This paper proposes PASTA (PArameter-free Solutions for Textual Analysis), a large scale engine providing strategies to automatically tune the algorithm parameters for the whole text clustering process. A data weighting strategy (e.g., TF-IDF) and a transformation method of input data (e.g., LSI) is explored before performing the cluster analysis to reduce sparseness, and make the overall analysis problem more eectively tractable. PASTA includes auto-selection strategies to o-load the end-user from parameter tuning and achieve a good quality of the clustering results. PASTA's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, PASTA has been validated on three collections of Wikipedia documents. The experimental results show the eectiveness and the eciency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents.
Self-tuning techniques for large scale cluster analysis on textual data collections / DI CORSO, Evelina; Cerquitelli, Tania; Ventura, Francesco. - STAMPA. - (2017), pp. 1-6. ((Intervento presentato al convegno ACM SIGAPP Symposium On Applied Computing tenutosi a Marrakesh, Morocco nel April 3rd-7th, 2017 [10.1145/3019612.3019661].
Titolo: | Self-tuning techniques for large scale cluster analysis on textual data collections | |
Autori: | ||
Data di pubblicazione: | 2017 | |
Abstract: | This paper proposes PASTA (PArameter-free Solutions for Textual Analysis), a large scale engine p...roviding strategies to automatically tune the algorithm parameters for the whole text clustering process. A data weighting strategy (e.g., TF-IDF) and a transformation method of input data (e.g., LSI) is explored before performing the cluster analysis to reduce sparseness, and make the overall analysis problem more eectively tractable. PASTA includes auto-selection strategies to o-load the end-user from parameter tuning and achieve a good quality of the clustering results. PASTA's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, PASTA has been validated on three collections of Wikipedia documents. The experimental results show the eectiveness and the eciency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents. | |
Appare nelle tipologie: | 4.1 Contributo in Atti di convegno |
File in questo prodotto:
File | Descrizione | Tipologia | Licenza | |
---|---|---|---|---|
Cerquitelli-etAl.pdf | 2. Post-print / Author's Accepted Manuscript | Non Pubblico - Accesso privato/ristretto | Administrator Richiedi una copia | |
CerquitelliDiCorsoVenturaSAC2017.pdf | 1. Preprint / submitted version [pre- review] | Non Pubblico - Accesso privato/ristretto | Administrator Richiedi una copia |
http://hdl.handle.net/11583/2662148