Self-tuning techniques for large scale cluster analysis on textual data collections

Di Corso, Evelina; Cerquitelli, Tania; Ventura, Francesco

doi:10.1145/3019612.3019661

This paper proposes PASTA (PArameter-free Solutions for Textual Analysis), a large scale engine providing strategies to automatically tune the algorithm parameters for the whole text clustering process. A data weighting strategy (e.g., TF-IDF) and a transformation method of input data (e.g., LSI) is explored before performing the cluster analysis to reduce sparseness, and make the overall analysis problem more eectively tractable. PASTA includes auto-selection strategies to o-load the end-user from parameter tuning and achieve a good quality of the clustering results. PASTA's current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. As a case study, PASTA has been validated on three collections of Wikipedia documents. The experimental results show the eectiveness and the eciency of the proposed solution in analyzing collections of documents without tuning algorithm parameters and in discovering cohesive and well-separated groups of documents.

Self-tuning techniques for large scale cluster analysis on textual data collections / DI CORSO, E., Cerquitelli, T., Ventura, F.. - STAMPA. - (2017), pp. 1-6. (ACM SIGAPP Symposium On Applied Computing Marrakesh, Morocco April 3rd-7th, 2017) [10.1145/3019612.3019661].

Self-tuning techniques for large scale cluster analysis on textual data collections

DI CORSO, EVELINA;CERQUITELLI, TANIA;VENTURA, FRANCESCO

2017

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Anno del prodotto

2017

Appare nelle tipologie

4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2662148

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

PORTO @ Archivio Istituzionale della Ricerca

Self-tuning techniques for large scale cluster analysis on textual data collections

DI CORSO, EVELINA;CERQUITELLI, TANIA;VENTURA, FRANCESCO

2017

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Attenzione

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)