Prompting the data transformation activities for cluster analysis on collections of documents

Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S.

In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the 3-top solutions to the analyst. SELF-DATA will also include a knowledge database storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. SELF-DATA’s current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. The preliminary validation performed on 4 collections of documents highlights that the TF-IDF and logarithmic entropy weighting methods are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA in the presence of a larger feature domain.

Prompting the data transformation activities for cluster analysis on collections of documents / Cerquitelli, T., Di Corso, E., Ventura, F., Chiusano, S.. - ELETTRONICO. - (2017), pp. 1-8. (25th Italian Symposium on Advanced Database Systems, SEBD 2017, Squillace Lido, Catanzaro, Italy, June 25th-29th, 2017, Squillace Lido, Catanzaro, Italy June 25th-29th, 2017).

Prompting the data transformation activities for cluster analysis on collections of documents

Cerquitelli T.;Di Corso E.;Ventura F.;Chiusano S.

2017

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Anno del prodotto

2017

Appare nelle tipologie

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
paper_32.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 408.03 kB Formato Adobe PDF Visualizza/Apri	408.03 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2707902

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

PORTO @ Archivio Istituzionale della Ricerca

Prompting the data transformation activities for cluster analysis on collections of documents

Cerquitelli T.;Di Corso E.;Ventura F.;Chiusano S.

2017

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Attenzione

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)