In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the 3-top solutions to the analyst. SELF-DATA will also include a knowledge database storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. SELF-DATA’s current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. The preliminary validation performed on 4 collections of documents highlights that the TF-IDF and logarithmic entropy weighting methods are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA in the presence of a larger feature domain.
Prompting the data transformation activities for cluster analysis on collections of documents / Cerquitelli, T.; Di Corso, E.; Ventura, F.; Chiusano, S.. - ELETTRONICO. - (2017), pp. 1-8. ((Intervento presentato al convegno 25th Italian Symposium on Advanced Database Systems, SEBD 2017, Squillace Lido, Catanzaro, Italy, June 25th-29th, 2017, tenutosi a Squillace Lido, Catanzaro, Italy nel June 25th-29th, 2017.
Titolo: | Prompting the data transformation activities for cluster analysis on collections of documents | |
Autori: | ||
Data di pubblicazione: | 2017 | |
Abstract: | In this work we argue towards a new self-learning engine able to suggest to the analyst good tran...sformation methods and weighting schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the 3-top solutions to the analyst. SELF-DATA will also include a knowledge database storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. SELF-DATA’s current implementation runs on Apache Spark, a state-of-the-art distributed computing framework. The preliminary validation performed on 4 collections of documents highlights that the TF-IDF and logarithmic entropy weighting methods are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA in the presence of a larger feature domain. | |
Appare nelle tipologie: | 4.1 Contributo in Atti di convegno |
File in questo prodotto:
File | Descrizione | Tipologia | Licenza | |
---|---|---|---|---|
paper_32.pdf | 2a Post-print versione editoriale / Version of Record | ![]() | Visibile a tuttiVisualizza/Apri |
http://hdl.handle.net/11583/2707902