In this paper we propose a new self-learning engine to streamline the analytics process, as it enables analysts to mine massive data repositories with minimal user intervention. In the context of cluster analysis on a collection of documents this new system, named SELF-DATA (SELF-learning DAta TrAnsformation), suggests to the analyst how to configure the whole mining process for a given dataset. SELF-DATA relies on an engine exploring different data weighting schemas (e.g., normalized term frequencies) and data transformation methods (e.g., PCA) before applying the cluster analysis, evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the k-top solutions to the analyst. SELF-DATA will also include a knowledge base storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best configuration for the whole mining process for an unexplored dataset. The first development of SELF-DATA running on Apache Spark has been validated on 5 collections of documents. Experimental results highlight that TF-IDF and logarithmic entropy are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA with a large dictionary.
Data miners' little helper: data transformation activity cues for cluster analysis on document collections / Cerquitelli, Tania; DI CORSO, Evelina; Ventura, Francesco; Chiusano, SILVIA ANNA. - STAMPA. - (2017), pp. 1-6. (Intervento presentato al convegno Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics tenutosi a Amantea nel June 19-22, 2017) [10.1145/3102254.3102288].
Data miners' little helper: data transformation activity cues for cluster analysis on document collections
CERQUITELLI, TANIA;DI CORSO, EVELINA;VENTURA, FRANCESCO;CHIUSANO, SILVIA ANNA
2017
Abstract
In this paper we propose a new self-learning engine to streamline the analytics process, as it enables analysts to mine massive data repositories with minimal user intervention. In the context of cluster analysis on a collection of documents this new system, named SELF-DATA (SELF-learning DAta TrAnsformation), suggests to the analyst how to configure the whole mining process for a given dataset. SELF-DATA relies on an engine exploring different data weighting schemas (e.g., normalized term frequencies) and data transformation methods (e.g., PCA) before applying the cluster analysis, evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the k-top solutions to the analyst. SELF-DATA will also include a knowledge base storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best configuration for the whole mining process for an unexplored dataset. The first development of SELF-DATA running on Apache Spark has been validated on 5 collections of documents. Experimental results highlight that TF-IDF and logarithmic entropy are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA with a large dictionary.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2678073
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo