Data miners' little helper: data transformation activity cues for cluster analysis on document collections

Cerquitelli, Tania; Di Corso, Evelina; Ventura, Francesco; Chiusano, Silvia Anna

doi:10.1145/3102254.3102288

In this paper we propose a new self-learning engine to streamline the analytics process, as it enables analysts to mine massive data repositories with minimal user intervention. In the context of cluster analysis on a collection of documents this new system, named SELF-DATA (SELF-learning DAta TrAnsformation), suggests to the analyst how to configure the whole mining process for a given dataset. SELF-DATA relies on an engine exploring different data weighting schemas (e.g., normalized term frequencies) and data transformation methods (e.g., PCA) before applying the cluster analysis, evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the k-top solutions to the analyst. SELF-DATA will also include a knowledge base storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best configuration for the whole mining process for an unexplored dataset. The first development of SELF-DATA running on Apache Spark has been validated on 5 collections of documents. Experimental results highlight that TF-IDF and logarithmic entropy are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA with a large dictionary.

Data miners' little helper: data transformation activity cues for cluster analysis on document collections / Cerquitelli, T., DI CORSO, E., Ventura, F., Chiusano, S.A.. - STAMPA. - (2017), pp. 1-6. (Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics Amantea June 19-22, 2017) [10.1145/3102254.3102288].

Data miners' little helper: data transformation activity cues for cluster analysis on document collections

CERQUITELLI, TANIA;DI CORSO, EVELINA;VENTURA, FRANCESCO;CHIUSANO, SILVIA ANNA

2017

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2017
			
	Codice ISBN
	
				978-1-4503-5225-3
			
	Appare nelle tipologie
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2678073

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

PORTO @ Archivio Istituzionale della Ricerca

Data miners' little helper: data transformation activity cues for cluster analysis on document collections

CERQUITELLI, TANIA;DI CORSO, EVELINA;VENTURA, FRANCESCO;CHIUSANO, SILVIA ANNA

2017

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Attenzione

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)