Scaling associative classification for very large datasets

Venturini, Luca; Baralis, Elena; Garza, Paolo

doi:10.1186/s40537-017-0107-2

Supervised learning algorithms are nowadays successfully scaling up to datasets that are very large in volume, leveraging the potential of in-memory cluster-computing Big Data frameworks. Still, massive datasets with a number of large-domain categorical features are a difficult challenge for any classifier. Most off-the-shelf solutions cannot cope with this problem. In this work we introduce DAC, a Distributed Associative Classifier. DAC exploits ensemble learning to distribute the training of an associative classifier among parallel workers and improve the final quality of the model. Furthermore, it adopts several novel techniques to reach high scalability without sacrificing quality, among which a preventive pruning of classification rules in the extraction phase based on Gini impurity. We ran experiments on Apache Spark, on a real large-scale dataset with more than 4 billion records and 800 million distinct categories. The results showed that DAC improves on a state-of-the-art solution in both prediction quality and execution time. Since the generated model is human-readable, it can not only classify new records, but also allow understanding both the logic behind the prediction and the properties of the model, becoming a useful aid for decision makers.

Scaling associative classification for very large datasets / Venturini, L., Baralis, E., Garza, P.. - In: JOURNAL OF BIG DATA. - ISSN 2196-1115. - ELETTRONICO. - 4:1(2017), pp. 1-24. [10.1186/s40537-017-0107-2]

Scaling associative classification for very large datasets

Venturini, Luca;Baralis, Elena;Garza, Paolo

2017

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2017
			
	Codice DOI
	
				https://dx.doi.org/10.1186/s40537-017-0107-2
			
	Titolo della Rivista
	
				JOURNAL OF BIG DATA
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
DAC_JBD17.pdf accesso aperto Descrizione: Articolo principale - versione editoriale Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 1.3 MB Formato Adobe PDF Visualizza/Apri	1.3 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2695525

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

PORTO @ Archivio Istituzionale della Ricerca

Scaling associative classification for very large datasets

Venturini, Luca;Baralis, Elena;Garza, Paolo

2017

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Attenzione

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)