This thesis addresses the issue of enhancing the scalability of data mining techniques, with specific emphasis on association rule and frequent itemset mining. In particular, it proposes a scalable itemset mining approach relying on (i) a persistent (disk-based) representation of the transactional data, (ii) ad-hoc data retrieval techniques, and (iii)~strategies for the integration of existing itemset mining algorithms. A parallel design based on the same approach, to perform itemset extraction in a parallel and/or distributed environment, is also described. To address the manageability of frequent itemsets, a concise disk-based representation, with a set of querying techniques, is proposed. This work has been preliminarly validated in the Semantic Web domain, to identify semantic relationships from textual collections with a semi-automatic approach. As a minor topic, the extracion of frequent itemsets from streams of data, modelled as a set of transactional data windows, has also been tackled by proposing an online/offline analysis approach. Finally, a software platform, developed in a joint effort with the Institute for Cancer Research and Treatment (Candiolo), is presented, allowing the collection, the integration, and the analysis of heterogeneous data from the molecular oncology field.

Scaling data mining activities on very large datasets / Grand, Alberto. - STAMPA. - (2013).

Scaling data mining activities on very large datasets

GRAND, ALBERTO
2013

Abstract

This thesis addresses the issue of enhancing the scalability of data mining techniques, with specific emphasis on association rule and frequent itemset mining. In particular, it proposes a scalable itemset mining approach relying on (i) a persistent (disk-based) representation of the transactional data, (ii) ad-hoc data retrieval techniques, and (iii)~strategies for the integration of existing itemset mining algorithms. A parallel design based on the same approach, to perform itemset extraction in a parallel and/or distributed environment, is also described. To address the manageability of frequent itemsets, a concise disk-based representation, with a set of querying techniques, is proposed. This work has been preliminarly validated in the Semantic Web domain, to identify semantic relationships from textual collections with a semi-automatic approach. As a minor topic, the extracion of frequent itemsets from streams of data, modelled as a set of transactional data windows, has also been tackled by proposing an online/offline analysis approach. Finally, a software platform, developed in a joint effort with the Institute for Cancer Research and Treatment (Candiolo), is presented, allowing the collection, the integration, and the analysis of heterogeneous data from the molecular oncology field.
2013
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2507907
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo