Recent technological advancements have enabled generating and collecting huge amounts of data in a daily manner. This data is used for different purposes that may impact us on an unprecedented scale. Understanding the data, including detecting its outliers, is a critical step before utilizing it. Outlier detection has been studied well in the literature but the existing approaches fail to scale to these very large settings. In this paper, we propose DBSCOUT, an efficient exact algorithm for outlier detection with a linear complexity that can run in parallel over multiple independent machines, making it a fit for the settings with billions of tuples. Besides the theoretical analysis, our experiment results confirm orders of magnitude improvement over the existing work, proving the efficiency, scalability, and effectiveness of our approach.
DBSCOUT: A Density-based Method for Scalable Outlier Detection in Very Large Datasets / Corain, Matteo; Garza, Paolo; Asudeh, Abolfazl. - ELETTRONICO. - (2021), pp. 37-48. (Intervento presentato al convegno 2021 IEEE 37th International Conference on Data Engineering (ICDE) tenutosi a Chania, Greece nel 19-22 April 2021) [10.1109/ICDE51399.2021.00011].
DBSCOUT: A Density-based Method for Scalable Outlier Detection in Very Large Datasets
Corain, Matteo;Garza, Paolo;
2021
Abstract
Recent technological advancements have enabled generating and collecting huge amounts of data in a daily manner. This data is used for different purposes that may impact us on an unprecedented scale. Understanding the data, including detecting its outliers, is a critical step before utilizing it. Outlier detection has been studied well in the literature but the existing approaches fail to scale to these very large settings. In this paper, we propose DBSCOUT, an efficient exact algorithm for outlier detection with a linear complexity that can run in parallel over multiple independent machines, making it a fit for the settings with billions of tuples. Besides the theoretical analysis, our experiment results confirm orders of magnitude improvement over the existing work, proving the efficiency, scalability, and effectiveness of our approach.File | Dimensione | Formato | |
---|---|---|---|
DBSCOUT.pdf
non disponibili
Descrizione: Versione post-print dell'articolo
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.83 MB
Formato
Adobe PDF
|
1.83 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
DBSCOUTAcceptedVersion.pdf
accesso aperto
Descrizione: Versione articolo accettato
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
1.82 MB
Formato
Adobe PDF
|
1.82 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2912196