In the era of Big Data, effective data reduction through feature selection is of paramount importance for machine learning. This paper presents GLEm-Net (Grouped Lasso with Embeddings Network), a novel neural framework that seamlessly processes both categorical and numerical features to reduce the dimensionality of data while retaining as much information as possible. By integrating embedding layers, GLEm-Net effectively manages categorical features with high cardinality and compresses their information in a less dimensional space. By using a grouped Lasso penalty function in its architecture, GLEm-Net simultaneously processes categorical and numerical data, efficiently reducing high-dimensional data while preserving the essential information. We test GLEm-Net with a real-world application in an industrial environment where 6 million records exist and each is described by a mixture of 19 numerical and 7 categorical features with a strong class imbalance. A comparative analysis using state-of-the-art methods shows that despite the difficulty of building a high-performance model, GLEm-Net outperforms the other methods in both feature selection and classification, with a better balance in the selection of both numerical and categorical features.

GLEm-Net: Unified Framework for Data Reduction with Categorical and Numerical Features / DE SANTIS, Francesco; Giordano, Danilo; Mellia, Marco; Damilano, Alessia. - ELETTRONICO. - (2023), pp. 4240-4247. (Intervento presentato al convegno 2023 IEEE International Conference on Big Data (Big Data) tenutosi a Sorrento, Italy nel 15-18 December 2023) [10.1109/BigData59044.2023.10386901].

GLEm-Net: Unified Framework for Data Reduction with Categorical and Numerical Features

Francesco De Santis;Danilo Giordano;Marco Mellia;Alessia Damilano
2023

Abstract

In the era of Big Data, effective data reduction through feature selection is of paramount importance for machine learning. This paper presents GLEm-Net (Grouped Lasso with Embeddings Network), a novel neural framework that seamlessly processes both categorical and numerical features to reduce the dimensionality of data while retaining as much information as possible. By integrating embedding layers, GLEm-Net effectively manages categorical features with high cardinality and compresses their information in a less dimensional space. By using a grouped Lasso penalty function in its architecture, GLEm-Net simultaneously processes categorical and numerical data, efficiently reducing high-dimensional data while preserving the essential information. We test GLEm-Net with a real-world application in an industrial environment where 6 million records exist and each is described by a mixture of 19 numerical and 7 categorical features with a strong class imbalance. A comparative analysis using state-of-the-art methods shows that despite the difficulty of building a high-performance model, GLEm-Net outperforms the other methods in both feature selection and classification, with a better balance in the selection of both numerical and categorical features.
2023
979-8-3503-2445-7
File in questo prodotto:
File Dimensione Formato  
Desantis_Bigdata.pdf

accesso aperto

Descrizione: versione autore
Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 539.23 kB
Formato Adobe PDF
539.23 kB Adobe PDF Visualizza/Apri
GLEm-Net_Unified_Framework_for_Data_Reduction_with_Categorical_and_Numerical_Features.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 597.25 kB
Formato Adobe PDF
597.25 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2984914