In the era of Big Data, effective data reduction through feature selection is of paramount importance for machine learning. This paper presents GLEm-Net (Grouped Lasso with Embeddings Network), a novel neural framework that seamlessly processes both categorical and numerical features to reduce the dimensionality of data while retaining as much information as possible. By integrating embedding layers, GLEm-Net effectively manages categorical features with high cardinality and compresses their information in a less dimensional space. By using a grouped Lasso penalty function in its architecture, GLEm-Net simultaneously processes categorical and numerical data, efficiently reducing high-dimensional data while preserving the essential information. We test GLEm-Net with a real-world application in an industrial environment where 6 million records exist and each is described by a mixture of 19 numerical and 7 categorical features with a strong class imbalance. A comparative analysis using state-of-the-art methods shows that despite the difficulty of building a high-performance model, GLEm-Net outperforms the other methods in both feature selection and classification, with a better balance in the selection of both numerical and categorical features.
GLEm-Net: Unified Framework for Data Reduction with Categorical and Numerical Features / DE SANTIS, Francesco; Giordano, Danilo; Mellia, Marco; Damilano, Alessia. - ELETTRONICO. - (2023), pp. 4240-4247. (Intervento presentato al convegno 2023 IEEE International Conference on Big Data (Big Data) tenutosi a Sorrento, Italy nel 15-18 December 2023) [10.1109/BigData59044.2023.10386901].
GLEm-Net: Unified Framework for Data Reduction with Categorical and Numerical Features
Francesco De Santis;Danilo Giordano;Marco Mellia;Alessia Damilano
2023
Abstract
In the era of Big Data, effective data reduction through feature selection is of paramount importance for machine learning. This paper presents GLEm-Net (Grouped Lasso with Embeddings Network), a novel neural framework that seamlessly processes both categorical and numerical features to reduce the dimensionality of data while retaining as much information as possible. By integrating embedding layers, GLEm-Net effectively manages categorical features with high cardinality and compresses their information in a less dimensional space. By using a grouped Lasso penalty function in its architecture, GLEm-Net simultaneously processes categorical and numerical data, efficiently reducing high-dimensional data while preserving the essential information. We test GLEm-Net with a real-world application in an industrial environment where 6 million records exist and each is described by a mixture of 19 numerical and 7 categorical features with a strong class imbalance. A comparative analysis using state-of-the-art methods shows that despite the difficulty of building a high-performance model, GLEm-Net outperforms the other methods in both feature selection and classification, with a better balance in the selection of both numerical and categorical features.File | Dimensione | Formato | |
---|---|---|---|
Desantis_Bigdata.pdf
accesso aperto
Descrizione: versione autore
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
539.23 kB
Formato
Adobe PDF
|
539.23 kB | Adobe PDF | Visualizza/Apri |
GLEm-Net_Unified_Framework_for_Data_Reduction_with_Categorical_and_Numerical_Features.pdf
accesso riservato
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
597.25 kB
Formato
Adobe PDF
|
597.25 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2984914