Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model’s complexity, power, and uncertainties. In this paper, we investigate the reliability of the epsilon-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by epsilon-representativeness, i.e., both of them have points closer than epsilon, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that epsilon-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine learning component widely adopted for dealing with tabular data.
Application of the Representative Measure Approach to Assess the Reliability of Decision Trees in Dealing with Unseen Vehicle Collision Data / Perera-Lago, Javier; Toscano-Duran, Victor; Paluzo-Hidalgo, Eduardo; Narteni, Sara; Rucco, Matteo. - 2156:(2024), pp. 384-395. (Intervento presentato al convegno The 2nd world conference on eXplainable Artificial Intelligence (xAI 2024) tenutosi a La Valletta (Malta) nel 17-19 July 2024) [10.1007/978-3-031-63803-9_21].
Application of the Representative Measure Approach to Assess the Reliability of Decision Trees in Dealing with Unseen Vehicle Collision Data
Sara Narteni;
2024
Abstract
Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model’s complexity, power, and uncertainties. In this paper, we investigate the reliability of the epsilon-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by epsilon-representativeness, i.e., both of them have points closer than epsilon, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that epsilon-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine learning component widely adopted for dealing with tabular data.File | Dimensione | Formato | |
---|---|---|---|
xAI2024_published_seville.pdf
accesso riservato
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
784.35 kB
Formato
Adobe PDF
|
784.35 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
XAI24_RepDecisionTrees (2).pdf
embargo fino al 10/07/2025
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
474.96 kB
Formato
Adobe PDF
|
474.96 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2990592