Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model’s complexity, power, and uncertainties. In this paper, we investigate the reliability of the epsilon-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by epsilon-representativeness, i.e., both of them have points closer than epsilon, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that epsilon-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine learning component widely adopted for dealing with tabular data.

Application of the Representative Measure Approach to Assess the Reliability of Decision Trees in Dealing with Unseen Vehicle Collision Data / Perera-Lago, Javier; Toscano-Duran, Victor; Paluzo-Hidalgo, Eduardo; Narteni, Sara; Rucco, Matteo. - 2156:(2024), pp. 384-395. (Intervento presentato al convegno The 2nd world conference on eXplainable Artificial Intelligence (xAI 2024) tenutosi a La Valletta (Malta) nel 17-19 July 2024) [10.1007/978-3-031-63803-9_21].

Application of the Representative Measure Approach to Assess the Reliability of Decision Trees in Dealing with Unseen Vehicle Collision Data

Sara Narteni;
2024

Abstract

Machine learning algorithms are fundamental components of novel data-informed Artificial Intelligence architecture. In this domain, the imperative role of representative datasets is a cornerstone in shaping the trajectory of artificial intelligence (AI) development. Representative datasets are needed to train machine learning components properly. Proper training has multiple impacts: it reduces the final model’s complexity, power, and uncertainties. In this paper, we investigate the reliability of the epsilon-representativeness method to assess the dataset similarity from a theoretical perspective for decision trees. We decided to focus on the family of decision trees because it includes a wide variety of models known to be explainable. Thus, in this paper, we provide a result guaranteeing that if two datasets are related by epsilon-representativeness, i.e., both of them have points closer than epsilon, then the predictions by the classic decision tree are similar. Experimentally, we have also tested that epsilon-representativeness presents a significant correlation with the ordering of the feature importance. Moreover, we extend the results experimentally in the context of unseen vehicle collision data for XGboost, a machine learning component widely adopted for dealing with tabular data.
2024
978-3-031-63802-2
978-3-031-63803-9
File in questo prodotto:
File Dimensione Formato  
xAI2024_published_seville.pdf

accesso riservato

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 784.35 kB
Formato Adobe PDF
784.35 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
XAI24_RepDecisionTrees (2).pdf

embargo fino al 10/07/2025

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Pubblico - Tutti i diritti riservati
Dimensione 474.96 kB
Formato Adobe PDF
474.96 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2990592