Discovering relevant datasets in large, heterogeneous data ecosystems, such as Data Lakes or Data spaces, is a complex task, often hindered by a lack of transparency and user-centric explanations in the discovery process. Explainability is critical for enabling users to understand why specific datasets are recommended, what information they contain, and how they align with user-defined criteria and preferences. To address these challenges, this work proposes a novel Graph Retrieval-Augmented Generation (Graph RAG) framework to enhance explainability in a platform for discovery of summary data sources. The proposed approach leverages a Knowledge Graph (KG) to interpret user requests, extracting relevant contextual information. These enriched requests are then transformed by a Large Language Model (LLM) into actionable dataset queries for a dataset discovery platform. Candidate solutions are evaluated and enriched with statistical insights on value distributions and contextual knowledge from the KG. Finally, the LLM ranks these solutions based on user preferences, producing a final report. This dual strategy of query enrichment and contextual explanation fosters transparency and enhances user understanding of the discovery process. We demonstrate the effectiveness of the approach through an experimental validation, highlighting its potential to improve both the accuracy and interpretability of dataset discovery.

A Graph RAG Approach to Enhance Explainability in Dataset Discovery / Diamantini, Claudia; Mele, Alessandro; Mircoli, Alex; Potena, Domenico; Rossetti, Cristina; Storti, Emanuele. - In: DATA SCIENCE AND ENGINEERING. - ISSN 2364-1185. - (2025). [10.1007/s41019-025-00313-x]

A Graph RAG Approach to Enhance Explainability in Dataset Discovery

Rossetti, Cristina;
2025

Abstract

Discovering relevant datasets in large, heterogeneous data ecosystems, such as Data Lakes or Data spaces, is a complex task, often hindered by a lack of transparency and user-centric explanations in the discovery process. Explainability is critical for enabling users to understand why specific datasets are recommended, what information they contain, and how they align with user-defined criteria and preferences. To address these challenges, this work proposes a novel Graph Retrieval-Augmented Generation (Graph RAG) framework to enhance explainability in a platform for discovery of summary data sources. The proposed approach leverages a Knowledge Graph (KG) to interpret user requests, extracting relevant contextual information. These enriched requests are then transformed by a Large Language Model (LLM) into actionable dataset queries for a dataset discovery platform. Candidate solutions are evaluated and enriched with statistical insights on value distributions and contextual knowledge from the KG. Finally, the LLM ranks these solutions based on user preferences, producing a final report. This dual strategy of query enrichment and contextual explanation fosters transparency and enhances user understanding of the discovery process. We demonstrate the effectiveness of the approach through an experimental validation, highlighting its potential to improve both the accuracy and interpretability of dataset discovery.
File in questo prodotto:
File Dimensione Formato  
Diamantini_et_al-2025-Data_Science_and_Engineering.pdf

accesso aperto

Descrizione: published paper
Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Creative commons
Dimensione 2.78 MB
Formato Adobe PDF
2.78 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3005148