ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.

Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation / Rondina, Marco; Vetro', Antonio; De Martin, Juan Carlos. - 14115:(2023), pp. 79-91. (Intervento presentato al convegno 22nd Portuguese Conference on Artificial Intelligence tenutosi a Horta, Faial Island, Azores nel September 5 – September 8, 2023) [10.1007/978-3-031-49008-8_7].

Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation

Rondina, Marco;Vetro', Antonio;De Martin, Juan Carlos
2023

Abstract

ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.
2023
978-3-031-49007-1
978-3-031-49008-8
File in questo prodotto:
File Dimensione Formato  
01-1_dts_postprint.pdf

embargo fino al 15/12/2024

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 329.92 kB
Formato Adobe PDF
329.92 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
978-3-031-49008-8_7.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 586.21 kB
Formato Adobe PDF
586.21 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2981538