ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.
Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation / Rondina, Marco; Vetro', Antonio; De Martin, Juan Carlos. - 14115:(2023), pp. 79-91. (Intervento presentato al convegno 22nd Portuguese Conference on Artificial Intelligence tenutosi a Horta, Faial Island, Azores nel September 5 – September 8, 2023) [10.1007/978-3-031-49008-8_7].
Completeness of Datasets Documentation on ML/AI Repositories: An Empirical Investigation
Rondina, Marco;Vetro', Antonio;De Martin, Juan Carlos
2023
Abstract
ML/AI is the field of computer science and computer engineering that arguably received the most attention and funding over the last decade. Data is the key element of ML/AI, so it is becoming increasingly important to ensure that users are fully aware of the quality of the datasets that they use, and of the process generating them, so that possible negative impacts on downstream effects can be tracked, analysed, and, where possible, mitigated. One of the tools that can be useful in this perspective is dataset documentation. The aim of this work is to investigate the state of dataset documentation practices, measuring the completeness of the documentation of several popular datasets in ML/AI repositories. We created a dataset documentation schema-the Documentation Test Sheet (dts)-that identifies the information that should always be attached to a dataset (to ensure proper dataset choice and informed use), according to relevant studies in the literature. We verified 100 popular datasets from four different repositories with the dts to investigate which information were present. Overall, we observed a lack of relevant documentation, especially about the context of data collection and data processing, highlighting a paucity of transparency.File | Dimensione | Formato | |
---|---|---|---|
01-1_dts_postprint.pdf
Open Access dal 16/12/2024
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
329.92 kB
Formato
Adobe PDF
|
329.92 kB | Adobe PDF | Visualizza/Apri |
978-3-031-49008-8_7.pdf
accesso riservato
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
586.21 kB
Formato
Adobe PDF
|
586.21 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2981538