This survey provides a synthesis of the practical copyright compliance challenges inherent in the input stages of Generative Artificial Intelligence (GenAI) systems. Specifically, we sought to address three research questions, examining the types of copyright-protected data involved, the corresponding challenges for copyright compliance in input data processing practices, and potential mitigation strategies proposed by researchers. To achieve this, we conducted a Systematic Literature Review (SLR), aimed at establishing a methodical foundation for the research. A recurring theme is the opacity of training data usage, particularly within proprietary models. This review highlights the tension between the scale of data acquisition necessary for pre-training foundation models and the intensive curation required for fine-tuning, alongside the misalignement of content licences. Particular attention is dedicated to the absence of mechanisms to govern bot activity, and the risk of copyright infringement through either model fine-tuning aimed at stylistic emulation or inadvertent memorisation of protected training data. To counter these risks, various mitigation strategies have emerged, including watermarking, adversarial perturbation, training data attribution with eventual distribution of royalties, the use of synthetic datasets, and Text and Data Mining (TDM) opt-out mechanisms in machine-readable formats.
Copyright Infringement Issues and Mitigations in Data for Training Generative AI / Arnaudo, Anna; Coppola, Riccardo; Morisio, Maurizio; Vetro, Antonio; Borghi, Maurizio; Raso, Riccardo; Khan, Bryan. - (In corso di stampa). ( IEEE BigData 2025 Macau (CHN) 12-18 Dec 2025).
Copyright Infringement Issues and Mitigations in Data for Training Generative AI
Anna Arnaudo;Riccardo Coppola;Maurizio Morisio;Antonio Vetro;Maurizio Borghi;
In corso di stampa
Abstract
This survey provides a synthesis of the practical copyright compliance challenges inherent in the input stages of Generative Artificial Intelligence (GenAI) systems. Specifically, we sought to address three research questions, examining the types of copyright-protected data involved, the corresponding challenges for copyright compliance in input data processing practices, and potential mitigation strategies proposed by researchers. To achieve this, we conducted a Systematic Literature Review (SLR), aimed at establishing a methodical foundation for the research. A recurring theme is the opacity of training data usage, particularly within proprietary models. This review highlights the tension between the scale of data acquisition necessary for pre-training foundation models and the intensive curation required for fine-tuning, alongside the misalignement of content licences. Particular attention is dedicated to the absence of mechanisms to govern bot activity, and the risk of copyright infringement through either model fine-tuning aimed at stylistic emulation or inadvertent memorisation of protected training data. To counter these risks, various mitigation strategies have emerged, including watermarking, adversarial perturbation, training data attribution with eventual distribution of royalties, the use of synthetic datasets, and Text and Data Mining (TDM) opt-out mechanisms in machine-readable formats.| File | Dimensione | Formato | |
|---|---|---|---|
|
with_template__Copyright_Infringement_Issues_and_Mitigations_in_Data_for_Training_Generative_AI (9).pdf
accesso riservato
Tipologia:
1. Preprint / submitted version [pre- review]
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
296.99 kB
Formato
Adobe PDF
|
296.99 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3006276
