This survey provides a synthesis of the practical copyright compliance challenges inherent in the input stages of Generative Artificial Intelligence (GenAI) systems. Specifically, we sought to address three research questions, examining the types of copyright-protected data involved, the corresponding challenges for copyright compliance in input data processing practices, and potential mitigation strategies proposed by researchers. To achieve this, we conducted a Systematic Literature Review (SLR), aimed at establishing a methodical foundation for the research. A recurring theme is the opacity of training data usage, particularly within proprietary models. This review highlights the tension between the scale of data acquisition necessary for pre-training foundation models and the intensive curation required for fine-tuning, alongside the misalignement of content licences. Particular attention is dedicated to the absence of mechanisms to govern bot activity, and the risk of copyright infringement through either model fine-tuning aimed at stylistic emulation or inadvertent memorisation of protected training data. To counter these risks, various mitigation strategies have emerged, including watermarking, adversarial perturbation, training data attribution with eventual distribution of royalties, the use of synthetic datasets, and Text and Data Mining (TDM) opt-out mechanisms in machine-readable formats.

Copyright Infringement Issues and Mitigations in Data for Training Generative AI / Arnaudo, Anna; Coppola, Riccardo; Morisio, Maurizio; Vetro, Antonio; Borghi, Maurizio; Raso, Riccardo; Khan, Bryan. - (In corso di stampa). ( IEEE BigData 2025 Macau (CHN) 12-18 Dec 2025).

Copyright Infringement Issues and Mitigations in Data for Training Generative AI

Anna Arnaudo;Riccardo Coppola;Maurizio Morisio;Antonio Vetro;Maurizio Borghi;
In corso di stampa

Abstract

This survey provides a synthesis of the practical copyright compliance challenges inherent in the input stages of Generative Artificial Intelligence (GenAI) systems. Specifically, we sought to address three research questions, examining the types of copyright-protected data involved, the corresponding challenges for copyright compliance in input data processing practices, and potential mitigation strategies proposed by researchers. To achieve this, we conducted a Systematic Literature Review (SLR), aimed at establishing a methodical foundation for the research. A recurring theme is the opacity of training data usage, particularly within proprietary models. This review highlights the tension between the scale of data acquisition necessary for pre-training foundation models and the intensive curation required for fine-tuning, alongside the misalignement of content licences. Particular attention is dedicated to the absence of mechanisms to govern bot activity, and the risk of copyright infringement through either model fine-tuning aimed at stylistic emulation or inadvertent memorisation of protected training data. To counter these risks, various mitigation strategies have emerged, including watermarking, adversarial perturbation, training data attribution with eventual distribution of royalties, the use of synthetic datasets, and Text and Data Mining (TDM) opt-out mechanisms in machine-readable formats.
In corso di stampa
File in questo prodotto:
File Dimensione Formato  
with_template__Copyright_Infringement_Issues_and_Mitigations_in_Data_for_Training_Generative_AI (9).pdf

accesso riservato

Tipologia: 1. Preprint / submitted version [pre- review]
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 296.99 kB
Formato Adobe PDF
296.99 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3006276