The voracious appetite of GenAI for data has raised a number of issues regarding copyright. We conducted a Systematic Literature Review (SLR), focusing on the types of copyright-protected data involved, the main challenges for compliance, and the ongoing mitigation efforts. This study — encompassing the literature published between 2019 and the first half of 2025 — seeks to examine the entire Generative AI pipeline under the lights of the EU copyright legislation. At the input stage, recurring issues include opacity surrounding the use of training data, misalignment between content licences and their usage, and the lack of mechanisms to regulate web scrapers collecting training material. Concerns also arise from fine-tuning aimed at stylistic emulation, as well as inadvertent memorisation and verbatim reproduction of training material. Various mitigation strategies have emerged: watermarking, adversarial perturbation, training data attribution with eventual royalties distribution, synthetic datasets, and machine-readable Text and Data Mining (TDM) opt-out mechanisms. Output-related vulnerabilities are similarly relevant, as adversarial attacks may extract sensitive data or generate style-imitating content. Retrieval-Augmented Generation (RAG) offers improved transparency, but remains prone to misattribution. The review underscores ongoing research dedicated to enhancing transparency in GenAI's output, including provenance-tracking protocols, blockchain-based integrity systems, watermarking techniques, and inference-time defences. Given the heterogeneity of the field, this review prioritised breadth over depth, leaving room for future studies while serving as a foundational guide for individuals with prior exposure or a developing understanding of the field. The findings underscore the necessity for enhanced copyright safeguards — potentially structured as multi-layered protective systems — and cross-sectoral collaboration to align rapid generative AI development with robust intellectual property protection.
Generative AI and intellectual property: a systematic literature review of possible issues and mitigations / Arnaudo, Anna; Coppola, Riccardo; Morisio, Maurizio; Vetro, Antonio; Borghi, Maurizio; Raso, Riccardo; Khan, Bryan. - In: PEERJ. COMPUTER SCIENCE.. - ISSN 2376-5992. - ELETTRONICO. - (2026).
Generative AI and intellectual property: a systematic literature review of possible issues and mitigations
Anna Arnaudo;Riccardo Coppola;Maurizio Morisio;Antonio Vetro;Maurizio Borghi;
2026
Abstract
The voracious appetite of GenAI for data has raised a number of issues regarding copyright. We conducted a Systematic Literature Review (SLR), focusing on the types of copyright-protected data involved, the main challenges for compliance, and the ongoing mitigation efforts. This study — encompassing the literature published between 2019 and the first half of 2025 — seeks to examine the entire Generative AI pipeline under the lights of the EU copyright legislation. At the input stage, recurring issues include opacity surrounding the use of training data, misalignment between content licences and their usage, and the lack of mechanisms to regulate web scrapers collecting training material. Concerns also arise from fine-tuning aimed at stylistic emulation, as well as inadvertent memorisation and verbatim reproduction of training material. Various mitigation strategies have emerged: watermarking, adversarial perturbation, training data attribution with eventual royalties distribution, synthetic datasets, and machine-readable Text and Data Mining (TDM) opt-out mechanisms. Output-related vulnerabilities are similarly relevant, as adversarial attacks may extract sensitive data or generate style-imitating content. Retrieval-Augmented Generation (RAG) offers improved transparency, but remains prone to misattribution. The review underscores ongoing research dedicated to enhancing transparency in GenAI's output, including provenance-tracking protocols, blockchain-based integrity systems, watermarking techniques, and inference-time defences. Given the heterogeneity of the field, this review prioritised breadth over depth, leaving room for future studies while serving as a foundational guide for individuals with prior exposure or a developing understanding of the field. The findings underscore the necessity for enhanced copyright safeguards — potentially structured as multi-layered protective systems — and cross-sectoral collaboration to align rapid generative AI development with robust intellectual property protection.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/3010554
Attenzione
Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo
