Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

Braga, Marco; Kasela, Pranav; Raganato, Alessandro; Pasi, Gabriella

doi:10.1109/wi-iat62293.2024.00057

Personalization in Information Retrieval (IR) is a topic studied by the research community since a long time. However, there is still a lack of datasets to conduct large-scale evaluations of personalized IR; this is mainly due to the fact that collecting and curating high-quality user-related information requires significant costs and time investment. Furthermore, the creation of datasets for Personalized IR (PIR) tasks is affected by both privacy concerns and the need for accurate user-related data, which are often not publicly available. Recently, researchers have started to explore the use of Large Language Models (LLMs) to generate synthetic datasets, which is a possible solution to generate data for low-resource tasks. In this paper, we investigate the potential of Large Language Models (LLMs) for generating synthetic documents to train an IR system for a Personalized Community Question Answering task. To study the effectiveness of IR models fine-tuned on LLM-generated data, we introduce a new dataset, named Sy-SE-PQA. We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities. Starting from questions in SE-PQA, we generate synthetic answers using different prompt techniques and LLMs. Our findings suggest that LLMs have high potential in generating data tailored to users' needs. The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering / Braga, M., Kasela, P., Raganato, A., Pasi, G.. - (2024), pp. 360-366. (2024 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2024 Bangkok (THA) 09-12 December 2024) [10.1109/wi-iat62293.2024.00057].

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

Braga, Marco;Kasela, Pranav;Raganato, Alessandro;Pasi, Gabriella

2024

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2024
			
	Codice ISBN
	
				979-8-3315-0494-6
			
	Appare nelle tipologie
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Synthetic_Data_Generation_with_Large_Language_Models_for_Personalized_Community_Question_Answering.pdf accesso riservato Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 216.42 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	216.42 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3002211

PORTO @ Archivio Istituzionale della Ricerca

Synthetic Data Generation with Large Language Models for Personalized Community Question Answering

Braga, Marco;Kasela, Pranav;Raganato, Alessandro;Pasi, Gabriella

2024

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)