Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing

Braga, Marco; Gian Carlo Milanese,; Pasi, Gabriella

Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language- specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the F1 measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat

Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing / Braga, Marco; Carlo Milanese, Gian; Pasi, Gabriella. - (In corso di stampa). ( 2025 IEEE/WIC International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) London 15-18 November 2025).

Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing

Marco Braga;Gian Carlo Milanese;Gabriella Pasi

In corso di stampa

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Anno del prodotto

In corso di stampa

Appare nelle tipologie

4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
_Braga__Milanese__LLM_for_text_processing__WI_IAT_2025_.pdf accesso riservato Descrizione: Versione accettata del paper Tipologia: 2. Post-print / Author's Accepted Manuscript Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 285.74 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	285.74 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3010273

PORTO @ Archivio Istituzionale della Ricerca

Investigating Large Language Models’ Linguistic Abilities for Text Preprocessing

Marco Braga;Gian Carlo Milanese;Gabriella Pasi

In corso di stampa

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)