This paper introduces a retrieval-based text classification framework tailored for language corpora in the domain of gas pipe damage description analysis, with a specific focus on determining patch applicability. Due to the scarcity of free-text damage descriptions in this domain, we construct a synthetic binary classification dataset, referred to as CoRe-S. This dataset consists of 11,904 damage descriptions generated from structured attributes, where each instance is labeled as either Patchable (True) or Unpatchable (False). The CoRe-S dataset presents two primary challenges: (i) a class imbalance, where positive cases are the minority, and (ii) frequent use of domain-specific terminology, which results in low lexical diversity across descriptions. To quantify this lack of variation, we introduce the Corpus Pairwise Diversity statistic, which measures the degree of lexical dissimilarity between documents in a corpus. We adopt a training-free, retrieval-based text classification approach and demonstrate that Sentence-BERT-NLI is the most effective encoder under low-diversity conditions, as it excels at capturing subtle lexical and semantic differences between otherwise similar documents. To address the class imbalance, we apply random undersampling, which outperforms other under-sampling strategies in our experiments. Our results show that the proposed retrieval-based classifier significantly outperforms other training-free text classification methods—whether zero-shot, few-shot, or similarity-based—achieving an improvement of approximately 35.2% in macro F1-score over the second-best method. Our code is publicly available at: https://github.com/links-ads/core-unimodal-retrieval-for-classification.

Classifying Gas Pipe Damage Descriptions in Low-Diversity Corpora / Catalano, Luca; D'Asaro, Federico; Pantaleo, Michele; Jamshed, Minal; Acharjee, Prima; Giulietti, Nicola; Fossat, Eugenio; Rizzo, Giuseppe. - 4112:(2025). ( CLiC-it 2025 – Eleventh Italian Conference on Computational Linguistics Cagliari (ITA) September 24-26, 2025).

Classifying Gas Pipe Damage Descriptions in Low-Diversity Corpora

Catalano, Luca;D'Asaro, Federico;Pantaleo, Michele;Jamshed, Minal;Acharjee, Prima;Rizzo, Giuseppe
2025

Abstract

This paper introduces a retrieval-based text classification framework tailored for language corpora in the domain of gas pipe damage description analysis, with a specific focus on determining patch applicability. Due to the scarcity of free-text damage descriptions in this domain, we construct a synthetic binary classification dataset, referred to as CoRe-S. This dataset consists of 11,904 damage descriptions generated from structured attributes, where each instance is labeled as either Patchable (True) or Unpatchable (False). The CoRe-S dataset presents two primary challenges: (i) a class imbalance, where positive cases are the minority, and (ii) frequent use of domain-specific terminology, which results in low lexical diversity across descriptions. To quantify this lack of variation, we introduce the Corpus Pairwise Diversity statistic, which measures the degree of lexical dissimilarity between documents in a corpus. We adopt a training-free, retrieval-based text classification approach and demonstrate that Sentence-BERT-NLI is the most effective encoder under low-diversity conditions, as it excels at capturing subtle lexical and semantic differences between otherwise similar documents. To address the class imbalance, we apply random undersampling, which outperforms other under-sampling strategies in our experiments. Our results show that the proposed retrieval-based classifier significantly outperforms other training-free text classification methods—whether zero-shot, few-shot, or similarity-based—achieving an improvement of approximately 35.2% in macro F1-score over the second-best method. Our code is publicly available at: https://github.com/links-ads/core-unimodal-retrieval-for-classification.
File in questo prodotto:
File Dimensione Formato  
21_main_long.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Creative commons
Dimensione 6.2 MB
Formato Adobe PDF
6.2 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3002060