Document classification is helpful for law professionals to improve content browsing and retrieval. Pretrained Language Models, such as BERT, have become established for legal document classification. However, legal content is quite diversified. For example, documents vary in length from very short maxims to relatively long judgements; certain document types are rich of domain-specific expressions and can be annotated with multiple labels from domain-specific taxonomies. This paper studies to what extent existing pretrained models are suited to the legal domain. Specifically, we examine a real business case focused on Italian legal document classification. On a proprietary dataset with thousands of diversified categories (e.g., legal judgements, maxims, and legal news) we explore the use of Pretrained Language Models adapted to handle various content types. We collect both quantitative and qualitative results, highlighting best and worst cases, anomalous categories, and limitations of currently available models.

On the use of Pretrained Language Models for Legal Italian Document Classification / Benedetto, Irene; Sportelli, Gianpiero; Bertoldo, Sara; Tarasconi, Francesco; Cagliero, Luca; Giacalone, Giuseppe. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - ELETTRONICO. - (2023). (Intervento presentato al convegno 27th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems tenutosi a Atene, Grecia nel September 6-8, 2023).

On the use of Pretrained Language Models for Legal Italian Document Classification

Irene Benedetto;Luca Cagliero;
2023

Abstract

Document classification is helpful for law professionals to improve content browsing and retrieval. Pretrained Language Models, such as BERT, have become established for legal document classification. However, legal content is quite diversified. For example, documents vary in length from very short maxims to relatively long judgements; certain document types are rich of domain-specific expressions and can be annotated with multiple labels from domain-specific taxonomies. This paper studies to what extent existing pretrained models are suited to the legal domain. Specifically, we examine a real business case focused on Italian legal document classification. On a proprietary dataset with thousands of diversified categories (e.g., legal judgements, maxims, and legal news) we explore the use of Pretrained Language Models adapted to handle various content types. We collect both quantitative and qualitative results, highlighting best and worst cases, anomalous categories, and limitations of currently available models.
2023
File in questo prodotto:
File Dimensione Formato  
showpdf.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: Creative commons
Dimensione 279.75 kB
Formato Adobe PDF
279.75 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2982618