Document classification is helpful for law professionals to improve content browsing and retrieval. Pretrained Language Models, such as BERT, have become established for legal document classification. However, legal content is quite diversified. For example, documents vary in length from very short maxims to relatively long judgements; certain document types are rich of domain-specific expressions and can be annotated with multiple labels from domain-specific taxonomies. This paper studies to what extent existing pretrained models are suited to the legal domain. Specifically, we examine a real business case focused on Italian legal document classification. On a proprietary dataset with thousands of diversified categories (e.g., legal judgements, maxims, and legal news) we explore the use of Pretrained Language Models adapted to handle various content types. We collect both quantitative and qualitative results, highlighting best and worst cases, anomalous categories, and limitations of currently available models.
On the use of Pretrained Language Models for Legal Italian Document Classification / Benedetto, Irene; Sportelli, Gianpiero; Bertoldo, Sara; Tarasconi, Francesco; Cagliero, Luca; Giacalone, Giuseppe. - In: PROCEDIA COMPUTER SCIENCE. - ISSN 1877-0509. - ELETTRONICO. - 225:(2023), pp. 2244-2253. (Intervento presentato al convegno 27th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems tenutosi a Atene, Grecia nel September 6-8, 2023) [10.1016/j.procs.2023.10.215].
On the use of Pretrained Language Models for Legal Italian Document Classification
Irene Benedetto;Luca Cagliero;
2023
Abstract
Document classification is helpful for law professionals to improve content browsing and retrieval. Pretrained Language Models, such as BERT, have become established for legal document classification. However, legal content is quite diversified. For example, documents vary in length from very short maxims to relatively long judgements; certain document types are rich of domain-specific expressions and can be annotated with multiple labels from domain-specific taxonomies. This paper studies to what extent existing pretrained models are suited to the legal domain. Specifically, we examine a real business case focused on Italian legal document classification. On a proprietary dataset with thousands of diversified categories (e.g., legal judgements, maxims, and legal news) we explore the use of Pretrained Language Models adapted to handle various content types. We collect both quantitative and qualitative results, highlighting best and worst cases, anomalous categories, and limitations of currently available models.File | Dimensione | Formato | |
---|---|---|---|
showpdf.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Creative commons
Dimensione
279.75 kB
Formato
Adobe PDF
|
279.75 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2982618