Inferring multilingual domain-specific word embeddings from large document corpora

Cagliero, Luca; La Quatra, Moreno

doi:10.1109/ACCESS.2021.3118093

The use of distributed vector representations of words in Natural Language Processing has become established. To tailor general-purpose vector spaces to the context under analysis, several domain adaptation techniques have been proposed. They all require sufficiently large document corpora tailored to the target domains. However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. This paper aims at tackling the aforesaid issue. It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language in order to learn how to tailor the vector space of the target language to the domain of interest. The performance of the proposed method was validated via extrinsic evaluation by addressing the established word retrieval task. To this aim, a new benchmark multilingual dataset, derived from Wikipedia, has been released. The results confirmed the effectiveness and usability of the proposed approach.

Inferring multilingual domain-specific word embeddings from large document corpora / Cagliero, Luca; LA QUATRA, Moreno. - In: IEEE ACCESS. - ISSN 2169-3536. - ELETTRONICO. - 9:(2021), pp. 137309-137321. [10.1109/ACCESS.2021.3118093]

Inferring multilingual domain-specific word embeddings from large document corpora

Luca Cagliero;Moreno La Quatra

2021

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2021
			
	Codice DOI
	
				https://dx.doi.org/10.1109/ACCESS.2021.3118093
			
	Titolo della Rivista
	
				IEEE ACCESS
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
IEEEAccess_2021_pre.pdf accesso riservato Descrizione: Pre-print version of the manuscript Tipologia: 1. Preprint / submitted version [pre- review] Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 1.1 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.1 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
Inferring_Multilingual_Domain-Specific_Word_Embeddings_From_Large_Document_Corpora.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 1.89 MB Formato Adobe PDF Visualizza/Apri	1.89 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2927412

PORTO @ Archivio Istituzionale della Ricerca

Inferring multilingual domain-specific word embeddings from large document corpora

Luca Cagliero;Moreno La Quatra

2021

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)