ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis

Cagliero, Luca.; Garza, Paolo; Baralis, Elena

doi:10.1145/3298987

Sentence-based summarization aims at extracting concise summaries of collections of textual documents.Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value De-composition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar item-sets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the above mentioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of uncorrelated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.

ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis / Cagliero, Luca.; Garza, Paolo; Baralis, Elena. - In: ACM TRANSACTIONS ON INFORMATION SYSTEMS. - ISSN 1046-8188. - ELETTRONICO. - 37:2(2019), pp. 1-33. [10.1145/3298987]

ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis

Cagliero, Luca.;Garza, Paolo;Baralis, Elena

2019

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
			2019
		
	Codice DOI
	
			https://dx.doi.org/10.1145/3298987
		
	Titolo della Rivista
	
			ACM TRANSACTIONS ON INFORMATION SYSTEMS
		
	Appare nelle tipologie
	
			1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
a21-cagliero.pdf non disponibili Descrizione: Articolo principale Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 1.28 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.28 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2749925

PORTO @ Archivio Istituzionale della Ricerca

ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis

Cagliero, Luca.;Garza, Paolo;Baralis, Elena

2019

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)