Sentence-based summarization aims at extracting concise summaries of collections of textual documents.Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value De-composition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar item-sets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the above mentioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of uncorrelated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.
ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis / Cagliero, Luca.; Garza, Paolo; Baralis, Elena. - In: ACM TRANSACTIONS ON INFORMATION SYSTEMS. - ISSN 1046-8188. - ELETTRONICO. - 37:2(2019), pp. 1-33. [10.1145/3298987]
ELSA: A multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis
Cagliero, Luca.;Garza, Paolo;Baralis, Elena
2019
Abstract
Sentence-based summarization aims at extracting concise summaries of collections of textual documents.Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value De-composition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar item-sets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the above mentioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of uncorrelated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.File | Dimensione | Formato | |
---|---|---|---|
a21-cagliero.pdf
non disponibili
Descrizione: Articolo principale
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.28 MB
Formato
Adobe PDF
|
1.28 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2749925