Tackling redundancy in text summarization through different levels of language analysis
Highlights
► The problem of redundancy in text summarization is analyzed from three perspectives. ► The best use of exploiting redundancy in a text summarization system is analyzed. ► Semantic-based methods detect up to 90% of redundant data, being the best ones. ► Lexical-based approaches only detect 19% of redundant information.
Introduction
Nowadays, the vast amount of information available has resulted in a disadvantage rather than an advantage, since users cannot cope with all the information, and the existing tools may be not suitable for their specific needs and purposes. This brings great challenges for different aspects of the current society, such as health [34], business [42], administrative procedures [4], or education [44]. This and other fields need to be improved with the help of automatic tools and procedures which make the analysis, processing and interpreting of the information and data more effective and efficient. The potential application of text summarization (TS) for facilitating information access and helping users to easily manage large amounts of data has reflected in the numerous approaches proposed by the research community in recent years. This task allows users to obtain a brief fragment of text that conveys the essential and most relevant information from a larger one [53]. However, to produce good summaries automatically is still a big challenge.
For instance, redundant information, temporal dimension or coreference resolution are issues which have to be taken into consideration especially when summarizing a set of documents (multi-document summarization), thus presenting this task even greater challenges [22]. In particular, it is worth mentioning that multi-document summarization systems have a crucial role in the management of such information from different perspectives: i) providing users with condensed versions of texts which contain the essence of a range of documents, and serve as substitutes of the original ones (informative summaries); ii) addressing a specific information need expressed in a query (query-focused summaries) or by means of opinions (opinion-oriented summaries); or iii) simply helping people to decide whether it is worth accessing and reading a whole document or not (indicative summaries). These are only a few examples of the most common and well-known kinds of summaries, but more types can be found in [52], [26], [37].
Recent efforts have been concentrated on multi-document summarization, for instance, [3], [62], [33], as it has turned out an essential task since large amounts of documents have to be dealt with. This task is also much more complex than summarizing a single document, and this difficulty arises from the existing diversity of topics within a large set of documents. A good summarization technology aims to combine the main topics with completeness, readability, and conciseness. According to [22], the main differences between single- and multi-document summarization concern: a) the degree of redundancy, which is much higher than in single-document summarization; b) the temporal dimension of the documents; c) the compression rate, which will typically be much lower for collections of related documents than for single-document summaries; and d) the coreference problem across documents. Since 2003, international evaluation forums, such as the Document Understanding Conferences1 (DUC) or the Text Analysis Conferences2 (TAC) have only addressed guidelines for multi-document summarization tasks.
Motivated by the current need toward the generation of high quality summaries, and their potential use and application for facilitating information access in human-computer interaction, it is crucial to investigate and explore different factors that can be widely classified in the following groups: 1) methods that help to detect relevant fragments of information; 2) techniques that go beyond the simple extraction of sentences; 3) approaches that consider the linguistic quality of summaries; and 4) approaches that deal with redundant information. Redundancy is a well-known problem when dealing with TS, in the sense that a summary should avoid giving a piece of information more than once; otherwise, information will be repeated, affecting the quality of the final summaries and introducing noise. Moreover, there are multiple levels of language processing when humans produce or understand language, which comprise phonology, morphology, lexical, syntactic, semantic, discourse and pragmatics [39]. However, current Natural Language Processing (NLP) systems mainly deal with the lower levels of processing (lexical, syntactic and semantic), partly due to the difficulty associated with the interpretation at the highest levels.
Many different techniques have been proposed to tackle the redundancy problem, and TS systems use a wide range of them to avoid incorporating repeated information in the final summary. To the best of our knowledge, there is no previous study that analyzes the different levels of language analysis (lexical, syntactic and semantic) in the context of TS for removing redundancy. Therefore, the first contribution of this paper is to quantify the influence of the proposed levels of language analysis for detecting redundancy. Specifically, each level is represented by a well-known method: cosine similarity for the lexical level, textual entailment for the syntactic, and sentence alignment for the semantic one. The second contribution is to analyze the usefulness of such methods within the TS task, determining their positive and negative effects, and thus exploring them. On the one hand, redundant information has always been considered as a negative factor for TS, and as a consequence, most approaches remove repeated information at a first stage. On the other hand, redundant information can be also exploited to produce summaries, assuming that the more the information is repeated across documents, the more important it is, following the idea of [7]. This double-perspective for tackling redundancy is analyzed within the scope of this paper, and a comparison with existing TS systems is also provided.
This paper is organized as follows. Section 2 gives an overview on previous work on TS, focusing on multi-document summarization, and how different approaches have been investigated to deal with redundancy in summaries. In Section 3, the methods analyzed in order to detect redundant information across documents are explained in detail. In Section 4, the twofold use of redundant information is dealt with: first, to prevent it from appearing in the summary; and second, to determine potential relevant information within a document. The experiments together with the evaluation are presented in Section 5, and finally some relevant conclusions are drawn and further work is explained in Section 6.
Section snippets
Related work
As far as TS is concerned, much effort has been devoted to automatically identify relevant content within a document (or a set of documents) in order to select which sentences should appear in the summary, producing extracts as a consequence.
A wide range of techniques have been widely explored, such as statistical features [10]; linguistic models [23]; graph-based algorithms [29], [38], [45]; or machine learning techniques [1]. All these methodologies produce extracts by selecting a set of
Redundancy detection approaches
The objective of this section is to present three methods for detecting redundant information: cosine similarity, textual entailment and sentence alignment.
The reason why these methods have been selected for our research is, on the one hand, the different types of knowledge they employ, and, on the other hand, their popularity among the NLP community. The final goal is to integrate them into a TS approach, and carry out a comparison and an in-depth analysis of their benefits and limitations. In
Text summarization approach
In Section 3, three different methods to detect redundant information were described: i) cosine similarity; ii) textual entailment; and iii) sentence alignment.
In order to prove the performance of each proposed method and analyze their benefits and limitations, the objective of this section is to explain how they can be integrated into a TS approach. The core of this TS approach is similar to the one described in [32], which relies on statistical and linguistic features to detect relevant
Evaluation and discussion
The objective of this section is to test the proposed methods for detecting redundancy within their applicability in TS and, at the same time, assessing the benefits of the different levels of language analysis for this problem.
In order to provide a detailed evaluation of the different approaches, this section is structured in several subsections. Firstly, the corpora used for developing all the experiments are explained, as well as some details concerning the experimental setup (Section 1).
Conclusions and future work
In this paper an analysis of three redundancy detection methods that employ different levels of language analysis (lexical, syntactic and semantic) and their influence within the text summarization task was presented. In particular, the proposed methods for tackling the redundancy problem were cosine similarity, textual entailment and sentence alignment, corresponding each of them to one of the different levels of language analysis previously mentioned, respectively. It has been shown that
Acknowledgments
This research has been funded by the Spanish Government under the project TEXT-MESS 2.0 (TIN2009-13391-C04-01). Moreover, it has been also supported by Conselleria d'Educació — Generalitat Valenciana (grant no. PROMETEO/2009/119 and ACOMP/2010/286).
The authors would also like to thank Ester Boldrini, Helena Burruezo, Javi Fernández, Óscar Ferrández, José Manuel Gómez and Juanma Martínez for participating in the manual evaluation of summaries.
Dr. Elena Lloret is a post-doctoral researcher at the University of Alicante at the European Project FIRST: A Flexible Interactive Reading Support Toool (grant no. FP7-287607). She is a Computer Science graduate and she received her Ph.D at the University of Alicante. Her main field of interest is text summarization, text simplification and text comprehension. She is the author of over 25 scientific publications in relevant journals and international conferences. She has been collaborating with
References (64)
- et al.
MCMR: maximum coverage and minimum redundant text summarization model
Expert Systems with Applications
(2011) - et al.
Providing standard-oriented data models and interfaces to egovernment services: a semantic-driven approach
Computer Standards & Interfaces
(2009) - et al.
The iCabiNET system: Harnessing Electronic Health Record standards from domestic and mobile devices to support better medication adherence
Computer Standards & Interfaces
(2012) - et al.
A standard language for service delivery: enabling understanding among stakeholders
Computer Standards & Interfaces
(2012) - et al.
A semantic graph-based approach to biomedical summarisation
Artificial Intelligence in Medicine
(2011) Automatic summarising: the state of the art
Information Processing and Management
(2007)- et al.
Text summarization features selection method using pseudo genetic-based model
- et al.
pSum-SaDE: a modified p-median problem and self-adaptive differential evolution algorithm for text summarization
Applied Computational Intelligence and Soft Computing
(2011) - et al.
A Contextual Query Expansion Approach by Term Clustering for Robust Text Summarization, in: the Document Understanding Workshop (Presented at the HLT/NAACL)
(2007) - et al.
Information fusion in the context of multi-document summarization
Sentence fusion for multidocument news summarization
Computational Linguistics
Learning document-level semantic properties from free-text annotations
Journal of Artificial Intelligence Research
The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries
A statistical approach for automatic text summarization by extraction
Text summarization via hidden Markov models and pivoted QR matrix decomposition
Classy Arabic and English multi-document summarization
Back to basics: CLASSY 2006
Overview of DUC 2006
Overview of the TAC 2008 update summarization task
Learning to fuse disparate sentences
Alicante University at TAC 2009: experiments in RTE
Multi-sentence compression: finding shortest paths in word graphs
Framework for abstractive summarization using text-to-text generation
Syntax: A Functional–Typological Introduction, II
Multi-document summarization by sentence extraction
Summarizing text by ranking text units according to shallow linguistic features
Ccnu at tac 2008: Proceeding on using semantic method for automated summarization
Reducing redundancy in multi-document summarization using lexical semantic similarity
Automated multilingual text summarization and its evaluation
A query-specific opinion summarization system
Rouge: a package for automatic evaluation of summaries
Cited by (0)
Dr. Elena Lloret is a post-doctoral researcher at the University of Alicante at the European Project FIRST: A Flexible Interactive Reading Support Toool (grant no. FP7-287607). She is a Computer Science graduate and she received her Ph.D at the University of Alicante. Her main field of interest is text summarization, text simplification and text comprehension. She is the author of over 25 scientific publications in relevant journals and international conferences. She has been collaborating with international researchers and has participated in a number of projects at a national level (TIN2006-15265-C06, TIN2009-13391-C04). She has also been collaborating with international groups in Wolverhampton, Sheffield and Edinburgh.
Prof. Dr. Manuel Palomar is the University President of the University of Alicante and head of the Natural Language Processing and Information Systems Research Group of the same university. He is also a full professor of this University since 1991 and his main teaching area focuses on the analysis, design and management of databases, datawarehouses, and information systems. He received his Master's degree and Ph.D in Computer Science at the Polytechnic University of Valencia, Spain. His research interests are Human Language Technologies (HLT) and Natural Language Processing (NLP), in particular text summarization, semantic roles, textual entailment, information extraction and anaphora resolution. He has supervised more than 12 theses and he is the author of more than 70 scientific publications on international journals and conferences on different topics related to HLT and NLP. Furthermore, he has coordinated and been involved in a number of regional, national and international research projects funded by the Generalitat Valenciana (Valencian Government), the Ministry of Science and Innovation (Spanish Government) and the European Council.