article

Multidocument summarization: An added value to clustering in interactive retrieval

Authors:
Manuel J. Maña-López

Universidad de Vigo, Huelva, Spain

Universidad de Vigo, Huelva, Spain
View Profile

,
Manuel De Buenaga

Universidad Europea de Madrid, Madrid, Spain

Universidad Europea de Madrid, Madrid, Spain
View Profile

,
José M. Gómez-Hidalgo

Universidad Europea de Madrid, Madrid, Spain

Universidad Europea de Madrid, Madrid, Spain
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 22 Issue 2pp 215–241https://doi.org/10.1145/984321.984323

Published:01 April 2004Publication History

ACM Transactions on Information Systems

Abstract

A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar documents, one or several particular aspects. This kind of task, called instance or aspectual retrieval, has been explored in several TREC Interactive Tracks. In this article, we propose in addition to the classification capacity of clustering techniques, the possibility of offering a indicative extract about the contents of several sources by means of multidocument summarization techniques. Two kinds of summaries are provided. The first one covers the similarities of each cluster of documents retrieved. The second one shows the particularities of each document with respect to the common topic in the cluster. The document multitopic structure has been used in order to determine similarities and differences of topics in the cluster of documents. The system is independent of document domain and genre. An evaluation of the proposed system with users proves significant improvements in effectiveness. The results of previous experiments that have compared clustering algorithms are also reported.

References

Abraços, J. and Lopes, G. P. 1997. Statistical methods for retrieving most significant paragraphs in newspaper articles. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the 35th Meeting of the Association for Computational Linguistics, and the 8th Conference of the European Chapter of the Assocation for Computational Linguistics (Madrid, Spain). I. Mani and M. T. Maybury, Eds.]]Google Scholar
Ando, R., Boguraev, B., Byrd, R., and Neff, M. 2000. Multidocument summarization by visualizing topical content. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics (Seattle, Wash.).]] Google Scholar
Baxendale, P. B. 1958. Man-made index for technical literature---An experiment. IBM J. Res. Develop. 2, 4, 354--361.]]Google Scholar
Buckley, C. 1985. Implementation of the Smart information retrieval system. Tech. Rep. 85-686. Cornell University.]] Google Scholar
Carey, M., Kriwaczek, F., and Rüger, S. 2000. A visualization interface for document searching and browsing. In Proceedings of CIKM 2000 Workshop on New Paradigms in Information Visualization and Manipulation (Washington, D.C.).]]Google Scholar
Edmundson, H. P. 1969. New methods in automatic extracting. J. ACM 16, 2 (Apr.), 264--285.]] Google Scholar
Fuller, M., Kaszkiel, M., Ng, C., Wu, M., Zobel, J., Kim, D., Robertson, J., and Wilkinson, R. 1998. Ad hoc, speech, and interactive tracks at MDS/CSIRO. In Proceedings of the 7th Text REtrieval Conference (TREC-7) (Gaithersburg, Md.). 465--474.]]Google Scholar
Goldstein, J., Mittal, V. O., Carbonell, J., and Callan, J. P. 2000. Creating and evaluating multidocument sentence extract summaries. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management (CIKM) (Washington, D.C.). ACM, New York, 165--172.]] Google Scholar
Hearst, M. A. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computat. Ling. 23, 1, 33--64.]] Google Scholar
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 76--84.]] Google Scholar
Hersh, W. and Over, P. 1999. TREC-8 interactive report. In Proceedings of the 8th Text REtrieval Conference (TREC-8) (Gaithersburg, Md.). 57--64.]]Google Scholar
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchic clustering in information retrieval. Inf. Stor. Ret. 7, 217--240.]]Google Scholar
Kan, M., McKeown, K. R., and Klavans, J. L. 2001. Domain-specific informative and indicative summarization for information retrieval. In Proceedings of the Workshop on Text Summarization, 24th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (New Orleans, La.). ACM, New York.]]Google Scholar
Karypis, G. 2002. Cluto: A Software Package for Clustering High Dimensional Datasets. Release 1.5. Department of Computer Science, University of Minnesota.]]Google Scholar
Krishnaiah, P. R. and Kanal, L. 1982. Classification, Pattern Recognition and Reduction in Dimensionality: Handbook of Statistics. Vol. 2. North-Holland Publishing Company, Amsterdam, The Netherlands.]]Google Scholar
Leuski, A. 2001. Evaluating document clustering for interactive information retrieval. In Proceedings of 10th International Conference on Information and Knowledge Management (CIKM'01). 33--40.]] Google Scholar
Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Develop. 2, 2, 159--165.]]Google Scholar
Maña-López, M. J., de Buenaga, M., and Gómez-Hidalgo, J. M. 1999. Using and evaluating user directed summaries to improve information access. In Proceedings of the 3rd European Conference on Research and Advanced Technology for Digital Libraries (ECDL'99) (Paris, France). Springer-Verlag, New York, 198--214.]] Google Scholar
Mani, I. 2001. Automatic Summarization. John Benjamins Publishing Company, Amsterdam/Philadephia.]]Google Scholar
Mittal, V. O., Kantrowitz, M., Goldstein, J., and Carbonell, J. 1999. Selecting text spans for document summaries: heuristics and metrics. In Proceedings of the Conference of the American Association of Artificial Intelligence (AAAI'99).]] Google Scholar
Nakao, Y. 2000. An algorithm for one-page summarization of a long text based on thematic hierarchy detection. In Proceedings of the 38th Meeting of the Association for Computational Linguistics. 302--309.]] Google Scholar
Over, P. 1997. TREC-6 interactive report. In Proceedings of the Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, Md.). 73--82.]]Google Scholar
Over, P. 1998. TREC-7 interactive track report. In Proceedings of the Seventh Text REtrieval Conference (TREC-7) (Gaithersburg, Md.). 65--72.]]Google Scholar
Paice, C. D. 1990. Constructing literature abstracts by computer: Techniques and prospects. Inf. Proc. Manage. 26, 1, 171--186.]] Google Scholar
Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics (Seattle, Wash.).]] Google Scholar
Rasmussen, E. 1992. Clustering algorithms. In Information Retrieval: Data Structures & Algorithms, W. Frakes and R. Baeza-Yates, Eds. Prentice-Hall International, London, England, 419--442.]] Google Scholar
Rüger, S. and Gauch, S. E. 2000. Feature reduction for document clustering and classification. Tech. Rep. DTR 2000/8. Department of Computing, Imperial College, London, England.]]Google Scholar
Sahami, M. 1998. Using machine learning to improve information access. Ph.D. dissertation. Computer Science Department, Stanford Univ., Stanford Calif.]] Google Scholar
Salton, G. 1968. Automatic Information Organization and Retrieval. McGraw-Hill, New York.]] Google Scholar
Salton, G., Allan, J., and Singhal, A. 1996. Automatic text decomposition and structuring. Inf. Proc. Manage. 32, 2, 127--138.]] Google Scholar
Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 5, 513--523.]] Google Scholar
Salton, G., Singhal, A., Mitra, M., and Buckley, C. 1997. Automatic text structuring and summarization. Inf. Proc. Manage. 33, 2, 193--207.]] Google Scholar
Skorochod'ko, E. F. 1972. Adaptive method of automatic abstracting and indexing. In Information Processing 71: Proceedings of the IFIP Congress 71, C. Freiman, Ed. North-Holland, Amsterdam, The Netherlands, 1179--1182.]]Google Scholar
Spink, A., Jansen, B., Wolfram, D., and Saracevic, T. 2002. From e-sex to e-commerce: Web search changes. Computer 35, 3, 107--109.]] Google Scholar
Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.]]Google Scholar
Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 2--10.]] Google Scholar
Vaithyanathan, S. and Dom, B. 1999. Model selection in unsupervised learning with applications to document clustering. In Proceedings of the 16th International Conference on Machine Learning (ICML-99) (Bled, Slovenia).]] Google Scholar
van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Buttersworth, London.]] Google Scholar
Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan-Kaufmann, San Mateo, Calif.]] Google Scholar
Wu, M., Fuller, M., and Wilkinson, R. 2001. Using clustering and classification approaches in interactive retrieval. Inf. Proc. Manage. 37, 3, 459--484.]] Google Scholar
Zamir, O. and Etzioni, O. 1998. Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). 46--54.]] Google Scholar
Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: Experiments and analysis. Tech. Rep. 01-40, Department of Computer Science, University of Minnesota.]]Google Scholar

Index Terms

Multidocument summarization: An added value to clustering in interactive retrieval

Recommendations

Integrating Document Clustering and Multidocument Summarization

Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix ...
Read More
Experiments in multidocument summarization
HLT '02: Proceedings of the second international conference on Human Language Technology Research

This paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived ...
Read More
Update Summarization via Graph-Based Sentence Ranking

Due to the fast evolution of the information on the Internet, update summarization has received much attention in recent years. It is to summarize an evolutionary document collection at current time supposing the users have read some related previous ...
Read More

Reviews

Reviewer: Bei Yu

Clustering retrieved documents is a typical post-retrieval processing technique used to present an organized result set, not simply a ranked list, to the user, in order to reduce the cognitive burden of going through a large number of returned results. Some commercial search engines, such as Vivisimo, have implemented this strategy fairly successfully. However, studies show that the benefit of clustering is undermined by a poor visual connection between the clusters and the document content. By providing extra indicative extracts, covering both the similarities and particularities of each document contributing to the specific cluster, the authors of this paper take a further step toward improving retrieved documents’ organization and accessibility. With the assumption that each document consists of several subtopics, the authors first use the TextTiling algorithm to segment the documents. K-means variants are then used to cluster the text segment, and a sentence extraction-based multidocument summary is generated for each cluster, to cover common aspects using surface level information (for example, locations, headings, tf*idf values, and so on). Finally, a single summary is generated for each document indicating its originality. Commonality detection is relatively easier than difference identification, in that, for the latter problem, it is even harder to balance originality and relevance. Text segment alignment is also necessary if multiple aspects are addressed for the same topic. Regretfully, this paper does not offer a detailed solution for difference and originality detection. The authors used both objective (instance precision/recall) and subjective (questionnaires) methods to evaluate their system’s effectiveness. The objective evaluation shows that commonality summarization helps reduce the reading load by 20 to 30 percent, and it is not a surprise to see that the difference summary does not significantly help. The usability problem still remains in this approach, based on a subjective evaluation by users that it is hard to use the new system. Online Computing Reviews Service

Reviewer: Ian Ruthven

Simultaneously accessing large numbers of text documents is an activity that is not well supported by current search engine interfaces. Many solutions have been explored that employ some form of clustering, or document summarization, to facilitate the assessment of retrieved documents. The research outlined in this paper combines the two solutions to provide summaries of document clusters. The idea itself is not new; what is new is that some attempt has been made to exploit the topical structure of documents in the summarization process. However, this novel aspect is not well described, and other researchers may have difficulty replicating this work based on the description given. Two types of summaries are described: multi-document summaries, which emphasize the similarities between documents within a cluster, and single document summaries, which emphasize the difference between a document and other documents in the same cluster. A user evaluation, based on a standard information retrieval experimental protocol, is presented to assess the effectiveness of the clustering and summarization techniques. The evaluation compares three interfaces: a standard interface that offers lists of retrieved documents; a clustering interface offering a series of document clusters; and a summarization interface that offers clusters, a summary of each cluster, and a summary of each document. In the published version of the paper, the screen dumps are of such poor quality that they are illegible, so it is not possible to evaluate the quality or readability of the summaries created. The results are not conclusive, displaying no real user preference for one interface over the others. One of the main findings is that familiarity with the search topic correlates with a preference for the standard list interface, whereas user unfamiliarity with the topic correlates with a preference for the more advanced interfaces. It is a pity that more correlations like this, or qualitative results, were not reported, as they may form a useful guide to why users prefer one interface over another. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 22, Issue 2
April 2004
178 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/984321
Issue’s Table of Contents

Copyright © 2004 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2004
Published in tois Volume 22, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Multidocument summarization
topic segmentation
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 36
  Total Citations
  View Citations
- 2,060
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multidocument summarization: An added value to clustering in interactive retrieval

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Integrating Document Clustering and Multidocument Summarization

Experiments in multidocument summarization

Update Summarization via Graph-Based Sentence Ranking

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multidocument summarization: An added value to clustering in interactive retrieval

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Integrating Document Clustering and Multidocument Summarization

Experiments in multidocument summarization

Update Summarization via Graph-Based Sentence Ranking

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media