skip to main content
article

Multidocument summarization: An added value to clustering in interactive retrieval

Published:01 April 2004Publication History
Skip Abstract Section

Abstract

A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar documents, one or several particular aspects. This kind of task, called instance or aspectual retrieval, has been explored in several TREC Interactive Tracks. In this article, we propose in addition to the classification capacity of clustering techniques, the possibility of offering a indicative extract about the contents of several sources by means of multidocument summarization techniques. Two kinds of summaries are provided. The first one covers the similarities of each cluster of documents retrieved. The second one shows the particularities of each document with respect to the common topic in the cluster. The document multitopic structure has been used in order to determine similarities and differences of topics in the cluster of documents. The system is independent of document domain and genre. An evaluation of the proposed system with users proves significant improvements in effectiveness. The results of previous experiments that have compared clustering algorithms are also reported.

References

  1. Abraços, J. and Lopes, G. P. 1997. Statistical methods for retrieving most significant paragraphs in newspaper articles. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the 35th Meeting of the Association for Computational Linguistics, and the 8th Conference of the European Chapter of the Assocation for Computational Linguistics (Madrid, Spain). I. Mani and M. T. Maybury, Eds.]]Google ScholarGoogle Scholar
  2. Ando, R., Boguraev, B., Byrd, R., and Neff, M. 2000. Multidocument summarization by visualizing topical content. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics (Seattle, Wash.).]] Google ScholarGoogle Scholar
  3. Baxendale, P. B. 1958. Man-made index for technical literature---An experiment. IBM J. Res. Develop. 2, 4, 354--361.]]Google ScholarGoogle Scholar
  4. Buckley, C. 1985. Implementation of the Smart information retrieval system. Tech. Rep. 85-686. Cornell University.]] Google ScholarGoogle Scholar
  5. Carey, M., Kriwaczek, F., and Rüger, S. 2000. A visualization interface for document searching and browsing. In Proceedings of CIKM 2000 Workshop on New Paradigms in Information Visualization and Manipulation (Washington, D.C.).]]Google ScholarGoogle Scholar
  6. Edmundson, H. P. 1969. New methods in automatic extracting. J. ACM 16, 2 (Apr.), 264--285.]] Google ScholarGoogle Scholar
  7. Fuller, M., Kaszkiel, M., Ng, C., Wu, M., Zobel, J., Kim, D., Robertson, J., and Wilkinson, R. 1998. Ad hoc, speech, and interactive tracks at MDS/CSIRO. In Proceedings of the 7th Text REtrieval Conference (TREC-7) (Gaithersburg, Md.). 465--474.]]Google ScholarGoogle Scholar
  8. Goldstein, J., Mittal, V. O., Carbonell, J., and Callan, J. P. 2000. Creating and evaluating multidocument sentence extract summaries. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management (CIKM) (Washington, D.C.). ACM, New York, 165--172.]] Google ScholarGoogle Scholar
  9. Hearst, M. A. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computat. Ling. 23, 1, 33--64.]] Google ScholarGoogle Scholar
  10. Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 76--84.]] Google ScholarGoogle Scholar
  11. Hersh, W. and Over, P. 1999. TREC-8 interactive report. In Proceedings of the 8th Text REtrieval Conference (TREC-8) (Gaithersburg, Md.). 57--64.]]Google ScholarGoogle Scholar
  12. Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchic clustering in information retrieval. Inf. Stor. Ret. 7, 217--240.]]Google ScholarGoogle Scholar
  13. Kan, M., McKeown, K. R., and Klavans, J. L. 2001. Domain-specific informative and indicative summarization for information retrieval. In Proceedings of the Workshop on Text Summarization, 24th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (New Orleans, La.). ACM, New York.]]Google ScholarGoogle Scholar
  14. Karypis, G. 2002. Cluto: A Software Package for Clustering High Dimensional Datasets. Release 1.5. Department of Computer Science, University of Minnesota.]]Google ScholarGoogle Scholar
  15. Krishnaiah, P. R. and Kanal, L. 1982. Classification, Pattern Recognition and Reduction in Dimensionality: Handbook of Statistics. Vol. 2. North-Holland Publishing Company, Amsterdam, The Netherlands.]]Google ScholarGoogle Scholar
  16. Leuski, A. 2001. Evaluating document clustering for interactive information retrieval. In Proceedings of 10th International Conference on Information and Knowledge Management (CIKM'01). 33--40.]] Google ScholarGoogle Scholar
  17. Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Develop. 2, 2, 159--165.]]Google ScholarGoogle Scholar
  18. Maña-López, M. J., de Buenaga, M., and Gómez-Hidalgo, J. M. 1999. Using and evaluating user directed summaries to improve information access. In Proceedings of the 3rd European Conference on Research and Advanced Technology for Digital Libraries (ECDL'99) (Paris, France). Springer-Verlag, New York, 198--214.]] Google ScholarGoogle Scholar
  19. Mani, I. 2001. Automatic Summarization. John Benjamins Publishing Company, Amsterdam/Philadephia.]]Google ScholarGoogle Scholar
  20. Mittal, V. O., Kantrowitz, M., Goldstein, J., and Carbonell, J. 1999. Selecting text spans for document summaries: heuristics and metrics. In Proceedings of the Conference of the American Association of Artificial Intelligence (AAAI'99).]] Google ScholarGoogle Scholar
  21. Nakao, Y. 2000. An algorithm for one-page summarization of a long text based on thematic hierarchy detection. In Proceedings of the 38th Meeting of the Association for Computational Linguistics. 302--309.]] Google ScholarGoogle Scholar
  22. Over, P. 1997. TREC-6 interactive report. In Proceedings of the Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, Md.). 73--82.]]Google ScholarGoogle Scholar
  23. Over, P. 1998. TREC-7 interactive track report. In Proceedings of the Seventh Text REtrieval Conference (TREC-7) (Gaithersburg, Md.). 65--72.]]Google ScholarGoogle Scholar
  24. Paice, C. D. 1990. Constructing literature abstracts by computer: Techniques and prospects. Inf. Proc. Manage. 26, 1, 171--186.]] Google ScholarGoogle Scholar
  25. Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics (Seattle, Wash.).]] Google ScholarGoogle Scholar
  26. Rasmussen, E. 1992. Clustering algorithms. In Information Retrieval: Data Structures & Algorithms, W. Frakes and R. Baeza-Yates, Eds. Prentice-Hall International, London, England, 419--442.]] Google ScholarGoogle Scholar
  27. Rüger, S. and Gauch, S. E. 2000. Feature reduction for document clustering and classification. Tech. Rep. DTR 2000/8. Department of Computing, Imperial College, London, England.]]Google ScholarGoogle Scholar
  28. Sahami, M. 1998. Using machine learning to improve information access. Ph.D. dissertation. Computer Science Department, Stanford Univ., Stanford Calif.]] Google ScholarGoogle Scholar
  29. Salton, G. 1968. Automatic Information Organization and Retrieval. McGraw-Hill, New York.]] Google ScholarGoogle Scholar
  30. Salton, G., Allan, J., and Singhal, A. 1996. Automatic text decomposition and structuring. Inf. Proc. Manage. 32, 2, 127--138.]] Google ScholarGoogle Scholar
  31. Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 5, 513--523.]] Google ScholarGoogle Scholar
  32. Salton, G., Singhal, A., Mitra, M., and Buckley, C. 1997. Automatic text structuring and summarization. Inf. Proc. Manage. 33, 2, 193--207.]] Google ScholarGoogle Scholar
  33. Skorochod'ko, E. F. 1972. Adaptive method of automatic abstracting and indexing. In Information Processing 71: Proceedings of the IFIP Congress 71, C. Freiman, Ed. North-Holland, Amsterdam, The Netherlands, 1179--1182.]]Google ScholarGoogle Scholar
  34. Spink, A., Jansen, B., Wolfram, D., and Saracevic, T. 2002. From e-sex to e-commerce: Web search changes. Computer 35, 3, 107--109.]] Google ScholarGoogle Scholar
  35. Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.]]Google ScholarGoogle Scholar
  36. Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 2--10.]] Google ScholarGoogle Scholar
  37. Vaithyanathan, S. and Dom, B. 1999. Model selection in unsupervised learning with applications to document clustering. In Proceedings of the 16th International Conference on Machine Learning (ICML-99) (Bled, Slovenia).]] Google ScholarGoogle Scholar
  38. van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Buttersworth, London.]] Google ScholarGoogle Scholar
  39. Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan-Kaufmann, San Mateo, Calif.]] Google ScholarGoogle Scholar
  40. Wu, M., Fuller, M., and Wilkinson, R. 2001. Using clustering and classification approaches in interactive retrieval. Inf. Proc. Manage. 37, 3, 459--484.]] Google ScholarGoogle Scholar
  41. Zamir, O. and Etzioni, O. 1998. Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). 46--54.]] Google ScholarGoogle Scholar
  42. Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: Experiments and analysis. Tech. Rep. 01-40, Department of Computer Science, University of Minnesota.]]Google ScholarGoogle Scholar

Index Terms

  1. Multidocument summarization: An added value to clustering in interactive retrieval

              Recommendations

              Reviews

              Bei Yu

              Clustering retrieved documents is a typical post-retrieval processing technique used to present an organized result set, not simply a ranked list, to the user, in order to reduce the cognitive burden of going through a large number of returned results. Some commercial search engines, such as Vivisimo, have implemented this strategy fairly successfully. However, studies show that the benefit of clustering is undermined by a poor visual connection between the clusters and the document content. By providing extra indicative extracts, covering both the similarities and particularities of each document contributing to the specific cluster, the authors of this paper take a further step toward improving retrieved documents’ organization and accessibility. With the assumption that each document consists of several subtopics, the authors first use the TextTiling algorithm to segment the documents. K-means variants are then used to cluster the text segment, and a sentence extraction-based multidocument summary is generated for each cluster, to cover common aspects using surface level information (for example, locations, headings, tf*idf values, and so on). Finally, a single summary is generated for each document indicating its originality. Commonality detection is relatively easier than difference identification, in that, for the latter problem, it is even harder to balance originality and relevance. Text segment alignment is also necessary if multiple aspects are addressed for the same topic. Regretfully, this paper does not offer a detailed solution for difference and originality detection. The authors used both objective (instance precision/recall) and subjective (questionnaires) methods to evaluate their system’s effectiveness. The objective evaluation shows that commonality summarization helps reduce the reading load by 20 to 30 percent, and it is not a surprise to see that the difference summary does not significantly help. The usability problem still remains in this approach, based on a subjective evaluation by users that it is hard to use the new system. Online Computing Reviews Service

              Ian Ruthven

              Simultaneously accessing large numbers of text documents is an activity that is not well supported by current search engine interfaces. Many solutions have been explored that employ some form of clustering, or document summarization, to facilitate the assessment of retrieved documents. The research outlined in this paper combines the two solutions to provide summaries of document clusters. The idea itself is not new; what is new is that some attempt has been made to exploit the topical structure of documents in the summarization process. However, this novel aspect is not well described, and other researchers may have difficulty replicating this work based on the description given. Two types of summaries are described: multi-document summaries, which emphasize the similarities between documents within a cluster, and single document summaries, which emphasize the difference between a document and other documents in the same cluster. A user evaluation, based on a standard information retrieval experimental protocol, is presented to assess the effectiveness of the clustering and summarization techniques. The evaluation compares three interfaces: a standard interface that offers lists of retrieved documents; a clustering interface offering a series of document clusters; and a summarization interface that offers clusters, a summary of each cluster, and a summary of each document. In the published version of the paper, the screen dumps are of such poor quality that they are illegible, so it is not possible to evaluate the quality or readability of the summaries created. The results are not conclusive, displaying no real user preference for one interface over the others. One of the main findings is that familiarity with the search topic correlates with a preference for the standard list interface, whereas user unfamiliarity with the topic correlates with a preference for the more advanced interfaces. It is a pity that more correlations like this, or qualitative results, were not reported, as they may form a useful guide to why users prefer one interface over another. Online Computing Reviews Service

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Information Systems
                ACM Transactions on Information Systems  Volume 22, Issue 2
                April 2004
                178 pages
                ISSN:1046-8188
                EISSN:1558-2868
                DOI:10.1145/984321
                Issue’s Table of Contents

                Copyright © 2004 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 1 April 2004
                Published in tois Volume 22, Issue 2

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader