Abstract
A more and more generalized problem in effective information access is the presence in the same corpus of multiple documents that contain similar information. Generally, users may be interested in locating, for a topic addressed by a group of similar documents, one or several particular aspects. This kind of task, called instance or aspectual retrieval, has been explored in several TREC Interactive Tracks. In this article, we propose in addition to the classification capacity of clustering techniques, the possibility of offering a indicative extract about the contents of several sources by means of multidocument summarization techniques. Two kinds of summaries are provided. The first one covers the similarities of each cluster of documents retrieved. The second one shows the particularities of each document with respect to the common topic in the cluster. The document multitopic structure has been used in order to determine similarities and differences of topics in the cluster of documents. The system is independent of document domain and genre. An evaluation of the proposed system with users proves significant improvements in effectiveness. The results of previous experiments that have compared clustering algorithms are also reported.
- Abraços, J. and Lopes, G. P. 1997. Statistical methods for retrieving most significant paragraphs in newspaper articles. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the 35th Meeting of the Association for Computational Linguistics, and the 8th Conference of the European Chapter of the Assocation for Computational Linguistics (Madrid, Spain). I. Mani and M. T. Maybury, Eds.]]Google Scholar
- Ando, R., Boguraev, B., Byrd, R., and Neff, M. 2000. Multidocument summarization by visualizing topical content. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics (Seattle, Wash.).]] Google Scholar
- Baxendale, P. B. 1958. Man-made index for technical literature---An experiment. IBM J. Res. Develop. 2, 4, 354--361.]]Google Scholar
- Buckley, C. 1985. Implementation of the Smart information retrieval system. Tech. Rep. 85-686. Cornell University.]] Google Scholar
- Carey, M., Kriwaczek, F., and Rüger, S. 2000. A visualization interface for document searching and browsing. In Proceedings of CIKM 2000 Workshop on New Paradigms in Information Visualization and Manipulation (Washington, D.C.).]]Google Scholar
- Edmundson, H. P. 1969. New methods in automatic extracting. J. ACM 16, 2 (Apr.), 264--285.]] Google Scholar
- Fuller, M., Kaszkiel, M., Ng, C., Wu, M., Zobel, J., Kim, D., Robertson, J., and Wilkinson, R. 1998. Ad hoc, speech, and interactive tracks at MDS/CSIRO. In Proceedings of the 7th Text REtrieval Conference (TREC-7) (Gaithersburg, Md.). 465--474.]]Google Scholar
- Goldstein, J., Mittal, V. O., Carbonell, J., and Callan, J. P. 2000. Creating and evaluating multidocument sentence extract summaries. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management (CIKM) (Washington, D.C.). ACM, New York, 165--172.]] Google Scholar
- Hearst, M. A. 1997. TextTiling: Segmenting text into multi-paragraph subtopic passages. Computat. Ling. 23, 1, 33--64.]] Google Scholar
- Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 76--84.]] Google Scholar
- Hersh, W. and Over, P. 1999. TREC-8 interactive report. In Proceedings of the 8th Text REtrieval Conference (TREC-8) (Gaithersburg, Md.). 57--64.]]Google Scholar
- Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchic clustering in information retrieval. Inf. Stor. Ret. 7, 217--240.]]Google Scholar
- Kan, M., McKeown, K. R., and Klavans, J. L. 2001. Domain-specific informative and indicative summarization for information retrieval. In Proceedings of the Workshop on Text Summarization, 24th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (New Orleans, La.). ACM, New York.]]Google Scholar
- Karypis, G. 2002. Cluto: A Software Package for Clustering High Dimensional Datasets. Release 1.5. Department of Computer Science, University of Minnesota.]]Google Scholar
- Krishnaiah, P. R. and Kanal, L. 1982. Classification, Pattern Recognition and Reduction in Dimensionality: Handbook of Statistics. Vol. 2. North-Holland Publishing Company, Amsterdam, The Netherlands.]]Google Scholar
- Leuski, A. 2001. Evaluating document clustering for interactive information retrieval. In Proceedings of 10th International Conference on Information and Knowledge Management (CIKM'01). 33--40.]] Google Scholar
- Luhn, H. P. 1958. The automatic creation of literature abstracts. IBM J. Res. Develop. 2, 2, 159--165.]]Google Scholar
- Maña-López, M. J., de Buenaga, M., and Gómez-Hidalgo, J. M. 1999. Using and evaluating user directed summaries to improve information access. In Proceedings of the 3rd European Conference on Research and Advanced Technology for Digital Libraries (ECDL'99) (Paris, France). Springer-Verlag, New York, 198--214.]] Google Scholar
- Mani, I. 2001. Automatic Summarization. John Benjamins Publishing Company, Amsterdam/Philadephia.]]Google Scholar
- Mittal, V. O., Kantrowitz, M., Goldstein, J., and Carbonell, J. 1999. Selecting text spans for document summaries: heuristics and metrics. In Proceedings of the Conference of the American Association of Artificial Intelligence (AAAI'99).]] Google Scholar
- Nakao, Y. 2000. An algorithm for one-page summarization of a long text based on thematic hierarchy detection. In Proceedings of the 38th Meeting of the Association for Computational Linguistics. 302--309.]] Google Scholar
- Over, P. 1997. TREC-6 interactive report. In Proceedings of the Sixth Text REtrieval Conference (TREC-6) (Gaithersburg, Md.). 73--82.]]Google Scholar
- Over, P. 1998. TREC-7 interactive track report. In Proceedings of the Seventh Text REtrieval Conference (TREC-7) (Gaithersburg, Md.). 65--72.]]Google Scholar
- Paice, C. D. 1990. Constructing literature abstracts by computer: Techniques and prospects. Inf. Proc. Manage. 26, 1, 171--186.]] Google Scholar
- Radev, D. R., Jing, H., and Budzikowska, M. 2000. Centroid-based summarization of multiple documents: Sentence extraction, utility-based evaluation, and user studies. In Proceedings of the Workshop on Automatic Summarization at the 6th Applied Natural Language Processing Conference and the 1st Conference of the North American Chapter of the Association for Computational Linguistics (Seattle, Wash.).]] Google Scholar
- Rasmussen, E. 1992. Clustering algorithms. In Information Retrieval: Data Structures & Algorithms, W. Frakes and R. Baeza-Yates, Eds. Prentice-Hall International, London, England, 419--442.]] Google Scholar
- Rüger, S. and Gauch, S. E. 2000. Feature reduction for document clustering and classification. Tech. Rep. DTR 2000/8. Department of Computing, Imperial College, London, England.]]Google Scholar
- Sahami, M. 1998. Using machine learning to improve information access. Ph.D. dissertation. Computer Science Department, Stanford Univ., Stanford Calif.]] Google Scholar
- Salton, G. 1968. Automatic Information Organization and Retrieval. McGraw-Hill, New York.]] Google Scholar
- Salton, G., Allan, J., and Singhal, A. 1996. Automatic text decomposition and structuring. Inf. Proc. Manage. 32, 2, 127--138.]] Google Scholar
- Salton, G. and Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 5, 513--523.]] Google Scholar
- Salton, G., Singhal, A., Mitra, M., and Buckley, C. 1997. Automatic text structuring and summarization. Inf. Proc. Manage. 33, 2, 193--207.]] Google Scholar
- Skorochod'ko, E. F. 1972. Adaptive method of automatic abstracting and indexing. In Information Processing 71: Proceedings of the IFIP Congress 71, C. Freiman, Ed. North-Holland, Amsterdam, The Netherlands, 1179--1182.]]Google Scholar
- Spink, A., Jansen, B., Wolfram, D., and Saracevic, T. 2002. From e-sex to e-commerce: Web search changes. Computer 35, 3, 107--109.]] Google Scholar
- Steinbach, M., Karypis, G., and Kumar, V. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.]]Google Scholar
- Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 2--10.]] Google Scholar
- Vaithyanathan, S. and Dom, B. 1999. Model selection in unsupervised learning with applications to document clustering. In Proceedings of the 16th International Conference on Machine Learning (ICML-99) (Bled, Slovenia).]] Google Scholar
- van Rijsbergen, C. J. 1979. Information Retrieval, 2nd ed. Buttersworth, London.]] Google Scholar
- Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan-Kaufmann, San Mateo, Calif.]] Google Scholar
- Wu, M., Fuller, M., and Wilkinson, R. 2001. Using clustering and classification approaches in interactive retrieval. Inf. Proc. Manage. 37, 3, 459--484.]] Google Scholar
- Zamir, O. and Etzioni, O. 1998. Web document clustering: a feasibility demonstration. In Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). 46--54.]] Google Scholar
- Zhao, Y. and Karypis, G. 2001. Criterion functions for document clustering: Experiments and analysis. Tech. Rep. 01-40, Department of Computer Science, University of Minnesota.]]Google Scholar
Index Terms
- Multidocument summarization: An added value to clustering in interactive retrieval
Recommendations
Integrating Document Clustering and Multidocument Summarization
Document understanding techniques such as document clustering and multidocument summarization have been receiving much attention recently. Current document clustering methods usually represent the given collection of documents as a document-term matrix ...
Experiments in multidocument summarization
HLT '02: Proceedings of the second international conference on Human Language Technology ResearchThis paper describes a multidocument summarizer built upon research into the detection of new information. The summarizer uses several new strategies to select interesting and informative sentences, including an innovative measure of importance derived ...
Update Summarization via Graph-Based Sentence Ranking
Due to the fast evolution of the information on the Internet, update summarization has received much attention in recent years. It is to summarize an evolutionary document collection at current time supposing the users have read some related previous ...
Comments