ABSTRACT
Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for a collection of search results retrieved in response to a query. At every level of the hierarchy, the new algorithm progressively identifies topics in a way that maximizes the coverage while maintaining distinctiveness of the topics. We refer the proposed algorithm to as DisCover. Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate test being user judgment. We use several objective measures such as coverage and reach time for an empirical comparison of the proposed algorithm with two other monothetic clustering algorithms to demonstrate its superiority. Even though our algorithm is slightly more computationally intensive than one of the algorithms, it generates better hierarchies. Our user studies also show that the proposed algorithm is superior to the other algorithms as a summarizing and browsing tool.
- G. Ball and D. A. Hall. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153--155, 1967.Google ScholarCross Ref
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, 2001. Google ScholarDigital Library
- K. Franzen and J. Karlgren. Verbosity and interface design. Technical Report T2000:04, Swedish Institute of Computer Science (SICS), 2000.Google Scholar
- H. Frigui and O. Nasraoui. Simultaneous categorization of text documents and identification of cluster-dependent keywords. In Proceedings of FUZZIEEE, pages 158--163, Honolulu, Hawaii, 2002.Google ScholarCross Ref
- G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, 1994. Google ScholarDigital Library
- A. Griffiths, H. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Sciences, 37:3--11, 1986.Google ScholarCross Ref
- M. A. Hearst. Automated discovery of WordNet relations. In C. Fellbaum, editor, WordNet: an Electronic Lexical Database. MIT Press, 1998.Google Scholar
- M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR, pages 76--84, Zurich, CH, 1996. Google ScholarDigital Library
- R. Krishnapuram and K. Kummamuru. Automatic taxonomy generation: Issues and possibilities. In LNCS: Proceedings of Fuzzy Sets and Systems (IFSA), volume 2715, pages 52--63. Springer-Verlag Heidelberg, Jan. 2003. Google ScholarDigital Library
- K. Kummamuru, A. K. Dhawale, and R. Krishnapuram. Fuzzy co-clustering of documents and keywords. In Proceedings of FUZZIEEE, St. Louis, MO, 2003.Google ScholarCross Ref
- K. Kummamuru and R. Krishnapuram. A clustering algorithm for asymmetrically related data with its applications to text mining. In Proceedings of CIKM, pages 571--573, Atlanta, USA, 2001. Google ScholarDigital Library
- D. J. Lawnie and W. B. Croft. Generating hierarchical summaries for web searches.citeseer.nj.nec.com/lawrie03generating.html.Google Scholar
- D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In Proceedings of SIGIR, pages 349--357. ACM Press, 2001. Google ScholarDigital Library
- D. J. Lawrie and W. B. Croft. Generating hierarchical summaries for web searches. In Proceedings of SIGIR, pages 457 -- 458, 2003. Google ScholarDigital Library
- B. Mandhani, S. Joshi, and K. Kummamuru. A matrix density based algorithm to hierarchically co-cluster documents and words. In Proceedings of WWW , Budapest, Hungary, 2003. Google ScholarDigital Library
- F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Meeting of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarDigital Library
- M. F. Porter. An algorithm for suffix stripping. Program, 14:130--137, 1980.Google ScholarCross Ref
- G. Salton. The SMART Retrieval Systems. Prentice Hall, Englewood Cliffs, N.J., 1971.Google Scholar
- M. Sanderson. Word sense disambiguation and information retrieval. In Proceedings of SIGIR, pages 142--151, 1994. Google ScholarDigital Library
- M. Sanderson and W.B.Croft. Deriving concept hierarchies from text. In Proceedings of SIGIR, pages 206--213, 1999. Google ScholarDigital Library
- E. Selberg and O. Etzioni. Multi-service search and comparison using the MetaCrawler. In Proceedings of WWW, Darmstadt, Germany, December 1995.Google Scholar
- P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy - The Principles and Practice of Numerical Classification. W. H. Freeman, San Francisco, CA, 1973.Google Scholar
- S. Vaithyanathan and B. Dom. Model selection in unsupervised learning with applications to document clustering. In The Sixth International Conference on Machine Learning (ICML- 1999), pages 423--433, June 1999. Google ScholarDigital Library
- S. Vaithyanathan and B. Dom. Model-based hierarchical clustering. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence, pages 599--608, 2000. Google ScholarDigital Library
- R. Weischedel, M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci. Coping with ambiguity and unknown words through probabilistic models. Association for Computational Linguistics, 19(2):359--382, 1993. Google ScholarDigital Library
- O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998. Google ScholarDigital Library
- Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of CIKM, pages 515--524. ACM Press, 2002. Google ScholarDigital Library
Index Terms
- A hierarchical monothetic document clustering algorithm for summarization and browsing search results
Recommendations
Using topic themes for multi-document summarization
The problem of using topic representations for multidocument summarization (MDS) has received considerable attention recently. Several topic representations have been employed for producing informative and coherent summaries. In this article, we ...
Extractive text summarization using clustering-based topic modeling
AbstractText summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...
Topic themes for multi-document summarization
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalThe problem of using topic representations for multi-document summarization (MDS) has received considerable attention recently. In this paper, we describe five different topic representations and introduce a novel representation of topics based on topic ...
Comments