skip to main content
10.1145/988672.988762acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Published:17 May 2004Publication History

ABSTRACT

Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for a collection of search results retrieved in response to a query. At every level of the hierarchy, the new algorithm progressively identifies topics in a way that maximizes the coverage while maintaining distinctiveness of the topics. We refer the proposed algorithm to as DisCover. Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate test being user judgment. We use several objective measures such as coverage and reach time for an empirical comparison of the proposed algorithm with two other monothetic clustering algorithms to demonstrate its superiority. Even though our algorithm is slightly more computationally intensive than one of the algorithms, it generates better hierarchies. Our user studies also show that the proposed algorithm is superior to the other algorithms as a summarizing and browsing tool.

References

  1. G. Ball and D. A. Hall. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153--155, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  2. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Franzen and J. Karlgren. Verbosity and interface design. Technical Report T2000:04, Swedish Institute of Computer Science (SICS), 2000.Google ScholarGoogle Scholar
  4. H. Frigui and O. Nasraoui. Simultaneous categorization of text documents and identification of cluster-dependent keywords. In Proceedings of FUZZIEEE, pages 158--163, Honolulu, Hawaii, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  5. G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Griffiths, H. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Sciences, 37:3--11, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. A. Hearst. Automated discovery of WordNet relations. In C. Fellbaum, editor, WordNet: an Electronic Lexical Database. MIT Press, 1998.Google ScholarGoogle Scholar
  8. M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR, pages 76--84, Zurich, CH, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Krishnapuram and K. Kummamuru. Automatic taxonomy generation: Issues and possibilities. In LNCS: Proceedings of Fuzzy Sets and Systems (IFSA), volume 2715, pages 52--63. Springer-Verlag Heidelberg, Jan. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Kummamuru, A. K. Dhawale, and R. Krishnapuram. Fuzzy co-clustering of documents and keywords. In Proceedings of FUZZIEEE, St. Louis, MO, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  11. K. Kummamuru and R. Krishnapuram. A clustering algorithm for asymmetrically related data with its applications to text mining. In Proceedings of CIKM, pages 571--573, Atlanta, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. J. Lawnie and W. B. Croft. Generating hierarchical summaries for web searches.citeseer.nj.nec.com/lawrie03generating.html.Google ScholarGoogle Scholar
  13. D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In Proceedings of SIGIR, pages 349--357. ACM Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. J. Lawrie and W. B. Croft. Generating hierarchical summaries for web searches. In Proceedings of SIGIR, pages 457 -- 458, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Mandhani, S. Joshi, and K. Kummamuru. A matrix density based algorithm to hierarchically co-cluster documents and words. In Proceedings of WWW , Budapest, Hungary, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Meeting of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. F. Porter. An algorithm for suffix stripping. Program, 14:130--137, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  18. G. Salton. The SMART Retrieval Systems. Prentice Hall, Englewood Cliffs, N.J., 1971.Google ScholarGoogle Scholar
  19. M. Sanderson. Word sense disambiguation and information retrieval. In Proceedings of SIGIR, pages 142--151, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Sanderson and W.B.Croft. Deriving concept hierarchies from text. In Proceedings of SIGIR, pages 206--213, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Selberg and O. Etzioni. Multi-service search and comparison using the MetaCrawler. In Proceedings of WWW, Darmstadt, Germany, December 1995.Google ScholarGoogle Scholar
  22. P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy - The Principles and Practice of Numerical Classification. W. H. Freeman, San Francisco, CA, 1973.Google ScholarGoogle Scholar
  23. S. Vaithyanathan and B. Dom. Model selection in unsupervised learning with applications to document clustering. In The Sixth International Conference on Machine Learning (ICML- 1999), pages 423--433, June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Vaithyanathan and B. Dom. Model-based hierarchical clustering. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence, pages 599--608, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Weischedel, M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci. Coping with ambiguity and unknown words through probabilistic models. Association for Computational Linguistics, 19(2):359--382, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of CIKM, pages 515--524. ACM Press, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A hierarchical monothetic document clustering algorithm for summarization and browsing search results

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '04: Proceedings of the 13th international conference on World Wide Web
        May 2004
        754 pages
        ISBN:158113844X
        DOI:10.1145/988672

        Copyright © 2004 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 May 2004

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

        Upcoming Conference

        WWW '24
        The ACM Web Conference 2024
        May 13 - 17, 2024
        Singapore , Singapore

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader