Article

A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Authors:
Krishna Kummamuru

IBM India Research Lab, New Delhi

IBM India Research Lab, New Delhi
View Profile

,
Rohit Lotlikar

IBM India Research Lab, New Delhi

IBM India Research Lab, New Delhi
View Profile

,
Shourya Roy

IBM India Research Lab, New Delhi

IBM India Research Lab, New Delhi
View Profile

,
Karan Singal

Indian Institute of Technology, Guwahati

Indian Institute of Technology, Guwahati
View Profile

,
Raghu Krishnapuram

IBM India Research Lab, New Delhi

IBM India Research Lab, New Delhi
View Profile

WWW '04: Proceedings of the 13th international conference on World Wide WebMay 2004Pages 658–665https://doi.org/10.1145/988672.988762

Published:17 May 2004Publication History

WWW '04: Proceedings of the 13th international conference on World Wide Web

Pages 658–665

ABSTRACT

Organizing Web search results into a hierarchy of topics and sub-topics facilitates browsing the collection and locating results of interest. In this paper, we propose a new hierarchical monothetic clustering algorithm to build a topic hierarchy for a collection of search results retrieved in response to a query. At every level of the hierarchy, the new algorithm progressively identifies topics in a way that maximizes the coverage while maintaining distinctiveness of the topics. We refer the proposed algorithm to as DisCover. Evaluating the quality of a topic hierarchy is a non-trivial task, the ultimate test being user judgment. We use several objective measures such as coverage and reach time for an empirical comparison of the proposed algorithm with two other monothetic clustering algorithms to demonstrate its superiority. Even though our algorithm is slightly more computationally intensive than one of the algorithms, it generates better hierarchies. Our user studies also show that the proposed algorithm is superior to the other algorithms as a summarizing and browsing tool.

References

G. Ball and D. A. Hall. A clustering technique for summarizing multivariate data. Behavioral Science, 12:153--155, 1967.Google ScholarCross Ref
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons, Inc., New York, 2001. Google ScholarDigital Library
K. Franzen and J. Karlgren. Verbosity and interface design. Technical Report T2000:04, Swedish Institute of Computer Science (SICS), 2000.Google Scholar
H. Frigui and O. Nasraoui. Simultaneous categorization of text documents and identification of cluster-dependent keywords. In Proceedings of FUZZIEEE, pages 158--163, Honolulu, Hawaii, 2002.Google ScholarCross Ref
G. Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Publishers, 1994. Google ScholarDigital Library
A. Griffiths, H. Luckhurst, and P. Willett. Using inter-document similarity information in document retrieval systems. Journal of the American Society for Information Sciences, 37:3--11, 1986.Google ScholarCross Ref
M. A. Hearst. Automated discovery of WordNet relations. In C. Fellbaum, editor, WordNet: an Electronic Lexical Database. MIT Press, 1998.Google Scholar
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In Proceedings of SIGIR, pages 76--84, Zurich, CH, 1996. Google ScholarDigital Library
R. Krishnapuram and K. Kummamuru. Automatic taxonomy generation: Issues and possibilities. In LNCS: Proceedings of Fuzzy Sets and Systems (IFSA), volume 2715, pages 52--63. Springer-Verlag Heidelberg, Jan. 2003. Google ScholarDigital Library
K. Kummamuru, A. K. Dhawale, and R. Krishnapuram. Fuzzy co-clustering of documents and keywords. In Proceedings of FUZZIEEE, St. Louis, MO, 2003.Google ScholarCross Ref
K. Kummamuru and R. Krishnapuram. A clustering algorithm for asymmetrically related data with its applications to text mining. In Proceedings of CIKM, pages 571--573, Atlanta, USA, 2001. Google ScholarDigital Library
D. J. Lawnie and W. B. Croft. Generating hierarchical summaries for web searches.citeseer.nj.nec.com/lawrie03generating.html.Google Scholar
D. Lawrie, W. B. Croft, and A. Rosenberg. Finding topic words for hierarchical summarization. In Proceedings of SIGIR, pages 349--357. ACM Press, 2001. Google ScholarDigital Library
D. J. Lawrie and W. B. Croft. Generating hierarchical summaries for web searches. In Proceedings of SIGIR, pages 457 -- 458, 2003. Google ScholarDigital Library
B. Mandhani, S. Joshi, and K. Kummamuru. A matrix density based algorithm to hierarchically co-cluster documents and words. In Proceedings of WWW , Budapest, Hungary, 2003. Google ScholarDigital Library
F. C. N. Pereira, N. Tishby, and L. Lee. Distributional clustering of English words. In Meeting of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarDigital Library
M. F. Porter. An algorithm for suffix stripping. Program, 14:130--137, 1980.Google ScholarCross Ref
G. Salton. The SMART Retrieval Systems. Prentice Hall, Englewood Cliffs, N.J., 1971.Google Scholar
M. Sanderson. Word sense disambiguation and information retrieval. In Proceedings of SIGIR, pages 142--151, 1994. Google ScholarDigital Library
M. Sanderson and W.B.Croft. Deriving concept hierarchies from text. In Proceedings of SIGIR, pages 206--213, 1999. Google ScholarDigital Library
E. Selberg and O. Etzioni. Multi-service search and comparison using the MetaCrawler. In Proceedings of WWW, Darmstadt, Germany, December 1995.Google Scholar
P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy - The Principles and Practice of Numerical Classification. W. H. Freeman, San Francisco, CA, 1973.Google Scholar
S. Vaithyanathan and B. Dom. Model selection in unsupervised learning with applications to document clustering. In The Sixth International Conference on Machine Learning (ICML- 1999), pages 423--433, June 1999. Google ScholarDigital Library
S. Vaithyanathan and B. Dom. Model-based hierarchical clustering. In Proceedings of Sixth Conference on Uncertainty in Artificial Intelligence, pages 599--608, 2000. Google ScholarDigital Library
R. Weischedel, M. Meteer, R. Schwartz, L. Ramshaw, and J. Palmucci. Coping with ambiguity and unknown words through probabilistic models. Association for Computational Linguistics, 19(2):359--382, 1993. Google ScholarDigital Library
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998. Google ScholarDigital Library
Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of CIKM, pages 515--524. ACM Press, 2002. Google ScholarDigital Library

Index Terms

A hierarchical monothetic document clustering algorithm for summarization and browsing search results
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Using topic themes for multi-document summarization

The problem of using topic representations for multidocument summarization (MDS) has received considerable attention recently. Several topic representations have been employed for producing informative and coherent summaries. In this article, we ...
Read More
Extractive text summarization using clustering-based topic modeling
Abstract
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...
Read More
Topic themes for multi-document summarization
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

The problem of using topic representations for multi-document summarization (MDS) has received considerable attention recently. In this paper, we describe five different topic representations and introduce a novel representation of topics based on topic ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '04: Proceedings of the 13th international conference on World Wide Web
May 2004
754 pages
ISBN:158113844X
DOI:10.1145/988672
Conference Chairs:
Stuart Feldman
IBM Research
,
Mike Uretsky
New York University
,
Program Chairs:
Marc Najork
Microsoft Research
,
Craig Wills
Worcester Polytechnic Institute
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 May 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
automatic taxonomy generation
clustering
data mining
search
summarization
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 131
  Total Citations
  View Citations
- 2,140
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A hierarchical monothetic document clustering algorithm for summarization and browsing search results

WWW '04: Proceedings of the 13th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Using topic themes for multi-document summarization

Extractive text summarization using clustering-based topic modeling

Topic themes for multi-document summarization