skip to main content
10.1145/564376.564411acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Document clustering with cluster refinement and model selection capabilities

Published:11 August 2002Publication History

ABSTRACT

In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.

References

  1. Tagged brown corpus: http://www.hit.uib.no/icame/brown/bcm.html, 1979.Google ScholarGoogle Scholar
  2. Nist topic detection and tracking corpus: http://www.nist.gov/speech/tests/tdt/tdt98/index.htm, 1998.Google ScholarGoogle Scholar
  3. J. Allan, R. Papka, and V. Lavrenko. Online new event detection and tracking. In Proceedings of the 21th ACM SIGIR Conference (SIGIR'98), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28:341--344, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  6. D. R. Cutting, D. R. Karger, J. O. Pederson, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM/SIGIR, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification second edition. Wiley, New York, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. A. Gale and K. W. Church. Identifying word correspondences in parallel texts. In Proc. of the Speech and Natural Language Workshop, page 152, Pacific Grove, CA, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering. In SRI Technical Report ITAD-433-MS-98-044, 1997.Google ScholarGoogle Scholar
  10. T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In Proceedings of IJCAI-99, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation ofthe number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML2000), June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Proceedings of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Platt. J. platt. sequential minimal optimization: A fast algorithm for training support vector machines. technical report 98-14, microsoft research. http://www.research.microsoft.com/~jplatt/smo.html., 1998.Google ScholarGoogle Scholar
  14. P. Willett. Recent trends in hierarchical document clustering: a critical review. Informaton Processing & Management, 24(5):577--597, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Willett. Document clustering using an inverted file approach. Journal of Information Science, 2:223--231, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  16. J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. Topic tracking in a news stream. In Proceedings of the DARPA Broadcast News Workshop, Feb. 1999.Google ScholarGoogle Scholar
  17. Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and online event detection. In Proceedings of the 21th ACM SIGIR Conference (SIGIR'98), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document clustering with cluster refinement and model selection capabilities

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
            August 2002
            478 pages
            ISBN:1581135610
            DOI:10.1145/564376

            Copyright © 2002 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 August 2002

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader