Article

Document clustering with cluster refinement and model selection capabilities

Authors:
Xin Liu

NEC USA, Inc, Cupertino, CA

NEC USA, Inc, Cupertino, CA
View Profile

,
Yihong Gong

NEC USA, Inc, Cupertino, CA

NEC USA, Inc, Cupertino, CA
View Profile

,
Wei Xu

NEC USA, Inc, Cupertino, CA

NEC USA, Inc, Cupertino, CA
View Profile

,
Shenghuo Zhu

University of Rochester, Rochester, NY

University of Rochester, Rochester, NY
View Profile

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2002Pages 191–198https://doi.org/10.1145/564376.564411

Published:11 August 2002Publication History

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 191–198

ABSTRACT

In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.

References

Tagged brown corpus: http://www.hit.uib.no/icame/brown/bcm.html, 1979.Google Scholar
Nist topic detection and tracking corpus: http://www.nist.gov/speech/tests/tdt/tdt98/index.htm, 1998.Google Scholar
J. Allan, R. Papka, and V. Lavrenko. Online new event detection and tracking. In Proceedings of the 21th ACM SIGIR Conference (SIGIR'98), 1998. Google ScholarDigital Library
L. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR, 1998. Google ScholarDigital Library
W. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28:341--344, 1977.Google ScholarCross Ref
D. R. Cutting, D. R. Karger, J. O. Pederson, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM/SIGIR, 1992. Google ScholarDigital Library
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification second edition. Wiley, New York, 2000. Google ScholarDigital Library
W. A. Gale and K. W. Church. Identifying word correspondences in parallel texts. In Proc. of the Speech and Natural Language Workshop, page 152, Pacific Grove, CA, 1991. Google ScholarDigital Library
M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering. In SRI Technical Report ITAD-433-MS-98-044, 1997.Google Scholar
T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In Proceedings of IJCAI-99, 1999. Google ScholarDigital Library
D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation ofthe number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML2000), June 2000. Google ScholarDigital Library
F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Proceedings of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarDigital Library
J. Platt. J. platt. sequential minimal optimization: A fast algorithm for training support vector machines. technical report 98-14, microsoft research. http://www.research.microsoft.com/~jplatt/smo.html., 1998.Google Scholar
P. Willett. Recent trends in hierarchical document clustering: a critical review. Informaton Processing & Management, 24(5):577--597, 1988. Google ScholarDigital Library
P. Willett. Document clustering using an inverted file approach. Journal of Information Science, 2:223--231, 1990.Google ScholarCross Ref
J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. Topic tracking in a news stream. In Proceedings of the DARPA Broadcast News Workshop, Feb. 1999.Google Scholar
Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and online event detection. In Proceedings of the 21th ACM SIGIR Conference (SIGIR'98), 1998. Google ScholarDigital Library

Index Terms

Document clustering with cluster refinement and model selection capabilities
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Document clustering based on cluster validation
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

This paper presents a cluster validation based document clustering algorithm, which is capable of identifying both important feature words and true model order (cluster number). Important feature subset is selected by optimizing a cluster validity ...
Read More
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
Text document clustering based on neighbors

Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
August 2002
478 pages
ISBN:1581135610
DOI:10.1145/564376
General Chair:
Kalervo Järvelin
University of Tampere, Finland
,
Program Chairs:
Micheline Beaulieu
University of Sheffield, UK
,
Ricardo Baeza-Yates
University of Chile, Chile
,
Sung Hyon Myaeng
Chungnam National University, Korea
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
EM algorithm
document clustering
gaussian mixtures model
model selection
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 72
  Total Citations
  View Citations
- 1,524
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document clustering with cluster refinement and model selection capabilities

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document clustering based on cluster validation

Hybrid Bisect K-Means Clustering Algorithm

Text document clustering based on neighbors