ABSTRACT
In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.
- Tagged brown corpus: http://www.hit.uib.no/icame/brown/bcm.html, 1979.Google Scholar
- Nist topic detection and tracking corpus: http://www.nist.gov/speech/tests/tdt/tdt98/index.htm, 1998.Google Scholar
- J. Allan, R. Papka, and V. Lavrenko. Online new event detection and tracking. In Proceedings of the 21th ACM SIGIR Conference (SIGIR'98), 1998. Google ScholarDigital Library
- L. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR, 1998. Google ScholarDigital Library
- W. Croft. Clustering large files of documents using the single-link method. Journal of the American Society of Information Science, 28:341--344, 1977.Google ScholarCross Ref
- D. R. Cutting, D. R. Karger, J. O. Pederson, and J. W. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM/SIGIR, 1992. Google ScholarDigital Library
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification second edition. Wiley, New York, 2000. Google ScholarDigital Library
- W. A. Gale and K. W. Church. Identifying word correspondences in parallel texts. In Proc. of the Speech and Natural Language Workshop, page 152, Pacific Grove, CA, 1991. Google ScholarDigital Library
- M. Goldszmidt and M. Sahami. A probabilistic approach to full-text document clustering. In SRI Technical Report ITAD-433-MS-98-044, 1997.Google Scholar
- T. Hofmann. The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In Proceedings of IJCAI-99, 1999. Google ScholarDigital Library
- D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation ofthe number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML2000), June 2000. Google ScholarDigital Library
- F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In Proceedings of the Association for Computational Linguistics, pages 183--190, 1993. Google ScholarDigital Library
- J. Platt. J. platt. sequential minimal optimization: A fast algorithm for training support vector machines. technical report 98-14, microsoft research. http://www.research.microsoft.com/~jplatt/smo.html., 1998.Google Scholar
- P. Willett. Recent trends in hierarchical document clustering: a critical review. Informaton Processing & Management, 24(5):577--597, 1988. Google ScholarDigital Library
- P. Willett. Document clustering using an inverted file approach. Journal of Information Science, 2:223--231, 1990.Google ScholarCross Ref
- J. Yamron, I. Carp, L. Gillick, S. Lowe, and P. van Mulbregt. Topic tracking in a news stream. In Proceedings of the DARPA Broadcast News Workshop, Feb. 1999.Google Scholar
- Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and online event detection. In Proceedings of the 21th ACM SIGIR Conference (SIGIR'98), 1998. Google ScholarDigital Library
Index Terms
- Document clustering with cluster refinement and model selection capabilities
Recommendations
Document clustering based on cluster validation
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge managementThis paper presents a cluster validation based document clustering algorithm, which is capable of identifying both important feature words and true model order (cluster number). Important feature subset is selected by optimizing a cluster validity ...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Text document clustering based on neighbors
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as ...
Comments