Article

Document clustering via adaptive subspace iteration

Authors:
Tao Li

University of Rochester, Rochester, NY

University of Rochester, Rochester, NY
View Profile

,
Sheng Ma

IBM T.J. Watson Research Center, Hawthorne, NY

IBM T.J. Watson Research Center, Hawthorne, NY
View Profile

,
Mitsunori Ogihara

University of Rochester, Rochester, NY

University of Rochester, Rochester, NY
View Profile

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2004Pages 218–225https://doi.org/10.1145/1008992.1009031

Published:25 July 2004Publication History

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 218–225

ABSTRACT

Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI¹ , which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.

References

Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference (pp. 61--72).]] Google ScholarDigital Library
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference (pp. 94--105).]] Google ScholarDigital Library
Anderberg, M. R. (1973). Cluster analysis for applications. Academic Press Inc.]]Google Scholar
Berger, M., & Rigoutsos, I. (1991). An algorithm for point clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21, 1278--1286.]]Google ScholarCross Ref
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceedings of 7th International Conference on Database Theory(ICDT'99) (pp. 217--235).]] Google ScholarDigital Library
Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, 12--44. Berlin: Springer-verlag.]]Google Scholar
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the world wide web using webace. AI Review, 13, 365--391.]] Google ScholarDigital Library
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). Autoclass: a Bayesian classification system. Proceedings of the Fifteenth International Conference on Machine Learning(ICML'88).]]Google ScholarCross Ref
Cheng, Y., & Church, G. M. Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB) (pp. 93--103).]] Google ScholarDigital Library
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley and Sons.]] Google ScholarDigital Library
Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315,39--59.]]Google ScholarCross Ref
Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning (Technical Report 2001-05). Department of Computer Science, University of Texas at Austin.]] Google ScholarDigital Library
Dhillon, I. S., Mallela, S., & Modha, S. S. (2003). Information-theoretic co-clustering. ACM SIGKDD Conference (pp. 89--98).]] Google ScholarDigital Library
Ding, C., He, X., Zha, H., & Simon, H. (2002). Adaptive dimension reduction for clustering high dimensional data. IEEE International Conference on Data Mining(ICDM 2002) (pp. 107--114).]] Google ScholarDigital Library
Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1281--1285.]] Google ScholarDigital Library
Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. J.Mach.Learn.Res., 3, 1307--1331.]] Google ScholarDigital Library
Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.]]Google Scholar
Govaert, G. (1985). Simultaneous clustering of rows and columns. Control and Cybernetics, 437--458.]]Google Scholar
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Conference (pp. 73--84).]] Google ScholarDigital Library
Hagen, L., & Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11, 1074--1085.]]Google ScholarDigital Library
Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploration. Proceedings of the 2nd International Conference on Autonomous Agents (Agents'98) (pp. 408--415).]] Google ScholarDigital Library
Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.]] Google ScholarDigital Library
Hartigan, J. (1975). Clustering algorithms. Wiley.]] Google ScholarDigital Library
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.]]Google Scholar
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.]] Google ScholarDigital Library
Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. New York: Prentice-Hall.]] Google ScholarDigital Library
Kato, L. (1995). Perturbation theory for linear operators. Springer.]]Google Scholar
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788--791.]]Google ScholarCross Ref
Leung, Y., she Zhang, J., & Xu, Z.-B. (2000). Clustering by scale-space filtering. IEEE Transactions on pattern analysis and machine intelligence, 22, 1396--1410.]] Google ScholarDigital Library
Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004).]]Google ScholarCross Ref
Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth Conference on Information and Knowledge Management(CIKM 2003) (pp. 317--324).]] Google ScholarDigital Library
Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantization design. IEEE Transactions on Communications, 28,84--95.]]Google ScholarCross Ref
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.]]Google Scholar
Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 (NIPS'01).]]Google Scholar
Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto: University of Toronto Press.]]Google Scholar
Perona, P., & Freeman, W. (1998). A factorization approach to grouping. Lecture Notes in Computer Science, 1406, 655--670.]] Google ScholarDigital Library
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888--905.]] Google ScholarDigital Library
Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. Advances in Neural Information Processing Systems 12 (NIPS'99).]]Google Scholar
Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. ACM SIGIR 2000 (pp. 208--215).]] Google ScholarDigital Library
Spielman, D. A., & Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite element meshes. In IEEE Symposium on Foundations of Computer Science (pp. 96--105).]] Google ScholarDigital Library
Tishby, N., Pereira, F. C., & Bialek, W. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).]]Google Scholar
Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. Proceedings of International Conference on Computer Vision-Volume 2 ICCV (2) (pp. 975--982).]] Google ScholarDigital Library
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. ACM SIGIR 2003 (pp. 267--273).]] Google ScholarDigital Library
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Conference (pp. 103--114).]] Google ScholarDigital Library
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.]]Google ScholarDigital Library
Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.]]Google Scholar
Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference.]]Google Scholar
Zhu, S., Li, T., & Ogihara, M. (2002). CoFD: An algorithm for non-distance based clustering in high dimensional spaces. 4th International Conference on Data Warehousing and Knowledge Discovery (Dawak 2002) (pp. 52--62).]] Google ScholarDigital Library

Index Terms

Document clustering via adaptive subspace iteration
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Tensor clustering via adaptive subspace iteration

Multi-way data or tensors are generalizations of matrices. Clustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. Despite significant progress made on subspace clustering for two-way ...
Read More
Clustering multi-way data via adaptive subspace iteration
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Clustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. In this paper, we propose the subspace clustering algorithm on multi-way data, called ASI-T (Adaptive Subspace Iteration on Tensor)...
Read More
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
adaptive subspace identification
alternating optimization
document clustering
factor analysis
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 1,973
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Document clustering via adaptive subspace iteration

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tensor clustering via adaptive subspace iteration

Clustering multi-way data via adaptive subspace iteration

Hybrid Bisect K-Means Clustering Algorithm