ABSTRACT
Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1 , which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.
- Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference (pp. 61--72).]] Google ScholarDigital Library
- Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference (pp. 94--105).]] Google ScholarDigital Library
- Anderberg, M. R. (1973). Cluster analysis for applications. Academic Press Inc.]]Google Scholar
- Berger, M., & Rigoutsos, I. (1991). An algorithm for point clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21, 1278--1286.]]Google ScholarCross Ref
- Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceedings of 7th International Conference on Database Theory(ICDT'99) (pp. 217--235).]] Google ScholarDigital Library
- Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, 12--44. Berlin: Springer-verlag.]]Google Scholar
- Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the world wide web using webace. AI Review, 13, 365--391.]] Google ScholarDigital Library
- Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). Autoclass: a Bayesian classification system. Proceedings of the Fifteenth International Conference on Machine Learning(ICML'88).]]Google ScholarCross Ref
- Cheng, Y., & Church, G. M. Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB) (pp. 93--103).]] Google ScholarDigital Library
- Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley and Sons.]] Google ScholarDigital Library
- Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315,39--59.]]Google ScholarCross Ref
- Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning (Technical Report 2001-05). Department of Computer Science, University of Texas at Austin.]] Google ScholarDigital Library
- Dhillon, I. S., Mallela, S., & Modha, S. S. (2003). Information-theoretic co-clustering. ACM SIGKDD Conference (pp. 89--98).]] Google ScholarDigital Library
- Ding, C., He, X., Zha, H., & Simon, H. (2002). Adaptive dimension reduction for clustering high dimensional data. IEEE International Conference on Data Mining(ICDM 2002) (pp. 107--114).]] Google ScholarDigital Library
- Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1281--1285.]] Google ScholarDigital Library
- Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. J.Mach.Learn.Res., 3, 1307--1331.]] Google ScholarDigital Library
- Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.]]Google Scholar
- Govaert, G. (1985). Simultaneous clustering of rows and columns. Control and Cybernetics, 437--458.]]Google Scholar
- Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Conference (pp. 73--84).]] Google ScholarDigital Library
- Hagen, L., & Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11, 1074--1085.]]Google ScholarDigital Library
- Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploration. Proceedings of the 2nd International Conference on Autonomous Agents (Agents'98) (pp. 408--415).]] Google ScholarDigital Library
- Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.]] Google ScholarDigital Library
- Hartigan, J. (1975). Clustering algorithms. Wiley.]] Google ScholarDigital Library
- Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.]]Google Scholar
- Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.]] Google ScholarDigital Library
- Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. New York: Prentice-Hall.]] Google ScholarDigital Library
- Kato, L. (1995). Perturbation theory for linear operators. Springer.]]Google Scholar
- Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788--791.]]Google ScholarCross Ref
- Leung, Y., she Zhang, J., & Xu, Z.-B. (2000). Clustering by scale-space filtering. IEEE Transactions on pattern analysis and machine intelligence, 22, 1396--1410.]] Google ScholarDigital Library
- Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004).]]Google ScholarCross Ref
- Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth Conference on Information and Knowledge Management(CIKM 2003) (pp. 317--324).]] Google ScholarDigital Library
- Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantization design. IEEE Transactions on Communications, 28,84--95.]]Google ScholarCross Ref
- McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.]]Google Scholar
- Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 (NIPS'01).]]Google Scholar
- Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto: University of Toronto Press.]]Google Scholar
- Perona, P., & Freeman, W. (1998). A factorization approach to grouping. Lecture Notes in Computer Science, 1406, 655--670.]] Google ScholarDigital Library
- Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888--905.]] Google ScholarDigital Library
- Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. Advances in Neural Information Processing Systems 12 (NIPS'99).]]Google Scholar
- Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. ACM SIGIR 2000 (pp. 208--215).]] Google ScholarDigital Library
- Spielman, D. A., & Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite element meshes. In IEEE Symposium on Foundations of Computer Science (pp. 96--105).]] Google ScholarDigital Library
- Tishby, N., Pereira, F. C., & Bialek, W. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).]]Google Scholar
- Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. Proceedings of International Conference on Computer Vision-Volume 2 ICCV (2) (pp. 975--982).]] Google ScholarDigital Library
- Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. ACM SIGIR 2003 (pp. 267--273).]] Google ScholarDigital Library
- Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Conference (pp. 103--114).]] Google ScholarDigital Library
- Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.]]Google ScholarDigital Library
- Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.]]Google Scholar
- Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference.]]Google Scholar
- Zhu, S., Li, T., & Ogihara, M. (2002). CoFD: An algorithm for non-distance based clustering in high dimensional spaces. 4th International Conference on Data Warehousing and Knowledge Discovery (Dawak 2002) (pp. 52--62).]] Google ScholarDigital Library
Index Terms
- Document clustering via adaptive subspace iteration
Recommendations
Tensor clustering via adaptive subspace iteration
Multi-way data or tensors are generalizations of matrices. Clustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. Despite significant progress made on subspace clustering for two-way ...
Clustering multi-way data via adaptive subspace iteration
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementClustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. In this paper, we propose the subspace clustering algorithm on multi-way data, called ASI-T (Adaptive Subspace Iteration on Tensor)...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Comments