skip to main content
10.1145/1008992.1009031acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Document clustering via adaptive subspace iteration

Published:25 July 2004Publication History

ABSTRACT

Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1 , which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.

References

  1. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference (pp. 61--72).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference (pp. 94--105).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anderberg, M. R. (1973). Cluster analysis for applications. Academic Press Inc.]]Google ScholarGoogle Scholar
  4. Berger, M., & Rigoutsos, I. (1991). An algorithm for point clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21, 1278--1286.]]Google ScholarGoogle ScholarCross RefCross Ref
  5. Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceedings of 7th International Conference on Database Theory(ICDT'99) (pp. 217--235).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, 12--44. Berlin: Springer-verlag.]]Google ScholarGoogle Scholar
  7. Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the world wide web using webace. AI Review, 13, 365--391.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). Autoclass: a Bayesian classification system. Proceedings of the Fifteenth International Conference on Machine Learning(ICML'88).]]Google ScholarGoogle ScholarCross RefCross Ref
  9. Cheng, Y., & Church, G. M. Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB) (pp. 93--103).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley and Sons.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315,39--59.]]Google ScholarGoogle ScholarCross RefCross Ref
  12. Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning (Technical Report 2001-05). Department of Computer Science, University of Texas at Austin.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dhillon, I. S., Mallela, S., & Modha, S. S. (2003). Information-theoretic co-clustering. ACM SIGKDD Conference (pp. 89--98).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ding, C., He, X., Zha, H., & Simon, H. (2002). Adaptive dimension reduction for clustering high dimensional data. IEEE International Conference on Data Mining(ICDM 2002) (pp. 107--114).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1281--1285.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. J.Mach.Learn.Res., 3, 1307--1331.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.]]Google ScholarGoogle Scholar
  18. Govaert, G. (1985). Simultaneous clustering of rows and columns. Control and Cybernetics, 437--458.]]Google ScholarGoogle Scholar
  19. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Conference (pp. 73--84).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hagen, L., & Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11, 1074--1085.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploration. Proceedings of the 2nd International Conference on Autonomous Agents (Agents'98) (pp. 408--415).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hartigan, J. (1975). Clustering algorithms. Wiley.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.]]Google ScholarGoogle Scholar
  25. Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. New York: Prentice-Hall.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kato, L. (1995). Perturbation theory for linear operators. Springer.]]Google ScholarGoogle Scholar
  28. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788--791.]]Google ScholarGoogle ScholarCross RefCross Ref
  29. Leung, Y., she Zhang, J., & Xu, Z.-B. (2000). Clustering by scale-space filtering. IEEE Transactions on pattern analysis and machine intelligence, 22, 1396--1410.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004).]]Google ScholarGoogle ScholarCross RefCross Ref
  31. Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth Conference on Information and Knowledge Management(CIKM 2003) (pp. 317--324).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantization design. IEEE Transactions on Communications, 28,84--95.]]Google ScholarGoogle ScholarCross RefCross Ref
  33. McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.]]Google ScholarGoogle Scholar
  34. Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 (NIPS'01).]]Google ScholarGoogle Scholar
  35. Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto: University of Toronto Press.]]Google ScholarGoogle Scholar
  36. Perona, P., & Freeman, W. (1998). A factorization approach to grouping. Lecture Notes in Computer Science, 1406, 655--670.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888--905.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. Advances in Neural Information Processing Systems 12 (NIPS'99).]]Google ScholarGoogle Scholar
  39. Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. ACM SIGIR 2000 (pp. 208--215).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Spielman, D. A., & Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite element meshes. In IEEE Symposium on Foundations of Computer Science (pp. 96--105).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Tishby, N., Pereira, F. C., & Bialek, W. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).]]Google ScholarGoogle Scholar
  42. Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. Proceedings of International Conference on Computer Vision-Volume 2 ICCV (2) (pp. 975--982).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. ACM SIGIR 2003 (pp. 267--273).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Conference (pp. 103--114).]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.]]Google ScholarGoogle Scholar
  47. Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference.]]Google ScholarGoogle Scholar
  48. Zhu, S., Li, T., & Ogihara, M. (2002). CoFD: An algorithm for non-distance based clustering in high dimensional spaces. 4th International Conference on Data Warehousing and Knowledge Discovery (Dawak 2002) (pp. 52--62).]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Document clustering via adaptive subspace iteration

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
          July 2004
          624 pages
          ISBN:1581138814
          DOI:10.1145/1008992

          Copyright © 2004 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 25 July 2004

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader