skip to main content
article

Subspace clustering for high dimensional data: a review

Published:01 June 2004Publication History
Skip Abstract Section

Abstract

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. This paper presents a survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests and discuss some potential applications where subspace clustering could be particularly useful.

References

  1. D. Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 274--281. ACM Press, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. C. Aggarwal. Re-designing distance functions and distance-based applications for high dimensional data. ACM SIGMOD Record, 30(1):13--18, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. C. Aggarwal. Towards meaningful high-dimensional nearest neighbor search by human-computer interaction. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 593--604, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional space. In Database Theory, Proceedings of 8th International Conference on, pages 420--434, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. Fast algorithms for projected clustering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 61--72. ACM Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 70--81. ACM Press, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 37--46. ACM Press, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 94--105. ACM Press, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 240--250, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Barbará, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management, pages 582--589. ACM Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.]]Google ScholarGoogle Scholar
  12. E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Knowledge Discovery and Data Mining, pages 245--250, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245--271, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Blum and R. Rivest. Training a 3-node neural networks is NP-complete. Neural Networks, 5:117--127, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Cao and J. Wu. Projective art for clustering data sets in high dimensional spaces. Neural Networks, 15(1):105--120, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J.-W. Chang and D.-S. Jin. A new cell-based clustering method for large, high-dimensional data in data mining applications. In Proceedings of the 2002 ACM symposium on Applied computing, pages 503--507. ACM Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 84--93. ACM Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. M. D. Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913--937, October 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature selection for clustering - a filter solution. In Data Mining, 2002. Proceedings. 2002 IEEE International Conference on, pages 115--122, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Dash and H. Liu. Feature selection for clustering. In Proceedings of the Fourth Pacific Asia Conference on Knowledge Discovery and Data Mining, (PAKDD-2000). Kyoto, Japan, pages 110--121. Springer-Verlag, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Dash, H. Liu, and X. Xu. '1 + 1 > 2': merging distance and density based clustering. In Database Systems for Advanced Applications, 2001. Proceedings. Seventh International Conference on, pages 32--39, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Dash, H. Liu, and J. Yao. Dimensionality reduction of unsupervised data. In Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE International Conference on, pages 532--539, November 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Devaney and A. Ram. Efficient feature selection in conceptual clustering. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 92--97, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Ding, X. He, H. Zha, and H. D. Simon. Adaptive dimension reduction for clustering high dimensional data. In Data Mining, 2002. Proceedings., Second IEEE International Conference on, pages 147--154, December 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. G. Dy and C. E. Brodley. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 247--254, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Epter, M. Krishnamoorthy, and M. Zaki. Clusterability detection and initial seed selection in large datasets. Technical Report 99-6, Rensselaer Polytechnic Institute, Computer Science Dept., Rensselaer Polytechnic Institute, Troy, NY 12180, 1999.]]Google ScholarGoogle Scholar
  27. L. Ertöz, M. Steinbach, and V. Kumar. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International Conference on Data Mining, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  28. D. Fasulo. An analysis of recent work on clustering algorithms. Technical report, University of Washington, 1999.]]Google ScholarGoogle Scholar
  29. X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Machine Learning, Proceedings of the International Conference on, 2003.]]Google ScholarGoogle Scholar
  30. D. Florescu, A. Y. Levy, and A. O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59--74, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Fradkin and D. Madigan. Experiments with random projections for machine learning. In Proceedings of the 2003 ACM KDD. ACM Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. H. Friedman and J. J. Meulman. Clustering objects on subsets of attributes. http://citeser.nj.nec.com/friedman02clustering.html, 2002.]]Google ScholarGoogle Scholar
  33. G. Gan. Subspace clustering for high dimensional categorical data. http://www.math.yorku.ca/~gjgan/talk.pdf, May 2003. Talk Given at SOS-GSSD (Southern Ontario Statistics Graduate Student Seminar Days).]]Google ScholarGoogle Scholar
  34. J. Ghosh. Handbook of Data Mining, chapter Scalable Clustering Methods for Data Mining. Lawrence Erlbaum Assoc, 2003.]]Google ScholarGoogle Scholar
  35. S. Goil, H. Nagesh, and A. Choudhary. Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University, 2145 Sheridan Road, Evanston IL 60208, June 1999.]]Google ScholarGoogle Scholar
  36. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster validity methods: part i. ACM SIGMOD Record, 31(2):40--45, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering validity checking methods: part ii. ACM SIGMOD Record, 31(3):19--27, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Hamerly and C. Elkan. Learning the k in k-means. To be published, obtained by Dr. Liu at ACM-SIGKDD 2003 conference., 2003.]]Google ScholarGoogle Scholar
  39. J. Han, M. Kamber, and A. K. H. Tung. Geographic Data Mining and Knowledge Discovery, chapter Spatial clustering methods in data mining: A survey, pages 188--217. Taylor and Francis, 2001.]]Google ScholarGoogle Scholar
  40. A. Hinneburg, D. Keim, and M. Wawryniuk. Using projections to visually cluster high-dimensional data. Computing in Science and Engineering, IEEE Journal of, 5(2):14--25, March/April 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Hinneburg and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Very Large Data Bases, Proceedings of the 25th International Conference on, pages 506--517, Edinburgh, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. I. Inza, P. Larraaga, and B. Sierra. Feature weighting for nearest neighbor by estimation of bayesian networks algorithms. Technical Report EHU-KZAA-IK-3/00, University of the Basque Country, PO Box 649. E-20080 San Sebastian. Basque Country. Spain., 2000.]]Google ScholarGoogle Scholar
  43. A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys (CSUR), 31(3):264--323, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. K. Jiawei Han. Data Mining: Concepts and Techniques, chapter 8, pages 335--393. Morgan Kaufmann Publishers, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. Kaski and T. Kohonen. Exploratory data analysis by the self-organizing map: Structures of welfare and poverty in the world. In A.-P. N. Refenes, Y. Abu-Mostafa, J. Moody, and A. Weigend, editors, Neural Networks in Financial Engineering. Proceedings of the Third International Conference on Neural Networks in the Capital Markets, London, England, 11--13 October, 1995, pages 498--507. World Scientific, Singapore, 1996.]]Google ScholarGoogle Scholar
  47. Y. Kim, W. Street, and F. Menczer. Feature selection for unsupervised learning via evolutionary search. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 365--369, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. I. S. Kohane, A. Kho, and A. J. Butte. Microarrays for an Integrative Genomics. MIT Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1--2):273--324, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. E. Kolatch. Clustering algorithms for spatial databases: A survey, 2001.]]Google ScholarGoogle Scholar
  51. T. Li, S. Zhu, and M. Ogihara. Algorithms for clustering high dimensional and distributed data. Intelligent Data Analysis Journal, 7(4):?--?, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. J. Lin and D. Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003.]]Google ScholarGoogle Scholar
  53. B. Liu, Y. Xia, and P. S. Yu. Clustering through decision tree construction. In Proceedings of the ninth international conference on Information and knowledge management, pages 20--29. ACM Press, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. H. Liu and H. Motoda. Feature Selection for Knowledge Discovery & Data Mining. Boston: Kluwer Academic Publishers, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178. ACM Press, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. B. L. Milenova and M. M. Campos. O-cluster: scalable clustering of large high dimensional data sets. In Data Mining, Proceedings from the IEEE International Conference on, pages 290--297, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):301--312, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th VLDB Conference, pages 144--155, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. E. Ng Ka Ka and A. W. chee Fu. Efficient algorithm for projected clustering. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 273--, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Z. Nie and S. Kambhampati. Frequency-based coverage statistics mining for data integration. In Proceedings of the International Conference on Data Engineering 2004, 2003.]]Google ScholarGoogle Scholar
  61. A. Patrikainen. Projected clustering of high-dimensional binary data. Master's thesis, Helsinki University of Technology, October 2002.]]Google ScholarGoogle Scholar
  62. D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727--734, San Francisco, 2000. Morgan Kaufmann.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. J. M. Pena, J. A. Lozano, P. Larranaga, and I. Inza. Dimensionality reduction in unsupervised learning of conditional gaussian networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(6):590--603, June 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A monte carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 418--427. ACM Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. B. Raskutti and C. Leckie. An evaluation of criteria for measuring the quality of clusters. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99, pages 905--910, Stockholm, Sweden, July 1999. Morgan Kaufmann.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. Raychaudhuri, P. D. Sutphin, J. T. Chang, and R. B. Altman. Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology, 19(5):189--193, 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  67. S. M. Rüger and S. E. Gauch. Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College, London, UK, 2000.]]Google ScholarGoogle Scholar
  68. U. Shardanand and P. Maes. Social information filtering: algorithms for automating word of mouth. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 210--217. ACM Press/Addison-Wesley Publishing Co., 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. L. Talavera. Dependency-based feature selection for clustering symbolic data, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. L. Talavera. Dynamic feature selection in incremental hierarchical clustering. In Machine Learning: ECML 2000, 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, volume 1810, pages 392--403. Springer, Berlin, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. C. Tang and A. Zhang. An iterative strategy for pattern discovery in high-dimensional data sets. In Proceedings of the eleventh international conference on Information and knowledge management, pages 10--17. ACM Press, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information - a survey of existing approaches. In In Stuckenschmidt, H., ed., IJCAI-01 Workshop, pages 108--117, 2001.]]Google ScholarGoogle Scholar
  73. I. H. Witten and E. Frank. Data Mining: Pratical Machine Leaning Tools and Techniques with Java Implementations, chapter 6.6, pages 210--228. Morgan Kaufmann, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. K.-G. Woo and J.-H. Lee. FINDIT: a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting. PhD thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea, 2002.]]Google ScholarGoogle Scholar
  75. J. Yang, W. Wang, H. Wang, and P. Yu. Δ-clusters: capturing subspace correlation in a large data set. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 517--528, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. In Proceedings of the twentieth International Conference on Machine Learning, pages 856--863, 2003.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2--3):149--159, November 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Subspace clustering for high dimensional data: a review
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader