article

Subspace clustering for high dimensional data: a review

Authors:
Lance Parsons

Arizona State University, Tempe, AZ

Arizona State University, Tempe, AZ
View Profile

,
Ehtesham Haque

Arizona State University, Tempe, AZ

Arizona State University, Tempe, AZ
View Profile

,
Huan Liu

Arizona State University, Tempe, AZ

Arizona State University, Tempe, AZ
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 6 Issue 1June 2004pp 90–105https://doi.org/10.1145/1007730.1007731

Published:01 June 2004Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces. There are two major branches of subspace clustering based on their search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches find dense regions in low dimensional spaces and combine them to form clusters. This paper presents a survey of the various subspace clustering algorithms along with a hierarchy organizing the algorithms by their defining characteristics. We then compare the two main approaches to subspace clustering using empirical scalability and accuracy tests and discuss some potential applications where subspace clustering could be particularly useful.

References

D. Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 274--281. ACM Press, 2001.]] Google ScholarDigital Library
C. C. Aggarwal. Re-designing distance functions and distance-based applications for high dimensional data. ACM SIGMOD Record, 30(1):13--18, 2001.]] Google ScholarDigital Library
C. C. Aggarwal. Towards meaningful high-dimensional nearest neighbor search by human-computer interaction. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 593--604, 2002.]] Google ScholarDigital Library
C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional space. In Database Theory, Proceedings of 8th International Conference on, pages 420--434, 2001.]] Google ScholarDigital Library
C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. Fast algorithms for projected clustering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 61--72. ACM Press, 1999.]] Google ScholarDigital Library
C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 70--81. ACM Press, 2000.]] Google ScholarDigital Library
C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 37--46. ACM Press, 2001.]] Google ScholarDigital Library
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pages 94--105. ACM Press, 1998.]] Google ScholarDigital Library
N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 240--250, 2000.]] Google ScholarDigital Library
D. Barbará, Y. Li, and J. Couto. Coolcat: an entropy-based algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management, pages 582--589. ACM Press, 2002.]] Google ScholarDigital Library
P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.]]Google Scholar
E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In Knowledge Discovery and Data Mining, pages 245--250, 2001.]] Google ScholarDigital Library
A. Blum and P. Langley. Selection of relevant features and examples in machine learning. Artificial Intelligence, 97:245--271, 1997.]] Google ScholarDigital Library
A. Blum and R. Rivest. Training a 3-node neural networks is NP-complete. Neural Networks, 5:117--127, 1992.]] Google ScholarDigital Library
Y. Cao and J. Wu. Projective art for clustering data sets in high dimensional spaces. Neural Networks, 15(1):105--120, 2002.]] Google ScholarDigital Library
J.-W. Chang and D.-S. Jin. A new cell-based clustering method for large, high-dimensional data in data mining applications. In Proceedings of the 2002 ACM symposium on Applied computing, pages 503--507. ACM Press, 2002.]] Google ScholarDigital Library
C.-H. Cheng, A. W. Fu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 84--93. ACM Press, 1999.]] Google ScholarDigital Library
G. M. D. Corso. Estimating an eigenvector by the power method with a random start. SIAM Journal on Matrix Analysis and Applications, 18(4):913--937, October 1997.]] Google ScholarDigital Library
M. Dash, K. Choi, P. Scheuermann, and H. Liu. Feature selection for clustering - a filter solution. In Data Mining, 2002. Proceedings. 2002 IEEE International Conference on, pages 115--122, 2002.]] Google ScholarDigital Library
M. Dash and H. Liu. Feature selection for clustering. In Proceedings of the Fourth Pacific Asia Conference on Knowledge Discovery and Data Mining, (PAKDD-2000). Kyoto, Japan, pages 110--121. Springer-Verlag, 2000.]] Google ScholarDigital Library
M. Dash, H. Liu, and X. Xu. '1 + 1 > 2': merging distance and density based clustering. In Database Systems for Advanced Applications, 2001. Proceedings. Seventh International Conference on, pages 32--39, 2001.]] Google ScholarDigital Library
M. Dash, H. Liu, and J. Yao. Dimensionality reduction of unsupervised data. In Tools with Artificial Intelligence, 1997. Proceedings., Ninth IEEE International Conference on, pages 532--539, November 1997.]] Google ScholarDigital Library
M. Devaney and A. Ram. Efficient feature selection in conceptual clustering. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 92--97, 1997.]] Google ScholarDigital Library
C. Ding, X. He, H. Zha, and H. D. Simon. Adaptive dimension reduction for clustering high dimensional data. In Data Mining, 2002. Proceedings., Second IEEE International Conference on, pages 147--154, December 2002.]] Google ScholarDigital Library
J. G. Dy and C. E. Brodley. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 247--254, 2000.]] Google ScholarDigital Library
S. Epter, M. Krishnamoorthy, and M. Zaki. Clusterability detection and initial seed selection in large datasets. Technical Report 99-6, Rensselaer Polytechnic Institute, Computer Science Dept., Rensselaer Polytechnic Institute, Troy, NY 12180, 1999.]]Google Scholar
L. Ertöz, M. Steinbach, and V. Kumar. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In Proceedings of the 2003 SIAM International Conference on Data Mining, 2003.]]Google ScholarCross Ref
D. Fasulo. An analysis of recent work on clustering algorithms. Technical report, University of Washington, 1999.]]Google Scholar
X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Machine Learning, Proceedings of the International Conference on, 2003.]]Google Scholar
D. Florescu, A. Y. Levy, and A. O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59--74, 1998.]] Google ScholarDigital Library
D. Fradkin and D. Madigan. Experiments with random projections for machine learning. In Proceedings of the 2003 ACM KDD. ACM Press, 2002.]] Google ScholarDigital Library
J. H. Friedman and J. J. Meulman. Clustering objects on subsets of attributes. http://citeser.nj.nec.com/friedman02clustering.html, 2002.]]Google Scholar
G. Gan. Subspace clustering for high dimensional categorical data. http://www.math.yorku.ca/~gjgan/talk.pdf, May 2003. Talk Given at SOS-GSSD (Southern Ontario Statistics Graduate Student Seminar Days).]]Google Scholar
J. Ghosh. Handbook of Data Mining, chapter Scalable Clustering Methods for Data Mining. Lawrence Erlbaum Assoc, 2003.]]Google Scholar
S. Goil, H. Nagesh, and A. Choudhary. Mafia: Efficient and scalable subspace clustering for very large data sets. Technical Report CPDC-TR-9906-010, Northwestern University, 2145 Sheridan Road, Evanston IL 60208, June 1999.]]Google Scholar
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Cluster validity methods: part i. ACM SIGMOD Record, 31(2):40--45, 2002.]] Google ScholarDigital Library
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. Clustering validity checking methods: part ii. ACM SIGMOD Record, 31(3):19--27, 2002.]] Google ScholarDigital Library
G. Hamerly and C. Elkan. Learning the k in k-means. To be published, obtained by Dr. Liu at ACM-SIGKDD 2003 conference., 2003.]]Google Scholar
J. Han, M. Kamber, and A. K. H. Tung. Geographic Data Mining and Knowledge Discovery, chapter Spatial clustering methods in data mining: A survey, pages 188--217. Taylor and Francis, 2001.]]Google Scholar
A. Hinneburg, D. Keim, and M. Wawryniuk. Using projections to visually cluster high-dimensional data. Computing in Science and Engineering, IEEE Journal of, 5(2):14--25, March/April 2003.]] Google ScholarDigital Library
A. Hinneburg and D. A. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality in high-dimensional clustering. In Very Large Data Bases, Proceedings of the 25th International Conference on, pages 506--517, Edinburgh, 1999.]] Google ScholarDigital Library
I. Inza, P. Larraaga, and B. Sierra. Feature weighting for nearest neighbor by estimation of bayesian networks algorithms. Technical Report EHU-KZAA-IK-3/00, University of the Basque Country, PO Box 649. E-20080 San Sebastian. Basque Country. Spain., 2000.]]Google Scholar
A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.]] Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys (CSUR), 31(3):264--323, 1999.]] Google ScholarDigital Library
M. K. Jiawei Han. Data Mining: Concepts and Techniques, chapter 8, pages 335--393. Morgan Kaufmann Publishers, 2001.]] Google ScholarDigital Library
S. Kaski and T. Kohonen. Exploratory data analysis by the self-organizing map: Structures of welfare and poverty in the world. In A.-P. N. Refenes, Y. Abu-Mostafa, J. Moody, and A. Weigend, editors, Neural Networks in Financial Engineering. Proceedings of the Third International Conference on Neural Networks in the Capital Markets, London, England, 11--13 October, 1995, pages 498--507. World Scientific, Singapore, 1996.]]Google Scholar
Y. Kim, W. Street, and F. Menczer. Feature selection for unsupervised learning via evolutionary search. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 365--369, 2000.]] Google ScholarDigital Library
I. S. Kohane, A. Kho, and A. J. Butte. Microarrays for an Integrative Genomics. MIT Press, 2002.]] Google ScholarDigital Library
R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1--2):273--324, 1997.]] Google ScholarDigital Library
E. Kolatch. Clustering algorithms for spatial databases: A survey, 2001.]]Google Scholar
T. Li, S. Zhu, and M. Ogihara. Algorithms for clustering high dimensional and distributed data. Intelligent Data Analysis Journal, 7(4):?--?, 2003.]] Google ScholarDigital Library
J. Lin and D. Gunopulos. Dimensionality reduction by random projection and latent semantic indexing. In Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003.]]Google Scholar
B. Liu, Y. Xia, and P. S. Yu. Clustering through decision tree construction. In Proceedings of the ninth international conference on Information and knowledge management, pages 20--29. ACM Press, 2000.]] Google ScholarDigital Library
H. Liu and H. Motoda. Feature Selection for Knowledge Discovery & Data Mining. Boston: Kluwer Academic Publishers, 1998.]] Google ScholarDigital Library
A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 169--178. ACM Press, 2000.]] Google ScholarDigital Library
B. L. Milenova and M. M. Campos. O-cluster: scalable clustering of large high dimensional data sets. In Data Mining, Proceedings from the IEEE International Conference on, pages 290--297, 2002.]] Google ScholarDigital Library
P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):301--312, 2002.]] Google ScholarDigital Library
R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th VLDB Conference, pages 144--155, 1994.]] Google ScholarDigital Library
E. Ng Ka Ka and A. W. chee Fu. Efficient algorithm for projected clustering. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 273--, 2002.]] Google ScholarDigital Library
Z. Nie and S. Kambhampati. Frequency-based coverage statistics mining for data integration. In Proceedings of the International Conference on Data Engineering 2004, 2003.]]Google Scholar
A. Patrikainen. Projected clustering of high-dimensional binary data. Master's thesis, Helsinki University of Technology, October 2002.]]Google Scholar
D. Pelleg and A. Moore. X-means: Extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 727--734, San Francisco, 2000. Morgan Kaufmann.]] Google ScholarDigital Library
J. M. Pena, J. A. Lozano, P. Larranaga, and I. Inza. Dimensionality reduction in unsupervised learning of conditional gaussian networks. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 23(6):590--603, June 2001.]] Google ScholarDigital Library
C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. A monte carlo algorithm for fast projective clustering. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 418--427. ACM Press, 2002.]] Google ScholarDigital Library
B. Raskutti and C. Leckie. An evaluation of criteria for measuring the quality of clusters. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99, pages 905--910, Stockholm, Sweden, July 1999. Morgan Kaufmann.]] Google ScholarDigital Library
S. Raychaudhuri, P. D. Sutphin, J. T. Chang, and R. B. Altman. Basic microarray analysis: grouping and feature reduction. Trends in Biotechnology, 19(5):189--193, 2001.]]Google ScholarCross Ref
S. M. Rüger and S. E. Gauch. Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College, London, UK, 2000.]]Google Scholar
U. Shardanand and P. Maes. Social information filtering: algorithms for automating word of mouth. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 210--217. ACM Press/Addison-Wesley Publishing Co., 1995.]] Google ScholarDigital Library
L. Talavera. Dependency-based feature selection for clustering symbolic data, 2000.]] Google ScholarDigital Library
L. Talavera. Dynamic feature selection in incremental hierarchical clustering. In Machine Learning: ECML 2000, 11th European Conference on Machine Learning, Barcelona, Catalonia, Spain, May 31 - June 2, 2000, Proceedings, volume 1810, pages 392--403. Springer, Berlin, 2000.]] Google ScholarDigital Library
C. Tang and A. Zhang. An iterative strategy for pattern discovery in high-dimensional data sets. In Proceedings of the eleventh international conference on Information and knowledge management, pages 10--17. ACM Press, 2002.]] Google ScholarDigital Library
H. Wache, T. Vögele, U. Visser, H. Stuckenschmidt, G. Schuster, H. Neumann, and S. Hübner. Ontology-based integration of information - a survey of existing approaches. In In Stuckenschmidt, H., ed., IJCAI-01 Workshop, pages 108--117, 2001.]]Google Scholar
I. H. Witten and E. Frank. Data Mining: Pratical Machine Leaning Tools and Techniques with Java Implementations, chapter 6.6, pages 210--228. Morgan Kaufmann, 2000.]] Google ScholarDigital Library
K.-G. Woo and J.-H. Lee. FINDIT: a Fast and Intelligent Subspace Clustering Algorithm using Dimension Voting. PhD thesis, Korea Advanced Institute of Science and Technology, Taejon, Korea, 2002.]]Google Scholar
J. Yang, W. Wang, H. Wang, and P. Yu. Δ-clusters: capturing subspace correlation in a large data set. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 517--528, 2002.]] Google ScholarDigital Library
L. Yu and H. Liu. Feature selection for high-dimensional data: a fast correlation-based filter solution. In Proceedings of the twentieth International Conference on Machine Learning, pages 856--863, 2003.]]Google ScholarDigital Library
M. Zait and H. Messatfa. A comparative study of clustering methods. Future Generation Computer Systems, 13(2--3):149--159, November 1997.]] Google ScholarDigital Library

Index Terms

Subspace clustering for high dimensional data: a review
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes ...
Read More
A survey on enhanced subspace clustering

Subspace clustering finds sets of objects that are homogeneous in subspaces of high-dimensional datasets, and has been successfully applied in many domains. In recent years, a new breed of subspace clustering algorithms, which we denote as enhanced ...
Read More
Robust projected clustering

Projected clustering partitions a data set into several disjoint clusters, plus outliers, so that each cluster exists in a subspace. Subspace clustering enumerates clusters of objects in all subspaces of a data set, and it tends to produce many ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 6, Issue 1
Special issue on learning from imbalanced datasets
June 2004
117 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1007730
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2004
Check for updates
Author Tags
clustering survey
high dimensional data
projected clustering
subspace clustering
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 954
  Total Citations
  View Citations
- 10,612
  Total Downloads
- Downloads (Last 12 months)416
- Downloads (Last 6 weeks)59
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

A survey on enhanced subspace clustering

Robust projected clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

A survey on enhanced subspace clustering

Robust projected clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media