Abstract
Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation and comparison between these paradigms on a common basis.
Conclusive evaluation and comparison is challenged by three major issues. First, there is no ground truth that describes the "true" clusters in real world data. Second, a large variety of evaluation measures have been used that reflect different aspects of the clustering result. Finally, in typical publications authors have limited their analysis to their favored paradigm only, while paying other paradigms little or no attention.
In this paper, we take a systematic approach to evaluate the major paradigms in a common framework. We study representative clustering algorithms to characterize the different aspects of each paradigm and give a detailed comparison of their properties. We provide a benchmark set of results on a large variety of real world and synthetic data sets. Using different evaluation measures, we broaden the scope of the experimental analysis and create a common baseline for future developments and comparable evaluations in the field. For repeatability, all implementations, data sets and evaluation measures are available on our website.
- C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park. Fast algorithms for projected clustering. In SIGMOD, pages 61--72, 1999. Google ScholarDigital Library
- C. Aggarwal and P. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD, pages 70--81, 2000. Google ScholarDigital Library
- R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, pages 94--105, 1998. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
- I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, pages 409--414, 2007. Google ScholarDigital Library
- I. Assent, R. Krieger, E. Müller, and T. Seidl. INSCY: Indexing subspace clusters with in-process-removal of redundancy. In ICDM, pages 719--724, 2008. Google ScholarDigital Library
- I. Assent, E. Müller, R. Krieger, T. Jansen, and T. Seidl. Pleiades: Subspace clustering and evaluation. In ECML PKDD, pages 666--671, 2008.Google ScholarCross Ref
- A. Asuncion and D. Newman. UCI Machine Learning Repository, 2007.Google Scholar
- K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful. In IDBT, pages 217--235, 1999. Google ScholarDigital Library
- B. Bringmann and A. Zimmermann. The chosen few: On identifying valuable patterns. In ICDM, pages 63--72, 2007. Google ScholarDigital Library
- Y. Cheng and G. M. Church. Biclustering of expression data. In International Conference on Intelligent Systems for Molecular Biology, pages 93--103, 2000. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1977.Google Scholar
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD, pages 226--231, 1996.Google Scholar
- J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. Google ScholarDigital Library
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, 2000. Google ScholarDigital Library
- I. Joliffe. Principal Component Analysis. Springer, New York, 1986.Google ScholarCross Ref
- K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, pages 246--257, 2004.Google ScholarCross Ref
- E. Keogh, L. Wei, X. Xi, S.-H. Lee, and M. Vlachos. LB_Keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures. In VLDB, pages 882--893, 2006. Google ScholarDigital Library
- H.-P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, pages 250--257, 2005. Google ScholarDigital Library
- J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symp. Math. stat. & prob., pages 281--297, 1967.Google Scholar
- G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD, pages 533--541, 2008. Google ScholarDigital Library
- G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, pages 414--425, 2006. Google ScholarDigital Library
- E. Müller, I. Assent, S. Günnemann, T. Jansen, and T. Seidl. OpenSubspace: An open source framework for evaluation and exploration of subspace clustering algorithms in WEKA. In Open Source in Data Mining Workshop at PAKDD, pages 2--13, 2009.Google Scholar
- E. Müller, I. Assent, R. Krieger, S. Günnemann, and T. Seidl. DensEst: Density estimation for data mining in high dimensional spaces. In SDM, pages 173--184, 2009.Google ScholarCross Ref
- E. Müller, I. Assent, R. Krieger, T. Jansen, and T. Seidl. Morpheus: Interactive exploration of subspace clustering. In KDD, pages 1089--1092, 2008. Google ScholarDigital Library
- E. Müller, I. Assent, and T. Seidl. HSM: Heterogeneous subspace mining in high dimensional data. In SSDBM, pages 497--516, 2009. Google ScholarDigital Library
- H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In SDM, 2001.Google ScholarCross Ref
- A. Patrikainen and M. Meila. Comparing subspace clusterings. TKDE, 18(7):902--916, 2006. Google ScholarDigital Library
- C. e. a. Procopiuc. A monte carlo algorithm for fast projective clustering. In SIGMOD, pages 418--427, 2002. Google ScholarDigital Library
- K. Sequeira and M. Zaki. SCHISM: A new approach for interesting subspace mining. In ICDM, pages 186--193, 2004. Google ScholarDigital Library
- I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, USA, 2005. Google ScholarDigital Library
- M. L. Yiu and N. Mamoulis. Frequent-pattern based iterative projected clustering. In ICDM, pages 689--692, 2003. Google ScholarDigital Library
Index Terms
- Evaluating clustering in subspace projections of high dimensional data
Recommendations
Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasetsSubspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Iterative random projections for high-dimensional data clustering
In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Subspace clustering of high-dimensional data: an evolutionary approach
Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the ...
Comments