skip to main content
research-article

Evaluating clustering in subspace projections of high dimensional data

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation and comparison between these paradigms on a common basis.

Conclusive evaluation and comparison is challenged by three major issues. First, there is no ground truth that describes the "true" clusters in real world data. Second, a large variety of evaluation measures have been used that reflect different aspects of the clustering result. Finally, in typical publications authors have limited their analysis to their favored paradigm only, while paying other paradigms little or no attention.

In this paper, we take a systematic approach to evaluate the major paradigms in a common framework. We study representative clustering algorithms to characterize the different aspects of each paradigm and give a detailed comparison of their properties. We provide a benchmark set of results on a large variety of real world and synthetic data sets. Using different evaluation measures, we broaden the scope of the experimental analysis and create a common baseline for future developments and comparable evaluations in the field. For repeatability, all implementations, data sets and evaluation measures are available on our website.

References

  1. C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park. Fast algorithms for projected clustering. In SIGMOD, pages 61--72, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Aggarwal and P. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD, pages 70--81, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, pages 94--105, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, pages 409--414, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Assent, R. Krieger, E. Müller, and T. Seidl. INSCY: Indexing subspace clusters with in-process-removal of redundancy. In ICDM, pages 719--724, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Assent, E. Müller, R. Krieger, T. Jansen, and T. Seidl. Pleiades: Subspace clustering and evaluation. In ECML PKDD, pages 666--671, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  8. A. Asuncion and D. Newman. UCI Machine Learning Repository, 2007.Google ScholarGoogle Scholar
  9. K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful. In IDBT, pages 217--235, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Bringmann and A. Zimmermann. The chosen few: On identifying valuable patterns. In ICDM, pages 63--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Y. Cheng and G. M. Church. Biclustering of expression data. In International Conference on Intelligent Systems for Molecular Biology, pages 93--103, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1977.Google ScholarGoogle Scholar
  13. M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD, pages 226--231, 1996.Google ScholarGoogle Scholar
  14. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Joliffe. Principal Component Analysis. Springer, New York, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  17. K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, pages 246--257, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  18. E. Keogh, L. Wei, X. Xi, S.-H. Lee, and M. Vlachos. LB_Keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures. In VLDB, pages 882--893, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H.-P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, pages 250--257, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symp. Math. stat. & prob., pages 281--297, 1967.Google ScholarGoogle Scholar
  21. G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD, pages 533--541, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, pages 414--425, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Müller, I. Assent, S. Günnemann, T. Jansen, and T. Seidl. OpenSubspace: An open source framework for evaluation and exploration of subspace clustering algorithms in WEKA. In Open Source in Data Mining Workshop at PAKDD, pages 2--13, 2009.Google ScholarGoogle Scholar
  24. E. Müller, I. Assent, R. Krieger, S. Günnemann, and T. Seidl. DensEst: Density estimation for data mining in high dimensional spaces. In SDM, pages 173--184, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  25. E. Müller, I. Assent, R. Krieger, T. Jansen, and T. Seidl. Morpheus: Interactive exploration of subspace clustering. In KDD, pages 1089--1092, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Müller, I. Assent, and T. Seidl. HSM: Heterogeneous subspace mining in high dimensional data. In SSDBM, pages 497--516, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In SDM, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  28. A. Patrikainen and M. Meila. Comparing subspace clusterings. TKDE, 18(7):902--916, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. e. a. Procopiuc. A monte carlo algorithm for fast projective clustering. In SIGMOD, pages 418--427, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Sequeira and M. Zaki. SCHISM: A new approach for interesting subspace mining. In ICDM, pages 186--193, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. L. Yiu and N. Mamoulis. Frequent-pattern based iterative projected clustering. In ICDM, pages 689--692, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating clustering in subspace projections of high dimensional data

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader