research-article

Evaluating clustering in subspace projections of high dimensional data

Authors:
Emmanuel Müller

RWTH Aachen University, Germany

RWTH Aachen University, Germany
View Profile

,
Stephan Günnemann

RWTH Aachen University, Germany

RWTH Aachen University, Germany
View Profile

,
Ira Assent

Aalborg University, Denmark

Aalborg University, Denmark
View Profile

,
Thomas Seidl

RWTH Aachen University, Germany

RWTH Aachen University, Germany
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 1pp 1270–1281https://doi.org/10.14778/1687627.1687770

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

Clustering high dimensional data is an emerging research field. Subspace clustering or projected clustering group similar objects in subspaces, i.e. projections, of the full space. In the past decade, several clustering paradigms have been developed in parallel, without thorough evaluation and comparison between these paradigms on a common basis.

Conclusive evaluation and comparison is challenged by three major issues. First, there is no ground truth that describes the "true" clusters in real world data. Second, a large variety of evaluation measures have been used that reflect different aspects of the clustering result. Finally, in typical publications authors have limited their analysis to their favored paradigm only, while paying other paradigms little or no attention.

In this paper, we take a systematic approach to evaluate the major paradigms in a common framework. We study representative clustering algorithms to characterize the different aspects of each paradigm and give a detailed comparison of their properties. We provide a benchmark set of results on a large variety of real world and synthetic data sets. Using different evaluation measures, we broaden the scope of the experimental analysis and create a common baseline for future developments and comparable evaluations in the field. For repeatability, all implementations, data sets and evaluation measures are available on our website.

References

C. Aggarwal, J. Wolf, P. Yu, C. Procopiuc, and J. Park. Fast algorithms for projected clustering. In SIGMOD, pages 61--72, 1999. Google ScholarDigital Library
C. Aggarwal and P. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD, pages 70--81, 2000. Google ScholarDigital Library
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD, pages 94--105, 1998. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994. Google ScholarDigital Library
I. Assent, R. Krieger, E. Müller, and T. Seidl. DUSC: Dimensionality unbiased subspace clustering. In ICDM, pages 409--414, 2007. Google ScholarDigital Library
I. Assent, R. Krieger, E. Müller, and T. Seidl. INSCY: Indexing subspace clusters with in-process-removal of redundancy. In ICDM, pages 719--724, 2008. Google ScholarDigital Library
I. Assent, E. Müller, R. Krieger, T. Jansen, and T. Seidl. Pleiades: Subspace clustering and evaluation. In ECML PKDD, pages 666--671, 2008.Google ScholarCross Ref
A. Asuncion and D. Newman. UCI Machine Learning Repository, 2007.Google Scholar
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors meaningful. In IDBT, pages 217--235, 1999. Google ScholarDigital Library
B. Bringmann and A. Zimmermann. The chosen few: On identifying valuable patterns. In ICDM, pages 63--72, 2007. Google ScholarDigital Library
Y. Cheng and G. M. Church. Biclustering of expression data. In International Conference on Intelligent Systems for Molecular Biology, pages 93--103, 2000. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1977.Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD, pages 226--231, 1996.Google Scholar
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. Google ScholarDigital Library
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, 2000. Google ScholarDigital Library
I. Joliffe. Principal Component Analysis. Springer, New York, 1986.Google ScholarCross Ref
K. Kailing, H.-P. Kriegel, and P. Kröger. Density-connected subspace clustering for high-dimensional data. In SDM, pages 246--257, 2004.Google ScholarCross Ref
E. Keogh, L. Wei, X. Xi, S.-H. Lee, and M. Vlachos. LB_Keogh supports exact indexing of shapes under rotation invariance with arbitrary representations and distance measures. In VLDB, pages 882--893, 2006. Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, M. Renz, and S. Wurst. A generic framework for efficient subspace clustering of high-dimensional data. In ICDM, pages 250--257, 2005. Google ScholarDigital Library
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Berkeley Symp. Math. stat. & prob., pages 281--297, 1967.Google Scholar
G. Moise and J. Sander. Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In KDD, pages 533--541, 2008. Google ScholarDigital Library
G. Moise, J. Sander, and M. Ester. P3C: A robust projected clustering algorithm. In ICDM, pages 414--425, 2006. Google ScholarDigital Library
E. Müller, I. Assent, S. Günnemann, T. Jansen, and T. Seidl. OpenSubspace: An open source framework for evaluation and exploration of subspace clustering algorithms in WEKA. In Open Source in Data Mining Workshop at PAKDD, pages 2--13, 2009.Google Scholar
E. Müller, I. Assent, R. Krieger, S. Günnemann, and T. Seidl. DensEst: Density estimation for data mining in high dimensional spaces. In SDM, pages 173--184, 2009.Google ScholarCross Ref
E. Müller, I. Assent, R. Krieger, T. Jansen, and T. Seidl. Morpheus: Interactive exploration of subspace clustering. In KDD, pages 1089--1092, 2008. Google ScholarDigital Library
E. Müller, I. Assent, and T. Seidl. HSM: Heterogeneous subspace mining in high dimensional data. In SSDBM, pages 497--516, 2009. Google ScholarDigital Library
H. Nagesh, S. Goil, and A. Choudhary. Adaptive grids for clustering massive data sets. In SDM, 2001.Google ScholarCross Ref
A. Patrikainen and M. Meila. Comparing subspace clusterings. TKDE, 18(7):902--916, 2006. Google ScholarDigital Library
C. e. a. Procopiuc. A monte carlo algorithm for fast projective clustering. In SIGMOD, pages 418--427, 2002. Google ScholarDigital Library
K. Sequeira and M. Zaki. SCHISM: A new approach for interesting subspace mining. In ICDM, pages 186--193, 2004. Google ScholarDigital Library
I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, USA, 2005. Google ScholarDigital Library
M. L. Yiu and N. Mamoulis. Frequent-pattern based iterative projected clustering. In ICDM, pages 689--692, 2003. Google ScholarDigital Library

Index Terms

Evaluating clustering in subspace projections of high dimensional data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems

Recommendations

Subspace clustering for high dimensional data: a review
Special issue on learning from imbalanced datasets

Subspace clustering is an extension of traditional clustering that seeks to find clusters in different subspaces within a dataset. Often in high dimensional data, many dimensions are irrelevant and can mask existing clusters in noisy data. Feature ...
Read More
Iterative random projections for high-dimensional data clustering

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data ...
Read More
Subspace clustering of high-dimensional data: an evolutionary approach

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 2, Issue 1
August 2009
1293 pages
ISSN:2150-8097
Editors:
Serge Abiteboul,
Tova Milo,
Jignesh Patel,
Philippe Rigaux
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 63
  Total Citations
  View Citations
- 939
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating clustering in subspace projections of high dimensional data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Iterative random projections for high-dimensional data clustering

Subspace clustering of high-dimensional data: an evolutionary approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating clustering in subspace projections of high dimensional data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Subspace clustering for high dimensional data: a review

Iterative random projections for high-dimensional data clustering

Subspace clustering of high-dimensional data: an evolutionary approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media