Skip to main content
Log in

A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data clustering is a fundamental and very popular method of data analysis. Its subjective nature, however, means that different clustering algorithms or different parameter settings can produce widely varying and sometimes conflicting results. This has led to the use of clustering comparison measures to quantify the degree of similarity between alternative clusterings. Existing measures, though, can be limited in their ability to assess similarity and sometimes generate unintuitive results. They also cannot be applied to compare clusterings which contain different data points, an activity which is important for scenarios such as data stream analysis. In this paper, we introduce a new clustering similarity measure, known as ADCO, which aims to address some limitations of existing measures, by allowing greater flexibility of comparison via the use of density profiles to characterize a clustering. In particular, it adopts a ‘data mining style’ philosophy to clustering comparison, whereby two clusterings are considered to be more similar, if they are likely to give rise to similar types of prediction models. Furthermore, we show that this new measure can be applied as a highly effective objective function within a new algorithm, known as MAXIMUS, for generating alternate clusterings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. In: Proceedings of ACM SIGMOD international conference on management of data, pp 575–586

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, pp 81–92

  • Bacardit J, Garrell JM (2004) Analysis and improvements of the adaptive discretization intervals knowledge representation. In: GECCO, vol 2, pp 726–738

  • Bae E, Bailey J (2006) Coala: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: International conference on data mining, pp 53–62

  • Bae E, Bailey J, Dong G (2006) Clustering similarity comparison using density profiles. In: Australian joint conference on artificial intelligence, pp 342–351

  • Borg I, Groenen P (1997) Modern multidimensional scaling: theory and applications. Springer, Berlin

    MATH  Google Scholar 

  • Caruana R, Elhawary M, Nguyen N, Smith C (2006) Meta clustering. In: International conference on data mining, pp 107–118

  • Chmielewski MR, Grzymala-busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. In: International journal of approximate reasoning, pp 294–301

  • Davidson I (2005a) Clustering with constraints: feasibility issues and the k-means algorithm. In: SIAM international conference on data mining

  • Davidson I (2005b) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Pacific Asia conference on knowledge discovery, pp 59–70

  • Davidson I, Ravi S (2006) Identifying and generating easy sets of constraints for clustering. In: Conference on artificial intelligence

  • Dunn J (1974) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57

    Article  MathSciNet  Google Scholar 

  • Ekman G (1963) A direct method for multidimensional ratio scaling. Psychometrika 28(1): 33–41

    Article  Google Scholar 

  • Estivill-Castro V (2002) Why so many clustering algorithms: a position paper. SIGKDD Explor Newsl 4(1): 65–75

    Article  MathSciNet  Google Scholar 

  • Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102

    MATH  Google Scholar 

  • Fred A, Jain A (2003) Robust data clustering. In: Proceedings of conference on computer vision and pattern recognition, pp 128–133

  • Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6): 835–850

    Article  Google Scholar 

  • Gondek D (2004) Non-redundant data clustering. In: International conference on data mining, pp 75–82

  • Gondek D, Hofmann T (2003) Conditional information bottleneck clustering. In: International conference on data mining, pp 36–42

  • Gondek D, Hofmann T (2004) Non-redundant data clustering. In: International conference on data mining, pp 75–82

  • Gower JC, Legendre P (1986) Metric and dissimilarity properties of dissimilarity coefficients. J Classif 3: 5–48

    Article  MATH  MathSciNet  Google Scholar 

  • Gregson RAM (1975) Psychometrics of similarity. Academic Press, San Diego

    Google Scholar 

  • Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R, Vanhoutte A (1989) Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf Process Manag 25(3): 315–318

    Article  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1): 193–218

    Article  Google Scholar 

  • Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: application in vlsi domain. In: Design automation conference, p 526

  • Kendall K (1999) A database of computer attacks for the evaluation of intrusion detection systems. Masters Thesis, Massachusetts Institute of Technology

  • Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2: 83–97

    Article  Google Scholar 

  • Larsen B, Aone C (1999) Fast and effective text mining using linear time document clustering. In: Proceedings of the conference on knowledge discovery and data mining, pp 16–22

  • Meila M (2002) Comparing clusterings. Technical Report, Department of Statistics, University of Washington

  • Meila M (2003) Comparing clusterings—technical report. http://citeseer.ist.psu.edu/meila02comparing.html

  • Meila M (2005) Comparing clusterings—an axiomatic view. In: International conference on machine learning

  • Meilǎ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on Machine learning, pp 577–584

  • Mixed Integer Linear Programming (MILP) Solver (2007). http://lpsolve.sourceforge.net

  • Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall/CRC, Boca Raton

    Book  MATH  Google Scholar 

  • Rand W (1971a) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 846–850

    Article  Google Scholar 

  • Rand WM (1971b) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 622–626

    Article  Google Scholar 

  • Ratanamahatana C (2003) Cloni: clustering of square root of n interval discretization. Data Mining IV, Info. and Comm. Tech 29

  • Repository U (2008) http://archive.ics.uci.edu/ml

  • Richeldi M, Rossotto M (1995) Class-driven statistical discretization of continuous attributes (extended abstract). In: Proceedings of the 8th European conference on machine learning. Springer, London, UK, pp 335–338

  • Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn 3: 583–617

    Article  MATH  MathSciNet  Google Scholar 

  • Streilein WW, Cunningham RK, Webster SE (2001) Improved detection of low-profile probe and denial-of-service attacks. In: Proceedings of workshop on statistical and machine learning techniques in computer intrusion detection

  • Sung AH, Mukkamala S (2003) Identifying important features for intrusion detection using support vector machines and neural networks. In: Proceedings of the symposium on applications and the internet (SAINT), pp 209–217

  • Theodoridis S, Koutroumbas K (1999) Pattern recognition. Academic Press, San Diego

    Google Scholar 

  • Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. Allerton Conference on Communication, Control and Computing, pp 368–377

  • Topchy A, Jain AK (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12): 1866–1881

    Article  Google Scholar 

  • Topchy AP, Law MHC, Jain AK, Fred AL (2004a) Analysis of consensus partition in cluster ensemble. In: Proceedings of the 4th IEEE international conference on data mining, pp 225–232

  • Topchy A, Martin H, Law C, Jain A, Fred A (2004b) Analysis of consensus partition in cluster ensemble. In: International conference on data mining, pp 225–232

  • Torgo L, Soares C (1998) Dynamic discretization of continuous attributes. In: Proceedings of the 6th Ibero-American conference on AI, pp 160–169

  • Wallace DL (1983) Comment. J Am Stat Assoc 78(383): 569–576

    Article  Google Scholar 

  • Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1): 39–74

    Article  Google Scholar 

  • Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on machine learning, pp 1028–1035

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James Bailey.

Additional information

Responsible editor: Charu Aggarwal.

Part of this work appeared in a preliminary form in Bae et al. (2006). See Sect. 2 for discussion.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bae, E., Bailey, J. & Dong, G. A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Disc 21, 427–471 (2010). https://doi.org/10.1007/s10618-009-0164-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0164-z

Keywords

Navigation