Skip to main content
Log in

Characterization and evaluation of similarity measures for pairs of clusterings

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data-mining applications

  2. Arabie P, Boorman SS (1973) Multidimensional scaling of measures of distance between partitions. Math Psychol 10: 148–203

    Article  MATH  MathSciNet  Google Scholar 

  3. Baroni-Urbani C, Buser MW (1976) Similarity of binary data. Syst Zool 25(3): 251–259

    Article  Google Scholar 

  4. Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software

  5. Braun-Blanquet JNY (1932) Plant sociology: the study of plant communities. McGraw-Hill Book Company, Inc, New York

    Google Scholar 

  6. Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Fayyad UN, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT press, Cambridge, pp 153–180

    Google Scholar 

  7. Coombs CH, Dawes RM, Tversky A (1970) Mathematical psychology: an elementary introduction. Prentice-Hall, Englewood Cliffs, NJ

    MATH  Google Scholar 

  8. Dennis RLH, Williams WR, Shreeve TG (1998) Faunal structures among european butterflies: evolutionary implications of bias for geography, endemism and taxonomic affiliation. Ecography 21: 181–203

    Article  Google Scholar 

  9. Dice LE (1945) Measures of the amount of ecologic association between species. Ecology 26(3): 297–302

    Article  Google Scholar 

  10. Fager EW, McGowan JA (1963) Zooplankton species groups in the north pacific:co-occurrences of species can be used to derive groups whose members react similarly to water-mass types. Science 140: 453–460 doi:10.1126/science.140.3566.453

    Article  Google Scholar 

  11. Faith DP (1983) Asymmetric binary similarity measures. Oecologia 57(3): 287–290

    Article  Google Scholar 

  12. Filkov V, Skiena S (2004) Heterogeneous data integration with the consensus clustering formalism. Data Integration in the Life Sciences (DILS). Int Workshop No 1 2994: 110–123

    Google Scholar 

  13. Forbes S (1925) Method of determining and measuring the associative relations of species. Science 61(1585): 518–524

    Google Scholar 

  14. Fossum TV, Haller SM (2004) Measuring card sort orthogonality. Expert Syst 22(3): 139–146

    Article  Google Scholar 

  15. Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. Am Stat Assoc 78(383): 553–569

    Article  MATH  Google Scholar 

  16. Fred A, Jain A (2003) Robust data clustering. In: IEEE computer society conference on computer vision and pattern recognition

  17. Gilbert N, Wells TCE (1966) Analysis of quadrat data. Ecology 54(3): 675–685

    Article  Google Scholar 

  18. Goodall DW (1967) The distribution of the matching coefficient. Biometics 23(4): 647–656

    Article  MathSciNet  Google Scholar 

  19. Halkidi M, Batistikis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst 17: 107–145

    Article  MATH  Google Scholar 

  20. Hamann U (1961) Merkmalbestand und verwandtschaftsbeziehungen de farinosae: Ein beitrag zum system der monokotyledonen. Wildenowia 2: 639–768

    Google Scholar 

  21. Hayek LC (1994) Analysis of amphibian biodiversity data. In: Heyer WR, Donnelly MA, McDiarmid RW, Hayek L-AC, Foster MS (eds) Measuring and monitoring biological diversity: standard methods for amphibians. Smithsonian Institution Press

  22. Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415

    Article  Google Scholar 

  23. Holliday JD, Hu C-Y, Willett P (2002) Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2d fragment bit-strings. Comb Chem High Throughput Screen 5(2): 155–166

    Google Scholar 

  24. Horibe Y (1985) Entropy and correlation. IEEE Trans Syst Man Cybern (SMC) SMC-15(5): 641–642

    Google Scholar 

  25. Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses. et dans quelques regiones voisines. Naturelles Bulletin de la Societe Vaudoise des Sciences, pp 241–272

  26. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 2(32): 241–254

    Article  Google Scholar 

  27. Karypis G, Han E-H, Kumar V (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput 32(8): 68–75

    Google Scholar 

  28. Knobbe AJ, Adrianns PW (1996) Analysis of binary association. In: Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon, pp 311–314

  29. Kulczynski S (1927) Zespoly roslin w pieninach—die pflanzenassoziationen der pieninen. Bulletin international de l’acadmie polonaise des sciences et des lettres B(2): 57–203

    Google Scholar 

  30. Kvalseth TO (1987) Entropy and correlation: some comments. IEEE Trans Syst Man Cybern SMC-17: 517–519

    Article  Google Scholar 

  31. Lee TT (1987) An information theoretic analysis of relational databases - part 1: data dependencies and information metric. IEEE Trans Softw Eng SE-13(10): 1049–1061

    Article  Google Scholar 

  32. Linfoot EH (1957) An informational measure of correlation. Inf Control 1: 85–87

    Article  MATH  MathSciNet  Google Scholar 

  33. Lopez de Mantaras R (1989) Id3 revisited: a distance-based criterion for attribute selection. In: International symposium on methodologies for intelligent systems (ISMIS-89). Charlotte, North California

  34. MacQueen J (1967) Some methods for classification and analysis of multivariate observations

  35. Malvestuto FM (1986) Statistical treatment of the information content of a database. Inf Syst 11(3): 211–223

    Article  MATH  Google Scholar 

  36. Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT Press, New York

    MATH  Google Scholar 

  37. McConnaughey BH (1964) The determination and analysis of plankton communities. Marine Research Indonesia Special (Penelitian Laut Di Indonesia) Spec. no. 30

  38. Meila M (2003) Comparing clusterings by variation of information. Proceedings of the 16th annual conference of computational learning theory (COLT)

  39. Michael EL (1920) Marine ecology and the coefficient of association: A plea in behalf of quantitative biology. J Ecol 8(1): 54–59

    Article  MathSciNet  Google Scholar 

  40. Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Press, Boston–Dordrecht

    MATH  Google Scholar 

  41. Mirkin B (2001) Eleven ways to look at the chi-squared coefficient for contingency tables. Am Stat 55(6): 111–120

    Article  MathSciNet  Google Scholar 

  42. Mountford MD (1962) An index of similarity and its application to classificatory problems. In: Murphy PW (ed) Progress in soil zoology. Butterworth, London, pp 43–50

    Google Scholar 

  43. Pawlak Z, Wong SK, Ziarko WIJM-M (1988) Rough sets: probabilistic versus deterministic approach. Int J Man Mach Stud 29(1): 81–95

    Article  MATH  Google Scholar 

  44. Powers DMW (2007) Expected information in the transmission of an equality selection of distribution/clustering or of individual class labels, echnical report, Flinders University (S.A.)

  45. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  46. Quinlan JR (1990) Induction of decision trees. In: Shavlik JW, Dietterich TG (eds) Readings in machine learning, Morgan Kaufmann. Originally published in machine learning 1:81–106, 1986.

  47. Rajski C (1961) A metric space of discrete probability distributions. Inf Control 4(4): 371–377

    Article  MathSciNet  Google Scholar 

  48. Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336): 846–850

    Article  Google Scholar 

  49. Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434): 1115–1118

    Article  Google Scholar 

  50. Russell PF, Rao TR (1940) On habitat and association of species of anopheline larvae in southeastern, madras. Malaria Inst India 3: 153–178

    Google Scholar 

  51. Savage RM (1934) The breeding behavior of the common frog, rana remporaria linn., and of the common toad bufo bufo bufo linn. Zoological Society of London, pp 55–70

  52. Sneath PHA (1968) Vigour and pattern in taxonomy. Gen Microbiol 54(1): 1–11

    Google Scholar 

  53. Sneath PHA, Sokal RR (1973) Numerical taxonomy. Freeman and Company, San Francisco

    MATH  Google Scholar 

  54. Sokal RR, Sneath PHA (1964) Principles of numerical taxonomy. Syst Zool 13: 106–108

    Article  Google Scholar 

  55. Sorgenfrei T (1958) Molluscan assemblages from the marine middle miocene of south jutland and their environments

  56. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitionings. Mach Learn Res 3: 583–617

    Article  MathSciNet  Google Scholar 

  57. Tarwid K (1960) Szacowanie zbieznosci nisz ekologicznych gatunkow droga oceny prawdopodobienstwa spotykania sie ich w polowach. Ecol Polska B(6): 115–130

    Google Scholar 

  58. Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Pres, New York

    Google Scholar 

  59. Thurstone L (1927) A law of comparative judgement. Psychol Rev 34: 278–286

    Google Scholar 

  60. Wallace D.L. (1983) A method for comparing two hierarchical clusterings: comment. Am Stat Assoc 78(383): 569–576

    Article  Google Scholar 

  61. Wan SJ, Wong SKM (1989) A measure for concept dissimilarity and its applications in machine learning. In: International conference on computing and information. Toronto North, Canada, pp 23–27

  62. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Amsterdam

    MATH  Google Scholar 

  63. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37

    Article  Google Scholar 

  64. Yao YY, Wong SKM, Butz CJ (1999) On information theoretic measures of attribute importance. In: Zhong N (ed) PAKDD’99. Beijing, China, pp 133–137

  65. Yule GU (1912) On the methods of measuring association between two attributes. R Soc Lond 75(6): 579–642

    Google Scholar 

  66. Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Darius Pfitzner.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pfitzner, D., Leibbrandt, R. & Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst 19, 361–394 (2009). https://doi.org/10.1007/s10115-008-0150-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0150-6

Keywords

Navigation