Elsevier

Pattern Recognition

Volume 37, Issue 3, March 2004, Pages 487-501
Pattern Recognition

Validity index for crisp and fuzzy clusters

https://doi.org/10.1016/j.patcog.2003.06.005Get rights and content

Abstract

In this article, a cluster validity index and its fuzzification is described, which can provide a measure of goodness of clustering on different partitions of a data set. The maximum value of this index, called the PBM-index, across the hierarchy provides the best partitioning. The index is defined as a product of three factors, maximization of which ensures the formation of a small number of compact clusters with large separation between at least two clusters. We have used both the k-means and the expectation maximization algorithms as underlying crisp clustering techniques. For fuzzy clustering, we have utilized the well-known fuzzy c-means algorithm. Results demonstrating the superiority of the PBM-index in appropriately determining the number of clusters, as compared to three other well-known measures, the Davies–Bouldin index, Dunn's index and the Xie–Beni index, are provided for several artificial and real-life data sets.

Introduction

Clustering [1], [2], [3], [4], [5] is an unsupervised classification method when the only data available are unlabelled, and no structural information about it is available. In clustering (also known as exploratory data analysis), a set of patterns, usually vectors in a multi-dimensional space, are organized into coherent and contrasted groups, such that patterns in the same group are similar in some sense and patterns in different groups are dissimilar in the same sense. The purpose of any clustering technique is to evolve a partition matrix U(X) of a given data set X (consisting of, say, n patterns, X={x1,x2,…,xn}) so as to find a number, say K, of clusters (C1,C2,…,CK). The partition matrix U(X) of size K×n may be represented as U=[ukj], k=1,…,K and j=1,…,n, where ukj is the membership of pattern xj to clusters Ck. In crisp partitioning of the data, the following condition holds: ukj=1 if xj∈Ck, otherwise ukj=0. The purpose is to classify data set X such thatCi≠∅fori=1,…,K,Ci∩Cj=∅fori=1,…,K,j=1,…,Kandi≠jandi=1KCi=X.In the case of fuzzy clustering, the purpose is to evolve an appropriate partition matrix U=[ukj]K×n, where ukj∈[0,1], such that ukj denotes the grade of membership of the jth element to the kth cluster. In fuzzy partitioning of the data, the following conditions hold:0<j=1nukj<nfork=1,…,K,k=1Kukj=1forj=1,…,nandk=1Kj=1nukj=n.The k-means algorithm [5] is one of the very well-known partitional clustering method that produces the minimum-squared-error partitions. When the number of clusters is known a priori, the k-means algorithm optimizes the distance criterion either by minimizing the within cluster spread, or by maximizing the inter cluster separation. The expectation maximization (EM) algorithm [6] is considered to be an appropriate optimization algorithm for constructing proper statistical models of data. It provides a probabilistic clustering where each data element has a certain probability of being a member of any cluster. Unlike the k-means algorithm, it does not depend on any distance measure, and accommodates categorical and continuous data in a superior manner.

The two fundamental questions that need to be addressed in any typical clustering scenario are: (i) how many clusters are actually present in the data, and (ii) how real or good is the clustering itself. That is, whatever may be the clustering technique, one has to determine the number of clusters and also the validity of the clusters formed [7]. The measure of validity of the clusters should be such that it will be able to impose an ordering of the clusters in terms of its goodness. In other words, if U1,U2,…,Um be m partitions of X, and the corresponding values of a validity measure be V1,V2,…,Vm, then Vk1Vk2⩾⋯⩾Vkm, ∀ki∈{1,2,…,m},i=1,2,…,m will indicate that Uk1Uk2↑…↑Ukm. Here ‘UiUj’ indicates that partition Ui is a better clustering than Uj. Note that a validity measure may also define a increasing sequence instead of an decreasing sequence of Vk1,…,Vkm.

In this paper, we describe an index, called PBM-index, which can be used to associate a measure with different partitions of a data set; the maximum value of which indicates the appropriate partitioning. Therefore, if the number of clusters, K, is varied within some range, and an underlying clustering technique is used to partition the data, then the value of K corresponding to the maximum value of PBM-index will indicate the correct number of clusters present in the data. The effectiveness of this index, for determining the appropriate number of clusters, is demonstrated for four artificial and two real-life data sets.

Other well-known cluster validity indices, available in the literature, are the Davies–Bouldin (DB) index [8], Dunn's Index [9] (both for hard clusters primarily), and the Xie–Beni (XB) index [10] (for fuzzy clusters). Davies–Bouldin index is a function of ratio of the sum of within cluster scatter to between cluster scatter. Dunn's index is a ratio of within cluster and between cluster separations. The Xie–Beni index is a ratio of the fuzzy within cluster sum of squared distances to the product of the number of elements and the minimum between cluster separation. In order to demonstrate the effectiveness of the PBM-index, we compare its performance with the other indices for evolving the proper number of clusters for four artificial and four real-life data sets. For this purpose, both the k-means [5] and EM algorithms [6] have been used as the underlying clustering strategy.

In a part of the investigation, a fuzzified version of the PBM-index is proposed. Again, the maximum value of the fuzzy index over different fuzzy partitions of the data indicates the appropriate clustering. For this purpose, the fuzzy c-means (FCM) algorithm [11] is used as the underlying clustering technique. FCM uses the principles of fuzzy sets to partition a data into a fixed number, c, of clusters; thereby providing the appropriate c×n partition matrix. The performance of the fuzzy PBM index is compared with the XB-index in determining the proper number of fuzzy clusters for different data sets.

Section snippets

PBM-index: a measure for cluster validity

In this section, we first define the PBM-index. This is followed by an explanation of the interaction among the different components of the index so that it can approximately indicate the proper partitioning of the data.

Experimental results

Four artificial and four real-life data sets are considered for experiments. They are described first. This is followed by a demonstration of the variation of PBM-index with the number of clusters, when the k-means algorithm and the EM algorithm are used as underlying clustering mechanisms. Finally, a comparison of PBM-index with the Davies–Bouldin (DB) index , Dunn's index and the Xie–Beni (XB) index is made in terms of the number of clusters and the clustering obtained for the above-mentioned

Discussion and conclusions

A cluster validity index is described in this article. It is found to attain its maximum value when the data is properly clustered. Therefore, this new index may be used for evolving the appropriate number of clusters in a given data set. Moreover, proper partitioning of the data set may also be achieved using the PBM-index. The performance of this index for providing the correct number of clusters is compared with those of the well known the DB-index, Dunn's index and the XB-index where the

Summary

Clustering is an unsupervised classification scheme where no a priori knowledge of the data set is available. In clustering (also known as exploratory data analysis), a set of patterns, usually vectors in a multidimensional space, are organized into a number of coherent and contrasted groups, such that patterns in the same group are similar in some sense and patterns in different groups are dissimilar in the same sense. Clustering can be performed either in crisp or fuzzy mode. In crisp mode,

About the Author—MALAY K. PAKHIRA received his B.Sc. degree in Physics and B.Tech. degree in Computer Science and Technology from the University of Calcutta, Calcutta, India in 1987 and 1990, respectively. He received his Masters in Computer Science and Engineering from Jadavpur University, Calcutta, India in 1992. Currently he is a lecturer in Computer Science and Technology at Kalyani Government Engineering College, West Bengal, India. At present he is doing his Ph.D. research work on

References (18)

  • R.C. Dubes et al.

    Clustering techniquesthe user's dilemma

    Pattern Recognition

    (1976)
  • R. Kothari et al.

    On finding the number of clusters

    Pattern Recognition Lett.

    (1999)
  • M.R. Anderberg

    Cluster Analysis for Application

    (1973)
  • J.A. Hartigan

    Clustering Algorithms

    (1975)
  • P.A. Devijver et al.

    Pattern Recognition: A Statistical Approach

    (1982)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • J.T. Tou et al.

    Pattern Recognition Principles

    (1974)
  • P.S. Bradley, U. Fayyad, C. Reina, Scaling EM (expectation maximization) clustering to large databases, Technical...
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1979)
There are more references available in the full text version of this article.

Cited by (676)

  • A fuzzy C-means algorithm for optimizing data clustering

    2023, Expert Systems with Applications
View all citing articles on Scopus

About the Author—MALAY K. PAKHIRA received his B.Sc. degree in Physics and B.Tech. degree in Computer Science and Technology from the University of Calcutta, Calcutta, India in 1987 and 1990, respectively. He received his Masters in Computer Science and Engineering from Jadavpur University, Calcutta, India in 1992. Currently he is a lecturer in Computer Science and Technology at Kalyani Government Engineering College, West Bengal, India. At present he is doing his Ph.D. research work on Unsupervised Pattern Classification. His research interests include Image Processing, Pattern Recognition, Evolutionary Algorithms, Soft Computing and Data Mining.

About the Author—SANGHAMITRA BANDYOPADHYAY did her Bachelors in Physics and Computer Science in 1988 and 1991, respectively, from University of Calcutta, Calcutta, India. Subsequently, she did her Masters in Computer Science from Indian Institute of Technology, Kharagpur, India in 1993 and Ph.D. in Computer Science from Indian Statistical Institute, Calcutta, India in 1998. Currently she is an Assistant Professor at Indian Statistical Institute, Calcutta, India. She is the first recipient of Dr. Shanker Dayal Sharma Gold Medal and Institute Silver Medal for being adjudged the best all round postgraduate performer in 1994. She has worked in Los Alamos National Laboratory in 1997 as a graduate research assistant and in the University of New South Wales, Sydney, Australia. as a post doctoral fellow. Dr. Bandyopadhyay received the Indian National Science Academy (INSA) and the Indian Science Congress Association (ISCA) Young Scientist Awards in 2000, and Indian National Academy of Engineers (INAE) Young Scientist Awards in 2002. Her research interests include Evolutionary and Soft Computing, Pattern Recognition, Data Mining, Parallel Processing and Distributed Computing.

About the Author—UJJWAL MAULIK did his Bachelors in Physics and Computer Science in 1986 and 1989 respectively from University of Calcutta, Calcutta, India. Subsequently, he did his Masters and Ph.D. in Computer Science in 1991 and 1997, respectively, from Jadavpur University, India. Dr. Maulik has worked as a scientist in Center for Adaptive Systems Application, Los Alamos, and Los Alamos National Laboratory, New Mexico, USA in 1997. In 1999, he went on a postdoctoral assignment to University of New South Wales, Sydney, Australia. He is the recipient of the Government of India BOYSCAST fellowship for doing research in University of Texas at Arlington, USA in 2001. Dr. Maulik has been elected Fellow of the Institute of Electronics and Telecommunication Engineers (IETE). He is currently an assistant professor in the Department of Computer Science, Kalyani Government Engineering College, India. His research interests include Parallel and Distributed Processing, Natural Language Processing, Evolutionary Computation, Pattern Recognition and Data Mining.

View full text