Article

Constraint-driven clustering

Authors:
Rong Ge

Simon Fraser University

Simon Fraser University
View Profile

,
Martin Ester

Simon Fraser University

Simon Fraser University
View Profile

,
Wen Jin

Simon Fraser University

Simon Fraser University
View Profile

,
Ian Davidson

State University of New York: Albany

State University of New York: Albany
View Profile

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2007Pages 320–329https://doi.org/10.1145/1281192.1281229

Published:12 August 2007Publication History

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 320–329

ABSTRACT

Clustering methods can be either data-driven or need-driven. Data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. Thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. However, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. In this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. For this purpose, we introduce a novel cluster model, Constraint-Driven Clustering (CDC), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. Two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. We prove the NP-hardness of the CDC problem with different constraints. We propose a novel dynamic data structure, the CD-Tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the CDC constraints and minimizes the objective function. Based on CD-Trees, we develop an efficient algorithm to solve the new clustering problem. Our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm.

Supplemental Material

p320-ge-200.mov

mov

28.2 MB

Download

p320-ge-768.mov

mov

92.6 MB

Download

References

M. Abramowitz and I. A. Stegun(Eds.). Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables New York: Dover, 1972. Google ScholarDigital Library
C. C. Aggarwal and P. S. Yu. A condensation approach to privacy preserving data mining. In EDBT 2004.Google ScholarCross Ref
G. Aggarwal, T. F. K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, and A. Zhu. Approximation algorithms for k-anonymity. Journal of Privacy Technology 2005.Google Scholar
A. D. Amis, R. Prakash, T. H. P. Vuong, and D. T. Huynh. Max-min d-cluster formation in wireless ad hoc networks. In INFOCOM 2000.Google ScholarCross Ref
S. Bandyopadhyay and E. J. Coyle. An energy-efficient hierarchical clustering algorithm for wireless sensor networks. In INFOCOM 2003.Google ScholarCross Ref
A. Banerjee and J. Ghosh. On scaling up balanced clustering algorithms. In ICDM 2002.Google ScholarCross Ref
A. Banerjee and J. Ghosh. Scalable clustering algorithms with balancing constraints. Data Mining Knowledge Discovery 13(3), 2006. Google ScholarDigital Library
S. Banerjee and S. Khuller. A clustering scheme for hierarchical control in multi-hop wireless networks. In INFOCOM 2001.Google ScholarCross Ref
P. Bradley, K. P. Bennett, and A. Demiriz. Constrained k-means clustering. Technical report, MSR-TR-2000-65, Microsoft Research, 2000.Google Scholar
J. Cartigny, D. Simplot, and I. Stojmenovic. Localized minimum-energy broadcasting in ad-hoc networks. In INFOCOM 2003.Google ScholarCross Ref
I. Davidson and S. S. Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In SDM 2005.Google ScholarCross Ref
I. Davidson and S. S. Ravi. Identifying and generating easy sets of constraints for clustering. In AAAI 2006. Google ScholarDigital Library
I. Davidson and S. S. Ravi. The complexity of non-hierarchical clustering with constraints. Journal of Knowledge Discovery and Data Mining To Appear. Google ScholarDigital Library
J. Domingo-Ferrer and J. M. Mateo-Sanz. Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering 14(1), 2002. Google ScholarDigital Library
M. E. Dyer and A. M. Frieze. Planar 3dm is np-complete. J. Algorithms 7(2), 1986. Google ScholarDigital Library
M. Ester, R. Ge, W. Jin, and Z. Hu. Amicroeconomic data mining problem: customer-oriented catalog segmentation. In KDD 2004. Google ScholarDigital Library
R. Ge, M. Ester, W. Jin, and Z. Hu. Adisc-based approach to data summarization and privacy preservation. In SSDBM 2006. Google ScholarDigital Library
S. Ghiasi, A. Srivastava, X. Yang, and M. Sarrafzadeh. Optimal energy aware clustering in sensor network. Sensor 2(7), 2002.Google Scholar
J. Ghosh and A. Strehl. Clustering and visualization of retail market baskets. In N. R. Pal and L. Jain, editors, Knowledge Discovery in Advanced Information Systems Springer, 2002.Google Scholar
G. Gupta and M. Younis. Load-balanced clustering of wireless sensor networks. IEEE International Conference on Communications 2003.Google ScholarCross Ref
N. P. Jedid-Jah Jonkera and D. V. den Poel. Joint optimization of customer segmentation and marketing policy to maximize long-term pro?tability. Expert Systems with Applications 27(2), 2004.Google Scholar
W. Jin, R. Ge, and W. Qian. On robust and effective k-anonymity in large databases. In PAKDD 2006. Google ScholarDigital Library
V. Kawadia and P. R. Kumar. Power control and clustering in ad hoc networks. In INFOCOM 2003.Google Scholar
J. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. J. Data Mining and Knowledge Discovery 1999. Google ScholarDigital Library
R. Krishnan and D. Starobinski. Effecient clustering algorithms for self-organizing wireless sensor networks. Journal of Ad-Hoc Networks 2005.Google Scholar
A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In PODS 2004. Google ScholarDigital Library
D. Newman, S. Hettich, C. Blake, and C. Merz. UCI repository of machine learning databases, 1998.Google Scholar
P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information (abstract). In PODS 1998. Google ScholarDigital Library
A. Strehl and J. Ghosh. A scalable approach to balanced, high-dimensional clustering of market-baskets. In HiPC 2000 2000. Google ScholarDigital Library
L. Sweeney. k-anonymity: A model for protecting privacy. In IJUFKS 2002. Google ScholarDigital Library
A. K. H. Tung, J. Han, R. T. Ng, and L. V. S. Lakshmanan. Constraint-based clustering in large databases. In ICDT 2001. Google ScholarDigital Library
K. Wagstaff and C. Cardie. Clustering with instance-level constraints. In ICML 2000. Google ScholarDigital Library
D. R. Woods. Drawing planar graphs. Technical report, Report No. STAN-CS-82-943, Computer Science Department, Stanford University, 1981.Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In SIGMOD 1996. Google ScholarDigital Library
S. Zhong and J. Ghosh. Scalable, balanced model-based clustering. In SDM 2003.Google ScholarCross Ref

Index Terms

Constraint-driven clustering
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Efficient incremental constrained clustering
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Clustering with constraints is an emerging area of data mining research. However, most work assumes that the constraints are given as one large batch. In this paper we explore the situation where the constraints are incrementally given. In this way the ...
Read More
Density-based semi-supervised clustering

Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...
Read More
Self-Organizing-Map Based Clustering Using a Local Clustering Validity Index

Classical clustering methods, such as partitioning and hierarchical clustering algorithms, often fail to deliver satisfactory results, given clusters of arbitrary shapes. Motivated by a clustering validity index based on inter-cluster and intra-cluster ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NP-hardness
clustering
constraints
Qualifiers
- Article
Conference

Acceptance Rates
KDD '07 Paper Acceptance Rate111of573submissions,19%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 1,213
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Constraint-driven clustering

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Efficient incremental constrained clustering

Density-based semi-supervised clustering

Self-Organizing-Map Based Clustering Using a Local Clustering Validity Index