Article

Cross-relational clustering with user's guidance

Authors:
Xiaoxin Yin

UIUC

UIUC
View Profile

,
Jiawei Han

UIUC

UIUC
View Profile

,
Philip S. Yu

IBM T. J. Watson Res. Center

IBM T. J. Watson Res. Center
View Profile

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningAugust 2005Pages 344–353https://doi.org/10.1145/1081870.1081910

Published:21 August 2005Publication History

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 344–353

ABSTRACT

Clustering is an essential data mining task with numerous applications. However, data in most real-life applications are high-dimensional in nature, and the related information often spreads across multiple relations. To ensure effective and efficient high-dimensional, cross-relational clustering, we propose a new approach, called CrossClus, which performs cross-relational clustering with user's guidance. We believe that user's guidance, even likely in very simple forms, could be essential for effective high-dimensional clustering since a user knows well the application requirements and data semantics. CrossClus is carried out as follows: A user specifies a clustering task and selects one or a small set of features pertinent to the task. CrossClus extracts the set of highly relevant features in multiple relations connected via linkages defined in the database schema, evaluates their effectiveness based on user's guidance, and identifies interesting clusters that fit user's needs. This method takes care of both quality in feature extraction and efficiency in clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of this approach.

References

C.C. Aggarwal, P.S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces. SIGMOD, 2000.]] Google ScholarDigital Library
C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, J.S. Park. Fast Algorithms for Projected Clustering. SIGMOD, 1999.]] Google ScholarDigital Library
P. Cheeseman, et al. AutoClass: A Bayesian Classfication System. ICML, 1988.]]Google Scholar
J.G. Dy, C.E. Brodley. Feature Selection for Unsupervised Learning. J. Machine Learning Research, 2004.]] Google ScholarDigital Library
W. Emde, D. Wettschereck. Relational Instance-Based Learning. ICML, 1996.]]Google Scholar
V. Ganti, J. Gehrke, R. Ramakrishnan. CACTUS - Clustering Categorical Data Using Summaries. KDD, 1999.]] Google ScholarDigital Library
T. Gärtner, J. W. Lloyd, P. A. Flach. Kernels and Distances for Structured Data. Machine Learning, 57, 2004.]] Google ScholarDigital Library
I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. J. Machine Learning Research, 2003.]] Google ScholarDigital Library
M.A. Hall. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. ICML, 2000.]] Google ScholarDigital Library
V. Hristidis, Y. Papakonstantinou. DISCOVER: Keyword Search in Relational Databases. VLDB, 2002.]]Google ScholarDigital Library
L. Kaufman, P.J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]Google Scholar
K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl. Constrained k-means clustering with background knowledge. ICML, 2001.]] Google ScholarDigital Library
H. Kim, S. Lee. A semi-supervised document clustering technique for information organization. CIKM, 2000.]] Google ScholarDigital Library
M. Kirsten, S. Wrobel. Relational Distance-Based Clustering. ILP, 1998.]] Google ScholarDigital Library
M. Kirsten, S. Wrobel. Extending K-Means Clustering to First-order Representations. ILP, 2000.]] Google ScholarDigital Library
J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. Berkeley Symposium, 1967.]]Google Scholar
T.M. Mitchell. Machine Learning. McGraw Hill, 1997.]] Google ScholarDigital Library
P. Mitra, C.A. Murthy, S.K. Pal. Unsupervised Feature Selection Using Feature Similarity. PAMI, 2002.]] Google ScholarDigital Library
R.T. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB, 1994.]] Google ScholarDigital Library
X. Yin, J. Han, J. Yang, P.S. Yu. CrossMine: Efficient Classification Across Multiple Database Relations. ICDE, 2004.]] Google ScholarDigital Library
E. P. Xing, A. Y. Ng, M. I. Jordan, S. Russell. Distance metric learning, with application to clustering with side-information. NIPS, 2002.]]Google Scholar

Index Terms

Cross-relational clustering with user's guidance
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Efficient Disk-Based K-Means Clustering for Relational Databases

K-means is one of the most popular clustering algorithms. This article introduces an efficient disk-based implementation of K-means. The proposed algorithm is designed to work inside a relational database management system. It can cluster large data ...
Read More
Improved k- means clustering algorithm for two dimensional data
CCSEIT '12: Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology

Clustering is a procedure of organizing the objects in groups whose member exhibits some kind of similarity. So a cluster is a collection of objects which are alike and are different from the objects belonging to other clusters. K-Means is one of ...
Read More
Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters

In this paper, we present an agglomerative fuzzy $k$-means clustering algorithm for numerical data, an extension to the standard fuzzy $k$-means algorithm by introducing a penalty term to the objective function to make the clustering process not ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
data mining
relational databases
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 781
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross-relational clustering with user's guidance

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficient Disk-Based K-Means Clustering for Relational Databases

Improved k- means clustering algorithm for two dimensional data

Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters