ABSTRACT
Clustering is an essential data mining task with numerous applications. However, data in most real-life applications are high-dimensional in nature, and the related information often spreads across multiple relations. To ensure effective and efficient high-dimensional, cross-relational clustering, we propose a new approach, called CrossClus, which performs cross-relational clustering with user's guidance. We believe that user's guidance, even likely in very simple forms, could be essential for effective high-dimensional clustering since a user knows well the application requirements and data semantics. CrossClus is carried out as follows: A user specifies a clustering task and selects one or a small set of features pertinent to the task. CrossClus extracts the set of highly relevant features in multiple relations connected via linkages defined in the database schema, evaluates their effectiveness based on user's guidance, and identifies interesting clusters that fit user's needs. This method takes care of both quality in feature extraction and efficiency in clustering. Our comprehensive experiments demonstrate the effectiveness and scalability of this approach.
- C.C. Aggarwal, P.S. Yu. Finding Generalized Projected Clusters in High Dimensional Spaces. SIGMOD, 2000.]] Google ScholarDigital Library
- C.C. Aggarwal, C. Procopiuc, J.L. Wolf, P.S. Yu, J.S. Park. Fast Algorithms for Projected Clustering. SIGMOD, 1999.]] Google ScholarDigital Library
- P. Cheeseman, et al. AutoClass: A Bayesian Classfication System. ICML, 1988.]]Google Scholar
- J.G. Dy, C.E. Brodley. Feature Selection for Unsupervised Learning. J. Machine Learning Research, 2004.]] Google ScholarDigital Library
- W. Emde, D. Wettschereck. Relational Instance-Based Learning. ICML, 1996.]]Google Scholar
- V. Ganti, J. Gehrke, R. Ramakrishnan. CACTUS - Clustering Categorical Data Using Summaries. KDD, 1999.]] Google ScholarDigital Library
- T. Gärtner, J. W. Lloyd, P. A. Flach. Kernels and Distances for Structured Data. Machine Learning, 57, 2004.]] Google ScholarDigital Library
- I. Guyon, A. Elisseeff. An Introduction to Variable and Feature Selection. J. Machine Learning Research, 2003.]] Google ScholarDigital Library
- M.A. Hall. Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. ICML, 2000.]] Google ScholarDigital Library
- V. Hristidis, Y. Papakonstantinou. DISCOVER: Keyword Search in Relational Databases. VLDB, 2002.]]Google ScholarDigital Library
- L. Kaufman, P.J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley and Sons, 1990.]]Google Scholar
- K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl. Constrained k-means clustering with background knowledge. ICML, 2001.]] Google ScholarDigital Library
- H. Kim, S. Lee. A semi-supervised document clustering technique for information organization. CIKM, 2000.]] Google ScholarDigital Library
- M. Kirsten, S. Wrobel. Relational Distance-Based Clustering. ILP, 1998.]] Google ScholarDigital Library
- M. Kirsten, S. Wrobel. Extending K-Means Clustering to First-order Representations. ILP, 2000.]] Google ScholarDigital Library
- J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. Berkeley Symposium, 1967.]]Google Scholar
- T.M. Mitchell. Machine Learning. McGraw Hill, 1997.]] Google ScholarDigital Library
- P. Mitra, C.A. Murthy, S.K. Pal. Unsupervised Feature Selection Using Feature Similarity. PAMI, 2002.]] Google ScholarDigital Library
- R.T. Ng, J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB, 1994.]] Google ScholarDigital Library
- X. Yin, J. Han, J. Yang, P.S. Yu. CrossMine: Efficient Classification Across Multiple Database Relations. ICDE, 2004.]] Google ScholarDigital Library
- E. P. Xing, A. Y. Ng, M. I. Jordan, S. Russell. Distance metric learning, with application to clustering with side-information. NIPS, 2002.]]Google Scholar
Index Terms
- Cross-relational clustering with user's guidance
Recommendations
Efficient Disk-Based K-Means Clustering for Relational Databases
K-means is one of the most popular clustering algorithms. This article introduces an efficient disk-based implementation of K-means. The proposed algorithm is designed to work inside a relational database management system. It can cluster large data ...
Improved k- means clustering algorithm for two dimensional data
CCSEIT '12: Proceedings of the Second International Conference on Computational Science, Engineering and Information TechnologyClustering is a procedure of organizing the objects in groups whose member exhibits some kind of similarity. So a cluster is a collection of objects which are alike and are different from the objects belonging to other clusters. K-Means is one of ...
Agglomerative Fuzzy K-Means Clustering Algorithm with Selection of Number of Clusters
In this paper, we present an agglomerative fuzzy $k$-means clustering algorithm for numerical data, an extension to the standard fuzzy $k$-means algorithm by introducing a penalty term to the objective function to make the clustering process not ...
Comments