rFILTA: relevant and nonredundant view discovery from collections of clusterings via filtering and ranking

Lei, Yang; Vinh, Nguyen Xuan; Chan, Jeffrey; Bailey, James

doi:10.1007/s10115-016-1008-y

rFILTA: relevant and nonredundant view discovery from collections of clusterings via filtering and ranking

Regular Paper
Published: 28 November 2016

Volume 52, pages 179–219, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yang Lei ORCID: orcid.org/0000-0003-3780-6510¹,
Nguyen Xuan Vinh¹,
Jeffrey Chan² &
…
James Bailey¹

247 Accesses
Explore all metrics

Abstract

Meta-clustering is a popular approach for finding multiple clusterings in the dataset, taking a large number of base clusterings as input for further user navigation and refinement. However, the effectiveness of meta-clustering is highly dependent on the distribution of the base clusterings and open challenges exist with regard to its stability and noise tolerance. In addition, the clustering views returned may not all be relevant, hence there is open challenge on how to rank those clustering views. In this paper we propose a simple and effective filtering algorithm that can be flexibly used in conjunction with any meta-clustering method. In addition, we propose an unsupervised method to rank the returned clustering views. We evaluate the framework (rFILTA) on both synthetic and real-world datasets, and see how its use can enhance the clustering view discovery for complex scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

Clustering graph data: the roadmap to spectral techniques

Article Open access 22 January 2024

Rahul Mondal, Evelina Ignatova, … Robert Heyer

Notes

Please refer to Sect. 9.5.1 for more details about the dataset and experiments.
The similarity between two clusterings are measured by adjusted mutual information (AMI), which will be introduced in Sect. 4.1.
We demonstrate this point in the experiments part, refer to Fig. 21.
The normalized version of mutual information for scaling mutual information to [0, 1].
Even though we are not sampling the clustering space uniformly, the size of the meta-cluster can be considered as one reasonable standard.
The images of card are downloaded from https://code.google.com/p/vectorized-playing-cards/.
The code of feature extraction is available at https://github.com/adikhosla/feature-extraction.
www.cs.cmu.edu/webkb.

References

Azimi J, Fern X (2009) Adaptive cluster ensemble selection. In: IJCAI vol 9, pp 992–997
Bache K, Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
Bae E, Bailey J Coala (2006) A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: Sixth international conference on data mining, 2006 (ICDM’06). IEEE, pp 53–62
Bailey J (2013) Alternative clustering analysis: a review. In: Aggarwal C, Reddy C (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton
Google Scholar
Caruana R, Elhaway M, Nguyen N, Smith C (2006) Meta clustering. In: Proceedings of ICDM, pp 107–118
Cui Y, Fern XZ, Dy JG (2007) Multi-view clustering via orthogonalization. In: Proceedings of ICDM, pp 133–142
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society Conference on computer vision and pattern recognition, 2005 (CVPR’2005) IEEE, vol 1, pp 886–893
Dang XH, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the of (KDD’10), pp 573–582
Dang XH, Bailey J (2014) Generating multiple alternative clusterings via globally optimal subspaces. Data Min Knowl Discov 28(3):569–592
Article MathSciNet MATH Google Scholar
Dang XH, Bailey J (2015) A framework to uncover multiple alternative clusterings. Mach Learn 98(1–2):7–30
Article MathSciNet MATH Google Scholar
Davidson I, Qi Z (2008) Finding alternative clusterings using constraints. In: Proceedings of ICDM, pp 773–778
Faivishevsky L, Goldberger J (2010) Nonparametric information theoretic clustering algorithm. In: Proceedings of ICML, pp 351–358
Fern XZ, Lin W (2008) Cluster ensemble selection. Stat Anal Data Min 1(3):128–141
Article MathSciNet Google Scholar
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):4
Article Google Scholar
Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306
Article MathSciNet MATH Google Scholar
Gullo F, Domeniconi C, Tagarelli A (2015) Metacluster-based projective clustering ensembles. Mach Learn 98(1–2):181–216
Article MathSciNet MATH Google Scholar
Hadjitodorov ST, Kuncheva LI, Todorova LP (2006) Moderate diversity for better cluster ensembles. Inf Fusion 7(3):264–275
Article Google Scholar
Havens TC, Bezdek JC (2012) An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans Knowl Data Eng 24(5):813–822
Article Google Scholar
Havens TC, Bezdek JC, Keller JM, Popescu M (2009) Clustering in ordered dissimilarity data. Int J Int Syst 24(5):504–528
Article MATH Google Scholar
Hossain MS, Ramakrishnan N, Davidson I, Watson LT (2013) How to “alternatize” a clustering algorithm. Data Min Knowl Discov 27(2):193–224
Article MathSciNet MATH Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Jain P, Meka R, Dhillon IS (2008) Simultaneous unsupervised learning of disparate clusterings. Stat Anal Data Min: ASA Data Sci J 1(3):195–210
Article MathSciNet Google Scholar
Jaskowiak PA, Moulavi D, Furtado AC, Campello RJ, Zimek A, Sander J (2016) On strategies for building effective ensembles of relative clustering validity criteria. Knowl Inf Syst 47(2):329–354
Article Google Scholar
Lei Y, Vinh NX, Chan J, Bailey J (2014) Filta Better view discovery from collections of clusterings via filtering. Machine learning and knowledge discovery in databases. Springer, Berlin, pp 145–160
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge
Book MATH Google Scholar
Naldi MC, Carvalho A, Campello RJ (2013) Cluster ensemble selection based on relative validity indexes. Data Min Knowl Discov 27(2):259–289
Article MathSciNet MATH Google Scholar
Nguyen N, Caruana R (2007) Consensus clusterings. In: Seventh IEEE international conference on data mining (ICDM’2007). IEEE, pp 607–612
Nie F, Xu D, Li X (2012) Initialization independent clustering with actively self-training method. IEEE Trans Syst, Man, Cybern, Part B (Cybern) 42(1):17–27
Article Google Scholar
Nie F, Wang X, Huang H (2014) Clustering and projected clustering with adaptive neighbors. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 977–986
Nilsback ME, Zisserman A (2006) A visual vocabulary for flower classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2, pp 1447–1454
Niu D, Dy JG, Jordan MI (2014) Iterative discovery of multiple alternativeclustering views. IEEE Trans Pattern Anal Mach Intell 36(7):1340–1353
Article Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Phillips JM, Raman P, Venkatasubramanian S (2011) Generating a diverse set of high-quality clusterings. arXiv:1108.0017
Pihur V, Datta S, Datta S (2007) Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23(13):1607–1615
Article Google Scholar
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Sheng W, Swift S, Zhang L, Liu X (2005) A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Trans Syst, Man, Cybern, Part B (Cybern) 35(6):1156–1167
Article Google Scholar
Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881
Article Google Scholar
Vinh NX, Epps J (2010) minCEntropy: a novel information theoretic approach for the generation of alternative clusterings. In: Proceedings of the ICDM, pp 521–530
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of ICML. ACM, pp 1073–1080
Wang L, Nguyen UT, Bezdek JC, Leckie CA, Ramamohanarao K (2010) iVAT and aVAT: enhanced visual analysis for cluster tendency assessment. In: Proceedings of PAKDD, pp 16–27
Wang H, Shan H, Banerjee A (2011) Bayesian cluster ensembles. Stat Anal Data Min 4(1):54–70
Article MathSciNet Google Scholar
Zhang Y, Li T (2011) Extending consensus clustering to explore multiple clustering views. In: Proceedings of the SDM, pp 920–931

Download references

Author information

Authors and Affiliations

Department of Computing and Information Systems, University of Melbourne, Parkville, Australia
Yang Lei, Nguyen Xuan Vinh & James Bailey
School of Science (Computer Science), RMIT University, Melbourne, Australia
Jeffrey Chan

Authors

Yang Lei
View author publications
You can also search for this author in PubMed Google Scholar
Nguyen Xuan Vinh
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Chan
View author publications
You can also search for this author in PubMed Google Scholar
James Bailey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Lei.

Additional information

Jeffrey Chan conducted part of this work while at the University of Melbourne.

Appendices

Appendix 1: Flower dataset

The flower image dataset [30] consists of 17 species of flowers with 80 images of each. We choose images from four species which are Buttercup, Daisy, Windflower and Sunflower (Fig. 28). For each specie, we randomly choose 16 images, that is 64 images in total. There are two natural clustering views in this dataset: color (white and yellow) and shape (sharp and round). For reducing the disturbance and focusing on the flowers, we processed the images by blacking the background. We scaled these images to \(120\times 120\) pixels and extracted their features in the same way as we did for card dataset. Finally each image is represented by 22 features. We generate 700 base clusterings on this dataset with 2 clusters. The two ground truth clustering views are shown in Fig. 29.

We firstly show the results on the unfiltered base clusterings in Fig. 30. We got 62 clustering views from this set of unfiltered base clusterings. The top 4 clustering views are shown in Fig. 30b. The first row is the color view, containing two clusters, representing two colors, yellow and white. The second row is the shape view, including two shapes, sharp and round. The results after filtering with \(L=100,\beta =0.6\) are shown in Fig. 31. As we can see from the Fig. 31a, the iVAT diagram contains two clearly separated blocks (meta-clusters) after filtering out the irrelevant clusterings (compared with unfiltered iVAT diagram in Fig. 30a). The generated clustering views from these two meta-clusters are shown in Fig. 31b which are just the color and shape view.

To further demonstrate the utility of ranking, we show another set of results in Fig. 32 with \(L=100, \beta =0.3\). When we decrease the tradeoff parameter to \(\beta =0.3\), more diverse clusterings will be included. Thus, the iVAT diagram in Fig. 32a is more fuzzy and untidy compared with the higher \(\beta =0.6\) in Fig. 31a. We generated 9 clustering views from this set of filtered set of base clusterings. The top 4 clustering views are shown in Fig. 32b. As we can see, the first row is the color view and the second row is the shape view. As we decreased the tradeoff parameter \(\beta \), we select more diverse clusterings which result in more clustering views.

The MBM scores for clustering views generated from the unfiltered base clusterings and the filtered base clusterings with \(\beta =0.3\) are shown in Fig. 33. In summary:

1.
We generate 62 clustering views from the unfiltered base clusterings and generated 9 clustering views from the filtered base clusterings with \(\beta =0.3\).
2.
The top 2 clustering views from both sets of clusterings recover and match well with the ground truth clustering views.
3.
The rank function works well by ranking the color and shape views as the top 2.

Appendix: 2 Object dataset

The Amsterdam Library of Object Images (ALOI) consists of 110250 images of 1000 common objects. For each object, a number of photos are taken from different angles and under various lighting conditions. We choose 9 objects with different colors and shapes, for a total of 108 images (Fig. 34). We processed them in the same way as the card dataset and extracted 15 features for each image finally. The two ground truth clustering views on object dataset are shown in (Fig. 35).

In this set of experiments, we generate 700 base clusterings with 3 clusters. We would like to demonstrate the performance of the filtering and ranking functions in our rFILTA framework. The experimental results on the unfiltered set of base clusterings are shown in Fig. 36. As we can see from the iVAT diagram of the 700 base clusterings in Fig. 36a, there are a big block and two small blocks along the diagonal. We finally generate 3 clustering views shown in Fig. 36b. The first row is the color view, containing three clusters, red, green and yellow. We do not find the shape view from the unfiltered base clusterings. Next, we show the results on the filtered set of base clusterings with \(L=100, \beta =0.95\) in Fig. 37. The iVAT diagram contains multiple clear blocks. The top 4 clustering views are shown in Fig. 37b. The first row is the color view and the fourth row is the shape view.

Comparing the two sets of results from the unfiltered base clusterings and the filtered base clusterings, we have some observations. There are less clustering views generated from the unfiltered base clusterings than ones from the filtered base clusterings. It may be because that there are a lot of generated base clusterings which are connecting different clustering views in the clustering space. Thus, in the clustering space, they seems like a big meta-cluster. After filtering, we clean out these connecting base clusterings and the different clustering views are separated clearly. Thus, the iVAT diagram of the unfiltered base clusterings only contains one big dark block and two small blocks while the iVAT diagram of the filtered base clusterings contain multiple clear blocks. From the unfiltered base clusterings, the shape view is not discovered. It is because its meta-cluster is concealed in the big block. After filtering, the shape views is discovered and the quality of the color view is increased.

The MBM scores for clustering views generated from the unfiltered and filtered base clusterings are shown in Fig. 38. In summary:

1.
We found out 3 clustering views from the unfiltered set of base clusterings and found out 9 clustering views from the filtered base clusterings with \(L=100, \beta = 0.95\).
2.
The MBM scores for the 3 clustering views generated from the unfiltered base clusterings are invariant as only one color view is recovered and also the quality does not get better.
3.
The returned top 4 clustering views from the filtered set of base clusterings recover and match well with the two ground truth views with MBM\((\mathcal {C} _4)=0.9\).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lei, Y., Vinh, N.X., Chan, J. et al. rFILTA: relevant and nonredundant view discovery from collections of clusterings via filtering and ranking. Knowl Inf Syst 52, 179–219 (2017). https://doi.org/10.1007/s10115-016-1008-y

Download citation

Received: 26 June 2015
Revised: 15 July 2016
Accepted: 10 November 2016
Published: 28 November 2016
Issue Date: July 2017
DOI: https://doi.org/10.1007/s10115-016-1008-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

rFILTA: relevant and nonredundant view discovery from collections of clusterings via filtering and ranking

Abstract

Access this article

Similar content being viewed by others