Skip to main content
Log in

rFILTA: relevant and nonredundant view discovery from collections of clusterings via filtering and ranking

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Meta-clustering is a popular approach for finding multiple clusterings in the dataset, taking a large number of base clusterings as input for further user navigation and refinement. However, the effectiveness of meta-clustering is highly dependent on the distribution of the base clusterings and open challenges exist with regard to its stability and noise tolerance. In addition, the clustering views returned may not all be relevant, hence there is open challenge on how to rank those clustering views. In this paper we propose a simple and effective filtering algorithm that can be flexibly used in conjunction with any meta-clustering method. In addition, we propose an unsupervised method to rank the returned clustering views. We evaluate the framework (rFILTA) on both synthetic and real-world datasets, and see how its use can enhance the clustering view discovery for complex scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27

Similar content being viewed by others

Notes

  1. Please refer to Sect. 9.5.1 for more details about the dataset and experiments.

  2. The similarity between two clusterings are measured by adjusted mutual information (AMI), which will be introduced in Sect. 4.1.

  3. We demonstrate this point in the experiments part, refer to Fig. 21.

  4. The normalized version of mutual information for scaling mutual information to [0, 1].

  5. Even though we are not sampling the clustering space uniformly, the size of the meta-cluster can be considered as one reasonable standard.

  6. The images of card are downloaded from https://code.google.com/p/vectorized-playing-cards/.

  7. The code of feature extraction is available at https://github.com/adikhosla/feature-extraction.

  8. www.cs.cmu.edu/webkb.

References

  1. Azimi J, Fern X (2009) Adaptive cluster ensemble selection. In: IJCAI vol 9, pp 992–997

  2. Bache K, Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml

  3. Bae E, Bailey J Coala (2006) A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: Sixth international conference on data mining, 2006 (ICDM’06). IEEE, pp 53–62

  4. Bailey J (2013) Alternative clustering analysis: a review. In: Aggarwal C, Reddy C (eds) Data clustering: algorithms and applications. CRC Press, Boca Raton

    Google Scholar 

  5. Caruana R, Elhaway M, Nguyen N, Smith C (2006) Meta clustering. In: Proceedings of ICDM, pp 107–118

  6. Cui Y, Fern XZ, Dy JG (2007) Multi-view clustering via orthogonalization. In: Proceedings of ICDM, pp 133–142

  7. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: IEEE computer society Conference on computer vision and pattern recognition, 2005 (CVPR’2005) IEEE, vol 1, pp 886–893

  8. Dang XH, Bailey J (2010) A hierarchical information theoretic technique for the discovery of non linear alternative clusterings. In: Proceedings of the of (KDD’10), pp 573–582

  9. Dang XH, Bailey J (2014) Generating multiple alternative clusterings via globally optimal subspaces. Data Min Knowl Discov 28(3):569–592

    Article  MathSciNet  MATH  Google Scholar 

  10. Dang XH, Bailey J (2015) A framework to uncover multiple alternative clusterings. Mach Learn 98(1–2):7–30

    Article  MathSciNet  MATH  Google Scholar 

  11. Davidson I, Qi Z (2008) Finding alternative clusterings using constraints. In: Proceedings of ICDM, pp 773–778

  12. Faivishevsky L, Goldberger J (2010) Nonparametric information theoretic clustering algorithm. In: Proceedings of ICML, pp 351–358

  13. Fern XZ, Lin W (2008) Cluster ensemble selection. Stat Anal Data Min 1(3):128–141

    Article  MathSciNet  Google Scholar 

  14. Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data (TKDD) 1(1):4

    Article  Google Scholar 

  15. Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38:293–306

    Article  MathSciNet  MATH  Google Scholar 

  16. Gullo F, Domeniconi C, Tagarelli A (2015) Metacluster-based projective clustering ensembles. Mach Learn 98(1–2):181–216

    Article  MathSciNet  MATH  Google Scholar 

  17. Hadjitodorov ST, Kuncheva LI, Todorova LP (2006) Moderate diversity for better cluster ensembles. Inf Fusion 7(3):264–275

    Article  Google Scholar 

  18. Havens TC, Bezdek JC (2012) An efficient formulation of the improved visual assessment of cluster tendency (iVAT) algorithm. IEEE Trans Knowl Data Eng 24(5):813–822

    Article  Google Scholar 

  19. Havens TC, Bezdek JC, Keller JM, Popescu M (2009) Clustering in ordered dissimilarity data. Int J Int Syst 24(5):504–528

    Article  MATH  Google Scholar 

  20. Hossain MS, Ramakrishnan N, Davidson I, Watson LT (2013) How to “alternatize” a clustering algorithm. Data Min Knowl Discov 27(2):193–224

    Article  MathSciNet  MATH  Google Scholar 

  21. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  22. Jain P, Meka R, Dhillon IS (2008) Simultaneous unsupervised learning of disparate clusterings. Stat Anal Data Min: ASA Data Sci J 1(3):195–210

    Article  MathSciNet  Google Scholar 

  23. Jaskowiak PA, Moulavi D, Furtado AC, Campello RJ, Zimek A, Sander J (2016) On strategies for building effective ensembles of relative clustering validity criteria. Knowl Inf Syst 47(2):329–354

    Article  Google Scholar 

  24. Lei Y, Vinh NX, Chan J, Bailey J (2014) Filta Better view discovery from collections of clusterings via filtering. Machine learning and knowledge discovery in databases. Springer, Berlin, pp 145–160

    Google Scholar 

  25. Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval, vol 1. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  26. Naldi MC, Carvalho A, Campello RJ (2013) Cluster ensemble selection based on relative validity indexes. Data Min Knowl Discov 27(2):259–289

    Article  MathSciNet  MATH  Google Scholar 

  27. Nguyen N, Caruana R (2007) Consensus clusterings. In: Seventh IEEE international conference on data mining (ICDM’2007). IEEE, pp 607–612

  28. Nie F, Xu D, Li X (2012) Initialization independent clustering with actively self-training method. IEEE Trans Syst, Man, Cybern, Part B (Cybern) 42(1):17–27

    Article  Google Scholar 

  29. Nie F, Wang X, Huang H (2014) Clustering and projected clustering with adaptive neighbors. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 977–986

  30. Nilsback ME, Zisserman A (2006) A visual vocabulary for flower classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol 2, pp 1447–1454

  31. Niu D, Dy JG, Jordan MI (2014) Iterative discovery of multiple alternativeclustering views. IEEE Trans Pattern Anal Mach Intell 36(7):1340–1353

    Article  Google Scholar 

  32. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  33. Phillips JM, Raman P, Venkatasubramanian S (2011) Generating a diverse set of high-quality clusterings. arXiv:1108.0017

  34. Pihur V, Datta S, Datta S (2007) Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23(13):1607–1615

    Article  Google Scholar 

  35. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  36. Sheng W, Swift S, Zhang L, Liu X (2005) A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Trans Syst, Man, Cybern, Part B (Cybern) 35(6):1156–1167

    Article  Google Scholar 

  37. Strehl A, Ghosh J (2003) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  38. Topchy A, Jain AK, Punch W (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881

    Article  Google Scholar 

  39. Vinh NX, Epps J (2010) minCEntropy: a novel information theoretic approach for the generation of alternative clusterings. In: Proceedings of the ICDM, pp 521–530

  40. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of ICML. ACM, pp 1073–1080

  41. Wang L, Nguyen UT, Bezdek JC, Leckie CA, Ramamohanarao K (2010) iVAT and aVAT: enhanced visual analysis for cluster tendency assessment. In: Proceedings of PAKDD, pp 16–27

  42. Wang H, Shan H, Banerjee A (2011) Bayesian cluster ensembles. Stat Anal Data Min 4(1):54–70

    Article  MathSciNet  Google Scholar 

  43. Zhang Y, Li T (2011) Extending consensus clustering to explore multiple clustering views. In: Proceedings of the SDM, pp 920–931

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Lei.

Additional information

Jeffrey Chan conducted part of this work while at the University of Melbourne.

Appendices

Appendix 1: Flower dataset

The flower image dataset [30] consists of 17 species of flowers with 80 images of each. We choose images from four species which are Buttercup, Daisy, Windflower and Sunflower (Fig. 28). For each specie, we randomly choose 16 images, that is 64 images in total. There are two natural clustering views in this dataset: color (white and yellow) and shape (sharp and round). For reducing the disturbance and focusing on the flowers, we processed the images by blacking the background. We scaled these images to \(120\times 120\) pixels and extracted their features in the same way as we did for card dataset. Finally each image is represented by 22 features. We generate 700 base clusterings on this dataset with 2 clusters. The two ground truth clustering views are shown in Fig. 29.

Fig. 28
figure 28

Example images of buttercup, sunflower, windflower and daisy flowers from left to right, from top to bottom

Fig. 29
figure 29

Two ground truth clustering views on flower dataset. The first row is the color view and the second row is the shape view (colour figure online)

Fig. 30
figure 30

Results for the unfiltered base clusterings on the flower dataset. a The iVAT diagram of the 700 unfiltered base clusterings on the flower dataset. b The top 4 views got from the 700 unfiltered base clusterings. The first row is the color view and the second row is the shape view (colour figure online)

We firstly show the results on the unfiltered base clusterings in Fig. 30. We got 62 clustering views from this set of unfiltered base clusterings. The top 4 clustering views are shown in Fig. 30b. The first row is the color view, containing two clusters, representing two colors, yellow and white. The second row is the shape view, including two shapes, sharp and round. The results after filtering with \(L=100,\beta =0.6\) are shown in Fig. 31. As we can see from the Fig. 31a, the iVAT diagram contains two clearly separated blocks (meta-clusters) after filtering out the irrelevant clusterings (compared with unfiltered iVAT diagram in Fig. 30a). The generated clustering views from these two meta-clusters are shown in Fig. 31b which are just the color and shape view.

Fig. 31
figure 31

Results for the 100 filtered base clusterings on the flower dataset. a The iVAT diagram of 100 filtered base clusterings with \(\beta =0.6\) on the flower dataset. b The 2 clustering views got from the filtered base clusterings on the flower dataset. The first row is the color view and the second row is the shape view (colour figure online)

To further demonstrate the utility of ranking, we show another set of results in Fig. 32 with \(L=100, \beta =0.3\). When we decrease the tradeoff parameter to \(\beta =0.3\), more diverse clusterings will be included. Thus, the iVAT diagram in Fig. 32a is more fuzzy and untidy compared with the higher \(\beta =0.6\) in Fig. 31a. We generated 9 clustering views from this set of filtered set of base clusterings. The top 4 clustering views are shown in Fig. 32b. As we can see, the first row is the color view and the second row is the shape view. As we decreased the tradeoff parameter \(\beta \), we select more diverse clusterings which result in more clustering views.

Fig. 32
figure 32

Results of the filtered base clusterings on flower dataset. a The iVAT diagram of the 100 filtered base clusterings with \(\beta =0.3\) on the flower dataset. b The top 4 views generated from the filtered base clusterings. The first row is the color view and the second row is the shape view (colour figure online)

Fig. 33
figure 33

The MBM scores for two sets of clustering views generated from the unfiltered and filtered base clusterings on the flower dataset

The MBM scores for clustering views generated from the unfiltered base clusterings and the filtered base clusterings with \(\beta =0.3\) are shown in Fig. 33. In summary:

  1. 1.

    We generate 62 clustering views from the unfiltered base clusterings and generated 9 clustering views from the filtered base clusterings with \(\beta =0.3\).

  2. 2.

    The top 2 clustering views from both sets of clusterings recover and match well with the ground truth clustering views.

  3. 3.

    The rank function works well by ranking the color and shape views as the top 2.

Appendix: 2 Object dataset

The Amsterdam Library of Object Images (ALOI) consists of 110250 images of 1000 common objects. For each object, a number of photos are taken from different angles and under various lighting conditions. We choose 9 objects with different colors and shapes, for a total of 108 images (Fig. 34). We processed them in the same way as the card dataset and extracted 15 features for each image finally. The two ground truth clustering views on object dataset are shown in (Fig. 35).

Fig. 34
figure 34

Example images of the nine selected objects

Fig. 35
figure 35

Two ground truth clustering views on object dataset. The first row is the color view and the second row is the shape view (colour figure online)

In this set of experiments, we generate 700 base clusterings with 3 clusters. We would like to demonstrate the performance of the filtering and ranking functions in our rFILTA framework. The experimental results on the unfiltered set of base clusterings are shown in Fig. 36. As we can see from the iVAT diagram of the 700 base clusterings in Fig. 36a, there are a big block and two small blocks along the diagonal. We finally generate 3 clustering views shown in Fig. 36b. The first row is the color view, containing three clusters, red, green and yellow. We do not find the shape view from the unfiltered base clusterings. Next, we show the results on the filtered set of base clusterings with \(L=100, \beta =0.95\) in Fig. 37. The iVAT diagram contains multiple clear blocks. The top 4 clustering views are shown in Fig. 37b. The first row is the color view and the fourth row is the shape view.

Fig. 36
figure 36

The results on the unfiltered base clusterings on the object dataset. a The iVAT diagram of the unfiltered base clusterings on the object dataset. b The top 3 clustering views got from the unfiltered base clusterings on the object dataset. The first row is the color view (colour figure online)

Fig. 37
figure 37

The results for the filtered base clusterings on object dataset with \(\beta =0.3\). a The iVAT diagram of the filtered base clusterings on the card dataset. b The top 4 views got from the filtered base clustering on the object dataset. The first row is the color view and the fourth row is the shape view (colour figure online)

Comparing the two sets of results from the unfiltered base clusterings and the filtered base clusterings, we have some observations. There are less clustering views generated from the unfiltered base clusterings than ones from the filtered base clusterings. It may be because that there are a lot of generated base clusterings which are connecting different clustering views in the clustering space. Thus, in the clustering space, they seems like a big meta-cluster. After filtering, we clean out these connecting base clusterings and the different clustering views are separated clearly. Thus, the iVAT diagram of the unfiltered base clusterings only contains one big dark block and two small blocks while the iVAT diagram of the filtered base clusterings contain multiple clear blocks. From the unfiltered base clusterings, the shape view is not discovered. It is because its meta-cluster is concealed in the big block. After filtering, the shape views is discovered and the quality of the color view is increased.

The MBM scores for clustering views generated from the unfiltered and filtered base clusterings are shown in Fig. 38. In summary:

  1. 1.

    We found out 3 clustering views from the unfiltered set of base clusterings and found out 9 clustering views from the filtered base clusterings with \(L=100, \beta = 0.95\).

  2. 2.

    The MBM scores for the 3 clustering views generated from the unfiltered base clusterings are invariant as only one color view is recovered and also the quality does not get better.

  3. 3.

    The returned top 4 clustering views from the filtered set of base clusterings recover and match well with the two ground truth views with MBM\((\mathcal {C} _4)=0.9\).

Fig. 38
figure 38

MBM scores for clustering views generated from the unfiltered and filtered base clusterings on the object dataset

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lei, Y., Vinh, N.X., Chan, J. et al. rFILTA: relevant and nonredundant view discovery from collections of clusterings via filtering and ranking. Knowl Inf Syst 52, 179–219 (2017). https://doi.org/10.1007/s10115-016-1008-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-1008-y

Keywords

Navigation