Skip to main content
Log in

Efficient discovery of contrast subspaces for object explanation and characterization

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes \(C_+\) and \(C_-\) and a query object \(o\), we want to find the top-\(k\) subspaces that maximize the ratio of likelihood of \(o\) in \(C_+\) against that in \(C_-\). Such subspaces are very useful for characterizing an object and explaining how it differs between two classes. We demonstrate that this problem has important applications, and, at the same time, is very challenging, being MAX SNP-hard. We present CSMiner, a mining method that uses kernel density estimation in conjunction with various pruning techniques. We experimentally investigate the performance of CSMiner on a range of data sets, evaluating its efficiency, effectiveness, and stability and demonstrating it is substantially faster than a baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. While [8] presented a contrast-pattern length based algorithm to detection global outliers, their problem setting is different from ours.

  2. Generally, given a set of observations \(Q\), the plausibility of two models \(M_1\) and \(M_2\) can be assessed by the Bayes factor \(K=\frac{Pr(Q\mid M_1)}{Pr(Q \mid M_2)}\).

  3. If it is not unimodal, then there could be multiple peaks at different distances from the query, which is counter to intuition. Similarly, we have no basis for preferring any direction over another, so symmetry is natural.

References

  1. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30:37–46

    Article  Google Scholar 

  2. Bache K, Lichman M (2013) UCI machine learning repository

  3. Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246

    Article  MATH  Google Scholar 

  4. Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proc. of the 7th Int’l Conf on Database Theory, pp 217–235

  5. Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proc. of the 13th SIAM Int’l Conf on Data Min, pp 198–206

  6. Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proc. of the 2000 ACM SIGMOD Int’l Conf on Manag of data, pp 93–104

  7. Cai Y, Zhao HK, Han H, Lau RYK, Leung HF, Min H (2012) Answering typicality query based on automatically prototype construction. In: Proc. of the 2012 IEEE/WIC/ACM Int’l Joint Conf Web Intell Intell Agent Technol, 01:362–366

  8. Chen L, Dong G (2006) Masquerader detection using OCLEP: one class classification using length statistics of emerging patterns. In: Proc. of Int’l workshop on information Processing over Evolving Networks (WINPEN), p 5

  9. Dong G, Bailey J (eds) (2013) Contrast data mining: concepts, algorithms, and applications. CRC Press, Boca Raton

    Google Scholar 

  10. Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proc. of the 5th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 43–52

  11. Duan L, Tang G, Pei J, Bailey J, Dong G, Campbell A, Tang C (2014) Mining contrast subspaces. In: Proc. of the 18th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 249–260

  12. Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 28–36

  13. He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118

    Article  Google Scholar 

  14. Hua M, Pei J, Fu AW, Lin X, Leung HF (2009) Top-k typicality queries and efficient query answering methods on large databases. VLDB J 18(3):809–835

    Article  Google Scholar 

  15. Jeffreys H (1961) The theory of probability, 3rd edn. Oxford

  16. Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proc. of the IEEE 28th Int’l Conf on Data Engineering, pp 1037–1048

  17. Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 444–452

  18. Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 831–838

  19. Novak PK, Lavrac N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403

    MATH  Google Scholar 

  20. Papadimitriou CH, Yannakakis M (1991) Optimization, approximation, and complexity classes. J Comput Syst Sci 43(3):425–440

    Article  MathSciNet  MATH  Google Scholar 

  21. Rymon R (1992) Search through systematic set enumeration. In: Proc. of the 3rd Int’l Conf on Principles of Knowledge Representation and Reasoning, pp 539–550

  22. Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, London

    Book  MATH  Google Scholar 

  23. Wang L, Zhao H, Dong G, Li J (2005) On the complexity of finding emerging patterns. Theor Comput Sci 335(1):15–27

    Article  MathSciNet  MATH  Google Scholar 

  24. Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 28(4):20:1–20:38

    Article  Google Scholar 

  25. Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proc. of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, pp 78–87

  26. Wu S, Crestani F (2003) Methods for ranking information retrieval systems without relevance judgments. In: Proc. of the 2003 ACM Symposium on Applied Computing. ACM, New York, NY, USA, pp 811–816

Download references

Acknowledgments

The authors are grateful to the editor and the anonymous reviewers for their constructive comments, which help to improve this paper. Lei Duan’s research was supported in part by National Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371), and SRFDP 20100181120029. Jian Pei’s and Guanting Tang’s research was supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work was supported by an ARC Future Fellowship (FT110100112). Work by Lei Duan and Guozhu Dong at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. All opinions, findings, conclusions, and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Duan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Duan, L., Tang, G., Pei, J. et al. Efficient discovery of contrast subspaces for object explanation and characterization. Knowl Inf Syst 47, 99–129 (2016). https://doi.org/10.1007/s10115-015-0835-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0835-6

Keywords

Navigation