skip to main content
10.1145/2487575.2487676acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Subsampling for efficient and effective unsupervised outlier detection ensembles

Published:11 August 2013Publication History

ABSTRACT

Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.

References

  1. N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proc. KDD, pages 504--509, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, and A. Zimek. Evaluation of clusterings - metrics and visual support. In Proc. ICDE, pages 1285--1288, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. Interactive data mining with 3d-parallel-coordinate-trees. In Proc. SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Angiulli and F. Fassetti. DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD, 3(1):4:1--57, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proc. PKDD, pages 15--26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 3rd edition, 1994.Google ScholarGoogle Scholar
  7. S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. KDD, pages 29--38, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Bertoni and G. Valentini. Ensembles based on random projections to improve the accuracy of clustering algorithms. In WIRN / NAIS, pages 31--37, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. M. Breunig, H.-P. Kriegel, P. Kroger, and J. Sander. Data Bubbles: Quality preserving performance boosting for hierarchical clustering. In Proc. SIGMOD, pages 79--90, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proc. SIGMOD, pages 93--104, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  12. T. G. Dietterich. Ensemble methods in machine learning. In Proc. MCS, pages 1--15, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9):1090--1099, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  14. X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. ICML, pages 186--193, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.Google ScholarGoogle Scholar
  16. A. L. N. Fred and A. K. Jain. Robust data clustering. In Proc. CVPR, pages 128--136, 2003.Google ScholarGoogle Scholar
  17. J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proc. ICDM, pages 212--221, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Ghosh and A. Acharya. Cluster ensembles. WIREs DMKD, 1(4):305--315, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. Detection of outliers. WIREs Comp. Stat., 1(1):57--70, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. K. Hansen and P. Salamon. Neural network ensembles. IEEE TPAMI, 12(10):993--1001, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Jin, A. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. KDD, pages 293--298, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proc. PAKDD, pages 577--593, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proc. ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proc. KDD, pages 219--222, 1997.Google ScholarGoogle Scholar
  26. G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5):1170--1187, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proc. CIKM, pages 1649--1652, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proc. SDM, pages 13--24, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  29. H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proc. KDD, pages 444--452, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proc. KDD, pages 157--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proc. DASFAA, pages 368--383, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. Distance-based outlier detection: Consolidation and renewed bearing. PVLDB, 3(2):1469--1480, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE, pages 315--326, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  34. S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. SIGMOD, pages 427--438, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. J. Rousseeuw and M. Hubert. Robust statistics for outlier detection. WIREs DMKD, 1(1):73--79, 2011.Google ScholarGoogle Scholar
  36. E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proc. SDM, pages 1047--1058, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  37. E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc., 2012.Google ScholarGoogle Scholar
  38. T. Soler and M. Chin. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention, pages 393--406, 1985.Google ScholarGoogle Scholar
  39. A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583--617, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE TPAMI, 27(12):1866--1881, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. G. Valentini and F. Masulli. Ensembles of learning machines. In Proc. Neural Nets WIRN, pages 3--22, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. N. H. Vu and V. Gopalkrishnan. Efficient pruning schemes for distance-based outlier detection. In Proc. ECML PKDD, pages 160--175, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proc. KDD, pages 776--784, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detection approach for scattered real-world data. In Proc. PAKDD, pages 813--822, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min., 5(5):363--387, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Subsampling for efficient and effective unsupervised outlier detection ensembles

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2013
      1534 pages
      ISBN:9781450321747
      DOI:10.1145/2487575

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 August 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader