research-article

Subsampling for efficient and effective unsupervised outlier detection ensembles

Authors:
Arthur Zimek

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Matthew Gaudet

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Ricardo J.G.B. Campello

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

,
Jörg Sander

University of Alberta, Edmonton, Alberta, Canada

University of Alberta, Edmonton, Alberta, Canada
View Profile

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2013Pages 428–436https://doi.org/10.1145/2487575.2487676

Published:11 August 2013Publication History

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 428–436

ABSTRACT

Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.

References

N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proc. KDD, pages 504--509, 2006. Google ScholarDigital Library
E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, and A. Zimek. Evaluation of clusterings - metrics and visual support. In Proc. ICDE, pages 1285--1288, 2012. Google ScholarDigital Library
E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. Interactive data mining with 3d-parallel-coordinate-trees. In Proc. SIGMOD, 2013. Google ScholarDigital Library
F. Angiulli and F. Fassetti. DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD, 3(1):4:1--57, 2009. Google ScholarDigital Library
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proc. PKDD, pages 15--26, 2002. Google ScholarDigital Library
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 3rd edition, 1994.Google Scholar
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. KDD, pages 29--38, 2003. Google ScholarDigital Library
A. Bertoni and G. Valentini. Ensembles based on random projections to improve the accuracy of clustering algorithms. In WIRN / NAIS, pages 31--37, 2005. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, P. Kroger, and J. Sander. Data Bubbles: Quality preserving performance boosting for hierarchical clustering. In Proc. SIGMOD, pages 79--90, 2001. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proc. SIGMOD, pages 93--104, 2000. Google ScholarDigital Library
G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.Google ScholarCross Ref
T. G. Dietterich. Ensemble methods in machine learning. In Proc. MCS, pages 1--15, 2000. Google ScholarDigital Library
S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9):1090--1099, 2003.Google ScholarCross Ref
X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. ICML, pages 186--193, 2003.Google ScholarDigital Library
A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.Google Scholar
A. L. N. Fred and A. K. Jain. Robust data clustering. In Proc. CVPR, pages 128--136, 2003.Google Scholar
J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proc. ICDM, pages 212--221, 2006. Google ScholarDigital Library
J. Ghosh and A. Acharya. Cluster ensembles. WIREs DMKD, 1(4):305--315, 2011. Google ScholarDigital Library
A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. Detection of outliers. WIREs Comp. Stat., 1(1):57--70, 2009.Google ScholarDigital Library
S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006. Google ScholarDigital Library
L. K. Hansen and P. Salamon. Neural network ensembles. IEEE TPAMI, 12(10):993--1001, 1990. Google ScholarDigital Library
W. Jin, A. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. KDD, pages 293--298, 2001. Google ScholarDigital Library
W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proc. PAKDD, pages 577--593, 2006. Google ScholarDigital Library
F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proc. ICDE, 2012. Google ScholarDigital Library
E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proc. KDD, pages 219--222, 1997.Google Scholar
G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5):1170--1187, 2003. Google ScholarDigital Library
H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proc. CIKM, pages 1649--1652, 2009. Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proc. SDM, pages 13--24, 2011.Google ScholarCross Ref
H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proc. KDD, pages 444--452, 2008. Google ScholarDigital Library
A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proc. KDD, pages 157--166, 2005. Google ScholarDigital Library
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proc. DASFAA, pages 368--383, 2010. Google ScholarDigital Library
G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. Distance-based outlier detection: Consolidation and renewed bearing. PVLDB, 3(2):1469--1480, 2010. Google ScholarDigital Library
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE, pages 315--326, 2003.Google ScholarCross Ref
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. SIGMOD, pages 427--438, 2000. Google ScholarDigital Library
P. J. Rousseeuw and M. Hubert. Robust statistics for outlier detection. WIREs DMKD, 1(1):73--79, 2011.Google Scholar
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proc. SDM, pages 1047--1058, 2012.Google ScholarCross Ref
E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc., 2012.Google Scholar
T. Soler and M. Chin. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention, pages 393--406, 1985.Google Scholar
A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583--617, 2002. Google ScholarDigital Library
A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE TPAMI, 27(12):1866--1881, 2005. Google ScholarDigital Library
G. Valentini and F. Masulli. Ensembles of learning machines. In Proc. Neural Nets WIRN, pages 3--22, 2002. Google ScholarDigital Library
N. H. Vu and V. Gopalkrishnan. Efficient pruning schemes for distance-based outlier detection. In Proc. ECML PKDD, pages 160--175, 2009. Google ScholarDigital Library
J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proc. KDD, pages 776--784, 2008. Google ScholarDigital Library
K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detection approach for scattered real-world data. In Proc. PAKDD, pages 813--822, 2009. Google ScholarDigital Library
A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min., 5(5):363--387, 2012. Google ScholarDigital Library

Index Terms

Subsampling for efficient and effective unsupervised outlier detection ensembles
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Feature bagging for outlier detection
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines ...
Read More
Data perturbation for outlier detection ensembles
SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management

Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Building an ensemble requires learning of diverse models and ...
Read More
Improving multiclass classification and outlier detection method through ensemble technique
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information Processing

Class imbalance problems in multiclass have attracted much research focus due to classification difficulty caused by imbalance class distribution, presence of outliers, and irrelevant features that degrades performance of classifiers. Most of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2013
1534 pages
ISBN:9781450321747
DOI:10.1145/2487575
Editors:
Rayid Ghani
University of Chicago
,
Ted E. Senator
SAIC
,
Paul Bradley
MethodCare, Inc.
,
Rajesh Parekh
Groupon
,
Jingrui He
Stevens Institute of Technology
,
General Chairs:
Robert L. Grossman
University of Chicago and Open Data Group
,
Ramasamy Uthurusamy
General Motors Corporation (retired)
,
Program Chairs:
Inderjit S. Dhillon
University of Texas
,
Yehuda Koren
Google
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ensemble
outlier detection
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 136
  Total Citations
  View Citations
- 1,340
  Total Downloads
- Downloads (Last 12 months)55
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Subsampling for efficient and effective unsupervised outlier detection ensembles

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Feature bagging for outlier detection

Data perturbation for outlier detection ensembles

Improving multiclass classification and outlier detection method through ensemble technique

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Subsampling for efficient and effective unsupervised outlier detection ensembles

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Feature bagging for outlier detection

Data perturbation for outlier detection ensembles

Improving multiclass classification and outlier detection method through ensemble technique

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media