ABSTRACT
Outlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Here, we propose and study subsampling as a technique to induce diversity among individual outlier detectors. We show analytically and experimentally that an outlier detector based on a subsample per se, besides inducing diversity, can, under certain conditions, already improve upon the results of the same outlier detector on the complete dataset. Building an ensemble on top of several subsamples is further improving the results. While in the literature so far the intuition that ensembles improve over single outlier detectors has just been transferred from the classification literature, here we also justify analytically why ensembles are also expected to work in the unsupervised area of outlier detection. As a side effect, running an ensemble of several outlier detectors on subsamples of the dataset is more efficient than ensembles based on other means of introducing diversity and, depending on the sample rate and the size of the ensemble, can be even more efficient than just the single outlier detector on the complete data.
- N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In Proc. KDD, pages 504--509, 2006. Google ScholarDigital Library
- E. Achtert, S. Goldhofer, H.-P. Kriegel, E. Schubert, and A. Zimek. Evaluation of clusterings - metrics and visual support. In Proc. ICDE, pages 1285--1288, 2012. Google ScholarDigital Library
- E. Achtert, H.-P. Kriegel, E. Schubert, and A. Zimek. Interactive data mining with 3d-parallel-coordinate-trees. In Proc. SIGMOD, 2013. Google ScholarDigital Library
- F. Angiulli and F. Fassetti. DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD, 3(1):4:1--57, 2009. Google ScholarDigital Library
- F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proc. PKDD, pages 15--26, 2002. Google ScholarDigital Library
- V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 3rd edition, 1994.Google Scholar
- S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proc. KDD, pages 29--38, 2003. Google ScholarDigital Library
- A. Bertoni and G. Valentini. Ensembles based on random projections to improve the accuracy of clustering algorithms. In WIRN / NAIS, pages 31--37, 2005. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, P. Kroger, and J. Sander. Data Bubbles: Quality preserving performance boosting for hierarchical clustering. In Proc. SIGMOD, pages 79--90, 2001. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proc. SIGMOD, pages 93--104, 2000. Google ScholarDigital Library
- G. Brown, J. Wyatt, R. Harris, and X. Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6:5--20, 2005.Google ScholarCross Ref
- T. G. Dietterich. Ensemble methods in machine learning. In Proc. MCS, pages 1--15, 2000. Google ScholarDigital Library
- S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure. Bioinformatics, 19(9):1090--1099, 2003.Google ScholarCross Ref
- X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proc. ICML, pages 186--193, 2003.Google ScholarDigital Library
- A. Frank and A. Asuncion. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.Google Scholar
- A. L. N. Fred and A. K. Jain. Robust data clustering. In Proc. CVPR, pages 128--136, 2003.Google Scholar
- J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In Proc. ICDM, pages 212--221, 2006. Google ScholarDigital Library
- J. Ghosh and A. Acharya. Cluster ensembles. WIREs DMKD, 1(4):305--315, 2011. Google ScholarDigital Library
- A. S. Hadi, A. H. M. Rahmatullah Imon, and M. Werner. Detection of outliers. WIREs Comp. Stat., 1(1):57--70, 2009.Google ScholarDigital Library
- S. T. Hadjitodorov, L. I. Kuncheva, and L. P. Todorova. Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264--275, 2006. Google ScholarDigital Library
- L. K. Hansen and P. Salamon. Neural network ensembles. IEEE TPAMI, 12(10):993--1001, 1990. Google ScholarDigital Library
- W. Jin, A. Tung, and J. Han. Mining top-n local outliers in large databases. In Proc. KDD, pages 293--298, 2001. Google ScholarDigital Library
- W. Jin, A. K. H. Tung, J. Han, and W. Wang. Ranking outliers using symmetric neighborhood relationship. In Proc. PAKDD, pages 577--593, 2006. Google ScholarDigital Library
- F. Keller, E. Müller, and K. Böhm. HiCS: high contrast subspaces for density-based outlier ranking. In Proc. ICDE, 2012. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. A unified notion of outliers: Properties and computation. In Proc. KDD, pages 219--222, 1997.Google Scholar
- G. Kollios, D. Gunopulos, N. Koudas, and S. Berchthold. Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE, 15(5):1170--1187, 2003. Google ScholarDigital Library
- H.-P. Kriegel, P. Kroger, E. Schubert, and A. Zimek. LoOP: local outlier probabilities. In Proc. CIKM, pages 1649--1652, 2009. Google ScholarDigital Library
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. In Proc. SDM, pages 13--24, 2011.Google ScholarCross Ref
- H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proc. KDD, pages 444--452, 2008. Google ScholarDigital Library
- A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In Proc. KDD, pages 157--166, 2005. Google ScholarDigital Library
- H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In Proc. DASFAA, pages 368--383, 2010. Google ScholarDigital Library
- G. H. Orair, C. Teixeira, Y. Wang, W. Meira Jr., and S. Parthasarathy. Distance-based outlier detection: Consolidation and renewed bearing. PVLDB, 3(2):1469--1480, 2010. Google ScholarDigital Library
- S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE, pages 315--326, 2003.Google ScholarCross Ref
- S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. SIGMOD, pages 427--438, 2000. Google ScholarDigital Library
- P. J. Rousseeuw and M. Hubert. Robust statistics for outlier detection. WIREs DMKD, 1(1):73--79, 2011.Google Scholar
- E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. In Proc. SDM, pages 1047--1058, 2012.Google ScholarCross Ref
- E. Schubert, A. Zimek, and H.-P. Kriegel. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc., 2012.Google Scholar
- T. Soler and M. Chin. On transformation of covariance matrices between local Cartesian coordinate systems and commutative diagrams. In ASP-ACSM Convention, pages 393--406, 1985.Google Scholar
- A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 3:583--617, 2002. Google ScholarDigital Library
- A. Topchy, A. Jain, and W. Punch. Clustering ensembles: Models of concensus and weak partitions. IEEE TPAMI, 27(12):1866--1881, 2005. Google ScholarDigital Library
- G. Valentini and F. Masulli. Ensembles of learning machines. In Proc. Neural Nets WIRN, pages 3--22, 2002. Google ScholarDigital Library
- N. H. Vu and V. Gopalkrishnan. Efficient pruning schemes for distance-based outlier detection. In Proc. ECML PKDD, pages 160--175, 2009. Google ScholarDigital Library
- J. Yang, N. Zhong, Y. Yao, and J. Wang. Local peculiarity factor and its application in outlier detection. In Proc. KDD, pages 776--784, 2008. Google ScholarDigital Library
- K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detection approach for scattered real-world data. In Proc. PAKDD, pages 813--822, 2009. Google ScholarDigital Library
- A. Zimek, E. Schubert, and H.-P. Kriegel. A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min., 5(5):363--387, 2012. Google ScholarDigital Library
Index Terms
- Subsampling for efficient and effective unsupervised outlier detection ensembles
Recommendations
Feature bagging for outlier detection
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data miningOutlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines ...
Data perturbation for outlier detection ensembles
SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database ManagementOutlier detection and ensemble learning are well established research directions in data mining yet the application of ensemble techniques to outlier detection has been rarely studied. Building an ensemble requires learning of diverse models and ...
Improving multiclass classification and outlier detection method through ensemble technique
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingClass imbalance problems in multiclass have attracted much research focus due to classification difficulty caused by imbalance class distribution, presence of outliers, and irrelevant features that degrades performance of classifiers. Most of the ...
Comments