Abstract
In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.
- AAR96 A. Arning, Rakesh Agrawal, and R Raghavan. A linear method for deviation detection in large databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Portland, Oregon, August 1996.Google Scholar
- AMS+95 Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules, chapter 14. 1995.Google Scholar
- BKNS00 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof:indetifying density-based local outliers. In Proc. of the ACM SIGMOD Conference on Management of Data, May 2000. Google ScholarDigital Library
- BKSS90 N. Beckmann, H.-R Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of ACM SIGMOD, pages 322-331, Atlantic City, NJ, May 1990. Google ScholarDigital Library
- BL94 V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons, New York, 1994.Google Scholar
- EKX95 Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google Scholar
- GRS98 Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1998. Google ScholarDigital Library
- JD88 Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarDigital Library
- KN98 Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392-403, New York, USA, September 1998. Google ScholarDigital Library
- KN99 Edwin Knorr and Raymond Ng. Finding intensional knowledge of distance-based outliers. In Proc. of the VLDB Conference, pages 211-222, Edinburgh, UK, September 1999. Google ScholarDigital Library
- NH94 Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarDigital Library
- RKV95 N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of ACM SIGMOD, pages 71-79, San Jose, CA, 1995. Google ScholarDigital Library
- RRS98 Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. Technical report, Bell Laboratories, Murray Hill, 1998.Google Scholar
- RS98 Rajeev Rastogi and Kyuseok Shim. Public: A decision tree classifier that integrates building and pruning. In Proc. of the Int'l Conf. on Vet7 Large Data Bases, New York, 1998. Google ScholarDigital Library
- Sam89 H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989. Google ScholarDigital Library
- SAM98 S. Sarawagi, R. Agrawal, and N. Megiddo. Discoverydriven exploration of olap data cubes. In Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998. Google ScholarDigital Library
- ZRL96 Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library
Index Terms
Efficient algorithms for mining outliers from large data sets
Recommendations
Efficient algorithms for mining outliers from large data sets
SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of dataIn this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in ...
Mining top-n local outliers in large databases
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data miningOutlier detection is an important task in data mining with numerous applications, including credit card fraud detection, video surveillance, etc. A recent work on outlier detection has introduced a novel notion of local outlier in which the degree to ...
Distributed Strategies for Mining Outliers in Large Data Sets
We introduce a distributed method for detecting distance-based outliers in very large data sets. Our approach is based on the concept of outlier detection solving set [2], which is a small subset of the data set that can be also employed for predicting ...
Comments