skip to main content
article
Free Access

Efficient algorithms for mining outliers from large data sets

Authors Info & Claims
Published:16 May 2000Publication History
Skip Abstract Section

Abstract

In this paper, we propose a novel formulation for distance-based outliers that is based on the distance of a point from its kth nearest neighbor. We rank each point on the basis of its distance to its kth nearest neighbor and declare the top n points in this ranking to be outliers. In addition to developing relatively straightforward solutions to finding such outliers based on the classical nested-loop join and index join algorithms, we develop a highly efficient partition-based algorithm for mining outliers. This algorithm first partitions the input data set into disjoint subsets, and then prunes entire partitions as soon as it is determined that they cannot contain outliers. This results in substantial savings in computation. We present the results of an extensive experimental study on real-life and synthetic data sets. The results from a real-life NBA database highlight and reveal several expected and unexpected aspects of the database. The results from a study on synthetic data sets demonstrate that the partition-based algorithm scales well with respect to both data set size and data set dimensionality.

References

  1. AAR96 A. Arning, Rakesh Agrawal, and R Raghavan. A linear method for deviation detection in large databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Portland, Oregon, August 1996.Google ScholarGoogle Scholar
  2. AMS+95 Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. Fast Discovery of Association Rules, chapter 14. 1995.Google ScholarGoogle Scholar
  3. BKNS00 Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jorg Sander. Lof:indetifying density-based local outliers. In Proc. of the ACM SIGMOD Conference on Management of Data, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BKSS90 N. Beckmann, H.-R Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In Proc. of ACM SIGMOD, pages 322-331, Atlantic City, NJ, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. BL94 V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley and Sons, New York, 1994.Google ScholarGoogle Scholar
  6. EKX95 Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu. A database interface for clustering in large spatial databases. In Int'l Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), Montreal, Canada, August 1995.Google ScholarGoogle Scholar
  7. GRS98 Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. Cure: An efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, June 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. JD88 Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, New Jersey, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. KN98 Edwin Knorr and Raymond Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. of the VLDB Conference, pages 392-403, New York, USA, September 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. KN99 Edwin Knorr and Raymond Ng. Finding intensional knowledge of distance-based outliers. In Proc. of the VLDB Conference, pages 211-222, Edinburgh, UK, September 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. NH94 Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the VLDB Conference, Santiago, Chile, September 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. RKV95 N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. of ACM SIGMOD, pages 71-79, San Jose, CA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. RRS98 Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. Efficient algorithms for mining outliers from large data sets. Technical report, Bell Laboratories, Murray Hill, 1998.Google ScholarGoogle Scholar
  14. RS98 Rajeev Rastogi and Kyuseok Shim. Public: A decision tree classifier that integrates building and pruning. In Proc. of the Int'l Conf. on Vet7 Large Data Bases, New York, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Sam89 H. Samet. The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. SAM98 S. Sarawagi, R. Agrawal, and N. Megiddo. Discoverydriven exploration of olap data cubes. In Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. ZRL96 Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient algorithms for mining outliers from large data sets

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGMOD Record
          ACM SIGMOD Record  Volume 29, Issue 2
          June 2000
          609 pages
          ISSN:0163-5808
          DOI:10.1145/335191
          Issue’s Table of Contents
          • cover image ACM Conferences
            SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data
            May 2000
            604 pages
            ISBN:1581132174
            DOI:10.1145/342009

          Copyright © 2000 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 May 2000

          Check for updates

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader