skip to main content
10.1145/1242572.1242591acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Scaling up all pairs similarity search

Published:08 May 2007Publication History

ABSTRACT

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.

References

  1. A. Arasu, V. Ganti, & R. Kaushik (2006). Efficient Exact Set-Similarity Joins. In Proc. of the 32nd Int'l Conf. on Very Large Data Bases, 918--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Beeferman & A. Berger (2000). Agglomerative Clustering of a Search Engine Query Log. In Proc. of the 6th ACM-SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, 407--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Böhm, B. Braunmuller, M. Breunig, & H.-P. Kriegel (2000). High Performance Clustering Based on the Similarity Join. In Proc. of the 2000 ACM CIKM International Conference on Information and Knowledge Management, 298--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder, S. C. Glassman, M. S. Manasse, & G. Zweig (1997). Syntactic clustering of the Web. In Proc. of the 6th Int'l World Wide Web Conference, 391--303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Buckley & A. F. Lewit (1985). Optimization of Inverted Vector Searches. In Proc. of the Eight Annual Int'l Conf. on Research and Development in Information Retrieval, 97--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual Symposium on Theory of Computing, 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri, V. Ganti, & R. Kaushik (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proc. of the 22nd Int'l Conf on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chien & N. Immorlica (2005). Semantic Similarity Between Search Engine Queries Using Temporal Correlation. In Proc. of the 14th Int'l World Wide Web Conference, 2--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S.-L. Chuang & L.-F. Chien (2005). Taxonomy Generation for Text Segments: A Practical Web-Based Approach. In ACM Transactions on Information Systems, 23(4), 363--396. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Fagin, R. Kumar, & D. Sivakumar (2003). Efficient Similarity Search and Classification via Rank Aggregation. In Proc. of the 2003 ACM-SIGMOD Int'l Conf. on Management of Data, 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Gionis, P. Indyk, & R. Motwani (1999). Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Int'l Conf. on Very Large Data Bases, 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Indyk, & R. Motwani (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proc. of the 30th Symposium on the Theory of Computing, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Metwally, D. Agrawal, & A. El Abbadi (2007). DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams. In Proc. of the 16th Int'l Conf. on the World Wide Web, to appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Moffat, R. Sacks-Davis, R. Wilkinson, & J. Zobel (1994). Retrieval of partial documents. In The Second Text REtrieval Conference, 181--190.Google ScholarGoogle Scholar
  15. A. Moffat & J. Zobel (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Persin (1994). Document filtering for fast ranking. In Proc. of the 17th Annual Int'l Conf. on Research and Development in Information Retrieval, 339--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Persin, J. Zobel, & R. Sacks-Davis (1994). Fast document ranking for large scale information retrieval. In Proc. of the First Int'l Conf. on Applications of Databases, Lecture Notes in Computer Science v819, 253--266.Google ScholarGoogle Scholar
  18. R. Ramakrishnan & J. Gehrke (2002). Database Management Systems. McGraw--Hill Science/Engineering/Math; 3rd edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Sahami & T. Heilman (2006). A Web--based Kernel Function for Measuring the Similarity of Short Text Snippets. In Proc. of the 15th Int'l Conf. on the World Wide Web, 377--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. Spertus, M. Sahami, & O. Buyukkokten (2005). Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In Proc. of the 11th ACM--SIGKDD Int'l Conf. on Knowledge Discovery in Data Mining, 678--684. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Sarawagi & A. Kirpal (2004). Efficient Set Joins on Similarity Predicates. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, 743--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Strohman, H. Turtle, & W. B. Croft (2005). Optimization Strategies for Complex Queries. In Proc. of the 28th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval, 219--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Turtle & J. Flood (1995). Query Evaluation: Strategies and Optimizations. In Information Processing & Management, 31(6), 831--850. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scaling up all pairs similarity search

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          WWW '07: Proceedings of the 16th international conference on World Wide Web
          May 2007
          1382 pages
          ISBN:9781595936547
          DOI:10.1145/1242572

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 8 May 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate1,899of8,196submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader