skip to main content
10.1145/2487575.2487625acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

Scalable all-pairs similarity search in metric spaces

Published:11 August 2013Publication History

ABSTRACT

Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of our solution include: i) flexible support for multiple metrics of interest; ii) an autonomic approach to partition the input dataset with minimal redundancy to achieve good load-balance in the presence of limited computing resources; iii) an on-the- fly lossless compression strategy to reduce both the running time and the final output size. We validate the utility, scalability and the effectiveness of the approach on hundreds of machines using real and synthetic datasets.

References

  1. M. Alabduljalil, X. Tang, and T. Yang. Optimizing parallel algorithms for all pairs similarity search. In WSDM Conference, pages 203--212, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. A. Arbatsky. The Certainty Principle. http://arxiv.org/abs/quant-ph/0608138v1, 2006.Google ScholarGoogle Scholar
  3. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. I. Assent, A. Wenning, and T. Seidl. Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. In ICDE Conference, pages 11--11, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Baraglia, G. De Francisci Morales, and C. Lucchese. Document Similarity Self-Join with MapReduce. In ICDM Conference, pages 731--736, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In WWW Conference, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Boldi and S. Vigna. The Webgraph Framework I: Compression Techniques. In WWW Conference, pages 595--601, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chaudhuri, V. Ganti, and R. Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE Conference, pages 5--5, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Elsayed, J. Lin, and D. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In ACL (Short Papers), pages 265--268, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. et. al. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, pages 262--270, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Ferreira Cordeiro, C. Traina Junior, A. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering Very Large Multi-Dimensional Datasets with MapReduce. In SIGKDD Conference, pages 690--698, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. C. Fishburn and P. L. Hammer. Bipartite dimensions and bipartite degrees of graphs. Discrete Mathematics, 160(1):127--148, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In VLDB Conference, pages 721--732, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293--306, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR Conference, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Jacox and H. Samet. Metric space similarity joins. TODS, 33(2):7, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Jarvis and E. Patrick. Clustering Using a Similarity Measure Based on Shared Near Neighbors. TOC, 100(11):1025--1034, 1973. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Koren. Collaborative Filtering with Temporal Dynamics. CACM, 53(4):89--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Kulis and K. Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In ICCV, pages 2130--2137, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. Metwally, D. Agrawal, and A. El Abbadi. DETECTIVES: DETEcting Coalition hiT InSSation attacks in adVertising nEtworks Streams. In WWW Conference, pages 241--250, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Metwally and C. Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the VLDB Endowment, 5(8):704--715, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Netflix Inc. Netflix competition.Google ScholarGoogle Scholar
  24. A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD Conference, pages 949--960, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Rubner, C. Tomasi, and L. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. IJCV, 40(2):99--121, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Satuluri and S. Parthasarathy. Bayesian Locality Sensitive Hashing for Fast Similarity Search. PVLDB, 5(5):430--441, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Silva and J. Reed. Exploiting mapreduce-based similarity joins. In SIGMOD, pages 693--696, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network. In SIGKDD Conference, pages 678--684, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information processing letters, 40(4):175--179, 1991.Google ScholarGoogle Scholar
  31. R. Vernica, M. Carey, and C. Li. Efficient Parallel Set-Similarity Joins Using MapReduce. In SIGMOD Conference, pages 495--506, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW Conference, pages 131--140, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable all-pairs similarity search in metric spaces

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
      August 2013
      1534 pages
      ISBN:9781450321747
      DOI:10.1145/2487575

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 August 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader