poster

Scalable all-pairs similarity search in metric spaces

Authors:
Ye Wang

The Ohio State University, Columbus, Ohio, USA

The Ohio State University, Columbus, Ohio, USA
View Profile

,
Ahmed Metwally

Google Inc., Mountain View, California, USA

Google Inc., Mountain View, California, USA
View Profile

,
Srinivasan Parthasarathy

The Ohio State University, Columbus, Ohio, USA

The Ohio State University, Columbus, Ohio, USA
View Profile

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2013Pages 829–837https://doi.org/10.1145/2487575.2487625

Published:11 August 2013Publication History

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 829–837

ABSTRACT

Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of our solution include: i) flexible support for multiple metrics of interest; ii) an autonomic approach to partition the input dataset with minimal redundancy to achieve good load-balance in the presence of limited computing resources; iii) an on-the- fly lossless compression strategy to reduce both the running time and the final output size. We validate the utility, scalability and the effectiveness of the approach on hundreds of machines using real and synthetic datasets.

References

M. Alabduljalil, X. Tang, and T. Yang. Optimizing parallel algorithms for all pairs similarity search. In WSDM Conference, pages 203--212, 2013. Google ScholarDigital Library
D. A. Arbatsky. The Certainty Principle. http://arxiv.org/abs/quant-ph/0608138v1, 2006.Google Scholar
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
I. Assent, A. Wenning, and T. Seidl. Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. In ICDE Conference, pages 11--11, 2006. Google ScholarDigital Library
R. Baraglia, G. De Francisci Morales, and C. Lucchese. Document Similarity Self-Join with MapReduce. In ICDM Conference, pages 731--736, 2010. Google ScholarDigital Library
R. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In WWW Conference, pages 131--140, 2007. Google ScholarDigital Library
P. Boldi and S. Vigna. The Webgraph Framework I: Compression Techniques. In WWW Conference, pages 595--601, 2004. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE Conference, pages 5--5, 2006. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
T. Elsayed, J. Lin, and D. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In ACL (Short Papers), pages 265--268, 2008. Google ScholarDigital Library
R. et. al. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, pages 262--270, 2012. Google ScholarDigital Library
R. Ferreira Cordeiro, C. Traina Junior, A. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering Very Large Multi-Dimensional Datasets with MapReduce. In SIGKDD Conference, pages 690--698, 2011. Google ScholarDigital Library
P. C. Fishburn and P. L. Hammer. Bipartite dimensions and bipartite degrees of graphs. Discrete Mathematics, 160(1):127--148, 1996. Google ScholarDigital Library
D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In VLDB Conference, pages 721--732, 2005. Google ScholarDigital Library
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293--306, 1985.Google ScholarCross Ref
M. Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR Conference, pages 284--291, 2006. Google ScholarDigital Library
E. Jacox and H. Samet. Metric space similarity joins. TODS, 33(2):7, 2008. Google ScholarDigital Library
R. Jarvis and E. Patrick. Clustering Using a Similarity Measure Based on Shared Near Neighbors. TOC, 100(11):1025--1034, 1973. Google ScholarDigital Library
Y. Koren. Collaborative Filtering with Temporal Dynamics. CACM, 53(4):89--97, 2010. Google ScholarDigital Library
B. Kulis and K. Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In ICCV, pages 2130--2137, 2009.Google ScholarCross Ref
A. Metwally, D. Agrawal, and A. El Abbadi. DETECTIVES: DETEcting Coalition hiT InSSation attacks in adVertising nEtworks Streams. In WWW Conference, pages 241--250, 2007. Google ScholarDigital Library
A. Metwally and C. Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the VLDB Endowment, 5(8):704--715, 2012. Google ScholarDigital Library
Netflix Inc. Netflix competition.Google Scholar
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD Conference, pages 949--960, 2011. Google ScholarDigital Library
Y. Rubner, C. Tomasi, and L. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. IJCV, 40(2):99--121, 2000. Google ScholarDigital Library
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarDigital Library
V. Satuluri and S. Parthasarathy. Bayesian Locality Sensitive Hashing for Fast Similarity Search. PVLDB, 5(5):430--441, 2011. Google ScholarDigital Library
Y. Silva and J. Reed. Exploiting mapreduce-based similarity joins. In SIGMOD, pages 693--696, 2012. Google ScholarDigital Library
E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network. In SIGKDD Conference, pages 678--684, 2005. Google ScholarDigital Library
J. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information processing letters, 40(4):175--179, 1991.Google Scholar
R. Vernica, M. Carey, and C. Li. Efficient Parallel Set-Similarity Joins Using MapReduce. In SIGMOD Conference, pages 495--506, 2010. Google ScholarDigital Library
C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW Conference, pages 131--140, 2008. Google ScholarDigital Library

Index Terms

Scalable all-pairs similarity search in metric spaces
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates ...
Read More
Similarity Between Points in Metric Measure Spaces
Similarity Search and Applications
Abstract
This paper is about similarity between objects that can be represented as points in metric measure spaces. A metric measure space is a metric space that is also equipped with a measure. For example, a network with distances between its nodes and ...
Read More
On the similarity metric and the distance metric

Similarity and dissimilarity measures are widely used in many research areas and applications. When a dissimilarity measure is used, it is normally required to be a distance metric. However, when a similarity measure is used, there is no formal ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2013
1534 pages
ISBN:9781450321747
DOI:10.1145/2487575
Editors:
Rayid Ghani
University of Chicago
,
Ted E. Senator
SAIC
,
Paul Bradley
MethodCare, Inc.
,
Rajesh Parekh
Groupon
,
Jingrui He
Stevens Institute of Technology
,
General Chairs:
Robert L. Grossman
University of Chicago and Open Data Group
,
Ramasamy Uthurusamy
General Motors Corporation (retired)
,
Program Chairs:
Inderjit S. Dhillon
University of Texas
,
Yehuda Koren
Google
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
all-pairs similarity search
similarity joins
Qualifiers
- poster
Conference

Acceptance Rates
KDD '13 Paper Acceptance Rate125of726submissions,17%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 711
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable all-pairs similarity search in metric spaces

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

Similarity Between Points in Metric Measure Spaces

On the similarity metric and the distance metric

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable all-pairs similarity search in metric spaces

KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

Similarity Between Points in Metric Measure Spaces

On the similarity metric and the distance metric

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media