ABSTRACT
Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of our solution include: i) flexible support for multiple metrics of interest; ii) an autonomic approach to partition the input dataset with minimal redundancy to achieve good load-balance in the presence of limited computing resources; iii) an on-the- fly lossless compression strategy to reduce both the running time and the final output size. We validate the utility, scalability and the effectiveness of the approach on hundreds of machines using real and synthetic datasets.
- M. Alabduljalil, X. Tang, and T. Yang. Optimizing parallel algorithms for all pairs similarity search. In WSDM Conference, pages 203--212, 2013. Google ScholarDigital Library
- D. A. Arbatsky. The Certainty Principle. http://arxiv.org/abs/quant-ph/0608138v1, 2006.Google Scholar
- D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
- I. Assent, A. Wenning, and T. Seidl. Approximation Techniques for Indexing the Earth Mover's Distance in Multimedia Databases. In ICDE Conference, pages 11--11, 2006. Google ScholarDigital Library
- R. Baraglia, G. De Francisci Morales, and C. Lucchese. Document Similarity Self-Join with MapReduce. In ICDM Conference, pages 731--736, 2010. Google ScholarDigital Library
- R. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs Similarity Search. In WWW Conference, pages 131--140, 2007. Google ScholarDigital Library
- P. Boldi and S. Vigna. The Webgraph Framework I: Compression Techniques. In WWW Conference, pages 595--601, 2004. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE Conference, pages 5--5, 2006. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008. Google ScholarDigital Library
- T. Elsayed, J. Lin, and D. Oard. Pairwise Document Similarity in Large Collections with MapReduce. In ACL (Short Papers), pages 265--268, 2008. Google ScholarDigital Library
- R. et. al. Searching and mining trillions of time series subsequences under dynamic time warping. In KDD, pages 262--270, 2012. Google ScholarDigital Library
- R. Ferreira Cordeiro, C. Traina Junior, A. Machado Traina, J. López, U. Kang, and C. Faloutsos. Clustering Very Large Multi-Dimensional Datasets with MapReduce. In SIGKDD Conference, pages 690--698, 2011. Google ScholarDigital Library
- P. C. Fishburn and P. L. Hammer. Bipartite dimensions and bipartite degrees of graphs. Discrete Mathematics, 160(1):127--148, 1996. Google ScholarDigital Library
- D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In VLDB Conference, pages 721--732, 2005. Google ScholarDigital Library
- T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293--306, 1985.Google ScholarCross Ref
- M. Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms. In SIGIR Conference, pages 284--291, 2006. Google ScholarDigital Library
- E. Jacox and H. Samet. Metric space similarity joins. TODS, 33(2):7, 2008. Google ScholarDigital Library
- R. Jarvis and E. Patrick. Clustering Using a Similarity Measure Based on Shared Near Neighbors. TOC, 100(11):1025--1034, 1973. Google ScholarDigital Library
- Y. Koren. Collaborative Filtering with Temporal Dynamics. CACM, 53(4):89--97, 2010. Google ScholarDigital Library
- B. Kulis and K. Grauman. Kernelized Locality-Sensitive Hashing for Scalable Image Search. In ICCV, pages 2130--2137, 2009.Google ScholarCross Ref
- A. Metwally, D. Agrawal, and A. El Abbadi. DETECTIVES: DETEcting Coalition hiT InSSation attacks in adVertising nEtworks Streams. In WWW Conference, pages 241--250, 2007. Google ScholarDigital Library
- A. Metwally and C. Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the VLDB Endowment, 5(8):704--715, 2012. Google ScholarDigital Library
- Netflix Inc. Netflix competition.Google Scholar
- A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD Conference, pages 949--960, 2011. Google ScholarDigital Library
- Y. Rubner, C. Tomasi, and L. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval. IJCV, 40(2):99--121, 2000. Google ScholarDigital Library
- S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarDigital Library
- V. Satuluri and S. Parthasarathy. Bayesian Locality Sensitive Hashing for Fast Similarity Search. PVLDB, 5(5):430--441, 2011. Google ScholarDigital Library
- Y. Silva and J. Reed. Exploiting mapreduce-based similarity joins. In SIGMOD, pages 693--696, 2012. Google ScholarDigital Library
- E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating Similarity Measures: A Large-Scale Study in the Orkut Social Network. In SIGKDD Conference, pages 678--684, 2005. Google ScholarDigital Library
- J. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information processing letters, 40(4):175--179, 1991.Google Scholar
- R. Vernica, M. Carey, and C. Li. Efficient Parallel Set-Similarity Joins Using MapReduce. In SIGMOD Conference, pages 495--506, 2010. Google ScholarDigital Library
- C. Xiao, W. Wang, X. Lin, and J. Yu. Efficient Similarity Joins for Near Duplicate Detection. In WWW Conference, pages 131--140, 2008. Google ScholarDigital Library
Index Terms
- Scalable all-pairs similarity search in metric spaces
Recommendations
A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates ...
Similarity Between Points in Metric Measure Spaces
Similarity Search and ApplicationsAbstractThis paper is about similarity between objects that can be represented as points in metric measure spaces. A metric measure space is a metric space that is also equipped with a measure. For example, a network with distances between its nodes and ...
On the similarity metric and the distance metric
Similarity and dissimilarity measures are widely used in many research areas and applications. When a dissimilarity measure is used, it is normally required to be a distance metric. However, when a similarity measure is used, there is no formal ...
Comments