Article

Scaling up all pairs similarity search

Authors:
Roberto J. Bayardo

Google: Inc., Mountain View, CA

Google: Inc., Mountain View, CA
View Profile

,
Yiming Ma

University of California: Irvine, Irvine, CA

University of California: Irvine, Irvine, CA
View Profile

,
Ramakrishnan Srikant

Google: Inc., Mountain View, CA

Google: Inc., Mountain View, CA
View Profile

WWW '07: Proceedings of the 16th international conference on World Wide WebMay 2007Pages 131–140https://doi.org/10.1145/1242572.1242591

Published:08 May 2007Publication History

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 131–140

ABSTRACT

Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.

References

A. Arasu, V. Ganti, & R. Kaushik (2006). Efficient Exact Set-Similarity Joins. In Proc. of the 32nd Int'l Conf. on Very Large Data Bases, 918--929. Google ScholarDigital Library
D. Beeferman & A. Berger (2000). Agglomerative Clustering of a Search Engine Query Log. In Proc. of the 6th ACM-SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, 407--416. Google ScholarDigital Library
C. Böhm, B. Braunmuller, M. Breunig, & H.-P. Kriegel (2000). High Performance Clustering Based on the Similarity Join. In Proc. of the 2000 ACM CIKM International Conference on Information and Knowledge Management, 298--305. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, & G. Zweig (1997). Syntactic clustering of the Web. In Proc. of the 6th Int'l World Wide Web Conference, 391--303. Google ScholarDigital Library
C. Buckley & A. F. Lewit (1985). Optimization of Inverted Vector Searches. In Proc. of the Eight Annual Int'l Conf. on Research and Development in Information Retrieval, 97--110. Google ScholarDigital Library
M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual Symposium on Theory of Computing, 380--388. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, & R. Kaushik (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proc. of the 22nd Int'l Conf on Data Engineering. Google ScholarDigital Library
S. Chien & N. Immorlica (2005). Semantic Similarity Between Search Engine Queries Using Temporal Correlation. In Proc. of the 14th Int'l World Wide Web Conference, 2--11. Google ScholarDigital Library
S.-L. Chuang & L.-F. Chien (2005). Taxonomy Generation for Text Segments: A Practical Web-Based Approach. In ACM Transactions on Information Systems, 23(4), 363--396. Google ScholarDigital Library
R. Fagin, R. Kumar, & D. Sivakumar (2003). Efficient Similarity Search and Classification via Rank Aggregation. In Proc. of the 2003 ACM-SIGMOD Int'l Conf. on Management of Data, 301--312. Google ScholarDigital Library
A. Gionis, P. Indyk, & R. Motwani (1999). Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Int'l Conf. on Very Large Data Bases, 518--529. Google ScholarDigital Library
P. Indyk, & R. Motwani (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proc. of the 30th Symposium on the Theory of Computing, 604--613. Google ScholarDigital Library
A. Metwally, D. Agrawal, & A. El Abbadi (2007). DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams. In Proc. of the 16th Int'l Conf. on the World Wide Web, to appear. Google ScholarDigital Library
A. Moffat, R. Sacks-Davis, R. Wilkinson, & J. Zobel (1994). Retrieval of partial documents. In The Second Text REtrieval Conference, 181--190.Google Scholar
A. Moffat & J. Zobel (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379. Google ScholarDigital Library
M. Persin (1994). Document filtering for fast ranking. In Proc. of the 17th Annual Int'l Conf. on Research and Development in Information Retrieval, 339--348. Google ScholarDigital Library
M. Persin, J. Zobel, & R. Sacks-Davis (1994). Fast document ranking for large scale information retrieval. In Proc. of the First Int'l Conf. on Applications of Databases, Lecture Notes in Computer Science v819, 253--266.Google Scholar
R. Ramakrishnan & J. Gehrke (2002). Database Management Systems. McGraw--Hill Science/Engineering/Math; 3rd edition. Google ScholarDigital Library
M. Sahami & T. Heilman (2006). A Web--based Kernel Function for Measuring the Similarity of Short Text Snippets. In Proc. of the 15th Int'l Conf. on the World Wide Web, 377--386. Google ScholarDigital Library
E. Spertus, M. Sahami, & O. Buyukkokten (2005). Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In Proc. of the 11th ACM--SIGKDD Int'l Conf. on Knowledge Discovery in Data Mining, 678--684. Google ScholarDigital Library
S. Sarawagi & A. Kirpal (2004). Efficient Set Joins on Similarity Predicates. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, 743--754. Google ScholarDigital Library
T. Strohman, H. Turtle, & W. B. Croft (2005). Optimization Strategies for Complex Queries. In Proc. of the 28th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval, 219--225. Google ScholarDigital Library
H. Turtle & J. Flood (1995). Query Evaluation: Strategies and Optimizations. In Information Processing & Management, 31(6), 831--850. Google ScholarDigital Library

Index Terms

Scaling up all pairs similarity search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Effective Similarity Search on Indoor Moving-Object Trajectories
DASFAA 2016: Proceedings, Part II, of the 21st International Conference on Database Systems for Advanced Applications - Volume 9643

In this paper, we propose a new approach to measuring the similarity among indoor moving-object trajectories. Particularly, we propose to measure indoor trajectory similarity based on spatial similarity and semantic pattern similarity. For spatial ...
Read More
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
Read More
String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data mining
similarity join
similarity search
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 505
  Total Citations
  View Citations
- 2,306
  Total Downloads
- Downloads (Last 12 months)105
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scaling up all pairs similarity search

WWW '07: Proceedings of the 16th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective Similarity Search on Indoor Moving-Object Trajectories

String similarity search and join: a survey

String similarity measures and joins with synonyms