ABSTRACT
Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity score (as determined by a function such as cosine distance) is above a given threshold. We propose a simple algorithm based on novel indexing and optimization strategies that solves this problem without relying on approximation methods or extensive parameter tuning. We show the approach efficiently handles a variety of datasets across a wide setting of similarity thresholds, with large speedups over previous state-of-the-art approaches.
- A. Arasu, V. Ganti, & R. Kaushik (2006). Efficient Exact Set-Similarity Joins. In Proc. of the 32nd Int'l Conf. on Very Large Data Bases, 918--929. Google ScholarDigital Library
- D. Beeferman & A. Berger (2000). Agglomerative Clustering of a Search Engine Query Log. In Proc. of the 6th ACM-SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, 407--416. Google ScholarDigital Library
- C. Böhm, B. Braunmuller, M. Breunig, & H.-P. Kriegel (2000). High Performance Clustering Based on the Similarity Join. In Proc. of the 2000 ACM CIKM International Conference on Information and Knowledge Management, 298--305. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, & G. Zweig (1997). Syntactic clustering of the Web. In Proc. of the 6th Int'l World Wide Web Conference, 391--303. Google ScholarDigital Library
- C. Buckley & A. F. Lewit (1985). Optimization of Inverted Vector Searches. In Proc. of the Eight Annual Int'l Conf. on Research and Development in Information Retrieval, 97--110. Google ScholarDigital Library
- M. S. Charikar (2002). Similarity Estimation Techniques from Rounding Algorithms. In Proc. of the 34th Annual Symposium on Theory of Computing, 380--388. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, & R. Kaushik (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proc. of the 22nd Int'l Conf on Data Engineering. Google ScholarDigital Library
- S. Chien & N. Immorlica (2005). Semantic Similarity Between Search Engine Queries Using Temporal Correlation. In Proc. of the 14th Int'l World Wide Web Conference, 2--11. Google ScholarDigital Library
- S.-L. Chuang & L.-F. Chien (2005). Taxonomy Generation for Text Segments: A Practical Web-Based Approach. In ACM Transactions on Information Systems, 23(4), 363--396. Google ScholarDigital Library
- R. Fagin, R. Kumar, & D. Sivakumar (2003). Efficient Similarity Search and Classification via Rank Aggregation. In Proc. of the 2003 ACM-SIGMOD Int'l Conf. on Management of Data, 301--312. Google ScholarDigital Library
- A. Gionis, P. Indyk, & R. Motwani (1999). Similarity Search in High Dimensions via Hashing. In Proc. of the 25th Int'l Conf. on Very Large Data Bases, 518--529. Google ScholarDigital Library
- P. Indyk, & R. Motwani (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proc. of the 30th Symposium on the Theory of Computing, 604--613. Google ScholarDigital Library
- A. Metwally, D. Agrawal, & A. El Abbadi (2007). DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams. In Proc. of the 16th Int'l Conf. on the World Wide Web, to appear. Google ScholarDigital Library
- A. Moffat, R. Sacks-Davis, R. Wilkinson, & J. Zobel (1994). Retrieval of partial documents. In The Second Text REtrieval Conference, 181--190.Google Scholar
- A. Moffat & J. Zobel (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379. Google ScholarDigital Library
- M. Persin (1994). Document filtering for fast ranking. In Proc. of the 17th Annual Int'l Conf. on Research and Development in Information Retrieval, 339--348. Google ScholarDigital Library
- M. Persin, J. Zobel, & R. Sacks-Davis (1994). Fast document ranking for large scale information retrieval. In Proc. of the First Int'l Conf. on Applications of Databases, Lecture Notes in Computer Science v819, 253--266.Google Scholar
- R. Ramakrishnan & J. Gehrke (2002). Database Management Systems. McGraw--Hill Science/Engineering/Math; 3rd edition. Google ScholarDigital Library
- M. Sahami & T. Heilman (2006). A Web--based Kernel Function for Measuring the Similarity of Short Text Snippets. In Proc. of the 15th Int'l Conf. on the World Wide Web, 377--386. Google ScholarDigital Library
- E. Spertus, M. Sahami, & O. Buyukkokten (2005). Evaluating Similarity Measures: A Large Scale Study in the Orkut Social Network. In Proc. of the 11th ACM--SIGKDD Int'l Conf. on Knowledge Discovery in Data Mining, 678--684. Google ScholarDigital Library
- S. Sarawagi & A. Kirpal (2004). Efficient Set Joins on Similarity Predicates. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data, 743--754. Google ScholarDigital Library
- T. Strohman, H. Turtle, & W. B. Croft (2005). Optimization Strategies for Complex Queries. In Proc. of the 28th Annual Int'l ACM-SIGIR Conf. on Research and Development in Information Retrieval, 219--225. Google ScholarDigital Library
- H. Turtle & J. Flood (1995). Query Evaluation: Strategies and Optimizations. In Information Processing & Management, 31(6), 831--850. Google ScholarDigital Library
Index Terms
- Scaling up all pairs similarity search
Recommendations
Effective Similarity Search on Indoor Moving-Object Trajectories
DASFAA 2016: Proceedings, Part II, of the 21st International Conference on Database Systems for Advanced Applications - Volume 9643In this paper, we propose a new approach to measuring the similarity among indoor moving-object trajectories. Particularly, we propose to measure indoor trajectory similarity based on spatial similarity and semantic pattern similarity. For spatial ...
String similarity search and join: a survey
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataA string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...
Comments