skip to main content
article

Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Published:01 September 2013Publication History
Skip Abstract Section

Abstract

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.

In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.

References

  1. E2LSH. http://www.mit.edu/~andoni/LSH/.Google ScholarGoogle Scholar
  2. LikeLike. http://code.google.com/p/likelike/.Google ScholarGoogle Scholar
  3. LSH-Hadoop. https://github.com/LanceNorskog/LSH-Hadoop.Google ScholarGoogle Scholar
  4. LSHKIT. http://lshkit.sourceforge.net.Google ScholarGoogle Scholar
  5. OptimalLSH. https://github.com/yahoo/Optimal-LSH.Google ScholarGoogle Scholar
  6. Twitter breaks 400 million tweet-per-day barrier, sees increasing mobile revenue. http://bit.ly/MmXObG.Google ScholarGoogle Scholar
  7. A. Andoni and P. Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of SODA, pages 1203-1212, 2006. Google ScholarGoogle Scholar
  8. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 58, 2008. Google ScholarGoogle Scholar
  9. N. Askitis and J. Zobel. Cache-conscious collision resolution in string hash tables. In SPIRE, pages 92-104, 2005. Google ScholarGoogle Scholar
  10. B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In CIKM, pages 2174-2178. ACM, 2012. Google ScholarGoogle Scholar
  11. J. L. Bentley. Multidimensional binary search trees used for associative searching. CACM, 18(9):509-517, 1975. Google ScholarGoogle Scholar
  12. C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322-373, Sept. 2001. Google ScholarGoogle Scholar
  13. M. Charikar. Similarity estimation techniques from rounding. In Proceedings of STOC, pages 380-388, 2002. Google ScholarGoogle Scholar
  14. L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In Proceedings of SIGMOD, 2005. Google ScholarGoogle Scholar
  15. A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, 2007. Google ScholarGoogle Scholar
  16. A. Dasgupta, R. Kumar, and T. Sarlós. Fast locality-sensitive hashing. In SIGKDD, pages 1073-1081. ACM, 2011. Google ScholarGoogle Scholar
  17. I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Oxford University Press, Inc., 1986. Google ScholarGoogle Scholar
  18. P. C.-M. K. A. P. Haghani. Lsh at large - distributed knn search in high dimensions. In WebDB, 2008.Google ScholarGoogle Scholar
  19. M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006. Google ScholarGoogle Scholar
  20. P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. In Proceedings of STOC, 1998. Google ScholarGoogle Scholar
  21. C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core cpus. PVLDB, 2(2):1378-1389, 2009. Google ScholarGoogle Scholar
  22. T. J. Lehman and M. J. Carey. A study of index structures for main memory database management systems. In VLDB, 1986. Google ScholarGoogle Scholar
  23. Y. Li, J. M. Patel, and A. Terrell. Wham: A high-throughput sequence alignment method. TODS, 37(4):28:1-28:39, Dec. 2012. Google ScholarGoogle Scholar
  24. F. Liu, C. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In Proceedings of ACM SIGMOD, 2006. Google ScholarGoogle Scholar
  25. G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW, 2007. Google ScholarGoogle Scholar
  26. E. Mohr, D. A. Kranz, and R. H. Halstead. Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems, 2:185-197, 1991. Google ScholarGoogle Scholar
  27. J. Pan and D. Manocha. Fast GPU-based locality sensitive hashing for k-nearest neighbor computation. In SIGSPATIAL, 2011. Google ScholarGoogle Scholar
  28. S. Petrovic, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In NAACL, volume 10, pages 181-189, 2010. Google ScholarGoogle Scholar
  29. A. Sadilek and H. Kautz. Modeling the impact of lifestyle on health at scale. In Proceedings of ACM ICWSDM, pages 637-646, 2013. Google ScholarGoogle Scholar
  30. N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, 2010. Google ScholarGoogle Scholar
  31. M. Slaney, Y. Lifshits, and J. He. Optimal parameters for locality-sensitive hashing. Proceedings of the IEEE, 100(9):2604-2623, 2012.Google ScholarGoogle Scholar
  32. X. Yan, P. S. Yu, and J. Han. Substructure similarity search in graph databases. In Proceedings of SIGMOD, 2005. Google ScholarGoogle Scholar

Index Terms

  1. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 6, Issue 14
          September 2013
          384 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 September 2013
          Published in pvldb Volume 6, Issue 14

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader