ABSTRACT
Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.
- O. Alonso, D. Fetterly, and M. Manasse. 2013. Duplicate News Story Detection Revisited. In Proc. 9th Asia Information Retrieval Societies Conf. (AIRS). 203--214.Google Scholar
- A. Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proc. Compression and Complexity of Sequences. 21--29. Google ScholarDigital Library
- M. S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proc. 34th Symp. on Theory of Computing (STOC). 380--388. Google ScholarDigital Library
- O. Chum, J. Philbin, and A. Zisserman. 2008. Near Duplicate Image Detection: Min-Hash and TF-IDF Weighting. In Proc. British Machine Vision Conf. (BMVC). 812--815.Google Scholar
- Y. Collet. 2016. xxHash -- Extremely Fast Hash Algorithm. https://github.com/Cyan4973/xxHash.Google Scholar
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms. MIT Press. Google ScholarDigital Library
- S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. 2017. Fast Similarity Sketching. In Proc. 58th Symp. on Foundations of Computer Science (FOCS). 663--671.Google Scholar
- L. Devroye. 1986. Non-Uniform Random Variate Generation. Springer, New York.Google Scholar
- J. Drew, M. Hahsler, and T. Moore. 2017. Polymorphic Malware Detection Using Sequence Classification Methods and Ensembles. EURASIP J. on Information Security (2017), 2:1--2:12. Google ScholarDigital Library
- O. Ertl. 2017. SuperMinHash -- A New Minwise Hashing Algorithm for Jaccard Similarity Estimation. (2017). arXiv:1706.05698Google Scholar
- S. Gollapudi and R. Panigrahy. 2006. Exploiting Asymmetry in Hierarchical Topic Extraction. In Proc. 15th Int. Conf. on Information and Knowledge Management (CIKM). 475--482. Google ScholarDigital Library
- B. Haeupler, M. S. Manasse, and K. Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. arXiv:1410.4266Google Scholar
- T. Haveliwala, A. Gionis, and P. Indyk. 2000. Scalable Techniques for Clustering the Web. In Proc. 3rd Int. Workshop on the Web and Databases (WebDB). 129--134.Google Scholar
- S. Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proc. 10th Int. Conf. on Data Mining (ICDM). 246--255. Google ScholarDigital Library
- J. Kleinberg and E. Tardos. 2002. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. J. of the ACM 49, 5 (2002), 616--639. Google ScholarDigital Library
- J. Leskovec, A. Rajaraman, and J. D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press. Google ScholarDigital Library
- P. Li. 2015. 0-Bit Consistent Weighted Sampling. In Proc. 21th Int. Conf. on Knowledge Discovery and Data Mining (KDD). 665--674. Google ScholarDigital Library
- P. Li. 2017. Linearized GMM Kernels and Normalized Random Fourier Features. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 315--324. Google ScholarDigital Library
- P. Li and A. C. König. 2010. b-Bit Minwise Hashing. In Proc. 19th Int. Conf. on World Wide Web (WWW). 671--680. Google ScholarDigital Library
- P. Li and A. C. König. 2011. Theory and Applications of b-bit Minwise Hashing. Communications of the ACM 54, 8 (2011), 101--109. Google ScholarDigital Library
- P. Li, A. Owen, and C. Zhang. 2012. One Permutation Hashing. In Proc. 26th Conf. on Advances in Neural Information Processing Systems (NIPS). 3113--3121. Google ScholarDigital Library
- P. Li and C.-H. Zhang. 2017. Theory of the GMM Kernel. In Proc. 26th Int. Conf. on World Wide Web (WWW). 1053--1062. Google ScholarDigital Library
- J. Lumbroso. 2013. Optimal Discrete Uniform Generation from Coin Flips, and Applications. (2013). arXiv:1304.1916Google Scholar
- C. Luo and A. Shrivastava. 2016. SSH (Sketch, Shingle, &Hash) for Indexing Massive-Scale Time Series. In Proc. of Machine Learning Research, Vol. 55. 38--58.Google Scholar
- M. Manasse, F. McSherry, and K. Talwar. 2010. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/ consistent-weighted-sampling/Google Scholar
- V. Markovtsev and E. Kant. 2017. Topic Modeling of Public Repositories at Scale Using Names in Source code. (2017). arXiv:1704.00135Google Scholar
- G. Marsaglia and W. W. Tsang. 2000. The Ziggurat Method for Generating Random Variables. J. of Statistical Software 5, 8 (2000), 1--7.Google ScholarCross Ref
- M. Mitzenmacher and E. Upfal. 2005. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press. Google ScholarDigital Library
- E. Raff and C. Nicholas. 2017. Malware Classification and Class Imbalance via Stochastic Hashed LZJD. In Proc. 10th ACM Workshop on Artificial Intelligence and Security (AISec). 111--120. Google ScholarDigital Library
- S. Sathe and C. C. Aggarwal. 2017. Similarity Forests. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 395--403. Google ScholarDigital Library
- A. Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In Advances in Neural Information Processing Systems 29 (NIPS). 1498--1506. Google ScholarDigital Library
- A. Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In Proc. 34th Int. Conf. on Machine Learning (ICML). 3154--3163.Google Scholar
- A. Shrivastava and P. Li. 2015. Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment. In Proc. 24th Int. Conf. on World Wide Web (WWW). 981--991. Google ScholarDigital Library
- R. Spring and A. Shrivastava. 2017. Scalable and Sustainable Deep Learning via Randomized Hashing. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 445--454. Google ScholarDigital Library
- W. Wu, B. Li, L. Chen, and C. Zhang. 2016. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In Proc. 16th Int. Conf. on Data Mining (ICDM). 1287--1292.Google Scholar
- W. Wu, B. Li, L. Chen, and C. Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proc. 26th Int. Conf. on World Wide Web (WWW). 1035--1043. Google ScholarDigital Library
- W.Wu, B. Li, L. Chen, C. Zhang, and P. S. Yu. 2017. Improved ConsistentWeighted Sampling Revisited. (2017). arXiv:1706.01172Google Scholar
Index Terms
- BagMinHash - Minwise Hashing Algorithm for Weighted Sets
Recommendations
Set similarity search beyond MinHash
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of ComputingWe consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b1, b2)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a ...
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm
The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
GPU-based minwise hashing: GPU-based minwise hashing
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebMinwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise ...
Comments