skip to main content
10.1145/3219819.3220089acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Published:19 July 2018Publication History

ABSTRACT

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.

References

  1. O. Alonso, D. Fetterly, and M. Manasse. 2013. Duplicate News Story Detection Revisited. In Proc. 9th Asia Information Retrieval Societies Conf. (AIRS). 203--214.Google ScholarGoogle Scholar
  2. A. Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proc. Compression and Complexity of Sequences. 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proc. 34th Symp. on Theory of Computing (STOC). 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. O. Chum, J. Philbin, and A. Zisserman. 2008. Near Duplicate Image Detection: Min-Hash and TF-IDF Weighting. In Proc. British Machine Vision Conf. (BMVC). 812--815.Google ScholarGoogle Scholar
  5. Y. Collet. 2016. xxHash -- Extremely Fast Hash Algorithm. https://github.com/Cyan4973/xxHash.Google ScholarGoogle Scholar
  6. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. 2017. Fast Similarity Sketching. In Proc. 58th Symp. on Foundations of Computer Science (FOCS). 663--671.Google ScholarGoogle Scholar
  8. L. Devroye. 1986. Non-Uniform Random Variate Generation. Springer, New York.Google ScholarGoogle Scholar
  9. J. Drew, M. Hahsler, and T. Moore. 2017. Polymorphic Malware Detection Using Sequence Classification Methods and Ensembles. EURASIP J. on Information Security (2017), 2:1--2:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. O. Ertl. 2017. SuperMinHash -- A New Minwise Hashing Algorithm for Jaccard Similarity Estimation. (2017). arXiv:1706.05698Google ScholarGoogle Scholar
  11. S. Gollapudi and R. Panigrahy. 2006. Exploiting Asymmetry in Hierarchical Topic Extraction. In Proc. 15th Int. Conf. on Information and Knowledge Management (CIKM). 475--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Haeupler, M. S. Manasse, and K. Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. arXiv:1410.4266Google ScholarGoogle Scholar
  13. T. Haveliwala, A. Gionis, and P. Indyk. 2000. Scalable Techniques for Clustering the Web. In Proc. 3rd Int. Workshop on the Web and Databases (WebDB). 129--134.Google ScholarGoogle Scholar
  14. S. Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proc. 10th Int. Conf. on Data Mining (ICDM). 246--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Kleinberg and E. Tardos. 2002. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. J. of the ACM 49, 5 (2002), 616--639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Leskovec, A. Rajaraman, and J. D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Li. 2015. 0-Bit Consistent Weighted Sampling. In Proc. 21th Int. Conf. on Knowledge Discovery and Data Mining (KDD). 665--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Li. 2017. Linearized GMM Kernels and Normalized Random Fourier Features. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 315--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Li and A. C. König. 2010. b-Bit Minwise Hashing. In Proc. 19th Int. Conf. on World Wide Web (WWW). 671--680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Li and A. C. König. 2011. Theory and Applications of b-bit Minwise Hashing. Communications of the ACM 54, 8 (2011), 101--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Li, A. Owen, and C. Zhang. 2012. One Permutation Hashing. In Proc. 26th Conf. on Advances in Neural Information Processing Systems (NIPS). 3113--3121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Li and C.-H. Zhang. 2017. Theory of the GMM Kernel. In Proc. 26th Int. Conf. on World Wide Web (WWW). 1053--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lumbroso. 2013. Optimal Discrete Uniform Generation from Coin Flips, and Applications. (2013). arXiv:1304.1916Google ScholarGoogle Scholar
  24. C. Luo and A. Shrivastava. 2016. SSH (Sketch, Shingle, &Hash) for Indexing Massive-Scale Time Series. In Proc. of Machine Learning Research, Vol. 55. 38--58.Google ScholarGoogle Scholar
  25. M. Manasse, F. McSherry, and K. Talwar. 2010. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/ consistent-weighted-sampling/Google ScholarGoogle Scholar
  26. V. Markovtsev and E. Kant. 2017. Topic Modeling of Public Repositories at Scale Using Names in Source code. (2017). arXiv:1704.00135Google ScholarGoogle Scholar
  27. G. Marsaglia and W. W. Tsang. 2000. The Ziggurat Method for Generating Random Variables. J. of Statistical Software 5, 8 (2000), 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Mitzenmacher and E. Upfal. 2005. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Raff and C. Nicholas. 2017. Malware Classification and Class Imbalance via Stochastic Hashed LZJD. In Proc. 10th ACM Workshop on Artificial Intelligence and Security (AISec). 111--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Sathe and C. C. Aggarwal. 2017. Similarity Forests. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 395--403. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In Advances in Neural Information Processing Systems 29 (NIPS). 1498--1506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In Proc. 34th Int. Conf. on Machine Learning (ICML). 3154--3163.Google ScholarGoogle Scholar
  33. A. Shrivastava and P. Li. 2015. Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment. In Proc. 24th Int. Conf. on World Wide Web (WWW). 981--991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Spring and A. Shrivastava. 2017. Scalable and Sustainable Deep Learning via Randomized Hashing. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 445--454. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. W. Wu, B. Li, L. Chen, and C. Zhang. 2016. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In Proc. 16th Int. Conf. on Data Mining (ICDM). 1287--1292.Google ScholarGoogle Scholar
  36. W. Wu, B. Li, L. Chen, and C. Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proc. 26th Int. Conf. on World Wide Web (WWW). 1035--1043. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. W.Wu, B. Li, L. Chen, C. Zhang, and P. S. Yu. 2017. Improved ConsistentWeighted Sampling Revisited. (2017). arXiv:1706.01172Google ScholarGoogle Scholar

Index Terms

  1. BagMinHash - Minwise Hashing Algorithm for Weighted Sets

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Other conferences
                  KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
                  July 2018
                  2925 pages
                  ISBN:9781450355520
                  DOI:10.1145/3219819

                  Copyright © 2018 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 19 July 2018

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader