research-article

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Author:
Otmar Ertl

Dynatrace, Linz, Austria

Dynatrace, Linz, Austria
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 1368–1377https://doi.org/10.1145/3219819.3220089

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1368–1377

ABSTRACT

Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.

References

O. Alonso, D. Fetterly, and M. Manasse. 2013. Duplicate News Story Detection Revisited. In Proc. 9th Asia Information Retrieval Societies Conf. (AIRS). 203--214.Google Scholar
A. Z. Broder. 1997. On the Resemblance and Containment of Documents. In Proc. Compression and Complexity of Sequences. 21--29. Google ScholarDigital Library
M. S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proc. 34th Symp. on Theory of Computing (STOC). 380--388. Google ScholarDigital Library
O. Chum, J. Philbin, and A. Zisserman. 2008. Near Duplicate Image Detection: Min-Hash and TF-IDF Weighting. In Proc. British Machine Vision Conf. (BMVC). 812--815.Google Scholar
Y. Collet. 2016. xxHash -- Extremely Fast Hash Algorithm. https://github.com/Cyan4973/xxHash.Google Scholar
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. Introduction to Algorithms. MIT Press. Google ScholarDigital Library
S. Dahlgaard, M. B. T. Knudsen, and M. Thorup. 2017. Fast Similarity Sketching. In Proc. 58th Symp. on Foundations of Computer Science (FOCS). 663--671.Google Scholar
L. Devroye. 1986. Non-Uniform Random Variate Generation. Springer, New York.Google Scholar
J. Drew, M. Hahsler, and T. Moore. 2017. Polymorphic Malware Detection Using Sequence Classification Methods and Ensembles. EURASIP J. on Information Security (2017), 2:1--2:12. Google ScholarDigital Library
O. Ertl. 2017. SuperMinHash -- A New Minwise Hashing Algorithm for Jaccard Similarity Estimation. (2017). arXiv:1706.05698Google Scholar
S. Gollapudi and R. Panigrahy. 2006. Exploiting Asymmetry in Hierarchical Topic Extraction. In Proc. 15th Int. Conf. on Information and Knowledge Management (CIKM). 475--482. Google ScholarDigital Library
B. Haeupler, M. S. Manasse, and K. Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. arXiv:1410.4266Google Scholar
T. Haveliwala, A. Gionis, and P. Indyk. 2000. Scalable Techniques for Clustering the Web. In Proc. 3rd Int. Workshop on the Web and Databases (WebDB). 129--134.Google Scholar
S. Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In Proc. 10th Int. Conf. on Data Mining (ICDM). 246--255. Google ScholarDigital Library
J. Kleinberg and E. Tardos. 2002. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. J. of the ACM 49, 5 (2002), 616--639. Google ScholarDigital Library
J. Leskovec, A. Rajaraman, and J. D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press. Google ScholarDigital Library
P. Li. 2015. 0-Bit Consistent Weighted Sampling. In Proc. 21th Int. Conf. on Knowledge Discovery and Data Mining (KDD). 665--674. Google ScholarDigital Library
P. Li. 2017. Linearized GMM Kernels and Normalized Random Fourier Features. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 315--324. Google ScholarDigital Library
P. Li and A. C. König. 2010. b-Bit Minwise Hashing. In Proc. 19th Int. Conf. on World Wide Web (WWW). 671--680. Google ScholarDigital Library
P. Li and A. C. König. 2011. Theory and Applications of b-bit Minwise Hashing. Communications of the ACM 54, 8 (2011), 101--109. Google ScholarDigital Library
P. Li, A. Owen, and C. Zhang. 2012. One Permutation Hashing. In Proc. 26th Conf. on Advances in Neural Information Processing Systems (NIPS). 3113--3121. Google ScholarDigital Library
P. Li and C.-H. Zhang. 2017. Theory of the GMM Kernel. In Proc. 26th Int. Conf. on World Wide Web (WWW). 1053--1062. Google ScholarDigital Library
J. Lumbroso. 2013. Optimal Discrete Uniform Generation from Coin Flips, and Applications. (2013). arXiv:1304.1916Google Scholar
C. Luo and A. Shrivastava. 2016. SSH (Sketch, Shingle, &Hash) for Indexing Massive-Scale Time Series. In Proc. of Machine Learning Research, Vol. 55. 38--58.Google Scholar
M. Manasse, F. McSherry, and K. Talwar. 2010. Consistent Weighted Sampling. Technical Report. https://www.microsoft.com/en-us/research/publication/ consistent-weighted-sampling/Google Scholar
V. Markovtsev and E. Kant. 2017. Topic Modeling of Public Repositories at Scale Using Names in Source code. (2017). arXiv:1704.00135Google Scholar
G. Marsaglia and W. W. Tsang. 2000. The Ziggurat Method for Generating Random Variables. J. of Statistical Software 5, 8 (2000), 1--7.Google ScholarCross Ref
M. Mitzenmacher and E. Upfal. 2005. Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press. Google ScholarDigital Library
E. Raff and C. Nicholas. 2017. Malware Classification and Class Imbalance via Stochastic Hashed LZJD. In Proc. 10th ACM Workshop on Artificial Intelligence and Security (AISec). 111--120. Google ScholarDigital Library
S. Sathe and C. C. Aggarwal. 2017. Similarity Forests. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 395--403. Google ScholarDigital Library
A. Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In Advances in Neural Information Processing Systems 29 (NIPS). 1498--1506. Google ScholarDigital Library
A. Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In Proc. 34th Int. Conf. on Machine Learning (ICML). 3154--3163.Google Scholar
A. Shrivastava and P. Li. 2015. Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment. In Proc. 24th Int. Conf. on World Wide Web (WWW). 981--991. Google ScholarDigital Library
R. Spring and A. Shrivastava. 2017. Scalable and Sustainable Deep Learning via Randomized Hashing. In Proc. 23rd Int. Conf. on Knowledge Discovery and Data Mining (KDD). 445--454. Google ScholarDigital Library
W. Wu, B. Li, L. Chen, and C. Zhang. 2016. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In Proc. 16th Int. Conf. on Data Mining (ICDM). 1287--1292.Google Scholar
W. Wu, B. Li, L. Chen, and C. Zhang. 2017. Consistent Weighted Sampling Made More Practical. In Proc. 26th Int. Conf. on World Wide Web (WWW). 1035--1043. Google ScholarDigital Library
W.Wu, B. Li, L. Chen, C. Zhang, and P. S. Yu. 2017. Improved ConsistentWeighted Sampling Revisited. (2017). arXiv:1706.01172Google Scholar

Index Terms

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

Recommendations

Set similarity search beyond MinHash
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

We consider the problem of approximate set similarity search under Braun-Blanquet similarity B(x, y) = |x ∩ y| / max(|x|, |y|). The (b₁, b₂)-approximate Braun-Blanquet similarity search problem is to preprocess a collection of sets P such that, given a ...
Read More
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
Read More
GPU-based minwise hashing: GPU-based minwise hashing
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
consistent weighted sampling
jaccard similarity
locality-sensitive hashing
sketching algorithms
weighted minwise hashing
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 499
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Set similarity search beyond MinHash

Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

GPU-based minwise hashing: GPU-based minwise hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

BagMinHash - Minwise Hashing Algorithm for Weighted Sets

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Set similarity search beyond MinHash

Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

GPU-based minwise hashing: GPU-based minwise hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media