Finding associations and computing similarity via biased pair sampling

Campagna, Andrea; Pagh, Rasmus

doi:10.1007/s10115-011-0428-y

Finding associations and computing similarity via biased pair sampling

Regular paper
Published: 17 June 2011

Volume 31, pages 505–526, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Andrea Campagna¹ &
Rasmus Pagh¹

197 Accesses
8 Citations
Explore all metrics

Abstract

Sampling-based methods have previously been proposed for the problem of finding interesting associations in data, even for low-support items. While these methods do not guarantee precise results, they can be vastly more efficient than approaches that rely on exact counting. However, for many similarity measures no such methods have been known. In this paper, we show how a wide variety of measures can be supported by a simple biased sampling method. The method also extends to find high-confidence association rules. We demonstrate theoretically that our method is superior to exact methods when the threshold for “interesting similarity/confidence” is above the average pairwise similarity/confidence, and the average support is not too low. Our method is particularly advantageous when transactions contain many items. We confirm in experiments on standard association mining benchmarks that we obtain a significant speedup on real data sets. Reductions in computation time of over an order of magnitude, and significant savings in space, are observed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Iterative sampling based frequent itemset mining for big data

Article 20 March 2015

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

Improved Implementation and Performance Analysis of Association Rule Mining in Large Databases

References

Aggarwal A, Vitter JS (1988) The input/output complexity of sorting and related problems. Commun. ACM 31(9): 1116–1127
Article MathSciNet Google Scholar
Aggarwal CC, Yu PS (1998) A new framework for itemset generation. In: Proceedings of the ACM SIGACT–SIGMOD–SIGART symposium on principles of database systems (PODS ’98). ACM Press, New York, pp 18–24
Agrawal R, Mehta M, Shafer JC, Srikant R, Arning A, Bollinger T (1996) The quest data mining system. In: Proceedings of the 2nd international conference of knowledge discovery and data mining (KDD ’96). AAAI Press, CA, pp 244–249
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: International conference on very large data bases (VLDB ’94). Morgan Kaufmann Publishers, Inc., CA, pp 487–499
Amossen RR, Pagh R (2009) Faster join-projects and sparse matrix multiplications. In: Proceedings of database theory—12th international conference (ICDT ’09), vol 361 of ACM international conference proceeding series. ACM, New York, pp 121–126
Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on very large data bases (VLDB ’06). ACM, New York, pp 918–929
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99). ACM Press, New York, pp 254–260
Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. SIGMOD Rec ACM Special Interest Group Manag Data 26(2): 265–276
Article Google Scholar
Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: Proceedings of the ACM-SIGMOD international conference on management of data (SIGMOD ’97), vol. 26(2) of SIGMOD record (ACM special interest group on management of data). ACM Press, New York, pp 255–264
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (2000) Min-wise independent permutations. J Comput Syst Sci 60(3): 630–659
Article MathSciNet MATH Google Scholar
Campagna A, Pagh R (2010) On finding similar items in a stream of transactions. In: Proceedings of the 10th IEEE international conference on data mining workshops (ICDMW 2010). IEEE Computer Society, Silver Spring, pp 121–128
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thiry-fourth annual ACM symposium on theory of computing (STOC ’02). ACM, New York, pp 380–388
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering (ICDE 2006). IEEE Computer Society, Silver Spring, p 5
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman JD, Yang C (2001) Finding interesting associations without support pruning. IEEE Trans Knowl Data Eng 13(1): 64–78
Article Google Scholar
Cohen E, Lewis DD (1999) Approximating matrix multiplication for pattern recognition tasks. J Algorithms 30(2): 211–252
Article MathSciNet MATH Google Scholar
Coppersmith D, Winograd S (1990) Matrix multiplication via arithmetic progressions. J Symb Comput 9(3): 251–280
Article MathSciNet MATH Google Scholar
Cormode G, Hadjieleftheriou M (2008) Finding frequent items in data streams. PVLDB 1(2): 1530–1541
Google Scholar
Cormode G, Korn F, Tirthapura S (2008) Exponentially decayed aggregates on data streams. In: Proceedings of the 24th international conference on data engineering (ICDE 2008). IEEE, New York, pp 1379–1381
Cormode G, Muthukrishnan S (2005) What’s hot and what’s not: tracking most frequent items dynamically. ACM Trans Database Syst 30(1): 249–278
Article MathSciNet Google Scholar
Demaine ED, López-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: Proceedings of the 10th annual European symposium algorithms (ESA ’02), pp 348– 360
Geurts K, Wets G, Brijs T, Vanhoof K (2003) Profiling high frequency accident locations using association rules. In: Proceedings of the 82nd annual transportation research board, p 18
Goethals B, Zaki MJ (2004) Advances in frequent itemset mining implementations: report of fimi’03’. ACM SIGKDD Explor 6(1): 109–117
Article Google Scholar
Goethals B, Zaki MJ (eds) (2003) Proceedings of the ICDM 2003 workshop on frequent itemset mining implementations (FIMI ’03), Vol 90 of CEUR workshop proceedings. CEUR-WS.org
Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann, CA
MATH Google Scholar
Indyk P (1999) A small approximately min-wise independent family of hash functions. In: Proocedings of the 10th annual ACM-SIAM symposium on discrete algorithms (SODA’99), pp 454–456
Indyk P, Motwani R, Raghavan P, Vempala S (1997) Locality-preserving hashing in multidimensional spaces. In: Proceedings of the twenty-ninth annual ACM symposium on theory of computing (STOC ’97), pp 618–625
Kohavi R, Brodley C, Frasca B, Mason L, Zheng Z (2000) KDD-Cup 2000 organizers’ report: peeling the onion. SIGKDD Explor 2(2): 86–98
Article Google Scholar
Lee Y-K, Kim W-Y, Cai YD, Han J (2003) Comine: Efficient mining of correlated patterns. In: Proceedings of the IEEE international conference on data mining (ICDM ’03). IEEE Computer Society, Silver Spring, pp 581–584
Bayardo RJ, Jr. Goethals B, Zaki MJ (eds) (2004) Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI ’04), vol 126 of CEUR workshop proceedings, CEUR-WS.org
Metwally A, Agrawal D, Abbadi AE (2005a) , Efficient computation of frequent and top-k elements in data streams. In: Proceedings of database theory—10th international conference (ICDT 2005), vol 3363 of lecture notes in computer science. Springer, Berlin, pp 398–412
Metwally A, Agrawal D, Abbadi AE (2005b) , Efficient computation of frequent and top-k elements in data streams. Technical Report 23, University of California, Santa Barbara, USA
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
MATH Google Scholar
Omiecinski E (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1): 57–69
Article MathSciNet Google Scholar
Park JS, Chen M-S, Yu PS (1995) An effective hash-based algorithm for mining association rules. SIGMOD Rec ACM Special Interest Group Manag Data 24(2): 175–186
Google Scholar
Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: Proceedings of the 21st international conference on very large data bases (VLDB ’95). Morgan Kaufmann Publishers, CA, pp 432–444
Toivonen H (1996) Sampling large databases for association rules. In: Proceedings of the 22nd international conference on very large data bases (VLDB ’96). Morgan Kaufmann Publishers, pp 134–145
Wu X, Zhang C, Zhang S (2004) Efficient mining of both positive and negative association rules. ACM Trans Inf Syst 22: 381–405
Article Google Scholar
Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: Proceedings of the 25th international conference on data engineering, (ICDE ’09). IEEE, London, pp 916–927
Xiao C, Wang W, Lin X, Yu JX (2008) Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th international conference on world wide web, (WWW ’08). ACM, New York, pp 131–140
Yuster R, Zwick U (2005) Fast sparse matrix multiplication. ACM Trans Algorithms 1(1): 2–13
Article MathSciNet Google Scholar
Zhang S, Wu X, Zhang C, Lu J (2008) Computing the minimum-support for mining frequent patterns. Knowl Inf Syst 15(2): 233–257
Article Google Scholar

Download references

Author information

Authors and Affiliations

Efficient Computation Group, IT University of Copenhagen, Rued Langgaards Vej 7, 2300, København S, Denmark
Andrea Campagna & Rasmus Pagh

Authors

Andrea Campagna
View author publications
You can also search for this author in PubMed Google Scholar
Rasmus Pagh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rasmus Pagh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Campagna, A., Pagh, R. Finding associations and computing similarity via biased pair sampling. Knowl Inf Syst 31, 505–526 (2012). https://doi.org/10.1007/s10115-011-0428-y

Download citation

Received: 15 January 2010
Revised: 19 April 2011
Accepted: 27 May 2011
Published: 17 June 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10115-011-0428-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Finding associations and computing similarity via biased pair sampling

Abstract

Access this article

Similar content being viewed by others

Iterative sampling based frequent itemset mining for big data

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

Improved Implementation and Performance Analysis of Association Rule Mining in Large Databases

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Finding associations and computing similarity via biased pair sampling

Abstract

Access this article

Similar content being viewed by others

Iterative sampling based frequent itemset mining for big data

A Framework for Interestingness Measures for Association Rules with Discrete and Continuous Attributes Based on Statistical Validity

Improved Implementation and Performance Analysis of Association Rule Mining in Large Databases

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation