Time-weighted counting for recently frequent pattern mining in data streams

Lim, Yongsub; Kang, U.

doi:10.1007/s10115-017-1045-1

Time-weighted counting for recently frequent pattern mining in data streams

Regular Paper
Published: 22 March 2017

Volume 53, pages 391–422, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Yongsub Lim¹ &
U. Kang²

482 Accesses
13 Citations
Explore all metrics

Abstract

How can we discover interesting patterns from time-evolving high-speed data streams? How to analyze the data streams quickly and accurately, with little space overhead? How to guarantee the found patterns to be self-consistent? High-speed data stream has been receiving increasing attention due to its wide applications such as sensors, network traffic, social networks, etc. The most fundamental task on the data stream is frequent pattern mining; especially, focusing on recentness is important in real applications. In this paper, we develop two algorithms for discovering recently frequent patterns in data streams. First, we propose TwMinSwap to find top-k recently frequent items in data streams, which is a deterministic version of our motivating algorithm TwSample providing theoretical guarantees based on item sampling. TwMinSwap improves TwSample in terms of speed, accuracy, and memory usage. Both require only O(k) memory spaces and do not require any prior knowledge on the stream such as its length and the number of distinct items in the stream. Second, we propose TwMinSwap-Is to find top-k recently frequent itemsets in data streams. We especially focus on keeping self-consistency of the discovered itemsets, which is the most important property for reliable results, while using O(k) memory space with the assumption of a constant itemset size. Through extensive experiments, we demonstrate that TwMinSwap outperforms all competitors in terms of accuracy and memory usage, with fast running time. We also show that TwMinSwap-Is is more accurate than the competitor and discovers recently frequent itemsets with reasonably large sizes (at most 5–7) depending on datasets. Thanks to TwMinSwap and TwMinSwap-Is, we report interesting discoveries in real world data streams, including the difference of trends between the winner and the loser of U.S. presidential candidates, and temporal human contact patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

On the nature and types of anomalies: a review of deviations in data

Article Open access 04 August 2021

A comprehensive survey of anomaly detection techniques for high dimensional big data

Article Open access 02 July 2020

Notes

If an itemset \(\mu \) is frequent, its every subset \(\nu \subseteq \mu \) is also frequent.
Here, \(binomial(\omega ,\theta )\) denotes a binomial random variable with the number \(\omega \) of independent trials and the success probability \(\theta \).
This is a different concept from closed frequent itemsets [2].
In the original paper proposing Skip LC-SS, k is set to a large \(50{,}000\le k\le 70{,}000\).
http://www.yelp.com/dataset_challenge/.
http://konect.uni-koblenz.de/networks/mit.

References

Arasu A, Manku GS (2004) Approximate counts and quantiles over sliding windows. In: PODS
Borgelt C, Yang X, Nogales-Cadenas R, Carmona-Saez P, Pascual-Montano A (2011) Finding closed frequent item sets by intersecting transactions. In: EDBT
Boyer RS, Moore JS (1991) Mjrty: a fast majority vote algorithm. In: Automated reasoning: essays in honor of Woody Bledsoe
Brijs T, Swinnen G, Vanhoof K, Wets G (1999) Using association rules for product assortment decisions: a case study. In: Knowledge discovery and data mining
Calders T, Dexters N, Gillis JJM, Goethals B (2014) Mining frequent itemsets in a stream. Inf Syst 39:233–255
Article Google Scholar
Chang JH, Lee WS (2003) Finding recent frequent itemsets adaptively over online data streams. In: KDD
Chang J H, Lee W S (2004) Decaying obsolete information in finding recent frequent itemsets over data streams. IEICE Trans 87–D(6):1588–1592
Google Scholar
Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. In: ICALP
Chen L, Mei Q (2014) Mining frequent items in data stream using time fading model. Inf Sci 257:54–69
Article MathSciNet MATH Google Scholar
Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1):3–20
Article Google Scholar
Cormode G, Muthukrishnan S (2004a) An improved data stream summary: the count-min sketch and its applications. In: LATIN
Cormode G, Muthukrishnan S (2004b) What’s new: finding significant differences in network data streams. In: INFOCOM
Dallachiesa M, Palpanas T (2013) Identifying streaming frequent items in ad hoc time windows. Data Knowl Eng 87:66–90
Article Google Scholar
Demaine ED, Lpez-Ortiz A, Munro JI (2002) Frequency estimation of internet packet streams with limited space. In: ESA
Eagle N, Pentland A (2006) Reality mining: sensing complex social systems. Pers Ubiquitous Comput 10(4):255–268
Article Google Scholar
Fischer MJ, Salzberg SL (1982) Finding a majority among \(n\) votes: solution to problem 81–5 (Journal of Algorithms, June 1981). J Algorithms 3(4):362–380
Article Google Scholar
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: SIGMOD
Golab L, DeHaan D, Demaine ED, López-Ortiz A, Munro JI (2003) Identifying frequent items in sliding windows over on-line packet streams. In: Internet measurement conference
Golab L, DeHaan D, López-Ortiz A, Demaine ED (2004) Finding frequent items in sliding windows with multinomially-distributed item frequencies. In: SSDBM
Jin C, Qian W, Sha C, Yu JX, Zhou A (2003) Dynamically maintaining frequent items over a data stream. In: CIKM
Jin R, Agrawal G (2005) An algorithm for in-core frequent itemset mining on streaming data. In: ICDM
Karp RM, Shenker S, Papadimitriou CH (2003) A simple algorithm for finding frequent elements in streams and bags. ACM Trans Database Syst 28:51–55
Article Google Scholar
Lee D, Lee W (2005) Finding maximal frequent itemsets over online data streams adaptively. In: ICDM
Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: KDD
Lewis DD, Yang Y, Rose TG, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397
Google Scholar
Li H, Lee S, Shan M (2005) Online mining (recently) maximal frequent itemsets over data streams. In: 15th international workshop on research issues in data engineering
Li H, Zhang N, Chen Z (2012) A simple but effective maximal frequent itemset mining algorithm over streams. JSW 7(1):25–32
Google Scholar
Lim Y, Choi J, Kang U (2014) Fast, accurate, and space-efficient tracking of time-weighted frequent items from data streams. In: CIKM
Lim Y, Jung M, Kang U (2017) Memory-efficient and accurate sampling for counting local triangles in graph streams: from simple to multigraphs. ACM Trans Knowl Discov Data 11(4)
Lim Y, Kang U (2015) Mascot: memory-efficient and accurate sampling for counting local triangles in graph streams. In: KDD
Liu H, Lu Y, Han J, He J (2006) Error-adaptive and time-aware maintenance of frequency counts over data streams. In: WAIM
Manerikar N, Palpanas T (2009) Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl Eng 68(4):415–430
Article Google Scholar
Manku GS, Motwani R (2002) Approximate frequency counts over data streams. In: VLDB
McAuley JJ, Leskovec J (2013) From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In: WWW
Metwally A, Agrawal D, Abbadi AE (2005) Efficient computation of frequent and top-k elements in data streams. In: ICDT
Misra J, Gries D (1982) Finding repeated elements. Sci Comput Program 2(2):143–152
Article MathSciNet MATH Google Scholar
Vitter JS (1985) Random sampling with a reservoir. ACM Trans Math Softw 11(1):37–57
Article MathSciNet MATH Google Scholar
Yamamoto Y, Iwanuma K, Fukuda S (2014) Resource-oriented approximation for frequent itemset mining from bursty data streams. In: SIGMOD
Zhang S, Chen L, Tu L (2009a) Frequent items mining on data stream based on time fading factor. In: AICI
Zhang S, Chen L, Tu L (2009b) Frequent items mining on data stream using hash-table and heap. In: ICIS

Download references

Acknowledgements

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0190-15-2012, High Performance Big Data Analytics Platform Performance Acceleration Technologies Development). The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study.

Author information

Authors and Affiliations

Big Data Tech. Lab, SK Telecom, Seongnam, Republic of Korea
Yongsub Lim
Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea
U. Kang

Authors

Yongsub Lim
View author publications
You can also search for this author in PubMed Google Scholar
U. Kang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to U. Kang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lim, Y., Kang, U. Time-weighted counting for recently frequent pattern mining in data streams. Knowl Inf Syst 53, 391–422 (2017). https://doi.org/10.1007/s10115-017-1045-1

Download citation

Received: 22 April 2016
Revised: 20 January 2017
Accepted: 17 March 2017
Published: 22 March 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s10115-017-1045-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Time-weighted counting for recently frequent pattern mining in data streams

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

On the nature and types of anomalies: a review of deviations in data

A comprehensive survey of anomaly detection techniques for high dimensional big data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Time-weighted counting for recently frequent pattern mining in data streams

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

On the nature and types of anomalies: a review of deviations in data

A comprehensive survey of anomaly detection techniques for high dimensional big data

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation