ABSTRACT
Probability density function estimation is a fundamental component in several stream mining tasks such as outlier detection and classification. The nonparametric adaptive kernel density estimate (AKDE) provides a robust and asymptotically consistent estimate for an arbitrary distribution. However, its extensive computational requirements make it difficult to apply this technique to the stream environment. This paper tackles the issue of developing efficient and asymptotically consistent AKDE over data streams while heeding the stringent constraints imposed by the stream environment. We propose the concept of local regions to effectively synopsize local density features, design a suite of algorithms to maintain the AKDE under a time-based sliding window, and analyze the estimates' asymptotic consistency and computational costs. In addition, extensive experiments were conducted with real-world and synthetic data sets to demonstrate the effectiveness and efficiency of our approach.
- "Freeway Performance Measurement System (PeMS) {http://pems.eecs.berkeley.edu}."Google Scholar
- C. Aggarwal, "A framework for diagnosing changes in evolving data streams," in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, pp. 575--586, 2003. Google ScholarDigital Library
- C. Aggarwal and P. S. Yu, "A survey of synopsis construction in data streams," in Data Streams: Models and Algorithms, C. Aggarwal, Ed. New York: Springer Science and Business Media, pp. 169--202, 2007.Google Scholar
- A. Asuncion and D. J. Newman, "UCI Machine Learning Repository {http://www.ics.uci.edu/~mlearn/MLRepository.html}," Irvine, CA: University of California, School of Information and Computer Science, 2007.Google Scholar
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and issues in data stream systems," in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, USA, pp. 1--16, 2002. Google ScholarDigital Library
- B. Babcock, M. Datar, and R. Motwani, "Sampling from a moving window over streaming data," in Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, USA, pp. 633--634, 2002. Google ScholarDigital Library
- P. Gibbons, Y. Matias, and V. Poosala, "Fast incremental maintenance of approximate histograms," ACM Transactions on Database Systems vol. 27, pp. 261--298, 2002. Google ScholarDigital Library
- A. Gilbert, Y. Kotidis, S. Muthukrishan, and M. J. Strauss, "How to summarize the universe: dynamic maintenance of quantiles," in Proceedings of the 28th International Conference of Very Large Data Bases, Hong Kong, China, pp. 454--465, 2002. Google ScholarDigital Library
- A. Gray and A. Moore, "Rapid evaluation of multiple density models," in Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, Key West, Florida, USA, 2003.Google Scholar
- S. Guha, N. Koudas, and K. Shim, "Approximation and streaming algorithms for histogram construction problems," ACM Transactions on Database Systems, vol. 31, pp. 396--438, 2006. Google ScholarDigital Library
- C. Heinz, "Density estimation over data streams," in Mathematics. Phd: Phillipps-University Marburg, 2007.Google Scholar
- C. Heinz and B. Seeger, "Towards kernel density estimation over streaming data," in Proceedings of the 13th International Conference on Management of Data, Delhi, India, pp. 91--102, 2006.Google Scholar
- Y. Ioannidis, "The history of histograms (abridged)," in Proceedings of the 29th International Conference on Very Large Databases, Berlin, Germany, pp. 19--30, 2003. Google ScholarDigital Library
- E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana, "The UCR Time Series Classification/Clustering {http://www.cs.ucr.edu/~eamonn/time_series_data}," 2008.Google Scholar
- T. Ledl, "Kernel density estimation: theory and application in discriminant analysis," Austrian Journal of Statistics, vol. 33, pp. 267--279, 2004.Google ScholarCross Ref
- L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, "Streaming-data algorithms for high-quality clustering," in Proceedings of the 18th IEEE International Conference on Data Engineering, San Jose, CA, USA, pp. 685--694, 2002. Google ScholarDigital Library
- E. Parzen, "On estimation of a probability density function and mode," Annals of Mathematical Statistics, vol. 33, pp. 1065--1076, 1962.Google ScholarCross Ref
- S. R. Sain and D. W. Scott, "On locally adaptive density estimation," Journal of the American Statistical Association, vol. 91, pp. 1525--1534, 1996.Google ScholarCross Ref
- D. W. Scott, Multivariate Density Estimation. New York: Wiley & Sons, 1992.Google Scholar
- B. W. Silverman, Density estimation for statistics and data analysis. London: Chapman and Hall, 1986.Google Scholar
- S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, "Online outlier detection in sensor data using non-parametric models," in Proceedings of the 32nd International Conference on Very Large Databases, Seoul, Korea, pp. 187--198, 2006. Google ScholarDigital Library
- E. J. Wegman and D. J. Marchette, "On some techniques for streaming data: a case study of internet packet headers," Journal of Computational and Graphical Statistics, vol. 12, pp. 1--22, 2003.Google ScholarCross Ref
- T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, pp. 103--114, 1996. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny, "Fast density estimation using CF-kernel for very large databases," in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, pp. 312--316, 1999. Google ScholarDigital Library
- A. Zhou, Z. Cai, L. Wei, and W. Qian, "M-Kernel merging: towards density estimation over data streams," in Proceedings of the 8th International Conference on Database Systems for Advanced Applications, Kyoto, Japan, pp. 285--292, 2003. Google ScholarDigital Library
Index Terms
- A framework for estimating complex probability density structures in data streams
Recommendations
Fast adaptive kernel density estimator for data streams
The probability density function (PDF) is an effective data model for a variety of stream mining tasks. As such, accurate estimates of the PDF are essential to reducing the uncertainties and errors associated with mining results. The nonparametric ...
Wavelet density estimators over data streams
SAC '05: Proceedings of the 2005 ACM symposium on Applied computingDensity estimation is a building block of many data analysis techniques. A recently examined approach based on wavelets promises to be superior to traditional density estimation techniques. For possibly infinite data streams, however, this approach is ...
Dm-KDE: dynamical kernel density estimation by sequences of KDE estimators with fixed number of components over data streams
In many data stream mining applications, traditional density estimation methods such as kernel density estimation, reduced set density estimation can not be applied to the density estimation of data streams because of their high computational burden, ...
Comments