skip to main content
10.1145/1458082.1458164acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A framework for estimating complex probability density structures in data streams

Published:26 October 2008Publication History

ABSTRACT

Probability density function estimation is a fundamental component in several stream mining tasks such as outlier detection and classification. The nonparametric adaptive kernel density estimate (AKDE) provides a robust and asymptotically consistent estimate for an arbitrary distribution. However, its extensive computational requirements make it difficult to apply this technique to the stream environment. This paper tackles the issue of developing efficient and asymptotically consistent AKDE over data streams while heeding the stringent constraints imposed by the stream environment. We propose the concept of local regions to effectively synopsize local density features, design a suite of algorithms to maintain the AKDE under a time-based sliding window, and analyze the estimates' asymptotic consistency and computational costs. In addition, extensive experiments were conducted with real-world and synthetic data sets to demonstrate the effectiveness and efficiency of our approach.

References

  1. "Freeway Performance Measurement System (PeMS) {http://pems.eecs.berkeley.edu}."Google ScholarGoogle Scholar
  2. C. Aggarwal, "A framework for diagnosing changes in evolving data streams," in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, pp. 575--586, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Aggarwal and P. S. Yu, "A survey of synopsis construction in data streams," in Data Streams: Models and Algorithms, C. Aggarwal, Ed. New York: Springer Science and Business Media, pp. 169--202, 2007.Google ScholarGoogle Scholar
  4. A. Asuncion and D. J. Newman, "UCI Machine Learning Repository {http://www.ics.uci.edu/~mlearn/MLRepository.html}," Irvine, CA: University of California, School of Information and Computer Science, 2007.Google ScholarGoogle Scholar
  5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, "Models and issues in data stream systems," in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, USA, pp. 1--16, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Babcock, M. Datar, and R. Motwani, "Sampling from a moving window over streaming data," in Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, USA, pp. 633--634, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Gibbons, Y. Matias, and V. Poosala, "Fast incremental maintenance of approximate histograms," ACM Transactions on Database Systems vol. 27, pp. 261--298, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Gilbert, Y. Kotidis, S. Muthukrishan, and M. J. Strauss, "How to summarize the universe: dynamic maintenance of quantiles," in Proceedings of the 28th International Conference of Very Large Data Bases, Hong Kong, China, pp. 454--465, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gray and A. Moore, "Rapid evaluation of multiple density models," in Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics, Key West, Florida, USA, 2003.Google ScholarGoogle Scholar
  10. S. Guha, N. Koudas, and K. Shim, "Approximation and streaming algorithms for histogram construction problems," ACM Transactions on Database Systems, vol. 31, pp. 396--438, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Heinz, "Density estimation over data streams," in Mathematics. Phd: Phillipps-University Marburg, 2007.Google ScholarGoogle Scholar
  12. C. Heinz and B. Seeger, "Towards kernel density estimation over streaming data," in Proceedings of the 13th International Conference on Management of Data, Delhi, India, pp. 91--102, 2006.Google ScholarGoogle Scholar
  13. Y. Ioannidis, "The history of histograms (abridged)," in Proceedings of the 29th International Conference on Very Large Databases, Berlin, Germany, pp. 19--30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Keogh, X. Xi, L. Wei, and C. A. Ratanamahatana, "The UCR Time Series Classification/Clustering {http://www.cs.ucr.edu/~eamonn/time_series_data}," 2008.Google ScholarGoogle Scholar
  15. T. Ledl, "Kernel density estimation: theory and application in discriminant analysis," Austrian Journal of Statistics, vol. 33, pp. 267--279, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  16. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, "Streaming-data algorithms for high-quality clustering," in Proceedings of the 18th IEEE International Conference on Data Engineering, San Jose, CA, USA, pp. 685--694, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Parzen, "On estimation of a probability density function and mode," Annals of Mathematical Statistics, vol. 33, pp. 1065--1076, 1962.Google ScholarGoogle ScholarCross RefCross Ref
  18. S. R. Sain and D. W. Scott, "On locally adaptive density estimation," Journal of the American Statistical Association, vol. 91, pp. 1525--1534, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  19. D. W. Scott, Multivariate Density Estimation. New York: Wiley & Sons, 1992.Google ScholarGoogle Scholar
  20. B. W. Silverman, Density estimation for statistics and data analysis. London: Chapman and Hall, 1986.Google ScholarGoogle Scholar
  21. S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos, "Online outlier detection in sensor data using non-parametric models," in Proceedings of the 32nd International Conference on Very Large Databases, Seoul, Korea, pp. 187--198, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. E. J. Wegman and D. J. Marchette, "On some techniques for streaming data: a case study of internet packet headers," Journal of Computational and Graphical Statistics, vol. 12, pp. 1--22, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  23. T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Canada, pp. 103--114, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Zhang, R. Ramakrishnan, and M. Livny, "Fast density estimation using CF-kernel for very large databases," in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, pp. 312--316, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Zhou, Z. Cai, L. Wei, and W. Qian, "M-Kernel merging: towards density estimation over data streams," in Proceedings of the 8th International Conference on Database Systems for Advanced Applications, Kyoto, Japan, pp. 285--292, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A framework for estimating complex probability density structures in data streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
      October 2008
      1562 pages
      ISBN:9781595939913
      DOI:10.1145/1458082

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 October 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader