skip to main content
article

A geometric approach to monitoring threshold functions over distributed data streams

Published:01 November 2007Publication History
Skip Abstract Section

Abstract

Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms.

We present a novel geometric approach which reduces monitoring the value of a function (vis-à-vis a threshold) to a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner.

We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.

References

  1. Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In STOC '96: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing. ACM Press, New York, NY, 20--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arasu, A. and Manku, G. S. 2004. Approximate counts and quantiles over sliding windows. In PODS '04: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York, NY, 286--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In PODS '02: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York, NY, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Babcock, B. and Olston, C. 2003. Distributed top-k monitoring. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 28--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Babu, S. and Widom, J. 2001. Continuous queries over data streams. SIGMOD Rec. 30, 3, 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Berkovitz, L. 2002. Convexity and Optimization in Rn. Wiley, New York, NY.Google ScholarGoogle Scholar
  7. Bulut, A., Singh, A. K., and Vitenberg, R. 2005. Distributed data streams indexing using content-based routing paradigm. In IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, Los Alamitos, CA, 94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., and Zdonik, S. B. 2002. Monitoring streams---a new class of data management applications. In VLDB '02: Proceedings of the International Conference on Very Large Data Bases. 215--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In ICALP '02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming. Springer-Verlag, London, U.K., 693--703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., and Zdonik, S. 2003. Scalable distributed stream processing. In CIDR 2003: Proceedings of the First Biennial Conference on Innovative Data Systems Research (Asilomar, CA).Google ScholarGoogle Scholar
  11. Cormode, G., Garofalakis, M., Muthukrishnan, S., and Rastogi, R. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In SIGMOD '05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dilman, M. and Raz, D. 2001. Efficient reactive monitoring. In INFOCOM '01: Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. 1012--1019.Google ScholarGoogle Scholar
  13. Gibbons, P. B. and Matias, Y. 1998. New sampling-based summary statistics for improving approximate query answers. In SIGMOD '98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 331--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gibbons, P. B. and Tirthapura, S. 2001. Estimating simple functions on the union of data streams. In SPAA '01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures. ACM Press, New York, NY, 281--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gibbons, P. B. and Tirthapura, S. 2002. Distributed streams algorithms for sliding windows. In SPAA '02: Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures. ACM Press, New York, NY, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Huang, L., Garofalakis, M., Hellerstein, J., Joseph, A., and Taft, N. 2006. Toward sophisticated detection with distributed triggers. In MineNet '06: Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data. ACM Press, New York, NY, 311--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jain, A., Hellerstein, J. M., Ratnasamy, S., and Wetherall, D. 2004. A wakeup call for Internet monitoring systems: The case for distributed triggers. In Proceedings of the 3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets, San Diego, CA).Google ScholarGoogle Scholar
  18. Keralapura, R., Cormode, G., and Ramamirtham, J. 2006. Communication-efficient distributed monitoring of thresholded counts. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lasserre, J. B. 2000. Global optimization with polynomials and the problem of moments. SIAM J. Optimiz. 11, 3, 796--817. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Liu, L., Pu, C., and Tang, W. 1999. Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11, 4, 610--628. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Madden, S. and Franklin, M. J. 2002. Fjording the stream: An architecture for queries over streaming sensor data. In ICDE '02: Proceedings of the 18th International Conference on Data Engineering (ICDE'02). IEEE Computer Society Press, Los Alamitos, CA, 555. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Madden, S., Shah, M., Hellerstein, J. M., and Raman, V. 2002. Continuously adaptive continuous queries over streams. In SIGMOD '02: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005. Finding (recently) frequent items in distributed data streams. In ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05). IEEE Computer Society, Press, Los Alamitos, CA, 767--778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In VLDB '02: Proceedings of the International Conference on Very Large Data Bases. 346--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston, C., Rosenstein, J., and Varma, R. 2003. Query processing, resource management, and approximation in a data stream management system. In Proceedings of the First Biennial Conference on Innovative Data Systems Research (CIDR, Asilomar, CA). 245--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Olston, C., Jiang, J., and Widom, J. 2003. Adaptive filters for continuous queries over distributed data streams. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 563--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Parrilo, P. 2003. Semidefinite programming relaxations for semialgebraic problems. Math. Programm. 96, 2, 293--320.Google ScholarGoogle ScholarCross RefCross Ref
  29. Rose, T., Stevenson, M., and Whitehead, M. 2002. The Reuters Corpus Volume 1---from yesterday's news to tomorrow's language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria).Google ScholarGoogle Scholar
  30. Sharfman, I., Schuster, A., and Keren, D. 2006. A geometric approach to monitoring threshold functions over distributed data streams. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Siddiqi, K., Bouix, S., Tannenbaum, A., and Zucker, S. W. 2002. Hamilton-Jacobi skeletons. Int. J. Comput. Vis. 48, 3 (July), 215--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sullivan, S., Sandford, L., and Ponce, J. 1994. Using geometric distance fits for 3-d object modeling and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 16, 12, 1183--1196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Terry, D., Goldberg, D., Nichols, D., and Oki, B. 1992. Continuous queries over append-only databases. In SIGMOD '92: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 321--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Varadhan, G., Krishnan, S., Sriram, T., and Manocha, D. 2004. Topology preserving surface extraction using adaptive subdivision. In SGP '04: Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing. ACM Press, New York, NY, 235--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yi, B.-K., Sidiropoulos, N., Johnson, T., Jagadish, H. V., Faloutsos, C., and Biliris, A. 2000. Online data mining for co-evolving time sequences. In ICDE '00: Proceedings of the 16th International Conference on Data Engineering. IEEE Computer Society, Press, Los Alamitos, CA, 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zhu, Y. and Shasha, D. 2002. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB '02: Proceedings of the International Conference on Very Large Data Bases. 358--369. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A geometric approach to monitoring threshold functions over distributed data streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 32, Issue 4
      November 2007
      364 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1292609
      Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 November 2007
      Published in tods Volume 32, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader