Abstract
Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms.
We present a novel geometric approach which reduces monitoring the value of a function (vis-à-vis a threshold) to a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner.
We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.
- Alon, N., Matias, Y., and Szegedy, M. 1996. The space complexity of approximating the frequency moments. In STOC '96: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing. ACM Press, New York, NY, 20--29. Google ScholarDigital Library
- Arasu, A. and Manku, G. S. 2004. Approximate counts and quantiles over sliding windows. In PODS '04: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York, NY, 286--296. Google ScholarDigital Library
- Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J. 2002. Models and issues in data stream systems. In PODS '02: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM Press, New York, NY, 1--16. Google ScholarDigital Library
- Babcock, B. and Olston, C. 2003. Distributed top-k monitoring. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 28--39. Google ScholarDigital Library
- Babu, S. and Widom, J. 2001. Continuous queries over data streams. SIGMOD Rec. 30, 3, 109--120. Google ScholarDigital Library
- Berkovitz, L. 2002. Convexity and Optimization in Rn. Wiley, New York, NY.Google Scholar
- Bulut, A., Singh, A. K., and Vitenberg, R. 2005. Distributed data streams indexing using content-based routing paradigm. In IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium. IEEE Computer Society Press, Los Alamitos, CA, 94. Google ScholarDigital Library
- Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., and Zdonik, S. B. 2002. Monitoring streams---a new class of data management applications. In VLDB '02: Proceedings of the International Conference on Very Large Data Bases. 215--226. Google ScholarDigital Library
- Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In ICALP '02: Proceedings of the 29th International Colloquium on Automata, Languages and Programming. Springer-Verlag, London, U.K., 693--703. Google ScholarDigital Library
- Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., and Zdonik, S. 2003. Scalable distributed stream processing. In CIDR 2003: Proceedings of the First Biennial Conference on Innovative Data Systems Research (Asilomar, CA).Google Scholar
- Cormode, G., Garofalakis, M., Muthukrishnan, S., and Rastogi, R. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In SIGMOD '05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 25--36. Google ScholarDigital Library
- Dilman, M. and Raz, D. 2001. Efficient reactive monitoring. In INFOCOM '01: Proceedings of the Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies. 1012--1019.Google Scholar
- Gibbons, P. B. and Matias, Y. 1998. New sampling-based summary statistics for improving approximate query answers. In SIGMOD '98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 331--342. Google ScholarDigital Library
- Gibbons, P. B. and Tirthapura, S. 2001. Estimating simple functions on the union of data streams. In SPAA '01: Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures. ACM Press, New York, NY, 281--291. Google ScholarDigital Library
- Gibbons, P. B. and Tirthapura, S. 2002. Distributed streams algorithms for sliding windows. In SPAA '02: Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures. ACM Press, New York, NY, 63--72. Google ScholarDigital Library
- Huang, L., Garofalakis, M., Hellerstein, J., Joseph, A., and Taft, N. 2006. Toward sophisticated detection with distributed triggers. In MineNet '06: Proceedings of the 2006 SIGCOMM Workshop on Mining Network Data. ACM Press, New York, NY, 311--316. Google ScholarDigital Library
- Jain, A., Hellerstein, J. M., Ratnasamy, S., and Wetherall, D. 2004. A wakeup call for Internet monitoring systems: The case for distributed triggers. In Proceedings of the 3rd ACM SIGCOMM Workshop on Hot Topics in Networks (HotNets, San Diego, CA).Google Scholar
- Keralapura, R., Cormode, G., and Ramamirtham, J. 2006. Communication-efficient distributed monitoring of thresholded counts. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 289--300. Google ScholarDigital Library
- Lasserre, J. B. 2000. Global optimization with polynomials and the problem of moments. SIAM J. Optimiz. 11, 3, 796--817. Google ScholarDigital Library
- Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. 2004. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361--397. Google ScholarDigital Library
- Liu, L., Pu, C., and Tang, W. 1999. Continual queries for internet scale event-driven information delivery. IEEE Trans. Knowl. Data Eng. 11, 4, 610--628. Google ScholarDigital Library
- Madden, S. and Franklin, M. J. 2002. Fjording the stream: An architecture for queries over streaming sensor data. In ICDE '02: Proceedings of the 18th International Conference on Data Engineering (ICDE'02). IEEE Computer Society Press, Los Alamitos, CA, 555. Google ScholarDigital Library
- Madden, S., Shah, M., Hellerstein, J. M., and Raman, V. 2002. Continuously adaptive continuous queries over streams. In SIGMOD '02: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 49--60. Google ScholarDigital Library
- Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005. Finding (recently) frequent items in distributed data streams. In ICDE '05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05). IEEE Computer Society, Press, Los Alamitos, CA, 767--778. Google ScholarDigital Library
- Manku, G. S. and Motwani, R. 2002. Approximate frequency counts over data streams. In VLDB '02: Proceedings of the International Conference on Very Large Data Bases. 346--357. Google ScholarDigital Library
- Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston, C., Rosenstein, J., and Varma, R. 2003. Query processing, resource management, and approximation in a data stream management system. In Proceedings of the First Biennial Conference on Innovative Data Systems Research (CIDR, Asilomar, CA). 245--256. Google ScholarDigital Library
- Olston, C., Jiang, J., and Widom, J. 2003. Adaptive filters for continuous queries over distributed data streams. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 563--574. Google ScholarDigital Library
- Parrilo, P. 2003. Semidefinite programming relaxations for semialgebraic problems. Math. Programm. 96, 2, 293--320.Google ScholarCross Ref
- Rose, T., Stevenson, M., and Whitehead, M. 2002. The Reuters Corpus Volume 1---from yesterday's news to tomorrow's language resources. In Proceedings of the Third International Conference on Language Resources and Evaluation (Las Palmas de Gran Canaria).Google Scholar
- Sharfman, I., Schuster, A., and Keren, D. 2006. A geometric approach to monitoring threshold functions over distributed data streams. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 301--312. Google ScholarDigital Library
- Siddiqi, K., Bouix, S., Tannenbaum, A., and Zucker, S. W. 2002. Hamilton-Jacobi skeletons. Int. J. Comput. Vis. 48, 3 (July), 215--231. Google ScholarDigital Library
- Sullivan, S., Sandford, L., and Ponce, J. 1994. Using geometric distance fits for 3-d object modeling and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 16, 12, 1183--1196. Google ScholarDigital Library
- Terry, D., Goldberg, D., Nichols, D., and Oki, B. 1992. Continuous queries over append-only databases. In SIGMOD '92: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data. ACM Press, New York, NY, 321--330. Google ScholarDigital Library
- Varadhan, G., Krishnan, S., Sriram, T., and Manocha, D. 2004. Topology preserving surface extraction using adaptive subdivision. In SGP '04: Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing. ACM Press, New York, NY, 235--244. Google ScholarDigital Library
- Yi, B.-K., Sidiropoulos, N., Johnson, T., Jagadish, H. V., Faloutsos, C., and Biliris, A. 2000. Online data mining for co-evolving time sequences. In ICDE '00: Proceedings of the 16th International Conference on Data Engineering. IEEE Computer Society, Press, Los Alamitos, CA, 13. Google ScholarDigital Library
- Zhu, Y. and Shasha, D. 2002. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB '02: Proceedings of the International Conference on Very Large Data Bases. 358--369. Google ScholarDigital Library
Index Terms
- A geometric approach to monitoring threshold functions over distributed data streams
Recommendations
A geometric approach to monitoring threshold functions over distributed data streams
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of dataMonitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More ...
Prediction-based geometric monitoring over distributed data streams
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMany modern streaming applications, such as online analysis of financial, network, sensor and other forms of data are inherently distributed in nature. An important query type that is the focal point in such application scenarios regards actuation ...
Sketching distributed sliding-window data streams
While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...
Comments