Abstract
Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.
First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:
• Computing a single correct answer, as in notifications.
• Reasoning about a lack of data, as in dip detection.
• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.
• Safely and punctually garbage collecting obsolete inputs and intermediate state.
• Surfacing a reliable signal of overall pipeline health.
Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.
- T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: Fault-tolerant stream processing at internet scale. Proc. VLDB Endow., 6(11):1033--1044, Aug. 2013. Google ScholarDigital Library
- T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792--1803, 2015. Google ScholarDigital Library
- T. Akidau, S. Chernyak, and R. Lax. Streaming Systems. O'Reilly Media, Inc., 1st edition, 2018.Google ScholarDigital Library
- D. Anicic, P. Fodor, S. Rudolph, R. Stühmer, N. Stojanovic, and R. Studer. Etalis: Rule-based reasoning in event processing. In Reasoning in event-based distributed systems, pages 99--124. Springer, 2011.Google ScholarCross Ref
- A. Awad, J. Traub, and S. Sakr. Adaptive watermarks: A concept drift-based approach for predicting event-time progress in data streams. In EDBT, pages 622--625, 2019.Google Scholar
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google Scholar
- B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment, 8(4):401--412, 2014. Google ScholarDigital Library
- T. Das. Event-time aggregation and watermarking in apache spark's structured streaming. https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apacmhe-sparks-structured-streaming.html, 2017. [Online; accessed 06-Feb-2021].Google Scholar
- M. J. S. Eno Thereska, Michael Noll. Watermarks, tables, event time, and the dataflow model. https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/, 2017. [Online; accessed 25-Jan-2021].Google Scholar
- D. Gyllstrom, E. Wu, H.-J. Chae, Y. Diao, P. Stahlberg, and G. Anderson. Sase: Complex event processing over streams. arXiv preprint cs/0612128, 2006.Google Scholar
- C. S. Jensen and R. Snodgrass. Temporal specialization and generalization. IEEE Transactions on Knowledge and Data Engineering, 6(6):954--974, 1994. Google ScholarDigital Library
- K. Kulkarni and J.-E. Michels. Temporal features in sql:2011. SIGMOD Rec., 41(3):34--43, Oct. 2012. Google ScholarDigital Library
- R. Lax. After lambda: Exactly-once processing in cloud dataflow, part 2 (ensuring low latency). https://cloud.google.com/blog/products/gcp/after-lambda-exactly-once-processing-in-cloud-dataflow-part-2-ensuring-low-latency, 2017. [Online; accessed 06-Feb-2021].Google Scholar
- F. McSherry. Timelydataflow. https://github.com/TimelyDataflow/timely-dataflow, 2020.Google Scholar
- F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In CIDR, 2013.Google Scholar
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455, 2013. Google ScholarDigital Library
- J. Roesler. Kafka streams' take on watermarks and triggers. https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers/, 2019. [Online; accessed 25-Jan-2021].Google Scholar
- U. Srivastava and J. Widom. Flexible time management in data stream systems. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263--274, 2004. Google ScholarDigital Library
- J. Teich, L. Thiele, and E. A. Lee. Modeling and simulation of heterogeneous real-time systems based on a deterministic discrete event model. In Proceedings of the 8th international symposium on System synthesis, pages 156--161, 1995. Google ScholarDigital Library
- P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):555--568, 2003. Google ScholarDigital Library
- A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015. Google ScholarDigital Library
- G. Wang, L. Chen, A. Dikshit, J. Gustafson, B. Chen, M. J. Sax, J. Roesler, S. Blee-Goldman, B. Cadonna, A. Mehta, V. Madan, and J. Rao. Consistency and completeness: Rethinking distributed stream processing in apache kafka. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, 2021. Google ScholarDigital Library
- E. Wu, Y. Diao, and S. Rizvi. High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 407--418, 2006. Google ScholarDigital Library
Index Terms
- Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow
Recommendations
Stream processing with BigData: SSS-MapReduce
CLOUDCOM '12: Proceedings of the 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CloudCom)We propose a Map Reduce based stream processing system, called SSS, which is capable of processing stream along with large scale static data. Unlike the existing stream processing systems that can work only on the relatively small on-memory data-set, ...
Dual-Paradigm Stream Processing
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingExisting stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable ...
Generic windowing support for extensible stream processing systems
Stream processing applications process high volume, continuous feeds from live data sources, employ data-in-motion analytics to analyze these feeds, and produce near real-time insights with low latency. One of the fundamental characteristics of such ...
Comments