skip to main content
research-article

Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow

Published:01 July 2021Publication History
Skip Abstract Section

Abstract

Streaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.

First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:

• Computing a single correct answer, as in notifications.

• Reasoning about a lack of data, as in dip detection.

• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.

• Safely and punctually garbage collecting obsolete inputs and intermediate state.

• Surfacing a reliable signal of overall pipeline health.

Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.

References

  1. T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: Fault-tolerant stream processing at internet scale. Proc. VLDB Endow., 6(11):1033--1044, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792--1803, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Akidau, S. Chernyak, and R. Lax. Streaming Systems. O'Reilly Media, Inc., 1st edition, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Anicic, P. Fodor, S. Rudolph, R. Stühmer, N. Stojanovic, and R. Studer. Etalis: Rule-based reasoning in event processing. In Reasoning in event-based distributed systems, pages 99--124. Springer, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  5. A. Awad, J. Traub, and S. Sakr. Adaptive watermarks: A concept drift-based approach for predicting event-time progress in data streams. In EDBT, pages 622--625, 2019.Google ScholarGoogle Scholar
  6. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google ScholarGoogle Scholar
  7. B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment, 8(4):401--412, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Das. Event-time aggregation and watermarking in apache spark's structured streaming. https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apacmhe-sparks-structured-streaming.html, 2017. [Online; accessed 06-Feb-2021].Google ScholarGoogle Scholar
  9. M. J. S. Eno Thereska, Michael Noll. Watermarks, tables, event time, and the dataflow model. https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/, 2017. [Online; accessed 25-Jan-2021].Google ScholarGoogle Scholar
  10. D. Gyllstrom, E. Wu, H.-J. Chae, Y. Diao, P. Stahlberg, and G. Anderson. Sase: Complex event processing over streams. arXiv preprint cs/0612128, 2006.Google ScholarGoogle Scholar
  11. C. S. Jensen and R. Snodgrass. Temporal specialization and generalization. IEEE Transactions on Knowledge and Data Engineering, 6(6):954--974, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Kulkarni and J.-E. Michels. Temporal features in sql:2011. SIGMOD Rec., 41(3):34--43, Oct. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Lax. After lambda: Exactly-once processing in cloud dataflow, part 2 (ensuring low latency). https://cloud.google.com/blog/products/gcp/after-lambda-exactly-once-processing-in-cloud-dataflow-part-2-ensuring-low-latency, 2017. [Online; accessed 06-Feb-2021].Google ScholarGoogle Scholar
  14. F. McSherry. Timelydataflow. https://github.com/TimelyDataflow/timely-dataflow, 2020.Google ScholarGoogle Scholar
  15. F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In CIDR, 2013.Google ScholarGoogle Scholar
  16. D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Roesler. Kafka streams' take on watermarks and triggers. https://www.confluent.io/blog/kafka-streams-take-on-watermarks-and-triggers/, 2019. [Online; accessed 25-Jan-2021].Google ScholarGoogle Scholar
  18. U. Srivastava and J. Widom. Flexible time management in data stream systems. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263--274, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Teich, L. Thiele, and E. A. Lee. Modeling and simulation of heterogeneous real-time systems based on a deterministic discrete event model. In Proceedings of the 8th international symposium on System synthesis, pages 156--161, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):555--568, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Wang, L. Chen, A. Dikshit, J. Gustafson, B. Chen, M. J. Sax, J. Roesler, S. Blee-Goldman, B. Cadonna, A. Mehta, V. Madan, and J. Rao. Consistency and completeness: Rethinking distributed stream processing in apache kafka. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, 2021. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. Wu, Y. Diao, and S. Rizvi. High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 407--418, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image Proceedings of the VLDB Endowment
                Proceedings of the VLDB Endowment  Volume 14, Issue 12
                July 2021
                587 pages
                ISSN:2150-8097
                Issue’s Table of Contents

                Publisher

                VLDB Endowment

                Publication History

                • Published: 1 July 2021
                Published in pvldb Volume 14, Issue 12

                Qualifiers

                • research-article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader