ABSTRACT
Sliding-window aggregation is a widely-used approach for extracting insights from the most recent portion of a data stream. The aggregations of interest can usually be cast as binary operators that are associative, but they are not necessarily commutative nor invertible. Non-invertible operators, however, are difficult to support efficiently. The best published algorithms require O(log n) aggregation steps per window operation, where n is the sliding-window size at that point. For a FIFO window, this can be improved to O(1) on average by using two aggregation stacks.
This paper presents DABA, a novel algorithm for aggregating FIFO sliding windows that significantly improves upon these time bounds. DABA requires only O(1) aggregation steps per operation in the worst case (not just on average). As such, DABA asymptotically improves the performance of sliding-window aggregation without restricting the operator to be invertible. Our experimental results demonstrate that these theoretical improvements hold in practice. DABA is a substantial improvement over the state of the art in terms of both latency and throughput.
- 2016. Apache Flink: Scalable Batch and Stream Data Processing. https://flink.apache.org.(2016). Retrieved Aug. 2016.Google Scholar
- adamax. 2011. Re: Implement a queue in which push_rear(), pop_front() and get_min() are all constant time operations. http://stackoverflow.com/questions/4802038/. (2011). Retrieved Aug., 2016.Google Scholar
- Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. In Conference on Very Large Data Bases (VLDB) Industrial Track. 734--746.Google ScholarDigital Library
- Mohamed Ali, Badrish Chandramouli, Jonathan Goldstein, and Roman Schindlauer. 2011. The extensibility framework in Microsoft Stream Insight. In International Conference on Data Engineering (ICDE). 1242--1253. Google ScholarDigital Library
- Arvind Arasu and Jennifer Widom. 2004. Resource sharing in continuous sliding window aggregates. In Conference on Very Large Data Bases (VLDB). 336--347. Google ScholarDigital Library
- David Bacon, Perry Cheng, and V. T. Rajan. 2003. A Real-Time Garbage Collector with Low Overhead and Consistent Utilization. In Principles of Programming Languages (POPL). 285--298. Google ScholarDigital Library
- Burton H. Bloom. 1970. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM (CACM) 13, 7 (1970), 422--426. Google ScholarDigital Library
- Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin. 2014. Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. In Conference on Very Large Data Bases (VLDB). 1441--1451. Google ScholarDigital Library
- Graham Cormode and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58--75. Google ScholarDigital Library
- Chuck Cranor, Theodore Johnson, Oliver Spataschek, and Vladislav Shkapenyuk. 2003. Gigascope: A Stream Database for Network Applications. In International Conference on Management of Data (SIGMOD) Industrial Track. 647--651. Google ScholarDigital Library
- Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007. HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms (AofA). 127--146.Google Scholar
- Buğra Gedik. 2013. Generic windowing support for extensible stream processing systems. Software Practice and Experience (SP&E) (2013), 1105--1128. Google ScholarDigital Library
- Martin Hirzel, Henrique Andrade, Buğra Gedik, Gabriela Jacques-Silva, Rohit Khandekar, Vibhore Kumar, Mark Mendell, Howard Nasgaard, Scott Schneider, Robert Soulé, and Kun-Lung Wu. 2013. IBM Streams Processing Language: Analyzing Big Data in Motion. IBM Journal of Research and Development 57, 3/4 (2013). Google ScholarDigital Library
- Robert Hood and Robert Melville. 1981. Real-Time Queue Operation in Pure LISP. Inform. Process. Lett. 13, 2 (1981), 50--54.Google ScholarCross Ref
- Paul Hudak, Simon L. Peyton Jones, Philip Wadler, Brian Boutel, Jon Fairbairn, Joseph H. Fasel, María M. Guzmán, Kevin Hammond, John Hughes, Thomas Johnsson, Richard B. Kieburtz, Rishiyur S. Nikhil, Will Partain, and John Peterson. 1992. Report on the Programming Language Haskell, A Non-strict, Purely Functional Language. SIGPLAN Notices 27, 5 (1992), R1--R164. Google ScholarDigital Library
- Sailesh Krishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, Pasha Golovko, Alan Li, and Neil Thombre. 2010. Continuous Analytics over Discontinuous Streams. In International Conference on Management of Data (SIGMOD). 1081--1092. Google ScholarDigital Library
- Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In International Conference on Management of Data (SIGMOD). 623--634. Google ScholarDigital Library
- Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In International Conference on Management of Data (SIGMOD). 239--250. Google ScholarDigital Library
- Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. ACM SIGMOD Record 34, 1 (2005), 39--44. Google ScholarDigital Library
- Bongki Moon, Inés Fernando Vega López, and Vijaykumar Immanuel. 2000. Scalable Algorithms for Large Temporal Aggregation. In International Conference on Data Engineering (ICDE). 145--154. Google ScholarDigital Library
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martin Abadi. 2013. Naiad: A Timely Dataflow System. In Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
- Chris Okasaki. 1995. Simple and efficient purely functional queues and deques. Journal of Functional Programming (JFP) 5, 4 (1995), 583--592.Google ScholarCross Ref
- Scott Schneider, Martin Hirzel, Buğra Gedik, and Kun-Lung Wu. 2015. Safe Data Parallelism for General Streaming. IEEE Transactions on Computers (TC) 64, 2 (2015), 504--517.Google ScholarCross Ref
- Jon Skeet. 2009. Re: design a stack such that getMinimum() should be O(1). http://stackoverflow.com/questions/685060/. (2009). Retrieved Aug., 2016.Google Scholar
- Utkarsh Srivastava and Jennifer Widom. 2004. Flexible time management in data stream systems. In Principles of Database Systems (PODS). 263--274. Google ScholarDigital Library
- Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2015. Constant-Time Sliding Window Aggregation. Technical Report RC25574. IBM Research.Google Scholar
- Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General Incremental Sliding-Window Aggregation. In Conference on Very Large Data Bases (VLDB). 702--713. Google ScholarDigital Library
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm @Twitter. In International Conference on Management of Data (SIGMOD). 147--156. Google ScholarDigital Library
- Jun Yang and Jennifer Widom. 2001. Incremental computation and maintenance of temporal aggregates. In International Conference on Data Engineering (ICDE). 51--60. Google ScholarDigital Library
- Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Symposium on Operating Systems Principles (SOSP). 247--260. Google ScholarDigital Library
- Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP). 423--438. Google ScholarDigital Library
Index Terms
- Low-Latency Sliding-Window Aggregation in Worst-Case Constant Time
Recommendations
LightSaber: Efficient Window Aggregation on Multi-core Processors
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataWindow aggregation queries are a core part of streaming applications. To support window aggregation efficiently, stream processing engines face a trade-off between exploiting parallelism (at the instruction/multi-core levels) and incremental computation ...
Sliding-Window Aggregation Algorithms: Tutorial
DEBS '17: Proceedings of the 11th ACM International Conference on Distributed and Event-based SystemsStream processing is important for analyzing continuous streams of data in real time. Sliding-window aggregation is both needed for many streaming applications and surprisingly hard to do efficiently. Picking the wrong aggregation algorithm causes poor ...
Optimal and general out-of-order sliding-window aggregation
Sliding-window aggregation derives a user-defined summary of the most-recent portion of a data stream. For in-order streams, each window change can be handled in O(1) time even when the aggregation operator is not invertible. But streaming data often ...
Comments