ABSTRACT
Realtime data processing powers many use cases at Facebook, including realtime reporting of the aggregated, anonymized voice of Facebook users, analytics for mobile applications, and insights for Facebook page administrators. Many companies have developed their own systems; we have a realtime data processing ecosystem at Facebook that handles hundreds of Gigabytes per second across hundreds of data pipelines.
Many decisions must be made while designing a realtime stream processing system. In this paper, we identify five important design decisions that affect their ease of use, performance, fault tolerance, scalability, and correctness. We compare the alternative choices for each decision and contrast what we built at Facebook to other published systems.
Our main decision was targeting seconds of latency, not milliseconds. Seconds is fast enough for all of the use cases we support and it allows us to use a persistent message bus for data transport. This data transport mechanism then paved the way for fault tolerance, scalability, and multiple options for correctness in our stream processing systems Puma, Swift, and Stylus.
We then illustrate how our decisions and systems satisfy our requirements for multiple use cases at Facebook. Finally, we reflect on the lessons we learned as we built and operated these systems.
- Monoid. https://en.wikipedia.org/wiki/Monoid.Google Scholar
- Presto. http://prestodb.io.Google Scholar
- Rocksdb. http://rocksdb.org.Google Scholar
- Samza. http://samza.apache.org.Google Scholar
- Scribe. https://github.com/facebook/scribe.Google Scholar
- Zeromq. https://zeromq.org/.Google Scholar
- L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, et al. Scuba: diving into data at facebook. In PVLDB, pages 1057--1067, 2013. Google ScholarDigital Library
- A. Agarwal, M. Slee, and M. Kwiatkowski. Thrift: Scalable cross-language services implementation. Technical report, Facebook, 2007.Google Scholar
- T. Akidau, A. Balikov, K. Bekiro\uglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: Fault-tolerant stream processing at internet scale. PVLDB, 6(11):1033--1044, Aug. 2013. Google ScholarDigital Library
- T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, and S. Whittle. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB, 8(12):1792--1803, Aug. 2015. Google ScholarDigital Library
- R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: fault-tolerant and scalable joining of continuous data streams. In SIGMOD, pages 577--588, 2013. Google ScholarDigital Library
- A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The stanford data stream management system. Technical Report 2004--20, Stanford InfoLab, 2004.Google Scholar
- M. Balazinska, H. Balakrishnan, S. R. Madden, and M. Stonebraker. Fault-tolerance in the borealis distributed stream processing system. ACM TODS, 33(1):3:1--3:44, Mar. 2008. Google ScholarDigital Library
- O. Boykin, S. Ritchie, I. O'Connell, and J. Lin. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB, 7(13):1441--1451, Aug. 2014. Google ScholarDigital Library
- N. Bronson, T. Lento, and J. L. Wiener. Open data challenges at facebook. In ICDE, pages 1516--1519, 2015.Google ScholarCross Ref
- P. Carbone, G. Fóra, S. Ewen, S. Haridi, and K. Tzoumas. Lightweight asynchronous snapshots for distributed dataflows. CoRR, abs/1506.08603, 2015.Google Scholar
- R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages 725--736, 2013. Google ScholarDigital Library
- A. Goel, B. Chopra, C. Gerea, D. Mátáni, J. Metzler, F. Ul Haq, and J. L. Wiener. Fast database restarts at Facebook. In SIGMOD, pages 541--549, 2014. Google ScholarDigital Library
- J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In SIGMOD Workshop on Networking Meets Databases, 2011.Google Scholar
- S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mittal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter heron: Stream processing at scale. In SIGMOD, pages 239--250, 2015. Google ScholarDigital Library
- J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Cetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, M. Stonebraker, K. Tufte, and H. Wang. S-store: Streaming meets transaction processing. PVLDB, 8(13):2134--2145, Sept. 2015. Google ScholarDigital Library
- L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In IEEE Data Mining Workshops, pages 170--177, 2010. Google ScholarDigital Library
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop distributed file system. In Mass Storage Systems and Technologies (MSST), pages 1--10, 2010. Google ScholarDigital Library
- M. Stonebraker, U. Çetintemel, and S. B. Zdonik. The 8 requirements of real-time stream processing. SIGMOD Record, 34(4):42--47, 2005. Google ScholarDigital Library
- J. Sundram. Developing data products. Big Data Spain, 2015. https://www.youtube.com/watch?v=CkEdD6FL7Ug.Google Scholar
- A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at facebook. In SIGMOD, pages 1013--1020, 2010. Google ScholarDigital Library
- R. Tibbetts, S. Yang, R. MacNeill, and D. Rydzewski. Streambase liveview: Push-based real-time analytics. StreamBase Systems (Jan 2012), 2011.Google Scholar
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@twitter. In SIGMOD, pages 147--156, 2014. Google ScholarDigital Library
- J. L. Wiener. Understanding realtime conversations on Facebook. QCON San Francisco, 2015. https://qconsf.com/sf2015/speakers/janet-wiener.Google Scholar
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, volume 8, pages 1--14, 2008. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2, 2012. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, pages 423--438, 2013. Google ScholarDigital Library
Index Terms
- Realtime Data Processing at Facebook
Recommendations
Dual-Paradigm Stream Processing
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingExisting stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable ...
Scalable stateful stream processing for smart grids
DEBS '14: Proceedings of the 8th ACM International Conference on Distributed Event-Based SystemsWe describe a solution to the ACM DEBS Grand Challenge 2014, which evaluates event-based systems for smart grid analytics. Our solution follows the paradigm of stateful data stream processing and is implemented on top of the SEEP stream processing ...
A New Application Benchmark for Data Stream Processing Architectures in an Enterprise Context: Doctoral Symposium
DEBS '17: Proceedings of the 11th ACM International Conference on Distributed and Event-based SystemsAgainst the backdrop of ever-growing data volumes and trends like the Internet of Things (IoT) or Industry 4.0, Data Stream Processing Systems (DSPSs) or data stream processing architectures in general receive a greater interest. Continuously analyzing ...
Comments