Distributed data stream processing and edge computing: A survey on resource elasticity and future directions

https://doi.org/10.1016/j.jnca.2017.12.001Get rights and content

Highlights

  • The paper surveys state of the art on stream processing engines and mechanisms.

  • The work describes how existing solutions exploit resource elasticity features of cloud computing in stream processing.

  • It presents a gap analysis and future directions on stream processing on heterogeneous environments.

Abstract

Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, wearable assistance, and Internet of Things, continuous data streams must be processed under very short delays. Several solutions, including multiple software engines, have been developed for processing unbounded data streams in a scalable and efficient manner. More recently, architecture has been proposed to use edge computing for data stream processing. This paper surveys state of the art on stream processing engines and mechanisms for exploiting resource elasticity features of cloud computing in stream processing. Resource elasticity allows for an application or service to scale out/in according to fluctuating demands. Although such features have been extensively investigated for enterprise applications, stream processing poses challenges on achieving elastic systems that can make efficient resource management decisions based on current load. Elasticity becomes even more challenging in highly distributed environments comprising edge and cloud computing resources. This work examines some of these challenges and discusses solutions proposed in the literature to address them.

Introduction

The increasing availability of sensors, mobile phones, and other devices has led to an explosion in the volume, variety and velocity of data generated and that requires analysis of some type. As society becomes more interconnected, organisations are producing vast amounts of data as result of instrumented business processes, monitoring of user activity (CISCO, 2012, Clifford and Hardy, 2013), wearable assistance (Ha et al., 2014), website tracking, sensors, finance, accounting, large-scale scientific experiments, among other reasons. This data deluge is often termed as big data due to the challenges it poses to existing infrastructure regarding, for instance, data transfer, storage, and processing (de Assuncao et al., 2015).

A large part of this big data is most valuable when it is analysed quickly, as it is generated. Under several emerging application scenarios, such as in smart cities, operational monitoring of large infrastructure, and Internet of Things (IoT) (Atzori et al., 2010), continuous data streams must be processed under very short delays. In several domains, there is a need for processing data streams to detect patterns, identify failures (Rettig et al., 2015), and gain insights.

Several stream processing frameworks and tools have been proposed for carrying out analytical tasks in a scalable and efficient manner. Many tools employ a dataflow approach where incoming data results in data streams that are redirected through a directed graph of operators placed on distributed hosts that execute algebra-like operations or user-defined functions. Some frameworks, on the other hand, discretise incoming data streams by temporarily storing arriving data during small time windows and then performing micro-batch processing whereby triggering distributed computations on the previously stored data. The second approach aims at improving the scalability and fault-tolerance of distributed stream processing tools by handling straggler tasks and faults more efficiently.

Also to improve scalability, many stream processing frameworks have been deployed on clouds (Armbrust et al., 2009), aiming to benefit from characteristics such as resource elasticity. Elasticity, when properly exploited, refers to the ability of a cloud to allow a service to allocate additional resources or release idle capacity on demand to match the application workload. Although efforts have been made towards making stream-processing more elastic, many issues remain unaddressed. There are challenges regarding the placement of stream processing tasks on available resources, identification of bottlenecks, and application adaptation. These challenges are exacerbated when services are part of a larger infrastructure that comprises multiple execution models (e.g.lambda architecture, workflows or resource-management bindings for high-level programming abstractions (Boykin et al., 2014, Google Cloud Dataflow, 2015)) or hybrid environments comprising both cloud and edge computing resources (Hu et al., 2015, Hu et al., 2016).

More recently, software frameworks (Apache Edgent, 2017, Pisani et al., 2017) and architectures have been proposed for carrying out data stream processing using constrained resources located at the edge of the Internet. This scenario introduces additional challenges regarding application scheduling, resource elasticity, and programming models. This article surveys stream-processing solutions and approaches for deploying data stream processing on cloud computing and edge environments. By so doing, it makes the following contributions:

  • It reviews multiple generations of data stream processing frameworks, describing their architectural and execution models.

  • It analyses and classifies existing work on exploiting elasticity to adapt resource allocation to match the demands of stream processing services. Previous work has surveyed stream processing solutions without a focus on how resource elasticity is addressed (Zhao et al., 2017). The present work provides a more in-depth analysis of existing solutions and discusses how they attempt to achieve resource elasticity.

  • It discusses ongoing efforts on resource elasticity for data stream processing and their deployment on edge computing environments, and outlines future directions on the topic.

The rest of this paper is organised as follows. Section 2 provides background information on big-data ecosystems and architecture for online data processing. Section 3 describes existing engines and other software solutions for data stream processing whereas Section 4 discusses managed cloud solutions for stream processing. In Section 5 we elaborate on how existing work tries to tackle aspects of resource elasticity for data stream processing. Section 6 discusses solutions that aim to leverage multiple types of infrastructure (e.g. cloud and edge computing) to improve the performance of stream processing applications. Section 7 presents future directions on the topic and finally, Section 8 concludes the paper.

Section snippets

Background and architecture

This section describes background on stream-processing systems for big-data. It first discusses how layered real-time architecture is often organised and then presents a historical summary of how such systems have evolved over time.

Stream processing engines and tools

While the first generation of SPEs were analogous to DBMSs, developed to perform long running queries over dynamic data and consisted essentially of centralised solutions, the second generation introduced distributed processing and revealed challenges on load balancing and resource management. The third generation of solutions resulted in more general application frameworks that enable the specification and execution of UDFs. This section presents a historical overview of data stream processing

Managed cloud systems

This section describes public cloud solutions for processing streaming data and presents details on how elasticity features are made available to developers and end users. The section primarily identifies prominent technological solutions for processing of streaming data and highlights their main features.

Elasticity in stream processing systems

Over time several types of applications have benefited from resource elasticity, a key feature of cloud computing (Lorido-Botran et al., 2014). As highlighted by Lorido-Botran et al., elasticity in cloud environments is often accomplished via a Monitoring, Analysis, Planning and Execution (MAPE) process where:

  • 1.

    application and system metrics are monitored;

  • 2.

    the gathered information is analysed to assess current performance and utilisation, and optionally predict future load;

  • 3.

    based on an auto-scaling

Distributed and hybrid architecture

Most distributed data stream processing systems have been traditionally designed for cluster environments. More recently, architectural models have emerged for more distributed environments spanning multiple data centres or for exploiting the edges of the Internet (i.e., edge and fog computing (Hu et al., 2015, Sarkar et al., 2015)). Existing work aims to use the Internet edges by trying to place certain stream processing elements on micro data centres (often called Cloudlets (Satyanarayanan,

Future directions

Organisations often demand not only online processing of large amounts of streaming data, but also solutions that can perform computations on large data sets by using models such as MapReduce. As a result, big data processing solutions employed by large organisations exploit hybrid execution models (e.g.using batch and online execution) that can span multiple data centres. In addition to providing elasticity for computing and storage resources, ideally, a big data processing service should be

Summary and conclusions

This paper discussed solutions for stream processing and techniques to manage resource elasticity. It first presented how stream processing fits in the overall data processing framework often employed by large organisations. Then it presented a historical perspective on stream processing engines, classifying them into three generations. After that, we elaborated on third-generation solutions and discussed existing work that aims to manage resource elasticity for stream processing engines. In

Acknowledgements

We thank Rodrigo Calheiros (Western Sydney University), Srikumar Venugopal (IBM Research Ireland), Xunyun Liu (The University of Melbourne), and Piotr Borylo (AGH University) for their comments on a preliminary version of this work. This work has been carried out in the scope of a joint project between the French National Center for Scientific Research (CNRS) and the University of Melbourne.

Marcos Dias de Assunção is a Researcher at Inria and a former research scientist at IBM Research - Brazil (2011–2014). He holds a Ph.D. in Computer Science and Software Engineering (2009) from The University of Melbourne, Australia. He has published more than 40 technical papers in conferences and journals. His interests include resource management in cloud computing, methods for improving the energy efficiency of data centres, and resource elasticity for data stream processing.

References (128)

  • S.T. Allen et al.

    Storm Applied: Strategies for Real-time Event Processing

    (2015)
  • Amazon CloudWatch,...
  • Amazon EC2 Container Service,...
  • Amazon Kinesis Firehose,...
  • Amini, L., Andrade, H., Bhagwan, R., Eskesen, F., King, R., Selo, P., Park, Y., Venkatramani, C., 2006. SPC: A...
  • Aniello, L., Baldoni, R., Querzoni, L., 2013. Adaptive Online Scheduling in Storm, pp....
  • Apache ActiveMQ,...
  • Apache Beam,...
  • Apache Edgent,...
  • Apache Flink,...
  • Apache flink - iterative graph processing, API Documentation 2017. URL...
  • Apache Kafka,...
  • Apache Samza,...
  • Apache Thrift,...
  • Apache Zookeeper,...
  • Arasu, A., Babcock, B., Babu, S., Cieslewicz, J., Datar, M., Ito, K., Motwani, R., Srivastava, U., Widom, J., 2004....
  • Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R.H., Konwinski, A., Lee, G., Patterson, D.A., Rabkin, A.,...
  • Azure IoT Hub,...
  • Azure Stream Analytics,...
  • Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In:...
  • Babcock, B., Babu, S., Motwani, R., Datar, M., 2003. Chain: Operator scheduling for memory minimization in data stream...
  • Balazinska, M., Balakrishnan, H., Stonebraker, M., 2004. Contract-based load management in federated distributed...
  • Borthakur, D., Gray, J., Sarma, J.S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D.,...
  • O. Boykin et al.

    A framework for integrating batch and online MapReduce computations

    Proc. VLDB Endow.

    (2014)
  • Cardellini, V., Grassi, V., Presti, F.L., Nardelli, M., 2015. Distributed QoS-aware scheduling in Storm. In:...
  • Cardellini, V., Grassi, V., Presti, F.L., Nardelli, M., 2016. Optimal operator placement for distributed stream...
  • Centenaro, M., Vangelista, L., Zanella, A., Zorzi, M., 2016. Long-range Communications in Unlicensed Bands: The Rising...
  • Chan, S., 2016. Apache quarks, watson, and streaming analytics: Saving the world, one smart sprinkler at a time....
  • Chen, W., Paik, I., Li, Z., 2017. Cost-aware streaming workflow allocation on geo-distributed data centers. IEEE...
  • Chen, J., DeWitt, D.J., Tian, F., Wang, Y., 2000. NiagaraCQ: A scalable continuous query system for internet databases....
  • Chen, Y., Alspaugh, S., Borthakur, D., Katz, R., 2012. Energy efficiency for large-scale MapReduce workloads with...
  • Cheng, B., Papageorgiou, A., Bauer, M., 2016. Geelytics: Enabling on-demand edge analytics over scoped data sources....
  • Unlocking Game-Changing Wireless Capabilities: Cisco and SITA help Copenhagen Airport Develop New Services for...
  • Clifford, S., Hardy, Q., 2013. Attention, shoppers: Store is tracking your cell. New York Times. URL...
  • Cloud Foundry,...
  • Dabek, F., Cox, R., Kaashoek, F., Morris, R., 2004. Vivaldi: A decentralized network coordinate system. In: Conference...
  • M.D. de Assuncao et al.

    Big data computing and clouds: trends and future directions

    J. Parallel Distrib. Comput.

    (2015)
  • Dean, J., Ghemawat, S. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51...
  • DistributedLog,...
  • B. Ellis

    Real-time Analytics: Techniques to Analyze and Visualize Streaming Data

    (2014)
  • Cited by (259)

    View all citing articles on Scopus

    Marcos Dias de Assunção is a Researcher at Inria and a former research scientist at IBM Research - Brazil (2011–2014). He holds a Ph.D. in Computer Science and Software Engineering (2009) from The University of Melbourne, Australia. He has published more than 40 technical papers in conferences and journals. His interests include resource management in cloud computing, methods for improving the energy efficiency of data centres, and resource elasticity for data stream processing.

    Alexandre da Silva Veith is a PhD student at Ecole Normale Superieure (ENS) Lyon, France. He obtained his masters on applied computing at Unisinos University (2014). Hist interests include distributed systems, stream processing, resource elasticity, and auto-parallelisation of stream processing dataflows.

    Rajkumar Buyya is Professor of Computer Science and Software Engineering and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also the founding CEO of Manjrasoft, a spin-off company of the University, commercialising its innovations in Cloud Computing. He has authored over 400 publications and several text books. He is one of the most cited authors in computer science and software engineering worldwide.

    View full text