Big Data computing and clouds: Trends and future directions

https://doi.org/10.1016/j.jpdc.2014.08.003Get rights and content

Highlights

  • Survey of solutions for carrying out analytics and Big Data on Clouds.

  • Identification of gaps in technology for Cloud-based analytics.

  • Recommendations of research directions for Cloud-based analytics and Big Data.

Abstract

This paper discusses approaches and environments for carrying out analytics on Clouds for Big Data applications. It revolves around four important areas of analytics and Big Data, namely (i) data management and supporting architectures; (ii) model development and scoring; (iii) visualisation and user interaction; and (iv) business models. Through a detailed survey, we identify possible gaps in technology and provide recommendations for the research community on future directions on Cloud-supported Big Data computing and analytics solutions.

Introduction

Society is becoming increasingly more instrumented and as a result, organisations are producing and storing vast amounts of data. Managing and gaining insights from the produced data is a challenge and key to competitive advantage. Analytics solutions that mine structured and unstructured data are important as they can help organisations gain insights not only from their privately acquired data, but also from large amounts of data publicly available on the Web  [118]. The ability to cross-relate private information on consumer preferences and products with information from tweets, blogs, product evaluations, and data from social networks opens a wide range of possibilities for organisations to understand the needs of their customers, predict their wants and demands, and optimise the use of resources. This paradigm is being popularly termed as Big Data.

Despite the popularity on analytics and Big Data, putting them into practice is still a complex and time consuming endeavour. As Yu  [136] points out, Big Data offers substantial value to organisations willing to adopt it, but at the same time poses a considerable number of challenges for the realisation of such added value. An organisation willing to use analytics technology frequently acquires expensive software licences; employs large computing infrastructure; and pays for consulting hours of analysts who work with the organisation to better understand its business, organise its data, and integrate it for analytics  [120]. This joint effort of organisation and analysts often aims to help the organisation understand its customers’ needs, behaviours, and future demands for new products or marketing strategies. Such effort, however, is generally costly and often lacks flexibility. Nevertheless, research and application of Big Data are being extensively explored by governments, as evidenced by initiatives from USA  [20] and UK  [106]; by academics, such as the bigdata@csail initiative from MIT  [19]; and by companies such as Intel  [122].

Cloud computing has been revolutionising the IT industry by adding flexibility to the way IT is consumed, enabling organisations to pay only for the resources and services they use. In an effort to reduce IT capital and operational expenditures, organisations of all sizes are using Clouds to provide the resources required to run their applications. Clouds vary significantly in their specific technologies and implementation, but often provide infrastructure, platform, and software resources as services  [25], [13].

The most often claimed benefits of Clouds include offering resources in a pay-as-you-go fashion, improved availability and elasticity, and cost reduction. Clouds can prevent organisations from spending money for maintaining peak-provisioned IT infrastructure that they are unlikely to use most of the time. Whilst at first glance the value proposition of Clouds as a platform to carry out analytics is strong, there are many challenges that need to be overcome to make Clouds an ideal platform for scalable analytics.

In this article we survey approaches, environments, and technologies on areas that are key to Big Data analytics capabilities and discuss how they help building analytics solutions for Clouds. We focus on the most important technical issues on enabling Cloud analytics, but also highlight some of the non-technical challenges faced by organisations that want to provide analytics as a service in the Cloud. In addition, we describe a set of gaps and recommendations for the research community on future directions on Cloud-supported Big Data computing.

Section snippets

Background and methodology

Organisations are increasingly generating large volumes of data as result of instrumented business processes, monitoring of user activity  [14], [127], web site tracking, sensors, finance, accounting, among other reasons. With the advent of social network Web sites, users create records of their lives by daily posting details of activities they perform, events they attend, places they visit, pictures they take, and things they enjoy and want. This data deluge is often referred to as Big Data 

Data management

One of the most time-consuming and labour-intensive tasks of analytics is preparation of data for analysis; a problem often exacerbated by Big Data as it stretches existing infrastructure to its limits. Performing analytics on large volumes of data requires efficient methods to store, filter, transform, and retrieve the data. Some of the challenges of deploying data management solutions on Cloud environments have been known for some time  [1], [113], [82], and solutions to perform analytics on

Model building and scoring

The data storage and Data as a Service (DaaS) capabilities provided by Clouds are important, but for analytics, it is equally relevant to use the data to build models that can be utilised for forecasts and prescriptions. Moreover, as models are built based on the available data, they need to be tested against new data in order to evaluate their ability to forecast future behaviour. Existing work has discussed means to offload such activities–termed here as model building and scoring–to Cloud

Visualisation and user interaction

With the increasing amounts of data with which analyses need to cope, good visualisation tools are crucial. These tools should consider the quality of data and presentation to facilitate navigation  [44]. The type of visualisation may need to be selected according to the amount of data to be displayed, to improve both displaying and performance. Visualisation can assist in the three major types of analytics: descriptive, predictive, and prescriptive. Many visualisation tools do not describe

Business models and non-technical challenges

In addition to providing tools that customers can use to build their Big Data analytics solutions on the Cloud, models for delivering analytics capabilities as services on a Cloud have been discussed in previous work  [120]. Sun et al.  [119] provide an overview of the current state of the art on the development of customised analytics solutions on customers’ premises and elaborate on some of the challenges to enable analytics and analytics as a service on the Cloud. Some of the potential

Other challenges

In business models where high-level analytics services may be delivered by the Cloud, human expertise cannot be easily replaced by machine learning and Big Data analysis  [99]; in certain scenarios, there may be a need for human analysts to remain in the loop  [91]. Management should adapt to Big Data scenarios and deal with challenges such as how to assist human analysts in gaining insights and how to explore methods that can help managers in making quicker decisions.

Application profiling is

Summary and conclusions

The amount of data currently generated by the various activities of the society has never been so big, and is being generated in an ever increasing speed. This Big Data trend is being seen by industries as a way of obtaining advantage over their competitors: if one business is able to make sense of the information contained in the data reasonably quicker, it will be able to get more costumers, increase the revenue per customer, optimise its operation, and reduce its costs. Nevertheless, Big

Marcos Dias de Assuncao, a former member of the research staff at IBM, is interested in workload migration, resource management in Cloud computing, and techniques for big data analysis. Marcos obtained Ph.D. in Computer Science and Software Engineering (2009) from the University of Melbourne, Australia.

References (140)

  • Apache S4: distributed stream computing platform,...
  • Apache Hadoop,...
  • Apache Mahout,...
  • Apache Samza,...
  • M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R.H. Katz, A. Konwinski, G. Lee, D.A. Patterson, A. Rabkin, I. Stoica,...
  • Attention, shoppers: Store is tracking your cell, New York Times. URL...
  • A. Balmin et al.

    A platform for eXtreme Analytics

    IBM J. Res. Dev.

    (2013)
  • R.S. Barga et al.

    Project Daytona: Data Analytics as a Cloud Service

  • G. Bell et al.

    Beyond the Data Deluge

    Science

    (2009)
  • I. Bhattacharya et al.

    Enabling Analysts in Managed Services for CRM Analytics

  • bigdata@csail,...
  • ‘Big Data’ has Big Potential to Improve Americans’ Lives, Increase Economic Opportunities, Committee on Science, Space...
  • Birst Inc.,...
  • R. Bonney et al.

    Next steps for citizen science

    Science

    (2014)
  • D. Borthakur et al.

    Apache Hadoop Goes Realtime at Facebook

  • C. Bunch et al.

    An Evaluation of Distributed Datastores Using the AppScale Cloud Platform

  • B. Calder et al.

    Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency

  • P.H. Carns et al.

    PVFS: A parallel file system for linux clusters

  • J. Chang et al.

    Workload diversity and dynamics in big data analytics: implications to system designers

  • Y. Chen et al.

    Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis

  • Y. Chen et al.

    Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads

    Proceedings of the VLDB Endowment

    (2012)
  • H. Chen et al.

    Business Intelligence and Analytics: From Big Data to Big Impact

    MIS Quarterly

    (2012)
  • Q. Chen et al.

    Experience in Continuous analytics as a Service (CaaaS)

  • Y. Chen et al.

    Analytics ecosystem transformation: A force for business model innovation

  • K. Chen et al.

    CloudVista: Visual Cluster Exploration for Extreme Scale Data in the Cloud

  • N. Chohan, A. Gupta, C. Bunch, K. Prakasam, Hybrid Cloud Support for Large Scale Analytics and Web Processing, in:...
  • J. Choo et al.

    Customizing Computational Methods for Visual Analytics with Big Data

    IEEE Computer Graphics and Applications

    (2013)
  • Cloud9 Analytics,...
  • J. Cohen et al.

    MAD skills: new analysis practices for big data

    Proceedings of the VLDB Endow

    (2009)
  • A. Cuzzocrea et al.

    Analytics over large-scale multidimensional data: the big data revolution!

  • DataDirect Cloud, http://cloud.datadirect.com/...
  • T.H. Davenport et al.

    Competing on Analytics: The New Science of Winning

    (2007)
  • T.H. Davenport et al.

    Analytics at Work: Smarter Decisions, Better Results

    (2010)
  • J. Davey, F. Mansmann, J. Kohlhammer, D. Keim, The future internet, Springer-Verlag, Berlin, Heidelberg, 2012, Ch....
  • J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM...
  • G. DeCandia et al.

    Dynamo: Amazon’s Highly Available Key-Value Store

    SIGOPS Operating Systems Review

    (2007)
  • E. Deelman et al.

    Data management challenges of data-intensive scientific workflows

  • P. Deepak et al.

    Configurable and Extensible Multi-flows for Providing Analytics as a Service on the Cloud

  • P. Deyhim, Best practices for Amazon EMR, White paper, Amazon (2013). URL...
  • U. Fayyad et al.

    The KDD process for extracting useful knowledge from volumes of data

    Commun. ACM

    (1996)
  • Cited by (627)

    • The computational planet

      2023, Journal of Computational Science
    View all citing articles on Scopus

    Marcos Dias de Assuncao, a former member of the research staff at IBM, is interested in workload migration, resource management in Cloud computing, and techniques for big data analysis. Marcos obtained Ph.D. in Computer Science and Software Engineering (2009) from the University of Melbourne, Australia.

    Dr. Rodrigo N. Calheiros is a Research Fellow in the Department of Computing and Information Systems, the University of Melbourne, Australia. Since 2010, he is a member of the CLOUDS Lab of the University of Melbourne, where he researches various aspects of cloud computing. He works in the field of Cloud computing since 2008. His research interests also include virtualization, grid computing, and simulation and emulation of distributed systems.

    Silvia Bianchi is a Research Staff Member in the Service Systems group of IBM Research Brazil. She joined IBM in March 2012. Silvia received B.Sc. degree in Computer Science from the Federal University of Santa Catarina (UFSC), Brazil, M.Sc. degree in Computer Science from Paul Sabatier University (UPS), France, and Ph.D. in Computer Science from the University of Neuchatel (Unine) in Switzerland. She is currently involved in projects on Cloud Computing, Peer-to-Peer and Publish/Subscribe.

    Marco A.S. Netto is a Researcher at IBM Research Brazil, where he works on Cloud Computing and Analytics related projects. Marco obtained his Ph.D. in Computer Science and Software Engineering (2010) from the University of Melbourne, Australia. His research interests are Cluster/Grid/Cloud Computing with focus on SLA management, virtualisation, performance evaluation, job scheduling, Quality-of-Service, and optimisation issues.

    Dr. Rajkumar Buyya is Professor of Computer Science and Software Engineering and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also the founding CEO of Manjrasoft, a spin-off company of the University, commercialising its innovations in Cloud Computing. He has authored 400 publications and four text books. He is one of the highly cited authors in computer science and software engineering worldwide.

    View full text