Big Data computing and clouds: Trends and future directions
Introduction
Society is becoming increasingly more instrumented and as a result, organisations are producing and storing vast amounts of data. Managing and gaining insights from the produced data is a challenge and key to competitive advantage. Analytics solutions that mine structured and unstructured data are important as they can help organisations gain insights not only from their privately acquired data, but also from large amounts of data publicly available on the Web [118]. The ability to cross-relate private information on consumer preferences and products with information from tweets, blogs, product evaluations, and data from social networks opens a wide range of possibilities for organisations to understand the needs of their customers, predict their wants and demands, and optimise the use of resources. This paradigm is being popularly termed as Big Data.
Despite the popularity on analytics and Big Data, putting them into practice is still a complex and time consuming endeavour. As Yu [136] points out, Big Data offers substantial value to organisations willing to adopt it, but at the same time poses a considerable number of challenges for the realisation of such added value. An organisation willing to use analytics technology frequently acquires expensive software licences; employs large computing infrastructure; and pays for consulting hours of analysts who work with the organisation to better understand its business, organise its data, and integrate it for analytics [120]. This joint effort of organisation and analysts often aims to help the organisation understand its customers’ needs, behaviours, and future demands for new products or marketing strategies. Such effort, however, is generally costly and often lacks flexibility. Nevertheless, research and application of Big Data are being extensively explored by governments, as evidenced by initiatives from USA [20] and UK [106]; by academics, such as the bigdata@csail initiative from MIT [19]; and by companies such as Intel [122].
Cloud computing has been revolutionising the IT industry by adding flexibility to the way IT is consumed, enabling organisations to pay only for the resources and services they use. In an effort to reduce IT capital and operational expenditures, organisations of all sizes are using Clouds to provide the resources required to run their applications. Clouds vary significantly in their specific technologies and implementation, but often provide infrastructure, platform, and software resources as services [25], [13].
The most often claimed benefits of Clouds include offering resources in a pay-as-you-go fashion, improved availability and elasticity, and cost reduction. Clouds can prevent organisations from spending money for maintaining peak-provisioned IT infrastructure that they are unlikely to use most of the time. Whilst at first glance the value proposition of Clouds as a platform to carry out analytics is strong, there are many challenges that need to be overcome to make Clouds an ideal platform for scalable analytics.
In this article we survey approaches, environments, and technologies on areas that are key to Big Data analytics capabilities and discuss how they help building analytics solutions for Clouds. We focus on the most important technical issues on enabling Cloud analytics, but also highlight some of the non-technical challenges faced by organisations that want to provide analytics as a service in the Cloud. In addition, we describe a set of gaps and recommendations for the research community on future directions on Cloud-supported Big Data computing.
Section snippets
Background and methodology
Organisations are increasingly generating large volumes of data as result of instrumented business processes, monitoring of user activity [14], [127], web site tracking, sensors, finance, accounting, among other reasons. With the advent of social network Web sites, users create records of their lives by daily posting details of activities they perform, events they attend, places they visit, pictures they take, and things they enjoy and want. This data deluge is often referred to as Big Data
Data management
One of the most time-consuming and labour-intensive tasks of analytics is preparation of data for analysis; a problem often exacerbated by Big Data as it stretches existing infrastructure to its limits. Performing analytics on large volumes of data requires efficient methods to store, filter, transform, and retrieve the data. Some of the challenges of deploying data management solutions on Cloud environments have been known for some time [1], [113], [82], and solutions to perform analytics on
Model building and scoring
The data storage and Data as a Service (DaaS) capabilities provided by Clouds are important, but for analytics, it is equally relevant to use the data to build models that can be utilised for forecasts and prescriptions. Moreover, as models are built based on the available data, they need to be tested against new data in order to evaluate their ability to forecast future behaviour. Existing work has discussed means to offload such activities–termed here as model building and scoring–to Cloud
Visualisation and user interaction
With the increasing amounts of data with which analyses need to cope, good visualisation tools are crucial. These tools should consider the quality of data and presentation to facilitate navigation [44]. The type of visualisation may need to be selected according to the amount of data to be displayed, to improve both displaying and performance. Visualisation can assist in the three major types of analytics: descriptive, predictive, and prescriptive. Many visualisation tools do not describe
Business models and non-technical challenges
In addition to providing tools that customers can use to build their Big Data analytics solutions on the Cloud, models for delivering analytics capabilities as services on a Cloud have been discussed in previous work [120]. Sun et al. [119] provide an overview of the current state of the art on the development of customised analytics solutions on customers’ premises and elaborate on some of the challenges to enable analytics and analytics as a service on the Cloud. Some of the potential
Other challenges
In business models where high-level analytics services may be delivered by the Cloud, human expertise cannot be easily replaced by machine learning and Big Data analysis [99]; in certain scenarios, there may be a need for human analysts to remain in the loop [91]. Management should adapt to Big Data scenarios and deal with challenges such as how to assist human analysts in gaining insights and how to explore methods that can help managers in making quicker decisions.
Application profiling is
Summary and conclusions
The amount of data currently generated by the various activities of the society has never been so big, and is being generated in an ever increasing speed. This Big Data trend is being seen by industries as a way of obtaining advantage over their competitors: if one business is able to make sense of the information contained in the data reasonably quicker, it will be able to get more costumers, increase the revenue per customer, optimise its operation, and reduce its costs. Nevertheless, Big
Marcos Dias de Assuncao, a former member of the research staff at IBM, is interested in workload migration, resource management in Cloud computing, and techniques for big data analysis. Marcos obtained Ph.D. in Computer Science and Software Engineering (2009) from the University of Melbourne, Australia.
References (140)
- et al.
Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility
Future Gener. Comput. Syst.
(2009) - et al.
The Aneka platform and QoS-driven resource provisioning for elastic applications on hybrid Clouds
Future Gener. Comput. Syst.
(2012) Data management in the cloud: Limitations and opportunities
IEEE Data Engineering Bulletin
(2009)- Amazon redshift,...
- Amazon data pipeline,...
- Amazon Elastic MapReduce...
- Amazon Kinesis,...
- et al.
Cloud Analytics: Do We Really Need to Reinvent the Storage Stack?
- et al.
Visual analytics tools for analysis of movement data
SIGKDD Explor. Newsl.
(2007) - Announcing Suro: Backbone of Netflix’s Data Pipeline,...
A platform for eXtreme Analytics
IBM J. Res. Dev.
Project Daytona: Data Analytics as a Cloud Service
Beyond the Data Deluge
Science
Enabling Analysts in Managed Services for CRM Analytics
Next steps for citizen science
Science
Apache Hadoop Goes Realtime at Facebook
An Evaluation of Distributed Datastores Using the AppScale Cloud Platform
Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency
PVFS: A parallel file system for linux clusters
Workload diversity and dynamics in big data analytics: implications to system designers
Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis
Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads
Proceedings of the VLDB Endowment
Business Intelligence and Analytics: From Big Data to Big Impact
MIS Quarterly
Experience in Continuous analytics as a Service (CaaaS)
Analytics ecosystem transformation: A force for business model innovation
CloudVista: Visual Cluster Exploration for Extreme Scale Data in the Cloud
Customizing Computational Methods for Visual Analytics with Big Data
IEEE Computer Graphics and Applications
MAD skills: new analysis practices for big data
Proceedings of the VLDB Endow
Analytics over large-scale multidimensional data: the big data revolution!
Competing on Analytics: The New Science of Winning
Analytics at Work: Smarter Decisions, Better Results
Dynamo: Amazon’s Highly Available Key-Value Store
SIGOPS Operating Systems Review
Data management challenges of data-intensive scientific workflows
Configurable and Extensible Multi-flows for Providing Analytics as a Service on the Cloud
The KDD process for extracting useful knowledge from volumes of data
Commun. ACM
Cited by (627)
The computational planet
2023, Journal of Computational ScienceAIMDP: An Artificial Intelligence Modern Data Platform. Use case for Spanish national health service data silo
2023, Future Generation Computer SystemsParallel computing in finance for estimating risk-neutral densities through option prices
2023, Journal of Parallel and Distributed ComputingRepuTE: A soft voting ensemble learning framework for reputation-based attack detection in fog-IoT milieu
2023, Engineering Applications of Artificial IntelligenceA health monitoring system using cloud and IOT devices
2024, AIP Conference ProceedingsDynamic authentication for intelligent sensor clouds in the Internet of Things
2024, International Journal of Information Security
Marcos Dias de Assuncao, a former member of the research staff at IBM, is interested in workload migration, resource management in Cloud computing, and techniques for big data analysis. Marcos obtained Ph.D. in Computer Science and Software Engineering (2009) from the University of Melbourne, Australia.
Dr. Rodrigo N. Calheiros is a Research Fellow in the Department of Computing and Information Systems, the University of Melbourne, Australia. Since 2010, he is a member of the CLOUDS Lab of the University of Melbourne, where he researches various aspects of cloud computing. He works in the field of Cloud computing since 2008. His research interests also include virtualization, grid computing, and simulation and emulation of distributed systems.
Silvia Bianchi is a Research Staff Member in the Service Systems group of IBM Research Brazil. She joined IBM in March 2012. Silvia received B.Sc. degree in Computer Science from the Federal University of Santa Catarina (UFSC), Brazil, M.Sc. degree in Computer Science from Paul Sabatier University (UPS), France, and Ph.D. in Computer Science from the University of Neuchatel (Unine) in Switzerland. She is currently involved in projects on Cloud Computing, Peer-to-Peer and Publish/Subscribe.
Marco A.S. Netto is a Researcher at IBM Research Brazil, where he works on Cloud Computing and Analytics related projects. Marco obtained his Ph.D. in Computer Science and Software Engineering (2010) from the University of Melbourne, Australia. His research interests are Cluster/Grid/Cloud Computing with focus on SLA management, virtualisation, performance evaluation, job scheduling, Quality-of-Service, and optimisation issues.
Dr. Rajkumar Buyya is Professor of Computer Science and Software Engineering and Director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also the founding CEO of Manjrasoft, a spin-off company of the University, commercialising its innovations in Cloud Computing. He has authored 400 publications and four text books. He is one of the highly cited authors in computer science and software engineering worldwide.