1 Introduction

Nowadays, the computational complexity of processes and decisions held on a daily basis depend on the availability of high-quality data, which often holds in practice thanks to the massive digitization of traditional activity sectors. Unfortunately, such information is often produced at rates never seen before and in a non-structured fashion, outstripping the scales at which it was collected and mined by traditional data management systems. This situation eventually originated the so-called Big Data paradigm, which refers to the collection, analysis and visualization of data at scales that surpass the capacities of traditional infrastructures for information storage and processing. The core concept of Big Data is the derivation of alternative and efficient computing means to ingest, retrieve, process and visualize large amounts of data [1, 2]. Actually, Internet of Things (IoT) and Cloud Computing are standard bearers of the current digitization process that is conducted in different sectors, as they support the connectivity and management of devices in charge of data gathering, delivery, processing, and computation under different architectural strategies. All in all, data play a paramount role in both paradigms, the difference being the imposed requirements and specifications (e.g., processing latency or transmission bandwidth).

In this context, notable milestones in the past (e.g., Map-Reduce programming, complex event processing or NoSQL databases) have led to a relatively high degree of maturity of Big Data technologies. However, algorithms for information fusion, processing and data mining have not gone on a par with the aforementioned technologies. Indeed, only a fraction of classical approaches for drawing knowledge from data have been adapted to the new requirements and computing procedures brought by Big Data technologies. Although adaptations for these approaches keep growing at a continuous pace, many of them still remain unaddressed. The complexity, heterogeneity, dynamism and inherently distributed nature of Big Data technologies do not help either for this purpose. Even models enjoying a straightforward adaptability to Big Data computing environments (e.g., ensembles for predictive modeling) can be severely affected by the obsolescence of the information from where they are learned [3], or the failure of a node in a distributed Map-Reduce computing grid [4]. All in all, data fusion, processing, learning and visualization of Big Data require a major focus not only on tailoring the algorithmic steps underlying each model/technique to the computing technologies underneath, but also endowing them with higher levels of resilience against failures, adaptation to changes in data and the accommodation of unprecedented levels of data volume, heterogeneity and veracity. In short: coupling algorithmic adaptation with systems’ adaptation.

In light of the above, Big Data environments call for computationally efficient techniques that meet such requirements by embracing self-learning and adaptation capabilities at the core of their design. This unchains a magnificent opportunity for bio-inspired computation, which has gained a remarkable momentum in the Big Data literature. Inspired by intelligent behavioral patterns observed in nature, many practitioners in the scientific community have emulated such bio-inspired processes in the form of computational algorithms, aiming at harnessing the adaptability and self-learning capabilities of such biological systems to face complex problems [5]. Consequently, an upsurge of inspirational sources has been historically considered for the design and development of bio-inspired methods for different computational problems. Some examples of this claim for optimization problems are the behavioral patterns of animals [6, 7], genetic inheritance mechanisms [8] or physical phenomena [9], among many others. In regards to modeling, connections among neurons in the brain have stimulated a flurry of neural network approaches, arriving at the current myriad of Deep Learning models, all sharing a similar bio-inspired rationale [10].

Bio-inspired computation can provide promising solutions for the acknowledged drawbacks of Big Data processing in IoT and Cloud Computing environments, such as poor scalability, security issues, task distribution, fault tolerance, or low performance in traditional information technology frameworks. New optimization, scaling and management approaches can largely be benefited from the adaptability of bio-inspired methods, even further when considering the different dimensions of Big Data (volume, variety, velocity, veracity and variability), which increase the complexity of the problems to be solved. Fortunately, the synergy among Big Data and bio-inspired computation is clear and meaningful. On the one hand, bio-inspired computation can act as a beacon for attaining near-optimal solutions for complex modeling and optimization problems that can be present in the Big Data paradigm. For instance, bio-inspired heuristic methods for optimization can efficiently accommodate the dynamic nature of objectives and constraints of an optimization problem characterizing the load balancing in a cloud computing grid [11]. Fuzzy logic can help accounting for the uncertainty of Big Data decision making, mostly when data sources are unreliable or the decision is held in a context subject to exogenous and non-considered factors [12]. The benefits resulting from this synergistic relationship are exposed by new Big Data infrastructures, tools and technologies that have adopted bio-inspired algorithms to reach a higher level of efficiency in their tasks. Some few examples of technologies that take advantage of the capabilities of bio-inspired algorithms are, among many others, NoSQL databases [13,14,15], load planners/schedulers [16], or tools assisting analytical tasks such as feature selection [17], dimensionality reduction [18] or data fusion [19]. On the other hand, through bio-inspired computation perspective, Big Data provides the possibility of great volumes and varieties of data and the efficient implementation of solvers through new technologies, which offer parallel, distributable and scalable workloads. In this context, there are numerous studies and surveys focused on Big Data analytics [20]. All evidences confirm that efforts conducted in this topic are growing lately, which calls for a reference material to organize achievements so far, and connect them with a prospect of valuable research directions.

The goal of this survey is to answer this call by enumerating and thoroughly examining the principal points of connection between Big Data technologies and bio-inspired computation. To this end, we undertake several interconnected tasks, all departing from a critical assessment of the recent literature:

  • First, we review the main concepts related to Big Data and bio-inspired computation, settling common grounds for an adequate understanding of our study.

  • We examine contributed works where Big Data infrastructure, tools and technologies have been improved through bio-inspired computation approaches.

  • We exhaustively review how bio-inspired algorithms have enhanced the Big Data domain, classifying them into different steps of the Big Data life cycle (i.e., data fusion, processing, learning and visualization).

  • We explore and compare to each other the specific scope of problems tackled so far by the community, identifying further applications that can be addressed in the future.

  • Finally, we provide our envisioned future for this research in the form of a prospect of challenges, trends and research directions that can be pursued for stepping further in this research topic.

This work is structured in the following way: In Sect. 2 we present in detail both Big data and bio-inspired computing concepts. Section 3 delves into the synergies between these two paradigms, providing a taxonomy to classify advances reported so far and a critical review of the existing literature. Next, we introduce current challenges and open opportunities in 4. Section 5 ends the survey by summarizing the main conclusions and by providing an outlook towards the future of this exciting field.

2 Big data and bio-inspired computation: first concepts

As has been anticipated in the introduction, this section first defines concepts underneath Big Data (Sect. 2.1) and bio-inspired computation (Sect. 2.2). On the one hand, we focus on the Big Data life cycle phases, along with their associated technologies. On the other hand, we classify bio-inspired algorithms as per the kind of problems they can solve, as well as their biological source of inspiration. This allows detecting which bio-inspired algorithms have demonstrated a better off-the-shelf applicability to large data volumes, or have been specifically designed for such a purpose.

2.1 Big data paradigm

Briefly explained, Big Data is a concept that encloses large volumes of high-speed, complex, variable and heterogeneous data, along with advanced technologies and techniques that enable their collection, storage, processing/analysis and visualization. This specific definition expands the one provided by Gartner in [1]. In this subsection, we first discuss the relationships among Big Data and bio-inspired computation, which have stimulated the research that has hitherto been made in this field. We next describe in detail the Big Data life cycle, which is of capital importance for properly understanding the investigation carried out in this area and the subsequent analysis of the literature.

2.1.1 Big data dimensions and bio-inspired computation

There is a clear consensus within the community that Big Data relies on five different main features of data: volume, velocity, variety, variability and veracity [21]. All these characteristics are critical and define the way data is managed across the environment, which can be defined as follows [22]: (1) volume represents the magnitude of the data in terms of size; (2) velocity refers to the speed at which the information is produced, received and processed; (3) variety is related to the heterogeneity of data produced by different domains; (4) variability refers to changes in non-stationarity events that affect data, which require accommodating their effects on the system and/or models over time; and (5) veracity deals with the provenance and reliability of the collected information. These five dimensions are cross-domain, and unless properly resolved, can hinder the adoption of data-based operational workflows in a diversity of applications.

Fortunately, bio-inspired computation can effectively help legacy technologies to cope with challenges stemming from the above features. In terms of volume, for example, bio-inspired optimization metaheuristics can contribute to the feasibility of traditional data mining models for large datasets under assorted strategies, including instance reduction, feature selection, or model simplification [23]. Indeed the compliance of the optimization problems formulated in these strategies with the typical volumes of Big Data is among the motivations for the upsurge of large-scale global optimization, a subarea within bio-inspired optimization that deals with problems of very high dimensionality (thousands to millions of decision variables [24]). Bio-inspired solvers have also been proven to excel at data integration, aggregation and fusion [25, 26], outstanding as essential drivers to deal with the variety and variability dimensions of Big data. Lastly, velocity and veracity dimensions affect data and service quality, as well as monitoring and security problems. Examples of bio-inspired optimization algorithms dealing with these issues can be found in [27, 28], whereas elements from fuzzy logic have also been utilized in Big Data environments subject to data uncertainty (see [29, 30] and references in the comprehensive overview in [31]).

Figure 1 summarizes graphically each of the five dimensions of Big Data described above, as well as problems typically arising from each of them. Along with this information, we include citations to several landmark reviews gravitating on how bio-inspired computation has managed to overcome the barriers imposed by the Big Data paradigm.

Fig. 1
figure 1

Big Data dimensions associated with typical problems liable to be solved by bio-inspired computation. Nodes colored in blue correspond to computational tasks, whereas those colored in light brown indicate specific applications where the Big Data dimension indicated in their parent nodes are particularly relevant. Computational requirements enabled by bio-inspired computation are indicated in the gray box set on the background

2.1.2 Big data life cycle

A logical line of thinking springing from the aforementioned dimensions is that Big Data requires highly adaptive techniques to efficiently process large quantities of data within tolerable computational times. Following [32], three are the questions that must be formulated in regard to the management and treatment of data: (1) Is it technologically affordable to capture and store all data? (2) is it possible to clean, enrich, and analyze the data? and (3) is it possible to retrieve, search, integrate, and visualize the data?. Answering these three questions (which can be summarized as the store-process-manage triplet) is essential for extracting valuable insights from data in practical use cases.

Considering these three technological concerns, a common way to orchestrate the heterogeneity of technologies under the Big Data paradigm is around the Big Data life cycle, which comprises data storage, data fusion, data learning, searching, sharing, transferring, visualization, querying, updating and information privacy. Among these new areas, the ones that best fit with the main philosophy of bio-inspired computation, and those in which solutions of greater value can be provided, are the following:

  • Data Fusion: This phase represents the process of merging multiple data sources, towards producing consistent, accurate, and useful information. Data fusion is clearly related to the variety feature of Big Data, and its complexity stems from the large volumes of data that must be fused. In this sense, bio-inspired algorithms inherently provide great benefits for this purpose, with an increasing prevalence of model-based data fusion based on Deep Learning neural network models. In fact, the main concept of data fusion originates from the human and animal ability to incorporate information from multiple senses to improve their monitoring capabilities. This being said, the design flexibility and unified learning framework that current Deep Learning models provide is currently one of the enablers of the so-called model-based data fusion. Indeed, the fact that hierarchical features can be nowadays learned from image, video, text, and other forms of data in the space and/or sequential (time) domains permit to learn them together by assembling neural parts devoted to each domain. In these areas, information sharing is realized through the exchange and sharing of parts of the neural networks, which are trained together for the task at hand. Therefore, Deep Learning methods can effectively implement Data Fusion by implementing multi-modal feature extraction over a mixture of neural units specialized for sequential (e.g., LSTM or GRU cells) and space domains (convolutional filters). Once assembled, the training process of the overall neural network (gradient backprop) tunes the parameters of these units for them to learn what features to extract and fuse for solving the task at hand. Emerging learning paradigms such as transfer learning, domain adaptation and multitask learning are also largely harnessing the possibilities brought by neural computation [33].

  • Data Storage: This stage refers to the need for effective repositories capable of storing and efficiently managing huge volumes of data. This process poses a remarkable challenge in terms of distribution, scalability and performance. Some additional problems to face in this regard are the concurrency and consensus derived from writing and accessing data in the repositories. In this context, bio-inspired algorithms are appropriate for this purpose, since most of them consider distribution and parallelism intrinsically in their design. Furthermore, data reduction has also leveraged bio-inspired computation in a number of representative works [34, 35].

  • Data Processing: This phase regards the proper processing of all the merged and stored data. In this sense, any technique developed for this purpose must accommodate the great amount of information available in Big Data context and the rate at which it is produced. Once again, the inherent parallelism of bio-inspired methods makes them promising alternatives for managing the distribution of large volumes of the data, particularly in what refers to feature selection, instance filtering and data imputation, as well as in streaming environments [36]. Likewise, a large fraction of data in the context of Big Data is composed of images/videos. Consequently, image prioritization/video summarization technologies are key stakeholders to contribute to data reduction.

  • Data Learning: This step regards all processes aimed at retrieving relevant knowledge from the available Big Data. At this point we stress on the paramount relevance that bio-inspired computation has held in data mining, with a plethora of studies exhaustively reviewing the activity in this confluence of technologies over the years. However, the interest in extrapolating these prior achievements around bio-inspired data mining to the scales, speeds and variety of Big Data has not fully exploded to reach its potential. Obviously, neural computation relies extensively on the biological mechanisms inside the human brain. Modern variants such as convolutional neural networks hinge on how the visual cortex operates when fed with an image. Modern neural computation, collectively referred to as Deep Learning, can be conceived as a family of bio-inspired computation techniques by themselves that require heavy loads of data for learning their constituent parameters. However, as we will show later, the possibilities of bio-inspired computation span far beyond the biological principle of models and algorithms currently utilized.

  • Data Visualization: Once the learning model has produced an insight from Big Data, this phase undertakes the visualization of large volumes of data and information, coupled with the added knowledge extracted by the learning models in use. Visualization is actually a challenge that has not yet been as addressed as other phases of the life cycle, possibly due to the strong link between Artificial Intelligence, computer graphics and cognitive sciences [37, 38].

With all this, Fig. 2 showcases the described five phases of the data life cycle, which are used to convert simple and raw data into valuable knowledge. Through the conduction of these steps, the sixth and last dimension of the Big Data is attained: value.

From the technological point of view, these phases need to be efficiently implemented using suitable tools and mechanisms. Techniques and technologies involved in this process are jointly integrated into a single system, forging what is called Big Data platform, which resides in complex server infrastructures. Additional technologies being applied to Big Data include massively parallel-processing systems (MPP), search-based applications, data mining grids, distributed file systems, distributed databases or NoSQL databases and cloud-based infrastructure (applications, storage and computing resources).

All the components that comprise a Big Data architecture have different technological requirements and characteristics, which depend on the purpose they should cover in the ecosystem. In accordance with the increase in these requirements, adopted solutions usually tend to be a set of integrated and suitable tools for data analytics and Big Data. These combined systems are called Big Data suites. In the specific context of security [39], several technologies can be found in the Big Data technology stack. In this paper, we analyze the initiatives proposed to improve any of the above technologies (from cloud technologies to analysis assistance tools) by means of bio-inspired computation.

In what refers to infrastructure, Big Data technologies [40] support three options: on-premise, cloud and hybrid. Thus, depending on the approach, the infrastructure management complexity and the needed tools vary significantly. In this case, bio-inspired metaheuristics have demonstrated a remarkable performance when solving complex problems associated with infrastructure and technologies, such as resource allocation and management [41, 42], job scheduling [43], log synchronization and information security [44], or anomaly detection in the management and health of the IT infrastructure [45]. We will later examine them thoroughly.

Fig. 2
figure 2

Phases of the big data life cycle

2.2 Fundamentals of bio-inspired computation

In a nutshell, bio-inspired computation [46] can be defined as the combination of computational intelligence [47] and collective intelligence behaviors [48]. Usually, computations methods classified in this category are conceived for efficiently solving highly complex problems. These solvers are designed using as source of inspiration a wide variety of principles and phenomena encountered in nature and biological systems. The main reason for mimicking such observed behaviors for solving complex computational tasks is to harness the adaptive, reactive and distributed features of these natural systems. In this way, every aspect that defines the solving method is modeled mirroring the living phenomena and biological systems, such as the evolution of species [8], immune systems [49], the human brain [50], or the collective behavior of animals [6, 51, 52], among others.

In this survey, we focus our attention on four specific areas that can be placed within the wider field of bio-inspired computation: neural networks [53, 54], Evolutionary Computation [55, 56], Swarm Intelligence [57] and Fuzzy Systems [58, 59]. All these four concepts are fully related to the Big Data paradigm and the main problems arisen in this field, due to the suitability of their application to this area [60]. Our decision to undertake this study departs from our findings recently drawn in [60]. In this paper we present a taxonomy of bio-inspired computational intelligence, highlighting four major families: Natural Computing, Artificial Immune Systems, Fuzzy Systems and Neural Networks. In our case, we do not consider Artificial Immune Systems given the lack of works reporting advances in the application of this family of algorithms to Big Data systems. This scarcity, however, unveils an interesting research direction in security that we will later discuss in detail.

A convenient criterion to organize all techniques under the bio-inspired computation umbrella is the kind of computational problems that can be solved. As such, computational intelligence techniques and methods can undertake three generic problems (Fig. 3), which differ from each other depending on the unknown information to be solved by the technique at hand [56]:

  1. 1.

    Modeling or system identification, in which given a prior set of inputs and their corresponding outputs, the goal is to determine the model that best relates both, so that a new output can be produced for any given input. All predictive modeling techniques belong to this first category.

  2. 2.

    Simulation, in which given an input data and an assumed expression for the system, the goal is to observe the properties of its produced output. A clear example of simulation in the wide sense is clustering: Given an input data, a clustering algorithm is applied towards observing whether the output shows up a certain group structure.

  3. 3.

    Optimization, in which given a system and a measure of quality of its output, the goal is to find the input that maximizes the quality of its output. This is actually what is done by bio-inspired meta-heuristic algorithms.

Fig. 3
figure 3

Conceptual diagram showing the three tasks that can be tackled with Computational Intelligence

We define now the four aforementioned large families of bio-inspired computation methods and their connection to the above generic problems. Table 1 complements these explanations with an excerpt of the particular problems in the context of Big Data that such families can address, as well as the affected Big Data dimensions:

2.2.1 Neural networks

Neural networks are computational models inspired by brain modeling studies. It consists of a set of units, called artificial neurons, connected together to transmit signals. The smallest unit of analysis of neural networks in the computational domain is what is called neuron or perceptron. An important feature of neural networks is their ability to learn from their environment. Neural networks have been widely applied on supervised, unsupervised, hybrid and reinforcement learning [61]. For this reason they have been extensively applied to modeling problems such as classification, regression or matching, as well as to simulation problems via unsupervised neural approaches such as Kohonen maps, auto-encoders, Hebbian learning and the like.

2.2.2 Evolutionary computation

Evolutionary Computation (EC) comprises a family of algorithms for global optimization inspired by biological evolution. Some recurrent ideas that have been used as inspiration up to now are, among others, the survival of the fittest, natural selection, reproduction, mutation, competition or symbiosis. For properly emulating the processes involved in nature and the natural selection mechanism, candidate solutions are organized in a population, and the fitness function determines how good they are adapted to the environment in which solutions live. This fitness should be strictly related with the problem at hand, being proportional to the quality of the solution solving that problem. Most representative EC techniques, which differ in the way in which they represent and evolve individuals, are as follows: (1) genetic programming, in which individuals are represented as executable programs [62]; (2) evolutionary programming, phenotype-oriented [63]; (3) evolutionary strategies, which can be deemed as the evolution of evolution [64]; (4) differential evolution, population-based search strategy in which the modification of individuals is based on the difference between them [65]; (5) genetic algorithms, population-based techniques based on the Darwinian evolution of species theory [8]; (6) cultural evolution, adaptation to the environment at faster rates than biological evolution [66], (7) co-evolution, distribute solvers in which multiple subpopulations evolve in a joint way [67].

Up to now, EC has been applied in a wide spectrum of knowledge fields. For interested readers, we suggest the findings reported in works such as [68,69,70] for the analysis of recent research trend in some specific applications.

Fig. 4
figure 4

Diagram depicting the differences between a EC and b SI

2.2.3 Swarm intelligence

Swarm Intelligence (SI) is a specific branch of Computational Intelligence also dedicated to the optimization of complex problems through the study and adaptation of the collective behavior of decentralized, self-organized agents. This way, SI methods usually consist of a population (swarm) of simple agents, which evolve jointly along time through local interactions with one another, and with their environment. Furthermore, despite the interactions among individuals are determined beforehand, social interaction plays a key role in the resulting behavior of the swarm towards achieving a global objective. In other words, although every agent relies on local interactions impacting on the resulting behavior of the swarm, the global performance of the group simultaneously determines the conditions under which individual agents perform. As previously mentioned, a wide spectrum of inspirational sources has been embraced over the last couple of decades for producing SI methods. We can highlight among such sources the behavioral patterns of animals such as bees [7], cuckoos [51], fireflies [71], or cats [72]. Other inspiring motifs for SI methods are physical processes, such as the electromagnetic theory [73], optic systems [74], or general relativity [75]. Social human behaviors have also served as inspiration for modeling novel metaheuristics, with renowned examples such as anarchic societies [76].

One of the main features that make SI methods specially efficient for solving optimization problems is their ability for distributing the optimization tasks, decentralizing in this way the evolution of solutions. This feature makes them particularly appealing for their implementation in Big Data ephemeral environments, in which computation resources are intermittently available. Other acknowledged differences of this optimization paradigm with respect to EC are the behavioral mechanisms by which the swarm evolves towards the best solution of the problem at hand, which are driven by one-to-one simple interaction rules rather than by population-based selection and crossover operators (see Fig. 4 for a diagram illustrating such differences).

2.2.4 Fuzzy systems

Fuzzy systems are specific mechanisms within Computational Intelligence which faithfully adapts to the human reasoning model and to the real-world. This logic introduces a better understanding of clauses of the type it is hot, it is high or it is fast. In this context, the term fuzzy refers to the fact that the logic involved can deal with concepts that cannot be expressed as true or false, but rather as partially true. For reaching this goal, the core concept of fuzzy systems is to understand the quality quantifiers for inferences and human reasoning. In this way, fuzzy systems are usually used as mechanisms inside other methods, but also as monolithic methods. Up to now, many real-world applications have been benefited from these paradigms, mainly control (optimization), prediction (modeling) and decision support [77,78,79].

Table 1 Relationship between the four families of bio-inspired computation approaches, typical problems and dimensions in the Big Data context

3 A joint perspective on bio-inspired computation and big data

This section is devoted to presenting and describing the main synergies between both paradigms studied in this paper: Big Data and bio-inspired computation. Several reviews and surveys have so far addressed this intersection from different perspectives, domains or applications. Table 2 summarizes the essential information of such works carried out during the last two years, including the period of time covered by the articles analyzed in it, the number of reviewed works, the proposal of a taxonomy to organize them, the phases of the Big Data life cycle covered, families of bio-inspired algorithms under scope and, finally, whether a critical analysis, challenges and research directions are given. The comparison made in these terms with the present work reveals several aspects of improvement:

  • A self-contained introduction to the concepts underneath Big Data and bio-inspired computation (Sect. 2), helping the reader understand their synergies and complementarities, as reflected in Table 1 and Fig. 1.

  • A significantly higher number of reviewed works (324), which nearly triples the amount of references considered in other surveys alike.

  • A wider domain coverage than other similar studies that focus only on a reduced subset of the phases of the Big Data life cycle, collection and processing/analysis. In our case, we use a three-fold criterion when designing our taxonomy: Big Data infrastructure, Big Data technologies and Big Data life cycle phases.

  • A more extensive taxonomy to classify the works under analysis in terms of the families of bio-inspired algorithms used in every reviewed work.

  • A critical analysis dissecting what has been done so far in the field, along with a set of future challenges that are tightly connected to bio-inspired computation and Big Data, avoiding to fall into generalistic formulations.

Table 2 Recent overviews on Big Data connected to bio-inspired computation, and their comparison to this work

For this purpose, two separated biases have been used: (i) the adoption of bio-inspired computation for modifying different technologies of the Big Data stack, in terms of infrastructure and life cycle technologies; and (ii) the evolution of bio-inspired algorithms adapted to the Big Data life cycle and its features, such as programming models and Big Data volumes. To this end, we divide the analysis into three subsections. The first two are associated with Big Data infrastructure (Sect. 3.1) and Big Data technologies (Sect. 3.2), elaborating on how they can leverage the adoption of bio-inspired computation approaches. Section 3.3 rounds up this joint perspective by outlining bio-inspired computing algorithms that have been adapted to the Big Data domain, emphasizing on the life cycle phases involved in each bibliographic item. Figure 5 summarizes the recent literature noted in the field, in which the combination of these technologies has reported remarkable performance and efficiency gains so far.

3.1 Bio-inspired computation for big data infrastructures

Generally, Big Data platforms can be deployed into two different kinds of infrastructures: on-premise or in the cloud [82]. Furthermore, a third approach hybridizing these two concepts is also possible. The existence of these types makes necessary the existence of tools for the systematization of the deployment, used as a guide for the system administrator. In this specific point is where the optimization capabilities of bio-inspired computation solvers acquire relevance, allowing for the automatization of these tasks in an efficient fashion. The main goal of the system administrator is to achieve a smart system management, which can lead to significant improvements in resource usage, such as provisioning; virtualization and allocation [41]; scheduling and optimization; balancing and reservation; and anomaly detection, among many others. In this regard, it should be clarified that resources are conceived as the elements that make up the infrastructure, such as virtual machines, containers, network elements, physical servers or computer nodes.

In addition to the above heterogeneity of resources and tasks, the inherent characteristics of new approaches to Big Data Analytics (speed, non-stationarities, and resilience to failure/ephemeral computing resources) have opened up new challenges in terms of adaptability, learning and self-organization. Analytical models are nowadays deployed on hybrid, volatile, highly scalable and rapidly reconfigurable resources. It is within this complex ecosystem of computation technologies where it becomes essential to ensure that systems and processes meet the aforementioned capabilities, paving the way for bio-inspired computation to become an enabler for this purpose.

To properly categorize the analysis of the study, we follow the previously mentioned classification, which is the most commonly used within the Big Data context: on-premises infrastructures (Sect. 3.1.1), cloud infrastructures (Sect. 3.1.2) and hybrid approaches (Sect. 3.1.3).

Fig. 5
figure 5

Taxonomy of works related to the application of bio-inspired computation to the big data domain, classified as per the different application areas under consideration

3.1.1 Bio-inspired computation applied to on-premise infrastructures

Briefly explained, on-premise regards to the software and technology located within the physical confines of an organization. This concept opposes running the system remotely on hosted servers or in the cloud. Thus, by installing and running software on hardware located within the premises of the company, full physical access to the data is available. Furthermore, the configuration, management and security of the computing infrastructure can be carried out directly in the system.

Regarding the configuration and management, bio-inspired computation can resolve problems related to task allocation and resource scheduling. In [83], for example, the authors present an approach based on distributed SI mechanisms that mimic the behavior of social insects to solve problems such as overlay management, routing, task allocation, and resource discovery. Through this approach, the authors of [83] construct an adaptive and robust management system for peer-to-peer networks.

The use of Graphics Processing Units (GPUs) and cluster-based parallel computing techniques is also a research trend, aiming at accelerating the process of extracting the correlations between items in sizeable data instances. In [84], for instance, authors propose four different population-based metaheuristics for efficiently mining association rules, which benefit from the cluster intensive computing and massive GPU threading.

On another vein, a special case of Big Data on-premise infrastructure is the so-called High Performance Computing (HPC, [85]), which refers to hardware and programming models specialized in solving highly complex problems mainly via parallelization. In this sense, using HPC solutions requires new techniques for memory management. An interesting recent survey published by Pupykina et al. [86] discusses the challenges of memory management in HPC and Cloud Computing, including a review of bio-inspired optimization methods to increase memory utilization.

In the security context, referring to the application level security as well as advanced protection against malware, the paper presented by Mthunzi et al. [44] proposes a comprehensive review of the benefits that the application of bio-inspired algorithms brings to the specific field of cybersecurity. It is also interesting the work of Rauf et al. [87], which highlights and discusses challenges and open opportunities in the intersection of cybersecurity and bio-inspired computation. Lastly, another totally different approach can be found in [88], where several management problems related to the increase in complexity and the need for energy are addressed in detail. For achieving the planned objectives, a bio-inspired self-organized technique is proposed for the redistribution of load among servers in data centers.

Reflecting on the activity noted so far on bio-inspired computation applied to the design, management and operation of on-premise Big Data infrastructures, we stress on the lack of informed evidences whether bio-inspired algorithms can meet realistic complexity scales of large computing farms. Furthermore, even if resource utilization does not vary as dynamically as in other alternative shared computing environments, most works reviewed in this strand of literature do not inform about the latencies induced by the usage of bio-inspired methods for, e.g., resource balancing or fast evolving computing tasks, which could hinder their practical adoption in Big Data environments subject to timing constraints. This criticism mostly refers to optimization methods: Biologically inspired modeling solutions suited for their deployment over Big Data infrastructure are far more mature than their optimization counterparts.

3.1.2 Bio-inspired computation applied to cloud computing infrastructures

In few words, Cloud Computing infrastructure can be defined as the collection of hardware and software elements needed to enable the remote management of the whole Big Data system. These concepts include computing power, networking and storage. It also contemplates an interface for users to access their virtualized resources, like cloud management software, deployment software and platform virtualization. In the Big Data context, the ability of Cloud Computing to offer fully scalable technical resources adapted to the needs of each project is crucial. Thanks to that, limitations of traditional physical servers are avoided. However, appropriate management tools are needed in order to efficiently take care of tasks such as resource virtualization or services deployment optimization.

In the current literature, works in this line of research can be classified into two main strands: (i) approaches related to the resource provisioning and allocation in Cloud Computing environments, and (ii) tasks related to the deployment, planning and optimization of services and applications:

  • On the one hand, the allocation and scheduling of multiple virtual resources, such as virtual machines (VMs), is a well-known research field in Cloud Computing. In [89], for example, a Genetic Algorithm is proposed for the optimization of VM distribution across a federated cloud. Similar is the approach followed by Rocha et al. in [90], which presents a hybrid optimization model that allows a cloud service provider to establish VM placement strategies. This way, the energetic efficiency and network quality of service are jointly optimized. More recent is the work presented in [91], which solves the same problem by means of an ant colony system. In addition, the research introduced in [92] hybridizes a Firefly Algorithm with fuzzy logic for server consolidation and VM placement in cloud data centers. Also interesting is the study presented in [93], which focuses on Hadoop Big Data technology. In that work, authors implement a bio-inspired solver for optimizing the placement of VMs in OpenStack. In [94], Pires et al. propose a novel multi-objective formulation of the VM placement problem, which is addressed by means of a novel multi-objective memetic algorithm. Additionally, in [95] an Ant Colony Optimization and dynamic forecast scheduling is combined for solving the VM placement problem, showing a remarkable efficiency in terms of less wasted resources and better load balancing. Finally, an interesting approach based on Cuckoo Search is proposed in [96] for data center resource provisioning in the cloud.

  • On the other hand, task scheduling over distributed and virtual resources is a main concern which can affect the performance of Big Data system. In [97], a meta-heuristic algorithm called Chaotic Social Spider Algorithm is developed for solving the task scheduling problems in virtual machines. The authors of this work focused on minimizing the overall makespan, while leveraging load balancing. Additionally, in the survey presented in [98], different bio-inspired approaches are analyzed for tackling the aforementioned problem. A work closer to Big Data technologies is conducted in [99], in which authors theorize on how the Map Reduce programming model performs the assignment of tasks in Cloud Computing environments. This analysis is carried out by resorting to assorted algorithms, including bio-inspired techniques.

It is also worth mentioning that one of the key goals in cloud environments is the optimal use of resources, for which load balancing techniques are often applied. This has been a particularly profitable playground for bio-inspired optimization techniques, yielding extensive surveys such as the one in [100], which provides a wide coverage of nature-inspired meta-heuristic techniques applied in the area of cloud load balancing. In this line [101] addresses the problem of load balancing in cloud environments by proposing a hybrid Cuckoo Search and Firefly Algorithm, showing a promising performance. An additional approach for load balancing is described in [102], focused on both Fog and Cloud Computing environments. The authors compare the performance of several bio-inspired computation methods, including Cuckoo Search, Flower Pollination and Bat Algorithm.

Our review of the literature related to Cloud Computing infrastructure has revealed that in most cases, the conditions under which algorithmic proposals are validated are largely uncoupled from the constraints and computation budgets that such algorithms would encounter in practical settings. This criticism refers not only to the scales by which, e.g., load balancing methods are validated (regime of tasks/users being concurrently handled), but also when it comes to the variability in time of the tasks under computation. Furthermore, very scarce to null attention is paid to the efficiency of the bio-inspired algorithm itself, mainly due to the simplicity of the simulation settings under which algorithms are validated. We advocate for a closer look taken at the implications of using bio-inspired algorithms, taking a step aside common practice, and informing the community of bio-inspired methods that can truly be adopted under computation-intensive regimes.

3.1.3 Bio-inspired computation applied to hybrid big data infrastructures

As mentioned, hybrid infrastructures comprise a blend of private clouds, public clouds and on-premise data centers. Thus, Big Data systems and applications can be deployed on any of these environments, depending on several business strategies, such as the main objective of the system, its tactical requirements and the required outcome. This is the case for heterogeneous distributed systems, in which environments and resources such as cluster computing, grid computing, peer-to-peer computing, cloud computing and ubiquitous computing are mixed [103, 104]. This particular scenario brings the necessity of efficiently managing a large variety of tools and software. This need motivates the development of new algorithms schemes for events and tasks scheduling. Thus, new methods for resource management should also be designed for increasing the performance of such systems. In [105], for example, a valuable survey is presented revolving around the advances on scheduling algorithms, energy-aware models, self-organizing resource management, dataware service allocation, Big Data management and performance analysis. All this analysis is conducted from the perspective of bio-inspired computation. In [106], a review of biological concepts and principles to solve service provisioning problems is presented, along with the proposal of a bio-inspired cost minimization mechanism for data-intensive scenarios where such problem emerges. The proposed method utilizes bio-inspired mechanisms to search and find the optimal data service solution in Big Data environments, considering data management and service maintenance costs. Finally, in [107], a preliminary work is presented on the deployment of evolutionary algorithms on Hybrid Big Data infrastructures. To do that, authors widen the functionality of the well-known ECJ tool [108] for fulfilling their purpose.

On a short reflexive note, here we foresee an increasing prevalence of bio-inspired algorithms capable of bringing together multiple conflicting objectives. Such objectives emerge as a result of the hybridization of different infrastructures, both private and public, which may have some goals in common (e.g., energy efficiency), but others that delineate an interesting Pareto trade-off to be balanced (correspondingly, cost of service versus fairness in the distribution of shared public computing resources). This paves the way towards a magnificent opportunity for multi-criteria decision making algorithms suited to deal with multiple confronted objectives, such as multi-objective meta-heuristics. Our examination of the literature uncovers that this is a niche of opportunity that should attract more efforts in the near future.

3.1.4 Bio-inspired computation applied to big data networks

We finish this subsection turning our attention towards a particularly significant element within the infrastructure: the network. In fact, different computing models can configure their operation based on the network topology and the associated communication latency. Examples of these models are Fog [109] and Edge Computing [110]. In this area, there are multiple open opportunities and a wide room for improvement, by means of optimization techniques used for orchestrating the deployment of elements depending on the features and distribution of the network. It is in this specific stream in which bio-inspired algorithms can emerge as an efficient approach for the aforementioned orchestration. For instance, in [111] a scheduling method for application modules in a fog computing environment is proposed using bio-inspired solving schemes such as Genetic Algorithm, Particle Swarm Optimization and Ant Colony Optimization for the reduction in the energy consumption and execution time. A similar approach is proposed in [112], in which a framework for the optimal deployment in Fog/Edge Computing environments via bio-inspired algorithms is described.

Another cornerstone task related to the infrastructure network is the security in communications. For this problem, bio-inspired algorithms can also be very useful, as shown in [113]. In that paper, authors propose a semi-class intrusion detection method which combines multiple classifiers to arrange exceptions and typical exercises in a computer system. Another axis of interest is the scalability of the network, which is also an aspect of utmost relevance in Big Data scenarios. In [114, 115], for example, authors propose and utilize a framework that supports simulation and testbed experiments to investigate the scalability and adaptability of ant routing algorithms in networking.

In this application area, there is a notable inertia towards the use of bio-inspired techniques for network security purposes. However, Big Data networks, stricto sensu, has so far not been risen much interest in the use of bio-inspired computation to address inherent problems such as latency minimization, routing or network dimensioning. We nevertheless envision that the extrapolation of the Big Data paradigm towards ephemeral computing will span further opportunities due to the intermittency of the network, the variability of task completion schedules and the uncontrolled availability of computation nodes. It is only under these circumstances when the complexity of governing ephemeral computing resources will require the flexibility and adaptability granted by bio-inspired computation.

3.2 Bio-inspired computation applied to big data technologies

The fast evolution and the emergence of new technologies in the Big Data stack, along with the adhesion of a growing number of organizations to this paradigm, causes the appearance of new challenges and opportunities in this field. Usually, these challenges are associated with the development, management and operation of new functionalities. In this regard, one of the essential aspects related to the Big Data technology stack is the non-functional requirements that the solution and tools need to consider. Singh et al. explain in [116] some of the most representative ones: (i) scalability; (ii) data I/O performance; (iii) fault tolerance; (iv) real-time processing; (v) supported data size; and (vi) iterative task support. Based on these six criteria, we can classify Big Data tools into three large groups [117]: NoSQL databases, parallel and distributed programming models and ecosystems of tools. We now analyze them in detail:

3.2.1 Bio-inspired computation and NoSQL databases

In a nutshell, a NoSQL [118] database provides a mechanism for the storage and retrieval of data, which is modeled in means other than the traditional tabular relations used in relational databases. This kind of database presents different points of improvements which can be addressed through the application of bio-inspired algorithms. Some of these applications are related to the horizontal scalability (choice of cluster topology), availability and replication of the data (assignment of the replicas to the nodes), or the consistency level of the information (ensuring the writing optimization), among many others.

In [119], for example, authors present a framework that allows Hadoop to manage the distribution of the data and its placement based on cluster analysis of the data itself. This work is not directly related to NoSQL databases, but it arguably represents an interesting approach for optimal data distribution in physical storage using evolutionary clustering techniques. The paper presented by Nowosielski et al. [120] is a good example of how bio-inspired solvers can aid in the achievement of horizontal scalability, specifically the Flower Pollination and the Krill Herd metaheuristic algorithms. In the specific context of data availability and replication, the work published in [14] presented an adaptive distributed database replication technique based on the application of an algorithm based on colonies of pogo antsis. An additional valuable research can be found in [121], in which the Firefly Algorithm is applied for the positioning and optimization of traffic in NoSQL database system, modeled with exponentially distributed service and vacation. Bio-inspired computation can also contribute to the design of the logical data schema. The research presented in [122] is an example of this trend, proposing a design repository for storing and retrieving biological (and engineering) design strategies.

Another interesting investigation is also shown in [123], in which a data warehouse schema design optimization is optimized by means of a Particle Swarm Optimization approach. In [124] a mathematical model of a column-oriented database performance was presented. Authors propose the use of Flower Pollination Algorithm for regression equation coefficients optimization. Furthermore, they highlight its accuracy and sophistication, which makes it appropriate for the foundation of database performance optimization.

Another highly relevant field of study combining NoSQL databases and bio-inspired computing is the so-called query optimization [125]. The work presented by Rani et al. in [126], for example, proposes the use of a bio-inspired algorithm based on the antibody-antigen clonal selection scientific theory for the efficient modeling of distributed query plans. The same author presents in [127] a study revolving around the distributed query processing optimization based on artificial immune systems, which is among the few references identified so far where immune systems have been utilized in Big Data scenarios.

Furthermore, there are situations in which bio-inspired techniques assist in the extraction of association rules over databases, as can be seen in [128]. In that study, authors showcase an approach for extracting association rules by applying a Bee Swarm Optimization meta-heuristic algorithm to a large database using the massively parallel threads of a GPU processor. An additional valuable approach is proposed in [129] for association rule mining, in which the JAYA algorithm is applied to big database instances.

Finally, an additional possible viewpoint can also be highlighted in this section, which evinces even further how bio-inspired optimization methods can take advantage of NoSQL technologies. This is the concrete proposal of Jordan et al. in [130]. In this paper, authors showcase how a system benefits from optimization knowledge persisted on a NoSQL database, serving as associative memory to better guide the optimizer through dynamic environments. This supports our claim that bio-inspired computation can not only benefit non-conventional databases, but can also leverage conversely the storage capabilities of such databases to store history information that can be retrieved and exploited by the bio-inspired algorithm upon requiring it, as in, e.g., recurrently changing concepts modeled by neural networks (continual learning) or dynamic optimization with bio-inspired meta-heuristics. This synergy is worth to be explored further by prospective studies around recurrent evolving learning environments.

3.2.2 Bio-inspired computation for parallel and distributed computing models

The significant rise of distributed and parallel processing techniques has dramatically transformed the use case landscape, improving existing levels of processing performance. In this context, two clear approaches can be spotted: batch programming models and those adapted to real-time or streaming environments. As in other situations discussed before, problems arising in these two scenarios can be tackled through the perspective of bio-inspired computation.

On the one hand, regarding batch parallel programming models, two main challenges can be found: (i) improvements over existing programming models (such as MapReduce [131]), or (ii) the development of new improved computing approaches under bio-inspired computation techniques. In the first case, we find interesting works such as [132] and [133]. In those studies, the former introduces improvements into the programming model regarding the efficient distribution of tasks, whereas the latter showcases more precise locations of the distributed data. Another remarkable research work can be found in [134], which provides a Big Data scheme based on Spark to handle highly imbalanced datasets. They successfully validated their approach over several datasets composed of up to 17 million instances. In [135], Hans et al. present details about reshaping the DEAP library for Evolutionary Computation by parallelizing the costly evaluation of encoded programs (individuals) on a Spark cluster. It is interesting to highlight also the work presented in [136], where authors focus on the Cloud Computing paradigm with emerging programming models, such as Spark, to prove how several parallel differential evolutionary algorithms can perform well in this situation. Obtained outcomes demonstrate the existence of a competitive speedup against serial implementations, along with a remarkable horizontal scalability. Finally, we can find new programming models such as the one proposed in [107], in which a new approach to deploy computing intensive runs of enterprise applications on Big Data infrastructures is presented.

On the other hand, a streaming system can be referred to as real-time if it guarantees a response within tight deadlines. Furthermore, depending on the specific context of the application, tight times can be a matter of minutes, seconds, or even milliseconds. Nowadays, due to the velocity dimension of Big Data, these systems are cornerstones of the technology stack in the treatment of large volumes of data, and they can take advantage from the characteristics of bio-inspired computation, such as its speed and efficiency when solving complex problems. A proof validating this claim is the existence of the so-called Software Model for Distributed Incremental Closeness Factor-Based Algorithms (SMDICFBA), in which incremental clustering models are proposed to learn dynamically about embedded patterns from raw data [137]. An additional example for supporting this statement can be found in [138], in which a new approach to stream computing is introduced. For achieving online optimization and scheduling, a particle swarm optimization algorithm hybridized with back-propagation and an immune clonal algorithm are used in that work. Lastly, we pause at the term Organic Computing [139], which behaves and interacts with humans in a bio-inspired manner. All in all, it is important to ensure that the efficiency of bio-inspired algorithms do not clash with the stringent computational requirements imposed by avant-garde parallel computing setups. Connecting back with our reflections offered previously, there is little evidence of implementations of bio-inspired optimization algorithms that can perform within realistic computational boundaries.

3.2.3 Bio-inspired computation for big data ecosystem tools

Big Data Ecosystem can be defined as a framework for solving Big Data problems, comprised by a suite of cluster management and task/jobs scheduling/assignment tools, which encompasses a number of valuable services (ingesting, storing, analyzing and maintaining). An example of this kind of ecosystems optimized by bio-inspired computation can be found in [140], which presents a hybrid Particle Swarm Optimization–Genetic Algorithm for solving the task assignment problem. Another case is presented in [141], in which a bio-inspired method based on ant systems is developed for optimizing the distribution of service deployment. Regarding scheduling, we can find works such as [142], in which multi-stage multi-machine multi-product scheduling problem is resolved using the Bat Algorithm. In [143] energy-aware cloud task scheduling is studied by resorting to the same method. Finally, a task scheduler on diverse computing systems is described in [144]. In that case, the system is developed as a hybridization of the bat algorithm and the artificial bee colony. Apart from these reviewed works, we have not found any further contributions showcasing tools for Big Data ecosystems empowered by bio-inspired algorithms.

3.2.4 Bio-inspired computation for security

We finish this section by devoting a few lines to works related to security technologies. Interesting investigations on this context can be found in [145,146,147,147]. Being strict, these works are not directly associated with Big Data environments, but they are used for paradigms such as Cloud Computing or Internet of Things. All these papers adopt the use of bio-inspired algorithms for solving different problems such as access control or intrusion detection, which are common to any complex networked system. Big Data is by no means an exception, and should embrace advances in bio-inspired computation for security purposes in future evolutions of its technology stack, including all applications for which this area of Artificial Intelligence has a long history of successes in network security.

3.3 Big data life cycle bolstered by bio-inspired computation

In Sect. 2.1.2 we introduced the Big Data life cycle, which is made up of different phases. Bio-inspired computation can improve each of such phases in terms of efficiency and fulfillment of non-functional requirements. In this section, we outline a significant group of valuable works for each phase, which arguably help understand the importance of the consideration of bio-inspired algorithms over each of these life cycle phases.

The relevance of bio-inspired methods applied to the Big Data paradigm has been previously studied, but always associated with specific algorithm categories or under the prism of specific problems. For example, a survey on data science with population-based algorithms is presented in [149]. Authors of this work focus on EC and SI, and they acknowledge the need for new techniques in the field to appropriately deal with the problems, scales and requirements arising from Big Data. Likewise, the work in [150] paves the way towards using genetic programming in Big Data problems. This work shows and discusses different ways of configuring Big Data training evaluations and parallelization, and demonstrates their impact on efficient problem solving.

For the sake of comprehensiveness, we show in Fig. 6 the different life cycle phases and solutions that bio-inspired computation provides for each of them. We proceed now to overview the research conducted up to now on each of these life cycle steps: data fusion (Sect. 3.3.1), data storage (Sect. 3.3.2), data processing and learning (Sect. 3.3.3) and data visualization (Sect. 3.3.4).

3.3.1 Data fusion and bio-inspired computation

Data fusion is the process of integrating multiple data sources to produce more consistent and useful information. In the Big Data paradigm, this is a crucial procedure due to the large amount and heterogeneity of the data sources that currently can be found in a given use case. From the perspective of bio-inspired computation, this is a problem that has been tackled in the literature before, as can be seen in valuable reviews such as [151, 152] or [153]. Furthermore, there is a clear consensus that the relevance of this topic increases along with the volume of information becoming larger.

Fig. 6
figure 6

Application areas of bio-inspired computation for each Big Data life cycle phase

The heterogeneity of the data and the diversity of their sources cause difficulties when accessing and understanding their underlying structure. Users identify a problem for properly representing and interpreting the same real-world objects recovered from different data sources. In this context, [154] presents an approach to solve the dynamic feature selection based on Big Data fusion with multi-objective particle swarm optimization. Another example is proposed by Dong et al. in [155], in which authors determine security threats in power grid by making full use of heterogeneous data sources in power big data. In that paper, researchers map heterogeneous data in different formats to a unified embedded vector space with deep restricted Boltzmann machine, achieving the efficient fusion of heterogeneous data sources. Furthermore, Zhang et al. have published several recent works related to Big Data Fusion techniques using ensemble learning and Neural Networks as their core of research [156, 157]. As a matter of fact, ensemble learning can also be conceived as a fusion of decisions made by the constituent models in the ensemble. Bearing this in mind, the automatic construction of ensembles has also largely leveraged the use of bio-inspired optimization algorithms [158, 159], with recent examples of their application to Big Data scenarios [134, 160].

Data fusion techniques can be applied to multiple domains such as culture, health, language analysis, and transportation and mobility in Smart Cities. In the cultural heritage domain, Piccialli et al. [161] present and discuss the application of a clustering approach for behavioral classification of IoT cultural data collected in the National Archaeological Museum of Naples (Italy). In the Health domain, for example, we find studies like [162], in which e-health data is collected from patients suffering from different diseases, and the optimal attributes are chosen by using an improved Dragonfly Algorithm for an enhanced classification. In the text analysis domain, the research introduced in [163] proposes and compares effective fusion matching methods using neural networks for automatic removing semantic collision of files. In Smart Cities, Wang et al. [164] present an interesting approach about urban Big Data fusion based on Deep Learning. The investigation detailed in [165] is also centered in Smart Cities, focusing on the management of natural disasters using fuzzy models. In transportation domain, the work [166] presents a study related to train transport, revolving around delay prediction by means of Big Data fusion techniques based on bio-inspired techniques. Finally, we note the profitable strand of literature revolving on rule mining with bio-inspired methods, which has also permeated to the Big Data field. An example is [167], which proposes an efficient associative classifier for large imbalanced datasets based on an evolutionary algorithm that efficiently discovers rare yet reliable association rules.

Without a doubt, the main algorithmic player in bio-inspired computation when it comes to data fusion is Deep Learning. The flexibility of neural architectures to blend together features extracted from different information domains has stepped further over the state of the art as a form of model-based information fusion. Other subfamilies of bio-inspired computation have also been used for this purpose, but rather for auxiliary tasks that help—yet not realize on their own—the fusion of different information flows (e.g., meta-heuristics for neural architecture search).

3.3.2 Data storage and bio-inspired computation

The case of Big Data storage is closely linked to the correct selection and optimization of persistence tools and technologies, which have been already seen in Sect. 3.2.1. Indeed, there are specific tasks associated with this phase of the life cycle which are also likely to be improved by virtue of bio-inspired algorithms. Additionally, these tasks do not only relate with the storage technology itself. An example is the conceptual design of the database schema, with multiple related works such as [122, 123] or [168]. The management and maintenance of large volumes of data is also subject to improvement. This research trend is exemplified by [169], where a biologically inspired algorithm is proposed to identify and mitigate the impact of misbehavior on the performance of data management in social networks. Finally, it is also interesting to highlight [170], which introduces a bio-inspired approach combining Big Data with data intensive computing issues in the future vision of a smart healthcare data management.

A further interesting work related with data persistence is [171], in which authors propose a new algorithm inspired from the working principle of human memory for storing Hierarchical Temporal Memory features detected from an image. A few explorations of data allocation and reduction using bio-inspired methods have been reported in [172, 173] and [34, 174], respectively. Finally, it is interesting to point that there are studies also dedicated to secure sharing of large volumes of data using bio-inspired computing approaches, such as the one presented by Ogiela et al. in [175]. Unfortunately, our bibliography analysis has not yielded any further evidences of biologically inspired mechanisms used for improving the data management efficiency of modern data storage technologies. The plethora of works dealing with relational databases enhanced by bio-inspired mechanisms seem not to have been extrapolated to the Big Data realm, even if the diversity of data and the confluence of spatial and temporal information flows open up large possibilities for the research domain targeted in this survey.

3.3.3 Bio-inspired computation for data processing and learning

These are arguably the most important phases within the Big Data life cycle, since they are the ones in charge of converting data into knowledge. There are many works to consider in this specific area [46]. For this reason, we split these works into two groups: (i) techniques based on bio-inspired concepts for the pre- or post-processing of data, and (ii) adaptation of bio-inspired algorithms to be capable of responding and solving the requirements and dimensions of the Big Data paradigm:

  • Bio-inspired pre- and post-processing techniques have been widely utilized in the literature for an assorted of possibilities, from data imputation to instance selection, noise filtering, dimensionality reduction or model output simplification [176]. A growing corpus of works can be found in the literature with new algorithmic proposals that undertake the aforementioned tasks in scenarios and setups that could be considered close to the computational requirements imposed by the Big Data paradigm [177, 178]. However, a closer inspection to the literature reveals that an open challenge emerges from the extrapolation of such bio-inspired approaches to the scales of Big Data, which we later discuss in depth in Sect. 4.

  • Bio-Inspired algorithms adapted to Big Data: in this case, two computational problems have been actively investigated in Big Data environments: clustering (simulation) and prediction (modeling). For clustering purposes, a manifold of research studies have been conducted using different bio-inspired methods, such as [27, 179,180,181] or [182]. In [183], a technique based on the Whale Optimization solver is presented as a clustering technique to be used in the Big Data domain. Authors evaluate their research against four alternative clustering techniques, obtaining promising results. In prediction, many interesting works can be found in the current literature. In [184], for example, an ant colony-based algorithm is used, in which prediction over data streams is performed. In [185], an Ant Colony Optimization method is also employed for Big Data distribution considerations. The same method is used in [186], where decision analysis is studied over mobile Big Data.

  • Another notable group of works to mention are those in which distributed and parallelizable programming models are used for the implementation of the bio-inspired algorithms. An example of this trend can be found in [187], using MapReduce for developing a particle swarm optimization-back-propagation neural network algorithm. In [188], Spark is used for developing a Particle Swarm Optimization and a Differential Evolution algorithm. Finally, authors of [189] introduce a parallel population-based optimization algorithm with Spark. Another interesting work along this line is [190], in which a scalable Genetic Algorithm is developed using Apache Spark. To do that, authors maintain the population diversity and minimize the materialization and shuffles in resilient distributed datasets.

Finally, it is interesting to highlight that bio-inspired computation can also be used in conjunction with other techniques, such as time series analysis [191], for the calculation of similarity functions [192]. Furthermore, novel bio-inspired approaches can be created specifically focused on this field of application, such as the Danger Theory presented in [193].

We end up this glimpse at the literature with a notable mention to the prominence of bio-inspired methods used for automating the hyper-parametric tuning process, which have lately grown towards covering the design of the entire data mining pipeline [194, 195]. As we will later expose, the popularity and track of recent success cases of the so-called AutoML research area [196] unleashes a vast research niche for the extension of the functionalities of existing tools and frameworks to Big Data scenarios. The possibility of federating models without compromising the privacy and confidentiality of Big Data from where they learned (also referred to as Federated Learning) is another research line with a narrow connection to bio-inspired learning models. However, the practical totality of federated learning scenarios reported to date has gravitated on neural network models, as they easily allow for privacy-aware knowledge sharing, aggregation and redistribution among peers. Furthermore, even though many of these studies resort to the Big Data term in their introduction and claims, they lag notably behind the scales expected for realistic Big Data use cases, nor do they generalize to other models for which the federation of knowledge is not that clear to perform. We will later revolve on these issues and their implications towards effective Big Data governance.

3.3.4 Data visualization and bio-inspired computation

On a concluding point for this section, we underscore that techniques for the efficient visualization of large volumes of data are in a relatively less mature point of development. The same happens about their synergy with bio-inspired computation, since works related to both areas of research are scarce. The closest work that falls in this intersection is the one presented by Gritsenko et al. [197]. In that work, a visualization method itself is not presented, but a neural network approach coined as Extreme Learning Machines for visualization is proposed for improving the output of results so that it can be visualized more easily. The difficulty to measure the level of visual perception by the user, his/her cognitive assimilation of the visualization, and the strong case-specific nature of the visualization has hitherto yielded largely ad-hoc tools and techniques. However, we foresee that the current momentum of eXplainable Artificial Intelligence (XAI) tools spawn a new visualization era in which insights about the data are produced by explaining and understanding the knowledge captured by models constructed during the learning phase. The need for coupling the explanatory information embedded in the generated explanations with the cognitive capabilities of the audience becomes very relevant in Big Data contexts. In our targeted application domain, spatial and temporal data often collide together (especially in applications related to Smart Cities, Earth observation or digital twins of large industrial assets), requiring explanations that require a higher degree of sophistication when presenting them to non-specialized users. We will elaborate on this claim in Sect. 4.4.

4 Critical analysis, open challenges and research directions

The vast activity noted in the literature is a clear representation of the technical advances attained lately with bio-inspired computation applied to Big Data. Indeed, manifold domains have capitalized bio-inspired computation in data-based applications, including energy [198, 199], transport and mobility [60], health [200], industry [201, 202], agriculture [203], cyber-physical systems [20], social networks [204,205,206] or sensor networks [207], among many others. Recent worldwide developments around the COVID-19 pandemic have also ignited research activity on Big Data and Artificial Intelligence (in many cases, using deep neural networks for CT scan-based diagnosis), yet without much evidence that the scales of studies claiming to be Big Data so far can be considered as such [208, 209].

In this section we summarize several weak and promising aspects detected at the merger between Big Data and bio-inspired computation. As a result of our literature assessment, we have observed that there are still many questions to investigate when hybridizing both paradigms. In what follows several research niches are enumerated and discussed with respect to the previously analyzed literature. Figure 7 summarizes graphically our prospects of the future of the field.

Fig. 7
figure 7

Challenges envisioned in the crossroads between bio-inspired computation and Big Data

4.1 Is the community really focusing on big data?

To begin with, a pause of reflection must be first made at the short albeit rich history of bio-inspired computation and Big Data. To quantitatively buttress this statement, Fig. 8 depicts the number of yearly publications retrieved from the Scopus database when being queried with the term Big Data and different concepts related to bio-inspired computation. The corpus of literature is impressive, and keeps growing steadily over the years. However, this seemingly vigorous momentum of the field must be assessed with caution: A large proportion of the works encountered during our examination of the literature revealed insufficiently justified usages of the term Big Data, reporting algorithmic advances and designing experimental setups far from achieving the scales assumed for Big Data scenarios. No evidences were given on the implementation of the algorithm in question in Big Data frameworks, nor were the datasets in use large and/or fast produced enough to justify the Big Data label.

Among the reasons for the fact identified above, we underscore the lack of real public datasets and problems that match the scales assumed for Big Data scenarios, either in terms of volume, variety or velocity. For example, Mann et al.[210] have already detected this problem in the health domain, identifying that there is a wide mismatch between the optimism surrounding the solutions implemented by Big Data technologies and the real existence of Big Data problems to test them. This remarkable absence of reference problems puts to question the veracity of bio-inspired solutions, and reduces the impact and soundness of contributed tools and software frameworks reported to date. A benchmark comprising several Big Data problems/tasks of realistic scales could be very helpful to set a reference for the assessment of the relative gains claimed by new bio-inspired Big Data solutions. Such a benchmark should bring technical challenges, along at least one of the Big Data dimensions formulated in Sect. 2.1.1, that cannot be tackled by conventional computers and programming tools. Some recent compendiums [211,212,213] can be of help to discern whether new studies on bio-inspired are indeed Big Data or, instead, embrace the term in a less demanding setup. In these works several metrics are also defined, which could also be used in prospective studies (particularly those related to efficiency all along the Big Data life cycle),

Furthermore, most works have been focused on a very narrow portfolio of application scenarios, with the optimization of cloud environments at the forefront of the application of bio-inspired methods. Other Big Data areas such as security and governance, undoubtedly unleash new opportunities that at present, remain largely uncharted by the community.

Another aspect that buries the field in shadows of doubt is the justification of the novelty of the bio-inspired algorithm just by the metaphor that inspires its design. This is a widely acknowledged concern in bio-inspired computation [5], igniting controversial debates around the convenience of these practices for the knowledge advance in the field. As in other application domains, we have identified evidences that such poor practices also prevail in bio-inspired computation for Big Data: many contributions in this line design biased experimental benchmarks favoring their proposed algorithm and penalizing others, by, e.g., tuning the parameters only for selected counterparts in the benchmark, or by varying the conditions under which each algorithm is evaluated (different machines, datasets and/or software implementations). Disregarding the true intentions underneath these poor practices, it should be enforced that prospective studies provide the means to validate the results by third parties, embracing recommendations elicited by recent works on this topic [214].

On a constructive note, it is our firm belief that the community should welcome new biological metaphors for improving the efficiency of Big Data systems along their different dimensions. Nevertheless, it is necessary that new works conform firmly to methodological principles: fairness in the comparisons, experimental replicability and a solid justification why the design of the algorithm is driven by the requirements of the Big Data task to be solved [215]. Implementations of bio-inspired computation approaches in high-performance languages and platforms are largely available nowadays (including GPU versions of optimization algorithms [216, 217]). Furthermore, large-scale global optimization solvers are also a subject of intense investigation [24]. This settles a solid stepping stone and an unprecedented opportunity for bio-inspired computation to meet the scales of Big Data, leaving behind studies of loose connection to Big Data requirements and questionable scientific impact.

Fig. 8
figure 8

Yearly publications retrieved from Scopus by submitting the queries indicated in the legend (as per June 1, 2021). The vertical axis is in logarithmic scale

4.2 Towards a bio-inspired operationalization of big data pipelines

Traditionally, the scientific community in the field of Artificial Intelligence has focused on the development of new algorithms and techniques over the years. These activities are often conducted under laboratory or experimental settings, overlooking real world potentials and risks. An important challenge can be found in this regard, focused on the life cycle management of Artificial Intelligence approaches and their implementation and maintenance in production environments.

Related to this, Big Data technologies are complex and numerous, and the lack of adequate tools to automatize and operationalize their use and management is a clear problem. In this context, the AIOps concept [218] becomes relevant. AIOps aims to improve and automate all tasks of the software operation phase by employing Artificial Intelligence techniques. As we have analyzed throughout this study, it is clear that the self-learning capabilities of bio-inspired computation techniques have a lot to say in this research direction, given that they are widely used in the development of key tasks of the operationalization process, such as optimization tasks [219] and resource planning [220]. Furthermore, the versatility of bio-inspired algorithms can solve complex problems for highly configurable systems [221], as is the case of the Big Data technology stack specialized in analysis and deployment in Cloud Computing infrastructure.

At this point a relevant point of distinction must be made between (i) the automatic configuration of data-based pipelines (which are collectively referred to as AutoML methods), and (ii) the automated deployment of such pipelines over the resources available in Big Data infrastructures. Both tasks have been recently tackled in isolation, e.g., AutoML has no regards to the available computing resources underneath, nor do deployment tools consider the chance to redesign the data-based pipeline as per the needs and the restrictions of the deployment itself. We definitely advocate for more research efforts invested in blending together requirements imposed at the software (data mining, visualization) and hardware (latency, memory, time) levels. Some recent advances have been done in this direction with the proposal of new specification languages that incorporate elements and requirements from both realms for the distribution of analytical pipelines [222]. Nevertheless, there is still a long road ahead to reach enough maturity for the adoption of these advances in real-world production environments.

4.3 Feasibility of bio-inspired computation for real-time big data

A widely acknowledged problem of bio-inspired algorithms is that in their seminal form, they do not accommodate stringent time constraints as those emerging in streaming contexts. By contrast, the original form of optimization and modeling approaches are better suited to deal with stationary data contexts, in which all the information from where knowledge is extracted is made available before the data processing and learning phases (batch setting). However, when information flows continuously, in large volumes and at a fast pace, bio-inspired techniques must be endowed with the features (incrementality, resiliency to data changes, efficiency in the consumption of resources, model memory) required to sustain their analysis and produce outcomes in a similar fashion to the batch setting. Renowned benchmarks for Big Data streaming such as Yahoo! Streaming Benchmark [223] and other recent proposals [224,225,226] are designed to pose complex challenges for Big Data processing systems in terms of throughput and latency that permeate to the upper layers, e.g., efficient implementations of algorithms that learn incrementally from data that is available for very short periods of time.

In this regard, it would be interesting to investigate new developments or reimplementations of existing algorithms to adapt them to real-time Big Data contexts, even if it is necessary to consider new strategies and methodologies for the deployment of analytical models in streaming systems [227]. For this to occur in the future, a closer look should be paid to emerging paradigms in bio-inspired computation that are specifically suited to real-time scenarios, such as extremely optimized versions of EC and SI solvers, new forms of neural computation for non-stationary streams, or studies in which the operation of the bio-inspired technique is driven not only by the quality of its output, but also by the complexity of its implementation.

Interestingly, the community has already dedicated notable efforts towards anticipating the above needs in the design of algorithms, yielding research areas of utmost relevance such as dynamic optimization [228, 229], learning models over non-stationary data streams [230] or evolving fuzzy systems [231]. Unfortunately, we note very few evidences that such algorithmic developments can be deployed effectively in Big Data contexts, either for the processing, learning and visualization phases of the Big Data cycle, or for supporting the underlying processes of data fusion, storage and governance (in particular, load balancing, dynamic resource allocation or task scheduling, which are often performed in real-time). This is a research niche that should be addressed in the future to shed light on the potentiality of bio-inspired computation for real-time Big Data platforms.

4.4 Explainable AI (XAI) and big data visualization

Big Data often lies in the core of critical decisions, which in some domains of application may entail severe consequences. Health diagnosis is arguably the most enlightening example supporting this statement. A wrong diagnosis of the patient can lead to a wrongly prescribed therapy. Conversely, if Big Data models fail to detect an illness, the patient at hand might undergo fatal consequences. A similar observation can be made in other domains (e.g., defense, law, state administration), mostly in those where decisions affect directly human life anyhow. When this is the case, veracity rises as the Big Data dimension on which a primary focus must be placed, allowing for the quantification of the uncertainty, accountability and the delivery of explanations of the insights drawn from data. In other words: for decisions to be fully informed, opaque models should be avoided or, at least, complemented with techniques that allow understanding the reasons why they were made.

Traditionally, visualization tools have been at the forefront of inspecting large volumes of Big Data, seeking new forms of data representation that allow understanding relationships between heterogeneous data and their evolution over space and time. The term visual analytics was actually forged to highlight the potential that a good visualization has to explore and analyze data without resorting to additional models [232]. However, the scales, variety and veracity of current Big Data scenarios make visualization not enough any longer. Powerful bio-inspired modeling approaches such as Deep Learning networks are in many cases the only viable option to analyze Big Data, surpassing in some cases over-human performance. However, the superior modeling capability of such models clashes with their black-box nature, hindering any chance to explain what they observe in their input data to produce their outputs.

Based on the above rationale, bio-inspired computation for Big Data should massively embrace explainability as one of their main design drivers, either by developing new approaches from scratch that are more algorithmically transparent than their predecessors, or by incorporating tools that provide such explanations. The design of these explainability tools is the motivation of the upsurge of XAI [233, 234] witnessed in the last couple of years. Specifically, XAI refers to methods and techniques developed to ease the interpretation and understanding of decisions made by Artificial Intelligence models by humans, disregarding their expertise or background in this discipline. Other akin research areas that contribute to the trustworthiness of Big Data decisions is confidence estimation, namely, the quantitative evaluation of the epistemic uncertainty of Artificial Intelligence models. Since most bio-inspired learning algorithms are controlled by stochastic processes (for instance, stochastic gradient descent in neural networks, or the search operators in EC- and SI-based search meta-heuristics), a very relevant side information is to compute the variability of the output with respect to the input data and the distribution of the stochastic components of the model.

When endowing Big Data applications with functionalities to explain decisions and estimate the confidence of the deployed algorithms, the entire Big Data life cycle could be trustworthy, ensuring that the veracity dimension is appropriately considered. Nonetheless, most existing work published nowadays focuses on new algorithms and applications, stressing on performance rather than on usability and interpretability of real users. We envision that it is now the time to go beyond performance and focus on practical value, bridging the gap between achievements reported by the academia and the real-world problems faced by practitioners in their respective sectors [235]. For this purpose, and in accordance with recent studies [236, 237], Big Data visualization must enter the XAI arena, and help depicting highly dimensional explanations of outputs produced by bio-inspired models in an understandable manner. For this to occur, we foresee that XAI functionalities currently underway in the XAI research field should grow in mature and adapted suitably to deal with models distributed over computing nodes, each learning from different data silos. Specifically, the multi-modality of data present in a significant segment of Big Data applications (those capturing data over both space and time, e.g., Smart Cities, transport, Earth observation) requires a new generation of explainability tools that allow human reasoning of patterns and explanations held over such domains simultaneously [238].

4.5 Big data governance and security

Another challenge emerges from all those activities necessary for the data to be correctly and fairly managed, secured and traced, which is called data governance. The characteristics of modern bio-inspired Deep Learning models—in particular, their capability to ingest and fuse different information flows along the learning process—usually pose a severe threat to data governance approaches, specially in what refers to privacy regulation and informed consent. Enhanced governance techniques and tools are required to help preserve the autonomy and rights of individuals to control their personal information, and to guarantee that protected data remains as such over the entire Big Data cycle. There are already works focused on studying the maintenance of privacy in the analysis of personal data [239], and the achievement of traceability of the data flow during the analysis process [80, 240]. It is undeniable that techniques such as differential privacy, federated learning and homomorphic encryption are expected to play a major role in Big Data governance for years to come. However, a question remains whether current bio-inspired computation techniques will smoothly accommodate the assumptions and restrictions imposed by these upsurging privacy-preserving methods.

A related research direction is that of security. In recent years, a vibrant activity has been noted around the development of algorithms for ensuring confidentiality, integrity, and availability in complex data-based systems. It is a consolidated fact that the existing cyber-infrastructure has numerous inherent limitations that make the maintenance of the current network security devices not scale well, and provide the adversary with asymmetric advantages. For example cybersecurity, with problems such as spam filtering [241] or intrusion detection in real time [242,243,244], is a research area in which numerous studies are undertaken trying to adapt the advantages of bio-inspired computation to this kind of systems. The reality is that security is an indispensable and complex requirement in any system, for which bio-inspired approaches can yield a competitive advantage. This claim can be easily confirmed by reviewing the current literature, where bio-inspired algorithms are a promising approach currently yielding great results in Cloud Computing environments [245]. The huge amount of logging information generated by complex Big Data infrastructures is, without a doubt, a rich substrate for detecting, identifying and counteracting security threats. The self-organizing nature of bio-inspired computation can provide the required level of robustness and resilience against such threats, specially those inspired by artificial immune systems for authentication and access control systems [246], evolutionary algorithms as constituent parts of intrusion detection systems relying on predictive modeling [247], or swarm intelligence methods for forensic analysis [248]. The record of successes around the application of bio-inspired methods to the security of complex networked systems is a motivational evidence towards embracing them massively in the Big Data realm.

5 Conclusions and outlook

We live in the era of digitization, which has caused an explosion of data in sectors that had traditionally lagged behind in the adoption of information and communication technologies. Consequently, multiple opportunities to generate value from data have spawned in almost all sectors. In this context, Big Data encompasses all tools and technologies that support the efficient materialization of data analysis when produced at volumes, rates and heterogeneity levels that cannot be managed by traditional means. Big Data systems are being increasingly adopted by the enterprises exploiting applications to manage data-driven processes, practices and systems in a business wide context. Specifically, Big Data systems and their underlying applications empower enterprises with analytical decision making for optimizing organizational productivity and competitiveness.

Despite the above benefits, the stringent operational conditions under which Big Data platforms operate demand several capabilities to their underlying processes, technologies and algorithms. Among them, in this survey we have focused on adaptability to data changes, scalability, computational efficiency, flexibility, integrability and uncertainty modeling. All these requirements address renowned issues arising from different phases of the Big Data life cycle. In this regard, we have stressed on the capital role that bio-inspired computation can play for Big Data technologies to acquire and effectively provide such functionalities. Indeed, modeling, simulation and optimization tasks can be formulated at different phases of the life cycle wherein biologically inspired methods have been applied. To properly inform the audience about the history of bio-inspired Big Data, we have performed a critical literature analysis along different axis: i) the Big Data technology that benefits from the application of bio-inspired methods (infrastructure, NoSQL database technology, network and parallel/distributed computing model); and ii) the Big Data life cycle phase in question (data fusion, storage, processing, learning and visualization). Relevant references have been thoroughly discussed, unveiling research trends and niches that remain open in the field.

As a result of our critical examination of the literature, we have outlined several research directions that may effectively deal with the main challenges in bio-inspired Big Data. Three of them stand out as those that deserve more research efforts in years to come:

  • Common methodological grounds in the proposal of new bio-inspired algorithms for Big Data, including the adoption of good practices and recommendations to ensure their scientific and practical value.

  • An explicit consideration of complexity in the design of new algorithms, specially those for real-time environments, avoiding at all means the use of the term Big Data to refer to problems and scenarios that do not correspond to the expected scales of this paradigm.

  • A close look at the possibilities brought by avant-garde research areas in bio-inspired computation, such as XAI as a core element adding value to the data visualization phase of the Big Data life cycle.

This survey intends to serve as a smooth entry point for practitioners and newcomers interested in performing research around bio-inspired Big Data technologies. Inspirational behaviors behind bio-inspired computation techniques accumulate thousands of years of accumulated experience in addressing complex modeling, simulation and optimization tasks. It is straightforward to think that the scales, variability and uncertainty of problems tackled nowadays by Big Data technologies should leverage the capabilities offered by bio-inspired methods. Nature knows how to best adapt to changes, scale up nicely under environmental pressure and resiliently react against threats. Bio-inspired Big Data is, on balance, a natural choice.