Abstract

The accurate prediction of genetic networks using computational tools is one of the greatest challenges in the postgenomic era. Recurrent Neural Network is one of the most popular but simple approaches to model the network dynamics from time-series microarray data. To date, it has been successfully applied to computationally derive small-scale artificial and real-world genetic networks with high accuracy. However, they underperformed for large-scale genetic networks. Here, a new methodology has been proposed where a hybrid Cuckoo Search-Flower Pollination Algorithm has been implemented with Recurrent Neural Network. Cuckoo Search is used to search the best combination of regulators. Moreover, Flower Pollination Algorithm is applied to optimize the model parameters of the Recurrent Neural Network formalism. Initially, the proposed method is tested on a benchmark large-scale artificial network for both noiseless and noisy data. The results obtained show that the proposed methodology is capable of increasing the inference of correct regulations and decreasing false regulations to a high degree. Secondly, the proposed methodology has been validated against the real-world dataset of the DNA SOS repair network of Escherichia coli. However, the proposed method sacrifices computational time complexity in both cases due to the hybrid optimization process.

1. Introduction

A gene regulatory network (GRN) represents the regulatory behaviour or dependencies among a group of genes inside a cell. A GRN is characterized by a directed graph in which nodes denote genes, and regulatory dependencies among genes are depicted by directed edges between the corresponding nodes. There are two types of interaction, namely, activation and inhibition. This kind of network is unique for particular functions within a cell. Thus, the study of GRN is essential to ascertain the genetic causes of a particular disease. As a consequence of this, scientists can venture into the development of new and improved techniques for the treatment of a disease [1].

Nowadays, DNA microarrays [2, 3] are extensively utilized for the investigation of finding reasons for different illnesses. A microarray dataset contains the gene expression levels of millions of genes of a species under investigation, at a particular condition or set of conditions. Time series microarray database consists of changes in the expression of genes with time in response to some disease-causing events or any form of treatment at different time points. This includes essential information regarding the dynamic behaviour as well as the dependencies of genes.

In this work, Recurrent Neural Network (RNN) [4] which is a closed loop Neural Network with a delayed feedback has been used to model dynamics and dependencies of the genetic system from temporal genetic data. Using suitable optimization methods, RNN based GRN can be inferred for the purpose, where the objective function of optimization is chosen so that it becomes proportional to the training error. Different metaheuristics techniques, namely, Genetic Algorithm (GA) [5], Particle Swarm Optimization (PSO) [6], -Means Population-Based Incremental Learning (KPBIL) [7], Invasive Weed and Artificial Bee Colony (ABC) [8], PSO and Ant Colony Optimization (ACO) [9], Bat Algorithm [10], have been implemented to infer GRNs from the time series microarray data. Computational reconstruction of GRN is essentially a reverse engineering problem. All the approaches mentioned above are tested against both small-scale artificial and real-life GRNs. However, Noman et al. [11] proposed Decoupled Recurrent Neural Network, which was trained by Differential Evolution (DE), where a penalty term or L1 regularizer was introduced into the objective function to balance between the accuracy of the parameters and the actual network structure. The model mentioned above [11] is very efficient in finding all the valid regulations and accurately reproducing the dynamics of both of the small networks studied. However, the main disadvantage of this model is that the given model predicts a large number of false regulations for more extensive or larger networks. The balance between fitness value and actual network structure for large-scale networks is the primary concern of the reverse engineering problem of GRN, and it is still an open area of research.

In this paper, a hybrid Cuckoo Search (CS)-Flower Pollination Algorithm (FPA), CS-FPA, is proposed for the inference of GRNs from time-series data. FPA is used to train the RNN parameters, and CS is introduced to find the biologically plausible network architecture to select the best combination of genes that are responsible for modifying the expression of each gene. The preliminary notions of RNN, CS, and FPA are discussed in the next section. The details of the fitness function of FPA for decoupled RNN and the learning process for finding the actual structure of a GRN using CS are discussed in Section 3. Next, the effectiveness of the proposed CS-FPA based RNN model is tested against a large artificial GRN without the presence of noise as well as with noisy data. The result is also compared with other state-of-the-art methods. The conclusion is given in Section 4.

2. Theoretical Background

We have discussed the fundamental theoretical concepts of RNN, CS, and FPA, in this section, for a better understanding of the proposed methodology.

2.1. Preliminaries of RNN

The RNN model [12] is a closed loop Artificial Neural Network that has a delay variable, between the outputs of each neuron in the output layer of the RNN, to each of the neurons in the input layer, which is suitable to model temporal behaviour or dynamics of data (Figure 1). For a canonical RNN model [49, 11] it is assumed that each of the total output neurons in the unit is a gene expression value of next time instant , and the input neurons are the gene expression of present state for the same genes; thus they interact with each and every one in regenerative way:

Here, is usually a sigmoid function; is used as a classification function; and is the weight of inputs of the RNN model, and it stands for the type and strength of the regulatory interaction of the th gene with the th gene. From the point of view of a GRN, each node corresponds to a gene, and a connection between two nodes defines their interaction. The weight values can be either positive, negative, or zero; is the most significant term for a GRN as the value of is the connecting weight of an edge of the GRN, which represent the connections between gene- and gene-. A positive value of represents activation of gene- by gene-, a negative value denotes repression or inhibition of gene- by gene-, and means that gene- has no regulatory control on gene-. The term represents the basal expression level or a bias term, and is a time constant (delay) of the th gene; is incremental time instance; in this work it is always set as 1. Thus, any RNN model can be expressed by a set of parameters, , where .

2.2. Preliminaries of Cuckoo Search (CS) Optimization

Yang [13] first proposed Cuckoo Search optimization [1418] based on brood parasitism of cuckoo birds that reproduce their eggs by utilizing nests of other host birds. These birds have the ability to use other birds for raising the new generation. Cuckoo lay their eggs, one or more than one, in the nest of the host birds in their absence using Lévy flight. Lévy flight [19] is an important characteristic of CS. Levy flight is defined as a random movement done by the birds with a step value of distributed probability. While new solution for a cuckoo , Lévy flight is performed as

Here is the step size; means entrywise multiplications. Levy flights essentially provide a random walk and the random steps are drawn from a Lévy distribution [13, 19] as follows:

In this equation, is the standard gamma function. This distribution is valid for large steps . For optimization problems, one cuckoo nest corresponds to one solution of an optimization problem. When the host bird recognizes the alien eggs, the host bird may destroy the eggs or may leave its nest and build a new nest with a certain probability . To avoid this, cuckoos learn to make eggs similar to the host bird’s eggs. However, the highest quality nest with eggs (i.e., best solutions) will be selected to move over to the next generation where the quality of an egg or fitness of a solution is simply proportional to the value of objective function.

2.3. Preliminaries of Flower Pollination Algorithm (FPA)

FPA [20] is typically associated with the transfer of pollen for reproduction or flowering of plants, and pollinators such as insects, birds, and bats are mainly responsible for such transfer. FPA [2123] is a recently proposed metaheuristic that is based on some simplified rules for pollination. Biotic cross-pollination can be assumed as a process of global pollination, and pollen carrying pollinators follow Lévy flights during transport (Rule  1). For local pollination, abiotic pollination and self-pollination are used (Rule  2). Pollinators may develop flower reliability, which is proportional to the resemblance of two flowers, that is, reproduction probability (Rule  3). The switching of local to global pollination can be controlled by a switch probability , slightly biased towards local pollination (Rule  4). Here, each pollen or flower corresponds to a solution of the optimization problem being considered.

Global and local pollination (i.e., search) are done according to the following two equations [21], respectively:

Here, is the pollen or solution vector at iteration , is the scaling factor to control the step, is the current best solution found among all solutions at the current iteration, and are pollens from the different flowers of the same plant species, and stands for random walk step size within a uniform distribution in . The reason behind selecting FPA as optimization method is that it gives better convergence and accuracy than other popular metaheuristic techniques [21].

3. Methodology

The RNN formalism is based on a set of parameters, , , , which we refer to as RNN model parameters in this work. The reverse engineering of GRNs from temporal microarray data requires finding the optimum values of the RNN parameters with the help of optimization techniques such that the training error is minimized.

3.1. Model Used for Validation

We choose a large artificial network with 30 genes, to investigate the inference capability of the proposed algorithm. This network structure is very sparse in nature, and it has already been studied in [11]. The parameters of this architecture were chosen arbitrarily as shown in Table 1 where there are only 36 connections or regulations in the network; for example, , signifying an edge (a regulatory relationship) that exists between Gene 1 and Gene 14 in the GRN, and the negative value implies that Gene 14 inhibits or suppresses the expression of Gene 1. Any negative value denotes inhibition or repression while a positive value implies activation. The numerical value has only mathematical significance for the RNN formalism implemented here. Only the nature of the value of , that is, whether , 0, or , has biological significance.

The parameters and also have no significance in the biological structure of a GRN. They are required for the mathematical modelling of the RNN formalism. They basically determine the bias and rate of change of dynamics for a particular gene. Using (1), 5 time series datasets, with 50 time points in each, are generated for learning of RNN assuming that . These time series data are used as the training data, represented by a gene expression matrix where each column indicates a gene and each row indicates a time point; the data in each cell means the gene expression level of a particular gene at a particular time point.

The corresponding GRN is shown in Figure 2, where arrow-head and T-head denote activation and suppression, respectively, for positive and negative weights, respectively, between any two genes.

3.2. Decoupled Recurrent Neural Network Training

In all optimization methods, an objective/fitness function is used to measure the quality of a solution. One of the most regularly implemented estimation criteria is the squared error that is defined as follows:

In the above equation, is the number of genes in the GRN, is the number of sampling instances of the observed gene expression data, and is the number of training datasets; is the numerically calculated gene expression value of the th gene at time , for the th dataset; is the actual gene expression level of the th gene at time , for the th dataset. The function represents the total squared error between the calculated and experimental gene expression data.

Thus, the training of the RNN model parameters is, in essence, a nonlinear optimization problem. The primary objective is to determine the optimal RNN parameters such that the mean square error is minimised, and the calculated gene expression data fits with the observed gene expression data. Since, for genes, parameters must be determined to find the solution of a set of equations as in (1), the RNN model parameter search space is of -dimensional space (here 960). However, this space becomes too computationally expensive in the case of large-scale genetic networks. Therefore, the problem of inferring a GRN from temporal expression data is decoupled or divided into several subproblems corresponding to each of the genes. Now, the objective of each subproblem corresponding to th gene is to find the values of decoupled RNN model parameters which minimizes the prediction. Thus, we define the objective function for the th gene only as

Hence, to solve the differential equation (1), the number of RNN parameters needed to determine is only parameters for the th gene only. Thus, this decoupling method divides a -dimensional problem subproblems of dimension . Accumulating the parameters of all genes, the overall structure of the RNN model can be achieved which in turn will define the inferred GRN.

3.3. CS and FPA Hybridization

Real-life GRNs are sparsely connected; that is, very few connections exist among genes, and the measured data are also very noisy. Thus, for GRN inference problems, it is expected that the values of a majority of the weight parameters are zero. Nevertheless, it is found that though the complexity is reduced up to a certain limit by using a decoupled strategy, it may lead to different solutions. Each solution may have very low error value depending on the different connectivities or structures among the genes in the GRN and the corresponding values of the kinetic parameters. This is known as the overfitting problem that occurs mainly due to the nonlinearity of the RNN methodology, large search space, and a large number of parameters (here 32 for the decoupled RNN formalism, i.e., still quite large) to be optimized. Thus, to overcome this problem in real-life genetic networks, a balance between prediction error minimization and actual regulatory structure of GRN needs to be achieved.

The proposed Cuckoo Search-Flower Pollination Algorithm hybrid approach utilizes a single-layer RNN formalism to determine the weights of the edges between the input and output layers of RNN, with the help of FPA that minimizes the error between the calculated and experimental gene expression profiles. This RNN structure is hybridized with a new metaheuristic CS algorithm for creating hypothetical interconnections or regulatory edges among genes by selecting the best combination of affecting/regulatory genes. Moreover, the maximum connectivity of each gene is restricted to as it is observed that real-life GRNs are sparsely connected; that is, very few genes participate in actual regulations.

In our present problem, we have chosen the value of ; that is, we assume that at most any three genes can affect any single target gene. CS technique has been introduced to detect these three regulatory genes, and FPA is used to know the weight of the RNN model where these three regulatory genes act as the input node of RNN and the target gene acts as the output node. The RNN is implemented to determine how the gene expression levels of those three genes, at a particular time point, affect the target gene’s expression value at the next time instance. Assuming this, we assign the value of to be zero except for those three regulatory genes and the search space thus becomes 5-dimensional (3 weight, 1 bias, and 1 delay parameter). It may be possible that, for a particular regulatory gene set, one or more becomes zero after optimization using FPA. This implies no regulation for those genes with zero value of .

The CS is initialized randomly with a population of different solutions or eggs on the different random nest, and the quality of each egg in the host nest is calculated using FPA based RNN to observe the impact of regulatory genes on other genes in the GRN as part of one generation. However, we initialized with a population of 10, each consisting of 3 different combinations of genes starting from . This type of predefined initialization is used as it increases the probability of covering all regulatory genes without repetition. Search range for the model is chosen as previous work [11] like , , and , respectively.

The quality of a host nest or fitness of a solution is simply proportional to the objective function that is the resultant error of RNN for the set of particular genes or nest. FPA always tries to minimize the training error by optimizing the value of the RNN model parameters. During training of RNN, each pair in the training dataset contains the gene expression values for only 3 regulatory genes from the one time instance of the microarray data to be the input values to the RNN model, and the expression value of the target gene at the next time instance of the microarray data is the target output of the RNN. This helps to reduce the execution time by reducing the unnecessary calculations corresponding to nonregulatory genes. Moreover, it is observed that during optimization if the fitness value for FPA based RNN becomes less than 1 × 10−8; then it gives the almost actual value of the parameters of RNN for the regulatory genes. Therefore, a stopping criterion is introduced to minimize the execution time of the algorithm; that is, if the fitness value for a particular gene set becomes less than 1 × 10−8 after some iterations, the program stops execution instantly. Moreover, the corresponding genes set becomes the desired output.

Sometimes, it is found that if any solution does not consist of the actual regulatory genes, then the convergence is very slow which may increase the execution time as it is not able to go below the value 1 × 10−8 even after a large number of iterations. Therefore, another alternative stopping criterion is also imposed which can be stated as follows: if the difference between current best fitness value and fitness value of the (current-200)th iteration is less than 1 × 10−10, then the program execution will also stop. Better host nest with better quality eggs of each generation will move to the next generations. After successful completion of all iterations, we have a set of regulatory genes which can affect the target gene most. This process is repeated for all 30 genes one by one to get the final GRN structure.

Here inputs are the time series data, generated by (1), using parameters as shown in Table 1, and the outputs are a combination of 3 regulatory genes (th) for a particular target gene (th) along with the value of regulatory weights, that is, , corresponding to those 3 regulatory genes for which the minimum or optimum fitness value is achieved. It is worth mentioning that for a target gene we always get a combination of three responsible genes, but this does not mean that during GRN reconstruction there will always be regulatory edges from those regulatory genes towards target gene. The existence of a directed edge in a network also depends on the weights between target and regulatory genes. If amplitudes of weights are zero or very small, there will be no edges for the target gene; that is, there will be noninteraction. Small perturbation of obtained weights from actual one can be ignored unless and until it does not change the polarity or sign, that is, the type of regulation. Moreover, small values of weights can be considered as zero. It is believed that if the number of iterations and population size are large enough, these small perturbations can also be avoided, but execution time will also increase consequently. Thus, during reconstruction of GRN, both the set of regulatory genes and the value of the corresponding gene’s weights must be kept in mind. The pseudocode of this CS-FPA hybrid is given as in Algorithm 1.

For gn (gene) = 1 : 30
   Initialize the population para with host nests (solutions) with
   dimension = 3 that is .
For  : 10
   Calculate fob (fitness value) for all 10 solutions using RNN and FPA;
If ()
  Break;
End if;
  End for;
  If (minimum (fob) )
   For  : max iteration (100)
  Randomly select a cuckoo (th) avoid current best;
  Randomly generate another nest (th) keeping current best nests by Lévy flights;
  If ()
   is replaced by the news solution;
  End if;
  If ()
  Break;
  End;
  Discard the worse nests with a fractional probability ();
   Keep the highest quality nest that is best solution with best fitness value;
   Rank the available solutions and locate the current best;
   End for;
  End if;
End for;
Post-processing and visualization of GRN;
Function fob (para, gn)
 Initialize a rnnpara population of nf (30) with dimension df (5) pollen randomly with a switch probability
 Find the fitness fun for all solutions and best pollen among them
 For (tf < MaxGeneration (2000))
  For  : nf
   If ,
  Draw (df-dimensional) step vector L which obeys a Lévy distribution
   Global pollination via equation (4)
  Else
   Draw from a uniform distribution in
   Randomly choose jf and kf among all the solutions
   Do local pollination via equation (5)
   End if
   Evaluate fitness fun of new solutions of pollens
   If new solutions are better
  Update them in the population
   End if
  End for
  Find store the best fitness of current iteration
  If ((bestfitness (tf) <) (() && ((bestfitness () − bestfitness (tf)) < )))
    Break
  End if
 End for
 Return bestfitness
End fob
Function fun (rnnpara, para, gn)
 Define (5) times series data with (50) sample point
 Calculate the gene expression value of next time instance using equation (1)
 Determine the squared error using equation (7)
 Return error
End fun
3.4. Results for Artificial Dataset

After execution of the proposed methodology, we get the hypothetical interconnections among genes regarding weights of the RNN model. Now, the performance of any inference algorithm for GRN is measured regarding sensitivity (), specificity (), accuracy, and Matthews Correlation Coefficient (MCC) [11, 24] which are defined as follows:

TP (True Positive) denotes the number of correctly predicted regulations, and TN (True Negative) represents the number of properly predicted nonregulations. FP (False Positive) denotes the number of incorrectly predicted regulations, and FN (False Negative) represents the number of falsely predicted nonregulations by the inference algorithm. All the experiments have been performed using MATLAB R2009b, running on Windows 7 with an Intel© Dual Core processor and 2 GB of RAM. In this work, both noiseless and noisy data are considered for the construction of the artificial network to validate the proposed algorithm. Validation of the proposed method is achieved in terms of correct prediction of RNN model parameters, that is, correct prediction of regulations in the GRN. The artificial time series datasets generated using (1) are used as training inputs.

For the reconstruction of a GRN consisting of 30 genes, we need an adjacency matrix (here weight matrix of RNN) of dimension. In this case, there are only 36 regulations, that is, 36 nonzero value of weights (see the Supplementary Material available online at http://dx.doi.org/10.1155/2016/5283937). For noiseless data, using RNN model, we found that the proposed algorithm can predict 32 regulations correctly compared to those which were present in actual GRN, and also includes only four unwanted regulations. Thus, , , , and . It is found that the proposed algorithm is able to detect maximum correct regulations or TPs with a very good accuracy (for maximum cases, the fitness value went below ). Though this method only detects 4 false regulations, 4 true regulations are also missing. Nevertheless its inference capability or performance is quite satisfactory and better than other state-of-the-art methods like DE [11].

Next, we add 5% random noise to the initial training dataset and apply the CS-FPA hybrid algorithm to it to check the robustness against noise like real-world data. It is observed that, in the presence of noise, the number of FPs increased significantly. However, it can still identify all the TPs (previously inferred) with good accuracy. For noisy data, , , , and . Table 2 shows a comparative study of the performance of the proposed methodology with earlier work with respect to specificity and sensitivity.

It is quite interesting to observe that performance (in terms of sensitivity) of CS-FPA does not change for detecting true positive regulations in the presence of noise whereas sensitivity of DE [11] drastically falls to very low value in the presence on noise that denotes inability of DE [11] for finding actual regulation in the presence of noise. Moreover, specificity, accuracy, and MCC are better than DE [11], and values of these parameters do not change significantly due to noise though some false regulations are included into the GRN. However, due to a parallel implementation of two metaheuristic for one problem, overall computational time complexity is still significant although several break conditions are applied to minimize the execution time as much as possible which is 1.5 hours on average for this large network model.

3.5. Results for Real-Time Dataset of E. coli

Microarray experiments on the SOS DNA repair network for E. coli [25] were first done by the Uri Alon group [26]. The experimental datasets are considered as the benchmark for the evaluation of algorithms for reverse engineering of GRNs from real-world datasets. In the SOS network, 8 genes were considered (uvrD, lexA, umuD, recA, uvrA, uvrY, ruvA, and polB as shown in Figure 3) due to their significant involvement in the process of DNA repair. During their experiments, the E. coli cells were irradiated with UV light. Four experiments were performed with different UV light intensities. Each experiment consists of 50 time steps spaced by 6 minutes for each of the eight genes. However, in this work, the first dataset with all eight genes has been considered where it is being preprocessed by neglecting first time point (zero) normalizing in the range .

Search range for the model is chosen as previous work [11] like , , and , respectively. For CS, the value of is set as 2 as the number of available genes is limited to 8 only and other parameter settings remain the same. If we apply CS-FPA hybrid to the SOS dataset, only 4 out of 9 potential regulations can be appropriately predicted from the data, which are the inhibition of lexA on uvrD, umuD, recA, and ruvA. Moreover, it also includes 11 false regulations in the network which is a disadvantage of this process. The results are shown in Table 3.

Furthermore, using two or more time series does not yield any enhancement for the results. The cause behind this is that inference of real-time GRNs is an ill-posed problem that has no unique solution. Noise and measurement error in this type of real-time series microarray is another issue. This inherent difficulty is a limitation of our proposed method.

4. Conclusion

Various researchers have already proposed numerous techniques to solve the reverse engineering problem of GRNs from temporal genetic expression data in the domain of computational biology and bioinformatics. It is imperative to enhance the accuracy of the inference algorithms as well as to reduce the number of incorrect predictions (i.e., FPs) within a plausible runtime.

The RNN formalism is a very popular candidate for inferring GRNs from microarray gene expression data regarding biological plausibility and computational efficiency. In this work, we have implemented the decoupled RNN model where the regulatory parameters of each gene are calculated independently in separate search instances. We incorporated hybridized technique where two metaheuristics are paired to obtain the RNN based GRN model with less search space, less computational complexity, and more accuracy. In this paper, hybridized CS-FPA is proposed for reconstruction of GRNs where FPA is used to train the RNN parameters, and CS is introduced to select the best combination of genes that are responsible for modifying the expression of each gene. Moreover, the maximum connectivity of each gene is restricted to as it is observed that real-life GRNs are sparsely connected; that is, very few genes participate in regulations.

To prove the efficiency of this inference algorithm, it is applied to a benchmark problem of the artificial network with 30 genes with and without noise. With the use of fewer data points, CS-FPA based RNN can infer the network with very high accuracy. However, in the presence of noise, the number of FPs increases significantly, but it can still identify all TPs (inferred in the noiseless scenario) with good accuracy. It is also found that noise robustness is better than other existing methods for artificial data. In the instance of the E. coli dataset, it can detect only four true regulations and includes some false regulations.

Another important observation that is apparent from our results is that the proposed methodology can reconstruct the large artificial GRNs more efficiently than that of real-life GRNs. However, this needs further study on different networks available to us, and the existing boundary of our work validates this observation. In the future, various regularization techniques, the inclusion of prior knowledge about GRNs, and parallel computing methods may be utilized to improve the accuracy and speed further.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Supplementary Materials

We have included the following in the supplementary materials:

(i) The noiseless dataset of the artificial network, used for training, and the corresponding results.

(ii) The dataset of the artificial network with 5% noise added, and the corresponding results.

(iii) The dataset of the E. coli DNA SOS repair network, and the corresponding results.

(iv) All the codes developed by the authors.

  1. Supplementary Material