1 State-of-the-art

Motif finding is a well-studied problem in computing. Various motif search algorithms have been developed, falling into two main categories: heuristic and exact. Heuristic algorithms perform an iterative local search, for instance by repeatedly refining an input sampling or projection until a motif is found. Gibbs sampling and Expectation Maximization (EM), used in the motif-finding tool MEME both use probabilistic computations to optimize an initial random alignment. [An alignment is simply a vector (a1, a2,…,an) of n positions, which predicts that the motif occurs at position ai in the given sequence Si.] Gibbs sampling tries to refine the alignment one position at a time; in contrast, EM may recompute the entire alignment in a single iteration. Projection combines a pattern-based approach with EM’s probabilistic approach, trying to guess every successive character of a tentative motif and using EM to verify its guesses. GARPS uses a random version of projection, in tandem with the genetic algorithm (GA), for yet another iterative approach. These are just some of many successful heuristic algorithms [1, 2]. However, heuristics are non-exhaustive, and thus not always guaranteed to find a solution. Exact algorithms, on the other hand, perform an exhaustive search of possible motifs and so always find the planted motif.

WINNOWER and its successor MITRA are exact algorithms that look at pairwise l-mer similarity to find motifs. In a set of DNA sequences, there are numerous pairs of “similar” l-mers, which come from different sequences and have Hamming distances of at most 2d from each other (meaning that they could be two d neighbors of the same l-mer). WINNOWER represents these pairs in a graph, with l-mers as nodes and edges connecting l-mer pairs. It then prunes the graph to identify “cliques” of pairs that indicate a motif. MITRA refines this graph representation into a mismatch tree containing all possible l-mers, organized by prefix [3, 4]. The tree structure allows MITRA to eliminate entire branches at a time, making it faster than WINNOWER at removing the spurious edges that are not part of any motif clique.

The current state-of-the-art in exact motif search is qPMS9, the most recent in a series of Planted Motif Search algorithms. It performs a sample-driven step, which generates a k-tuple of l-mers from each of k input strings, followed by a pattern-driven step, which generates the common d-neighborhood of the tuple and then checks whether any of the l-mers in this common neighborhood is a motif. To identify neighbors, qPMS9 efficiently traverses the tree of all possible l-mers, using certain pruning criteria explored by predecessors PMSPrune and qPMS7 to quickly discard non-neighbor branches. Sampling in qPMS9 is an improvement on its predecessor PMS8; in building a k-tuple, qPMS9 intelligently prioritizes l-mers that have fewer matches with the l-mers already selected, such that the common d-neighborhood becomes smaller and thus faster to check through. Finally, PMS8 and qPMS9 have been implemented to run on multiple processors, allowing them to solve instances with (l, d) as large as (50, 21) in a few hours.

2 Work done till now

We have proposed and implemented the distributed parallel computing algorithm for motif discovery problem. We have implemented a simple scalable and efficient parallel openMP and openMPI implementation for Planted Motif Search problem using cluster computer. Also we have presented the method for creating Beowulf cluster [5, 6]. The efficiency of the algorithm is validated by testing it on both simulated as well as real biological databases.

2.1 Experimental result on simulated data set

The input sequences are generated by using simulated data sets with parameters t = 20 sequences and m = 600 characters, where the characters are A, C, G, T. Each (l, d) input instance dataset is generated as follows: We generate random strings with length (m − l) each, where the characters appear randomly with equal probability [7, 8]. Then we generate randomly an l length string M and plant a copy of it in each sequence at random position after mutating it with at most d random mutations (Fig. 1; Table 1).

Fig. 1
figure 1

Column chart for some instances of simulated data set

Table 1 Running time in seconds for different challenging instances [9, 10]

Figure 2 shows the scalability results for our algorithm where

$$ speedup = \frac{Time \;on\, 1 \,node }{Time \;on \;C\, nodes} $$

We note that, our proposed method reduces the running time and the speedup achieved scales well with the increasing in number of cluster nodes.

Fig. 2
figure 2

Scalability plot of our algorithm for some instances

2.2 Experimental result on real data set

We test PMS on a set of real biological data which are used in the literature. The data for this set contains the upstream DNA regions of a set of genes from different species (Table 2).

Table 2 Motifs detected in real biological dataset [9, 10]

3 Future work

Solving computationally intensive problems on high performance computing architecture can significantly improve and speedup the run time of the problem solution when proper task distribution, scheduling strategy and suitable parallel computing paradigms are used. Deploying more and more cluster computers can bridge the gap of speed difference between architectures and will result in fewer numbers of concurrent jobs that can be allocated to the system. Future work may include the use of different scheduling strategies and intelligent selection criteria to choose the best scheduling strategy to solve a given computationally intensive problem. We believe that this paper is a step towards a complete system to solve computationally intensive problems on heterogeneous architectures.

3.1 Research papers submitted

  1. 1.

    “GENOME WIDE IDENTIFICATION OF CIS–REGULATORY MOTIF USING BEOWULF CLUSTER” submitted in “IETE Journal of Research” on 6th April 2017

    Status Under Review

  2. 2.

    “REVIEW OF REGULATORY MOTIF DISCOVERY ALGORITHMS” submitted in “IETE Technical Review” on 15th August 2017

    Status Under Review