Keywords

1 Introduction

Accurate annotation employing domain information extracted from sequence/structure and related attributes immensely enhances our current understanding of viral genomes. A major role is played by data driven modelling in recent advances made in vaccine development, epidemiology studies, pathogenicity determination, and drug design [1]. Introduction of NGS technology coupled with novel experimental techniques have provided very large volumes of data requiring accurate machine learning based modelling techniques. These techniques can be broadly categorized into supervised learning, unsupervised learning, and reinforcement learning.

Supervised learning can be explained with a classic example of function annotation (see Fig. 1). In this task we have knowledge of certain number of sequences belonging to functional ‘class1’ from prior experimental annotation and knowledge of another set of sequences known not to be annotated as ‘class2’. As shown in Fig. 1, a knowledge based model is built which separates data into two classes. This knowledge may be in terms of domain attributes extracted from sequences\structure, etc. The set of domain attributes are known as input data. Experimentally annotated class information is known as output data. The supervised learning model derives a functional relation between input and output. This model can be used to classify a query example to identify the functional class employing this model. This approach can be extended to classification into multiple functional classes.

Fig. 1
figure 1

Supervised vs. unsupervised learning

In Unsupervised learning, we do not have prior knowledge about the classes. Unsupervised Learning is a class of Machine Learning techniques which enables us to discover patterns in the data. The data given to the unsupervised algorithm are not labelled, which means only the input variables (X) representing sequences\structure are presented to the algorithm with no corresponding output variables (Fig. 1). This type of learning is used extensively in viral biology to infer Phylogeny. The unsupervised learning method groups data without any prior knowledge of class labels. After the model is built one can derive knowledge about examples clustered in any specific group. While supervised and unsupervised learning learn from data, the reinforcement learning paradigm learns from experience. In the following sections we provide details of SVM algorithms, a list of domain attributes presented to the algorithm, selection of informative attributes, and finally a discussion on some applications of SVM in viral biology.

2 Support Vector Machines for Classification

Support Vector Machines can be used both for supervised and unsupervised learning tasks. In viral biology, SVM is used mainly for supervised learning. SVM classifiers are a set of universal feed-forward network-based algorithms that have been rigorously formulated from statistical learning theory by Vapnik [2]. They are very popular machine learning paradigms which are routinely used in different branches of science and engineering.

2.1 SVM Binary Classifier for Linearly Separable Data

Let us take a simple case study to explain principle of SVM Linear Classification. The task is to build a model to separate a set of sequences belonging to functional class 1 from another set of sequences belonging to functional class 2. Class 1 examples can be peptides having antiviral activity while class 2 examples are not known to possess any antiviral activity. The input data vector for ith example is denoted by xi and the corresponding class label is denoted yi. The output of any example belonging to class 1 is represented by the subset yi = +1 and those belonging to class 2 are represented by the subset yi = −1. The hyperplane for the linearly separable data can be defined as:

$$ \mathbf{w}\bullet {\mathbf{x}}_{\begin{array}{l}\mathbf{i}\\ {}\end{array}}+b=0 $$

This hyperplane (Fig. 2) separates the data into two different classes. ‘w’ refers to the weight vector with elements equal to the number of attributes. The problem here is to find out the best values of the elements of the weight vector, which maximize separation of the two classes with reference to a given performance measure (e.g. accuracy). This amounts to finding a hyperplane which maximizes the margin. This implies that at the training stage the examples belonging to class1 should be maximally separated from examples belonging to class 2. It can be shown that such a problem can be formulated as a Convex Quadratic Optimization problem [2]. The solution for such a convex optimization problem has only one global optimum as opposed to multiple local optimum solutions (algorithm can get stuck up in any of the inferior local optima) like other candidate algorithms like neural network etc. have. It is this highly beneficial aspect coupled with superior performance has attracted researchers and practitioners from different fields to employ Support Vector Machines. After model building, the weight vectors can be obtained from only a subset of training examples. This subset is known as Support Vectors and hence the name Support Vector Machines. It must be noted here that SVM converts the original “N” dimensional problem into a one dimensional problem using dot products between the examples.

Fig. 2
figure 2

Maximum margin-minimum norm classifier

2.2 Non-linear Support Vector Machines

Biological data are inherently non-linear. A linear hyperplane cannot satisfactorily separate such non-linear data (Fig. 3). To handle these data SVM first transforms the data to a higher dimensional feature space and then employs a linear hyperplane. There are two inherent difficulties in the above approach: (i) It is difficult to find a suitable transformation by trial-and-error. (ii) We may have to employ a transformation to a very high dimensional space for reasonable classification accuracy which becomes computationally intractable. To solve these problems SVM employs appropriate kernel functions. Kernel functions are defined as a function of dot products in the original space and they are equivalent to the dot products in the higher dimensional feature space. SVM separating surface can now be defined as a linear hyperplane in the high dimensional feature space and introduction of appropriate kernel functions make it possible to do all the computations in the original space itself. Kernel functions have to satisfy Mercers Theorem; They have to satisfy the axioms of Hilbert space and have to be positive definite. The most popular kernel functions are Polynomial, Gaussian Radial Basis Function (RBF), and Multi-layer Perceptron kernel functions. Apart from these there are several domain dependent kernel functions. In computational biology, string kernels and Fisher kernels are very popular. Formulation as described above is known as Hard-margin SVM classification.

Fig. 3
figure 3

Non-linearly separable data

2.3 Soft Margin SVM

If we try to find a hyperplane which yields the maximum possible training accuracy, the margin obtained may become very narrow. Such a hyperplane while classifying the training set very well, over-fits the data and may fail miserably in unseen query test examples. It may be possible to increase the margin with slight loss of training accuracy (Fig. 4). This will generalize better than the one having a narrow margin and has more robust prediction capabilities. This trade-off between margin maximization and misclassification error in soft margin formulations can be obtained by optimizing a new parameter ‘C’.

Fig. 4
figure 4

Trade off: increasing margin/reducing misclassification

3 Brief Details of Classification of Real-Life Binary Datasets

Given a dataset we must first find the optimal hyperplane in the original dimension. In SVM terminology this is known as a linear kernel and after building the model we must estimate the required performance measure (e.g. accuracy). If it is not satisfactory, we must resort to nonlinear separation and employ conventional kernels like Polynomial, Gaussian Radial Basis Function (RBF), Exponential Radial Basis Function, Multi-layer Perceptron kernel functions etc. For every kernel, there are kernel parameters. With each kernel, apart from finding the best kernel parameters one must also tune the ‘C’ parameter as discussed in earlier section. If these kernels also are not satisfactory then we must resort to domain dependent kernels.

4 Support Vector Machines for Regression

In classification examples are grouped into discrete sets. In regression, a functional relationship is found between input data and output having continuous values. Many problems require a nonlinear model to adequately regress the data. The methodology described in the previous sections can be easily extended to employ SVM to handle nonlinear regression (Schölkopf et al. 1999) [2]. The methodology for linear regression is same as that of conventional models for regression; examples which are linearly classifiable can be done in the original dimension itself. What is different in SVM linear regression is that a novel epsilon-insensitive loss function is defined, which is robust against outliers in the data [3, 4]. For data that cannot be regressed linearly, a principle similar to the one implemented in classification problems can be extended easily; for such kind of problems, data needs to be taken to a higher dimensional feature space and subsequently regressed linearly. Appropriate kernel functions can again be defined to simplify computation.

5 Attributes Used in Viral Biology Problems

In viral biology we encounter a variety of attribute types, with each type providing huge magnitudes of domain attributes. Broadly, these attributes can be classified as sequence based, structure based, spectrum of light or radiation based (i.e. spectroscopic), microarray gene expression profiles etc. Protein sequence k-mer features range from amino acid (AA) (k = 1), dipeptide (k = 2), tripeptide (k = 3) to tetrapeptide (k = 4) and so on. It is possible to extract physiochemical properties like hydrophobicity, charge, hydrophilicity etc., from each of the AA alphabets. The simplest discrete set of features is the AA composition. Conversion of sequence information in terms of AA composition reduces the protein sequence into a 20 letter alphabet. While this is beneficial, we lose all sequence information. Recently Chou defined and introduced different types of pseudo- AA (PseAA) compositional attributes of protein sequences; these are a set of discrete numbers derived from AA sequences possessing some sort of sequence order or pattern information [5]. Ever since the first PseAA composition was formulated, these attributes have been successfully employed in several protein function identification tasks. Two classes of attributes frequently used in viral biology are listed below:

5.1 QSAR Descriptors

In quantitative structure activity relationship modelling, domain information about a molecule is provided in terms of different types of descriptors. The initially developed QSAR descriptors comprise hydrophobic, electric, and stearic parameters. Currently, descriptors of different dimensions ranging from 0 to 3 are routinely employed in modern QSAR analysis. Zero-dimensional descriptors comprise of atom counts, bond counts, molecular weight, sum of atomic properties; one dimensional descriptors two-dimensional descriptors deal with topological descriptors and three dimensional descriptors provide geometrical information. Originally QSAR is regression problem in which a functional relationship is obtained between activity of a molecule and the descriptors. This relationship can be linear or non-linear so a regressor like SVM or random forest can be employed for this job. This is illustrated in Fig. 5.

Fig. 5
figure 5

QSAR regression using SVM

5.2 PSSM Descriptors

Evolutionary information, one of the most important types of information in assessing functionality in biological analysis, has been successfully used to encode protein in many applications. PSIBLAST is used to repeatedly search specific databases, using a multiple alignment of high scoring sequences found in each search as input in the next round of searching. Normally iterations are continued until user specified number of iterations and at the end, the final Position Specific Scoring Matrix (PSSM) is generated. Such a matrix provides remote homology information and using PSSM attributes as descriptors in SVM would be useful if remotely connected sequences have similar functionalities. In the view of the fact that SVM requires the fixed length feature vectors, a vector of dimension 400 can be recovered from PSSM score matrix for use as input in SVM classifier.

Apart from the attributes described above, many different types of attributes are used, depending on the particular domain problem encountered.

5.3 Attribute Selection

Not all attributes are informative in data sets. Features which are non-informative will act as noise, do not have discriminative power & interfere with the classification process. Hence the model will have very little predictive accuracy. In Protein function identification in viral biology, several sequence and structural features can be extracted [6, 7]. For example the AA, dipeptide & tri-peptide compositional features put together amount to 8400 in number & not all of them will be important in a particular function annotation task. To select a subset of informative features by bruit force, we need to evaluate huge number of subsets of features which becomes computationally time consuming. Various feature/attribute selection methods are available to simplify the process of subset selection. Feature selection techniques help us to avoid overfitting and improve model performance to provide faster and more cost-effective models; they also provide invaluable domain information. However, feature selection techniques have to employ appropriate search techniques, they bring in an additional level of complexity and computational cost. Feature selection techniques differ from each other in the way they incorporate this search in the added space of feature subsets in the model selection. Figure 6 illustrates the advantages of feature selection. These methods can be broadly classified as filter, wrapper and embedded methods.

Fig. 6
figure 6

Advantages of feature selection

5.3.1 Filter Ranking Methods

Filter ranking methods use some heuristics to score and rank the features (Fig. 7). In the example given above once the 8400 features are ranked by an appropriate filter method, the most informative subset of features can be selected, and the model can be built on this subset to maximize performance. Most popular filter methods include mutual information, student t-test, correlation-based feature selection (CFS) and several variants of the Markov blanket filter method, Minimum Redundancy-Maximum Relevance (mRmR) and Uncorrelated Shrunken Centroid (USC) algorithm. We give below some of the methods used in viral biology related problems:

Fig. 7
figure 7

Filter ranking & classification accuracy

5.3.1.1 Information Gain

Information gain score for any given attribute is calculated as the difference between entropy of the entire data set and the conditional entropy of for each possible value of the attribute. This can be done by binning each attribute and counting the frequency of occurrence of different labels for the range of the attribute in each bin. Based on the score, top ranking attribute subset can be easily identified to build the model.

5.3.1.2 mRmR

The attributes are selected in such a way they are mutually dissimilar, non- redundant and maximally relevant simultaneously.

5.3.1.3 Mutual Information

Mutual information is a measure between random variables, that quantifies the information obtained about one of them, through the other. For the purpose of feature selection, mutual information between the subset of selected features and the target variable should be maximal.

5.3.1.4 Correlation Filter

The Correlation Feature Selection (CFS) selects subset of features that uncorrelated to each other but maximally correlated to the output variable.

5.3.1.5 Chi-Square

The chi-square test is a statistical test computes a score reflecting of independence to determine the dependency of two variables. We need to calculate chi-square statistics between every feature variable and the target variable and observe the existence of a relationship between the variables and the target. If the target variable is independent of the feature variable, we can discard that feature variable. If they are dependent, the feature variable is very important. For continuous variables, chi-square can be applied after “Binning” the variable.

5.3.2 Wrapper Methods

While filter methods are fast, they are not very accurate as they do not encode feature correlation. Wrapper methods employs a learning classifier for repeated evaluation of different subsets of features. These methods include forward selection & backward selection algorithms. In forward selection we start with an empty set and add features one by one which maximally improve accuracy until all features are added in the set. A subset can then be chosen which exhibits maximum accuracy. In backward selection we start with all features and remove least significant features one by one.

Recently recursive feature elimination wrappers have become very popular. In SVM recursive feature elimination algorithm, viz., SVM-RFE, the simulations start with all features and the algorithm weights are determined. Then features with least absolute value of weight are recursively removed until no feature is left out. Here again best performing subset can be easily identified which is used in the final model (see Figs. 8 and 9). Several wrapper based methods are population based and use Genetic algorithms, Ant Colony Optimization or other swarm intelligent methods. These methods mimic some nature inspired phenomena and evolve optimal solutions. Fie e.g. ACO is based on co-operative search behaviour of live ants. Biogeography is the study of distribution and dynamics of a large number of species geographically over a period of time. Biogeography based optimization (BBO) involves mimicking the natural processes of migration over a population in iterative generations, simulating discrete time. Atulji Srivatsava et al. employed BBO Simultaneous Feature Selection and MHC Class I Peptide Binding Prediction using Support Vector Machines and Random Forests [8].

Fig. 8
figure 8

Wrappers: schematic representation

Fig. 9
figure 9

Wrappers & classification accuracy

5.3.3 Embedded Methods

In embedded class of feature selection techniques, optimal subset search is facilitated within the classification model. In random forest there are two inbuilt feature ranking methods, viz., Gini importance and variable importance. In SVM recursive feature elimination algorithm, viz., SVM-RFE, the simulations start with all features and the algorithm weights are determined. Then features with the least absolute value of weight are recursively removed until no feature is left out. Here again best performing subset can be easily identified which is used in the final model.

6 Performance Measures

While accuracy is the conventional performance measure, it may not be appropriate in all situations. In some examples we may require maximizing the positive accuracy while in some other situations negative accuracy may be the desired performance measure. Also, in imbalanced datasets, where we have more examples in one class than the other we have to optimize both positive and negative accuracies.

Referring to Fig. 10, true positives are the examples which are originally positive and are predicted positive by SVM. True negatives are the examples which are originally negative and predicted negative. False positives are the examples which are originally negative but predicted positive. False negatives are the examples which are originally positive, but predicted negative. With these definitions, we can define positive and negative accuracies. True positive rates or sensitivities are defined as;

$$ TPR=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{true}\ \mathrm{positive}\ \mathrm{examples}}{\mathrm{total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{positive}\ \mathrm{examples}}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $$

True negative rate or specificity can be defined as:

$$ TNR=\frac{\mathrm{number}\ \mathrm{of}\ \mathrm{true}\ \mathrm{negative}\ \mathrm{examples}}{\mathrm{total}\ \mathrm{number}\ \mathrm{of}\ \mathrm{negative}\ \mathrm{examples}}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} $$

Precision or positive predictive value can be defined as:

$$ PPV=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} $$

F1 score is a harmonic mean of precision and sensitivity:

$$ PPV=2\;\left[\;\frac{\mathrm{PPV}\ast \mathrm{TPR}}{\mathrm{PPV}+\mathrm{TPR}}\right]=\frac{2\mathrm{TP}}{2\;\left(\mathrm{TP}+\mathrm{FP}+\mathrm{FN}\;\right)} $$

Apart from these Matthew Correlation Coefficient is used as measure which provides optimal positive and negative accuracy and can be defined as:

$$ MCC=\frac{\mathrm{TP}\ast \mathrm{TN}-\mathrm{FP}\ast \mathrm{FN}}{\sqrt{\left( TP+ FP\right)\left( TP+ FN\right)\left( TN+ FP\right)\left( TN+ FN\right)}} $$

MCC score of −1 indiciates very poor classification and +1 indicates highest possible performance. In case of imbalance datasets it is customary to use MCC as the desired performance measure.

Fig. 10
figure 10

Distribution of examples classified by the model

6.1 Cross Validation Measures

A simple way to test the performance is to split the data with 80% train and 20% test. The model id built on the 80% train data and tested on 20% test data. While this can be done for quickly estimating the performance of the model may not be fully adequate. To remove statistical bias two different cross validation measures are used to gauge the performance and obtain the best algorithm parameters. In K-fold cross validation, the training set is randomly divided into K-folds. To start with, the first fold is used as the test set and the remaining k − 1 folds are used to build SVM hyperplane model. This model is evaluated by using the examples in the first fold. Similarly, each of the other k folds are used as test sets and the remaining k − 1 folds are employed to build the models respectively. From these k experiments the cross validation accuracy is estimated as the average of k test accuracies. In leave one out cross validation procedure, each time one example is left out as a test example and the remaining n − 1 examples are used to build the model. The built model is tested with the left out example. Conventionally fivefold or tenfold measures are used (k = 5 or k = 10). In k-fold cross validation, irrespective of the number of examples in the datasets, k different models are always built, whereas in leave-one-out cross validation, number of models is equal to number of examples in the training set.

7 SVM Extension to Solve Multi-class Type of Classification Problems

There are different algorithms, which address multi-class classification problem. Two well-known techniques include one-against-all method (Weston and Watkins 1999) and one-against-one technique [9]. One-against-all method considers the multi-class problem as a collection of binary classification problems. In general, k classifiers are needed to solve the k class problem. The kth classifier constructs a hyper-plane between class k and the k − 1 other classes. A majority vote across the classifiers is applied to classify the new test point. In one-against-one technique k(k + 1)/2 classifiers are needed. In each classifiers a model is built with examples of one class against examples of another class. Here again for a test example majority vote is needed to decide the class label.

8 Other SVM Types

8.1 Least Square SVM (LSSVM)

Least Square SVM classifier were proposed by Suykens and Vandewalle [10]. In their version of least square SVM they add a term in the objective function which penalizes square of error between prediction and actual class label. In this version, the problem is now formulated as a set of linear equations, instead of the convex quadratic problem for classical SVMs. Such a formulation makes computation simpler and faster. Several problems in bioinformatics has been solved using LSSVM. LSSVM formulation has also been extended for solving SVM problems.

8.2 One Class SVM

Several real-life datasets are highly imbalanced. Function annotation problems in viral biology have a small number of positive examples, while the negative examples can be very large. So such a distribution causes imbalance in the datasets and the minority class prediction accuracy will be very poor. One class SVM has been proposed in the literature to overcome this issue. In One class SVM only the data belonging to the majority class examples is used to build the model. There are two different version proposed in the literature for One class SVM. In the Tax and Duin’s version [11], a model for the smallest hyper-sphere including all the majority class examples is formed. A new example is predicted as a majority class example if it falls inside the sphere. Otherwise it is predicted as a minority class example. For non-linearly separable patterns appropriate kernel functions can be defined as in the case of binary SVMs. In the other version of the One class SVM, a hyperplane model is used instead of a hypersphere model [12]. One class SVM can also be used to detect anomalies and faults.

9 Applications of SVM in Virology

In this section, we outline a few important problems in viral biology where SVMs have been successfully applied on many case studies.

9.1 Quantitative Structure Activity Relationship (QSAR) Applications

Rapid assessment of desired activities of a large number of small-molecule compounds can be achieved by High throughput screening (HTS). QSAR analysis has been playing a key role in screening of compounds by building knowledge-based models [13]. This greatly reduces the experimental screening load. QSAR methodology focuses on finding a model, which allows for correlating the experimentally determined activity of a family of compounds with their molecular structure. Once a high performance model is built, it can be used to identify the activity of any new compound based on appropriate domain attributes extracted from their molecular structure. The set of atoms and covalent bonds between them can define a molecular structure. However, creation of structure-activity relationship models cannot be directly done from the structure of the molecule. Domain information has to be presented to the algorithm in the form of descriptors; molecular descriptors range from physicochemical and quantum-chemical to geometrical and topological features. The methodology of building QSAR models consists of four steps: (a) extracting descriptors from molecular structure (b) choosing most informative descriptors as per activity (c) building a model based on filtered molecular descriptors (d) screening molecule for activity in question. In Table 1, different example of descriptors as listed. These examples are categorised based on structural conformations [13].

Table 1 Examples of different descriptors based on structural conformation [15]

Quantitative structure–activity relationship (QSAR) modelling with descriptor selection has become increasingly important because of a large number of descriptors of different types can be extracted in principle. Descriptor selection can improve the accuracy of QSAR classification studies and reduce their computation complexity by removing the irrelevant and redundant descriptors. Descriptor selection is an important pre-processing tool for QSAR studies. The sparse support vector machine (SSVM), one of the embedded methods, is of particular interest because it can perform descriptor selection and classification simultaneously.

Further explanation is included in Sect. 5.3.

SVMs have been found to provide robust and accurate QSAR models for several problems encountered in viral biology. Two types of QSAR models can be build. First one is a regression problem in which a model is built against descriptors vs. experimentally annotated activities. This is a regression problem, schematically shown in Fig. 11. The second problem can be posed as classification problem. For this a threshold value for the experimental activities has to be defined. Compounds having activities less than these threshold activities are grouped into ‘class1’. The other compounds are grouped into ‘class2’. SVM classification model is built to separate compounds into two groups. A new query compound can then be classified as active or inactive as schematically represented in Fig. 12.

Fig. 11
figure 11

SVM regression model

Fig. 12
figure 12

SVM classification model

Human immunodeficiency virus (HIV) affects and destroys the immune system and causes acquired immunodeficiency syndrome (AIDS) disease. As per the UNAIDS report [14], 77.3 million [59.9 million–100 million] people have become infected with HIV since the start of the epidemic and 35.4 million [25.0 million–49.9 million] people have perished from AIDS-related illnesses since the start of the epidemic. Numerous molecular modelling approaches have been attempted to address the design of new anti-HIV compounds. Most of them are based on QSAR [15]. In an interesting and comprehensive study [15] QSAR based attributes were selected for predicting inhibiting activity of the compound against HIV proteins including protease (PR), reverse transcriptase (RT) and integrase (IN). Around 18,000 molecular descriptors which include geometric, electrostatic, structural, constitutional, path and graph fingerprints etc. were extracted utilizing the open source PaDEL software. To reduce the number of descriptors Attributes selection was carried out using ‘Best-First’ as the search method in Waikato Environment for Knowledge Analysis (Weka) suite. SMO regression algorithm in the Weka suite was used to classify the data into active and inactive sets. The models were able to achieve excellent values of Pearson correlation coefficient for all the three data sets, viz, PR, RT, IN. An integrated web based Platform HIVprotI [16] was further developed using this model.

The tetra-hydro-imidazo[4,5,1-jk][1,4]-benzodiazepines (TIBOs), constitute a group of potent system inhibitors of HIV-1 reverse transcriptase. With a view to segregate TIBO compounds into high and low classes of inhibitors of HIV-1 reverse transcriptase, Hdoufane et al. carried out SAR studies on 89 TIBO derivatives using different classifiers, such as support vector machines, artificial neural networks, random forests, and decision trees [17]. They successfully employed seven molecular descriptors characterizing hydrophobic, electronic, and topological aspects of the molecules and obtained excellent training and test accuracies.

The successful identification of HIV proteins may have important significance in treatment since epidemiological and biological characteristics of HIV-1 and HIV-2 are quite different., Juan Mei et al. employed SVM along with other classifiers to predicted HIV-1 and HIV-2 proteins based on pseudo AA compositions and increment of diversity (ID) algorithm [18]. With jack knife tests, SVM models gave the highest prediction accuracy of 0.9909 .

Both HBV and HCV are of immense significance as leading causes of liver cancer as well as co-infection with HIV. A potentially important study included 172 positives and 8998 negative cases and built a classification model of the HBV dataset; in the same study HCV dataset included 533 positives and 7287 negatives [19]. The data had obvious imbalance in the number of examples in the positive and negative data sets. Three different imbalance handling methods, viz., (i) Downsize, (ii) Multi downsize, and (iii) SMOTE were used. SMOTE provided the best performance; SVM prediction accuracies of 64% for HBV and 71% for HCV were reported for this model.

Influenza, a respiratory virus, is correlated with high morbidity and mortality rates. Neuraminidase (NA) and haemagglutinin (HA) are two major glycoproteins found on the surface of the influenza virus. Compounds that inhibit neuraminidase can protect host cells from viral infection and retard the spread of the virus among cells. A two staged approach has been used to build a QSAR classification model separating neuraminidase as active and inactive [20]. In the first stage minimum redundancy maximum relevance criteria was employed to select the most informative descriptors. The second stage employs the selected descriptors as input to SSVM L1-norm classifiers. The dataset consisted 479 neuraminidase inhibitors of H1N1 virus whose experimentally measured IC50 values were available. These set of training compounds were separated by thresholding the activity into two categories: active compounds with IC50 <20μM, while those with IC50 >20 μM were considered to be weakly active compounds. The 7 top descriptors selected gave an SVM classifier accuracy of 90.62% which is far higher than the earlier SVM approaches.

The classification of protein quaternary structure complex is of significant interest in computational biology research. Chi-Chou Huang at.al have developed a two-staged architecture for five class classification of grouping protein quaternary structure of a complex; the five classes are monomer, dimer, trimer, tetramer, and other subunit classes [21]. AA frequency, Shannon entropy and accessible surface areas were employed as domain attributes. One against all SVM classifiers were used of in which positive data consisted of examples of given class and negative data consisted of all the remaining classes. Due to this division number of examples in the positive side of the classifier was much less than the negative side. This created imbalance and reduced classification accuracy. To counter this, the author employs a bootstrap method for repeated sampling and generated different subsets of data. The majority class was further subjected to random sub-sampling. Mathews Correlation coefficient was used as performance measure. The bootstrapping method was able to produce an MCC of 0.696 and above. List of examples are given in Table 2.

Table 2 Illustrative examples for QSAR applications

9.2 SVM Applications Based on Next Generation Sequencing (NGS) Data

The term ‘Next-Generation’ Sequencing (NGS)’ refers to the advancement in nucleic acid sequencing technologies. Numbers of sequence reads generated per run has progressively increased with time, due to improved understanding of molecular biology as well as technological advances. Current sequencing platforms are capable of generating enormous numbers of sequence reads in quick turnaround time, allowing researchers to explore all possible aspects of biomedical studies at molecular level and dig deeper in the genetic aspects. NGS has proven to be an efficient, fast and reliable approach to solve problems in studies of evolution, ecology and genetics, overcoming the limitation of traditional molecular approaches [22]. Another great advantage of NGS approach over traditional molecular studies is that it is also cost efficient. End-to-end human genome can be sequenced in few hours using NGS technology, whereas, it took over a decade to sequence and assemble human genome using Sanger Sequencing. Based upon the chemistry, a number of NGS platforms have been developed since last decade. Bioinformatics knowledge plays an important role in assembling the fragments sequenced in parallel by mapping all the read sequences to the human genome reference. Depth of the sequencing, i.e. number of times the template has been sequenced, assures accuracy of sequencing, making sure that observed variation in sequenced data is the result of mutations, and not of sequencing errors. NGS can be used to sequence targeted regions identified in a genetic study, or entire genome including all coding genes (whole exome sequencing).

The variations in human genome can be a few nucleotide base changes (substitutions), insertions, and deletions of DNA, large genomic deletions of exons or whole genes and rearrangements such as inversions and translocations. All these anomalies are collectively termed ‘mutations’. Traditional methods of sequencing were only able to discover handfuls of mutations including small insertions and deletions. This led to the development of dedicated assays, to discover additional types of variations. Some of the examples includes fluorescence in situ hybridization (FISH) for conventional karyotyping, or comparative genomic hybridization (CGH) microarrays to detect sub-microscopic chromosomal copy number changes such as microdeletions.

With recent advancements in NGS technologies and better understanding of life at genomic level, various questions have been answered using whole genome sequencing. Areas of applications includes genome diversity, metagenomics, epigenetics, discovery of non-coding RNAs and protein-binding sites, and gene-expression profiling by RNA sequencing [22,23,24,25,26]. Apart from high-throughput whole genome sequencing, typical applications of NGS methods in microbiology and virology are discovery of new microorganisms and viruses by using metagenomic approaches, investigation of microbial communities in the environment and in human body for understanding healthy and disease conditions, analysis of viral genome variability within the host, detection of antiviral drug-resistance mutations in patients with human immunodeficiency virus (HIV) infection or viral hepatitis, etc.

In the context of Microbial Analysis, the term metagenomics designates the analysis of all of the nucleic acid present in a given sample. Without isolating and culturing individual microbial species, entire communities of microorganisms can be explored. NGS applications in metagenomic studies include the discovery of novel viruses from clinical samples in human and animal diseases, e.g. the new Ebola virus Bundiubugyo [27], identification of a viral etiology of disease outbreak in honeybees [28], and involvement of a new arenavirus in transplant-associated disease clusters [29]. Scope of applications also include characterization of the viral community in the environment [30, 31], in animals [32], and viral community in humans [33,34,35,36]. Due to high replication capacity and low fidelity of the replication enzyme, high intra-host variability is shown by reverse transcriptase-dependent viruses (e.g. hepatitis B virus, human immunodeficiency virus) and RNA viruses (e.g. hepatitis C virus, influenza virus). Such a set of closely related genomes within a given host allows a viral population to swiftly adapt to dynamic environments and evolve resistance to vaccines and antiviral drugs [37]. Significant work using NGS has been done for the characterization of intra-host variability of influenza virus [38, 39], HCV, HIV and HBV.

Jian’an Jia et al. designed an approach to distinguish between 2 disease groups caused by Hepatitis B Virus – Chronic Hepatitis B (CHB) and Hepatocellular Carcinoma (HCC) [40]. NGS was used to sequence the pre-S region of a large number of CHB and HCC individuals. The attributes used were word pattern frequency vector of various lengths ranging from k = 2 to k = 8. Maximum CV mean AUC of 0.93 k = 5. The prediction accuracy was found to be much higher than prediction results using KNN classifiers.

To investigate HBV genotypes and predict HCC status, Xin Bai et al. used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals [41]. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbours (KNN) and support vector machine (SVM). In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM .

Apart from applications viral disease diagnosis, a recent study demonstrates usefulness of a hybrid approach in early assessment of the risk by predicting the host of influenza viruses using the Support Vector Machine (SVM) classifier based on the word vector, representation and feature extraction method for biological sequences [42]. Accuracies for host prediction in avian, human & swine influenza viruses were 99.7%, 96.9% & 90.6%, respectively. Table 3 contains some examples of SVM application using NGS data to address problems in virology studies.

Table 3 Illustrative examples for SVM applications based on NGS approach

9.3 SVM Applications Based on Spectroscopy Data

From array of several spectroscopic techniques, Raman spectroscopy and Infrared (IR) absorption spectroscopy have led to major breakthroughs in biological, pharmaceutical, and clinical research [43,44,45]. With use of visible-light laser beams, Raman spectroscopy can be used as a non-invasive characterization technique and achieve resolution same as fluorescence microscopy. The inelastic scattering of light photons by vibrating molecules in the samples is called as Raman scattering. Information about molecular vibrations produced due to change in frequencies of the photons are useful in diagnostic studies. Such change in frequencies are result of interactions of molecular bonds. Initial changes in almost all the types of diseases (including cancer and viral infections) occur at molecular level. Laboratory tests are inadequate in identifying such changes due to some limitations. Raman spectroscopy has the potential to monitor these changes at molecular level at early stage of the disease [46]. Information about abnormalities can be retrieved from the spectral differences between normal and diseased samples, which is used for the purpose of diagnosis. With diverse areas of applications, Spectroscopy is a promising clinical tool for the real-time diagnosis of diseases and assessment of living healthy and cancerous tissue, cells and their subcellular compounds and structures. It can also be used to track the mode of action of drugs on a molecular level.

Due to its high sensitivity and selectivity Raman spectroscopy requires only a small sample volume and minimal preparation efforts. The high resolution, ease of sample preparation, and very short data collection time required make the technology ideal for use in the study of viruses and virally infected cells. As the acquisition can be fast, processes in real time can be studied. In different conditions and environments, informative molecular details can be extracted since water environment can disturb these spectra to a slight extent. Therefore, this technique is ideal for studies like viral protein assembly, dynamics, interactions and structural alterations, compared to other available methods. The stereochemistry and structures of proteins and nucleic acid components of viruses, can be determined using spectroscopy [47, 48]. The conformational changes that leads to viral procapsid and capsid assembly was identified using Raman spectroscopy [49, 50]. Raman spectroscopy is effective also in distinguishing between even the homogenous viruses, thereby increasing its possible role even further in diagnostic medicine.

Dengue fever, Yellow fever, Japanese encephalitis, Murray Valley encephalitis, tick-borne encephalitis and West Nile encephalitis are diseases attributed to flavivirus infection. Early detection is important to prevent these diseases from progressing into the severe or terminal stages. Non-structural protein 1 (NS1) is acknowledged as one of the biomarkers for flavivirus related diseases. Radzol AR et al. defined a model for PCA-SVM with MLP kernel for classification of flavivirus biomarker, NS1 molecule, from Surface Enhanced Raman Spectroscopic (SERS) spectra of saliva [51]. Best PCA-SVM (MLP) model defined in this study yielded accuracy of 96.9%.

Another example of life-threatening viral infection is Hepatitis B, that attacks the liver. In a study analysing hepatitis B virus (HBV) infection in human blood serum using Raman spectroscopy combined with pattern recognition technique, SVM model with two different kernels i.e. polynomial function and Gaussian radial basis function (RBF) were investigated for the classification of normal blood sera from HBV infected sera based on Raman spectral features [52]. Best performance achieved for polynomial kernel of order-2 with accuracy of 98% using fivefold cross-validation.

In case of chronic hepatitis C, liver biopsy has been the reference for staging the degree of fibrosis until the last decade. For obvious reasons, non-invasive tests e.g. blood tests measuring the markers that are either involved in the synthesis or degradation of extracellular matrix, has to be the preferred alternatives for assessment of hepatic fibrosis. However, the performance of these non-invasive methods is limited in differentiating between mild and moderate stages of fibrosis and in evaluating the effect of treatments on liver fibrosis process. Use of Fourier transform infrared (FTIR) spectroscopy applied to the serum in the assessment of hepatic fibrosis, was demonstrated by Scaglia et al. [53]. Infrared spectral characteristics exhibited by serum from patients, were employed in differentiation of chronic hepatitis C patients with extensive hepatic fibrosis from those without fibrosis and thus predicting the degree of hepatic fibrosis. With leave-one-out cross-validation, the accuracy achieved was 97.7%.

A similar study was performed for the classification of dengue suspected in human sera. SVM models built on the basis of three different kernel functions including Gaussian radial basis function (RBF), polynomial function and linear function were employed to classify the human blood sera based on features obtained from Raman Spectra [54]. With the tenfold cross validation method, best results were obtained for the polynomial kernel of order 1 with diagnostic accuracy of about 85%.

The applications are not limited to only medicinal diagnosis. Viruses could infect over hundreds of different species of plants, including crops of tobacco, tomato, pepper, cucumber, etc. Viruses can survive outside the plant, and remain in a dormant state to infect growing crops. Once the plant is infected, no chemical cure is effective, and usually all the infected crops should be removed. For detecting seeds infestation caused by cucumber green mottle mosaic virus (CGMMV), near-infrared (NIR) hyperspectral imaging system was used to discriminate virus-infected seeds from healthy seeds with partial least square discriminant analysis (PLS-DA) and least square support vector machine (LS-SVM) [55]. The classification accuracy for virus-infected watermelon seeds were 83.3% with the best model.

Whereas Jiyu Peng et al. proposed an approach to discriminate TMV-infected tobacco based on laser-induced breakdown spectroscopy (LIBS) [56]. Two different kinds of tobacco samples (fresh leaves and dried leaf pellets) were collected for spectral acquisition, and partial least squared discrimination analysis (PLS-DA) was used to establish classification models. In prediction set, 94.4% and 94.7% accuracies obtained for observed emission lines of dried & fresh leaves. Compared to PLS-DA, SVM was proved to be efficient to eliminate influences of moisture content. Some other examples are listed in Table 4.

Table 4 Illustrative examples for SVM applications using spectroscopy

9.4 SVM Applications for Epitope Prediction

An epitope is a specific target of a few AA residues on an antigen molecule that is recognized by B-cells or T-cells of the immune system [57, 58]. A B-cell epitope is the antigen portion that binds to B-cell Receptor (BCR) on B-cells, where BCR contains membrane-bound antibody. There are 2 types of B-cell epitopes based on their orientation. One is linear epitope that comprises of a continuous string of AA s. The second one consisting of most B-cell epitopes is conformational epitope which is made up of discontinuous AAs that comes close with protein folding [59, 60]. A T-cell epitope binds to the major histocompatibility complex (MHC) on surface of antigen-presenting cells (APCs) and MHC presents the antigen to the T-cell receptor (TCR) on T-cells [59]. The major histocompatibility complex (MHC) or human leukocyte antigen (HLA) is the gene family that helps the immune system to identify and destroy the foreign substance [61].

Vaccines have proven to be useful tools to control various viral diseases like influenza, smallpox, polio, hepatitis and rotavirus. The conventional methods of developing vaccines include attenuated or killed whole pathogen that improves immunity to a specific disease and involve only experimental methods of epitope identification. Vaccine development takes a long time with conventional methods because of the time consuming experimental screening of huge number of potential candidates [62]. The fact that only few AA residues are detected by B- and T-cells instead of whole pathogen is leveraged for vaccine development, understanding disease etiology, disease diagnosis and immune monitoring [58, 59]. Moreover, with advances in next-generation sequencing methods, proteomics, and transcriptomics as well as ever increasing immune system data and databases, epitopes can be identified in few years. Once the epitopes are predicted using computational methods, the peptides can be experimentally tested for its binding affinity and ability to elicit desired immune response. Immunoinformatics involves the development of bioinformatics tools that analyses data to predict B- and T-cell epitopes which can stimulate immune response. In-silico prediction methods of epitope prediction can be beneficial to decrease the number of potential epitopes for experimental confirmation, develop epitope-based vaccines for hypervariable viruses and develop chimeric vaccines [59, 62]. Epitope based-vaccines can be safer and less expensive than conventional methods [62].

Predicted epitopes should take into account the desirable features of epitopes such as they should be conserved in different parts of viral lifecycle, their binding affinity and efficacy, they should bind to more than one allele of immune system molecules and most of them are proteins [59, 62]. Most epitope prediction methods are based on proteins and their different descriptors including physicochemical properties related profiles of proteins, evolutionary data, sequence motifs and quantitative matrices (QM) [58, 59]. SVM has been one of the most popular methods used for both B-cell & T-cell epitope prediction.

9.4.1 T-Cell Epitope Prediction

T-cell epitopes are processed within a cell, linked with MHC & presented on T-cell surface to be recognized by T-cell receptor. Each of these steps decide the immunogenicity of T-cell epitopes. However, most of the T-cell epitopes focus on the step where a peptide is linked with MHC-I & MHC-II [59]. MHC-1 binds to peptides of length 9–11 AA s and its pockets prefers peptides with certain physicochemical properties. Hence, peptide-MHC-I binding prediction methods work on peptide sequences of 9 AA residues. On the other hand, MHC-II binds to longer peptides but the prediction methods focus on peptide part that binds to the MHC-II groove. Large number of databases like IEDB, EPIMHC and AntiJen, store epitopes verified through experimental approaches [59]. These have served as rich sources of positive examples for several prediction methods.

Different computational methods/models have been used to predict epitopes like use of Sequence Motif, motif matrix, quantitative affinity matrices (QAM) etc. . However, machine learning (ML) methods have proven to be the most robust method for prediction [63]. With high dimensionality of the data and the limited observations, SVM comes as a better method. In a study, 36 stimulatory peptides and 167 non-stimulatory peptides were gathered, and physical properties of 20 AA s were used to develop models from Artificial Neural Network, Decision Tree & SVM. SVM proved to outperform prediction of stimulatory peptides with maximum sensitivity of 0.76 [64].

MHC2Pred is one of the freely available tools based on SVM to predict MHC-II binding peptides [65]. To develop a model for MHC2Pred, binding & non-binding peptides, based on IC50, were collected from MHCBN and JenPep database. Peptides with less than 9 AA residues were discarded and rest of the peptides were looked for 9 AA s that would bind the MHC-II groove using Matrix Optimization Techniques (MOT) package. A vector of length 20 was created for each AA in 9-mer peptide where binders were given +1 and non-binders a −1. Each peptide was thus represented by 180 (9 × 20) length vectors. This data was used to develop SVM model which was later validated using fivefold cross validation and got an overall accuracy of method is >78% [65].

SVMHC is another tool for prediction of both MHC class I and class II binding peptides [66]. For MHC-I prediction model, peptides of length 8–10 were represented by a binary sparse encoding. For MHC-II peptide binding prediction, matrices by Sturniolo et al. [67] were used. These matrices represent HLA-DR peptide binding specificity where HLA-DR is an MHC-II cell surface receptor [67] (see sr. no. 1 of Table 6).

Predicting immunogenicity of epitopes can help in vaccine design and POPISK is a tool that predicts reactivity of T-cells to peptides and identify positions that are recognized by TCR [68]. POPISK uses SVM model with a weighted degree string kernel (see sr. no. 2 of Table 6).

9.4.2 B-Cell Epitope Prediction

B-cell epitopes can be predicted based on physicochemical properties like hydrophilicity, flexibility, polarity, and exposed surface as well as secondary & 3D structures [62]. There are 566 AA indices that represents physicochemical properties of AA s listed in AAindex [69].

Linear epitopes can be predicted using antigen sequences by calculating AA propensity scales based on physicochemical properties. AA Propensities (AAP) calculation considers an overlapping window of length k AA s in a protein sequence and for each window, average propensity value of AA s is calculated, where propensity value can be hydrophilicity, accessibility, flexibility, polarity, antigenicity, beta-turn, surface exposed scale, etc. The average value is assigned to the AA in middle of the window. AA s residues that passes the threshold are considered as potential epitopes. A combination of different propensity values can be used with specific weights [70].

Due to poor performance of AA propensity scales, Machine learning (ML) methods were later adopted to distinguish B-cell epitopes from non-epitopes. BCPREDS and SVMtrip [71] are epitope prediction tools based on Support Vector Machine (SVM) [59]. More information on SVMtrip is provided in sr. no. 3 of Table 6.

Conformational B-cell epitopes can be predicted using features related to the structure of the proteins. One of the studies have used combination of physicochemical features, evolutionary PSSM features and structural features as protrusion index (PI), accessible surface area (ASA), relative accessible surface area (RSA) and B-factor [72] (see sr. no. 4 of Table 6). Physicochemical properties of AA s were derived from AAIndex. PSSM represents the attributes extracted from repeated multiple sequence alignment of sequences that can be generated using PSI-BLAST with specific number of iterations. It is a scoring matrix where each position in the multiple sequence alignment is given an AA substitution scores. PSSM is used to incorporate evolutionary information of a peptide [73,74,75]. Another study by Ansari et al. [76] on conformational B-cell epitope uses 3 types of features namely binary profile of pattern (BPP), physiochemical profile of patterns (PPP) and composition profile of patterns (CPP) (see sr. no. 5 of Table 6). In this study, patterns of different lengths were created from the sequences. Then for each pattern 3 feature vectors were created, (1) BPP, a vector of length 21 based on binary number for occurrence and non-occurrence of AA, (2) PPP, a vector of length 5 based on 5 physicochemical properties named Hydrophobicity, Flexibility, Polarity_Grantham, Polarity_Ponnuswami, Antigenicity and (3) CPP based on composition of patterns. CBTope server uses this method for predicting B-cell epitopes [76].

Listed in Table 5 are some freely available T-cell & B-cell epitope prediction web servers based on SVM.

Table 5 List of freely available epitope prediction servers

Information on some more SVM based epitope prediction studies have been provided in Table 6.

Table 6 Illustrative examples of epitope prediction based on SVM

9.5 Applications of SVM Involving Protein-Protein Interaction in Virology

Proteins are the workhorses of a cell that carry out majority of the functions in a cell. Eighty percent of proteins are not functional in isolated forms but they operate in complexes by interacting with other molecules [77, 78]. Protein-protein interaction (PPI) is the physical & functional interactions of proteins that controls wide range of molecular processes in a cell, like signal transduction, cell-cell communication, transcription, replications etc. [79]. PPIs can be responsible for altering kinetic properties of enzyme, modifying proteins activity, changing specificity of protein binding, constructing new binding sites and regulatory function. Alteration or malfunction of these interactions can lead to diseases [79]. The collection of all the protein-protein interaction of cell or an organism is called interactome. The study of PPIs can help in predicting a biological process involving protein of unknown function, fasten the pace of understanding functional pathways or to know biochemistry of a cell [77, 79, 80]. Knowledge of specific PPI can also help in identification of drug targets [79].

PPI data can be mapped to large scale networks where nodes represent proteins and edges represent their physical or functional interactions. These networks are known as PPI networks (PIN) [77, 79]. PPI networks can be used to extract various information like functionality of a protein based on its placement in the network as the closely linked proteins can have similar biological activity. PPI can also be used to decipher which complex a protein belongs to and the diseases related to a protein [79]. The knowledge that is encapsulated in the PPI can help improve the biological and biomedical applications [77].

Virus-host proteins interactions are key to viral infection and subsequent pathogenesis. Many PPIs are involved between virus and host during a viral infection where the virus proteins take over the host transcriptional machinery [78]. It has been believed that viral proteins bind to the host protein that are highly connected [81]. Endogenous interface, with respect to virus-host systems, are responsible for interactions in their own system i.e. host-host PPI and virus-virus PPI. On the other hand, exogenous interfaces are responsible of virus-host interactions. Both virus and host compete for endogenous and exogenous interfaces [81]. Mutations at protein interfaces can reduce or increase their binding affinities by changing protein electrostatics and structural properties. Virus and host proteins change their surface resides through mutations as an evolutionary result to compete for binding partner. However, host tends to be less variable than viruses. Viruses diversify through various modes of molecular evolution, including conservation, horizontal gene transfer, gene duplication and molecular mimicry [81]. Viral proteins constantly inhibit host-host interactions and therefore, blocking such interactions between virus & host can aid in biomedical applications by identification of drug targets and developing antiviral therapies [81]. For e.g. a drug, Maraviroc, binds the cellular co-receptor CCR5, a receptor on white blood cells involved in immune system, preventing it from interacting with GP120 of HIV1 which is essential for entry of HIV-1 in host [82]. As viruses pose a global threat, understanding of virus & human PPIs can help in development of vaccines for treatment .

Comprehensive PPI networks have been generated using experimental methods. These experimental methods employ different techniques like tandem affinity purification, affinity chromatography, coimmunoprecipitation, protein arrays, protein fragment complementation, phage display, X-ray crystallography, and NMR spectroscopy [79]. However, due to the huge PPI data and the time consuming experimental methods, computational methods are increasingly becoming popular to analyse the PPI networks and find out the functions of unexplored proteins. Computational methods of PPI detection are based on sequences, structure of molecules, gene fusion, phylogenetic tree and gene expression [79].

Detection of virus-host interactions using machine learning methods have proved to be very useful. Several SVM models have been developed for the same purpose; known PPIs as positive set, are used to train the models to predict whether two proteins interact or not. Positive set data can be extracted from experimental data available in the databases. Selecting negative dataset is complicated. Negatome, a database of negative interactions developed using text mining, can be used to gather negative data set [83, 84].

Emamjomeh et al. [85], developed SVM model to predict PPI interactions between human and hepatitis C virus (HCV) [D32]. In this study, SVM was combined with other learning methods like random forest (RF), Naïve Bayes (NB) and multilayer perceptron (MLP) Feature vectors were generated for HCV & human proteins which included six different AA composition (ACC), pseudo AA composition (PAC), PSSM as evolutionary information feature, network centrality measures, tissue information and post-translational modification (PTM) information [85]. AA composition is the simplest descriptor used to represent a protein sequence. However, with this descriptor the sequence order of AA s is lost and hence, pseudo AA is used which involves AA composition as well as sequence order-based features [5] (see sr. no. 1 of Table 7).

Table 7 Illustrative examples Protein-Protein interaction studies based on SVM

Cui et al. [86] developed an SVM model for prediction of virus-host PPI for 2 viruses, human papillomaviruses (HPV) and hepatitis C virus (HCV). This SVM model is based on relative frequency of AA triplets (RFAT) between virus & host AA sequences and GO annotations of protein. RFAT generates fixed length for variable length proteins and enables models to achieve a better accuracy. In this study, a vector based on AA triplets & biochemical similarity is generated. Based on biochemical properties of AA residues, 6 categories are defined as {IVLM}, {FYW}, {HKR}, {DE}, {QNTP}, and {ACGS}. Using this classification of AA s, there are 6 × 6 × 6 = 216 possible AA triplets [86]. The protein sequence is converted into AA triplets and the vector of 216 length is created that contains the frequency of each category in sequences of variable length. LIBSVM [87] was used to generate model with the radial basis function (RBF) as a kernel function. For dataset, HCV & human interaction data was extracted from the infection mapping project (I-MAP) whereas for HPV, data was extracted from NCBI Bio Systems Database. For HCV accuracy of 85.1 was achieved whereas for HPV it was 87.5 [86] (see sr. no. 2 of Table 7).

RFAT has been used in many studies with different combinations of categories and k-mer. In a study of HIV and human PPI [88], four-mer sequences were used instead of triplet. With 7 categories and 4-mer sequences, RFAT vector of 4802 (7^4 ∗ 2) length was generated (see sr. no. 5 of Table 7).

Kim et al. [89] used 4 categories based on chemical properties of side chain of the AA s making 64 AA triplets combination (see sr. no. 3 of Table 7).

In another study of PPI by Zhou et al. [78] and Shen et al. [90], a similar feature vector of triplets is produced but 7 categories of AA residues are used instead of 6 and these categories are based on diploes and volumes of the side chains of AA s. With 7 categories 343 (7 × 7 × 7) AA triplets are possible. RFAT feature vector had 686 elements i.e. 343 for host and 343 for virus. Zhou et al. [78] uses more features as frequency difference of AA triplets (FDAT) between virus and host proteins, AA composition (AC) in each pair of host and virus proteins, normalized frequency of each AA group, transition and distribution of AA groups. As a result of these 6 features, a feature vector of length 1175 was created. Again, LIBSVM [87] with RBF was used to develop model. Best performance was obtained with combination of all these 6 features with accuracy of 85.64% (see sr. no. 4 of Table 7).

Most of the prediction methods are specific to a virus-host combination. However, there are SVM based methods that are generic enough to predict PPIs of virus and host that were not used for training set. The approach by Zhou et al. [78] is one of such methods i.e. it does not require model for each host-virus pair. Another method called DeNovo, is a generic method that can predict novel PPIs. This method is based on SVM that trains on different virus-host PPIs [91].

Table 7 shows some studies on SVM model that are used for protein-protein interaction of virus-host.

10 Miscellaneous Examples

Apart from above examples, there are some noticeable studies employing other approaches to address problems in virology. Microarray is a method that uses microscopic chip where each spot-on chip has a DNA/cDNA sequence attached. These sequences bind to the complementary unknown sequences & thereby detects gene expressions of thousands of genes. In Virology, Microarray is used to screen viruses for which genomes are available in GenBank by looking at the conserved viral sequences. Microarray gene expression profiles are also used to detect the immune response that can further help in classifying disease caused by viruses, that is conventionally done using quantitative real time PCR (qPCR). SVM can be used to detect immune response by using microarray gene expression data. Due to big size of microarray data, important features are extracted using feature selection methods.

In a study [92], the authors have reported that DNA microarray technology can be used as a high-throughput method to analyse polymorphisms within a short region of the FMDV genome encoding relevant functions in antigenicity and receptor recognition. Their SVM based methodology classifies the samples based on their hybridization signal. This prediction methodology has wide ranging applications to fine genotyping including studies of heterogeneous viral populations, genetic changes in virus, bacteria, and genes of rapidly evolving cells, such as tumor cells.

Predicting the hosts of newly discovered viruses is important for pandemic surveillance of infectious diseases. Li and SUN [93] investigated the use of alignment-based and alignment-free methods and support vector machine using mononucleotide frequency and dinucleotide bias to predict the hosts of viruses, and applied these approaches to three datasets: rabies virus, coronavirus, and influenza A virus [93] also showed that SVM predicts the hosts of viruses with a high degree of accuracy.

The phosphorylation of virus proteins by host kinases is linked to viral replication leading to an inhibition of normal host-cell functions. Unravelling of phosphorylation mechanisms in virus proteins can aid in drug design and treatment. In this study [94] a two-layered Support Vector Machines (SVMs) was applied to train a predictive model for identification of phosphorylation sites.

Replication of their DNA genomes is a central step in the reproduction of many viruses. [V4] proposes a novel least-squares support vector machines (LS-SVMs) model with viruses of herpes family along with data sets involving a collection of caudoviruses coming from three viral families under the order of caudovirales. The LS-SVM approach provides superior performance as compared to those given by the previous methods. Ensembled with previously proposed methods, the LS-SVM approach further improves the prediction accuracy for the herpesvirus replication origins. Recursive feature elimination was used to extract the most informative attributes and provides important domain knowledge in terms of the most significant features of the data sets [95] further conclude LS-SVMs can potentially be a very reliable and robust tool for viral replication origin prediction.

11 Web Server

SVM has been used in a variety of studies on viruses across different data types. Some of the tools mentioned in these studies are available as standalone tools whereas others are used in the backend of freely available web-servers . Web servers are user friendly and more intuitive making it easy for user to input data and analyse the output. Table 8 shows some of the web servers based on SVM models that are used in virology.

Table 8 Examples of SVM based web servers SVM for Virology Studies

12 Concluding Remarks

In this review, we illustrated the use of Support Vector Machines as a tool for building learning models in viral biology. SVM plays a vital role in building Quantitative structure activity relationship models. The robustness and accuracy of SVM models based rigorously on statistical learning theory has paved the way for quicker, faster and reliable methods of identification of potent molecules in drug design. SVM models have also enabled development of tools for rational design of novel vaccines. Recent advances in NGS technology could also be easily incorporated with SVM for building models with increased performance. We have also listed large number of case studies and examples in different areas of viral biology where SVM has been deployed with productive results