Sequence retrieval and structural analysis of the target proteins
Amino acid sequences of target proteins (S [Genbank: QHD43416.1], E) [Genbank: QHD43418.1] and M [QHD43419.1]) of SARS-COV-2 were collected in fasta format from Genbank. Their allergenicity and antigenicity were checked by AllerTop and vaxijen respectively. All proteins were found to be non-allergenic and antigenic. E protein was considered to be the most antigenic followed by M and S protein with 0.60, 0.51 and 0.46 antigenic values respectively. Besides, 40 S protein disulphide bonds, 3 E protein disulphide bonds and 6 M protein disulphide bonds calculations were made in DIANNA v1.1 (Additional File 3: Table S2). Physiochemical properties evaluated by protparam showed that all proteins are stable. S protein was found to be acidic and hydrophilic, while E and M protein was found to be basic and hydrophobic (Table 1). TMHMM carried out a prediction of the transmembrane topology of the target proteins displaying one transmembrane helix in S and E protein while three transmembrane helices in M protein. Residues of the S protein from 1-1213 were on the surface while residues from 1214-1236 were within the transmembrane region and residues from 1237-1273 were buried within the core region of S protein. Similarly, in E protein residues from 1-11 were exposed on the surface and residues from 12-34 were inside the transmembrane region and residues from 35-75 were buried inside the core region of E protein. In M protein residues from 1-19 and 74-77 were exposed on the surface, while residues from 20-39, 51-73, and 78-100 were inside the transmembrane region and residues from 40-50 and 101-222 were buried inside the core region. Secondary structure details of the target proteins predicted by PSIPRED are mentioned in Additional File 4: Table S3.
Homology modeling and validation
To predict structure-based epitopes of SARS-COV-2 proteins (S, E, and M), homology modeling of the proteins was performed. Chain A, Spike glycoprotein of SARS-COV (PDB ID: 6ACC) and Chain A, Envelope small membrane protein of SARS-COV (PDB ID: 5X29) were found to be the best template based on E-value, percent identity, and query coverage for S and E protein respectively. S protein was found to have a 75% identity with a spike protein of SARS-COV (6ACC). E protein had 88.71% identity with Chain A, Envelope small membrane protein of SARS-COV. No suitable template was found for M protein, so its structure was predicted by Raptor X [48]. The amino acid sequence was provided as an input that yielded to have two domains and the best template used was model 5yckA. The P-value for the modeled structure was calculated at 9.14× 10−5. All 222 (100%) amino acids were modeled and 27 (12%) positions were predicted as disordered in the model. Visualization of the models was done by Chimera (Additional file 1: Figure S2). Structures were refined by Galaxy refine server and ModRefiner. The quality factor (z-score) and Ramachandran plot values of refined models are mentioned in Additional File 5: Table S4 (Additional file 1: Figure S3-S5).
B cell epitopes prediction
ABCpred was used to predict the linear B cell epitopes of the target proteins and the TMHMM server was used to check the surface availability. Antigenicity was checked by vaxijen. Epitopes that were exposed on the surface, antigenic, and 100% conserved in the protein were selected from all the predicted epitopes. Total 23 epitopes (S-19, E-1, and M-3) were selected based on these criteria. Among the chosen epitopes, ‘SPTKLNDLCFTNVY’ of S protein showed the highest antigenicity (1.69) and predicted score (Table 2). The position of epitopes on their respective protein structure was visualized by Pymol (Figure 1).
Besides, potential B cell surface accessibility needs to be examined. By analyzing the physiochemical properties of amino acids and their concentration in previously identified B cell epitopes, the Kolaskar and Tongaonkar antigenicity tools evaluated target proteins for predicting B cell epitopes. The estimation threshold was set at 1.045, while the window size was maintained 7. It predicted the protein antigenic propensity value of S protein as 1.041 (Average), 0.866 (Minimum) and 1.261 (Maximum) (Additional file 1: Figure S6-A), of E protein as 1.119 (Average), 0.947 (Minimum) and 1.262 (Maximum) (Additional file 1: Figure S7-A) and of M protein as 1.053 (average), 0.904 (minimum), and 1.235 (maximum) (Additional file 1: Figure S8-A). Chou and Fasman beta-turn analyzing algorithm was used to predict beta-turn in target proteins since beta-turn in nature is exposed to the surface and hydrophilic and plays a vital role in beginning the defensive response. The threshold value was adjusted at 1.009, it calculated values of 0.097 (average), 0.541 (minimum) and 1.484 (maximum) in S protein (Additional file 1: Figure S6-B), 0.883 (average), 0.554 (minimum) and 1.264 (maximum) in E protein (Additional file 1: Figure S7-B), and 0.915 (average), 0.600 (minimum) and 1.384 (maximum) in M protein (Additional file 1: Figure S8-B). The findings show that area from 251 to 257 amino acids in S protein, area from 63-69 amino acids and 67-73amino acids in E protein and area from 209-216 amino acids in M protein are more likely to reassure beta turns in peptide structure. Experimental experience shows that the parts of the epitope bound with antibodies or alleles are essentially elastic. Hydrophilic protein regions are usually exposed on the surface and play a key role in eliciting an immune response. ABCpred score and the antigenic value calculated by vaxijen certainly show that all expected peptides are part of the transmembrane-protein extracellular region and capable of optimizing a defensive response within the host during COVID-19 infection. Therefore, to determine the hydrophilicity and the surface abundance of possible B cell epitopes, the parker surface accessibility prediction method having a threshold of 1.279 and Emini surface accessibility prediction methods having a threshold of 1.00 were used. Values calculated by parker-hydrophilicity was 1.238 (average), -7.629 (minimum) and 7.743 (maximum) in S protein (Additional file 1: Figure S6-C), -0.911 (average), -6.843 (minimum) and 4.929 (maximum) in E protein, and -0.499 (average) (Additional file 1: Figure S7-C), -9.257 (minimum) and 6.871 (maximum) in M protein (Additional file 1: Figure S8-C). Emini surface accessibility analyzing results are given in the Additional File 6: Table S5 and Additional file 1: Figure S6-D, S7-D, S8-D. Flexibility analysis by Karplus and schulz tool showed that area from 251-257 amino acids in S protein (Additional file 1: Figure S6-E), 65-71amino acids in E protein (Additional file 1: Figure S7-E) and area from 210-216 amino acids in M protein (Additional file 1: Figure S8-E) are highly versatile.
To further improve the specificity and variety of B-cell epitopes, Discotop 2.0 server was used to calculate surface abundance concerning residual contact number and use the novel amino acid score to forecast discontinuous epitopes. 3D structures of the target proteins were used to predict discontinuous epitopes; 90% specificity, − 3.700 thresholds and 22.000 Angstroms propensity score radius. 138 discontinuous epitopes of S protein, 1 epitope of the E protein and 22 epitopes of M protein were calculated (Table 3). The position of epitopes on their respective protein structure was visualized by Pymol (Figure 2).
T cell epitopes prediction and properties evaluation
T cell (MHC class I & MHC class II) epitopes of target proteins were predicted by the IEDB consensus method. Peptides that can bind to multiple alleles are considered the most appropriate peptides due to their strong defensive ability. Their antigenicity and allergenicity were checked by vaxijen and Allergen FP 1.0. Their conservation in protein sequences was analyzed by the IEDB conservancy analysis tool. Epitopes that are bound to multiple alleles, highly antigenic, non-allergenic and 100% conserved were screened out. Based on these criteria, 9 MHC class I (S-3, E-3, and M-3) and 7 MHC class II (S-1, E-3 and M-3) were shortlisted. Between MHC class I epitopes ‘IPFAMQMAYRFN’ of S protein suggested higher antigenicity score 1.3 binding with multiple alleles including HLA-B*35:01, HLA-B*35:03, HLA-A*24:02, HLA-B*51:01, HLA-B*53:01. The peptide ‘VTLACFVLAAVYRIN’ of M protein was considered most antigenic for its higher antigenicity score 1.0 between MHC class II epitopes and it was bound to multiple alleles including HLA-DRB1*07:03, HLA-DRB1*11:20, HLA-DRB1*01:02, HLA-DRB1*07:01, HLA-DRB1*11:14, HLA-DRB1*13:23, HLA-DRB1*03:09, HLA-DRB1*13:07, HLA-DRB1*11:28, HLA-DRB1*13:05, HLA-DRB1*03:05, HLA-DRB1*04:08 (Table 4). Protein digesting enzyme server was used to estimate peptides digesting enzymes. Enzymes that do not digest peptide into fragments is considered non-digestive enzyme. Peptides digestible with many enzymes are not stable. Less enzyme digested peptides, on the other hand, are very stable and favored vaccine candidates (Table 5).
World population coverage
Population coverage was calculated with finally selected nine MHC class I and seven MHC class II epitopes and related HLA alleles. Selected MHC class I and MHC class II epitopes represented 88.23% and 66.66% of the world’s population. Highest coverage of MHC class I epitopes found in the population of two countries Finland (96.16%) and Italy (96.38%). Similarly, the highest coverage of MHC class II epitopes was found within France (79.44%). In China where SARS-COV-2 was first identified, MHC class I epitopes showed 63.26% population coverage and MHC class II showed 41.17% coverage. Also, the population coverage was higher in regions where SARS-COV-2 cases have been reported (Figure 3).
Molecular docking analysis
The 3D structures of selected nine MHC class I epitopes were predicted using PEPFOLD (Additional file 1: Figure S9). Molecular docking is important to understand the patterns of protein-peptides interaction. To analyze the interaction of HLA allele with selected MHC class I epitopes, molecular docking was performed. Docking results revealed all the selected peptides binds inside the receptor binding site of HLA allele with strong interactions (Figure 4).