Next Article in Journal
Ursolic Acid Induces Apoptosis in Colorectal Cancer Cells Partially via Upregulation of MicroRNA-4500 and Inhibition of JAK2/STAT3 Phosphorylation
Previous Article in Journal
Electrochemical Detection of Dopamine at a Gold Electrode Modified with a Polypyrrole–Mesoporous Silica Molecular Sieves (MCM-48) Film
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LAIPT: Lysine Acetylation Site Identification with Polynomial Tree

1
School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou 221018, China
2
School of Information Science and Engineering, Zaozhuang University, Zaozhuang 277100, China
3
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China
*
Authors to whom correspondence should be addressed.
Int. J. Mol. Sci. 2019, 20(1), 113; https://doi.org/10.3390/ijms20010113
Submission received: 16 October 2018 / Revised: 30 November 2018 / Accepted: 5 December 2018 / Published: 29 December 2018
(This article belongs to the Section Molecular Biophysics)

Abstract

:
Post-translational modification plays a key role in the field of biology. Experimental identification methods are time-consuming and expensive. Therefore, computational methods to deal with such issues overcome these shortcomings and limitations. In this article, we propose a lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate amino-acid residue relationships in peptide segments. This polynomial style was enriched by the physical and chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree. Finally, some effect evaluation measurements were employed to test the model’s performance.

1. Introduction

Post-translational modification (PTM) is one of the most significant processes in the field of biology. More than 650 types of post-translational modification were reported across several decades of efforts. Among these types of post-translational modification, several modifications have the ability to reverse their processes. PTM provides a fine-tuned control of protein function in various types of cells in the field of disease research and drug design [1,2,3,4]. For example, the well-known tumor suppressor p53 is subject to many post-translational modifications, which have the ability to alter its localization, stability, and other related functions, thus ultimately modulating its response to various forms of genotoxic stress [5,6,7,8,9,10]. Therefore, p53 drives both the activation and repression of a large number of promoters, which ultimately define its tumor suppressor abilities. This tumor suppressor is a critical transcription factor in the field of post-translational modification [11]. With these reversible modifications, protein structures change and their functions are enriched to some degree. As one of the most typical and classical reversible types of modification, lysine acetylation was reported about half a century ago [1,2]. Acetylation occurs on the ε-amino group of lysine residues; it was noted that three enzymes take part in this process. Whereas lysine deacetylases (KDACs) remove the acetyl groups of proteins, lysine acetyl transferases (KATs) transfer the acetyl group across proteins [3,4,5,6]. Considering the key role of lysine acetylation in several diseases and novel drug designations, a great deal of experimental approaches were proposed and introduced to identify the acetylation sites of lysine residues in protein sequences. These experimental approaches, including radioactivity chemical methods, chromatin immune precipitation (ChIP), and mass spectrometry, play their roles in various degrees [7,8]. Unfortunately, these experimental methods can hardly meet the need of identifying sites, and they are time-consuming and expensive. Considering this issue, effective identification methods, based on a computational biology approach, are urgently needed to identify acetylation modification sites, especially with the increasingly number of protein resources.
When it comes to computational biology methods, several classical methods were introduced in the field of protein sequence procession [12,13,14,15,16]. Meanwhile, with the development of machine learning and artificial intelligence, some computational methods were proposed and designed to deal with similar issues at the DNA, RNA and protein levels [17,18,19]. Several milestone efforts were demonstrated in the field of identification of lysine modification sites. For instance, Xu et al. made use of a support vector machine (SVM) to identification lysine acetylation sites with ensemble information [20]. PLMLA(prediction of lysine methylation and lysine acetylation by combining multiple features), which was designed by Shi et al. in 2012, utilized information about protein sequences and secondary structure to demonstrate whether lysine residues were modified or not [21]. In the same year, PSKAcePred (Position-Specific Analysis and Prediction for Protein Lysine Acetylation Based on Multiple Features), which was proposed by Suo et al., was based on amino-acid composition and physicochemical properties to quantify protein segments [22]. Meanwhile, Shao et al. proposed BRABSB (bi-relative adapted binomial score Bayes), which made use of binomial score Bayesian [23]. Since then, SSPKA (species-specific lysine acetylation prediction), based on the random forest (RF) model, was proposed in 2014 to deal with such modification sites. Two years later, Wu et al. designed a novel approach named KA-predictor (Improved Species-Specific Lysine Acetylation Site Prediction) that utilized many different kinds of features to identify cases of lysine modification [24]. Overall, models for the effective identification of modification sites consist of two parts. The first part is feature description, which focuses on an effective method of showing protein sequence information or peptide segment information in several different aspects. The second part is the construction of the machine learning model, which aims to deal with different types of protein sequences or peptide segments with high accuracy and generalizability. The abovementioned methods, among others (PLMLA, Phosida, LysAcet, EnsemblePail, PSKAcePred, BRABSB, and SSPKA), can be regarded as the state of the art in this field.
Relationships among amino-acid residues need to be effectively described at the protein level. These relationships have the ability to demonstrate the local information of amino-acid residues in some peptide segments, and can be helpful in constructing more useful information with regards to the identification of modification sites. Some related work was proposed in DNA and RNA analysis [25,26,27,28,29,30,31]; methods such as DeepBind and DeepSea take advantage of deep convolutional neural networks (CNNs) to predict the sequence specificities of DNA-binding proteins [32,33,34,35]. In summary, these sequence analysis methods can be regarded as issues resolved using computational biology.
When it comes to the abovementioned issues, Chou proposed five steps for dealing with them [35,36,37,38]. In the first step, available benchmark datasets should be selected, which are used to train and test machine learning models. In the second step, available methods for sequence quality expression should be selected. In the third step, an available algorithm should be used to identify positive and negative samples. In the fourth step, validation methods for evaluating the performances of the proposed methods should be selected. In the final step, a web resource should be constructed to detail the workflow, along with related raw data. Therefore, in this paper, we introduce a method for the identification of lysine acetylation sites following these steps.
In this article, we propose lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate amino-acid residue relationships in peptide segments. This polynomial style was enriched by the physico-chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree (FNT). Finally, some effect evaluation measurements were employed to test the model’s performance. And the website of this work is shown in http://121.250.173.184/.

2. Results and Discussions

2.1. Comparison with Other Features

In order to evaluate the performance of the polynomial form features, several state-of-the-art methods were chosen for comparison, including binary encoding, amino acid composition (AA composition), grouping AA composition, physico-chemical property, k nearest neighbor features, and secondary tendency structure. The details of these comparisons are shown in Table 1, Table 2 and Table 3.

2.2. Comparison with Other Models

In order to more objectively evaluate the performance of the proposed feature description and employed classification model, we compared it with several state-of-the-art methods, including DBD (Breakdowns of B)-Threader, iDNA-Prot, and other similar tools in the field of sequence classification and post-translational modification. Details of these comparisons are shown in Table 4, Table 5 and Table 6.
In order to show the proposed model’s stability and generalization, we utilized the ROC (receiver operating characteristic curve) curve to show the classification results. Meanwhile, some cross-validation methods (fourfold, sixfold, eightfold, and 10-fold) were also utilized. The detailed ROC curves for each species are shown in Figure 1, Figure 2 and Figure 3.

2.3. Performance Using Differences Bandwidths

In this work, the bandwidths of sliding windows played a significant role in the feature size. On the one hand, the lack of a bandwidth can waste computational resources and result in ineffective feature description. On the other hand, different species may have unique bandwidths in this classification model. Therefore, we tested bandwidths ranging from 21 to 31, with an interval of 2. Detailed results for each of these bandwidths in the selected species are shown in Table 7. In order to more objectively show the results, we compared them with other machine learning methods, including SVM, NN, and RF.
From the above table, we can easily determine that the most appropriate bandwidths for Homo sapiens, Mus musculus, and Escherichia coli were 25, 25, and 23, respectively Furthermore, the FNT model performed better than the three other machine learning methods in the majority of measurements among these bandwidths.

2.4. Performance of Polynomial Feature Description

In this section, we discuss the parameter selection of polynomial feature description. The three proposed feature description methods constitute five parameters, one of which is described by Equation (1), two of which are described by Equation (2), and two of which are described by Equation (3). The three proposed methods were compared, taking into account the coefficients a1 and a2, and the constants b1, b2, and c. We defined a1 and a2 in the range [−10, 10], and three constants in the range [−100, 100], to test the performance of the employed classification method. We determined that the most appropriate parameters of the three proposed features were as follows: c = 57.6, a1 = 4.1, a2 = −2.7, b1 = 27.1, and b2 = 67.1. The three proposed features performed differently. The abovementioned classification models were also used for comparison. In order to reduce the usage of unnecessary computational resources, the most appropriate bandwidths determined previously were used. The Details of the results are shown in Table 8.
y = c 1
y = a 1 x 1 2 + b 1
y = a 2 x 2 + b 2
From the above table, we can easily determine that the different features performed differently. The most appropriate feature description was that described by Equation (3) for both H. sapiens and E. coli, while that described by the Equation (2) was most suitable for M. musculus.

3. Materials and Methods

Because of the ubiquity and universality of lysine acetylation at the protein level, we can find several acetylated proteins in various databases, including NCBI (National Center for Biotechnology Information), Uniprot, and other related proteomics databases. In this study, we selected about 30,000 protein sequences, which contain more than 111,200 acetylation sites among them [49]. These proteins could be extracted from the Protein Lysine Modification Database (PLMD) version 3.0 [50]. PLMD is one of the most well-known and commonly used post-translational modification site databases, and it contains more than 20 types of lysine modification in more than 170 species at the protein level. Generally, this database can be treated as the largest available acetylation database; thus, it was employed as the benchmark dataset in this work. Unfortunately, overestimation may be one of the most significant limitations when using machine learning. In order to overcome this shortcoming, CD-HIT (Cluster Database at High Identity with Tolerance) was utilized to remove some homologous sequences [51,52,53,54]. In this work, we utilized a threshold of 40% similarity with this tool. Following this process, we obtained 59,532 proven acetylated modification sites from 20,527 protein sequences. These protein data were used to construct the training, testing, and independent datasets. During this classification process, we defined the proven acetylated sites as positive samples and the non-proven modifications as negative samples. Detailed information of the employed datasets is shown in Table 9, and details with regards to the construction of datasets are shown in Figure 4.
In this work, we employed the general dataset as the training and testing datasets. In order to evaluate the generalization and stability, we employed three species incorporating lysine acetylation sites as the independent datasets.
After constructing the available datasets, some peptide segments were extracted from the whole protein sequences. In order to reduce the unnecessary usage of storage space and computational resources, some peptides with a central lysine residue were extracted in this work. We made use of sliding windows to extract peptide segments with a size of 2n + 1 [55], where n is the length of the upstream or downstream fragment, and 1 is the position of the central lysine residue in the segment. In this work, the length of the upstream fragment was equal to that of the downstream fragment, and n ranged from 10 to 15. Thus, the whole length of the sliding window was between 21 and 31. In the next section, we discuss the performances of the various selected lengths of sliding window.

3.1. Encoding of Protein Fragments

Several different types of features for quantifying biological sequences were presented across many years of protein research, such as amino-acid composition, position special scoring matrix, physico-chemical properties, and other related features [56,57,58] These features can demonstrate sequence information in various aspects, and they play various roles in protein sequence analysis. However, few features can demonstrate the relationships of amino-acid residues. In this paper, each peptide was treated as a sample. According to biological concepts, neighboring amino-acid residues present both coordination and individual functions. On this basis, we tried utilizing some of these functions to describe the relationships in this work.
We propose a polynomial method to describe the relationships between the central lysine residue and the neighboring amino-acid residues. Several forms of polynomial styles exist, such as the constant form, linear function form, quadratic function form, cubic function form, and so on. For example, we show the curves of these four forms in Figure 5.
In Figure 5, L1, L2, L3, L4, and L5 follow Equations (4)–(8), respectively.
y = x
y = x 1 2
y = x 1 3
y = x 2
y = x 3
From Figure 5, we can easily determine that both L2 and L4 are even functions, while the other curves are odd functions. Considering that the upstream and the downstream fragments played the same role in the selected peptide segments, the even functions were selected for this work; therefore, we utilized three types of functions. The first one was the constant function, whereby all amino-acid residues in the peptide segments have the same influence, as described in Equation (1). The second function followed Equation (2), and the third function followed the Equation (3).
y = c 1
y = a 1 x 1 2 + b 1
y = a 2 x 2 + b 2
where the parameters a1, a2, b1, b2, and c1 were optimized in this work. It was noted that both Equations (1) and (2) could hardly be described as linear functions. Thus, the center of the last two functions was designated as the origin point, i.e., the classified modification sites in the peptide segments. Regions to the left and right part of this origin point were designated as the upstream and the downstream segments, respectively. The influence of each neighboring amino-acid residue is defined below.
According to Equation (1), the relationship between a neighboring amino-acid residue and the central lysine is shown in Equation (9).
i n f l u 1 = [ c 1 , c 1 , , c 1 ]
where influ1 contains 2n + 1 elements in each sample, and c1 is the relationship between each amino-acid residue in the selected peptide segment. In this function, every amino-acid residue has the same influence; thus, the amino-acid composition can be regarded as a special form of this style.
According to Equation (2), the relationship between the neighboring and central residues are shown in Equation (10).
i n f l u 2 = [ a 1 n 1 2 + b 1 , a 1 + b 1 , b 1 , a 1 + b 1 , a 1 n 1 2 + b 1 ]
where influ2 also contains 2n + 1 elements, and each value of influ2 follows the discrete values of Equation (2) and has the range [−n, n].
According to the Equation (3), the relationship between two amino-acid residues is shown in Equation (11).
i n f l u 3 = [ a 1 n 2 + b 1 , a 1 + b 1 , b 1 , a 1 + b 1 , a 1 n 2 + b 1 ]
where he influ3 also contains 2n + 1 elements, and each value of influ3 follows the discrete values of Equation (11) and has the range [−n, n].
After demonstrating the fundamental relationship of amino-acid residues within the classified peptide, the next step was to enrich the related properties of amino-acid residues. In this step, physical, chemical, evolutional, structural, and other related information was enriched using the three styles proposed above.

3.2. Physico-Chemical Properties

Physico-chemical properties are widely and successfully utilized in the identification of protein post-translational modifications, including ubiquitination, phosphorylation, and others [59,60]. These properties can help determine the fundamental characteristics of proteins in several aspects. One of the most well-known and widely utilized databases is AAIndex [61,62], which contains a great deal of physico-chemical and biochemical information for each amino-acid residue and some amino-acid compositions. The latest version of this database describes 544 properties of amino acid residues. Among these properties, following previous efforts and research [62], we selected several of them, which are listed in Table 10.
Considering the abovementioned elements, we minimized the presence of useless information; therefore, the area under the receiver operating characteristic (ROC) curve (AUC) was used to evaluate the measurements in this work.

3.3. Prediction Algorithm

The computational identification of modification sites focuses on classification models in the field of machine learning. In this thesis, we employed machine learning models, including the flexible neural tree. We employed three machine learning methods for the three elements in the classification. The first element involved the bandwidth of the sliding windows in the classified peptide segments, the second involved the parameters of polynomial feature description, and the third involved the selection of different combinations. Therefore, the classification model was designed to deal with these three elements; the detailed outline of this algorithm is demonstrated in Figure 6.
The flexible neural tree (FNT) was proposed by Chen [63,64], and it can be treated as an alternative tree neural network. Therefore, this model can be utilized to deal with the issues of classification and prediction in the field of machine learning. The typical structure of an FNT is shown in Figure 7.
From the above figure, we can easily determine that the model contains three types of layers—the input layer, the hidden layer, and the output layer. The network function of this model is shown in Equations (12) and (13).
n e t w o r k i = j = 1 i ω j × y j
o u t i = f ( m i , n i , n e t w o r k i ) = e ( n e t w o r k i m i n i ) 2
where wj is the weight of the j-th input element, and yj is the j-th element of the input sample. Both mi and ni are parameters in this network.

3.4. Performance Measurements

Some well-known methods exist in the field of machine learning for evaluating performance measurements. In this work, some typical measurements, including sensitivity, specificity, accuracy, F1 scores, and Matthew’s correlation coefficients (MCCs) [65,66], of the identified modification sites were used. Furthermore, the AUC [67] was also employed to test the performance of imbalanced classification problems, whereby the negative sample size was much bigger than the positive sample size.
In this classification problem, samples can be defined as two types—positive samples and negative samples. Positive samples refer to peptide segments where the central lysine is acetylated, while negative samples refer to peptide segments where the central lysine is not. According to the definitions of the classified samples, there can be four outcomes. If a positive sample is classified as true, this can be deemed a true positive (TP). If a positive sample is classified as false, this can be deemed a false positive (FP). Following this concept, a negative sample classified as true is a true negative (TN), and a negative sample classified as false is a false negative (FN). According to the number of TP, TN, FP, and FN, we can easily obtain measures of sensitivity, specificity, accuracy, F1 scores, and MCC.
A c c u r a c y = T P + T N P + N
S e n s i t i v i t y = T P T P + F N
S p e c i f i c i t y = T N T N + F P
F 1 = 2 T P 2 T P + F N + F P
M C C = T P × T N F P × F N ( T P + F P ) ( T P + F N ) ( T N + F P ) ( T N + F N )
where P is the number of positive samples and N is the number of negative samples. Nevertheless, in Equations (14)–(18), there is a lack of intuitiveness, and they can hardly be described as easy to understand for the majority of researchers in the field of biology. The interpretation of MCC in particular is not at all intuitive in this form, although this measurement plays a key role in the evaluation of the classification model’s stability. Therefore, we made use of the concept based on Chou, proposed at the beginning of this century. In this concept, the total number of positive samples can be defined as N+, and the total number of negative samples can be defined as the N. Then, the number of misclassified positive samples can be treated as the N + , and the number of misclassified negative samples can be treated as the N + . With this definition, TP, TN, FP, and FN can be described in Equations (19)–(22).
T P = N + N +
T N = N N +
F P = N +
F N = N +
Thus, the abovementioned measurements can be newly defined as Equations (23)–(27).
A c c u r a c y = 1 N + + N + N + + N
S e n s i t i v i t y = 1 N + N +
S p e c i f i c i t y = 1 N + N
F 1 = 2 ( N + N + ) 2 N + N + + N +
M C C = 1 ( N + N + + N + N ) ( 1 + N + N + N + ) ( 1 + N + N + N )
The interpretations of each performance metric in Equations (23)–(27) are far more intuitive and easier to understand for biological researchers. For instance, when samples can be correctly classified, whereby all positive samples are classified as true and all negative samples are classified as false, we get N + = 0 and N + = 0, and the sensitivity and specificity are both equal to 1. Meanwhile, the accuracy is equal to 1 and MCC is also equal to 1 in such a situation. On the contrary, if all positive samples are classified as false and all negative samples are classified as true, N + and N + are both equal to 1, and the sensitivity and specificity are both equal to 0. Furthermore, the accuracy is equal to 0, and the MCC is equal to −1 in this situation. In a random classification issue, N + = 0.5N and N + = 0.5N+. Thus, the accuracy is equal to 0.5 and MCC is equal to 0 in this situation. This definition method has several advantages [68,69,70,71]; however, utilizing these five measurements can hardly meet required performance in a scenario of imbalanced classification. Therefore, we made use of ROC and precision recall. ROC can be shown by the relationship between the true positive rate (TPR) and the false positive rate (FPR) in the classification. Meanwhile, precision recall can be demonstrated by the relationship between the precision and recall.

4. Conclusions

In this article, we proposed a lysine acetylation site identification with polynomial tree method (LAIPT), making use of the polynomial style to demonstrate the amino-acid residue relationships in peptide segments. The polynomial style was enriched by the physico-chemical properties of amino-acid residues. Then, these reconstructed features were input into the employed classification model, named the flexible neural tree. Finally, some effect evaluation measurements were employed to test the model’s performances. We demonstrated that the three employed species modification sites constituted unique feature descriptions. In the future, we hope to determine more useful forms of feature description and to utilize effective classification models to deal with them. We hope that the algorithm described herein has the ability to deal with other types of protein post-translational modification sites in various species.

Author Contributions

W.B. conceived the study. Z.L. designed the method. B.Y. designed the algorithm. Y.Z. conducted the experiments. W.B. wrote the manuscript. All authors reviewed the manuscript.

Funding

This work was supported by grants from the National Science Foundation of China, Nos. 61702445, 61873270, and a grant from the PhD Programs Foundation of the Ministry of Education of China (No. 20120072110040).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Kouzarides, T. Chromatin modifications and their function. Cell 2007, 128, 693–705. [Google Scholar] [CrossRef] [PubMed]
  2. Mann, M.; Jensen, O.N. Proteomic analysis of post-translational modifications. Nat. Biotechnol. 2003, 21, 255–261. [Google Scholar] [CrossRef] [PubMed]
  3. Dai, C.; Gu, W. P53 post-translational modification: Deregulated in tumorigenesis. Trends Mol. Med. 2010, 16, 528–536. [Google Scholar] [CrossRef] [PubMed]
  4. Ruthenburg, A.J.; Li, H.; Patel, D.J.; Allis, C.D. Multivalent engagement of chromatin modifications by linked binding modules. Nat. Rev. Mol. Cell Biol. 2007, 8, 983–994. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Wysocka, J.; Swigut, T.; Xiao, H.; Milne, T.A.; Kwon, S.Y.; Landry, J.; Kauer, M.; Tackett, A.J.; Chait, B.T.; Badenhorst, P. A phd finger of nurf couples histone h3 lysine 4 trimethylation with chromatin remodelling. Nature 2006, 442, 86–90. [Google Scholar] [CrossRef] [PubMed]
  6. Wysocka, J.; Swigut, T.; Milne, T.A.; Dou, Y.; Zhang, X.; Burlingame, A.L.; Roeder, R.G.; Brivanlou, A.H.; Allis, C.D. Wdr5 associates with histone h3 methylated at k4 and is essential for h3 k4 methylation and vertebrate development. Cell 2005, 121, 859–872. [Google Scholar] [CrossRef] [PubMed]
  7. Zeng, L.; Zhou, M. Bromodomain: An acetyl-lysine binding domain. FEBS Lett. 2002, 513, 124–128. [Google Scholar] [CrossRef]
  8. Jenuwein, T.; Allis, C.D. Translating the histone code. Science 2001, 293, 1074–1080. [Google Scholar] [CrossRef]
  9. Marmorstein, R.; Roth, S.Y. Histone acetyltransferases: Function, structure, and catalysis. Curr. Opin. Genet. Dev. 2001, 11, 155–161. [Google Scholar] [CrossRef]
  10. Bode, A.M.; Dong, Z. Post-translational modification of p53 in tumorigenesis. Nat. Rev. Cancer 2004, 4, 793–805. [Google Scholar] [CrossRef]
  11. Walsh, G.; Jefferis, R. Post-translational modifications in the context of therapeutic proteins. Nat. Biotechnol. 2006, 24, 1241–1252. [Google Scholar] [CrossRef] [PubMed]
  12. Janke, C.; Bulinski, J.C. Post-translational regulation of the microtubule cytoskeleton: Mechanisms and functions. Nat. Rev. Mol. Cell Biol. 2011, 12, 773–786. [Google Scholar] [CrossRef] [PubMed]
  13. Xu, Y.; Shao, X.; Wu, L.; Deng, N.; Chou, K. ISNO-AApair: Incorporating amino acid pairwise coupling into PseAAC for predicting cysteine s-nitrosylation sites in proteins. PeerJ 2013, 1, e171. [Google Scholar] [CrossRef] [PubMed]
  14. Qiu, W.; Xiao, X.; Lin, W.; Chou, K. iMethyl-PseAAC: Identification of protein methylation sites via a pseudo amino acid composition approach. BioMed Res. Int. 2014, 2014, 947416. [Google Scholar] [CrossRef] [PubMed]
  15. Xu, Y.; Wen, X.; Shao, X.J.; Deng, N.Y.; Chou, K.C. iHyd-PseAAC: Predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int. J. Mol. Sci. 2014, 15, 7594–7610. [Google Scholar] [CrossRef] [PubMed]
  16. Xu, Y.; Wen, X.; Wen, L.; Wu, L.; Deng, N.; Chou, K. iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS ONE 2014, 9. [Google Scholar] [CrossRef] [PubMed]
  17. Chen, W.; Feng, P.; Ding, H.; Lin, H.; Chou, K. iRNA-Methyl: Identifying N6-methyladenosine sites using pseudo nucleotide composition. Anal. Biochem. 2015, 490, 26–33. [Google Scholar] [CrossRef]
  18. Qiu, W.; Xiao, X.; Lin, W.; Chou, K. iUbiq-Lys: Prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J. Biomol. Struct. Dyn. 2015, 33, 1731–1742. [Google Scholar] [CrossRef]
  19. Chen, W.; Tang, H.; Ye, J.; Lin, H.; Chou, K. iRNA-PseU: Identifying RNA pseudouridine sites. Mol. Ther. Nucleic Acids 2016, 5, e332. [Google Scholar]
  20. Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.C. iCar-PseCp: Identify carbonylation sites in proteins by monte carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget 2016, 7, 34558–34570. [Google Scholar] [CrossRef]
  21. Jia, J.; Zhang, L.; Liu, Z.; Xiao, X.; Chou, K.C. pSumo-CD: Predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC. Bioinformatics 2016, 32, 3133–3141. [Google Scholar] [CrossRef] [PubMed]
  22. Liu, Z.; Xiao, X.; Yu, D.J.; Jia, J.; Qiu, W.R.; Chou, K.C. pRNAm-PC: Predicting N6-methyladenosine sites in RNA sequences via physical–chemical properties. Anal. Biochem. 2016, 497, 60–67. [Google Scholar] [CrossRef] [PubMed]
  23. Qiu, W.R.; Sun, B.Q.; Xiao, X.; Xu, Z.C.; Chou, K.C. iPTM-mLys: Identifying multiple lysine PTM sites and their different types. Bioinformatics 2016, 32, 3116–3123. [Google Scholar] [CrossRef] [PubMed]
  24. Qiu, W.R.; Xiao, X.; Xu, Z.C.; Chou, K.C. iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier. Oncotarget 2016, 7, 51270–51283. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  25. Feng, P.; Ding, H.; Yang, H.; Chen, W.; Lin, H.; Chou, K.C. iRNA-PseColl: Identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC. Mol. Ther. Nucleic Acids 2017, 7, 155–163. [Google Scholar] [CrossRef] [PubMed]
  26. Bao, W.; Huang, Z.; Yuan, C.A.; Huang, D.S. Pupylation sites prediction with ensemble classification model. Int. J. Data Min. Bioinform. 2017, 18, 91–104. [Google Scholar] [CrossRef]
  27. Qiu, W.R.; Jiang, S.Y.; Xu, Z.C.; Xiao, X.; Chou, K.C. iRNAm5C-PseDNC: Identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget 2017, 8, 41178–41188. [Google Scholar] [CrossRef]
  28. Qiu, W.R.; Sun, B.Q.; Xiao, X.; Xu, D.; Chou, K.C. iPhos-PseEvo: Identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory. Mol. Inform. 2017, 36. [Google Scholar] [CrossRef]
  29. Qiu, W.R.; Sun, B.Q.; Xuan, X.; Xu, Z.C.; Jia, J.H.; Chou, K.C. iKcr-PseEns: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 2017. [Google Scholar] [CrossRef]
  30. Xu, Y.; Wang, Z.; Li, C.; Chou, K.C. iPreny-PseAAC: Identify c-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into PseAAC. Med. Chem. 2017, 13, 544. [Google Scholar] [CrossRef]
  31. Bao, W.; Yuan, C.A.; Zhang, Y.; Han, K.; Nandi, A.K.; Honig, B.; Huang, D.S. Mutli-features predction of protein translational modification sites. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 15, 1453–1460. [Google Scholar] [CrossRef] [PubMed]
  32. Bao, W.; Jiang, Z.; Huang, D.S. Novel human microbe-disease association prediction using network consistency projection. BMC Bioinform. 2017, 18, 543. [Google Scholar] [CrossRef] [PubMed]
  33. Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W.; Chou, K.C. iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. Genomics 2018, S0888754318300090. [Google Scholar] [CrossRef] [PubMed]
  34. Khan, Y.D.; Rasool, N.; Hussain, W.; Khan, S.A.; Chou, K.C. iPhosT-PseAAC: Identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC. Anal. Biochem. 2018, 550, 109–116. [Google Scholar] [CrossRef] [PubMed]
  35. Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.C. Pse-in-one: A web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015, 43, W65–W71. [Google Scholar] [CrossRef] [PubMed]
  36. Bao, W.; You, Z.H.; Huang, D.S. Cippn: Computational identification of protein pupylation sites by using neural network. Oncotarget 2017, 8, 108867–108879. [Google Scholar] [CrossRef] [PubMed]
  37. Lavecchia, A. Machine-learning approaches in drug discovery: Methods and applications. Drug Discov. Today 2015, 20, 318–331. [Google Scholar] [CrossRef] [PubMed]
  38. Chou, K. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Curr. Top. Med. Chem. 2017, 17, 2337–2358. [Google Scholar] [CrossRef]
  39. András, S.; Jeffrey, S. Efficient prediction of nucleic acid binding function from low-resolution protein structures. J. Mol. Biol. 2006, 358, 922–933. [Google Scholar]
  40. Lin, W.Z.; Fang, J.A.; Xuan, X.; Kuo-Chen, C. iDNA-Prot: Identification of DNA binding proteins using random forest with grey model. PLoS ONE 2011, 6, e24756. [Google Scholar] [CrossRef]
  41. Ma, X.; Guo, J.; Liu, H.; Xie, J.; Sun, X. Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Trans. Comput. Biol. Bioinform. 2012, 9, 1766–1775. [Google Scholar] [CrossRef] [PubMed]
  42. Shi, S.; Qiu, J.; Sun, X.; Suo, S.; Huang, S.; Liang, R. PLMLA: Prediction of lysine methylation and lysine acetylation by combining multiple features. Mol. BioSyst. 2012, 8, 1520–1527. [Google Scholar] [CrossRef] [PubMed]
  43. Gnad, F.; Ren, S.; Choudhary, C.; Cox, J.; Mann, M. Predicting post-translational lysine acetylation using support vector machines. Bioinformatics 2010, 26, 1666–1668. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  44. Li, S.; Li, H.; Li, M.; Shyr, Y.; Xie, L.; Li, Y. Improved prediction of lysine acetylation by support vector machines. Protein Pept. Lett. 2009, 16, 977–983. [Google Scholar] [CrossRef]
  45. Hou, T.; Zheng, G.; Zhang, P.; Jia, J.; Li, J.; Xie, L.; Wei, C.; Li, Y. LAceP: Lysine acetylation site prediction using logistic regression classifiers. PLoS ONE 2014, 9, e89575. [Google Scholar] [CrossRef]
  46. Suo, S.B.; Qiu, J.D.; Shi, S.P.; Sun, X.Y.; Huang, S.Y.; Chen, X.; Liang, R.P. Position-specific analysis and prediction for protein lysine acetylation based on multiple features. PLoS ONE 2012, 7, e49108. [Google Scholar] [CrossRef]
  47. Shao, J.; Xu, D.; Hu, L.; Kwan, Y.W.; Wang, Y.; Kong, X.; Ngai, S.M. Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score bayes feature representation. Mol. BioSyst. 2012, 8, 2964–2973. [Google Scholar] [CrossRef]
  48. Li, Y.; Wang, M.; Wang, H.; Tan, H.; Zhang, Z.; Webb, G.I.; Song, J. Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features. Sci. Rep. 2014, 4, 5765. [Google Scholar] [CrossRef]
  49. Cao, D.; Xu, Q.; Liang, Y. propy: A tool to generate various modes of Chou’s PseAAC. Bioinformatics 2013, 29, 960–962. [Google Scholar] [CrossRef]
  50. Chen, W.; Lin, H.; Chou, K. Pseudo nucleotide composition or PseKNC: An effective formulation for analyzing genomic sequences. Mol. BioSyst. 2015, 11, 2620–2634. [Google Scholar] [CrossRef]
  51. Chou, K. Prediction of signal peptides using scaled window. Peptides 2001, 22, 1973–1979. [Google Scholar] [CrossRef]
  52. Chen, W.; Feng, P.; Lin, H.; Chou, K. iRSpot-PseDNC: Identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 2013, 41. [Google Scholar] [CrossRef] [PubMed]
  53. Cheng, X.; Xiao, X.; Chou, K. pLoc-mPlant: Predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general PseAAC. Mol. BioSyst. 2017, 13, 1722–1727. [Google Scholar] [CrossRef] [PubMed]
  54. Cheng, X.; Xiao, X.; Chou, K. pLoc-mHum: Predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial go information. Bioinformatics 2018, 34, 1448–1456. [Google Scholar] [CrossRef] [PubMed]
  55. Cheng, X.; Zhao, S.; Lin, W.; Xiao, X.; Chou, K. pLoc-mAnimal: Predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 2017, 33, 3524–3531. [Google Scholar] [CrossRef] [PubMed]
  56. Xiao, X.; Cheng, X.; Su, S.; Mao, Q.; Chou, K. pLoc-mGpos: Incorporate key gene ontology information into general PseAAC for predicting subcellular localization of gram-positive bacterial proteins. Nat. Sci. 2017, 09, 330–349. [Google Scholar] [CrossRef]
  57. Xiang, C.; Xuan, X.; Chou, K.C. pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general PseAAC. Genomics 2017, 110, 50–58. [Google Scholar]
  58. Cheng, X.; Xiao, X.; Chou, K.C. pLoc-mGneg: Predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC. Genomics 2018, 110, 231–239. [Google Scholar] [CrossRef]
  59. Kuo-Chen, C. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. BioSyst. 2013, 9, 1092–1100. [Google Scholar]
  60. Chou, K.C.; Zhang, C.T. Prediction of protein structural classes. CRC Crit. Rev. Biochem. 2008, 30, 275–349. [Google Scholar] [CrossRef]
  61. Xiao, X.; Wang, P.; Chou, K. Quat-2l: A web-server for predicting protein quaternary structural attributes. Mol. Div. 2011, 15, 149–155. [Google Scholar] [CrossRef] [PubMed]
  62. Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K.C. Ienhancer-2l: A two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics 2016, 32, 362. [Google Scholar] [CrossRef] [PubMed]
  63. Liu, B.; Yang, F.; Chou, K.C. 2L-piRNA: A two-layer ensemble classifier for identifying piwi-interacting RNAs and their function. Mol. Ther. Nucleic Acids 2017, 7, 267–277. [Google Scholar] [CrossRef] [PubMed]
  64. Liu, B.; Li, K.; Huang, D.S.; Chou, K.C. iEnhancer-EL: Identifying enhancers and their strength with ensemble learning approach. Bioinformatics 2018, 34, 3835–3842. [Google Scholar] [CrossRef] [PubMed]
  65. Liu, B.; Weng, F.; Huang, D.S.; Chou, K.C. iRO-3wPseKNC: Identify DNA replication origins by three-window-based pseknc. Bioinformatics 2018, 34, 3086–3093. [Google Scholar] [CrossRef] [PubMed]
  66. Liu, B.; Yang, F.; Huang, D.S.; Chou, K.C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 2017, 34, 33–40. [Google Scholar] [CrossRef] [PubMed]
  67. Bao, W.; Chen, Y.; Wang, D. Prediction of protein structure classes with flexible neural tree. Biomed. Mater. Eng. 2014, 24, 3797–3806. [Google Scholar] [PubMed]
  68. Bao, W.; Wang, D.; Chen, Y. Classification of protein structure classes on flexible neutral tree. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 1122–1133. [Google Scholar] [CrossRef]
  69. Chen, Y.; Yang, B.; Dong, J.; Abraham, A. Time-series forecasting using flexible neural tree model. Inf. Sci. 2005, 174, 219–235. [Google Scholar] [CrossRef] [Green Version]
  70. Chen, Y.; Abraham, A.; Yang, B. Hybrid flexible neural-tree-based intrusion detection systems. Int. J. Intell. Syst. 2010, 22, 337–352. [Google Scholar] [CrossRef]
  71. Chen, Y.; Abraham, A.; Yang, B. Feature selection and classification using flexible neural tree. Neurocomputing 2006, 70, 305–313. [Google Scholar] [CrossRef]
Figure 1. Receiver operating characteristic (ROC) curves of Homo sapiens.
Figure 1. Receiver operating characteristic (ROC) curves of Homo sapiens.
Ijms 20 00113 g001
Figure 2. ROC curves of Mus musculus.
Figure 2. ROC curves of Mus musculus.
Ijms 20 00113 g002
Figure 3. ROC Curves of Escherichia coli.
Figure 3. ROC Curves of Escherichia coli.
Ijms 20 00113 g003
Figure 4. Distribution of employed datasets.
Figure 4. Distribution of employed datasets.
Ijms 20 00113 g004
Figure 5. Curves of Equations (4)–(8) in the range [−1, 1].
Figure 5. Curves of Equations (4)–(8) in the range [−1, 1].
Ijms 20 00113 g005
Figure 6. The outline of the lysine acetylation site identification with polynomial tree method (LAIPT).
Figure 6. The outline of the lysine acetylation site identification with polynomial tree method (LAIPT).
Ijms 20 00113 g006
Figure 7. Typical structure of a flexible neural tree (FNT).
Figure 7. Typical structure of a flexible neural tree (FNT).
Ijms 20 00113 g007
Table 1. Performances of different features in E. coli. Sn—sensitivity; Sp—specificity; Acc—accuracy; MCC—Matthew’s correlation coefficient.
Table 1. Performances of different features in E. coli. Sn—sensitivity; Sp—specificity; Acc—accuracy; MCC—Matthew’s correlation coefficient.
FeaturesSn (%)Sp (%)Acc (%)F1MCC
Binary encoding43.3675.8059.580.51750.2026
AA composition64.1452.7958.460.60700.1704
Grouping AA composition41.7876.0458.910.50420.1897
Physico-chemical properties55.5363.9359.730.57960.1953
KNN features64.9455.8560.390.62120.2088
Secondary tendency structure59.9657.4058.680.59200.1737
PSSM51.2069.3960.300.56320.2094
Proposed algorithm73.2674.0173.630.73540.4727
Table 2. Performances of different features in M. musculus.
Table 2. Performances of different features in M. musculus.
FeaturesSn (%)Sp (%)Acc (%)F1MCC
Binary encoding42.2775.3858.820.50650.1871
AA composition63.0552.3757.710.59850.1551
Grouping AA composition40.6975.6258.150.49300.1741
Physico-chemical properties54.4463.5158.970.57020.1802
KNN features63.8555.4359.640.61270.1935
Secondary tendency structure58.8756.9857.920.58320.1585
PSSM50.1168.9759.540.55330.1943
Proposed algorithm73.5274.3173.910.73810.4783
Table 3. Performances of different features in H. sapiens.
Table 3. Performances of different features in H. sapiens.
FeaturesSn (%)Sp (%)Acc (%)F1MCC
Binary encoding43.1474.7458.940.51230.1885
AA composition63.9251.7357.830.60250.1577
Grouping AA composition41.5674.9858.270.49900.1755
Physico-chemical properties55.3162.8759.090.57480.1823
KNN features64.7254.7959.760.61660.1961
Secondary tendency structure59.7456.3458.040.58740.1609
PSSM50.9868.3359.660.55820.1961
Proposed algorithm74.8272.4273.620.73930.4725
Table 4. Performances of different methods in E. coli.
Table 4. Performances of different methods in E. coli.
MethodSn (%)Sp (%)Acc (%)F1MCC
DNABIND [39]65.7867.9766.880.66510.3376
DNAbinder [39]56.8763.7960.330.58910.2071
DBD-Threader [40]22.7994.7158.750.35590.2519
DNA-Prot [40]67.8153.7160.760.63340.2174
iDNA-Prot [40]65.7165.7265.720.65710.3143
DBPPred [41]75.3772.8774.120.08110.4826
PLMLA [42]60.8064.7062.700.47480.2550
Phosida [43]70.6154.9062.700.65470.2580
LysAcet [44]27.5076.5052.000.36420.0450
EnsemblePail [45]27.5066.7047.100.3420-0.0640
PSKAcePred [46]41.2060.8051.000.45680.0200
BRABSB [47]51.0060.8055.900.53630.1180
SSPKA [48]54.9076.5065.700.61550.3210
Proposed algorithm73.2674.0173.630.73540.4727
Table 5. Performances of different methods in M. musculus.
Table 5. Performances of different methods in M. musculus.
MethodSn (%)Sp (%)Acc (%)F1MCC
DNABIND [39]63.3164.5463.930.63700.2785
DNAbinder [39]58.1465.5461.840.60370.2375
DBD-Threader [40]26.5492.4559.500.39590.2525
DNA-Prot [40]69.2158.4363.820.65670.2780
iDNA-Prot [40]69.5466.4568.000.68480.3601
DBPPred [41]78.4574.4576.450.76910.5294
PLMLA [42]51.6051.9051.700.51680.0350
Phosida [43]59.0054.6056.800.57730.1370
LysAcet [44]43.1067.0055.000.48950.1040
EnsemblePail [45]51.1076.2063.500.58430.2820
PSKAcePred [46]51.1065.9058.400.55180.1720
BRABSB [47]63.8058.4061.100.62120.2220
SSPKA [48]64.8066.5065.700.65360.3140
Proposed algorithm73.5274.3173.910.73810.4783
Table 6. Performances of different methods in H. sapiens.
Table 6. Performances of different methods in H. sapiens.
MethodSn (%)Sp (%)Acc (%)F1MCC
DNABIND [39]65.7267.4266.570.66280.3314
DNAbinder [39]57.8766.9762.420.60630.2494
DBD-Threader [40]27.2890.6358.960.39930.2315
DNA-Prot [40]66.7460.7463.740.64800.2753
iDNA-Prot [40]67.5465.7966.670.66950.3334
DBPPred [41]79.7473.8576.800.77460.5368
PLMLA [42]63.0066.3064.800.64060.2960
Phosida [43]55.3058.3056.800.56140.1360
LysAcet [44]50.3061.6055.800.53310.1200
EnsemblePail [45]45.7061.8053.500.49700.0760
PSKAcePred [46]55.3055.8055.600.55440.1110
BRABSB [47]61.2066.3063.700.62800.2750
SSPKA [48]48.2072.5060.000.54870.2140
Proposed algorithm74.8272.4273.620.73930.4725
Table 7. Performance using different bandwidths. SVM—support vector machine; NN—neural network; RF—random forest.
Table 7. Performance using different bandwidths. SVM—support vector machine; NN—neural network; RF—random forest.
SpeciesBandwidthMethodSn (%)Sp (%)Acc (%)F1MCC
H. sapiens21SVM70.3466.5268.430.69020.3689
NN67.6765.7266.700.67020.3340
RF72.3768.6170.490.71030.4101
FNT70.5170.8770.690.70640.4138
23SVM66.4865.8966.180.66280.3237
NN63.8165.0964.450.64220.2890
RF68.5167.9868.240.68330.3649
FNT66.6570.2468.440.67870.3691
25SVM72.2868.4370.360.70920.4075
NN69.6167.6368.620.68930.3725
RF74.3170.5272.420.72930.4487
FNT72.4572.7872.620.72570.4524
27SVM69.3865.9467.660.68210.3534
NN66.7165.1465.920.66190.3185
RF71.4168.0369.720.70220.3946
FNT69.5570.2969.920.69810.3984
29SVM69.7166.3168.010.68540.3604
NN67.0465.5166.270.66530.3255
RF71.7468.4070.070.70560.4016
FNT69.8870.6670.270.70150.4054
31SVM68.3664.2366.300.66980.3262
NN65.6963.4364.560.64960.2913
RF70.3966.3268.360.68990.3674
FNT68.5368.5868.560.68550.3711
M. musculus21SVM69.9969.1169.550.69680.3910
NN67.3265.3166.320.66650.3264
RF71.0270.2070.610.70730.4122
FNT72.3773.3672.870.72730.4573
23SVM69.0967.7068.400.68610.3679
NN66.4263.9065.160.65600.3033
RF70.1268.7969.460.69660.3891
FNT71.4771.9571.710.71640.4342
25SVM71.0669.9570.510.70670.4102
NN68.3966.1567.270.67630.3455
RF72.0971.0471.570.71710.4313
FNT73.4474.2073.820.73720.4764
27SVM70.9869.6770.320.70520.4065
NN68.3165.8767.090.67490.3419
RF72.0170.7671.380.71560.4277
FNT73.3673.9273.640.73560.4728
29SVM70.3769.5269.950.70070.3989
NN67.7065.7266.710.67040.3343
RF71.4070.6171.010.71120.4201
FNT72.7573.7773.260.73120.4652
31SVM70.2369.4769.850.69970.3970
NN67.5665.6766.620.66930.3324
RF71.2670.5670.910.71010.4182
FNT72.6173.7273.170.73020.4634
E. coli21SVM68.5368.9268.730.68670.3745
NN65.8665.1265.490.65620.3098
RF68.5670.0169.290.69060.3858
FNT70.7072.2771.490.71260.4298
23SVM70.8269.5670.190.70380.4038
NN68.1565.7666.960.67350.3392
RF70.8570.6570.750.70780.4150
FNT72.9972.9172.950.72960.4590
25SVM71.0369.8670.450.70620.4090
NN68.3666.0667.210.67580.3443
RF71.0670.9571.010.71020.4201
FNT73.2073.2173.210.73200.4641
27SVM70.3669.6069.980.70090.3996
NN67.6965.8066.740.67060.3349
RF70.3970.6970.540.70490.4108
FNT72.5372.9572.740.72680.4548
29SVM70.1269.2869.700.69820.3939
NN67.4565.4866.460.66790.3293
RF70.1570.3770.260.70220.4051
FNT72.2972.6372.460.72410.4491
31SVM69.5668.8169.180.69300.3837
NN66.8965.0165.950.66270.3190
RF69.5969.9069.740.69700.3949
FNT71.7372.1671.940.71880.4389
Table 8. Performance of different functions.
Table 8. Performance of different functions.
SpeciesFunctionMethodSn (%)Sp (%)Acc (%)F1MCC
H. sapiensEquation (1)SVM72.2868.4370.360.70910.4074
NN69.6167.6368.620.68930.3725
RF72.2868.4370.360.70910.4074
FNT74.3170.5272.4150.72930.4486
Equation (2)SVM73.1168.9071.000.71600.4205
NN70.4468.1069.270.69630.3855
RF73.1168.9071.000.71600.4205
FNT75.1470.9973.060.73610.4617
Equation (3)SVM72.7970.3371.560.71910.4313
NN70.1269.5369.820.69910.3965
RF72.7970.3371.560.71910.4313
FNT74.8272.4273.620.73930.4725
M. musculusEquation (1)SVM71.0669.9570.510.70670.4101
NN68.3966.1567.270.67630.3455
RF72.0971.0471.570.71710.4313
FNT73.4474.2073.820.73720.4764
Equation (2)SVM71.1470.0670.600.70760.4120
NN68.4766.2667.360.67720.3474
RF72.1771.1571.660.71800.4332
FNT73.5274.3173.910.73810.4783
Equation (3)SVM71.1670.6270.890.70970.4179
NN68.4966.8267.660.67930.3532
RF72.1971.7171.950.72020.4391
FNT73.5474.8774.210.74040.4842
E. coliEquation (1)SVM70.8269.5670.190.70380.4038
NN68.1565.7666.960.67350.3392
RF70.8570.6570.750.70780.4150
FNT72.9972.9172.950.72960.4590
Equation (2)SVM71.3270.0970.700.70880.4141
NN68.6566.2967.470.67850.3495
RF71.3571.1871.260.71290.4253
FNT73.4973.4473.460.73470.4693
Equation (3)SVM71.0970.6670.870.70940.4175
NN68.4266.8667.640.67890.3528
RF71.1271.7571.430.71340.4287
FNT73.2674.0173.630.73540.4727
Table 9. Detailed information of employed datasets.
Table 9. Detailed information of employed datasets.
DatasetProtein SequencesPositive SamplesNegative Samples
General15,57555,773863,365
Homo sapiens103210,29943,373
Mus musculus34334419788
Escherichia coli14320051528
Table 10. Selected properties from the AAIndex database. AA—amino acid.
Table 10. Selected properties from the AAIndex database. AA—amino acid.
NumberAAIndex IDName of Properties
1CHOP780207Normalized frequency of C-terminal non-helical region
2DAYM780201Relative mutability
3EISD860102Atom-based hydrophobic moment
4FAUJ880108Localized electrical effect
5FAUJ880111Positive charge
6FINA910103Helix termination parameter at position j-2, j-1, j
7JANJ780101Average accessible surface area
8KARP850103Flexibility parameter for two rigid neighbors
9KLEP840101Net charge
10KRIW710101Side-chain interaction parameter
11KRIW790102Fraction of site occupied by water
12NAKH920103AA composition of ejecta of single-spanning proteins
13QIAN880101Weights for alpha-helix at the window position of −6

Share and Cite

MDPI and ACS Style

Bao, W.; Yang, B.; Li, Z.; Zhou, Y. LAIPT: Lysine Acetylation Site Identification with Polynomial Tree. Int. J. Mol. Sci. 2019, 20, 113. https://doi.org/10.3390/ijms20010113

AMA Style

Bao W, Yang B, Li Z, Zhou Y. LAIPT: Lysine Acetylation Site Identification with Polynomial Tree. International Journal of Molecular Sciences. 2019; 20(1):113. https://doi.org/10.3390/ijms20010113

Chicago/Turabian Style

Bao, Wenzheng, Bin Yang, Zhengwei Li, and Yong Zhou. 2019. "LAIPT: Lysine Acetylation Site Identification with Polynomial Tree" International Journal of Molecular Sciences 20, no. 1: 113. https://doi.org/10.3390/ijms20010113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop