RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features
Introduction
RNA post-transcriptional modification is a research hotspot that is located at the intersection of epigenetics and bioinformatics. This type of modification refers to the addition of chemical groups to the four bases on some ribonucleotides or to local structural changes in the RNA sequence [1], [2], [3], [4]. To date, more than 160 types of RNA modifications have been verified experimentally which involve 13 types and are mainly included in the RMBase database [5]. Among the RNA modifications, N1-methyladenosine(m1A), N6-methyladenosine(m6A), pseudouridine(Ψ), 5-methylcytosine(m5C), and 2′-O-methylation(2′-O-Me) have been studied more frequently [6], [7]. In addition, other types of RNA modifications, such as N2-methylguanosine(m2G), N7-methylguanosine(m7G) and dihydrouridine modification(D), have also been identified. N2-methylguanosine has been found by biologists in the tRNA of eukaryotes and archaea and is formed by amino methylation at the guanine C-2 site catalyzed by rRNA guanine-(N2)-methyltransferase [8], [9]. Previous research has indicated that m2G plays an important role in biological processes [10]. For example, tRNA can control and stabilize the tertiary structure by pairing with other bases to form an interaction [11], [12]. Because there is relatively little data on RNA modification by m2G, research on m2G has not been able to dig deeper into more biological functions due to the lack of data. Therefore, to determine the additional biological functions of m2G modification sites, it is important to exploit time-saving calculation tools to identify m2G modification sites.
At present, the only prediction tool that has been developed for m2G modification sites is iRNA-m2G [13], which is used to identify RNA m2G modification sites in the eukaryotic transcriptome. The prediction method uses the jackknife test and an independent dataset to evaluate the stability of the constructed model. This method considers only the chemical properties of nucleotides and cumulative nucleotide frequencies when encoding RNA-modified sequences. Although the performance of the predictor is better, the training and testing datasets that are used by this method have the same positive sample data, which affect the performance of the prediction model. Due to this, based on hybrid features and a random forest, a novel predictor, RFhy-m2G, was developed to predict m2G modification sites. The feature encoding methods are ENAC, PseDNC, and NPPS. These feature methods contain three properties: primary sequence derivation properties, physicochemical properties, and position-specific properties. For data imbalances, synthetic minority oversampling technology (SMOTE) is used to process positive and negative sample data to eliminate the impact of data imbalances on the prediction model. In addition, the feature selection method, MRMD2.0, was adopted to select the optimal feature from the hybrid feature ENAC + PseDNC + NPPS. Feature analysis was performed on the optimal feature, and the obtained optimal feature still contains three properties. The process framework of RFhy-m2G is shown in Fig. 1.
Section snippets
Datasets preparation
The m2G modification site data used in this study were obtained from Chen et al. [13]. These data include 146 sequences of positive samples (e.g., m2G-site-containing sequences): H. sapiens 46, M. musculus 30, and S. cerevisiae 67. The negative samples (e.g., non-m2G-site-containing sequences) consisted of 1389 sequences: H. sapiens 601, M. musculus 474, and S. cerevisiae 314. By using CD-Hit [14], [15] to delete homologous sequences, a high-quality dataset was obtained. The positive samples
Performance of single feature and classifier optimization
To extract additional information about RNA m2G modification sites, we selected the feature encoding method ENAC, PseDNC and NPPS, which include primary sequence properties, physicochemical properties, and position-specific information in the feature extraction part. To screen the best classifier, we chose commonly used classifiers such as decision tree (DT), random forest (RF), logistic regression (LR), and K-nearest neighbor (KNN) classifiers for m2G modification site prediction. In this
Conclusion
For prediction research of RNAm2G modification sites, we developed the predictor, RFhy-m2G, based on hybrid features and RF, which can accurately identify m2G modification sites from unknown RNA modification sequences. Among the obtained data, the positive sample m2G modification sites have fewer data. SMOTE was adopted to address data imbalances to construct a predictive and robust prediction tool. To find the best features, we first chose the feature encoding methods ENAC, PseDNC, and NPPS,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Natural Science Foundation of China (No. 62072353, No. 61922020, and No. 61672406) and the Fundamental Research Funds for the Central Universities [No. JB180307].
References (87)
- et al.
iRNA-PseKNC(2methyl): Identify RNA 2 '-O-methylation sites by convolution neural network and Chou's pseudo components
J. Theor. Biol.
(2019) - et al.
Posttranscriptionally modified nucleosides in transfer-RNA – their locations and frequencies
Biochimie
(1995) - et al.
MD simulation studies to investigate iso-energetic conformational behaviour of modified nucleosides m(2)G and m(2) 2G present in tRNA
Computat. Struct. Biotechnol. J.
(2013) - et al.
iRNA-m2G: Identifying N-2-methylguanosine sites based on sequence-derived information
Mol. Therapy-Nucleic Acids
(2019) - et al.
RNA-MethylPred: a high-accuracy predictor to identify N6-methyladenosine in RNA
Anal. Biochem.
(2016) - et al.
Identification of membrane protein types via multivariate information fusion with Hilbert-Schmidt Independence Criterion
Neurocomputing
(2020) - et al.
Identification of drug-target interactions via dual Laplacian regularized least squares with multiple kernel fusion
Knowl.-Based Syst.
(2020) - et al.
Inspector: a lysine succinylation predictor based on edited nearest-neighbor undersampling and adaptive synthetic oversampling
Anal. Biochem.
(2020) - et al.
Prediction of protein crotonylation sites through LightGBM classifier based on SMOTE and elastic net
Anal. Biochem.
(2020) - et al.
Computational methods for identifying similar diseases
Mol. Therapy Nucl. Acids
(2019)
Detecting N6-methyladenosine sites from RNA transcriptomes using random forest
J. Comput. Sci.
i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome
Int. J. Biol. Macromol.
Investigating the gene expression profiles of cells in seven embryonic stages with machine learning algorithms
Genomics
Analysis of variance
J. Consumer Psychol.
Determining protein–protein functional associations by functional rules based on gene ontology and KEGG pathway
Biochim. Biophys. Acta (BBA) – Proteins and Proteomics
Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier
Artif. Intell. Med.
A novel hierarchical selective ensemble classifier with bioinformatics application
Artif. Intell. Med.
Prediction of human protein subcellular localization using deep learning
J. Parallel Distrib. Comput.
Multi-substrate selectivity based on key loops and non-homologous domains: new insight into ALKBH family
Cell. Mol. Life Sci.
RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data
Nucleic Acids Res.
Ribosomal RNA guanine-(N2)-methyltransferases and their targets
Nucleic Acids Res.
Structural requirements for enzymatic activities of foamy virus protease-reverse transcriptase
Proteins-Struct. Funct. Bioinf.
The modified nucleosides of RNA – summary
Nucleic Acids Res.
CD-HIT: accelerated for clustering the next-generation sequencing data
Bioinformatics
Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks
Briefings Bioinf.
RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data
Nucleic Acids Res
Compilation of tRNA sequences and sequences of tRNA genes
Nucleic Acids Res
GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes
Nucleic Acids Res
iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data
Brief. Bioinf.
iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization
Nucleic Acids Res
PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition
Bioinformatics
RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule
Database (Oxford)
iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition
Nucleic Acids Res.
Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques
Public Health Rep. (1896–1970)
DNN-m6A: a cross-species method for identifying RNA N6-methyladenosine sites based on deep neural network with multi-information fusion
Genes
Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences
Briefings Bioinf.
Identifying N-6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine
Sci. Rep.
RFAthM6A: a new tool for predicting m(6)A sites in Arabidopsis thaliana
Plant Mol. Biol.
Identification of drug-target interactions via fuzzy bipartite local model
Neural Comput. Appl.
Cited by (29)
IIFS: An improved incremental feature selection method for protein sequence processing
2023, Computers in Biology and MedicineMulti-view local hyperplane nearest neighbor model based on independence criterion for identifying vesicular transport proteins
2023, International Journal of Biological MacromoleculesiRNA-ac4C: A novel computational method for effectively detecting N4-acetylcytidine sites in human mRNA
2023, International Journal of Biological MacromoleculesCitation Excerpt :The higher the auROC, the better the performance. Cross-validation test is a statistical analysis strategy for evaluating a model and has been widely used in various classification problems [57–62]. In this work, to save computing time, we used 10-fold cross-validation to train the model, and the independent dataset to evaluate our model and other methods.
Empirical comparison and recent advances of computational prediction of hormone binding proteins using machine learning methods
2023, Computational and Structural Biotechnology JournalIdentification of adaptor proteins using the ANOVA feature selection technique
2022, MethodsCitation Excerpt :Feature extraction techniques used to classify proteins mainly include amino acid composition (AAC), polypeptide composition [20], Pseudo amino acid composition (PseAAC) [21], and the composition of k-Spaced Amino Acid Pairs (CKSAAP) [22]. The classifiers for recognizing proteins mainly contain support vector machine (SVM), random forest, and deep learning[23–34]. These methods have been used to identify hormone-binding proteins [35,36], toxins [37], and DNA/RNA regulatory elements [38].
Analysis and modeling of myopia-related factors based on questionnaire survey
2022, Computers in Biology and Medicine