Elsevier

Methods

Volume 203, July 2022, Pages 32-39
Methods

RFhy-m2G: Identification of RNA N2-methylguanosine modification sites based on random forest and hybrid features

https://doi.org/10.1016/j.ymeth.2021.05.016Get rights and content

Highlights

  • A novel method was proposed to identify RNA m2G sites using hybrid features.

  • The over-sample method SMOTE was adopted to deal with the problem of data imbalance.

  • After using MRMD to select features, the performance of the model is improved.

  • The RFhy-m2G is superior to other methods, which can effective identify m2G sites.

Abstract

N2-methylguanosine is a post-transcriptional modification of RNA that is found in eukaryotes and archaea. The biological function of m2G modification discovered so far is to control and stabilize the three-dimensional structure of tRNA and the dynamic barrier of reverse transcription. To discover additional biological functions of m2G, it is necessary to develop time-saving and labor-saving calculation tools to identify m2G. In this paper, based on hybrid features and a random forest, a novel predictor, RFhy-m2G, was developed to identify the m2G modification sites for three species. The hybrid feature used by the predictor is used to fuse the three features of ENAC, PseDNC, and NPPS. These three features include primary sequence derivation properties, physicochemical properties, and position-specific properties. Since there are redundant features in hybrid features, MRMD2.0 is used for optimal feature selection. Through feature analysis, it is found that the optimal hybrid features obtained still contain three kinds of properties, and the hybrid features can more accurately identify m2G modification sites and improve prediction performance. Based on five-fold cross-validation and independent testing to evaluate the prediction model, the accuracies obtained were 0.9982 and 0.9417, respectively. The robustness of the predictor is demonstrated by comparisons with other predictors.

Introduction

RNA post-transcriptional modification is a research hotspot that is located at the intersection of epigenetics and bioinformatics. This type of modification refers to the addition of chemical groups to the four bases on some ribonucleotides or to local structural changes in the RNA sequence [1], [2], [3], [4]. To date, more than 160 types of RNA modifications have been verified experimentally which involve 13 types and are mainly included in the RMBase database [5]. Among the RNA modifications, N1-methyladenosine(m1A), N6-methyladenosine(m6A), pseudouridine(Ψ), 5-methylcytosine(m5C), and 2′-O-methylation(2′-O-Me) have been studied more frequently [6], [7]. In addition, other types of RNA modifications, such as N2-methylguanosine(m2G), N7-methylguanosine(m7G) and dihydrouridine modification(D), have also been identified. N2-methylguanosine has been found by biologists in the tRNA of eukaryotes and archaea and is formed by amino methylation at the guanine C-2 site catalyzed by rRNA guanine-(N2)-methyltransferase [8], [9]. Previous research has indicated that m2G plays an important role in biological processes [10]. For example, tRNA can control and stabilize the tertiary structure by pairing with other bases to form an interaction [11], [12]. Because there is relatively little data on RNA modification by m2G, research on m2G has not been able to dig deeper into more biological functions due to the lack of data. Therefore, to determine the additional biological functions of m2G modification sites, it is important to exploit time-saving calculation tools to identify m2G modification sites.

At present, the only prediction tool that has been developed for m2G modification sites is iRNA-m2G [13], which is used to identify RNA m2G modification sites in the eukaryotic transcriptome. The prediction method uses the jackknife test and an independent dataset to evaluate the stability of the constructed model. This method considers only the chemical properties of nucleotides and cumulative nucleotide frequencies when encoding RNA-modified sequences. Although the performance of the predictor is better, the training and testing datasets that are used by this method have the same positive sample data, which affect the performance of the prediction model. Due to this, based on hybrid features and a random forest, a novel predictor, RFhy-m2G, was developed to predict m2G modification sites. The feature encoding methods are ENAC, PseDNC, and NPPS. These feature methods contain three properties: primary sequence derivation properties, physicochemical properties, and position-specific properties. For data imbalances, synthetic minority oversampling technology (SMOTE) is used to process positive and negative sample data to eliminate the impact of data imbalances on the prediction model. In addition, the feature selection method, MRMD2.0, was adopted to select the optimal feature from the hybrid feature ENAC + PseDNC + NPPS. Feature analysis was performed on the optimal feature, and the obtained optimal feature still contains three properties. The process framework of RFhy-m2G is shown in Fig. 1.

Section snippets

Datasets preparation

The m2G modification site data used in this study were obtained from Chen et al. [13]. These data include 146 sequences of positive samples (e.g., m2G-site-containing sequences): H. sapiens 46, M. musculus 30, and S. cerevisiae 67. The negative samples (e.g., non-m2G-site-containing sequences) consisted of 1389 sequences: H. sapiens 601, M. musculus 474, and S. cerevisiae 314. By using CD-Hit [14], [15] to delete homologous sequences, a high-quality dataset was obtained. The positive samples

Performance of single feature and classifier optimization

To extract additional information about RNA m2G modification sites, we selected the feature encoding method ENAC, PseDNC and NPPS, which include primary sequence properties, physicochemical properties, and position-specific information in the feature extraction part. To screen the best classifier, we chose commonly used classifiers such as decision tree (DT), random forest (RF), logistic regression (LR), and K-nearest neighbor (KNN) classifiers for m2G modification site prediction. In this

Conclusion

For prediction research of RNAm2G modification sites, we developed the predictor, RFhy-m2G, based on hybrid features and RF, which can accurately identify m2G modification sites from unknown RNA modification sequences. Among the obtained data, the positive sample m2G modification sites have fewer data. SMOTE was adopted to address data imbalances to construct a predictive and robust prediction tool. To find the best features, we first chose the feature encoding methods ENAC, PseDNC, and NPPS,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Natural Science Foundation of China (No. 62072353, No. 61922020, and No. 61672406) and the Fundamental Research Funds for the Central Universities [No. JB180307].

References (87)

  • A. Khan et al.

    Detecting N6-methyladenosine sites from RNA transcriptomes using random forest

    J. Comput. Sci.

    (2020)
  • M.M. Hasan et al.

    i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome

    Int. J. Biol. Macromol.

    (2020)
  • L. Chen et al.

    Investigating the gene expression profiles of cells in seven embryonic stages with machine learning algorithms

    Genomics

    (2020)
  • A. Tybout et al.

    Analysis of variance

    J. Consumer Psychol.

    (2001)
  • Y.-H. Zhang et al.

    Determining protein–protein functional associations by functional rules based on gene ontology and KEGG pathway

    Biochim. Biophys. Acta (BBA) – Proteins and Proteomics

    (2021)
  • L. Wei et al.

    Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier

    Artif. Intell. Med.

    (2017)
  • L. Wei et al.

    A novel hierarchical selective ensemble classifier with bioinformatics application

    Artif. Intell. Med.

    (2017)
  • L. Wei et al.

    Prediction of human protein subcellular localization using deep learning

    J. Parallel Distrib. Comput.

    (2018)
  • T.M. Carlile, M.F. Rojas-Duran, W.V. Gilbert, Pseudo-Seq: Genome-Wide Detection of Pseudouridine Modifications in RNA....
  • S. Li, C.E. Mason, The Pivotal Regulatory Landscape of RNA Modifications. In: Annual Review of Genomics and Human...
  • C. Qi, P. Wang, T. Fu, M. Lu, Y. Cai, X. Chen, Cheng L: A comprehensive review for gut microbes: technologies,...
  • B. Xu et al.

    Multi-substrate selectivity based on key loops and non-homologous domains: new insight into ALKBH family

    Cell. Mol. Life Sci.

    (2021)
  • J.-J. Xuan et al.

    RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data

    Nucleic Acids Res.

    (2018)
  • K. Liu, W. Chen, iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics...
  • P.V. Sergiev et al.

    Ribosomal RNA guanine-(N2)-methyltransferases and their targets

    Nucleic Acids Res.

    (2007)
  • A. Schneider et al.

    Structural requirements for enzymatic activities of foamy virus protease-reverse transcriptase

    Proteins-Struct. Funct. Bioinf.

    (2014)
  • P.A. Limbach et al.

    The modified nucleosides of RNA – summary

    Nucleic Acids Res.

    (1994)
  • L. Fu et al.

    CD-HIT: accelerated for clustering the next-generation sequencing data

    Bioinformatics

    (2012)
  • Y. Zhu et al.

    Computational identification of eukaryotic promoters based on cascaded deep capsule neural networks

    Briefings Bioinf.

    (2020)
  • W.-J. Sun et al.

    RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data

    Nucleic Acids Res

    (2016)
  • M. Sprinzl et al.

    Compilation of tRNA sequences and sequences of tRNA genes

    Nucleic Acids Res

    (2005)
  • P.P. Chan et al.

    GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes

    Nucleic Acids Res

    (2016)
  • Z. Chen et al.

    iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data

    Brief. Bioinf.

    (2020)
  • Z. Chen et al.

    iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization

    Nucleic Acids Res

    (2021)
  • Y. Zuo et al.

    PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition

    Bioinformatics

    (2017)
  • L. Zheng et al.

    RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou's five-step rule

    Database (Oxford)

    (2019)
  • W. Chen et al.

    iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition

    Nucleic Acids Res.

    (2013)
  • J. Yerushalmy

    Statistical problems in assessing methods of medical diagnosis, with special reference to X-ray techniques

    Public Health Rep. (1896–1970)

    (1947)
  • L. Zhang et al.

    DNN-m6A: a cross-species method for identifying RNA N6-methyladenosine sites based on deep neural network with multi-information fusion

    Genes

    (2021)
  • Z. Chen et al.

    Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences

    Briefings Bioinf.

    (2020)
  • P. Xing et al.

    Identifying N-6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine

    Sci. Rep.

    (2017)
  • X. Wang et al.

    RFAthM6A: a new tool for predicting m(6)A sites in Arabidopsis thaliana

    Plant Mol. Biol.

    (2018)
  • Y. Ding et al.

    Identification of drug-target interactions via fuzzy bipartite local model

    Neural Comput. Appl.

    (2020)
  • Cited by (29)

    • iRNA-ac4C: A novel computational method for effectively detecting N4-acetylcytidine sites in human mRNA

      2023, International Journal of Biological Macromolecules
      Citation Excerpt :

      The higher the auROC, the better the performance. Cross-validation test is a statistical analysis strategy for evaluating a model and has been widely used in various classification problems [57–62]. In this work, to save computing time, we used 10-fold cross-validation to train the model, and the independent dataset to evaluate our model and other methods.

    • Identification of adaptor proteins using the ANOVA feature selection technique

      2022, Methods
      Citation Excerpt :

      Feature extraction techniques used to classify proteins mainly include amino acid composition (AAC), polypeptide composition [20], Pseudo amino acid composition (PseAAC) [21], and the composition of k-Spaced Amino Acid Pairs (CKSAAP) [22]. The classifiers for recognizing proteins mainly contain support vector machine (SVM), random forest, and deep learning[23–34]. These methods have been used to identify hormone-binding proteins [35,36], toxins [37], and DNA/RNA regulatory elements [38].

    View all citing articles on Scopus
    View full text