Skip to main content
Advertisement
  • Loading metrics

SCMFMDA: Predicting microRNA-disease associations based on similarity constrained matrix factorization

Abstract

miRNAs belong to small non-coding RNAs that are related to a number of complicated biological processes. Considerable studies have suggested that miRNAs are closely associated with many human diseases. In this study, we proposed a computational model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to effectively combine different disease and miRNA similarity data, we applied similarity network fusion algorithm to obtain integrated disease similarity (composed of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity) and integrated miRNA similarity (composed of miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity). In addition, the L2 regularization terms and similarity constraint terms were added to traditional Nonnegative Matrix Factorization algorithm to predict disease-related miRNAs. SCMFMDA achieved AUCs of 0.9675 and 0.9447 based on global Leave-one-out cross validation and five-fold cross validation, respectively. Furthermore, the case studies on two common human diseases were also implemented to demonstrate the prediction accuracy of SCMFMDA. The out of top 50 predicted miRNAs confirmed by experimental reports that indicated SCMFMDA was effective for prediction of relationship between miRNAs and diseases.

Author summary

Considerable studies have suggested that miRNAs are closely associated with many human diseases, so predicting potential associations between miRNAs and diseases can contribute to the diagnose and treatment of diseases. Several models of discovering unknown miRNA-diseases associations make the prediction more productive and effective. We proposed SCMFMDA to obtain more accuracy prediction result by applying similarity network fusion to fuse multi-source disease and miRNA information and utilizing similarity constrained matrix factorization to make prediction based on biological information. The global Leave-one-out cross validation and five-fold cross validation were applied to evaluate our model. Consequently, SCMFMDA could achieve AUCs of 0.9675 and 0.9447 that were obviously higher than previous computational models. Furthermore, we implemented case studies on significant human diseases including colon neoplasms and lung neoplasms, 47 and 46 of top-50 were confirmed by experimental reports. All results proved that SCMFMDA could be regard as an effective way to discover unverified connections of miRNA-disease.

Introduction

MicroRNAs (miRNAs) are a number of 17-24nt non-coding RNAs, which act a pivotal part in controlling the expression of gene through RNA cleavage or translation repression [13]. Lin-4 was the first miRNA inspected in experiment by Lee et al. [4] in 1993. Since that time, a large amount of miRNAs was discovered by researchers in experiments [4,5]. Researchers have sought out generous miRNAs from various of species that included viruses, animals and plants [6]. Because miRNAs regulated the expression of a great quantity of target genes, the total miRNA pathway played a key role in gene expression control [79]. miRNAs are bound up with several crucial biological processes, such as cell development, cell differentiation, cell proliferation and so on [10]. Developmental defects can be the result of the dysregulation of miRNAs that also associate with progression of diseases [11]. In the meantime, considerable studies have indicated that miRNAs are connected with a serious of human neoplasms, which include lung neoplasms [12], prostate neoplasms [13] and so on. Hence distinguishing miRNAs associated with diseases can deepen understanding of the genetic causes of complex diseases. Massive connections between miRNAs and diseases have been found by a variety of traditional experiments in the past few years [14,15]. Traditional manual models can infer the connections between miRNA and disease, but which are time-consuming, laborious and high failure rate. Therefore, showing the potential relationship between miRNAs and diseases in need of computational methods with effectiveness and stability, as they can obtain increasing reliable miRNA-disease connections [16].

In the past period of time, a great deal of computation-based algorithms and methods have been applied to predict potential relationship of miRNA-disease [17,18]. For example, Jiang et al. [19] proposed a model that applied the human phenome-microRNAome network to predict potential interactions between miRNAs with similar function and diseases with similar phenotypic. However, the predictive performance of the model was not as decent as expected due to be affected by high false positive and false negative rates existing in the associations between miRNAs and targets. Later, the model WBSMDA [20] introduced the Gaussian interaction profile similarity to enrich similarity information of miRNA and disease. The WBSMDA could also predict potential relationship between new miRNAs and new diseases without any verified correlative information. The collaborative matrix factorization method was applied to predict the relationship of miRNA-diseases in CMFMDA [21], which also could utilize plentiful biological information observe unknown interactions. The model EGBMMDA [22] began to take advantage of decision tree learning to discover novel miRNA-disease interaction by integrating verified miRNA-disease connections, miRNA functional similarity and disease semantic similarity. The informative feature vector was constructed by multi-measures to train the regression tree under the gradient boosting framework. Zhao et al. [23] applied adaptive boosting to observe unverified miRNA-disease association in ABMDA model. And they utilized k-means clustering on negative samples to perform random sampling, which could control the balance between positive samples and negative samples. The BHCMDA [24] model utilized biased heat conduction (BHC) algorithm to predict unknown connections between miRNAs and diseases though combining miRNA similarity matrix, disease similarity matrix and miRNA-disease association matrix. The probabilistic matrix factorization (PMF) algorithm was used in IMIPMF [25] model to infer potential miRNA-disease interactions. The PMF was widely used in recommender systems, so it could effectively make use of all information to recommend miRNAs which are strongly associated with the disease.

Recently, the methods based on random walk were gradually proposed and more accuracy prediction results were obtained. Chen et al. [26] utilized the random walk with restart algorithm to construct RWRMDA model. Because the prediction performance calculated by global network similarity was better than local network [27,28], RWRMDA employed global network similarity to gain the feasible interactions between miRNAs and diseases. Unfortunately, RWRMDA was inappropriate to the diseases without known associated miRNAs. Shi et al. [29] utilized the function links between human disease genes and miRNA targets to devise a novel model. Random walk algorithm and global network distance measurement were applied to search feasible relationship between miRNAs and diseases. Liu et al. [30] also implemented random walk with restart algorithm in the model to make prediction results to a higher degree. They employed random walk with restart algorithm on a heterogeneous graph established by utilizing disease similarity and miRNA similarity. Luo et al. [31] employed imbalanced bi-random walk method on a heterogeneous network with information of miRNAs and diseases to identify feasible interactions of miRNA-disease. Niu et al. [32] applied random walk with restart algorithm to extract miRNA features from integrated miRNA similarity network in RWBRMDA model. Then these miRNA features were utilized by binary logistic regression algorithm to predict potential miRNA-disease associations.

For the sake of obtaining reliable and accurate predictive performance, machine learning-based methods gradually were utilized to predict unknown miRNA-disease associations. For instance, the model RBMMMDA [33] utilized restricted Boltzmann machine to predict miRNA-disease multi-type associations. The RBMMMDA could gain not only novel associations between miRNAs and diseases, but also corresponding association types. The model PBMDA [34] constructed a heterogeneous graph including different interlinked sub-graphs and further adopted depth-first search algorithm to seek potential miRNA-disease associations. PBMDA could function as a useful calculation tool to accelerate the prediction of miRNA-disease interactions. The model DNRLMF-MDA [35] integrated dynamic neighborhood regularized and logistic matrix factorization to predict potential relationship of miRNA-disease. DNRLMF-MDA applied logistic matrix factorization algorithm to association probability between miRNAs and diseases. Then implementing dynamic neighborhood regularized algorithm to improve predictive performance. Peng et al. [36] proposed the model MDA-CNN for miRNA-disease connection identification. The miRNA-disease interaction features were firstly captured by a three-layer network. Then an auto-encoder was employed to identify obvious miRNA-disease feature combinations. After these feature representations were reduced, the convolutional neural network utilized them to predict the final results. The significant machine learning-based model MLMDA [37] was proposed by Zheng et al. to predict unknown relationship of miRNA-disease. The k-mer sparse matrix was used to extract miRNA sequence information. Then integrating miRNA sequence information, miRNA and disease similarity information to construct feature vectors. The deep auto-encoder neural network (AE) and random forest classifier made full use of feature vectors to calculate the prediction probability. The NCMCMDA [38] model integrated neighborhood constraint with matrix completion algorithm to change the recovery task into an optimization problem. This model applied the fast iterative shrinkage-thresholding algorithm to recover missing interactions between miRNAs and diseases. Zhang et al. [39] proposed the computational model MSFSP to achieve a more accuracy predictive performance of miRNA-disease interactions. The MSFSP firstly integrated various similarity information of miRNA and disease to construct the similarity of miRNA and disease. Then miRNA and disease similarity matrices and verified miRNA-disease association matrix were utilized to constitute the weighted network of miRNA-disease connections. The final prediction labels were calculated by weighting miRNA and disease space projection scores. Ji et al. [40] proposed SVAEMDA model to infer more disease-related miRNAs, which used miRNA similarity and disease similarity to obtain the representations of miRNA and disease. In addition, the variational autoencoder based predictor was trained to predict unknown interactions of miRNA-disease, which combined verified miRNA-disease interactions with the representations of miRNA and disease to generate the feature vectors of miRNA and disease.

Because there were several limitations in previous models, we presented a novel model based on Similarity Constrained Matrix Factorization for miRNA-Disease Association Prediction (SCMFMDA). In order to obtain plentiful disease similarity data, we applied similarity network fusion algorithm to integrate various disease similarities, which consisted of disease functional similarity, disease semantic similarity and disease Gaussian interaction profile kernel similarity. Similarly, miRNA similarity data was obtained by applying similarity network fusion to integrate miRNA functional similarity, miRNA sequence similarity and miRNA Gaussian interaction profile kernel similarity. In addition, we added L2 regularization terms and similarity constraint terms to standard Nonnegative Matrix Factorization (NMF) method to predict more unknown miRNA-disease associations. To evaluate the effectiveness of SCMFMDA, global Leave-one-out cross validation and five-fold cross validation were carried out on the verified miRNA-disease association data downloaded from HMDD v2.0 [41]. As a result, SCMFMDA achieved AUC values of 0.9675 and 0.9447, respectively. Furthermore, we performed case studies on colon neoplasms and lung neoplasms. Consequently, the miR2Disease [42] and dbDEMC v2.0 [43] databases were utilized to validate results of case studies, which achieved high confirmation ratios. Experimental results showed that SCMFMDA was effective for inferring possible relationship between miRNAs and diseases.

Materials

Human miRNA-disease associations

In this study, we downloaded verified human miRNA-disease association information from HMDD v2.0 database, which included 5430 known associations between 383 diseases and 495 miRNAs. For the sake of making calculation convenient, we made an adjacency matrix ARnd×nm to indicate the verified miRNA-disease associations. The nd and nm mean the number of diseases and miRNAs, respectively. We used aij to represent the (i,j)th element of matrix A. Specifically, The element aij is set to 1 if disease di is related to miRNA mj; and otherwise, it is set to 0.

Disease functional similarity

The phenotypically similar diseases tend to associate with similar genes. Therefore, we could calculate disease functional similarity based on the functional information of gene. The log-likelihood score (LLS) represents the probability of a functional linkage between different genes, which can be downloaded from the HumanNet database [44] and be normalized as follows: (1) where LLS(ga, gb) denotes the LLS between gene ga and gene gb, LLSmax and LLSmin are the maximum LLS and minimum LLS in HumanNet database; LLSn(ga, gb) represents the normalized LLS.

Then, the gene functional similarity score can be calculated by the below equation: (2) where SHumanNet represents the link set that contains whole links between genes in HumanNet database; e(a,b) indicates the link between gene ga and gene gb.

Furthermore, the functional similarity score between gene g and gene set G is defined as follows: (3)

The SIDD [45] can be utilized to obtain disease-gene association data, which are involved in calculating disease functional similarity SD1 by the following equation: (4)

Disease semantic similarity

On the basis of previous study [46], the medical subject headings (Mesh) descriptors could be implemented to calculate disease semantic similarity. Here, the Directed Acyclic Graph (DAG) could be adopted to indicate the specific relationship of different diseases. Concretely, the DAG(D) = (D,T(D),E(D)) represents the DAG of disease D, in which T(D) denotes the node set containing D itself and its ancestor nodes, E(D) denotes the relevant edge set including edges from parent nodes to their child nodes directly. Then the semantic value of disease D can be calculated as below: (5) where the semantic contribution of disease d to D can be calculated as follows: (6) here, Δ is the semantic contribution factor that is set to 0.5 based on previous literature [47].

On the basis of assumption that various diseases tend to be regarded as similar diseases if the large parts of their DAGs are same. Therefore, the semantic similarity DS1(di, dj) between disease di and disease dj can be defined as follows: (7)

Based on the previous study [48], diseases appear in less DAGs may be more specific, these diseases ought to gain a higher semantic contribution in DAGs. Therefore, different diseases located in the same layer of one DAG, which may obtain the different contribution value. Specifically, the semantic contribution of disease d to D can be calculated in different way as below: (8)

Correspondingly, the semantic score of disease D and semantic similarity DS2(di, dj) between disease di and disease dj can be calculated as follows: (9) (10)

Finally, we integrated DS1 and DS2 to calculate final disease semantic similarity SD2(di, dj) between disease di and disease dj in following equation: (11)

miRNA functional similarity

Based on the calculation method of miRNA functional similarity [49,50], assuming that functionally similar miRNAs tend to be linked with phenotypically similar diseases and vice versa. We downloaded miRNA functional similarity data from http://www.cuilab.cn/files/images/cuilab/misim.zip. Here, we constructed the matrix SM1 with nm rows and nm columns for storing the corresponding information. The element SM1(mi, mj) represents the relevant functional similarity score between miRNA mi and miRNA mj.

miRNA sequence similarity

We utilized the Needleman-Wunsch Algorithm to calculate miRNA sequence similarity, and corresponding miRNA sequence information can be obtained from miRBase database [51]. Be similar to miRNA functional similarity, we also constructed a matrix SM2Rnm×nm to store sequence similarity information, where SM2(mi, mj) was the relevant sequence similarity score between miRNA mi and miRNA mj.

Gaussian interaction profile kernel similarity for diseases and miRNAs

On the basis of previous study [49,50], because miRNAs with similar function are likely to be linked with diseases with similar phenotypes, the Gaussian interaction profile (GIP) kernel similarity can be calculated and applied to stand for the miRNA similarity and disease similarity. Concretely, the binary vector K(di) is constructed to indicate the interaction profile of disease di in accordance with whether di possesses known association with each miRNA or not. Here, the GIP kernel similarity SD3(di, dj) between disease di and disease dj can be calculated as below equations: (12) (13)

In the same light, the GIP kernel similarity SM3(mi, mj) between miRNA mi and miRNA mj can be calculated by the following formulas: (14) (15) where the binary vector K(mi) indicates the interaction profile of miRNA mi in accordance with whether mi has known association with each disease or not, the parameter ρm is utilized to control kernel bandwidth.

Methods

Overview

The SCMFMDA includes two major parts: similarity network fusion is applied to obtain integrated disease similarity and integrated miRNA similarity; known miRNA-diseases associations and integrated similarities are adopted in similarity constrained matrix factorization to infer unknown associations of miRNA-disease. The specific flow chart of SCMFMDA is shown in Fig 1.

Integrating similarity for diseases and miRNAs

The similarity between two diseases can use disease functional similarity, disease semantic similarity and disease GIP kernel similarity to represent. Similarly, miRNA functional similarity, miRNA sequence similarity and miRNA GIP kernel similarity can be utilized to indicate similarity between different miRNAs. Here, the similarity network fusion (SNF) [52] method is applied to integrate various similarities for disease and miRNA. According to previous study, the process of SNF can be expressed as iterative update of similarity matrices. The main steps of utilizing SNF to integrate different disease similarities SDn, n = 1,2,3 are introduced as follows.

In the first step, we calculated normalized weight matrix Pn of each similarity network as follows: (16)

In the second step, we utilized k nearest neighbor (KNN) algorithm to measure the local relationship of each similarity network. The specific process to obtain corresponding matrix Kn is displayed as follows: (17) where the Ni indicates the number of neighbors in the disease.

In the third step, we applied SNF to integrate normalized weight matrix Pn and local relationship matrix Kn as follows: (18)

Because we had three different disease similarity networks (disease functional similarity, disease semantic similarity and disease GIP kernel similarity), the m was equal to 3. After iterative update, the ultimate disease similarity matrix Sd could be obtained as follows: (19)

Similarly, we could apply SNF algorithm to obtain final miRNA similarity matrix Sm.

Similarity constrained matrix factorization

After obtaining processed disease similarity and miRNA similarity, similarity constrained matrix factorization method is adopted to observe more unknown interactions of miRNA-disease, and Fig 2 shows concrete details of it. The SCMFMDA factorized the matrix ARnd×nm into URnd×γ and VRnm×γ, where γ denoted the dimension of disease feature and miRNA feature in the low-rank spaces. To be specific, the association of miRNA-disease roughly equal to the inner product between the disease feature vector and the miRNA feature vector: , where ui and vj represent the ith row of U and the jth row of V, respectively. The corresponding objective function is shown as follows: (20)

Then, the L2 regularization terms of ui and vj are added to the Eq (20) for solving overfitting problem.

thumbnail
Fig 2. The details of similarity constrained matrix factorization.

https://doi.org/10.1371/journal.pcbi.1009165.g002

(21) where σ is the regularization parameter for ui and vj.

On the basis of previous study [53,54], the geometric properties of data points may be kept when they are mapped from high-rank space into low-rank space. Disease similarity Sd and miRNA similarity Sm can indicate geometric structure of data points, so we present similarity constraint terms SU and SV as follows: (22) (23) where represents the similarity between disease di and disease dj, denotes the similarity between miRNA mi and miRNA mj, respectively. Considering the similarity degree between two data points is up to the distance of them, so SU will incur a heavy penalty if the distance of di and dj are close in disease feature space. Therefore, we could keep the geometric structure of disease data points by minimizing SU, which would cause that disease di and disease dj were mapped closely in low dimensional space. For miRNA, it is the same situation. Hence, the objective function of SCMFMDA are proposed by adding SU and SV to Eq (21) as follows: (24) where ε is regarded as hyper parameter which can availably control the smoothness of similarity consistency.

Optimization algorithm

In this section, we proposed an efficacious optimization algorithm to calculate the objective function of SCMFMDA. First, the partial derivatives of L in regard to ui and vj are calculated as follows: (25) where A(i,:) denotes the ith row of matrix A. (26) where A(:,j) denotes the jth column of matrix A.

Then, the second derivatives of L in regard to ui and vj are calculated by the below equations: (27) (28)

According to Newton’s method, ui and vj can be executed iterative update as follows: (29) (30)

Hence, ui and vj can be updated by the following formulas: (31) (32)

When the convergence condition is met, the update of ui and vj will stop. The prediction matrix can be obtained by updated ui and vj.

(33)

The value of denotes the association probability between disease di and miRNA mj. The more likely the association is, if the score is higher.

Results

Parameters optimization

In this section, parameters γ, σ and ε are quantitatively analyzed to research their effect on the prediction performance. γ represents the dimension of diseases and miRNAs in low-rank spaces, and γ<min (nd, nm) that can be considered as the percentage of min (nd, nm). Parameters σ and ε denote the regularization parameters. The AUC value of 5-CV is applied to evaluate influence of the choice of parameters on the performance of model. And after generous test experiments were conducted, we could get the conclusion that the value of γ would affect the experiment individually. For this reason, we fixed σ and ε in a suitable combination to test the most suitable value of γ∈{0,10%,…,1} in SCMFMDA. In order to ensure the correctness of the test, σ and ε are fixed in different combination. From Fig 3A, we could see that SCMFMDA obtained the best performance when γ = 50%. In addition, the γ = 50% is fixed so that the effect of regularization parameters σ and ε can be clearly evaluated. We utilized all combinations of σ∈{2−3,2−2,…,23} and ε∈{2−3,2−2,…,23} to construct SCMFMDA. From Fig 3B, we could discover that SCMFMDA acquired best AUC value of 0.9447 when σ = 22 and ε = 20. In summary, γ, σ and ε are set to 50%, 22 and 20 in our model, respectively.

thumbnail
Fig 3.

The influence of parameters on SCMFMDA: (A) the influence of γ; (B) the influence of σ and ε.

https://doi.org/10.1371/journal.pcbi.1009165.g003

Model comparison

In order to evaluate the prediction ability of SCMFMDA, we compared several previous computational methods that were proposed to predict unknown miRNA-disease associations. We applied same dataset (HMDD v2.0 database) to train these methods so that comparison results could be considered as fairness. The specific information of these methods are shown as follows.

  • MSCHLMDA [55] is a multi-similarity based combinative hypergraph learning model (published in 2020).
  • ICFMDA [56] is an improved collaborative filtering-based computational model (published in 2018).
  • SACMDA [57] is short acyclic connections-based computational model (published in 2018).
  • GRNMF [58] is a graph regularized non-negative matrix factorization-based model (published in 2018).
  • GRL2,1NMF [59] is a graph Laplacian regularized L2,1-nonnegative matrix factorization-based computational model (published in 2020).
  • NPCMF [60] is a nearest profile-based collaborative matrix factorization model (published in 2019).
  • KBMFMDA [61] is a kernelized Bayesian matrix factorization-based computational model (published in 2020).

Based on the HMDD v2.0 database that included 5430 verified associations and 184155 unverified associations between 383 diseases and 495 miRNAs, global Leave-one-out cross validation (global LOOCV) and five-fold cross validation (5-CV) were implemented to evaluate the prediction performance of these methods. In the framework of global LOOCV, the test set was held by each verified association of miRNA-disease in turn, the training set was composed of other verified associations. The whole unknown miRNA-disease associations were considered as candidate samples. Similarly, in the framework of 5-CV, the whole verified miRNA-disease associations were divided into five parts in a random way, where test set was held by one part in turn, training set consisted of other four parts in turn. The whole unknown miRNA-disease associations were considered as candidate samples. In addition, by either the global LOOCV or the 5-CV, we applied SCMFMDA to obtain all predicted association scores so that the ranking of test set relative to candidate samples could be calculated. When the ranking of all test sample were higher than the certain threshold, SCMFMDA was regarded as a valid model. Then we could utilize the Receiver operating characteristics (ROC) curve that was obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) to effectively evaluate the performance of SCMFMDA. We could calculate the area under the ROC curve (AUC) of SCMFMDA whose value was between 0 and 1. Similarly, we could obtain AUCs of other computational methods by utilizing the information of HMDD v2.0 database.

In this work, when global LOOCV method was conducted, SCMFMDA, MSCHLMDA, ICFMDA and SACMDA acquired average AUC values of 0.9675, 0.9287, 0.9072 and 0.8777, respectively (Fig 4). For the purpose of reducing potential deviations resulted in random sample segmentations, we applied 100 times repeated segmentations to verified associations of miRNA-disease in 5-CV method, and the average AUC values of SCMFMDA, MSCHLMDA, ICFMDA and SACMDA reached 0.9447, 0.9263, 0.9046, and 0.8773, respectively (Fig 5). Obviously, the prediction performance of SCMFMDA was better than other methods.

thumbnail
Fig 4. AUC of global LOOCV compared with those of MSCHLMDA, ICFMDA and SACMDA.

https://doi.org/10.1371/journal.pcbi.1009165.g004

thumbnail
Fig 5. AUC of 5-CV compared with those of MSCHLMDA, ICFMDA and SACMDA.

https://doi.org/10.1371/journal.pcbi.1009165.g005

In order to further reflect the performance of the SCMFMDA, it is also compared with other state-of-the-art matrix factorization-based methods that include GRNMF, GRL2,1NMF, NPCMF, KBMFMDA. The 5-CV results of all model are demonstrated in Table 1, clearly SCMFMDA possesses the best AUC. The advantages of SCMFMDA than other matrix factorization-based models are as follows: first, the biological similarity data that are utilized in SCMFMDA obviously more than other models; second, SCMFMDA utilizes SNF instead of traditional linear combination method to integrate various similarity data, which greatly guarantee the completeness and effectiveness of experiment data; third, the L2 regularization and similarity constraint terms are added to the NMF objective function, which benefit to correctly discover more unknown miRNA-disease connections.

thumbnail
Table 1. Comparisons between SCMFMDA and other MF-based models.

https://doi.org/10.1371/journal.pcbi.1009165.t001

Case studies

For the purpose of demonstrating the effectiveness and accuracy of SCMFMDA, we applied an evaluation experiment in this section. We implemented two types of human diseases, i.e., colon neoplasms and lung neoplasms to validate the expression of our method. There is no doubt that these diseases do great harm to human health. Colon neoplasms belongs to malignancy in the field of Medicine, which has been confirmed to associate with several miRNAs [62,63]. Lung neoplasms is one of the most dangerous malignancies with the fastest increase in morbidity and mortality [12]. A growing number of evidence indicates that lung neoplasms and a few of miRNAs have close relationship. For a specific disease, verified associations of whole diseases in HMDD v2.0 database are considered as training samples, unverified associations with the specific disease in HMDD v2.0 database are treated a candidate samples. By training this model, we could rank predicted association score of the candidate samples and then the top 50 candidate associations with the specific disease are selected. In addition, we utilized two types of databases that were miR2disease and dbDEMC v2.0 to check out miRNAs that have been ranked. Moreover, Tables 2 and 3 indicated prediction results obtained via SCMFMDA, respectively. The 94% and 92% of top 50 miRNAs that inferred by our model, which were individually confirmed to associate with colon neoplasms and lung neoplasms according to the miR2Disease and dbDEMC v2.0 databases. Only 3 and 4 of top 50 predicted miRNAs that are related colon neoplasm and lung neoplasms could not find clues in the databases.

thumbnail
Table 2. The top 50 potential miRNAs associated with colon neoplasms.

https://doi.org/10.1371/journal.pcbi.1009165.t002

thumbnail
Table 3. The top 50 potential miRNAs associated with lung neoplasms.

https://doi.org/10.1371/journal.pcbi.1009165.t003

Discussion and conclusion

In this paper, we introduced a new model named SCMFMDA that used similarity constrained matrix factorization algorithm to predict possible associations of miRNA-disease. In order to obtain plenty of disease similarity data and miRNA similarity data, similarity network fusion algorithm is used to integrate various disease and miRNA biological information, respectively. In addition, L2 regularization terms and similarity constraint terms are added to the standard NMF for predicting more unobserved miRNA-disease associations. In the frameworks of global LOOCV and 5-CV, the AUCs of SCMFMDA severally achieved 0.9675 and 0.9447 that indicated the performance of our model had a significant improvement relative to previous models. Furthermore, the predicted miRNAs that related to colon neoplasms and lung neoplasms were confirmed by the experiment literatures, so the prediction results of our model were proved to be reliable.

What should be denoted is that the following factors may contribute to the reliable performance of SCMFMDA. First, similarity network fusion algorithm was applied to integrate different disease and miRNA similarities, which can ensure the richness of biological data in the experiment. Then, the function of L2 regularization terms is avoiding overfitting problem. Moreover, the similarity constraint terms consist of disease feature-based similarity and miRNA feature-based similarity, which can generate robustness to the data richness.

However, several limitations may influence the performance of SCMFMDA. First, the model is applicable to the diseases and miRNAs must appear in the selected dataset, but can’t make predictions for other diseases and miRNAs. In addition, for some important parameters in SCMFMDA, we hadn’t appropriate way to select the most suitable parameters expect carrying out all combinations. Therefore, we should continuously optimize our model to improve its performance in later days.

Supporting information

S1 Table. Known human miRNA-disease associations obtained from HMDD v2.0 database.

https://doi.org/10.1371/journal.pcbi.1009165.s001

(XLSX)

S2 Table. Names of 383 diseases involved in known human miRNA-disease associations obtained from HMDD v2.0 database.

https://doi.org/10.1371/journal.pcbi.1009165.s002

(XLSX)

S3 Table. Names of 495 miRNAs involved in known human miRNA-disease associations obtained from HMDD v2.0 database.

https://doi.org/10.1371/journal.pcbi.1009165.s003

(XLSX)

S4 Table. The constructed disease functional similarity score matrix.

https://doi.org/10.1371/journal.pcbi.1009165.s004

(XLSX)

S5 Table. The constructed disease semantic similarity score matrix.

https://doi.org/10.1371/journal.pcbi.1009165.s005

(XLSX)

S6 Table. The constructed miRNA functional similarity score matrix.

https://doi.org/10.1371/journal.pcbi.1009165.s006

(XLSX)

S7 Table. The constructed miRNA sequence similarity score matrix.

https://doi.org/10.1371/journal.pcbi.1009165.s007

(XLSX)

References

  1. 1. Bartel DP. MicroRNAs: Genomics, Biogenesis, Mechanism, and Function. Cell. 2004;116(2):281–297. pmid:14744438
  2. 2. Chatterjee S, Grosshans H. Active turnover modulates mature microRNA activity in Caenorhabditis elegans. Nature. 2009;461(7263):546–549. pmid:19734881
  3. 3. He L, Hannon GJ. MicroRNAs: small RNAs with a big role in gene regulation. Nat Rev Genet. 2004;5(7):522–531. pmid:15211354
  4. 4. Lee RC, Feinbaum RL, Ambros V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell. 1993;75(5):843–854. pmid:8252621
  5. 5. Wightman B, Ha I, Ruvkun G. Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell. 1993;75(5):855–862. pmid:8252622
  6. 6. Jopling CL, Yi M, Lancaster AM, Lemon SM, Sarnow P. Modulation of Hepatitis C Virus RNA Abundance by a Liver-Specific MicroRNA. Science. 2005;309(5740):1577–1581. pmid:16141076
  7. 7. Xu P, Guo M, Hay BA. MicroRNAs and the regulation of cell death. Trends Genet. 2004;20(12):617–624. pmid:15522457
  8. 8. Bartel DP. MicroRNAs: Target Recognition and Regulatory Functions. Cell. 2009;136(2):215–233. pmid:19167326
  9. 9. Miska EA. How microRNAs control cell division, differentiation and death. Curr Opin Genet Dev. 2005;15(5):563–568. pmid:16099643
  10. 10. Harfe BD. MicroRNAs in vertebrate development. Curr Opin Genet Dev. 2005;15(4):410–415. pmid:15979303
  11. 11. Meola N, Gennarino V, Banfi S. microRNAs and genetic diseases. Pathogenetics. 2009;2(1):7. pmid:19889204
  12. 12. Yanaihara N, Caplen N, Bowman E, Seike M, Kumamoto K. Unique microRNA molecular profiles in lung cancer diagnosis and prognosis. Cancer Cell. 2006;9(3):189–198. pmid:16530703
  13. 13. Sita-Lumsden A, Dart DA, Waxman J, Bevan CL. Circulating microRNAs as potential new biomarkers for prostate cancer. Br J Cancer. 2013;108(10):1925–1930. pmid:23632485
  14. 14. Mohammadi-Yeganeh S, Paryan M, Samiee SM, Soleimani M, Arefian E, Azadmanesh K, et al. Development of a robust, low cost stem-loop real-time quantification PCR technique for miRNA expression analysis. Mol Biol Rep. 2013;40(5):3665–3674. pmid:23307300
  15. 15. Thomson JM, Parker JS, Hammond SM. Microarray Analysis of miRNA Gene Expression. Methods Enzymol. 2007;427:107–122. pmid:17720481
  16. 16. Han K, Xuan P, Ding J, Zhao ZJ, Hui L, Zhong YL. Prediction of disease-related microRNAs by incorporating functional similarity and common association information. Genet Mol Res. 2014;13(1):2009–2019. pmid:24737426
  17. 17. Yu S, Liang C, Xiao Q, Li G, Ding P, Luo J. MCLPMDA: A novel method for miRNA-disease association prediction based on matrix completion and label propagation. J Cell Mol Med. 2019;23(2):1427–1438. pmid:30499204
  18. 18. Chen X, Gong Y, Zhang D, You Z, Li Z. DRMDA: deep representations–based miRNA–disease association prediction. J Cell Mol Med. 2018;22(1):472–485. pmid:28857494
  19. 19. Jiang Q, Hao Y, Wang G, Juan L, Wang Y. Prioritization of disease microRNAs through a human phenome-microRNAome network. BMC Syst Biol. 2010;4(SUPPL. 1):S2. pmid:20522252
  20. 20. Chen X, Yan C, Zhang X, You Z, Deng L, Liu Y, et al. WBSMDA: Within and Between Score for MiRNA-Disease Association prediction. Sci Rep. 2016;6:21106. pmid:26880032
  21. 21. Shen Z, Zhang YH, Han K, Nandi AK, Honig B, Huang DS. miRNA-Disease Association Prediction with Collaborative Matrix Factorization. Complexity. 2017;2017:1–9.
  22. 22. Chen X, Huang L, Xie D, Zhao Q. EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction. Cell Death Dis. 2018;9(1):3. pmid:29305594
  23. 23. Zhao Y, Chen X, Yin J. Adaptive boosting-based computational model for predicting potential miRNA-disease associations. Bioinformatics. 2019;35(22):4730–4738. pmid:31038664
  24. 24. Zhu XY, Wang XZ, Zhao HC, Pei TR, Kuang LN, Wang L. BHCMDA: A New Biased Conduction Based Method for Potential MiRNA-Disease Association Prediction. Front Genet. 2020;11(1):384. pmid:32425979
  25. 25. Ha J, Park C, Park C, Park S. IMIPMF: Inferring miRNA-disease interactions using probabilistic matrix factorization. J Biomed Inform. 2020;102:103358. pmid:31857202
  26. 26. Chen X, Liu M, Yan G. RWRMDA: predicting novel human microRNA-disease associations. Mol Biosyst. 2012;8(10):2792–2798. pmid:22875290
  27. 27. Köhler S, Bauer S, Horn D, Robinson PN. Walking the Interactome for Prioritization of Candidate Disease Genes. The Am J Hum Genet. 2008;82(4):949–958. pmid:18371930
  28. 28. Zhang H, Cao L, Gao S. A locality correlation preserving support vector machine. Pattern Recognition. 2014;47(9):3168–3178.
  29. 29. Shi H, Xu J, Zhang G, Xu L, Xia L. Walking the interactome to identify human miRNA-disease associations through the functional link between miRNA targets and disease genes. BMC Syst Biol. 2013;7:101. pmid:24103777
  30. 30. Liu Y, Zeng X, He Z, Zou Q. Inferring microRNA-disease associations by random walk on a heterogeneous network with multiple data sources. IEEE/ACM Trans Comput Biol Bioinform. 2017;14(4):905–915. pmid:27076459
  31. 31. Luo J, Xiao Q. A novel approach for predicting microRNA-disease associations by unbalanced bi-random walk on heterogeneous network. J Biomed Inform. 2017;66:194–203. pmid:28104458
  32. 32. Niu Y, Wang G, Yan G, Chen X. Integrating random walk and binary regression to identify novel miRNA-disease association. BMC Bioinformatics. 2019;20:59. pmid:30691413
  33. 33. Chen X, Yan CC, Zhang X, Li Z, Deng L, Zhang Y, et al. RBMMMDA: predicting multiple types of disease-microRNA associations. Sci Rep. 2015;5(1):13877. pmid:26347258
  34. 34. You Z, Huang Z, Zhu Z, Yan G, Chen X. PBMDA: A novel and effective path-based computational model for miRNA-disease association prediction. PLoS Comput Biol. 2017;13(1):e1005455. pmid:28339468
  35. 35. Yan C, Wang JX, Ni P, Lan W, Wu FX, Pan Y. DNRLMF-MDA: Predicting microRNA-Disease Associations Based on Similarities of microRNAs and Diseases. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(1):233–243. pmid:29990253
  36. 36. Peng J, Hui W, Li Q, Chen B, Hao J, Jiang Q, et al. A learning-based framework for miRNA-disease association identification using neural networks. Bioinformatics. 2019;35(21):4364–4371. pmid:30977780
  37. 37. Zheng K, You ZH, Wang L, Zhou Y, Li LP, Li ZW. MLMDA: a machine learning approach to predict and validate MicroRNA-disease associations by integrating of heterogeneous information source. J Transl Med. 2019;17(1):260. pmid:31395072
  38. 38. Chen X, Sun LG, Zhao Y. NCMCMDA: miRNA-disease association prediction through neighborhood constraint matrix completion. Brief Bioinform. 2021;22(1):485–496. pmid:31927572
  39. 39. Zhang Y, Chen M, Cheng X, Wei H. MSFSP: A Novel miRNA-Disease Association Prediction Model by Federating Multiple-Similarities Fusion and Space Projection. Front Genet. 2020;11:389. pmid:32425980
  40. 40. Ji C, Wang Y, Gao Z, Li L, Zheng C. A Semi-Supervised Learning Method for MiRNA-Disease Association Prediction Based on Variational Autoencoder. IEEE/ACM Trans Comput Biol Bioinform. 2021;1(1):99. pmid:33735084
  41. 41. Li Y, Qiu C, Tu J, Geng B, Yang J, Jiang T, et al. HMDD v2.0: a database for experimentally supported human microRNA and disease associations. Nucleic Acids Res. 2013;42(Database issue): D1070–D1074. pmid:24194601
  42. 42. Jiang Q, Wang Y, Hao Y, Liran J, Teng M, Zhang X, et al. miR2Disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Res. 2009;37(Database issue): D98–D104. pmid:18927107
  43. 43. Yang Z, Wu L, Wang A, Tang W, Zhao Y, Zhao H, et al. dbDEMC 2.0: Updated database of differentially expressed miRNAs in human cancers. Nucleic Acids Res. 2016;45(D1):D812–D818. pmid:27899556
  44. 44. Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 2011;21(7):1109–1121. pmid:21536720
  45. 45. Cheng L, Wang G, Li J, Zhang T, Xu P, Wang Y, et al. SIDD: A Semantically Integrated Database towards a Global View of Human Disease. PLoS One. 2013;8(10):e75504. pmid:24146757
  46. 46. Lipscomb CE. Medical Subject Headings (MeSH). Bull Med Libr Assoc. 2000;88(3):265–266. pmid:10928714
  47. 47. Wang D, Wang JY, Lu M, Song F, Cui Q. Inferring the human microRNA functional similarity and functional network based on microRNA-associated diseases. Bioinformatics. 2010;26(13):1644–1650. pmid:20439255
  48. 48. Xuan P, Han K, Guo MZ, Guo YH, Li JB. Ding J, et al. Correction: Prediction of microRNAs Associated with Human Diseases Based on Weighted k Most Similar Neighbors. PLoS One. 2013;8(9):e70204. pmid:24116246
  49. 49. Goh KI, Cusick ME, Valle D, Childs B, Barabási AL. The human disease network. Proc Natl Acad Sci U S A. 2007;104(27):8685–8690. pmid:17502601
  50. 50. Lu M, Zhang Q, Min D, Jing M, Guo Y, Guo W, et al. An Analysis of Human MicroRNA and Disease Associations. PLoS One. 2008;3(10):e3420. pmid:18923704
  51. 51. Kozomara A, Griffiths-jones S. miRBase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2013;42(D1):D68–D73. pmid:24275495
  52. 52. Wang B, Mezlini AM, Demir F, Fiume M, Tu ZW, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–337. pmid:24464287
  53. 53. Zhang W, Liu XR, Chen YL, Wu WJ, Wang W, Li XH. Feature-derived graph regularized matrix factorization for predicting drug side effects-Science Direct. Neurocomputing. 2018;287:154–162.
  54. 54. Rana B, Juneja A, Saxena M, Gudwani S, Kumaran SS, Behari M, et al. Graph Theory based Spectral Feature Selection for Computer Aided Diagnosis of Parkinson’s Disease Using T1-weighted MRI. International Journal of Imaging Systems and Technology. 2015;25(3):245–255.
  55. 55. Wu Q, Wang Y, Gao Z, Ni J, Zheng C. MSCHLMDA: Multi-Similarity Based Combinative Hypergraph Learning for Predicting MiRNA-Disease Association. Front Genet. 2020;11:354. pmid:32351545
  56. 56. Jiang Y, Liu B, Yu L, Yan C, Bian H. Predict MiRNA-Disease Association with Collaborative Filtering. Neuroinformatics. 2018;16:363–372. pmid:29948843
  57. 57. Shao B, Liu B, Yan C. SACMDA: MiRNA-Disease Association Prediction with Short Acyclic Connections in Heterogeneous Graph. Neuroinformatics. 2018;16:373–382. pmid:29644547
  58. 58. Xiao Q, Luo J, Liang C, Cai J, Ding P. A graph regularized non-negative matrix factorization method for identifying microRNA-disease associations. Bioinformatics. 2018;34(2):239–248. pmid:28968779
  59. 59. Gao Z, Wang Y, Wu Q, Ni J, Zheng C. Graph regularized L2,1-nonnegative matrix factorization for miRNA-disease association prediction. BMC Bioinformatics. 2020;21:61. pmid:32070280
  60. 60. Gao Y, Cui Z, Liu J, Wang J, Zheng C. NPCMF: Nearest Profile-based Collaborative Matrix Factorization method for predicting miRNA-disease associations. BMC Bioinformatics. 2019;20(1):353. pmid:31234797
  61. 61. Chen X, Li S, Yin J, Wang C. Potential miRNA-disease association prediction based on kernelized Bayesian matrix factorization. Genomics. 2020;112(1):809–819. pmid:31136792
  62. 62. Torre LA, Bray F, Siegel RL, Tieulent JL, Jemal A. Global cancer statistics, 2012. CA Cancer J Clin. 2015;65(2):87–108. pmid:25651787
  63. 63. Hiroko OK, Masashi I, Daisuke K, Yoshitaka H, Yasuhide Y, Koh F, et al. Circulating Exosomal microRNAs as Biomarkers of Colon Cancer. PLoS One. 2014;9(4):e92921. pmid:24705249