Introduction

Pathway enrichment analysis is an essential step for interpreting high-throughput (omics) data that uses current knowledge of genes and biological processes. A common application determines statistical enrichment of molecular pathways, biological processes and other functional annotations in long lists of candidate genes1. Genomic, transcriptomic, proteomic and epigenomic experiments emphasize complementary aspects of underlying biology and are best analyzed integratively, as is now routinely done in large-scale projects such as The Cancer Genome Atlas (TCGA)2, Clinical Proteome Tumor Analysis Consortium (CPTAC), International Cancer Genome Consortium (ICGC)3,4, Genotype-Tissue Expression (GTEx)5, and others6. Thus, simultaneous analysis of multiple candidate gene lists for characteristic pathways is increasingly needed.

Numerous approaches are available for interpreting single gene lists. For example, the GSEA algorithm can detect upregulated and downregulated pathways in gene expression datasets7. Web-based methods such as Panther8, ToppCluster9, and g:Profiler10 detect significantly enriched pathways amongst ranked or unranked gene lists and are generally applicable to genes and proteins from various analyses. Some approaches allow analysis of multiple input gene lists however these primarily rely on visualization rather than data integration to evaluate the contribution of distinct gene lists towards each detected pathway9,10. Finally, no methods are available for unified pathway analysis of coding and non-coding mutations from whole-genome sequencing (WGS) data, or integrating these with other types of DNA aberrations such as copy number changes and balanced genomic rearrangements.

Cancer genomes are characterized by multiple classes of mutations, including single nucleotide variants (SNVs), small insertions–deletions (indels), copy number alterations, and translocations. These affect a small number of frequently mutated pan-cancer driver genes such as TP53, less-frequent and tissue-specific genes such as SPOP in prostate cancer, and numerous infrequently mutated genes. The majority of currently known driver mutations of SNVs and indels affect protein-coding sequence11 and only few high-confidence non-coding drivers have been found, such as the mutation hotspots in the TERT promoter12. Discovery of coding and non-coding driver mutations is a major goal of large cancer whole genome sequencing efforts. The PCAWG Consortium aggregated whole genome sequencing data from 2658 cancers across 38 tumor types generated by the ICGC and TCGA projects. These sequencing data were re-analysed with standardized, high-accuracy pipelines to align to the human genome (reference build hs37d5) and identify germline variants and somatically acquired mutations, as described in the PCAWG marker paper4. A consensus analysis of variant calls in PCAWG tumors generated a high-confidence catalog of driver mutations in protein-coding driver genes (CDS) and non-coding regions of 5′ and 3′ untranslated elements (UTRs), promoters and enhancers13. A consensus pathway and network analysis of PCAWG driver mutations used knowledge of molecular pathways and gene interaction networks as priors to further discover infrequent candidate driver variants including those in the non-coding genome14.

Here we report the development of the ActivePathways method that uses data fusion techniques to address the challenge of integrative pathway analysis of multi-omics data. It detects significantly enriched pathways across multiple datasets, including those pathways that are not apparent in any individual dataset. We present several analyses to demonstrate this method. First, we integrate cancer driver genes with coding and non-coding mutations predicted using the PCAWG dataset13 and reveal numerous processes and additional genes with frequent coding and non-coding mutations. Second, we integrate patient clinical information with transcriptomic and copy number alterations in breast cancers of the METABRIC project15 to discover prognostic pathways and processes in breast cancer subtypes. Third, we integrate transcriptomic data of normal tissues of the GTEx project5 with ChIP-seq data to infer gene regulatory networks and biological processes downstream of the Hippo pathway of tissue growth control and regeneration. Thus ActivePathways is a versatile method for combining diverse multiomics datasets.

Results

Multi-omics pathway enrichment analysis with ActivePathways

ActivePathways is a simple three-step method that extends our earlier work10 (Fig. 1). It requires two input datasets. The first input is a table of P-values with genes listed in rows and evidence from distinct datasets listed in columns. The columns can include P-values of differential gene expression, gene essentiality, mutation or copy number alteration burden and many others that are derived using platform-specific quantification methods. The second input is a collection of gene sets that represents collective knowledge of gene function and interactions we refer to as pathways. The most common analysis utilizes biological processes from gene ontology16 (GO) and molecular pathways from the Reactome database17. Depending on the hypothesis, these data may also include many other types of gene sets such as targets of transcription factors or microRNAs.

Fig. 1: Method overview.
figure 1

a ActivePathways requires as input (1) a matrix of gene P-values for different omics datasets, and (2) a collection of gene sets corresponding to biological pathways and processes. b In step (1), gene P-values are merged using the Brown procedure and filtered to produce an integrated gene list that combines evidence from datasets and is ranked by decreasing significance with a lenient threshold. In step (2), pathway enrichment analysis is conducted on the integrated gene list using the ranked hypergeometric test that determines the optimal level of enrichment in the ranked gene sub-list for every pathway. In step (3), separate gene lists are compiled from individual input datasets and analyzed for pathway enrichment using the ranked hypergeometric test, to find supporting evidence for each pathway from the integrative analysis. c ActivePathways provides a list of enriched pathways in the integrated gene list, the associated genes with significant Brown P-values, and annotations of evidence supporting each pathway. Results of ActivePathways are visualized as enrichment maps where nodes correspond to pathways and pathways with many shared genes are connected into networks representing broader biological themes.

In the first step of ActivePathways, we derive an integrated gene list that for each input gene aggregates significance from multiple omics datasets. The integrated gene list is compiled by a fusion of gene significance from different omics datasets (i.e., evidence) using the Brown’s extension18 of the Fisher’s combined probability test. The Brown’s method considers dependencies between datasets and thus provides more conservative estimates of significance for genes that are supported by multiple similar omics datasets. The integrated input gene list is then ranked by decreasing significance and filtered using a lenient cut-off, designed to capture additional candidate genes with sub-significant signals while discarding the bulk of insignificant genes (unadjusted Brown Pgene < 0.1). In the second step, a pathway enrichment analysis is conducted on the integrated gene using a ranked hypergeometric test10 and a collection of gene sets (i.e., biological processes, molecular pathways, and other gene annotations). The ranked hypergeometric test is designed to capture smaller pathways tightly associated with few top-ranking genes and also broader processes associated with larger subsets of input genes. The family-wise multiple testing correction method by Holm19 is then applied across tested pathways to select the pathways significantly enriched in the integrated gene list (Qpathway < 0.05). In the third step, we perform a similar analysis on the gene lists of individual omics datasets separately to determine the omics evidence supporting the integrative pathway analysis results determined in step 2. Importantly, the third step also highlights pathways that are only found through data integration and are not apparent in any single omics dataset separately, providing the added value of integrated analysis. Finally, the method provides input files for Enrichment Map20 for visualizing resulting pathways with the corresponding omics evidence.

Integrating coding and non-coding drivers in 2658 cancer genomes

We performed an integrative pathway analysis of driver genes predicted in the PCAWG project based on somatic SNVs and indels. This analysis comprised 29 cancer patients cohorts of histological tumor types and 18 meta-cohorts combining multiple types of tumors, with 47 cohorts in total (Supplementary Table 1). ActivePathways identified at least one significantly enriched process or pathway in 89% of these cohorts (42/47, Qpathway < 0.05, ranked hypergeometric test) (Fig. 2a). We analyzed the evidence supporting predictions of enriched pathways: most cohorts showed enrichments in pathways supported by protein-coding mutations in genes (37/47 or 79%). This serves as a positive control since the majority of currently-known cancer driver genes have frequent protein-coding mutations.

Fig. 2: Pathway enrichment analysis of cancer driver genes with ActivePathways.
figure 2

a Bar plot shows number of significantly enriched pathways (Q < 0.05) among predicted driver genes with coding and non-coding mutations in the PCAWG dataset. The majority of pathways detected by ActivePathways are supported by protein-coding mutations, as expected (dark green bars), while non-coding mutations (orange, red) reveal additional pathways. Pathways shown in dark red are found only in the integrated gene list of coding and non-coding mutations but not in gene lists of individual mutation scores. b Enrichment map shows pathways enriched in frequently mutated genes in the adenocarcinoma cohort of 1773 tumors. Nodes in the network represent pathways and similar pathways with many common genes are connected. Groups of similar pathways are indicated. Nodes are colored by supporting evidence from coding and non-coding cancer mutations. Arrow indicates kidney developmental processes. c The group of enriched kidney developmental processes is apparent from integrated evidence of coding and non-coding mutations but is not found among coding or non-coding candidate genes separately. d Heatmap shows P-values of driver genes involved in kidney developmental processes, including driver genes found in the driver analysis (indicated with #) and additional genes only found in the pathway analysis. Top row shows merged P-values from the Brown procedure. Genes listed in the Cancer Gene Census (CGC) database are indicated in boldface letters. e Pathway analysis recovers most genes of the driver list from PCAWG (orange asterisks), as well as additional infrequently mutated genes apparent due to their pathway associations. Additional known cancer genes detected in the pathway analysis are listed (green dots) and occur more frequently than expected from chance alone.

Non-coding mutations in genes also contributed broadly to the discovery of frequently mutated biological processes and pathways: 24/47 cohorts (51%) showed significantly enriched pathways that were apparent when only analyzing non-coding driver scores corresponding to UTRs, promoters or enhancers. The majority of PCAWG tumor cohorts (41/47 or 87%) revealed some frequently mutated pathways that were apparent when integrating coding and non-coding mutations however remained undetected when considering either coding or non-coding mutations separately, emphasizing the value of our integrative approach. As expected, cohorts with more patient tumor samples generated more significantly enriched pathways (Spearman ρ = 0.74, P = 2.3 × 10−9; Supplementary Fig. 1), suggesting that larger datasets are better powered to distinguish rarely mutated genes involved in biological pathways and processes. Discovery of pathways enriched in non-coding mutations suggests that our integrative pathway analysis is an attractive strategy for illuminating the dark matter of the non-coding cancer genome.

We studied the adenocarcinoma cohort of 1773 samples of 16 tumor types. Integrative pathway analysis highlighted 432 genes that were significantly enriched in 526 pathways (Qpathway < 0.05) (Fig. 2b). As expected, the majority of pathways were only supported by genes with frequent coding mutations (328/526 or 62%). However, an additional set of 101 pathways (19%) was supported by both coding and non-coding gene mutations, 72 pathways (14%) were only apparent in the integrated analysis of both coding and non-coding mutations, and 25 (5%) were only enriched in non-coding mutations. Accumulation of these individually infrequent non-coding mutations into relevant pathways and processes is apparent in our integrative analysis and remains undetected in a gene-focused analysis.

The major biological themes with frequent protein-coding mutations included hallmark cancer processes such as ‘apoptotic signaling’ (24 genes; Qpathway = 4.3 × 10−5) and ‘mitotic cell cycle’ (8 genes; Qpathway = 0.0026), and additional biological processes such as chromatin modification and RNA splicing that are increasingly recognized in cancer biology. Thus, ActivePathways captures the expected cancer pathways enriched in driver genes with protein-coding mutations as positive controls. In contrast to these solely protein-coding driver associations, a large group of developmental processes and signal transduction pathways was detected as enriched in both coding and non-coding mutations in genes; for example ‘embryo development process’ was supported by mutations in exons, 3′UTRs and gene promoters (68 genes; Qpathway = 2.9 × 10−12), while the Reactome pathway ‘repression of WNT target genes’ was only apparent in the integrated analysis of coding and non-coding mutations but not in either dataset alone (5 genes, Qpathway = 0.016). In summary, these data show that ActivePathways is a sensitive approach for integrating multi-omics signals such as coding and non-coding mutations, interpreting supporting omics evidence, and finding additional functional associations that are not apparent in any single input dataset.

Pathway-based prioritization of rarely mutated cancer genes

Pathway analysis can identify candidate genes that would otherwise remain undetected in gene-based analyses. ActivePathways enhances such discovery by integrating signals across multiple datasets. In the pathway analysis of coding and non-coding mutations in PCAWG, we focused on a group of processes involved in kidney development that were exclusively detected through the integration of coding and non-coding mutations (Fig. 2c, d). ActivePathways found 18 genes involved in these processes, only five of which were predicted as driver genes in the consensus driver analysis of the PCAWG project13. Additional known cancer genes included the oncogene MYC with 13 patients with 3′UTR mutations (driver P-value PUTR3 = 4.8 × 10−4), the transcription factor SMAD3 of the TGF-β pathway with 14 patients with protein-coding mutations (PCDS = 4.0 × 10−4) and the growth inhibitory tumor suppressor gene TSC1 with 23 patients with protein-coding mutations (PCDS = 1.4 × 10−4) as well as candidate cancer genes such as IQGAP1 with ten patients with promoter mutations (Ppromoter = 8.2 × 10−4) that encodes a signaling protein involved in cell motility and morphology. The additional genes remained below the FDR-adjusted significance cut-off in the gene-focused consensus driver analysis (Q = 0.17–0.62), however were found by ActivePathways due to their associations with kidney development. Thus ActivePathways can exploit functional gene annotations and multiple omics signals to find further candidate genes that remain undetected in gene-focused analyses.

We evaluated the effects of our data integration strategy and examined all 333 pathway-associated candidate genes detected in the adenocarcinoma cohort (Fig. 2e). As expected, these included a considerable proportion of known cancer genes. First, as positive controls we found 60/64 significantly mutated genes that were also identified in the PCAWG consensus driver analysis13. In addition, we found a set of 47/333 known cancer genes annotated in the COSMIC Cancer Gene Census database11 that were not detected in the driver analysis, significantly more than expected by chance alone (seven genes expected, Fisher’s exact P = 4.0 × 10−24). Those included well-known cancer driver genes MYC, IDH1, NF1, and BCL9. ActivePathways was able to detect these additional genes for several reasons. First, the integrated gene list was filtered using a lenient statistical cut-off in ActivePathways (Pgene < 0.1) that allowed a long tail of 273/333 genes with less-frequent mutations to be detected through pathway associations. Second, certain genes were upgraded through the data fusion procedure as a single stronger P-value per gene was derived by combining multiple weaker P-values corresponding to the coding regions, promoters, UTRs, enhancers of the gene. This affected 17/333 pathway-associated genes including six known cancer genes (HNRNPA2B1, STAG2, TCF7L2, SUZ12, CLTC, and ZNF521), Thus, the integration procedure prioritized specific pathway-related genes compared to their original rankings in individual mutation datasets. However, the majority of genes showed reduced significance after the fusion and were excluded from the pathway analysis: 3543 genes had at least one significant P-value prior to data fusion (Pgene < 0.1) while 88% of these (3112) were considered insignificant following the Brown P-value combination step. In contrast, the majority of pathway-associated genes (220/333) showed improved rankings in the integrated gene list compared to their original rankings in individual input datasets. This formal combination of P-values across omics datasets is therefore more conservative than a naïve approach of selecting a top P-value for every gene. Thus, ActivePathways finds additional candidate genes that remain undetected in gene-by-gene analyses and are highlighted due to their multiple omics signals in pathways.

Integrating prognostic CNA and mRNA signals in breast cancer

To demonstrate an integrative analysis of patient clinical information with multiple types of omics data, we studied the pathways and processes associated with patient prognosis in breast cancer. We leveraged the METABRIC dataset15 of 1780 breast cancer samples drawn from all four subtypes (HER2-enriched, basal-like, luminal-A, luminal-B) and evaluated all genes using three types of prognostic evidence. mRNA abundance profiles were deconvolved between mRNA abundance levels in tumor cells (TC) and tumor-adjacent cells (TAC) using the ISOpure algorithm21,22. mRNA values were associated with patient survival using median dichotomization and log-rank tests. Gene copy number alterations (CNA) were included as the third type of evidence and associated with patient survival using log-rank tests.

ActivePathways identified 192 significantly enriched GO biological processes and Reactome pathways across the four subtypes of breast cancer, of which nine pathways were enriched in multiple cancer subtypes and 33 pathways were only apparent through the integrative pathway analysis but not in any of the CNA or mRNA datasets alone. The major findings enriched in prognostic signatures in breast cancer subtypes involved the processes and pathways of immune response, apoptosis, ribosome biogenesis and chromosome segregation (Fig. 3a).

Fig. 3: Prognosis-associated pathways in four molecular subtypes of breast cancer.
figure 3

a Enrichment maps of prognostic pathways and processes were found in an integrative analysis of mRNA abundance in tumor cells (TC), tumor-adjacent cells (TAC), and gene copy number alterations (CNA) of the METABRIC dataset. Multicolored nodes indicate pathways that were prognostic according to several types of molecular evidence. Blue nodes indicate pathways that were only apparent through merging of molecular signals. b Hazard ratios (HR) of prognostic genes related to immune system development in basal and HER2-enriched subtypes of breast cancer. Strongest HR value of TC, TAC is shown. Genes commonly found in basal and HER2-enriched tumors are shown. Known cancer genes are shown in boldface. c Heatmap shows genes, corresponding log-rank P-values, and merged Brown P-values related to the GO process ‘negative regulation of apoptotic process’ that was found by integrating prognostic omics data in HER2-enriched breast cancer. d Kaplan–Meier plots show the strongest prognostic signal related to apoptotic signaling, the phosphatase DUSP1 that significantly associates with reduced patient survival through increased tumor-adjacent mRNA level (left), increased tumor mRNA level (center) and gene copy number amplification (right). Log-rank P-values are shown.

Immune activity was associated with prognostic genes in basal-like and HER2-enriched breast cancers with significant enrichment of GO processes such as ‘immune system development’ (Qbasal = 3.0 × 10−4, 113 genes; QHER2 = 0.035, 61 genes; ranked hypergeometric test) and ‘lymphocyte differentiation’ (QHER2 = 6.8 × 10−4, 46 genes; Qbasal = 8.4 × 10−4, 45 genes). The majority of related genes were associated with improved patient prognosis upon increased mRNA abundance in tumor cells or tumor-adjacent cells, comprising 50/61 genes in the HER2-enriched subtype and 78/113 genes in the basal subtype (Fig. 3b). Interestingly, only a minority of these genes (10) were significant in both of the two breast cancer subtypes, suggesting different modes of immune activity in subtypes and emphasizing the power of our pathway-based approach. Basal-like breast cancers were associated with additional 67 terms involving immune response and blood cells, however no immune-related terms were enriched for luminal subtypes of breast cancers. Prognostic features of immune-related genes in HER2-enriched and basal-like breast cancers are well known23,24. Our pathway-based findings indicate that immune activity in breast tumor cells and in the surrounding microenvironment negatively affects tumor progression and improves prognosis.

Apoptosis was associated with patient prognosis in HER2-enriched and luminal-A breast cancers through enriched GO processes such as ‘negative regulation of apoptotic process’ (QHER2 = 0.030, 122 genes; QluminalA = 0.015, 228 genes) and ‘programmed cell death’ (QHER2 = 0.015, 125 genes; QluminalA = 0.016, 231 genes) (Fig. 3c). Interestingly, anti-apoptotic pathways were only detected in the integrative analysis and not in genomic and transcriptomic gene signatures separately. Among the genes negatively regulating apoptosis, DUSP1 provided the strongest prognostic signal in HER2-enriched breast cancers. This was apparent in the molecular stratification of samples by mRNA of tumor cells (log-rank PTC = 0.019, HR = 1.5) and tumor-adjacent cells (PTAC = 8.3 × 10−4, HR = 1.8) as well as gene copy number amplifications (PCNA = 9.8 × 10−4, HR = 2.8) (Fig. 3d). DUSP1 encodes a phosphatase signaling protein of the MAPK pathway that is over-expressed in malignant breast cancer cells and inhibits apoptotic signaling25. HER2 over-expression is known to suppress apoptosis in breast cancer26. Anti-apoptotic signaling is a hallmark of cancer and expectedly associated with worse patient prognosis.

ActivePathways also identified prognostic pathway associations that were only apparent in single subtypes of breast cancer. For example, the prognostic genes for luminal-B subtype were enriched in processes related to ‘chromosome segregation’ (QluminalB = 0.017, 41 genes) that have been associated with worse outcome in breast cancer27. As another example, luminal-A breast cancers were associated with prognosis in ribosomal and RNA processing genes, such as ‘ribosome biogenesis’ (QluminalA = 6.9 × 10−10, 60 genes) and ‘rRNA metabolic process’ (QluminalA = 1.8 × 10−13, 64 genes). Although not described specifically in the luminal-A subtype, ribosomal mRNA abundance has been shown to be prognostic in breast cancer as a marker of cell proliferation28,29. In summary, ActivePathways can be used for integrating clinical data with multiomics information of molecular alterations. Such analyses can provide leads for functional studies and biomarker development.

Interpreting co-expressed and DNA-bound targets of Hippo TFs

To demonstrate the use of ActivePathways for studying gene regulation, we analyzed transcriptomes of non-cancerous human tissues from the GTEx project5. We focused on the Hippo signaling pathway involved in organ size control, tissue homeostasis and cancer30,31 and studied regulatory networks downstream of the two transcription factors (TFs) YAP and TAZ (encoded by YAP1 and WWTR1). YAP and TAZ are evolutionarily conserved master regulators of Hippo signaling in mammals that respond to intracellular and extracellular signals of cell–cell interactions, cell polarity, mechanical cues, G protein-coupled receptor signaling, and cellular energy status30,31.

We performed an integrative pathway enrichment analysis of transcriptomics and epigenomics data of the two master regulators of the Hippo pathway. First, we predicted transcriptional target genes for YAP and TAZ (1898 and 1319 genes, respectively), using co-expression analysis and robust rank aggregration32 over 9642 transcriptomes of 40 tissue types in GTEx (Q < 0.05). Second, we studied the set of 2356 target genes of YAP that have DNA-binding sites at gene promoters derived from a YAP ChIP-seq study33 reanalyzed in the ReMap database34 (Q < 0.05). The three gene lists with corresponding significance values were used as input to ActivePathways for the integrative analysis.

Integrative analysis of transcriptional and DNA-binding target genes of YAP/TAZ resulted in 225 significantly enriched GO processes and Reactome pathways (Qpathway < 0.05) (Fig. 4a). The resulting pathways are expected in the context of Hippo signaling and included development and morphogenesis, cell motility, organization of actin cytoskeleton and cell–cell junctions, signal transduction pathways such as EGFR, Wnt, Robo, TGF-beta, rho GTPase, and others. ActivePathways highlighted 2066 genes in these pathways. We examined known members of the Hippo pathway and found 13 of 32 genes among pathway-associated genes (WWTR1, LATS1, WWC1, YWHAQ, YAP1, MAPK8, MAPK9, STK3, TNIK, TEAD3, MAP4K3, MINK1, TEAD1), more than expected from chance alone (four genes expected, P = 3.6 × 10−5, Fisher’s exact test). We also compiled an extended list of 106 Hippo-related genes30,31,35 and confirmed that these were enriched in pathway-associated genes (36/308 genes expected, P = 9.1 × 10−26, Fisher’s exact test) (Fig. 4b). A large fraction of pathways (55 or 24%) was only identified in the integrated analysis and not in any input dataset alone, underlining the advantage of detecting significantly enriched pathways across multiple complementary omics datasets. Similarly, Hippo-related genes were either supported by ChIP-seq data alone (54 genes), mRNA data alone (30 genes), or both by mRNA and ChIP-seq data (22 genes), concordant with the notion that RNA-seq and ChIP-seq show limited agreement regarding TF target genes and provide complementary insights into gene regulation. Thus, our integrative analysis of co-expression networks and TF–DNA interactions of Hippo master regulators expectedly converges on Hippo-related genes and pathways.

Fig. 4: Integrative pathway enrichment analysis of Hippo target genes across human tissues.
figure 4

a Enrichment map of GO processes and Reactome pathways enriched in the target genes of transcriptional regulators YAP and TAZ of the Hippo pathway. Co-expressed target genes of YAP and TAZ across normal human tissues of the GTEx dataset (pathways are shown in green and yellow, respectively) and DNA-binding target genes of YAP from ChIP-seq experiments (shown in blue) were analyzed. Pathways only found through the integration of ChIP-seq and RNA-seq data are shown in red. b Euler diagram shows 106 Hippo-related genes that were significantly enriched in the detected pathways and supported by a combination of signals in RNA-seq and ChIP-seq datasets. Core Hippo genes detected in the analysis are listed. c Regulatory network of 17 TFs and 1,426 target genes detected in the ActivePathways analysis of gene sets representing transcription factor target genes. Transcription factors with enriched target genes in the YAP/TAZ regulome are shown in multi-colored circles. Target genes are colored by increasing statistical significance (turquoise to red). d Integrative analysis of pathways and GO processes is complementary to the analysis of transcription factor targets. Euler diagram shows total number of pathway-associated identified in the analysis of GO and Reactome terms (left) and TF target genes from ENCODE (right). Numbers of Hippo-related genes are shown in brackets.

In addition to GO terms and pathways, ActivePathways can be used to interpret omics data with other classes of functional gene sets such as TF target genes. To further elucidate gene regulatory networks downstream of YAP/TAZ, we considered potential enrichments of DNA-binding target genes of 161 TFs profiled in ChIP-seq studies of the ENCODE project36. We found a regulatory network of 17 TFs and 1426 target genes enriched in the YAP/TAZ regulome (Q < 0.05, ranked hypergeometric test) (Fig. 4c). These included the DNA-binding targets of the master regulators of pluripotent stem cells37 NANOG (208/774 target genes, Q = 1.3 × 10−6) and POU5F1 (107/406 target genes, Q = 7.3 × 10−11). This finding is in agreement with the role of Hippo pathway activity in stem cell regulation38. The regulatory network was significantly enriched in 50 Hippo-related genes (25 expected, P = 1.2 × 10−7, Fisher’s exact test) and six core Hippo genes (two expected, P = 0.030; WWTR1, VGLL4, TNIK, MAPK8, MOB1A, LATS1), similarly to the pathway-based analysis above. However, the two analyses revealed distinct genes: 886 genes were commonly found by both analyses, 1180 genes were found only in the pathway-based analysis, and 540 genes were found only in the TF-based analysis (Fig. 4d). Thus, integrative enrichment analysis of TF target genes provides complementary information to GO terms and pathways. In summary, this analysis highlights genes and pathways related to Hippo signaling in human tissues and demonstrates the use of ActivePathways for studying gene regulatory networks across complementary omics datasets and technology platforms.

Evaluating the robustness and sensitivity of ActivePathways

We carefully benchmarked ActivePathways using the dataset of cancer driver genes predicted by PCAWG13. First, we compared the performance of ActivePathways with six methods used in the PCAWG pathway and network analysis working group14 (Hierarchical HotNet39,40, SSA−ME41, NBDI42, induced subnetwork analysis40, CanIsoNet43, hypergeometric test). These diverse methods used molecular interaction networks, functional gene sets and/or transcriptomics data to analyze the PCAWG pan-cancer dataset of predicted cancer driver genes. A subsequent consensus analysis defined pathway-implicated driver gene lists with protein-coding and non-coding mutations, based on a majority vote of the pathway and network analysis methods14. ActivePathways recovered these consensus gene lists with the highest accuracy: 100% of coding driver genes (87/87) and 85% of non-coding candidate genes (79/93) were detected (Fig. 5a). Thus ActivePathways agreed the most with the ensemble of several other methods in prioritizing known and candidate cancer driver genes using pathway and network context.

Fig. 5: Benchmarking of ActivePathways.
figure 5

a Comparison of ActivePathways with six additional pathway and network analysis methods used in the PCAWG pathway and network consensus analysis. ActivePathways best recovers the consensus lists of pathway-implicated driver genes with coding and non-coding mutations (indicated by asterisk). The consensus lists are shown in the leftmost bars of the plot and have been compiled through a majority vote of the seven methods in the PCAWG pathway and network analysis working group. b Comparison of ActivePathways (leftmost bars) and common pathway enrichment analysis using multiple significance cut-offs of PCAWG gene lists with protein-coding and non-coding mutations. ActivePathways shows increased sensitivity of pathway analysis even at the most lenient gene list significance cut-offs and recovers additional pathways only detected through integration of multiple datasets (dark red).

We also compared the performance of ActivePathways with a standard approach of pathway enrichment analysis that considers a single statistically-filtered gene list using a ranked hypergeometric test1,10 (Fig. 5b). To this end, we analyzed individual gene lists of protein-coding and non-coding drivers of the PCAWG adenocarcinoma cohort using multiple gene selection thresholds (Q < 0.05, Q < 0.1, Q < 0.25, P < 0.1). ActivePathways showed increased sensitivity of pathway enrichment analysis compared to the standard approach, in particular for pathways involving non-coding mutations that were not promimently represented in any single gene lists. Even compared to the analysis employing the most lenient gene selection filter (P < 0.1), ActivePathways identified 72 additional pathways that were only apparent through the integration of coding and non-coding mutations and remained undetected in the analysis of individual datasets. Thus, our method provides additional information to common approaches that focus on single gene lists.

We evaluated the robustness of ActivePathways to parameter variations and missing data. We varied the parameter Pgene that determines the ranked gene lists used in the pathway enrichment analysis (default threshold Pgene < 0.1). The majority of PCAWG cohorts (40/47 or 85%) retrieved significantly enriched pathways even with a considerably more stringent threshold (Pgene < 0.001), however 67% fewer pathways were found compared to the default threshold in the median cohort (Supplementary Fig. 2). We then evaluated the robustness of ActivePathways to missing data by randomly removing subsets of driver scores from the initial dataset. Even when removing 50% of gene driver P-values with P < 0.001, the majority of cohorts (37/47 or 79%) had at least one significantly enriched pathway, however 66% fewer pathways were found on average (Supplementary Fig. 3).

We evaluated the expected false positive rates of ActivePathways. We tested 1,000 simulated datasets for each of 47 patient cohorts and expectedly found no significant pathways in 92% of these simulations (Supplementary Fig. 4). Simulated data were obtained by randomly reassigning P-values of driver predictions to different genomic elements, a conservative approach that disrupts pathway annotations of genes while retaining the presence of strong P-values observed in the real data. The median family-wise false discovery rate of ActivePathways computed across cohorts (7.2%) slightly exceeded the applied multiple testing correction (Q < 0.05). Higher rates were observed in cohorts including melanoma tumors, potentially due to abundant promoter mutations caused by impaired nucleotide excision repair in protein-bound genomic regions44. We evaluated quantile-quantile (QQ) plots of pathway-based ranked hypergeometric P-values from ActivePathways and found that these often deviated from the expected uniform distribution and appeared statistically inflated (Supplementary Fig. 5). However, P-values derived from simulated gene scores showed no inflation in our simulations. Anticipating that the most significant cancer driver genes associated with protein-coding mutations, we performed partial simulations. We constructed datasets with simulated gene P-values for protein-coding mutations and true P-values for non-coding mutations. As expected, partially simulated datasets showed a lesser extent of P-value inflation, suggesting that highly significant known cancer genes are responsible for inflation due to their involvement in many pathways. Statistical testing of highly redundant pathways and processes violates the independence assumption of statistical tests and multiple testing procedures, a known caveat of pathway enrichment analysis1, which likely explains the observed distribution of significance values of our method. Collectively, these benchmarks show that ActivePathways is a sensitive and robust method for detecting significantly enriched pathways and processes through integrative analysis of multi-omics data.

Discussion

Integrative pathway enrichment analysis helps distill thousands of high-throughput measurements to a smaller number of pathways and biological themes that are most characteristic of the experimental data at hand, ideally leading to mechanistic insights and candidate genes for follow-up studies. In particular, a joint analysis of complementary datasets often leads to insights that are unavailable in any particular dataset. ActivePathways provides a generally available framework for systematically prioritizing genes and pathways across multiple omics datasets that utilizes fusion of gene significance. This allows us to identify pathways and processes that stand out when the data are combined, yet are not apparent in any single analyzed dataset. The use of ActivePathways is demonstrated in the three case studies above. In our example of cancer driver discovery, pathway analysis provides complementary insights to gene-focused driver discovery as it also focuses on sub-significant genes that are clustered into relevant biological processes and carry coding and non-coding mutations. In the integrative analysis of molecular alterations in breast cancer subtypes, we find a spectrum of genes and pathways whose molecular signatures in the tumors or the microenvironment have potential prognostic significance. A subset of these findings, such as anti-apoptotic signaling, is only apparent through data integration. In the final example, we use ActivePathways to associate the co-expression and DNA-binding networks of Hippo master regulators to downstream pathways and processes with multi-omics evidence.

Our general pathway analysis strategy is applicable to diverse kinds of datasets where well-calibrated P-values are available for the entire set of genes or proteins. A multi-omics study may quantify genes using a series of genomic, transcriptomic and proteomic experiments and compute corresponding P-values. Data from epigenomic experiments and genome-wide association studies can be also analyzed after signals have been appropriately mapped to genes, for example by identifying ChIP-seq peaks in gene promoters similarly to our GTEx analysis. Clinical and phenotypic information of patients can be also included through association and survival statistics, as shown in our analysis of prognostic signatures of breast cancer subtypes. Our method is expected to work with raw, unadjusted P-values and also with Q-values adjusted for multiple testing, however it is primarily intended for unadjusted P-values for increased sensitivity. ActivePathways conducts multiple testing correction at the pathway level and reports significantly enriched pathways at a family-wise error rate cutoff, regardless of the gene-specific multiple testing correction applied upstream. Quantification of genes and proteins through P-values is more robust than quantification through their abundance measures such as counts or fold-changes. P-values provided to ActivePathways need to be computed using dedicated methods for individual omics platforms such that inherent biases in the data are accounted for prior to pathway analysis. In our example of cancer driver discovery, appropriately computed P-values of driver gene predictions from the PCAWG project13 account for confounding factors of somatic mutations, such as gene sequence length and nucleotide content, mutation signatures active in different types of tumors45 and biological correlates of mutation frequency such as transcription and replication timing46. On the one hand, considering all such variations directly in the pathway enrichment analysis would require substantially more complex models. On the other hand, directly analyzing pathways using simpler metrics (such as mutation counts or frequencies) would propagate any upstream biases to the pathway enrichment analysis and cause challenges with false positives and data interpretation. Thius, given appropriately derived P-values for genes, proteins and other molecules, ActivePathways can be applied to a wide range of analyses.

Our method comes with important caveats. First, we only evaluate genes and proteins annotated in pathway databases. Such databases have variable coverage, rely on frequent data updates47 and may miss sparsely annotated candidate genes. The most general type of pathway enrichment analysis considers biological processes and molecular pathways however many kinds of gene sets available in resources such as MSigDB48 can be used to expand the scope of ActivePathways. Second, pathway information is highly redundant and analyses of rich molecular datasets often result in many significant results reflecting the same underlying pathway. We address this redundancy by visualizing and summarizing pathway results as enrichment maps1,20 that summarize multiple similar pathways and processes into general biological themes. Statistical inflation of results accompanied by biological redundancy is addressed by a stringent multiple testing correction. Third, the analysis treats pathways as gene sets and does not consider interactions of genes in pathways. This simplified strategy allows us to consider a wider repertoire of pathways and processes as reliable mechanistic interactions are often context-specific and limited to a small subset of well-studied signaling pathways. Several advanced methods such as HotNet39, PARADIGM49, and GeneMania50 model pathways through gene and protein interactions.

Translation of discoveries into improved human health through actionable mechanistic insights, biomarkers, and molecular therapies is a long-standing goal of biomedical research. For example, next-generation cancer genomics projects such as ICGC-ARGO aim to collect multi-omics datasets with detailed clinical profiles that will present new challenges for pathway and network analysis techniques. In summary, ActivePathways is integrative pathway analysis method that improves systems-level understanding of cellular organization in health and disease.

Methods

Integrated and evidence-based gene lists

The main input of ActivePathways is a matrix of P-values where rows include all genes of a genome and columns correspond to evidence from omics datasets. To interpret multiple omics datasets, a combined P-value is computed for each gene using a data fusion approach, resulting in an integrated gene list. The integrated gene list is computed by merging all P-values of a given gene into one combined P-value using the Brown’s extension18 of the Fisher’s combined probability test that accounts for overall covariation of P-values from different sources of evidence. The integrated gene list of Brown P-values is then ranked in order of decreasing significance and filtered using a lenient threshold (unadjusted P < 0.1 by default). Evidence-based gene lists representing different omics datasets are based on ranked P-values from individual columns of the input matrix and filtered using the same significance threshold.

Statistical enrichment of pathways

Statistical enrichment of pathways in ranked lists of candidate genes is carried out with the ranked hypergeometric test. The test considers one pathway gene set at a time and analyzes increasing subsets of input genes from the top of the ranked gene list. The same procedure is used for integrated and evidence-based gene lists. At each iteration, the test computes the hypergeometric enrichment statistic and P-value for the set of genes shared by the pathway and top sub-list of the input gene list. For optimal processing speed, only gene lists ending with a pathway-related gene are considered. The ranked hypergeometric statistic selected the input gene sub-list that achieves the strongest enrichment and the smallest P-value as the final result for the given pathway, as:

$$\left( {P_{{{{\mathrm{pathway}}}}},G} \right) = \left\{ {{{{\mathrm{min}}}},{{{\mathrm{argmin}}}}_n} \right\}\mathop {\sum }\limits_{x = k}^{{{{\mathrm{min}}}}\left( {n,K} \right)} \frac{{\left( {\begin{array}{*{20}{c}} K \\ k \end{array}} \right)\left( {\begin{array}{*{20}{c}} {N - K} \\ {n - k} \end{array}} \right)}}{{\left( {\begin{array}{*{20}{c}} N \\ n \end{array}} \right)}},$$

where Ppathway stands for the hypergeometric P-value of the pathway enrichment at the optimal sub-list of the significance-ranked candidate genes, G represents the length of the optimal sub-list, i.e., the number of top genes from the input gene list, N is the number of protein-coding genes with annotations in the pathway database, i.e., in Gene Ontology and Reactome, K is the total number of genes in a given pathway, n is the number of genes in a given gene sub-list considered, and k is the number of pathway genes in the considered sub-list. For a conservative estimate of pathway enrichment, we consider as background N the universe of genes contained in input gene sets (terms from pathway databases and ontologies) rather than the complete repertoire of protein-coding genes. To obtain candidate genes involved in the pathway of interest, we intersect pathway genes with the optimal sub-list of candidate genes. The ranked hypergeometric P-value is computed for all pathways and resulting P-values are corrected for multiple testing using the Holm-Bonferroni method of family-wise error rate (FWER)19. Significant pathways are reported by default (Q < 0.05).

Evaluating omics evidence of enriched pathways

Each evidence-based gene list derived from a single omics dataset is also analyzed for enriched pathways with the ranked hypergeometric test. Pathways found in the integrated gene list are then labeled for supporting evidence in the case they are also found as significant in any evidence-based gene list. A pathway is considered to be found only through data integration and labeled as combined-only if it is identified as enriched in the integrated gene list but not identified as enriched in any of the evidence-based gene lists at equivalent significance cut-offs (default Q < 0.05). Each detected pathway is additionally annotated with pathway genes apparent in the optimal sub-list of candidate genes separately for the integrated gene list and each evidence-based gene list.

Pathways and processes

We used gene sets corresponding to biological processes of Gene Ontology16 and molecular pathways of the Reactome database17 downloaded from the g:Profiler web server10 on Jan 26th 2018. Large general gene sets with more than a thousand genes and small specific gene sets with less than five genes were excluded to avoid statistical inflation of large gene lists and interpretation challenges with very small lists.

Enrichment map visualization

ActivePathways creates input files for the EnrichmentMap app20 of Cytoscape51 for network visualization of similar pathways and their coloring according to supporting evidence. Enrichment maps for adenocarcinoma driver mutations, prognostic quantification of molecular alterations of breast cancer, and transcriptional networks downstream of Hippo signaling were visualized with stringent pathway similarity scores (Jaccard and overlap combined coefficient 0.6) and manually curated for the most representative groups of similar pathways and processes. Singleton pathways that were redundant with larger groups of pathways were merged with the latter or discarded. Coloring of pathways in the adenocarcinoma enrichment map was rearranged by merging colors of pathways supported by non-coding mutation scores of promoters, enhancers and/or UTRs into one group.

Coding and non-coding mutations of the PCAWG dataset

We used ActivePathways to analyze driver predictions of coding and non-coding mutations across white-listed 2583 whole cancer genomes of the ICGC-TCGA PCAWG project. P-values of driver predictions were computed separately for protein-coding sequences, promoters, enhancers and untranslated regions (UTR3, UTR5 across multiple subsets of samples representing histological tumor types and pan-cancer cohorts as reported in the PCAWG driver discovery study13. We used gene-enhancer mapping predictions provided by PCAWG, excluded enhancers with more than five target genes, and selected the most significant enhancer for each gene, if any. Unadjusted P-values for coding sequences, promoters, enhancers and UTRs were compiled as input matrices and analyzed as described above. Missing P-values were conservatively interpreted as ones. Results from ActivePathways were validated with two lists of cancer genes. Predicted drivers from the PCAWG consensus analysis13 were selected as statistically significant findings (Q < 0.05) following a stringent multiple testing correction spanning all types of elements (exons, UTRs, promoter, enhancers). The curated list of known cancer genes was retrieved from the COSMIC Cancer Gene Census (CGC) database11. One-tailed Fisher’s exact tests were used to estimate enrichment of these genes in our results, using all human protein-coding genes as the statistical background set.

Prognostic CNA and mRNA signals in breast cancer

ActivePathways was used to evaluate prognostic pathways in breast cancer subtypes. Multiple types of omics data were used for an integrative analysis: mRNA gene expression data and gene copy number alteration (CNA) data of the were derived from the METABRIC cohort of 1991 patients with a single primary fresh frozen breast cancer specimen each15. Curtis et al.15 classified the patients into the intrinsic breast cancer subtypes using the PAM50 mRNA-based classifier52 resulting in 330 basal-like breast cancers, 238 HER2-enriched breast cancers, 721 luminal-A breast cancers, 491 luminal-B breast cancers. Using these data, we computationally deconvolved tumor cell (TC) mRNA and tumor adjacent cell (TAC) mRNA abundance levels from the bulk profiled specimens. TC mRNA was deconvolved using the ISOpure21 method release 2010b in MATLAB. TAC mRNA abundance profiles were computed using the ISOpure.calculate.tac function from the R package ISOpureR22 v1.1.2. The deconvolution analyses were performed independently for each breast cancer subtype. The mRNA univariate survival analysis was conducted as follows. For each gene, patients were dichotomized based on mRNA abundance. Dichotomization was either based on the median mRNA abundance for that gene or a fixed value of 6.5. Based on the mRNA abundance distribution of genes on the Y chromosome in female samples, the value 6.5 was estimated as the threshold for noise for non-expressed genes. Median dichotomization was used if the median was above 6.5 or if there were no events in one of the groups when dichotomizing based on 6.5. The high and low groups based on mRNA abundance were compared by univariate log-rank tests for overall patient survival. TC and TAC mRNA abundance values were evaluated independently. Survival modeling was performed in the R statistical environment (v3.4.3) using the survival package (v2.42–3). The CNA univariate survival analysis was conducted as follows. For each gene, we assessed whether more gains or losses were apparent. The copy number status with a higher count was subsequently used to separate patients into two groups: those with the chosen copy number status and the remaining patients. The two groups were then used for overall survival modeling with log-rank tests.

Hippo pathway target genes in mRNA and ChIP-seq data

This analysis included two types of omics data, mRNA abundance measurements from RNA-seq experiments and transcription factor DNA-binding measurements from chromatin immunoprecipitation sequencing (ChIP-seq) experiments. The RNA-seq dataset of human tissues was downloaded from GTEx v7 data portal (https://www.gtexportal.org/home/). The dataset included transcript abundance values of 21,518 protein-coding genes in 11,688 samples across 53 tissues. Tissues with less than 25 available samples and low gene expression (mean TPM < 1.0; transcripts per million) were excluded from further analysis, resulting in 40 tissues and 9672 samples with mRNA abundance profiles of 19,025 genes. Transcriptional target gene lists for the master transcription factors YAP and TAZ (encoded by YAP1, WWTR1) were predicted separately in the following two steps. First, we computed pairwise Pearson correlations between a given TF and all other genes within a tissue of interest and ranked these by the significance P-value of positive correlations. Second, the resulting ranked gene lists were then aggregated into one master target gene list across the GTEx tissues using the robust rank aggregation (RRA) method32 with default parameters (Qgene < 0.05). FDR-adjusted values of genes from RRA for YAP and TAZ were used as the first and second evidence for input of ActivePathways, respectively. For DNA-binding targets of YAP, we retrieved ChIP-seq binding sites in three cell lines (CCLP1, MSTO, and HUCCT1) from an earlier study33 that were reprocessed in the ReMap database34. Binding sites were filtered by statistical significance (Q < 0.05) and mapped to gene promoters of the human genome (hg19) using gene promoters defined in the PCAWG driver analysis13. If a promoter had multiple binding sites, the site with the strongest FDR-value was selected as a representative site for that gene. FDR-adjusted values of genes with YAP DNA-binding sites were used as the third evidence for input of ActivePathways. Significantly enriched pathways among the putative target genes of YAP and TAZ were subsequently detected using ActivePathways. We compiled a list of 308 Hippo-related genes from the KEGG pathway database35 and two recent review papers30,31. To validate the analysis, we tested the overall sets of genes identified by ActivePathways for enrichment of Hippo-related genes using Fisher’s exact tests. The Hippo analysis was conducted separately for two collections of functional gene sets. First, we tested GO biological processes and Reactome pathways similarly to analyses described above. Second, we used gene sets corresponding to transcription factor binding sites (TFBS) derived from the ENCODE project36.

Method benchmarking

We benchmarked ActivePathways using multiple approaches, including simulated datasets of P-values, method parameter variations, and partial replacement of P-values scores with insignificant P-values. Benchmarking was performed on the PCAWG predictions of cancer driver genes with coding and non-coding mutations. To evaluate the false discovery rate of ActivePathways, we created simulated datasets by reassigning all observed P-values to random genes and their genomic elements. Simulations were conducted separately for different tumor cohorts. One thousand simulated datasets were analyzed with ActivePathways and those with at least one significantly detected pathway counted towards false discovery rates. A separate set of simulations maintained the positions of non-coding P-values among genes and randomly reassigned P-values corresponding to protein-coding mutations, expectedly leading to a reduction in detected pathways as the PCAWG datasets primarily included strong P-values for genes with frequent protein-coding mutations. Quantile-quantile analysis and QQ-plots were used to compare P-value distributions of pathways discovered from true P-values, partially shuffled P-values (true non-coding and shuffled protein-coding P-values), and fully shuffled P-values. To evaluate robustness of ActivePathways, we randomly replaced a fraction of significant driver P-values in the true dataset (P < 0.001) with insignificant P-values (P = 1). We tested different fractions of missing values (10, 25, and 50%) across a thousand datasets of randomly selected missing data points. We concluded that most PCAWG cohorts included significantly enriched pathways even with large fractions of missing data. To further evaluate robustness, we tested different values of the Brown P-value threshold used to select the integrated gene list for pathway enrichment analysis. The default parameter value (Pgene < 0.1) was compared to alternative values (0.001, 0.01, 0.05, and 0.2). We concluded that ActivePathways found enriched pathways in most tumor cohorts even at more stringent gene selection levels.

Ethical review

Sequencing of human subjects’ tissue was performed by ICGC and TCGA consortium members under a series of locally approved Institutional Review Board (IRB) protocols as described in Hudson et al. Informed consent was obtained from all human participants. Ethical review of the current data analysis project was granted by the University of Toronto Research Ethics Board (REB) under protocol #30278, “Pan-cancer Analysis of Whole Genomes: PCAWG”.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.