Introduction

Epithelial ovarian carcinoma is a heterogeneous disease, representing approximately 3.7% of all new female cancer diagnoses1. It comprises several distinct histological subtypes (including high- and low-grade serous, clear cell, endometrioid and mucinous), each one displaying different behaviours at both the clinical and molecular levels2. Around 70% of epithelial ovarian tumours are high-grade serous ovarian carcinomas (HGSOC), which are relatively aggressive and have a poor prognosis.

There is a significant genetic component to the risk of ovarian carcinoma3, with germline mutations in BRCA1 and BRCA2 identifiable in 11āˆ’23% of affected women with HGSOC4,5, rising to as high as 42% of affected women with a family history of two or more ovarian carcinomas6. Other genes make a smaller contribution to HGSOC risk (e.g. RAD51C, RAD51D, BRIP17,8,9,10,11), but the hereditary basis of approximately 50% of cases remains unexplained3, which compromises risk management for these women and their families.

Efforts to identify additional moderate-to-high-risk hereditary breast and ovarian cancer (HBOC) genes have largely been restricted to candidate gene approaches using targeted next-generation sequencing (NGS) panels of known cancer predisposition genes12,13,14,15,16,17,18, which have collectively only resolved a very small proportion of unexplained families. Although three studies utilised data from whole-exome sequencing (WES) of BRCA1 and BRCA2-negative ovarian carcinoma patients19,20,21, these analysed only a subset of candidate genes in the available data and included non-HGSOC tumour types in their case cohorts. Others utilised germline sequencing data from The Cancer Genome Atlas (TCGA)22,23,24,25, but this approach is limited by the diverse technologies used to generate TCGA data along with the absence of any linked family history information. None of the previous studies have identified candidate HBOC genes that have been validated in multiple independent studies; nor has there been any consistency of the candidates identified across different studies.

As a first step in resolving the missing heritability of ovarian carcinoma, we present WES data from a large cohort of women diagnosed with HGSOC, who were tested through a familial cancer clinic but returned negative findings for the BRCA1 and BRCA2 genes. Our results indicate that familial HGSOC is enriched for rare protein-coding loss-of-function (LoF) variants, but displays high genetic heterogeneity, with no single proposed candidate gene identified in our cohort found in more than 2.4% of cases. These genes are functionally diverse, with only a small number associated with DNA repair as with other known HGSOC predisposition genes, suggesting that much of the remaining missing heritability may lie in genes and pathways that are currently overlooked.

Results

Exome sequencing and variant filtering

Whole-exome sequencing was successfully performed on all germline DNA samples to an average depth of 126Ɨ with 98.4% of the bases covered to >20Ɨ. Principal component analysis (PCA) was performed using common single nucleotide polyporphisms (SNPs), demonstrating that over 95% of participants were of Western European origin (Supplementary Fig.Ā 1). Numerous quality and variant frequency filters (as summarised in Fig.Ā 1) were applied to the data to remove artefacts, common variants and lower-impact variants that are unlikely to represent moderate-to-high-risk alleles. Implementing these filters left 6733 unique, rare ā€˜HIGHā€™ impact variants in 4901 genes.

Fig. 1: Flowchart illustrating the filtering, ranking, prioritisation and curation steps used on the processed exome variant (vcf) data.
figure 1

Steps performed in the post-sequencing pipeline (i.e. alignment of FASTQ reads, variant calling and annotation) are not displayed. Numerical figures refer to unique variants and genes. LoF loss of function, VEP Variant Effect Predictor, MAF minor allele frequency, RF failed random forests filter.

Variants in known and proposed ovarian carcinoma risk genes

Sequence data were analysed for deleterious variants in known ovarian carcinoma predisposition genes, including RAD51C, RAD51D7,8, BRIP19 and the Lynch syndrome genes (MLH1, MSH2, MSH6, PMS2)5. As expected, no BRCA1 or BRCA2 variants were identified in this pre-screened group and only six of the 516 cases (1.2%) had clinically actionable variants in one of the other genes (TableĀ 1). Five individuals carried LoF variants in one of MSH6, RAD51C, RAD51D or BRIP1, and one had a likely pathogenic missense variant in RAD51C26,27. These six cases were removed from the discovery cohort, since the presence of deleterious variants in one of these genes is likely to explain their personal and family history of cancer.

Table 1 Known ovarian carcinoma predisposition genes with loss-of-function (LoF) and deleterious missense variants in the total case cohort.

Amongst the remaining 510 cases, 28 individuals (5.5%) had a LoF or known deleterious missense variant in 16 genes that have been proposed as ovarian cancer predisposition genes and are commonly included on HBOC gene testing panels (TableĀ 2). After applying Fisherā€™s exact tests as described below, only PALB2, ATM and MRE11A were enriched for LoF variants in the cases compared to GnomAD, although the number of variants and cases was small, and caution should be exercised interpreting the odds ratios as risk estimates. As it is currently unclear whether variants in these genes have a genuine role in HGSOC predisposition, these individuals were retained in the discovery cohort for subsequent analysis.

Table 2 Proposed ovarian carcinoma predisposition genes with loss-of-function (LoF) and known deleterious missense variants in the discovery case cohort.

Analysis of ranked candidate genes and variants of interest

To assess for variant enrichment, the gene-level frequency of ā€˜HIGHā€™ impact variants in the remaining 510 cases was compared to the gene-level frequency in the GnomAD sub-population (nā€‰=ā€‰59,095), as detailed in the Methods. Overall, for all protein-coding genes represented on the WES panel (nā€‰=ā€‰19,818), there was a significantly higher number of rare LoF variants in the cases compared to GnomAD (pā€‰<ā€‰0.0001, chi-squared test). Two-tailed Fisherā€™s exact tests were performed to rank genes by level of enrichment (as represented by their p values), and plotting their distribution (Fig.Ā 2) demonstrated a significantly greater number of genes enriched for rare LoF variants (nā€‰=ā€‰133, ORā€‰>ā€‰1 and pā€‰<ā€‰0.01) compared to genes depleted for rare LoF variants (nā€‰=ā€‰19, ORā€‰<ā€‰1 and pā€‰<ā€‰0.01) in the cases vs. GnomAD (pā€‰<ā€‰0.0001, chi-squared test).

Fig. 2: Waterfall bar chart displaying the degree of enrichment for all protein-coding genes represented on the WES panel (nā€‰=ā€‰19,818) with filtered LoF variants in comparison to GnomAD, ordered by decreasing log10 p value. Any negative log10 p values for genes with calculated odds ratiosā€‰>ā€‰1 were transformed into positive values prior to plotting.
figure 2

Genes with log10 p values of 0 (equivalent to a p value of 1 i.e. no difference to GnomAD frequency) were not plotted. Shaded areas to left and right of dashed lines represent genes with p valuesā€‰<ā€‰0.01 and odds ratiosā€‰>ā€‰1 (nā€‰=ā€‰133) or <1 (nā€‰=ā€‰19), respectively. Genes are labelled on the x-axis every 50 rows from the ordered list for illustration only.

To identify the most likely candidates with an excess of LoF variants from amongst the remaining 4863 genes (Supplementary DataĀ 1), a number of additional steps were applied (Fig.Ā 1). First, the Benjaminiāˆ’Hochberg procedure28 for multiple testing was used on the ranked list of Fisherā€™s test p values to establish a ā€˜discoveryā€™ threshold of 0.0094 (number of p valuesā€‰=ā€‰4863, false discovery rateā€‰=ā€‰0.3). Next, only protein-coding genes enriched with rare LoF variants (in any of the major GnomAD sub-populations) by at least three-fold in the cases were retained, reducing the list to 1700 unique LoF variants in 1307 genes amongst 491 individuals. Of these genes, the vast majority had a LoF variant in just one individual (nā€‰=ā€‰942) with most of the remainder occurring in 2āˆ’4 individuals (Fig.Ā 3). Finally, genes with LoF variants in three or more individuals and p values below the calculated multiple testing threshold (nā€‰=ā€‰66) were prioritised for curation, including detailed GnomAD and bam file review. Twenty-three genes with low-confidence LoF variants were removed during curation; these included 15 genes that were removed due to their remaining valid variants occurring in fewer than three individuals, or falling below our three-fold enrichment threshold.

Fig. 3: Frequency bar chart displaying number of genes containing LoF variants vs. the number of individual cases they were present in.
figure 3

Total number of genes stated above bar for each class of number of cases.

The remaining 43 highest-ranked candidate genes with high-confidence, rare LoF variants are displayed in TableĀ 3 (for individual variants and associated case data, refer to Supplementary DataĀ 2). The top-ranked genes are involved in very diverse functional pathways (e.g. transporter proteins and metabolic enzymes), and of note, few appear to have a role in DNA repair despite the fact that all known HBOC genes to date are directly or indirectly involved with that function2,29. The majority of these candidate genes have not been reported to contain pathogenic somatic mutations in serous ovarian tumour samples from the COSMIC database (TableĀ 3), and for those that do, the frequency of somatic variants is low (<1% of samples). Comparing the family history distribution of candidate gene carriers (nā€‰=ā€‰138) and non-carriers (nā€‰=ā€‰378), there was no significant difference in the likelihood of being a carrier in those with a family history of breast and/or ovarian cancer in one or more first- or second-degree relatives vs. those with no family history (pā€‰=ā€‰0.55, Fisherā€™s exact test).

Table 3 Top-ranked candidate genes remaining after curation with loss-of-function (LoF) variants present in three or more discovery case cohort individuals.

To assess if this reflected a genuine lack of enrichment of DNA repair genes, the total frequency of rare LoF variants in DNA repair genes grouped by functional pathway30 amongst the cases in the discovery cohort was compared with the GnomAD sub-population (TableĀ 4), excluding known HBOC genes that were previously searched for in the total case cohort (TableĀ 1). One-hundred-and-five cases (21%) harboured at least one LoF variant across all DNA repair and associated genes, but the total frequency of LoF variants across all functional categories in the cases was very similar to GnomAD (0.063% vs. 0.061%, pā€‰=ā€‰0.60, Fisherā€™s exact test). Although the frequency of LoF variants in the subset of genes involved in the nucleotide excision repair, homologous recombination repair, Fanconi anaemia and non-homologous end-joining pathways were higher in the cases vs. GnomAD, only the homologous recombination repair category was significantly enriched (pā€‰=ā€‰0.032, Fisherā€™s exact test).

Table 4 List of DNA repair genes with loss-of-function (LoF) variants in the discovery case cohort, grouped by pathway.

Discussion

Reported here is the largest WES study to date of HGSOC patients with no detectable BRCA1 or BRCA2 germline mutations. The extreme degree of genetic heterogeneity underlying HGSOC predisposition is demonstrated by the fact that 1307 genes are enriched for LoF variants by a minimum of three-fold, along with the fact that amongst the 43 high-priority candidates, the median number of LoF variants was only four. Although a proportion of these genes are likely to be false positives, the fact there is a significantly higher number of rare LoF variants in the case cohort compared to GnomAD as well as significantly more genes with ORsā€‰>ā€‰1 compared to those with ORsā€‰<ā€‰1 indicates that the list likely includes many genuine HGSOC predisposition genes.

Among the top-ranked genes (TableĀ 3), a small number function in a manner analogous to other known tumour suppressor genes. For example, RPA331, USP5032 and RAD133 are thought to participate in arresting cell cycle progression in response to DNA damage. Others, such as SLC12A4 (a potassium and chloride ion co-transporter)34 and IMPDH2 (the rate-limiting enzyme in guanine nucleotide synthesis)35,36,37, are known to have an oncogenic role in various tumour types. Assuming their biological function as described in the literature is accurate and complete, it is unclear how germline LoF variants in these genes might predispose to tumour development. However, the vast majority of top-ranked genes either have no known role in tumorigenesis (e.g. LOXL2) and/or their function is currently unknown (e.g. ZBTB45). This uncertainty suggests that approaches to gene discovery that emphasise candidate gene function above other considerations (such as relative frequency of LoF mutations in cases vs. controls) may fail to identify HGSOC predisposition genes functioning in pathways other than those classically inactivated in HBOC, such as DNA repair pathway genes. Of these, only homologous recombination repair pathway genes were modestly enriched for rare LoF variants in the cohort, indicating that mutations in these genes cannot alone explain the missing heritability of HGSOC.

Only 16 top-ranked genes had any somatic mutations recorded in COSMIC (TableĀ 3), none of which exceeded 1% of serous ovarian tumour samples present in the database. This is consistent with the finding that established germline susceptibility genes, with the exception of TP53, are also rarely found to harbour somatically-acquired mutations in sequenced tumour samples. BRCA1 pathogenic somatic mutations, for example, are only present in 1.59% of serous ovarian carcinoma samples in the COSMIC database38.

Only a small fraction of cases (6.6%) were potentially explained by genes known or suggested to be linked with a higher risk of HGSOC, which is consistent with the low frequency reported in other studies5,10,11,13,14,17,39. Of note, there was no enrichment for Lynch syndrome genes (MLH1, MSH2, MSH6, PMS240), despite the large size of the cohort. This reflects the fact that the ovarian tumour types most often associated with Lynch syndrome (i.e. clear cell and low-grade endometrioid)41 were not represented in this patient group.

Although many of the suspected HBOC-associated genes harboured LoF variants (TableĀ 2), the frequencies were low and only PALB2, ATM and MRE11A showed some degree of enrichment compared to GnomAD. The level of enrichment was relatively modest, with very wide confidence intervals due to the small numbers present, making it challenging to interpret their true significance. Previous work suggested a similarly modest increase in risk for ATM and PALB2, but not for MRE11A13,14, casting doubt on whether the latter is truly an ovarian carcinoma predisposition gene. Recent additional data from PALB2 families found that pathogenic variants are associated with a two-to-three-fold increased risk of ovarian carcinoma42, independently of the known strong association with breast carcinoma.

The remaining proposed HBOC genes with LoF variants present in the cohort (BLM, CHEK2, FANCM, NBN, NF1, RAD50, RECQL) have similar or lower frequencies of LoF mutations compared to GnomAD. Whilst these results do not exclude the possibility they may be associated with an increased risk of hereditary ovarian carcinoma, it does suggest that caution should be exercised when interpreting their causative role in the context of germline genetic testing for women with suspected hereditary ovarian carcinoma and no personal or family history of breast carcinoma.

To date, no studies have applied a wholly unbiased WES-based approach to ovarian carcinoma predisposition gene discovery in a case cohort selected for HGSOC and enriched for hereditary cases where BRCA1 and BRCA2 involvement have been excluded. Stafford et al.19 conducted WES on 48 BRCA1 and BRCA2-negative ovarian carcinoma cases with a high prior likelihood of genetic susceptibility, but restricted their candidate gene variant analysis to 155 genes involved in DNA damage response or cell cycle regulation, along with 64 ovarian carcinoma-associated genes listed in the Human Gene Mutation Database (HGMD). Similarly, Lu et al.20 interrogated WES data from Ambry Genetics for 2051 women with ovarian carcinoma for only a small number of known ā€˜cancer-associatedā€™ genes, and demonstrated significant enrichment for variants in six genes (ATM, CHEK2, MSH6, PALB2, RAD51C and TP53). Recently, Zhu et al.21 analysed WES data from 158 BRCA1 and BRCA2-negative ovarian carcinoma cases and identified ANKRD11 and POLE as putative risk genes following validation studies. Neither gene was found to be enriched for LoF variants in our cohort. However, their analysis of the exome data excluded variants in genes based on expression data and residual variation intolerance scores, and retained predicted pathogenic missense variants. The selective focus of these studies on certain genes also reflects a prevailing assumption about the importance of DNA repair pathway genes in HGSOC that is not supported by our data, which further emphasises the importance of applying an open approach to candidate gene identification.

Other groups alternatively used TCGA germline WES data to search for disease-associated genetic variants, although as noted earlier, this approach has limitations. Kanchi et al.22, using data from 429 serous ovarian carcinoma TCGA cases and 557 controls, identified several genes enriched for germline deleterious variants that were not previously associated with ovarian carcinoma (e.g. ASXL1, MAP3K1 and SETD2). However, their subsequent studies23,25 did not validate their predisposition gene discoveries. Dicks et al.24 also used TCGA data from 412 HGSOC cases to identify disease-associated variants in 12 DNA repair genes, and subsequently assessed them in 3107 HGSOC cases and 3368 controls. Of these candidate genes, only FANCM had a significantly higher mutation frequency in cases vs. controls. None of the genes identified by Dicks et al. (including FANCM) were enriched for LoF mutations in our cohort.

Limitations of the current study include the use of GnomAD as the control population, given the differences in sequencing platforms and variant callers that could result in both false-positive and false-negative associations. Detailed review of variants in the top-ranked genes in both the cases and GnomAD to identify potentially unreliable calls aimed to reduce this problem. While ethnicity differences between the cases and GnomAD exist, these were demonstrated by PCA to be minimal with their predominant (over 95%) Western European ancestry being well matched with the GnomAD NFE non-cancer cohort. In addition, the frequencies of LoF mutations in HBOC genes in GnomAD were broadly comparable to our local population control figures from prior studies43,44, giving us confidence that in the context of a gene discovery phase, GnomAD is a suitable surrogate control population.

The largest potential source of uncertainty in this study is the extreme genetic heterogeneity of HGSOC predisposition, with most of the candidate genes only having LoF mutations in less than 0.5% of individuals, meaning that the risk of false-positive associations in the discovery set due to chance or to rare, benign variants will be high (up to 30% for ranked genes with p valuesā€‰<ā€‰0.0094 after multiple testing correction). Consequently, it will be essential to conduct further validation using very large caseāˆ’control studies as well as orthogonal approaches such as tumour sequencing, which can provide powerful evidence of bi-allelic inactivation or other somatic genetic features consistent with the candidate gene actively driving carcinogenesis45,46,47.

In summary, WES of the largest cohort of BRCA1 and BRCA2-negative HGSOC cases assembled to date has demonstrated the extensive genetic heterogeneity that exists in the remaining unresolved cases of hereditary HGSOC. Furthermore, the lack of enrichment for LoF mutations in genes either directly or indirectly involved in DNA repair posits an explanation for the lack of success of previous candidate gene studies that have prioritised such classes of genes. This study provides an important, unbiased catalogue of ā€˜function-agnosticā€™ candidate genes based solely on mutation frequencies, which will facilitate additional genetic epidemiological and functional studies with the potential to translate the findings into future clinical practice.

Methods

Description of case cohort and controls

Cases consisted of 516 women from Australia recruited to the Variants in Practice (ViP) study between 2013 and 2018 (TableĀ 5) with a confirmed or suspected diagnosis of HGSOC, as well as those with tumours of similar histology arising in the fallopian tube and peritoneum (which share similar clinical and molecular characteristics to HGSOC and are all thought to originate from foci of serous tubal intraepithelial carcinoma48). Represented histological subtypes were high-grade serous (including carcinosarcomas) (nā€‰=ā€‰443); high-grade endometrioid (nā€‰=ā€‰35), which is considered a subtype of HGSOC, distinct from low-grade endometrioid tumours29,49; mixed epithelial types with a predominant high-grade serous component (nā€‰=ā€‰11); and suspected high-grade serous tumours that were previously classed as adenocarcinoma not otherwise specified or as unknown (nā€‰=ā€‰27). All women were referred to a specialist familial cancer centre and assessed as fulfilling local criteria for offering them BRCA1 and BRCA2 testing (https://www.eviq.org.au/p/620)50. Clinical testing for germline variants in both genes was performed using validated, standard techniques (next-generation panel sequencing and/or Sanger sequencing for exon variants, along with multiplex ligation-dependent probe amplification for structural variants) in a certified diagnostic lab, and all tested individuals had no pathogenic or likely pathogenic variants nor any large deletions in these genes. These results were reconfirmed on analysis of their exome sequencing data for BRCA1 and BRCA2 pathogenic variants.

Table 5 Characteristics of total case cohort.

Population control frequencies of gene variants were obtained from publicly available sequencing data in GnomAD version 2.1.1 (https://gnomad.broadinstitute.org)51, containing 125,748 exome sequences and 15,708 genome sequences from unrelated individuals worldwide. Filtering options within GnomAD were used to remove data from individuals with a cancer diagnosis (including those sourced from TCGA) as well as those that were not from a non-Finnish European ethnic background, leaving 59,095 individuals.

Exome sequencing and variant calling

Exome sequencing was performed on leucocyte DNA extracted from patient whole-blood samples utilising the Agilent SureSelect (Human All Exon v4 for six samples, and v6 for the remainder) capture and Illumina HiSeq 2500 (150 paired-end reads) sequencing platforms at two commercial sequencing companies (BGI and Novogene). An in-house bioinformatics pipeline constructed using Seqliner v0.7 (http://bioinformatics.petermac.org/seqliner) was used to process raw sequencing data. Raw sequencing reads were quality checked using FastQC v0.11.2 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc), trimmed using cutadapt v1.5ā€‰52 then aligned to the GRCh37/hg19 human reference genome using BWA-MEM v0.7.10ā€‰53. Duplicate reads were filtered using Picard MarkDuplicates (http://broadinstitute.github.io/picard). Base quality score recalibration and indel realignment were then performed on the filtered reads using the Genome Analysis Toolkit (GATK) v3.8.0ā€‰54. Variants were called using GATK HaplotypeCaller and Platypus v0.8.1ā€‰55, then annotated for predicted consequences using Ensembl Variant Effect Predictor (VEP) database version v85ā€‰56 and LoFTEE (https://github.com/konradjk/loftee).

Principal component analysis was performed in PLINK v1.90ā€‰57 using a set of all SNPs passing filters in at least two samples that were targeted by both the Human All Exon v4 and v6 captures and passed linkage disequilibrium pruning (r2 threshold: <0.3, window size: 100ā€‰kb, step size: 5ā€‰kb). Clusters in the PCA results were classified to ethnicities informed by markers from the major sub-population groups as defined in the GnomAD database.

Variant filtering, ranking and curation

A series of filters were applied to the variant data (Fig.Ā 1), using R v3.5.2 (2018) with tidyverse v1.2.1 installed, and the output viewed and analysed in Microsoft Excel v16.25 for Mac. For the discovery analysis, only variants classed by VEP56 as ā€˜HIGHā€™ impact were retained; these included classic LoF variants (stop-gain, start-loss, frameshift and essential splice site) in protein-coding transcripts, as well as equivalent variants in non-protein-coding transcripts (e.g. non-coding RNAs). Variants classed as ā€˜MODERATEā€™, ā€˜LOWā€™ or ā€˜MODIFIERā€™ impact (including missense, in-frame indel, stop-loss, cryptic splice site, synonymous etc. in protein-coding sequences) were removed. Analysis aimed to identify rare variants with strong pathogenic effect and good-quality sequencing metrics; hence, variants with GnomAD total population minor allele frequency (MAF)ā€‰>ā€‰0.005 or those annotated to non-canonical transcripts (as defined by Ensembl58,59) were removed and a number of quality filters applied (Fig.Ā 1). Following ranking (described below), additional filtering removed variants in transcripts that were not classed as ā€˜protein_codingā€™ in their Ensembl Biotype annotation, leaving only protein-coding LoF variants. Common variants (i.e. MAFā€‰>ā€‰0.005) in one or more of the major outbred population groups represented in GnomAD (i.e. excluding Finns, Ashkenazi Jewish and ā€˜otherā€™ populations) were also removed, using the ā€˜popmaxā€™ annotation. The latter filter facilitated the removal of common variants within the other major non-European ethnic groups (e.g. East Asian) represented in the patient sample, abrogating the need to use ethnicity-specific GnomAD data when performing filtering with these cases.

After excluding samples with deleterious variants in known ovarian cancer predisposition genes (TableĀ 1), remaining genes were ranked by degree of enrichment for presumed deleterious variants in the case population. To facilitate this, total control population frequencies of ā€˜HIGHā€™ impact variants for every gene transcript were calculated using the GnomAD non-cancer reference data for the non-Finnish European (NFE) sub-population51; these figures excluded common variants with MAFā€‰>ā€‰0.005, and were adjusted for genes with multiple variants per individual using the formula 1ā€‰āˆ’ā€‰āˆ(1ā€‰āˆ’ā€‰AFi) i.e. one minus the combined probability of not containing any of the variant alleles. Variants that were flagged in GnomAD as failing their ā€˜InbreedingCoeffā€™, ā€˜AC0ā€™ or ā€˜RFā€™ (random forests) QC filters were excluded from these figures, to match our filtering. Total frequencies for every gene with retained variants in the sample were calculated, and a risk ratio between figures for the two population groups (case cohort and GnomAD non-cancer NFE) was derived.

A two-tailed chi-squared test was then used to compare the total number of rare (i.e. AFā€‰ā‰¤ā€‰0.005) LoF variants in the case cohort vs. the equivalent number in the GnomAD non-cancer NFE sub-population for all genes represented on the Agilent SureSelect v6 exome panel with ā€˜protein_codingā€™ Biotype transcripts (nā€‰=ā€‰19,818). p values, odds ratios (ORs) and confidence intervals for every gene were then calculated using a two-tailed Fisherā€™s exact test, incorporating allele counts in the sample vs. equivalent counts in the GnomAD non-cancer NFE sub-population (with the denominator as the maximum number of alleles from that population with available data in GnomAD for that specific gene). Genes were ranked in order of increasing p value, with the most enriched genes having the lowest p values, and the calculated risk ratios were used to prioritise variants in genes that were enriched by three-fold or more in the case cohort for further analysis. Additional two-tailed chi-squared tests were used to compare the observed vs. the expected distribution of Fisherā€™s test p valuesā€‰<ā€‰0.01 for odds ratios >1 and <1 for genes with ā€˜protein_codingā€™ Biotype transcripts. The Benjaminiāˆ’Hochberg procedure28 for multiple testing was applied to the ranked list of p values to establish a ā€˜discoveryā€™ threshold p value for prioritising top-ranked genes for further study, specifying a false discovery rate of 0.3. It is important to note that the p values used for ranking candidate genes do not imply a statistically significant difference in total LoF allele frequency between cases and the GnomAD sub-population for any individual gene, since the case cohort lacked the size and power required to demonstrate this. A two-tailed Fisherā€™s exact test was also used to compare the total frequency of LoF variants in known DNA repair genes grouped by functional pathway (from Chae et al.30) in the discovery cohort (nā€‰=ā€‰510) with those in the GnomAD non-cancer NFE sub-population; this analysis did not include BRCA1 and BRCA2 or any of the other known ovarian carcinoma predisposition genes that had been analysed for LoF variants in the case cohort during filtering (described below). All graphs were plotted using GraphPad Prism v8.1.1 for Mac, and all statistical tests (Fisherā€™s exact test, chi-squared tests and the Benjaminiāˆ’Hochberg procedure) were performed in R or Prism.

Ranked genes and LoF variants were curated and scrutinised using available online databases (NCBI Gene, OMIM and COSMIC38) to annotate their function and possible role in cancer predisposition. GnomAD data for each gene were also reviewed, to identify those genes with problematic sequencing data, or variants that were found at an AFā€‰>ā€‰0.005 in one of the GnomAD sub-populations; any genes or variants affected as such were excluded from the top-ranked gene list. Finally, the candidate gene variants with borderline quality sequencing metrics (i.e. failed QC sequencing quality scoreā€‰<ā€‰500, read depthā€‰<ā€‰60, alt allele read frequencyā€‰<ā€‰0.35 or variants not called bidirectionally) were manually reviewed within the raw sequencing (bam) files using the Integrative Genomics Viewer (IGV) software60; any doubtful variants were excluded when collating the top-ranked gene list. A two-tailed Fisherā€™s exact test was used at this point to compare the likelihood of being a candidate gene carrier in those with a family history of breast and/or ovarian cancer in one or more first- or second-degree relatives (nā€‰=ā€‰262) vs. those with no family history (nā€‰=ā€‰254).

Analysis of known and proposed ovarian carcinoma risk genes

For known and proposed ovarian carcinoma predisposition genes (MLH1, MSH2, MSH6, PMS25, BRIP19, RAD51C7, RAD51D8, PALB261, FANCM24, ATM14, TP5362, CHEK263, BARD164, STK1165, CDH166, PTEN67, FANCC68, RECQL69, BLM68, NF170 and the MRN protein complex genes i.e. MRE11A, NBN, RAD5071), any identified LoF variants annotated to RefSeqGene transcripts72 were considered pathogenic and retained, but additionally checked in NCBI ClinVar (https://www.ncbi.nlm.nih.gov/clinvar) to exclude any that had been classed in this database as ā€˜benignā€™ or ā€˜likely benignā€™. Only missense variants classed as pathogenic in ClinVar with multiple sources of supporting evidence and consensus opinion were considered deleterious.

Ethics statement

This study protocol was approved by the Human Research Ethics Committees at each participating ViP study recruitment centre and the Peter MacCallum Cancer Centre (approval nos. 11/50 and 09/29). All participants provided informed consent for genetic analysis of their germline and tumour DNA.

Reporting summary

Further information on research design is available in theĀ Nature Research Reporting Summary linked to this article.