BACKGROUND

Breast cancer is the most common cancer in the Western world and accounts for 15% of cancer-related deaths in women, with about 522,000 deaths worldwide in 2012.1 Survival after a diagnosis of breast cancer varies considerably between patients even with closely matching tumour characteristics. Models that predict the likelihood of survival after breast cancer treatment use tumour and treatment data, but currently do not take host factors into account. The identification of prognostic and predictive biomarkers inherent in the germline of the patients rather than the tumour could pinpoint mechanisms of tumour progression and help with treatment stratification to increase therapeutic benefit. Such markers include inherited genetic variation, as there is evidence for heritability of breast cancer-specific mortality in affected first-degree relatives.2,3,4,5 Germline variation may affect prognosis by affecting tumour biology, since such variants are known to be associated with risk of specific breast tumour subtypes, particularly those defined by hormone receptor status, and have different outcomes.6,7,8 Germline genotype could also affect the efficacy of adjuvant drug therapies9,10 or might condition the host tumour environment via vascularisation,11,12 metastatic pattern,13,14 stroma–tumour interaction15,16 and immune surveillance.17,18

The association between common germline genetic variation and breast cancer-specific mortality has been examined in many candidate gene studies,5,9,14,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36 as well as in moderate-sized genome-wide association studies (GWAS).37,38,39,40,41 However, it has been difficult link GWAS results to plausible candidate genes and few have been convincingly replicated.29,42 Large studies with long follow-up and reliable data on known prognostic factors are required if novel alleles associated with prognosis in breast cancer are to be identified at a level of genome-wide significance. In the present work, we pooled genotype data from multiple breast cancer GWAS discovery and replication efforts43,44 with new genotype data obtained from a large breast cancer series genotyped using the OncoArray chip.45,46 We examined associations with risk of breast cancer-specific mortality in a total of 96,661 breast cancer patients with survival time data. We then investigated the potential functional role of the selected variants by predicting possible target genes.

Materials and methods

Breast cancer patient samples

We included data from twelve datasets (n = 96,661) in which multiple breast cancer patient cohorts were genotyped by a variety of arrays providing genome-wide coverage of common variants. An overview of the datasets with specification of the arrays used is given in Supplementary Table 1. Data from eight of these datasets have been used in previous analyses (n = 37,954).44 However, the Collaborative Oncological Gene-Environment Study (COGS) dataset from the Breast Cancer Association Consortium (BCAC) was updated to include additional follow-up and death events and additional genotype data, increasing the number of events and samples to a total of n = 29,959 patients. Two new datasets, the BCAC OncoArray and the SUCCESS A trial, comprising 58,027 samples, were added for the current analyses.

The OncoArray is a custom Illumina genotyping array designed by the Genetic Associations and Mechanisms in Oncology (GAME-ON) consortium. It includes 533,000 variants of which 260,660 form a GWAS backbone, with the remainder being custom content, details of which have been described previously.45 The SUCCESS-A Study47 is a randomised phase III study of n = 3,299 breast cancer cases. Cases from the trial were genotyped using the Illumina Human OmniExpress array. We downloaded imputed genotypes from dbGaP (data reference 6266).

COGS samples that were also genotyped on the OncoArray were removed from the COGS dataset (n = 14,426). Female patients with invasive breast cancer diagnosed at age > 18 years, and with follow-up data available were included in the analyses. BCAC data from freeze 8 was used, in which 873 COGS samples with unknown breast cancer-specific mortality status were excluded from the analyses. All stages of cancer, including metastatic, were used in the analysis. Some individual studies applied additional selection criteria such as young age or early breast cancer stage (Supplementary Table 2).

Genotype and sample quality control, ancestry analysis and imputation

The genotype and sample quality control for the datasets have been described previously.44,45,47,48 Ancestry outliers for each dataset were identified by multidimensional scaling or LAMP49 on the basis of a set of unlinked variants and HapMap2 populations. Samples of European ancestry were retained for analyses.

Ten of the datasets were imputed using the reference panel from the 1000 Genomes Project in a two-stage procedure. The 1000 Genomes project Phase 3 (October 2014) release was used as the reference panel for all the datasets apart from SUCCESS-A, which used the Phase 1 release (March 2012). Imputation for CGEMS and BPC3 was performed using the programme MACH.50 Phased genotypes were first derived using SHAPEIT51 and IMPUTE252 and then used to perform imputation on the phased data. The main analyses were based on variants that were imputed with imputation r2 > 0.3 and had minor allele frequency (MAF) > 0.01 in at least one of the datasets leading to ~10.4 million variants. To match the individual datasets in the meta-analysis we used the chromosome position. Variants were kept in the analysis as long as they were present in one of the studies. In those cases where there was ambiguity over the naming of the insertions and deletions, the MAF was used for further matching.

Statistical and bioinformatic methods

Time-to-event was calculated from the date of diagnosis. For prevalent cases with study entry after diagnosis left truncation was applied, i.e., follow-up started at the date of study entry.53 Follow-up was right censored on the date of death, on the date last known alive if death did not occur, or at 15 years after diagnosis, whichever came first. We chose the 15 years cut-off because follow-up varied between studies and after that period follow-up data became scarce. Follow-up of the cohorts is illustrated in Kaplan Meier curves (Supplementary Figure 1).

The hazard ratios (HR) for the association of genotypes with breast cancer-specific mortality were estimated using Cox proportional hazards regression54 implemented in an in-house programme written in C++. Analysis of the CGEMS and BPC3 data was conducted using ProbABEL.55 The estimates of the individual studies were combined using an inverse-variance weighted meta-analysis. Since meta-analysis results based on the Wald test have been shown to be inflated for rare variants56 we recomputed the standard errors based on the likelihood ratio test statistic (see details in Supplementary methods), using the formula:

$${\mathrm{SE}} = {\mathrm{log}}\left( {{\mathrm{HR}}} \right){\mathrm{/sqrt}}\left( {{\mathrm{LRT}}} \right)$$

For each dataset we included as covariates a variable number of principal components (Supplementary Table 1) from the ancestry analysis as covariates in order to control for cryptic population substructure. The Cox models were stratified by country for the OncoArray dataset and by study for the COGS dataset. Statistical tests were performed for each variant by combining the results for all the datasets using a fixed-effects meta-analysis. Inflation of the test statistics (λ) was estimated by dividing the 45th percentile of the test statistic by 0.357 (the 45th percentile for a χ2 distribution on 1 degree of freedom). Analyses were carried out for all invasive breast cancer and for oestrogen receptor (ER)-positive and ER-negative disease separately.

To assess the probability of a variant being a false positive we used a Bayesian false discovery probability (BFDP)57 test based on the P value, a prior set to 0.0001 and an upper likely HR of 1.3.

To predict potential target genes, we used Bedtools v2.26 to intersect notable variants with genomic annotation data relevant to gene regulation activity in samples derived from breast tissue. We examined features including enhancers, promoters and transcription factor binding sites identified by the Roadmap58 and ENCODE59 Projects. Expression quantitative loci (eQTL) data from GTEx60 were queried for evidence of potential cis-regulatory activity.

Results

Genotype data from 96,661 breast cancer cases (64,171 ER-positive and 16,172 ER-negative) with 7697 breast cancer deaths within 15 years were included in the primary analyses. For 16,318 cases we did not have ER-status information. The average follow-up time was 6.38 years. Details of the numbers of samples and events in each dataset are given in Supplementary Table 3. Manhattan and quantile-quantile (Q–Q) plots for the associations between variants and breast cancer-specific mortality of all invasive, ER-negative and ER-positive breast cancers are shown in Fig. 1 and Fig. 2, respectively. There was some evidence of inflation of the test statistic with an inflation factor of 1.06 for all invasive and ER-positive, and 1.05 for ER-negative including all variants. These Q–Q plots showed no evidence of an association at P < 5 × 10−8; at less stringent thresholds for significance, there were an increasing number of observed associations for all three analyses (Fig. 2).

Fig. 1
figure 1

Association plot for the meta-analysis of the twelve datasets for breast cancer-specific mortality analyses (censored at 15 years) for a all breast tumours (censored at 15 years), b ER-negative tumours and c ER-positive tumours. The y-axis shows the −log10 P values of each variant analysed, and the x-axis shows their chromosome position. The red horizontal line represents P = 5 × 10−8

Fig. 2
figure 2

Q–Q plots for the meta-analysis of the twelve datasets for breast cancer-specific mortality analyses (censored at 15 years) for a all breast cancer tumours (censored at 15 years), b ER-negative tumours and c ER-positive tumours. The y-axis represents the observed −log10 P value, and the x-axis represents the expected −log10 P value. The red line represents the expected distribution under the null hypothesis of no association. Analyses were not corrected for LD-structure

We identified three variants at BFDP < 15% associated with breast cancer-specific mortality of patients with ER-negative disease (Table 1). These variants are part of an independent set of 32 highly correlated variants61 on chromosome 7q21.1 that were associated at P < 5 × 10−6 (Supplementary Table 4). The LD matrix between these variants computed based on the 1000 European genomes,62,63 and their chromosomal positions, are shown in Supplementary Figure 1. The strongest association was for rs67918676: HR = 1.27; 95% CI = 1.16–1.39; P = 1.38 × 10−7; risk allele A frequency = 0.12 and BFDP = 11%. The imputation efficiency for this variant was high, with r2 = 0.99 for all datasets.

Table 1 Results of the variants with BFDP < 15% in the meta-analysis of the 12 studies of breast cancer-specific mortality

The lead variant rs67918676 is located in an intron of a long intergenic non-coding RNA gene, LOC105375207 (AC004009.3), in close proximity to the HOXA gene cluster and the lncRNA HOTTIP. We tested the genes within a 500 MBp window around the 32 highly correlated variants for the association of their mRNA expression in breast tumours with recurrence-free survival using KMplotter (kmplot.com/analysis). Four of the ten closest genes with probes available showed moderate association with breast cancer survival at P < 0.005 (HOXA9, HOTTIP, EVX1 and TAX1BP1), with these associations mainly observed for ER-negative breast cancer (Supplementary Table 5A). Yet, intersecting the germline variants with several sources of genomic annotation information (e.g., chromosome conformation, enhancer–promoter correlations or gene expression) we could not find strong in silico evidence of gene regulation by the region containing the associated variants.

We also identified four variants at a BFDP < 15% associated with breast cancer-specific mortality of patients with ER-positive disease (Table 1). These variants were part of an independent set of 45 highly correlated variants on chromosome 7q11.22 that were associated at P < 5 × 10−6 (Supplementary Table 6). The LD matrix between these variants computed based on the 1000 European genomes,62,63 and their chromosomal positions, are shown in Supplementary Figure 3. The strongest association was for rs4717568: HR = 0.88; 95% CI:0.84–0.92; P = 1.28 × 10−7; risk allele A frequency = 0.62 and BFDP = 7%. The imputation efficiency for this variant was high, with an average r2 = 0.96 for all datasets. Two coding genes, AUTS2 and GALNT17, were located within a 500 MBp window around the 45 highly correlated variants, but the expression of neither of the two was associated with breast cancer survival in KMplotter analyses of TCGA data (Supplementary Table 5B).

The association of rs67918676 with ER-negative breast cancer was observed in eight of nine studies with no significant heterogeneity present at P < 0.01 (Fig. 3 and Supplementary Figure 4a). For ER-positive disease, the association of rs4717568 was detected in all seven studies with no heterogeneity present at P < 0.01 (Fig. 4 and Supplementary Figure 4b).

Fig. 3
figure 3

Forest plot showing the association between the ER-negative variant rs67918676 and breast cancer-specific mortality in ER-negative tumours for the datasets used in the meta-analysis. The size of the square reflects the size of the study (see also Supplementary Table 3)

Fig. 4
figure 4

Forest plot showing the association between the ER-positive variant rs4717568 and breast cancer-specific mortality in ER-positive tumours for the datasets used in the meta-analysis. The size of the square reflects the size of the study (see also Supplementary Table 3)

Apart from the 7q variants, only one isolated rare variant reached BFDP values below 15% for all tumours (Table 1). The variant, rs370332736: HR = 1.17; 95% CI: 1.10–1.24; P = 2.48 × 10−7; risk allele A frequency = 0.09 and BFDP = 13%, is located on chromosome 6 and has an average imputation efficiency of r2 = 0.96 for all datasets. In addition, there were several variants found at P < 10−6 for all three analyses (Supplementary Table 4, Supplementary Table 6 and Supplementary Table 7).

Discussion

In this large survival analysis, we report a genome-wide study for identifying genetic markers associated with breast cancer-specific mortality, involving 96,661 patients from a combined meta-analysis. We found one noteworthy region with 32 highly correlated variants on chromosome 7q21.1 for ER-negative. The lead variant rs67918676 (P = 1.38 × 10−7 and BFDP of 11% under reasonable assumptions for the prior probability of association) is located in a long intergenic non-coding RNA gene (AC004009.3). While this represents an uncharacterised transcript mainly expressed in testis and prostate, it is located about 200 kb away from a cluster of HOXA homeobox genes that has been implicated in breast cancer aetiology and prognosis.64,65 This region also contains HOTTIP, a lncRNA with prognostic value on clinical outcome in breast cancer.66 The flanking region on the opposite side contains TAX1BP1, a gene that may be involved in chemosensitivity.67 Interestingly, database mining using KMplotter revealed evidence for an association of the expression of these nearby genes with survival from ER-negative breast cancer. On the other hand, the enhancer activity at this noteworthy locus was predicted to be low based on the intersection with biofeatures characteristic of regulatory activity as no known eQTLs appear to exist in this region, suggesting that gene regulatory effects of the identified variants are limited in breast tissue or may be activated under certain untested conditions. For ER-positive tumours, we found another noteworthy region with 45 highly correlated variants at P < 5 × 10E−6 on chromosome 7q11.22. The lead variant rs4717568 (P = 1.28 × 10−7 and BFDP of 7%) is located between the AUTS2 and the GALNT17 genes. GALNT17 encodes an N-acetylgalactosaminyltransferase that may play a role in membrane trafficking.68 AUTS2 has been implicated in neurodevelopment,69 but AUTS2 overexpression in cancer has also been linked with resistance to chemotherapy and epithelial-to-mesenchymal transition.70 It has been postulated that overexpression of AUTS2 is specific for metastases,70 which may be consistent with the inconspicuous gene expression results in the TCGA database.

It is important to note the differences between the present and the previous GWAS study we had undertaken,44 the latter done in a much smaller dataset (3632 events versus 7697 events in the current study) that did not include the OncoArray study. The OncoArray study is the largest dataset used in the present meta-analysis and also the study with the highest imputation quality. The two previously reported variants (rs148760487 for all breast cancer tumours and rs2059614 for ER-negative tumours) were not associated with breast cancer-specific mortality in the current analyses (P = 1.59 × 10−3 and P = 5.41 × 10−4, respectively). The most likely explanation for this is that the original results were false-positive findings, despite the original association being nominally “genome-wide significant”. The BDFPs for the original reported associations were 54% and 16%, respectively. For the lead variants identified in the present analysis, we tested for differences in the imputation quality between the current and previous analysis. All variants had high imputation quality (~0.99) in the previous study, suggesting that the longer and more complete follow-up together with a higher number of events allowed more robust identification of breast cancer mortality associations. However, there are some weaknesses of the current meta-analysis such as heterogeneity between patient treatment over time and between countries and between datasets with different study designs that should be considered. These limitations, intrinsic to large survival meta-analyses, increase the noise and reduce the power to detect true associations.

In conclusion, we found two novel candidate regions at chromosome 7 for breast cancer survival, credible at a BFDP < 15% and associated with either ER-negative or ER-positive breast cancer-specific mortality. Concerning additional variants, we might still be underpowered to obtain a more comprehensive picture of genomic markers for breast cancer outcome. Overall, the role of germline variants in breast cancer mortality is still unclear36,37,71 and additional analyses with larger sample sizes and more complete follow-up including treatments are needed. In addition, alternative methods that integrate multiple data sources such as gene expression, protein–protein interactions or pathway analyses may be used to aggregate the effect of multiple variants with small effects.72 Such approaches could increase the power of the analyses while better explaining the underlying biological mechanisms associated with breast cancer mortality.