Introduction

Copy number variants (CNVs) are pervasive in the human genome but are more challenging to detect with current technologies than single nucleotide variants (SNVs). Recent comprehensive sequencing projects1,2 have characterised CNVs in large sample sets. The gnomAD project identified a median of 3,505 deletions and 723 duplications covering more than 50 base pairs per genome. Most deletions and duplications tend to be rare with longer variants tending to be rarer, suggesting negative selection against these variants. At the population level the 1000 Genomes project has mapped a large proportion of inherited CNVs3 and observed that 65% had a frequency below 2%.

While somatic copy number alterations play a major role in the pathogenesis of breast tumours4,5, some germline CNVs are known to be associated with inherited risk of breast cancer. Rare loss of function variants in susceptibility genes such as BRCA1 and CHEK2 are associated with a large increase in risk6. While the majority of these variants are single nucleotide mutations and short indels, they also include longer deletions and duplications. It has been reported that up to a third of loss of function BRCA1 variants in some populations may be CNVs7.

Large-scale genome-wide association studies (GWAS) have established breast cancer associations with common variants at more than 150 loci, mostly in non-coding regions8,9,10,11. At two of the loci, deletions imputed from the 1000 Genomes reference panel have been identified as likely causal variants. A deletion of the APOBEC3B gene-coding region increases breast cancer risk12 and analysis of the tumours of the germline deletion carriers showed an increase in APOBEC-mediated somatic mutations13. A deletion in a regulatory region was identified as a likely causal variant at the 2q35 locus14,15.

Detecting CNVs from the intensity measurements of genotyping array probes is prone to producing unreliable calls due to the high level of noise. We recently developed a new CNV calling method, CamCNV16, which focuses on rare CNVs and identifies outlier samples that may have a CNV, based on the intensity distribution across all samples at each probe. We showed that this approach is able to detect CNVs using as few as three probes16. Here, we apply this approach to a very large array genotype dataset to search for breast cancer-associated CNVs. The analyses are outlined in Fig. 1.

Fig. 1: Flow diagram of  the calling and analysis pipeline.
figure 1

CNVs were called using the CamCNV pipeline from the intensities from the iCOGs and OncoArray genotyping arrays. The CNVs were assigned to regions, association results generated for each array and then meta-analysed. CNVs covering the coding regions of genes were analysed in gene burden tests.

Results

Summary of CNVs detected

After quality control we detected a mean of 2.9 deletions (standard deviation 1.6) and 2.5 duplications (SD 2.0) per sample. Supplementary Data 5 shows the mean length, probe coverage, and segment z-scores of called CNVs. Duplications tended to be longer than deletions: for example, deletions called on OncoArray covered a mean of 45 Kilobases (Kb) (SD 106 Kb) over 9.8 probes (SD 17.2), while duplications covered a mean of 109 Kb (SD 202 Kb) over 18.9 probes (SD 36.5). CNV calls observed in multiple samples were concentrated in a small proportion of probes (Supplementary Data 6), with <11% of probes having frequency >0.01% and <2% of probes having frequency >0.5%.

We identified called CNVs which overlapped for at least 90% of their length with rare deletions and duplications (frequency < 1%) identified by the 1000 Genomes Project (Supplementary Data 5 and Supplementary Fig. 1.). Forty-nine percent of OncoArray deletions and 47% of iCOGs deletions matched a 1000 Genomes Project variant while 29% of OncoArray duplications and 20% of iCOGs duplications matched. In total, we identified CNVs closely matching 3273 of the deletion variants published by the 1000 Genomes Project (~9% of total) and 1255 of their duplication variants (~24% of the total).

CNVs associated with overall risk

Association results were derived for 1301 regions containing deletions and 992 regions containing duplications. QQ plots are shown in Fig. 2a for deletions and 2b for duplications. There was no evidence for inflation in the test statistics for duplications (lambda = 0.98; lambda1000 = 1.00) and minimal evidence for deletions (lambda = 1.11; lambda1000 = 1.00).

Fig. 2: QQ plots of association results.
figure 2

Quantile-quantile plots of P-values from association tests of deletion regions (a), duplication regions (b), gene burden analysis for deletions (c), and gene burden for duplications (d).

Seven deletion and two duplication regions were associated with breast cancer risk at p < 0.001 (Table 1); of these, deletions within the BRCA1 region achieved p < 10−6. The results for all regions are shown in Supplementary Data 7 and 8 and include statistics on the number of probes covered by the calls. The results for individual probes covered by the regions analysed are in Supplementary Data 9−12.

Table 1 CNVs associated with overall risk.

The BRCA1 locus contains multiple deletions across the gene. The CHEK2 region (OR 1.94, p = 0.0003) covers the whole gene but nearly all the calls correspond to a deletion of exons nine and ten, which was previously observed in 1% of breast cancer cases and 0.4% of controls in Poland17. We observed the deletion in 0.9% of Polish cases and 0.5% of controls.

The most significant association (OR = 0.69 P = 0.00001) for duplications covers a large region on 17p13.3 with multiple long variants overlapping shorter duplications. The OncoArray results by probe show the strongest associations at a series of probes (17: 814529–850542) in the first intron of NXN, with the lowest P-value at 17: 819141 (OR = 0.45, P = 0.002). The most significant probe position on iCOGs was also in this region (17:836631, OR = 0.58, P = 0.09) (Fig. 3).

Fig. 3: Associations with duplications at the 17p13.3 locus.
figure 3

Log odds ratios for individual probes from Oncoarray and iCOGs are shown at the top in green. Genes from the Ensembl browser are shown in the next row, followed by structural variants identified by the 1000 Genomes Project with duplications in blue and deletions in red.

Associations with risk of breast cancer subtypes

We repeated the analyses restricting cases to those with ER-positive, ER-negative, and triple negative disease. Deletions and duplications with p-values below 0.001 are shown in Tables 2 and 3 and the results for all regions are in Supplementary Data 13 and 14. An association was observed for BRCA1 for all subtypes, with the exception of duplications for ER-positive disease. The odds ratio for BRCA1 deletions was higher for ER-negative disease (OR = 27.03; 95% CI, 15.66−46.67) than ER-positive (OR = 2.81; CI, 1.56 to 5.08; P = 8.46E–28 for the difference), while for CHEK2 the odds ratio was higher for ER-positive disease (OR = 2.32;CI,1.56–3.44) than ER-negative (OR = 1.36; CI,0.66 to 2.82; P = 0.11 for the difference), consistent with the known subtype-specific associations for deleterious variants in these genes18.

Table 2 Subtype associations for deletions.
Table 3 Subtype associations for duplications.

In total, we observed five deletion and two duplication regions with p-values < 0.001 that did not reach p < 0.001 in the overall risk analysis. The strongest novel association for ER-positive was for an intronic deletion in ITGBL1 (OR = 3.3, P = 0.00007, P for difference by ER-status = 0.18). For ER-negative disease the strongest novel association was with an intergenic deletion between ABCC4 (MRP4) and CLDN10 (OR = 2.16, P = 0.0002, P for difference by ER-status = 0.02). Neither of these associations was significant for the other subtype. For triple negative disease, the strongest evidence of association was found for an intergenic duplication between TMC3 and MEX3B (OR = 2.39, P = 0.00009) and for two separate deletions upstream of the DDX18 gene: 2:118258797–118389164 (OR = 6.56, P = 0.00001) and 2:117973154–118107795 (OR = 4.54, P = 0.0008). The association at these two deletions was driven by the same samples, with 75% of the carriers of the first deletion observed to have the second deletion and normal copy number at the 62 kb gap between the deletions.

Associations at established common susceptibility Loci

Three of the most significant associations were observed within regions harbouring known breast cancer susceptibility loci for breast cancer. The most significant (OR = 1.42;CI,1.21−1.67; P = 0.00015) was upstream of FGFR2 and consistent with a 28 kb deletion in the 1000 Genomes Project data (chr10:123433204−123461492). Three independent risk signals have been previously identified at this region19,20. The effect size for the CNV was larger than those previously reported for these common SNPs (largest OR = 1.27;CI,1.22−1.25). The CNV is in linkage disequilibrium with two of the SNPs identified as likely causally associated variants: rs35054928 (D′ = 0.82) and rs2981578 (D′ = 0.88). Conditioning on those SNPs reduced somewhat the strength of the association for the deletion (OR = 1.30; CI 1.10−1.53; P = 0.002, Supplementary Data 15).

The third strongest signal (OR = 4.9, P = 0.00001) in the deletion analysis for overall breast cancer was at 8: 132199447−132252439, 144Kb downstream of ADCY8. The strongest GWAS signal in this region lies in an intron of ADCY8 (lead SNP rs73348588, OR = 1.13, P = 8.2e−7)9. A 3 kb deletion in intron 4 of KLF12 was associated with ER-negative disease (OR = 2.4, P = 0.0007, P for difference by ER-status=0.01). This is 389 kb distant from common SNPs, located between KLF12 and KLF5, associated with ER-negative disease (rs9573140, OR = 0.94, P = 3.62e−9)21. The KLF12 and ADCY8 deletions are not in strong linkage disequilibrium with the corresponding GWAS signals and conditioning on these SNPs did not alter the strengths of the association for the CNVs (Supplementary Data 15).

Gene burden tests

We performed gene burden tests based on CNVs that overlapped exons. Analyses were restricted to genes in which at least 24 CNVs were identified, leaving 645 genes with deletions (Supplementary Data 16) and 1596 genes with duplications (Supplementary Data 17). QQ plots are shown in Fig. 2c for deletions and 2d for duplications. The lambda for inflation was 1.18 (; lambda1000 = 1.00) for deletions and 1.07 (; lambda1000 = 1.00) for duplications.

For deletions, we found 10 genes with P < 0.01 (Table 4), the most significant being BRCA1 (OR = 7.66, P = 3.72E−18). Four of these 10 genes (ATM, BRCA1, BRCA2, and CHEK2) are known breast cancer susceptibility genes18. Deletions were also observed in two other known susceptibility genes: PALB2 (23 cases, 9 controls, OR = 2.02, P = 0.09) and RAD51C (21 cases, 9 controls, OR = 2.04, P = 0.08). The most significant novel association was for SUPT3H (OR = 0.27, P = 0.0004).

Table 4 Gene burden results for deletions.

For duplications we observed 15 genes with P < 0.01 (Table 5). The most significant association was for VPS53 (OR = 0.5, P = 0.0009). This gene and ABR (OR = 0.61, P = 0.008) both lie within the region on 17p which had the strongest association in the regional analysis. These associations were driven by duplications in different samples, with only one duplication in one sample overlapping both genes. Duplications were associated with an increased risk for only four of the 15 genes; the most statistically significant was RSU1 (OR = 3.4, P = 0.004). There was also some evidence of association with risk for duplications in BRCA1 (OR = 1.75, P = 0.01). However, analysis restricted to duplications that included exon 12 of BRCA1 showed clearer evidence of association (34 carriers, OR = 4.7, P = 0.0001), consistent with one of the more frequent known BRCA1 duplications that results in a frameshift22.

Table 5 Gene burden results for duplications.

The gene burden subtype results are shown in Supplementary Data 18 and 19. The strongest associations were observed for BRCA1 deletions for ER Negative (OR = 33, P = 5.5E−35) and Triple Negative (OR = 49, P = 7.1E−34) disease, CHEK2 deletions for ER positive disease (OR = 2.14, P = 0.0001) and ATM deletions for ER positive disease (OR = 4.85, P = 0.0001). No additional genes significant at P < 0.0001 were found.

Direction of effect tests

In the gene burden and individual probe analyses we observed a directional effect, whereby the strongest associations for deletions tended to increase risk and those for duplications tended to be protective. To test whether these associations deviated from what would be expected by chance, we computed ranked summed z-score tests and evaluated the significance of the maximum test statistic by permutation. Results are summarised in Table 6. The statistic for deletion regions was more significant than any of the permuted statistics (P = 0.04) but was reduced to P = 0.12 after removing the known genes BRCA1 and CHEK2. The significance of the gene burden test for deletions also was reduced from P = 0.04 to P = 0.2 when the known genes were removed. The statistic for the duplication regions was lower than any of the 50 permutations (P = 0.04). The gene burden analysis for duplications was not significant.

Table 6 Direction of effect results.

Discussion

We used the largest available breast cancer case-control dataset, comprising more than 86,000 cases and 76,000 controls with array genotyping, to test for associations with rare CNVs. Using the intensities from genotyping arrays to detect CNVs is not ideal due to a high level of noise and uncertainty in the calling, particularly for duplications. However, in tests of known CNVs and replication of calls between duplicate samples, the CamCNV method shows reasonable levels of sensitivity and specificity16. The main focus of this analysis was low frequency CNVs (<1% frequency) since higher frequency CNVs can generally be studied through imputation to a reference panel. In the 0.05–1% frequency range, we could detect ~20% of the CNVs identified by the 1000 Genomes project. For some loci, we only had evidence from one array because the probes do not exist to detect the variants on the other array. Thus, while this array-based approach provides power to evaluate the CNVs that can be assayed, much denser arrays or direct sequencing would be required to provide a complete evaluation of the contribution of CNVs.

In support of the reliability of the method, we detected evidence for both deletions and duplications in BRCA1, which was stronger for ER-negative disease, and for deletions in CHEK2, which were stronger for ER-positive disease. The latter appeared to be driven by a single founder deletion in East European populations. Weaker evidence of association was also observed for deletions in other susceptibility genes (BRCA2, ATM, PALB2, and RAD51C); the ORs were consistent with those seen for deleterious SNVs and indels18. In total, around 0.5% of cases in our analysis had a deletion in one of the known susceptibility genes with the proportion rising to ~1% for cases diagnosed under 50 years of age. The majority of coding deletions are expected to affect only part of the gene, with one study observing that a quarter covered only a single exon23. To detect all of these using array data would require at least three probes per exon. The OncoArray has this level of coverage for a few genes, including BRCA1 and BRCA2, but the coverage is lower for most genes and many coding CNVs will have been missed.

A key issue is the appropriate level of statistical significance to apply to these analyses. For the gene burden tests, P < 2.5 × 10−6, as used in exome-sequencing, seems an appropriate level. It is less clear what is appropriate for non-coding variants. A level of P < 5 × 10−8 has become standard in GWAS and has been shown to lead to acceptable replication, but this seems over-conservative for CNVs, which are more likely to be deleterious. Consistent with this, for at least two of the ~200 common susceptibility loci, the likely causal variant is a CNV, a higher fraction than expected given the relative frequencies of CNVs and SNPs. Based on frequency analysis of whole-genome sequence data Abel et al.1 estimated that rare CNVs are >800 times more likely to be deleterious than rare SNVs and >300 times more likely than rare indels. On the other hand, the significance level for non-coding CNVs should logically be more stringent than for the gene burden tests. Taken together, a significance level of ~10−6 seems appropriate, while associations at P < 0.001 may be worth following up in future studies. In our analyses, only the association at BRCA1 (both in the overall and gene burden tests) passes the higher threshold. We also calculated Bayesian False Discovery Probabilities (BFDPs)24 (Supplementary Data 20 and 21) for our associations using prior probabilities of 0.001 for regions and 0.002 for genes. Outside the known genes none of the BDFPs gave a probability below 10%, with the lowest BFDP of 0.11 for the deletion in the FGFR2 locus. For a CNV observed with a frequency of 0.1% (n = 91 samples in the OncoArray dataset) we had 40% power to detect an association with an odds ratio of 2 but only 1.5% power to detect an association with an odds ratio of 1.5. An OR of 2, comparable to that seen for deleterious variants in ATM and CHEK2, may be plausible for rare coding CNVs or non-coding CNVs that have a significant effect on gene expression. Larger sample sizes will clearly be required to evaluate rare CNVs with more modest associations.

In addition to the BRCA1 and CHEK2 loci, we found associations in three known susceptibility regions identified through GWAS, harbouring FGFR2, ADCY8, and KLF12. In each case, the variants are rarer than the established associated variants, but confer higher risks. The ADCY8 and KLF12 deletions are not in linkage disequilibrium with the associated SNPs. The FGFR2 deletion is in linkage disequilibrium with two of the likely causal common SNPs although there was still evidence of association with the deletion, albeit weaker, after conditional analysis. In-silico and functional analysis clearly demonstrate that FGFR2 is the target of the previously established variants;19,20 it will be interesting to establish if the same is true for the CNV.

Excluding loci in known susceptibility regions, the strongest evidence of association was for a 12 kb deletion (13:102121830–102133956) in the second intron of ITGBL1 (OR = 3.3, P = 0.00007 in the ER-positive analysis). This deletes a promoter flanking region (Ensembl ID: ENSR00001563823) and CTCF binding site (Ensembl ID: ENSR00001062398) active in mammary epithelial cells. There is experimental evidence that ITGBL1 expression, mediated by the RUNX2 transcription factor, enables breast cancer cells to form bone metastases25.

In the gene burden analysis, the strongest novel association was for deletions within SUPT3H, which were associated with a reduced risk. SUPT3H encodes human SPT3, a component of the STAGA complex that acts as a co-activator of the MYC oncoprotein26. SUPTH is located within the first intron of the RUNX2 transcription factor and the syntenic relationship between the two genes is highly evolutionarily conserved27. RUNX2 has a role in mammary gland development and high RUNX2 expression is found in ER-negative tumours28. The PCDHGB2 association appears to be due to a single variant (5:140739812–140740918) that deletes the first exon but as this gene is part of the protocadherin gamma gene cluster it is also possible that the deletion may be having an effect on one of the five genes that overlap PCDHGB2. It also deletes a promoter active in mammary epithelial cells (ENSR00001342785). The next strongest signals were for MEAK7 (OR = 2.19, P = 0.001), a gene implicated in a mTOR signalling pathway29, and MAD1L1 ((OR = 2.00, P = 0.005), a component of the mitotic spindle-assembly that has been suggested as a possible tumour suppressor30.

After BRCA1, the most significant association for ER Negative disease in the gene burden analysis was for CYP2C18 which overlaps CYP2C19 (ER-negative OR = 2.6, P = 0.002; triple-negative OR = 4.4, P = 0.0002). A previous analysis of CNVs and breast cancer in the Finnish population identified a founder mutation reaching an overall frequency of ~ 3% and reported a possible association at this locus for triple negative (OR 2.8, p = 0.02) and ER-negative breast cancer (OR = 2.2, p = 0.048)31.

The results from duplications are harder to interpret as there are often longer duplications overlapping whole genes with shorter variants covering some part of their length. For the gene burden analysis there was little evidence of strong associations. In the regional analysis, the two strongest associations cover multiple genes. The strongest evidence of association (OR = 0.69, P = 1.1E−05) was for a 1.5 Mb region at the start of chromosome 17 (17:13905–1559829). The probe-specific and gene burden results highlighted some stronger signals within this region, for example within NXN and VPS53, but the direction of effect was consistent across the region with 80% of the OncoArray probes having an odds ratio of 0.75 or lower (Fig. 3). This locus has established associations with prostate and colorectal cancer. Interestingly a possible association with ER-positive breast cancer survival was detected for a rare SNP in the first intron of NXN, rs118021774 (HR = 1.83, P = 3.8E−06)32. The detected duplications are not in LD with this SNP. For the 0.4 Mb duplication region on chromosome 21 (OR = 2.23, P = 0.0001) the probe-specific results from OncoArray highlighted that the signal is specific to a shorter intergenic region (21:33421860−33459975) between HUNK and LINC00159.

We observed some evidence of an aggregate directional effect, both for genes and non-genic regions, such that the deletions in aggregate were associated with increased risk. There was also some suggestion that duplications, in aggregate, were associated with a reduced risk. These results suggest that additional associations are present that could be established with a larger dataset. A new GWAS, Confluence (https://dceg.cancer.gov/research/cancer-types/breast-cancer/confluence-project), aims to double the available sample size for breast cancer. This GWAS includes probes specifically designed to assay some of the most significant CNVs observed in this study (those significant at P < 0.001), and the sample size should be sufficient to confirm or refute these associations.

Methods

Subjects

Data were derived from blood samples from study participants in 66 studies participating in the Breast Cancer Association Consortium (BCAC) and genotyped as part of the OncoArray9,33 and iCOGS8 collaborations (Supplementary Data 1). Studies included population-based and hospital-based case-control studies, and case-control studies nested within prospective cohorts; we only included data from studies that provided both cases and controls. Phenotype data were based on version 12 of the BCAC database. Cases were diagnosed with either invasive breast cancer or carcinoma-in situ. Oestrogen receptor (ER) status was determined from medical records or tissue microarray evaluation, where available. Analyses were restricted to participants of European ancestry, as defined by ancestry informative principal components8,9. Where samples were genotyped on both arrays, we excluded the iCOGS sample as the OncoArray has better genome-wide coverage. After sample quality control (see below), data on 36,980 cases and 34,706 controls with iCOGS genotyping, and 49,808 cases and 41,416 controls with OncoArray genotyping, were available for analysis (Supplementary Data 2).

Arrays

The Illumina iCOGs genotyping array8 includes 211,155 probes for SNVs and short insertions/deletions. Most variants were selected because of previous association in case-control studies for breast prostate and ovarian cancers, or for dense mapping of regions harbouring an association. The OncoArray includes 533,631 probes, of which approximately half were selected from the Illumina HumanCore backbone, a set of SNPs designed to tag the most common variants. The remainder were selected on the basis of evidence of previous association with breast, prostate, ovarian, lung, or colorectal cancer risk. Approximately 32,000 variants on the OncoArray were selected to provide dense coverage of associated loci and known genes. The remainder were mostly selected from lists of common variants ranked by p-value, with a small number from lists of candidate variants.

CNV calling

CNVs were called using the CamCNV pipeline as previously described16. In brief, the log R (LRR) intensity measurements and B allele frequency (BAF) for each sample at each probe were exported from Illumina’s Genome Studio software. A principal component adjustment (PCA) was applied to the LRR, grouped by study, to remove noise and batch effects. After removing noisy probes and those in regions with known common CNVs, the LRRs for each probe were converted to z-scores using the mean and standard deviation from all BCAC samples. Circular binary segmentation was applied to the z-scores ordered by probe position for each sample using the DNACopy package34. This produces a list of segments for each sample by chromosome where the z-score of consecutive probes changes by more than two standard deviations. Segments with a mean probe z-score between −3.7 and −14 were called as deletions and segments with a mean z-score between +2 and +10 as duplications. We restricted the calls to segments covering a minimum of three and a maximum of 200 probes.

As per the CamCNV pipeline, we then excluded deletions with inconsistent B Allele Frequency and CNVs with a shift in LRR at the sample level that was outside the expected range. The additional CNV exclusions are summarised in Supplementary Data 3. To exclude regions with a high level of noise we also excluded CNVs falling within 1 Mb of telomeres and centromeres and a number of immune loci such as the T-cell receptor genes where somatic mutations in the blood are often observed35.

Sample quality control

Standard sample quality control exclusions were performed, as previously described for the SNP genotype analyses8,9. These include exclusions for excess heterozygosity, ancestry outliers, mismatches with other genotyping, and close relatives. A stricter sample call rate of >99% was used for the CNV analysis, compared to >95% used in the genotype analyses. We also excluded any participants for whom a DNA sample was not collected from blood and any that had been whole genome amplified.

In addition, we used two metrics to exclude noisy samples liable to produce an excess of unreliable CNV calls. First, we calculated a derivative log ratio spread (DLRS) figure for each sample as the standard deviation of the differences between LRR for probes ordered by genomic position, divided by the square root of two. This measures the variance in the LRR from each probe to the next averaged over the whole genome and thus is insensitive to large fluctuations such as might be expected between different chromosomes in the same sample. An ideal sample would have a small DLRS as the only variance would come from a small number of genuine CNVs. We calculated the DLRS using the dLRs function in R package ADM3 (https://CRAN.R-project.org/package=ADM3) before and after the PCA. At both stages, we excluded samples with a DLRS more than 3.5 standard deviations above the mean DLRS for that study.

Second, we counted the number of short segments (between three and 200 probes) output by DNACopy for each sample. We observed that the distribution of segment counts was skewed to the right with an excess of samples with a large number of segments. We calculated a cut-off for the maximum number of segments using the following formula where x is the segment count for each sample (based on the rationale that the distribution of the true number of segments should be approximately Poisson):

$${y}=2* {{{{{\rm{sqrt}}}}}}({x})$$
$${{{{{\rm{cut}}}}}}-{{{{{\rm{off}}}}}}={{{{{\rm{median}}}}}}({y})+3.5$$

The sample exclusions resulting from these QC steps are summarised in Supplementary Data 4.

Association tests

All analyses were carried out separately for deletions and duplications, since different types of CNV at the same locus do not necessarily have the same effect on risk. As we were only assessing rare CNVs, we treated all carriers as heterozygotes and did not attempt to identify rare homozygotes.

To account for overlapping CNVs and imprecision in the breakpoints, we assigned individual CNVs to regions. To identify the regions, we moved sequentially along each chromosome, identifying the start as an Oncoarray probe position where deletions were observed in at least five samples, and then the end position as the probe position before the first probe where deletions were observed in fewer than five samples. Regions within five probes of each other were then merged together. The process was repeated for duplications. Regions were also merged such that the major susceptibility genes (BRCA1, BRCA2, and CHEK2) were included within a single region. We then assigned individual CNVs to regions where at least 90% of the CNV’s length fell within the region. For iCOGS, which generally has less dense probe coverage, we first assigned CNVs to the OncoArray regions where they showed >90% overlap. We then assigned any remaining CNVs to regions defined using the iCOGS probes, using the same procedure. Using this approach, 3306 deletion regions were identified from OncoArray data, 812 of which were also observed using iCOGS data, and 541 regions identified using iCOGS alone. For duplications, there were 2203 OncoArray regions, with 854 also observed using iCOGS data, and 483 iCOGS specific regions.

Associations were evaluated for each region using logistic regression, with breast cancer status as the outcome, and the presence of a CNV in the region (0 or 1) as a covariate to derive a log odds ratio per deletion/duplication. Statistical significance was evaluated using a likelihood ratio test (based on the above model and one excluding CNV as a covariate). The logistic regression analyses were conducted using in-house software (https://ccge.medschl.cam.ac.uk/software/mlogit/). Study and ten ancestry informative principal components, defined separately for each array, were also included as covariates. The analyses were conducted separately for the iCOGS and OncoArray and then combined in a standard fixed effect meta-analysis using the METAL software (after first deriving the standard error of the log-odds ratio from the likelihood test statistic)36. To avoid regions with too few observations, we excluded regions with fewer than 24 deletions or duplications (~0.015% of samples). Associations significant at P < 0.001 were considered noteworthy.

To detect more precisely the location of association signals, we also generated results for each probe. We created a vector of pseudo-genotypes for each probe with samples, such that a deletion covering that probe was coded as 1 and all other samples were coded as 0. We generated a similar set of genotypes for duplications. The results were analysed using logistic regression, as above.

To test for association between CNVs affecting the coding sequence of genes, in aggregate, and breast cancer risk, we identified samples with a deletion or duplication overlapping the exons of each gene. Exon positions were downloaded from the UCSC Genome Browser hg19 knownGene table. We used logistic regression to generate a log odds ratio (OR) for carriers of coding variants covering each gene, adjusted for study, as above. We generated results for each array and then for carriers combined across both arrays. For the combined analyses we treated studies with samples on both arrays as separate studies.

To calculate BFDPs we assumed a log-normally distributed prior effect size as described by Wakefield24. The prior log(OR) was determined by assuming a 95% probability that the OR was less than some bound K, where K = 3 for the regional and gene-based analysis, except for BRCA1 and BRCA2 where K = 20 was assumed. The prior probability of association was assumed to be 0.001 for the regional analysis, 0.99 for BRCA1, BRCA2, ATM, and CHEK2 and 0.002 for other genes. For the gene-based analysis, only positive associations were considered as the prior evidence for all genes was in favour of PTVs being positively associated with risk.

To determine whether there was a tendency for CNVs to be associated with an excess, or deficit, of risk across genes or regions, we computed signed z-scores as the square root of the chi-squared statistic for each gene, multiplied by ±1 depending on whether the effect estimate was positive or negative. These were ranked and normalised summed z-scores, based on the r most significant associations, were derived. The overall test statistic was the maximum summed z-score over all possible values of r:

$$U=\begin{array}{c}{\max }\\ r\le n\end{array}\frac{1}{\sqrt{r}}\mathop{\sum }\limits_{j=1}^{r}{z}_{j}$$
(1)

Where n is the total number of genes/regions being tested. The significance of U was then determined by permutation, randomly permuting case-control labels within study 50 times.

Ethical approval

All participating studies were approved by their appropriate ethics review board and all subjects provided written informed consent.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.