Background

Prostate cancer (PrCa) is the most common cancer in men living in the Western world, with a lifetime risk of approximately 1 in 8 for men in Europe and the USA (http://www.info.cancerresearchuk.org/cancerstats/types/prostate/). Its aetiology remains poorly understood. The increased risk of PrCa for men with a family history is consistent with variation in genetic susceptibility to the disease (Edwards and Eeles 2004). Genome-wide association studies (GWAS) are a proven way to identify common disease susceptibility alleles. Thus far, such studies have identified SNPs associated with prostate cancer in more than 30 independent regions (Liu et al. 2010). We previously conducted a GWAS in two stages and identified SNPs in seven regions associated with PrCa risk (Eeles et al. 2008). One of these SNPs, rs2735839 on 19q13.3, lies within an LD block of 34 kb, containing the kallikrein genes, KLK3 and KLK2 (combined stage 1 and 2 P = 2.3 × 10−17). This SNP was also shown to be associated with serum PSA level; however, an association with PrCa has been demonstrated using samples not ascertained by PSA screening, suggesting that the association is not solely due to selection of cases with elevated PSA and controls with a low PSA level (Ahn et al. 2008).

KLK3 encodes prostate-specific antigen (PSA), a member of the kallikrein family of serine proteases. Its main physiological function is to cleave semenogelins in the seminal coagulum, thereby regulating sperm motility. Like other kallikreins, it might also participate in the process of neoplastic growth and metastasis of PrCa (Lawrence et al. 2010; Paliouras et al. 2007). The expression of PSA is highly restricted to normal and malignant prostate epithelial cells in men. During tumorigenesis, however, serum PSA levels tend to be elevated, probably due to the loss of normal architecture of the gland. For this reason, PSA is extensively used as a biomarker to screen for and monitor treatment of PrCa.

The association with rs2735839 implies that one or more variants in the KLK3 region, correlated with rs2735839, are directly associated with PrCa risk. In an attempt to identify such variants, we performed a comprehensive analysis of SNPs across the region, using a combination of imputation and further genotyping.

Materials and methods

Samples

Analyses were based on the samples genotyped in first and second stages of an UK/Australian GWAS, collected as previously described (Eeles et al. 2008, 2009), together with a third stage involving a further 4,901 cases and 4,847 controls. Stage 3 samples were selected from UKGPCS as for stages 1 and 2; from Studies of Epidemiology and Risk factors in Cancer Heredity (SEARCH), a case–control study based on the region covered by the Eastern UK Cancer Registry and Information Centre (ECRIC); and from the Australian epidemiological studies as in stage 2 (Supplementary Table 1). We also included data from the Cancer Genetic Markers of Susceptibility (CGEMS) study, a GWAS of 1,117 PrCa cases and 1,105 controls drawn from the PLCO study (http://www.cgems.cancer.gov/).

Imputation

To evaluate associations with PrCa for SNPs not included in genotyping arrays used in the GWAS, we defined a 34 kb LD region that included all SNPs in the region correlated with the original best hit, rs2735839, at r 2 > 0.1 according to the HapMap Phase II CEU dataset. We then identified all SNPs in the interval from the HapMap Phase II as well as SNPs identified through sequencing of European (CEU) samples in the 1000 Genomes study at a frequency of at least 2%. Genotypes for these SNPs in stages 1 and 2 of the UK/Australian GWAS, and the CGEMS study, were then imputed using MACH 1.0 (http://www.sph.umich.edu/csg/abecasis/MACH/) and 50 rounds of imputation. We included SNPs with quality score (r 2) > 0.3 in the analysis.

Genotyping

SNP rs17632542 was genotyped for stages 1 and 2 samples by the 5′-endonuclease assay (Taqman™) using the ABIPrism 7900HT sequence detection system according to the manufacturer’s instructions. Primers and probes were supplied by Applied Biosystems (http://www.appliedbiosystems.com/) as Assays-By-Design™. The Stage 3 genotyping was done using an Illumina Golden Gate Assay (http://www.illumina.com).

Statistical methods

We assessed associations between each SNP and disease using a 1 df.

Cochran–Armitage trend test, stratified by GWAS stage (UK/Australia stage 1, 2 or 3, and CGEMS). For imputed SNPs, we used the corresponding test based on the estimated allele dose. Odds ratios and confidence limits were estimated using unconditional logistic regression, stratified by stage. Tests of homogeneity of the odds ratios across strata were assessed using likelihood ratio tests. To determine independently associated SNPs from imputed data, we identified all SNPs significant at P < 10−7 from combined stages 1, 2 and CGEMS and used forward and backward stepwise logistic regression, based on estimated allele dose. SNPs were included in the model if they were significant at P < 0.001 after adjustment for other SNPs in the model. For the genotyped SNPs, we identified all SNPs significant at P < 10−7 in the stages 1, 2, 3 and CGEMS data combined and used forward and backward stepwise logistic regression.

The associations of SNP genotypes with PSA level were assessed using linear regression, after log transformation of PSA level to reduce skewness. Analyses were performed in R and STATA.

Molecular dynamics modelling the structure of PSA

Potential conformations of the original and mutated crystal structure models were investigated using energy minimisation and force field (Goodford 1985). To gain further insights into the effects of substitution at Ile179 we simulated solvated Ile179 and Ile179Thr. Structure files were generated using Visual Molecular Dynamics (VMD) (Humphrey et al. 1996). The system was equilibrated using Nanoscale Molecular Dynamics (NAMD) (Phillips et al. 2005) and simulated using ACEMD (Giupponi et al. 2008) with the Chemistry at HARvard Molecular Mechanics (CHARMM) (Brooks et al. 2009) force field. Ten independent simulations were carried out for both variants which were then analysed to determine the models’ average energies.

Results and discussion

We defined a 34-kb region on 19q13.3 based on HapMap phase II CEU data that included all SNPs correlated with rs2735839, the most strongly associated SNP in the GWAS, at r 2 > 0.1. In this region, we identified 197 SNPs in HapMap2 and 312 SNPs in the 1000 Genomes Project. Genotypes for these SNPs were imputed for subjects in stages 1 and 2 of the UK/Australian GWAS and CGEMS. Based on the combined results, we identified 35 SNPs associated with PrCa at P < 10−7. The most statistically significant associated SNPs were rs17632542 and rs62113212 (P = 3.0 × 10−24 and P = 2.2 × 10−24, respectively, Supplementary Table 2). These two SNPs were strongly correlated with one another (r 2 = 0.99 in HapMap CEU and r 2 = 0.75 in 1000 Genomes, Supplementary Table 3); neither of these SNPs was genotyped in the original GWAS.

To further evaluate these associations, we genotyped rs17632542 directly for the stages 1 and 2 sample set (5,504 cases and 5,834 controls) using a Taqman assay. We also genotyped rs17632542 in a further set of 4,901 PrCa cases and 4,847 controls (unselected for PSA) as part of an Illumina Golden Gate Assay (stage 3). We also genotyped the original associated SNP, rs2735839, and two other SNPs (rs266849 and rs1058205) that reached P < 10−7 in the original GWAS stages 1 and 2. The estimated per-allele ORs for rs17632542 were similar to those estimated from the imputed data, and the statistical significance was further increased, stage 1 per allele OR 0.35 (95% CI 0.30–0.42) P = 2.6 × 10−29; stage 2 per allele OR 0.78 (95% CI 0.68–0.90) P = 4 × 10−4; stage 3 per allele OR 0.65 (95% CI 0.57–0.74), P = 4.2 × 10−10; combined P = 1.9 × 10−34 over all stages. The estimated OR in homozygotes for the C allele was lower than in heterozygotes in all stages, in a matter consistent with a log-additive allele dose model (Table 1). As expected, the estimated effected size was larger in stage 1, reflecting the selection of cases for family history of disease and the controls for low PSA (see “Materials and methods”), but the estimated ORs in stages 2 and 3 were similar.

Table 1 Summary results for 4 SNPs in the KLK3 region at 19q13.3

In a multiple logistic regression analysis, rs17632542 remained significant after adjustment for rs2735839 (P = 8.6 × 10−14) based on all stages combined, whereas rs2735839 showed weaker association after adjustment for rs17632542 (P = 0.02).

To determine whether there are additional SNPs in the region, other than rs17632542, that are independently associated with PrCa, we used two different analytical approaches. First, we used stepwise logistic regression analyses of 7 SNPs in this region that were genotyped in all stages and were significant at P < 10−7. In this analysis, one additional SNP, rs266849, remained in the model at P < 0.001 after adjusting for rs17632542 (P = 0.001, Table 2) Secondly, we used the imputed data (Stages 1, 2 and CGEMS) and put all 35 SNPs significant at P < 10−7 in a logistic regression analysis (Supplementary Table 2). From this analysis, one SNP was significant at P < 0.001 after adjusting for rs17632542 (either rs2659052 or rs2569753, as both were correlated with one another, r 2 = 0.64). rs266849 is about 12 kb centromeric to rs17632542 between the KLK15 and KLK3 genes. rs266849 and rs2659052 were weakly correlated with each other and with rs17632542 (Fig. 1; Supplementary Table 3), suggesting that there are two additional associated variants. In both analyses, however, rs17632542 remained the most significantly associated SNP after adjusting for the other significant SNPs (Table 2; Supplementary Table 2).

Table 2 Stepwise logistic regression analysis of 7 SNPs in this region associated with PrCa risk at P < 10−7 (stages 1, 2, 3 and CGEMS)
Fig. 1
figure 1

Regional LD plot and multiple transcripts of the KLK3 gene at 19q13.3. The newly identified SNPs associated with PrCa risk are labelled using blue diamonds, two SNPs are intragenic in KLK3, rs17632542 is a coding variant in exon 4, rs 62113212 is in intron 3

We previously reported that rs2735839 was associated with serum PSA level (Eeles et al. 2008). Based on the serum PSA levels of the stages 2 and 3 controls, the C allele of rs17632542 was also associated with lower PSA levels, with lowest levels in CC homozygotes (P = 1.5 × 10−11, Supplementary Table 4; Supplementary Fig. 1). In a multiple regression analysis, rs2735839 was marginally significantly associated with PSA levels, after adjusting for rs17632542 (P = 0.06). In a recent publication (Gudmundsson et al. 2010) it was also reported that at the KLK region rs17632542 was significantly associated with PSA level and prostate cancer risk.

We found evidence for a significant association between rs17632542 and Gleason score, with a stronger association for lower grade disease (based on Gleason score <7 vs. 7+; P-trend = 7.04 × 10−7 and based on Gleason score <8 vs. 8+ P-trend 3.2 × 10−7, Supplementary Table 5). A similar association has previously been reported for rs2735839 (Kader et al. 2009). We found no association between rs17632542 and family history or age of onset of the disease (data not shown).

Taken together, the above results suggest that observed associations with both PrCa and PSA level are largely driven by rs17632542 or rs62113212. These SNPs are intragenic to KLK3: rs17632542 is a non-synonymous SNP in exon 4 resulting in an Ile179Thr substitution, while rs62113212 is intronic and lies in a sequence region with no enhancer or splice effect function (Fig. 1). However, based on the TFSEARCH database, http://www.cbrc.jp/research/db/TFSEARCH.html, rs62113212 changes the scores for the binding of the octamer factor 1 (Oct-1) transcription factor. Although we cannot rule out rs62113212 as the directly associated variant and it remains possible that another correlated variant was missed by the resequencing in the 1000 genomes dataset, rs17632542 appears to be the most plausible functional variant.

Based on the previous structural studies, 3D modeling and X-ray crystallography of PSA and other kallikreins, Ile179 is not part of the active catalytic site (Villoutreix et al. 1994). However, when we carried out multiple molecular dynamic simulations in explicit solvent, we found that the solvated threonine (Thr) variant consistently achieved lower energy levels compared with the isoleucine (Ile) variant (−3,532 ± 54 kcal/mol compared with −3,053 ± 41 kcal/mol), suggesting superior stability in solution. Furthermore, modeling suggested that the Thr variant causes displacement of the kallikrein loop (residues 111–118), although verification of this will require empirical structural determination (Fig. 2).

Fig. 2
figure 2

Comparative molecular dynamic simulation of threonine and isoleucine PSA variants. Structures shown are averaged from 10 independent simulations. a Ribbon diagram of I179T variant of PSA coloured according to side chain displacement from the isoleucine variant. Position of the catalytic triad is highlighted by a pink oval and T179 substitution by a green oval. The kallikrein loop’s position is indicated by a red rectangle. b Molecular surface view of I179T variant PSA showing solvent accessible surface coloured according to RMS displacement from wild-type (I179) PSA. Maximal displacement is coloured red, whilst minimal displacement is blue as indicated by the heat map

Because multiple alternative transcripts have been reported for KLK3 in normal and tumor-derived tissue (Heuze-Vourc’h et al. 2003; Pampalakis et al. 2008; Whitbread et al. 2010), we also assessed the possible effect of the nucleotide substitution at the pre-mRNA level using the Human Splice Finder bioinformatic prediction tool (Desmet et al. 2009). New sites for exonic splice enhancer motifs (SRp40, SRp55) were created in the presence of the rs17632542 minor allele (C) by two algorithms. This putative enhancer site is close to the splicing site for PSA-RP2, a splice variant which results in retention of the intron between exons 3 and 4. Notably, the same splice variant is found in four other KLK genes (Michael et al. 2005) (KLK1, KLK2, KLK5 and KLK15) that surround KLK3 at the centomeric end of the locus, suggesting this may be a biologically important splicing event, given the evolutionary conservation. Further molecular studies (EMSA, luciferase assay) would be required to determine if this SNP has an effect at the pre-mRNA level.

The contribution of common genetic variants in KLK3 to PrCa risk has been a subject of considerable debate because SNPs in this region were also found to be associated with serum levels of PSA. The association seen for rs2735839 in stage 1 of our GWAS could reflect the use of low PSA controls. However, the association was also seen in stage 2 of the GWAS, in the CGEMS study, and in a validation study of our GWAS results in 29 groups in the PRACTICAL (Prostate cancer association group to investigate cancer associated alterations in the genome) consortium in 56,300 samples (unpublished data), the majority of which have not selected controls on the basis of PSA level. There is still some potential for confounding due to opportunistic PSA testing or testing as part of the diagnostic process. The newly identified SNP, rs17632542, exhibits a more significant association with PrCa than the previous tag SNP and studies of this SNP in men who have prostate biopsy as the primary PrCa screen, irrespective of PSA levels, could determine whether this SNP is causally related to PrCa risk independent of PSA. Given its stronger effect on PrCa risk, inclusion of rs17632542 in place of rs2735839 in risk prediction models will improve discrimination in such models.

As this variant is predicted to introduce changes in the stability of the protein and might have a role in pre-mRNA processing of alternative transcripts; it is a good candidate variant for further functional investigation of the underpinnings of the GWAS signal in this region with respect to PrCa risk. Similarly, additional studies are required to determine if this coding variant also influences PSA levels.