Background

Although inherited susceptibility underlies 35% of variance in colorectal cancer (CRC) risk (Lichtenstein et al, 2000), high-penetrance germline mutations account for <6% of cases (Aaltonen et al, 2007). Much of the remaining variation in genetic risk is likely to be a consequence of the co-inheritance of multiple low-penetrance variants, some of which are common.

The ‘common-disease common-variant’ model of CRC implies that association analyses based on scans of polymorphic variants should be a powerful strategy for identifying low-penetrance susceptibility alleles. This assertion has recently been vindicated by genome-wide association (GWA) studies, which have provided robust evidence for several common low-risk variants influencing CRC risk (Tomlinson et al, 2005, 2007; Broderick et al, 2007; Zanke et al, 2007; Houlston et al, 2008; Jaeger et al, 2008; Tenesa et al, 2008). Although the risk of CRC associated with each of these common variants is individually modest, they make a significant contribution to the overall disease burden by virtue of their high frequencies in the population. Moreover, by acting in concert with each other, they have the potential to significantly affect an individual's risk of developing CRC. Hence, this class of susceptibility alleles is potentially of public health importance, allowing risk stratification within populations. One benefit for risk prediction between population subgroups is that it could enable tailoring of the invasiveness or frequency of large bowel screening, eventually leading to a reduction in mortality and even incidence through secondary prevention. Finally, the identification of new risk variants may identify new cancer pathways that made lead, in time, to the development of new prevention or treatment strategies for CRC.

To facilitate the study of predisposition to CRC, we established COGENT (COlorectal cancer GENeTics), an international consortium with the goal of identifying and characterising low-penetrance genetic variants that predispose to CRC. In this study, we review the rationale for studying low-penetrance susceptibility to CRC and our proposed strategy for COGENT.

Difficulties in conducting methodologically rigorous association studies

To date, most association studies based on the candidate gene approach have only evaluated a restricted number of polymorphisms, primarily in genes implicated in the metabolism of dietary carcinogens and protection of DNA from carcinogen-induced damage. Reports from these studies have largely been disappointing, with numerous positive associations initially from analyses of small case–control series being unconfirmed by subsequent analyses. Only a minority of studies have reported case–control data on the same variants, allowing pooling of data (Table 1). Although P-values from meta-analyses of such studies provide limited support for the role of variants in MTHFR (Huang et al, 2007; Hubner and Houlston, 2007), CCND1 (Tan et al, 2008), GSTT1 (de Jong et al, 2002), XPC (Zhang et al, 2008), NQO1 (Chao et al, 2006) and NAT2 (Chen et al, 2005), such analyses should be interpreted with caution even if publication bias is ignored. Use of false-positive report probability value (FPRP) (Wacholder et al, 2004), which integrates the earlier probability for association and statistical power, provides one method for assessing the robustness of summary estimates derived from pooled analyses. Although earlier probabilities are partly subjective, influenced by previous findings and experimental evidence with regard to the known impact of variants, the earlier probability for variants in candidate genes is unlikely to be better than 1 in 1000 (or 0.001) (Thomas and Clayton, 2004). Imposing a ‘best case’ value less than 0.001 and stipulating an odds ratio of 1.2 for associations, it is noteworthy that the likelihood of any of the variants being associated with CRC risk is not high (i.e., FPRP >0.2 suggested to be appropriate for summary analyses (Wacholder et al, 2004)). Hence, despite much research, until the advent of GWA studies, few, if any, definitive susceptibility alleles for CRC have been unequivocally identified through association studies. The accumulated experience to date has served to highlight the difficulties in conducting statistically and methodologically rigorous association studies to identify new cancer predisposition loci. The main issues are summarised below:

  1. 1

    The increase in CRC risk conferred by any common polymorphic variant is almost certainly small (i.e., typical relative risk 1.2). The inherent statistical uncertainty of case–control studies involving just a few hundred cases and controls severely constrains study power to reliably identify genetic determinants conferring modest, but potentially important, risks.

  2. 2

    As of the large number of polymorphisms in the genome, false-positive associations are inevitably more frequent than true-positive associations when testing large numbers of generic markers (especially when using off-the-shelf SNP arrays), even if studies are conducted in a scientifically rigorous manner. Hence, associations need to attain a high level of statistical significance to be established beyond reasonable doubt. For this reason, in GWA studies, a P-value threshold of 5.0 × 10−7 has been advocated and is generally considered to be appropriate for genome-wide significance.

  3. 3

    Positive associations need to be replicated in independent case–control series to further limit the type 1 error rate. However, to increase power, the allelic architecture of the population from which these case–control series are ascertained needs to have similar ancestry and, ideally, the same linkage disequilibrium (LD) structure.

  4. 4

    It should be recognised that cancers such as CRC are somewhat heterogeneous with respect to aetiology and biology. Specifically for CRC, colonic and rectal disease may have different risk factors and have a varied spectrum of somatic mutations and epimutations. It must thus be recognised that a given variant may not affect the risk of all histological forms of CRC. The power of any analysis stratified by histology is therefore limited because of the smaller numbers of cases in each group.

  5. 5

    Careful attention must be paid to population stratification as a source of confounding, because cancer rates and allele frequencies vary with race/ethnicity. This is one possible explanation for some of the false-positive associations reported in literature.

  6. 6

    Epidemiological risk factor data should ideally be taken into consideration to allow the examination of interactions between known aetiological factors (e.g., dietary risk factors) and genetic risk variants. As very large sample sizes are probably needed to detect interactions, the power of these types of analyses in the association studies reported to date has been extremely limited.

  7. 7

    Rare germline polymorphisms may be more highly penetrant and have significance for individuals, although the population-attributable risk may be low. Extreme examples include the previously identified mutations in DNA repair enzymes and Lynch Syndrome. Only through genotyping and sequencing of large numbers of individuals can additional rare variants that confer important individual risk be identified. Advances in sequencing technology make this feasible.

Table 1 Polymorphisms reported to be statistically significant in meta-analyses

Characteristics of low-penetrance variants

Most studies aimed at identifying low-penetrance alleles for cancer susceptibility have been based on a candidate gene approach formulated on preconceptions of pathology pertaining to the role of specific genes in the development of CRC. However, without a clear understanding of the biology of predisposition, the choice of suitable genes for the disease is inherently problematic, and very few susceptibility loci for CRC have been identified that adopt this strategy. An unbiased approach to genetic analysis is therefore required.

The availability of high-resolution LD maps and hence of comprehensive sets of tagging SNPs that capture most of the common sequence variation allows GWA studies for disease associations to be efficiently conducted. This approach is agnostic in that it does not depend on previous knowledge of function or presumptive involvement of any gene in disease causation. Moreover, it minimises the probability of failing to identify important common variants in hitherto unstudied loci (i.e., genes and regulatory regions).

Three GWA studies of CRC have so far been reported and 10 independent loci shown conclusively to be associated with CRC risk: 8q24.21, 11q23, 18q21.1 (SMAD7), 8q23.1 (EIF3H), 15q (GREM1), 19q13.1 (RHPN2), 20q12.3, 14q22.2 (BMP4), 16q22.1 (CDH1) and 10p14 (Tomlinson et al, 2005, 2007; Broderick et al, 2007; Zanke et al, 2007; Houlston et al, 2008; Jaeger et al, 2008; Tenesa et al, 2008). Risks associated with each of the common variants at each of these loci are modest (ORs 1.1–1.3; Table 2) and there is little evidence of interactive effects. With homozygous risk variants conferring twice the heterozygote risk, the distribution of risk alleles follows a normal distribution in both case and controls, with a shift towards a higher number of risk alleles in affected individuals consistent with a polygenic model of disease predisposition (Figure 1A). Figure 1B shows the ORs relative to the median number of risk alleles. Individuals with 15+ risk alleles have at least a three-fold increase in risk compared with those with a median number of risk alleles.

Table 2 The 10 loci associated with colorectal cancer risk identified from GWA studies (Tenesa and Dunlop, 2009)
Figure 1
figure 1

Polygenic model of colorectal cancer susceptibility. (A) Distribution of risk alleles for CRC, cases (black bars) and controls (grey bars); (B) Plot of the increasing ORs for CRC with increasing number of risk alleles. The ORs are relative to the median number of risk alleles; Vertical bars correspond to 95% confidence intervals. Data from Houlston et al (2008).

Data from these GWA studies and results from similar gene discovery efforts in other tumours are proving to be highly informative with regard to the allelic architecture of cancer susceptibility in general. The number of common variants that account for more than 1% of inherited risk is very low and only a small proportion of the heritability of any cancer can be explained by currently identified loci. Estimates of the contribution of currently identified loci to excess familial risk of CRC may be conservative, as there may be imperfect tagging surrogates for true aetiological loci. Multiple causal variants may also exist at each locus, including low-frequency variants with significantly larger cumulative effects on risk. Few of the observed disease-associated variants are coding variants, with many of the loci mapping to regions bereft of genes or protein-encoding transcripts. It is likely that much of the common variation in cancer risk is mediated through sequence changes influencing gene expression, perhaps in a subtle manner, or through effects on pathway components mitigated by functional redundancy.

Future directions

Prospects for identifying additional common variants

The power of existing GWA studies to identify common alleles conferring risks of 1.2 or greater (such as the 8q24 variant) is high. Hence, there are unlikely to be many additional CRC SNPs with similar effects for alleles with frequencies >0.2 in populations of European ancestry. Recent studies have had low power to detect alleles with smaller effects and/or MAFs <0.1. By implication, variants with such profiles are likely to collectively confer substantial risk because of their multiplicity or sub-maximal LD with tagging SNPs. The tagging SNPs used for GWA studies capture on an average 80% of common SNPs in the European population (i.e., r2>0.8), but only 12% of SNPs with MAFs of 5–10% are tagged at this level, limiting the power to detect this class of susceptibility allele. GWA-based strategies are not configured optimally to identify low-frequency variants with potentially stronger effects or to identify recessively acting alleles. Nor are current arrays formatted ideally to capture copy number variants or other structural variants such as small-scale insertions or deletions, which may affect CRC risk. It is therefore highly likely that a large number of low-penetrance variants remain to be discovered. This assertion is supported by the continued excess of associations observed over those expected in studies reported to date. Further efforts to expand the scale of GWA meta-analyses, in terms of both sample size and SNP coverage, and to increase the number of SNPs taken forward to large-scale replication may identify additional variants for CRC.

Analyses of most GWA studies have so far been primarily directed towards identifying single locus SNP associations. It is possible that analyses based on haplotypes of markers may identify ‘rarer’ disease alleles that may be present on rare haplotypes missed by single SNP analyses. Under certain circumstances, especially in which interaction effects are large and main effects are small, gene–gene interactions may be detected where no locus with a main effect has been identified (Marchini et al, 2005). Multi-locus approaches may therefore be the focus of future experiments as they may yield greater power to detect associations under certain genetic models.

Identifying causal variants

Validated tagSNPs are highly unlikely to directly cause CRC. Identifying the causal variant from a tagSNP that is statistically associated with disease is difficult. Although blocks of LD allow the efficient survey of the genome, they hamper fine mapping of the disease-associated region. Different ethnic groups are likely to have different LD block patterns and they can therefore be used to refine the location of a disease susceptibility locus before resequencing and functional analyses. The usefulness of this approach depends on the size of the study and SNP allele frequencies in different ethnic groups. In some of these populations, lower environmental risk exposure with lower CRC, incomplete case ascertainment and recording tools, as well as absence of large sample sets, are other challenges.

Incorporating non-genetic risk factors into risk models

Colorectal cancer risk will probably be determined by complex interactions between the various genetic and lifestyle/dietary risk factors. Epidemiological studies have established several dietary risk factors for colorectal neoplasia; these include low vegetable and high meat consumption (especially processed meat), and micronutrient deficiency and excessive alcohol intake. There is a weaker association between CRC, smoking and lack of physical activity. Common genetic variants are likely to interact with these environmental–lifestyle risk factors to modify risk. Furthermore, common gene variants will have a role in determining the effectiveness of chemoprevention agents such as non-steroidal anti-inflammatory drugs, hormone replacement therapy and micronutrient supplementation.

In assessing the interplay between inherited and non-genetic risk factors, analyses using different population cohorts will be highly informative. Wider comparisons between the population genetics of different ethnic groups have shown that SNP allele frequencies can vary greatly among ethnic groups, principally as a result of founder effects and genetic drift. Indeed, some SNPs may be informative in one population and not in another. At least in principle and probably in practice, some variants may have stronger or weaker effects on disease, depending on environment or general genetic background, as observed in inbred lines of mice.

Identification of the interaction between genetic variants and environmental risk factors is contingent on very large data sets ideally from different population cohorts, something only achievable through multi-centre collaborations. Even with such collaborative efforts, incorporating environmental risk factor data into models of predisposition is likely to be a serious challenge as, although ethnicity can be defined through genotype, environmental background is harder to standardise.

Inherited prognostic and predictive variants

In addition to influencing the risk of developing CRC, inherited genetic host factors are likely to influence the natural course of the disease. As a potential prognostic factor, the concept of germline variation imparting inter-individual variability in tumour development, progression and metastasis is receiving increasing attention. Compared with breast cancer, studies of the impact of germline variation on CRC prognosis have been more limited. Prognostic studies have generally examined the same candidate genes that are considered to have a role in predisposition. Genetic variation affecting inter-individual disease expression may influence the later stages of malignancy rather than early events associated with an inherited predisposition. Variants in growth factor, apoptosis or immune surveillance signalling pathways, for instance, might not cause CRC initiation but could have a substantial effect on the outcome of established disease. Chemotherapy response and toxicity may be related to germline genotype. Linking GWA data to patient outcome provides an attractive strategy for identifying new prognostic markers. It is essential to impose appropriate statistical thresholds and conduct replication analyses to avoid reporting false positives.

Rationale for the cogent consortium

The recognition that low-penetrance alleles contribute to the inherited risk of CRC represents a major advance in our understanding. In view of the above-noted issues, over the last few years, collaborations have been steadily developing between groups in the United Kingdom, Canada, the Americas, Holland, Germany, Finland, Spain and Australia that are engaged in ongoing searches for low-penetrance CRC variants through association-based studies. What initially began from relatively loose affiliations centred around work on specific projects has begun to crystallise into a more formal collaborative network after replication analyses of two published GWA studies. To continue and expand collaboration, a meeting was held at the University of Leiden, the Netherlands, in January 2009 to review ongoing association studies. There assembled an international team of researchers with expertise encompassing genetic epidemiology, statistical genetics, gene mapping, biology, molecular genetics, pathology and diagnosis and clinical management of CRC.

There was a consensus among participants that many of the challenges inherent in this field can best be addressed by international cooperative efforts, and the group unanimously decided to establish a CRC association consortium. An invitation to join COGENT that was subsequently extended to other groups known to be performing CRC association studies was well received. At present, 20 groups that are performing case–control genetic association studies have joined COGENT (Table 3). The eligibility criterion for inclusion is the involvement in a case–control study based on at least 500 cases and 500 controls sampled from the same population. The sample size limit aims to ameliorate the potential statistical, biological and technological/methodological confounding effects of small sample sizes (Moonesinghe et al, 2008). Collectively, over 48 000 cases and 43 000 controls have so far been accrued by COGENT researchers (Table 3).

Table 3 Number of CRC cases and controls currently established by COGENT consortium members

In each of the study centres, collection of samples and of clinico-pathological information has been undertaken with informed consent and relevant ethical review board approval in accordance with the tenets of the Declaration of Helsinki. Material transfer agreements have already been used between partners to allow for sharing of individualised data, and similar procedures will be adopted for future collaborative work.

Data pooling provides a very cost-effective approach to achieve an adequate power for subgroup analyses, which are unlikely to have sufficient sample sizes in a single study. Several potential problems need to be considered at the stage of data pooling. Given that individual studies have different data formats, covariates from individual studies will be agreed upon and compiled into a common set of variables relevant to specific projects. Study data sets sent from different centres will be checked for outliers, aberrant distribution, inadmissible values and inconsistencies before pooling to ensure data accuracy. Systematic variation between centres in terms of genotyping will be assessed globally using principal components and on a per-SNP basis. Discrepancies can be cross-verified with study centres.

COGENT represents the first international collaborative study seeking to comprehensively understand the impact of low-penetrance susceptibility to CRC and to describe the genetic landscape of the disease. The immediate goal of the group is to work together collaboratively to study polymorphisms that were previously associated with CRC risk and to plan for future high-quality studies. Past productive collaboration has laid the groundwork for these future studies centred initially on the expansion of discovery and replication of GWA studies, with biological analyses of variants and epidemiological studies as longer-term aims.