Introduction

Remarkable achievements have been made in large-scale genetic studies of common diseases and complex traits.1, 2 The identification of variants in the human genome that are convincingly associated with different phenotypes has mainly been carried out in individuals of European descent, although increasingly studies involving non-Caucasian samples from diverse population groups have been published or are currently being conducted. Genome-wide meta-analyses (GWMA) involving tens of thousands of samples have extended the success in allowing novel variants with smaller effect sizes to be discovered.3, 4, 5, 6, 7 Despite these triumphs, these findings really account for only a small fraction of the total disease heritability,8 suggesting undiscovered genetic mechanisms may be responsible or alternative methods to analyze these data may be necessary to address the missing heritability.

Current implementation of GWMA requires the same SNPs to display consistent evidence of phenotypic association across multiple populations. This implicitly assumes that across these populations, (i) the same causal variant is present; (ii) the linkage disequilibrium (LD) pattern between the causal variant and the assayed SNPs is similar and (iii) the effect sizes observed at the assayed SNPs are consistent.9, 10 Random effects methods for combining data across studies do not utilize information from neighboring SNPs that may present concurring evidence of disease association in different studies, and often have the tendency to weaken association signals.9 SNP-based meta-analyses also require the same SNPs to be genotyped in all the populations, although this requirement can apparently be addressed by imputation strategies that effectively standardize the SNP content across different studies11, 12, 13 (Figure 1). However, imputation does not always present an effective solution, particularly in the absence of appropriate population reference panels.14, 15

Figure 1
figure 1

Illustration of the three scenarios in a meta-analysis, where the genotyped SNPs may be in different degree of LD with the unobserved causal variant (star): (i) the ideal situation where the same SNPs are genotyped in two studies, and the LD between them and the functional variant is identical in both populations (black arrow); (ii) a realistic situation where the same markers are genotyped in two studies, but different LD patterns exist between them and the functional variant (green arrow); (iii) realistic situation where different markers are genotyped in two studies, and cannot be meta-analysed without resorting to imputation (pink arrow). The LD between the causal variant and each SNP is represented in different color intensity ranging from white (low LD) to red (high LD).

Before recent whole-genome sequencing endeavors, SNP discoveries were predominantly made in populations of European ancestry.16 This strong ascertainment bias has inadvertently skewed the SNPs surveyed in the International Hap-Map Project,17 which consequently prejudiced the SNP content of commercial genotyping platforms to carry tagging SNPs that are liable to exhibit higher minor allele frequencies in European populations.18 This means current genotyping arrays may be less optimal for non-European populations, resulting in attenuated association signals due to lower allele frequencies and weaker LD.14, 15 The pursuit of evidence stronger than the genome-wide significance is thus more challenging, and larger sample sizes in non-European studies and meta-analyses of ethnically mixed populations are required to compensate for variations in LD patterns from European populations.14, 19 Instead of seeking individual variants that display convincing evidence of phenotypic association across multiple populations, a more realistic scenario is perhaps to look for genomic regions with consistent clustering of SNPs exhibiting moderate signals in these populations.

In this paper, we propose a novel paradigm for interrogating genetic data for disease association given either a dichotomous or a quantitative outcome. Our method works by quantifying the degree of over-representation of associated SNPs in a pre-defined genomic region, given a specific definition of statistical significance. Through an eigen-decomposition of the matrix measuring the LD between every possible pair of SNPs in the region, the effective number of independent SNPs as well as the number of independent SNPs exhibiting evidence of phenotypic association can be evaluated (Supplementary Figure S1). The regional evidence of phenotypic association is thus quantified as the extent of over-representation of independent associated SNPs against the effective total number of independent SNPs in the region. This approach can be applied in a genome-wide fashion by considering moving windows of a fixed length within a population. In addition, this presents a natural framework for integrating the results from multiple studies in a region-based genome-wide meta-analysis, where we can sum up the number of independent signals and independent SNPs in each region across the different studies, and to calculate a single regional P-value for this meta-analysis by quantifying the joint extent of over-representation. This framework also allows a straightforward extension to consider evidence across genes and biological pathways.

Methods

Region-based analysis

Our region-based meta-analysis approach relies on the principle that when L* independent hypotheses are tested at a statistical significance threshold of α% (Pcrit), on average we expect αL*/100 of these hypotheses to display statistical evidence more significant than α% by chance. In the application within a genome-wide association study (GWAS), suppose there are 100 SNPs in a particular genomic window of 250 kb and the threshold for defining statistical significance has been set at 1%. Under the null hypothesis that none of the 100 SNPs are associated with the phenotype, we expect one SNP on average to exhibit a P-value that is <0.01 if the 100 SNPs are mutually independent. An over-representation of independent SNPs with P-value <0.01 in this genomic region thus corresponds to evidence that suggests the region is associated with the phenotype. However, the presence of LD implies the assumption of independence between the SNPs is unlikely to be valid.

In order to evaluate the effective number of ‘independent’ SNPs in each genomic region, we perform an eigen-decomposition of the L × L symmetric correlation matrix M between the L SNPs with entry mij denoting the LD in directional r2 between the ith and the jth SNP, where the direction is determined by the sign of D′. Here we assume the minor allele frequencies of all L SNPs are at least 1%. The resulting eigenvectors effectively represent mutually independent contributions in explaining the variance in the correlation matrix, and each eigenvector is given as a linear combination of SNPs that are in at least some degree of LD. The SNP loadings of each eigenvector measure the extent each SNP contributes to the eigenvector, and the relative loadings between the SNPs for each eigenvector provide a surrogate for the degree of correlation between the SNPs. The L eigenvectors thus represent independent sources of information from all the SNPs in the window, and the number of eigenvectors Ntotal that cumulatively accounts for τ% of the variance can be determined as argminl ∑i=1lλi≥τL%, for 1≤l≤L and where λi represents the eigenvalue corresponding to the ith eigenvector ei. Let w denote a vector of length L with the wi entry corresponding to one if the observed P-value for the ith SNP is <Pcrit, and zero otherwise. Suppose eij denote the jth component in the ith eigenvector, then the corresponding component in the ith scaled eigenvector represented by e’i is ∣eij∣/∑j=1L∣eij∣. The effective number of independent SNPs that exhibit P-value <Pcrit is thus calculated as

The regional evidence for the extent of over-representation of SNPs with P-values <Pcrit is calculated as the upper-tailed P-value of the exact Binomial test for observing Nhit out of Ntotal SNPs when the success probability is given as Pcrit. However, as Nhit can be a non-integer, we estimate the P-value associated with Nhit by linear interpolating between the P-values obtained for the floor and ceiling integer values of Nhit, or equivalently Pfloor+(Pceiling−Pfloor) × (Nhit−⌊Nhit⌋), where Pfloor and Pceiling denote the P-values associated with ⌊Nhit⌋ (the floor integer value of Nhit) and ⌈Nhit⌉ (the ceiling integer value of Nhit), respectively. Our simulation study suggests this procedure is conservative and tends to underestimate the evidence (Supplementary Figure S2), thus the true evidence is likely to be more significant than the resultant linear-interpolated P-value.

A region-based meta-analysis across K independent populations can be performed by calculating the corresponding Nhit(k) and Ntotal(k) from each population k in the same genomic window. The cumulative evidence across the K populations will then be quantified by the upper tailed P-value of the exact Binomial test for observing ∑k Nhit(k)out of∑k Ntotal(k) SNPs when the success probability is given as Pcrit.

Type 2 diabetes (T2D) data sets

We applied our region-based meta-analysis approach to combine the evidence from three separate genome-wide surveys of type 2 diabetes in the Chinese, South-East Asian Malays and Asian Indians from Singapore. Results from each individual survey and the SNP-based meta-analysis have been reported elsewhere.20 Briefly, the Chinese GWAS examined 2010 cases and 1945 controls (post-QC) that were typed on a mixture of Illumina (San Diego, CA, USA) 610 (1082 cases/1006 controls) and Illumina1M arrays (928 cases/939 controls). The corresponding numbers for the Malay and Indian GWAS were 794 cases/1240 controls and 977 cases/1169 controls, and these were all genotyped on the Illumina 610 arrays. A genome-wide region-based meta-analysis was first performed between the Chinese data that were genotyped on the two arrays to yield a single set of findings for the Chinese experiment. The three experiments for the different population groups were used as discovery cohorts for a region-based meta-analysis with a window size of 250 kb and a sliding gap of 50 kb such that two consecutive windows have a 200-kb overlap. We also performed a gene-based meta-analysis across 30 037 genes identified from the hg18 version of the TransMap UCSC gene mapping, with each window spanning a 100-kb flanking buffer from the start and end coordinates of each gene. A pathway analysis was also performed for 212 pathways in the KEGG database21, 22, 23 (http://www.genome.jp/kegg/pathway.html). Each gene (inclusive of a 25-kb flanking buffer) in a particular pathway was considered as a distinct window, except for genes within 50 kb of each other, which we merged as one discrete window. The intra-population evidence for a pathway was calculated from the summation of the effective number of independent significant and total SNPs across the windows. The P-value threshold (Pcrit) was set at 0.01. We identified any genomic region that exhibited P-value <0.001 in at least two populations from the region-based and gene-based analyses. This is an additional criteria to ensure that at least two populations are contributing to the observed signals, given the fundamental strategy of our approach is to identify genomic regions that are associated with the outcome in multiple populations. We excluded any regions that are known to carry copy number changes as estimation of LD is likely to be inaccurate in these regions. For the pathway analysis, we identified a pathway that exhibited P-value <0.05 in at least two populations. To avoid artificial signals of disease association that were the results of erroneous genotype calling, genotyping quality was visually ascertained in each cohort for every SNP located in the discovered regions from the region-based analysis. To validate the findings, similar analyses were performed on the type 2 diabetes data from Phase 1 of the Wellcome Trust Case Control Consortium (WTCC).19 Calculation of LD in each of the discovery and validation cohorts was performed with 500 control samples from the respective study.

Software implementation

The method described in this paper is implemented in three separate C++ programs: (i) regionalP for performing genome-wide region-based analysis; (ii) regionalP-gene for performing gene-based analysis; and (iii) regionalP-pathway for performing pathway-based analysis. The programs are available from http://www.statgen.nus.edu.sg/~software/regionalP.html.

Descriptions of the set up for the simulations, along with additional methods and analyses are available in the Supplementary Material online.

Results

Power and false-positive rates

We compared our method for regional analysis against standard SNP-based analyses with (i) only the genotyped SNPs or with (ii) the full set of SNPs after imputing against reference panels from phase 2 of the HapMap (HapMap2). In the meta-analysis combining the results from all three populations, the power of the region-based strategy was similar to that from a meta-analysis of the imputed SNPs (Figure 2). This was significantly higher than the power from the meta-analysis of only the genotyped SNPs. The false-positive rates of all three meta-analytic approaches were <5%, although the region-based approach had a near-zero false-positive rate when we imposed an additional restriction requiring at least two populations to exhibit P-values of <0.001 in the same region (Supplementary Table S1 online). At a genome-wide significance of 10−8, this additional restriction resulted in only a marginal decrease in statistical power, although this decrease was more substantial at less stringent significance thresholds. Investigating the sensitivity of our method by the allelic spectrum of the simulated causal variants in CEU, we observed the region-based approach was less powerful in identifying low-frequency causal variants (MAF of causal variant ≤5%) but was marginally more powerful for common causal variants (MAF of causal variant >5%, see Figure 2).

Figure 2
figure 2

Power comparisons of the different methods for the meta-analysis across all three Hapmap populations. Simulations were performed with HAPGEN (Wellcome Trust Centre for Human Genetics, Oxford, UK) assuming a causal variant that was present in all HapMap phase 2 panels with a multiplicative allelic relative risk of 1.5. The case–control genotype data were subsequently thinned to the SNP content of Affymetrix 500K (CEU simulations), Illumina 1M (JPT+CHB simulations) and Affymetrix 6.0 (Santa Clara, CA, USA) (YRI simulations). We calculate the power when only the genotyped SNPs were considered (green triangles), and when we performed region-based analyses of 100 kb regions in each of the three populations (red circles). Imputation was performed with population-specific haplotypes to recover the SNPs removed from the thinning (except for the causal SNP), and a SNP-based analysis was performed on this denser set of imputed and genotyped SNPs (blue diamonds). The SNP-based meta-analyses considered either the genotyped SNPs present across all three platforms only (green triangles) or across the denser set of imputed and genotyped SNPs common to all three populations (blue diamonds). The region-based meta-analysis was performed without restriction (red circles), and with the restriction that at least two populations display region-based P-value <0.001 (red open circles).

We also explored the performance of the three approaches in the presence of allelic heterogeneity, defined as having different causal variants in the same gene or genomic location across different populations. Specifically, we performed another series of simulations assuming two different causal variants in CEU and JPT+CHB, while allowing YRI to carry either of the two possible causal variants. Our simulation explicitly selected causal variants that are at least 20 kb away but within 50 kb of each other. The region-based approach significantly outperformed both SNP-based approaches in the meta-analyses across CEU and JPT+CHB, particularly at lower Type I errors and when LD between the two causal variants is low (Figure 3). When the LD between the two causal variants is high (r2>0.8), there is almost no difference in the results of the SNP-based meta-analyses of all three populations at higher Type I errors as compared with the power observed in our earlier simulations with only one causal variant. This is reassuring since we expect the two causal variants to behave as effectively a single variant when the LD is high. However, the low power experienced by the SNP-based methods in the presence of two separate causal variants reflects the inadequacy of SNP-based approaches for integrating data across diverse populations, and the greatest merit of the region-based approach is in the presence of allelic heterogeneity between populations where the different causal variants are in weak and non-existent LD.

Figure 3
figure 3

Power comparisons of the different methods for meta-analysis in the presence of allelic heterogeneity. A different causal variant was selected in CEU and JPT+CHB, respectively, while either of the two causal variants was equally likely to be present in the YRI simulations. The two causal variants are located at least 20 kb away but are not >50 kb apart, and have minor allele frequencies of at least 10% in all three HapMap populations. The case–control genotype data simulated from HAPGEN were subsequently thinned to the SNP content of Affymetrix 500K (CEU), Illumina 1M (JPT+CHB) and Affymetrix 6.0 (YRI). We calculated the power when only the CEU and JPT+CHB populations were combined (top row), and when all three HapMap panels were combined (bottom row), investigating the performance of the meta-analysis across the SNPs on all three arrays (green triangles), and for the region-based meta-analysis considering 250 kb regions (red circles). Imputation was performed with population-specific haplotypes to recover the SNPs removed from the thinning, and a SNP-based meta-analysis was performed on this denser set of imputed and genotyped SNPs common to all three populations (blue diamonds). We binned the 3000 pairs of causal variants according to the LD between the two SNPs into four groups: (i) 0≤r2≤0.1; (ii) 0.1 <r2≤0.3; (iii) 0.3 <r2≤0.8; (iv) 0.8 <r2≤1.

Application to T2D data

We applied our method to perform region-based, gene-based and pathway-based meta-analyses in three independent genome-wide studies of type 2 diabetes (T2D) involving the Chinese, Malays and Asian Indians in Singapore. This was performed across all the autosomal chromosomes within each of the three GWAS in a hypothesis-generating fashion, where for the region-based analyses we considered sliding windows of 250 kb each with a sliding distance of 50 kb such that every pair of consecutive windows overlapped by 200 kb. About half of the Chinese samples were genotyped on the Illumina 1M array, while the remaining half of the Chinese, Malay and Indian samples were genotyped on the Illumina 610 array. Results of the SNP-based meta-analyses using both the genotyped SNPs and the imputed SNPs have been reported elsewhere.20 Briefly, none of the SNPs achieved genome-wide significance in the meta-analyses, although variants in CDKAL1 and HHEX/IDE/KIF11 displayed moderate evidence of T2D association in at least two of the three populations. In particular, variants in CDKAL1 were found against a genomic background exhibiting substantial LD variations between the populations.24

The genome-wide meta-analysis with our region-based method identified five regions exhibiting P<0.001 in at least two of the three populations (Table 1). Other than the region on chromosome 6 that encompassed CDKAL1, the other four regions did not emerge in the SNP-based meta-analyses20 (Supplementary Table S2 online). In the replication experiment with the WTCCC data, two of these five regions displayed strong evidence of regional association (P<10−4) in the case–control T2D GWAS, which included the stretch on chromosome 6 encompassing CDKAL1 and the region on chromosome 3 between 21.73 and 22.13 Mb that encompassed ZNF659. Suggestive corroborative evidences (P<0.05) from WTCCC were also seen in the region on chromosome 2 that spanned the STK39 gene and the region on chromosome 14 containing the genes GNG2 and NID2. There was no evidence of regional association in the WTCCC for the remaining region on chromosome 20 spanning STX16 and NPEPL1.

Table 1 Results of the region-based meta-analysis for type 2 diabetes

Remarkably, all five regions have been previously implicated in diabetes, obesity or other cardiovascular biomarkers. The convincing signal for the region encompassing CDKAL1 is consistent with established findings for T2D,25, 26, 27, 28, 29, 30 while ZNF659 has been associated with young-onset type 2 diabetes in the American Indians.31 The STK39 gene has been consistently reported to harbor variants implicated in hypertension and in obesity and diabetes-related rodent quantitative trait loci.32 Previous pathway analysis has identified the G-protein GNG2 to be associated with type 1 diabetes,33 suggesting a serotonin modulating mechanism that is similarly relevant in the etiology of type 2 diabetes. Variants in STX16 have also been reported to significantly slow the reversal of insulin-stimulated glucose transport,34, 35 a biological mechanism that is highly relevant to T2D.

Discussion

The scale of GWMA with diverse European and non-European populations is expected to increase markedly given the popularity of genome-wide designs in studying the genetic etiology of common diseases and complex traits. This, however, increases the challenge of accommodating varying patterns of LD that may exist between genetically diverse populations, which can compromise the ability to reproduce the association signals from surrogate markers that are correlated to the unobserved functional polymorphisms. We have introduced an alternative strategy for combining the evidence across different populations that is robust to dissimilar patterns of LD surrounding a bona fide association signal. The approach is applicable to both case–control studies or in association studies of quantitative traits. Our method has also been shown to perform comparably to imputation-based meta-analysis, except it relies on available genotype information from the experiment without requiring additional reference data from appropriately matched populations. In the presence of allelic heterogeneity, our approach outperforms both SNP-based approaches using either genotyped or imputed SNPs. The application of the region-based method to three genome-wide surveys in T2D resulted in the discovery of novel and established regions that are subsequently validated with data from the WTCCC.

The region-based approach relies on the elegant application of the concept of statistical significance in evaluating a genomic region for evidence of trait association. For example, under the null hypothesis that the region is independent of the phenotype, we expect 5% of the SNPs to be statistically significant by chance when adopting a P-value threshold of 5%, if indeed all the SNPs in this region are mutually independent. If this assumption of mutual independence is true, an over-representation of statistically significant SNPs in this region constitutes evidence that this region is associated with the phenotype, with the extent of over-representation indicating the strength of the evidence. This is analogous to the use of 5 × 10−8 as the definition of genome-wide significance for assessing the likely authenticity of single markers. LD between the SNPs can confound the measurement of over-representation, as this can either inflate the number of significant signals, which increases false positives, or produce an inflated estimate of the total number of SNPs, which decreases statistical power. The eigen-decomposition of the LD matrix allows the effective number of independent SNPs to be estimated and consequently, also the effective number of independent association signals that are statistically significant. By surveying the same genomic region across different independent populations, the same statistical framework can be extended to consolidate the evidence from multiple populations, simply by summing the effective number of independent SNPs and signals across these populations and assessing the evidence for an over-representation of significant signals. This provides a simple but, yet, effective solution to combining the results from experiments that use different genotyping platforms. By searching for the same regions rather than the same SNPs to emerge in the different GWAS, inter-population variation in LD patterns between the assayed SNPs and the causal variant is expected to have lesser impact on the sensitivity of our approach.

One feature of our method is the ability to sharpen the association evidence in regions containing multiple weak signals across different ethnic groups. These signals may be weaker as a result of SNP ascertainment biases in the design of genotyping arrays, resulting in weaker LD between the assayed SNPs and the causal variants. The current definition of genome-wide significance excludes many potential signals to be considered in a bid to protect against the abundance of false discoveries that is associated with testing in excess of a million hypotheses. This poses a significant challenge to genome-wide studies and GWMA in populations with short LD, such as African populations,15, 36 as it is less likely for variants to be in sufficient LD to exhibit statistical evidence stronger than the stringent threshold. Furthermore, the greater genetic diversity that is common of such populations means it is not immediately straightforward to compensate for the lower LD by increasing the effective sample size through a meta-analysis of several populations. Our method thus provides a viable solution within a sound statistical framework to exploit and combine the evidence from SNPs that are weakly associated with the phenotype.

The application of analytical methods that investigate regions in the genome rather than relying on individual SNPs is not a new concept. Neither is implementing a statistical strategy to estimate the effective number of independent association tests in the presence of LD. Numerous approaches have in fact been introduced to address the issue of multiple testing in the presence of correlated SNPs.37, 38, 39, 40, 41, 42, 43 However, these methods either assign the most significant SNP-based evidence as the statistical evidence for the set of loci,39 or do not explicitly incorporate the association evidence in adjusting for the effective number of tests.38, 40, 41, 42, 43 A recent region-based approach adopted a more sophisticated approach that borrows information from surrounding SNPs, although it tends to rely on heuristic measures such as the proximity to specific genomic features (eg, known genes, evolutionarily conserved regions and haplotype blocks) for defining SNP clusters.44 In our opinion, the imputation frameworks that MACH12 and IMPUTE13 are built on provide a more natural way to incorporate information from surrounding SNPs without relying on pre-defined features that may not adequately account for the correlation between SNPs. We thus benchmarked our method against the performance of the imputation-based approach, which has become the strategy of choice in recent genome-wide studies. More importantly, neither of the previous region-based approaches provide a natural solution to integrate the evidence across multiple genome-wide studies in a meta-analysis, nor adequately manage the complexity due to allelic heterogeneity.

We have proposed a novel and powerful strategy for querying the genome for genotype–phenotype associations that realistically manages the challenges imposed by the fundamental design of genome-wide studies and in combining several such studies from diverse populations. We envisage this approach has the potential to be further developed for burden-related tests of rare or low-frequency variants across multiple heterogeneous populations, which is an emerging issue given the increasing popularity of exome-sequencing experiments across numerous traits.