Statistical Genetics

Goldstein, Darlene R.

doi:10.1007/978-1-4614-1347-9_12

Darlene R. Goldstein^2,3

Part of the book series: Selected Works in Probability and Statistics ((SWPS))

1486 Accesses

Abstract

Terry Speed has produced many interesting and important contributions to the field of statistical genetics, with work encompassing both experimental crosses and human pedigrees. He has been instrumental in uncovering and elucidating algebraic structure underlying a diverse range of statistical problems, providing new and unifying insights.

You have full access to this open access chapter, Download chapter PDF

Statistical Genetic Terminology

Statistical Methods in GeneticEpidemiology

Statistics is What Statistics Does

Article 13 September 2022

Anil Gore

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Terry Speed has produced many interesting and important contributions to the field of statistical genetics, with work encompassing both experimental crosses and human pedigrees. He has been instrumental in uncovering and elucidating algebraic structure underlying a diverse range of statistical problems, providing new and unifying insights.

Here, I provide a brief commentary and introduction to some of the key building blocks for an understanding of the papers. Some readers may also find useful a refresher on group action (see e.g. Fraleigh [6]) and hidden Markov models [13].

Linkage mapping

Linkage analysis studies inheritance of traits in families, with the aim of determining the chromosomal location of genes influencing the trait. The analysis proceeds by tracking patterns of coinheritance of the trait of interest and other traits or genetic markers, relying on the varying degree of recombination between trait and marker loci to map the loci relative to one another.

A measure of the degree of linkage is the recombination fraction θ, the chance of recombination occurring between two loci. For unlinked genes, θ = 1/2; for linked genes, 0 ≤ θ < 1/2.

Human pedigrees

S. Dudoit and T. P. Speed (1999). A score test for linkage using identity by descent data from sibships. Annals of Statistics 27:943–986.

This paper offers a novel and comprehensive algebraic view of sib-pair methods, fundamentally unifying a large collection of apparently ad hoc procedures and providing powerful insights into the methods.

Identical by descent allele sharing

Data for linkage analysis consist of sets of related individuals (pedigrees) and information on the genetic marker and/or trait genotypes or phenotypes. The recombination fraction is most commonly estimated by maximum likelihood for an appropriate genetic model for the coinheritance of the loci.

However, likelihood-based linkage analysis can be difficult to carry out due to the problem that the mode of inheritance may be complex and in any case is usually unknown. Nonparametric approaches are thus appealing, since they do not require a genetic inheritance model to be specified. Such methods typically focus on identical by descent (IBD) allele sharing at a locus between a pair of relatives. DNA at a locus is shared by two relatives identical by descent if it originated from the same ancestral chromosome. In families of individuals possessing the trait of interest, there is association between the trait and allele sharing at loci linked to trait susceptibility loci, which can be used to localize trait susceptibility genes.

Testing for linkage with IBD data has developed differently, depending on the type of trait. For qualitative traits, tests are based on IBD sharing conditional on phenotypes. Affected sib-pair methods are a popular choice; these are often described as nonparametric since the mode of inheritance does not need to be specified (see Hauser and Boehnke [10] for a review). On the other hand, for quantitative trait loci (QTL), linkage analysis is based on examination of phenotypes conditional on sharing (for example, the method of Haseman and Elston [9] or one of its many extensions).

Inheritance vector

The pattern of IBD sharing at a locus within a pedigree is summarized by an inheritance vector, which completely specifies the ancestral source of DNA. For sibships of size k, label locus (1, 2) and maternally derived alleles (3, 4). The inheritance vector at a given locus is the vector \(x = ({x}_{1},{x}_{2},...,{x}_{2k-1},{x}_{2k})\), where for sib i, x _2i − 1 is the label of the paternally inherited allele (1 or 2) and x _2i is that of the maternally inherited allele (3 or 4) at the locus.

For a pair of sibs, when paternal and maternal allele sharing are not distinguished, the 16 possible inheritance vectors give rise to three IBD configurations C _j: the sibs may share 0, 1, or 2 alleles IBD at the locus (Table 12.1). The IBD configurations can be thought of as orbits of groups acting on the set of possible inheritance vectors \(\mathcal{X}\) [2].

Score test for linkage

The literature contains several proposed tests of the null hypothesis of no linkage (H : θ = 1/2) based on score functions of IBD configurations for sibships and other pedigrees, with scores chosen to yield good power against a particular alternative. The score test of Dudoit and Speed to detect linkage represents a major breakthrough in that it creates a coherent, unified based approach to the linkage analysis of qualitative and quantitative traits using IBD data. The likelihood for the recombination fraction θ, conditional on the phenotypes of the relatives, is used to form a score test of the null hypothesis of no linkage (θ = 1/2).

Table 12.1 Sib-pair IBD configurations

Full size table

The probability vector of IBD configurations, conditional on pedigree phenotypes, at a marker locus linked to a trait susceptibility locus at recombination fraction θ can be written as ρ(θ, π)_{1 ×m} = π_{1 ×m} T(θ)_{m ×m}, where π represents the conditional probability vector for IBD configurations at the trait locus and the number of IBD configurations is m. T(θ) denotes the transition matrix between IBD configurations at loci separated by recombination fraction θ.

In general, the probability vector π depends on unknown genetic parameters. However, using their formulation of the problem, Dudoit and Speed [4] show rigorously that for affected sibships of a given size, the second-order Taylor series expansion of the log likelihood around the null of no linkage is independent of the genetic inheritance model. They thus provide a mathematically justified basis for affected sib-pair methods, which do not require an assumed mode of inheritance.

Practical advantages of the score test include: it is locally most powerful for alternatives close to the null; any genotype distribution can be used (i.e., Hardy-Weinberg equilibrium is not required); conditioning on phenotypes eliminates selection bias introduced by nonrandom ascertainment; and combining differently ascertained pairs is straightforward, providing the important benefit of allowing us to avoid discarding any data. For many realistic simulation scenarios [7, 8], the score test proves to be robust and shows large power gains over commonly used nonparametric tests.

Although the paper focuses on pairs of sibs, the same score test approach is also applicable to any set of relatives [3].

Experimental crosses

N. J. Armstrong, M. S. McPeek and T. P. Speed (2006). Incorporating interference into linkage analysis for experimental crosses. Biostatistics 7:374–386.

This paper improves multilocus linkage analysis of experimental crosses by incorporating a realistic model of crossover interference, and implementing it by extending the Lander-Green algorithm for genetic reconstruction. It represents the culmination of a series of studies of the modeling of crossover interference.

χ² model of crossover interference

During the (four-strand) process of crossing over in meiosis, two types of interference (nonindependence) are distinguished: chromatid interference, a situation in which the occurrence of a crossover between any pair of nonsister chromatids affects the probability of those chromatids being involved in other crossovers in the same meiosis; and crossover interference, which refers to nonrandom location of chiasmata along a chromosome.

Most genetic mapping is carried out assuming independence; that is, no chromatid interference and no crossover interference. This assumption simplifies likelihood calculations. Although there is little empirical evidence for chromatid interference, there is substantial evidence of crossover interference. Thus, more a more realistic model incorporating crossover interference should be able to provide more accurately estimated genetic maps.

The χ² model of crossover interference [5] provides a dramatically improved fit over a wide range of models [12, 14]. This model assumes that recombination intermediates (structures formed after initiation of recombination) are resolved in one of two ways: either with or without crossing over. Recombination initiation events are assumed to occur according to a Poisson distribution, but constraints on the resolution of intermediates creates interference. The χ² model assumes m unobserved intermediates between each crossover. For m = 1, the model reduces to the no (crossover) interference model. This model is a special case of the more general gamma model, but has the advantage of being computationally more feasible.

Genetic reconstruction and the Lander-Green algorithm

Genetic mapping in humans can be viewed as a missing data problem, since we are typically unable to observe the complete data (the number of recombinant and nonrecombinant meioses for each interval). If complete data were available, maximum likelihood estimates of a set of recombination fractions θ_i, \(i = 1,\ldots, T - 1\), for adjacent markers \({\mathcal{M}}_{1},\ldots, {\mathcal{M}}_{T}\) would just be the observed proportion of recombinants in an interval.

The genetic reconstruction problem is to determine the expected number of recombinations that occurred in intervals of adjacent markers, given genotypes at multiple marker loci in a pedigree and the recombination fraction for each interval. Construction is straightforward when there is complete genotype information, including the ancestral origin (paternal or maternal).

More commonly this information is not known, so a different strategy for likelihood calculation is needed to obtain recombination fraction estimates. Lander and Green [11] proposed an approach based on the use of inheritance vectors. They showed that the probability of the observed data can be calculated for any particular inheritance vector and that under no crossover interference, the inheritance vectors form a Markov chain along the chromosome. They model the pedigree and data as a hidden Markov model, where the hidden states are the (unobserved) inheritance vectors. The complexity of their algorithm for calculating likelihoods increases linearly with the number of markers but exponentially in pedigree size, making it appropriate for analysis of many markers on small to moderately sized pedigrees.

In experimental crosses, mapping is generally more straightforward since investigators can arrange crosses to produce complete data. However, the presence of unobserved recombination initiation points creates a new kind of missing data when the no interference model is not assumed. The creative insight of Armstrong et al. [1] is to model the crossover interference process as a hidden Markov model. This step works because even though crossovers resulting from initiation events do not occur independently (in the presence of crossover interference), the initiation events themselves are assumed to be independent. Armstrong et al. [1] are thus able to extend the Lander-Green algorithm to incorporate interference according to the χ² model, thereby providing more accurately estimated genetic maps.

Conclusion

Terry’s work in statistical genetics has identified underlying commonalities across seemingly disparate procedures, contributing meaningful theoretical and practical improvements. An impressive aspect of these works is the fresh perspective offered by viewing the problems at a stripped-down, fundamental level. Applying an exceptional combination of extensive mathematical expertise and pragmatic sensibility, Terry provides inventive solutions and a richer structural understanding of significant questions in statistical genetics.

References

N. J. Armstrong, M. S. McPeek, and T. P. Speed. Incorporating interference into linkage analysis for experimental crosses. Biostatistics, 7(3):374–386, 2006.
Article MATH Google Scholar
K. P. Donnelly. The probability that related individuals share some section of genome identical by descent. Theor. Popul. Biol., 23:34–63, 1983.
Article MathSciNet MATH Google Scholar
S. Dudoit. Linkage Analysis of Complex Human Traits Using Identity by Descent Data. PhD thesis, Department of Statistics, University of California, Berkeley, 1999.
Google Scholar
S. Dudoit and T. P. Speed. A score test for linkage using identity by descent data from sibships. Ann. Stat., 27(3):943–986, 1999.
Article MathSciNet MATH Google Scholar
E. Foss, R. Lande, F. Stahl, and C. Steinberg. Chiasma interference as a function of genetic distance. Genetics, 133:681–691, 1993.
Google Scholar
J. B. Fraleigh. A First Course in Abstract Algebra. Addison-Wesley Pub. Co., Reading, Mass., 7th edition, 2002.
Google Scholar
D. R. Goldstein, S. Dudoit, and T. P. Speed. Power of a score test for quantitative trait linkage analysis of relative pairs. Genet. Epidemiol., 19(Suppl. 1):S85–S91, 2000.
Article Google Scholar
D. R. Goldstein, S. Dudoit, and T. P. Speed. Power and robustness of a score test for linkage analysis of quantitative trait based on identity by descent data on sib pairs. Genet. Epidemiol., 20(4):415–431, 2001.
Article Google Scholar
J. K. Haseman and R. C. Elston. The investigation of linkage between a quantitative trait and a marker locus. Behav. Genet., 2:3–19, 1972.
Article Google Scholar
E. R. Hauser and M. Boehnke. Genetic linkage analysis of complex genetic traits by using affected sibling pairs. Biometrics, 54:1238–1246, 1998.
Article MATH Google Scholar
E. S. Lander and P. Green. Construction of multilocus genetic maps in humans. Proc. Natl. Acad. Sci. USA, 84:2363–2367, 1987.
Article Google Scholar
M. S. McPeek and T. P. Speed. Modeling interference in genetic recombination. Genetics, 139:1031–1044, 1995.
Google Scholar
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2): 257–286, 1989.
Article Google Scholar
H. Zhao, T. P. Speed, and M. S. McPeek. Statistical analysis of crossover interference using the chi-square model. Genetics, 139: 1045–1056, 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Institut de mathématiques d’analyse et applications, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Darlene R. Goldstein
Swiss Institute of Bioinformatics, Lausanne, Switzerland
Darlene R. Goldstein

Authors

Darlene R. Goldstein
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Darlene R. Goldstein .

Editor information

Editors and Affiliations

School of Public Health, Div. Biostatistics, University of California, Earl Warren Hall 140, Berkeley, 94720, California, USA
Sandrine Dudoit

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Goldstein, D.R. (2012). Statistical Genetics. In: Dudoit, S. (eds) Selected Works of Terry Speed. Selected Works in Probability and Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1347-9_12

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1347-9_12
Published: 09 January 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1346-2
Online ISBN: 978-1-4614-1347-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Statistical Genetics

Abstract