Skip to main content

Advertisement

Log in

Identifying nineteenth century genealogical links from genotypes

  • Original Investigation
  • Published:
Human Genetics Aims and scope Submit manuscript

Abstract

We have developed a likelihood method to identify moderately distant genealogical relationships from genomewide scan data. The aim is to compare the genotypes of many pairs of people and identify those pairs most likely to be related to one another. We have tested the algorithm using the genotypes of 170 Tasmanians with multiple sclerosis recruited into a haplotype association study. It is estimated from genealogical records that approximately 65% of Tasmania’s current population of 470,000 are direct descendants of the 13,000 female founders living in this island state of Australia in the mid-nineteenth century. All cases and four to five relatives of each case have been genotyped with microsatellite markers at a genomewide average density of 4 cM. Previous genealogical research has identified 51 pairwise relationships linking 56 of the 170 cases. Testing the likelihood calculation on these known relative pairs, we have good power to identify relationships up to degree eight (e.g. third cousins once removed). Applying the algorithm to all other pairs of cases, we have identified a further 61 putative relative pairs, with an estimated false discovery rate of 10%. The power to identify genealogical links should increase when the new, denser sets of SNP markers are used. Except in populations where there is a searchable electronic database containing virtually all genealogical links in the past six generations, the algorithm should be a useful aid for genealogists working on gene-mapping projects, both linkage studies and association studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. The degree of relationship d, defined as −log2(expected IBD sharing) is not an integer for all of these pairs: d has been rounded up to the nearest integer for Figs. 4 and 5.

References

  • Abecasis GR, Cookson WO (2000) GOLD—graphical overview of linkage disequilibrium. Bioinformatics 16:182–183

    CAS  PubMed  Google Scholar 

  • Abecasis GR, Cherny SS, Cookson WO, Cardon LR (2002) Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 30:97–101

    Article  CAS  PubMed  Google Scholar 

  • Bahlo M, Xing L, Wilkinson CR (2004) HumanMSD and MouseMSD: generating genetic maps for human and murine microsatellite markers. Bioinformatics 20:3280–3283

    Article  CAS  PubMed  Google Scholar 

  • Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8

    Google Scholar 

  • Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multipl testing. J R Stat Soc B 57:289–300

    Google Scholar 

  • Blouin MS (2003) DNA-based methods for pedigree reconstruction and kinship analysis in natural populations. Trends Ecol Evol 18:503–511

    Article  Google Scholar 

  • Burgess JR, Greenaway TM, Shepherd JJ (1998) Expression of the MEN-1 gene in a large kindred with multiple endocrine neoplasia type 1. J Intern Med 243:465–470

    Article  CAS  PubMed  Google Scholar 

  • Carvajal-Carmona LG, Ophoff R, Service S, Hartiala J, Molina J, Leon P, Ospina J, Bedoya G, Freimer N, Ruiz-Linares A (2003) Genetic demography of Antioquia (Colombia) and the Central Valley of Costa Rica. Hum Genet 112:534–541

    CAS  PubMed  Google Scholar 

  • Donnelly KP (1983) The probabilities that related individuals share some section of genome identical by descent. Theor Popul Biol 23:34–63

    Article  CAS  PubMed  Google Scholar 

  • Douglas JA, Skol AD, Boehnke M (2000) Probability of detection of genotyping errors and mutations as inheritance inconsistencies in nuclear-family data. Am J Hum Genet 66:487–495

    Article  Google Scholar 

  • Epstein MP, Duren WL, Boehnke M (2000) Improved inference of relationship for pairs of individuals. Am J Hum Genet 67:1219–1231

    CAS  PubMed  Google Scholar 

  • Ewen KR, Bahlo M, Treloar SA, Levinson DF, Mowry B, Barlow JW, Foote SJ (2000) Identification and analysis of error types in high-throughput genotyping. Am J Hum Genet 67:727–736

    Article  CAS  PubMed  Google Scholar 

  • Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164:1567–1587

    CAS  PubMed  Google Scholar 

  • Golding J (1990) Children of the nineties. A longitudinal study of pregnancy and childhood based on the population of Avon (ALSPAC). West Engl Med J 105:80–82

    CAS  PubMed  Google Scholar 

  • Hanley JA, McNeil BJ (1982) The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143:29–36

    CAS  PubMed  Google Scholar 

  • Jobling MA, Hurles M, Tyler-Smith C (2004) Human evolutionary genetics: origins, peoples and disease. Garland Publishing, New York

    Google Scholar 

  • John S, Shephard N, Liu G, Zeggini E, Cao M, Chen W, Vasavda N, Mills T, Barton A, Hinks A, Eyre S, Jones KW, Ollier W, Silman A, Gibson N, Worthington J, Kennedy GC (2004) Whole-genome scan, in a complex disease, using 11,245 single-nucleotide polymorphisms: comparison with microsatellites. Am J Hum Genet 75:54–64

    Article  CAS  PubMed  Google Scholar 

  • Karayiorgou M, Torrington M, Abecasis GR, Pretorius H, Robertson B, Kaliski S, Lay S, Sobin C, Möller N, Lundy SL, Blundell ML, Gogos JA, Roos JL (2004) Phenotypic characterization and genealogical tracing in an Afrikaner schizophrenia database. Am J Med Genet 124B:20–28

    Article  Google Scholar 

  • Knuiman MW, Cullen KJ, Bulsara MK, Welborn TA, Hobbs MS (1994) Mortality trends, 1965 to 1989, in Busselton, the site of repeated health surveys and interventions. Aust J Public Health 18:129–135

    CAS  PubMed  Google Scholar 

  • Kong A, Gudbjartsson DF, Sainz J, Jonsdottir GM, Gudjonsson SA, Richardsson B, Sigurdardottir S, Barnard J, Hallbeck B, Masson G, Shlien A, Palsson ST, Frigge ML, Thorgeirsson TE, Gulcher JR, Stefansson K (2002) A high-resolution recombination map of the human genome. Nat Genet 31:241–247

    CAS  PubMed  Google Scholar 

  • Leutenegger A-L, Prum B, Génin E, Verny C, Lemainque A, Clerget-Darpoux F, Thompson EA (2003) Estimation of the inbreeding coefficient through use of genomic data. Am J Hum Genet 73:516–523

    Article  CAS  PubMed  Google Scholar 

  • McPeek MS, Sun L (2000) Statistical tests for detection of misspecified relationships by use of genome-screen data. Am J Hum Genet 66:1076–1094

    Article  CAS  PubMed  Google Scholar 

  • O’Connell JR, Weeks DE (1998) PedCheck: a program for identification of genotype incompatibilities in linkage analysis. Am J Hum Genet 63:259–266

    Article  CAS  PubMed  Google Scholar 

  • Ophoff RA, Escamilla MA, Service SK, Spesny M, Meshi DB, Poon W, Molina J, Fournier E, Gallegos A, Mathews C, Neylan T, Batki SL, Roche E, Ramirez M, Silva S, De Mille MC, Dong P, Leon PE, Reus VI, Sandkuijl LA, Freimer NB (2002) Genomewide linkage disequilibrium mapping of severe bipolar disorder in a population isolate. Am J Hum Genet 71:565–574

    Article  CAS  PubMed  Google Scholar 

  • Pridmore SA (1990) The large Huntington’s disease family of Tasmania. Med J Aust 153:595–595

    PubMed  Google Scholar 

  • Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945–959

    CAS  PubMed  Google Scholar 

  • Rahman P, Jones A, Curtis J, Bartlett S, Peddle L, Fernandez BA, Freimer NB (2003) The Newfoundland population: a unique resource for genetic investigation of complex diseases. Hum Mol Genet 12:R167–R172

    Article  CAS  PubMed  Google Scholar 

  • R Development Core Team (2003) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

  • Rubio JP, Bahlo M, Butzkueven H, van der Mei IAF, Sale MM, Dickinson JL, Groom P, Johnson LJ, Simmons RD, Tait B, Varney M, Taylor B, Dwyer T, Williamson R, Gough NM, Kilpatrick TM, Speed TP, Foote SJ (2002) Genetic dissection of the human leukocyte antigen region by use of haplotypes of Tasmanians with multiple sclerosis. Am J Hum Genet 70:1125–1137

    Article  CAS  PubMed  Google Scholar 

  • Sieberts SK, Wijsman EM, Thompson EA (2002) Relationship inference from trios of individuals, in the presence of typing error. Am J Hum Genet 70:170–180

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

The authors thank all the families who participated in the Tasmanian multiple sclerosis study. Thanks also to the staff at the Menzies Research Institute who coordinated recruitment of the families, and to the neurologists who confirmed the diagnoses of MS. J.S. and C.W. were supported by The Cooperative Research Centre for Discovery of Genes for Common Human Diseases (Gene CRC). J.S. is also supported by a Transitional Institute Grant of the National Health and Medical Research Council (NHMRC) of Australia. M.B. and J.P.R. are supported by NHMRC Project Grants. S.J.F. and T.P.S. are fellows of the NHMRC. All genotyping was conducted at the Australian Genome Research Facility. Recruitment and genotyping were funded by the Gene CRC. The Gene CRC is established and supported by the Australian Government’s Cooperative Research Centres Scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jim Stankovich.

Additional information

Electronic database information: URLs for the data in this articleare as follows:

Division of Genetics and Bioinformatics, WEHI

(http://bioinf.wehi.edu.au/software/index.html) for software to identify genealogical relationships from genotype data: Genotype-based identification of relative pairs (GBIRP)

Colonial Tasmanian Family Links Database

(http://resources.archives.tas.gov.au/Pioneers/) for records of Tasmanian births, deaths and marriages prior to 1900

Appendix: Further details of the approximate likelihood calculation

Appendix: Further details of the approximate likelihood calculation

Consider a pair of degree d cousins and the corresponding pair of related haplotypes (inherited from the cousins’ related parents) for a particular chromosome. Denote the two haplotypes h1=(h11,h12,...) and h2=(h21,h22,...), where h ik is the allele at marker k along the ith haplotype . Then to evaluate the numerator in Eq. 2 we need to evaluate \(\Pr((h_1,h_2)|d).\) We consider integer values of d between 3 (first cousins) and 13 (e.g. 6th cousins) inclusive.

Define the IBD process {Dd} for h1 and h2 by

$$ D^d_k = \left\{ \begin{array}{*{20}l} 1 & \hbox{if } h_1 \hbox{ and } h_2 \hbox{ are IBD at marker } k \\ 0 & \hbox{if } h_1 \hbox{ and } h_2 \hbox{ are not IBD at marker } k \end{array} \right.$$

Denote the population frequencies of the alleles h1k and h2k by f1k and f2k respectively. Suppose that there is a probability ɛ that one of the alleles h1k and h2k has been incorrectly genotyped or inferred, or that one of the alleles has undergone a mutation in the meioses separating the pair of relatives. Then, we have the following probabilities for the genotypes at marker k, given the IBD status at marker k:

$$\Pr((h_{1k},h_{2k})|D^d_k=0) = f_{1k} f_{2k}$$
(5)
$$\Pr((h_{1k},h_{2k})|D^d_k=1) = \left\{ \begin{array}{*{20}l}\varepsilon f_{1k} f_{2k,} & h_{1k} \neq h_{2k} \\ (1-\varepsilon) f_{1k} + \varepsilon f_{1k}^2, & h_{1k} = h_{2k}.\end{array}\right. $$
(6)

When the allele on a particular haplotype at a particular marker is ambiguous, the probabilities Eqs. 5 and 6 were evaluated using both equally likely alleles and averaged.

The IBD process {Dd} is not Markov for the cousin relationships we consider. However, it is possible to construct an augmented process {Ad} which contains all of the information of the IBD process {Dd} and which is Markov in the absence of interference (Donnelly 1983). McPeek and Sun (2000) present the minimal augmented Markov process {A3} for first cousins in detail: they define 7 states 1,...,7, with states A3=5 and A3=7 corresponding to D3=1. More generally, suppose that the haplotypes h1 and h2 carried by dth degree cousins are descended from a pair of related first cousin haplotypes g1 and g2 via an additional d−3 meioses. Then the probability of the haplotype h1 being IBD with the haplotype g1 and the haplotype h2 being IBD with the haplotype g2 at a particular marker is 1/2d-3: all d−3 of these additional meioses must be “switched” a certain way. Define the Markov process {Sd-3}, which counts how many of these meioses are switched the right way for haplotypes h1 and h2 to be IBD with haplotypes g1 and g2, respectively. {Sd-3} has d−2 possible states: 0,1,...,d−3. Then the product of {Sd-3} and the process {A3} for the first cousin haplotypes g1 and g2 is an augmented Markov process {Ad} which contains all of the information of the non-Markov IBD process {Dd}.

The augmented process {Ad} has 7(d−2) states 1,...,7(d−2). To move from marker k−1 to marker k in the likelihood calculation requires evaluation of [7(d−2)]2 transition probabilities

$$ \Pr(A^{d}_k=j | A^{d}_{k-1}=i, \theta_{k-1,k}), \quad i,j = 1,\ldots,7(d-2) $$

where θk-1,k is the recombination fraction between markers k−1 and k. To reduce computation we follow the suggestion of McPeek and Sun and approximate the IBD process {Dd} by a Markov process {Bd} with the correct transition probabilities for adjacent markers: i.e. set

$$ \Pr(B^{d}_k=j | B^{d}_{k-1}=i) = \Pr(D^{d}_k=j | D^{d}_{k-1}=i), \quad i,j = 0,1. $$

First, evaluate

$$\Pr(D_k^d=1 | D_{k-1}^d=1) = \Pr(D_k^3 =1 | D_{k-1}^3=1) \Pr(S_k^{d-3}=d-3 | S_{k-1}^{d-3}=d-3).$$
(7)

Now, for the process {Sd-3} to stay in state d−3 with all d−3 meioses switched a certain way, there must be no recombinations between markers k−1 and k. Thus

$$\Pr(S_k^{d-3}=d-3 | S_{k-1}^{d-3}=d-3) = (1-\theta)^{d-3},$$
(8)

where θ=θk-1,k.To evaluate the probability that the process {D3} stays in state 1 (with the first cousin haplotypes staying IBD) between markers k−1 and k, refer back to the augmented Markov process {A3}, which contains all the information of {D3}. Using the labelling of the seven states of {A3} and the transition probabilities between them given by McPeek and Sun (2000), we have

$$ \begin{aligned} \Pr(D_k^3 =1 | D_{k-1}^3=1) &= \sum_{i,j \in \{A^3: D^3=1\}} \Pr(A_{k-1}^3=i | D_{k-1}^3=1) \Pr(A_k^3=j | A_{k-1}^3=i) \\ &= \frac{{1}}{{2}}\left[\Pr(A_k^3=5 | \Pr(A_{k-1}^3=5) + \Pr(A_k^3=7 | \Pr(A_{k-1}^3=5) \right. \\ & \quad + \left. \Pr(A_k^3=5 | \Pr(A_{k-1}^3=7) + \Pr(A_k^3=7 | \Pr(A_{k-1}^3=7) \right] \\ &= \frac{{1}} {{2}}\left[(1-\theta)^2 \psi^2 + \theta^2 \phi^2 + 2 \psi^2 \phi + \psi^3 \right],\end{aligned}$$
(9)

where ψ=θ2+(1−θ)2 and ϕ=1 − ψ. Substituting Eqs. 8 and 9 into Eq. 7 gives

$$\Pr(D_k^d=1 | D_{k-1}^d=1) = \frac{{1}}{{2}} (1-\theta)^{d-3} \left[(1-\theta)^2 \psi^2 + \theta^2 \phi^2 + 2 \psi^2 \phi + \psi^3 \right].$$
(10)

For convenience, write the transition probability from state i to j in abbreviated notation: \(t_{ij} = \Pr(D_k^d=j | D_{k-1}^d=i)\). Equation 10 gives t11. Then

$$ t_{10}=1-t_{11}.$$

Now the Markov chains {A3} and {Sd-3} have stationary distributions

$$\pi_1 = \pi_2 =\pi_4 =\pi_5 =\pi_6 =\pi_7 = 1/8, \ \pi_3=1/4$$

and

$$\pi_{\ell} =\frac{{(d-3)!}}{{\ell! (d-3-\ell)!}}\frac{{{1}}}{{2^{d-3}}}, \quad \ell=0,1,\ldots,d-3$$

respectively. Thus the process {Dd}, which has value 1 when Sd-3=d−3 and A3=5 or A3=7, has stationary distribution

$$\pi_0 = 1-\frac{{1}}{{2^{d-1}}} {,\ } \pi_1 = \frac{{1}}{{2^{d-1}}} $$
(11)

satisfying

$$\left( \begin{array}{*{20}l} t_{11} & t_{01} \cr t_{10} & t_{00} \cr \end{array}\right) \left( \begin{array}{*{20}l} \pi_1 \cr \pi_0 \end{array}\right)=\left( \begin{array}{*{20}l} \pi_1 \cr \pi_0 \end{array} \right).$$
(12)

Substituting Eq. 11 into Eq. 12 and solving for t01 gives

$$ t_{01} = {{1 - t_{11}}\over {2^{d-1}-1}}, $$

and, finally,

$$ t_{00} = 1 - t_{01}. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stankovich, J., Bahlo, M., Rubio, J.P. et al. Identifying nineteenth century genealogical links from genotypes. Hum Genet 117, 188–199 (2005). https://doi.org/10.1007/s00439-005-1279-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00439-005-1279-y

Keywords

Navigation