Ancestry patterns inferred from massive RNA-seq data

  1. Antonio Salas1,2
  1. 1Unidade de Xenética, Instituto de Ciencias Forenses (INCIFOR), Facultade de Medicina, Universidade de Santiago de Compostela, and GenPoB Research Group, of the Instituto de Investigación Sanitaria de Santiago (IDIS), Hospital Clínico Universitario de Santiago (SERGAS), 15706 Galicia, Spain
  2. 2Translational Pediatrics and Infectious Diseases Unit, and GENVIP Research Group ( of the Instituto de Investigación Sanitaria de Santiago (IDIS), Hospital Clínico Universitario de Santiago (SERGAS), 15706 Galicia, Spain
  1. Corresponding author: antonio.salas{at}


There is a growing body of evidence suggesting that patterns of gene expression vary within and between human populations. However, the impact of this variation in human diseases has been poorly explored, in part owing to the lack of a standardized protocol to estimate biogeographical ancestry from gene expression studies. Here we examine several studies that provide new solid evidence indicating that the ancestral background of individuals impacts gene expression patterns. Next, we test a procedure to infer genetic ancestry from RNA-seq data in 25 data sets where information on ethnicity was reported. Genome data of reference continental populations retrieved from The 1000 Genomes Project were used for comparisons. Remarkably, only eight out of 25 data sets passed FastQC default filters. We demonstrate that, for these eight population sets, the ancestral background of donors could be inferred very efficiently, even in data sets including samples with complex patterns of admixture (e.g., American-admixed populations). For most of the gene expression data sets of suboptimal quality, ancestral inference yielded odd patterns. The present study thus brings a cautionary note for gene expression studies highlighting the importance to control for the potential confounding effect of ancestral genetic background.


  • Received December 18, 2018.
  • Accepted April 16, 2019.

This article is distributed exclusively by the RNA Society for the first 12 months after the full-issue publication date (see After 12 months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at

| Table of Contents