Review
High-throughput sequencing (HTS) for the analysis of viral populations

https://doi.org/10.1016/j.meegid.2020.104208Get rights and content

Highlights

  • High Throughput Sequencing techniques have revolutionized many fields of Biology, including Virology.

  • More and easier access to sequence information provides new possibilities for analyzing virus populations.

  • For RNA viruses, mutation rates are of similar magnitude than error rates in HTS technologies.

  • There are many computer programs specifically designed for analyzing HTS data of virus populations.

  • We review analytical and methodological advances and some major applications of HTS of virus populations.

Abstract

The development of High-Throughput Sequencing (HTS) technologies is having a major impact on the genomic analysis of viral populations. Current HTS platforms can capture nucleic acid variation across millions of genes for both selected amplicons and full viral genomes. HTS has already facilitated the discovery of new viruses, hinted new taxonomic classifications and provided a deeper and broader understanding of their diversity, population and genetic structure. Hence, HTS has already replaced standard Sanger sequencing in basic and applied research fields, but the next step is its implementation as a routine technology for the analysis of viruses in clinical settings. The most likely application of this implementation will be the analysis of viral genomics, because the huge population sizes, high mutation rates and very fast replacement of viral populations have demonstrated the limited information obtained with Sanger technology. In this review, we describe new technologies and provide guidelines for the high-throughput sequencing and genetic and evolutionary analyses of viral populations and metaviromes, including software applications. With the development of new HTS technologies, new and refurbished molecular and bioinformatic tools are also constantly being developed to process and integrate HTS data. These allow assembling viral genomes and inferring viral population diversity and dynamics. Finally, we also present several applications of these approaches to the analysis of viral clinical samples including transmission clusters and outbreak characterization.

Introduction

Viruses are probably the most abundant organisms on Earth and major drivers of evolution at all levels of organization and time-scales (Zhang et al., 2018). They also represent the most diverse types of genome organization, which are used to establish the seven higher categories of the current classification of viruses (http://www.ictvonline.org/virustaxonomy.asp). This variety of genomes not only speaks about their different evolutionary origin, but also about the ways they interact with their hosts. Being obligate parasites of cellular organisms, they have adapted to many different life-styles and survival strategies, often leading to very specific changes and traits. However, despite their variety, viruses usually attract the interest of researchers, clinicians, public health officials and the general population because of the serious problems they can originate when infecting a new species or when a new viral strain expands among individuals of a former host, overcoming their intrinsic or previously effective defenses, and causing disease. In fact, viruses represent the most common agents responsible for emerging and re-emerging infectious diseases.

Despite their apparent genomic simplicity, viruses encode all the necessary features to successfully complete their life-cycle. Many of these depend on interactions with their hosts and all are based on genetic differences that can be passed onto the usually large offspring produced by a single virus. Their high population sizes along with their mutation rates (Sanjuan et al., 2010), usually several orders of magnitude larger than those of cellular organisms, create the ideal conditions for making viruses exceptional “evolutionary machines”, capable of exploiting every minute genetic variation.

The relevance of analyzing the genetic variability of viruses has been evident even before the advent of sequencing techniques. The first genome sequence was that of bacteriophage ΦX174 (Sanger et al., 1977), almost two decades earlier than that of a cellular organism, the bacteria Haemophilus influenza (Fleischmann et al., 1995). Virologists, molecular biologists, infectious disease specialists, and many other researchers have used a wide range of molecular methodologies to learn about the genetic differences and properties of infecting viral populations. One frequent problem they have had to face is the extraordinary variability found in many of those populations, especially for viruses with a single-stranded RNA genome (Moya et al., 2004). The detailed analysis of the genetic variation in a viral population was out of reach for most studies and only “average” or “consensus” sequences were usually obtained.

The recent development of high-throughput sequencing (HTS) methods (Loman et al., 2015) along with their increased precision and lower costs are changing the focus from the Sanger-method sequencing to HTS as the standard technique for obtaining viral sequences (Goodwin et al., 2016). As we will address below, this also means that new concepts and analytical methods have to be applied to accommodate the differences between the results produced by the new and the previous techniques and how they inform us about the genetic composition of the analyzed populations. However, current HTS technologies still present relatively high sequencing errors , from about 0.1% in Illumina (Goodwin et al., 2016) up to 12.7% in MinION ONT (Bowden et al., 2019) which, although possibly assumable in some disciplines, may not provide the required accuracy in others, such as the design of antiviral therapies (Del Campo et al., 2018) or the quality standards required in forensic genetics (Arenas et al., 2017; Budowle et al., 2014).

Here, we provide a general overview of techniques, concepts, analytical methods and several applications of HTS to study viral populations. We will focus more in exposing the range of tools or concepts than in detailing their underlying theoretical or algorithmic bases. More detailed reviews of the major topics covered in this work have been published already (Table 1) and the interested reader is referred to those and the references cited herein for additional details on these topics.

Section snippets

HTS technologies and their application to the analyses of viral populations

The development of DNA sequencing by chain termination, commonly referred to as Sanger sequencing, revolutionized biological research (Sanger et al., 1977). This methodology coupled with the development of automated DNA sequencers in the mid 1990s allowed labs to sequence large numbers of genes and whole genomes, culminating with the sequencing of the human genome and the beginning of the sequencing era (Liesegang, 2001; Consortium, International Human Genome Sequencing, and International Human

Overview of major methods for the analysis of HTS data

There are four major types of analysis of HTS results from viral populations: assembly of new genomes, mapping of reads to a reference sequence (or resequencing), the study of viral sequences included in metagenomics data, and the analysis, usually detection, of specific or rare variants. There are many programs for the general analysis of these different applications that can be applied to HTS data of virus samples. However, most of the algorithms implemented in these tools were originally

Virus population genetics and genomics with HTS data

Most HTS technologies provide large amounts of information about the within-sample variability. For viruses, especially RNA viruses, this opens the possibility of characterizing the viral population not just by consensus or master sequences - which can be reconstructed from the raw data - but in terms of the actual variability present in the sample. Naturally, this requires the use of concepts and methods derived from population genetics, which summarize the relevant parameters governing and

Molecular evolutionary analysis from HTS data

In this section, we describe the application of HTS data to the evolutionary analysis at the molecular level, including the estimation of substitution and recombination rates, signatures of selection, and genome-wide evolutionary histories. Clearly, genetic analyses based on large amounts of data (i.e., genome-wide) can provide high statistical confidence. However, systematic biases in the methods applied to analyze such large datasets can provide precise but inaccurate results (S. Kumar et

Analysis of environmental and engineered samples. Metaviromes

After consideration of the methodological and analytical issues and tools for the study of viral populations, the next sections will deal with some of the most frequent applications of HTS for this kind of studies. Viral metagenomics, or viromics, comprises the study of viral genome sequences (Angly et al., 2006; Edwards and Rohwer, 2005), or metavirome, from resident ecological communities adapted to a specific ecosystem type or biome. Ecosystems and their derived biosamples are classified

Major applications of HTS in the viral analysis of clinical samples

HTS has opened new opportunities in different areas of clinical virology (Capobianchi et al., 2013). The introduction of NGS represented a great boost in four main fields at the interface between clinical setting and applied research. These are (i) the diagnoses of new infectious agents involved in clinical syndromes caused by multiple agents or a combination of them; (ii) the impact of minority variants involved in antiviral resistance; (iii) the study of intra-host evolutionary dynamics, and

The application of HTS in the analysis of virus transmission clusters and outbreaks

The fast evolutionary rate of RNA viruses and the serious health issues for people infected by many of them have enabled the establishment of a new epidemiological approach, the analysis of transmission clusters and chains through the comparison of sequence information obtained from viruses infecting patients and their potential sources. This strategy is becoming a new standard for surveillance of general populations and specific groups (Agoti et al., 2019; Alamil et al., 2019), and it has also

Challenges and future directions

Over the last few years we have witnessed a transition from Sanger sequencing to High-Throughput Sequencing in the field of viral genomics. HTS can capture much more viral diversity than any other technique and generate large amounts of sequences for population analyses at affordable costs. But HTS also comes with some technological caveats (e.g., high sequencing errors, heterogeneous coverage) and bioinformatic challenges (e.g., genome assembly) that still need to be solved. Given the limited

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

We thank Michel Tibayrenc, Editor-in-Chief of the journal, for the invitation to contribute with this article. MPL was partially funded the Milken Institute School of Public Health Pilot Fund Program, the Margaret Q. Landenberger Research Foundation and the Fundação para a Ciência e a Tecnologia (T495756868-00032862). MA is funded by grants “RYC-2015-18241” from MICIU (Spanish Government) and “ED431F 2018/08” from the Xunta de Galicia. FGC, NGG, MAB and JH are funded by projects BFU2017-89594R

References (351)

  • Felipe Hernandes Coutinho et al.

    Metagenomics sheds light on the ecology of marine microbes and their viruses

    Trends Microbiol.

    (2018)
  • James H. Degnan et al.

    Gene tree discordance, phylogenetic inference and the multispecies coalescent

    Trends Ecol. Evol.

    (2009)
  • José A. Del Campo et al.

    Hepatitis C virus deep sequencing for sub-genotype identification in mixed infections: a real-life experience

    Int. J. Infect. Dis.

    (2018)
  • Ruud H. Deurenberg et al.

    Application of next generation sequencing in clinical microbiology and infection prevention

    J. Biotechnol.

    (2017)
  • Isabella Abbate et al.

    Detection of quasispecies variants predicted to use CXCR4 by ultra-deep pyrosequencing during early HIV infection

    AIDS.

    (2011)
  • Ana B. Abecasis et al.

    Phylogenetic analysis as a forensic tool in HIV transmission investigations

    AIDS

    (2018)
  • Evelien M. Adriaenssens et al.

    Metagenomic analysis of the viral community in Namib Desert hypoliths

    Environ. Microbiol.

    (2015)
  • Evelien M. Adriaenssens et al.

    Environmental drivers of viral community composition in antarctic soils identified by viromics

    Microbiome.

    (2017)
  • Charles N. Agoti et al.

    Transmission patterns and evolution of respiratory syncytial virus in a community outbreak identified by genomic analysis

    Virus Evol.

    (2017)
  • Charles N. Agoti et al.

    Genomic analysis of respiratory syncytial virus infections in households and utility in inferring who infects the infant

    Sci. Rep.

    (2019)
  • Derek Aguiar et al.

    HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data

    J. Comput. Biol.

    (2012)
  • Derek Aguiar et al.

    Haplotype assembly in polyploid genomes and identical by descent shared tracts

    Bioinformatics

    (2013)
  • M. Alamil et al.

    Inferring epidemiological links from deep sequencing data: a statistical learning approach for human, animal and plant diseases

    Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci.

    (2019)
  • E.K. Alidjinou et al.

    RNA and DNA Sanger sequencing versus next-generation sequencing for HIV-1 drug resistance testing in treatment-Naive patients

    J. Antimicrob. Chemother.

    (2017)
  • Christian L. Althaus et al.

    Stochastic interplay between mutation and recombination during the acquisition of drug resistance mutations in human immunodeficiency virus type 1

    J. Virol.

    (2005)
  • Rika E. Anderson et al.

    Evolutionary strategies of viruses, bacteria and archaea in hydrothermal vent ecosystems revealed through metagenomics

    PLoS One

    (2014)
  • Anders F. Andersson et al.

    Virus population dynamics and acquired virus resistance in natural microbial communities

    Science

    (2008)
  • Florent E. Angly et al.

    The marine viromes of four oceanic regions

    PLoS Biol.

    (2006)
  • Maria Anisimova et al.

    Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites

    Genetics

    (2003)
  • M. Anisimova

    State-of the art methodologies dictate new standards for phylogenetic analysis

    BMC Evol. Biol.

    (2013)
  • Sten Anslan et al.

    PipeCraft: flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data

    Mol. Ecol. Resour.

    (2017)
  • Leonardo Arbiza et al.

    Genome-wide heterogeneity of nucleotide substitution model fit

    Genome Biol. Evol.

    (2011)
  • John Archer et al.

    Identifying the important HIV-1 recombination breakpoints

    PLoS Comput. Biol.

    (2008)
  • Miguel Arenas

    The importance and application of the ancestral recombination graph

    Front. Genet.

    (2013)
  • Miguel Arenas

    Advances in computer simulation of genome evolution: toward more realistic evolutionary genomics analysis by approximate Bayesian computation

    J. Mol. Evol.

    (2015)
  • Miguel Arenas

    Trends in substitution models of molecular evolution

    Front. Genet.

    (2015)
  • Miguel Arenas

    Applications of the coalescent for the evolutionary analysis of genetic data

    Encycl. Bioinforma. Comput. Biol.

    (2019)
  • Miguel Arenas et al.

    The effect of recombination on the reconstruction of ancestral sequences

    Genetics.

    (2010)
  • Miguel Arenas et al.

    The influence of Re combination on the estimation of selection from coding sequence alignments

    Nat. Sel.

    (2014)
  • Miguel Arenas et al.

    Forensic genetics and genomics: much more than just a human affair

    PLoS Genet.

    (2017)
  • Abdullah Assiri et al.

    Hospital outbreak of middle east respiratory syndrome coronavirus

    N. Engl. J. Med.

    (2013)
  • Jasmijn A. Baaijens et al.

    De novo assembly of viral quasispecies using overlap graphs

    Genome Res.

    (2017)
  • Guy Baele et al.

    Emerging concepts of data integration in pathogen phylodynamics

    Syst. Biol.

    (2017)
  • G.J. Baillie et al.

    Evolutionary dynamics of local pandemic H1N1/2009 influenza virus lineages revealed by whole-genome analysis

    J. Virol.

    (2012)
  • Anton Bankevich et al.

    SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing

    J. Comput. Biol.

    (2012)
  • Joëlle Barido-Sottani et al.

    Detection of HIV transmission clusters from phylogenetic trees using a multi-state birth-death model

    J. R Soc. Interface/R Soc.

    (2018)
  • Mark A. Beaumont et al.

    The Bayesian revolution in genetics

    Nat. Rev. Genet.

    (2004)
  • Niko Beerenwinkel et al.

    Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data

    Front. Microbiol.

    (2012)
  • Shellie R. Bench et al.

    Metagenomic characterization of Chesapeake Bay Virioplankton

    Appl. Environ. Microbiol.

    (2007)
  • Michael G. Berg et al.

    A pan-HIV strategy for complete genome sequencing

    J. Clin. Microbiol.

    (2016)
  • Cited by (31)

    • Discovery of Virus-Host interactions using bioinformatic tools

      2022, Methods in Cell Biology
      Citation Excerpt :

      Nowadays, development of sequencing techniques opened a new world in viral genome identification (Seto et al., 2011). Several experimental technologies, such as Ribonucleic Acid (RNA) sequencing (Depledge, Mohr, & Wilson, 2019), Deoxyribonucleic Acid (DNA) sequencing (França, Carrilho, & Kist, 2002), Next-Generation Sequencing (NGS) (Barzon, Lavezzo, Militello, Toppo, & Palù, 2011; Capobianchi, Giombini, & Rozera, 2013), High-Throughput-NGS (HT-NGS) (Pérez-Losada et al., 2020), Mass Spectrometry (MS) (Buchberger, DeLaney, Johnson, & Li, 2018) and imaging allowed the characterization of a high number of viral genomes, including whole genome sequencing, viral Open Reading Frames (ORF), gene functions and phylogeny proteins (Pappas et al., 2021). In fact, these High-Throughput Screening (HTS) technologies used in the expansion of omics data continues to produce large amounts of information from different populations and cell types for a variety of infectious diseases (Pappas et al., 2021; Schneider & Orchard, 2011).

    • Fungal communities in Nelumbinis semen characterized by high-throughput sequencing

      2021, International Journal of Food Microbiology
      Citation Excerpt :

      High-throughput sequencing (HTS), also known as next-generation sequencing technique, comes along with them. It has become one of the most crucial candidates to analyze the structure and composition of fungal community in the environmental and microbial fields, displaying outstanding advantages, such as increased precision, high sequencing efficiency, and low cost (Pérez-Losada et al., 2020; Tang et al., 2018). As a culture-independent central molecular tool, it can quickly and efficiently produce massive reads, which are beneficial for many practical applications (de Carvalho et al., 2019; Ma et al., 2021; Papademas et al., 2021).

    • Application of high-throughput sequencing technology in HIV drug resistance detection

      2021, Biosafety and Health
      Citation Excerpt :

      Nanopore sequencing technology is a new type of sequencing technology developed by Oxford NanoPore Technology (ONT), which belongs to the same category as SMRT. The key difference appears that ONT uses α-hemolysin as the nanochannel [47]. However, α-hemolysin has a small diameter, only allowing a single nucleic acid polymer to pass through.

    • The role of mobile genetic elements in organic micropollutant degradation during biological wastewater treatment

      2020, Water Research X
      Citation Excerpt :

      The study of viral communities in WWTPs has been limited due to the low percentage of host bacteria that can be cultured in the laboratory. However, recent advances in high-throughput sequencing technologies have enabled researchers to sequence the whole viral metagenome in several samples (Edwards and Rohwer, 2005; Pérez-Losada et al., 2020). Genes identified in the phage metagenomes of several WWTPs and other environments such as the mouse gut include ARGs and 16S rRNA genes from Firmicutes, Proteobacteria, Bacteroidetes, and Actinobacteria (Parsley et al., 2010a,b; Del Casale et al., 2011; Modi et al., 2013).

    • Evaluation of haplotype callers for next-generation sequencing of viruses

      2020, Infection, Genetics and Evolution
      Citation Excerpt :

      Next-generation sequencing (NGS) technologies provide novel opportunities to study the evolution of many viruses that impose health issues among humans, such as human immunodeficiency virus (HIV), hepatitis C virus (HCV), human papillomavirus (HPV), and influenza (Pérez-Losada et al., 2020).

    View all citing articles on Scopus
    View full text