Keywords
Seagrass, Zostera marina, eelgrass, chromosome-scale genome assembly, annotation
This article is included in the Plant Science gateway.
This article is included in the Genomics and Genetics gateway.
Seagrass, Zostera marina, eelgrass, chromosome-scale genome assembly, annotation
Seagrasses are a polyphyletic assemblage of early-diverging monocotyledoneous plants belonging to the Alismatales (Les, Cleland, and Waycott 1997; Du and Wang 2016); they are not true grasses (Poaceae). Several clades of seagrasses arose independently from freshwater sister taxa 3-4 times between the Paleocene and late Eocene (~65-34 mya) and are the only fully marine, flowering plants (~14 genera and ~65 species) (Chase et al. 2016). They occur in predominantly soft-sediment, marine coastal environments worldwide (Green, Short, and Frederick 2003) and as engineering species provide the foundation of three-dimensional habitats that are among the most productive and biodiverse (Costanza et al. 1997; Duffy et al. 2015). Seagrass meadows provide numerous ecosystem services, e.g., provisioning of fish and invertebrates, retention of nutrients (Larkum, Orth, and Duarte 2006) and carbon sequestration (Fourqurean et al. 2012). Unfortunately, they are also under threat related to human impacts (Waycott et al. 2009) that fundamentally change coastal system dynamics (Duffy et al. 2015) and make restoration extremely difficult (van Katwijk et al. 2016). Effective marine conservation strategies require integrative research perspectives between ecology and genomics (Hillebrand, Jacob, and Leslie 2020) because ecological and evolutionary change can and do occur on the same time scales (Carroll et al. 2007), e.g., genetic polymorphisms underlying critical traits or the role of genetic diversity at selectively relevant sites for population resilience.
Zostera marina (eelgrass) is a marine model species with >3000 papers covering a wide variety of ecological, evolutionary, conservation and biotech topics. Its unique, circumglobal, warm temperate to Arctic distribution has allowed it to withstand numerous cycles of rapid climate change during the Pliocene glacial and interglacial periods (Olsen et al. 2004), empirically demonstrating its capacity to adapt, acclimate and recover (Duarte et al. 2018), e.g., to temperature (Franssen et al. 2011; Jueterbock et al. 2016; Jueterbock et al. 2020), salinity gradients/osmoregulation (Shivaraj et al. 2017), ocean acidification (Zimmerman 2020) and potential pathogens (Brakel et al. 2014; Guan, Saha, and Weinberger 2020; Zang et al. 2020). Further, clonal populations of Z. marina can persist for hundreds to thousands of years ((Thorsten et al. 1999) for Z. marina, (Arnaud-Haond et al. 2012) for Posidonia oceanica)), yet have found ways to adapt through periods of rapid climate selection via genotypic plasticity, fostered by somatic mutation (Yu et al. 2020) and epigenetic modification of the methylome (DuBois, Williams, and Stachowicz 2020; Jueterbock et al. 2020). Microbiome interaction studies are being conducted in parallel with eelgrass resequencing, e.g., bacterial (Cucio et al. 2016; Fahimipour et al. 2017; Wang, Tomas, and Mueller 2020; Eisen et al. 2017) and fungal (Ettinger, Voerman, et al. 2017; Ettinger, Williams, et al. 2017; Ettinger and Eisen 2019, 2020; Ettinger, Vann, and Eisen 2020) to inform restoration strategies as well as meta-organismal function. Bioengineered salinity tolerance is also of interest (Wani et al. 2020).
The new assembly of the Z. marina reference genome will further advance studies in the aforementioned areas, as well as comparative analyses of genome structure and evolution, as new reference genomes for representatives of the other three seagrass lineages (i.e., Posidonia oceanica - Posidoniaceace, Cymodocea nodosa - Cymodoceaceae and Thalassia testudinum - Hydrocharitaceae) come online in the near future along with Zostera mueller (Lee et al. 2016) and Zostera japonica (unpublished).
We used an aliquot of the same DNA that served as the basis for Z. marina v.1.0 genome. We sequenced the Z. marina genome using a whole genome shotgun sequencing strategy and standard sequencing protocols. Sequencing reads were collected using Illumina and PacBio platforms at the HudsonAlpha Institute for Biotechnology in Huntsville, Alabama, USA. Illumina reads were sequenced using the Illumina HiSeq-2500 platform and the PacBio reads were sequenced using the SEQUEL II platform. One 400 bp insert 2×150 Illumina fragment library (162.7× coverage), and one 2×150 Hi-C Illumina library were constructed using Dovetail Hi-C kit and sequenced to 581.1× coverage. Prior to assembly, Illumina fragment reads were screened for PhiX contamination. Reads composed of >95% simple sequence were removed. Illumina reads <50 bp after trimming for adapter and quality (q<20) were removed. The final read set consists of 280,181,449 reads for a total of 156.4× of high-quality Illumina bases. For the PacBio sequencing, a standard PacBio long read library was constructed and a total of 8 PB chemistry 3.0 chips (10 hours movie time) were sequenced on a Sequel 1, a sequence yield of 75.97 Gb, with a total coverage of 189.93x.
The current v.3.1 assembly was generated by assembling the 5,615,408 PacBio reads (189.93x sequence coverage) using the MECAT assembler (Xiao et al. 2017) and subsequently polished using ARROW (Chin et al. 2013).
Misjoins in the assembly were identified using Hi-C data as part of the JUICER pipeline (Durand et al. 2016). No misjoins were identified in the polished assembly. The contigs were then oriented, ordered, and joined together using HI-C data using the JUICER pipeline. A total of 89 joins were applied to the assembled contigs to form the final set of six chromosomes. Each chromosome join is padded with 10,000 Ns. Significant telomeric sequence was identified using the (TTAGGG)n repeat, and care was taken to make sure that it was properly oriented in the production assembly. The remaining scaffolds were screened against bacterial proteins, organelle sequences, GenBank nr and removed if found to be a contaminant.
Finally, homozygous SNPs and INDELs were corrected in the released consensus sequence using 40x of Illumina reads (2x150, 400bp insert) by aligning the reads using bwa mem (Li 2013) and identifying homozygous SNPs and INDELs with the GATK’s UnifiedGenotyper tool (McKenna et al. 2010). A total of 1,876 homozygous SNPs and 64,447 homozygous INDELs were corrected in the release.
RepeatModeler v2.0 was used to build a custom repeat library for the genome assembly of Z. marina v.3.1. Subsequently, RepeatMasker v4.1 was used to discover and classify repeats based on the custom repeat libraries from RepeatModeler v2.0. Transfer RNAs (tRNA) were predicted by tRNAscan-SE v1.31 (Chan and Lowe 2019) with default parameters. We also identified noncoding RNAs, such as microRNAs (miRNAs), small nuclear RNAs (snRNAs) and ribosomal RNAs (rRNAs) by comparing with known noncoding RNA libraries (Rfam v14.2), using the cmscan program of Infernal v1.1.2 (Nawrocki and Eddy 2013). In addition, novel miRNA entries from the Z. marina v.1.0 assembly were aligned to hard-masking Z. marina v.3.1 using SeqMap (Jiang and Wong 2008) with no mismatches. We extracted ~ 110 bp upstream and downstream sequences surrounding every aligned locus and discarded the miRNAs candidates located within protein coding sequences or repetitive elements (“NNNNNNNNNNN”). The stem-loop structure and the minimum free energy (MFE) were predicted for each region using the RNAfold program of the ViennaRNA v 2.1.1 (Lorenz et al. 2011) with default settings. Finally, the results based on Rfam and Z.marina v.1.0 were combined into a non-redundant prediction of miRNAs.
Genome annotation was performed using a combination of ab initio prediction, homology searches and RNA-aided alignment. Augustus (Stanke, Tzvetkova, and Morgenstern 2006) was used for ab initio gene prediction using model training based on protein structures and RNA-seq data from Z. marina v.1.0 (Olsen et al. 2016). For homology-based predictions, the protein sequences of Z. marina v.1.0 and Oryza sativa were downloaded from PLAZA (https://bioinformatics.psb.ugent.be/plaza/) and aligned to Z. marina v.3.1 using TBLASTN with different e-values (Z. marina v.1.0 with e-value ≤ 1e-10 and O. sativa with e-value ≤ 1e-5). Next, regions were mapped by these query sequences to define gene models using Exonerate (Slater and Birney 2005). For RNA-aided annotation, we downloaded 23 libraries of Z. marina v.1.0 from NCBI (BioProject PRJNA280336). Firstly, we joined the paired-end reads using clc_assembly_cell to generate almost 2/3 of joined reads and 1/3 of un-joined reads. Then, we aligned the joined and un-joined RNA-seq data to Z. marina v.3.1 using HISAT2 (Kim, Langmead, and Salzberg 2015) with the parameters “--max-intronlen 50000” and assembled into potential transcripts using StringTie v2.1.1 (Kovaka et al. 2019). Further, TransDecoder v5.0.2 was used to predict open reading frames (ORFs) within the assembled transcripts. Finally, gene models obtained from all three approaches were integrated as the final non-redundant gene set using EVidenceModeler (v.1.1.1) (Haas et al. 2008). Specifically, a set of 1,460 bad gene models identified by the wgd software (Zwaenepoel and Van de Peer 2019) was manually curated using the genome browser GenomeView on the ORCAE platform (https://bioinformatics.psb.ugent.be/orcae/) and the gene annotation results were evaluated by BUSCO hits. Putative gene function was determined using InterProScan (Jones et al. 2014) with the different databases, including PFAM, Gene3D, PANTHER, CDD, SUPERFAMILY, ProSite, and GO. Meanwhile, functional annotation of predicted genes was also obtained by aligning the protein sequences against the sequences in public protein databases (Uniprot/SwissProt database) using BLASTP with an e-value cut-off of 1×10-5.
A single genotype (or clone) of Z. marina from the northern Baltic Sea, Finnish Archipelago Sea had been subjected to whole-genome assembly using Sanger and Illumina sequencing (referred to as Z. marina v.1.0) (Olsen et al. 2016). Since PacBio technology can deliver longer reads, necessary to improve assembly contiguity and obtain a nearly complete, reference genome, we re-sequenced the inbred, Finnish clone, leading to the final v.3.1 release, which contains 260.5 Mb of sequence, consists of 432 contigs with a contig N50 of 7.0 Mb and a total of 87.6% of assembled bases into six pseudo-chromosomes (2n = 12). Interestingly, during the assembly of the genome using Hi-C, it was noted that there was a seventh “chromosome” (scaffold_7 in this release) with a length of 8.68Mb, consisting of mainly repetitive DNA and a possible Nucleolus Organizing Region (NOR). Completeness of the euchromatic portion of v.3.1 assembly was assessed using 20,450 annotated genes from Z. marina v.1.0. The screened alignments indicate that 20,342 (99.47%) of the previously annotated version 1.0 genes aligned to the v.3.1 release. Of the remaining genes, 50 (0.24%) aligned at <50% coverage, while 58 (0.29%) were not found in the v.1.0 release. This shows a high degree of consistency between the two versions. However, version 3.1 presents much higher contiguity and fewer gaps compared to the previously published Z. marina v.1.0 (Table 1).
We used ab initio approaches to identify and annotate repetitive sequences, which accounted for 67.12 % of Z. marina v.3.1. 41.72 % of these TEs were long terminal repeat (LTR) elements (Table 2). Screening the Z. marina v.3.1 genome against the Rfam v14.2 database using Infernal identified 546 tRNAs, 376 rRNAs, 93 miRNAs, and 134 snRNAs (Table 3). In addition to the 93 known conserved miRNAs identified from Rfam v14.2, we also identified 23 novel miRNAs candidates compared to 19 novel miRNAs candidates in Z. marina v.1.0, resulting in a total of 116 miRNAs (Table 4).
Through a combination of ab initio prediction, homology searches and RNA-aided evidence, we annotated 21,533 gene models after masking repeat elements. After manually checking most gene models and improving 1395 incorrect gene models on the ORCAE platform and adding 30 new genes based on RNA-seq evidence, the final annotation produced 21,483 gene models, 91.8% of which (19,739 genes) are supported by transcriptome data from leaves, roots and flowers. On average, protein-coding genes in Z. marina v.3.1 are 3,300 bp long and contain 4.99 exons, values that are very similar to those of Z. marina v.1.0. Notably, intron lengths greatly improved after manual curation (Table 5). BUSCO assessment of the current gene set shows that the current annotation includes 95.7% complete genes in the embryophyte database10. 93.2% of the BUSCO genes were single copy while 2.5% of these BUSCO genes were found in duplicate. 0.5% of the BUSCO genes were fragmented and 3.8% was missing, which could be due to some specific pathways missing in Z. marina, compared to land plants (Olsen et al., 2016) (Table 6). BUSCO assessment in Z. marina v.3.1 shows more complete genes and fewer fragmented genes than for Z. marina v.1.0 (Figure 1). 20,665 genes (96.2%) obtained at least one functional assignment based on similarity to known proteins in the databases. Pfam domain information could be added to 15,716 (73.2%) predicted genes, and 12,406 (57.7%) predicted genes could be assigned a GO term (Table 7).
Here, we report a high-quality, chromosome-scale genome assembly of Z. marina using a combination of single-molecule real-time sequencing and Hi-C scaffolding. Although a draft genome sequence for Z. marina has been available for more than five years (Olsen et al. 2016), a chromosome-scale assembly and well-annotated reference genome is an important step to further advance our understanding with respect to its metabolism, evolution and adaptation. As we expected, there is a discrepancy in genome size between an Illumina-derived assembly and a PACBIO long read-derived assembly, which is mainly due to the more accurate coverage of repetitive content. Nevertheless, the genome size of 259 Mb still characterizes Z. marina as relatively compact monocot genome, and it is also the smallest genome among the seagrasses where genome size estimates exist (JGI pilot analyses). Also, telomere and centromere regions are generally captured more fully. As a first high quality, reference genome of an important marine angiosperm, v.3.1 of the genome of Z. marina will provide a great resource for further comparative and evolutionary studies.
NCBI BioProject: Zostera marina Genome sequencing and assembly. Accession number: PRJNA701932; https://identifiers.org/ncbi/bioproject:PRJNA701932.
The genome assembly and annotation for Z. marina v.3.1 is also available from https://data.jgi.doe.gov/refine-download/phytozome?organism=Zmarina and through the ORCAE platform (https://bioinformatics.psb.ugent.be/gdb/zostera/).
We thank Shengqiang Shu from the Joint Genome Institute for validating the annotation.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Peska V, Mátl M, Mandáková T, Vitales D, et al.: Human-like telomeres in Zostera marina reveal a mode of transition from the plant to the human telomeric sequences. Journal of Experimental Botany. 2020; 71 (19): 5786-5793 Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Telomere biology, genomics, sequencing
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genomics, population genetics, bioinformatics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 15 Apr 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)