Data validation
To produce the
S. colias genome assembly, 2 sequencing strategies were used: Illumina PE short reads and PacBio HiFi long reads. The PE dataset was used to assess the genomic proprieties of the
S. colias species and scaffold the long-read assembly, while HiFi reads were used to perform the primary genome assembly and gap closing (Figure
2).
The Illumina sequencing yielded 149 M of PE reads and the PacBio sequencing generated 1.7 M of HiFi reads (Table
1). Trimmed short reads were used to estimate the genome size (817 Mbp), heterozygosity rate (1.31%), and genome repeat content (approximately 26%), using GenomeScope2 (Figure
3). The complete statistics of GenomeScope2 can be consulted in Additional File 3
[17, 18]. In parallel, the HiFi dataset was inspected, and mitochondrial reads, as well as possible sources of contamination, were removed (amounting to 0.31% of the initial dataset) (Table
1).
Figure 3.
Genomescope2 plot with k-mer spectra content and fitted models of the Scomber colias Illumina PE dataset.
For the mtDNA assemblies, a total of 38,868 mtDNA PE reads were filtered by GetOrganelle and a total of 792 mtDNA PacBio HiFi reads were filtered by BLASTN search. The 2 assemblies had the same length, 16,570 bp, and differed from each other by 0.29% (uncorrected
p-distances). Furthermore, the PE and PacBio HiFi mtDNA assemblies differed from the
S. colias mtDNA assembly available on NCBI (accession number
AB488406.1 [11]), by 0.35% and 0.40% respectively (uncorrected
p-distances). The mtDNA gene content and arrangement is as expected for most fishes and is standard for vertebrates
[72], consisting of 13 protein-coding genes, 22 transfer RNA (trn), and 2 ribosomal RNA (rrn) (Figure
4).
Figure 4.
Circular mitochondrial genome assembly of Scomber colias, obtained from the Illumina PE dataset (equal to that obtained from the PacBio HiFi long reads assembly). From the centre to the outmost features: GC content distribution; sequencing depth distribution of aligned Paired-End reads; gene elements (i.e., PCGs, rRNA genes, tRNA genes).
Table 1
General statistics of read datasets used to perform the Scomber colias genome assembly.
Sample | Sequencing type | Library type | Platform | Insert size (bp) | Number of reads (before clean-up) | Number of reads (after clean-up) | Application |
---|
Sco_PH | WGS | Long reads | PacBio Sequel II System | 15,500 | 1,792,104 | 1,786,541 | Genome Assembly, Gap Closing, Assessment |
Sco_PE | WGS | Short reads | HiSeq X Ten | 478 | 149,564,893 | 84,738,393 | Scaffold, Assessment |
The primary genome assembly was produced using filtered PacBio HiFi reads and the below software packages and settings. Following the above-mentioned criteria (Material and Methods: Nuclear genome assembly and assessment) the Sco_k21 assembly was selected, with both pseudo-haplotypes merged and subjected to purge_dups. Detailed statistics of Hifiasm and HiCanu genome assemblies can be consulted in Additional File 4
[17, 18]. Although the purge_dumps generated a primary and an alternative assembly, only the primary assembly was used in subsequent steps. At the same time, 4 short-read genome assemblies were performed with W2RAP software, and contigs with over 500 bp were used as “long reads” to scaffold the primary assembly. Additional File 5 shows QUAST and BUSCO statistics for the PE genome assemblies
[17, 18]. Importantly, during the scaffolding process, only structural information of short-read assemblies was used, without the inclusion of bases. Lastly, the remaining non-basal long read assemblies were used to fill gaps inserted during the scaffolding stage. The final assembly (primary assembly) of
S. Colias yielded a genome size of 814 Mbp, distributed in 2,028 scaffolds and 2,093 contigs with an N50 length of 4.19 and 3.34 Mbp, respectively. On the other hand, the alternative assembly had 807 Mbp and 5908 contigs with an N50 length of 0.47 Mbp (Table
2). The BUSCO analyses, at the nucleotide level, in Eukaryota and Actinopterygii datasets, showed high levels of completeness for both primary (97.3% and 97.9% of single-copy orthologs) and alternative (93.3% and 96% single-copy orthologs) assemblies (Table
2). Consistently, Merqury determined high QV (primary, 56.53%; alternative, 54.99%) and
k-mer completeness (primary, 86.11%; alternative, 84.60%) values for both assemblies (Table
2). In the primary assembly, the
k-mer analyses (via Merqury) showed a low level of
k-mer duplication in the genome (colour blue, green, purple, and orange in Figure
5a), indicating a high level of haplotype uniqueness (red colouring in Figure
5a), and a similar
k-mer distribution pattern to GenomeScope2 (performed with Illumina PE reads). Additionally, we found a high mapping rate in the Illumina, PacBio, and RNA-Seq reads, against the primary assembly of 95%, 99.8%, and 90.02%, respectively. Overall, these results provide evidence of the high quality of the
S. colias genome assembly (Table
2). Our
S. colias genome assembly ranks fourth in high-quality genome assemblies within the order Scombriformes and first in the genus
Scomber (Additional File 6)
[17, 18].
Table 2
Statistics of the Scomber colias genome assembly.
Assembly | Alternative | Primary |
---|
| Contigs (Sco_k21_a_c) | Contigs (Sco_k21_p_c) | Scaffolds (Sco_k21_p_s) |
---|
Number of contigs (≥10,000 bp) | 5,908 | 2,093 | 2,028 |
Number of contigs (≥50,000 bp) | 2,417 | 1,123 | 1,078 |
Number of contigs (≥100,000 bp) | 1,593 | 704 | 662 |
Number of contigs (≥200,000 bp) | 1,025 | 456 | 417 |
Number of contigs (≥500,000 bp) | 421 | 235 | 209 |
Number of contigs (≥1,000,000 bp) | 123 | 155 | 138 |
Total length (≥10,000 bp) | 807,928,680 | 813,976,802 | 814,072,661 |
Total length (≥50,000 bp) | 721,244,010 | 781,696,683 | 782,480,923 |
Total length (≥100,000 bp) | 662,374,873 | 751,893,146 | 752,912,084 |
Total length (≥200,000 bp) | 580,469,606 | 716,806,065 | 718,068,371 |
Total length (≥500,000 bp) | 385,329,197 | 648,055,626 | 653,890,381 |
Total length (≥1,000,000 bp) | 180,689,595 | 591,655,104 | 603,146,189 |
Largest contig (Mbp) | 3,248 | 22,804,600 | 22,804,600 |
Total length (Mbp) | 807,936 | 813,977 | 814,072 |
GC (%) | 39.94 | 40.09 | 40.09 |
N50 (Mbp) | 0,466 | 3,342 | 4,190 |
K-mer completeness (%) | 84.602 | | 86.1077 |
Consensus quality | 56.5369 | | 54.9969 |
Read back mapping PE (%) | - | | 95.0 |
Read back mapping PH (%) | - | | 99.8 |
Read back mapping RNA-Seq (%) | - | | 90.2 |
BUSCO statistics (databases) | - | | |
Eukariota** | T: 93.3, C: 90.2 [S: 88.6, D: 1.6], F: 3.1, M: 6.7, n: 255 | T: 97.3, C: 96.1 [S: 93.7, D: 2.4], F: 1.2, M: 2.7, n: 255 |
Actinopterygii** | T: 96.0, C: 94.8 [S: 91.9, D: 2.9], F: 1.2, M: 4.0, n: 3640 | T: 97.9, C: 97.2 [S: 96.2, D: 1.0], F: 0.7, M: 2.1, n: 3640 |
Figure 5.
Validation of the genome assembly and annotation process. (a) K-mer analyses of the Scomber colias genome assembly (Merqury). (b) Maximum Likelihood phylogenetic tree based on the concatenated alignments of amino acid sequences of 392 single-copy orthologs retrieved by OrthoFinder. Bootstrap values are shown next to the nodes. (c) BUSCO scores were obtained from searching the proteomes of the 3 Scombriformes species with genome annotation available, against the actinopterygii_odb10 (n:3640) lineage.
The RepeatMasker software masked 29.62% of bases in the primary genome assembly. The masked regions were predominantly linked to DNA elements (11.66%), long interspersed nuclear elements (4.11%), long terminal repeats (2.58%), and simple repeats (2.88%). Furthermore, 8.62% of the genome was masked and annotated as “unclassified”, and only a small percentage were classified as short interspersed nuclear elements, small RNA, or satellite repeats (Table
3). The genome annotation process generated about 27,675 protein-coding genes and 30,999 protein-coding sequences. On average, we found 9.5 exons and 1,656 bp lengths per CDS (Table
4). Of the CDS, 30,355 had at least 1 BLASTP hit in SwissProt or RefSeq databases, 27,101 were identified in the InterPro database, and 21,664 of these were classified as belonging to a specific homolog superfamily (Table
5).
Table 3
Report of RepeatMasker software. This report contains statistics of repetitive elements in the Scomber colias genome assembly.
Total number of sequences | 2,028 |
---|
Total length (bp) | 814,072,661 bp |
---|
GC level (%) | 40.09 |
---|
Number of bases masked | 241,071,029 bp (29.62%) |
---|
Type | Number of elements | Length in Genome | Percentage of Genome |
---|
SINEs | 16,132 | 2,679,916 | 0.33 |
ALUs | 0 | 0 | 0.00 |
MIRs | 7,082 | 1,280,739 | 0.16 |
LINEs | 113,089 | 33,426,533 | 4.11 |
LINE1 | 8,048 | 4,651,362 | 0.57 |
LINE2 | 57,670 | 14,551,177 | 1.79 |
L3/CR1 | 697 | 123,438 | 0.02 |
LTR elements | 82,410 | 20,969,171 | 2.58 |
ERVL | 10 | 279 | 0.00 |
ERVL-MaLRs | 0 | 0 | 0.00 |
ERV_classI | 22,786 | 4,702,084 | 0.58 |
ERV_classII 11490 | | 576,448 | 0.07 |
DNA elements | 623,126 | 94,930,706 | 11.66 |
hAT-Charlie | 27,952 | 5,526,534 | 0.68 |
TcMar-Tigger | 169 | 46,619 | 0.01 |
Unclassified | 278,199 | 70,161,089 | 8.62 |
Total interspersed repeats | - | 222,167,415 | 27.29 |
Small RNA | 11,380 | 1,807,250 | 0.22 |
Satellites | 18,552 | 3,093,792 | 0.38 |
Simple repeats | 84,014 | 23,465,814 | 2.88 |
Low complexity | 959 | 200,769 | 0.02 |
Table 4
Structural annotation report of the Scomber colias genome assembly.
Structural Annotation | Result |
---|
Number of genes | 27,675 |
Number of mRNAs | 30,999 |
Number of CDSs | 30,999 |
Number of exons | 295,102 |
Number of introns | 264,103 |
Number of exon in CDS | 295,102 |
Number of intron in CDS | 264,103 |
Number of introns in exon | 264,103 |
Number of introns in intron | 235,209 |
Number gene overlapping | 71 |
Number of single exon genes | 2,036 |
Number of single exon mRNA | 2,105 |
Mean mRNAs per gene | 1.1 |
Mean CDSs per mRNA | 1.0 |
Mean exons per mRNA | 9.5 |
Mean introns per mRNA | 8.5 |
Mean exons per CDS | 9.5 |
Mean introns in CDSs per mRNA | 8.5 |
Mean introns in exons per mRNA | 8.5 |
Mean introns in introns per mRNA | 7.6 |
Total gene length | 269,856,447 |
Total mRNA length | 310,580,471 |
Total CDS length | 51,346,678 |
Total exon length | 51,346,678 |
Total intron length | 259,233,793 |
Total intron length per CDS | 259,233,793 |
Total intron length per exon | 259,233,793 |
Total intron length per intron | 35,919,947 |
Mean gene length | 9,750 |
Mean mRNA length | 10,019 |
Mean CDS length | 1,656 |
Mean exon length | 173 |
Mean intron length | 981 |
Mean intron in exon length | 981 |
Mean intron in intron length | 152 |
Longest gene | 242,447 |
Longest mRNA | 242,447 |
Longest CDS | 98,436 |
Longest exon | 14,939 |
Longest intron | 76,003 |
Longest CDS piece | 14,939 |
Shortest gene | 303 |
Shortest mRNA | 144 |
Shortest CDS | 18 |
Shortest exon | 3 |
Shortest intron | 30 |
Table 5
Functional annotation report of S. colias genome assembly.
Functional Annotation | Number |
---|
Swiss-Prot/ RefSeq | 30,355 |
InterPro | 27,101 |
CDD | 12,832 |
Coils | 7,705 |
GO | 18,643 |
Gene3D | 22,209 |
HAMAP | 463 |
KEGG | 1,402 |
MetaCyc | 1,140 |
MobiDBlite | 16,765 |
PIRSF | 1,755 |
PRINTS | 7,143 |
Pfam | 25,708 |
PROSITE patterns | 8,082 |
PROSITE profiles | 16,229 |
Reactome | 7,376 |
SFDL | 114 |
SMART | 14,906 |
SUPERFAMILY | 21,664 |
TIGRFAMs | 1,427 |
To validate the protein-coding sequences we performed phylogenetic analysis (via OrthoFinder) and BUSCO analysis (using the Actinopterygii library profile) (Figure
5b, c). Of the 16 Actinopterygii proteins datasets inputted to OrthoFinder, 98.3% were assigned to 29,066 orthogroups, with 12,334 orthogroups present in all species. All OrthoFinder statistics can be consulted in Additional File 7
[17, 18]. Furthermore, a total of 392 single-copy orthologues were retrieved by OrthoFinder and used for the phylogenomic analysis. Alignment, trimming and concatenation of all single-copy orthologues, resulted in a final 120,886 aa-long supermatrix alignment that was used for phylogenomic inference in IQ-Tree. The resulting Maximum Likelihood phylogenetic tree has maximum support for almost all nodes (Figure
5b). The phylogeny recovered the reciprocal monophyletic Acanthopterygii groups Pelagiaria, Eupercaria, Anabantaria, Carangaria, and Ovalentaria, with Pelagiaria being the basal clade and represented by the 3 Scombrifomes, including
S. colias (Figure
5b). These results are in accordance with the most recent phylogenomic study of ray-finned fishes
[73], as well as the Ensembl Compara Species Tree of Ensembl database
[51]. BUSCO analysis showed the
S. colias proteome with 93.6% of the groups complete, 2% fragmented, and 4.4% missing (Figure
5c). In comparison,
T. maccoyii had 99.8% BUSCO groups complete, while
T. orientalis had but 82.8%. These results are expected, since the
T. maccoyii genome assembly, part of the Vertebrate Genome Project
[74], was built at chromosome level, with multiple technologies (including 46x PacBio data, 46x 10X Genomics Chromium data, BioNano data, and Arima Hi-C data) and several manual curation steps
[75]. In contrast, both
T. Orientalis [76] and
S. colias were built at scaffold level using only short- and long-read information.
We further explored the quality of the annotation by investigating the repertoire of the NRs superfamily in the
S. colias assembly. NRs are critical molecular physiology components, with vital roles in animal physiology and disruption
[77]. Moreover, their exact NR gene complement in vertebrate lineages has been shown to vary
[67]. We were able to deduce the existence of 76 NRs in the
S. colias genome, detailed in Additional File 8, in line with the repertoire described for other teleost species
[78]. Among the retrieved NRs we found those that are key components of the “chemical defensome”—an ensemble of related and unrelated genes that protect organisms against chemical stressors, and are thus critical under anthropogenic chemical build-up and climate change scenarios—such as the xenobiotic-inducible pregnane X receptor (
pxr,
nr1i2)
[68, 79]. Subsequent analysis, using gene names, further suggested the presence of gene annotations for the vast majority of the reported members of the teleost “chemical defensome” in
S. colias, similarly to that described for
D. rerio [68]. Additional BLAST searches were performed for a reduced set of genes (
fthl,
gstp,
hsph,
maff,
nme8, and
slc21), uncovering possible homologs for this gene subset, except for a single member of the GST family (
gstp). The chemical defensome repertoire identified in
S. colias species is detailed in Additional File 9 in the associated data entries
[17, 18].
We additionally validated our dataset by examining the present population structure of the species, since the genome may also provide clues regarding its past demographic history
[69]. One popular method to produce these inferences is the pairwise sequentially Markovian coalescent (PSMC) model, here applied to the
S. colias final genome assembly. Since PSMC requires an estimation of the genome-wide mutation rate, and since this has never previously been produced for
S. colias, we used the recently estimated genome-wide mutation rate of the yellowfin tuna,
T. albacares, of 7.3 × 10
−9 mutations/site/generation
[70]. The results suggest a past population expansion between 160,000–115,000 years ago, with maximum effective population size (
Ne) of 36,000 during the end of the Mid-Pleistocene Transaction, corresponding to the Eemian (i.e., the last interglacial period) and the transition between Marine Isotope Stages (MIS) 5 and 6 (Figure
6). This population expansion is followed by an apparent decrease in the
Ne to around 25,000 at the beginning of the Late Pleistocene, corresponding to the beginning of the Last Glacial Period. These results, suggesting the influence of climatic changes from the Pleistocene glaciation cycles on the
Ne, are following other recent studies on Scombriformes, such as the Pacific Sierra mackerel,
Scomberomorus sierra [80], and the Indo-Pacific Yellowfin tuna
T. albacares [70], as well as in other pelagic marine species such as the killer whale
[81].
Figure 6.
Pairwise sequentially Markovian coalescent (PSMC) estimates from the Scomber colias genome assembly. Estimations were obtained using a generation time of 2 years and genome-wide mutation rate of 7.3 × 10−9 mutations/site/generation. Effective population size (Ne) is presented in the left vertical axis, and changes estimated up to the present, over the last 3 myr, on the horizontal axis.