Published online : 14 June 2022
Article Outline
Scroll to top
Data Release
A chromosome-level genome assembly and annotation of the maize elite breeding line Dan340
 Views 602
 Downloads 141
Review History
Download PDF

Cite this article as... 

Yikun Zhao, Yuancong Wang, De Ma, Guang Feng, Yongxue Huo, Zhihao Liu, Ling Zhou, Yunlong Zhang, Liwen Xu, Liang Wang, Han Zhao, Jiuran Zhao, Fengge Wang, A chromosome-level genome assembly and annotation of the maize elite breeding line Dan340Gigabyte, 2022  https://doi.org/10.46471/gigabyte.63

 Copy citation
Gigabyte
Gigabyte
2709-4715
GigaScience Press
Sha Tin, New Territories, Hong Kong SAR
Data Description
Background
Maize (Zea mays ssp. mays L., NCBI:txid381124) is one of the most important crops grown worldwide for food, forage, and biofuel, with an annual production of more than 1 billion tons [1]. Owing to the rapid human population growth and economic demand, maize has been predicted to account for 45% of the total cereal demand by 2050 [2]. In addition, it is an important model organism for fundamental research in genetics and genomics [3].
Because of its importance in crop science, genetics and genomics, several reference genomes of common maize inbred lines used in breeding have been released since 2009 [48]. However, comparative genomic analyses have found that maize genomes exhibit high levels of genetic diversity among different inbred lines [1, 7, 9]. Meanwhile, accumulating studies have suggested that one or a few reference genomes cannot fully represent the genetic diversity of a species [7, 10, 11].
The maize cultivar Dan340 is an excellent backbone inbred line of the LvDa Red Cob Group that has several desirable characteristics, such as disease resistance, lodging resistance, high combining ability, and wide adaptability. More than 50 maize hybrid breeds have been derived from Dan340 since 2000, and their planting area has reached 19 million ha. It is considered that Dan340 originated from a landrace in China and exhibits significant genetic differences from other maize germplasms that represent the most important core maize germplasms in China [12]. Therefore, Dan340 could serve as a model inbred line for the genetic dissection of desirable agronomic traits, combining ability, heterosis, and breeding history.
In the present study, we constructed a high-quality chromosome-level reference genome for Dan340 by combining PacBio long HiFi sequencing reads, Illumina short reads, and chromosomal conformational capture (Hi-C) sequencing reads. The completeness and continuity of the resulting genome are comparable with those of other important maize inbred lines: B73 [4], Mo17 [7], SK [13], PH207 [5], and HZS [8]. Furthermore, comparative genomic analyses were performed between Dan340 and other maize lines. Genes and gene families specific to Dan340 were identified. In addition, large numbers of structural variations between Dan340 and other maize inbred lines were detected. The assembly and annotation of this genome will increase our understanding of the intraspecific genomic diversity in maize and provide a novel resource for maize breeding improvements.
Plant materials and DNA sequencing
The inbred line Dan340 (Figure 1) was selected for genome sequencing and assembly because it is an elite maize cultivar that plays an important role in maize breeding and genetic research. The plants were grown at 25 °C in a greenhouse of the Beijing Academy of Agriculture and Forestry Sciences, Beijing, China. Fresh and tender leaves were harvested from the best-growing individual, immediately frozen in liquid nitrogen, and then preserved at −80 °C in the laboratory prior to DNA extraction. Genomic DNA was extracted from the leaf tissue of a single plant using the DNAsecure Plant Kit (Tiangen Biotech Co., Ltd., Beijing, China). To ensure that the DNA extracts were useable for all types of genomic libraries, their quality and quantity were evaluated using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and electrophoresis on a 0.8% agarose gel, respectively.
Figure 1.
Ear appearances of the maize inbred lines Dan340, B73, Mo17, and SK.
In recent years, third-generation DNA sequencing technologies have undergone rapid technological innovation and are now widely used in genome assembly. In this study, PacBio circular consensus sequencing (CCS) libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA; Ref. No. 101-685-400), following the manufacturer’s protocols, and they were subsequently sequenced on the PacBio sequel II platform (Pacific Biosciences, RRID:SCR_017990). As a result, 63.53 Gb (approximately 27× coverage) of HiFi reads was generated and used for the genome assembly.
In addition, one Illumina paired-end sequencing library, with an insert size of 350 bp, was generated using the NEB Next Ultra DNA Library Prep Kit (NEB, Ipswich, MA, USA) following the manufacturer’s protocol and then sequenced using an Illumina HiSeq X Ten platform (Illumina, San Diego, CA, USA, RRID:SCR_016385) at the Novogene Bioinformatics Institute, Beijing, China. Approximately 80.66 Gb (∼34×) of Illumina sequencing data were obtained.
One Hi-C library was constructed using young leaves following previously published procedures [14], with slight modifications outlined in our published protocol [15] (Figure 2). In brief, approximately 5-g leaf samples from seedlings were cut into minute pieces and cross-linked using a 4% formaldehyde solution at room temperature in a vacuum for 30 min. Each sample was mixed with excess 2.5 M glycine for 5 min to quench the cross-linking reaction and then placed on ice for 15 min. The cross-linked DNA was extracted and then digested for 12 h with 20 units of DpnII restriction enzyme (NEB, Ipswich, MA, USA, Catalog #R0543S) at 37 °C. Next, the resuspended mixture was incubated at 62 °C for 20 min to inactivate the restriction enzyme. The sticky ends of the digested fragments were biotinylated and proximity ligated to form enriched ligation junctions and then ultrasonically sheared to a size of 200–600 bp. The biotin-labelled DNA fragments were pulled down and ligated with Illumina paired-end adapters, and then amplified by PCR to produce the Hi-C sequencing library. The library was sequenced using an Illumina HiSeq X Ten platform with 2 × 150 bp paired-end reads. After removing low-quality sequences and trimming adapter sequences, 304.37 Gb (approximately 130×) of clean data were generated and used for the genome assembly.
Figure 2.
Protocols.io protocol for the Hi-C library construction from young Maize leaves [15]. https://www.protocols.io/widgets/doi?uri=dx.doi.org/10.17504/protocols.io.bp2l61mkzvqe/v1
Genome assembly
To obtain a high-quality genome assembly of Dan340, we employed both PacBio HiFi reads and Illumina short reads, with scaffolding informed by high-throughput Hi-C. The assembly was performed in a stepwise fashion. First, a de novo assembly of the long CCS reads generated from PacBio single-molecule real-time (SMRT) sequencing was performed using Hifiasm [16] (RRID:SCR_021069). A total of two SMRT cells produced 4,073,418 subreads, with an average length of 15,598 bp and a read N50 of 15,715 bp. The generation of HiFi reads and adapter trimming was performed using PacBio SMRTLink (Version 8.0) [17] with default parameters, followed by the deduplication of reads using pbmarkdup (Version 0.2.0) [18], as recommended by PacBio. Next, HiFi reads were aligned to each other and assembled into genomic contigs using Hifiasm [16] with default parameters. Next, the primary contigs (p-contigs) were polished using Quiver [19] by aligning the SMRT reads. Then, Pilon [20] (RRID:SCR_014731) was used to perform the second round of error correction using the short paired-end reads generated by the Illumina Hiseq platforms. Subsequently, the Purge Haplotigs pipeline [21] was used to remove redundant sequences formed due to heterozygosity. The draft genome assembly was 2348.68 Mb; it reached a high level of continuity and a contig N50 length of 45.11 Mb.
To reduce Hi-C reads having a bias due to experimental artefacts, we removed the following read types using HiCUP [22] (RRID:SCR_005569): (a) reads with ≥10% unidentified nucleotides (N); (b) reads with >10 nt aligned to the adapter, allowing ≤10% mismatches; and (c) reads with >50% bases having a Phred quality <5. Next, the filtered Hi-C reads were aligned against the contig assemblies using BWA (Version 0.7.8, RRID:SCR_010910[23]. Reads were excluded from subsequent analyses if they did not align within 500 bp of a restriction site or did not uniquely map. Also, the number of Hi-C read pairs linking each scaffold pair was tabulated. ALLHiC (Version 0.8.12) [24] was used in simple diploid mode to scaffold the genome and optimize the ordering and orientation of each clustered group, producing a chromosome-level assembly. The Juicebox Assembly Tools (Version 1.9.8, RRID:SCR_021172[25] were used to visualize and manually correct the large-scale inversions and translocations to obtain the final pseudo-chromosomes (Figure 3). Finally, 2315 scaffolds (representing 91.30% of the total length) were anchored to 10 chromosomes (Figure 4).
Figure 3.
Hi-C contact heat map displaying the inter- and intra-chromosomal interactions in the genome of the maize inbred line Dan340.
Figure 4.
Circos plot of genomic features. Outer-to-inner tracks indicate the following: (A) Chromosome numbers of Dan340 and B73; (B) Repeat density; (C) Histogram of gene density distributions along the chromosomes; (D) Histogram of GC content distributions along the chromosomes; (E) Syntenic relationships of gene pairs between Dan340 and B73 genomes identified using the best-hit method.
The final assembly of the Dan340 genome was 2348.72 Mb, including 2738 contigs and 2315 scaffolds, with N50 of 41.49 Mb and 215.35 Mb, respectively (Table 1).
Table 1
Genome assembly and annotation statistics for the four tested maize inbred lines.
Genomic features Dan340 B73 Mo17 SK
Assembled genome size (bp)2,348,678,8712,182,075,9942,104,465,7152,161,392,594
Number of scaffolds 2315 687 2203 671
Total length of scaffolds (Mb) 2348.72 2182 2182 2162
Scaffold N50 222,765,871 226,353,449 220,382,597 73,237,962
Number of contigs 2738 1395 9040 1090
Total length of contigs (Mb) 2,144,444 2,178,268 2,147,495 2,150,874
Contig N50 45,109,016 47,037,903 1,491,782 15,776,512
Number of genes 39,733 39,756 38,620 43,271
Evaluation of the assembly quality
We assessed the quality of the assembly using several independent methods. First, the short reads obtained from the Illumina sequencing data were aligned to the final assembly using BWA [26]. Our results showed that the percent of reads mapped to the reference genome was 97.48%. Second, a total of 248 conservative genes existing in six eukaryotic model organisms were selected to form the core gene library for the Core Eukaryotic Genes Mapping Approach (CEGMA) [27] (RRID:SCR_015055) evaluation. To evaluate its integrity, our assembled Dan340 genome was aligned to this core gene library using TBLASTN (RRID:SCR_011822[28], GeneWise (Version 2.2.0, RRID:SCR_015054[29], and the GeneID tools (Version 1.4 RRID:SCR_021639[30]. Our results showed that 238 complete (95.97%) and 243 partial (97.98%) genes were detected in our assembly. Third, the completeness was assessed using the benchmarking universal single-copy orthologs (BUSCO) [31] (RRID:SCR_015008). The final assembly was tested against BUSCO (v.3) with embryophyta_odb10 database [32], which includes 1614 conserved core genes. Our results showed that 98.08% (1583), 1.11% (18), and 0.81% (13) of the plant single-copy orthologs were present in the assembled Dan340 genome as complete, fragmented, or missing genes, respectively. Fourth, the long-terminal repeat (LTR) Assembly Index (LAI) metric was used to evaluate the assembly continuity in Dan340 and three other maize genomes (B73, Mo17, and SK; Figure 5). The intact LTR retrotransposons were identified in the four genomes using LTRharvest (Version 1.6.1, RRID:SCR_018970[33], LTR_Finder (Version 1.07, RRID:SCR_015247[34], and LTR_retriever (Version 2.9.0, RRID:SCR_017623[35]. The LAI pipeline was executed using the following parameter settings: -t 20 -intact genome.fasta.pass.list -all genome.ltr.fasta.out. Our Dan340 genome had an LAI score of 25.13, which was relatively high among the four maize genomes compared in this study. B73, Mo17, and SK scored 24.94, 24.45, and 27.12, respectively (Figure 5 and Table 2). A higher LAI score indicates a more complete genome assembly because more intact LTR retrotransposons are identified, as was the case of our Dan340 genome. Furthermore, whole-genome sequence alignments of Dan340 to the genomes of the other three maize inbred lines demonstrated that our assembly has highly collinear relationships with other published maize genomes (Figure 6). Taken together, our assessment results suggest that the Dan340 genome assembly is of high quality.
Figure 5.
Genome-wide LAI scores for Dan340, B73, Mo17, and SK.
Figure 6.
(A)–(C) Pairwise comparison of genome sequences using a dot plot between the Dan340 line and B73 (23,350 gene pairs), Mo17 (21,913 gene pairs), or SK (23,016 gene pairs). The horizontal axis represents the target species; the vertical axis represents the reference species; C1–C10 represents the respective chromosomes 1–10; 0–35 k represents the chromosome length scale marks, which mainly reflect the lengths of the chromosomes; a point represents a pair of shared genes.
Table 2
LAI scores of the four tested maize inbred lines.
LinesDan340 B73Mo17 SK
LAI 25.1324.9424.4527.12
Genome annotation
Repeat sequences of the Dan340 genome were annotated using both ab initio and homolog-based search methods. For the ab initio prediction, RepeatModeler (Version 1.0.8, RRID:SCR_015027[36], RepeatScout (Version 1.0.5, RRID:SCR_014653[37], and LTR_Finder [34] were used to discover transposable elements (TEs) and to build a TEs library. An integrated TEs library and a known repeat library (Repbase Version 15.02, homolog-based, RRID:SCR_021169) were subjected to RepeatMasker (Version 3.3.0 RRID:SCR_012954[38] to predict the TEs. For the homolog-based predictions, RepeatProteinMask was performed to detect the TEs in our genome by comparing it against a TE protein database. Tandem repeats were ascertained in the genome using Tandem Repeats Finder (Version 4.07b, RRID:SCR_022193[39]. As a result, 1723.99 Mb of repeat sequences were identified, accounting for 73.40% of the genome size. Among these repeat sequences, 1555.57 Mb were predicted to be long-terminal repeat (LTR) retrotransposons, and 44.53 Mb were predicted to be DNA transposons, accounting for 66.23% and 1.60% of the genome, respectively. Furthermore, among the LTR retrotransposons, the Gypsy and Copia superfamilies comprised 23.81% and 12.75% of the genome, respectively. Thus, retrotransposons accounted for a large proportion of the Dan340 genome, which was consistent with the genomic characteristics of other maize inbred lines (Table 2).
All repetitive regions except the tandem repeats were soft-masked for protein-coding gene annotations. Five ab initio gene prediction programs, Augustus (Version 3.0.2, RRID:SCR_008417[4042], GENSCAN (Version 1.0, RRID:SCR_013362[43], GeneID [30], GlimmerHMM (Version 3.0.2, RRID:SCR_002654[44], and SNAP (Version 2013-02-16, RRID:SCR_007936[45], were used to predict genes. In addition, the protein sequences of five homologous species (Sorghum bicolor, Setaria italica, Hordeum vulgare, Triticum aestivum, and Oryza sativa) were downloaded from Ensembl and NCBI. Homologous sequences were aligned against the genome using TBLASTN (E-value 1 × 10−5). GeneWise [29] was employed to predict gene models based on the sequence alignment results.
For the RNA-seq predictions, fresh samples of six tissues (stem, endosperm, embryo, bract, silk, and ear tip) were collected. The total RNA was extracted from each sample using an RNAprep Pure Plant Kit (Tiangen Biotech Co., Ltd., Beijing, China). The isolated, purified RNA, having fragment lengths of approximately 300 bp, was the template for constructing a cDNA library. The NEBNext Ultra RNA Library Prep Kit from Illumina (New England Biolabs, Ipswich, MA, USA) was used to construct the cDNA library following the manufacturer’s instructions. The sequencing was performed on an Illumina HiSeq X Ten platform, and 150-bp paired-end reads were generated. Raw reads were trimmed by removing the adapter sequences, reads with more than 5% of unknown base calls (N), and low-quality bases (base quality less than 5). Clean paired-end reads were aligned to the genome using TopHat (Version 2.0.13, RRID:SCR_013035[46] to identify exon regions and splice positions. The alignment results were then used as input for Cufflinks (Version 2.1.1, RRID:SCR_014597[47] to assemble the transcripts into the gene models. In addition, RNA-seq data were assembled using Trinity (Version 2.1.1, RRID:SCR_013048[48], creating several pseudo-ESTs (short for expressed sequence tags). These pseudo-ESTs were also mapped to the assembled genome using BLAT [49] (RRID:SCR_011919), and gene models were predicted using PASA [50] (RRID:SCR_014656). A weighted and non-redundant gene set was generated using EVidenceModeler (EVM, Version 1.1.1, RRID:SCR_014659[51], which merged all the gene models predicted by the above three approaches. Finally, PASA was used to adjust the gene models generated by EVM. As a result, 39,733 protein-coding genes were annotated in our final set. To better understand gene functions, we used all our 39,733 protein-coding genes as queries against public protein databases, including NCBI non-redundant protein sequences, Swiss-Prot, Protein family, Kyoto Encyclopedia of Genes and Genomes (KEGG), InterPro, and Gene Ontology (GO). In total, 39,646 genes (99.8%) were annotated using these databases, and 24,402 (61.41%) were supported by RNA-seq data. Furthermore, the number of genes, the gene length distribution, and the exon length distribution were all comparable to those of other maize inbred lines and common crop species (Table 3).
Table 3
Summary statistics of annotated protein-coding genes in Dan340 and other maize inbred lines and common crop species.
SpeciesNumberAverage transcript length (bp)Average CDS length (bp)Average exons per geneAverage exon length (bp)Average intron length (bp)
Dan34039,733 3793.47 1140.91 4.69 243.47 719.61
B7339,756 3511.78 1102.11 4.58 240.64 673.10
Mo1738,620 3362.68 1140.26 4.69 242.98 601.83
SK42,942 3857.18 1179.17 4.83 243.93 698.48
Hvu24,286 2116.13 1093.77 4.1 267.02 330.19
Osa35,679 2165.58991.55 3.78 262.57 422.87
Sbi34,008 2626.44 1164.14 4.31 270.09 441.76
Sit27,233 2982.22 1336.29 5.14 260.2 397.98
Tae103,539 3087.61 1277.31 4.51 283.23 515.78
Abbreviations: Hvu: Hordeum vulgare; Osa: Oryza sativa; Sbi: Sorghum bicolor; Sit: Setaria italica; Tae: Triticum aestivum.
Transfer RNA (tRNA) genes were predicted using tRNAscan-SE (Version 1.4, RRID:SCR_010835[52] with the default parameters. Ribosomal RNAs (rRNAs) were annotated based on their homology levels with the rRNAs of several species of higher plants using BLASTN with an E-value of 1 × 10−5. The microRNA (miRNA) and small nuclear RNA (snRNA) fragments were identified by searching the Rfam database (Version 11.0, RRID:SCR_007891) using Infernal (Version 1.1, RRID:SCR_011809[53, 54]. Finally, 4547 miRNAs, 5963 tRNAs, 63,564 rRNAs, and 1422 snRNAs were identified, with average lengths of 126.79, 75.25, 309.47, and 132.10 bp, respectively (Table 4).
Table 4
Annotation statistics of the non-coding RNAs in the Dan340 genome using different databases.
TypeCopy (w*)Average length (bp)Total length (bp)
rRNA miRNA 4547 126.79 576,516 0.024546
tRNA 5963 75.25 448,705 0.019104
rRNA63,564 309.47 19,671,118 0.84
18S 6607 1727.38 11,412,778 0.49
28S25,188 143.61 3,617,315 0.15
5.8S25,181 153.48 3,864,710 0.16
5S 6588 117.84 776,315 0.033053
snRNA snRNA 1,422 132.1 187,845 0.007998
CD-box 647 103.2 66,768 0.002843
HACA-box 123 126.27 15,531 0.000661
splicing 651 161.72 105,278 0.004482
Comparative genomic analysis between Dan340 and other maize lines
We applied the OrthoMCL pipeline [55] to identify orthologous gene families among the four maize inbred lines, including Dan340, B73, Mo17, and SK. The longest protein from each gene was selected, and the proteins with a length of less than 30 amino acids were removed. Subsequently, pairwise sequence similarities between all input protein sequences were calculated using BLASTP [56] (RRID:SCR_001010) with an E value cut-off of 1 × 10−5. Markov clustering (MCL) of the resulting similarity matrix was used to define the ortholog cluster structure of the proteins, using an inflation value (-I) of 1.5 (default setting of OrthoMCL).
Next, comparative analyses were performed among Dan340, B73, Mo17, and SK (Figure 7A). The genes from the Dan340 genome and those from B73, Mo17 and SK were clustered into 27,654 gene families. Of these, 15,690 families were shared among the four maize inbred lines, representing a core set of genes across these maize genomes. We found 1806 genes from 359 gene families that were specific to Dan340, of which many had functional GO annotations related to “protein phosphorylation”, “single-organism catabolic process”, and “pheromone binding” (Figure 7B). Using the KEGG functional enrichment, the most enriched pathways of the Dan340-specific genes were “antifolate resistance”, “epithelial cell signaling in Helicobacter pylori infection”, and “pentose and glucuronate interconversions” (Figure 7C).
Figure 7.
Gene family analyses and core- and pan-genomes of maize. (A) Comparisons of gene families in Dan340, B73, Mo17, and SK. The Venn diagram illustrates the shared and unique gene families among the four maize inbred lines. (B) GO enrichment analysis of Dan340-specific genes. (C) KEGG analysis of Dan340-specific genes. (D) Core- and pan-genomes of maize. The histograms show the core-gene clusters (shared by all four genomes), dispensable gene clusters (present in three or two genomes), and specific gene clusters (present only in one genome).
In addition, OrthoMCL was used to identify the core and dispensable gene sets based on gene families. The gene families that were shared among the four inbred lines were defined as core gene families. Furthermore, gene families shared among three inbred lines, between two inbred lines, and those only present in one inbred line (private gene families) are also displayed in Figure 7D.
Genetic variation analysis
To investigate the genetic and structural variations between Dan340 and other maize inbred lines, we first aligned the other three genomes to the Dan340 reference genome based on MUMmer (V 4.0.0 beta2) [57] (RRID:SCR_018171) with parameter “—mum -g 1000 -c 90 -l 40”. Then the alignment files were filtered to generate 1-to-1 mapping by delta-filter with parameters:“-m -i 90 -l 100”. The genomes of B73 and Mo17 were downloaded from MaizeGDB [58], and the genome of SK was obtained from the National Genomics Data Center [59]. Next, the output of Nucmer was analyzed using SyRI [60] with default parameters to identify variation. On the basis of the above pipeline, we obtained structural variation sets and generated into the vcf file. We also used PBSV (Version 2.2.2) [61] to investigate the genetic and structural variations (details and outputs available via the GigaDB entry [62]).
The high-quality Dan340 reference genome allowed us to identify large SVs in different maize inbred lines. By aligning the genome of B73 to the Dan340 genome, we identified 36,363 structural variations (longer than 500 bp) between the two representative maize genomes, including 15,923 insertions, 16,173 deletions, 141 inversions, and 4,126 duplications (Table 5). Furthermore, the structural variations presented in Mo17 and SK were also detected in this study (Tables 6 and 7). The dataset generated by PBSV is available in GigaDB [62]. These datasets provide abundant variation resources for future molecular improvements and breeding in maize.
Table 5
Structural variations between Dan340 and B73.
Chr. NumberInsertionDeletionInversionDuplication
NumberLength (bp)NumberLength (bp)NumberLength (bp)NumberLength (bp)
12,19617,315,7452,35218,827,437201,157,7764561,574,417
21,96921,308,8872,03724,356,3382744,078,1818856,684,632
31,90816,163,1991,95116,640,4351413,164,4114241,376,864
41,76414,082,1071,80315,310,877171,036,8163381,118,242
51,39511,820,7661,40011,943,768127,848,8833311,422,310
61,23712,859,2691,17912,589,5481112,425,5963681,291,481
71,51315,084,6051,46316,126,941719,727,6555346,342,568
81,44212,042,5451,42011,464,69712531,8763011,034,272
91,39713,366,9451,42113,770,69693,773,691286992,215
101,1028,984,8951,1479,437,95512362,594203692,374
Table 6
Structural variations between Dan340 and Mo17.
Chr. NumberInsertionDeletionInversionDuplication
NumberLength (bp)NumberLength (bp)NumberLength (bp)NumberLength (bp)
12,11216,595,7612,26818,826,385241,892,6113641,360,243
21,77517,433,4321,82219,928,3562744,197,4956713,764,856
31,81015,987,5791,85016,785,7581112,931,3423641,093,740
41,30510,393,4081,36012,151,836128,399,292270932,811
51,40111,584,4341,48112,571,981136,688,118292961,542
61,16311,784,3181,18212,742,6851813,228,1763821,182,254
71,46615,144,9291,51518,539,394619,813,9305204,613,073
81,2669,520,5251,38611,021,444151,387,8132901,134,832
91,33512,303,8411,38813,598,480134,032,9852821,024,261
101,0838,801,8741,1389,738,720171,044,530247812,820
Table 7
Structural variations between Dan340 and SK.
Chr. NumberInsertionDeletionInversionDuplication
NumberLength (bp)NumberLength (bp)NumberLength (bp)NumberLength (bp)
12,24216,991,5222,43519,097,136332,305,4015571,959,780
21,95621,684,1862,04524,111,2112350,179,6789235,280,910
31,90915,947,6171,94315,909,0652013,717,3054501,485,303
42,02616,759,2902,10117,565,665283,102,3143901,238,308
51,66513,409,3851,72714,676,710288,960,1253811,429,527
61,20112,554,7661,25112,023,4041215,291,8014341,552,484
71,50215,584,0441,55416,932,5001719,884,3846092,554,084
81,2299,421,0141,31110,010,2798398,4343241,110,257
91,78718,944,1801,86617,898,278133,443,0243621,285,568
101,0878,296,0441,1899,755,94215966,661260870,594
Conclusions
We assembled the chromosome-level genome of the maize elite inbred line Dan340 using long CCS reads from the third-generation PacBio Sequel II sequencing platform, with scaffolding informed by Hi-C. The final assembly of the Dan340 genome was 2348.72 Mb, including 2738 contigs and 2315 scaffolds with N50 of 41.49 Mb and 215.35 Mb, respectively. Comparisons of the Dan340 genome with the reference genomes of three other common maize inbred lines identified 1806 genes from 359 gene families that were specific to Dan340. In addition, we also obtained large numbers of structural variants between Dan340 and other maize inbred lines, and these may be underlying the mechanisms responsible for the phenotypic discrepancies between Dan340 and other maize varieties. Therefore, the assembly and annotation of this genome improves our understanding of the intraspecific genomic diversity in maize and provides novel resources for maize breeding improvements.
Data Availability
The raw sequence data have been deposited in NCBI under project accession No. PRJNA795201. Data is also available in the GigaScience GigaDB repository [62].
Abbreviations
BUSCO: Benchmarking Universal Single-Copy Orthologs; Hi-C: chromosomal conformation capture; CCS: circular consensus sequencing; EVM: EVidenceModeler; HiFi: long high-fidelity; LTR: long-terminal repeat; LAI: long-terminal repeat assembly index; TEs: transposable elements; EVM: EVidenceModeler; KEGG: Kyoto Encyclopedia of Genes and Genomes; GO: Gene Ontology; MCL: Markov clustering; NCBI: National Center for Biotechnology Information; nucleotides (N); PacBio: Pacific Biosciences; SMRT: single-molecule real-time; SV: structural variations; tRNA: Transfer RNA; rRNAs: Ribosomal RNAs; miRNA: microRNA; snRNA: small nuclear RNA.
Competing Interests
The authors declare that they have no competing interests.
Authors’ contributions
FW, JZ and HZ conceived the project; Y-KZ, DM and YW wrote and modified the manuscript; LX, GF and LW performed the data curation; YH, LZ, Y-LZ and ZL analyzed the data. All authors read and approved the final manuscript.
Funding
This research was supported by grants from the special project for the construction of scientific and technological innovation capacity of Beijing Academy of Agriculture and Forestry Sciences (NO. KJCX20200305).
References
1YangN, XuXW, WangRR Contributions of Zea mays subspecies mexicana haplotypes to modern maize. Nat. Commun., 2017; 8: 1874.
2HubertB, RosegrantM, BoekelMA The future of food: scenarios for 2050. Crop. Sci., 2010; 50: 3350.
3HakeS, Ross-IbarraJ. Genetic, evolutionary and plant breeding insights from the domestication of maize. Elife, 2015; 4: e05861.
4SchnablePS, WareD, FultonRS The B73 maize genome: complexity, diversity, and dynamics. Science, 2009; 326: 11121115.
5HirschCN, HirschCD, BrohammerAB Draft assembly of elite inbred line PH207 provides insights into genomic and transcriptome diversity in Maize. Plant Cell, 2016; 28: 27002714.
6JiaoY, PelusoP, ShiJ Improved maize reference genome with single-molecule technologies. Nature, 2017; 546: 524527.
7SunS, ZhouY, ChenJ Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes. Nat. Genet., 2018; 50: 12891295.
8LiC, SongW, LuoY The HuangZaoSi Maize genome provides insights into genomic variation and improvement history of Maize. Mol. Plant., 2019; 12: 402409.
9LuF, RomayMC, GlaubitzJC High-resolution genetic mapping of maize pan-genome sequence anchors. Nat. Commun., 2015; 6: 6914.
10HabererG, KamalN, BauerE European maize genomes highlight intraspecies variation in repeat and gene content. Nat. Genet., 2020; 52: 950957.
11HuffordMB, SeetharamAS, WoodhouseMR De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science, 2021; 373: 655662.
12ZhangR, XuG, LiJ Patterns of genomic variation in Chinese maize inbred lines and implications for genetic improvement. Theor. Appl. Genet., 2018; 131: 12071221.
13YangN, LiuJ, GaoQ Genome assembly of a tropical maize inbred line provides insights into structural variation and crop improvement. Nat. Genet., 2019; 51: 10521059.
14BeltonJM, McCordRP, GibcusJH Hi-C: a comprehensive technique to capture the conformation of genomes. Methods, 2012; 58: 268276.
15ZhaoY, WangY, MaD Hi-C library construction from young Maize leaves. protocols.io. 2022; https://dx.doi.org/10.17504/protocols.io.bp2l61mkzvqe/v1.
16ChengH, ConcepcionGT, FengX Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Meth., 2021; 18: 170175.
17PacBio SMRTLink. https://www.pacb.com/support/software-downloads/. Accessed 16 May 2021.
18Pbmarkdup. https://github.com/PacificBiosciences/pbmarkdup. Accessed 16 May 2021.
19ChinCS, AlexanderDH, MarksP Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Meth., 2013; 10: 563569.
20WalkerBJ, AbeelT, SheaT Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 2014; 9(11): e112963.
21Purge Haplotigs pipeline. https://bitbucket.org/mroachawri/purge_haplotigs/overview. Accessed 20 June 2021.
22WingettS, EwelsP, Furlan-MagarilM HiCUP: pipeline for mapping and processing Hi-C data. F1000Res, 2015; 4: 1310.
23LiH. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint. 2013; https://arxiv.org/abs/1303.3997.
24ZhangX, ZhangS, ZhaoQ Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants, 2019; 5: 833845.
25DurandNC, RobinsonJT, ShamimMS Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst., 2016; 3: 99101.
26LiH, DurbinR. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 2010; 26: 589595.
27ParraG, BradnamK, KorfI. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics, 2007; 23: 10611067.
28GertzEM, YuYK, AgarwalaR Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol., 2006; 4: 41.
29BirneyE, ClampM, DurbinR. GeneWise and genomewise. Genome Res., 2004; 14: 988995.
30AliotoT, BlancoE, ParraG Using geneid to identify genes. Curr. Protoc. Bioinform., 2018; 64: e56.
31SimãoFA, WaterhouseRM, IoannidisP BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015; 31: 32103212.
32BUSCO (v.3) with embryophyta_odb10 database. https://busco-archive.ezlab.org/v3/frame_plants.html. Accessed 6 August 2021.
33EllinghausD, KurtzS, WillhoeftU. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinform., 2008; 9: 18.
34XuZ, WangH. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucl. Acids Res., 2007; 35: W265W268.
35OuS, ChenJ, JiangN. Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucl. Acids Res., 2018; 46(21): e126.
36RepeatModeler. http://www.repeatmasker.org/RepeatModeler/. Accessed 6 August 2021.
37PriceAL, JonesNC, PevznerPA. De novo identification of repeat families in large genomes. Bioinformatics, 2005; 21(Suppl 1): i351i358.
38RepeatMasker. http://www.repeatmasker.org/. Accessed 16 August 2021.
39BensonG. Tandem repeats finder: a program to analyze DNA sequences. Nucl. Acids Res., 1999; 27: 573580.
40StankeM, SteinkampR, WaackS AUGUSTUS: a web server for gene finding in eukaryotes. Nucl. Acids Res., 2004; 32: W309W312.
41StankeM, KellerO, GunduzI AUGUSTUS: ab initio prediction of alternative transcripts. Nucl. Acids Res., 2006; 34: W435W439.
42StankeM, MorgensternB. AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints. Nucl. Acids Res., 2005; 33: W465W467.
43BurgeC, KarlinS. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 1997; 268: 7894.
44MajorosWH, PerteaM, SalzbergSL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics, 2004; 20: 28782879.
45KorfI. Gene finding in novel genomes. BMC Bioinform., 2004; 5: 59.
46KimD, PerteaG, TrapnellC TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol., 2013; 14: R36.
47TrapnellC, RobertsA, GoffL Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc., 2012; 7: 562578.
48GrabherrMG, HaasBJ, YassourM Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol., 2011; 29: 644652.
49KentWJ. BLAT–the BLAST-like alignment tool. Genome Res., 2002; 12: 656664.
50HaasBJ, DelcherAL, MountSM Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucl. Acids Res., 2003; 31: 56545666.
51HaasBJ, SalzbergSL, ZhuW Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol., 2008; 9: R7.
52LoweTM, EddySR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucl. Acids Res., 1997; 25: 955964.
53NawrockiEP, KolbeDL, EddySR. Infernal 1.0: inference of RNA alignments. Bioinformatics, 2009; 25: 1335137.
54NawrockiEP, EddySR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics, 2013; 29: 29332935.
55LiL, StoeckertCJ, RoosDS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res., 2003; 13: 21782189.
56AltschulSF, GishW, MillerW Basic local alignment search tool. J. Mol. Biol., 1990; 215: 403410.
57KurtzS, PhillippyAM, DelcherAL Versatile and open software for comparing large genomes. Genome Biol., 2004; 5(2): 19.
58MaizeGDB. https://maizegdb.org/download. 52. Accessed 8 August 2021.
59National Genomics Data Center. https://ngdc.cncb.ac.cn/search/?dbId=gsa&q=CRA001371. Accessed 8 August 2021.
60GoelM, SunH, JiaoWB SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol., 2019; 20: 277.
61PBSV. https://github.com/PacificBiosciences/pbsv. Accessed 11 September 2021.
62ZhaoY, WangY, MaD Supporting data for “A chromosome-level genome assembly and annotation of a maize elite breeding line Dan340”. GigaScience Database. 2022; http://dx.doi.org/10.5524/102221.