Published online : 8 October 2021
Article Outline
Scroll to top
Data Release
A chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)
 Views 913
 Downloads 239
Review History
Download PDF

Cite this article as... 

Kenta Shirasawa, Rakesh Chahota, Hideki Hirakawa, Soichiro Nagano, Hideki Nagasaki, Tilak Sharma, Sachiko Isobe, A chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)Gigabyte, 2021  https://doi.org/10.46471/gigabyte.30

 Copy citation
Gigabyte
Gigabyte
2709-4715
GigaScience Press
Sha Tin, New Territories, Hong Kong SAR
Data description
Background
Horsegram (Macrotyloma uniflorum [Lam.] Verdc.) (NCBI:txid271171), is an underutilized warm-season diploid legume (2n = 20, 22). It belongs to the Fabaceae family of the Phaseoleae tribe, and is cultivated mainly in semi-arid regions of the world. On the Indian subcontinent, horsegram is consumed primarily as a food legume, whereas in Africa and Australia it is grown mainly for use as a concentrated animal feed and fodder. This self-pollinating plant is thought to have originated in Africa because most of its 32 wild species exist there [1], and the Northwestern Himalayan region is considered its secondary center of origin [2]. Horsegram may have been domesticated as M. uniflorum var. uniflorum in the southern part of India, but its probable progenitor, M. axillare, has not been reported in India. Therefore, the process by which cultivated horsegram was domesticated from its wild ancestors has not yet been established [3].
Because of its ability to grow under water-deficient and marginal soil conditions, horsegram is a preferred choice in the era of global climate change. Horsegram contains 16.0–30.4% protein [4], and constitutes an important source of dietary protein for the undernourished population in south Asia. In addition, the seeds are a rich source of lysine and vitamins [5], and its antioxidant, antimicrobial, and unique antilithiatic properties make it a food of nutraceutical importance [68]. As a result of horsegram’s medicinal importance and ability to thrive under drought-like conditions, the US National Academy of Sciences has identified this legume as a potential food source for the future [9].
Context
The existence of many wild and unsolicited characteristics makes horsegram a less favorable legume for commercial cultivation, although it does possess numerous attributes that make it a potential food legume for warm arid regions. In addition, there is a lack of genetic and molecular tools with which to genetically enhance horsegram. To elucidate the potential of this food legume species, we generated and analyzed a draft genome sequence for HPK-4, a horsegram cultivar commercially released by CSK Himachal Pradesh Agricultural University (HPAU), Palampur, India. This variety, which has dark-brown seeds, is under cultivation in many parts of the Northwestern Indian Himalayan region. It is resistant to anthracnose (Colletotrichum truncatum) and tolerant to abiotic stresses such as drought, salinity, and heavy metals. This is the first attempt to generate a draft genome sequence of this ‘orphan’, but it is an important food legume species and will provide a reference for sequence-based analysis of horsegram germplasm to elucidate the genetic bases of important traits.
Methods
Whole genome sequencing and assembly of horsegram
The genome sequences of a horsegram variety, HPK-4, bred at CSK-HPAU, were generated from a paired-end (PE) library by Illumina HiSeq 2000 with a total length of 37.9 Gbp (gigabase pairs) [10]. All data analysis for this study was performed on Linux servers running Red Hat Enterprise Linux Server 7.1. Using the Jellyfish v1.1.6 program (RRID: SCR_005491[11], the genome size of HPK-4 was estimated to be approximately 343.6 Mbp (megabase pairs) (Figure 1). The parameters used in the analysis are listed in Table 1.
Figure 1.
Genome size estimation using Jellyfish with the distribution of the number of distinct k-mers (k = 17) with the given multiplicity values.
Table 1
Parameters used in each program.
Program nameParameters or BUSCO data setComments
Jellyfish v1.1.6–m 17 –s 1000000000 –t 32 –C
SOAPdenovo2 r223–K (61 71 81 91) –R –F –p 8
SSPACE v2.0–x 0 –z 0 –k 3 –a 0.7 –n 15 –T 8 –g 0 –v 1
GapFiller v1.10–m 30 –o 5 –r 0.7 –n 10 –d 50 –t 10 –g 0 –T 8 –i 1
Platanus v1.2.1–t 12 –m 300
MaSuRCA v2.3.2Default Parameters
TruSPAdes v3.6.2Default Parameters
RepeatMasker v3.2.9–poly –x –lib
RepeatScout v1.0.5Default Parameters
BRAKER1 v1.9Default Parameters
BUSCO v3.0Embryophyta, odb10
Samtools 0.1.19samtools mpileup mpileup –d 10000000 –D –u
bcftools 0.1.19bcftools view –c –g –v
vctools 0.1.12bvcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 5 –minQ 214 –max-missing 1SNP filterfing for 8 F2 WGS
vctools 0.1.12bvcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 10 –minQ 50 –max-missing 0.2SNP filterfing for 214 F2TAS
vctools 0.1.12bvcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 5 –minQ 999 –max-missing 0.5 –maf 0.05SNP filtering for 89 population
JoinMap 4Kosambi’s mapping function, linkage with rec. frec. Smaller than 0.4 and a LOD lather than 1.0, Goodness-of-fit for removal of loci  = 5.0, Number of added loci after which to perform a ripple  = 1, Third round  = yes
The Illumina PE reads were assembled by SOAPdenovo2 r223 (RRID: SCR_014986[12] with k-mers of 61 and 81, and contigs were generated with total lengths of 352.2 Mbp (k-mer = 81) and 389.3 Mbp (k-mer = 61) (Table 2). The contigs constructed with k-mer = 81 were selected and scaffolded with mate-pair (MP) reads with insert sizes of 2, 5, 10, and 15 Kbp (kilobase pairs) by using SSPACE v2.0 (RRID: SCR_005056[13]. The number of generated scaffolds was 6227 after gap filling by GapFiller [14] and excluding contamination. The total length of the scaffolds was 297.1 Mbp (Assembly 1, Table 2), which was approximately 55–92 Mbp shorter in length than the estimated genome size of HPK-4.
Table 2
Statistics of de novo whole genome assembly.
File nameSOAPdenovo ContigsAssembly 1 SOAPdenovo/SSPACE/GapFiller (k-mer  = 81)Assembly 2 PlatanusAssembly 3 MaSuRCAAssembly 4 TruSPAdesAssembly 5 GMcloserMUN_r1.1MUN_r1.11MUN_r1.11 pseudomolecule
Input reads PEPEPE + MPPE + MPPE + MPPE + MPSLRAssembly 1 + Assembly 4Assembly 5MUN_r1.1MUN_r1.11 + Linkage map
Comments k-mer  = 61 k-mer  = 81Include contaminationExclude contaminationInclude contaminationInclude contamination Exclude contamination≥500bpScaffolds revised
All
Number of sequences 1,534,576779,1017,1236,22862,32317,400374,2536,2283,4953,49710
Total length (bp) 389,388,347352,263,669297,816,217297,127,168287,695,252313,146,8821,357,659,302295,740,202294,688,765294,688,765259,245,825
Average length (bp) 25445241,81147,7084,61617,9973,62847,48684,31784,26925,924,583
Max length (bp) 43,86486,78613,495,99513,495,99513,114,3789,397,72179,94813,482,8539,844,2739,844,27333,386,276
Min length (bp) 6282300300100711,50014650050015,505,026
N50 length (bp) 2,6026,1083,571,8133,571,8134,221,4422,147,7354,1203,568,8832,818,5552,818,55528,154,654
A 137,634,352123,511,85598,936,27098,713,63595,878,438103,574,375440,896,819100,065,27599,718,91599,718,91588,615,763
T 128,359,494116,788,62998,000,49797,829,34396,128,274103,392,204440,199,34199,104,67898,784,55598,784,55588,538,203
G 61,810,61256,330,39443,804,35543,679,03842,484,81646,002,912238,337,48444,442,74444,254,55744,254,55738,863,122
C 61,583,88955,632,79143,617,62243,524,19642,490,69846,133,757238,219,30844,268,95644,083,22744,083,22738,986,862
N 0013,457,47313,380,95610,713,02614,043,6346,3507,858,5497,847,5117,847,5114,241,875
Total (ATGC, bp) 389,388,347352,263,669284,358,744283,746,212276,982,226299,103,2481,357,652,952287,881,653286,841,254286,841,254255,003,950
GC% (ATGC) 31.7030.730.730.730.835.130.830.830.830.5
≥300 bp
Number of sequences 96,83485,2297,1236,22813,04517,107374,2536,226---
Total length (bases) 255,385,256270,010,759297,816,217297,127,168281,104,166313,093,5591,357,659,302295,739,758---
Average length (bases) 2,6373,16841,81147,70821,54918,3023,62847,501---
≥500 bp
Number of sequences 72,09756,0653,9453,4688,51413,654374,2533,4693,4953,497
Total length (bases) 245,981,429258,994,765296,598,533296,074,149279,298,951311,725,7031,357,659,302294,688,765294,688,765294,688,765
Average length (bases) 3,4124,61975,18385,37332,80522,8303,62884,94984,31784,269
≥1 Kbp
Number of sequences 52,71039,1761,9761,8422,7169,787374,2531,8361,8621,864
Total length (bases) 232,331,716247,369,100295,266,774294,976,239275,198,779308,936,1661,357,659,302293,585,247293,585,247293,585,247
Average length (bases) 4,4086,314149,427160,139101,32531,5663,628159,905157,672157,503
≥2 Kbp
Number of sequences 27,00723,6881,2051,1865114,079190,8531,1761,2021,204
Total length (bases) 185,671,228219,707,018294,019,523293,893,601272,047,316299,058,101960,490,176292,496,469292,496,469292,496,469
Average length (bases) 6,8759,275244,000247,802532,38273,3175,033248,721243,341242,937
≥3 Kbp
Number of sequences 15,66816,5051,0841,0733953,26170,3961,0561,0821,084
Total length (bases) 141,491,218191,540,893293,553,905293,455,161271,611,649295,920,059495,770,522292,032,037292,032,037292,032,037
Average length (bases) 9,03111,605270,806273,490687,62490,7457,043276,545269,900269,402
We speculated that the shorter observed length of the total scaffolds may have been caused by misintegration of repeat sequences by SSPACE v2.0. Therefore, we performed subsequent assemblies using two programs, Platanus v1.2.1 (RRID: SCR_015531[15] and MaSuRCA v2.3.2 (RRID: SCR_010691[16]. The total ATGC lengths of the scaffolds were not significantly different among the three assemblies: 284.4 Mbp in SOAPdenovo-SSPACE (Assembly 1, before excluding contamination; Table 2), 277.0 Mbp in Platanus (Assembly 2), and 299.1 Mbp in MaSuRCA (Assembly 3).
Meanwhile, an Illumina synthetic long-reads (SLR) library was constructed with high-molecular-weight cellular DNA using a TruSeq synthetic long-read DNA library prep kit (Illumina). Sequences were generated by Illumina HiSeq 2000 and MiSeq systems with read lengths of 93 nt and 251 nt, respectively. The SLR reads were synthesized through the TruSPAdes v3.6.2 pipeline [17]. Among the three assemblies with PE and MP reads, Assembly 1 was used for subsequent analysis, and gaps were closed with Illumina SLRs by GMcloser (RRID: SCR_000646[18]. Potentially contaminated sequences were excluded using BLASTN searches against the chloroplast and mitochondrial genome sequences of Arabidopsis thaliana (accession numbers NC_000932.1 and NC_001284.2), human genome sequences (hg19 [19]), fungal genome sequences registered with the National Center for Biotechnology Information (NCBI) [20], bacterial genome sequences registered with [21], vector sequences in UniVec [22], and PhiX (NC_001422.1) [23] sequences with E-value cutoffs of 1 × 10−10 and length coverage >10%. The total length of the resultant assembly (Assembly 5) was 295.7 Mbp.
The results of benchmarking universal single-copy ortholog (BUSCO) analysis (RRID: SCR_015008[24] identified that 93.1% of BUSCOs were found as complete genes in Assembly 5. We therefore considered that Assembly 5 covered most of the coding regions of the horsegram genome. Sequences shorter than 500 bp were excluded from Assembly 5, and the remaining sequences were designated as MUN_r1.1.
Linkage map and pseudomolecule construction
To construct chromosome-scale genome sequences, a SNP linkage map was created with the 214 F2 progenies. SNPs segregating in the F2 population were detected by mapping Illumina re-sequence reads of the eight F2 individuals onto the assembled genome using Bowtie2 (RRID: SCR_016368[25], and by calling variants using SAMtools 0.1.19 (RRID: SCR_002105[26] and vcftools 0.1.12 (RRID: SCR_001235[27]. Target amplicon sequencing (TAS) was performed to genotype the identified SNPs according to the methods described in Shirasawa et al. [28].
The linkage map was constructed using JoinMap 4 with Kosambi’s mapping function (RRID: SCR_009248[29]. The assembled genome sequence scaffolds were aligned onto the linkage map for pseudomolecule construction. The female parent of the F2 progenies was HPK-4. The male parent was initially considered to be HPKM-193, but this assignment was later found to be wrong when the whole genome sequences of HPK-4, HPKM-193, and the eight F2 progenies were compared. Candidate SNPs segregating in the F2 progenies were identified by mapping the whole genome Illumina sequences of the eight randomly selected F2 progenies onto MUN_r1.1.
A total of 2942 SNPs were identified, and 1378 SNPs were successfully genotyped by TAS analysis in 214 F2 progenies. Of these, 1263 SNPs were mapped onto the ten linkage groups with a total length of 980 cM (Table 3). A total of 219 scaffolds in MUN_r1.1 were then aligned onto the linkage map (Figure 2; Table 3; and in GigaDB [10]). During the process of alignment, two scaffolds were discovered to be misscaffoldings and split. The revised set of scaffolds was designated as MUN_r1.11 (Table 4; Table 2). The number of sequences of MUN_r1.11 was 3,495, with a total length of 294.7 Mbp and an N50 length of 2.8 Mbp. The aligned scaffolds on the linkage map were connected to 10,000 Ns for the construction of chromosome-scale pseudomolecules. The total length of the ten pseudomolecules was 259.2 Mbp, with an N50 length of 28.2 Mbp (Table 4; Table 5). When the total length of the A, G, T, and C bases was compared, the 10 pseudomolecules were found to cover 89% of the scaffolds in MUN_r1.11. The ratios of complete BUSCOs identified in MUN_r1.11 and the 10 scaffolds were 93.1% and 87.4%, respectively. Most of the complete BUSCOs were identified as single copies, suggesting a slow rate of duplication in the coding regions of the assembled genomes.
Figure 2.
Anchoring the horsegram genome assembly to the genetic linkage map. The linkage groups (left black bars) and 219 anchored MUN_r1.1 scaffolds (right blue bars) with 1263 SNPs. The crossbars on the linkage groups show the positions of mapped SNPs. Blue, aqua, pink, and red colors represent the numbers of mapped SNPs per cM of 1–5, 6–10, 10–15, and ≧16, respectively.
Table 3
Statistics of a SNP linkage map and numbers of anchored scaffolds.
  Linkage mapNumber of anchored scaffolds (MUN_r1.1)
Number of mapped SNPsLength (cM)Mean distance between SNPs (cM)Segregation distortion ratio (%)
Chr1 148 97.2 0.664.0529
Chr2 128 73.1 0.57 74.2226
Chr3 84 120.3 1.433.5714
Chr4 131 123.1 0.94 14.5018
Chr5 124 100.8 0.813.2322
Chr6 76 88.9 1.17 14.5716
Chr7 148 119.1 0.80 12.8422
Chr8 185 117.1 0.632.7019
Chr9 87 51.7 0.59 81.6125
Chr10 152 88.8 0.583.9528
Total 1,263 980 0.78   219
Table 4
Statistics on the horsegram genome assembly and CDS.
  MUN_r1.11 MUN_r1.11MUN_r1.1_cds
Genome/ScaffoldsGenome/Pseudomolecules CDS
Number of sequences 3,497 10 36,105
Total length (bp) 294,688,765 259,245,825 38,820,013
Average length (bp) 84,269 25,924,583 1,075
Maximum length (bp) 9,844,273 33,386,276 15,732
Minimum length (bp) 500 15,505,026 150
N50 length (bp) 2,818,555 28,154,654 1,488
Total length of AGTC (bp) 286,841,254 255,003,950
Gaps (bp) 7,847,511 4,241,875 -
GC% 30.8 30.5 43.8
Repeat % 28.99497136 - -
Number of complete genes - - 35,508
Number of partial genes - - 597
Table 5
Assembly statistics of MUN_r1.11 pseudomolecules.
 MUN_chr01MUN_chr02MUN_chr03MUN_chr04MUN_chr05MUN_chr06MUN_chr07MUN_chr08MUN_chr09MUN_chr10
Total length of sequences (bp)28,154,65424,194,72723,973,32931,423,84725,354,79818,013,15928,753,26033,386,27615,505,02630,486,749
A 9,658,539 8,292,913 8,231,11610,813,335 8,677,934 6,151,137 9,805,58211,440,621 5,247,87010,296,716
T 9,689,700 8,302,615 8,197,37510,734,237 8,667,169 6,161,839 9,816,93511,432,900 5,212,37010,323,063
G 4,127,487 3,524,028 3,621,541 4,739,623 3,794,983 2,702,465 4,354,257 5,051,796 2,292,810 4,654,132
C 4,128,504 3,558,607 3,614,862 4,779,430 3,830,211 2,738,375 4,343,032 5,030,174 2,307,484 4,656,183
N 550,424 516,564 308,435 357,222 384,501 259,343 433,454 430,785 444,492 556,655
Total (ATGC)27,604,23023,678,16323,664,89431,066,62524,970,29717,753,81628,319,80632,955,49115,060,53429,930,094
GC% (ATGC) 29.9 29.9 30.6 30.6 30.5 30.6 30.7 30.6 30.5 31.1
Number of anchored scaffolds 29 26 14 18 22 16 22 19 25 28
Total length of scaffolds27,874,65423,944,72723,843,32931,253,84725,144,79817,863,15928,543,26033,206,27615,265,02630,216,749
Total length of inserted Ns (N10000) 280,000 250,000 130,000 170,000 210,000 150,000 210,000 180,000 240,000 270,000
Repetitive sequences
Repetitive sequences in the assembled genome were identified by RepeatMasker v3.2.9 (RRID: SCR_012954[30] for known repetitive sequences registered in Repbase (RRID: SCR_021169[31], and de novo repetitive sequences were defined by RepeatScout v1.0.5 (RRID: SCR_014653[32]. A total of 50.2 Mbp of repetitive sequences were identified on the assembled genome, occupying 29% of the total length (Table 6). Of the identified repetitive sequences, the sequences registered in Repbase were found on 12.0% of the assembled genome, whereas unique repetitive sequences, i.e., those not registered in Repbase, were located on 17.0% of the assembled genome. Simple sequence repeat (SSR) motifs were identified by MISA mode in SciRoKo software with the default parameters (RRID: SCR_000941[33]. A total of 74,362 SSRs were identified in MUN_r1.11 with an average frequency of 0.21 SSR per 100 Kbp [10]. The highest SSR frequency, 0.66 SSR per 100 Kbp, was observed in chr06, and this value was almost three times higher than that in chr03 and chr08 (0.22 SSR per 100 Kbp).
Table 6
Length and ratio of repetitive sequences.
       MUN_r1.11 MUN_chr01 MUN_chr02 MUN_chr03 MUN_chr04 MUN_chr05 MUN_chr06 MUN_chr07 TSUd_chr08 MUN_chr09 MUN_chr10
      Length occupied (bp)% of Whole genomeb) Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)Length occupied (bp)% of Whole line-specific genomeb)
Known repeats in Pepbase Interspersed repeatsClass ISINEs33,1940.01,8930.02,7040.02,3300.03,9980.03,0170.02,1620.05,5970.03,6060.07690.03,6400.0
LINEs758,4500.379,1350.360,1580.278,2230.367,2840.261,5310.235,3120.276,9480.395,3280.332,4640.288,6140.3
LTR elementsTotal18,946,4546.42,502,1678.91,905,7247.91,430,8606.01,626,1395.21,474,1565.8938,8395.21,864,3306.52,080,9236.21,268,2358.21,796,1685.9
Copia12,243,3684.21,611,1365.71,257,5135.2902,9003.81,167,0963.7949,4553.7658,1073.71,176,3914.11,417,1064.2740,1324.81,173,5673.8
Gypsy6,273,3362.1820,5532.9601,0002.5500,2012.1417,2771.3496,8332.0254,8381.4654,9722.3613,1221.8491,8233.2594,1701.9
Class IIDNA elements3,100,5191.1380,1641.4280,4031.2227,5240.9294,6360.9206,5750.8121,3030.7294,7421.0307,3360.9198,9941.3289,8141.0
Unclassified4200.000.0770.000.000.000.000.000.02430.000.0630.0
Helitrons182,4920.124,5030.151,5730.215,7630.18,8280.012,9150.116,0310.18,1330.017,1830.18,9450.113,3110.0
Low complexitya)3,615,4991.2222,1860.8208,4790.9203,1040.8238,9590.8194,2680.8134,1280.7269,6590.9257,6390.8119,4700.8256,4050.8
Simple repeat8,038,1812.7744,0572.6655,1102.7581,8252.4793,4312.5634,6652.5493,0672.7687,6672.4783,4342.3367,9902.4753,7002.5
Unknown13,2700.03660.09930.06420.01,0530.05200.01,0940.01,4910.08840.08590.02,8200.0
Subtotal35,247,90612.04,044,95014.43,229,41313.32,581,44810.83,088,9699.82,627,75210.41,775,4499.93,250,69211.33,614,61010.82,038,42013.13,241,12910.6
Unique repeats Unknown49,554,79216.86,234,99522.14,672,58919.33,435,35714.34,094,11413.03,347,93113.22,243,73912.54,226,30014.75,021,33315.03,433,88322.15,105,09816.7
Simple repeat642,2250.267,9810.258,7610.247,8150.266,2790.251,6730.240,2700.261,3280.265,3860.231,0040.261,4490.2
Subtotal50,197,01717.06,302,97622.44,731,35019.63,483,17214.54,160,39313.23,399,60413.42,284,00912.74,287,62814.95,086,71915.23,464,88722.35,166,54716.9
Total     85,444,92329.010,347,92636.87,960,76332.96,064,62025.37,249,36223.16,027,35623.84,059,45822.57,538,32026.28,701,32926.15,503,30735.58,407,67627.6
aPrimarily poly-purine/poly-pyrimidine stretches, or regions of extremely high AT or GC content. Stretches of DNA (100 bp) were masked when they were >87% AT or >89% GC, and 30 bp stretches were masked when they contained 29 A/T (or GC) nucleotides.bN bases were excluded from the calculation.
Transcript sequencing, gene prediction, and annotation
Total RNA of HPK-4 was extracted from seedlings, leaves, roots, flowers, and young pods using the RNeasy Plant Mini Kit (QIAGEN). RNA libraries were constructed by using a TruSeq standard mRNA HT sample prep kit (Illumina). Library sequencing was performed by an Illumina HiSeq system with a read length of 93 nt. Assembly was performed by Trinity [34]. A total of 485 million transcript Illumina reads were obtained from seedlings, leaves, roots, flowers, and young pods of HPK-4 (Figure 3[10].
Figure 3.
Plant materials used for transcript sequences. The seedlings (A), leaves (B), roots (C), flowers (D), and young pods (E) of HPK-4 used for Illumina transcript sequencing.
Ab initio gene prediction was performed by BRAKER1 v1.9 (RRID: SCR_018964[35] with the obtained transcript sequences. Transposable elements (TEs) were detected by BLASTP searches against the NCBI NR protein database [36] with an E-value cutoff of 1 × 10−10. Domain search was performed by InterProScan against the InterPro database with an E-value cutoff of 1.0 (RRID: SCR_005829[37].
A total of 46,095 gene sequences were predicted on the assembled genome with a total length of 48.3 Mbp (Table 7). After removal of TEs and both pseudo and short gene sequences, 36,105 gene sequences remained, and this set of sequences was designated as MUN_r1.1_cds (Table 4). The ratio of complete BUSCOs identified on MUN_r1.1_cds was 91.2%. Of the 36,105 sequences, 35,508 were classified as complete genes and 597 as partial. The coding sequences (CDSs) were further tagged with “f” (full similarity), “p” (partial similarity), and “d” (domain) according to the similarity level against the non-redundant database (f: E-values ≤1 × 10−20 and identity ≥70%; p: E-values ≤1 × 10−20 and identity <70%) and the InterPro database (d: E-values ≤1.0; Table 8). Of the 36,105 sequences, 21,471 (59.4%) were tagged with “f” and 6,692 (18.5%) with “p”. The number of gene sequences tagged with “d” was 24,575 (68.1%).
Table 7
Statistics of candidate genes predicted by BRAKER1 v1.9.
 All predicted genes MUN_r1.1_cds
Exclude TE, pseudo and short genes
Number of sequences 46,095 36,105
Total length (bp) 48,277,179 38,820,013
Average length (bp) 1,047 1,075
Max length (bp) 15,732 15,732
Min length (bp) 60 150
N50 length (bp) 1,443 1,488
GC% 43.3 43.8
Table 8
Number of CDSs showing significant similarity by BLASTP and domain searches against NCBI NR and InterPro.
Number of CDSs % to TotalClassification TagAll predicted genesMUN_r1.1_cds
All predicted genesMIN_r1.1_cdsSimilarity against NRDomain
19,874 43 55 Complete f d Included Included
1554 3 4 Complete f - Included Included
3574 8 10 Complete p d Included Included
2996 6 8 Complete p - Included Included
1052 2 3 Complete - d Included Included
6458 14 18 Complete - - Included Included
26 0 0 Partial f d Included Included
17 0 0 Partial f - Included Included
49 0 0 Partial p d Included Included
73 0 0 Partial p - Included Included
124 0 0 Partial - d Included Included
308 1 1 Partial - - Included Included
126 0.3 - Pseudo   IncludedNot included
107 0.2 - Short   IncludedNot included
9757 21.2 - TE   IncludedNot included
BLASTP Search against NCBI NR database f: E-value ≦1 × 10−20 and similarity ≧70% p: E-value ≦1 × 10−20 and similarity <70% InterProScan against the InterPro database d: E-value ≦1.0
Transfer RNA genes were predicted using tRNAscan-SE ver. 1.23 with the default parameters [38], and compared with the numbers on the genomes of Phaseolus vulgaris (Pvulgaris_218_v1.0, 681) [39], Vigna angularis (Vangularis_v1.a1) [40], Lotus japonicus (Lj3.0) [41], and A. thaliana (Araport11 [42]). Total number of putative tRNA genes in the assembled genomes (MUN_r1.11) was 690, almost the same as the numbers for the genomes of P. vulgaris (681), V. angularis (667), and A. thaliana (699, Table 9). rRNA genes were predicted by BLAST searches (E-value cutoff of 1 × 10−10) with query sequences of A. thaliana 5.8S and 25S rRNAs (X52320.1) and 18S rRNA (X16077.1). The total number of putative rRNA genes identified in the genome was 139, which was again the same as the number in the P. vulgaris genome.
Table 9
Numbers of putative tRNA and rRNA encoding genes identified in MUN_r1.1 and other legume species.
tRNA
Encode M. uniflorum (MUN_r1.11) P. vulgaris (Pvulgaris_218_v1.0) V. angularis (Vangularis_v1.genome) L. japonicus (Lj3.0_pseudomol) A. thaliana (TAIR10_genome)
Ala 40 41 44 40 33
Arg 34 43 52 54 39
Asn 17 22 30 28 19
Asp 27 30 32 32 28
Cys 12 16 20 76 17
Gln 21 21 24 23 19
Glu 31 28 32 40 27
Gly 46 51 44 48 43
His 13 15 14 19 12
Ile 101 30 32 29 25
Leu 49 56 53 57 45
Lys 34 43 35 41 33
Met 38 39 43 48 31
Phe 21 28 20 22 17
Pro 43 48 36 46 68
Ser 50 50 51 44 72
Thr 690 28 24 36 26
Trp 15 16 17 21 16
Tyr 15 17 18 22 83
Val 36 41 37 34 32
Subtotal 669 663 658 760 685
Subtotal (%) 97.0 97.4 98.7 88.6 98.0
Pseudo 19 12 7 80 13
SeC 0 1 0 12 0
Sup 0 1 0 0 0
Undet 2 4 2 6 1
Total 690 681 667 858 699
rRNA
Encode gene M. uniflorum (MUN_r1.1) P. vulgaris (Pvulgaris_218_v1.0) V. angularis (Vangularis_v1.genome) L. japonicus (Lj3.0_pseudomol) A. thaliana (TAIR10_genome)
18S 40 48 224 27 4
25S 87 83 421 64 5
5.8S 12 8 139 6 2
Total 139 139 784 97 11
aPseudo, SeC, Sup, and Undet represent pseudogenes, selenocysteine tRNAs, possible suppressor tRNAs, and tRNAs with undetermined/unknown isotypes, respectively.bAccession numbers used for identification or 5.8S, 18S, and 25S rRNAs were X52320.1, X16077.1, and X52320.1, respectively.
Diversity analysis in genetic resources
Only two species in the genus Macrotyloma, i.e., horsegram and M. geocarpum, are used as crops. It was speculated that horsegram domestication occurred in India twice: once in northwestern India at 4000 years before present, and once on the Indian Peninsula at 3500 years before present [43]. In addition, horsegram has narrow genetic diversity, as revealed by molecular analysis [44]. The genetic diversity of 91 cultivated horsegram accessions and one M. axellare accession, a wild relative of horsegram that is maintained at CSK-HPAU, were investigated based on dd-RAD-Seq analysis [10]. Library construction and variant calling were performed according to Shirasawa et al. [45]. The ddRAD-Seq reads were generated by an Illumina HiSeq 2000 system with a read length of 93 nt and mapped onto the assembled genome sequences. The two accessions, IC139449 and IC547543, were excluded from further analysis because of the small number of obtained reads. The mapped ratio of the reads onto the genome (MUN_r1.11) ranged from 80–90% in most of the accessions (Figure 4). However, M. axellare and one horsegram accession (IC313367) showed low mapping ratios of 17% and 55%, respectively. M. axellare was excluded from further analysis because of its low mapping ratio.
Figure 4.
Mapped ratios of the dd-RAD-Seq reads of 92 accessions.
A total of 277 SNPs were identified in the remaining 89 accessions across the genome [10]. The Jaccard similarity coefficients of the 277 SNPs were calculated using GGT 2.0 [46], and a neighbor-joining (NJ) phylogenetic tree was constructed using MEGA ver 10.1.8 (RRID: SCR_000667[47]. The NJ tree classified the 89 accessions into two clusters (Figure 5). Cluster 1 included varieties bred in the CSK-HPAU, which are prefixed with “HPK”. Most of the HPK varieties showed very close genetic relations and formed a subcluster (HPK cluster); the single exception was HPK-4.
Figure 5.
A phylogenetic tree of the 89 horsegram accessions based on 277 SNPs. HPK-4, used in the reference genome construction; HPK-4, used in the reference genome construction, and HPKM-193; the obtained whole genome sequences of the accessions are circled with blue and green lines, respectively.
Whole genome structure in horsegram
Figure 6 shows a graphical view of the horsegram genome structure with a graph drawn by Circos (Figure 6; RRID: SCR_011798[48]. Repetitive sequences were frequently observed in the midsection of each chromosome, and the tendency was more pronounced in horsegram-specific sequences (Figure 6A). The ratio of repetitive sequences commonly observed in all five species was quite low, suggesting the uniqueness of repetitive sequences compared to the gene sequences. The gene sequences commonly observed between horsegram and the other compared species tended to be distributed to the two end regions of the chromosomes (Figure 6B). On the other hand, horsegram-specific gene sequences were distributed more uniformly across the genome, suggesting the unique structure of the horsegram genome.
Figure 6.
Graphical view of the horsegram genome structure. (A) Ratios of repetitive sequences in 1-Mbp windows. Blue bars represent horsegram-specific sequences. Green bars show sequences commonly observed in horsegram and three other legume species: P. vulgaris, V angularis, and L. japonicus. Red bars show sequences commonly observed in the four legume species and A. thaliana. (B) Numbers of the predicted horsegram genes (MUN_r1.1_cds) in 1-Mbp windows. The bar colors are the same as in (A). (C) CNV distribution in a 1-Mbp window. Green and red dots show log2 ratio plus and minus values, respectively. (D) Pi values and positions of the 255 SNPs identified in the 89 horsegram accessions by using dd-RAD-Seq. The distance between horizontal lines represents a Pi value of 0.1. (E) SNP density identified among the F2 mapping population based on whole genome re-sequencing of the eight F2 progenies. Blue, pale blue, pale pink, and pink indicate numbers of SNPs ≦50, ≦100, ≦150, and 150 < in 1-Mbp windows, respectively.
Copy number variations (CNVs) of one horsegram accession, HPKM-193, were detected against the HPK-4 genome (Figure 6C) based on the whole genome sequence reads of HPKM-193 using CNV-Seq (RRID: SCR_013357[49] with a 1-Mbp window. CNVs with a minus log2 ratio were particularly observed on chr09 and chr02.
Of the 277 SNPs identified among the 89 horsegram accessions, 255 were located across the genome sequences of 10 chromosomes (Figure 6D). In each chromosome, SNPs were mostly identified in the regions where common putative genes of horsegram and the other compared species were located, particularly for chr04, chr07, chr08, and chr10. The differing trends in variable distribution is thought to reflect the presence of varying degrees of selection pressure in the horsegram germplasm resources in Himachal Pradesh.
SNP density mapped on the linkage map is illustrated in Figure 6E. As in the case of the CNVs, distribution bias was observed in the SNPs of HPKM-193; however, this bias was not like that in CNVs. A higher SNP density was observed in the midsection in most of the chromosomes. Chr06 showed less variation than the other chromosomes.
Genes related to drought tolerance
Horsegram is considered one of the most drought-tolerant legume crop species. Personal investigation showed that plants can survive for more than 20 days without water under controlled conditions. A study by Bhardwaj et al. [50] described a transcriptome analysis of eight shoot and root tissues of a drought-sensitive (M-191) genotype and a drought-tolerant (M-249) genotype of horsegram under controlled and drought stress conditions. This study identified some important genic regions responsible for drought tolerance.
To estimate genes related to drought tolerance in the horsegram genome, a BLASTP search of the 36,105 putative genes was performed against amino acid sequences of A. thaliana (Araport11), and hit genes were further used in BLAST searches against DroughtDB [51], the NCBI NR protein database, and Plant Stress Gene Database [52]. A total of 158 horsegram genes showed significant similarity to the 78 genes in DroughtDB [10]. The most frequently hit gene was ABCG40, which encodes a protein that functions as an ABC transporter, and showed significant similarity to 14 horsegram genes. OST1/SRK2E and AtrbohF were also frequently identified, with hits to seven and six horsegram genes, respectively. Of the 158 genes, 93 showed the same domain sequences as the A. thaliana gene, and 52 were like the genes registered in the PSGD. These genes were indicated to have a greater likelihood of being candidate genes related to drought tolerance.
Comparative and phylogenetic analyses with other legume species
Horsegram belongs to the subtribe Phaseolinae in the millettioid clade, along with P. vulgaris and V. angularis. The genome structure of horsegram was compared with those of P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), and L. japonicus (Lr_r3.0).
The predicted gene sequences in MUN_r1.1_cds were clustered with other plant species (P. vulgaris, V. angularis, L. japonicus, and A. thaliana) for comparison at the protein sequence level. A total of 73,457 clusters were generated using the program CD-HIT (RRID: SCR_007105[53] (Table 10). Of the 36,105 putative gene sequences, 21,369 (59.2%) genes were clustered with other plant species and 14,736 (40.8%) were considered horsegram-specific genes (Figure 7). A total of 3738 (10.4%) horsegram gene sequences were clustered with 3,864 P. vulgaris and 3,713 V. angularis genes, which were considered millettioid-specific genes. Common genes in legumes were identified for 6550 (18.1%) horsegram gene sequences, based on clusters with P. vulgaris, V. angularis, andL. japonicus.
Figure 7.
Ratios of genes of horsegram (MUN_r1.1_cds) clustered with those of four other plant species. Pv, Va, Li, and At represent genes of P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11), respectively.
Table 10
Number of gene clusters in horsegram and the four plant species, P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11).
Clustered speciesNumber of clustered speciesNumber of clustersNumber of clustered genes
Horsegram P. vulgaris (Pv) V. angularis (Va) L. japonicus (Lj) A. thaliana (At)
Horsegram 1 9,578 14,736 0 0 0 0
P. vulgaris (Pv) 1 3,449 0 4,114 0 0 0
V. angularis (Va) 1 8,306 0 0 9,545 0 0
L. japonicus (Lj) 1 17,544 0 0 0 21,677 0
A. thaliana (At) 1 15,177 0 0 0 0 18,129
Horsegram + Pv 2 897 1,003 1,028 0 0 0
Horsegram + Va 2 471 531 0 530 0 0
Horsegram + Lj 2 362 391 0 0 484 0
Horsegram + At 2 111 116 0 0 0 147
Pv + Va 2 1,031 0 1,248 1,159 0 0
Pv + Lj 2 262 0 293 0 333 0
Pv + At 2 88 0 97 0 0 130
Va + Lj 2 290 0 0 317 382 0
Va + At 2 92 0 0 95 0 113
Lj + At 2 326 0 0 0 398 415
Horsegram + Pv + Va (Common in Millettioids)33,1893,7383,8643,71300
Horsegram + Pv + Lj345851952105940
Horsegram + Pv + At387899200110
Horsegram + Va + Lj323027202552920
Horsegram + Va + At34953054058
Horsegram + Lj + At35659007274
Pv + Va + Lj358106496647250
Pv + Va + At3940101990118
Pv + Lj + At36607907989
Va + Lj + At34100455153
Horsegram + Pv + Va + Lj (common in legumes)45,0116,5506,5916,4726,8470
Horsegram + Pv + Va + At46708538578450873
Horsegram + Pv + Lj + At41732011960222208
Horsegram + Va + Lj + At4768208310298
Pv + Va + Lj + At43020370365420385
Common in all 5 4,390 6,912 7,097 7,000 7,056 6,638
Number of genes 36,105 27,197 31,241 39,734 27,621
Number of clustered genes 36,105 27,197 31,241 39,734 27,638
Number of non-clustered genes 0 0 0 0 17
Functional analysis was performed for MUN_r1.1_cds by classifying 36,105 putative genes into the Gene Ontology (GO) and euKaryotic clusters of Orthologous Groups (KOG) databases [54]. A total of 24,699 (68.4%) putative genes were annotated with GO categories including 9086 (25.2%) genes involved in biological processes, 4127 (11.4%) genes coding for cellular components, and 1377 (38.7%) genes associated with molecular functions (Figure 8). The ratio of annotated horsegram genes was smaller than those of the other species. The species with a ratio of classified GO categories most like that of horsegram was L. japonicus. A total of 18,630 (51.6%) putative genes showed significant similarity to genes in the KOG database (Figure 9). As in the results for GO, the ratio of hit genes was lower than for the other four species.
Figure 8.
Comparison of genes annotated by the GO database in horsegram (MUN_r1.1_cds), P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11). (A) Numbers and ratios of genes annotated by GO database. (B) Ratios of the classified GO categories in the predicted genes.
Figure 9.
Comparison of genes annotated by the KOG database in horsegram (MUN_r1.1_cds), P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11). (A) Numbers and ratios of genes annotated by the KOG database. (B) Ratio of the classified KOG categories in hit genes.
Clear relationships were observed with a warm-season legume, V. angularis, and one-on-one relationships were observed between horsegram chr02 (Mun_chr02) and V. angularis chr09 (Va_chr09), Mun_chr04 and Va_chr02, Mun_chr06 and Va_chr10, Mun_chr07 and Va_chr08, Mun_chr08 and Va_chr04, and Mun_chr09 and Va_chr05 (Figure 10A). The syntenic relations with P. vulgaris were slightly more complex than those with V. angularis, and those with the cool-season legume L. japonicus were more fragmented.
Figure 10.
Comparative and phylogenetic analyses with other legume species. (A) Graphical view of syntenic relationships between horsegram and P. vulgaris (left), V. anagularis (middle), and L. japonicus (right). Pink and blue dots show homologous sequences of MUN_r1.11 with forward and reverse directions against the reference sequences. (B) Distribution of Ks values of orthologous gene pairs in horsegram (Mu) and the four plant species: P. vulgaris (Pv), V. anagularis (Va), L. japonicus (Lj), and A. thaliana (At). C: Phylogenetic tree of 4,154 common single-copy genes of the six legume species: P. vulgaris, V. anagularis, G. max, L. japonicus, M. truncatula, and A. thaliana.
Synonymous substitutions per site (Ks) were estimated by comparing gene pairs in each combination of horsegram, P. vulgaris, V. angularis, L. japonicus, and A. thaliana (Araport11) by KaKs Calculator [55] based on the clustered genes using the CD-HIT program (Figure 10B). The similar distributions of horsegram, P. vulgaris, and V. angularis indicated the close relations among the three species. The ratios of gene pairs showing Ks values less than 0.1% were 21.1% between horsegram and P. vulgaris and 8.4% between horsegram and V. angularis, suggesting that there was a closer relationship between horsegram and P. vulgaris at the gene level.
The phylogenetic analysis was performed with Medica truncatula (r5.0) [56] and Glycine max (Glma4 [57]) in addition to P. vulgaris, V. angularis, L. japonicus, and A. thaliana. A total of 978 common single-copy genes were identified for horsegram and the six species by clustering the genes using OrthoFinder (RRID: SCR_017118[58]. Multiple alignment was performed for the 978 single-copy genes using Muscle (RRID: SCR_011812[59], and gaps were excluded by Gblock [60]. An NJ tree was created with the 4154 single-copy orthologous genes identified in the four legume species by MEGA 7.0.9 beta (RRID: SCR_000667[61] and TIMETREE (RRID: SCR_021162[62]. A. thaliana was used for the outgroup (Figure 10C). When the divergence time between M. truncatula and G. max was considered to be 53 million years ago, it was estimated that horsegram diverged from P. vulgaris and V. angularis 20.75 million years ago (Figure 10C). Among the four legume species in millettioids, P. vulgaris and V. angularis shared closer relations with each other than with horsegram, and horsegram was closer to P. vulgaris and V. angularis than to G. max. The results are in consonance with a previous study based on a comparison of eight chloroplast regions [63].
Data validation and quality control
The quality of assembled genome and gene sequences was investigated by using 1375 Embryophyta BUSCOs (v 3.0, obd10; RRID: SCR_015008[24]. A total of 1340 (93.1%) BUSCOs were identified on the assembled scaffolds (MUN_r1.11), while 1259 (87.4%) and 1313 (91.2%) were converted by the pseudomolecules and CDS sequences (Table 11).
Table 11
Statistics of the horsegram genome assembly and CDS.
  MUN_r1.11 MUN_r1.11MUN_r1.1_cds
Genome/ScaffoldsGenome/Pseudomolecules CDS
BUSCOs
Complete 1340 (93.1%) 1259 (87.4%)1313 (91.2%)
Complete single-copy 1252 (86.9%) 1181 (82%)1208 (83.9%)
Complete duplicated 88 (6.1%) 78 (5.4%) 105 (7.3%)
Fragmented 26 (1.8%) 31 (2.2%) 23 (1.6%)
Missing 74 (5.1%) 150 (10.4%) 104 (7.2%)
Reuse potential
In this study, we have provided a first-draft genome assembly of horsegram cultivar (HPK-4) and investigated features of the horsegram genome and gene sequences as well as the genetic diversity of the accessions. This information will help to establish an efficient breeding program for horsegram by integrating conventional breeding with marker-based biotechnological tools. Finally, the genomic information revealed in this study can be applied to the improvement of other disadvantageous food legumes.
Data availability
The genome assembly data, annotations, and gene models are available at the Horsegram Database [64]. The obtained genome sequence reads are available from the DNA Databank of Japan (DDBJ) Sequence Read Archive (DRA) under the BioProject accession number PRJDB5374.
Data sets supporting the results of this article are available in GigaScience Database [10].
Declarations
List of abbreviations
BUSCO: benchmarking universal single-copy ortholog; CDS: coding sequence; CNV: copy number variation; CSK-HPAU: CSK Himachal Pradesh Agricultural University; GO: Gene Ontology; Kbp: kilobase pairs; KOG: euKaryotic clusters of Orthologous Groups; Mbp: megabase pairs; MP: mate-pair; NCBI: National Center for Biotechnology Information; NJ: neighbor-joining; PE: paired-end; SLR: synthetic long read; SNP: single nucleotide polymorphism; SSR: simple sequence repeat; TAS: target amplicon sequencing; TE: transposable element.
Ethical approval
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by the Bilateral Joint Research Projects from the Japan Society for the Promotion of Science and the Department of Science and Technology of the Government of India, and by funds from the Kazusa DNA Research Institute Foundation.
Author contributions
The project was designed by S.I., R.C., and T.S. K.S. contributed to the genome sequencing, linkage map, and pseudomolecule construction. R.C. contributed to the creation of the mapping population. H.H. contributed to the genome assembly and gene prediction, annotation, and phylogenetic analysis. S.N. contributed to the transcript sequencing. H.N. contributed to the pseudomolecule construction. S.I. contributed to the entire process of data analysis.
Acknowledgements
We thank A. Watanabe, S. Nakayama, Y. Kishida, M. Kohara, H. Tsuruoka, C. Minami, and S. Sasamoto (KDRI) for their technical assistance. We also thank the National Bureau of Plant Genetic Resources (NBPGR), New Delhi, for providing the germplasm lines for diversity analysis.
References
1GillettJ, PolhillR, VerdcourtB, Flora of tropical East Africa. Leguminosae (Part 3) Subfamily Papilionoideae (1); Leguminosae (Part 4) Subfamily Papilionoideae (2). London: Crown Copyright, 1971.
2AroraR, ChandelK, Botanical source areas of wild herbage legumes in India. Trop Grasslands, 1972; 6: 213221.
3ChahotaRK, SharmaTR, SharmaSK, KumarN, RanaJC, 12 - Horsegram. In: SinghM, UpadhyayaHD, BishtIS (eds), Genetic and genomic resources of grain legume improvement. Oxford: Elsevier, 2013; pp. 293305.
4PatelDP, DabasBS, SapraRS, MandalS, Evaluation of horsegram (Macrotyloma uniflorum) (Lam.) germplasm. New Delhi: National Bureau of Plant Genetic Resources Publication, 1995.
5GopalanC, RamashastriBV, BalasubramanyanSC, Nutritive value of Indian foods. Hyderabad: National Institute of Nutrition, ICMR, Offset Press, 1999; p. 156.
6ReddyB, BrijithaN, RaghavenderC, Aflatoxin contamination in insect damaged seeds of horsegram under storage. Mycotoxin Res., 2005; 21: 187191.
7GautamM, KatochS, ChahotaRK, Comprehensive nutritional profiling and activity directed identification of lead antioxidant, antilithiatic agent from Macrotyloma uniflorum (Lam.) Verdc. Food Res. Int., 2020; 137: 109600.
8GautamM, DattN, ChahotaRK, Assessment of calcium oxalate crystal inhibition potential, antioxidant activity and amino acid profiling in horse gram (Macrotyloma uniflorum): high altitude farmer’s varieties. 3 Biotech., 2020; 10: 402.
9National Research Council. Tropical legumes: resources for the future. Washington, DC: National Academies Press, 1979; https://doi.org/10.17226/19836.
10ShirasawaK, ChahotaRK, HirakawaH, NaganoS, NagasakiH, SharmaTR, IsobeSN, Supporting data for “A Chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)”. GigaScience Database. 2021; http://dx.doi.org/10.5524/100932.
11MarcaisG, KingsfordC, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011; 27: 764770.
12LuoR SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 2012; 1: 18.
13BoetzerM, HenkelCV, JansenHJ, ButlerD, PirovanoW, Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 2011; 27: 578579.
14BoetzerM, PirovanoW, Toward almost closed genomes with GapFiller. Genome Biol., 2012; 13: 19.
15KajitaniR Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res., 2014; 24: 13841395.
16ZiminAV, MarçaisG, PuiuD, RobertsM, SalzbergSL, YorkeJA, The MaSuRCA genome assembler. Bioinformatics, 2013; 29: 26692677.
17BankevichA, PevznerPA, TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat. Methods, 2016; 13: 248250.
18KosugiS, HirakawaH, TabataS, GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31: 37333741.
19University of California Santa Cruz Genomics Institute. UCSC Genome Browser on Human February 2009 (GRCh37/hg19) Assembly. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/. Accessed 13 April 2016.
20National Center for Biotechnology Information. https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/. Accessed 13 April 2016.
21National Center for Biotechnology Information. https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/. Accessed 13 April 2016.
22UniVec. Bethesda, MD: National Center for Biotechnology Information. 2016; http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/. Accessed 13 April 2016.
23AirGM, ElsMC, BrownLE, LaverWG, WebsterRG, Location of antigenic sites on the three-dimensional structure of the influenza N2 virus neuraminidase. Virology, 1985; 145: 237248.
24SimãoFA, WaterhouseRM, IoannidisP, KriventsevaEV, ZdobnovEM, BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015; 31: 32103212.
25LangmeadB, WilksC, AntonescuV, CharlesR, Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 2019; 35: 421432.
26DanecekP, BonfieldJK, LiddleJ, MarshallJ, OhanV, PollardMO, WhitwhamA, KeaneT, McCarthySA, DaviesRM, LiH, Twelve years of SAMtools and BCFtools. GigaScience, 2021; 10(2): giab008. https://doi.org/10.1093/gigascience/giab008.
27DanecekP The variant call format and VCFtools. Bioinformatics, 2011; 27: 21562158.
28ShirasawaK, KuwataC, WatanabeM, FukamiM, HirakawaH, IsobeS, Target amplicon sequencing for genotyping genome-wide single nucleotide polymorphisms identified by whole-genome resequencing in peanut. Plant Genome, 2016; 9: https://doi.org/10.3835/plantgenome2016.06.0052.
29Van OoijenJ, JoinMap® 4, Software for the calculation of genetic linkage maps in experimental populations. Wageningen: Kyazma BV, 2006; p. 33.
30SmitAFA, HubleyR, GreenP, RepeatMasker Open-3.0. 2013–2015, http://www.repeatmasker.org.
31BaoW, KojimaKK, KohanyO, Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA, 2015; 6: 11.
32PriceAL, JonesNC, PevznerPA, De novo identification of repeat families in large genomes. Bioinformatics, 2005; 21(Suppl 1): i351i358.
33KoflerR, SchlöttererC, LelleyT, SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics, 2007; 23: 16831685.
34HenschelR, LieberM, WuLS, NistaPM, HaasBJ, LeDucRD., Trinity RNA-Seq assembler performance optimization. In: Proceedings of the 1st conference of the extreme science and engineering discovery environment: bridging from the eXtreme to the campus and beyond. 2012; pp. 18.
35HoffKJ, LangeS, LomsadzeA, BorodovskyM, StankeM, BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 2016; 32: 767769.
36National Center for Biotechnology Information. Available from: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/. Accessed 13 November 2016.
37MulderNJ New developments in the InterPro database. Nucleic Acids Res., 2007; 35: D224D228.
38LoweTM, ChanPP, tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res., 2016; 44: W54W57.
39SchmutzJ A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet., 2014; 46: 707713.
40SakaiH The Vigna Genome Server, “Vig GS”: A genomic knowledge base of the genus Vigna based on high-quality, annotated genome sequence of the Azuki Bean, Vigna angularis (Willd.) Ohwi & Ohashi. Plant Cell Physiol., 2016; 57: e2.
41SatoS Genome structure of the legume, Lotus japonicus. DNA Res., 2008; 15: 227239.
42The Arabidopsis Information Resource (TAIR). Araport11 genome release. https://arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FAraport11_genome_release. Accessed 26 November 2016.
43FullerDQ, MurphyC, The origins and early dispersal of horsegram (Macrotyloma uniflorum), a major crop of ancient India. Genet. Resour. Crop. Evol., 65: 285305.
44SharmaV, SharmaTR, RanaJC, ChahotaRK, Analysis of genetic diversity and population structure in horsegram (Macrotyloma uniflorum) using RAPD and ISSR markers. Agri. Res., 2015; 4: 221230.
45ShirasawaK, HirakawaH, IsobeS, Analytical workflow of double-digest restriction site-associated DNA sequencing based on empirical and in silico optimization in tomato. DNA Res., 2016; 23: 145153.
46Van BerlooR, GGT 2.0: versatile software for visualization and analysis of genetic data. J. Hered., 2008; 99: 232236.
47KumarS, StecherG, LiM, KnyazC, TamuraK, MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol., 2018; 35: 15471549.
48KrzywinskiM Circos: an information aesthetic for comparative genomics. Genome Res., 2009; 19: 16391645.
49XieC, TammiMT, CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform., 2009; 10: 19.
50BhardwajJ Comprehensive transcriptomic study on horse gram (Macrotyloma uniflorum): De novo assembly, functional characterization and comparative analysis in relation to drought stress. BMC Genom., 2013; 14: 118.
51AlterS DroughtDB: an expert-curated compilation of plant drought stress genes and their homologs in nine species. Database, 2015; 2015: bav046.
52Plant Stress Gene Database. New Delhi; Jawaharlal Nehru University. http://ccbb.jnu.ac.in/stressgenes/frontpage.html. Accessed 12 October 2020.
53FuL, NiuB, ZhuZ, WuS, LiW, CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012; 28: 31503152.
54TatusovRL The COG database: an updated version includes eukaryotes. BMC Bioinform., 2003; 4: 114.
55ZhangZ, LiJ, ZhaoXQ, WangJ, WongGKS, YuJ, KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinform., 2006; 4: 259263.
56PecrixY Whole-genome landscape of Medicago truncatula symbiotic genes. Nat. Plants, 2018; 4: 10171025.
57Soybase. US Department of Agriculture–Agricultural Research Service, Iowa State University. https://soybase.org/. Accessed 13 August 2021.
58EmmsDM, KellyS, OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol., 2019; 20: 114.
59European Molecular Biology Laboratory–European Bioinformatics Institute (EMBI–EBI). MUSCLE. Multiple sequence alignment. Hinxton: EMBL–EBI, 2019https://www.ebi.ac.uk/Tools/msa/muscle. Accessed 13 August 2021.
60CastresanaJ, Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol., 2000; 17: 540552.
61KumarS, NeiM, DudleyJ, TamuraK, MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinform., 2008; 9: 299306.
62HedgesSB, DudleyJ, KumarS, TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics, 2006; 22: 29712972.
63StefanovićS, PfeilBE, PalmerJD, DoyleJJ, Relationships among phaseoloid legumes based on sequences from eight chloroplast regions. System Bot., 2009; 34: 115128.
64Horsegram Database. Kazusa DNA Research Institute. 2017; http://horsegram.kazusa.or.jp/. Accessed 19 January 2021.