Main

The origin of SARS-CoV-2, as well as its mode of introduction into the human population, are unknown at present. Since its emergence, numerous animal species have been studied to identify possible reservoirs and/or intermediate hosts of the virus, including a large diversity of insectivorous bats of the genus Rhinolophus. Despite the recent report of various SARS-CoV-2-related viruses in R. shameli (isolated in Cambodia in 201013), R. pusillus and R. malayanus (China, 2020 and 2019, respectively2), R. acuminatus (Thailand, 20203) and R. cornutus (Japan, 20134), the closest SARS-CoV-2 bat-borne genome still remains the one from R. affinis, RaTG13 (China, 20135,6), with 96.1% identity at the whole-genome level. Several studies also suggested the involvement of pangolin coronaviruses in the emergence of SARS-CoV-2 (refs. 7,8,9). Since its appearance in humans, SARS-CoV-2 has evolved through sporadic mutations and recombination events14, some of which correspond to gains in fitness allowing the virus to spread more widely, or to escape neutralizing antibodies15.

To decipher the origin of SARS-CoV-2, it is therefore essential to ascertain the diversity of animal coronaviruses, and more specifically, that of bat coronaviruses. Although the identification of SARS-CoV-2 in bats is a main goal, a more realistic objective is to identify the sequences that contribute to its mosaicism. The spike sequence seems essential, as it determines the binding affinity and accessibility of the receptor-binding domain (RBD) to the cellular ACE2 receptor and is therefore responsible for host range10,11,12. The closest related bat strain identified so far (RaTG13) has a low RBD sequence similarity to SARS-CoV-2, and with only 11/17 hACE2 contact amino acid residues conserved with SARS-CoV-2, its affinity for hACE2 is very limited16. Moreover, SARS-CoV-2 poorly infects bats and bat cells tested so far17. In addition, no bat SARS-CoV-2-like virus has been shown to use hACE2 to efficiently enter human cells, and none has the furin cleavage site that is associated with an increased pathogenicity in humans18. The SARS-CoV-2 RBD binds to R. macrotis ACE2 with a lower affinity than to hACE2 (ref. 19). An essential piece of information—finding bat viruses with an RBD motif genetically close to that of SARS-CoV-2 and capable of binding to hACE2 with high affinity—is therefore missing.

We speculated that this type of virus could be identified in bats living in the limestone karstic terrain common to China, Laos and Vietnam in the Indochinese peninsula. Here we report the presence of sarbecoviruses close to SARS-CoV-2 whose RBDs differ from that of SARS-CoV-2 by only one or two contact residues, strongly bind to the hACE2 protein and mediate hACE2-dependent entry and replication into human cells. Despite the absence of the furin cleavage site, these viruses may have contributed to the origin of SARS-CoV-2 and may intrinsically pose a future risk of direct transmission to humans.

Diversity of bat and coronavirus species

A total of 645 bats belonging to 6 families and 46 species were captured (Supplementary Table 1). Two hundred and forty-seven blood samples, 608 saliva, 539 anal/faecal and 157 urine swabs were collected from bats in the northern part of Laos (Supplementary Table 2). We first screened all 539 faecal samples through a pan-coronavirus nested RT-PCR analysis20. Overall, 24 individuals of 10 species were positive, and 1 individual (BANAL-27) was concomitantly infected by an alphacoronavirus and a betacoronavirus (Supplementary Table 3). BLAST analysis of amplicons identified alphacoronavirus sequences of the Decacovirus, Pedacovirus and Rhinacovirus subgenera and betacoronavirus sequences of the Nobecovirus and Sarbecovirus subgenera. Sequences of the Sarbecovirus subgenus were all identified from Rhinolophus individuals belonging to three different species (R. malayanus, R. marshalli and R. pusillus). Positive individuals were trapped in three different districts, and those infected with a sarbecovirus were all from the Fueng district in Vientiane province (Fig. 1a, site 1).

Fig. 1: Genomic description of bat-borne sarbecoviruses identified in Laos.
figure 1

a, Map of sampling sites. All BANAL isolates were collected from the same site (site 1). The map was downloaded from DIVA-GIS (https://www.diva-gis.org/gdata). b, Phylogenetic analysis of the protein sequence of the RBD of Laotian and representative human, bat, and pangolin sarbecoviruses. Sequences were aligned with MAFFT (ref. 49) in auto mode, and maximum-likelihood phylogenetic reconstruction was performed with PhyML implemented through the NGPhylogeny portal50 with the LG + G substitution model. Branch support was evaluated with the aBayes parameter. Bat species are specified in the name of the sequences. Sequences are coloured according to Fig. 1c. c, Similarity plot analysis of Laotian and representative bat and pangolin sarbecoviruses based on the full-length genome sequence of the SARS-CoV-2 human prototype strain (NC_045512, Wuhan-Hu-1) used as a reference. The analysis was performed with the SimPlot program version 3.5.1 (ref. 51) with the Kimura two-parameter model, a window size of 1,000 base pairs, a step size of 100 base pairs, a transition/transversion rate (T/t) of 2.0, and a Gap/Strip parameter: on51. nsp, non-structural protein; RdRP, RNA-dependent RNA polymerase. d, Heat map of identities at the protein level of representative human, bat and pangolin sarbecoviruses compared to human SARS-CoV-2 lineage B (NC_045512). Spike protein has been divided into functional domains, and the sequences are ordered according to percentage of identity of the RBD domain. The asterisk marks the absence of a functional ORF10 in Thai bat RacCS203 (accession number MW251308). The heat map was created using the gplots package in R (version 3.6.3). AA, amino acid.

Next-generation sequencing (NGS) and Sanger sequencing were used to obtain a complete genomic sequence of five of the seven sarbecoviruses (Fig. 1 and Supplementary Table 4). The coverage of the genome of the remaining two sarbecoviruses (BANAL-27 and BANAL-242 sampled from R. pusillus and R. malayanus bats, respectively) was 90%; therefore, they were not included in the final analyses. Phylogenetic analyses performed on the receptor-binding protein domain of lineages A and B human SARS-CoV-2 (ref. 21), and on representative bat and pangolin sarbecoviruses, placed the Laotian R. malayanus BANAL-52, R. pusillus BANAL-103 and R. marshalli BANAL-236 coronaviruses close to human SARS-CoV-2 and pangolin coronaviruses collected in 2019, whereas R. malayanus BANAL-116 and BANAL-247 coronaviruses belonged to a sister clade with other bat coronaviruses (RmYN02, RacCS203, RpYN06 and PrC31) from different Rhinolophus species. Pangolin coronaviruses sampled in 2017 exhibited a basal position relative to these strains (Fig. 1b). Very similar SARS-CoV-2-like viruses are shared by different bat species, suggesting possible circulation of viruses between different species living sympatrically in the same caves. These results are consistent with the similarity plot analysis showing that RaTG13 and BANAL-52 bat coronaviruses exhibit high nucleotide identity with SARS-CoV-2 throughout the length of the genome (96.8% for BANAL-52 and 96.1% for RaTG13). Notably, BANAL-52 has a higher level of nucleotide conservation than RaTG13 in the S1 domain of the spike protein, and especially in the amino-terminal domain (NTD) and RBD of the spike protein (Fig. 1c). These observations are congruent with amino acid identities between human SARS-CoV-2 and representative bat and pangolin coronaviruses, which have a high level of conservation, except for the open reading frame 8 (ORF8) of bat BANAL-116, BANAL-247, Rc-o319 and RmYN02 (Supplementary Fig. 1). The S1 domain of the spike protein (and especially the NTD) has a lower degree of conservation in several bat coronaviruses, suggesting that this domain may reflect a relative degree of adaptation of the virus to its mammalian host (Fig. 1d and Extended Data Fig. 1).

Bat Sarbecovirus evolutionary history

Following analysis using a genetic algorithm for recombination detection (GARD), we identified 14 recombinant breakpoints during the evolutionary history of sarbecoviruses, which were further confirmed by phylogenetic analyses performed on the 15 fragments of sequences defined by the breakpoints (Fig. 2 and Supplementary Fig. 2). No specific signature was identified in the breakpoints (Supplementary Table 5). SARS-CoV-2 has a mosaic genome, to which more than five sequences close to sequences published or determined during this study contributed: R. malayanus RmYN02 and R. pusillus RpYN06 viruses found in China in 2019, R. affinis RaTG13 coronavirus found in China in 2013, and R. malayanus BANAL-52 and R. pusillus BANAL-103 found in northern Laos in 2020 (this study). No pangolin coronavirus sequence was immediately associated with a recombination event at the origin of SARS-CoV-2. Laotian Rhinolophus bat coronaviruses had a lower degree of recombination compared to SARS-CoV-2. Such recombination events occurred between other BANAL viruses isolated from bats living sympatrically in caves in the same area.

Fig. 2: Recombination events in the evolutionary history of sarbecoviruses.
figure 2

Representation of the 15 recombinant fragments of relevant Sarbecovirus genomes compared to the SARS-CoV-2 human prototype strain (NC_045512). The coordinates of the breakpoints refer to the nucleotide position in the alignment. Where possible, the closest viral sequence is indicated for each fragment. In other cases, MULT indicates a group of multiple sequences. The asterisk marks unresolved fragment phylogeny (fragment 13, from positions 27,344 to 27,800 in the alignment). Sequences are coloured as in Fig. 1. The complete phylogenetic analyses are presented in Supplementary Fig. 2.

Notably, the origin of several fragments of SARS-CoV-2 genomes could be assigned to several donor strains and not a unique donor sequence. For example, a breakpoint was identified seven nucleotides upstream of the RBD region of S1: the downstream fragment of SARS-CoV-2, which comprises the RBD and the beginning of S2, could involve the R. malayanus BANAL-52, R. pusillus BANAL-103 and R. marshalli BANAL-236 viruses, which formed a highly supported sister clade of SARS-CoV-2 (fragment 11, Supplementary Fig. 2). In a more basal position are R. shameli bat coronaviruses and pangolin-2019 coronaviruses, consistent with the conservation of RBD amino acid sequences among SARS-CoV-2 and representative bat and pangolin coronaviruses (Extended Data Fig. 2). Among the 17 residues that interact with hACE2, 16 are conserved between SARS-CoV-2 and BANAL-52 or BANAL-103 (one mismatch, H498Q), and 15/17 are conserved for BANAL-236 (two mismatches, K493Q and H498Q) whereas only 13/17 residues are conserved for the Cambodian bat R. shameli virus and 11/17 for the Chinese bat R. affinis RaTG13 virus. At the full spike protein level, bat R. affinis RaTG13 and pangolin-2017 P4L viruses seemed closer to SARS-CoV-2 than bat R. malayanus BANAL-52, but this effect is due to a higher degree of conservation in S2. All of these viruses shared the absence of a furin cleavage site and the conservation of the internal fusion peptide (Supplementary Figs. 1 and 3 and Extended Data Fig. 3).

Interaction of BANAL RBDs with ACE2

Biolayer interferometry experiments to measure the interaction between hACE2 and the RBDs of BANAL-52/103 (which have identical residues in the receptor-binding motif; Extended Data Fig. 2), BANAL-236 and SARS-CoV-2 (residues 233–524) resulted in a dissociation constant Kd three times lower for the BANAL RBDs compared to SARS-CoV-2 (Fig. 3a). This higher affinity can be attributed to the Q498H mismatch that has been reported to increase the affinity of the SARS-CoV-2 RBD for hACE2, and also to be involved in the host range expansion of SARS-CoV-2 and SARS-CoV-2-like viruses22,23,24,25,26,27,28,29.

Fig. 3: Dynamics of the binding of hACE2 to bat-sarbecovirus-borne RBDs and insight into the structure of the complex.
figure 3

a, Biolayer interferometry (BLI) binding analysis of the hACE2 peptidase domain to immobilized BANAL-52/103, BANAL-236 or SARS-CoV-2 RBDs. Black lines correspond to a global fit of the data using a 1:1 binding model. b, Frequency of formation of salt bridges close to the RBD–ACE2 interface (from left to right: D30/K417, E35/K493, D38/K493, K31/E35 and D38/K353) during the course of the MD simulations. The analysis was performed for nine different MD simulations (three replicates for each complex) of hACE2 in complex with RBDs from SARS-CoV-2 (SARS-CoV-2.1, SARS-CoV-2.2 and SARS-CoV-2.3, shades of green), BANAL-236 (BANAL 236-CoV.1, BANAL 236-CoV.2 and BANAL 236-CoV.3, shades of red) and BANAL-52/103 (BANAL 52/103-CoV.1, BANAL 52/103-CoV.2 and BANAL 52/103-CoV.3, shades of blue). c, Ribbon representations of the crystal structures of the hACE2 peptidase domain (cyan) in complex with SARS-CoV-2 (PDB 6M0J) or BANAL-236 (this study, PDB 7PKI) RBDs (pink). Black arrows in the overall structures indicate the structural difference between the two complexes at the level of helix H4. The magnifications show the main interactions in the ACE2–RBD interfaces. Residues in the receptor-binding motif altered between SARS-CoV-2 and BANAL-236 are highlighted with coloured outlines. Numbers 1, 2 and 3 indicate the three main clusters of interactions between hACE2 and RBDs.

To study the effect of the alterations at the interface between these RBDs and hACE2, we performed molecular dynamics (MD) simulations of the complexes between the RBDs of SARS-CoV-2, BANAL-236 and BANAL-52/103 and hACE2 initiated from the crystal structure and homology models of these systems, respectively (Supplementary Table 6). Cluster analysis of the MD trajectories showed that, at the RBD–hACE2 interface, both BANAL complexes were identical to the complex between the SARS-CoV-2 RBD and hACE2 within 2 Å backbone root mean square deviation (RMSD; Extended Data Fig. 4), except for one of the BANAL-52/103 simulations that exhibited larger conformational variability of the RBD residues S443–Y449 (Supplementary Methods and Extended Data Fig. 5). Empirical scoring functions predicted a similar RBD–hACE2 binding energy in all three complexes (Extended Data Fig. 6).

The analysis of the persistence of hydrogen bonds and salt bridges provided further insights into the effect of the substitutions at the RBD–hACE2 interface (Fig. 3b and Extended Data Fig. 7). The H498Q mismatch present in both BANAL-52/103 and BANAL-236 RBDs disrupted the hydrogen bonds between RBD Q498 and both hACE2 K353 and Q42. However, these hydrogen bonds were only transiently formed in the SARS-CoV-2 complex: more persistent hydrogen bonds in this region (RBD T500 to hACE2 D355; RBD G502 to hACE2 K353; and RBD Y505 to hACE2 E37) were not affected. The K493Q mismatch enabled the formation of two salt bridges between RBD and hACE2 that were not present in the SARS-CoV-2 complex (RBD K493 to hACE2 E35 and RBD K493 to hACE2 D38).

For further insight into the molecular details of these interactions, we determined the crystal structure of the complex between the BANAL-236 RBD and the hACE2 peptidase domain to 2.9 Å resolution (Supplementary Table 7). The overall structure of this RBD is identical to that of SARS-CoV-2 (RMSD 0.360 Å, 150 Cα). The only significant difference is in the region between amino acids D363 and S375 (Fig. 3c, arrow). In this region, BANAL spikes have the A372T mismatch, which converts the sequence 370NSA372to 370NST372, an N-linked glycosylation sequon. The crystals indeed showed clear electron density for the first N-acetylglucosamine residue of glycan attached to N370. The main chain at residue T372 makes a hydrogen bond with the glycan residue, altering the conformation of the main chain downstream, which results in partial unwinding of helix H4 in RBD 236. The calculation of a simulated annealing composite omit map for the segment D363–S375 confirms the correct assignment of the structure for this polypeptide segment and the N-glycosylation at residue N370 (Extended Data Fig. 8).

As expected, most of the interactions observed in the SARS-CoV-2 RBD–hACE2 complex30 are also present in the structure of the complex between the BANAL-236 RBD and hACE2. In these interfaces, there are three main clusters of interactions as indicated in Fig. 3c (insets). The sequence mismatches are in clusters 2 (making the salt bridge between RBD K493 and hACE2 E35) and 3 (hydrogen bond between RBD H498 and D38). Although the interaction K493–E35 contributes to stabilizing the complex, it does not seem to markedly affect the binding to hACE2 because both BANAL RBDs have similar Kd values.

Virus replication in human cells

To assess whether the BANAL-236 spike protein could mediate entry into cells expressing hACE2, we generated lentiviral particles pseudotyped with spike from the SARS-CoV-2 strain first detected in Wuhan or BANAL-236 (Supplementary Fig. 4). We detected spike-mediated entry of the BANAL-236 spike-pseudotyped lentivirus into hACE2-expressing HEK-293T cells, but no such entry into control cells not expressing hACE2 (Fig. 4a). Entry was blocked by human sera neutralizing SARS-CoV-2, but not by control non-neutralizing sera, demonstrating that neutralization of BANAL-236 was specific for epitopes shared with the spike protein of SARS-CoV-2 (Fig. 4b).

Fig. 4: BANAL-236 entry and propagation in human cells.
figure 4

a, Results of spike-pseudotyped BANAL-236 (squares) and SARS-CoV-2 (strain BetaCoV/France/IDF0372/2020, GISAID accession number EPI_ISL_406596, diamonds) pseudovirus entry assay in HEK-293T cells expressing (purple lines) or not (grey lines) the hACE2 receptor, shown in relative luminescence units (RLUs) produced by the firefly luciferase present in the lentiviral backbone and the Bright-Glo luciferase substrate. A single experiment performed in triplicate representative of two experiments is shown. Centre values represent the average of the three replicates and error bars indicate s.d. b, Results of spike-pseudotyped BANAL-236 (black) and SARS-CoV-2 (strain BetaCoV/France/IDF0372/2020, grey) neutralization assay expressed as a percentage of neutralization of luciferase activity in the absence of serum. Sera neutralizing SARS-CoV-2 were from patients with confirmed infections whereas non-neutralizing sera samples were collected before the spread of SARS-CoV-2 Laos. The dashed line marks the neutralization threshold. A single experiment representative of three independent experiments is shown. c, Human cell lines expressing endogenous ACE2, Calu-3 (blue lines) and Caco-2 (green lines), were infected at an MOI of 0.01 with the BANAL-236 virus (squares) and the virus first detected in Wuhan (diamonds). VeroE6 cells were infected at an MOI of 0.0001 with BANAL-236 virus (squares) and SARS-CoV-2 (strain BetaCoV/France/IDF0372/2020, diamonds) pre-incubated with (grey lines) or without (pink lines) soluble hACE2 (sACE2) at 25 µg ml−1 for 30 min. Genome copy number was determined by quantitative RT-PCR in the supernatants recovered 3 and 4 days post-infection. A single experiment performed in triplicate is shown. Centre values represent the average of the three replicates and error bars indicate s.d.

Source data

To isolate infectious viruses, rectal swabs were inoculated on VeroE6 cells. No cytopathic effect (CPE) was observed 3 and 4 days after infection, but viral RNAs were detected for one of the two wells inoculated with the BANAL-236 sample (cycle threshold (Ct) = 25.1 at day 3, Ct = 21.7 at day 4). The culture supernatant (C1) formed plaques on VeroE6 and the titre was 3,800 pfu ml−1. A C2 viral stock was prepared by amplification on VeroE6 at a multiplicity of infection (MOI) of 10−4. The culture supernatant was collected on day 4 when CPE was observed and titrated on VeroE6 (Extended Data Fig. 9). The plaques’ phenotype was small, but the titre reached 2.6.106 pfu ml−1. The random NGS performed on the RNA extracted from this stock confirmed that the culture was pure and corresponded to the BANAL-236 virus, without any non-synonymous mutations between the original BANAL-236 genome and the C2 viral stock. Replication of BANAL-236 in VeroE6 was efficiently inhibited by soluble hACE2, thus showing that entry and propagation was largely ACE2 dependent (Fig. 4c and Extended Data Fig. 9). Furthermore, BANAL-236 replicated in human cell lines expressing endogenous levels of ACE2, Calu-3 and Caco-2 (Fig. 4c). The kinetics of RNA synthesis were slower compared to those for SARS-CoV-2. Infectious viral particles were produced at day 4 in the supernatant of Caco-2 and Calu-3 cells (respectively 104.7 and 102.9 pfu ml−1).

Discussion

Many sarbecoviruses circulate in Rhinolophus colonies living in caves in China and probably also in neighbouring countries further south31,32,33. During the course of a prospective study in northern Laos, we have identified, among other coronaviruses, five sarbecoviruses for which we obtained full-length sequences. Among these, three (BANAL-52, BANAL-103 and BANAL-236) were considered to be close to SARS-CoV-2 because of the similarity of one of the S1 domains (NTD or RBD) or S2 to that of SARS-CoV-2.

As genomic regions subject to recombination are probably contributing to host–virus interactions, we compared SARS-CoV-2 strains from the two lineages identified at the onset of the coronavirus disease 2019 (COVID-19) outbreak21 to these new bat sarbecoviruses and to pangolin strains in the SARS-CoV-2 clade. Strains close to R. pusillus RpYN06, R. malayanus RmYN02 and Rhinolophus sp. PrC31 isolated in China in 2018–2019, along with R. malayanus BANAL-52, R. pusillus BANAL-103 and R. marshalli BANAL-236 isolated in Laos in 2020, contributed to the appearance of SARS-CoV-2 in different regions of the genome. No closer viral genome has yet been identified as a possible contributor, and pangolin coronaviruses seem more distantly related than bat coronaviruses. We identified potential recombination sites, allowing for the reconstruction of the phylogenetic history of early isolated SARS-CoV-2 strains between homologous regions defined by recombination points. We identified a breakpoint at the beginning of the SARS-CoV-2 RBD, resulting in a downstream fragment key for the virus tropism and host spectrum composed of the RBD and the furin cleavage site, and ending in the N-terminal region of S2. Despite the absence of the furin site, phylogenetic reconstruction of this fragment showed that the Laotian R. malayanus BANAL-52, R. pusillus BANAL-103 and R. marshalli BANAL-236 coronaviruses are the closest ancestors of SARS-CoV-2 known so far. ORF8 was highly divergent between SARS-CoV-2 related genomes. The ORF8 genes from the strains BANAL-52, BANAL-103 and BANAL-236, like that of RaTG13, were closer to that of SARS-CoV-2 than to those of pangolin strains. ORF8 encodes a protein that has been proposed to participate in immune evasion34 and is deleted in many human SARS-CoV-2 strains that appeared after March 202035, which is reminiscent of the deletions identified during the 2003 severe acute respiratory syndrome epidemic36. Therefore, the presence of ORF8 is consistent with bats acting as a natural reservoir of early strains of SARS-CoV-2.

Structural and functional biology studies have identified the RBD domain that mediates the interaction with hACE2 and host range, as well as the main amino acids that are involved30,37,38. The RBDs (BANAL-52, BANAL-103 and BANAL-236) are closer to that of SARS-CoV-2 than that of any other bat strain described so far, including that of RaTG13, the virus identified in R. affinis from the Mojiang mineshaft where pneumonia cases with clinical characteristics a posteriori interpreted as similar to COVID-19 (ref. 6) were recorded in 201239,40. Overall, one (H498Q (BANAL-103 and BANAL-52)) or two (K493Q and H498Q (BANAL-236)) amino acids interacting with hACE2 are substituted in these strains in comparison to SARS-CoV-2. These substitutions did not destabilize the interface between BANAL-236 and hACE2, as shown by the biolayer interferometry experiments (Fig. 3a) and analysed by MD simulations.

Our results therefore support the hypothesis that SARS-CoV-2 could originally result from a recombination of sequences pre-existing in Rhinolophus bats living in the extensive limestone cave systems of Southeast Asia and South China41,42. Many species forage in the same cave areas, including R. malayanus and R. pusillus43. In addition, the distributions of R. marshalli, R. malayanus and R. pusillus overlap in the Indochinese subregion (Supplementary Fig. 5), which means that they may share caves as roost sites and foraging habitats44. With the new viruses described here, understanding the emergence of SARS-CoV-2 does not require speculation of recombination or natural selection for increased RBD affinity for hACE2 in an intermediate host such as the pangolin before spillover45, nor natural selection in humans following spillover46. However, we found no furin cleavage site in any of these viruses on sequences determined directly from original faecal swab samples, which prevent from any risk of counterselection of the furin site by amplification in Vero cells18. The lack of the furin cleavage site may be explained by insufficient sampling in bats. On the basis of comparison of the sequences around the cleavage site between S1 and S2 (Extended Data Fig. 3), it has been suggested that the furin cleavage site in SARS-CoV-2 could originate from recombination events between SARS-CoV-2-related coronaviruses co-circulating in bats2,47, meaning that BANAL-116, BANAL-247, bat RmYN02 (ref. 2) and bat RacCS203 (ref. 3) coronaviruses may share a common history with SARS-CoV-2. Alternatively, the furin cleavage site could have been acquired through passages of the virus in an alternative host or during an early poorly symptomatic unreported circulation in humans. Finally, the epidemiological link between these bat viruses and the first human cases remains to be established.

As expected from the high affinity for hACE2 of the S ectodomain of BANAL-236, pseudoviruses expressing it were able to efficiently enter human cells expressing endogenous hACE2 using an ACE2-dependent pathway. However, alternative routes of entry may still exist, especially in cells that do not express ACE2 (ref. 48). Entry was blocked by a serum neutralizing SARS-CoV-2. The RaTG13 strain, the closest to SARS-CoV-2 known previously, had never been isolated. By contrast, preliminary studies show that BANAL-236 replicated in primate VeroE6 cells with a small plaque phenotype compared to that of SARS-CoV-2. Further analysis may indicate more clearly which steps shape infectivity.

To conclude, our results pinpoint the presence of new bat sarbecoviruses that seem to have the same potential for infecting humans as early strains of SARS-CoV-2. Guano collectors, certain ascetic religious communities who spend time in or very close to caves and tourists visiting caves are particularly at risk of being exposed. Further investigations are needed to assess whether such exposed populations have been infected, symptomatically or not, by one of these viruses, and whether infection could confer protection against subsequent SARS-CoV-2 infections. In this context, it is noteworthy that SARS-CoV-2 with the furin site deleted replicates in hamsters and in transgenic mice expressing hACE2, but leads to less severe disease and protects from rechallenge with wild-type SARS-CoV-2 (ref. 18).

Methods

Ethical and legal statements

The bat study was approved by the wildlife authorities of the Department of Forest Resource Management (DFRM), and the Ministry of Agriculture and Forestry (MAF), Lao People’s Democratic Republic, no. 2493/DFRM, issued on 21 May 2020 and no. 0755/MAF issued on 2 June 2020. All animals were captured, handled and sampled following previously published protocols and ASM guidelines52,53. Exportation from Laos and importation in France were conducted according to national regulations. Human serum samples used for neutralization assays were already available54 and selected on the basis of their status regarding seroneutralization of SARS-CoV-2. They were collected following a protocol approved by the Lao National Ethics Committee for Health Research (NECHR; reference no. 052/2020).

Biosafety

Both the Institut Pasteur du Laos (IPL) and the Faculty of Environmental Science have extensive experience in safely collecting bats (appropriate biosafety training and personal protective equipment (PPE) for collectors). In this study, training before field work was organized once for field work participants. The aim was to teach participants the transmission risks of infectious agents from bats and how to identify, assess and mitigate these risks, as well to practice the use of PPE.

During field collection, sampling stations were selected to minimize potential exposure to infectious agents and stress on the animals during the handling time by selecting: an area easy to disinfect; an area out of view of the general public; a location that would not cause exposure of the general population (such as a picnic area); and procedures that reduce time and stress on bats caused by handling. For handling bats and bat samples, the following minimum PPE was required: eye protection, an N95 respirator, long clothing/coverall and latex gloves (2 pairs). All waste was disposed of in biohazard bags and was transported to a disposal site of IPL in Vientiane capital. Each sample box containing 81 cryovial tubes was placed inside an individual plastic ziplock and transported from the field to IPL on dry ice in a polystyrene foam box. At IPL, samples were stored in a specific −80 °C freezer until analysis.

For initial sample analysis at IPL, samples were transferred to a BSL-3 room for nucleic acid extraction. Initial coronavirus screening by nested RT-PCR was then performed under internal regulation on biosafety and security of IPL. Sample-extracted products (50 µl) were stored in NucleoSpin 8 sample boxes (8 × 12) and sent to IP-Paris packed in individual plastic ziplock bags on dry ice. Aliquots of anal swab samples were sent individually triple packaged on dry ice in a separate box.

All experiments on potentially infectious samples performed at Institut Pasteur (Paris) were conducted in BSL-3 laboratories according to procedures adapted for respiratory viruses.

Bat sampling areas and sample collection

Trapping sessions were conducted at four sites, in the Fueng and Meth districts, Vientiane province, and in the Namor and Xay districts, Oudomxay province, between July 2020 and January 2021 (Fig. 1a and Supplementary Tables 1 and 2). Bats were captured using four-bank harp traps55 and mist nets set in forest patches between rice fields or orange/banana plantations and karst limestone formations, for 5–8 nights depending on accessibility. Harp traps were set across natural trails in patches of forest understorey. Mist nets were set across natural trails, at the edges of forests, at entrances of caves and in areas near cave entrances, as well as in open areas or those with high forest canopy. Bats were morphologically identified following morphological criteria55,56,57. Other data such as forearm length (FA), sex, developmental stage (adult or juvenile) and reproductive condition (pregnant or lactating) were also recorded. Bats were sampled for saliva, faeces and/or urine, and blood before release at the capture site. Species identification of PCR-positive individuals was confirmed by sequencing the mitochondrial cytochrome oxidase 1 (ref. 20).

Initial coronavirus screening

Total RNA was extracted from faecal samples using the NucleoSpin 8 virus kit (Macherey Nagel). cDNA was synthesized using the Maxima H minus first strand cDNA synthesis kit (Thermo Scientific) and random hexamers following the manufacturer’s instructions. The presence of coronaviruses was tested by a nested RT-PCR approach using PCR master mix (Promega) and by targeting the RNA-dependent RNA Polymerase gene using combinations of degenerate and non-degenerate consensus primers as previously described20. PCR products of the expected size were directly sequenced on both strands by Sanger sequencing using the nested PCR primers. The sequences obtained were confirmed by similarity analysis using the NCBI BLASTn search (http://www.ncbi.nlm.nih.gov/BLAST).

Primer design for Betacoronavirus enrichment before NGS

Betacoronavirus enrichment was performed at the genus level by adapting a previously described protocol58 based on k-mers for targeted-sequence enrichment before NGS. Briefly, 2,000 complete Betacoronavirus genomes were downloaded from the GenBank and GISAID databases and then clustered to a 95% sequence identity using CD-HIT-EST (ref. 59). Overall, 185 representative sequences of all betacoronaviruses were used for further analysis. Owing to the high diversity in the genus Betacoronavirus, the full genomes belonging to the subgenera Sarbecovirus, Nobecovirus, Merbecovirus and Embecovirus were separately aligned using MAFFT multiple sequence alignment software and used to design 13-mer spiked primers for each cluster. The genomic position of the full set of primers was extracted from some representative subgenera sequences and close primers were removed. Finally, 416 spiked primers were synthesized by Eurofins Genomics.

Sample preparation for sequencing

Reverse transcription was performed using the mix of spiked primers and random hexamers at a 10:1 ratio using the SuperScript IV First-Strand Synthesis System (Invitrogen). After a denaturation step at 95 °C for 3 min in the presence of dNTPs (500 µM), DTT (5 mM) and RNaseOUT inhibitor, the reverse transcription reaction was incubated for 10 thermal cycles consisting of 6 steps at 8 °C for 12 s, 15 °C for 45 s, 20 °C for 45 s, 30 °C for 30 s, 35 °C for 2 min and 42 °C for 3 min, followed by a final incubation step at 42 °C for 20 min, as previously described60. Double-stranded cDNA was generated using the Sequenase 2.0 DNA Polymerase kit (Applied Biosystems) in the presence of dNTPs and then purified using the Beckman Coulter Agencourt AMPure XP.

For samples with a low nucleic acid content, a random amplification step was performed using the MALBAC Single Cell WGA kit (Yikon Genomics, Promega). The amplified product was then purified using AMPure XP Beads, eluted in a final volume of 20 μl of low TE (10 mM Tris-HCl (pH 8,0), 0,1 mM EDTA), and quantified with the Qubit DNA HS Assay (Life Technologies, Thermo Fisher Scientific).

Libraries were generated using the NEBNext Ultra II DNA Library Prep kit (New England Biolabs) after a fragmentation step using the Covaris M220 Focused-ultrasonicator using microTUBE-15 (peak incident power = 18 W, duty factor = 20%, cycles per burst = 50, treatment time = 60 s). The PCR-amplified libraries were cleaned up using 0.9× AMPure XP Beads and checked on the 2100 Bioanalyzer system with the High-Sensitivity DNA kit (Agilent Technologies) and quantified with the Qubit DNA HS Assay. Finally, the dual-multiplexed libraries were pooled (six samples per pool) and run on the Illumina NextSeq500 platform with High Output Kit version 2.5 (150 cycles).

Amplicon sequencing

In addition to enrichment-based sequencing, cDNA was amplified using the AmpliSeq for Illumina SARS-CoV-2 Research Panel (catalogue number 20020496), applying 26 amplification cycles in PCR1 and 9 cycles in PCR2. Primers at the end of the amplicons were partially digested during the library preparation, following the manufacturer’s instructions. Libraries were barcoded individually using the Illumina UD dual indices and normalized with the AmpliSeq Library Equalizer for Illumina (catalogue number 20019171), and then pooled and sequenced on the Illumina NextSeq500 instrument using a Mid Output version 2.5 kit (SR 150 cycles). To eliminate residual PCR primer sequences, raw reads were trimmed by 15 bases at each end and special attention was paid to checking that internal sequences corresponding to primer regions in overlapping amplicons did not derive from primer sequences of the multiplex PCR. As an internal control, we verified that sequences of the complete genome of sample BANAL-236 obtained from the enrichment-based sequencing approach and from the AmpliSeq approach were identical.

Genome assembly and finishing

Raw reads from the enrichment-based sequencing were processed with an in-house bioinformatics pipeline (Microseek; Bigot T. et al., unpublished observations) comprising quality check and trimming, read normalization, de novo assembly, and ORF prediction of contigs and singletons, followed by three levels of taxonomic assignation61. Sequences identified as Sarbecovirus were then mapped onto appropriate reference sequences using CLC Genomics Workbench 20.0 (Qiagen). Trimmed reads from the amplicon sequencing were mapped to the SARS-CoV-2 genome first, and then mapped again (refined mapping) to the closest genome relative. When needed, complete genomes were obtained by conventional PCR and Sanger sequencing. Briefly, viral RNA was reverse transcribed using SuperScript IV reverse transcriptase (Invitrogen) and cDNA was subsequently used to fill the gaps in the genomes using Phusion High Fidelity DNA polymerase (New England Biolabs) and specific primers flanking the missing regions. Positive PCR products were further purified and sequenced by Sanger sequencing at Eurofins Genomics.

Recombination and phylogenetic analyses

Identification of recombination events occurring during the evolutionary history of bat sarbecoviruses was performed using the IDPlot package62, a web-based workflow that includes multiple sequence alignment and phylogeny-based breakpoint prediction using the GARD algorithm from the HyPhy genetic analysis suite63. First, a comprehensive analysis comprising 106 sequences and covering all non-human Sarbecovirus and Sarbecovirus-related complete genomes available in GenBank and GISAID databases was performed, including prototype strains of SARS-CoV-2 isolated in 2019. Special attention was paid to including bat-borne and pangolin-borne viral sequences to maximize the ability to capture a large diversity of the sarbecoviruses. Then, a reduced set of 36 sequences was chosen because of their phylogenetic proximity to SARS-CoV-2, and the GARD algorithm was run to identify recombination breakpoints in Laotian and representative human, bat and pangolin sarbecoviruses. Breakpoint coordinates were confirmed by performing phylogenetic analyses on the corresponding fragments using PhyML implemented through the NGPhylogeny portal50. Branch support was evaluated with the aBayes parameter.

Generation of lentiviral pseudoviruses

The synthetic genes encoding spike in BANAL-236 and in SARS-CoV-2 (strain BetaCoV/France/IDF0372/2020, GISAID accession number EPI_ISL_406596) were cloned into the pVAX1 vector with a cytoplasmic tail truncation of 19 amino acids. Pseudotyped lentiviral particles were prepared using HEK-293T cells (ATCC CRL-3216) seeded in 10-cm dishes. HEK-293T cells were co-transfected with 5 µg of spike-encoding plasmid, 10 µg of lentiviral backbone plasmid expressing the firefly luciferase (pHAGE-CMV-Luc2-IRES-ZsGreen-W) and 3.3 µg of each lentiviral helper plasmid expressing HIV Gag-Pol (HDM-Hgpm2), Tat (HDM-tat1b) and Rev (pRC-CMV-Rev1b) using calcium phosphate precipitation64. The medium was replaced 5 h post-transfection by 6 ml of DMEM without fetal calf serum (FCS) and phenol red. Pseudotyped particles were collected 48 h post-transfection, clarified by centrifugation at 2,500g for 5 min and frozen at −80 °C. Mock pseudotyped lentivirus was generated as above but in the absence of an S-expressing plasmid.

Spike-pseudotyped lentivirus entry assays

HEK-293T cells stably expressing hACE2 were transduced in suspension by mixing 50 µl of threefold serial dilutions of S-pseudotyped lentiviruses with 50 µl of cells at 4 × 105 cells per ml in 96-well white culture plates65. At 60–72 h post-transduction, 100 µl of Bright-Glo luciferase substrate (Promega) was added to the wells and luminescence was measured using a Berthold Centro XS luminometer.

Neutralization assays

Sera neutralizing SARS-CoV-2 and non-neutralizing sera were described in ref. 54. They were decomplemented at 56 °C for 30 min and 2.5 µl was incubated with 0.5 µl of S-pseudotyped lentiviruses in a final volume of 50 µl DMEM–10% FCS without phenol red in 96-well white culture plates. After 30 min at room temperature, 50 µl of hACE2-expressing HEK-293T cells in suspension at 4 × 105 cells per ml was mixed into the wells. Luminescence was measured at 60–72 h post-transduction as described above. Neutralization was calculated using the following formula: 1 − (RLU in presence of serum/(mean of RLU in absence of serum determined in 12 wells − 3 × s.d.).

Virus isolation and multiplication

Rectal swabs were inoculated in duplicate in 24-well plates containing VeroE6 cells (ATCC CRL-1586) (1/5 dilution in 100 µl DMEM without FCS supplemented with 1% penicillin–streptomycin, 1% Fungizone and 1 µg ml−1 TPCK-treated trypsin). After 1 h of adsorption at 37 °C, inoculum was removed and 1 ml of the medium described above was added. At 3 and 4 days after infection, CPE was monitored and 100 µl of supernatant was collected for RNA extraction. Quantitative RT-PCR targeting a conserved sequence in the E gene was performed as described previously66. Culture supernatant (C1) was collected at day 4 and titrated by plaque assay on VeroE6 overlaid with 0.5% carboxymethylcellulose containing 1 µg ml−1 TPCK-treated trypsin. A viral stock was prepared by amplification on VeroE6 cells at an MOI of 10−4. Culture supernatant (C2) was collected at day 4 when a massive CPE was observed and titrated by plaque assay on VeroE6 as described above. A C3 viral stock was produced by a subsequent viral amplification at an MOI of 10−4 for 3 days on VeroE6. RNA was extracted from the viral stock and submitted to random NGS analysis using the SMARTer Stranded Total RNA-Seq Kit version 3 - Pico Input Mammalian (Takara Bio). Raw reads were processed with the Microseek pipeline, as described above.

For infection experiments, VeroE6 or the human cell lines Calu-3 (lung cells, ATCC HTB-55) and Caco-2 (intestinal cells, ATCC HTB-37) were infected in triplicate in 24-well plates with viral stocks of BANAL-236 (C3) or SARS-CoV-2 (strain BetaCoV/France/IDF0372/2020) at an MOI of 0.0001 and 0.01, respectively. Human soluble ACE2 was pre-incubated at 25 µg ml−1 (ref. 67) for 30 min with the viral inoculum before VeroE6 infection. Infections were carried out without TPCK in the medium described above for 3 and 4 days. Supernatants were recovered at 0, 3 and 4 days post-infection and genome copy number and viral titres were quantified as described above.

Protein expression and purification

BANAL-52/103, BANAL-236 and SARS-CoV-2 RBDs (residues 233–524, with carboxy-terminal 8×His-Strep and AVI tags) and hACE2 peptidase domain (residues 19–615, with C-terminal 8×His tag) were expressed in Expi293F cells at 37 °C and 8% CO2 (GnTI- Expi293, ThermoFisher Scientific). Cell culture supernatants were collected 5 days post transfection and purified by affinity chromatography followed by size-exclusion chromatography (SEC) using a 200 10/300 GL column pre-equilibrated in 20 mM Tris-HCl pH 8.0, 100 mM NaCl.

For crystallization experiments, the same constructs were expressed in Expi293F GnTI cells. The protein tags were cleaved overnight with thrombin and deglycosylated with EndoH. The RBD was mixed with a 1.3 molar excess of hACE2 and the complex was purified by SEC.

Biolayer interferometry

Purified Avi-tagged RBD was biotinylated using a BirA biotin–protein ligase kit according to the manufacturer’s instructions (Avidity). The biotinylated RBDs at 100 nM were immobilized to SA sensors. A 1:2 dilution series of hACE2 starting at 100 nM in PBS–BSA buffer was used in cycles of 200-s association followed by 200-s dissociation steps to determine protein–protein affinity. The data were baseline-subtracted and the plots were fitted using the Pall FortéBio/Sartorius analysis software (version 12.0). Data were plotted in Prism 9.1.0.

Crystallization and data collection

Crystals of the complex BANAL-236 RBD–hACE2 were obtained at 4 °C in sitting drops by mixing 200 nl of the protein complex at 8 mg ml−1 with 200 nl of reservoir solution containing 0.2 M lithium sulfate, 0.1 M Tris 8.5, 30% w/v PEG 4000. The crystals were soaked in reservoir solution containing 20% glycerol as a cryoprotectant before being flash-frozen in liquid nitrogen. X-ray diffraction data were collected on the beamline PROXIMA 1 at the SOLEIL synchrotron (St Aubin, France) and reduced using the XDS package68. The structure of the complex was determined by molecular replacement with Phaser software69 using the coordinates of the SARS-CoV-2 RBD in complex with hACE2 as a search template (Protein Data Bank (PDB) 6M0J). The model was manually corrected in COOT (ref. 70) and refined with phenix.refine (ref. 71). The final coordinates were deposited in the PDB with the accession code 7PKI.

MD simulations of RBD–hACE2 complexes

Generation of homology models

Homology models of the complexes between the RBDs of BANAL-236 and BANAL-52/103 and hACE2 were constructed using the X-ray structure of the complex between the SARS-CoV-2 RBD and hACE2 (PDB code 6M0J; resolution 2.45 Å) as a template using MODELLER version 10.1 (ref. 72). These models included the fragments S19–D615 and T333–G526 of hACE2 and RBD, respectively, which were the regions resolved in the template. In these regions, BANAL-236 and BANAL-52/103 RBDs have a sequence identity to SARS-CoV-2 RBD equal to 96.9% and 97.4%, respectively. The alignment reported in Extended Data Fig. 3 was used. The zinc and chloride atoms in the template were retained during homology modelling; N-acetylglucosamine (NAG) and water residues were removed. Seven disulfide bonds were detected by MODELLER in the template (three in hACE2 and four in the RBD) and enforced in the generation of the homology models using CHARMM-like distance and dihedral angles restraints. For each construct, 100 homology models were built and ranked on the basis of the normalized DOPE score73. The top three scoring models of each complex were used as starting points of three independent MD simulations, as described in the following section.

Set-up, equilibration and production of the MD simulations

The six homology models described in the previous section along with the X-ray structure of the SARS-CoV-2 RBD–hACE2 complex (PDB code 6M0J) were used as input to the CHARMM-GUI server74. In the case of the X-ray structure, for consistency with the homology models, the zinc and chloride atoms in 6M0J were retained whereas the NAG and water residues were removed. The seven systems were solvated in a triclinic box of initial xyz dimensions of ~13.5 nm × 9.2 nm × 8.3 nm. Potassium and chloride ions were added to ensure charge neutrality at a salt concentration of 0.15 M. The total number of atoms was ~104,000. Further details of the systems are reported in Supplementary Table 4. The CHARMM36m force field75 was used for the protein and ions, and the TIP3P model76 was used for the water molecules. CHARMM36m force field parameters for the seven pairs of cysteines linked by disulfide bonds were used. The CHARMM-GUI models were first energy-minimized using the steepest descent algorithm. After minimization, the systems were equilibrated using a 1-ns-long simulation in the NPT ensemble followed by a 1-ns-long simulation in the NVT ensemble. The temperature T was set at 300 K and the pressure P was set to 1 atm using the Bussi–Donadio–Parrinello thermostat77 and the Berendsen barostat78, respectively. During equilibration, harmonic restraints on the positions of the protein backbone and sidechain heavy atoms were applied. For each system studied, production simulations were performed in the NVT ensemble for 1 µs. A time step of 2 fs was used together with LINCS constraints on hydrogen bonds79. The van der Waals interactions were gradually switched off at 1.0 nm and cut off at 1.2 nm; the particle mesh Ewald method was used to calculate electrostatic interactions with a cutoff at 1.2 nm (ref. 80). Production simulations were performed at room temperature for consistency with the biolayer interferometry experiments.

Details of the analysis

To evaluate the stability of the starting model during the production simulations, we calculated the backbone RMSD with respect to the energy-minimized structure for each frame of the trajectories. The RMSD was calculated separately for the residues in the RBD (Extended Data Fig. 4a), the hACE2 (Extended Data Fig. 4b), the RBD–hACE2 interface (Extended Data Fig. 4c) and the entire complex (Extended Data Fig. 4d). Interfacial residues were defined as the residues in one subunit closer than 0.8 nm to the residues in the other subunit in the X-ray structure of the complex between the SARS-CoV-2 RBD and hACE2. RMSD calculations were performed using the driver utility of PLUMED version 2.7 (ref. 81). To characterize conformational heterogeneity at the RBD–hACE2 interface, we performed a cluster analysis of the MD trajectories using as similarity metrics the backbone RMSD of the interfacial residues and the GROMOS clustering approach82 with a cutoff equal to 0.2 nm. The nine trajectories were first concatenated, clustering was then performed, and finally the population of each cluster was calculated separately for each trajectory (Extended Data Fig. 4e). To estimate the binding energy between RBD and hACE2, we used the InterfaceAnalyzer tool in ROSETTA version 3.11 (ref. 83; Extended Data Fig. 6a) and the AnalyseComplex tool in FoldX version 4 (ref. 84; Extended Data Fig. 6b). To identify relevant interactions at the RBD–hACE2 interface that could contribute to the binding affinity of the complex, we quantified the frequency of formation of inter-subunit salt bridges (Fig. 3b) and hydrogen bonds (Extended Data Fig. 7) during the course of the MD simulations. For each frame of the trajectories, we used PLUMED to calculate the distances between the side-chain charged groups of aspartic acids (OD1/OD2), glutamic acids (OE1/OE2), lysines (NZ) and arginines (NH1/NH2). An inter-subunit salt bridge was then defined as formed if the distance between groups with opposite charge was lower than 0.32 nm. We confirmed this calculation using the Salt Bridges tool available in VMD (ref. 85). To monitor the formation of inter-subunit hydrogen bonds, we used the Hydrogen Bond Analysis module of the MDAnalysis library version 1.0.0 (ref. 86). A donor–acceptor distance and angular cutoff of 0.3 nm and 150° were used to define the formation of a hydrogen bond. We confirmed this calculation using the Hydrogen Bonds tool available in VMD.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.