Introduction

The development of genome editing tools associated with the clustered regularly interspaced short palindromic repeat (CRISPR) systems has revolutionized biomedical studies. Holding great potential for the treatment of genetic diseases, diverse precise genome editing tools based on CRISPR-Cas9 have been developed, such as homology-directed repair (HDR) based systems, as well as cytosine and adenine base editors (BE)1,2,3. While the HDR method requires double-stranded DNA breaks (DSBs) and causes unpredictable editing outcomes, BEs use nickase Cas9 (nCas9), enabling more precise modifications without generating DSBs4,5,6. For example, cytosine BEs (CBEs) are constructed by fusion of a cytidine deaminase domain with nCas9. This fusion protein forms a complex with the guide RNA and performs site-specific deamination to convert cytosine (C) to uracil (U) in the deaminase activity window. The base U is subsequently replaced with thymine (T) by the endogenous cellular repairing machinery, resulting in an overall C-to-T substitution at the defined genomic site. Since point mutations are responsible for more than half of human disease-associated genetic variants2, BEs are superior than HRD based systems due to their higher editing efficiency in the correction of pathogenic single nucleotide polymorphisms7, avoidance of unwanted DSBs, and prevention of insertions and deletions4,5.

While engineering of several BE variants has improved product purity and overall editing efficiency8,9, one of the major challenges in base editing is the discrimination of multiple identical bases located within the deaminase activity window2 of 4–10 nucleotides. As a result, the target base and other bystander bases will all be modified, negatively impacting the precision of genome editing outcomes. To address this issue, the introduction of beneficial mutations to deaminase has further advanced BEs10,11,12. For example, compared to the wild-type APOBEC3A (A3A)-BE3, an engineered A3A CBE with the mutation N57G maintained high editing activity at the target C in the TCR motif with greatly reduced activity against bystanders11. Also, followed by several rounds of screening and validation of rational mutagenesis, we previously engineered an APOBEC3G (A3G)-CBE that preferentially edits the second C in the “CC” motif with 6000-fold improvement in perfectly modified alleles compared to the original BE4max12. Despite these successes, a general theoretical framework to guide the design of mutations that can lead to high editing activity at the target base and low activity at bystanders (defined as BE high editing selectivity) is still missing. Mutation selections in these previous studies were mostly suggested by structural considerations: starting from the identification of key residues in the deaminase near the DNA binding motif and then mutating those residues to form a candidate library for experimental validation. The design process could be greatly accelerated with a comprehensive theoretical model that could quantitatively explain and predict the effect of specific mutations on editing activity at the target base and bystanders. In addition, such a theoretical model would also improve our fundamental understanding of the biochemical and biophysical processes that take place during base editing.

Molecular dynamic (MD) simulations have been used to study the activity of BE complexes and the role of beneficial mutations to enhance overall editing activity (both at target and bystanders)13,14. Herein we present a comprehensive multi-scale theoretical approach to describe the molecular processes taking place during BE editing, explaining at the microscopic level the role of beneficial mutations in discriminating the target base over bystanders. To fulfill this goal, we built a general theoretical framework combining a discrete-state stochastic (chemical-kinetic) model and MD simulations, explicitly calculating the base editing probability at both the target base and bystanders. In our model, we include an important parameter, \(\triangle {E}_{m}\), the binding affinity between deaminases and ssDNA. This parameter was modulated by introducing various mutations into BE and its values were measured through MD simulations. This framework helps establish a relationship between mutations and BE editing selectivity. We then proposed a theoretical principle arguing that the BE selectivity is non-monotonically dependent on \(\triangle {E}_{m}\). It is argued that the highest BE selectivity can be obtained by varying the binding affinity. In addition, other relevant kinetic parameters are included in the model, such as the binding rate between Cas9 and ssDNA and the deamination rate of BE, allowing us to discuss how \(\triangle {E}_{m}\) correlates with those parameters to affect BEs editing selectivity. Our model successfully explains how the mutations influence the editing selectivity of A3A-BE3. Finally, we designed mutations to further improve the selectivity of the A3G-BE system and we verified the improved editing selectivity experimentally. Thus, the framework we propose opens multiple opportunities for future efficient engineering of BE using theory-driven methods.

Results

Kinetic model of base editing

We developed a discrete-state stochastic model to describe the dynamics of target and bystander editing. This is a minimal chemical-kinetic approach that considers the most relevant chemical states and features of base editing. For convenience, unless noted otherwise, we will use A3A-BE3 to edit the EGFP site 1 as an example.

In this theoretical model (Fig. 1), it is assumed that the Cas9 domain of CBE can bind to ssDNA with a rate u0, initiating the base editing (transition from state 0 to state 2). Alternatively, the protein complex can go to an unproductive state where editing cannot take place, with a rate of u4 (transition from state 0 to state 1). Next, either the Cas9 domain dissociates from DNA with a rate w0 (Transition from state 2 to state 0), or the target cytidine binds to the deaminase catalytic site with a rate u1 (transition from state 2 to state 3). Then the cytidine can either dissociate from the site with a rate w1 without being edited (backward step from state 3 to state 2), or it can be chemically transformed to uridine with a rate u3 (transition from state 3 to state 5). Similarly, the bystander cytidine may bind to the deaminase with a rate u2 (transition from state 2 to state 4), and subsequently, it can either unbind with a rate w2 without being edited (transition from state 4 back to state 2), or it can be chemically transformed with a rate u3 (transition from state 4 to state 6). After that, while Cas9 is still bound to DNA (being in the state 5 or 6), the deaminase can continue editing other cytidines in this region with the same sequence of events (transition to the states 9–12). Alternatively, if Cas9 dissociates from DNA, uridine will be transformed to thymidine through DNA repair (transitions from state 5 to states 7 and 13, or transition from state 6 to states 8 and 14). This U-to-T editing decreases the rebinding rate of Cas9 to ssDNA (transition 7\(\to \)5) if the endogenous DNA repair and replication machinery has changed the DNA sequence from G:C pair to A:T pair. In this case, the new DNA sequence does not perfectly match the spacer sequence of sgRNA. Because the repairing rate is unknown, the rebinding rate is assumed to be mu0 with \(0\le m\le 1\). The parameter m reflects how much the rebinding ability of the BE complex is lowered in comparison with the original substrate. If the DNA repairing rate is slow then m tends to be closer to 1; otherwise, m tends to be closer to 0. Note that the kinetic network in Fig. 1 is a minimal description of complex chemical processes that take place during base editing.

Fig. 1: Chemical-kinetic model of A3A-BE3 editing the EGFP site 1.
figure 1

Deaminase, Cas9, target and bystander base are represented by blue, orange, red and yellow squares, respectively. “C” represents cytosine, “T” represents thymine and “X” represents uridine or thymine. Editing is modeled as a multiple-step chemical reaction where Cas9 first binds to ssDNA, cytidine then binds to the catalytic site of the deaminase and is chemically converted to thymidine. Here, uj and wj (j = 0–4) represent the chemical-kinetic rates for various transitions. The model has a total of 15 states and can produce four outcomes (dashed squares): CTC (failed editing), CTT (only the target base is edited), TTC (only the bystander is edited), TTT (both target and bystander bases are edited).

To evaluate the dynamics of base editing, we explored the first-passage probabilities method successfully used in various problems in chemistry, physics, and biology15,16,17,18. In the case of EGFP site 1 editing by A3A-BE3 there are four possible outcomes as shown in Fig. 1: CTC (state 1, failed editing), CTT (state 13, only the target base is edited), TTC (state 14, only the bystander is edited) and TTT (state 12, both the target base and the bystander are edited). The explicit solution for the probability for the system to end up in one of these products is given below (see derivations in the Supplementary Information Appendix):

$${P}_{{CTT}}={P}_{5}\cdot \frac{{u}_{4}{w}_{0}({u}_{3}+{w}_{2})}{\left({u}_{2}+{w}_{0}\right)\left({u}_{3}+{w}_{2}\right)\left({u}_{4}+m{u}_{0}\right)-{u}_{2}{w}_{2}\left({u}_{4}+{{mu}}_{0}\right)-{{mu}}_{0}{w}_{0}\left({u}_{3}+{w}_{2}\right)}$$
(1)
$${P}_{{TTC}}={P}_{6}\cdot \frac{{u}_{4}{w}_{0}({u}_{3}+{w}_{1})}{\left({u}_{1}+{w}_{0}\right)\left({u}_{3}+{w}_{1}\right)\left({u}_{4}+m{u}_{0}\right)-{u}_{1}{w}_{1}\left({u}_{4}+{{mu}}_{0}\right)-{{mu}}_{0}{w}_{0}\left({u}_{3}+{w}_{1}\right)}$$
(2)
$${P}_{{TTT}}= {P}_{5}\cdot \left[1-\frac{{u}_{4}{w}_{0}\left({u}_{3}+{w}_{2}\right)}{\left({u}_{2}+{w}_{0}\right)\left({u}_{3}+{w}_{2}\right)\left({u}_{4}+m{u}_{0}\right)-{u}_{2}{w}_{2}\left({u}_{4}+{{mu}}_{0}\right)-{{mu}}_{0}{w}_{0}\left({u}_{3}+{w}_{2}\right)}\right]\\ +{P}_{6}\cdot \left[1-\frac{{u}_{4}{w}_{0}({u}_{3}+{w}_{1})}{\left({u}_{1}+{w}_{0}\right)\left({u}_{3}+{w}_{1}\right)\left({u}_{4}+m{u}_{0}\right)-{u}_{1}{w}_{1}\left({u}_{4}+{{mu}}_{0}\right)-{{mu}}_{0}{w}_{0}\left({u}_{3}+{w}_{1}\right)}\right]$$
(3)

\({P}_{{CTC}}\) can be calculated as one minus the sum of the above three probabilities. In equations [13], \({P}_{5}\) and \({P}_{6}\) are two intermediate parameters satisfying:

$${P}_{5}=\frac{{u}_{0}{u}_{1}{u}_{3}({u}_{3}+{w}_{2})}{\left({u}_{1}+{u}_{2}+{w}_{0}\right)\left({u}_{3}+{w}_{1}\right)\left({u}_{3}+{w}_{2}\right)\left({u}_{0}+{u}_{4}\right)-{u}_{1}{w}_{1}\left({u}_{3}+{w}_{2}\right)\left({u}_{0}+{u}_{4}\right)-{u}_{2}{w}_{2}\left({u}_{3}+{w}_{1}\right)\left({u}_{0}+{u}_{4}\right)-{u}_{0}{w}_{0}\left({u}_{3}+{w}_{1}\right)\left({u}_{3}+{w}_{2}\right)}$$
(4)
$${P}_{6}=\frac{{u}_{0}{u}_{1}{u}_{3}({u}_{3}+{w}_{1})}{\left({u}_{1}+{u}_{2}+{w}_{0}\right)\left({u}_{3}+{w}_{1}\right)\left({u}_{3}+{w}_{2}\right)\left({u}_{0}+{u}_{4}\right)-{u}_{1}{w}_{1}\left({u}_{3}+{w}_{2}\right)\left({u}_{0}+{u}_{4}\right)-{u}_{2}{w}_{2}\left({u}_{3}+{w}_{1}\right)\left({u}_{0}+{u}_{4}\right)-{u}_{0}{w}_{0}\left({u}_{3}+{w}_{1}\right)\left({u}_{3}+{w}_{2}\right)}$$
(5)

In experiments, a common way to quantify editing efficiency is to measure the overall probability of editing the target cytidine, \({P}_{t}\)11,12. To compare these predictions with experimental results, \({P}_{t}\) was calculated as:

$${P}_{t}={P}_{{CTT}}+{P}_{{TTT}}$$
(6)

Similarly, the overall probability of editing the bystander cytidine, \({P}_{b}\), was calculated as:

$${P}_{b}={P}_{{TTC}}+{P}_{{TTT}}$$
(7)

Our goal is to parameterize the model by reproducing experimentally measured probabilities, \({P}_{t}\) and \({P}_{b}\). Here, we assume that the binding between the cytidine (both target and bystander) and the deaminase is mainly a diffusion-controlled process. Therefore, considering that target and bystander cytidine are chemically identical and very close spatially, we added an additional approximation:

$${u}_{2}={u}_{1}$$
(8)
$${w}_{2}={w}_{1}{e}^{\triangle \triangle {E}_{0}/{kT}}={w}_{1}{e}^{\left[\triangle {E}_{0}\left({bystander}\right)-\triangle {E}_{0}\left({target}\right)\right]/{k}_{B}T}$$
(9)

The physical meaning of these expressions is the following: the binding rate to the target or the bystander are the same, but the unbinding is governed by the strength of the interactions between the DNA substrate and the protein complex. In Eq. (9), the term \(\triangle {E}_{0}\) represents the binding free energy between the ssDNA binding motif and the deaminase. \(\triangle \triangle {E}_{0}\) represents the difference in \(\triangle {E}_{0}\) between the dissociation from the target base and the dissociation from the bystander base. This difference arises from the sequence shift in the binding interface. An example is shown in Fig. 1, where the sequence of ssDNA binding motif changes from “T-1C0” in the case of target editing, to “G-1C0” in the case of bystander editing. This change can be formalized by a mutation from thymine to guanine at position -1, which perturbs the binding free energy and further influences the unbinding rate w of the cytidine from the catalytic site. Note that this approximation can also be explained using thermodynamic arguments, since the ratio between rates of binding and unbinding is related to the free energy difference between two states: the state where the protein-RNA complex is bound to the DNA chain and the state where both DNA and protein complex are free, \(\frac{{u}_{1}}{{w}_{1}}={e}^{-\triangle {E}_{0}\left({target}\right)/{k}_{B}T}\), \(\frac{{u}_{2}}{{w}_{2}}={e}^{-\triangle {E}_{0}\left({bystander}\right)/{k}_{B}T}\). Using Eq. (8) one can derive the result in Eq. (9).

Similarly, any deaminase mutation can be represented as a perturbation in binding free energy relative to the wild type,

$${w}_{1,{mutation}}={w}_{1,{WT}}{e}^{\triangle \triangle {E}_{m}/{k}_{B}T}={w}_{1,{WT}}{e}^{\left[\triangle {E}_{0}\left({mutation}\right)-\triangle {E}_{0}\left({WT}\right)\right]/{k}_{B}T}$$
(10)

\(\triangle \triangle {E}_{m}\) represents the difference in free energy due to mutations.

Substituting Eqs. (810) into Eqs. (67), we obtain:

$${P}_{t}=\frac{\left({\gamma }_{1}+m+{\gamma }_{1}{\gamma }_{3}+{\gamma }_{1}{\gamma }_{2}{\gamma }_{3}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)+({\gamma }_{1}+m)(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}})}{\left({\gamma }_{1}+m+{\gamma }_{1}{\gamma }_{3}+{\gamma }_{1}{\gamma }_{2}{\gamma }_{3}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)\cdot [\left(2+{2\gamma }_{1}+{{\gamma }_{1}\gamma }_{3}\right)\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)-{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\left(1+{\gamma }_{1}\right)\left(1+2{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}+{e}^{\frac{\triangle \triangle {E}_{0}}{{k}_{B}T}}\right)]}$$
(11)
$${P}_{b}=\frac{\left({\gamma }_{1}+m+{\gamma }_{1}{\gamma }_{3}+{\gamma }_{1}{\gamma }_{2}{\gamma }_{3}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)+({\gamma }_{1}+m)(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}})}{\left({\gamma }_{1}+m+{\gamma }_{1}{\gamma }_{3}+{\gamma }_{1}{\gamma }_{2}{\gamma }_{3}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)\cdot [\left(2+{2\gamma }_{1}+{{\gamma }_{1}\gamma }_{3}\right)\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)-{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}\left(1+{\gamma }_{1}\right)\left(1+2{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}+{e}^{\frac{\triangle \triangle {E}_{0}}{{k}_{B}T}}\right)]}$$
(12)
$${\gamma }_{1}={u}_{4}/{u}_{0}$$
(13)
$${\gamma }_{2}={w}_{1,{WT}}/{u}_{3}$$
(14)
$${\gamma }_{3}={w}_{0}/{u}_{1}$$
(15)

Equations (1115) give the full analytical expressions in terms of kinetic rates and binding affinities that can be used to calculate the editing probability. There are six free parameters to describe the base editing process (\({\gamma }_{1}\), \({\gamma }_{2}\), \({\gamma }_{3}\), \(m\), \(\triangle \triangle {E}_{0}\), \(\triangle \triangle {E}_{m}\)) but this number can be reduced using additional information. For example, previous binding experiments19 have indicated that A3A binds to ssDNA with \({K}_{d}=57\mu M,{K}_{M}=62\mu M\) and \({k}_{{cat}}=1.1/s\). From these values, one can infer that \({w}_{1,{WT}}=12.54/s\) and \({u}_{3}=1.1/s\). Therefore, after the cytidine binds to the catalytic site, the relative probability between unbinding and the chemical transformation step, \({\gamma }_{2}\), is 11.4. Next, if the changed ssDNA sequence no longer perfectly matches the sgRNA sequence, we assume that successful editing prevents rebinding of Cas9 to ssDNA, therefore \(m\)=0. Nevertheless, we show below that this assumption only has a minor effect on the final results. Lastly, we performed all-atom computational simulations to estimate \(\triangle \triangle {E}_{0}\) and \(\triangle \triangle {E}_{m}\), as shown in the next section. As a result, only two free parameters remain in the model, \({\gamma }_{1}\) and \({\gamma }_{3}\) (Eqs. 13 and 15), both of which are parameterized by reproducing experimental values of \({P}_{t}\) and \({P}_{b}\).

Computational estimates of binding free energy changes

We chose four CBEs developed by the Joung group:11 A3A(S99A), A3A(Y130F), A3A(N57Q), and A3A(N57A) to calculate the binding free energy changes between ssDNA and A3A. These CBE variants reduce the bystander effect to different extents while maintaining a high probability of on-target editing. The binding interface in the wild type A3A-ssDNA binding complex is shown in the crystal structure (PDB ID: 5KEG) (Fig. 2a). The carbonyl oxygen of Ser99 forms a hydrogen bond with the N4 atom of the cytidine in the catalytic site (dC0). The hydroxyl group of Tyr130 forms a hydrogen bond with the 5’-phosphate of dC0. Lastly, the nitrogen atom in the sidechain of Asp57 forms a hydrogen bond with the O3 atom of dC0. Therefore, all four CBE variants appear to destabilize the binding between A3A and ssDNA (\(\triangle \triangle {E}_{m}\) > 0) by breaking this hydrogen-bonding network. In addition, since A3A recognizes the T-1C0 instead of the G-1C0 motif, the binding free energy to the deaminase should be higher (more repulsive) for the bystander cytidine than for the target cytidine (\(\triangle \triangle {E}_{0}\) > 0). To quantitatively calculate \(\triangle \triangle {E}_{0}\) and \(\triangle \triangle {E}_{m}\), we utilized the so-called “alchemical free-energy calculations” based on MD simulations20,21. A thermodynamic cycle was constructed to convert \(\triangle \triangle {E}_{0}\) and \(\triangle \triangle {E}_{m}\) (Fig. 2b, \(\triangle {G}_{3}-\triangle {G}_{1}\)) to the difference between two slow alchemical transitions (Fig. 2b, \(\triangle {G}_{2}-\triangle {G}_{4}\)). One transition is the free energy change for the A3A-ssDNA complex due to mutations (Fig. 2b, \(\triangle {G}_{2}\)) whereas the other is the free energy change for A3A alone due to mutations (Fig. 2b, \(\triangle {G}_{4}\)). Calculated values indeed show that mutations cause an apparent increase in the deaminase-ssDNA binding free energy (Fig. 2c), consistent with predictions based on the structural data.

Fig. 2: Calculation of binding free energy changes between ssDNA binding motif and A3A deaminase under various mutations.
figure 2

a Hydrogen-bonding network between cytidine dC0 and A3A residues 57, 99 and 130. b Thermodynamic cycle used to calculate the binding free energy changes: ssDNA (black line), A3A (blue balloons). A mutation at the binding interface is shown by a transition from small orange circle to orange triangle. c Computationally estimated changes in binding free energy for A3A deaminase mutations S99A, Y130F, N57Q, N57G and binding free energy change between A3A binding to target and bystander cytidines. The energy unit is kBT and T = 300 K. Data are presented as mean values ± s.e.m, estimated by Bennett’s acceptance ratio method for 200,000 data points. Source data are provided as a Source Data file.

The rationale for A3A mutants that reduce the bystander effect

To check whether our model can reproduce the experimentally measured on-target and bystander editing probability, we substituted \(\triangle \triangle {E}_{0}\) and \(\triangle \triangle {E}_{m}\) calculated above into Eqs. (1115) and adjusted \({\gamma }_{1}\) and \({\gamma }_{3}\). The resulting theoretical prediction is in very good agreement with the experimental measurements11 (Fig. 3a), with values \({\gamma }_{1}=\frac{{u}_{4}}{{u}_{0}}=2.1\) and \({\gamma }_{3}=\frac{{w}_{0}}{{u}_{1}}=2.9* {10}^{-5}\). The value of \({\gamma }_{1}\) indicates that there is a significant fraction of BEs failing to initiate editing, whereas the value of \({\gamma }_{3}\) suggests that the residence time of Cas9 on ssDNA is sufficient for the deaminase to function. We note here that the choice of m, which quantifies the effect of sgRNA mismatch on the rebinding rate of Cas9 and ssDNA, does not significantly affect the result (Fig. S1). The model also well produced the editing patterns at multiple genomic loci (Fig. S3), demonstrating the generality of this model.

Fig. 3: Theoretical model of the A3A-BE3 editing system.
figure 3

a Comparison between theoretical calculations (solid lines) and experimental data11 (circles, representing the mean ± s.e.m. of three independent biological replicates). \(\triangle \triangle {E}_{m}\) represents the perturbation in binding free energy due to the different mutations; kBT is the unit of energy; \({P}_{t}\) and \({P}_{b}\) represent the overall probability of editing target and bystander cytidine, respectively. b Calculated editing probabilities for products CTT, TTC and TTT. \({P}_{t}={P}_{{CTT}}+{P}_{{TTT}}\); \({P}_{b}={P}_{{TTC}}+{P}_{{TTT}}\). Source data are provided as a Source Data file.

Our theoretical model can be used to explain why the single mutation N57G greatly improves the editing selectivity of A3A-BE3. First, the ratio between the probabilities of having the target cytidine edited before the bystander (Fig. 1, transition state \(2\to 3\to 5\to \ldots \)) and that of the reversed events (Fig. 1, transition state \(2\to 4\to 6\to \ldots \)) can be calculated as:

$${R}_{1}=\frac{P({state}2\to 3\to \ldots )}{P({state}2\to 4\to \ldots )}=\frac{\frac{{u}_{3}}{{u}_{3}+{w}_{1}}}{\frac{{u}_{3}}{{u}_{3}+{w}_{2}}}=\frac{1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}}{1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{m}}{{k}_{B}T}}}$$
(16)

with \({\gamma }_{2}=11.4\). In the case of A3A, \({R}_{1}\) can be approximated as \({e}^{\frac{\triangle \triangle {E}_{0}}{{k}_{B}T}}\). As A3A significantly prefers the TC motif to the GC motif (\(\triangle \triangle {E}_{0} \sim 6{k}_{B}T\)) this ratio is larger than 400. As a result, for both A3A(WT) and A3A(N57G), the probability of having only the bystander edited is very low (Fig. 3b, blue line, almost zero). After the target cytidine is edited (Fig. 1, state 5), the system has the choice of getting released with the product CTT (Fig. 1, state 13) or to continue editing the bystander, leading to the product TTT (Fig. 1, state 12). The outcome is largely influenced by the ratio between \({w}_{2}\) and u3. If \({w}_{2}\) is significantly larger than u3, bystander editing will be blocked because the residence time for the bystander cytidine in the catalytic site is too short to complete the transition to thymidine. Analytically, after the target cytidine gets edited, the probability ratio between CTT and TTT outcomes can be calculated as:

$${R}_{2}=\frac{P({state}5\to 13)}{P({state}5\to 12)}=\frac{{w}_{0}}{{u}_{2}}\left(1+\frac{{w}_{2}}{{u}_{3}}\right)={\gamma }_{3}\left(1+{\gamma }_{2}{e}^{\frac{\triangle \triangle {E}_{0}+\triangle \triangle {E}_{m}}{{k}_{B}T}}\right)$$
(17)

For A3A(WT), \({R}_{2}\) is 0.17, meaning that the dominant product is TTT (Fig. 3b, purple square at \(\triangle \triangle {E}_{m}=0\)). This explains that for the wild-type A3A the editing efficiency of the target cytidine is similar to that of the bystander. In sharp comparison, as \(\triangle \triangle {E}_{m}\) increases by 4.5 \({k}_{B}T\) for A3A(N57G), \({R}_{2}\) is 14.8. Now the dominant edited product changes to CTT (Fig. 3b, green line at \(\triangle \triangle {E}_{m}\)= 4.5 \({k}_{B}T\)). In this case, A3A(N57G) minimizes the bystander effect while maintaining a high probability of editing the target base.

Similar arguments can be presented for the other three A3A mutants (S99A, Y130F, and N57Q). Calculations (Table S1) show that \({R}_{2}\) gradually increases from 0.42 for S99A to 4.91 for N57Q mutants, indicating that the bystander effect gradually decreases. This is consistent with experimental findings (Fig. 3). However, it is critical to note that to gain high editing selectivity, mutated residues have a non-monotonic effect in the deaminase-ssDNA interface. Here, selectivity is defined as the difference in probabilities between editing the target and editing the bystander. Weakening the binding interface up to 4–6 kBT (depending on the system) greatly improves selectivity (Fig. 3a), but it drops when \(\triangle \triangle {E}_{m}\) continues further increasing. This result can be explained using the following physical considerations. Increasing \(\triangle \triangle {E}_{m}\) leads to faster-unbinding rates between cytidine and deaminase. At moderate values of \(\triangle \triangle {E}_{m}\), target editing is less affected (Fig. 1, state 2\(\to \)5) but bystander editing is blocked (Fig. 1, state 5\(\to \)12). However, for very large values of \(\triangle \triangle {E}_{m}\), both editing pathways are essentially blocked and the system prefers to go into the inactive state (Fig. 1, state 1). Therefore, proper modulation of the binding interface is the key to optimize base editing selectivity. We further prove this point in the next section.

The computational model helps design new A3G-BEs with improved editing selectivity

In this section, we employ our theoretical model to optimize the editing selectivity of the base editor A3G3.1 (Fig. 4a). First, the editing profiles at both target and bystander bases were calculated by using Eqs. (1115). Our calculations show that improving the editing selectivity of A3G3.1 requires mutations that increase \(\triangle \triangle {E}_{m}\) by 2–3 kBT (shaded area in Fig. 4a). Second, specific mutations were designed and \(\triangle \triangle {E}_{m}\) was calculated for each mutation by alchemical free-energy calculations as detailed above. Four mutations (T218S, T218N, T218I and T218G) fell into these selection criteria. A failure example is T218W, which loses the editing activity at the target base owing to overly increased \(\triangle \triangle {E}_{m}\) (Fig. 4a). We then experimentally verified these four mutations at three genomic loci containing the “TCC” motif, including EMX1 #a3, PPP1R12C #a1, and ATM #1. We chose these target sites with the “TCC” motif, which are generally more challenging over “ACC” or “GCC” for selectively editing the second C, since “T” and “C” are structurally more similar. In the “TCC” case, the deaminase tends to treat “T” as a “C” and preferentially edits the bystander first “C” as well. In our tests, A3G3.14 (A3G3.1 with T218S) and A3G3.15 (A3G3.1 with T218N) generally show much improved editing selectivity (Fig. 4b), with marginally or modestly decreased editing efficiency. Therefore, A3G3.14 and A3G3.15 were further tested at other five genomic loci, including MMS22L #1, FANCE #1, MRPL44 #1, FANCF #c1, and MRPL40 #1 (Fig. 4c). Compared to the original A3G3.1, the target-to-bystander editing ratio increases from average 2.9 to 8.6-fold with mutations.

Fig. 4: Engineering of A3G-BEs.
figure 4

a Theoretical calculation. C6 represents the target base. C5 represents the bystander. The shaded area represents the region with improved editing selectivity. b Experimental measurements at three genomic loci for four mutations picked by theoretical model; A3G3.1 is the full-length APOBEC3G deaminase with a set of mutations which increase the catalytic efficiency. A3G3.8, 3.9, 3.14, and 3.15 are A3G3.1 with T218G, T218I, T218S, and T218N, respectively. Bar plots represent the mean ± s.d. of three independent biological replicates. c Experimental measurements at other five genomic loci for A3G3.14 (T218N) and A3G3.15 (T218S). Bar plots represent the mean ± s.d. of three independent biological replicates, except for the bar representing the editing efficiency of A3G3.14 at FANCF #c1 site, which shown the mean ± s.d. from four biological replicates. Source data are provided as a Source Data file.

Our results indicate that mutagenesis stringency and genomic sites are tightly coupled in determining the target-to-bystander editing ratio. Mutagenesis stringency influences the overall editing patterns while specific genomic sites dictate the mutation with the best performance. The basic rule is that relatively large mutagenesis stringency (i.e., high \(\triangle \triangle {E}_{m}\)) is needed for genomic sites with low editing selectivity, and vice versa. Our tested eight genomic loci can be divided into two types in regards to the A3G3.1 editing selectivity. The first group, including EMX1 #a3, FANCE #1, and MRPL40 #1 sites, showed low selectivity, as expressed by the target-to-bystander editing ratio around 1.07–1.23; whereas the second group, including PPP1R12C #a1, ATM #1, MMS22L #1, MRPL44 #1 and FANCF #c1 sites, showed selectivity to some extent, with the target-to-bystander editing ratio ranging from 2.87 to 5.97. This natural site-dependent difference in selectivity arises from multiple reasons such as sequence context and the levels of DNA accessibility. Therefore, the first group needs mutations with higher \(\triangle \triangle {E}_{m}\) than the second group. The theoretical model predicts that \(\triangle \triangle {E}_{m}\) follows an order of S < N (Fig. 4a). As a result, for the first group, A3G3.15 (T218N) generally performs better than A3G3.14 (T218S). In contrast, for the second group, A3G3.14 (T218S) performs better. These results indicate that mutagenesis stringency and genomic sites should be considered simultaneously during the designing process.

Currently, one difficulty in designing BE is that there are few methods to predict the editing pattern for a novel mutation before experimental validation. In addition, the same mutation can function differently at different genomic loci. Using our model, the editing patterns of those two mutations on A3G3.1 are computationally predictable, and well-validated by experiments (Fig. 4b and 4c). This result demonstrates the power of combining theoretical and experimental approaches. EMX1 #a3 site was also tested in three cell lines, K562, Jurkat, and HeLa (Supplementary Fig. 4). Although these cell lines generally have low transfection efficiency, we still observed an increase of the target-to-bystander editing ratio in A3G3.15 treated cells, compared to those treated by A3G3.1 (Supplementary Fig. 4 and Fig. 4b), for two cell lines, K562 (two-tailed p = 0.0005 with unpaired t-test) and Jurkat (p = 0.0001). The improvement for Hela cells is less significant and needs further optimization in the future.

Discussion

In this work, we developed a theoretical framework to understand the molecular mechanisms of base editing. Our approach suggests several general rules to design BEs with improved editing selectivity. This goal is fulfilled by modulating the binding affinity between deaminase and ssDNA using mutagenesis (\(\triangle \triangle {E}_{m}\)). The principle is to guarantee that the residence time of deaminase on ssDNA is sufficiently long to complete the editing of the first on-target site, while being too short for editing the second (bystander) site. Our theoretical method predicts optimal values for \(\triangle \triangle {E}_{m}\). Away from these optimal values, selectivity decreases. Therefore, instead of testing experimentally a set of candidate BE mutants, one can instead set up a computational pre-screening process by estimating the \(\triangle \triangle {E}_{m}\) of those variants, and only candidates near the optimal value can then be tested experimentally. Herein, we used alchemical free-energy calculations to estimate \(\triangle \triangle {E}_{m}\). The accuracy of this method has been validated in the A3A and A3G system (Figs. 3 and 4). Future work will help to develop carefully parameterized scoring functions for ssDNA-protein interactions or combine machine learning techniques22 so that the prediction of \(\triangle \triangle {E}_{m}\) can be accelerated. In addition, when estimating \(\triangle \triangle {E}_{0}\) and \(\triangle \triangle {E}_{m}\), our model only considers the local sequence near the target base and neglects the long-range contributions from other bases in the same editing window. Because the local sequence context is a major chemical factor in determining the relative outcomes of bystander edits vs target site edits, with such approximations our model can still explain the existing experimental data (Fig. 3) and guide the design of new mutations (Fig. 4). However, the long-range contributions might serve as additional regulators and requires more detailed investigations in the future.

Equations (1115) indicate that for a given system the editing probability is regulated by two other parameters, \({\gamma }_{1}\) and \({\gamma }_{3}\), in addition to \(\triangle \triangle {E}_{m}\). Therefore, we plotted the editing probability for different values of \({\gamma }_{1}\) (Fig. 5a) and \({\gamma }_{3}\) (Fig. 5b). We first reduced the parameter \({\gamma }_{1}\) (Fig. 5a) which can be achieved by increasing the on-rate of Cas9 to ssDNA. It turns out that the editing selectivity for the WT system is not affected by \({\gamma }_{1}\) (Fig. 5a, solid blue line vs dashed blue line at \(\triangle \triangle {E}_{m}=0\)) as the efficiencies of both target and bystander editing increase synchronously. However, the selectivity greatly improves when \(\triangle \triangle {E}_{m}\) is 4–6 kBT, meaning that \({\gamma }_{1}\) amplifies the deaminase mutation regulation effect. This suggests an effective combination strategy in the design of highly selective BE: optimization of \(\triangle \triangle {E}_{m}\) first, then reducing \({\gamma }_{1}\) to amplify this effect. We then reduced the parameter \({\gamma }_{3}\) (Fig. 5b). Our calculation indicates that reducing \({\gamma }_{3}\) does not change the maximum editing selectivity but induces a right shift in the editing profile, i.e., a larger \(\triangle \triangle {E}_{m}\) value is required to achieve the maximum editing selectivity.

Fig. 5: Theoretical calculations on regulating the base editing pattern of A3A-BE3.
figure 5

Base editing pattern of A3A-BE3 regulated by (a) \({\gamma }_{1}\) and (b) \({\gamma }_{3}\). The definition of \(\triangle \triangle {E}_{m}\), \({\gamma }_{1}\) and \({\gamma }_{3}\) can be found in Eqs. (10,1315). \({P}_{t}\) and \({P}_{b}\) are the overall probabilities of editing the target and bystander cytidine, respectively. The difference between \({P}_{t}\) and \({P}_{b}\) is shown in blue. The setting with original parameters is represented by solid lines (case 1) whereas variants are represented by dashed lines (case 2: \({\gamma }_{1}\) divided by five; case 3: \({\gamma }_{3}\) divided by five). Source data are provided as a Source Data file.

Another interesting question is whether a general mutation to all BEs homologs exists that optimizes editing. Unfortunately, we found that a mutation working perfectly for one type of deaminase may fail for another type, even when they are homologs. For example, A3A mutations N57G and Y315 greatly reduce the bystander effect while maintaining a high probability of target editing, but A3G N244G almost loses the base editing ability (A3A N57 aligns with A3G N244) (Supplementary Fig. 2a). The theoretical model developed above explains this negative outcome, since \(\triangle \triangle {E}_{m}\) for N244G turns out to be much larger than its optimal value (Supplementary Fig. 2b). Our model shows that an energy shift of only few kBT can dramatically change the selectivity of a BE, from its optimal value to zero. This is due to subtle differences between homologs: the on-rate of A3G binding to ssDNA is lower that for A3A, as the former can form a large dimer that interferes with binding. In addition, A3G has lower \(\triangle \triangle {E}_{0}\) (see Eq. (9)) because the sequence of the ssDNA binding motif changes from “C-1C0” in the target editing to “T-1C0” in the bystander editing. The energy perturbation from C-1 to T-1 is smaller than that from T-1 to G-1 in the case of A3A. These differences cause A3A and A3G to have distinct behavior under equivalent mutations. In fact, even for the same BE, editing different loci can change the binding free energy by a few kBT because at different loci the neighboring bases of the bystander may vary. Therefore, each BE may require a unique optimization to achieve high editing selectivity. Nevertheless, our approach gives quantifiable parameters that can be used to accelerate the search for best editors. In summary, a general design strategy would be (a) employing the chemical-kinetic model (Eqs. (1115)) to determine the binding free energy changes required to achieve the maximum editing selectivity, \(\triangle {E}_{{peak}}\); (b) designing mutations in the deaminase, estimating \(\triangle \triangle {E}_{m}\) and selecting the ones near \(\triangle {E}_{{peak}}\); (c) keeping mutations picked in the previous step and designing extra mutations that increase the binding rate of Cas9 to DNA substrate; (d) experimental validation of these changes.

Methods

Free energy calculations by molecular dynamic simulations

We utilized MDs based on alchemical free-energy calculations20,21 (Fig. 2) to estimate the binding free energy changes under various mutations11. All simulations were carried out using the Gromacs package23. Amber99sb*ILDN force field with bsc0 correction for nucleic acids was used24,25,26,27. The integration time step was set to 2 fs. The initial states of the A3A-ssDNA28 and the A3G-ssDNA29 binding complex were taken from their crystal structures (PDB ID: 5keg and 6bux, respectively). Then we used the pmx webserver30,31 to generate hybrid structures and topologies representing mutations. Each system was solvated in a cubic box with TIP3P water molecules. The ions concentration was set to 0.1 M. The dimension of the box is 9 nm. The temperature was maintained at 300 K by the Berendsen thermostat32 while the pressure was maintained at 1.0 atm by using the Parrinello-Rahman barostat33. Electrostatic interactions were calculated by the Particle Mesh Ewald method34. The soft-core function was used for the nonbonded interactions during the alchemical transitions35. For each system, energy minimization was first performed, followed by 1 ns NVT and 1 ns NPT equilibration with the protein configuration restrained. Then the system was further equilibrated for 5 ns without any restrain. The last snapshot of the trajectory served as the starting configuration for the following alchemical transitions.

The alchemical transition (\(\lambda =0\to \lambda =1\)) was divided into 21 consecutive windows with the bin size of 0.05. For each window i, \(\lambda \) was first increased from 0 to \({\lambda }_{i}({\lambda }_{i}={{\mathrm{0,0.05,0.1,0.15}}}\ldots )\) with a slow rate 10−8/step, then was fixed to \({\lambda }_{i}\) for 40 ns production run. dH/dl values were recorded every 100 steps. The free energy and error bar were estimated by Bennett’s acceptance ratio method36.

Experiment

Mammalian cell culture

HEK293T cells (American Type Culture Collection, CRL-3216) were cultured in GlutaMAXTM high-glucose Dulbecco’s modified Eagle’s medium (DMEM, Thermo, cat. 10569044). HeLa (ATCC, CCL-2), K562 (ATCC, CRL-3343), and Jurkat (ATCC, TIB-152) lines were maintained in GlutaMAXTM RPMI 1640 medium (HEPES buffered, Thermo, cat. 72400146). The culture media were all supplemented with 10% fetal bovine serum (FBS, Thermo, cat. 10437028) and 100 U/mL penicillin-streptomycin (Thermo, cat. 15140122). Cells were grown in a humid atmosphere at 37 °C with 5% CO2. HEK293T and HeLa cells were passaged at a ratio of 1:4 when reaching 90% confluency by using TrypLE Express (Thermo, cat. 12605028). Jurkat and K562 cells were subcultured and added with fresh medium every 2 or 3 days to keep the density below 106 cells/mL. Mycoplasma testing was performed monthly using a mycoplasma PCR detection kit (abm, cat. G238).

Plasmid construction

The full-length human codon-optimized wild-type A3G with a set of mutations (P200A + N236A + P247K + Q318K + Q322K) was synthesized as gBlock and inserted into the BE4max construct (Addgene #112093) to replace the rAPOBEC1 region, resulting in A3G3.1. To do so, both insertion and vectors were amplified using primers with overhangs containing Esp3I recognition sites, which would generate compatible complementary sticky ends after cutting. Then the one-pot Golden Gate assembly was employed to cut and ligate two amplified pieces by using Esp3I and T4 DNA ligase (New England Biolabs, cat. R0734L and M0202L). Likewise, the Y315F and N244G variants were respectively generated by designing an extra pair of primers containing the indicated mutations and performing a three-piece assembly. The A3G3.8, 3.9, 3.14, and 3.15 constructs were Golden Gate cloned to introduce point mutations T218G, T218I, T218S, and T218N to A3G3.1.

Cell transfection, genomic DNA extraction, amplicon sequencing, and analysis

Cell transfection was performed as previously described with slight modifications12. Briefly, HEK293T or HeLa cells were seeded into a poly-D-lysine–coated 48-well plate (Corning, cat. 354509) at a density of 4.5 × 104 cells per well in 250 μL antibiotic-free culture medium supplemented with 10% FBS. In about 12–16 h, upon reaching 70% confluency, cells of each well were transfected. K562 and Jurkat cells were reverse-transfected at a density of 2 × 104 cells per well. Four cell types were all transfected with 750 ng BE plasmids and 250 ng sgRNA plasmids using 1.5 μl Lipofectamine 2000 (Thermo, cat. 11668019) dispersed in 25 μL Opti-MEM (Thermo, cat. 31985062) according to the manufacturer’s instructions. Three days later, the genomic DNAs were collected by removing the medium by aspiration or centrifugation, washing the cells gently with PBS (Thermo, cat. 10010049), and lysing the cells at 37 °C for 1–2 h with 100 μl per well of lysis buffer containing 10 mM Tris-HCl (pH 7.5, Thermo, cat. 15567027), 0.05% SDS (Sigma, cat. 71725), and 25 μg/mL proteinase K (Fisher BioReagents, cat. BP1700). The cell lysates containing genomic DNA were then subjected to heat inactivation of the proteinase K at 80 °C for 0.5–1 h. For Sanger sequencing, the genomic DNA amplification primers and the Sanger sequencing primers are listed in Supplementary Table 2. The 20 μL PCR reactions containing 0.4 U Q5 High-Fidelity DNA polymerase (New England Biolabs, cat. M0491L), 0.5 µM of forward and reverse primers, and 100 ng of genomic DNA, were performed using a 35-cycle PCR program. The Sanger sequencing results were analyzed using EditR online software (http://baseeditr.com/).

A total of 100 ng genomic DNA was amplified at the EMX1 target site by using primers attached with the partial Illumina adapters and 8 bp compatible and nucleotide-balanced indices on both 5′ and 3′ end. The forward and reverse primers are as follows. For: 5′-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNNNTGTGGTTCCAGAACCGGAG-3′; Rev: 5′-GACTGGAGTTCAGACGTGTGCTCTTCCGATCTNNNNNNNNCTCTGCCCTCGTGGGTTT-3′. The protospacer sequence is 5′-GAGTCCGAGCAGAAGAAGAA-3′. The amplicon sequence is 5′-TGTGGTTCCAGAACCGGAGGACAAAGTACAAACGGCAGAAGCTGGAGGAGGAAGGGCCTGAGTCCGAGCAGAAGAAGAAGGGCTCCCATCACATCAACCGGTGGCGCATTGCCACGAAGCAGGCCAATGGGGAGGACATCGATGTCACCTCCAATGACTAGGGTGGGCAACCACAAACCCACGAGGGCAGAG-3′.

Amplicons were pooled, column purified (Qiagen), recovered in nuclease-free water (Thermo, cat. 10977023), and quantified by the Qubit dsDNA HS assay (Thermo, cat. Q32851). A volume of 25 μL sample with the final concentration adjusted to 20 ng/μL was submitted for Amplicon-EZ sequencing (Genewiz). Fastq files were then downloaded from Genewiz Ftp server and analyzed by using CRISPResso2 (https://github.com/pinellolab/CRISPResso2) to align reads and quantify the base editing efficiency and frequency37.

All gblocks and primers were synthesized by Integrated DNA technologies.

Statistical analysis

All experiments were performed with 2–4 independent biological replicates. Bar plots in Fig. 4, and Supplementary Fig. 4 represent means ± standard derivation (s.d.). Bar plot in Supplementary Fig. 2a represents means ± standard error of the mean (s.e.m.).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.