Keywords

1 Introduction

In the post-quantum era, we need to study the security of cryptographic systems against quantum attackers. In fact, many cryptographic schemes turn out to be less secure against attacks based on quantum computing. Some asymmetric cryptographic primitives face devastating attacks due to Shor’s algorithm [24]. In contrast, the impact of quantum computing on secret-key cryptography seems to be less severe. Most of the existing works are based on Grover’s algorithm [12] and Simon’s algorithm [25]. Grover’s algorithm can solve the search problem with quadratic speed-up, while Simon’s algorithm can find the hidden period with polynomially many quantum queries. In such attacks, the corresponding quantum oracle of the target cipher has to be implemented. Due to the importance of AES, it is one of the most studied ciphers  [3, 11, 15, 17, 18] in the context of efficient synthesis of quantum circuits. These implementations can be potentially used in some quantum attacks against symmetric-key primitives involving AES [4, 9, 13, 16]. In this paper, we construct some quantum circuits of AES with fewer qubits, and the techniques involved may provide more flexible qubit and circuit depth trade-offs for the quantum circuits of AES.

A quantum oracle for any classical vectorial Boolean function can be constructed with the Clifford \(+\) T gate set, which consists of the Hadamard gate (H), Phase gate (S), controlled-NOT gate (CNOT), and non-clifford T gate. There are some works on synthesizing optimal reversible circuits, such as reversible Boolean functions. Shende et al. [22] considered the synthesis of 3-bit reversible logic circuits using NOT gate, CNOT gate, and Toffoli gate. Golubitsky et al. [10] proposed an optimal 4-bit reversible circuits composed with NOT gate, CNOT gate, Toffoli gate, and the 4-bit Toffoli gate. The goal of synthesizing the optimal quantum circuit implementation is to reduce the circuit depth and number of qubits [3, 11, 17, 18]. According to our current understanding of fault-tolerant quantum computing, the metric of T-depth is probably the most important. However, before practical quantum computers are built, the method for reducing the cost with respect to the number of qubits is also very meaningful, and it may provides more flexible qubit and depth trade-offs.

Recently, the construction of efficient quantum circuits of AES has attracted much attention. In [8], Datta et al. presented a reversible implementation of AES. In [15], Jaques et al. proposed a method to minimize the depth-times-width cost metric for quantum circuits of AES. In [11], Grassl et al. proposed a quantum circuit of AES aiming at the lowest possible number of qubits. In [17], Kim et al. showed some time-memory trade-offs for key search on AES. In [3], Almazrooie et al. presented a new quantum circuit of AES-128. By utilizing the classical algebraic structure of the S-box [5], Langenberg et al. in [18] showed a new way to construct the quantum circuit of AES’s S-box, based on which Langenberg et al. proposed an efficient quantum circuit of AES-128. Compared to Almazrooie et al.’s and Grassl et al.’s estimates, the circuit proposed by Langenberg et al. could reduce the number of qubits and Toffoli gates simultaneously. Langenberg et al.’s work shows that we can construct an improved quantum circuit of AES by constructing a more efficient classical circuit of AES.

There are several works on how to reduce the gate number of AES in the classical setting [1, 7, 14, 19, 28]. In [14], Itoh and Tsujii proposed the tower field architecture for calculating multiplicative inverse in \(\mathbb {F}_2\), which was a powerful technique for designing compact hardware implementation of S-box. By using the tower field technique, Canright in [7] showed an efficient method for computing the multiplicative inverse of the input. In [6], Boyar and Peralta proposed a depth 16 circuit for the S-box in AES by using the tower field implementation.

Contribution. Firstly, we propose an improved quantum circuit for the S-box\(^{-1}\) of AES based on the improved classical circuit of the inverse of the AES S-box [28, 29]. Also, by exploiting some useful linear relationship, we propose some improved qubit-depth trade-offs for the quantum circuits of S-box/S-box\(^{-1}\) of AES. The improvements of the S-box and its inverse lead to corresponding improvements of the quantum circuits of the round function and the key-schedule algorithm of AES. Taking AES-128 as an example, we can generate \(W_{4i}\) by XORing \(SubWord(RotWord(W_{4i-1})), Rcon(i/s)\), \(W_{4i-1}\), \(W_{4i-5}\), \(W4_{4i-9}\) to \(W_{4i-13}\) (for \(4 \le i \le 10\)). In other words, we can obtain \(W_{4i}\) without introducing new qubits or cleaning up \(W_{4i-13}\) (for \(4 \le i \le 10\)). That is, our quantum circuit for the key schedules of AES-128/-192/-256 need 128/192/256 qubit, and 6 ancillas qubits, which require fewer qubits than the previous works [3, 11, 15, 18].

Secondly, we propose an improved zig-zag method with fewer qubits. To compute the output of the AES round function, we need 256 qubits to store the 128 qubits input and the 128 qubits output of the round function. In other words, we need at least 256 qubits in the zig-zag method. By using our quantum circuits of AES’s S-box and S-box\(^{-1}\), we propose an improved zigzag method for AES-128/-192/-256 with 256 qubits, which matches the minimum values. That is, our improved zig-zag method require 256/256/256 qubits for AES-128/-192/-256, while the prior work needed at least 528/656/656 qubits for AES-128/-192/-256, respectively.

We summarize the quantum resources to implement AES in Table 1. The # Toffoli/CNOT/NOT means the number of Toffoli gates, CNOT gates, and NOT gates, and # qubits means the number of qubits. We will adopt the same notations in the following tables. As shown in Table 1, our quantum circuit implementations of AES require fewer qubits than the prior works. Also, our quantum circuits of AES-128/-256 can obtain the best trade-off of \(T\cdot M\), where T is the Toffoli depth and M is the number of qubits.

Table 1. Summary of the quantum resources to implement AES

Remark. In this work, the Toffoli-count and Toffoli-depth are involved in our metric. A more fine-grained and accurate approach is to implement the entire circuit with the Clifford\(+T\) set, count the number of T gate, and measure the T-depth as was done in [15]. In [15], the quantum circuit was implemented with Q# [26] and the cost of the quantum circuit was estimated by the resource estimator of Q#. However, it seems that there are some issues with the resource estimator (see https://github.com/microsoft/qsharp-runtime/issues/192). So we do not use it here.

Outline. In Sect. 2, we present the definitions of some quantum gates. Sect. 3 not only makes a brief introduction to AES, but also shows the algebraic structures of AES’s S-box/S-box\(^{-1}\). In Sect. 4, we propose our improved quantum circuits of AES’s S-box and S-box\(^{-1}\). Section 5 shows our improved ideas for the zig-zag method and the key schedule of AES. In Sect. 6, we show our improved quantum circuit implementations of AES. We conclude this paper in Sect. 7.

2 Notations

The classical circuits allow wires to be joined together, such as \(a = a\oplus b\) and \(a = a\wedge b\). Obviously these operations are not reversible and not unitary. Different from the classical circuits, quantum circuits shall be reversible and unitary, which can be constructed by replacing classical gates with quantum gates. For example, we shall simulate AND gates with the Toffoli gate, while a XOR gate can be simulated with the CNOT gate.

Some prior works [2] showed that the quantum circuit consisting only of Clifford gates were not advantageous over classical computing. In other words, we shall adopt some non-Clifford gates (i.e. Toffoli gate) to obtain the quantum benefit. Also, some works [23, 27] showed the Toffoli gate and Clifford gates were universal. That is, we can implement any quantum computation by these gates. As shown in [20], the Clifford groups are much cheaper than the Toffoli gate (or T-gate). As a result, [11, 17, 18] defined the Toffoli depth as the time cost of the algorithm, while the memory cost is the total number of logical qubits required to perform the quantum algorithm. Similar to [11, 17, 18], we define the time and memory cost of our quantum circuit implementation of AES as follows.

Definition 1

A unit of quantum computational time cost is defined as the time for running a nonparallelizable logical Toffoli gate.

Definition 2

The space cost of the quantum circuit is defined as the number of logical qubits for the entire quantum computational.

Apart from the two definitions, we also clarify three kinds of qubits to avoid the confusions.

  1. 1.

    Data qubits are written as the input message, such as the round key or the input plaintext.

  2. 2.

    Ancilla qubits (or called garbage qubits) are initialized qubits those assist certain operation, which get written unwanted information after a certain operation. Note that we shall clean up the ancilla qubits at the end of the quantum circuit.

  3. 3.

    Output qubits contain the output information of a certain operation. Note that we do not need to clean up the output qubits.

Based on the definitions of three types of qubits, we adopt the following two strategies to reduce the number of qubits. First, we shall avoid applying the Toffoli gate to ancilla qubits, because these wires shall be cleaned up. However, we do not need to clean up the output qubits. As a result, we shall apply the Toffoli gates to output qubits to avoid involving them in the cleanup process. Second, some ancilla qubits remained idle until the end of the quantum circuit. By uncomputing these wires, we can reuse these ancilla qubits instead of introducing new ancilla qubits, which can reduce the number of qubits.

3 The AES Block Cipher

AES [21] is a family of iterative block ciphers based on the SPN structure. Its members with 128-bit, 192-bit, and 256-bit keys are denoted as AES-128 (10-round), AES-192 (12-round), and AES-256 (14-round), respectively. We will show the round function and key schedule of AES in the following. We refer the reader to [21] for the full description of AES.

3.1 Specification of AES

The AES round function consists the following four operations: \(\mathbf{AddRound} \) \(\mathbf{Key} \circ \mathbf{MixColumns} \circ \mathbf{ShiftRows} \circ \mathbf{SubBytes} \), where

  • AddRoundKey exclusive-ors each round key to the state;

  • SubBytes is the only non-linear transformation in AES, which applies an 8-bit S-box to the 16 bytes of the state in parallel. The algebraic structure of S-box is shown in Sect. 3.2.

  • ShiftRows cyclically rotates the cells of the i-th row to the left by i-byte (for \(0\le i\le 3\)).

  • MixColumns does a linear transformation on each column of the state with the MDS matrix

Similar to the encryption procession of AES, the decryption process of AES also consists of four operations \(\mathbf{AddRoundKey} \circ \mathbf{InvMC} \circ \mathbf{InvShiftRows} \circ \mathbf{InvSubBytes} \), where

  • AddRoundKey exclusive-ors the round key to the state;

  • InvSubBytes is the inverse operation of SubBytes;

  • InvShiftRows cyclically rotates the cells of the i-th row to the right by i-byte (for \(0\le i\le 3\)).

  • InvMC does a linear transformation on each column with the MDS matrix

The key schedules of AES-128/-192/-256 are described in Algorithm 1 and Algorithm 2. The parameters s and t used in the key schedules of AES-128 are \(s=4\), \(t=43\), while AES-192 adopts \(s=6\), \(t=51\).

figure a
figure b

The operations RotWord, Rcon and SubWord used in Algorithm 1 and Algorithm 2 are explained as follows.

  • RotWord cyclically rotates the four bytes to the left by 1-byte;

  • Rcon exclusive-ors the constant to each byte of the word;

  • SubWord applies an S-box operation to each byte of the word.

3.2 The Algebraic Structures of the S-Box of AES

There are several ways to implement the S-box of AES. In [5], Boyar and Peralta showed an efficient way to compute AES’s S-box by using the tower field architecture. Since we do not find a circuit with fewer AND gate than the classical circuit proposed by Boyar and Peralta [5], we adopt their classical circuit to construct our quantum circuit of AES’s S-box in the Sect. 4. Their circuit represents AES’s S-box as \(S(x)=B_S\cdot F_S(U_S\cdot x)\), where the matrix \(U_S\) takes \(x_0, x_1, \cdots , x_{7}\) as input and outputs \(x_7, y_1, \cdots , y_{21}\).

figure c

The function \(F_S:\mathbb {F}^{22}_2\rightarrow \mathbb {F}^{18}_2\) takes \(x_7, y_1, \cdots , y_{21}\) as input and outputs \(z_0, z_1, \cdots , z_{17}\).

figure d

The matrix \(B_S\) takes \(z_0, z_1, \cdots , z_{17}\) as input and outputs \(s_0, s_1, \cdots , s_{7}\).

figure e

3.3 Our Improved Classical Circuit of the S-Box\(^{-1}\) of AES

By using the tower technique, we propose an improved implementation of the S-box\(^{-1}\) (see in Table 2), which can be used to construct our quantum circuit of AES’s S-box\(^{-1}\). We can express AES’s S-box\(^{-1}\) as \(S^{-1}(x)=B'\cdot F'(U'\cdot x)\), where the matrix \(U'\in F^{8\times 22}_{2}\) takes \(x_0, x_1, \cdots , x_{7}\) as input and outputs \(y_0, y_1, \cdots , y_{21}\), where \(U_{i}=x_i\) (for \(0\le i\le 7\)).

figure f

The non-linear function \(F': \mathbb {F}^{22}_2\rightarrow \mathbb {F}^{18}_2\) takes \(y_0, y_1, \cdots , y_{21}\) as input and outputs \(z_0, z_1, \cdots , z_{17}\).

figure g

The matrix \(B'\) takes \(z_0, z_1, \cdots , z_{17}\) as input and outputs \(s_0, s_1, \cdots , s_{7}\).

figure h
Table 2. Summary of the resources to implement AES’s S-box\(^{-1}\)

4 The Quantum Circuits for the Basic AES Operations

4.1 Quantum Circuits for Three Linear Transformations of AES

As pointed out in [11], the three linear transformations of AES can be implemented with the CNOT gates as follows. We just adopt their quantum circuit of three linear transformations in our quantum circuits of AES.

  1. 1.

    AddRoundKey: The AddRoundKey transformation xors 128-bit roundkey to the state, which can be executed with 128 CNOT gates in parallel.

  2. 2.

    ShiftRows: Since the ShiftRows transformation just permutes the order of the sixteen bytes of AES, we do not need any quantum gates to execute these operations.

  3. 3.

    MixColumns: The MixColumns transformation operates a column (32 bits) at a time, which can be specified with a \(32\times 32\) matrix. The resultant circuit of MixColumns has 277 CNOT gates with a total depth of 39, which can be estimated by an LUP-decomposition [11].

In the following, we present our improved quantum circuit implementations of AES’s S-box and S-box\(^{-1}\). The details of our implementation of AES S-box and S-box\(^{-1}\) are available at https://github.com/Asiacrypt2020submission370/aes/.

4.2 Improved Quantum Circuit Implementations of AES’s S-Box

In this subsection, we propose some improved quantum circuit implementations of AES’s S-box. Our quantum circuit of AES’s S-box considers the following two cases: \(|x\rangle |0^a\rangle {\longrightarrow }|x\rangle |S(x)\rangle |0^{a-8}\rangle \) and \(|x\rangle |b\rangle |0^{a-8}\rangle {\longrightarrow }|x\rangle |S(x)\oplus b\rangle |0^{a-8}\rangle \). Note that the prior works [11, 18] only considered \(|x\rangle |0^a\rangle {\longrightarrow }|x\rangle |S(x)\rangle |0^{a-8}\rangle \).

Firstly, we improve the quantum circuit sending \(|x\rangle |0^{8}\rangle \) to \(|x\rangle |S(x)\rangle \). In this part, we propose an improved quantum circuit of AES’s S-box, which requires fewer qubits than the prior work. In detail, our quantum circuit of AES’s S-box requires only 6 ancilla qubits, which maps \(|x\rangle |0^{14}\rangle \) to \(|x\rangle |S(x)\rangle |0^{6}\rangle \). The prior work needed at least 16 ancilla qubits to compute the Sbox, which maps \(|x\rangle |0^{24}\rangle \) to \(|x\rangle |S(x)\rangle |0^{16}\rangle \). Our improved quantum circuits of AES’s S-box adopt the following two new observations, which are based on the algebraic structures of the S-box (see Sect. 3.2).

Observation 1

As shown in Sect. 3.2, the 18 values of \(z_0, \cdots , z_{17}\) can be obtained with the knowledge of \(t_{29}\), \(t_{33}\), \(t_{37}\), \(t_{40}\), \(t_{41}\), \(t_{42}\), \(t_{43}\), \(t_{44}\), \(t_{45}\) and \(x_7, y_0, \cdots , y_{17}\), where \(y_0, \cdots , y_{17}\) are the linear combination of \(x_0, x_1, \cdots , x_{7}\). Besides, \(t_{41}\), \(t_{42}\), \(t_{43}\), \(t_{44}\), \(t_{45}\) can be obtained by the linear combination of \(t_{29}, t_{33}, t_{37}, t_{40}\). In other words, we can obtain \(z_0, \cdots , z_{17}\) only with the knowledge of \(t_{29}, t_{33}, t_{37}, t_{40}\) and \(x_0, x_1, \cdots , x_{7}\).

Observation 2

The \(s_0, s_1, \cdots , s_7\) can be obtained by a linear combination of \(z_0, \cdots , z_{17}\) as follows, where \(\bar{s}\) applies the NOT operation on s.

$$\begin{aligned} s_0&=z_3\oplus z_4\oplus z_6\oplus z_7\oplus z_9\oplus z_{10}\oplus z_{15}\oplus z_{16},\\ s_1&=\overline{z_0\oplus z_1\oplus z_6\oplus z_7\oplus z_9\oplus z_{10}\oplus z_{15}\oplus z_{16}},\\ s_2&=\overline{z_0\oplus z_2\oplus z_6\oplus z_8\oplus z_{12}\oplus z_{14}\oplus z_{15}\oplus z_{17}},\\ s_3&=z_0\oplus z_1\oplus z_3\oplus z_4\oplus z_9\oplus z_{10}\oplus z_{15}\oplus z_{16},\\ s_4&=z_1\oplus z_2\oplus z_4\oplus z_5\oplus z_9\oplus z_{10}\oplus z_{15}\oplus z_{16},\\ s_5&=z_0\oplus z_2\oplus z_3\oplus z_4\oplus z_7\oplus z_8\oplus z_{10}\oplus z_{11}\oplus z_{12}\oplus z_{14}\oplus z_{15}\oplus z_{16},\\ s_6&=\overline{z_4\oplus z_5\oplus z_7\oplus z_8\oplus z_{12}\oplus z_{13}\oplus z_{15}\oplus z_{16}},\\ s_7&=\overline{z_0\oplus z_2\oplus z_3\oplus z_5\oplus z_{12}\oplus z_{13}\oplus z_{15}\oplus z_{16}}. \end{aligned}$$

The above two observations explore the linear relationship between different parameters in the algebraic structure of AES’s S-box. According to Observation 1, we can obtain \(z_0, \cdots , z_{17}\) with the knowledge of \(t_{29}, t_{33}, t_{37}, t_{40}\) and \(x_0, x_1, \cdots , x_{7}\). Obviously, we can obtain \(t_{29}, t_{33}, t_{37}, t_{40}\) by storing all \(t_i\) (for \(2 \le i \le 40\)), which requires 39 ancilla qubits (see in Sect. 3.2). Algorithm 3 can output \(t_{29}, t_{33}, t_{37}, t_{40}\) with 6 ancilla qubits by reusing some ancilla qubits.

As shown in our Algorithm 3 can be constructed with 6 ancilla qubit, 17 Tofoli gates, and 93 CNOT gates, while our previous Algorithm 3 required 6 ancilla qubits, 21 Toffoli gates, and 109 CNOT gates to calculate the same values. There are several \(t_i\) can be computed in parallel as follows. First, we can compute \(t_7\) and \(t_9\) in parallel. Second, we can compute \(t_2\) and \(t_{18}\) in parallel. Third, \(t_{29}\) and \(t_{37}\) can also computed in parallel. To sum up, the Toffoli depth of Algorithm 3 is 14.

Since Algorithm 3 need to recompute \(t_{36}\) and \(t_{2}\), we can obtain a new depth-qubit tradeoff of Algorithm 3 as follows. First, we observe that our new Algorithm 3 shall compute \(t_{36}\) three times. If we introduce a new ancilla qubit to store \(t_{36}\), we do not need to recompute \(t_{36}\). That is, we can save two Toffoli gates and two Toffoli depth by storing \(t_{36}\) in a new ancilla qubit. Second, our new Algorithm 3 need to compute \(t_{2}\) twice. If we introduce a new ancilla qubit to store \(t_{2}\), we can save one Toffoli gates and one Toffoli depth. That is, we can obtain a new depth-qubit tradeoff i of our new Algorithm 3 with \(14-i\) Toffoli depth, \(6+i\) ancilla qubits, \(17-(i+1)\) Toffoli gates, and \(93+(i+1)\) CNOT gates (for \(1\le i\le 2\)).

figure i

Note that Langenberg et al. in [18] also utilized the linear relationship between \(z_i\) and \(s_j\) (for \(0\le i\le 17\) and \(0\le j\le 7\)) to reduce the number of Toffoli gates. However, they did not explore the whole linear relationship like Observation 2. As a result, they needed to introduce a new ancilla qubit Z in their work. According to Observation 2, we can construct Algorithm 4 for AES’s S-box with the output of Algorithm 3.

figure j

We can obtain the time and memory cost of Algorithm 4 as follows.

  1. 1.

    It needs 18 Toffoli gates and 140 CNOT gates to obtain \(z_i\) for \(1\le i\le 17\).

  2. 2.

    Since Algorithm 4 adopt Algorithm 3 twice to clean up the ancilla qubits, we can obtain a new depth-qubit trade-off i of Algorithm 4 as follows.

  1. a.

    When \(i=0\), Algorithm 4 can compute the output of S-box with 6 ancilla qubits, 52 Toffoli gates, 326 CNOT gates, and 4 NOT gates. The Toffoli depth of Algorithm 4 in this case is \(2\times 14+13=41\).

  2. b.

    When \(1\le i\le 2\), Algorithm 4 can compute the output of S-box with \(6+i\) ancilla qubits, \(52-2(i+1)\) Toffoli gates, \(326+2(i+1)\) CNOT gates, 4 NOT gates. The Toffoli depth of Algorithm 4 in this case is \(41-2i\).

Next, we improve the quantum circuit sending \(|x\rangle |b\rangle \) to \(|x\rangle |S(x) \oplus b \rangle \). In this part, we propose a new quantum circuit of AES’s S-box, which maps \(|x\rangle |b\rangle |0^{7}\rangle \) to \(|x\rangle |S(x)\oplus b\rangle |0^{7}\rangle \) with the output of Algorithm 3. Since the qubits encoding b are not necessarily zero, we cannot adopt Algorithm 4 directly. According to Observation 2, this problem can be solved by introducing a new ancilla qubit Z, which can be used to store each \(z_i\). After filling Z with \(z_i\), we just XOR Z to \(s_j\) according to linear relationship in Observation 2. Note that we shall clean up Z each time so as to store new \(z_i\).

Since this Algorithm 5 is similar to Algorithm 4, we just give a brief description of Algorithm 5 in the following pseudo code.

figure k

Similar to Algorithm 4, we can obtain the time and memory cost of Algorithm 5 as follows

  1. 1.

    Algorithm 5 calculates each \(z_i\) (for \(0\le i\le 17\)) in the same order as Algorithm 4. That is, Algorithm 5 needs the same cost to compute each \(t_i\) and \(y_j\) as Algorithm 4.

  2. 2.

    Since \(z_{11}\) (or \(z_{17}\)) only appears in \(S_5\) (or \(S_2\)) (see in Observation 2), we can store \(z_{11}\) (or \(z_{17}\)) in \(S_5\) (or \(S_2\)) without affecting the other output qubits. In other words, we can compute \(z_{11}\) and \(z_{17}\) in parallel with other \(z_i\). Because we do not need to store \(z_{11}\) and \(z_{17}\) in Z, we just need to clean up Z sixteen times so as to store new \(z_i\). That is, Algorithm 5 needs 34 Toffoli gates to calculate each \(z_i\) (for \(0\le i\le 17\)).

  3. 3.

    Algorithm 5 shall adopt Algorithm 3 twice to compute S-box and clean up these ancilla qubits.

Similar to Algorithm 4, We can obtain a new depth-qubit trade-off i of Algorithm 5 as follows.

  1. 1.

    When \(i=0\), Algorithm 5 can compute the output of S-box with 7 ancilla qubits, 68 Toffoli gates, 352 CNOT gates, 4 NOT gates, and 60 Toffoli depth.

  2. 2.

    When \(1\le i\le 2\), we can compute S-box with \(7+i\) ancilla qubits, \(68-2(i+1)\) Toffoli gates, \(352+2(i+1)\) CNOT gates, 4 NOT gates, and \(60-2i\) Toffoli depth.

4.3 Improved Quantum Circuit Implementation of the S-Box\(^{-1}\)

Here we propose an new quantum circuit of AES’s S-box\(^-1\) with 7 ancilla qubits, which maps \(|x\rangle |S(x)\rangle |0^{7}\rangle \) to \(|x\oplus S^{-1}(S(x))\rangle |S(x)\rangle |0^{7}\rangle =|0^8\rangle |S(x)\rangle |0^{7}\rangle \). We can adopt our quantum circuit of S-box\(^{-1}\) to remove some state values. We will use this property to improve the zig-zag method. Our quantum circuit of AES’s S-box\(^{-1}\) benefits from the following observations, which are based on our improved classical circuit of AES’s S-box\(^{-1}\).

Observation 3

The 18-bit \(z_0, \cdots , z_{17}\) for computing S-box\(^{-1}\) can be obtained with the knowledge of \(t_{29}\), \(t_{33}\), \(t_{37}\), \(t_{40}\), \(t_{41}\), \(t_{42}\), \(t_{43}\), \(t_{44}\), \(t_{45}\) and \(y_0, \cdots , y_{21}\). Note that \(y_0, \cdots , y_{21}\) are the linear combination of \(x_0, \cdots , x_{7}\). Besides, \(t_{41}\), \(t_{42}\), \(t_{43}\), \(t_{44}\), \(t_{45}\) can be obtained by the linear combination of \(t_{29}\), \(t_{33}\), \(t_{37}\), \(t_{40}\). That is, we can obtain \(z_0, \cdots , z_{17}\) with the knowledge of \(t_{29}, t_{33}, t_{37}, t_{40}\) and \(x_0, \cdots , x_{7}\).

Observation 4

The 8-bit output of S-box\(^{-1}\) \(s_0, \cdots , s_7\) can be seen as a linear combination of the 18-bit \(z_0, \cdots , z_{17}\) as follows.

$$\begin{aligned} s_0&=z_0\oplus z_2\oplus z_3\oplus z_5\\ s_1&=z_1\oplus z_2\oplus z_4\oplus z_5\oplus z_{13}\oplus z_{14}\oplus z_{16}\oplus z_{17}\\ s_2&=z_3\oplus z_5\oplus z_6\oplus z_8\oplus z_{9}\oplus z_{11}\oplus z_{15}\oplus z_{17}\\ s_3&=z_1\oplus z_2\oplus z_3\oplus z_5\oplus z_6\oplus z_{7}\oplus z_{9}\oplus z_{11}\oplus z_{15}\oplus z_{17}\\ s_4&=z_{10}\oplus z_{11}\oplus z_{12}\oplus z_{14}\oplus z_{15}\oplus z_{16}\\ s_5&=z_3\oplus z_5\oplus z_6\oplus z_8\oplus z_{10}\oplus z_{11}\oplus z_{16}\oplus z_{17}\\ s_6&=z_0\oplus z_1\oplus z_3\oplus z_4\oplus z_{9}\oplus z_{10}\oplus z_{12}\oplus z_{13}\\ s_7&=z_3\oplus z_5\oplus z_6\oplus z_8\oplus z_{12}\oplus z_{13}\oplus z_{15}\oplus z_{16} \end{aligned}$$

According to Observation 3, we can obtain \(z_0, \cdots , z_{17}\) by \(t_{29}, t_{33}, t_{37}, t_{40}\) and \(x_0, \cdots , x_{7}\). We propose Algorithm 6 to compute the \(t_{29}, t_{33}, t_{37}, t_{40}\). As shown in Algorithm 6, we can compute the \(z_0, \cdots , z_{17}\) of the S-box\(^{-1}\) with 6 ancilla qubits, 17 Toffoli gates, 110 CNOT gates and 12 NOT gates.

figure l

As shown in the above, we can obtain the 14 outputs of Algorithm 6 with 7 ancilla qubits, 17 Toffoli gates, 110 CNOT gates and 12 NOT gates. The Toffoli depth of Algorithm 6 is 14, because we can compute some \(t_i\) in parallel as follows. First, we can compute the two \(t_6\) in parallel. Second, we can compute the two \(t_7\) in parallel. Third, we can compute the \(t_8\) and \(t_{10}\) in parallel.

Similar to Algorithm 3, we can obtain a new depth-qubit trade-off of Algorithm 6 by introducing more ancilla qubits. Note that Algorithm 6 need to compute \(t_6\), \(t_7\), \(t_{26}\) twice. If we introduce 3 more ancilla qubits to store these values, we do not need to recompute \(t_6\), \(t_7\), \(t_{26}\) again. That is, we can obtain a new depth-qubit trade-off of Algorithm 6, which needs \(7+i\) ancilla qubits, \(17-i\) Toffoli gates, \(110+i\) CNOT gates and 12 NOT gates (for \(0\le i\le 3\)). The Toffoli depth of this new trade-off Algorithm 6 is 13 (for \(1\le i\le 3\)).

After obtaining the 14-bit output of Algorithm 6, we can construct Algorithm 7 by using Observation 4. Since our algorithm for S-box\(^{-1}\) can not make sure the output bits are zero, we shall introduce a new ancilla qubit Z to store each \(z_i\) in this algorithm.

figure m

The time and space cost of Algorithm 7 can be computed as follows. First, Algorithm 7 needs 35 Toffoli gates and 115 CNOT gates to compute each \(z_i\) for \(0\le i\le 17\). Second, Algorithm 7 needs to adopt Algorithm 6 twice to compute S-box\(^{-1}\) and clean up the ancilla qubits, which set \(T[i]=0\) and \(U[j]=x_j\) for \(0\le i\le 5\) and \(0\le j\le 7\). To sum up, Algorithm 7 can output S-box\(^{-1}\) with 7 ancilla qubits, 69 Toffoli gates, 335 CNOT and 24 NOT gates. The depth of Algorithm 7 is 62. Given more ancilla qubits, we can also propose a new depth-qubit trade-off of Algorithm 7, which needs \(7+i\) ancilla qubits, \(69-2i\) Toffoli gates, \(335+2i\) CNOT, and 24 NOT gates (for \(0\le i\le 3\)). The Toffoli depth of the above algorithm is 60 (for \(1\le i\le 3\)).

5 Our Strategies for the Zig-Zag Method and the Key Schedule of AES

5.1 Zig-Zag Method with Improved Depth-Qubit Trade-Offs

The prior quantum circuit of AES [3, 11, 18] adopted the zig-zag method to reduce the number of qubits. As shown in Fig. 1, the prior zig-zag method needed 512 qubits by reusing some qubits. However, they could not remove the Round 4, Round 7 and Round 9, unless the entire process was reversed. The reason for this drawback is that the prior work only considered the encryption algorithm in their zig-zag method. That is, they should know Round \(i-1\) so as to remove Round i. In this subsection, we propose an improved zig-zag method (see in Fig. 2), which just needs 256 qubits. We can achieve this goal by applying our quantum circuit of S-box\(^{-1}\) in our zig-zag method.

Fig. 1.
figure 1

Comparison between the pipeline architecture and the zig-zag method. The round i is indicated by \(R_i\), while \(R_i^{-1}\) means to remove the round i.

Denote the j-th output of the 16 S-box in Round i as \(s^i_j\) (for \(0\le j\le 15\)), while the j-th byte of Round \(i-1\) is denoted as \(r^{i-1}_j\) (for \(0\le j\le 15\)). Given \(|r^{i-1}\rangle |0^{128}\rangle \), we can explain how to obtain Round i and remove Round \(i-1\) within these 256 qubits.

  1. 1.

    Given \(|r^{i-1}\rangle \), we can compute the first r bytes of \(s^i_0\), \(s^i_1\), \(\cdots \), \(s^i_{r-1}\) with our Algorithm 4. We can store \(s^i_0\), \(s^i_1\), \(\cdots \), \(s^i_{r-1}\) in the first \(8\cdot r\) qubits of \(|0^{128}\rangle \), while the left \(|0^{128-8\cdot r}\rangle \) qubits can be used for ancilla qubits. We can choose r to obtain a improved depth-qubits trade-off for our quantum circuit.

  2. 2.

    After computing \(s^i_0\), \(\cdots \), \(s^i_{r-1}\), we can remove the first r bytes in Round \(i-1\) by using our Algorithm 7. That is, we can compute for \(0\le j\le r-1\)). Note that we can still use the left \(|0^{128-8\cdot r}\rangle \) qubits as ancilla qubits. Since \(r^{i-1}_j= Sbox^{-1}(s^{i}_j)\), we have \(|s^{i}_j\rangle |r^{i-1}_j\oplus Sbox^{-1}(s^{i}_j)\rangle |0^{7}\rangle =|s^{i}_j\rangle |0^8\rangle |0^{7}\rangle \)

  3. 3.

    These re-zero \(|0^{8\cdot r}\rangle \) in Round \(i-1\) can be used as ancilla qubits for obtaining \(s^i_r\), \(s^i_{r+1}\), \(\cdots \), \(s^i_{15}\) and removing the left \(16-r\) bytes in Round \(i-1\).

  4. 4.

    After computing the 16 bytes of \(s^i_0\), \(s^i_{1}\), \(\cdots \), \(s^i_{15}\), we can compute the 16 bytes of Round i by computing \(AK\circ MC \circ SR(S_i)\), where \(S_i\) is the 16 bytes output of the S-box in Round i, and AK, MC and SR are the abbreviations for AddRoundKey, MixColumns and ShiftRows.

After generating Round i, we can compute Round \(i + 1\) and remove Round i in a similar way. We can assign the newly calculated Round \(i+1\) to these 128 re-initialized zero qubits of Round \(i-1\). We can compute the ciphertext of AES-128 by repeating the above operation 10 times. Obviously, we can construct the zig-zag method for AES-192/-256 with 256 qubits in a similar way, where the prior zig-zag method needs 656 qubits for AES-192/-256 both.

Fig. 2.
figure 2

Our method for improving the zig-zag method. The round i is indicated by \(R_i\), while \(R_i^{-1}\) means to remove the round i.

5.2 Improved Quantum Circuits for the Key Schedule of AES

In this subsection, we propose some improved quantum circuit implementations for the key schedule of AES-128/-192/-256.

Our Strategy for the Key Schedule of AES-128. Our quantum circuit for the key-schedule of AES-128 only requires 128 qubits, while the prior works needed at least 224 qubits. We can achieve this improvement by combining our quantum circuit of S-box (Algorithm 5) with the property proposed by Langenberg et al. [18] (see Table 3).

We take \(W_{16}\) as an example to explain Table 3, where \(W_{16}: W_{15}, W_{11}, W_{7},\) \(W_{3}.\) It means \(W_{16}\) can be computed with the knowledge of \(W_{15}, W_{11}, W_{7},\) and \(W_{3}\). According to Algorithm 1, we can rewrite \(W_{16}\) as \(W_{16}=W_{15}\oplus W_{11} W_{7}\oplus W_{3} \oplus SubWord(RotWord(W_{15}))\oplus Rcon(4)\). We can obtain the other \(W_i\) in Table 3 similarly.

According to Table 3, we can compute all \(W_j\) (for \(4\le i\le 43\)) with these ten 32-bit \(W_{4i+3}\) (for \(1\le i\le 10\)). In [11], Grassl et al. just stored these ten 32-bit \(W_{4i+3}\) (for \(1\le i\le 10\)) with \(32\times 10\) = 320 qubits to generate each roundkey of AES-128. In [18], Langenberg et al. showed that they could generate all round keys of AES-128 with 224 qubits by reusing some qubits as follows. After computing seven 32-qubit words \(W_7\), \(W_{11}\), \(W_{15}\), \(W_{19}\), \(W_{23}\), \(W_{27}\), \(W_{31}\), they just cleaned up \(W_7\) so as to assign \(W_{32}\) to these 32 re-zero qubits. Then they could compute \(W_{33}\), \(W_{34}\) and \(W_{35}\) one by one with the knowledge of \(W_{19}\), \(W_{23}\), \(W_{27}\), \(W_{31}\). Obviously, they could compute the left round-keys similarly.

When the output qubits were not zero, Langenberg et al. could not apply their quantum circuit of S-box to compute AES’s S-box. As a result, they should remove \(W_7\) to generate \(W_{32}\). Based on Algorithm 5, our improved quantum circuit for the key schedule of AES-128 can be explained as follows.

  1. 1

    As shown in Sect. 6, we can generate four 32-qubit words \(W_7\), \(W_{11}\), \(W_{15}\), \(W_{19}\) in the 128 zero qubits.

  2. 2

    Since we have no zero qubits left, we shall remove \(W_7\), \(W_{11}\), \(W_{15}\), \(W_{19}\) to generate new \(W_{4i+3}\) (for \(5 \le i \le 10\)). In detail, we can compute \(W_{20}\) by XORing \(SubWord(RotWord(W_{19}))\), Rcon(5), \(W_{19}, W_{11}\), \(W_{15}\) to \(W_{7}\). We shall adopt Algorithm 5 to compute SubWord, because the output qubits are not zero. As a result, we can assign the newly calculated \(W_{20}\) to \(W_7\) without introducing new qubits.

  3. 3

    After generating \(W_{20}\), we can compute \(W_{21}\), \(W_{22}\) and \(W_{23}\) one by one with \(W_{11}\), \(W_{15}\), \(W_{19}\) (see Table 3). Since we only store \(W_{23}\) in the memory, we can assign the newly calculated \(W_{20+j}\) to \(W_{20+j-1}\) (for \(1 \le j \le 3\)).

  4. 4

    The left round keys \(W_i\) (for \(24 \le i \le 43\)) can be generated in a similar way. After generating \(W_{4i-1}\), \(W_{4i-5}\), \(W_{4i-9}\), and \(W_{4i-13}\), we can assign the newly calculated \(W_{4i}\) to \(W_{4i-13}\) (for \(5 \le i \le 10\)) without introducing new qubits. After computing \(W_{4i}\), we can generate \(W_{4i+1}\), \(W_{4i+2}\), \(W_{4i+3}\) as follows: \(W_{4i+1}=W_{4i}\oplus W_{4i-1}\oplus W_{4i-9},\) \(W_{4i+2}=W_{4i+1}\oplus W_{4i-1}\oplus W_{4i-5},\) \(W_{4i+3}=W_{4i+2}\oplus W_{4i-1}.\) We can assign the newly calculated \(W_{4i+j}\) to \(W_{4i+j-1}\) (for \(1 \le j \le 3\)).

Our Strategy for the Key Schedule of AES-192 and AES-256. Similar to AES-128, we can obtain a property for AES-192 (or AES-256) in Table 4 (or Table 5).

The quantum circuit for the key schedule of AES-192 is similar to AES-128. After generating \(W_{11}\), \(W_{17}\), \(W_{23}\), \(W_{29}\), \(W_{35}\) and \(W_{41}\) in the 192 qubits, we can compute \(W_{42}\) by xoring \(SubWord(RotWord(W_{41}))\), Rcon(7), \(W_{35}\), \(W_{17}\) to \(W_{11}\). Then we can compute the round-key \(W_{42+j}\) (for \(1 \le j \le 5\)) one by one with the knowledge of \(W_{42+j-1}\), \(W_{17}\), \(W_{23}\), \(W_{29}\), \(W_{35}\) and \(W_{41}\). Obviously, we can compute left round keys for AES-192 in a similar way. To sum up, we can compute the 12 round-key of AES-192 with 192 qubits.

The quantum circuit for the key schedule of AES-256 can be constructed as follows. After generating the eight round-keys \(W_{11}\), \(W_{15}\), \(W_{19}\), \(W_{23}\), \(W_{27}\), \(W_{31}\), \(W_{35}\) and \(W_{39}\) in the quantum memory, we can compute \(W_{40}\) for AES-256 by XORing \(SubWord (RotWord(W_{39}))\), Rcon(5), \(W_{35}\), \(W_{27}\), \(W_{19}\) to \(W_{11}\). Then we can compute the round-key \(W_{40+j}\) (\(1 \le j \le 3\)) for Round 10 one by one with the knowledge of \(W_{39}\), \(W_{35}\), \(W_{27}\), \(W_{19}\). Similar to \(W_{40}\), we can obtain the left round key \(W_{44}\), \(W_{48}\), \(W_{52}\), and \(W_{56}\) without introducing new qubits. To sum up, we can compute the 14 round-key of AES-256 with 256 qubits.

Table 3. The keys required to construct each round-key of AES-128.
Table 4. The keys required to construct round-key of AES-192.
Table 5. The keys required to construct round-key of AES-256.

6 Improved Quantum Circuit Implementations of AES

6.1 Our Improved Quantum Circuit of AES-128

As shown in Fig. 2, we can divide our quantum circuit of AES-128 into three parts. Part 1 only contains Round 1, which does not need the S-box\(^{-1}\) operation. Part 2 contains Round 2, Round 3 and Round 4. Part 3 contains the left 6 rounds, which shall use Algorithm 5 to compute the round-keys.

After denoting \(r^{j}_i\) and \(s^{j+1}_i\) as the i-th byte of Round j and the S-box operations in Round \(j+1\) (for \(0\le j\le 9\) and \(0\le i\le 15\)), the time and memory cost of each parts can be computed as follows.

The Time and Space Cost of Part 1. We just compute Round 1 and remove Round 0 in Part 1 (see in Fig. 3).

Fig. 3.
figure 3

Our method for computing Round 1.

  1. 1.

    We can obtain Round 0 by implementing at most 128 Pauli-X gates (or called NOT gate) on the input keys \(W_0\), \(W_1\), \(W_2\), \(W_3\).

  2. 2.

    We can adopt Algorithm 4 in parallel to compute \(s^{1}_i\) (for \(0\le i\le 15\)), because we have 384 zero qubits (from the 128 to 511 qubits in initial state in Fig. 3). Since we need 128 qubits to store these 16 bytes \(s^{1}_i\) (for \(0\le i\le 15\)), we have \(384-128=256\) qubits left for ancilla qubits. In other words, we can obtain a depth-qubit trade-off \(i=2\) for these 16 S-box operations. That is, we can implement these 16 S-box operations with 128 ancilla qubits, 736 Toffoli gates and 5,312 CNOT gates. The Toffoli depth of these 16 S-box operations is \(41-4=37\), because we can implement the 16 S-box in parallel.

  3. 3.

    After obtaining \(s^{1}_i\) (for \(0\le i\le 15\)), we can apply at most 128 NOT gates to Round 0 so as to obtain \(W_0\), \(W_1\), \(W_2\), \(W_3\) again. Then we can compute the round-key \(W_4\), \(W_5\), \(W_6\), \(W_7\) for Round 1 with the knowledge of \(W_0\), \(W_1\), \(W_2\), \(W_3\). Similar to step 2, we can obtain a depth-qubit trade-off \(i=2\) for these 4 S-box operations for \(W_4\), because we have 224 ancilla qubits left. That is, we need 184 Toffoli gates and 1328 CNOT gates to implement these 4 S-box operations. The Toffoli depth of this operation is 37.

  4. 4.

    We not only require \(3\times 32=96\) CNOT gates and 1 NOT gate to produce \(W_4\), \(W_5\), \(W_6\), \(W_7\), but also need 128 CNOT gates to implement the AddRoundKey operation. In addition, we still need \(277\times 4=1108\) CNOT gates to implement 4 times MixColumns operations.

To sum up, we can implement Part 1 with 920 Toffoli gates, 7,972 CNOT gates, and 337 NOT gates. Since the 16 S-box in Round 1 and \(W_4\) cannot be implemented in parallel, the Toffoli depth of the above operation is 74.

The Time and Space Cost of Part 2. Part 2 contains three similar rounds from Round 2 to Round 4.

Fig. 4.
figure 4

Our method for computing Round 4 and removing Round 3 of AES-128.

In the following, we show the time and memory cost of computing Round 4 and removing Round 3, which can be divided into 5 phases (see in Fig. 4).

  1. 1.

    We can compute \(s^{4}_0\), \(\cdots \), \(s^{4}_7\) in Round 4 and the first two bytes S-box operations of \(W_{16}\), which requires 80 qubits to store these 10 bytes output of S-box. Since we have 160 zero qubits (the 224–255 and 384–511 qubits in state0 in Fig. 4), we have \(160-80=80\) qubits left for ancilla qubits. As a result, we can obtain a depth-qubit trade-off \(i=2\) for these 10 S-box operations. That is, we can implement these 10 S-box operations with 80 ancilla qubits, 460 Toffoli gates, 3320 CNOT gates and 40 NOT gates. The Toffoli depth of these 10 S-box operations is 37.

  2. 2.

    We can remove \(r^{3}_0\), \(\cdots \), \(r^{3}_7\) in Round 3 by adopting Algorithm 7. Since we have 80 zero qubits (the 240–255 and 448–511 qubits in state1 in Fig. 4), we can obtain a depth-qubit trade-off \(i=3\) for these 8 S-box\(^{-1}\) operations. That is, we can implement these 8 S-box\(^{-1}\) operations with 80 ancilla qubits, 504 Toffoli gates, 2728 CNOT gates and 192 NOT gates. The Toffoli depth of the 8 S-box\(^{-1}\) operations is 60.

  3. 3.

    We can compute \(s^{4}_8\) \(\cdots \), \(s^{4}_{15}\) in Round 4 and the last two bytes of \(W_{16}\), which requires 80 qubits to store these 10 bytes output of S-box. Since we have 144 zero qubits (the 240–319 and 448–511 qubits in state2 in Fig. 4), we have \(144-80=64\) qubits left for ancilla qubits. In other words, we can obtain the depth-qubit trade-off \(i=1\) (and \(i=0\)) for the first 4 S-box (the left 6 S-box) operations. That is, we can implement the first 4 S-box operations with \(4*7=\) 28 ancilla qubits, 192 Toffoli gates, 1320 CNOT gates and 16 NOT gates, while the left 6 S-box operations can be implemented with 36 ancilla qubits, 312 Toffoli gates, 1956 CNOT gates and 24 NOT gates. To sum up, we can implement these 10 S-box operations with 64 ancilla qubits, 504 Toffoli gates, 3276 CNOT gates and 40 NOT gates. The Toffoli depth of these 10 S-box operations is 41.

  4. 4.

    We can remove the \(r^{3}_8\), \(\cdots \), \(r^{3}_{15}\) in Round 3 by adopting Algorithm 7. Since we have 64 zero qubits here (the 256–319 qubits in state3 in Fig. 4), we can obtain a depth-qubit trade-off \(i=1\) for these 8 S-box\(^{-1}\) operations. That is, we can implement these 8 S-box\(^{-1}\) operations with 64 ancilla qubits, 544 Toffoli gates, 2688 CNOT gates and 192 NOT gates. The Toffoli depth of the 8 S-box\(^{-1}\) operations is 61.

  5. 5.

    We shall implement the MixColumns and AddRoundKey operations so as to obtain Round 4. The MixColumns operation for 128-bit state requires \(277\times 4=1108\) CNOT operations. According to the round-key algorithm of AES-128, after the SubWord operation, we still need \(32\times 8=256\) CNOT gates and 1 NOT gate to compute \(W_{16}\), \(W_{17}\), \(W_{18}\), \(W_{19}\). As a result, we can implement the AddRoundKey operation with 256+128 = 384 CNOT gates and 1 NOT gate.

To sum up, we need 2012 Toffoli gates, 13504 CNOT gates and 465 NOT gates to obtain Round 4 and remove Round 3. The Toffoli depth of the above five steps is 199. Since the time and memory cost of the left two rounds in Part 2 is similar to the above operation, we just provide some results and ignore the details. First, we require 1928 Toffoli gates, 13556 CNOT gates and 465 NOT gates to obtain Round 3 and remove Round 2. The Toffoli depth of this transformation is 194. Second, we require 1968 Toffoli gates, 13548 CNOT gates and 465 NOT gates to obtain Round 2 and remove Round 1, while the Toffoli depth is 157.

The Time and Space Cost of Part 3. Part 3 contains 6 similar rounds operations. In the following, we will show the time and memory cost of obtaining Round 5 and removing Round 4.

Fig. 5.
figure 5

Our method for computing Round 5 and removing Round 4 of AES-128.

Then we can compute the time and memory cost of the other rounds in Part 3 in a similar way. As shown in Fig. 5, we can divide the above transformation into 5 phases.

  1. 1.

    We can compute the \(s^{5}_0\), \(\cdots \), \(s^{5}_{7}\) in Round 5 and the first two S-box operations of \(W_{20}\). Since we have 128 zero bits (from the 256 to 383 qubits in state0 in Fig. 5), we have 128−64 = 64 qubits left for ancilla qubits, because we need \(|0\rangle ^{\otimes 64}\) qubits to store \(s^{5}_0\), \(\cdots \), \(s^{5}_{7}\). Since Algorithm 4 and Algorithm 5 require 6 and 7 ancilla qubits respectively, we need \(6\times 8+2\times 7=62\) qubits to run Algorithm 4 eight times and Algorithm 5 twice in parallel. Then we have \(64-48-14=2\) ancilla qubits left, which can introduce one more ancilla qubit for the first 2 S-box of \(W_{20}\). That is, we can implement the first 2 S-box of \(W_{20}\) with 16 ancilla qubits, 128 Toffoli gates, 706 CNOT gates and 8 NOT gates, while the 8 S-box of Round 5 can be implemented with 48 ancilla qubits, 416 Toffoli gates, 2608 CNOT gates and 32 NOT gates. To sum up, we can implement these 10 S-box operations with 64 ancilla qubits, 544 Toffoli gates, 3314 CNOT gates and 40 NOT gates. The Toffoli depth of these 10 S-box operations is 56, which is determined by Algorithm 5.

  2. 2.

    We can remove the \(r^{4}_0\), \(\cdots \), \(r^{4}_7\) in Round 4 by computing eight times S-box\(^{-1}\) operations with Algorithm 7. Since we have 64 qubits left for ancilla qubits (see in state1 in Fig. 5), we can obtain a depth-qubit trade-off \(i=1\) for these 8 S-box\(^{-1}\) operations. That is, we can implement these 8 S-box\(^{-1}\) operations with 64 ancilla qubits, 536 Toffoli gates, 2696 CNOT gates and 192 NOT gates. The Toffoli depth of these 8 S-box\(^{-1}\) operations is 60, because we can implement these 8 S-box\(^{-1}\) in parallel.

  3. 3.

    We can compute the \(s^{5}_8\), \(\cdots \), \(s^{5}_{15}\) in Round 5 and the last two bytes of \(W_{20}\). Similar to Step 1, we also have 2 ancilla qubits left, which can obtain a depth-qubit trade-off \(i=1\) for the last 2 S-box operations in \(W_{20}\). Similar to step 1, we can implement these 10 S-box operations with 64 ancilla qubits, 544 Toffoli gates, 3264 CNOT gates and 40 NOT gates. The Toffoli depth of these 10 S-box operations is 56.

  4. 4.

    We shall remove the \(r^{4}_8\), \(\cdots \), \(r^{4}_{15}\) of Round 4 in state3 by implementing eight times S-box\(^{-1}\) operations with Algorithm 7. Since we have 64 ancilla qubits here, we can implement these 8 S-box\(^{-1}\) operations with 64 ancilla qubits, 536 Toffoli gates, 2696 CNOT gates and 192 NOT gates. The Toffoli depth of the 8 S-box\(^{-1}\) operation is 60.

  5. 5.

    We shall implement the MixColumns and AddRoundKey operations so as to obtain Round 5. The 4 times MixColumns operation requires \(277\times 4=1108\) CNOT operations. According to the key algorithm of AES-128, after the SubWord operation, we still need \(32\times 8=256\) CNOT gates and 1 NOT gate to compute \(W_{20}\), \(W_{21}\), \(W_{22}\), \(W_{23}\). As a result, we can implement the AddRoundKey operation with 256 + 128 = 384 CNOT gates and 1 NOT gate.

Table 6. The quantum resource for AES-128 AES-192 and AES-256.

That is, we need 2160 Toffoli gates, 13512 CNOT gates, 465 NOT gates to obtain Round 5 and remove Round 4, while the Toffoli depth is 232. We can compute the time and space cost of the left 5 rounds in Part 3 in a similar way. However, different rounds of AES-128 require different cost in the AddRoundKey operation. According to the key schedule of AES-128, we need \(256\times 3=768\) CNOT gates and \(1\times 3=3\) NOT gate to generate the 3 round-keys of Round 6, Round 7 and Round 8, while the round-key of Round 9 and Round 10 require \(256\times 2=512\) CNOT gates and \(4\times 2=8\) NOT gates.

The time and memory cost of our quantum circuit of AES-128 can be obtained by summing Part 1, Part 2 and Part 3. All in all, our quantum circuit of AES-128 needs 512 qubits, 19788 Toffoli gates, 128517 CNOT gates and 4528 NOT gates. The Toffoli depth of our quantum circuit of AES-128 is 2016 (see in Table 6).

6.2 Quantum Circuit Implementations of AES-192 and AES-256

Since our quantum circuit implementation of AES-192 and AES-256 are similar to AES-128, we just show the conclusions and omit the details (see in Table 6). Our quantum circuit of AES-192 requires 640 qubits, 22380 Toffoli gates, 152378 CNOT gates and 5128 NOT gates. The Toffoli depth of our quantum circuit implementation of AES-192 is 2022. Our quantum circuit of AES-256 requires 768 qubits, 26774 Toffoli gates, 177645 CNOT gates and 6103 NOT gates. The Toffoli depth of our quantum circuit implementation of AES-256 is 2292.

7 Conclusion

In this paper, we propose some improved quantum circuit implementations of AES. In the future, there are still several research directions. First, we can explore some possible time-space trade-offs for our quantum circuit of AES by using Kim et al.’s work. Second, we can explore some improved quantum circuits for the other construction, such as the Feistel-SPN. Third, we can explore some improved quantum circuits of the S-box of the other block cipher, such as SM4 and Camellia.