Quantum process identification: a method for characterizing non-markovian quantum dynamics

Ryan S Bennink; Pavel Lougovski

doi:10.1088/1367-2630/ab3598

1. Introduction

The experimental characterization of dynamical processes is fundamental to physics. In the burgeoning field of quantum information science, the characterization of quantum processes in a general, systematic way has become particularly important. For example, in quantum key distribution, characterization of the information-preserving properties of the quantum channel between communicating parties is crucial to establishing the security of the generated key [1]. In the circuit model of quantum computing, a computation is represented as a sequence of primitive quantum logic operations or 'gates', each of which is implemented via a controlled quantum process. Characterization of these processes essential to assessing and improving their performance [2–5]. The paradigmatic way of characterizing a process involving some quantum system of interest is quantum process tomography (QPT) [6]. In QPT the process is tested many times using a variety of initial states and final measurements that collectively span the state space of the system. The process is then expressed as a quantum channel—a linear, completely positive, trace-preserving (CPTP) map on the system state space—that can be estimated from the experimental results.

A modern application of QPT (and the main context of this work) is to construct models for the behavior of fabricated quantum computing devices. Each qubit operation or 'gate' the device is capable of performing is characterized via QPT as a quantum channel on the targeted qubits. The execution of an arbitrary gate sequence is then modeled as a corresponding sequence of quantum channels on the device's qubits. While newer and more robust characterization methods such as gate set tomography (GST) [3, 5, 7] and randomized benchmarking (RB) [4, 8–14] have largely replaced QPT, these methods continue the approach of modeling each gate as a quantum channel involving only the targeted qubits.

This nearly ubiquitous approach to quantum device characterization is not valid, however, when the dynamics of interest are non-Markovian. For the purposes of this work, a process is Markovian if the dynamical map for any given time interval can be expressed as a sequence of maps representing the dynamics over successive subintervals [15]. Informally, this is the property that the (statistical) state of the system at any point in time is sufficient to determine its (statistical) state at at any future time; there is no additional dependence on prior states of the system or on any external degrees of freedom. By this definition a sequence of quantum channels constitutes a Markovian process. While the theory of non-Markovian quantum dynamics has received growing attention in recent years [16, 17], experimental strategies for characterizing such dynamics [18–20] are less well developed. Meanwhile, incoherent (Markovian) noise processes in quantum information technologies have been reduced to the point that non-Markovian errors are starting to becoming significant [3, 5].

In this paper we present a general, systematic method to empirically characterize arbitrary 'black box' quantum processes (i.e. processes not involving intermediate measurements), including non-Markovian dynamics of open quantum systems. More specifically, we show how to construct a minimal dynamical model of a quantum system solely from observations of that system at selected times. We call this task quantum process identification (QPI) in analogy with the systems engineering task known as system identification [21, 22]. Operationally, QPI consists of time-resolved process tomography followed by a numerical search for an appropriate model. But unlike QPT and its modern replacements, QPI assumes almost nothing about the system of interest, its environment, or their joint dynamics. In particular it does not restrict analysis to a presupposed system of interest, but identifies and characterizes as many degrees of freedom as are needed to account for the observations. It assumes no particular model, nor even quantum mechanics specifically. QPI assumes only that (1) there exists a finite-dimensional probabilistic model capable of predicting the observed behavior to desired accuracy, and (2) experiments on the system are statistically independent. Furthermore, QPI is fully self-contained; it requires no calibrated measurements or physical references. In the context of quantum computing, QPI can be used to better assess the quality of qubit operations and inform the development of more effective control and error mitigation techniques. Dynamical models produced by QPI may also be used to assess non-Markovianity or quantumness in processes of interest; however, such questions are beyond the scope of this work. We present QPI simply as a very general method—free of the usual assumptions of Markovianity—to obtain an accurate, empirically-derived mathematical model of a dynamical system's observable behavior.

The development of QPI was motivated by studies addressing the limitations of GST and RB for processes with non-Markovian aspects [5, 23–25] and by the emergence of ad-hoc adaptations of these methods to characterize specific types of non-Markovian errors [13, 26]. QPI provides a general, systematic solution to these challenges. (Unlike GST and RB, however, the version of QPI presented here characterizes only an individual gate, not a set of gates; see section 6 for further discussion.) As will be explained, our method for identifying quantum processes is based on a classic technique for reconstructing classical discrete-time linear systems from time traces, a technique also utilized in [27] for Markovian quantum processes. [19, 20, 28, 29] also present general frameworks for representing and characterizing non-Markovian quantum dynamics, but address slightly different situations than QPI and employ rather different formalisms. [28, 29] consider quantum communication channels with memory, wherein the system is prepared afresh for each use of the channel; in QPI, each occurrence of the process in question begins with the state resulting from the preceding occurrence, a model more appropriate in computing contexts. [20] considers controlled operations on a system interleaved with system-environment interactions and presents a way to determine a tensor that relates fixed-length control sequences to resulting system states. [19] has similar aims to the present work, but uses an autoregressive model to express non-Markovian features. QPI is also reminiscent of the task of inferring a quantum hidden Markov model (QHMM) [30–33], but with a key difference: in a QHMM each iteration of the process produces an observable output, whereas QPI characerizes 'black box' processes that can only be interrogated via disruptive measurements.

Other areas of study closely related to QPI include the theory of non-Markovian quantum dynamics [16, 17], the theory of quantum system identifiability [34], limitations of the quantum channel formalism [35–37], and the construction of effective bath models to simulate open quantum dynamics [38, 39]. More broadly, QPI falls under the general topic of learning quantum states and dynamics [32, 40–52], including dimension estimation [53–57] and determination of causal structure [58].

In the remainder of this article we give a detailed description and numerical demonstration of QPI. We begin in section 2 with a thought exercise to develop the intuition underlying QPI and illustrate several important principles. In section 3 we formulate the problem and review a model inference technique from the field of linear systems engineering that underlies QPI. The QPI protocol itself is described in section 4, followed in section 5 by simulation results which demonstrate the effectiveness of QPI for representative non-Markovian quantum processes. Lastly, in section 6 we summarize our results and identify possible directions of further development. Mathematical and algorithmic details are provided in the Appendices.

2. A thought exercise

The means of detecting and characterizing non-Markovian behavior can be understood through a simple example. Consider a slot machine which produces a 0 or 1 each time its lever is pulled (figure 1). A frequent player 'Alice' notices that each time the machine is powered on, the first two digits produced are equally likely to be 00, 01, 10, or 11. If this were the extent of Alice's observations, she might reasonably conclude that pulling the lever simply outputs a random digit. That is, she would expect the probability of observing a 1 would be 0.5, regardless of the value of previously observed digits.

**Figure 1.** A fictional slot machine illustrates how repeated observations of a system over time can be used to infer the existence of, and to model, latent degrees of freedom responsible for non-Markovian behavior. *(left)* The slot machine produces a digit (0 or 1) each time its lever is pulled. A frequent player observes that each time the slot machine is powered on it exhibits one of two distinct behavior patterns: the output either alternates with each pull of the lever or remains constant. This behavior cannot be explained by any stochastic model involving the only the most recently output digit. *(right)* The behavior can be explained by positing the existence of an ancillary bit that determines whether the output is constant or alternating.
Download figure:
Standard image High-resolution image

But suppose Alice observes longer output sequences and discovers that each time the machine is powered on, one of two distinct behaviors occurs: In half the cases, each pull of the lever repeats whatever digit appeared first. In the remaining cases, the output alternates between 0 and 1. Note that the value of the (k + 1)th digit is not predicted by the kth digit alone (the conditional probability is 1/2), but it is predicted by the (k − 1)th digit (they are always the same). In other words, the next output of the machine depends not only on the most recent output, but on prior outputs; the output is non-Markovian. Alice can explain the observed behavior by positing the existence of a 'control' bit inside the machine which determines whether the output is constant or alternating. The control bit is set to a random value when the machine is powered on and does not change.

Let us now imagine that the slot machine contains an internal counter which causes the control bit to change state after T pulls, thereby switching the behavior from constant to alternating or vice versa. If Alice only observes sequences of length ≤T, she will see nothing that indicates that the behavior of longer sequences is any different; her model will be incomplete.

Although this example is not a perfect analogy for QPI—the slot machine yields an output at every iteration whereas a disruptive measurement is needed to query a quantum process—it illustrates several relevant principles. The first principle is that repeated interactions between a system and non-forgetful ancillary degrees of freedom can yield non-Markovian system behavior. (Note that if the control bit was randomized at each time step, the target bit would be randomized at each time step independent of its previous value. In this case, there would be no need to track the state of the control bit; a stochastic model of the target bit alone would suffice.) The second principle is that the pattern of system behavior over an extended period of time can be used to infer the existence of, and determine the dynamics of, hidden degrees of freedom. In this example the inference could be made simply by inspection, but we will show how to do it in a systematic way. The third principle is that since experiments can have only finite duration, it is not possible to guarantee that one has observed enough behavior to obtain a complete model, i.e. with enough degrees of freedom to accurately predict all future behavior. But as we will show, it is possible to guarantee that an experimental characterization yields a complete model if one has an upper bound for the effective dimension of the process.

3. Theory of QPI

3.1. Problem formulation

The problem we address is to how to experimentally characterize the behavior of a quantum system of interest under some dynamical process, without assuming anything about the system's size or composition or the nature of the process. The process in question may be a repeatable procedure ${ \mathcal P }$ with definitive start and end, or it may be a continuous, ongoing dynamic. In the latter case we let ${ \mathcal P }$ denote evolution for some fixed time Δτ, where Δτ becomes the time resolution at which the dynamics are resolved.

Ultimately, all that can be known about a system comes from one's interactions with it. The kind of characterization we develop is relative to, and in terms of, the available ways of interacting with the system. In this work we consider only interactions in the form of 'experiments' in which the system is prepared, made to evolve under ${ \mathcal P }$ , and subsequently measured. (In section 6 we will speculate how QPI may be extended to more complicated interactions.) Let ${ \mathcal I }$ be the set of ways the system can be prepared (initialized) and let ${ \mathcal M }$ be the set of ways it can be measured. While in principle there exists a continuum of possible initializations and possible measurements, in practice one can utilize only finitely many of these; thus we assume ${ \mathcal I }$ and ${ \mathcal M }$ are finite sets¹ . An experiment is a specified by a triple $(i,t,m)$ which specifies the initialization $i\in { \mathcal I }$ , followed by t repetitions of ${ \mathcal P }$ , followed by the measurement $m\in { \mathcal M }$ . For notational convenience we suppose that each measurement has only two possible outcomes, YES or NO. (A measurement with k possible outcomes can be cast as k distinct YES/NO measurements.) Under these premises, the observable behavior of the system is the set of YES outcome probabilities for all possible experiments. The YES probability for experiment (i, t, m) will be denoted by ${F}_{i,m}^{(t)}$ .

It is worth reiterating that the characterization of ${ \mathcal P }$ produced by QPI is completely relative to the set of initializations ${ \mathcal I }$ and set of measurements ${ \mathcal M }$ considered. ${ \mathcal I }$ and ${ \mathcal M }$ need not be tomographically complete for the system, but if they are not then the resulting characterization may not cover the full state space of the system. Furthermore, if an 'absolute' characterization is desired, ${ \mathcal I }$ and ${ \mathcal M }$ should include the procedures that define the reference frame. As a side result, QPI also yields a characterization of the initializations and measurements relative to one another.

For our purposes, characterization amounts to constructing a mathematical model of the process ${ \mathcal P }$ , initializations ${ \mathcal I }$ , and measurements ${ \mathcal M }$ . We opt not to use the traditional Hilbert-space formulation of quantum mechanics, but instead the more general framework known as generalized probabilistic theory (GPT) [59–61]. GPT is a framework for a class of physical theories that includes quantum mechanics and classical mechanics as special cases. For us GPT has two appealing features: first, physical phenomena are described in operational terms. Rather than expressing phenomena in terms of complex operators that are not directly observable, GPT expresses phenomena in terms of the outcome probabilities of measurements that could be performed. This matches the task of experimental characterization extremely well. Secondly, the GPT formalism does not distinguish between quantum and classical behaviors; it treats all behaviors in a uniform way. This is useful because although the system of interest and the environment are nominally quantum, their behavior may have some purely classical aspects. GPT allows one to construct the most economical model of a process without choosing it to be quantum or classical. On the other hand, some symmetries required by quantum mechanics (such as complete positivity) are less conveniently expressed in GPT. We emphasize that QPI is not tied to the GPT formalism; if desired, QPI could be formulated entirely in terms of the traditional Hilbert space formalism.

One approach to representing non-Markovian dynamics is to construct dynamical maps that are not linear and/or not CPTP [36]. Another strategy is to go beyond time-local maps and use structures that explicitly encode temporal correlations, such as spectral densities [62, 63], autoregression parameters [19], or process tensors [20, 64]. QPI takes a different approach: the system's temporal evolution (whether Markovian not) is modeled by a fixed Markovian process involving both the system of interest and a sufficiently large abstract environment [65]. (This is analogous to Stinespring dilation of a CPTP channel [66, 67].) That is, an extended Markovian model is formulated to predict the system's observable (possibly non-Markovian) behavior. In technical terms, the central premise of QPI is that the system's observable behavior can be modeled to desired accuracy by a sufficiently large finite-dimensional, time-independent, time-local dynamical model² . Significantly, such a model can be inferred from measurements that nominally involve the system alone. This is important because the experimentalist often does not have access to, or even knowledge of, all the pertinent parts of the environment. By inferring a model on an extended state space, QPI implicitly characterizes the involvement of pertinent environmental degrees of freedom (though these degrees of freedom are abstract and might not be readily associated with particular physical degrees of freedom).

The only other assumption of QPI is that each experiment is statistically independent. One way of satisfying this second assumption is to use strong initializations that erase (as far as can be detected) any correlations between the system and the environment due to previous experiments. For example, the initialization procedure may include waiting a time much longer than the plausible memory lifetime of the environment. Note, however, that we allow the initialization procedure itself to correlate the system and environment. A weaker way of satisfying the independence assumption is to randomize the order of the experiments so that any residual correlations between the system and environment at the start of each experiment are sampled fairly. In this case, QPI infers a distribution of correlated initial states.

In QPI, a model of dimension d consists of a state vector ${s}_{i}\in {{\mathbb{R}}}^{d}$ for each initialization procedure $i\in { \mathcal I }$ , a property vector³ ${p}_{m}\in {{\mathbb{R}}}^{d}$ for each measurement procedure $m\in { \mathcal M }$ , and a transfer matrix $T\in {{\mathbb{R}}}^{d\times d}$ for the process ${ \mathcal P }$ . (The relation between these model elements and conventional quantities is shown in table 1.) Then the YES probability of experiment (i, t, m) is

$\begin{eqnarray}&&{F}_{i,m}^{(t)}={s}_{i}{T}^{t}{p}_{m}.\end{eqnarray} \tag{ 1 }$

(We have adopted the convention in which states are row vectors and operators are applied on the right.) In terms of the matrices $S=[\begin{array}{ccc}{s}_{1}; & {s}_{2}; & \ldots \end{array}]$ and $P=[\begin{array}{ccc}{p}_{1} & {p}_{2} & \ldots \end{array}]$ , we have

$\begin{eqnarray}&&{F}^{(t)}={{ST}}^{t}P.\end{eqnarray} \tag{ 2 }$

The goal of QPI is to find matrices S, P, and T that accurately predict F^(t) for all t = 0, 1, 2, ..., i.e. the observable behavior of the system as defined above. Here S and P are a characterization of the state preparations and measurements, respectively, while T characterizes the process ${ \mathcal P }$ itself. We note that these matrices can be determined only up to a similarity transformation, since for any invertible matrix G the mapping $S\to {SG}$ , $P\to {G}^{-1}P$ , $T\to {G}^{-1}{TG}$ leaves all probabilities unchanged. This indeterminacy of representation is known as 'gauge indeterminacy'. While gauge indeterminacy does lead to a minor theoretical conundrum regarding the assessment of process fidelity [68], it has (by definition) no experimental consequence and thus is not a limitation of QPI.

Table 1. Elements of a QPI model and the corresponding objects in the Hilbert-space formulation of quantum mechanics.

QPI model element	Corresponding Hilbert-space object
State row vector s	Density operator ρ
Property vector p	Observable Hermitian operator A
Transfer matrix T	CPTP map ${ \mathcal E }$
Expectation value sTp	Expectation value $\mathrm{Tr}(A{ \mathcal E }(\rho ))$

In the case of a continuous-time process, QPI can be used to obtain an effective master equation. Let $\rho (\tau )\in {{\mathbb{R}}}^{d}$ denote the extended system-environment state at time τ with F^(τ) ≡ ρ(τ)P. From (2) we have $\rho (t{\rm{\Delta }}\tau )={{ST}}^{t}=\rho ((t-1){\rm{\Delta }}\tau )T$ where Δτ is the chosen time resolution. For sufficiently small Δτ, T ≈ 1 + LΔτ for some matrix L. Then ρ approximately obeys the master equation

$\begin{eqnarray}&&\displaystyle \frac{{\rm{d}}\rho (\tau )}{{\rm{d}}\tau }=\rho (\tau )L.\end{eqnarray} \tag{ 3 }$

3.2. The Ho–Kalman method

A key to solving the problem of characterizing non-Markovian quantum processes is a technique that was devised half a century ago in the context of discrete-time linear systems and is now a standard topic in introductory texts on systems engineering [21, 22]. Consider a classical input-output system whose output $x\in {{\mathbb{R}}}^{n}$ at each time $t\in {\mathbb{Z}}$ is a linear function of its input $u\in {{\mathbb{R}}}^{m}$ at the previous d times. Ho and Kalman [69] showed that the system can be described by a d-dimensional state model of the form

$\begin{eqnarray}\begin{array}{rcl}s(t+1) & = & u(t)A+s(t)B\\ x(t) & = & s(t)C\end{array}\end{eqnarray} \tag{ 4 }$

where $s\in {{\mathbb{R}}}^{d}$ , $A\in {{\mathbb{R}}}^{m\times d}$ , $B\in {{\mathbb{R}}}^{d\times d}$ , and $C\in {{\mathbb{R}}}^{d\times n}$ . Furthermore, they showed how to obtain the coefficients A, B, C from the observable quantities X(1), ..., X(2d) where X_i,j(t) is the value of the output x_j(t) in response to a unit impulse on the ith input (u_i) at time 0. In this case $X(t+1)={{AB}}^{t}C$ . Let

$\begin{eqnarray}H=\left[\begin{array}{ccc}X(1) & \cdots & X(d)\\ \vdots & & \vdots \\ X(d) & \cdots & X(2d-1)\end{array}\right]\end{eqnarray} \tag{ 5 }$

and

$\begin{eqnarray}{H}^{{\prime} }=\left[\begin{array}{ccc}X(2) & \cdots & X(d+1)\\ \vdots & & \vdots \\ X(d+1) & \cdots & X(2d)\end{array}\right].\end{eqnarray} \tag{ 6 }$

The key fact here is that H and ${H}^{{\prime} }$ have related rank-d factorizations, H = LR and ${H}^{{\prime} }={LBR}$ where $L=[A;{AB};\,\cdots ;\,{{AB}}^{d-1}]$ and $R=[{CBC}\cdots {B}^{d-1}C]$ . The Ho–Kalman result is that a model of the form (4) can be obtained by finding any rank-d factorization H = LR, taking the first m rows of L for A, the first n columns of R for C, and taking $B={L}^{+}{H}^{{\prime} }{R}^{+}$ where ${}^{+}$ denotes the Moore–Penrose pseudo-inverse.

The relevance of this result to QPI becomes clear upon noting that $X(t+1)={{AB}}^{t}C$ has the exactly same form as equation (2). From the perspective of process tomography, A describes the preparable states and C describes the measurable properties. When $d\gt \min (m,n)$ these do not span the state space of the system, yet a full reconstruction of the process is still possible. That is, observations of the system over a sufficiently long period of time can yield complete information about the system, even when the preparable states and measurable properties are not intrinsically complete [27]. The critical implication of this result for QPI is that the properties of a system of interest over time are sufficient to reconstruct an accurate model of all relevant dynamics, including the dynamics of any external degrees of freedom with which the system interacts.

Mathematically, the construction of a d-dimensional model from H is possible because H has rank d. What if A and C are already tomographically complete? In that case it ought to be sufficient to use just X(1) = AC for H and X(2) = ABC for ${H}^{{\prime} }$ , as in QPT. In this case, the textbook Ho–Kalman method asks for more data than is actually necessary. The following Lemma, proved in appendix A, states that the number of time steps needed to ensure a complete characterization is determined by the number of latent (unmeasured) degrees of freedom only, not the total number of degrees of freedom:

Lemma 1. Let $X(t)$ be the impulse response function for an arbitrary linear, discrete-time, time-shift invariant system of dimension d. Let $r={\max }_{t}\mathrm{rank}\ X(t)$ denote the number of directly accessible dimensions and let $l\equiv d-r$ denote the number of latent dimensions. Let

$\begin{eqnarray}{H}^{(t;k)}\equiv \left[\begin{array}{cccc}X(t) & X(t+1) & \cdots & X(t+k)\\ X(t+1) & & & \vdots \\ \vdots & & & X(t+2k-1)\\ X(t+k) & \cdots & X(t+2k-1) & X(t+2k)\end{array}\right].\end{eqnarray} \tag{ 7 }$

For $k\geqslant l$ , $\mathrm{rank}\ {H}^{(t;k)}=d$ , i.e. ${H}^{(t;k)}$ captures the full dimensionality of the system. Furthermore a minimal ( $d$ -dimensional) model of the system can be constructed from ${X}^{(1)},\,\ldots ,\,{X}^{(2l+2)}$ . In particular, if ${LR}$ is any rank factorization of ${H}^{(1;l)}$ , then $X(t+1)={{AB}}^{t}C$ where $A$ is the first $m$ rows of $L,C$ is the first $n$ columns of $R$ , and $B={L}^{+}{H}^{(2;l)}{R}^{+}$ . Some $d$ dimensional systems can be reconstructed from fewer observations and/or observations at non-consecutive times, but some cannot. Thus $k\geqslant l$ is necessary to guarantee complete characterization of an unknown $d$ dimensional system.

The Ho–Kalman method is the basis of our approach to QPI. However, there are two important differences between a (classical) linear input-output system and a quantum process that require consideration:

In the Ho–Kalman problem X_i,j(t) is (in principle) a directly observable quantity, whereas in QPI the measurable quantity ${F}_{i,m}^{(t)}$ is a probability. One consequence is that in QPI, many repetitions of an experiment are needed to obtain each value ${F}_{i,m}^{(t)}$ . Even then, one only obtains an estimate ${\tilde{F}}_{i,m}^{(t)}$ whose precision is limited by the number of repetitions of the experiment. There is a large body of work on the extension of the Ho–Kalman approach to noisy data, but it considers additive or multiplicative Gaussian noise and is not applicable to QPI, where ${\tilde{F}}_{i,m}^{(t)}$ has a binomial distribution whose width depends on the (unknown) value of ${F}_{i,m}^{(t)}$ .
The observables in the Ho–Kalman method are all mutually compatible. In principle, the values ${X}_{i,j}(t)$ for all j and all t ≥ 1 can be obtained in a single experiment with initial state i. In contrast, measuring a quantum system generally disturbs it (an aspect of the uncertainty principle). Thus in QPI, not all measurable properties of the system can be measured at the same time, nor is the evolution of the system after a measurement the same as if the system had not been measured. Each ${F}_{i,m}^{(t)}$ must be estimated from a distinct set of experiments, in which the system evolves unobserved for t process repetitions and is then measured.

In the next section we present a method for QPI based on an extension of the Ho–Kalman approach to the quantum domain.

4. The QPI protocol

4.1. The experimental protocol

QPI is accomplished by performing a series of experiments, each repeated many times (figure 2). Each experiment is designated by a triple (i, t, m) which designates the ith way of initializing the system, followed by t repetitions of the process to be characterized, followed by the mth way of measuring the system. For each experiment (i, t, m) one records ${N}_{i,m}^{(t)}$ , the number of times the experiment was performed, and ${Y}_{i,m}^{(t)}$ , the number of times the YES outcome was observed. The frequency ${\tilde{F}}_{i,m}^{(t)}\equiv {Y}_{i,m}^{(t)}/{N}_{i,m}^{(t)}$ is the experimental estimate of ${F}_{i,m}^{(t)}$ . (Throughout this work, an experimental estimate of a quantity Q is denoted by $\tilde{Q}$ .)

The set of experiments performed must be chosen with some care in order for QPI to be effective. If the experimental probabilities could be determined perfectly, then by lemma 1 ${F}^{(0)},\,\ldots ,\,{F}^{(2l+1)}$ would be sufficient to characterize any process of dimension $d\leqslant l+\mathrm{rank}\ {F}^{(0)}$ . (Note that $\mathrm{rank}\ {F}^{(0)}$ is just the dimension of the space spanned by the initial states and measurements. If these are tomographically complete for the system of interest, $\mathrm{rank}\ {F}^{(0)}$ is just the system dimension.) However, some of the degrees of freedom may be slow to manifest in the sense that it takes many repetitions of the process for their impact on observable properties to become significant. In such cases the matrix ${H}^{(0;l)}$ is ill-conditioned and even a small amount of statistical error in the data yields an extremely poor characterization. For this reason it is advantageous to perform experiments covering a wide range of t values. As a general rule, an experiment with $2t$ process repetitions is twice as sensitive to the process parameters T as an experiment with t process repetitions [7].

The other main consideration is that the Ho–Kalman reconstruction technique requires the matrix H to have a generalized Hankel block structure:

$\begin{eqnarray}H=\left[\begin{array}{cccc}{F}^{({i}_{1}+{j}_{1})} & {F}^{({i}_{1}+{j}_{2})} & \cdots & {F}^{({i}_{1}+{j}_{n})}\\ {F}^{({i}_{2}+{j}_{1})} & {F}^{({i}_{2}+{j}_{2})} & & {F}^{({i}_{2}+{j}_{n})}\\ \vdots & & & \vdots \\ {F}^{({i}_{m}+{j}_{1})} & {F}^{({i}_{m}+{j}_{2})} & \cdots & {F}^{({i}_{m}+{j}_{n})}\end{array}\right]\end{eqnarray} \tag{ 8 }$

for some choice of integers i₁, ..., i_m and j₁, ..., j_n.

Our approach to satisfying these criteria is to use multiple 'flights' of experiments, where a flight is a set of experiments {(i, t, m)} with $2\tilde{l}+1$ consecutive t values and all values of i, m for each t (figure 3(a)). By lemma 1, each flight contains enough information to characterize a process of dimension at least $\tilde{l}+\mathrm{rank}\ {F}^{(0)}$ . To efficiently cover a wide range of t values we use flights whose base t values increase exponentially, {0, 1, 2, 4, ..., 2ⁱ}. We then add flights as needed to support the Ho–Kalman structure, equation (8). The result is that the experiments performed are those with t values in the set

$\begin{eqnarray}&&{ \mathcal T }={{ \mathcal T }}_{R}+{{ \mathcal T }}_{C},\end{eqnarray} \tag{ 9 }$

where

$\begin{eqnarray}&&{{ \mathcal T }}_{R}=\{0,1,2,4,\,\ldots ,\,{2}^{a}\}+\{0,1,2,\,\ldots ,\,\tilde{l}\}\end{eqnarray} \tag{ 10 }$

$\begin{eqnarray}&&{{ \mathcal T }}_{C}=\{0,1,2,4,\,\ldots ,\,{2}^{b}\}+\{0,1,2,\,\ldots ,\,\tilde{l}\}.\end{eqnarray} \tag{ 11 }$

Here + is the sumset operation and $a,b,\tilde{l}$ are chosen integers. We found that such a pattern of experiments is generally necessary to obtain accurate models, especially for systems in which the non-Markovian aspects manifest only over long timescales. Notably, the characterization accuracy grows as $\max { \mathcal T }$ , whereas the number of different experiments to be performed grows asymptotically as only ${(\mathrm{logmax}{ \mathcal T })}^{2}$ .

**Figure 3.** Arrangement of experimental data for analysis. (a) A flight is set of experiments with $2\tilde{l}+1$ consecutive values of process repetitions t, and is denoted by its lowest t value. The data for each flight is arranged into a matrix. (b) For dimension estimation and initial model estimation, the data is arranged to form a large Ho–Kalman matrix. (The matrices for flights with small t partially overlap.) (c) Nonlinear model fitting is performed by grouping flights into blocks. The data in block k = 1, 2, ..., b is related to the data in block 0 by a factor of ${T}^{{2}^{k-1}}$ , where T is to be determined.
Download figure:
Standard image High-resolution image

**Figure 3.** Arrangement of experimental data for analysis. (a) A flight is set of experiments with $2\tilde{l}+1$ consecutive values of process repetitions t, and is denoted by its lowest t value. The data for each flight is arranged into a matrix. (b) For dimension estimation and initial model estimation, the data is arranged to form a large Ho–Kalman matrix. (The matrices for flights with small t partially overlap.) (c) Nonlinear model fitting is performed by grouping flights into blocks. The data in block k = 1, 2, ..., b is related to the data in block 0 by a factor of ${T}^{{2}^{k-1}}$ , where T is to be determined.
Download figure:
Standard image High-resolution image

4.2. Model inference procedure

Once the experiments are performed, the data is analyzed to infer a model of the process ${ \mathcal P }$ , the initial states, and the measurements. Our basic approach is to seek the model with the highest likelihood for the given observations. This is a challenging task for two reasons: (1) The size of the model (number of dimensions) is unknown. (2) The likelihood is extremely nonlinear in T: T appears with powers up to $\max { \mathcal T }$ , which in practice can be on the order of 10³ or more. This creates an extremely unforgiving optimization landscape with many local (and very suboptimal) maxima.

Regarding the first challenge, a number of techniques have been devised to compare the likelihood of models with different numbers of parameters (e.g. the Akaike information criterion [70]) or to construct models whose complexity automatically scales with the complexity of the data (such as Dirichlet processes [71]). We use a simple but effective technique to first estimate the dimensionality of the process in question, which then determines the set of parameters to be estimated. The second challenge is essentially the same as that faced in gate set tomography. Building upon the progressive fitting approach described in [7], we devised a 4-stage model inference procedure, carefully honed over the course of our studies to reliably find concise, accurate process models. An overview of our procedure is given below; additional details are given in appendices B–D.

1.
Estimate the Model Dimension. First, the data is arranged into a matrix $\tilde{H}$ which is the experimental estimate of a Ho–Kalman matrix H. $\tilde{H}$ has the structure given in equation (8) with rows indexed by values from ${{ \mathcal T }}_{R}$ and columns indexed by values from ${{ \mathcal T }}_{C}$ . This may be viewed as an arrangement of the data from all the flights as shown in figure 3. The effective dimension d is then estimated as the number of statistically significant non-zero singular values of $\tilde{H}$ [72], where significance is assessed by a chi-square test. In brief, the uncertainties in the experimental data are estimated and propagated to yield uncertainties in the singular values of $\tilde{H}$ . This yields for each k = 1, 2, ... a threshold for the sum of squares of the k smallest singular values. If for some k the sum falls below the threshold, those singular values are discounted (deemed insignificant). The dimension d is taken to be the number of singular values that remain after the largest subset of singular values has been discounted. We note that this procedure does not determine the inherent dimension of a process so much as the dimension that is revealed by the data.
2.
Obtain an Initial Model. This stage is a variation of the Ho–Kalman method, modified to account for uncertainty in the experimental data. Let ${\tilde{H}}^{{\prime} }$ be the 'time shifted' version of $\tilde{H}$ , i.e. the matrix obtained by increasing the t value of each element of $\tilde{H}$ by 1. First, the variance of each experimental quantity ${\tilde{F}}_{i,m}^{(t)}$ is estimated. Then a rank-d factorization $\tilde{H}\approx {LR}$ is found by minimizing ${\sum }_{i,m}{W}_{i,m}({LR}-\tilde{H}{)}_{i,m}^{2}$ where W_i,m is the inverse variance of ${\tilde{H}}_{i,m}$ . The process matrix is estimated as the matrix T that minimizes ${\sum }_{i,m}{W}_{i,m}^{{\prime} }({LTR}-{\tilde{H}}^{{\prime} }{)}_{i,m}^{2}$ where ${W}_{i,m}^{{\prime} }$ is the inverse variance of ${\tilde{H}}_{i,m}^{{\prime} }$ . The first few rows of L and columns of R may be taken as initial estimates of S and P, respectively.
3.
Obtain an Improved Model. Although the previous stage uses all the data, it uses only the fact that $\tilde{H}$ and ${\tilde{H}}^{{\prime} }$ are related by a (single) factor of T. It does not use the fact that data from different flights should be related by higher powers of T. To incorporate this relationship into the model estimate, the data is organized into groups of flights or 'blocks' with exponentially increasing t values (figure 3(c)). Starting with the matrices L, T, R obtained in stage 2, a search is performed to find matrices A, T, B that best fit the data in all the blocks. The first few rows of A and columns of B may be taken as improved estimates of S and P, respectively. To ensure that the search finds a good solution and does not get stuck in a local extremum, the search is performed progressively, starting with the lowest-t blocks and gradually incorporating data from blocks with higher t values. The low-t blocks yield coarse estimates of T which are gradually refined using the data in higher-t blocks.
4.
Obtain a physically valid model estimate. The model produced by stage 3 is usually fairly accurate, but tends to make slightly unphysical predictions (i.e. probabilities outside the interval [0,1]). In this last stage, the model likelihood is maximized with the addition of penalty terms to encourage the validity of predicted probabilities. As in stage 3, the log-likelihood is approximated by a weighted sum of squared errors; however, this time the weights are not based on the (fixed) experimental frequencies but are based on the predicted probabilities. Also, in this step the data is not arranged into Ho–Kalman matrices; the model ${F}^{(t)}={{ST}}^{t}P$ (equation (2)) is fit directly to each experiment.

In our simulation studies, the model inference procedure described above proved to be both fast and reliable. In over 99% of the thousands of simulations we performed, it found a high-likelihood model without any intervention. In the few cases in which it did not, making minor adjustments to optimization parameters in stage 3 or 4 (e.g. penalty weights, convergence thresholds) invariably led to success. We found that all four stages were usually necessary to obtain accurate models. On a standard desktop computer, the time to complete all four stages ranged from just a few seconds for 7 dimensional processes to a few minutes for a 19-dimensional process.

We note that the exact arrangement of the data in steps 1–3 does not appear be critical. Beyond satisfying the Ho–Kalman structure, it appears to be important only that each row and column of $\tilde{H}$ cover both a wide range of t values and multiple runs of roughly l consecutive t values.

4.3. Example

To illustrate the QPI procedure let us consider how Alice would use it to characterize a 'black box' version of the slot machine described in section 2. In this version the slot machine has a display screen, a lever, and buttons labeled I0, I1, and M. Pushing I0 or I1 turns the machine on. When the lever is pulled the machine makes a momentary noise but otherwise has no observable effect. Pressing M causes the screen to briefly display the number 0 or 1, after which the machine turns off. In terms of the QPI formalism the initialization procedures are ${ \mathcal I }=\{{\mathsf{I}}{\mathsf{0}},{\mathsf{I}}{\mathsf{1}}\}$ . Pulling the lever performs an unknown process ${ \mathcal P }$ . Measurement is accomplished by pressing M. This is treated as two procedures, M0 and M1, which test whether the outcome is 0 or 1, respectively. Thus ${ \mathcal M }=\{{\mathsf{M}}{\mathsf{0}},{\mathsf{M}}{\mathsf{1}}\}$ .

The behavior of the machine, which Alice will attempt to discover, is as follows: If the machine is turned on using I0, the output after an even number of lever pulls is always 0; after an odd number of pulls it is 0 with probability 0.5 and 1 with probability 0.5. Likewise, if the machine is turned on using I1, the output after an even number of lever pulls is always 1; after an odd number of pulls it is 0 with probability 0.5 and 1 with probability 0.5. This behavior is expressed by the series of matrices F⁽⁰⁾, F⁽¹⁾, F⁽²⁾, ...with

$\begin{eqnarray}{F}^{(t)}\equiv \left[\begin{array}{cc}{F}_{{\mathsf{I}}{\mathsf{0}},{\mathsf{M}}{\mathsf{0}}}^{(t)} & {F}_{{\mathsf{I}}{\mathsf{0}},{\mathsf{M}}{\mathsf{1}}}^{(t)}\\ {F}_{{\mathsf{I}}{\mathsf{1}},{\mathsf{M}}{\mathsf{0}}}^{(t)} & {F}_{{\mathsf{I}}{\mathsf{1}},{\mathsf{M}}{\mathsf{1}}}^{(t)}\end{array}\right]=\left\{\begin{array}{ll}\left[\begin{array}{cc}1 & 0\\ 0 & 1\end{array}\right] & t\ \mathrm{even}\\ \left[\begin{array}{cc}0.5 & 0.5\\ 0.5 & 0.5\end{array}\right] & t\ \mathrm{odd}\end{array}\right..\end{eqnarray} \tag{ 12 }$

Alice seeks a model $(S,T,P)$ such that ${{ST}}^{t}P={F}^{(t)}$ for all t. The rows of $S\in {{\mathbb{R}}}^{2\times d}$ will express the states produced by I0 and I1 respectively in some d-dimensional state space. $T\in {{\mathbb{R}}}^{d\times d}$ will express the action of the lever in this state space; and the columns of $P\in {{\mathbb{R}}}^{d\times 2}$ will express the properties measured by M0 and M1 respectively.

To perform QPI Alice first estimates the probabilities ${F}_{i,m}^{(t)}$ for all $i\in \{{\mathsf{I}}{\mathsf{0}},{\mathsf{I}}{\mathsf{1}}\}$ , all $m\in \{{\mathsf{M}}{\mathsf{0}},{\mathsf{M}}{\mathsf{1}}\}$ , and for selected values of t. Each experiment involves initializing the machine using either I0 or I1 and pulling the lever some fixed number of times t before performing readout M. She repeats each experiment 100 times and records the fraction of times outcome m = {0, 1} occurs as ${\tilde{F}}_{i,m}^{(t)}\approx {F}_{i,m}^{(t)}$ . Alice first obtains

$\begin{eqnarray}{\tilde{F}}^{(0)}=\left[\begin{array}{cc}1 & 0\\ 0 & 1\end{array}\right].\end{eqnarray} \tag{ 13 }$

Since ${\tilde{F}}^{(0)}\approx {F}^{(0)}$ she surmises that $\mathrm{rank}\ {F}^{(0)}=2$ , i.e. the initializations and readout span a 2 dimensional space. Alice decides that she is willing to construct a model that can capture at least one latent dimension, $\tilde{l}=1$ . (In practice she should probably consider a larger value, but here we use a small value for simplicity of presentation.) Each flight of experiments will involve $2\tilde{l}+1=3$ consecutive t values. She chooses the flights to have base values from 0 up to 2¹ + 2¹ = 4. The complete set of t values used in Alice's characterization is ${ \mathcal T }={{ \mathcal T }}_{R}+{{ \mathcal T }}_{C}=\{0,1,2,3,4,5,6\}$ where ${{ \mathcal T }}_{R}={{ \mathcal T }}_{C}=\{0,{2}^{0},{2}^{1}\}+\{0,1\}$ . For the sake of discussion, we suppose Alice obtains

$\begin{eqnarray}{\tilde{F}}^{(0)}={\tilde{F}}^{(2)}={\tilde{F}}^{(4)}={\tilde{F}}^{(6)}=\left[\begin{array}{cc}1 & 0\\ 0 & 1\end{array}\right],\end{eqnarray} \tag{ 14 }$

$\begin{eqnarray}{\tilde{F}}^{(1)}=\left[\begin{array}{cc}0.48 & 0.52\\ 0.53 & 0.47\end{array}\right],{\tilde{F}}^{(3)}=\left[\begin{array}{cc}0.49 & 0.51\\ 0.46 & 0.54\end{array}\right],{\tilde{F}}^{(5)}=\left[\begin{array}{cc}0.52 & 0.48\\ 0.57 & 0.43\end{array}\right].\end{eqnarray} \tag{ 15 }$

Having obtained experimental estimates of F^(t) for a number of t values, Alice proceeds to perform the four steps of data analysis:

1.
Estimate the model dimension. Alice arranges nearly all the data into a Ho–Kalman matrix
$\begin{eqnarray}\tilde{H}=\left[\begin{array}{ccc}{\tilde{F}}^{(0)} & {\tilde{F}}^{(1)} & {\tilde{F}}^{(2)}\\ {\tilde{F}}^{(1)} & {\tilde{F}}^{(2)} & {\tilde{F}}^{(3)}\\ {\tilde{F}}^{(2)} & {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)}\\ {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)} & {\tilde{F}}^{(5)}\end{array}\right].\end{eqnarray} \tag{ 16 }$
(The column that would contain ${\tilde{F}}^{(6)}$ is excluded because step 2 will require all the t values to be increased by one, and no data was taken for t = 7.) Using the procedure detailed in appendix B, Alice computes the singular values of $\tilde{H}$ , the cumulative sum of squares of the singular values from smallest to largest, and the corresponding chi-square thresholds:
Since the sum of squares falls below the threshold for d ≥ 4 but above the threshold for d ≤ 3, Alice concludes d = 3.
2.
To obtain an initial estimate, Alice shifts the entries in $\tilde{H}$ by one lever pull to obtain
$\begin{eqnarray}{\tilde{H}}^{{\prime} }=\left[\begin{array}{ccc}{\tilde{F}}^{(1)} & {\tilde{F}}^{(2)} & {\tilde{F}}^{(3)}\\ {\tilde{F}}^{(2)} & {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)}\\ {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)} & {\tilde{F}}^{(5)}\\ {\tilde{F}}^{(4)} & {\tilde{F}}^{(5)} & {\tilde{F}}^{(6)}\end{array}\right].\end{eqnarray} \tag{ 18 }$
She numerically obtains a rank-3 factorization ${LR}\approx \tilde{H}$ that minimizes the total weighted error. The first 2 rows of L yield an initial estimate of S while the first two columns of R yield an initial estimate of P. She obtains an initial estimate of T by minimizing total weighted error of the approximation ${LTR}\approx {\tilde{H}}^{{\prime} }$ , keeping L and R fixed. Her initial model estimate is
$\begin{eqnarray}S=\left[\begin{array}{ccc}-0.659 & 0.705 & -0.010\\ -0.657 & -0.709 & -0.016\end{array}\right],\ T=\left[\begin{array}{ccc}1.00 & 0.015 & 0.013\\ 0.005 & -0.064 & -0.839\\ 0.006 & -1.19 & 0.057\end{array}\right],\ \tilde{P}=\left[\begin{array}{cc}-0.761 & -0.758\\ 0.706 & -0.708\\ -0.016 & 0.009\end{array}\right].\end{eqnarray} \tag{ 19 }$
3.
To obtain a better estimate, Alice arranges her data into three blocks:
$\begin{eqnarray}{D}_{0}=\left[\begin{array}{cc}{\tilde{F}}^{(0)} & {\tilde{F}}^{(1)}\\ {\tilde{F}}^{(1)} & {\tilde{F}}^{(2)}\\ {\tilde{F}}^{(2)} & {\tilde{F}}^{(3)}\\ {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)}\end{array}\right],\qquad {D}_{1}=\left[\begin{array}{cc}{\tilde{F}}^{(1)} & {\tilde{F}}^{(2)}\\ {\tilde{F}}^{(2)} & {\tilde{F}}^{(3)}\\ {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)}\\ {\tilde{F}}^{(4)} & {\tilde{F}}^{(5)}\end{array}\right],\qquad {D}_{2}=\left[\begin{array}{cc}{\tilde{F}}^{(2)} & {\tilde{F}}^{(3)}\\ {\tilde{F}}^{(3)} & {\tilde{F}}^{(4)}\\ {\tilde{F}}^{(4)} & {\tilde{F}}^{(5)}\\ {\tilde{F}}^{(5)} & {\tilde{F}}^{(6)}\end{array}\right]\end{eqnarray} \tag{ 20 }$
Alice then uses the numerical search procedure described in appendix C to find $A\in {{\mathbb{R}}}^{8\times 3}$ , $B\in {{\mathbb{R}}}^{3\times 4}$ , and $T\in {{\mathbb{R}}}^{3\times 3}$ such that AB ≈ D₀, ATB ≈ D₁, and ${{AT}}^{2}B\approx {D}_{2}$ . For the starting values of A, T, B she uses L,T, and the first four columns of P from the previous step. This yields
$\begin{eqnarray}S=\left[\begin{array}{ccc}-0.659 & 0.705 & -0.014\\ -0.657 & -0.709 & 0.003\end{array}\right],\ T=\left[\begin{array}{ccc}1.00 & 0.015 & 0.013\\ 0.005 & -0.064 & -0.834\\ 0.006 & -1.19 & 0.057\end{array}\right].\ P=\left[\begin{array}{cc}-0.761 & -0.758\\ 0.706 & -0.708\\ -0.017 & 0.011\end{array}\right].\end{eqnarray} \tag{ 21 }$
4.
Alice improves her estimate even further using the procedure described in appendix D to directly minimize the total error of the model ${{ST}}^{t}P={F}^{(t)}$ over all $t\in { \mathcal T }$ , using the values of S, T, P obtained from step 3. This yields her final estimated model
$\begin{eqnarray}S=\left[\begin{array}{ccc}-0.659 & 0.701 & -0.025\\ -0.657 & -0.712 & 0.000\end{array}\right]\ T=\left[\begin{array}{ccc}1.00 & 0.010 & 0.009\\ 0.005 & -0.061 & -0.840\\ 0.006 & -1.19 & 0.055\end{array}\right]\ P=\left[\begin{array}{cc}-0.764 & -0.755\\ 0.706 & -0.708\\ -0.020 & 0.013\end{array}\right].\end{eqnarray} \tag{ 22 }$

For comparison, an exact minimal model (S, T, P) for this system is

$\begin{eqnarray}{S}^{{\prime} }=\left[\begin{array}{ccc}-0.5 & -0.5 & -0.707\\ 0.5 & 0.5 & -0.707\end{array}\right],\qquad {T}^{{\prime} }=\left[\begin{array}{ccc}-1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\end{array}\right],\qquad {P}^{{\prime} }=\left[\begin{array}{cc}-0.5 & 0.5\\ -0.5 & 0.5\\ -0.707 & -0.707\end{array}\right].\end{eqnarray} \tag{ 23 }$

Alice's final model does not appear to be close to this. However, recall that models are meaningful only up to similarity transformations. A suitable similarity transformation on Alice's final model yields

$\begin{eqnarray}\begin{array}{c}S=\left[\begin{array}{ccc}-0.486 & -0.498 & -0.707\\ 0.511 & 0.493 & -0.707\end{array}\right],\,T=\left[\begin{array}{ccc}-1.003 & 0.004 & 0.000\\ -0.004 & 0.997 & 0.000\\ 0.006 & 0.000 & 1.000\end{array}\right]\,P=\left[\begin{array}{cc}-0.503 & 0.503\\ -0.497 & 0.497\\ -0.710 & -0.704\end{array}\right].\end{array}\end{eqnarray} \tag{ 24 }$

which clearly approximates equation (23). As it turns out, this model is accurate only for t ≲ 10, a consequence of the fact that Alice tested only small values of t and repeated each experiment relatively few times.

5. Simulations

To demonstrate the effectiveness of QPI, we simulated the characterization of several different non-Markovian processes relevant to quantum computing. Each process consists of a simple unitary operation on a qubit with a weak, physically-motivated non-Markovian error. In each case, the system of interest is the qubit and the non-Markovian behavior of the qubit is generated from a Markovian model of the qubit interacting with some external degree(s) of freedom. For these studies the error processes were chosen to be not only weak but also slow to reveal their non-Markovian nature, in order to present a strong characterization challenge.

For each process, we first simulated the evolution of the quantum state for several different initial states and predicted the time-dependent outcome probabilities for the standard tomographic measurements. The 'standard tomographic measurements' are the measurements of the dimensionless Pauli operators σ_x, σ_y, σ_z. These operators have eigenvalues +1 ('YES') and −1 (' NO') with corresponding eigenstates denoted $| \pm {\rangle }_{x}$ , $| \pm {\rangle }_{y}$ , $| \pm {\rangle }_{z}$ , respectively. For simplicity, state preparation and measurement were assumed to be ideal. Then, we simulated the collection of experimental data by choosing a random number of YES outcomes for each measurement from the computed probability distribution. The simulated data for each experiment was then passed to our data analysis code, which estimated the dimension of the process and found the most likely model using the procedure outlined above. The error of the characterization was calculated by using the inferred model to predict the time-dependent qubit state and comparing the predicted state to the true (simulated) state. The error at each time step was averaged over all initial states and over 500 independent simulated characterizations (200 for the system described in section 5.4).

In each plot below, black dots show the times of the (simulated) measurements and their average error, where error is defined as the trace distance $\tfrac{1}{2}\mathrm{Tr}\left|\rho (t)-{\rho }_{\mathrm{est}}(t)\right|$ between the true state ρ(t) and the state ρ_est(t) estimated from the measurements at time t only. The blue line is the QPI error, defined as the average trace distance between the true state ρ(t) and the state ρ_QPI(t) predicted by QPI, which is based on all the simulated measurements. For comparison, the dotted black line is the corresponding error for perfect QPT, obtained using exact measurement outcome probabilities at t = 0 and t = 1 only. As will be evident, the intrinsically Markovian models produced by process tomography (even with no statistical measurement error) cannot accurately represent the non-Markovian processes considered below.

5.1. Qubit rotation with slowly varying systematic error

In quantum computing technologies, operations on qubits are typically accomplished by applying control pulses that temporarily modify the Hamiltonian to induce a unitary rotation of one or more qubits. In many implementations the amount of rotation is proportional to the pulse energy. A miscalibration of the pulse intensity or duration results in over- or under-rotation of the qubits, which generally contributes to computational error [73].

In this example the process of interest is an intended π rotation of a qubit about the y axis, i.e. a bit flip. We suppose however that the intensity of the control pulse drifts sinusoidally over time, such that the actual angle of rotation produced by the pulse at time t (where each pulse takes one unit of time) is

$\begin{eqnarray}&&{\theta }_{t}=\pi +\epsilon \sin {\rm{\Omega }}t.\end{eqnarray} \tag{ 25 }$

We choose = 0.01 and Ω = 0.02, which yields a small, slow oscillation of the qubit about the intended state in the x–z plane. This oscillation cannot be described by any fixed Markovian process involving only the qubit. In fact, the error can be described as phase modulation, a process which has an infinite number of sidebands. In other words, this process technically has an infinite dimension. However, relatively few dimensions are needed to accurately reproduce typical experimental data.

The evolution of the qubit under this process was simulated for two different initial states, $| +{\rangle }_{z}$ and $| +{\rangle }_{x}$ . For each initial state, tomographic measurements of the qubit were simulated at 304 times ranging from t = 0 to t = 1035, forming 57 flights of length 12. Each measurement was repeated 10⁴ times. In nearly all realizations, model inference yielded an 11-dimensional model. Figure 4 shows the average characterization error as a function of t.

**Figure 4.** Average characterization error (solid line) for a qubit undergoing π rotations with small, sinusoidally varying error in the rotation angle. Also shown for comparison are the raw measurement error (dots) and error of perfect process tomography (dotted line).
Download figure:
Standard image High-resolution image

5.2. Coherent qubit leakage

Leakage refers to the loss of a qubit, or more precisely, the transition of a qubit to a non-computational state. This process is usually modeled as irreversible and incoherent, in which case the qubit has no effect on any subsequent operation. More realistically, qubit operations create and manipulate superpositions of computational and non-computational states [26]. In such cases a qubit can effectively leave and later return to the computational subspace, interfering with the computational evolution. The particular details of the qubit's behavior will depend strongly on the physics of the physical implementation of the qubit and the control mechanisms employed. For this example we use the 3-level anharmonic oscillator model which is applicable to popular qubit technologies including superconducting flux qubits and trapped ions [74]. In this model, the control pulse drives not only the intended $| 0\rangle \leftrightarrow | 1\rangle$ transition, but also off-resonant transitions between $| 1\rangle$ and the next excited state $| 2\rangle$ . The interaction Hamiltonian is

$\begin{eqnarray}&&H={\rm{\Omega }}(t)\left(| 0\rangle \langle 1| +\sqrt{2}| 1\rangle \langle 2| +{\rm{h}}.{\rm{c}}.\right)+{\rm{\Delta }}| 2\rangle \langle 2| ,\end{eqnarray} \tag{ 26 }$

where Ω(t) is the control amplitude and Δ is the detuning of the non-computational state. For simulations we used Gaussian control pulses of width 0.25 time steps (truncated to have 1 time step duration) and chose an excited state detuning Δ = 20 yielding a system in which the leakage in and out of the computational space is small but not negligible. For these parameter values the total probability of state $| 2\rangle$ oscillates with a period of nearly 30 time steps, with maximal value of a few percent. Meanwhile, the computational state slowly precesses about the intended state.

The evolution of this 3-level system was simulated for four different initial qubit states ( $| +{\rangle }_{z}=| 0\rangle$ , $| -{\rangle }_{z}=| 1\rangle$ , $| +{\rangle }_{x}$ , and $| +{\rangle }_{y}$ ). For each initial state, tomographic measurements of the qubit were simulated at 196 times ranging from t = 0 to t = 1029, forming 57 flights of length 6. Each measurement was repeated 10⁴ times. In nearly all cases, the model inference yielded a 7-dimensional model. The average characterization error for this process is shown in figure 5.

**Figure 5.** Average characterization error (solid line) for a qubit undergoing bit flips in the anharmonic oscillator model. Also shown for comparison are the raw measurement error (dots) and error of perfect process tomography (dotted line).
Download figure:
Standard image High-resolution image

5.3. A qubit coupled to a single impurity

Another potential source of error in qubit devices is unintended coupling between qubits, or between a qubit and a nearby impurity. A simple model for such coupling is the isotropic spin exchange Hamiltonian

$\begin{eqnarray}&&H=\gamma \left({\sigma }_{x}\otimes {\sigma }_{x}+{\sigma }_{y}\otimes {\sigma }_{y}+{\sigma }_{z}\otimes {\sigma }_{z}\right).\end{eqnarray} \tag{ 27 }$

Taking γ = 0.01 to model weak coupling, the evolution of a qubit and impurity spin was simulated for three initial qubit states ( $| +{\rangle }_{x}$ , $| +{\rangle }_{y}$ , and $| +{\rangle }_{z}$ ). The impurity is always initially in the state $| +{\rangle }_{z}$ . For each initial state, tomographic measurements were simulated at 64 times ranging from t = 0 to t = 1030, forming 12 flights of length 7. Each measurement was repeated 10⁴ times. In nearly all cases, model inference yielded a 7-dimensional model. This may be compared with the nominal dimension 4² = 16 of a two qubit system. The average characterization error for this process is shown in figure 6.

**Figure 6.** Average characterization error (solid line) for a qubit weakly coupled to an impurity via a spin-exchange interaction. Also shown for comparison are the raw measurement error (dots) and error of perfect process tomography (dotted line).
Download figure:
Standard image High-resolution image

5.4. ³¹P qubit in silicon

For our final case study, we considered a ${}^{31}{\rm{P}}$ donor in ²⁸Si with ²⁹Si impurities. The nuclear spin of the ${}^{31}{\rm{P}}$ defines a qubit. The qubit is manipulated by a time-varying magnetic field which also affects the impurities. Meanwhile, the ³¹P and ²⁹Si impurities interact via hyperfine coupling to the donor electron. A detailed model for this system was formulated in [75]. We simulated the behavior of the qubit and impurities in response to a sequence of pulses that each nominally produce a π rotation of the qubit (bit flip). For these simulations, the donor was imagined to be at the center of a ${(5\mathrm{nm})}^{3}$ cube with a ²⁹Si concentration of 800 ppm, which translates to 6 impurities in the qubit volume. The impurities were distributed randomly throughout the volume; in all, 200 random configurations were simulated. The background magnetic field B_z was 1.5 T, the amplitude B_x of the driving magnetic field was 0.15 T, and the ambient temperature was 250 mK. The total length of each pulse sequence was $2.9\,\mu {\rm{s}}$ . (For further details see [75].)

For each configuration of the impurities, the evolution of the donor and impurities was simulated for 4 different initial qubit states ( $| +{\rangle }_{x}$ , $| +{\rangle }_{y}$ , $| +{\rangle }_{z}$ , and $| -{\rangle }_{z}$ ) and with the impurities initially in thermal equilibrium. For each initial state, tomographic measurements of the qubit were simulated at 272 different time steps ranging from t = 0 to t = 1033, forming 57 flights of length 10. Each measurement was repeated 10³ times. In most realizations the inferred model dimension was 19 or 20, which is much less than the nominal dimension of the ³¹P + ²⁹Si system (7 nuclear spins + 1 electron spin = 8 spins with a nominal dimension of ${\left({2}^{8}\right)}^{2}=65536$ ). Thus QPI reveals that the effective interaction between the qubit and the impurities involves only a small subspace of the available Hilbert space, an insight that is not readily apparent from the underlying model. The average characterization error for this system is shown in figure 7.

**Figure 7.** Average characterization error (solid line) for a ³¹P donor qubit with 6 nearby ²⁹Si impurities. Also shown for comparison are the raw measurement error (dots) and error of perfect process tomography (dotted line). Each time step is $2.9\,\mu {\rm{s}}$ .
Download figure:
Standard image High-resolution image

**Figure 7.** Average characterization error (solid line) for a ³¹P donor qubit with 6 nearby ²⁹Si impurities. Also shown for comparison are the raw measurement error (dots) and error of perfect process tomography (dotted line). Each time step is $2.9\,\mu {\rm{s}}$ .
Download figure:
Standard image High-resolution image

6. Discussion

In all of the simulation studies above, our QPI method was able to accurately capture and reproduce the observable behavior of the qubit over very long time scales, with trace distance errors on the order of 10⁻². While this level of error would not be impressive for tomography of the state at any one time, the fact that this accuracy is maintained over more than a thousand time steps implies that the process itself is actually characterized to an accuracy on the order of 10⁻⁵. Notably, the inferred models remained accurate well beyond the times at which measurements were taken (although the errors did tend to slowly grow the farther in time the models were extrapolated).

In contrast, standard quantum process tomography (which uses measurements only at t = 0 and t = 1) yielded inaccurate predictions of qubit behavior past a few tens of time steps, even under the assumption of zero statistical measurement error. For some of the examples the error of process tomography is oscillatory. We attribute this to an oscillatory aspect of the non-Markovian evolution, such that every so often it happens to (nearly) coincide with the prediction of a Markovian model. Arguably, it would be fairer to compare QPI with a version of QPT that incorporates the data at all times but assumes the process is Markovian. This can be achieved in our framework by restricting the model dimension to 4 for characterization of a single qubit. In fact we attempted to perform such a comparison, but found that dimension-limited (Markovian) models were such a poor fit to the data that the model inference would not converge.

For the finite dimensional processes studied (leakage and spin exchange), QPI inferred a process dimension that was equal to or close to the true dimension. For the high-dimensional processes studied (control drift and ³¹P + ²⁹Si systems), QPI was able to reproduce the observed behavior using relatively low-dimensional models. Fortunately, it was not necessary to use flights as long as suggested by the worst-case bound of lemma 1.

Although these simulation studies addressed processes nominally involving a single qubit, QPI is in principle just as applicable to the characterization of qudit or multiqubit operations. However, the state space size (and hence also the cost of data collection and data analysis) grows exponentially with the number of qubits. Thus like all forms of tomography, QPI is really only feasible for characterizing small quantum systems. Indeed, a possible concern is that QPI appears to require much more experimental data than QPT, which already requires a moderately large amount of data. To some extent this is to be expected: whereas QPT needs only enough information to produce a 4-dimensional model for a qubit process, QPI needs enough information to first of all determine the effective size of the environment's memory and then determine all the dynamical parameters involving the qubit, the pertinent part of the environment, and their interaction. We expect, however, that QPI can be made much more efficient than the studies above would suggest. In these studies, all measurements were arbitrarily chosen to be performed the same number of times. Almost certainly some measurements are more informative than others. We expect that an adaptive testing strategy, in which the current model estimate is used to choose the most informative experiment to perform next, will significantly reduce the cost of QPI [44, 45, 76].

As presented here, QPI has two main limitations. The first is that it characterizes only one process. In the context of a quantum computer, one would like a self-consistent characterization of all qubit operations ('gates') that may be performed; under the assumption of Markovianity, GST provides just such a characterization. As formulated here, QPI would yield a set of separate characterizations, one for each gate, that cannot be related. More specifically, the relationship between latent variables in the models for different gates would be indeterminate, i.e. one would not know whether different gates interact with the same or different environmental degrees of freedom. We conjecture that QPI can be extended to jointly characterize a set of processes: The principles of the Ho–Kalman method extend in a straightforward way to this case. What is unclear at present is how to extend lemma 1, i.e. how to identify a relatively small set of operation sequences that are likely to yield a complete model. This is an obvious direction for future work.

In principle, extending QPI to multiple processes would allow cross talk and other spatial correlations to be characterized by treating different combinations of simultaneous gates as different processes. However, due to the very large number of different combinations this would constitute (even in a small device), this does not seem to be a practical approach to characterizing spatial correlations.

The other main limitation of QPI as presented here is that it assumes all measurements are final; no attempt is made to model the effect of the measurement on the state of the system or environment. But intermediate measurements are a key component of quantum error correction protocols, thus the ability to characterize their impact on the remainder of a computation is important. We suspect the ability to characterize intermediate measurements would follow from the ability to characterize multiple processes, since each measurement outcome can be treated as a distinct, randomly selected process.

As a final remark, it may be ackowledged that QPI models are abstract and often not manifestly insightful. However, capturing empirical behavior in a concise and accurate mathematical model, as QPI does, is an essential first step toward understanding. Methods of extracting further insight from QPI models would be an interesting topic for future work.

7. Conclusion

In conclusion, we have developed a new method for the characterization of quantum dynamical processes, called QPI. QPI goes beyond state-of-the-art methods by providing the means to systematically characterize non-Markovian dynamics as well as Markovian dynamics. We presented a detailed experimental protocol and data analysis procedure and demonstrated their effectiveness using numerical simulations of realistic non-Markovian error processes affecting qubits. Directions for future work include extending QPI to enable the characterization of multiple processes and non-final measurements, and using adaptive strategies to reduce its experimental cost. Apart from experimental applications, the theory underlying QPI elucidates the relationship between the temporal evolution of an open quantum system and its effective environment, providing a useful alternative to existing representations of non-Markovian quantum dynamics.

Acknowledgments

This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

This research was sponsored by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy. The authors would like to express appreciation to Nicholas Peters for helpful comments on the preparation of this manuscript.

Appendix A.: Proof of lemma 1

In this section we show that the minimum number of time observations needed to guarantee complete characterization of a discrete-time linear system depends on the number of latent (unmeasured) degrees of freedom, not the total number of degrees of freedom. Let d be the true dimension of the system and let a be the dimension of the subspace spanned by the initial states and measured quantities. We show that $\mathrm{rank}({H}_{l+1})=d$ where $l=d-a$ and H_l is the block-Hankel matrix formed from $X(1),\,\ldots ,\,X(2l-1)$ . It follows from the Ho–Kalman theory that X(1), ..., X(2(l + 1)) are sufficient to obtain a state model of the system .

The goal is to find matrices L, M, R such that $X(k)={{LM}}^{k-1}R$ . In the remainder of this section, X(k) will be denoted more compactly as X_k. For convenience, we assume that redundant initial states and redundant measurements have been omitted, so that X(k) is a × a. Since $\mathrm{rank}\ {X}_{i}\leqslant a$ for every X_i, we may choose a representation in which L has the form $L=\left[\begin{array}{cc}{1}_{a\times a} & {0}_{a\times l}\end{array}\right]$ . It follows that $R=\left[\begin{array}{c}{X}_{0}\\ {Y}_{0}\end{array}\right]$ where Y₀ is some unknown l × a matrix. We write M as

$\begin{eqnarray}M\equiv \left[\begin{array}{cc}A & B\\ C & D\end{array}\right],\end{eqnarray} \tag{ A1 }$

where A is a × a, D is l × l, and B and C are sized correspondingly. We have

$\begin{eqnarray}&&\left[\begin{array}{c}{X}_{n}\\ {Y}_{n}\end{array}\right]\equiv {M}^{n}\left[\begin{array}{c}{X}_{0}\\ {Y}_{0}\end{array}\right]\end{eqnarray} \tag{ A2 }$

$\begin{eqnarray}=\ \left[\begin{array}{cc}A & B\\ C & D\end{array}\right]\left[\begin{array}{c}{X}_{n-1}\\ {Y}_{n-1}\end{array}\right].\end{eqnarray} \tag{ A3 }$

Then

$\begin{eqnarray}&&{X}_{n+1}={{AX}}_{n}+{{BY}}_{n}\end{eqnarray} \tag{ A4 }$

$\begin{eqnarray}&&=\,{{AX}}_{n}+B({{CX}}_{n-1}+{{DY}}_{n-1})\end{eqnarray} \tag{ A5 }$

$\begin{eqnarray}&&\vdots \end{eqnarray} \tag{ A6 }$

$\begin{eqnarray}&&=\,{{AX}}_{n}+\left(\displaystyle \sum _{i=1}^{k}{{BD}}^{i-1}{{CX}}_{n-i}\right)+{{BD}}^{k}{Y}_{n-k}\end{eqnarray} \tag{ A7 }$

or

$\begin{eqnarray}&&{X}_{n+k+1}={{AX}}_{n+k}+\left(\displaystyle \sum _{i=1}^{k}{{BD}}^{i-1}{{CX}}_{n+l-i}\right)+{{BD}}^{k}{Y}_{n}\end{eqnarray} \tag{ A8 }$

for k = 0, 1, 2, ..., n. Let k = l. Our goal is to eliminate Y_n and obtain a recurrence relation involving only the X_j's. By the Cayley–Hamilton theorem there are coefficients ${\alpha }_{0},\,\ldots ,\,{\alpha }_{l-1}$ such that ${D}^{l}={\sum }_{j=0}^{l-1}{\alpha }_{i}{D}^{i}$ . Thus

$\begin{eqnarray}&&{X}_{n+l+1}={{AX}}_{n+l}+\left(\displaystyle \sum _{i=1}^{l}{{BD}}^{i-1}{{CX}}_{n+l-i}\right)+\displaystyle \sum _{j=0}^{l-1}{\alpha }_{j}{{BD}}^{j}{Y}_{n}.\end{eqnarray} \tag{ A9 }$

Setting k to j in (A8) yields

$\begin{eqnarray}&&{X}_{n+j+1}={{AX}}_{n+j}+\left(\displaystyle \sum _{i=1}^{j}{{BD}}^{i-1}{{CX}}_{n+j-i}\right)+{{BD}}^{j}{Y}_{n}.\end{eqnarray} \tag{ A10 }$

Solving for ${{BD}}^{j}{Y}_{n}$ and substituting into (A9) gives

$\begin{eqnarray}&&{X}_{n+l+1}={{AX}}_{n+l}+\left(\displaystyle \sum _{i=1}^{l}{{BD}}^{i-1}{{CX}}_{n+l-i}\right)+\displaystyle \sum _{j=0}^{l-1}{\alpha }_{j}\left({X}_{n+j+1}-{{AX}}_{n+j}-\displaystyle \sum _{i=1}^{j}{{BD}}^{i-1}{{CX}}_{n+l-i}\right)\end{eqnarray} \tag{ A11 }$

$\begin{eqnarray}&&={{AX}}_{n+l}+\displaystyle \sum _{i=1}^{l}\left(1-\displaystyle \sum _{j=i}^{l-1}{\alpha }_{j}\right){{BD}}^{i-1}{{CX}}_{n+l-i}+\displaystyle \sum _{j=0}^{l-1}{\alpha }_{j}\left({X}_{n+j+1}-{{AX}}_{n+j}\right).\end{eqnarray} \tag{ A12 }$

The important thing to note about this last equation is that ${X}_{n+l+1}$ is written in terms of X_n, ..., X_n+l. More precisely, there are matrices ${Q}_{0},\,\ldots ,\,{Q}_{l}$ such that

$\begin{eqnarray}&&{X}_{n+l+1}=\displaystyle \sum _{i=0}^{l}{Q}_{i}{X}_{n+i}.\end{eqnarray} \tag{ A13 }$

In other words, the rows of ${X}_{n+l+1}$ are linear combinations of the rows of ${X}_{n},\,\ldots ,\,{X}_{n+l}$ . (Note that if the Q_i's were scalars instead of matrices, the effective dimension of the system would be at most l + 1). By extension,

$\begin{eqnarray}&&{{ \mathcal T }}_{n+l+1}=\displaystyle \sum _{i=0}^{l}{Q}_{i}{{ \mathcal T }}_{n+i},\end{eqnarray} \tag{ A14 }$

where ${{ \mathcal T }}_{k}\equiv \left[\begin{array}{ccc}{X}_{k} & {X}_{k+1} & \cdots \end{array}\right]$ is the kth block of rows of H_∞. Thus ${{ \mathcal T }}_{1},\,\ldots ,\,{{ \mathcal T }}_{l+1}$ span ${{ \mathcal T }}_{l+1}$ and, by induction, all rows of H_∞.

Now, we could just as well have chosen a representation in which $R=[\begin{array}{cc}{1}_{a\times a}; & 0\end{array}]$ and $L=\left[\begin{array}{cc}{X}_{0} & {Z}_{0}\end{array}\right]$ , and written

$\begin{eqnarray}&&\left[\begin{array}{c}{X}_{n}\\ {Z}_{n}\end{array}\right]=\left[\begin{array}{c}{X}_{0}\\ {Z}_{0}\end{array}\right]{M}^{n}=\left[\begin{array}{c}{X}_{n-1}\\ {Z}_{n-1}\end{array}\right]\left[\begin{array}{cc}A & B\\ C & D\end{array}\right].\end{eqnarray} \tag{ A15 }$

This is the same problem as above, except that Y_n is replaced by Z_n, B and C swap roles, and recurrence relations involve factors of A, B, C, D on the right rather than on the left. A derivation analogous to that above would show that there exist matrices ${Q}_{0}^{{\prime} },\,\ldots ,\,{Q}_{l}^{{\prime} }$ such that

$\begin{eqnarray}&&{X}_{n+l+1}=\displaystyle \sum _{i=0}^{l}{X}_{n+i}{Q}_{i}^{{\prime} }\end{eqnarray} \tag{ A16 }$

by which we conclude that the first a(l + 1) columns span all the columns of H_∞. Thus $\mathrm{rank}({H}_{l+1})\,=\mathrm{rank}({H}_{\infty })=a+l$ . It follows from the Ho–Kalman theory that ${X}_{1},\,\ldots ,\,{X}_{2(l+1)}$ are sufficient to determine any system of dimension up to a + l.

Conversely, observing the output at less than 2(l + 1) times is generally insufficient to reproduce the system's behavior. We omit a detailed proof, but note that this fact can be established by constructing a d-dimensional system with a size l shift register that does not reveal its presence until the (l + 1)th time step. For this process $\mathrm{rank}\ {H}_{l}\lt \mathrm{rank}\ {H}_{l+1}=d$ . Similarly, it is easy to show that observing the output at non-consecutive times is also generally insufficient to construct a correct model. Consider a scalar system whose impulse response function (starting at time t = 0) cycles through the values $\alpha \to \beta \to \alpha \to \gamma \to \ldots$ . If the outputs are observed only at t = 0, 1, 2, 4, ..., 2^k, ..., the value γ would never be observed and a two-state system with output $\alpha \to \beta \to \alpha \to \beta \to \cdots$ would be inferred.

Appendix B.: Dimension estimation

In this section we consider the problem of estimating the dimension d of a process from an experimental estimate $\tilde{H}$ of a Ho–Kalman matrix H. Ho and Kalman showed if H is sufficiently large, then d is just the rank of H. But $\tilde{H}$ , which is a random perturbation of H, has a rank that is generally much higher than d. The singular values of $\tilde{H}$ are much more informative than the rank, as they reveal the magnitude of each dimension's contribution to the data. If the statistical errors in the data are small, all but d singular values of $\tilde{H}$ will be small. Here we derive a reliable criterion for determining which singular values are small enough to be considered spurious.

A naive approach would be to set a threshold for each singular value—say, based on its estimated variance—and count the number of singular values that exceed their threshold. There are problems with this approach, however, the most serious being that the number of singular values that exceed their threshold simply by chance grows with the size of $\tilde{H}$ , i.e. grows with the number of experiments performed. A way to overcome this problem is to test singular values collectively rather than individually: Given a threshold for statistical significance of the k smallest singular values, one finds the largest k for which the threshold is not met and takes d to be the number of singular values that remain.

An appropriate threshold can be derived from the uncertainty in the raw data. Each quantity ${\tilde{F}}_{i,m}^{(t)}$ that appears in $\tilde{H}$ is an independent random variable, distributed binomially with mean ${F}_{i,m}^{(t)}$ and variance ${F}_{i,m}^{(t)}(1-{F}_{i,m}^{(t)})/{N}_{i,m}^{(t)}$ . Let ${\epsilon }_{i,t,m}={\tilde{F}}_{i,m}^{(t)}-{F}_{i,m}^{(t)}$ denote the statistical deviation of ${\tilde{F}}_{i,m}^{(t)}$ and let ${\partial }_{(i,t,m)}H\equiv \partial H/\partial {F}_{i,m}^{(t)}$ . Then

$\begin{eqnarray}&&\tilde{H}=H+\displaystyle \sum _{x}({\partial }_{x}H){\epsilon }_{x},\end{eqnarray} \tag{ B1 }$

where x ranges over all experiments (i, t, m). Let h₁ × h₂ be the size of H and let $h=\min ({h}_{1},{h}_{2})$ . Let $H={{USV}}^{\dagger }$ be a 'full' singular value decomposition of H, i.e. U is h₁ × h₁ and V is h₂ × h₂, and S is h₁ × h₂ with diagonal elements ${s}_{1}\,\geqslant {s}_{2}\,\geqslant \cdots \geqslant \,{s}_{h}$ . Let U_k be the matrix consisting of columns k through h₁ of U, and V_k the matrix consisting of columns k through h₂ of V.

Suppose H has rank r. Then ${s}_{1},\,\ldots ,\,{s}_{r}\ne 0$ and ${s}_{r+1}\,=\,\cdots \,=\,{s}_{h}=0$ . According to perturbation theory, the corresponding singular values ${\tilde{s}}_{r+1},\,\ldots ,\,{\tilde{s}}_{h}$ of $\tilde{H}$ are the singular values of

$\begin{eqnarray}&&A\equiv {U}_{r+1}^{\dagger }\tilde{H}{V}_{r+1}\end{eqnarray} \tag{ B2 }$

$\begin{eqnarray}&&=\,{U}_{r+1}^{\dagger }\left(H+\displaystyle \sum _{x}({\partial }_{x}H){\epsilon }_{x}\right){V}_{r+1}.\end{eqnarray} \tag{ B3 }$

Since ${U}_{r+1}^{\dagger }{{HV}}_{r+1}=0$ we have

$\begin{eqnarray}&&A=\displaystyle \sum _{x}{A}_{x}{\epsilon }_{x},\end{eqnarray} \tag{ B4 }$

where

$\begin{eqnarray}&&{A}_{x}\equiv {U}_{r+1}^{\dagger }({\partial }_{x}H){V}_{r+1}.\end{eqnarray} \tag{ B5 }$

Let χ_r denote the residual 'energy' of the h − r smallest singular values,

$\begin{eqnarray}&&{\chi }_{r}=\displaystyle \sum _{i\gt r}{\tilde{s}}_{i}^{2}\end{eqnarray} \tag{ B6 }$

$\begin{eqnarray}&&=\,\mathrm{Tr}({A}^{T}A)\end{eqnarray} \tag{ B7 }$

$\begin{eqnarray}&&=\,\displaystyle \sum _{x,y}\mathrm{Tr}({A}_{x}^{T}{A}_{y}){\epsilon }_{x}{\epsilon }_{y}.\end{eqnarray} \tag{ B8 }$

Owing to the independence of experiments, we have $\left\langle {\epsilon }_{x}{\epsilon }_{y}\right\rangle =0$ for $x\ne y$ , yielding

$\begin{eqnarray*}&&\left\langle {\chi }_{r}\right\rangle =\displaystyle \sum _{i,j\gt r}\displaystyle \sum _{x}{B}_{{ij}}^{(x)},\end{eqnarray*}$

where ${B}^{(x)}\equiv {A}_{x}\sqrt{\left\langle {\epsilon }_{x}^{2}\right\rangle }$ .

If the threshold for rejecting ${\tilde{s}}_{r},\,\ldots ,\,{\tilde{s}}_{h}$ were set at $\left\langle {\chi }_{r}\right\rangle$ , then approximately half of the time the subset ${\tilde{s}}_{d+1},\,\ldots ,\,{\tilde{s}}_{h}$ would be incorrectly accepted as statistically significant, resulting in an overestimate of d. A higher threshold reduces the probability of this occurring, but increases the probability of rejecting ${\tilde{s}}_{d},\,\ldots ,\,{\tilde{s}}_{h}$ and thereby underestimating d. Thus the threshold should be just high enough to reject ${\tilde{s}}_{d+1},\,\ldots ,\,{\tilde{s}}_{h}$ a majority of the time. To that end we calculate the variance of χ_r. We have

$\begin{eqnarray*}\begin{array}{rcl}{\chi }_{r}^{2} & = & \mathrm{Tr}{\left({A}^{T}A\right)}^{2}\\ & = & \displaystyle \sum _{x,y}\displaystyle \sum _{p,q}\mathrm{Tr}({A}_{x}^{T}{A}_{y})\mathrm{Tr}({A}_{p}^{T}{A}_{q}){\epsilon }_{x}{\epsilon }_{y}{\epsilon }_{p}{\epsilon }_{q}.\end{array}\end{eqnarray*}$

Since the deviations are independent and zero-mean, $\left\langle {\epsilon }_{x}{\epsilon }_{y}{\epsilon }_{p}{\epsilon }_{q}\right\rangle$ vanishes unless it is of the form $\left\langle {\epsilon }_{x}^{4}\right\rangle$ or $\left\langle {\epsilon }_{x}^{2}{\epsilon }_{y}^{2}\right\rangle$ . The second form simplifies to $\left\langle {\epsilon }_{x}^{2}\right\rangle \left\langle {\epsilon }_{y}^{2}\right\rangle$ . Provided the statistical errors are small compared to the quantities being estimated, _x has an approximately normal distribution, for which $\left\langle {\epsilon }_{x}^{4}\right\rangle =3{\left\langle {\epsilon }_{x}^{2}\right\rangle }^{2}$ . Then

$\begin{eqnarray}\begin{array}{rcl}\left\langle {\chi }_{r}^{2}\right\rangle & = & 3\displaystyle \sum _{x}\mathrm{Tr}{\left({A}_{x}^{T}{A}_{x}\right)}^{2}{\left\langle {\epsilon }_{x}^{2}\right\rangle }^{2}+\displaystyle \sum _{x}\displaystyle \sum _{y\ne x}\left(\mathrm{Tr}({A}_{x}^{T}{A}_{x})\mathrm{Tr}({A}_{y}^{T}{A}_{y})+2\mathrm{Tr}{\left({A}_{x}^{T}{A}_{y}\right)}^{2}\right)\left\langle {\epsilon }_{x}^{2}\right\rangle \left\langle {\epsilon }_{y}^{2}\right\rangle \\ & = & {\left(\displaystyle \sum _{x}\mathrm{Tr}({A}_{x}^{T}{A}_{x})\left\langle {\epsilon }_{x}^{2}\right\rangle \right)}^{2}+2\displaystyle \sum _{x,y}\mathrm{Tr}{\left({A}_{x}^{T}{A}_{y}\right)}^{2}\left\langle {\epsilon }_{x}^{2}\right\rangle \left\langle {\epsilon }_{y}^{2}\right\rangle \\ & = & {\left\langle {\chi }_{r}\right\rangle }^{2}+2\displaystyle \sum _{i,j\gt r}\displaystyle \sum _{k,l\gt r}{\left(\displaystyle \sum _{x}{B}_{{ij}}^{(x)}{B}_{{kl}}^{(x)}\right)}^{2},\end{array}\end{eqnarray} \tag{ B9 }$

where

$\begin{eqnarray}&&{D}_{i,j;k,l}\equiv \displaystyle \sum _{x}{B}_{{ij}}^{(x)}{B}_{{kl}}^{(x)}.\end{eqnarray} \tag{ B10 }$

Our criterion for accepting the hypothesis ${s}_{r+1}={s}_{r+2}\,=\,\cdots \,=\,0$ is

$\begin{eqnarray}&&{\chi }_{r}\leqslant \left\langle {\chi }_{r}\right\rangle +\sqrt{\mathrm{Var}{\chi }_{r}}\end{eqnarray} \tag{ B11 }$

$\begin{eqnarray}&&=\,\displaystyle \sum _{i,j\gt r}\displaystyle \sum _{x}{B}_{i,j}^{(x)}+\sqrt{2\displaystyle \sum _{i,j\gt r}\displaystyle \sum _{k,l\gt r}{\left(\displaystyle \sum _{x}{B}_{{ij}}^{(x)}{B}_{{kl}}^{(x)}\right)}^{2}}.\end{eqnarray} \tag{ B12 }$

We take d be the smallest r for which this criterion is met. In theory the probability of overestimating d is then at most 0.16.

In practice this criterion cannot be applied exactly: The quantities on the right side of equation (B12) are not known, but can only be estimated. The singular values may not even be in quite the right order due to statistical variations. Furthermore, the approximations made in the derivation might not always be accurate. Nevertheless, in our simulations we observed that the criterion (B12) is effective: It usually yields an estimate of d quite close to the true value, provided one has taken enough data.

Appendix C.: Progressive fitting of data blocks

For large times t, the experimental quantity ${F}^{(t)}={{ST}}^{t}P$ is an extremely nonlinear function of T, presenting a considerable challenge for estimation of the parameters S, T, P from the data. Building on the approach described in [7], we devised a progressive fitting method which first uses low-t data to obtain a coarse estimate of T, then gradually incorporates data with higher t to increase the accuracy of the model. Our implementation involves multiple passes over the data, alternating between optimization of T given the current estimate of S, P and optimization of S, P given the current estimate of T. We found that if too much or too little data was used at any given step, either (S, P) or T could 'lock in' to the wrong region of parameter space from which the search could not recover. The procedure below was carefully honed to ensure that at each step, only reliable intermediate estimates were used. In our studies, 5–15 passes were usually sufficient to obtain good fits to the data.

Let H^(b) denote the bth block of data (figure 3(c)), W^(b) the matrix of corresponding statistical weights, and N_b the number of elements in blocks 0 through b. (The weights W^(b) are adjusted to account for the different multiplicity of different experiments, so that each experiment has the same total weight.) Let

$\begin{eqnarray}{\varrho }_{k}=\left\{\begin{array}{ll}0 & k=0\\ {2}^{k-1} & k\gt 0\end{array}\right..\end{eqnarray} \tag{ C1 }$

The function to be minimized is

$\begin{eqnarray}&&{{\rm{\Phi }}}_{b}(A,T,B)\equiv \displaystyle \frac{1}{{N}_{b}}\displaystyle \sum _{{b}^{{\prime} }=0}^{b}\displaystyle \sum _{i\in { \mathcal I }}\displaystyle \sum _{m\in { \mathcal M }}{W}_{i,m}^{({b}^{{\prime} })}{\left({{AT}}^{{\varrho }_{{b}^{{\prime} }}}B-{H}^{({b}^{{\prime} })}\right)}_{i,m}^{2}\end{eqnarray} \tag{ C2 }$

which is the average weighted error of the model over blocks 0 through b. Under the true model, the distribution of Φ_b has a strong peak at 1. A value significantly larger than this (say, 1.5) indicates a poor fit. Model fitting with the correct dimension usually yields Φ_b values around 0.8; overfitting leads to Φ_b around 0.5.

Let b_max denote is the index of the last block. Let b_highest denote the index of the highest block for which a fit has been attempted. The fitting procedure is as follows:

3.1 Let (L, T, R) be the outputs of Ho–Kalman estimation. Set A equal to L and set B equal to the first n(l + 1) columns of R. Find the smallest b such that Φ_b > 1.5 and set b_highest to this value (or to b_max if there is no such b).
3.2 Keeping T fixed, optimize $A,B$ . This is accomplished by minimizing a modified form of equation (C2) with $b={b}_{{\max }}$ . Whereas the current estimate of T is used for blocks ${b}^{{\prime} }\leqslant {b}_{\mathrm{highest}}$ , for blocks ${b}^{{\prime} }\gt {b}_{\mathrm{highest}}$ we replace ${T}^{{\varrho }_{{b}^{{\prime} }}}$ by the matrix ${Y}^{({b}^{{\prime} })}$ that is optimal with respect to A, B. In this way A, B are made to fit all the data but are not constrained by high powers of T that have not yet been determined to be accurate. Minimization can be accomplished via iterative linear algebraic methods.
3.3 Keeping A, B as fixed, optimize T.
- 3.3.1 Set b = 2.
- 3.3.2 If b < b_highest and Φ_b ≤ 1.5, continue to 3.3.3. Otherwise, find T that minimizes ${{\rm{\Phi }}}_{b}$ , keeping A, B fixed. This can be accomplished by various nonlinear optimization methods; we found the Gauss–Newton method with line search to work well.
- 3.3.3 If $b\lt {b}_{{\max }}$ and Φ_b ≤ 1.5, increase b by 1, set ${b}_{\mathrm{highest}}=\max (b,{b}_{\mathrm{highest}})$ , and go back to 3.3.2.
3.4 If $b={b}_{{\max }}$ and ${{\rm{\Phi }}}_{{b}_{{\max }}}$ did not improve significantly (say, by more than 0.001) compared to the previous pass, optimization has converged; continue to step 3.5. Otherwise, go back to step 3.2 (perform another pass).
3.5 If ${{\rm{\Phi }}}_{{b}_{{\max }}}\leqslant 1.5$ , the fit is deemed a success; in this case return T, return the first m rows of A for S, and return the first n columns of B for P. Otherwise, increase d by one, perform Ho–Kalman estimation again (stage 2, described in section 4.2), and start again at step 3.1.

Appendix D.: Final model optimization

The fourth and final stage of our model inference procedure is to fit the model ${F}^{(t)}={{ST}}^{t}P$ directly to the data. This last stage yields the most accurate models but requires a very good starting estimate S, T, P. As in the previous stage, we minimize a sum of weighted squared errors. But unlike in previous stages, the weights are determined by the model itself rather than estimated from the data, making the total squared error a good proxy for the negative log likelihood. Besides being simpler to work with than the log likelihood, weighted errors allow out-of-range probabilities (less than 0 or grater than 1) to be handled much more gracefully. In this stage we also impose soft constraints on the eigenvalues {λ_i} of T: Since the process in question is presumed to occur in a finite state space, the eigenvalues of T should not have magnitude greater than 1, as otherwise repeated application of the process would eventually take the state out of the state space.

The function to be minimized is

$\begin{eqnarray}&&{\rm{\Psi }}(S,T,P)\equiv \displaystyle \frac{1}{\left|{ \mathcal I }\right|\left|{ \mathcal M }\right|\left|{ \mathcal T }\right|}\displaystyle \sum _{t\in { \mathcal T }}\displaystyle \sum _{i\in { \mathcal I }}\displaystyle \sum _{m\in { \mathcal M }}{W}_{i,m}^{(t)}\left(S,T,P\right){\left({F}^{(t)}-{\tilde{F}}^{(t)}\right)}_{i,m}^{2}+E(T).\end{eqnarray} \tag{ D1 }$

where $E(T)={\sum }_{i}\max {\left(0,\left|{\lambda }_{i}\right|-1\right)}^{2}$ is the eigenvalue constraint. Nominally, ${W}_{i,m}^{(t)}$ is the inverse variance of ${F}_{i,m}^{(t)}$ . But since the inverse variance becomes negative or infinite if the model ever predicts ${F}_{i,m}^{(t)}\leqslant 0$ or ${F}_{i,m}^{(t)}\geqslant 1$ , we take

$\begin{eqnarray}&&{W}_{i,m}^{(t)}=\displaystyle \frac{1}{{V}_{i,m}^{(t)}+\sqrt{{V}_{i,m}^{(t)2}+4{\beta }_{i,m}^{(t)2}}},\end{eqnarray} \tag{ D2 }$

where

$\begin{eqnarray}&&{V}_{i,m}^{(t)}=\displaystyle \frac{{F}_{i,m}^{(t)}(1-{F}_{i,m}^{(t)})}{{N}_{i,m}^{(t)}}\end{eqnarray} \tag{ D3 }$

is the theoretical variance of ${F}_{i,m}^{(t)}$ and ${\beta }_{i,m}^{(t)}$ is a small buffer term that ensures ${W}_{i,m}^{(t)}$ remains finite, continuous, and positive. We start with ${\beta }_{i,m}^{(t)}=1/{N}_{i,m}^{(t)}$ . We then proceed to minimize D1 using the Gauss–Newton method with line search. After each search step the validity of the model predictions are assessed. If any ${F}_{i,m}^{(t)}$ is not in the interval $[0,1]$ , ${\beta }_{i,m}^{(t)}$ is reduced by 5%. This increases ${W}_{i,m}^{(t)}$ outside [0,1] and near its endpoints, thereby more strongly encouraging ${F}_{i,m}^{(t)}$ to become valid. If after a search step all the predicted probabilities are valid and Ψ was not significantly improved, the search concludes.