1 Introduction

We often model empirical processes as complex systems: they are composed of different agents or microstates, that interact through simple rules but evolve in non-trivial ways. A lot of research has taken a complex systems approach, where authors have constructed models from the bottom up, but where empirical observations exhibit incomplete information of the system. Applications range from Earth’s global climate [1], to urban dynamics [2], to the human brain [3], to financial markets [4], to transportation networks [5], and marine fisheries [6]. However, data are usually available as an aggregate statistic of the microstates because it is difficult to measure microstates directly. Thus, it is an ongoing challenge to develop methods to recover the latent microstates from aggregate and incomplete observations [7].

Estimating and forecasting complex systems accurately depends on (1) the inherent complexity of the system, which depends on things like the systems’ state space dimension, its Lyapunov exponents, and its attractors, (2) the sparsity and quality of the data, and (3) our ability to model them. In low-dimensional systems with high-quality data, as shown by Packard et al. [8] and Takens [9], state-space reconstruction techniques can be applied to partial information to create a representation as a dynamical system. By reconstructing an attractor from data, we can make accurate predictions by choosing its closest points to the current state of the system and extrapolating them [10]. These techniques work well even without any modeling [11].

In high-dimensional systems, it is typically not possible to use time series models. It is nonetheless often possible to use a theoretical model, if one can only measure the initial conditions that the model requires and compare them to the data. The task of estimating initial conditions that match observations is known as initialization. Similar to the state-space reconstruction techniques, we require to know the evolution function of the underlying dynamics of the system, or at least some approximation of it [12]. In particular, if the observations available are an aggregate of the dynamical system, then the process is known as microstate initialization, or latent state initialization. To do this we need to know how the microstate is aggregated, in addition to having a model of its dynamics.

In the fields of meteorology and numerical weather prediction (NWP), researchers have developed a framework for estimating the latent states of a system [13, 14], where a large number of observed states are available. This framework is known as data assimilation, and, although it is well justified from the Bayesian perspective, it incorporates several heuristics pertinent to the weather prediction field. For instance, modelers often incorporate Gaussian priors in the cost functionals involved with known statistics about the latent states [15]. Ideas from data assimilation have already permeated outside the NWP community, such as in urban dynamics [2]. Typically, data assimilation methods operate in a sequential matter: the microstate at time t gets nudged (or corrected) so it optimally approaches the empirical observation at t. Then, the modeler simulates the nudged microstate from time t to time \(t + 1\), where the microstate is again nudged. Therefore, the resulting sequence of microstates is not a solution of the underlying dynamical system—the microstate was constantly altered throughout its trajectory. We, in contrast, seek to analyze real trajectories of the latent states, so this sequential approach does not suit our goal.

The initialization process is an optimization problem where we minimize a cost function that depends on some notion of distance between the observed and the model-generated data. Hence, in high-dimensional systems, it is essential to develop efficient algorithms to find initial conditions with high precision. Research has been done around the parameter estimation of stochastic dynamical systems in low-dimensional parameter spaces [7, 16]. Among other alternatives, gradient descent has proven to be superior in strongly nonlinear models [17, 18], but the main drawback is that it can get stuck in local minima. Other methods, like genetic algorithms, simulated annealing, and other meta-heuristics algorithms [19], usually find the global minimum in low dimensions, but they are likely to diverge in high-dimensional systems. We provide a thorough comparison of the state-of-the-art gradient descent methods [20] and discuss their performance in the context of microstate initialization.

Here, we propose a gradient-based method for initializing chaotic systems where only aggregate observations of short observation windows are available. Using a combination of numerical simulations and analytical arguments, we study the conditions in which certain systems may be initialized with arbitrary precision. We explore the performance of several gradient descent algorithms and the effects of observational noise on the accuracy and convergence of the initialization process. Furthermore, we quantify the accuracy of our method numerically with out-of-sample forecasts. Under this framework, we offer a better understanding of what information the observations provide about a system’s underlying dynamics, and, additionally, lay out the connections of our method to those in the data assimilation literature.

The remainder of this paper is laid out as follows. In Sect. 2, we describe the initialization problem under the framework of dynamical systems and develop our initialization methodology accordingly. In Sect. 3, we test our method in two systems, namely the Lorenz and Mackey–Glass systems. Finally, in Sect. 4, we discuss our results and suggest further research directions.

2 Methods

2.1 Problem setup

A wide class of dynamical systems can be formulated as follows

$$\begin{aligned} \varvec{x}(t + \Delta t) = \varvec{f}( \varvec{x}(t); \varvec{\theta }) + \varvec{\xi }(t) \ , \end{aligned}$$
(1)

where \(\varvec{x}(t) \in {\mathcal {X}} \subset {\mathbb {R}}^{N_x}\) is the (\(N_x\)-dimensional) state of the system at time \(t \in {\mathbb {R}}\) living in some manifold \({\mathcal {X}}\), \(\varvec{\xi }(t) \in {\mathbb {R}}^{N_x}\) are random variables that model the intrinsic noise of the system and \(\varvec{\theta } \in {\mathbb {R}}^d\) is a vector of parameters. We will hereafter refer to the state \(\varvec{x}\) as the microstate to emphasize that \(\varvec{x}\) is not directly observable and is potentially high dimensional. If \(\Delta t > 0\) is finite, the model \(\varvec{f}: {\mathcal {X}} \times {\mathbb {R}}^d \rightarrow {\mathcal {X}}\) is a discrete mapping that takes the microstate from time t to time \(t + \Delta t\). If \(\Delta t\) is infinitesimal, the system (1) defines a continuous-time dynamical system. Here, we assume that the dynamical system \(\varvec{f}\) is deterministic and perfectly specified, so that Eq. (1) depends on neither \(\varvec{\xi }\) nor \(\varvec{\theta }\).

We are interested in using information from the past to estimate the microstate \(\varvec{x}(t_0)\) of a dynamical system \(\varvec{f}\) at the present time \(t_0\). We assume we know \(\varvec{f}\), but cannot directly observe its microstate \(\varvec{x}(t)\). Instead, we are only able to observe a sequence of observations \(\varvec{y}= (y_{-T}, \dots , y_0)\) measured at times \(\{t_{-T}, \dots , t_0 \} \), where \(y_k \in {\mathbb {R}}^{N_y}\) and \(N_y < N_x\), i.e., we are interested in measurements that lose information of the microstate by reducing its dimension from \(N_x\) to \(N_y\). Moreover, these observations can be noisy. Thus, we can relate the observations to the dynamical system as

$$\begin{aligned} y_{k} = {\mathcal {H}} \big ( \varvec{x}(t_k) \big ) + \epsilon _{k} \ . \end{aligned}$$
(2)

where \({\mathcal {H}}: {\mathcal {X}} \subset {\mathbb {R}}^{N_x} \rightarrow {\mathbb {R}}^{N_y}\) is a known (possibly) nonlinear observation operator and \(\epsilon _k\) accounts for observational noise. Although it is not necessary for the following discussions, we will assume for simplicity that \(N_y = 1\) and, that the noise has a known variance \(\sigma _y\). Our choice of indices from \(-T\) to 0 for \(y_k\) reflects our interest in the present time \(t_0\).

We assume that the observations are sampled at a uniform rate as follows

$$\begin{aligned} \Delta t_k = t_{k+1} - t_k = m \Delta t \ , \end{aligned}$$

where the nonzero integer m is the sampling intervalFootnote 1. We treat m as a parameter that controls how often we sample observations from the system. Thus, we can recast Eq. (1) as a discrete-time mapping

$$\begin{aligned} \varvec{x}_{k+1} = \varvec{f}^{ \{m \} }( \varvec{x}_k ) = \varvec{f}^{ \{mk \} }( \varvec{x}_{-T}) \ , \end{aligned}$$
(3)

where \(\varvec{f}^{ \{ i \} }(\cdot )\) is the composition of \(\varvec{f}\) with itself i-times, \(\varvec{x}_{-T} = \varvec{x}(t_{-T})\) is the latent microstate at the time of the first observation, and \(\varvec{f}^{ \{k \} }\) is the self-iteration of f by k times. Note that the whole evolution of the microstates is determined by \(\varvec{x}_{-T}\), so the microstate initialization problem is that of obtaining the best approximation to \(\varvec{x}_0\) given the observations from the assimilation time \(t_{-T}\) up to present time \(t_0\). These observations, alongside with the model (3) and the observation operator (2), form a nonlinear system of equations with \(N_x\) variables and T equations corresponding to the dimension of the microstate space and the number of observations.

2.2 Initialization procedure

As we stated previously, the microstate initialization problem is that of obtaining the best representation of the present-time microstate \(\varvec{x}_0\) given the history of observations \(\varvec{y}\) under the dynamical system \(\varvec{f}\). We define what best means in what follows. We make the code of the initialization procedure freely available at https://github.com/blas-ko/LatentStateInitialization.

2.2.1 Cost function

Given an estimate \(\varvec{x}\) of the ground-truth microstate \(\varvec{x}_{-T}\) at the assimilation time \(t_{-T}\), a natural way to quantify the goodness of \(\varvec{x}\) is by measuring the point-to-point discrepancy between the observed and estimated data. We do so with the following mean-squared cost function.Footnote 2

$$\begin{aligned} {\mathcal {J}}( \varvec{x} ) = \frac{1}{T \sigma _y^2} \sum \limits _{k = -T}^{0} \left( y_k - \hat{y}_k \right) ^2 , \end{aligned}$$
(4)

where

$$\begin{aligned} \hat{y}_k(\varvec{x}) = {\mathcal {H}} \Big ( \varvec{f}^{ \{ mk \} }( \varvec{x} ) \Big ) \end{aligned}$$
(5)

is our estimate of observation \(y_k\) given \(\varvec{x}\), and \(\sigma _y\), the variance of the observed data, is a normalization constant that is arbitrary but will be of practical use later. We call the state \(\hat{\varvec{x}}_{-T}\) that minimizes \({\mathcal {J}}\) the assimilated microstate and the present-time state \(\hat{\varvec{x}}_0 = \varvec{f}^{ \{m T\} }(\hat{\varvec{x}}_{-T})\) the initialized microstate. Our goal is to find an initialized microstate \(\hat{\varvec{x}}_0\) that is a good representation of the ground-truth microstate \(\varvec{x}_0\).

The cost function \({\mathcal {J}}\) is known as a filter in the interpolation assimilation community, a least-squares optimizer in the optimization and machine learning communities, and 4D-Var with infinite state uncertainty in the variational assimilation community. From a Bayesian perspective, the cost function \({\mathcal {J}}\) emerges naturally when we assume the prior \(p(\varvec{x})\) is uniform and observational noise is Gaussian: the posterior \({p(\varvec{x} | \varvec{y})}\) is maximal when \({\mathcal {J}}\) is minimal [14].

If the observations in the time series \(\varvec{y}\) are noiseless, and given that we assume that the model \(\varvec{f}\) perfectly describes the system, then \({\mathcal {J}}\) has a global minimum \(\hat{\varvec{x}}_{-T}\) for which \({\mathcal {J}}(\hat{\varvec{x}}_{-T}) = 0\). However, even for the noiseless scenario, \(\hat{\varvec{x}}_{-T}\) is not necessarily unique. Take, for instance, the Lorenz system [21], which is symmetric around its x-axis, and take an observation operator of the form \({\mathcal {H}}(\varvec{x}) = \varvec{x}^\dagger \varvec{x}\) with \(\dagger \) denoting matrix transposition. Note that for any microstate \(\varvec{x}\) in the axis of symmetry, the condition \(-\varvec{x}\) will produce the same sequence of noiseless observations, so \(\varvec{x}\) and \(-\varvec{x}\) are indistinguishable in terms of \({\mathcal {J}}\). We will refer to the set of indistinguishable microstates as the feasible set of solutions, and we will denote it as \(\varvec{\Omega }\).

An additional problem arises when the observations contain noise. In any finite time series, there might exist microstates with a lower value of the cost function than the ground-truth microstate—i.e., where \({\mathcal {J}}( \hat{\varvec{x}}_{-T} ) < {\mathcal {J}}( \varvec{x}_{-T} ) \) for \(\hat{\varvec{x}}_{-T} \not \in \varvec{\Omega }\)—which we will call dominating microstates, following [22]. Judd et al. [22] suggest that using cost functions that minimize both the variance (as in Eq. (4)) and the kurtosis will do a better job of identifying the true microstate when the observational noise is Gaussian. While this is a good idea when there are many observations, we consider short time series in this work, which calls for other alternatives. Instead of modifying the cost function \({\mathcal {J}}\), we preprocess the data to reduce the probability of finding dominating trajectories outside of the feasible set \(\varvec{\Omega }\).

Following [13], we distinguish between the error-free and the noise contributions to the cost function \({\mathcal {J}}\). Isolating the contributions from the discrepancy between ground-truth and assimilated microstate and the observational noise will let us design a well-suited methodology to deal with noise and dominating trajectories in a separate manner. Recall that, given \(\varvec{x}_{-T}\), \({\mathcal {H}}(\varvec{x}_k)\) is the error-free observation at time \(t_k\), \(\epsilon _k\) its associated noise and \(\hat{y}_k\) our estimation of \(y_k\). Thus, we can decompose Eq. (4) into three terms of the form

$$\begin{aligned} {\mathcal {J}}(\varvec{x}) = \frac{1}{\sigma _y^2} \Big [&\underbrace{\frac{1}{T}\sum _k ({\mathcal {H}}(\varvec{x}_k) - \hat{y}_k)^2}_{ \text {noise-free cost}} \\ +&\underbrace{\frac{1}{T}\sum _k \epsilon _k^2}_{\text {obseravation noise}} \\ -&\underbrace{\frac{2}{T}\sum _k ({\mathcal {H}}(\varvec{x}_k) - \hat{y}_k)\epsilon _k}_{\text {error-noise covariates}} \Big ]\ \end{aligned}$$

The first term on the RHS is the noise-free cost, \({\mathcal {J}}_{\text {free}}(\varvec{x})\), that we would obtain in the absence of observational error. The second term is the average contribution of the square of the noise, which converges to the noise variance, \(\sigma _n^2\), for large T. Finally, the third term captures how the noise and the noise-free cost vary together, which goes to 0 for large T because we assume the observational noise and the system dynamics are uncorrelated. Considering a large number of observations, we can thus approximate \({\mathcal {J}}\) as

$$\begin{aligned} {\mathcal {J}}(\varvec{x}) \approx {\mathcal {J}}_{\text {free}}(\varvec{x}) + \frac{\sigma _n^2}{\sigma _y^2} \ , \end{aligned}$$
(6)

which shows the expected behavior of Eq. (4) in the presence of noise.

Now, note that if we take \(\hat{y}_k = {\mathbb {E}}[\varvec{y}]\) for all k [10], which is the best constant predictor for the observed time series, then \({\mathcal {J}}_{\text {free}} = 1\) and, therefore

$$\begin{aligned} {\mathcal {J}}(\varvec{x} : \hat{y}_k = {\mathbb {E}}[\varvec{y}] \ \forall k) = {\mathcal {J}}_{\text {const}} := 1 + \sigma _n^2 / \sigma _y^2. \end{aligned}$$

Naturally, we want to find an assimilated microstate \(\hat{\varvec{x}}_{-T}\) that performs better than a constant predictor, so that \({\mathcal {J}}(\hat{\varvec{x}}_{-T}) \le {\mathcal {J}}_{\text {const}}\). This motivates us to consider microstates \(\varvec{x}\) such as

$$\begin{aligned} \lbrace \varvec{x}: {\mathcal {J}}(\varvec{x}) \le \alpha + \frac{\sigma _n^2}{\sigma _y^2}\beta \rbrace \ , \end{aligned}$$
(7)

where the parameter \(\alpha \in (0,1]\) accounts for the error-free tolerance about the global minimum while \(\beta \in (0,1]\) accounts for the noise tolerance. We will explore these parameters in the following sections.

The above discussion suggests that we should pay especial attention on handling dominating trajectories, which might be present by either the presence of observational noise or by microstates that live far away from the ground-truth but that have low cost function values. Thus, we propose the following three stages to initialize the microstate from aggregate observations: (1) a preprocessing stage in which we reduce the noise of the observations, (2) a bounding stage in which we limit the region of the microstate space in which we search for an optimal solution, and (3) a refinement stage in which we minimize the cost function (4) in a small search space and estimate the optimal microstate given the observations.

2.2.2 Preprocess: noise reduction

First, we preprocess the observed time series to reduce the observational noise and thus lower the probability of obtaining dominating microstates. Casdagli et al. [23] showed that in the presence of observational noise, the distribution of local minima gets increasingly complex with increasing levels of noise, especially when the dynamics is chaotic. We handle time series with only a handful of data points (in the order of 100 data points or less), so reducing noise by orbit shadowing [12] or large window impulse response filters [24] are not suitable options. Instead, we find that the best way to reduce the variance of the noise is using a low-pass moving average (LPMA) filter,

$$\begin{aligned} z_k := {\left\{ \begin{array}{ll} \frac{1}{2} y_{k} + \frac{1}{2} y_{k+1} &{} k = -T \\ \frac{1}{2} y_{k} + \frac{1}{2} y_{k-1} &{} k = 0 \\ \frac{1}{2} y_k + \frac{1}{4} \left( y_{k-1} + y_{k+1} \right) &{} \text {otherwise} \\ \end{array}\right. } \ , \end{aligned}$$
(8)

where \(z_k\) is the filtered data point at time \(t_k\). We control the amount of noise reduction by repeatedly feeding the signal back into the LPMA filter of Eq. (8). Feeding the signal back into the filter q times, hereafter denoted as \(z_k^q\), is equivalent to increasing the filtering window from three to \(2q+1\) points, making the filtered signal smoother.

We expect for the resulting variance, \((\sigma _n^q)^2\), to be lower than the original noise variance \(\sigma _n^2\). Thus, following [25] and assuming we can rewrite the filtered signal in terms of the microstates as \(z_k^q = {\mathcal {H}}(\varvec{x}_k) + \epsilon _k^q\), we can measure the performance of the LPMA filter with the increase of the signal-to-noise ratio

$$\begin{aligned} r_0 = \sqrt{ \frac{\sum _k (\epsilon _k)^2 }{ \sum _k (\epsilon _k^q)^2 } } \approx \frac{\sigma _n}{\sigma _n^q} , \end{aligned}$$
(9)

where \(r_0 > 1\) whenever \(\sigma _n > \sigma _n^q\).

The resulting noise distribution of the filtered signal converges to a zero-mean Gaussian distribution if we have either many data samples or set q big enough. However, if q is too big, we may filter parts of the dynamics and mix them into noise, resulting in exotic noise distributions (see Fig. 12 in Appendix C for examples). What big enough, many samples, and too big mean depend heavily on the dynamical system and the noise distribution, although we stress that the LPMA filter works optimally when the noise distribution has higher frequency spectrum on average than that of the dynamics of the system (see [26], Chapter 6.4.3.1.).

2.2.3 Bound: exploring the attractor

After the preprocessing stage, the next step is to bound the search space. Under no constraints in the cost function, the initialization procedure consists on searching through the whole microstate space \({\mathcal {X}}\) for a set of microstates that minimize Eq. (4), with no prior preference on where to start the search from. However, many real-world systems are dissipative, meaning that their dynamics relax into an attractor, i.e., a subset manifold \({\mathcal {M}} \subset {\mathcal {X}}\) of the microstate space. Thus, it is safe to assume that the observations \(\varvec{y}\) derive from a sequence of microstates that live in or near \({\mathcal {M}}\), and, by the properties of dissipative systems, any microstate \(\varvec{x}\) in the basin of attraction will eventually visit every point in \({\mathcal {M}}\) [27]. This means that, if we wait for long enough, then any point in the basin of attraction will get arbitrarily close to the ground truth microstate.

The bounding stage consists of exploiting the dissipative nature of real-world systems and letting any arbitrary estimate of the microstate explore the basin of attraction until it roughly approaches the ground truth microstate. To be more precise, we say that the microstate \(\varvec{x}_{-T}^R \in {\mathcal {X}}\) roughly approaches \(\varvec{x}_{-T}\) if

$$\begin{aligned} {\mathcal {J}}(\varvec{x}_{-T}^R) \le \delta _R := \alpha _R + \frac{\sigma _n^2}{\sigma _y^2}\beta _R, \end{aligned}$$
(10)

for some rough threshold \(0 \ll \delta _R < 1\). Thus, we let an arbitrary microstate evolve according to the model \(\varvec{f}\) until Eq. (10) is satisfied.

At this stage, we want to obtain solutions with a cost value of the order of the unfiltered noise level \(\sigma _n^2/\sigma _y^2\), so that \(\varvec{x}_{-T}^R\) is either near to the feasible set \(\varvec{\Omega }\) or to any of the dominating microstates driven by the noise. To achieve this, it suffices to set \(\beta _R \sim {\mathcal {O}}(1)\) and \(\alpha _R < \sigma _n^2/\sigma _y^2 \le 1\).

We note that whenever the time series \(\varvec{y}\) is noiseless, the bounding stage is only driven by \(\alpha _R\), the error-free tolerance of the points situated at the global minima of the cost function, which are exactly those in the feasible set \(\varvec{\Omega }\). Thus, our choice of \(\alpha _R\) leverages how closely we approach to \(\varvec{\Omega }\). If we set \(\alpha _R\) too close to 0, we would impose for \(\hat{\varvec{x}}_{-T}\) to lay near \(\varvec{\Omega }\); however, it would take too long simulation times to satisfy Eq. (10) for this approach to be practical. The idea, ultimately, is to set the lowest \(\delta _R\) possible such that the time to satisfy Eq. (10) is short.

2.2.4 Refine: cost minimization

The final step of the initialization procedure is refining \(\varvec{x}_{-T}^R\). By this point, we expect that \(\varvec{x}_{-T}^R\), our estimate of \(\varvec{x}_{-T}\), has bypassed most of the high-valued local minima of the cost function landscape. Additionally, we preprocessed the observations \(\varvec{y}\) to reduce their observational noise, but we have not fully exploited such preprocessing yet. By reducing the variance of the observational noise, we lower the number of dominating trajectories of the cost function—i.e., trajectories for \(\varvec{x} \not \in \varvec{\Omega }\) such that \({\mathcal {J}}(\varvec{x}) < {\mathcal {J}}(\varvec{x}_{-T})\). Thus, starting from \(\varvec{x}_{-T}^R\), we can minimize \({\mathcal {J}}\) using any optimization scheme until

$$\begin{aligned} {\mathcal {J}}(\hat{\varvec{x}}_{-T}) \le \delta _r := \alpha _r + \frac{\sigma _n^2}{\sigma _y^2}\beta _r \end{aligned}$$
(11)

for some refinement threshold \(0 < \delta _r \ll 1\), with \(\hat{\varvec{x}}_{-T}\) the assimilated microstate of the system. We then define \(\hat{\varvec{x}}_0 = \varvec{f}^{ \{m T\} }( \hat{\varvec{x}}_{-T} )\) to be the initialized microstate, hoping that \(\hat{\varvec{x}}_0\) is a good representation of \(\varvec{x}_0\).

We want for \(\hat{\varvec{x}}_{-T}\) to have the lowest cost possible at this stage. In terms of the error-free tolerance, we look for \(\alpha _r \ll \alpha _R < \sigma _n^2/\sigma _y^2 \le 1\), but the actual magnitude of \(\alpha _r\) is left to the modeler to choose. Regarding the contribution of the noise, recall from Eq. (9) that \(r_0 \approx \sigma _n / \sigma _n^q > 1\), so the lowest expected cost we can get is \(\sigma _n^2/\sigma _y^2 r_0^{-2}\) (see Eq. (6)). Thus, we define the refinement bound of our initialization procedure as that of setting \(\alpha _r \ll \alpha _R\) and \(\beta _r \sim {\mathcal {O}}(r_0^{-2})\).

For our optimization scheme, we explore a plethora of the most successful gradient-based algorithms in the literature [20]. These algorithms include stochastic gradient descent [28], momentum descent [29], Nesterov [30], Adagrad [31], Adadelta [32], Rmsprop [33], Adam [34], and AdamX [35] and YamAdam [36].

In all cases, we set the hyper-parameters to be those given in the literature. We then compute the gradient of \({\mathcal {J}}\) using centered finite differences of step size \(\sqrt{\varepsilon _M} \approx 1.5 \times 10^{-8}\), where \(\varepsilon _M\) corresponded to double precision arithmetic on our machine. In the absence of observational noise, provided that the dynamics is not degenerate and that we have sufficient observations, we expect that the feasible set \(\varvec{\Omega }\) collapses to the ground-truth microstate only, so at this stage we expect to infer it with high precision from the data.

2.3 Validation

We validate the initialized microstate \(\hat{\varvec{x}}_0\) by comparing them with the present-time microstate \(\varvec{x}_0\) and making out-of-sample predictions of the observed time series. Recall that we take the convention in which the observations, \(\varvec{y}= (y_{-T}, \dots , y_{0})\), have non-positive time indexes, so we refer to the observation times \(t_k\) for \(k < 0\) as assimilative while we refer to times for \(k \ge 0\) as predictive. We measure the discrepancy between the real and the simulated observations using both the normalized squared error in the observation space and the normalized error in the model space, i.e.,

$$\begin{aligned} \text {NSE}_k^{obs}&= \frac{ ( y_k - \hat{y}_k )^2 }{\sigma _y^2}, \end{aligned}$$
(12)
$$\begin{aligned} \text {NSE}_k^{mod}&= \frac{1}{N_x} (\varvec{x}_k - \hat{\varvec{x}}_k)^\dagger \Sigma _{\varvec{x}}^{-1}(\varvec{x}_k - \hat{\varvec{x}}_k) , \end{aligned}$$
(13)

where \(\sigma _y^2\) is the variance of the data, \(\Sigma _{\varvec{x}}\) the covariance matrix of the microstates and \((\cdot )^\dagger \) denotes matrix transposition. In general, \(\varvec{x}_k\) and \(\Sigma _{\varvec{x}}\) are unknown to the modeler, but we use them to measure the performance of our initializations in the latent space of the microstates.

Note that if we let \(T \rightarrow \infty \), then \({\mathcal {J}}(\varvec{x})\) converges to \({\mathbb {E}} [ NSE^{obs}_0 ]\). When k grows, the trajectories \(y_k\) and \(\hat{y}_k\) diverge exponentially until they lose all memory about their initial conditions (\(\varvec{x}_0\) and \(\hat{\varvec{x}_0}\), respectively). Given that such divergence is exponential, we therefore take the median of the NSE when comparing the performance over an ensemble of experiments as a better alternative to the mean.

On the same note, another way to validate the inferred microstate is by looking for how long our predictions accurately describe the system. Chaotic systems are by definition sensitive to initial conditions, meaning that microstates that are close to each other diverge exponentially over time. If these microstates live in a chaotic attractor, the distance between them is bounded by the size of the attractor [13]. Thus, we can assess the quality of our predictions by measuring how long the real and simulated observations retain memory about each other [10]. We refer to this limiting time as the predictability horizon of the system \(k_{max}\), and we define it as the average number of steps before the separation between \(y_k\) and \(\hat{y}_k\) is greater than the distance between two random points in the attractor of the system. Mathematically,

$$\begin{aligned} k_{max} := {\mathbb {E}} \Big [ \arg \min \limits _{ k \ge 0} \big \lbrace ( y_k - \hat{y}_k )^2 \ge {\mathcal {D}}_{\mathcal {M}} \big \rbrace \Big ], \end{aligned}$$

where \({\mathcal {D}}_{\mathcal {M}}\) is average squared distance between two random points in the attractor \({\mathcal {M}}\) normalized by the variance of the attractor. More specifically, if XY are two i.i.d. random variables such that \(X \sim \mu ({\mathcal {M}})\) with \(\mu (\cdot )\) denoting the natural measure, then

$$\begin{aligned} {\mathcal {D}}_{\mathcal {M}} := \frac{{\mathbb {E}}[ \Vert X - Y \Vert ^2 ] }{ Var(X) } = 2. \end{aligned}$$

Thus, given that \({\mathbb {E}}[ \sigma _y ] = Var(X)\), we can simplify \(k_{max}\) into an expression that only depends on Eq. (12) so that

$$\begin{aligned} k_{max} = {\mathbb {E}} \Big [ \arg \min \limits _{ k \ge 0} \big \lbrace \text {NSE}_{k} \ge 2 \big \rbrace \Big ] \ , \end{aligned}$$
(14)

which is the formula we use to compute \(k_{max}\).

Finally, we benchmark the inferred microstate by comparing \(k_{max}\) with a measure of the natural rate of divergence of the dynamics. As a measure of this, we use the Lyapunov tenfold time \(t_\lambda \), which indicates the average time for two neighboring microstates to diverge from each other by one order of magnitude [37]. This is defined in terms of the inverse of the maximum Lyapunov exponent \(\lambda \) as

$$\begin{aligned} t_\lambda = \frac{ \ln {10} }{m \Delta t} \lambda ^{-1}, \end{aligned}$$
(15)

where the factor \(\ln {10} /(m \Delta t)\) lets us interpret \(t_\lambda \) in the units of the number of observations after which on average the dynamics causes the loss of an order of magnitude of precision. We obtain \(\lambda \) numerically using the two-particle method from Benettin et al. [38].

2.4 Initialization procedure summary

We summarize our microstate initialization procedure in the following steps:

  1. 1.

    Preprocess: Smooth the observed time series using the LMPA filter (see Eq. (8)) or any other suitable noise reduction technique. The smoothed signal will have fewer dominating trajectories [22] and a simpler distribution of local minima than the full noisy signal [23].

  2. 2.

    Bound: Make an arbitrary guess \(\varvec{x} \in {\mathcal {X}}\) of the microstate, and let it evolve under the model \(\varvec{f}\) until the microstate roughly approaches the smoothed observations, i.e., until \({\mathcal {J}}(\varvec{x}_{-T}^R) \le \delta _R\) for \(\varvec{x}_{-T}^R = \varvec{f}^{ \{mR\} }(\varvec{x})\) for some \(R \ge 0\). (see Eq. (10)). If several attractors exist, make one arbitrary guess for each of the different basins of attraction in the system.

  3. 3.

    Refine: Minimize the cost function \({\mathcal {J}}\) starting from \(\mathbf {x}_{-T}^R\) using Adam gradient descent or any other suitable optimization scheme until \({\mathcal {J}}(\hat{\varvec{x}}_{-T}) \le \delta _r\) (see Eq. (11)) and call \(\hat{\varvec{x}}_{-T}\) the assimilated microstate and \(\hat{\varvec{x}}_0 = \varvec{f}^{ \{m T\} }(\hat{\varvec{x}}_{-T})\) the initialized microstate.

  4. 4.

    Validate: Compute the discrepancy between the real and simulated observations (see Eq. (12)) and the predictability horizon \(k_{max}\) (see Eq. (14)) on out-of-sample predictions of the system to evaluate the quality of the initialized microstate \(\hat{\varvec{x}}_0\). If possible, benchmark the predictability horizon with the Lyapunov tenfold time (see Eq. (15)) of the system considered.

3 Results

In this section, we test the microstate initialization procedure on two paradigmatic chaotic systems: the well-known Lorenz system [21] and the high-dimensional Mackey–Glass system [39]. We approximate both systems using numerical integrators (described in each section that follows), and we take the approximated system as the real dynamical system.

Table 1 Parameters and other quantities

In all cases, we sample the ground-truth microstate \(\varvec{x}_{-T}\) from the attractor of the system considered. We generate observations using the following nonlinear observation operator

$$\begin{aligned} {\mathcal {H}}(\varvec{x}) = \root 3 \of {\sum _{i=1}^{N_x} (\varvec{x})_i^3 }, \end{aligned}$$
(16)

where \((\varvec{x})_i\) is the i-th component of \(\varvec{x}\). Note that, while \({\mathcal {H}}\) is nonlinear, the mappings \(x \rightarrow x^3\) and \(x \rightarrow \root 3 \of {x}\) are bijective, so there exists a diffeomorphism between \({\mathcal {H}}\) and any non-degenerate linear operator \(\varvec{H}: {\mathcal {X}} \subset {\mathbb {R}}^{N_x} \rightarrow {\mathbb {R}}\). Additionally to this operator, in Appendix B we study the quality of our initialization procedure with several observations operators each with different levels of coupling between the microstate components. The predictions in the observation space are almost identical regardless of what observation operator we use. However, given the symmetry about the x-axis of the Lorenz system, the operator with the maximum coupling recovers the right x component, but a reflection of the ground truth of the y and z components.

We test our initialization procedure on both noiseless time series and noisy time series. For the noisy series, we take a zero-mean Gaussian noise distribution \(\epsilon _k \sim {\mathcal {N}}(0, \sigma _n^2)\) for all k. Here, \(\sigma _n\) represents the noise level of the observations, which we take to be \(30 \%\) of the standard deviation of the observed data; i.e., \(\sigma _n = 0.3\sigma _y\). We always construct the noiseless and noisy time series from the same ground-truth microstate so that all our results are comparable.

Although our choice for the initial guess is arbitrary, we initialize our method with a microstate that matches the first observation of \(\varvec{y}\) exactly. This is, we take our initial guess at random from the set \(\{ \varvec{x} \in {\mathcal {X}} : {\mathcal {H}}(\varvec{x}) = y_{-T} \}\). This is straightforward to do for any homogeneous function, such as the one of Eq. (16).

Throughout the results, we use the parameters and system features described in Table 1 unless otherwise stated. However, we evaluate the performance of our method for various choices of the rough parameter \(\delta _R\) (see Fig. 16 in Appendix C) and the optimizers described in Sect. 2.2.4 (See Figs. 10 and 11 in Appendix C). Varying \(\delta _R\) determines the effect of bounding the search space into finer-grained regions of the attractor, and we find that when observations are noiseless, the lower \(\delta _R\) the better the initialization but the longer it takes to meet condition (10). For noisy observations, we find that if we set \(\delta _R\) very small, the refinement stage yields no improvement over the initialized microstates obtained. In terms of the optimization schemes of the refinement stage, we find that Adadelta [32] and the various flavors of Adam [34,35,36] outperform all other alternatives, with significantly better results than vanilla gradient descent.

3.1 Low-dimensional example: Lorenz system

We first test our microstate initialization procedure on the Lorenz system [21], described by the following differential equations

$$\begin{aligned} \begin{aligned} \dot{x}&= \sigma ( y -x ) \ , \\ \dot{y}&= x(\rho - z) - y \ , \\ \dot{z}&= xy - \beta z \ , \end{aligned} \end{aligned}$$
(17)

where \(\varvec{x}_k = \big ( x(t_k), y(t_k), z(t_k) \big )\) is the microstate of the system at time \(t_k\). The dynamics exhibit chaotic behavior for the parameters \(\sigma = 28\), \(\rho = 10\), and \(\beta = 8/3\). We solve the system using a 4-th order Runge–Kutta integrator with a fixed step size of 0.01 units. Under these settings, the system’s Lyapunov tenfold time is \( t_\lambda = 127\) samples while its attractor has a Kaplan–Yorke dimension of \(N_{{\mathcal {M}}} = 2.06\).

We perform 1000 independent experiments. For each experiment, we sample ground-truth microstates at random from the attractor of the system and generate time series of \(T = 50\) observations with a sampling interval of \(m = 2\) time steps per observation. We initialize the microstate for each time series following the steps presented in Sect. 2.4 using the parameters summarized in Table 1. We present our results in Fig. 1, where we show the median assimilation (\(k<0\)) and prediction errors (\(k \ge 0\)) of the noiseless and noisy time series for both the model and observation spaces (see Eqs. (12)-(13)).

Fig. 1
figure 1

Prediction error for the Lorenz system, showing the median normalized squared error over 1000 experiments for the observation space (solid lines) and the model space (dotted lines) for the case of noiseless (green) and noisy (blue) observations. The solid vertical line separates the assimilative regime (\(k<0\)) from the predictive regime (\(k \ge 0\))

From the assimilation side, our estimations get progressively better the closer they are to the present time at \(k = 0\). This means that the longer the time series—and the more information we have from the past—the better the quality of the initialized microstate. Note that in the noisy case, the assimilative error plateaus near \(10^{-3}\) in the observation space, marking the noise level of the observations. In contrast, the estimations keep getting better in the model space, indicating that even in the presence of noise, having more observations mitigates the probability of having dominating trajectories.

From the prediction side, we observe that the error diverges at essentially the same rate in both the noiseless and noisy cases. The main difference is that the error intercept at \(k=0\) is higher in the noisy case, thus making the predictions saturate earlier than when the observations are noiseless. Specifically, we find prediction horizons of \(k_{max} = 171\) and \(k_{max}^{noisy} = 113\) steps, which correspond to \(1.35 t_\lambda \) vs. \(0.89 t_\lambda \) for the noiseless and noisy time series (see Fig. 13 in Appendix C for an alternative approach on the prediction horizons). Moreover, we find that the prediction errors on the model and observation spaces are almost identical throughout the whole prediction window. Thus, measuring how the errors diverge in the observation space gives us a good proxy of the out-of-sample behavior of the latent microstates of the system.

Fig. 2
figure 2

Predictability vs. number of observations. We show how the predictability horizon \(k_{max}\) for the Lorenz system changes with the number of observations T for noiseless (green) and noisy (blue) ensembles of time series. The horizontal black solid line indicates the Lyapunov tenfold time \(t_\lambda \)

In Figs. 2 and 3, we analyze how the performance of our method depends on the length of the observation window, showing how the prediction horizons and the discrepancies between \(\varvec{x}_0\) and \(\hat{\varvec{x}}_0\) change with the number of observations.

Fig. 3
figure 3

Initialized microstate error for the Lorenz system. We show how the average discrepancy \(\text {NSE}_0^{model}\) between the true present-time microstate \(\varvec{x}_0\) and the initialized microstate \(\hat{\varvec{x}}_0\) changes with the number of observations T for noiseless (green) and noisy (blue) ensembles of time series

In Fig. 2, we find that the prediction horizon increases linearly with the number of observations available, with a similar slope for both the noiseless and noisy cases. Not surprisingly, the noise affects the time horizon over which one can make an effective prediction.

Additionally, we observe in Fig. 3 that the discrepancy decreases monotonically in both the noiseless and noisy cases. For the noiseless case, we observe a higher than exponential decrease in the discrepancy that ranges from \(\text {NSE}_0^{model} \sim 10^{-3}\) for \(T = 5\) to \(\text {NSE}_0^{model} \sim 10^{-5}\) for \(T=50\) observations. While the change in discrepancy is less pronounced for the noisy time series, it decreases 2 orders of magnitude with \(\text {NSE}_0^{model} \sim 10^{-1.5}\) for \(T = 5\) to \(\text {NSE}_0^{model} \sim 10^{-3.5}\) for \(T=50\) observations.

The Lorenz equations are a low-dimensional system (\(N_x = 3\)) with a low dimensional attractor of dimension \(N_{{\mathcal {M}}} = 2.06\). The number of observations in our experiments can be much larger than the dimension of the system. When combined with the fact that this system does not have any severe degeneracies, (see Appendix A), we recover the ground-truth microstate precisely with only a handful of noiseless observations. Every additional observation is, in theory, redundant for finding \(\varvec{x}_{-T}\) but, in practice, measurements can have several sources of error such as observational noise or the finite precision of numerical integration methods. Each additional observation thus further averages out these errors, which probably why \(k_{max}\) gets better proportionally to T.

3.2 High-dimensional system: Mackey–Glass

The Mackey–Glass system [39] describes the dynamics of the density of cells in the blood with the following delayed differential equation

$$\begin{aligned} \dot{x} = {\mathscr {F}}\left( x, x_{t_d} \right) = \frac{a x_{t_d}}{1 + x_{t_d}^c} - b x. \end{aligned}$$
(18)

The state \(x_{t_d} = x(t-t_d)\) is the density of cells delayed by \(t_d\) time units and a, b, and c are parameters. It exhibits chaotic dynamics for \(t_d > 16.8\) with \(a = 0.2, \ b = 0.1\) and \(c = 10\) [40]. In terms of blood cell density, the chaotic regime represents a pathological behavior.

The evolution of Eq. (18) relies on knowing the state of x in the continuous interval \([t-t_d, t]\), making its state space infinite-dimensional. However, we can approximate such state by taking \(N_x\) samples at intervals of length \(\Delta t = t_d/N_x\) and constructing the \(N_x\)-dimensional microstate vector

$$\begin{aligned} \varvec{x}_{k}&= \left( (\varvec{x}_{k})_1, \cdots , (\varvec{x}_{k})_{N_x} \right) \nonumber \\&= \big ( x( t_k - \underbrace{N_x \Delta t}_{t_d}), \dots , x(t_k - \Delta t), x(t_k) \big ) , \end{aligned}$$
(19)

where \((\varvec{x}_{i})_k = x(t - (N_x-i)\Delta t)\).

Using this vector, we can obtain trajectories of the Mackey–Glass system with any numerical integrator, for which we use the Euler method with a fixed step size of \(\Delta t\) for simplicity. We can thus recast this approximate system with the following \(N_x\)-dimensional deterministic mapping

$$\begin{aligned} \varvec{x}_{k+1}&= \varvec{f}(\varvec{x}_{k}) \nonumber \\&= \left\{ \begin{array}{llll} (\varvec{x}_t)_{N_x} + \Delta t {\mathscr {F}}\big ( (\varvec{x}_t)_{N_x}, (\varvec{x}_t)_{1} \big ) \\ (\varvec{x}_{t+1})_{1} + \Delta t {\mathscr {F}}\big ( (\varvec{x}_{t+1})_{1}, (\varvec{x}_t)_{2}) \big ) \\ \vdots \\ (\varvec{x}_{t+1})_{N_x-1} + \Delta t {\mathscr {F}} \big ( (\varvec{x}_{t+1})_{N_x - 1}, (\varvec{x}_{t})_{N_x} \big ), \end{array} \right. \end{aligned}$$
(20)

using \({\mathscr {F}}\) as defined in Eq. (18). The microstate-space dimension \(N_x\) of this mapping is determined by \(t_d/\Delta t\), for which we take \(N_x = 50\) and \(t_d = 25\) so that the system exhibits chaotic dynamics. Under these settings, the system’s Lyapunovs tenfold time is \(t_\lambda = 230\) samples while its attractor has a Kaplan–Yorke dimension of \(N_{\mathcal {M}} = 2.34\).

As before, we perform 1000 independent experiments in which, for each experiment, we sample ground-truth microstates at random from the attractor of the system and generate time series of \(T=25\) observations with a sampling interval of \(m = 2\) time steps per observation (see Table 1 for details). In contrast to the Lorenz system, we consider time series containing fewer data points than the dimension of the microstate space (in this case \(T=25\) and \(N_x=50\), respectively), making the problem under-determined. We present our results in Fig. 4, where we show the median assimilation (\(k<0\)) and prediction errors (\(k \ge 0\)) for the noiseless and noisy time series for both the model and observation spaces.

Fig. 4
figure 4

Prediction error for the Mackey–Glass system. We show the median normalized squared error over 1000 experiments for the observation space (solid lines) and the model space (dotted lines) for the case of noiseless (green) and noisy (blue) observations. The solid vertical line separates the assimilative regime (\(k<0\)) from the predictive regime (\(k \ge 0\))

Unexpectedly, our method yields more accurate initializations for the Mackey–Glass system than the Lorenz system, even though both the attractor and the microstate space of the former have a higher dimension than the latter. However, if we compare the power spectra of the two systems (see Fig. 14 in Appendix C), we find that the Mackey–Glass system has a faster frequency decay and more frequency peaks than the Lorenz system, suggesting that the former is easier to initialize than the latter. For instance, the power amplitude for frequency 1/6 is more than 1000 higher in the Lorenz system than in the Mackey–Glass system. This further suggests that looking at the power spectra of the system is a better indicator of the initializability of a system than the dimension of its chaotic attractor.

From the prediction side, our results are qualitatively similar to what we saw for the Lorenz system. The error diverges at roughly the same rate in both the noiseless and noisy time series, with an error intercept at \(k=0\) that is higher for the noisy experiments. Additionally, we find a very close correspondence between the errors in the model and observation spaces, which supports our claim that measuring the error in the observation space gives us a good proxy of the out-of-sample behavior of the latent microstate dynamics.

In terms of their predictability horizons, we find that \(k_{max} = 556\) and \(k_{max}^{noisy} = 285\), corresponding to \(2.4t_\lambda \) and \(1.2t_\lambda \) for the noiseless and noisy ensembles, respectively, (see Fig. 13 in Appendix C for an alternative approach on the prediction horizons). We find it remarkable that with only 25 observations of the system, we obtain predictions that stay accurate for significantly longer than the Lyapunov tenfold time of the system.

From the assimilation side, the microstate estimations get progressively better the closer they are to the present time, similar to what we observed in the Lorenz system. However, the assimilation error is significantly lower in the observation space than in the model space, suggesting, misleadingly, that the initialized microstate is much more accurate than the error we observe in the model space. Nonetheless, the errors in model and observation spaces converge to each other as soon as the prediction window starts, meaning that the error in the observation space is still an accurate proxy of the out-of-sample behavior in the model space.

Interestingly, the first few out-of-sample predictions in the model space have a lower discrepancy than in the observation space in both the noiseless and noisy cases. This happens, we believe, because the time span of the observations \(\varvec{y}\) is not long enough for the initialized microstate to converge onto the attractor. With only \(T = 25\) data points of a 50-dimensional chaotic system, we do not possess enough information to recover the present-time microstate precisely.

Fig. 5
figure 5

Predictability horizon of the Mackey–Glass system. We show how the predictability horizon \(k_{max}\) changes with the number of observations T for noiseless (green) and noisy (blue) ensembles of time series. The horizontal black solid line indicates the Lyapunov tenfold time \(t_\lambda \). For the noiseless case, we observe a critical transition on the behavior of \(k_{max}\) for \(T_c = 25\)

The previous discussion suggests that we need more observations to better initialize the system. To investigate this we perform a series of experiments in which we vary the number of data points of the observed time series and assess the quality of the predictions. Similar to what we did for the Lorenz system, we focus on the prediction horizon (see Fig. 5) and the discrepancy between the present-time and the initialized microstate (see Fig. 6).

We find a contrasting behavior regarding the experiments between noisy and noiseless observations. When the observations are noisy, the prediction horizon increases linearly with the number of observations (see Fig. 5 blue), ranging from \(k_{max} = 0.78 t_\lambda \) when \(T=5\) to \(k_{max} = 1.34 t_\lambda \) when \(T=50\). We also find that, in general, the discrepancy between the initialized and ground-truth present microstate decreases monotonically with the number of observations (see Fig. 6 bottom). These results are qualitatively similar to what we found for the Lorenz system: the more observations the better.

When the observations are noiseless, we find a critical change of behavior at roughly \(T = 25\) observations. In Fig. 5 (green), we find that the prediction horizon rises superlinearly for \(T < 25\), with \(k_{max} = 1.11 t_\lambda \) for \(T = 5\) to \(k_{max} = 2.31 t_\lambda \) for \(T = 25\). Afterward, the prediction horizon grows linearly and with a marginal increase, getting to \(k_{max} = 2.67 t_\lambda \) for \(T = 50\). We note, however, that the increase in this linear regime is almost double of what we find in the noisy counterpart, with a slope of \(\Delta k_{max} = 15.2\) steps per observation against \(\Delta k_{max}^{noisy} = 8.3\), respectively. In parallel, we find that the discrepancy of the initialized microstate decreases abruptly for the time series of \( T \ge 25\) observations, as we show in Fig. 6 (green). This sharp transition reflects that time series with more than 25 observations have enough information to pin down the present-time latent microstate precisely, meaning that we possess enough data points to uniquely separate the mixing of the microstate generated by the observation operator \({\mathcal {H}}\) into its individual components.

Fig. 6
figure 6

Initialized model error for the Mackey–Glass system. We show how \(\text {NSE}_0^{model}\), the average discrepancy between the true present-time microstate \(\varvec{x}_0\) and the initialized microstate \(\hat{\varvec{x}}_0\), changes with the number of observations T for noiseless (green) and noisy (blue) ensembles of time series. For the noiseless case, we observe a critical transition in the behavior of \(\text {NSE}_0^{model}\) for \(T_c = 25\)

Fig. 7
figure 7

Critical transition heatmap of the Mackey–Glass system: In the z-axis, we show the (base-10 logarithm of the) initialized microstate discrepancy, \(\text {NSE}_0^{model}\), as a function of the number of observations T and the sampling interval m for ensembles of noiseless time series. In white, we plot the \(m = N_x/T\) curve for fixed \(N_x = 50\). We find that the microstate discrepancies decrease abruptly before and after this curve

In short, having 25 (or more) noiseless measurements of the system gives us enough information to precisely recover the present-time microstate, which is 50-dimensional. Recall that we are considering the discrete map (20) as the real system, so the only information we lose comes from either measuring the system with \({\mathcal {H}}\) or taking time series with a coarse-grained sampling frequency. Thus, for a fixed \({\mathcal {H}}\) and noiseless observations, recovering the initial microstates precisely should depend solely on how well the samples describe the latent trajectory of the system. Inspired by the Nyquist–Shannon sampling theorem [41], if we can establish a clear cutoff frequency on the power spectrum of the system, we could argue that if we observe the system with twice the frequency as the systems’s cutoff frequency, then the observed signal would not lose any information with respect to the signal sampled for every update of map (20).

We thus claim that, if we can establish a clear cutoff frequency \(f_c\) and the dynamical system does not suffer from severe degeneracies, we can precisely obtain the initial conditions of an \(N_x\)-dimensional (possibly) nonlinear system observed every m updates with a scalar (possibly) nonlinear observation operator \({\mathcal {H}}\) if 1) \(m^{-1} \ge 2f_c\) and 2) \(T_c \ge N_x / m\). Having these conditions satisfied is equivalent to having an invertible observation-matrix \(\varvec{M}\) that determines the solution of the system \(\varvec{y} = \varvec{M} \varvec{x}_0\) exactly (see Appendix A for a deeper development of this discussion).

We make further experiments to check the validity of our claim, where we measure the initialized microstate discrepancy when varying both the number of observations T and the sampling interval m while leaving the dimension of the system fixed to \(N_x = 50\). If our claim about the Nyquist–Shannon theorem is a good approximation, we expect to find a \(T_c \sim 1/m\) relation where, for \(T \ge T_c\), the error in the initialized microstate becomes significantly lower than for \(T < T_c\). In Fig. 7, we show the results of these experiments, in which we observe such a change of regimes before and after the \(T = N_x/m\) line, thus supporting the Nyquist–Shannon hypothesis.

4 Conclusions

Many natural and social processes can be accurately described by how their microstates interact and evolve into rich non-trivial dynamics with emerging properties. We often only possess aggregate noisy measurements of such processes, so it is of great interest to develop methods that let us extract information about the latent microstate dynamics from a given dataset.

In this paper, we tackled the problem of initializing the latent microstate of a known (possibly) nonlinear dynamical system from aggregate (possibly) noisy and nonlinear observations of the system. We propose a three-step method to obtain such latent microstate that consists of (1) reducing the observational noise to mitigate possible dominating trajectories, (2) letting the system explore its attractor(s) and thus limiting the region in which we search for an optimal solution, and (3) minimizing the discrepancy between the simulated and real observations to obtain a refined estimation of the ground-truth microstate. We quantified the discrepancy between observations and simulations using a least-squares cost function in the observation space, similar to [13, 14]. We minimized the cost function using a plethora of gradient-based algorithms, for which we find that Adadelta [32] and Adam-oriented schemes [34,35,36] perform the best.

We tested our method on two chaotic paradigmatic examples: the Lorenz and a high-dimensional approximation of the Mackey–Glass systems. We obtained initialized microstates that accurate fit the data, with out-of-sample predictions that outperformed the systems’ Lyapunov tenfold times, even when the observed time series were very short. We found that good predictions in the observations space always implied good predictions in the space of the microstates. We considered nonlinear observation operators that aggregate all the microstate component into a real number in all cases, with robust result with all the operators considered. Surprisingly, we obtained better results for the Mackey–Glass system, which has a higher-dimensional model space and higher-dimensional attractor but faster-decaying frequency spectrum than the Lorenz system, suggesting that the frequency spectrum gives us a better proxy of the initiability of the system than the observations to dimension of the model ratio.

In most experiments, the quality of the initialized microstate was proportional to the number of data points of the observed time series. However, when the dimension of the system was higher than the number of observations and these observations were noiseless, the quality of the initialized microstate grew superlinearly. This superlinear regime transitions into the more common linear regime in a non-trivial manner, and we explored the conditions for such a transition. We claim that as long as we can establish a clear cutoff frequency of the observed data and this data meets the Nyquist–Shannon sampling theorem conditions with respect to the cutoff frequency, we can recover the ground-truth microstate precisely with fewer observations than the dimension of the system, thus marking the transition between regimes. This implies that if we possess a dataset where observations are sampled at an optimal rate such that we lose the least possible information of the underlying system, we can obtain high-quality initializations with just a handful of samples.

How well we can initialize a system depends on the amount of information the observed data contains and on the intrinsic features of the system. On the one hand, the amount of observational noise, the mixing that results from aggregating the system, the number of observations, and the data sampling rate contribute significantly in estimating microstates that may fit the data well but extrapolate poorly into the future. On the other hand, the dimension of the dynamical system, the frequency spectrum of its attractor(s), and how chaotic the system is, determines the window in which the predictions stay accurate.

This work gives us a conceptual framework to understand the interface between aggregate data and microscopic interactions. However, it is limited the case in which we know the model that perfectly specifies the system, as well as the observation operator and the characteristics of the noise in the observations.

In future developments, we will explore how to deal with misspecified models and systems with stochastic behavior. We should include these new sources of uncertainty in the cost functions involved. With stochastic behavior alone, there are several new considerations for inferring individual agents’ evolution in a system [42]. Once we better understand what information low-dimensional observations gives us about a high-dimensional dynamical system, the natural next step is test our initialization method on real-world datasets.