1 Introduction

In this article, we expand the analysis of the deep parametric PDE method, provide further numerical results and demonstrate performance gains in a core computational problem in counterparty credit risk. The deep learning-based solver introduced in Glau and Wunderlich (2022) offers efficient approximations to high-dimensional parametric partial differential equations (PDEs) and builds on Sirignano and Spiliopoulos (2018). For instance to comply with regulatory policies, financial institutions have to regularly evaluate a massively large number of high-dimensional PDEs. Particularly, large portfolios and scenario calculations demand for PDE solutions for different parameters in one stochastic model. In order to advance these calculations in risk management, efficient high-dimensional parametric PDE solvers are required. The efficiency of classical PDE solvers deteriorates for high dimensions due to the curse of dimensionality. We show that the approximation rate of deep neural networks (DNNs) used in the deep parametric PDE solver is independent of the dimension of the problem.

The deep parametric PDE method has a strong mathematical foundation as it was shown to be converging to the exact solution, provided perfect numerical minimisation. However, there are currently no approximation rates available for the underlying DNN which reflect the efficiency in high dimensions. While the approximation theory for DNN is an active field, see, e.g., Elbrächter et al. (2021), Gühring et al. (2020) most results are only available for rectified linear units (ReLU) as the activation functions. Their lack of smoothness makes them unsuitable for the approximation of PDEs in the given framework. We therefore extend the approximation theory for neural networks with smooth activation functions and prove for the first time (to the best of our knowledge) an approximation rate which is independent of the dimensionality of the problem, thus lessening the curse of dimensionality. This result is shown by providing best-approximation rates for the \(L^2\)-norm which can be applied flexibly to the approximation over a variable state space, a time-state space as well as to the full parametric solutions.

In order to confirm this theoretical insight in a numerical example of practical relevance, we consider credit exposure calculations. Here financial institutions have to evaluate the risk they face in case counterparties of a derivative deal default. The common use of Monte-Carlo simulations to generate risk factors in exposure calculations requires frequent calls of the pricer. In particular for derivatives depending on several risk factors, this leads to unsustainably long run-times, rendering e.g. real-time nowcasting infeasible. As a prototypical example, we consider multi-asset option pricing and evaluate the expected exposure as well as the potential future exposure with up to ten underlying assets.

1.1 Literature review

The mathematical analysis of deep neural networks has significantly advanced in the recent years, see, e.g. Berner et al. (2021). In parts this is due to the increased attention of the topic thanks to impressive results of deep learning (LeCun et al., 2015).

Applications in deep learning and DNN-based PDE solver share the need for results on the best-approximation property of neural networks. For an overview over current results (mainly based on ReLU networks) see Elbrächter et al. (2021); Gühring et al. (2020). Approximation results for ReLU networks are shown in Yarotsky (2017) and were subsequently improved, e.g. specifically for solutions of PDEs (Grohs et al., 2019; Beck et al., 2020; Elbrächter et al., 2021), or in Sobolev spaces which include the first derivative (Gühring et al., 2020). Errors in Sobolev spaces for similar activations functions, rectified cubic units, were considered in Abdeljawad and Grohs (2022). For Kolmogorov PDEs it has furthermore been shown that the solution in the state-space can be approximated without the curse of dimensionality, see for example (Jentzen et al., 2021; Grohs et al., 2018; Reisinger & Zhang, 2020). We note that in this article, we have chosen the more general approach of best-approximation rates, as the deep parametric PDE method can be used to solve non-Kolmogorov PDEs. Exponential convergence rates for analytic functions were shown in low dimensions (E & Wang, 2018) and in high dimensions (Opschoor et al., 2021). Also constructive approaches based on Chebyshev interpolation were investigated in Tang et al. (2019); Herrmann et al. (2022).

The deep parametric PDE method was first introduced in Glau and Wunderlich (2022) and is based on the deep Galerkin method (Sirignano & Spiliopoulos, 2018). For a detailed literature review, we refer the reader to these articles. Related ideas led to the development of physics-informed neural networks (Raissi et al., 2019; Karniadakis et al., 2021). The interest for DNN-based methods is large, few recent examples are applications to Hamilton-Jacobi-Bellman equations (Grohs & Herrmann, 2021) and stochastic control in finance (Germain et al., 2021). In addition the current research into high-dimensional problems has inspired approaches beyond neural networks, see, e.g. Antonov and Piterbarg (2021).

Due to the second order derivative terms, some PDE solvers (including the deep parametric PDE method) require smooth activation functions. However, for smooth activation functions, there are less results available than for ReLU networks. For single-layer networks (Barron, 1993; Siegel & Xu, 2020) have shown approximation estimates in the \(L^2\)-norm which do not exceed those of standard Monte-Carlo methods. Expression rates for deep neural networks are investigated in Ohn and Kim (2019); Langer (2021), but these rates deteriorate in high-dimensions due to the remaining curse of dimensionality.

To the best of our knowledge, this article presents the first approximation rates for DNNs with smooth activation functions which are not deteriorating with the number of dimensions. The analysis is based on a sparse grid approach (Bungartz & Griebel, 2004; Garcke & Griebel, 2013). It is motivated by related results for ReLU activation functions (Montanelli & Du, 2019).

While the approximative power is a major contribution to the numerical error, research into the remaining factors is highly important, too. Some gaps between theory and practice are demonstrated in Adcock and Dexter (2021). Approximation results which include the sampling error are considered, e.g., in Shin et al. (2020). First investigations into the convergence of the optimiser are also available, see, e.g. Jentzen and Riekert (2021) and the references therein.

For an overview on counterparty credit risk and exposure calculations, see Gregory (2010); Green (2015). Several articles have proposed techniques to speeding up its calculation. For low-dimensions, the dynamic Chebyshev methods introduced in Glau et al. (2019) has proven highly suited for exposure calculations (Glau et al., 2021). For higher dimensions some machine learning algorithms have been considered: a proof of concept for a supervised learning-based approach in Welack (2019), calculations based on Deep BSDE solver in Gnoatto et al. (2020), and based on the deep optimal stopping algorithm for Bermudan options in Andersson and Oosterlee (2021a, 2021b). Besides DNN-based approaches for example sparse grid methods have recently been proposed for use in exposure calculations (Grzelak, 2021). In this article, we demonstrate the performance of the deep parametric PDE method for exposure calculations including high-dimensional pricing problems.

1.2 Structure of the article

In Sect. 2, we briefly recall the deep parametric PDE method. The theoretical analysis of the approximation through neural networks with smooth activation functions is presented in Sect. 3. Then in Sect. 4, we present the performance of the deep parametric PDE method in exposure calculations. Finally, we summarise and conclude the article in Sect. 5.

2 The deep parametric PDE method for option pricing

The deep parametric PDE method was recently introduced in Glau and Wunderlich (2022) as a neural network-based method to simultaneous compute the option price at all time, state and parameter value of interest. For convenience of the reader, we recall its main properties.

Expressing the option price in logarithmic asset variables, \(u(\tau , x; \mu )\) denotes the fair price of an option at time to maturity \(\tau \) for the asset prices \(s_i=e^{x_i}\):

$$\begin{aligned} u(\tau , x; \mu )&= {\text {Price}}(T-\tau , e^x; \mu ),\\ {\text {Price}}({t}, s; \mu )&= e^{- r (T-{t}) } \mathbb {E}(G(S_T(\mu )) \mid S_{{t}}(\mu )=s), \end{aligned}$$

with d underlyings \(S_{t}(\mu ) = (S_{t}^1(\mu ), \ldots , S_{t}^d(\mu ))\) and the physical time \({t}= T - \tau \). G denotes the payoff function at maturity and the assets S are modelled by a multivariate geometric Brownian motion. The parameter \(\mu \) can describe model parameters as well as option parameters. In our setting the parameter vector \(\mu \) contains the risk-free rate of return, volatilities and correlations, each with a smooth parameter dependency.

The option price solves the Black-Scholes PDE

$$\begin{aligned} \partial _\tau u(\tau ,x;\mu ) + \mathcal {L}^\mu _xu(\tau ,x;\mu )&= 0, \quad&(\tau , x)&\in \mathcal {Q}= (0,T)\times \Omega ,\\ u(0,x;\mu )&= g(x), \quad&x&\in \Omega , \\ u(\tau , x; \mu )&= u_{\Sigma }(t,x;\mu ), \quad&(\tau , x)&\in \Sigma = (0,T) \times \partial \Omega , \end{aligned}$$

with \(\Omega = (x_{\textrm{min}}, x_{\textrm{max}})^d\), \(g(x_1, \ldots , x_d) = G(e^{x_1},\ldots ,e^{x_d})\) and

$$\begin{aligned} \mathcal {L}^\mu _xu(\tau ,x;\mu )&= r(\mu ) u(\tau , x; \mu ) - \sum _{i=1}^d \left( r(\mu )-\frac{\sigma _i(\mu )^2}{2}\right) \partial _{x_i} u(\tau , x;\mu ) \\&\quad - \sum _{i,j=1}^d \frac{\rho _{ij}(\mu ) \sigma _i(\mu ) \sigma _j(\mu )}{2} \, \partial _{x_ix_j} u(\tau ,x;\mu ). \end{aligned}$$

For a given function \(u:\mathcal {Q}\times \mathcal {P}\rightarrow \mathbb {R}\) of sufficient smoothness, we define the loss based on the PDE’s residuals:

$$\begin{aligned} \mathcal {J}(u) = \mathcal {J}_\mathrm{{int}}(u) + \mathcal {J}_\mathrm{{ic}}(u). \end{aligned}$$

The interior residual is defined as

$$\begin{aligned} \mathcal {J}_\mathrm{{int}}(u) = |\mathcal {Q}\times \mathcal {P}|^{-1} \int _{\mathcal {P}} \int _{\mathcal {Q}} \left( \partial _\tau u(\tau ,x;\mu ) + \mathcal {L}^\mu _xu(\tau ,x;\mu ) \right) ^2 ~\textrm{d}(\tau ,x) ~\textrm{d}\mu , \end{aligned}$$

with \(|\mathcal {Q}\times \mathcal {P}|\) the size of the domain, and the initial residual as

$$\begin{aligned} \mathcal {J}_\mathrm{{ic}}(u) = |\Omega \times \mathcal {P}|^{-1} \int _{\mathcal {P}} \int _\Omega \left( u(0, x; \mu ) - g(x) \right) ^2 ~\textrm{d}x ~\textrm{d}\mu . \end{aligned}$$

The deep parametric PDE method minimises the residual over the space of deep neural networks of a given size:

$$\begin{aligned} u_{\textrm{DPDE}}= \mathop {\text {arg}\,\text {min}}\limits _{u_{\textrm{DNN}}} \mathcal {J}(u_{\textrm{DNN}}). \end{aligned}$$

The integrals are numerically evaluated by Monte-Carlo quadrature, which yields a similarity to mean squared error-residuals often used in machine learning:

$$\begin{aligned} \mathcal {J}_\mathrm{{int}}(u)&\approx \sum _{i=1}^N \Big ( \partial _\tau u\big (\tau ^{(i)}\! ,x^{(i)};\mu ^{(i)}\big ) + \mathcal {L}^\mu _xu\big (\tau ^{(i)}\!,x^{(i)};\mu ^{(i)}\big ) - f\big (\tau ^{(i)}\!,x^{(i)};\mu ^{(i)}\big ) \Big )^2 /N, \\ \mathcal {J}_\mathrm{{ic}}(u)&\approx \sum _{i=1}^N \Big ( u\big (0, {\hat{x}}^{(i)}; {\hat{\mu }}^{(i)}\big ) - g\big ({\hat{x}}^{(i)}; {\hat{\mu }}^{(i)}\big ) \Big )^2 /N, \end{aligned}$$

where \((\tau ^{(i)}\!, x^{(i)}\!, \mu ^{(i)}) \in \mathcal {Q}\times \mathcal {P}\) and \(({\hat{x}}^{(i)}\!, {\hat{\mu }}^{(i)})\in \Omega \times \mathcal {P}\) for \(i=1,\ldots , N\) are chosen randomly with a uniform distribution. In our experiment, we choose \(N=10,000\) as we observed less accurate approximations with smaller values of N.

In (Glau & Wunderlich, 2022, Theorem 1), it was shown that for sufficiently large networks, the \(L^2\)-error of the deep parametric PDE method can be arbitrarily small. However, no estimates on the required network size were given. In this article, we investigate this question in more details and provide new results on the approximation of functions by neural networks with smooth activation functions. In particular, we provide (to the best of our knowledge) the first approximation rates which do not deteriorate for higher dimensions, thus reducing the curse of dimensionality. We then discuss the consequences of these results to the deep parametric PDE method in more details.

3 Function approximation by neural networks

We consider dense neural networks with d input nodes, a single output node and the hyperbolic tangent as the activation function. Results for several output nodes can be shown in an analogue way.

In this chapter, we are going to improve the available approximation rates for a smooth activation function, based on available results for ReLU networks. There are two alternative pathways: based on a given ReLU approximation or by a direct construction of a network with smooth activation functions. For a proof based on a given ReLU approximation, we would first construct an auxiliary ReLU network which can then be approximated by a network with a smooth activation function. This was shown, for example in Boulle et al. (2020); Telgarsky (2017) using rational activation functions. For the direct construction, we construct the network with a smooth activation function without the auxiliary network, but in a similar way to the construction of the ReLU network. This way is technically more challenging, but results in sharper bounds. For this reason, we are going to directly construct the neural network with a smooth activation function. In addition to sharper bounds, this way also allows us to provide bounds of the magnitude of the parameters.

3.1 Deep neural networks

We consider neural networks with input \(h^0\in \mathbb {R}^n\), output dimension m, width w, depth L and activation function \(\sigma :\mathbb {R}^w\rightarrow \mathbb {R}^w\). Here \(\sigma \) is the component-wise evaluation of the hyperbolic tangent. The mapping from one layer to the next is an affine transformation combined with the activation function:

$$\begin{aligned} h^l = \sigma ( W^l h^{l-1} + b^l) \in \mathbb {R}^w. \end{aligned}$$

The output is given by an affine transformation of the last hidden layer:

$$\begin{aligned} W^{L+1} h^L + b^{L+1} \in \mathbb {R}^m. \end{aligned}$$

Here the weights are \(W^1 \in \mathbb {R}^{w\times n}\), \(W^l \in \mathbb {R}^{w\times w}\), \(l=2, \ldots , L\), \(W^{L+1} \in \mathbb {R}^{m\times w}\) and the biases are \(b^l\in \mathbb {R}^w\), \(l=1,\ldots , L\) and \(b^{L+1}\in \mathbb {R}^m\). Their values will be set by a numerical optimisation.

We are particularly interested in deep neural networks and their ability of function approximation. While the exact threshold between deep and shallow networks is rather arbitrary, we consider networks with \(L\ge 4\) as deep neural networks.

We denote by \(\Theta (L, w, n, m)\) the set of all weights corresponding to a neural network of depth L and width w mapping \(\mathbb {R}^n\) to \(\mathbb {R}^m\):

$$\begin{aligned} \Theta (L, w, n, m)&= \big \{\left( W^1, \ldots , W^{L+1}, b^1, \ldots b^{L+1}\right) : \\&\quad W^1 \in \mathbb {R}^{w\times n}, W^l \in \mathbb {R}^{w\times w}, l=2, \ldots , L, W^{L+1} \in \mathbb {R}^{m\times w}, \\&\quad b^l\in \mathbb {R}^w, l=1,\ldots , L, b^{L+1}\in \mathbb {R}^m. \big \} \end{aligned}$$

Note that we will frequently drop the arguments nm when they are clear from the context. Typically we will consider an input dimension \(n=d\) and an output dimension \(m=1\).

Next to the dimension of the network, also the magnitude of the weights and biases is important. Too large parameters can cause numerical instabilities and may lead to slow convergence by an iterative solver. Therefore, we will also show bounds for the maximal parameter value of \(\theta \in \Theta (L, w, n, m)\) as

$$\begin{aligned} |\theta |_\infty = \max \left\{ |W^l_{i,j}|, |b^l_i|:i,j,l \right\} . \end{aligned}$$

The evaluation of the neural network is denoted as \(N(\cdot \mid \theta )\):

$$\begin{aligned} N(x\mid \theta ) = W^{L+1} \sigma \bigg (W^L \sigma \Big (\cdots \sigma \big (W^2 \sigma (W^1 x + b^1) + b^2\big ) + \cdots \Big ) + b^{L}\bigg ) + b^{L+1}. \end{aligned}$$

Lemma 1

We can manipulate neural networks in the following ways:

  1. (a)

    For \(N\ge n\), we can trivially embed \(\Theta (L, w, n, m)\) in \(\Theta (L, w, N, m)\). I.e. for an injective index mapping \(\pi :\{1, \ldots , n\}\rightarrow \{1, \ldots , N\}\), we can extend a neural network \(\theta \in \Theta (L, w, n, m)\) to a neural network \({\tilde{\theta }}\in \Theta (L, w, N, m)\), such that

    $$\begin{aligned} N(x\mid \theta ) = N({\tilde{x}}\mid {\tilde{\theta }}), \end{aligned}$$

    for all \(x\in \mathbb {R}^n\), \({\widetilde{x}}\in \mathbb {R}^N\) with \(x_i = {{\tilde{x}}}_{\pi (i)}\), \(i=1,\ldots , n\). In particular

    $$\begin{aligned} \left|{\tilde{\theta }} \right|_\infty = \left|\theta \right|_\infty . \end{aligned}$$
  2. (b)

    Two neural networks can be evaluated in parallel. There is an operator \(\biguplus :\Theta (L, w_1, n, m_1)\times \Theta (L, w_2, n, m_2)\rightarrow \Theta (L, w_1+w_2, n, m_1 + m_2)\) producing the evaluation of two networks of at the same time, i.e. such that for \(\theta _\textrm{u} = \theta _1 \uplus \theta _2\):

    $$\begin{aligned} N(x\mid \theta _\textrm{u}) = \begin{pmatrix} N(x\mid \theta _1) \\ N(x\mid \theta _2) \end{pmatrix} \end{aligned}$$

    for all \(x\in \mathbb {R}^n\). It furthermore holds

    $$\begin{aligned} \left|\theta _\textrm{u}\right|_\infty =\max \{ \left|\theta _1\right|_\infty , \left|\theta _2\right|_\infty \}. \end{aligned}$$
  3. (c)

    There exists an addition operator \(\bigoplus :\Theta (L, w_1, n, m)\times \Theta (L, w_2, n, m)\rightarrow \Theta (L, w_1+w_2, n, m)\) producing the sum of two networks with the same output size, i.e. such that for \(\theta _+ = \theta _1 \oplus \theta _2\):

    $$\begin{aligned} N(x\mid \theta _+) = N(x\mid \theta _1) + N(x\mid \theta _2) \end{aligned}$$

    for all \(x\in \mathbb {R}^d\). It furthermore holds

    $$\begin{aligned} \left|\theta _+\right|_\infty \le \max \{\left|\theta _1\right|_\infty , \left|\theta _2\right|_\infty , \left|b_1^{L+1} \right|_\infty + \left|b_2^{L+1}\right|_\infty \}, \end{aligned}$$

    where \( b_1^{L+1} \) and \(b_2^{L+1}\) are the final biases of \(\theta _1\) and \(\theta _2\), respectively. The input weights and biases \(W^1, b^1\) of \(\theta _+\) can be bound more strictly by \(\max \{\left|W_1^1\right|_\infty , \left|b_1^1\right|_\infty , \left|W_1^2\right|_\infty , \left|b_1^2\right|_\infty \}\), where \(W_1^1, b_1^1\) and \(W_2^1, b_2^1\) are the input weights and biases of \(\theta _1\) and \(\theta _2\), respectively.

  4. (d)

    There exists a composition operator \(\bigodot :\Theta (L_1, w_1, m, l)\times \Theta (L_2, w_2, n,m)\rightarrow \Theta (L_1+L_2,\max \{w_1,w_2\}, n, l)\) producing the composition of two networks, i.e. such that for \(\theta _\odot = \theta _1 \odot \theta _2\):

    $$\begin{aligned} N(x\mid \theta _\odot ) = N( N(x\mid \theta _2) \mid \theta _1) \end{aligned}$$

    for all \(x\in \mathbb {R}^d\). We can estimate the new weights:

    $$\begin{aligned} \left|\theta _\odot \right|_\infty \le \max \left\{ \left|\theta _1\right|_\infty , \left|\theta _2\right|_\infty , \left|W_2^1\right|_\infty \left|W_1^{L_1+1}\right|_\infty , \left|W_2^1\right|_\infty \left|b_1^{L_1+1}\right|_\infty + \left|b_2^1\right|_\infty \right\} , \end{aligned}$$

    where \(W_1^{L_1+1}\) and \(b_1^{L_1+1}\) are the final layer weights and biases of \(\theta _1\) and \(W_2^1, b_2^1\) are the first layer weights and biases of \(\theta _2\). The weights of the first and final layer of \(\theta _\odot \) are equal to the weights of the first layer of \(\theta _1\) and the final layer of \(\theta _2\), respectively:

    $$\begin{aligned} W^{1}= W_1^{1}, \quad b^{1}= b_1^1, \quad W^{L_1+L_2+1} = W_2^{L_2+1}, \quad b^{L_1+L_2+1} = b_2^{L_2+1}. \end{aligned}$$

Proof

The four statements are shown by constructing the respective networks.

Part a Adding columns of zero and permuting all columns according to \(\pi \) allows us to construct weights \({{\widetilde{W}}}^1\in \mathbb {R}^{w\times N}\) such that

$$\begin{aligned} W^1 x = {\widetilde{W}}^1 {\widetilde{x}} \end{aligned}$$

for \(x\in \mathbb {R}^n\), \({\widetilde{x}}\in \mathbb {R}^N\) with \(x_i = {{\tilde{x}}}_{\pi (i)}\), \(i=1,\ldots , n\). Leaving all remaining weights and biases equal, yields the desired network.

Part b

Let \(\theta _1\in \Theta (L, w_1, n, m_1)\) with weights \(W_1^l\) and biases \(b_1^l\), as well as \(\theta _2\in \Theta (L, w_2, n, m_2)\) with weights \(W_2^l\) and biases \(b_2^l\) be given. We construct the new weights block-wise as

$$\begin{aligned} W^1&= \begin{pmatrix} W_1^1\\ W_2^1 \end{pmatrix} \in \mathbb {R}^{w_1+w_2\times n},\\ b^l&= \begin{pmatrix} b_1^l\\ b_2^l \end{pmatrix}\in \mathbb {R}^{w_1+w_2} \text { for } l=2,\ldots ,L,\\ W^l&= \begin{pmatrix} W_1^l &{} 0 \\ 0 &{} W_2^l \end{pmatrix} \in \mathbb {R}^{w_1+w_2\times w_1+w_2} \text { for } l=2,\ldots ,L,\\ b^{L+1}&= \begin{pmatrix} b_1^{L+1}\\ b_2^{L+1} \end{pmatrix}\in \mathbb {R}^{m_1+m_2},\\ W^{L+1}&= \begin{pmatrix} W_1^{L+1} &{} 0 \\ 0 &{} W_2^{L+1} \end{pmatrix} \in \mathbb {R}^{m_1+m_2\times w_1+w_2}. \end{aligned}$$

Clearly the absolute value of the new weights and biases are bounded by the maximal value in the networks \(\theta _1\) and \(\theta _2\).

For this new network we can show that the first hidden layer indeed combines the values of both networks as the activation function is applied element-wise and

$$\begin{aligned} W^1 h^0 + b^1&= \begin{pmatrix} W_1^1 h^0 + b_1^1 \\ W_2^1 h^0 + b_2^1 \end{pmatrix}. \end{aligned}$$

Inductively the same argument shows the final result.

Part c

We follow the construction from part (b), but change the final layer to obtain the sum of the output values:

$$\begin{aligned} W^{L+1}&= \begin{pmatrix} W_1^{L+1}&W_2^{L+1} \end{pmatrix} \in \mathbb {R}^{m \times w_1 + w_2}, \\ b^{L+1}&= b_1^{L+1} + b_2^{L+1} \in \mathbb {R}^m. \end{aligned}$$

Clearly the newly constructed weights can be bounded by the maximum of the original network weights and the sum of the output biases.

We consider the final hidden layers of both networks \(h_1^L\in \mathbb {R}^{w_1}\) and \(h_2^L\in \mathbb {R}^{w_2}\) and for the newly constructed network:

$$\begin{aligned} h^L = \begin{pmatrix} h_1^L\\ h_2^L \end{pmatrix} \in \mathbb {R}^{w+1}. \end{aligned}$$

Then the output of the network is given as

$$\begin{aligned} W^{L+1} h^L + b^{L+1}&= W_1^{L+1} h_1^L + b_1^{L+1} + W_2^{L+1}h_2^L + b_2^{L+1}\\&=N(x\mid \theta _1) + N(x\mid \theta _2). \end{aligned}$$

Part d We consider the case where \(w_1 = w_2 = w\) as similar to the construction in (a) the networks can trivially be extended. The first \(L_1\) and last \(L_2\) weights can be set equal to the networks \(\theta _1\) and \(\theta _2\) respectively:

$$\begin{aligned} W^l&= W_1^l\text { for } l=1,\ldots , L_1,\\ b^l&= b_1^l\text { for } l=1,\ldots , L_1,\\ W^{L_1+l}&= W_2^l \text { for } l=2,\ldots , L_2+1,\\ b^{L_1+l}&= b_2^l \text { for } l=2,\ldots , L_2+1, \end{aligned}$$

and it remains to construct the interface of both networks \(W^{L_1+1}, b^{L_2+1}\):

$$\begin{aligned} W^{L_1+1}&= W_2^1 W_1^{L_1+1} \in \mathbb {R}^{w\times w} \\ b^{L_1+1}&=W_2^1 b_1^{L_1+1} + b_2^1 \in \mathbb {R}^w. \end{aligned}$$

Note that as \(W_1^{L_1+1} \in \mathbb {R}^{m\times w}\), \( b_1^{L_1+1} \in \mathbb {R}^m\) and \( W_2^1 \in \mathbb {R}^{w\times m}\), the products are well-defined. The weights are clearly bounded by the maximum of the weights of \(\theta _1\) and \(\theta _2\) and the products at the interface.

The hidden layer \(h^{L_1}\) coincides with the final hidden layer of \(\theta _1\), thus

$$\begin{aligned} N(x\mid \theta _1) = W_1^{L_1+1} h^{L_1} + b_1^{L_1+1}. \end{aligned}$$

Having \(N(x\mid \theta _1)\) as the input to the network \(\theta _2\) yields the first hidden layer as

$$\begin{aligned} h_2^1&= \sigma \left( W_2^1N(x\mid \theta _1) + b_2^1\right) \\&=\sigma \left( W_2^1(W_1^{L_1+1} h^{L_1} + b_1^{L_1+1}) + b_2^1\right) \\&= \sigma \left( W^{L_1+1} h^{L_1} + b^{L_1+1} \right) = h^{L_1+1}. \end{aligned}$$

This means that the value of the hidden layer \(L_1+1\) equals the first layer of \(\theta _2\) with the input \(N(x\mid \theta _1)\). As the remaining weights are equal, the output of \(\theta \) equals the composition of the two networks \(\theta _1\) and \(\theta _2\). \(\square \)

3.2 Sparse grid theory

We show approximation results for neural networks in two steps: first, we approximate a function by a sparse grid function; then, the sparse grid function is approximated by a neural network. Therefore, we briefly recall basic results from sparse grid theory. For further information, we refer the reader to Bungartz and Griebel (2004); Garcke and Griebel (2013).

We consider a tensor-product space of \(W^{2,\infty }([0,1])\) functions:

$$\begin{aligned} X_0^{\infty , 2}&= \Big \{f\in L^\infty \left( [0,1]^d\right) :D_\alpha f \in L^\infty \left( [0,1]^d\right) \text { for } \alpha \in \mathbb {N}_{\ge 0}^d, ~|\alpha |_\infty \le 2, \\&~ u |_{\partial [0,1]^d} = 0 \Big \} \end{aligned}$$

with the semi-norm

$$\begin{aligned} |f |_{2, \infty } = \Vert D_{(2,\ldots ,2)} f\Vert _{L^2(\Omega )}. \end{aligned}$$

Here \(D_\alpha \) refers to the derivative with respect to a multi-index \(\alpha \in \mathbb {N}_{\ge 0}^d\), i.e. for each \(i=1,\ldots ,d\) the derivative with respect to the variable \(x_i\) is taken \(\alpha _i\)-times.

Sparse grids are based on a tensor grid with hierarchical refinements in each direction. In contrast to a full grid, only a reduced number of combinations are considered. For a multi-index \({\textbf{l}} = (l_1, \ldots , l_d) \in \mathbb {N}^d_{>0}\), we consider the grid created by the points \({\textbf{x}}_{{\textbf{l}}, {\textbf{i}}} = (i_1 h_{l_1}, \ldots , i_d h_{l_d})\in [0,1]^d\) for \({\textbf{i}}\in \mathcal {I}_{{\textbf{l}}}\), where \(h_{l} = 2^{-l}\) and

$$\begin{aligned} \mathcal {I}_{{\textbf{l}}} =\left\{ {\textbf{i}} = (i_1, \ldots , i_d):i_j = 2 k_j-1, k_j = 1,\ldots , 2^{l_j-1} \right\} . \end{aligned}$$

The corresponding basis functions \(\Psi _{{\textbf{l}}, {\textbf{i}}}({\textbf{x}}) = \prod _{j=1}^d \psi _{l_j, i_j}(x_j)\) are created as a tensor product of the univariate basis functions \(\psi _{l, i}(x) = (h_l - |x - x_{l, i}|)_+\) with \(x_{l, i} = i h_l\). We also define the multivariate mesh-size \(h_{{\textbf{l}}} = 2^{-|{\textbf{l}}|_1}\) and note that \(|\mathcal {I}_{{\textbf{l}}} |= 2^{|{\textbf{l}}|_1 - d}\).

On refinement level n, we consider the basis functions of all levels \({\textbf{l}}\) with \(|{\textbf{l}}|\le n+d-1\):

$$\begin{aligned} \left\{ \Psi _{{\textbf{l}}, {\textbf{i}}}:~{\textbf{i}} \in \mathcal {I}_{{\textbf{l}}},~ |{\textbf{l}}|\le n + d-1\right\} . \end{aligned}$$

Given coefficients \(v_{{\textbf{l}}, {\textbf{i}}}\), this yields a sparse grid function

$$\begin{aligned} f_n = \sum _{|{\textbf{l}}|_1\le n+d-1} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}} \Psi _{{\textbf{l}}, {\textbf{i}}}. \end{aligned}$$

Note that often a different scaling of the basis functions is used:

$$\begin{aligned} f_n = \sum _{|{\textbf{l}}|_1\le n+d-1} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}} {\widetilde{v}}_{{\textbf{l}}, {\textbf{i}}} \varphi _{{\textbf{l}}, {\textbf{i}}}, \end{aligned}$$

where \(\varphi _{{\textbf{l}}, {\textbf{i}}} = \psi _{{\textbf{l}}, {\textbf{i}}}/h_{{\textbf{l}}}\) and \(\widetilde{v}_{\textbf{l}} = h_{{\textbf{l}}}v_{{\textbf{l}}, {\textbf{i}}}\). This is the scaling used for example in Bungartz and Griebel (2004); Garcke and Griebel (2013).

For \(f\in X_0^{\infty , 2}\) with \(|f|_{2,\infty }=1\), Bungartz and Griebel (2004) construct coefficients \(v_{{\textbf{l}}, {\textbf{i}}}\), such that

$$\begin{aligned} |f(x) - f_n(x)|\le \varepsilon , \end{aligned}$$

for the number of degrees of freedom \(M=\mathcal {O}\left( \varepsilon ^{-1/2}|\log _2\varepsilon |^{3/2(d-1)}\right) \), see (Bungartz & Griebel, 2004, Lemma 3.13). In addition (Bungartz & Griebel, Bungartz and Griebel (2004), Lemma 3.3) provides a bound of the coefficients:

$$\begin{aligned} |v_{{\textbf{l}},{\textbf{i}}}|= h_{{\textbf{l}}}^{-1} |\widetilde{v}_{{\textbf{l}},{\textbf{i}}}|\le h_{{\textbf{l}}}^{-1} 2^{-d} 2^{-2|{\textbf{l}}|_1} = 2^{-d-|{\textbf{l}}|_1}, \end{aligned}$$
(1)

where we used \(h_{{\textbf{l}}} = 2^{-|{\textbf{l}}|_1}\).

3.3 Improved approximation result using sparse grids

We show approximation results for neural networks in two stages. First, we show that we can approximate each basis function \(\Psi _{{\textbf{l}}, {\textbf{i}}}({\textbf{x}})\) by a neural network to the desired accuracy. Then, we replace the basis functions in the sparse grid formulation by the constructed neural networks. We show that this network approximates the function up to the accuracy \(\varepsilon \) and that the size of the network grows at a rate that does not depend on the dimension. This reduces the curse of dimensionality as the influence of the dimension is confined to the constant of the estimate.

To approximate the basis functions, we first need to approximate some elementary functions: the identity, a multiplication and the absolute value. Recall that throughout this article, we consider the hyperbolic tangent as the activation function.

Lemma 2

For \(a\ge 1\), \(1>\delta >0\), we can construct the following neural networks with \(\sigma = \tanh \). The generic constant c is independent of a and \(\delta \).

  1. (a)

    There is \(\theta _{\textrm{id}, a}\in \Theta (1,1)\) with \(|\theta _\mathrm{{id}, a}|_\infty \le \delta ^{-1/2} a^{3/2} \), such that

    $$\begin{aligned} \sup _{x\in [-a,a]} |x - N(x\mid \theta _{\textrm{id}, a})|\le \delta /3. \end{aligned}$$
  2. (b)

    There is \(\theta _{\times , a}\in \Theta (1, 4)\) with \(|\theta _{\times , a}|_\infty \le \frac{9\sqrt{3}}{4} a^6 \delta ^{-2}\), such that

    $$\begin{aligned} \sup _{x\in [-a,a]} |x_1x_2 - N((x_1, x_2) \mid \theta _{\times , a})|\le \delta . \end{aligned}$$
  3. (c)

    There is \(\theta _\mathrm{{abs}, a}\in \Theta (2, 4)\) with \(|\theta _\mathrm{{abs}, a}|_\infty \le ca^3\delta ^{-2} \), such that

    $$\begin{aligned} \sup _{x\in [-a,a]} \big ||x|- N(x\mid \theta _\mathrm{{abs}, a})\big |\le \delta . \end{aligned}$$

Proof

The proof is partially based on Ohn and Kim (2019), but shows improved results and overcomes some technical difficulties.Footnote 1 In the following, let \(1>\delta >0\) and \(a\ge 1\).

Part a To construct the identity network with a single node, we consider for \(t>0\)

$$\begin{aligned} N(x\mid \theta ) = \frac{1}{t}\sigma (t x). \end{aligned}$$

We use the Taylor expansion with Lagrange remainder to represent \(\sigma (tx)\) as

$$\begin{aligned} \sigma (tx) = \sigma (0) + \sigma '(0) tx + \frac{1}{2}\sigma ''(0)(tx)^2+ \frac{1}{6} \sigma '''(\xi ) (tx)^3 = tx + \frac{1}{6}\sigma '''(\xi )(tx)^3, \end{aligned}$$

for some \(\xi \in \mathbb {R}\).

Using \(\sigma (0) = 1\) and \(|\sigma '''(\xi )|\le 2\) for all \(\xi \in \mathbb {R}\), we have for \(|x|\le a\)

$$\begin{aligned} |N(x\mid \theta ) - x|&= \left|\frac{1}{t} \left( tx + \frac{1}{6}\sigma '''(\xi )(tx)^3 \right) - x \right|\\&= \left|\frac{\sigma '''(\xi )t^2x^3}{6}\right|\le \frac{a^3}{3} t^2 \end{aligned}$$

We choose \(t_a= \delta ^{1/2} a^{-3/2}\) and have \(\theta _{\textrm{id}, a}\in \Theta (1,1)\) with

$$\begin{aligned} \sup _{x\in [-a, a]} |N(x|\theta _{\textrm{id}, a}) - x|\le \delta /3. \end{aligned}$$

Using \(t_a<1\), we also have

$$\begin{aligned} |\theta _{\textrm{id}, a}|_\infty = \max \left\{ t_a, t_a^{-1} \right\} = \delta ^{-1/2}a^{3/2}. \end{aligned}$$

Part b.1

Using the polarisation identity, we can evaluate a product based on squares: \(xy = (\frac{x+y}{2})^2-(\frac{x-y}{2})^2\). The square function can be approximated in a similar way to the identity function.

For \(t>0\) and \(x_0 = \log (2+\sqrt{3})/2 \approx 0.66\)Footnote 2, we consider the network \(\theta \) as

$$\begin{aligned} N(x|\theta ) = \frac{1}{t^2\sigma ''(x_0)} \sigma (x_0 + tx) + \frac{1}{t^2\sigma ''(x_0)} \sigma (x_0-tx) - \frac{2\sigma (x_0)}{t^2\sigma ''(x_0)}. \end{aligned}$$

As before, we can use a Taylor expansion at point \(x_0\) to estimate (with \(\xi , \xi ' \in \mathbb {R}\))

$$\begin{aligned} |N(x|\theta ) - x^2|&= \left|\frac{1}{6\sigma ''(x_0)}\sigma '''(\xi )tx^3 - \frac{1}{6\sigma ''(x_0)}\sigma '''(\xi ')tx^3 \right|\le \frac{\Vert \sigma '''\Vert _{\infty }}{3|\sigma ''(x_0)|} a^3 t. \end{aligned}$$

We choose \(t_a=\frac{3}{2} \delta a^{-3}\frac{|\sigma ''(x_0)|}{\Vert \sigma '''\Vert _{\infty }} = \frac{\sqrt{3}}{3} \delta a^{-3}\) and have \(\theta _{2,a}\in \Theta (1,2)\) with

$$\begin{aligned} |N(x\mid \theta _{2,a}) - x^2|\le \delta /2. \end{aligned}$$

We have used that, \(|\sigma ''(x_0)|= 4\sqrt{3}/9\) and \(\Vert \sigma '''\Vert _{\infty }=2\).

Part b.2

Duplicating \(\theta _{2,a}\) for the inputs \(\frac{x_1+ x_2}{2}\) and \(\frac{x_1-x_2}{2}\) yields according to Lemma 1(c) a neural network \(\theta _{\times , a}\in \Theta (1,4)\) as

$$\begin{aligned}&N((x_1,x_2)\mid \theta _{\times , a}) = N\left( \frac{x_1+x_2}{2}\mid \theta _{2,a}\right) - N\left( \frac{x_1-x_2}{2}\mid \theta _{2, a}\right) \nonumber \\&\quad = \frac{1}{t_a^2\sigma ''(x_0)} \sigma (x_0 + t_a/2 x_1 + t_a/2 x_2) + \frac{1}{t_a^2\sigma ''(x_0)} \sigma (x_0-t_a/2 x_1 - t_a/2 x_2) \nonumber \\&\quad - \frac{1}{t_a^2\sigma ''(x_0)} \sigma (x_0 + t_a/2 x_1 - t_a/2 x_2) - \frac{1}{t_a^2\sigma ''(x_0)} \sigma (x_0 - t_a/2 x_1 + t_a/2 x_2 ). \end{aligned}$$
(2)

We have

$$\begin{aligned} |\theta _{\times , a}|_\infty = \max \left\{ \frac{1}{t_a^2 |\sigma ''(x_0) |}, x_0, t_a/2 \right\} = \frac{9\sqrt{3}}{4}\, \delta ^{-2}a^6 \approx 3.90\delta ^{-2}a^6 , \end{aligned}$$

For \((x_1,x_2)\in [-a, a]^2\), we note that \(\frac{x_1+x_2}{2}, \frac{x_1-x_2}{2}\in [-a, a]\) and have

$$\begin{aligned}&|N(x_1,x_2)\mid \theta _{\times , a}) - x_1x_2|\\&\quad \le \left|N\left( \frac{x_1+x_2}{2}\mid \theta _{2,a}\right) - \left( \frac{x_1+x_2}{2}\right) ^2\right|+ \left|N\left( \frac{x_1-x_2}{2}\mid \theta _{2, a}\right) - \left( \frac{x_1-x_2}{2}\right) ^2\right|\le \delta , \end{aligned}$$

which concludes the proof of part b.

Part c We first show the statements for \(a=1\), where we omit the index a. Then a scaling argument yields the results.

We use the representation \(|x |= {\text {sign}}(x) \cdot x\). First, we approximate the sign-function and then multiply it by the identity to obtain the absolute value.

Set \(\theta _{\textrm{sign}}\in \Theta (1,1)\) as \(N(x|\theta _{\textrm{sign}}) = \sigma (\alpha x)\) with \(\alpha = \delta ^{-1} \log (2/\delta -1)/2\) for \(0<\delta <1\). Taking into account point symmetry and monotonicity of the hyperbolic tangent, from \(N(\delta \mid \theta _\textrm{sign}) = \tanh (\log (2/\delta -1)/2) = 1-\delta \) it follows that

$$\begin{aligned} \sup _{x\not \in (-\delta ,\delta )} |N(x|\theta _{\textrm{sign}}) - {\text {sign}}(x) |\le \delta . \end{aligned}$$

Now we combine \(\theta _\textrm{sign}\in \Theta (1,1)\), \(\theta _\textrm{id}\in \Theta (1,1)\) and \(\theta _{\times , 1+\delta /3}\in \Theta (1,4)\) to \(\theta _{\textrm{abs}}\in \Theta (2,4)\) as

$$\begin{aligned} N(x|\theta _{\textrm{abs}})&= N( (N(x|\theta _\textrm{id}), N(x|\theta _{\textrm{sign}}) )\mid \theta _{\times , 1+\delta /3}).\\&=\sum _{s_1,s_2\in \{-1, 1\}} \frac{s_1s_2}{t_{1+\delta /3}^2\sigma ''(x_0)} \sigma \left( x_0 + s_1 \frac{t_{1+\delta /3}}{2\delta ^{1/2}} \sigma (\delta ^{1/2} x) +s_2 \frac{t_{1+\delta /3}}{2} \sigma (\alpha x) \right) , \end{aligned}$$

with \(t_{1+\delta /3} = \frac{\sqrt{3} \delta }{3(1+\delta /3)^3}\).

For \(x\in [-1,1]\) we have

$$\begin{aligned} \big |N(x\mid \theta _{\textrm{abs}}) - |x |\big |&\le |N( (N(x\mid \theta _\textrm{id}), N(x\mid \theta _{\textrm{sign}}) )\mid \theta _{\times , 1+\delta /3}) - N(x\mid \theta _\textrm{id})\cdot N(x\mid \theta _{\textrm{sign}}) |\\&\quad + |N(x\mid \theta _\textrm{id}) \cdot N(x\mid \theta _{\textrm{sign}}) - {\text {sign}}(x) \cdot x|. \end{aligned}$$

The first term can directly be estimated by \(\delta \) as \(N(x|\theta _\textrm{id}) \in [-1-\delta /3, 1+\delta /3]\) and \(N(x\mid \theta _{\textrm{sign}})\in (-1,1)\).

Estimating the second term, we need to consider the cases \(x\in [-\delta ,\delta ]\) and \(x\not \in [-\delta ,\delta ]\) separately. For \(\delta <|x |\le 1\) we have

$$\begin{aligned} |N(x\mid \theta _\textrm{id}) N(x\mid \theta _{\textrm{sign}}) - {\text {sign}}(x) \cdot x |&\le |(N(x|\theta _\textrm{id}) -x)(N(x|\theta _{\textrm{sign}}) -{\text {sign}}(x)) |\\&\quad +|x(N(x|\theta _{\textrm{sign}}) -{\text {sign}}(x)) |\\&\quad + |{\text {sign}}(x)(N(x\mid \theta _\textrm{id}) -x) |\le 4/3\delta + \delta ^2/3 < 2\delta . \end{aligned}$$

For \(x\in [-\delta , \delta ]\), we get with \(|N(x|\theta _{\textrm{sign}})|< 1\):

$$\begin{aligned} \big |N(x\mid \theta _\textrm{id}) \cdot N(x\mid \theta _{\textrm{sign}}) - |x |\big |&\le |N(x\mid \theta _\textrm{id}) |\cdot |N(x|\theta _{\textrm{sign}})|+ |x |\le 7/3\delta < 3 \delta . \end{aligned}$$

In conclusion, we have for all \(x\in [-1,1]\):

$$\begin{aligned} \big |N(x\mid \theta _\textrm{id}) N(x\mid \theta _{\textrm{sign}}) -|x |\big |\le 3 \delta , \end{aligned}$$

with

$$\begin{aligned} |\theta _\textrm{abs}|\le \max \left\{ \frac{9\sqrt{3}}{4} \delta ^{-2} (1+\delta /3)^6, \delta ^{-1} \log (2/\delta -1)/2 \right\} \le c\, \delta ^{-2}. \end{aligned}$$

With a scaling argument, we can extend the result to include \(x\in [-a, a]\):

$$\begin{aligned} N(x|\theta _{\textrm{abs},a}) = a N(a^{-1}x\mid \theta _{\textrm{abs}}) \end{aligned}$$

satisfies

$$\begin{aligned} \sup _{x\in [-a,a]} \big |N(x\mid \theta _{\textrm{abs},a}) - |x|\big |\le 3\delta a \quad \text { and } \quad |\theta _\textrm{abs}|\le ca \delta ^{-2}. \end{aligned}$$

\(\square \)

These three basic approximations are sufficient to construct a simple univariate hat function in the following lemma. Using the multiplication operator, we will later extend the result to multivariate basis functions.

Lemma 3

For \(1>\delta >0\), \(\sigma = \tanh \), \(x_0 \in [0,1]\) and \(1>h>0\), it holds with c independent of \(\delta , x_0, h_0\) that there is \(\theta _{h, x_0}\in \Theta (4,5)\), such that

$$\begin{aligned} \sup _{x\in [0,1]} |\psi _{h, x_0}(x)-N(x\mid \theta _{h, x_0})|\le \delta , \end{aligned}$$

with \(|\theta _{h, x_0}|_\infty \le c\delta ^{-2}\), where \(\psi _{h, x_0}(x) = (h - |x-x_0|)_+\).

Proof

We first approximate the rectified linear unit \((x)_+ = (x + |x|)/2\) up to an accuracy of \(\delta \) by a neural network \(\theta _{\textrm{relu}}\):

$$\begin{aligned} N(x\mid \theta _{\textrm{relu}}) = 0.5 \left( N(N(x\mid \theta _{\textrm{id}})\mid \theta _{\textrm{id},1+\delta /3}) + N(x\mid \theta _{\textrm{abs}}) \right) . \end{aligned}$$

To add the networks approximating x and \(|x|\) using Lemma 1(c), both networks need to have the same depth. With \(\theta _{\textrm{id}}\in \Theta (1,1)\) and \(\theta _{\textrm{abs}}\in \Theta (2,4)\), we compose the identity network with another identity network using Lemma 1(d). With Lemmas 1 and 2, we then have \(\theta _{\textrm{relu}} \in \Theta (2,5)\) and \(|\theta _{\textrm{relu}}|_\infty \le c \delta ^{-2}\).

The error is given by

$$\begin{aligned} |N(x\mid \theta _{\textrm{relu}}) - (x)_+|\le |N(N(x\mid \theta _{\textrm{id}})\mid \theta _{\textrm{id},1+\delta /3}) - x|/2 + |N(x\mid \theta _{\textrm{abs}}) - |x||/2. \end{aligned}$$

We start with the first term and show using Lemma 2 (a) for any \(x\in [-1,1]\)

$$\begin{aligned} |N(N(x\mid \theta _{\textrm{id}})\mid \theta _{\textrm{id},1+\delta /3}) - x|&\le |N(x\mid \theta _{\textrm{id}}) - x |\\&\quad + |N(N(x\mid \theta _{\textrm{id}})\mid \theta _{\textrm{id},1+\delta /3}) - N(x\mid \theta _{\textrm{id}}) |\\&\le \delta /3 + \delta /3 < \delta , \end{aligned}$$

as \(N(x\mid \theta _{\textrm{id}}) \in [-1-\delta /3, 1+\delta /3]\). The second term is directly approximated by \(\delta /2\) using Lemma 2 (c), which yields

$$\begin{aligned} |N(x\mid \theta _{\textrm{relu}}) - (x)_+|\le \delta \end{aligned}$$

With this approximated rectified linear unit, we can finally approximate univariate basis functions

$$\begin{aligned} \psi _{h, x_0}(x) = \left( h - \left|x - x_{0} \right|\right) _+ = (1+\delta ) \left( \frac{h - \left|x - x_{0} \right|}{1+\delta }\right) _+ \end{aligned}$$

by \(\theta _{h, x_0}\) as

$$\begin{aligned} N(x\mid \theta _{h, x_0})&= (1+\delta ) N\left( \frac{h - N(x - x_0\mid \theta _{\textrm{abs}})}{1+\delta }\mid \theta _{\textrm{relu}}\right) . \end{aligned}$$

The construction is a composition of the networks \(\theta _\textrm{relu} \in \Theta (2,5)\) with \(\theta _\textrm{abs} \in \Theta (2,4)\), which yields \(\theta _{h, x_0}\in \Theta (4, 5)\) by Lemma 1(c).

For \(x\in [0,1]\), we have \(x - x_0\in [-1,1]\) and \(h - N(x - x_0\mid \theta _{\textrm{abs}}) \in [-1-\delta ,1+\delta ]\). Then we have

$$\begin{aligned}&|\psi _{h, x_0}(x) - N(x\mid \theta _{h, x_0})|\\&\quad \le \big |\left( h - |x - x_0 |\right) _+ - \left( h - N(x - x_0\mid \theta _\textrm{abs}) \right) _+ \big |\\&\qquad + (1+\delta ) \left|\left( \frac{ h - N(x - x_0\mid \theta _\textrm{abs} )}{1+\delta }\right) _+ - N\left( \frac{h - N(x - x_0\mid \theta _\textrm{abs})}{1+\delta }\mid \theta _{\textrm{relu}}\right) \right|. \end{aligned}$$

Both terms can be estimated easily as \(|(x)_+-(y)_+|\le |x-y|\):

$$\begin{aligned} \big |\left( h - \left|x - x_0 \right|\right) _+ - \left( h - N(x - x_0|\theta _\textrm{abs}) \right) _+ \big |\le \big ||x - x_0 |- N(x - x_0\mid \theta _\textrm{abs})\big |\le \delta , \end{aligned}$$

using Lemma 2(c) and

$$\begin{aligned} (1+\delta ) \left|\left( \frac{ h - N(x - x_0\mid \theta _\textrm{abs} )}{1+\delta }\right) _+ - N\left( \frac{h - N(x - x_0\mid \theta _{\textrm{abs}})}{1+\delta }\mid \theta _{\textrm{relu}}\right) \right|\le 2 \delta . \end{aligned}$$

All coefficients still fulfil \(\left|\theta _{h, x_0}\right|_\infty \le c \delta ^{-2}\), which finishes the proof. \(\square \)

Successive applications of the multiplication operator yield an approximation of the multivariate basis function. As in the previous lemmas, we are interested in the network’s width, depth and parameter norm, which is required for a specific accuracy.

Lemma 4

For \(d\in \mathbb {N}_{>0}\), \({\textbf{h}} = (h_1,\ldots ,h_d) \in [0,1]^d\) and \({\textbf{x}}^0 = (x^0_1, \ldots , x^0_d) \in [0,1]^d\), let

$$\begin{aligned} \Psi _{{\textbf{h}}, {\textbf{x}}^0}({\textbf{x}}) = \prod _{i=1}^d \psi _{h_i, x^0_i}(x_i). \end{aligned}$$

For \(1>\varepsilon >0\) and \(\sigma = \tanh \), there exists a neural network \(\theta _{{\textbf{h}}, {\textbf{x}}^0} \in \Theta (4+\lceil \log _2 d\rceil , 5d)\) such that

$$\begin{aligned} \sup _{{\textbf{x}}\in [0,1]^d} \big |\Psi _{{\textbf{h}}, {\textbf{x}}^0}({\textbf{x}}) - N({\textbf{x}}\mid \theta _{{\textbf{h}}, {\textbf{x}}^0})\big |\le \varepsilon , \end{aligned}$$

and

$$\begin{aligned} |\theta _{{\textbf{h}}, {\textbf{x}}^0} |_\infty \le c d^{2\log _2 3} \varepsilon ^{-2} \le c d^{4} \varepsilon ^{-2}. \end{aligned}$$

Proof

The statement is similar to (Rolnick & Tegmark, 2018, Proposition 4.6) and (Schwab & Zech, 2019, Proposition 3.7) as in all cases a d-dimensional product is approximated. We refine these proofs and in particular include a bound on the network’s weights and biases.

We start with the neural network approximations of the univariate basis functions \(\psi _{h_i, x^0_i}\) by a neural network \(\theta _i^{\textrm{hat}}\in \Theta (4,5)\) up to accuracy \(\delta = 2/9 d^{-\log _2 3}\varepsilon <1\), see Lemma 3. Using Lemma 1(b), we can create a network \(\theta ^0\in \Theta (4, 5d, d, d)\), which evaluates these networks in parallel:

$$\begin{aligned} N\left( {\textbf{x}}\mid \theta ^0\right) = \big (N(x_1\mid \theta _1^{\textrm{hat}}), N(x_2\mid \theta _2^{\textrm{hat}}), \ldots , N(x_d\mid \theta _d^{\textrm{hat}})\big ) \in \mathbb {R}^{d} \end{aligned}$$
Fig. 1
figure 1

Illustration of the multiplication stricture. Displayed in the special case, where \(\log _2 d\in \mathbb {N}\)

Formally we extend the d terms of the output by constant ones until we reach \(2^{\lceil \log _2 d\rceil }\) terms:

$$\begin{aligned} \big (N(x_1\mid \theta _1^{\textrm{hat}}), N(x_2\mid \theta _2^{\textrm{hat}}), \ldots , N(x_d\mid \theta _d^{\textrm{hat}}), 1, \ldots 1\big ) \in \mathbb {R}^{2^{\lceil \log _2 d\rceil } } \end{aligned}$$

Then we use the multiplication operator in a binary tree structure to multiply these terms. In the first row of the tree structure, we have \(\theta ^1\) with

$$\begin{aligned} N\left( {\textbf{x}}\mid \theta ^1\right)&= \Bigg ( N\Big (\big (N(x_1\mid \theta _1^{\textrm{hat}}), N(x_2\mid \theta _2^{\textrm{hat}})\big )\mid \theta _{\times , 1+\delta } \Big ),\\&\quad N\Big (\big (N(x_3\mid \theta _3^{\textrm{hat}}), N(x_4\mid \theta _4^{\textrm{hat}})\big )\mid \theta _{\times , 1+\delta } \Big ), \cdots \Bigg ) \in \mathbb {R}^{2^{\lceil \log _2 d\rceil -1} }. \end{aligned}$$

Using Lemmas 1(a, b) and 2(b), \(2^{\lceil \log _2 d\rceil -1} < d\) multiplications can be approximated in parallel in \(\Theta (1, 4d, d, 2^{\lceil \log _2 d\rceil -1})\). To compose the multiplications and the basis functions, we use Lemma 1(d) which shows \(\theta ^1 \in \Theta (5, 5d, d, 2^{\lceil \log _2 d\rceil -1})\).

Adding the second row analogously, we define \(\theta ^2\) with

$$\begin{aligned} N\left( {\textbf{x}}\mid \theta ^2\right)&= \Bigg ( N\Big (\big (N({\textbf{x}}\mid \theta ^1)_1, N({\textbf{x}}\mid \theta ^1)_2\big )\mid \theta _{\times , 1+4\delta } \Big ) ,\\&\quad N\Big (\big (N({\textbf{x}}\mid \theta ^1)_3, N({\textbf{x}}\mid \theta ^1)_4\big )\mid \theta _{\times , 1+4\delta } \Big ), \ldots \Bigg ) \in \mathbb {R}^{2^{\lceil \log _2d\rceil -2}} \end{aligned}$$

where Lemma 1(d) yields \(\theta ^2\in \Theta (6, 5d, d, 2^{\lceil \log _2 d\rceil -2})\). We continue the construction for \(\theta ^l \in \Theta (4+l, 5d,d, 2^{\lceil \log _2 d\rceil -l})\) to construct \(\theta _{{\textbf{h}},{\textbf{x}}^0} =\theta ^{\lceil \log _2 d\rceil } \in \Theta (4+\lceil \log _2d\rceil , 5d, d, 1)\), see Fig. 1 for an illustration of the multiplicative tree.

In the following, we inductively show that the error after each layer is bounded by \(\tau _l = 1/2 (3^{l+1} - 1)\delta < 1\). With \(\tau _0 = \delta \), the induction start holds.

Let us assume the approximation error on layer l is bounded by \(\tau _l\). We denote the exact product approximated by \(N({\textbf{x}}\mid \theta ^l)_i\) by \(f^l_i({\textbf{x}})\), \(i=1,\ldots , 2^{\lceil \log _2d\rceil -l}\). As \(f^l_i\) is constructed as a multiplication of univariate hat functions (each of which are bounded by 1), we have \(|f^l_i({\textbf{x}})|\le 1\). By the induction hypothesis it holds

$$\begin{aligned} \sup _{{\textbf{x}}\in [0,1]^d} \left|N\left( {\textbf{x}} \mid \theta ^l_i \right) -f^l_i({\textbf{x}})\right|\le \tau _l, \quad \text { for } i=1,\ldots , n. \end{aligned}$$

We consider an arbitrary multiplication on layer \(l+1\): for some \(i\in \{1,\ldots ,2^{\lceil \log _2d\rceil -(l+1)}\}\) the approximation of \(f^{l+1}_{i}({\textbf{x}}) = f^l_{2i-1}({\textbf{x}})\cdot f^l_{2i}({\textbf{x}})\) by

$$\begin{aligned} N\left( {\textbf{x}}\mid \theta ^{l+1}\right) _i = N\Big (\big (N({\textbf{x}}\mid \theta ^l)_{2i-1}, N({\textbf{x}}\mid \theta ^l)_{2i}\big )\mid \theta _{\times , 1+\tau _l} \Big ). \end{aligned}$$

Using Lemma 2(b), we have for \({\textbf{x}}\in [0,1]^d\):

$$\begin{aligned}&|N\left( {\textbf{x}}\mid \theta ^{l+1}\right) _i - f^l_{2i-1}({\textbf{x}})\cdot f^l_{2i}({\textbf{x}}) |\\&\quad \le \left|N\Big (\big (N({\textbf{x}}\mid \theta ^l)_{2i-1}, N({\textbf{x}}\mid \theta ^l)_{2i}\big )\mid \theta _{\times , 1+\tau _l} \Big ) - N({\textbf{x}}\mid \theta ^l)_{2i-1}\cdot N({\textbf{x}}\mid \theta ^l)_{2i} \right|\\&\qquad + \left|N({\textbf{x}}\mid \theta ^l)_{2i-1}\cdot N({\textbf{x}}\mid \theta ^l)_{2i} - f^l_{2i-1}({\textbf{x}})\cdot f^l_{2i}({\textbf{x}}) \right|\\&\quad \le \delta + \left|N({\textbf{x}}\mid \theta ^l)_{2i-1} - f^l_{2i-1}({\textbf{x}}) \right|\cdot \left|N({\textbf{x}}\mid \theta ^l)_{2i} \right|+ \left|N({\textbf{x}}\mid \theta ^l)_{2i} - f^l_{2i}({\textbf{x}}) \right|\cdot \left|f^l_{2i-1}({\textbf{x}}) \right|\\&\quad \le \delta + \tau _l(1+\tau _l) + \tau _l\le \delta + 3\tau _l = \tau _{l+1}. \end{aligned}$$

After \(\lceil \log _2 d \rceil \) layers, we arrive at a single network \(\theta _{{\textbf{h}}, {\textbf{x}}^0}\), which approximates \(\Psi _{{\textbf{h}}, {\textbf{x}}^0}\) up to an accuracy of

$$\begin{aligned} \tau _{\lceil \log _2 d \rceil } = \frac{1}{2}(3^{\lceil \log _2 d \rceil +1} -1 )\delta \le \frac{9}{2}3^{\log _2 d} \delta = \varepsilon . \end{aligned}$$

Since the absolute values of the input weights for \(\theta _{\times , a}\) are given by \(t_a /2 = \sqrt{3}/6 \delta a^{-3}\le \sqrt{3}/6\), see (2), Lemma 1(d) shows that by concatenating the networks, the asymptotic bounds of the weights remain in place. Thus we have \(|\theta _{{\textbf{h}},{\textbf{x}}^0}|_\infty \le c\delta ^{-2} = c d^{2\log _2 3} \varepsilon ^{-2}\). \(\square \)

With the approximation of each basis function, we show approximations of sparse grid functions.

Lemma 5

Let a sparse grid function be given:

$$\begin{aligned} f_n = \sum _{|{\textbf{l}}|_1\le n + d - 1} \sum _{{\textbf{i}}\in {\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}}\, \Psi _{{\textbf{l}}, {\textbf{i}}}, \end{aligned}$$

with bound (1): \(|v_{{\textbf{l}}, {\textbf{i}}}|\le 2^{-d-|{\textbf{l}}|_1}\). With M the number of sparse grid basis functions, there exists \(\theta _{f, n} \in \Theta (4+\lceil \log _2 d\rceil , 5 d M )\) with \(\sigma =\tanh \), such that

$$\begin{aligned} \sup _{{\textbf{x}}\in [0,1]^d} |f_n({\textbf{x}}) - N({\textbf{x}}\mid \theta _{f, n} ) |\le \varepsilon , \end{aligned}$$

and \(|\theta _{f, n}|_\infty \le \max \{1, c d^{\log _2 3} 2^{-4d} M^2 \varepsilon ^{-2}\}\), where c is independent of \(n, f_n, d\) and \(\varepsilon \).

Proof

The proof is closely related to (Montanelli & Du, 2019, Theorem 1), where only ReLU activation functions were used. ReLU neural networks can exactly represent the univariate basis functions, but are less efficient when approximating multiplications. As a consequence, estimating the approximation error for smooth activation functions differs even though the construction is analogue.

We denote the neural network approximations of each basis function \( \Psi _{{\textbf{l}}, {\textbf{i}}}\) with an accuracy of \(\delta \) as \( \theta _{{\textbf{l}}, {\textbf{i}}} = \theta _{{\textbf{h}}_{\textbf{l}}, {\textbf{x}}_{{\textbf{l}},{\textbf{i}}}}. \) Using Lemma 1(c) to add these M networks, we have \(\theta _{f, n} \in \Theta (4+\lceil \log _2 d\rceil , 5 d M )\) such that

$$\begin{aligned} N({\textbf{x}}\mid \theta _{f, n} ) = \sum _{|{\textbf{l}}|_1\le n + d - 1} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}}\, N({\textbf{x}}\mid \theta _{{\textbf{l}}, {\textbf{i}}}). \end{aligned}$$

By construction we have for \({\textbf{x}}\in [0,1]^d\):

$$\begin{aligned} |f_n({\textbf{x}}) - N({\textbf{x}}\mid \theta _{f, n} ) |&\le \sum _{|{\textbf{l}}|_1\le n + d - 1} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}}\, |\Psi _{{\textbf{l}}, {\textbf{i}}}({\textbf{x}}) - N({\textbf{x}}\mid \theta _{{\textbf{l}}, {\textbf{i}}})|\\&\le \delta \sum _{|{\textbf{l}}|_1\le n + d - 1} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}}. \end{aligned}$$

Using \( |v_{{\textbf{l}}, {\textbf{i}}}|\le 2^{-d-|{\textbf{l}}|_1}\), as given by (1), and \( |\mathcal {I}_{\textbf{l}}|= 2^{|{\textbf{l}}|_1-d}\) we have

$$\begin{aligned} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}} \le |\mathcal {I}_{\textbf{l}}|2^{-d-|{\textbf{l}}|_1} = 2^{-2d}, \end{aligned}$$

and thus

$$\begin{aligned} \sum _{|{\textbf{l}}|_1\le n + d - 1} \sum _{{\textbf{i}}\in \mathcal {I}_{\textbf{l}}}v_{{\textbf{l}}, {\textbf{i}}} \le 2^{-2d} |\{{\textbf{l}}: |{\textbf{l}}|_1\le n + d - 1, \mathcal {I}_{\textbf{l}}\ne \emptyset \}|. \end{aligned}$$

The number of layers with a non-empty index-set is in any case bounded by the number of basis functions M, which yields

$$\begin{aligned} |f_n({\textbf{x}}) - N({\textbf{x}}\mid \theta _{f, n} )|\le \delta 2^{-2d} M. \end{aligned}$$

Setting \( \delta = \min \{1, 2^{2d} \varepsilon /M\} \), we get

$$\begin{aligned} |f_n({\textbf{x}} ) - N({\textbf{x}}\mid \theta _{f, n} )|\le \varepsilon . \end{aligned}$$

The coefficients can then be bounded by

$$\begin{aligned} |\theta _{f, n}|_\infty \le c d^{\log _2 3} \delta ^{-2} = c d^{\log _2 3} \max \{1, 2^{-4d} M^2 \varepsilon ^{-2}\}. \end{aligned}$$

\(\square \)

Now we can summarise our theory.

Theorem 1

For \(f\in X^{\infty ,2}_0([0,1]^d)\) with \(|f|_{2,\infty }=1\), there exists a deep neural network \(\theta _{f, \varepsilon } \in \Theta (L, w, d, 1)\) with \(\sigma = \tanh \), \(L=4+\lceil \log _2 d\rceil \) and \(w=\mathcal {O}\left( \varepsilon ^{-1/2}|\log \varepsilon |^{3/2(d-1)}\right) \), such that

$$\begin{aligned} \sup _{{\textbf{x}}\in [0,1]^d} |f({\textbf{x}}) - N({\textbf{x}}\mid \theta _f)|\le \varepsilon \end{aligned}$$

with

$$\begin{aligned} |\theta _{f} |_\infty \le \mathcal {O}\left( \varepsilon ^{-3}|\log \varepsilon |^{3(d-1)}\right) . \end{aligned}$$

Proof

Fist, we construct the sparse grid approximation

$$\begin{aligned} f_n = \sum _{|{\textbf{l}}|_1\le n + d - 1} \sum _{{\textbf{i}}\in {\textbf{l}}} v_{{\textbf{l}}, {\textbf{i}}}\, \Psi _{{\textbf{l}}, {\textbf{i}}}, \end{aligned}$$

up to the accuracy \(\varepsilon /2\). From the approximation theory of sparse grids (Bungartz & Griebel, 2004, Lemma 3.13), we have

$$\begin{aligned} M = \mathcal {O} \left( \varepsilon ^{-1/2} |\log \varepsilon |^{3/2(d-1)}\right) \end{aligned}$$

and (Bungartz & Griebel, 2004, Lemma 3.3) shows (1).

This shows that we can apply Lemma 5 to construct \(\theta _f = \theta _{f, n}\), choosing the same accuracy \(\varepsilon /2\). Then for \({\textbf{x}}\in [0,1]^d\),

$$\begin{aligned} |f({\textbf{x}}) - N({\textbf{x}}\mid \theta _f)|\le |f({\textbf{x}}) - f_n({\textbf{x}}) |+ |f_n({\textbf{x}}) - N({\textbf{x}}\mid \theta _{f})|\le \varepsilon . \end{aligned}$$

\(\square \)

3.4 Application of the results to the deep parametric PDE method

In this chapter, we discuss how the result can be applied to the deep parametric PDE method. We note that the theoretical results are necessary but not sufficient for an efficient approximation in high dimensions. The reasons are elaborated in the following.

Theorem 1 in Glau and Wunderlich (2022) shows that the \(L^2\)-error of the deep parametric PDE method is bounded by the value of the loss function, which is minimised:

$$\begin{aligned} \Vert u-u_{\textrm{DPDE}}\Vert _{L^2(\mathcal {Q})} \le c \mathcal {J}(u_{\textrm{DPDE}}) = c~ \mathop {\text {arg}\,\text {min}}\limits _{u_{\textrm{DNN}}} \mathcal {J}(u_{\textrm{DNN}}). \end{aligned}$$

As the loss includes terms of the second derivative, further estimates require approximation results for the first and second derivative. Also the reduced regularity of most option pricing problems prevent a direct application of the results shown in the previous section. Both points are subject of future research.

On contrast, the results in this article show necessary conditions. For the deep parametric PDE method to be efficient in high dimensions, the best approximation in the \(L^2\) norm must not deteriorate for high dimensions, such as shown here with a dimension-independent rate. Without these results, the best-approximation in the \(H^2\) norm would have an approximation rate which decreases for higher dimensions, making the method inefficient. Thus our results are necessary for an efficient high-dimensional method.

While we acknowledge this remaining gap in the theory, the numerical results provide further insights into the efficiency of the method for higher dimensions.

4 Applications in credit risk management

As an illustrative example of the practical performance of the deep parametric PDE method, we investigate the evaluation of credit exposure, see Gregory (2010); Green (2015) for further information.

We consider the exposure \(V_t(X_t;\mu )\) at a given parameter \(\mu \), where \(X_t\) is the random variable describing the underlying assets/risk-factors. In our case, we have \(V_t(x;\mu ) = {\text {Price}}(t, e^x; \mu ) > 0 \) (otherwise, only the positive part is considered).

For a given real-world measure \(\mathbb {P}\), we consider the expected exposure

$$\begin{aligned} {\text {EE}}^{\textrm{risk}}(t)&= \mathbb {E}^\mathbb {P}(V_t) \end{aligned}$$

and the potential future exposure for risk levels \(\alpha \):

$$\begin{aligned} {{\text {PFE}}^{\textrm{risk}}}_\alpha (t)&= \inf \{y: \mathbb {P}(V_t>y)\le 1-\alpha \}. \end{aligned}$$

In the numerical experiment we consider three different risk levels: \(\alpha =95\%, 97.5\%\) and \(99\%\).

In addition, we compute the expected exposure in the riskfree measure \(\mathbb {Q}\) as we know its exact value:

$$\begin{aligned} {\text {EE}}^{\textrm{price}}(t) = e^{-rt} \mathbb {E}^\mathbb {Q}(V_t) = V_0 \end{aligned}$$

To evaluate these exposure measures, we use a full re-evaluation. On a grid of time-evaluations \(0<t_1<t_2<\ldots < t_n\le T\):

  1. 1.

    Sample M random paths for \(X_t\) according to \(\mathbb {P}\) or \(\mathbb {Q}\), each evaluated at \(t_i\): \({\widehat{X}}_i^j\), \(j=1,\ldots , M\);

  2. 2.

    Compute the asset prices in each scenario \(v_{i,j} = V_{t_i}(X_i^j; \mu )\);

  3. 3.

    Evaluate the mean value for the expected exposure:

    • In the real-world measure: \({\widehat{{\text {EE}}^{\textrm{risk}}}}(t_{{\hat{\i }}}) = \sum _{j=1}^M v_{{{\hat{\i }}},j}/M\);

    • For the risk-neutral measure, include the discount factor: \({\widehat{{\text {EE}}^{\textrm{price}}}}(t_{{\hat{\i }}}) = e^{-rt_{{\hat{\i }}}}\sum _{j=1}^M v_{{{\hat{\i }}},j}/M\).

  4. 4.

    For each time \(t_{{\hat{\i }}}\) sort the values \((v_{{\hat{\i }}, j})_j\) and return the value at the position \(\lceil \alpha M\rceil \) as the potential future exposure \({\widehat{{{\text {PFE}}^{\textrm{risk}}}}}_\alpha (t_{{\hat{\i }}})\).

We note that this exposure calculation has two sources of a numerical error: the Monte-Carlo approximation of the risk factors as well as the accuracy of the solver.

We use \(M=150\,000\) simulations to calculate the exposure. When we compare to a reference solution, we use \(M=1\,500\,000\) simulations and evaluate the price using the reference pricer based on Bayer et al. (2018). More precisely: we use the dimensionality reduction described in their article and solve the remaining auxiliary problem using a tensorised Gauss-Hermite quadrature. Due to the smoothness of the auxiliary problem as small number of 9 quadrature points per dimension is sufficient for a relative error of only \(0.13\%\). For two underlying assets, the auxiliary problems in one-dimensional, so we can use more quadrature points. With 33 Gauss-Hermite points we observed an accuracy close to machine precision. Further details can also be found in Glau and Wunderlich (2022).

For the deep parametric PDE method, we use the implementation developed in Glau and Wunderlich (2022). A sample code for two assets is available on GitHub.Footnote 3 The computational domain covers \(s^i\in [21, 460]\), \(t\in [0, 4]\), \(\sigma _i \in [0.1, 0.3]\), \(r\in [0.01,0.03]\) and \(\rho _{i,i+1} \in [0.2, 0.8]\). Unless noted otherwise, we consider a riskfree rate \(r=0.01\) and a drift (not used in pricing) of \(\mu =0.02\). For asset values outside of the computational domain, we return the value of the payoff as the asymptotics. We compare the deep parametric PDE method to a Monte-Carlo solver with \(1\,000\) samples, the recently proposed sparse grid stochastic collocation method (Grzelak, 2021) and the reference pricer.

We note that, although the approximation results in Theorem 1 were shown using sparse grids, there is no direct connection of our theoretical work to the sparse grid stochastic collocation method. The main motivation to consider this method is to compare the deep parametric PDE method to another modern method intended for medium to high dimensions.

4.1 Evaluations at fixed parameter

We start our investigation on fixed parameters with two and five underlying assets. As the deep parametric PDE method solves for a whole range of parameter values, the training phase was not specific to these parameters.

4.1.1 Two asset case

We consider two assets with the initial value \(s_0 = (50, 125)\), individual volatilities \(\sigma = (0.1, 0.3)\) and a correlation of \(\rho = 0.2\). We evaluate the exposure at 41 points in time.

In the two-dimensional case, we also investigate the need for high-order models. Therefore, we include a fitted one-dimensional model in our comparison. For \(S_t^i /s^i_0 \sim \mathcal{L}\mathcal{N}((r-\sigma _i^2/2)t, \sigma _i^2 t)\), \(i=1,2\), \({{\,\textrm{corr}\,}}(\log (S_t^1), \log (S_t^2)) = \rho \), we consider an independent univariate process \({\bar{S}}_t/{\bar{s}}_0 \sim \mathcal{L}\mathcal{N}({\bar{\mu }} t, {\bar{\sigma }}^2 t)\), with the same expectation and variance as the basket: \(\mathbb {E}({\bar{S}}_t) = \mathbb {E}((S_t^1+S_t^2)/2)\) and \({{\,\textrm{var}\,}}({\bar{S}}_t ) = {{\,\textrm{var}\,}}(S_t^1+S_t^2)/2)\). Using this fitted model, the Black-Scholes formula provides an approximation of the asset price.

Our first comparison considers the expected exposure in the riskfree measure. In this case, we know that the exposure profile is constant during the lifetime of the option. In Fig. 2 we see the expected exposure as well as its relative error. At maturity \({t}=4\) all methods deliver the same exposure, as the option price coincides with the payoff. With the exception of the one-dimensional model, all methods yield an acceptable exposure profile over the lifetime of the option. The fitted one-dimensional model exhibits a significantly larger error than the multivariate models. This shows the importance to consider multivariate models as low-dimensional fits do not produce accurate results. The sparse grid solver with 4 refinements shows more reasonable results, but still exhibits a notable error. The deep parametric PDE method only exhibits a slightly larger error than the reference pricer and the Monte-Carlo method. This confirms the accuracy of the deep parametric PDE method.

In Fig. 3, we see the expected exposure as well as the potential future exposure for three different risk levels in the real-world measure. We see a similar picture as before, with the multivariate pricer consistently showing relative errors below \(5\%\), while the fitted one-dimensional method shows up to \(30\%\) error. The results are similar for the expected exposure and the potential future exposure. In this situation we note that the deep parametric PDE method and the sparse grid solver have a similar maximal error. While the deep parametric PDE method shows the largest error in the near future, the sparse grid method has its largest error in the middle of the time interval.

Due to the large error of the fitted one-dimensional model, we will not consider it in the remaining comparisons.

Fig. 2
figure 2

Expected exposure in pricing measure (left) and its relative error (right). Two underlying assets with maturity \(T=4\)

Fig. 3
figure 3

Expected exposure (top) and potential future exposure for \(\alpha =95\%, 97.5\%\) and \(99\%\) (bottom) in real-world measure (left) and their relative error (right). For the error in the potential future exposure, the maximal error of the three risk levels is measured. Two underlying assets

4.1.2 Five asset case

For five assets, we consider initial values \(s_0 = (125, 100, 75, 150, 80)\), volatilities \(\sigma = (0.1, 0.3, 0.25, 0.15, 0.2)\) and pairwise correlations \((\rho _{i,i+1})_i = (0.3, 0.6, 0.4, 0.5)\). We note that due to the parameters, the pricing problem is already 16-dimensional. Due to the longer runtimes of the reference solver, we only consider 21 points in time.

In Fig. 4 the expected exposure in the riskfree measure and its error are shown. Again, we see similar error values in all four cases with the error due to the Monte-Carlo sampling of the risk-factors significantly contributing to the overall error. In the five-dimensional case, we note that the deep parametric PDE method results in about half the maximal error when compared to the sparse grid approach.

A similar picture is seen when considering the real-world measure, as shown in Fig. 5. In all cases a good level of approximation is observed, with relative errors below \(2\%\) for the expected exposure and below \(1\%\) for the potential future exposure. In both cases we see more accurate results using the deep parametric PDE method than using the sparse grid solver and the deep parametric PDE method has the same accuracy as the reference pricer.

Fig. 4
figure 4

Expected exposure in pricing measure (left) and its relative error (right). Five underlying assets

Fig. 5
figure 5

Expected exposure (top) and potential future exposure for \(\alpha =95\%, 97.5\%\) and \(99\%\) (bottom) in real-world measure (left) and their relative error (right). For the error in the potential future exposure, the maximal error of the three risk levels is measured. Five underlying assets

4.2 Runtime comparisons

With three of the four methods showing similar error values, the computational effort is decisive for the comparison. We note that the sparse grid approach exhibits larger error values, especially for five underlying assets. In Table 1 the runtimes for the presented examples are given. The runtimes measure the computation of the expected exposure and the three potential future exposures at considered points in time. The observed runtimes for five assets were scaled for comparability to account for the smaller number of time-steps in our examples. The code was evaluated on a typical end-user device, a 2015 MacBook Pro with a 2.7GHz dual-core CPU and 8GB RAM.

In all cases, the evaluation of exposures using the deep parametric PDE methods was significantly faster than the traditional methods, due to the use of neural networks. In comparison to the Monte-Carlo solver the gain was up to a factor of 5, while the speed-up factor compared to the reference pricer was up to 25. We note that the evaluation using the sparse grid approach was significantly faster for two assets with only a slightly larger error. This confirms that both methods are well suited for the task. However for five assets, the sparse grid approach takes almost as long as the deep parametric PDE method, but shows about twice the error. This highlights an important feature of the deep parametric PDE method: the stability of the run-time with a growing number of dimensions. The deep parametric PDE method does not take significantly longer to evaluate exposures for five assets than it takes to evaluate two assets. In contrast, the Monte-Carlo method, the sparse grid method and the reference pricer take significantly longer for higher-dimensions. This confirms our theoretical results, showing that the deep parametric PDE method significantly reduces the curse of dimensionality.

Table 1 Runtime comparison for the evaluation of the expected exposure and the potential future exposure for three risk levels at 41 points in time

4.3 Parametric evaluations

The parametric PDE method does not only return the solution for a single parameter value, but for a whole range of option and market parameters. To demonstrate the full performance of the deep parametric PDE method, we present examples for exposure calculations in a parametric setting. For scenario calculations and stress-tests, this is a highly relevant task.

To limit the amount of random states, we consider equal parameters for all five assets: we vary the initial state \(s_0 = ({\bar{s}}_0, \ldots , {\bar{s}}_0) \in \mathbb {R}^5\), the volatilities \(\sigma = ({\bar{\sigma }}, \ldots , {\bar{\sigma }})\in \mathbb {R}^5\) and pairwise correlations \((\rho _{i,i+1})_i = ({\bar{\rho }}, \ldots , {\bar{\rho }})\in \mathbb {R}^4\). In addition, the riskfree rate of return is chosen randomly. We consider ten different random parameters, which are displayed in Fig. 6. In each case, we evaluate the expected exposure and the potential future exposure and estimate their error.

To be able to evaluate the reference pricer, we had to reduce the complexity of the problem. Therefore, we only evaluate the exposure at 21 points in time and use \(M=750\,000\) for the calculation of reference values.

Fig. 6
figure 6

The ten random parameter values used in the experiments: initial state \({\bar{s}}_0\), the riskfree rate r, the volatility \({\bar{\sigma }}\) and the correlation \({\bar{\rho }}\)

Figure 7 shows the relative errors in all ten examples. We see the average error over the ten scenarios as well as the maximal and minimal error. Again, we see a similar level of error for all four methods. For the expected exposure and \(t<0.5\), we see a slightly larger error for the deep parametric PDE method, while for the potential future exposure, we see a slightly larger error for the Monte Carlo solver. Nevertheless, in both cases the error remains small, almost always below \(1\%\). The sparse grid solver again shows a larger maximal error than the deep parametric PDE method. We note that again the error for the deep parametric PDE method is largest in the near future, while the sparse grid error is larger towards the maturity of the option.

Fig. 7
figure 7

Average error for expected exposure (left) and PFE (right) in the real-world measure \(\mathbb {P}\). Area between maximal and minimal observed errors shaded

In summary, in all considered cases, the deep parametric PDE method provides accurate results with a speed-up factor of up to 25 to the reference pricer and 5 to the Monte-Carlo method. A calculation which used to take almost an hour can now be performed in under 10 min. In comparison to a state-of-the-art sparse grid solver we observed half the error with a comparable runtime.

4.4 Higher dimensional calculation

We have seen a clear performance gain of the deep parametric PDE method in comparison to a full re-evaluation with a Monte-Carlo solver as well as the reference pricer. In comparison to another modern high-dimensional solver the performance gain for two and five assets was less pronounced, but we have observed a trend that the deep parametric PDE method performs better for higher dimensions. In this section we continue this investigation and consider a case of ten underlying assets. We highlight that the deep parametric PDE method solves the option price for all time, state and parameter values at the same time, which overall is a 31-dimensional problem.

We evaluate the expected exposure and the potential future exposure (for risk-level \(\alpha =97.5\%\)) at 21 points in time. We consider an initial value of 100 and a volatility of 0.2 for each of the ten assets. The pairwise correlation is set to 0.5 for all pairs, the riskfree rate is set to 0.01 and the drift of the real-world measure is 0.02.

Due to the high dimensionality we have not computed reference values except for the constant expected exposure in the riskfree measure. As the expected exposure in this situation is equal to the current option price we could use a Monte-Carlo method with \(1\,000\,000\) evaluations to compute the single value. Note that we had to reduce the number of refinements of the sparse grid method to three to obtain reasonable run-times.

Fig. 8
figure 8

Exposures for 10 underlying assets. Top row: Expected exposure in the pricing measure \(\mathbb {Q}\) (left) and its relative error (right). Bottom row: Expected exposure (left) and potential future exposure (right) in the real-world measure \(\mathbb {P}\)

In Fig. 8, the evaluated expected exposures as well as their relative error are shown. We see a good accuracy of below \(0.5\%\) for the deep parametric PDE method, while the sparse grid method shows larger errors of up to \(6\%\). A similar picture can be seen for the quantities in the real-world measure \(\mathbb {P}\), which are also shown in the figure. Even though we do not have a reference value available, we can see that the deep parametric PDE method provides significantly more accurate results.

In addition to the improved accuracy, the deep parametric PDE method features faster runtimes. While the sparse grid method took 9.9min for the calculation of the expected exposures and the potential future exposure, the deep parametric PDE method took only about half of the time with 4.3min. Note that in comparison to Table 1 we only considered 21 evaluations in time instead of 41. To compare the values, we need to roughly double the observed runtimes. We then notice only a small increase in the runtime for the deep parametric PDE method between 2 and 10 dimensions.

5 Conclusion

In this article, we advanced the theory of the deep neural networks with smooth activation functions and demonstrated the practical use of the parametric PDE method in a situation of interest for the financial industry.

We have shown (as far as we are aware) the first approximation results for deep neural networks with smooth activation functions that exhibits an approximation rate independent of the dimension, therefore reducing the curse of dimensionality. The proofs employs valuable sparse grid estimates and arithmetic operations on DNNs.

We confirm the efficiency in high dimensions which is implied by these results in an example from credit exposure. We have found that the deep parametric PDE method consistently outperforms our reference method and a Monte-Carlo solver. While for two underlying assets the sparse grid stochastic collocation method has a competing performance, also this method was outperformed for five and especially ten underlying assets. We have observed speed-up factors of up to 25 on a standard office laptop.

Both the theoretical and the practical results show the strong performance of DNN-based methods, especially in finance.