1 Introduction

Methods from applied and computational algebraic topology have recently found substantial applications in the analysis of nonlinear and unstructured datasets (Ghrist 2008; Carlsson 2005). The modus operandi of topological data analysis is to first build a nested family of simplicial complexes around the elements of a dataset, and to then compute the associated persistent homology barcodes (Edelsbrunner and Harer 2010). Of central interest, when testing hypotheses under this paradigm, is the question of what homology groups to expect when the input data are randomly generated. Significant efforts have therefore been devoted to answering this question for various models of noise, giving rise to the field of stochastic topology (Kahle 2011; Bobrowski and Kahle 2018; Kahle 2009; Adler et al. 2014; Costa and Farber 2016). Our work here is a contribution to this area at the interface between probability theory and algebraic topology.

Distributional approximations provide a way of understanding random variables in cases where closed-form distributions cannot be easily obtained. This paper establishes the first multivariate normal approximations to three important counting problems in stochastic topology; as these approximations are based on Stein’s method, explicit bounds on the approximation errors are provided. Our starting point is the ubiquitous graph model \({{\textbf {G}}(n,p)}\); a graph G chosen from this model has as its vertex set \([n] = {\left\{ {1,2,\ldots ,n}\right\} }\), and each of its possible \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) edges is included independently with probability \(p \in [0,1]\). A natural higher-order generalisation of \({{\textbf {G}}(n,p)}\) is furnished by the random clique complex model \({{\textbf {X}}(n,p)}\), whose constituent complexes \({\mathscr {L}}\) are constructed as follows. One first selects an underlying graph \({G \sim {\textbf {G}}(n,p)}\), and then deterministically fills out all k-cliques in G with \((k-1)\)-dimensional simplices for \(k \ge 3\). Higher connectivity is measured by the Betti numbers \(\beta _k({\mathscr {L}})\), which are ranks of rational homology groups \(\text {H}_k({\mathscr {L}};{\mathbb {Q}})\)—in particular, \(\beta _0({\mathscr {L}})\) equals the number of connected components of the underlying random graph G. In Kahle (2014), Kahle proved the following far-reaching generalisation of the Erdős-Rényi connectivity result: for each \(k \ge 1\) and \(\epsilon > 0\),

  1. (1)

    if

    $$\begin{aligned} p \ge \left[ \left( \frac{k}{2}+1+\epsilon \right) \cdot \frac{\log (n)}{n}\right] ^{{1}/{(k+1)}}, \end{aligned}$$

    then \(\beta _k({\mathscr {L}}) = 0\) with high probability; and moreover,

  2. (2)

    if

    $$\begin{aligned} \left[ \frac{k+1+\epsilon }{n}\right] ^{1/k} \le p \le \left[ \left( \frac{k}{2}+1-\epsilon \right) \cdot \frac{\log (n)}{n}\right] ^{1/(k+1)}, \end{aligned}$$

    then \(\beta _k({\mathscr {L}}) \ne 0\) with high probability.

Unlike Kahle’s result, we study \({{\textbf {X}}(n,p)}\) in the regime where p is a constant. In that case, we may have \({\beta _k({\textbf {X}}(n,p)) \ne 0}\) for multiple values of k that depend on n. Hence, studying the multivariate distribution of the Betti numbers is of interest in this regime. Since many results about Betti numbers in the univariate case are based on counting simplices, we hope that understanding the multivariate simplex counts will facilitate multivariate understanding of the Betti numbers. With this result in mind, we motivate and describe three random vectors pertaining to \({{\mathscr {L}}\sim {\textbf {X}}(n,p)}\); the normal approximation of these three random vectors will be our focus in this paper. All three are denoted \(T = (T_1,\ldots ,T_d)\) for an integer \(d > 0\). For the purposes of this introduction, we add a superscript (1), (2) or (3) to indicate the particular vector.

1.1 Random vector 1: critical simplex counts

The computation of Betti numbers \(\beta _k({\mathscr {L}})\) begins with the chain complex

Here \(\textrm{Ch}_k\) is a vector space whose dimension equals the number of k-simplices in \({\mathscr {L}}\), while \(d_k:\textrm{Ch}_k \rightarrow \textrm{Ch}_{k-1}\) is an incidence matrix encoding which \((k-1)\)-simplices lie in the boundary of a given k-simplex. These matrices satisfy the property that every successive composite \(d_{k+1} \circ d_k\) equals zero, and \(\beta _k({\mathscr {L}})\) is the dimension of the quotient vector space \(\ker d_k/\text {img }d_{k+1}\). In order to calculate \(\beta _k({\mathscr {L}})\), one is required to put the matrices \({\left\{ {d_k:\textrm{Ch}_k \rightarrow \textrm{Ch}_{k-1}}\right\} }\) in reduced echelon form, which is a straightforward task in principle. Unfortunately, Gaussian elimination on an \(m \times m\) matrix incurs an \(O(m^3)\) cost, which becomes prohibitive when facing simplicial complexes built around large data sets (Otter 2017). The standard remedy is to construct a much smaller chain complex which has the same homology groups, and by far the most fruitful mechanism for achieving such homology-preserving reductions is discrete Morse theory (Forman 2002; Mischaikow and Nanda 2013; Henselman-Petrusek and Ghrist 2016; Lampret 2019).

The key structure here is that of an acyclic partial matching, which pairs together certain adjacent simplices of \({\mathscr {L}}\); and the homology groups of \({\mathscr {L}}\) may be recovered from a chain complex whose vector spaces are spanned by unpaired, or critical, simplices. One naturally seeks an optimal acyclic partial matching on \({\mathscr {L}}\) which admits the fewest possible critical simplices. Unfortunately, the optimal matching problem is computationally intractable to solve (Joswig and Pfetsch 2006) even approximately (Bauer and Rathod 2019) for large \({\mathscr {L}}\). Our first random vector \(T^{(1)}\) is obtained by letting \(T^{(1)}_k\) equal the number of critical k-simplices for a specific type of acyclic partial matching on \({\mathscr {L}}\), called the lexicographical matching. Knowledge of this random vector serves to simultaneously quantify the benefit of using discrete Morse theoretic reductions on random simplicial complexes and to provide a robust null model by which to measure their efficacy on general (i.e., not necessarily random) simplicial complexes. This is the first time the asymptotic distribution of this random vector is studied.

1.2 Random vector 2: link simplex counts

The link of a simplex t in \({\mathscr {L}}\), denoted \(\textbf{lk}(t)\), consists of all simplices s for which the union \(s \cup t\) is also a simplex in \({\mathscr {L}}\) and the intersection \(s \cap t\) is empty. The link of t forms a simplicial complex in its own right; and if we restrict attention to the underlying random graph G, then the link of a vertex is precisely the collection of its neighbours. Therefore, the Betti numbers \(\beta _k(\textbf{lk}(t))\) generalise the degree distribution for vertices of random graphs in two different ways—one can study neighbourhoods of higher-dimensional simplices by increasing the dimension of t, and one can examine higher-order connectivity properties by increasing the homological dimension k. The second random vector \(T^{(2)}\) of interest to us here is obtained by letting \(T^{(2)}_k\) equal the number of k-simplices that would lie in the link of a fixed simplex t in \({\mathscr {L}}\), if t indeed was a simplex in the random complex. As far as we are aware, ours is the first work that studies this random vector. A different conditional distribution, which follows directly from results on subgraph counts in \({{\textbf {G}}(n,p)}\), has been studied before, see Remark 5.1.

There are compelling reasons to better understand the combinatorics and topology of such links from a probabilistic viewpoint. For instance, the fact that the link of a k-simplex in a triangulated n-manifold is always a triangulated sphere of dimension \((n-k-1)\) has been exploited to produce canonical stratifications of simplicial complexes into homology manifolds (Asai and Shah 2022; Nanda 2020). Knowledge of simplex counts (and hence, Betti numbers) of links would therefore form an essential first step in any systematic study involving canonical stratifications of random clique complexes.

1.3 Random vector 3: total simplex counts

The strategy employed in Kahle’s proof of the second assertion above involves first checking that the expected number of k-simplices in \({{\mathscr {L}}\sim {\textbf {X}}(n,p)}\) is much larger than the expected number of simplices of dimensions \(k \pm 1\) whenever p lies in the range indicated by (2). Therefore, one may combine the Morse inequalities with the linearity of expectation in order to guarantee that the expected \(\beta _k({\mathscr {L}})\) is nonzero—see Kahle (2014, Section 4) for details. To facilitate more refined analysis and estimates of this sort, the third random vector \(T^{(3)}\) we study in this paper is obtained by letting \(T^{(3)}_k\) equal the total number of k-dimensional simplices in \({\mathscr {L}}\).

Since \(T^{(3)}_k\) is precisely the number of \((k+1)\)-cliques in \({G \sim {\textbf {G}}(n,p)}\), this random vector falls within the purview of generalised U-statistics. We extend results from Janson and Nowicki (1991) to show not only distributional convergence asymptotically but a stronger result, detailing explicit non-asymptotic bounds on the approximation. Several interesting problems can be seen as special cases—these include classical U-statistics (Lee 1990; Korolyuk and Borovskich 2013), monochromatic subgraph counts of inhomogeneous random graphs with independent random vertex colours, and the number of overlapping patterns in a sequence of independent Bernoulli trials. To the best of our knowledge, this is the first multivariate normal approximation result with explicit bounds where the sizes of the subgraphs are permitted to increase with n.

1.4 Main results

The central contributions of this work are multivariate normal approximations for all three random vectors T described above. The approximation error is quantified for finite n in terms of both smooth test functions as well as convex set indicator test functions. As long as the bound with respect to either test function class vanishes asymptotically, it implies asymptotic convergence in distribution. However, the convexset indicator test function result is stronger and can be more useful in statistical applications: for example, when estimating confidence regions, which are usually taken to be convex sets.

We state a simplified version of our normal approximation results here and note that the full statements and proofs have been recorded as Theorems 4.5, 5.3, and Corollary 6.1. Note that the quantities below \(B_{5.3}\) and \(B_{6.1}\) are explicit, allowing to vary the parameters p and d.

To state the results, for a positive integer d we define a class of test functions \(h: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\), as follows. We say \(h \in {\mathscr {H}}_d\) iff h is three times partially differentiable with third partial derivatives being Lipschitz and bounded. Moreover we denote by \({\mathscr {K}}\) the class of convex sets in \({\mathbb {R}}^d\).

Theorem 1.1

Let \(W^{(i)}\) be an appropriately scaled and centered version of random vector \(T^{(i)}\) for \(i = 1,2,3\) as described above. Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma _i\) be the covariance matrix of \(W^{(i)}\) for each i. Let \(h \in {\mathscr {H}}_d.\)

  1. (1)

    There is a constant \(B_{1.1.1} >0\) independent of n and a natural number \(N_{1.1.1}\) such that for any \(n \ge N_{1.1.1}\) we have

    $$\begin{aligned}\left| {\mathbb {E}}h(W^{(1)}) - {\mathbb {E}}h(\Sigma _1^{\frac{1}{2}}Z) \right| \le B_{1.1.1} \sup _{i, j, k \in [d]} \left\Vert \frac{\partial ^3 h}{\partial x_i \partial x_j \partial x_k}\right\Vert _{\infty } n^{-1}.\end{aligned}$$

    Also, there is a constant \(B_{1.1.2} >0\) independent of n and a natural number \(N_{1.1.2}\) such that for any \(n \ge N_{1.1.2}\) we have

    $$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^{(1)} \in A) -{\mathbb {P}}(\Sigma _1^{\frac{1}{2}}Z \in A)| \le B_{1.1.2}n^{-\frac{1}{4}}.\end{aligned}$$
  2. (2)

    There is a quantity \(B_{5.3}\) independent of n and defined explicitly such that

    $$\begin{aligned} \left| {\mathbb {E}}h(W^{(2)}) - {\mathbb {E}}h(\Sigma _2^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{5.3} (n - |t|)^{-\frac{1}{2}}; \end{aligned}$$

    and

    $$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^{(2)} \in A) -{\mathbb {P}}(\Sigma _2^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{5.3}^{\frac{1}{4}}(n - |t|)^{-\frac{1}{8}}.\end{aligned}$$
  3. (3)

    There is a quantity \(B_{6.1}\) independent of n and defined explicitly such that

    $$\begin{aligned} \left| {\mathbb {E}}h(W^{(3)}) - {\mathbb {E}}h(\Sigma _3^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{6.1} n^{-1}; \end{aligned}$$

    and

    $$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^{(3)} \in A)-{\mathbb {P}}(\Sigma _3^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{6.1}^{\frac{1}{4}}n^{-\frac{1}{4}}. \end{aligned}$$

En route to proving Theorem 1.1, we also establish the following properties, which are of direct interest in computational topology. Here we assume that \(p \in (0,1)\) and \(k \in \{1, 2, \ldots \}\) are constants.

  1. (1)

    The expected number of critical k-simplices is one order of n smaller that the expected total number of k-simplices; see Lemma 4.2.

  2. (2)

    The variance of the number of critical k-simplices is at least of the order \(n^{2k}\), as shown in Lemma 4.3. An upper bound of the same order can be proved similarly. The variance of the total number of k-simplices is also of the same order.

  3. (3)

    Knowing the expected value and the variance one can prove concentration results using different concentration inequalities, for example, Chebyshev’s inequality. This would show that not only the expected value of the number of critical simplices is smaller compared to all simplices but also that large deviations from the mean are unlikely, hence implying that the substantial improvement of one order of n is not only expected but also likely.

  4. (4)

    For counting critical simplices to high accuracy in probability, it is not necessary to check every simplex. Certain simplices have a very small chance of being critical, and can be safely ignored. The probability of this omission causing an error is vanishingly small asymptotically; see Proposition 4.4.

The main ingredient in establishing such results is often an abstract approximation theorem that can be applied to the random variables of interest. While there is no shortage of multivariate normal approximation theorems (Fang 2016; Raiíc 2004; Meckes 2009; Chen 2011), the existing ones are not sufficiently fine-grained for proving multivariate normal approximations to the random vectors studied here. We therefore return to the pioneering work of Barbour et al. (1989), who proved a univariate central limit theorem (CLT) for a decomposable sum of random variables using Stein’s method, treating the case of dissociated sums as a special case. Our approximation result (Theorem 3.2) forms an extension of their ideas to the multivariate setting.

1.5 Related work

There are different versions of distributional approximation results for subgraph counts in \({{\textbf {G}}(n,p)}\) that can be interpreted as simplex counts in \({{\textbf {X}}(n,p)}\). For example, a multivariate central limit theorem for centered subgraph counts in the more general setting of a random graph associated to a graphon can be found in Kaur and Röollin (2021). That proof is based on Stein’s method via a Stein coupling. Translating this result for uncentered subgraph counts would yield an approximation by a function of a multivariate normal. In Reinert and Röllin (2010), an exchangeable pair coupling led to Reinert and Rollin (2010, Proposition 2) which can be specialised to joint counts of edges and triangles; our approximation significantly generalises this result beyond the case where \(k \in {\left\{ {1,2}\right\} }\). Several univariate normal approximation theorems for subgraph counts are available; recent developments in this area include Privault and Serafin (2020), which uses Malliavin calculus together with Stein’s method, and Eichelsbacher and Rednoß (2023), which uses the Stein-Tikhomirov method. Stein’s method is used in Kahle and Meckes (2013) to show a CLT for the Betti numbers of \({{\textbf {X}}(n,p)}\) in a sparse regime; in Owada et al. (2021) limit theorems for Betti numbers and Euler characteristic are proven in a dynamical random simplicial complex model, also using Stein’s method.

Theorem 3.2 is not the first generalisation of the results in Barbour et al. (1989) to a multivariate setting, see for example (Fang 2016; Raiíc 2004). The key advantage of our approach is that it allows for bounds which are non-uniform in each component of the vector W. This is useful when, for example, the number of summands in each component are of different order or when the sizes of dependency neighbourhoods in each component are of different order. The applications considered here are precisely of this type, where the non-uniformity of the bounds is crucial. Moreover, we do not require the covariance matrix \(\Sigma \) to be invertible, and can therefore accommodate degenerate multivariate normal distributions.

1.6 Organisation

In Sect. 2 we recall concepts from the theory of simplicial complexes, which we later use. In Sect. 3 we state the theorems that serve as main tools in proving the CLTs. In order to maintain focus on the main results, we defer the proofs of this section until the end of the paper. In Sect. 4 we prove an approximation theorem for critical simplex counts of lexicographical matchings. Two technical computations required in this section have been consigned to the Appendix. In Sect. 5 we prove an approximation theorem for count variables of simplices that are in the link of a fixed simplex. In Sect. 6 we study simplex counts in the random clique complex and prove a CLT for this random variable. This CLT is a corollary of a multivariate normal approximation of generalised U-statistics, which might be of independent interest. Finally, in Sect. 7 we prove our main tools: the abstract approximation theorem (Theorem 3.2) as well as the approximation theorem for U-statistics (Theorem 3.9). We first use smooth test functions and then extend the results to convex set indicators using a smoothing technique from Gan et al. (2017).

1.7 Notation

Throughout this paper we use the following notation. Given positive integers nm we write [mn] for the set \({\left\{ {m,m+1,\ldots ,n}\right\} }\) and [n] for the set [1, n]. Given a set X we write \(\left| X \right| \) for its cardinality, \({\mathscr {P}}(X)\) for its powerset, and given a positive integer k we write \(C_k = \left\{ \,t \in {\mathscr {P}}([n]) \;\vert \; |t| = k\,\right\} \) for the collection of subsets of [n] which are of size k. For a function \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) we write \(\partial _{ij}f = \frac{\partial ^2f}{\partial x_i \partial x_j}\) and \(\partial _{ijk}f = \frac{\partial ^3f}{\partial x_i \partial x_j \partial x_k}\). Also, we write \(\left| f \right| _k = \sup _{i_1, i_2, \ldots , i_k \in [d]} \left\Vert \partial _{i_1 i_2 \ldots i_k}f\right\Vert _{\infty }\) for any integer \(k \ge 1\), as long as the quantities exist. Here \(|| \cdot ||_\infty \) denotes the supremum norm while \(|| \cdot ||_2\) denotes the Euclidean norm. The notation \(\nabla \) denotes the gradient operator in \({\mathbb {R}}^d\). The notation \(\textrm{Id}_{d \times d}\) denotes the \(d \times d\) identity matrix. The vertex set of all graphs and simplicial complexes is assumed to be [n]. We also use Bachmann-Landau asymptotic notation: we say \(f(n) = O(g(n))\) iff \(\limsup _{n \rightarrow \infty } \frac{|f(n)|}{g(n)} < \infty \) and \(f(n) = \Omega (g(n))\) iff \(\liminf _{n \rightarrow \infty } \frac{f(n)}{g(n)} > 0\). The notation that \(f(n)= \omega (g(n))\) indicates that \(\lim _{n \rightarrow \infty } \frac{f(n)}{g(n)} = \infty \).

2 Simplicial complex preliminaries

2.1 First definitions

Firstly, we recall the notion of a simplicial complex (Spanier 1966, Ch 3.1); these provide higher-dimensional generalisations of a graph and constitute data structures of interest across algebraic topology in general as well as applied and computational topology in particular.

A simplicial complex \({\mathscr {L}}\) on a vertex set V is a set of nonempty subsets of V (i.e. \(\varnothing \notin {\mathscr {L}}\subseteq {\mathscr {P}}(V)\)) such that the following properties are satisfied:

  1. (1)

    for each \(v \in V\) the singleton \({\left\{ {v}\right\} }\) lies in \({\mathscr {L}}\), and

  2. (2)

    if \(t \in {\mathscr {L}}\) and \(s \subset t\) then \(s \in {\mathscr {L}}\).

The dimension of a simplicial complex \({\mathscr {L}}\) is \(\max _{s \in {\mathscr {L}}}|s| - 1\). Elements of a simplicial complex are called simplices. If s is a simplex, then its dimension is \(|s| - 1\). A simplex of dimension k can be called a k-simplex. Note that the notion of one-dimensional simplicial complex is equivalent to the notion of a graph, with the vertex set V and edges as subsets.

Given a graph \(G = (V, E)\) the clique complex \({\mathscr {X}}\) of G is a simplicial complex on V such that

$$\begin{aligned} t \in {\mathscr {X}} \iff \forall u,v \in t, \; {\left\{ {u,v}\right\} } \in E.\end{aligned}$$

Recall that \({{\textbf {G}}(n,p)}\) is a random graph on n vertices where each pair of vertices is connected with probability p, independently of any other pair. The \({{\textbf {X}}(n,p)}\) random simplicial complex is the clique complex of the \({{\textbf {G}}(n,p)}\) random graph, which is a random model studied in stochastic topology (Kahle 2009, 2014). Note that \(t \in {{\mathscr {X}}} \) if and only if the vertices of t span a clique in G. Thus, elements in \({{\textbf {X}}(n,p)}\) are cliques in \({{\textbf {G}}(n,p)}\).

2.2 Links

The link of a simplex t in a simplicial complex \({\mathscr {L}}\) is the subcomplex

$$\begin{aligned} \textbf{lk}(t) = \left\{ \,s \in {\mathscr {L}} \;\vert \; s \cup t \in {\mathscr {L}}\text { and } t \cap s = \varnothing \,\right\} . \end{aligned}$$
Fig. 1
figure 1

Left: the link (highlighted in blue) of the vertex 1 (highlighted in red). Right: the link (highlighted in blue) of the edge \({\left\{ {1,2}\right\} }\) (highlighted in red). The two-dimensional simplices are shaded in grey

Example 2.1

If we look at a graph as a one dimensional simplicial complex, then the vertices are sets of the form \({\left\{ {i}\right\} }\) and edges are sets of the form \({\left\{ {i,j}\right\} }\). For a vertex \(t = {\left\{ {v}\right\} }\), the edges of the form \(s = {\left\{ {v,u}\right\} }\) will not be in the link of t because \(t \cap s = \varnothing \) is not satisfied. If we pick \(s = {\left\{ {i,j}\right\} }\) and \(v \notin s\), then \(s \cup t \in {\mathscr {L}}\) is not satisfied. So there will be no edges in the link. However, if \(s = {\left\{ {u}\right\} }\) and u is a neighbour of v, then \(s \cup t \in {\mathscr {L}}\) and \(s \cap t = \varnothing \). Hence the link of a vertex will be precisely the other vertices that the vertex is connected to; the notion of the link generalises the idea of a neighbourhood in a graph.

Example 2.2

Now consider the simplicial complex depicted in Fig. 1: it has 8 vertices, 12 edges and 3 two-dimensional simplices that are shaded in grey. On the left hand side of the figure we see highlighted in blue the link of the vertex 1, which is highlighted in red. So \(\textbf{lk}({\left\{ {1}\right\} }) = {\left\{ {{\left\{ {2}\right\} }, {\left\{ {3}\right\} }, {\left\{ {5}\right\} }, {\left\{ {6}\right\} }, {\left\{ {8}\right\} }, {\left\{ {2,3}\right\} }, {\left\{ {2,8}\right\} }, {\left\{ {5,6}\right\} }}\right\} }\). On the right hand side of the figure we see highlighted in blue the link of the edge \({\left\{ {1,2}\right\} }\), which is highlighted in red. That is, \(\textbf{lk}({\left\{ {1,2}\right\} }) = {\left\{ {{\left\{ {3}\right\} }, {\left\{ {8}\right\} }}\right\} }\).

2.3 Discrete Morse theory

A partial matching on a simplicial complex \({\mathscr {L}}\) is a collection

$$\begin{aligned} \Sigma = \left\{ \,(s, t) \;\vert \; s \subseteq t \in {\mathscr {L}}\text { and } |t| - |s| = 1\,\right\} \end{aligned}$$

such that every simplex appears in at most one pair of \(\Sigma \). A \(\mathbf {\Sigma }\)-path (of length \(k \ge 1)\) is a sequence of distinct simplices of \({\mathscr {L}}\) of the following form:

$$\begin{aligned}(s_1 \subseteq t_1 \supseteq s_2 \subseteq t_2 \supseteq \ldots \supseteq s_k \subseteq t_k)\end{aligned}$$

such that \((s_i, t_i) \in \Sigma \) and \(|t_i| - |s_{i+1}| = 1\) for all \(i \in [k]\). A \(\Sigma \)-path is called a gradient path if \(k=1\) or \(s_1\) is not a subset of \(t_k\). A partial matching \(\Sigma \) on \({\mathscr {L}}\) is called acyclic iff every \(\Sigma \)-path is a gradient path. Given a partial matching \(\Sigma \) on \({\mathscr {L}}\), we say that a simplex \(t \in {\mathscr {L}}\) is critical iff t does not appear in any pair of \(\Sigma \).

For a one-dimensional simplicial complex, viewed as a graph, a partial matching \(\Sigma \) is comprised of elements \(( v; \{u,v\})\) with v a vertex and \(\{u, v\}\) an edge. A \(\Sigma -\)path is then a sequence of distinct vertices and edges

$$\begin{aligned} v_1, {\left\{ {v_1,v_2}\right\} }, v_2, {\left\{ {v_2,v_3}\right\} }, \ldots , v_k, {\left\{ {v_k,v_{k+1}}\right\} } \end{aligned}$$

where each consecutive pair of the form \((v_i,{\left\{ {v_i,v_{i+1}}\right\} })\) is constrained to lie in \(\Sigma \).

We refer the interested reader to Forman (2002) for an introduction to discrete Morse theory and to Mischaikow and Nanda (2013) for seeing how it is used to reduce computations in the persistent homology algorithm. In this work we aim to understand how much improvement one would likely get on a random input when using a specific type of acyclic partial matching, defined below.

Definition 2.3

Let \({\mathscr {L}}\) be a simplicial complex and assume that the vertices are ordered by \([n] = \{1,\ldots ,n\}\). For each simplex \(s \in {\mathscr {L}}\) define

$$\begin{aligned} I_{{\mathscr {L}}}(s) :=\{j \in [n] \mid j < \min (s) \text { and } s \cup \{j\} \in {\mathscr {L}}\}. \end{aligned}$$

Now consider the pairings

$$\begin{aligned} s \leftrightarrow s \cup \{i\}, \end{aligned}$$

where \(i = \min I_{{\mathscr {L}}}(s)\) is the smallest element in the set \( I_{{\mathscr {L}}}(s)\), defined whenever \(I_{{\mathscr {L}}}(s) \ne \varnothing \). We call this the lexicographical matching.

Due to the \(\min I_{{\mathscr {L}}}(s)\) construction in the lexicographical matching, the indices are decreasing along any path and hence it will be a gradient path, showing that the lexicographical matching is indeed an acyclic partial matching on \({\mathscr {L}}\).

Fig. 2
figure 2

Lexicographical matching given by the red arrows. Critical simplices are highlighted in blue

Example 2.4

Consider the simplicial complex \({\mathscr {L}}\) depicted in Fig. 2. The complex has 5 vertices, 6 edges and one two-dimensional simplex that is shaded in grey. The red arrows show the lexicographical matching on this simplicial complex: there is an arrow from a simplex s to t iff the pair (st) is part of the matching. More explicitly, the lexicographical matching on \({\mathscr {L}}\) is

$$\begin{aligned}\Sigma = {\left\{ {({\left\{ {2}\right\} }, {\left\{ {1,2}\right\} }), ({\left\{ {3}\right\} }, {\left\{ {2,3}\right\} }), ({\left\{ {4}\right\} }, {\left\{ {1,4}\right\} }), ({\left\{ {5}\right\} }, {\left\{ {3,5}\right\} }), ({\left\{ {4, 5}\right\} }, {\left\{ {3,4, 5}\right\} })}\right\} }.\end{aligned}$$

Note that \({\left\{ {3,4}\right\} }\) cannot be matched because the set \(I_{{\mathscr {L}}}({\left\{ {3,4}\right\} })\) is empty. Also, in any lexicographical matching \({\left\{ {1}\right\} }\) is always critical as there are no vertices with a smaller label and hence the set \(I_{{\mathscr {L}}}({\left\{ {1}\right\} })\) is empty. So under this matching there are two critical simplices: \({\left\{ {1}\right\} }\) and \({\left\{ {3,4}\right\} }\), highlighted in blue in the figure. Hence, if we were computing the homology of this complex, considering only two simplices would be sufficient instead of all 12 which are in \({\mathscr {L}}\)—a significant improvement.

3 Probabilistic tools

In this section we introduce the approximation theorems that are used to study the random variables of interest. In order not to obscure our main results, the proofs are deferred to Sect. 7.

3.1 CLT for dissociated sums

Let n and d be positive integers. For each \(i \in [d] =:\{1, 2, \ldots , d\}\), we fix an index set \({\mathbb {I}}_i \subset [n] \times {\left\{ {i}\right\} }\) and consider the union of disjoint sets \({\mathbb {I}}:= \bigcup _{i \in [d]} {\mathbb {I}}_i\). Associate to each such \(s = (k,i) \in {\mathbb {I}}\) a real centered random variable \(X_s\) and form for each \(i \in [d]\) the sum

$$\begin{aligned} W_i :=\sum _{s \in {\mathbb {I}}_i} X_s. \end{aligned}$$

Consider the resulting random vector \(W = (W_1,\ldots ,W_d) \in {\mathbb {R}}^d\). The following notion is a natural multivariate generalisation of the dissociated sum from McGinley and Sibson (1975); see also Barbour et al. (1989).

Definition 3.1

We call W a vector of dissociated sums if for each \(s \in {\mathbb {I}}\) and \(j \in [d]\) there exists a dependency neighbourhood \({\mathbb {D}}_j(s) \subset {\mathbb {I}}_j\) satisfying three criteria:

  1. (1)

    the difference \(\left( W_j - \sum _{u \in {\mathbb {D}}_j(s)} X_u\right) \) is independent of \(X_s\);

  2. (2)

    for each \(t \in {\mathbb {I}}\), the quantity \(\left( W_j - \sum _{u \in {\mathbb {D}}_j(s)} X_u - \sum _{v \in {\mathbb {D}}_j(t) \setminus {\mathbb {D}}_j(s)} X_v\right) \) is independent of the pair \((X_s,X_t)\); and finally,

  3. (3)

    \(X_s\) and \(X_t\) are independent if \(t \not \in \bigcup _j {\mathbb {D}}_j(s)\).

Let W be a vector of dissociated sums as defined above. For each \(s\in {\mathbb {I}}\), by construction, the sets \({\mathbb {D}}_j(s), j \in [d]\) are disjoint (although for \(s \ne t\), the sets \({\mathbb {D}}_j(s)\) and \({\mathbb {D}}_j(t)\) may not be disjoint). We write \({\mathbb {D}}(s) = \bigcup _{j \in [d]}{\mathbb {D}}_j(s)\) for the disjoint union of these dependency neighbourhoods. With this preamble in place, we state the abstract approximation theorem that is the main ingredient in the proofs of our normal approximation results.

Theorem 3.2

Let \(h \in {\mathscr {H}}_d.\) Consider a standard d-dimensional Gaussian vector \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). Assume that for all \(s \in {\mathbb {I}}\), we have \({\mathbb {E}}\left\{ X_s \right\} = 0\) and \({\mathbb {E}}\left| X_s^3 \right| < \infty .\) Then, for any vector of dissociated sums \(W \in {\mathbb {R}}^d\) with a positive semi-definite covariance matrix \(\Sigma \),

$$\begin{aligned} \left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le B_{3.2} \left| h \right| _3, \end{aligned}$$

where \(B_{3.2} = B_{3.2.1} + B_{3.2.2}\) is the sum given by

$$\begin{aligned} B_{3.2.1}&:=\frac{1}{3}\sum _{s \in {\mathbb {I}}} \sum _{t, u \in {\mathbb {D}}(s)} \hspace{-.4em} \left( \frac{1}{2}{\mathbb {E}}\left| X_sX_tX_u \right| +{\mathbb {E}}\left| X_sX_t \right| {\mathbb {E}}\left| X_u \right| \right) \\ B_{3.2.2}&:=\frac{1}{3} \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{v \in {\mathbb {D}}(t) \setminus {\mathbb {D}}(s)} \hspace{-.4em} \left( {\mathbb {E}}\left| X_sX_tX_v \right| + {\mathbb {E}}\left| X_sX_t \right| {\mathbb {E}}\left| X_v \right| \right) . \end{aligned}$$

The theorem above together with a smoothing technique will be used to prove the following approximation theorem in terms of convex set indicators.

Theorem 3.3

Consider a standard d-dimensional Gaussian vector \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). For any centered vector of dissociated sums \(W \in {\mathbb {R}}^d\) with a positive semi-definite covariance matrix \(\Sigma \) and finite third absolute moments we have

$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{3.2}^{\frac{1}{4}}, \end{aligned}$$

where the quantity \(B_{3.2}\) as in Theorem 3.2.

The next result provides a simplification of Theorems 3.2 and 3.3 under the assumption that one uses bounds that are uniform in \(s, t, u \in {\mathbb {I}}\). Its proof follows immediately from writing the sum over \( \sum _{s \in {\mathbb {I}}} \sum _{t,u \in {\mathbb {D}}(s)} \) as the sum over \(\sum _{i \in [d]} \sum _{j \in [d]} \sum _{k \in [d]} \sum _{s \in {\mathbb {I}}_i} \sum _{t \in {\mathbb {D}}_j(s)} \sum _{u \in {\mathbb {D}}_k(s)}\).

Corollary 3.4

We have the following two bounds:

  1. (1)

    Under the assumptions of Theorem 3.2,

    $$\begin{aligned} \left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le B_{3.4} \left| h \right| _3. \end{aligned}$$
  2. (2)

    Assuming the hypotheses of Theorem 3.3,

    $$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{3.4}^{\frac{1}{4}}. \end{aligned}$$

Here \(B_{3.4}\) is a sum over \((i,j,k) \in [d]^3\) of the form

$$\begin{aligned} B_{3.4} :=\frac{1}{3} \sum _{(i,j,k)} \left| {\mathbb {I}}_i \right| \alpha _{ij} \left( \frac{3\alpha _{ik}}{2} + 2\alpha _{jk}\right) \beta _{ijk}; \end{aligned}$$

and \(\alpha _{ij}\) is the largest value attained by \(\left| {\mathbb {D}}_j(s) \right| \) over \(s \in {\mathbb {I}}_i\), and

$$\begin{aligned} \beta _{ijk} = \max _{s,t,u}\Big ({\mathbb {E}}\left| X_sX_tX_u \right| ,{\mathbb {E}}\left| X_sX_t \right| {\mathbb {E}}\left| X_u \right| \Big ) \end{aligned}$$

as (stu) range over \({\mathbb {I}}_i \times {\mathbb {I}}_j \times {\mathbb {I}}_k\).

In most of our applications, the variables \(X_s\) are centered and rescaled Bernoulli random variables. Hence, the following lemma is useful.

Lemma 3.5

Let \(\xi _1, \xi _2, \xi _3\) be Bernoulli random variables with expected values \(\mu _1, \mu _2, \mu _3\) respectively. Let \(c_1, c_2, c_3 > 0\) be any constants. Consider variables \(X_i :=c_i (\xi _i - \mu _i)\) for \(i = 1,2,3\). Then we have

$$\begin{aligned}&{\mathbb {E}}\left| X_1 X_2 X_3 \right| \le c_1 c_2 c_3 \left\{ \mu _1\mu _2(1-\mu _1)(1-\mu _2) \right\} ^{\frac{1}{2}}; \\&{\mathbb {E}}\left| X_{1} X_{2} \right| {\mathbb {E}}\left| X_{3} \right| \le c_1 c_2 c_3 \left\{ \mu _1\mu _2(1-\mu _1)(1-\mu _2) \right\} ^{\frac{1}{2}}. \end{aligned}$$

Proof

Note that \(X_{3}\) can take two values: \(-c_3 \mu _3\) or \(c_3(1-\mu _3)\). As \(0 \le \mu _3 \le 1\), we have

$$\begin{aligned}{\mathbb {E}}\left| X_{1} X_{2} \right| {\mathbb {E}}\left| X_{3} \right| \le c_3 {\mathbb {E}}\left| X_{1} X_{2} \right| ;\end{aligned}$$
$$\begin{aligned}{\mathbb {E}}\left| X_1 X_2 X_3 \right| \le c_3 {\mathbb {E}}\left| X_1 X_2 \right| .\end{aligned}$$

Applying the Cauchy-Schwarz inequality and direct calculation of the second moments gives

$$\begin{aligned}{\mathbb {E}}\left| X_1 X_2 \right| \le \left\{ {\mathbb {E}}\left\{ X_1^2 \right\} {\mathbb {E}}\left\{ X_2^2 \right\} \right\} ^{\frac{1}{2}} = c_1c_2 \left\{ \mu _1\mu _2(1-\mu _1)(1-\mu _2) \right\} ^{\frac{1}{2}},\end{aligned}$$

which finishes the proof. \(\square \)

3.2 CLT for U-statistics

Here we consider generalised U-statistics, which were first introduced in Janson and Nowicki (1991). The result we derive could be of independent interest but, most importantly, the approximation theorem for simplex counts follows as a consequence. Let \(\{\xi _i\}_{1 \le i \le n}\) be a sequence of of independent random variables taking values in a measurable subset \({\mathscr {X}} \subseteq {\mathscr {U}}\) and let \(\{Y_{i,j}\}_{1 \le i < j \le n}\) be an array of of independent random variables taking values in a measurable subset \({\mathscr {Y}} \subseteq {\mathscr {U}}\) which is independent of \(\{\xi _i\}_{1 \le i \le n}\). We use the convention that \(Y_{i,j} = Y_{j,i}\) for any \(i < j\). For example, one can think of \(X_i\) as a random label of a vertex i in a random graph where \(Y_{i,j}\) is the indicator for the edge connecting i and j. Given a subset \(s \subseteq [n]\) of size m, write \(s = {\left\{ {s_1, s_2, \ldots , s_m}\right\} }\) such that \(s_1< s_2< \ldots < s_m\) and set \({\mathscr {X}}_s = (\xi _{s_1}, \xi _{s_2}, \ldots , \xi _{s_m})\) and \({\mathscr {Y}}_s = (Y_{s_1, s_2}, Y_{s_1, s_3}, \ldots Y_{s_{m-1},s_m})\). Recall that \(C_k\) denotes the set of subsets of [n] which are of size k.

Definition 3.6

Given \(1 \le k \le n\) and a measurable function \(f: {\mathscr {X}}^k \times {\mathscr {Y}}^{\left( {\begin{array}{c}k\\ 2\end{array}}\right) } \rightarrow {\mathbb {R}}\) define the associated generalised U-statistic by

$$\begin{aligned}S_{n,k}(f) = \sum _{s \in C_k} f({\mathscr {X}}_s, {\mathscr {Y}}_s).\end{aligned}$$

Let \({\left\{ {k_i}\right\} }_{i \in [d]}\) be a collection of positive integers, each being at most n, and for each \(i \in [d]\) let \(f_i: {\mathscr {X}}^{k_i} \times {\mathscr {Y}}^{\left( {\begin{array}{c}k_i\\ 2\end{array}}\right) } \rightarrow {\mathbb {R}}\) be a measurable function. We are interested in the joint distribution of the variables \(S_{n,k_1}(f_1), S_{n,k_2}(f_2), \ldots S_{n,k_d}(f_d)\), which are assumed to have finite mean and variance.

Fix \(i \in [d]\). For \(s \in {\mathbb {I}}_i :=C_{k_i} \times {\left\{ {i}\right\} }\) define \(X_{s} = \sigma _i^{-1}\left( f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s}\right) \), where \(\mu _{s} = {\mathbb {E}}\left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) \right\} \) and \(\sigma _i^{2} = {{\,\textrm{Var}\,}}(S_{n,k_i}(f_i))\). Now let \(W_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\) be a random variable and write \(W = (W_1, W_2, \ldots W_d) \in {\mathbb {R}}^d\). By construction, \(W_i\) has mean 0 and variance 1.

Assumption 3.7

We assume that

  1. (1)

    For any \(i \in [d]\) there is some \(\alpha _i > 0\) such that for all \(s, t \in {\mathbb {I}}_i\), the variables \(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s), f_i({\mathscr {X}}_t, {\mathscr {Y}}_t)\) are either independent or \({{\,\textrm{Cov}\,}}(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s), f_i({\mathscr {X}}_t, {\mathscr {Y}}_t)) > \alpha _i\).

  2. (2)

    There is \(\beta \ge 0\) such that for any \(i,j,l \in [d]\) and any \(s \in {\mathbb {I}}_i, t \in {\mathbb {I}}_j, u \in {\mathbb {I}}_l\) we have

    $$\begin{aligned}{\mathbb {E}}\left| \left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s} \right\} \left\{ f_j({\mathscr {X}}_t, {\mathscr {Y}}_t) - \mu _{t} \right\} \left\{ f_l({\mathscr {X}}_u, {\mathscr {Y}}_u) - \mu _{u} \right\} \right| \le \beta \end{aligned}$$

    as well as

    $$\begin{aligned}{\mathbb {E}}\left| \left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s} \right\} \left\{ f_j({\mathscr {X}}_t, {\mathscr {Y}}_t) - \mu _{t} \right\} \right| {\mathbb {E}}\left| f_l({\mathscr {X}}_u, {\mathscr {Y}}_u) - \mu _{u} \right| \le \beta .\end{aligned}$$

The first assumption is not necessary but very convenient and we use it to derive a lower bound for the variance \(\sigma _i^2\). A normal approximation theorem can be proven in our framework when the assumption does not hold and a sufficiently large lower bound for the variance is acquired in a different way. Similarly, we use the second assumption to get a convenient bound on mixed moments. In order to maintain the generality and simplicity of the proofs, we work under Assumption 3.7.

We also consider the important special case that the functions in Definition 3.6 only depend on the second component, so that the sequence \({\left\{ {\xi _i}\right\} }_{i \in [n]}\) can be ignored. Hence, we add an additional assumption.

Assumption 3.8

We assume that the functions \(f_i\) only depend on the variables \({\left\{ {Y_{i,j}}\right\} }\) for \(1 \le i < j \le n\). That is, we can write \(f_i: {\mathscr {Y}}^{\left( {\begin{array}{c}k_i\\ 2\end{array}}\right) } \rightarrow {\mathbb {R}}\).

Such functions appear naturally, for example, when counting subgraphs in an inhomogeneous Bernoulli random graph. An example of such generalised U-statistic is simplex counts in \({{\textbf {X}}(n,p)}\) and is worked out in Sect. 6. We recall, for the purposes of the following result, that n is the number of variables in the sequence \(\{\xi _i\}_{1 \le i \le n}\) and \(\left( {\begin{array}{c}n\\ 2\end{array}}\right) \) is the number of variables in the sequence \(\{Y_{i,j}\}_{1 \le i < j \le n}\) from Definition 3.6.

Theorem 3.9

Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and let \(h \in {\mathscr {H}}_d\). Assume W with covariance matrix \(\Sigma \) satisfies Assumption 3.7. Then

$$\begin{aligned} \left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{3.9} n^{-\gamma }; \end{aligned}$$

and

$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{3.9}^{\frac{1}{4}}n^{-\frac{\gamma }{4}}. \end{aligned}$$

Here,

$$\begin{aligned} B_{3.9} = \frac{2^{\delta } \beta }{3} \sum _{i,j,l=1}^d \frac{k_i^{\min (k_i,k_j) + 1}}{k_i! \sqrt{\alpha _i \alpha _j \alpha _l}} \left( k_i^{\min (k_i,k_l) + 1} + k_j^{\min (k_j,k_l) + 1} \right) K_i K_j K_l \end{aligned}$$

and

$$\begin{aligned} K_i = (2k_i^2 - k_i)^{-\frac{k_i}{2} + \frac{1}{2}}. \end{aligned}$$

If only Assumption 3.7 is satisfied, then \(\gamma = \frac{1}{2}\) and \(\delta = 1\). If additionally Assumption 3.8 is also satisfied, then \(\gamma = 1\) and \(\delta = 4\).

4 Critical simplex counts for lexicographical Morse matchings

Now we attend to our motivating problem, critical simplex counts. Consider the random simplicial complex \({{\textbf {X}}(n,p)}\). In this section we study the joint distribution of critical simplices in different dimensions with respect to the lexicographical matching on \({{\textbf {X}}(n,p)}\). We start with the following lemma, which is an immediate consequence of Definition 2.3, allowing us to write down the variables of interest in terms of the edge indicators.

Lemma 4.1

Let \({\mathscr {L}}\) be a simplicial complex endowed with the lexicographical acyclic partial matching, and consider a simplex \(t \in {\mathscr {L}}\) with minimal vertex \(i \in [n]\). Then, t is matched with

  1. (1)

    one of its co-faces if and only if there exists some \(j < i\) for which \(t \cup {\left\{ {j}\right\} } \in {\mathscr {L}}\); and,

  2. (2)

    one of its faces if and only for all \(j < i\) we have \((t \setminus {\left\{ {i}\right\} }) \cup {\left\{ {j}\right\} } \notin {\mathscr {L}}\).

For any pair of integers \(1 \le i < j \le n\) let \({Y_{i,j} :=\mathbb {1}\left( {\left\{ {i,j}\right\} } \in {\textbf {X}}(n,p)\right) }\) be the edge indicator. Fix \(s \in C_k\). Define the variables \(X_s^{+} = \mathbb {1}\) (s matches with its coface given it is a simplex) and \(X_s^{-} = \mathbb {1}\) (s matches with its face given it is a simplex). The events that the two variables indicate are disjoint. By Lemma 4.1 we can see that \(X_s^{+} = 1 - \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s} Y_{a,b} \right) \) and \(X_s^{-} = \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s_{-}} Y_{a,b}\right) \), where \(s_{-} :=s {\setminus } {\left\{ {\min (s)}\right\} }\). Hence,

$$\begin{aligned}&\mathbb {1}\left( s \text { is a critical simplex}\right) = \mathbb {1}\left( s \in {\textbf {X}}(n,p)\right) (1 - (X_s^{+} + X_s^{-})) \\&= \prod _{i \ne j \in s} Y_{i,j} \left[ \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s}Y_{a,b}\right) - \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s_{-}} Y_{a,b}\right) \right] . \end{aligned}$$

Thus, the random variable of interest, counting the number of \((k-1)\)-simplices that are critical under the lexicographical matching, is

$$\begin{aligned} T_k = \sum _{s \in C_k} \prod _{i \ne j \in s} Y_{i,j} \left[ \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s}Y_{a,b}\right) - \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s_{-}} Y_{a,b}\right) \right] . \end{aligned}$$
(4.1)

Note that this random variable does not fit into the framework of generalised U-statistics because the summands in \(T_k\) depend not only on the variables that are indexed by the subset s. Therefore, Theorem 3.9 cannot be applied here.

4.1 Mean and variance

Lemma 4.2

For any \(1 \le k \le n - 1\) we have:

$$\begin{aligned} p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) +k}\left( {\begin{array}{c}n-2\\ k\end{array}}\right) (1-p) \le {\mathbb {E}}\{ T_{k+1} \} \le p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) -k-1}\left( {\begin{array}{c}n-1\\ k\end{array}}\right) (1-p).\end{aligned}$$

Proof

$$\begin{aligned} {\mathbb {E}}\ \{T_{k+1}\}&= \sum _{l=1}^{n-k} \sum _{\begin{array}{c} s \in C_{k+1} \\ \min (s) = l \end{array}} {\mathbb {E}}\left\{ \prod _{i \ne j \in s} Y_{i,j} \left[ \prod _{a=1}^{l - 1}\left( 1 - \prod _{b \in s}Y_{a,b}\right) - \prod _{a=1}^{l - 1}\left( 1 - \prod _{b \in s_{-}} Y_{a,b}\right) \right] \right\} \\&= p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } \sum _{l=1}^{n-k} \sum _{\begin{array}{c} s \in C_{k+1} \\ \min (s) = l \end{array}} \left\{ (1-p^{k+1})^{l - 1} - (1-p^k)^{l-1} \right\} \\&=p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } \sum _{l=0}^{n-k-1} \left( {\begin{array}{c}n-l-1\\ k\end{array}}\right) \left\{ (1-p^{k+1})^{l} - (1-p^k)^{l} \right\} \\&\le p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } \left( {\begin{array}{c}n-1\\ k\end{array}}\right) \sum _{l=0}^\infty \left\{ (1-p^{k+1})^{l} - (1-p^k)^{l} \right\} \\&= p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) -k-1}\left( {\begin{array}{c}n-1\\ k\end{array}}\right) (1-p). \end{aligned}$$

Moreover,

$$\begin{aligned}{\mathbb {E}}\ \{T_{k+1}\}&= p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } \sum _{l=0}^{n-k-1} \left( {\begin{array}{c}n-l-1\\ k\end{array}}\right) \left\{ (1-p^{k+1})^{l} - (1-p^k)^{l} \right\} \\&\ge p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } \left( {\begin{array}{c}n-2\\ k\end{array}}\right) \left\{ (1-p^{k+1})^1 - (1-p^k)^1 \right\} \\&= p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) +k}\left( {\begin{array}{c}n-2\\ k\end{array}}\right) (1-p). \end{aligned}$$

\(\square \)

In this example, bounding the variance is not immediate. The proof of the following Lemma 4.3 are long (and not particularly insightful) calculations, which are deferred to the Appendix. In Lemma 4.3 the constant could have been made explicit at the expense of a lengthy calculation while an explicit expression for the variance is given in Lemma A.1.

Lemma 4.3

For a fixed integer \(1 \le k \le n-1\) and \(p \in (0,1)\) there is a constant \(C_{p,k} > 0\) independent of n and a natural number \(N_{p,k}\) such that for any \(n \ge N_{p,k}\):

$$\begin{aligned}{{\,\textrm{Var}\,}}(T_{k+1}) \ge C_{p,k}n^{2k}.\end{aligned}$$

Just knowing the expectation and the variance can already give us some information about the variable. For example, we obtain the following proposition. This proposition shows that considering only a subset of the simplices already gives a good approximation for the critical simplex counts.

Proposition 4.4

Fix \(k \in [n]\). Let \(K \le n - k\) and set the random variable:

$$\begin{aligned}T_{k+1}^K :=\sum _{\begin{array}{c} s \in C_{k+1} \\ \min (s) \le K \end{array}} \prod _{i \ne j \in s} Y_{i,j} \left[ \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s}Y_{a,b}\right) - \prod _{a=1}^{\min (s) - 1}\left( 1 - \prod _{b \in s_{-}} Y_{a,b}\right) \right] .\end{aligned}$$

If \(K =K(n)= \omega (\ln ^{1 + \epsilon }(n))\) for any \(\epsilon > 0\), then the variable \(T_{k+1} - T_{k+1}^K\) vanishes with high probability, provided that p and k stay constant.

Proof

A similar calculation to that for Lemma 4.2 shows that:

$$\begin{aligned} {\mathbb {E}}\left\{ T_{k+1} - T_{k+1}^K \right\}&= \sum _{i=K+1}^{n-k} \left( {\begin{array}{c}n-i\\ k\end{array}}\right) p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) } \left\{ (1-p^{k+1})^{i-1} - (1-p^k)^{i-1} \right\} \\&\le \left( {\begin{array}{c}n\\ k\end{array}}\right) p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) }(1-p^{k+1})^K\sum _{i=0}^{\infty } (1-p^{k+1})^{i} \\&\le p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) -k-1}\frac{n^k}{k!}(1-p^{k+1})^K. \end{aligned}$$

Using Markov’s inequality, we get:

$$\begin{aligned}{\mathbb {P}}(T_{k+1} - T_{k+1}^K \ge 1) \le p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) -k-1}\frac{n^k}{k!}(1-p^{k+1})^K,\end{aligned}$$

which asymptotically vanishes as long as \(K = \omega (\ln ^{1+\epsilon }(n))\). \(\square \)

4.2 Approximation theorem

For \(i \in [d]\), recall a random variable counting i-simplices in \({{\textbf {X}}(n,p)}\) that are critical under the lexicographical matching, as given in (4.1). We write for the i-th index set \({\mathbb {I}}_i :=C_{i+1} \times {\left\{ {i}\right\} }\). For \(s = (\phi ,i) \in {\mathbb {I}}_i\) we write

$$\begin{aligned}\mu _{s} = p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\left( (1-p^{i+1})^{\min (\phi )-1}-(1-p^{i})^{\min (\phi )-1}\right) \end{aligned}$$

and \(\sigma _i = \sqrt{{{\,\textrm{Var}\,}}(T_{i+1})}\). Let

$$\begin{aligned}X_{s} = \sigma _i^{-1}\left\{ \prod _{i \ne j \in \phi } Y_{i,j} \left[ \prod _{a=1}^{\min (\phi ) - 1}\left( 1 - \prod _{b \in \phi }Y_{a,b}\right) - \prod _{a=1}^{\min (\phi ) - 1}\left( 1 - \prod _{b \in \phi _{-}} Y_{a,b}\right) \right] - \mu _{s}\right\} .\end{aligned}$$

Let \(W_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\) and \(W = (W_1, W_2, \ldots , W_d) \in {\mathbb {R}}^d\). For bounds that asymptotically go to zero for this example, we use Theorems 3.2 and 3.3 directly: the uniform bounds from Corollary 3.4 are not fine enough here. We note that here, due to the requirement of criticality, two summands \(X_s\) and \(X_u\) become dependent as soon as the corresponding subsets share a vertex. This is in contrast to simplex counts, which required an overlap of at least two vertices.

Theorem 4.5

Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma \) be the covariance matrix of W.

  1. (1)

    Let \(h \in {\mathscr {H}}_d\). Then there is a constant \(B_{4.5.1} >0\) independent of n and a natural number \(N_{4.5.1}\) such that for any \(n \ge N_{4.5.1}\) we have

    $$\begin{aligned}\left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le B_{4.5.1} \left| h \right| _3 n^{-1}.\end{aligned}$$
  2. (2)

    There is a constant \(B_{4.5.2} >0\) independent of n and a natural number \(N_{4.5.2}\) such that for any \(n \ge N_{4.5.2}\) we have

    $$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A) -{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le B_{4.5.2}n^{-\frac{1}{4}}.\end{aligned}$$

Proof

It is clear that W satisfies the conditions of Theorems 3.2 and 3.3 for any \(s = (\phi ,i) \in {\mathbb {I}}_i\) setting

$$\begin{aligned}{\mathbb {D}}_j(s) = \left\{ \,(\psi , j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 1\,\right\} .\end{aligned}$$

We apply Theorems 3.2 and 3.3. For the bounds on the quantity \(B_{3.2}\) from Theorems 3.2 and 3.3 we use Lemma 3.5 and Lemma 4.3. We write C for an unspecified positive constant that does not depend on n. Also, we assume here that n is large enough for the bound in Lemma 4.3 to apply. Let \(\mu (i,a) = p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\left( (1-p^{i+1})^{a - 1}-(1-p^{i})^{a - 1} \right) .\) Then we have:

$$\begin{aligned}&B_{3.2} \le \frac{1}{3} \sum _{i,j,k=1}^d \sum _{a=1}^{n-i} \sum _{\begin{array}{c} \phi \in C_{i+1} \\ \min (\phi ) = a \end{array}} \sum _{b=1}^{n-j} \sum _{\begin{array}{c} (\psi ,j) \in {\mathbb {D}}_j((\phi ,i)) \\ \min (\psi ) = b \end{array}} \\&\Bigg \{ \sum _{r \in {\mathbb {D}}_k((\phi ,i))} \frac{3}{2} (\sigma _i \sigma _j \sigma _k)^{-1} \left\{ \mu (i,a)\mu (j,b)(1-\mu (i,a))(1-\mu (j,b)) \right\} ^{\frac{1}{2}} \\&\qquad + \sum _{r \in {\mathbb {D}}_k((\psi ,j))} (\sigma _i \sigma _j \sigma _k)^{-1} \left\{ \mu (i,a)\mu (j,b)(1-\mu (i,a))(1-\mu (j,b)) \right\} ^{\frac{1}{2}} \Bigg \}\\&\le \sum _{i,j,k=1}^d \sum _{a=1}^{n-i} \sum _{b=1}^{n-j} Cn^{i+j-1} n^k n^{-i-j-k} \\&\qquad \times \left\{ (1-p^{i+1})^{a - 1}(1-p^{j+1})^{b - 1} + (1-p^{i})^{a - 1}(1-p^{j})^{b - 1} \right\} ^{\frac{1}{2}}\\&\le Cn^{-1} \sum _{i,j,k=1}^d \sum _{a=1}^{\infty } \sum _{b=1}^{\infty } \Bigg [\left\{ (1-p^{i+1})^{a - 1}(1-p^{j+1})^{b - 1} \right\} ^{\frac{1}{2}} \\&\qquad + \left\{ (1-p^{i})^{a - 1}(1-p^{j})^{b - 1} \right\} ^{\frac{1}{2}} \Bigg ] \\&\le Cn^{-1} d^3 \left\{ \frac{1}{(1 - \sqrt{1-p^{d+1}})^2} + \frac{1}{(1 - \sqrt{1-p^{d}})^2} \right\} \le C n^{-1}. \\ \end{aligned}$$

\(\square \)

Remark 4.6

The relevance of understanding the number of critical simplices in the context of applied and computational topology is as follows. We assume that \(p \in (0,1)\) and \(k \in \{1, 2, \ldots \} \) are constants.

  1. (1)

    As seen in Lemma 4.2, the expected number of critical k-simplices under the lexicographical matching is one power of n smaller than the total number of k-simplices in \({{\textbf {X}}(n,p)}\).

  2. (2)

    From Proposition 4.4, it is with high probability that in \({{\textbf {X}}(n,p)}\) all k-simplices \({s \in {\textbf {X}}(n,p)}\) with \(\min (s) = \omega (\ln ^{1+\epsilon }(n))\) for any fixed \(\epsilon > 0\) are not critical.

5 Simplex counts in links

Consider the random simplicial complex \({{\textbf {X}}(n,p)}\). For \(1 \le i < j \le n\) define the edge indicator \({Y_{i,j} :=\mathbb {1}\left( {\left\{ {i,j}\right\} } \in {\textbf {X}}(n,p)\right) }\). In this section we study the count of \((k-1)\)-simplices that would be in the link of a fixed subset \(t \subseteq [n]\) if the subset spanned a simplex in \({{\textbf {X}}(n,p)}\). Given that t is a simplex, the variable counts the number of \((k-1)\)-simplices in \(\textbf{lk}(t)\). Thus, the random variable of interest is

$$\begin{aligned} T^t_k = \sum _{s \in C_k} \left\{ \mathbb {1}\left( t \cap s = \varnothing \right) \prod _{i \ne j \in s} Y_{i,j} \prod _{i \in s, j \in t} Y_{i,j} \right\} . \end{aligned}$$
(5.1)

Note that the product \( \prod _{i \in s, j \in t} Y_{i,j} \) ensures that \(t \cup s\) is a simplex if t spans a simplex.

Remark 5.1

The random variable \(T^t_k\) does not fit into the framework of generalised U-statistics, because the summands depend not only on the variables that are indexed by the subset s and we do not sum over all subsets s but rather only the ones that do not intersect t. Hence, Theorem 3.9 does not apply here.

Moreover, note that given the number of vertices of the link of a simplex t, the conditional distribution of the link of t is again \({{\textbf {X}}(n,p)[n'][p]}\), where \(n'\) is a random variable equal to the number of vertices in the link. If we are interested in such a conditional distribution, the results proved later in Sect. 6 apply. However, in this section we study the number of simplices in the link of t given that t is a simplex rather than given the number of vertices of the link of t. Such a random variable behaves differently from the simplex counts in \({{\textbf {X}}(n,p)}\). For example, the summands of \(T_k^t\) have a different dependence structure compared to the summands of \(T_k\) (see Eq. 6.1 below). As a result, the approximation bounds are of different order.

It is natural to ask whether the results obtained in this section follow from those of Sect. 6 below. This might well be the case, but the answer is not straightforward. One could derive an approximation for the number of simplices in \(\textbf{lk}(t)\) given the number of vertices in the link; the variable \(T_k^t\) could then be approximated by a mixture, induced by the distribution of the number of vertices in the link (which is binomial). However, applying this approach naïvely yields bounds that do not converge to zero. While it is certainly possible that a different approach would succeed, we prefer to prove the approximation directly.

5.1 Mean and variance

It is easy to see that for any positive integer k and \(t \subseteq [n]\),

$$\begin{aligned} {\mathbb {E}}\{ T^t_{k+1} \} = \left( {\begin{array}{c}n - |t|\\ k+1\end{array}}\right) p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) + |t|(k+1)} =: \left( {\begin{array}{c}n - |t|\\ k+1\end{array}}\right) \mu _{k+1}^t \end{aligned}$$

since there are \(\left( {\begin{array}{c}n - |t|\\ k+1\end{array}}\right) \) choices for \(s \in C_{k+1}\) such that \(s \cap t = \varnothing \). Next we derive a lower bound on the variance.

Lemma 5.2

For any fixed \(1 \le k \le n-1\) and \(t \subseteq [n]\) we have:

$$\begin{aligned}{{\,\textrm{Var}\,}}(T^t_{k+1}) \ge (k+1)\left( {\begin{array}{c}n-|t|\\ 2k+1\end{array}}\right) \left( {\begin{array}{c}2k+1\\ k\end{array}}\right) (\mu _{k+1}^t)^2 \left\{ p^{-|t|} - 1\right\} .\end{aligned}$$

Proof

First let us calculate \({{\,\textrm{Cov}\,}}(T^t_{k+1}, T^t_{l+1})\). For fixed subsets \(s \in C_{k+1}\) and \(u \in C_{l+1}\) if \(|s \cap u| = 0\), then the corresponding variables \(\prod _{i \ne j \in s} Y_{i,j} \prod _{i \in s, j \in t} Y_{i,j}\) and \(\prod _{i \ne j \in u} Y_{i,j} \prod _{i \in u, j \in t} Y_{i,j}\) are independent and so have zero covariance.

For \(1 \le m \le l + 1\), the number of pairs of subsets \(s \in C_{k+1}\) and \(u \in C_{l+1}\) such that \(s \cap t = \varnothing = u \cap t\) and \(|s \cap u| = m\) is \(\left( {\begin{array}{c}n-|t|\\ k+1\end{array}}\right) \left( {\begin{array}{c}k+1\\ m\end{array}}\right) \left( {\begin{array}{c}n-|t| - k - 1\\ l + 1-m\end{array}}\right) \). Since each summand is non-negative, we lower bound by the \(m=1\) summand and get (with \(\left( {\begin{array}{c}1\\ 2\end{array}}\right) := 0\))

$$\begin{aligned}&{{\,\textrm{Cov}\,}}(T^t_{k+1}, T^t_{l+1}) \\&= \sum _{m=1}^{l+1} \left( {\begin{array}{c}n-|t|\\ k+1\end{array}}\right) \left( {\begin{array}{c}k+1\\ m\end{array}}\right) \left( {\begin{array}{c}n-|t| - k - 1\\ l + 1-m\end{array}}\right) \Big \{ \mu _{k+1}^t\mu _{l+1}^t p^{- \left( {\begin{array}{c}m\\ 2\end{array}}\right) } p^{-|t|m} -\mu _{k+1}^t\mu _{l+1}^t \Big \} \\&\ge \left( {\begin{array}{c}n-|t|\\ k+1\end{array}}\right) (k+1) \left( {\begin{array}{c}n-|t| - k - 1\\ l\end{array}}\right) \mu _{k+1}^t\mu _{l+1}^t \left\{ p^{-|t|} - 1\right\} \\&= (k+1)\left( {\begin{array}{c}n-|t|\\ l+k+1\end{array}}\right) \left( {\begin{array}{c}l+k+1\\ l\end{array}}\right) \mu _{k+1}^t\mu _{l+1}^t \left\{ p^{-|t|} - 1\right\} . \end{aligned}$$

Taking \(l = k\) completes the proof. \(\square \)

5.2 Approximation theorem

For a multivariate normal approximation of counts given in Equation (5.1), we write \(\sigma _i = \sqrt{{{\,\textrm{Var}\,}}(T^t_{i+1})}\) and \(C_{i+1}^t = \left\{ \,\phi \in C_{i+1} \;\vert \; \phi \cap t = \varnothing \,\right\} \), as well as \({\mathbb {I}}_i :=C_{i+1}^t \times {\left\{ {i}\right\} }\). For \(s = (\phi ,i) \in {\mathbb {I}}_i\) define

$$\begin{aligned} X_{s} = \sigma _i^{-1}(\prod _{i \ne j \in \phi } Y_{i,j} \prod _{a \in \phi , b \in t} Y_{a,b} - \mu _{i+1}^t). \end{aligned}$$

It is clear that \({\mathbb {E}}\left\{ X_{s} \right\} = 0\). Let \(W^t_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\) and \(W^t = (W^t_1, W^t_2, \ldots , W^t_d) \in {\mathbb {R}}^d\). Then we have the following approximation theorem.

Theorem 5.3

Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma \) be the covariance matrix of \(W^t\).

  1. (1)

    Let \(h \in {\mathscr {H}}_d\). Then

    $$\begin{aligned} \left| {\mathbb {E}}h(W^t) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{5.3} (n - |t|)^{-\frac{1}{2}}. \end{aligned}$$
  2. (2)

    Moreover,

    $$\begin{aligned}\sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W^t \in A) -{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{5.3}^{\frac{1}{4}}(n - |t|)^{-\frac{1}{8}}.\end{aligned}$$

Here

$$\begin{aligned}B_{5.3} = \frac{7}{6} (2d+1)^{5d+\frac{17}{2}} (p^{-|t|} - 1)^{-\frac{3}{2}}p^{-(d+1)(d+2|t|)}.\end{aligned}$$

Proof

It is clear that \(W^t\) satisfies the conditions of Corollary 3.4 with the dependency neighbourhood \({\mathbb {D}}_j(s) = \left\{ \,(\psi ,j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 1\,\right\} \) for any \(s = (\phi ,i) \in {\mathbb {I}}_i\). So we aim to bound the quantity \(B_{3.4}\) from the corollary.

Given \(\phi \in C^t_{i+1}\) and \(m \le \min (i+1, j+1)\) there are \(\left( {\begin{array}{c}i+1\\ m\end{array}}\right) \left( {\begin{array}{c}n-|t| - i-1\\ j+1-m\end{array}}\right) \) subsets \(\psi \in C^t_{j+1}\) such that \(|\phi \cap \psi | = m\). Therefore, for any \(i,j \in [d]\) and \(s \in {\mathbb {I}}_i\) we have

$$\begin{aligned} |{\mathbb {D}}_j(s)|= & {} \sum _{m=1}^{\min (i,j) + 1} \left( {\begin{array}{c}i+1\\ m\end{array}}\right) \left( {\begin{array}{c}n-|t|-i-1\\ j+1-m\end{array}}\right) \nonumber \\\le & {} (i+1)^{\min (i,j)+2} (n-|t|)^{j} \nonumber \\\le & {} (d+1)^{d+2} (n-|t|)^{j} \end{aligned}$$
(5.2)

giving a bound for \(\alpha _{ij}\). For a bound on \(\beta _{ijk}\), applying Lemma 3.5, for any \(i,j,k \in [d]\) and \(s \in {\mathbb {I}}_i, u \in {\mathbb {I}}_j, v \in {\mathbb {I}}_k\) we get

$$\begin{aligned}&{\mathbb {E}}\left| X_{s} X_{u} X_{v} \right| \le (\sigma _i \sigma _j \sigma _k)^{-1} \left\{ \mu _{i+1}^t\mu _{j+1}^t(1-\mu _{i+1}^t)(1-\mu _{j+1}^t) \right\} ^{\frac{1}{2}}; \end{aligned}$$
(5.3)
$$\begin{aligned}&{\mathbb {E}}\left| X_{s} X_{u} \right| {\mathbb {E}}\left| X_{v} \right| \le (\sigma _i \sigma _j \sigma _k)^{-1} \left\{ \mu _{i+1}^t\mu _{j+1}^t(1-\mu _{i+1}^t)(1-\mu _{j+1}^t) \right\} ^{\frac{1}{2}}. \end{aligned}$$
(5.4)

Now we apply Corollary 5.2 and get

$$\begin{aligned} \sigma _i^2&\ge (i+1)\left( {\begin{array}{c}n-|t|\\ 2i+1\end{array}}\right) \left( {\begin{array}{c}2i+1\\ i\end{array}}\right) (\mu _{k+1}^t)^2 \left\{ p^{-|t|} - 1\right\} \\&\ge \frac{(n-|t|)^{2i+1}}{(2d+1)^{d+1}d^{d}} (\mu _{k+1}^t)^2 \left\{ p^{-|t|} - 1\right\} . \end{aligned}$$

Taking both sides of the inequality to the power of \(-\frac{1}{2}\) we get for any \(i \in [d]\)

$$\begin{aligned} \sigma _i^{-1} \le (n-|t|)^{-i - \frac{1}{2}}(2d+1)^{\frac{d+1}{2}}d^{\frac{d}{2}} (\mu _{k+1}^t)^{-1}\left\{ p^{-|t|} - 1\right\} ^{-\frac{1}{2}}. \end{aligned}$$
(5.5)

Using Eqs. (5.2)–(5.5) to bound \(B_{3.4}\) from Corollary 3.4 we get:

$$\begin{aligned}&B_{3.4} \le \frac{7}{6} \sum _{i,j,k=1}^d \left( {\begin{array}{c}n-|t|\\ i+1\end{array}}\right) (d+1)^{2d+4} (n-|t|)^{j+k} (\sigma _i \sigma _j \sigma _k)^{-1} \\&\left\{ \mu _{i+1}^t\mu _{j+1}^t(1-\mu _{i+1}^t)(1-\mu _{j+1}^t) \right\} ^{\frac{1}{2}} \\ \le&\frac{7}{6} \sum _{i,j,k=1}^d (n-|t|)^{i+j+k+1} (d+1)^{2d+4} (n-|t|)^{-i -j -k - \frac{3}{2}} (2d+1)^{\frac{3d+3}{2}}d^{\frac{3d}{2}} \\&(p^{-|t|} - 1)^{-\frac{3}{2}}(\mu _{k+1}^t\mu _{i+1}^t\mu _{j+1}^t)^{-1} \left\{ \mu _{i+1}^t\mu _{j+1}^t(1-\mu _{i+1}^t)(1-\mu _{j+1}^t) \right\} ^{\frac{1}{2}} \\ \le&(n - |t|)^{-\frac{1}{2}} \frac{7}{6} (2d+1)^{5d+\frac{11}{2}} (p^{-|t|} - 1)^{-\frac{3}{2}} \sum _{i,j,k=1}^d \left( (\mu _{i+1}^t\mu _{j+1}^t)^{-1}(\mu _{k+1}^t)^{-2}\right) ^{\frac{1}{2}} \\ \le&\left\{ \frac{7}{6} (2d+1)^{5d+\frac{17}{2}} (p^{-|t|} - 1)^{-\frac{3}{2}}p^{-(d+1)(d+2|t|)} \right\} (n - |t|)^{-\frac{1}{2}}. \end{aligned}$$

\(\square \)

Remark 5.4

Recall that \({\mathbb {E}}\left\{ T^t_{k+1} \right\} = \left( {\begin{array}{c}n - |t|\\ k+1\end{array}}\right) p^{\left( {\begin{array}{c}k+1\\ 2\end{array}}\right) + |t|(k+1)}\). By Stirling’s approximation, if \(p \in (0,1)\) is a constant, then \(\max (k, |t|) = \Omega (\ln ^{1+\epsilon }(n))\) for any positive \(\epsilon \) forces the expectation to go to 0 asymptotically. Hence, by Markov’s inequality, with high probability there are no k-simplices in the link of t as long as \(\max (k, |t|)\) is of order \(\ln ^{1+\epsilon }(n)\) or larger for any \(\epsilon > 0\) for a constant p.

Recall that in Theorem 5.3 we count all simplices up to dimension d in the link of t. Note that if \(\max (d^2, d|t|) = O(\ln ^{1 - \epsilon }(n))\) for any \(\epsilon > 0\), then the bounds in Theorem 5.3 tend to 0 as n tends to infinity as long as \(p \in (0,1)\) stays constant. In particular, if d is a constant, Theorem 5.3 gives an approximation for all sizes of t for which the approximation is needed.

6 Simplex counts in \({{\textbf {X}}(n,p)}\)

In this section we apply Theorem 3.9 to approximate simplex counts. Consider \({G \sim {\textbf {G}}(n,p)}\). For \(1 \le x < y \le n\) let \(Y_{x,y} :=\mathbb {1}\left( x \sim y\right) \) be the edge indicator. In this section we are interested in the \((i+1)\)-clique count in \({{\textbf {G}}(n,p)}\) or, equivalently, the i-simplex count in \({{\textbf {X}}(n,p)}\), given by

$$\begin{aligned} T_{i+1} = \sum _{s \in C_{i+1}} \prod _{x \ne y \in s} Y_{x,y}. \end{aligned}$$
(6.1)

Let \({\mathscr {Y}}^{i+1} = \{0,1\}^{i+1}\) and let \(f_i: {\mathscr {Y}}^{i+1} \rightarrow {\mathbb {R}}\) be the function

$$\begin{aligned}f_i({\mathscr {Y}}_s) = {\prod _{Y_{x,y} \in {\mathscr {Y}}_s} Y_{x,y}}.\end{aligned}$$

Then the associated generalised U-statistic \(S_{n, i + 1}(f_i)\) equals the \((i+1)\)-clique count \(T_{i+1}\), as given by Eq. (6.1). To apply Theorem 3.9 we need to center and rescale our variables. It is easy to see that \({\mathbb {E}}\{ f_i({\mathscr {Y}}_\phi )\} = p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\) if \(\phi \in C_{i+1}\). We let \(I_i :=C_{i+1} \times {\left\{ {i}\right\} }\) and for \(s = (\phi ,i) \in {\mathbb {I}}_i\) we define \(X_{s} :={\sigma ^{-1}} \left( f_i({\mathscr {Y}}_\phi ) - p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) } \right) \) and \(W_i = \sum _{s \in {\mathbb {I}}_i} X_{s}\). Now the vector of interest is \(W = (W_1, W_2, \ldots , W_d) \in {\mathbb {R}}^d\). This brings us to the next approximation theorem.

Corollary 6.1

Let \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\) and \(\Sigma \) be the covariance matrix of W.

  1. (1)

    Let \(h \in {\mathscr {H}}_d\). Then

    $$\begin{aligned} \left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \le \left| h \right| _3 B_{6.1} n^{-1}. \end{aligned}$$
  2. (2)

    Moreover,

    $$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)| \le 2^{\frac{7}{2}} 3^{-\frac{3}{4}}d^{\frac{3}{16}}B_{6.1}^{\frac{1}{4}}n^{-\frac{1}{4}}. \end{aligned}$$

Here

$$\begin{aligned}B_{6.1} = \frac{16}{3} d^{2d+5} p^{-3\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) + 1}(1-p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) })(p^{-1} - 1)^{-\frac{3}{2}}.\end{aligned}$$

Proof

Firstly, observe that for any \(\phi , \psi \in C_{i+1}\) for which \(| \phi \cap \psi | \le 1\) the covariance vanishes, while if \(| \phi \cap \psi | \ge 2\) the covariance is non-zero, and we have

$$\begin{aligned}{{\,\textrm{Cov}\,}}(f_i({\mathscr {Y}}_\phi ), f_i({\mathscr {Y}}_\psi )) = p^{2\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) - \left( {\begin{array}{c}|\phi \cap \psi |\\ 2\end{array}}\right) } - p^{2\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) } \ge p^{2\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }(p^{-1} - 1).\end{aligned}$$

For \(s = (\phi ,i) \in {\mathbb {I}}_i\) write \({\hat{X}}_{s} = f_i({\mathscr {Y}}_\phi ) - p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }\). Then by Lemma 3.5 we get:

$$\begin{aligned}&{\mathbb {E}}\left| {\hat{X}}_{s} {\hat{X}}_{t} \right| {\mathbb {E}}\left| {\hat{X}}_{u} \right| \le \left\{ p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) + \left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }(1-p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) })(1-p^{\left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }) \right\} ^{\frac{1}{2}}; \\&{\mathbb {E}}\left| {\hat{X}}_{s} {\hat{X}}_{t} {\hat{X}}_{u} \right| \le \left\{ p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) + \left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }(1-p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) })(1-p^{\left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }) \right\} ^{\frac{1}{2}}. \end{aligned}$$

Since \( \left\{ p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) + \left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }(1-p^{\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) })(1-p^{\left( {\begin{array}{c}j+1\\ 2\end{array}}\right) }) \right\} ^{\frac{1}{2}} \le p(1-p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) })\), we see that Assumption 3.7 holds. Assumption 3.8 also holds and therefore we can apply Theorem 3.9 with \(k_i = i+1\), \(K_i = (2(i+1)^2 - 2(i+1))^{-\frac{1}{2}(i+1) + 1}\), \(\alpha _i = p^{2\left( {\begin{array}{c}i+1\\ 2\end{array}}\right) }(p^{-1} - 1)\), and \(\beta = p(1-p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) })\). Using the bounds \(K_i \le 1\) as well as \(2 \le k_i^{\min (k_i,k_j) + 1} \le d^{d+1}\), and \(\sqrt{\alpha _i} \ge p^{\left( {\begin{array}{c}d+1\\ 2\end{array}}\right) }\sqrt{p^{-1}-1}\) finishes the proof. \(\square \)

Remark 6.2

It is easy to show that with high probability there are no large cliques in \({{\textbf {G}}(n,p)}\) for \(p<1\) constant. To see this, the expectation of the number of k-cliques is \(\left( {\begin{array}{c}n\\ k\end{array}}\right) p^{\left( {\begin{array}{c}k\\ 2\end{array}}\right) }\). By Stirling’s approximation, \(k = \Omega (\ln ^{1+\epsilon }(n))\) for any positive \(\epsilon \) forces the expectation to go to 0 asymptotically. Hence, by Markov’s inequality, for any \(\epsilon > 0\), with high probability there are no cliques of order \(\ln ^{1+\epsilon }(n)\) or larger.

Recall that in Corollary 6.1 the size of the maximal clique we count is \(d + 1\). Note that if \(d = O(\ln ^{\frac{1}{2} - \epsilon }(n))\) for any \(\epsilon > 0\), then the bounds in Corollary 6.1 tend to 0 as n tends to infinity as long as \(p \in (0,1)\) stays constant. This value might seem quite small but in the light of there not being any cliques of order \(\ln ^{1+\epsilon }(n)\) with high probability, this is meaningfully large.

Remark 6.3

Note that in Corollary 6.1 we use a multivariate normal distribution with covariance \(\Sigma \), which is the covariance of W when n is finite and it differs from the limiting covariance, as mentioned in Reinert and Röllin (2010). To approximate W with the limiting distribution, in the spirit of Reinert and Rollin (2010, Proposition 3) one could proceed in two steps: use the existing theorems to approximate W with \(\Sigma Z\) and then approximate \(\Sigma Z\) with \(\Sigma _L Z\) where \(\Sigma _L\) is the limiting covariance, which is non-invertible, as observed in Janson and Nowicki (1991).

Remark 6.4

Corollary 6.1 generalises the result (Reinert and Röllin 2010, Proposition 2) beyond the case when \(d=2\) and we get a bound of the same order of n. Kaur and Roollin (2021, Theorem 3.1) considers centered subgraph counts in a random graph associated to a graphon. If we take the graphon to be constant, the associated random graph is just \({{\textbf {G}}(n,p)}\). Compared to Kaur and Roollin (2021, Theorem 3.1) we place weaker smoothness conditions on our test functions. However, we make use of the special structure of cliques whereas Kaur and Roollin (2021, Theorem 3.1) applies to any centered subgraph counts. Translating Kaur and Roollin (2021, Theorem 3.1) into a result for uncentered subgraph counts, as we provide here in the special case of clique counts, is not trivial for general d.

However, it should be possible to extend our results, using the same abstract approximation theorem, beyond the random clique complex to Linial-Meshulam random complexes (Linial and Meshulam 2006) or even more general multiparameter random complexes (Costa and Farber 2016). We shall consider this conjecture in future work.

7 A proof of the multivariate CLT for dissociated Sums and U-statistics

7.1 A proof of the multivariate CLT

Throughout this subsection, \(W \in {\mathbb {R}}^d\) is a vector of dissociated sums in the sense of Definition 3.1, with covariance matrix whose entries are \(\Sigma _{ij} = {{\,\textrm{Cov}\,}}(W_i, W_j)\) for \((i,j) \in [d]^2\). For each \(s \in {\mathbb {I}}\) we denote by \({\mathbb {D}}(s) \subset {\mathbb {I}}\) the disjoint union \(\bigcup _{j=1}^d {\mathbb {D}}_j(s)\). For each triple \((s,t,j) \in {\mathbb {I}}^2 \times [d]\) we write the set-difference \({\mathbb {D}}_j(t) {\setminus } {\mathbb {D}}_j(s)\) as \({\mathbb {D}}_j(t;s)\), with \({\mathbb {D}}(t;s) \subset {\mathbb {I}}\) denoting the disjoint union of such differences over \(j \in [d]\).

7.1.1 Smooth test functions

To prove Theorem 3.2 we use Stein’s method for multivariate normal distributions; for details see for example Chapter 12 in Chen (2011). Our proof of Theorem 3.2 is based on the Stein characterization of the multivariate normal distribution: \(Z \in {\mathbb {R}}^d\) is a multivariate normal \(\textrm{MVN}(0, \Sigma )\) if and only if the identity

$$\begin{aligned} {\mathbb {E}}\left\{ \nabla ^{T} \Sigma \nabla f(Z)-Z^{T} \nabla f(Z) \right\} =0 \end{aligned}$$
(7.1)

holds for all twice continuously differentiable \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) for which the expectation exists. In particular, we will use the following result based on Meckes (2009, Lemma 1 and Lemma 2). As Lemma 1 and Lemma 2 in Meckes (2009) are stated there only for infinitely differentiable test functions, we give the proof here for completeness.

Lemma 7.1

(Lemma 1 and Lemma 2 in Meckes (2009)) Fix \(n \ge 2\). Let \(h: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) be n times continuously differentiable with n-th partial derivatives being Lipschitz and \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). Then, if \(\Sigma \in {\mathbb {R}}^{d \times d}\) is symmetric positive semidefinite, there exists a solution \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) to the equation

$$\begin{aligned} \nabla ^{T} \Sigma \nabla f(w)-w^{T} \nabla f(w)=h(w)-{\mathbb {E}}h\left( \Sigma ^{1 / 2} Z\right) , \quad w \in {\mathbb {R}}^{d}, \end{aligned}$$
(7.2)

such that f is n times continuously differentiable and we have for every \(k=1, \ldots , n\):

$$\begin{aligned}\left| f \right| _k \le \frac{1}{k}\left| h \right| _k.\end{aligned}$$

Proof

Let h be as in the assertion. It is shown in Lemma 2.1 in Chatterjee and Meckes (2008), which is based on a reformulation of Eq. (2.20) in Barbour (1990), that a solution of (7.2) for h is given by \(f(x) = f_h(x) = \int _0^1 \frac{1}{2t} {\mathbb {E}}\{ h( Z_{x,t} ) \} \, dt\), with \(Z_{x, t}=\sqrt{t} x+\sqrt{1-t} \Sigma ^{1 / 2} Z\). As h has n-th partial derivatives being Lipschitz and hence for differentiating f we can bring the derivative inside the integral, it is straightforward to see that the solution f is n times continuously differentiable.

The bound on \(\left| f \right| _k\) is a consequence of

$$\begin{aligned}\frac{\partial ^{k} f}{\partial x_{i_{1}} \cdots \partial x_{i_{k}}}(x)=\int _{0}^{1}(2 t)^{-1} t^{k / 2} {\mathbb {E}}\left\{ \frac{\partial ^{k} h}{\partial x_{i_{1}} \cdots \partial x_{i_{k}}}\left( Z_{x, t}\right) \right\} d t\end{aligned}$$

for any \(i_1, i_2, \ldots , i_k\); see, for example, Equation (10) in Meckes (2009). Taking the sup-norm on both sides and bounding the right hand side of the equation gives

$$\begin{aligned}\left\Vert \frac{\partial ^{k} f}{\partial x_{i_{1}} \cdots \partial x_{i_{k}}}\right\Vert _{\infty } \le \left\Vert \frac{\partial ^{k} h}{\partial x_{i_{1}} \cdots \partial x_{i_{k}}}\right\Vert _{\infty } \int _{0}^{1}(2 t)^{-1} t^{k / 2}dt \le \frac{1}{k} \left| h \right| _k.\end{aligned}$$

\(\square \)

Proof of Theorem 3.2

To prove Theorem 3.2, we replace w by W in Eq. (7.2) and take the expected value on both sides. As a result, we aim to bound the expression

$$\begin{aligned} \left| {\mathbb {E}}\left\{ \nabla ^{T} \Sigma \nabla f(W)-W^{T} \nabla f(W) \right\} \right| = \left| {\mathbb {E}}\left\{ \sum _{i,j=1}^d \partial _{ij}f(W)\Sigma _{ij} - \sum _{i=1}^d W_i \partial _i f(W) \right\} \right| \end{aligned}$$
(7.3)

where f is a solution to the Stein equation (7.2) for the test function h. Since the variables \({\left\{ {X_s \mid s \in {\mathbb {I}}}\right\} }\) are centered and as \(X_t\) is independent of \(X_s\) if \(t \not \in {\mathbb {D}}(s)\), for each \((i,j) \in [d]^2\) we have

$$\begin{aligned} \Sigma _{ij} = {{\,\textrm{Cov}\,}}(W_i, W_j) = \hspace{-.2em} \sum _{s \in {\mathbb {I}}_i} \sum _{t \in {\mathbb {D}}_j(s)} \hspace{-.2em}{\mathbb {E}}\left\{ X_sX_t \right\} . \end{aligned}$$
(7.4)

We now use the decomposition of \(\Sigma _{ij}\) from (7.4) in the expression (7.3). For each pair \((s,j) \in {\mathbb {I}}\times [d]\) and \(t \in {\mathbb {D}}(s)\) we set \( {\mathbb {D}}_j(t;s) = {\mathbb {D}}_j(t) {\setminus } {\mathbb {D}}_j(s) \) and

$$\begin{aligned} U_j^s :=\hspace{-.4em} \sum _{u \in {\mathbb {D}}_j(s)} \hspace{-.4em} X_u;\quad W_j^s :=W_j - U_j^s, \quad \text{ and } \quad V_j^{s,t} :=\hspace{-1em} \sum _{v \in {\mathbb {D}}_j(t;s)} \hspace{-1em} X_v; \quad W_j^{s,t} :=W_j^s - V_j^{s,t}. \end{aligned}$$
(7.5)

By Definition 3.1, \(W_j^s \) is independent of \(X_s\), while \(W_j^{s,t}\) is independent of the pair \((X_s,X_t)\).

Next we decompose the r.h.s. of (7.3);

$$\begin{aligned} \left| {\mathbb {E}}\left\{ \sum _{i=1}^d W_i \partial _i f(W) - \sum _{i,j=1}^d \partial _{ij}f(W)\Sigma _{ij} \right\} \right| = \left| R_1 + R_2 + R_3 \right| ; \end{aligned}$$

with

$$\begin{aligned} R_1&= \sum _{i=1}^d {\mathbb {E}}\left\{ W_i \partial _i f(W) \right\} - \sum _{s \in {\mathbb {I}}} \sum _{j=1}^d {\mathbb {E}}\left\{ X_s U^s_j \partial _{|s|j}f(W^s) \right\} , \end{aligned}$$
(7.6)
$$\begin{aligned} R_2&= \sum _{s \in {\mathbb {I}}} \sum _{j=1}^d {\mathbb {E}}\left\{ X_s U^s_j \partial _{|s|j}f(W^s) \right\} - \sum _{s \in {\mathbb {I}}}\sum _{t \in {\mathbb {D}}(s)} {\mathbb {E}}\left\{ X_sX_t \right\} {\mathbb {E}}\partial _{|s||t|}f(W^{s,t}), \text { and}\end{aligned}$$
(7.7)
$$\begin{aligned} R_3&= \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} {\mathbb {E}}\left\{ X_{s} X_{t} \right\} \left( {\mathbb {E}}\partial _{|s||t|}f(W^{s,t}) - {\mathbb {E}}\partial _{|s||t|}f(W) \right) . \end{aligned}$$
(7.8)

Here we recall that if \(s=(k,i)\) then \(|s|=i \in [d]\). \(\square \)

As with the vector of dissociated sums \(W \in {\mathbb {R}}^d\) itself, we can assemble these differences into random vectors. Thus, \(W^s \in {\mathbb {R}}^d\) is \((W^s_1,\ldots ,W^s_d)\), and similarly \(W^{s,t} = (W^{s,t}_1,\ldots W^{s,t}_d)\). In the next three claims, we provide bounds on \(R_i\) for \(i \in [3]\).

Claim 7.2

The absolute value of the expression \(R_1\) from (7.6) is bounded above by

$$\begin{aligned} |R_1| \le \left( \frac{1}{2} \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(s)} {\mathbb {E}}\left| X_sX_tX_u \right| \right) \left| f \right| _3. \end{aligned}$$

Proof

Note that

$$\begin{aligned} R_1&= \sum _{i=1}^d \sum _{s \in {\mathbb {I}}_i}{\mathbb {E}}\left\{ X_s \partial _i f(W) \right\} - \sum _{s \in {\mathbb {I}}} \sum _{j=1}^d {\mathbb {E}}\left\{ X_s U^s_j \partial _{|s|j}f(W^s) \right\}&\\&= \sum _{i=1}^d \sum _{s \in {\mathbb {I}}_i}\left( {\mathbb {E}}\left\{ X_s \partial _i f(W) \right\} - \sum _{j=1}^d {\mathbb {E}}\left\{ X_s U^s_j \partial _{ij}f(W^s) \right\} \right) .&\end{aligned}$$

For each \(s \in {\mathbb {I}}_{i}\), it follows from (7.5) that \(W = U^s + W^s\). Using the Lagrange form of the remainder term in Taylor’s theorem, we obtain

$$\begin{aligned} \partial _i f(W) = \partial _i f(W^s) + \sum _{j=1}^d \partial _{ij}f(W^s) U^s_j + \frac{1}{2} \sum _{j,k=1}^d \partial _{ijk}f(W^s + \theta _s U^s) U^s_j U^s_k \end{aligned}$$

for some random \(\theta _s \in (0,1)\). Using this Taylor expansion in the expression for \(R_1\), we get the following four-term summand \(S_{i,s}\) for each \(i \in [d]\) and \(s \in {\mathbb {I}}_i\):

$$\begin{aligned} S_{i,s}&= {\mathbb {E}}\left\{ X_s \partial _i f(W^s) \right\} + \sum _{j=1}^d {\mathbb {E}}\left\{ X_s\partial _{ij}f(W^s) U^s_j \right\} \\&\quad + \frac{1}{2} \sum _{j,k=1}^d {\mathbb {E}}\left\{ X_s\partial _{ijk}f(W^s + \theta _s U^s) U^s_j U^s_k \right\} - \sum _{j=1}^d {\mathbb {E}}\left\{ X_s \partial _{ij}f(W^s)U^s_j \right\} . \end{aligned}$$

The second and fourth terms cancel each other. Recalling that \(X_s\) is centered by definition and independent of \(W^s\) by Definition 3.1, the first term also vanishes and

$$\begin{aligned} R_1 = \sum _{i=1}^d\sum _{s \in {\mathbb {I}}_i} S_{i,s} = \frac{1}{2} \sum _{i,j,k=1}^d\sum _{s \in {\mathbb {I}}_i} {\mathbb {E}}\left\{ X_s\partial _{ijk}f(W^s + \theta _s U^s) U^s_j U^s_k \right\} . \end{aligned}$$

Recalling that \(\left\Vert \partial _{ijk}f\right\Vert _{\infty } \le \left| f \right| _3\) and that \(U^s_j = \sum _{t \in {\mathbb {D}}_j(s)}X_t\), we have:

$$\begin{aligned} |R_1|&\le \frac{1}{2} \sum _{i,j,k=1}^d \sum _{s \in {\mathbb {I}}_i} {\mathbb {E}}\left| X_s \partial _{ijk}f(W^s + \theta _s U^s) U^s_j U^s_k \right| \\&\le \frac{\left| f \right| _3}{2} \sum _{i,j,k=1}^d \sum _{s \in {\mathbb {I}}_i} {\mathbb {E}}\left| X_{s} \sum _{t \in {\mathbb {D}}_j(s)} X_{t} \sum _{u \in {\mathbb {D}}_k(s)} X_{u} \right| \\&\le \frac{\left| f \right| _3}{2} \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(s)} {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| , \end{aligned}$$

as desired. \(\square \)

Claim 7.3

The absolute value of the expression \(R_2\) from (7.7) is bounded above by

$$\begin{aligned} |R_2| \le \left( \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(t;s)} {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| \right) \left| f \right| _3. \end{aligned}$$

Proof

Recalling that \(U^s_j = \sum _{t \in {\mathbb {D}}_j(s)} X_{t}\) and \({\mathbb {D}}(s) = \bigcup _{j=1}^d {\mathbb {D}}_j(s)\),

$$\begin{aligned} R_2&= \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \left\{ {\mathbb {E}}\left\{ X_{s} X_{t} \partial _{\left| s \right| \left| t \right| }f(W^s) \right\} - {\mathbb {E}}\left\{ X_{s}X_{t} \right\} {\mathbb {E}}\left\{ \partial _{\left| s \right| \left| t \right| }f(W^{s,t}) \right\} \right\} . \end{aligned}$$

Fix \(s \in {\mathbb {I}}\) and \(t \in {\mathbb {D}}_j(s)\). Recall that by (7.5), \(W^s = W^{s,t} + V^{s,t}\). Using the Lagrange form of the remainder term in Taylor’s theorem, we obtain:

$$\begin{aligned} \partial _{\left| s \right| \left| t \right| } f(W^s) = \partial _{\left| s \right| \left| t \right| } f(W^{s,t}) + \sum _{k=1}^d \partial _{\left| s \right| \left| t \right| k}f(W^{s,t} + \theta _{s,t} V^{s,t}) V^{s,t}_k \end{aligned}$$

for some random \(\theta _{s,t} \in (0,1)\). Using this Taylor expansion in the expression for \(R_2\), we get the following three-term summand \(S_{s,t}\) for each pair \((s, t) \in {\mathbb {I}}\times {\mathbb {D}}_j(s)\):

$$\begin{aligned} S_{s,t}&= {\mathbb {E}}\left\{ X_{s} X_{t} \partial _{\left| s \right| \left| t \right| }f(W^{s,t}) \right\} + \sum _{k=1}^d {\mathbb {E}}\left\{ X_{s} X_{t} \partial _{\left| s \right| \left| t \right| k}f(W^{s,t} + \theta _{s,t} V^{s,t}) V^{s,t}_k \right\} \\&\quad -{\mathbb {E}}\left\{ X_{s}X_{t} \right\} {\mathbb {E}}\left\{ \partial _{\left| s \right| \left| t \right| }f(W^{s,t}) \right\} . \end{aligned}$$

Recalling that \(W^{s,t}\) is independent of the pair \((X_s, X_t)\) the first and the last terms cancel each other and only the sum over k is left:

$$\begin{aligned} R_2&= \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} S_{s,t} = \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{k=1}^d {\mathbb {E}}\left\{ X_{s} X_{t} \partial _{\left| s \right| \left| t \right| k}f(W^{s,t} + \theta _{s,t} V^{s,t}) V^{s,t}_k \right\} . \end{aligned}$$

Recalling that \(\left\Vert \partial _{ijk}f\right\Vert _{\infty } \le \left| f \right| _3\) and that \(V_k^{s,t} = \sum _{v \in {\mathbb {D}}_k(t;s)} X_v\) we have:

$$\begin{aligned} |R_2|&\le \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{k=1}^d \sum _{v \in {\mathbb {D}}_k(t;s)} {\mathbb {E}}\left| X_{s} X_{t} X_{v} \partial _{\left| s \right| \left| t \right| k}f(W^{s,t} + \theta _{s,t} V^{s,t}) \right| \\&\le \left| f \right| _3 \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(t;s)} {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| , \end{aligned}$$

as required. \(\square \)

Claim 7.4

$$\begin{aligned} |R_3| \le \left( \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \left\{ \sum _{u \in {\mathbb {D}}(s)} {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| + \sum _{u \in {\mathbb {D}}(t;s)} {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| \right\} \right) \left| f \right| _3 . \end{aligned}$$

Proof

Fix \((s, t) \in {\mathbb {I}}\times {\mathbb {D}}_j(s)\). Recall that by (7.5), \(W^{s,t} = W - U^s - V^{s,t}\). Using the Lagrange form of the remainder term in Taylor’s theorem, we obtain

$$\begin{aligned} \partial _{\left| s \right| \left| t \right| }f(W^{s,t}) = \partial _{\left| s \right| \left| t \right| }f(W) - \sum _{k=1}^d \partial _{\left| s \right| \left| t \right| k}f(W - \rho _{s,t}(U^s + V^{s,t}))(U^s_k + V^{s,t}_k) \end{aligned}$$

for some random \(\rho _{s,t} \in (0,1)\). Recalling that \(U_k^s = \sum _{t \in {\mathbb {D}}_k(s)} X_t\) and \(V_k^{s,t} =\sum _{u \in {\mathbb {D}}_j(t;s)} X_u\),

$$\begin{aligned} R_3 =&-\sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{k=1}^d {\mathbb {E}}\left\{ X_{s} X_{t} \right\} {\mathbb {E}}\left\{ \partial _{\left| s \right| \left| t \right| k}f(W - \rho _{s,t}(U^s + V^{s,t}))(U^s_k + V^{s,t}_k) \right\} \\ =&- \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(s)} {\mathbb {E}}\left\{ X_{s} X_{t} \right\} {\mathbb {E}}\left\{ X_{u} \partial _{|s||t|k}f(W - \rho _{s,t}(U^s + V^{s,t})) \right\} \\&-\sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(t;s)} {\mathbb {E}}\left\{ X_{s} X_{t} \right\} {\mathbb {E}}\left\{ X_{u} \partial _{|s||t|k}f(W - \rho _{s,t}(U^s + V^{s,t})) \right\} . \end{aligned}$$

Recalling that \(\left\Vert \partial _{ijk}f\right\Vert _{\infty } \le \left| f \right| _3\) we bound:

$$\begin{aligned} |R_3| \le&\left| f \right| _3 \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(s)} {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| + \left| f \right| _3 \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(t;s)} {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| , \end{aligned}$$

as required. \(\square \)

Take any \(h \in {\mathscr {H}}_d\). Let \(f: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be the associated solution from Lemma 7.1. Combining Claims 7.17.4 and using Lemma 7.1 we have:

$$\begin{aligned}&\left| {\mathbb {E}}h(W) - {\mathbb {E}}h(\Sigma ^{\frac{1}{2}}Z) \right| \\ \le&\left| {\mathbb {E}}\left\{ \nabla ^{T} \Sigma \nabla f(W)-W^{T} \nabla f(W) \right\} \right| \le |R_1| + |R_2| + |R_3| \\ \le&\left| f \right| _3 \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(s)} \left( \frac{1}{2} {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| + {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| \right) \\&+\left| f \right| _3 \sum _{s \in {\mathbb {I}}} \sum _{t \in {\mathbb {D}}(s)} \sum _{u \in {\mathbb {D}}(t;s)} \left( {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| + {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| \right) \\ \le&\frac{1}{3} \left| h \right| _3 B_{3.2}. \end{aligned}$$

7.1.2 Non-smooth test functions

Convex set indicator test functions provide a stronger distance between probability distributions. Also, from a non-asymptotic perspective, the distance might be more useful in statistical applications: for example, when estimating confidence regions, which are often convex sets. Here we follow (Kaur and Röollin 2021, Section 5.3) very closely to derive a bound on the convex set distance between a vector of dissociated sums \(W \in {\mathbb {R}}^d\) with covariance matrix \(\Sigma \) and a target multivariate normal distribution \(\Sigma ^{\frac{1}{2}}Z\), where \(Z \sim \textrm{MVN}(0, \textrm{Id}_{d \times d})\). The smoothing technique used here is introduced in Gan et al. (2017). However, a better (polylogarithmic) dependence on d could potentially be achieved using a recent result (Gaunt and Li 2023, Proposition 2.6), at the expense of larger constants. The recursive approach from Schulte and Yukich (2019), Kasprzak and Peccati (2022) usually yields better dependence on n; however, this requires the target normal distribution to have an invertible covariance matrix. Since this property does not always hold in our applications of interest, we do not use the recursive approach here.

Proof of Theorem 3.3

Fix \(A \in {\mathscr {K}}\), \(\epsilon > 0\) and define

$$\begin{aligned} A^{\epsilon }=\left\{ y \in {\mathbb {R}}^{d}: d(y, A)<\epsilon \right\} , \quad \text{ and } \quad A^{-\epsilon }=\left\{ y \in {\mathbb {R}}^{d}: B(y; \epsilon ) \subseteq A\right\} \end{aligned}$$

where \(d(y, A)=\inf _{x \in A}\left\Vert x-y \right\Vert _2\) and \(B(y; \epsilon ) = \left\{ \,z \in {\mathbb {R}}^d \;\vert \; \left\Vert y-z\right\Vert _2 \le \epsilon \,\right\} \).

Let \( {\mathscr {H}}_{\epsilon , A} :={\left\{ {h_{\epsilon , A}: {\mathbb {R}}^{d} \rightarrow [0,1]; A \in {\mathscr {K}}}\right\} }\) be a class of functions such that \(h_{\epsilon , A}(x)=1\) for \(x \in A\) and 0 for \(x \notin A^{\epsilon }\). Then, by Bentkus (2003, Lemma 2.1) as well as inequalities (1.2) and (1.4) from Bentkus (2003), for any \(\epsilon > 0\) we have

$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}} Z \in A)| \le 4 d^{\frac{1}{4}} \epsilon +\sup _{A \in {\mathscr {K}}}\left| {\mathbb {E}}h_{\epsilon , A}(W)-{\mathbb {E}}h_{\epsilon , A}(\Sigma ^{\frac{1}{2}}Z)\right| . \end{aligned}$$

Let \(f: {\mathbb {R}}^{d} \rightarrow {\mathbb {R}}\) be a bounded Lebesgue measurable function, and for \(\delta >0\) let

$$\begin{aligned} \left( S_{\delta } f\right) (x)=\frac{1}{(2 \delta )^{d}} \int _{x_{1}-\delta }^{x_{1}+\delta } \cdots \int _{x_{d}-\delta }^{x_{d}+\delta } f(z) d z_{d} \ldots d z_{1}. \end{aligned}$$

Set \(\delta =\frac{\epsilon }{16\sqrt{d}}\) and \(h_{\epsilon , A}=S_{\delta }^{4} I_{A^{\epsilon /4}}\), where \(I_{A^{\epsilon /4}}\) is the indicator function of the subset \(A^{\epsilon /4} \subseteq {\mathbb {R}}^d\). By Gan et al. (2017, Lemma 3.9) we have that \(h_{\epsilon , A}\) is bounded and is in \({\mathscr {H}}_d\).

Moreover, the following bounds hold:

$$\begin{aligned} \left\| h_{\epsilon , A}\right\| _{\infty } \le 1, \quad \left| h_{\epsilon , A}\right| _{2} \le \frac{1}{\epsilon ^{2}}, \quad \left| h_{\epsilon , A}\right| _{3} \le \frac{1}{\epsilon ^{3}}. \end{aligned}$$

Note that \(h_{\epsilon , A}=S_{\delta }^{4} I_{A^{\epsilon /4}} \in {\mathscr {H}}_{\epsilon , A}\) and hence (Bentkus 2003, Lemma 2.1) applies. Using this with Theorem 3.2 we get

$$\begin{aligned} \sup _{A \in {\mathscr {K}}}|{\mathbb {P}}(W \in A)-{\mathbb {P}}(\Sigma ^{\frac{1}{2}}Z \in A)|\le & {} 4 d^{\frac{1}{4}} \epsilon +\sup _{A \in {\mathscr {K}}}\left| {\mathbb {E}}h_{\epsilon , A}(W)-{\mathbb {E}}h_{\epsilon , A}(\Sigma ^{\frac{1}{2}}Z)\right| \\\le & {} 4 d^{\frac{1}{4}} \epsilon + \frac{1}{3\epsilon ^3} B_{3.2}. \end{aligned}$$

Since this bound works for every \(\epsilon > 0\), we minimise it by using \(\epsilon = \left( \frac{3B_{3.2}}{4d^{\frac{1}{4}}}\right) ^{\frac{1}{4}}\). \(\square \)

7.2 A proof of the CLT for U-statistics

Proof of Theorem 3.9

Note that if \(s = (\phi ,i) \in {\mathbb {I}}_i\) and \(u = (\psi , j) \in {\mathbb {I}}_j\) are chosen such that \(\phi \cap \psi = \varnothing \), then the corresponding variables \(X_{s}\) and \(X_{u}\) are independent since \(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s)\) and \(f_j({\mathscr {X}}_u, {\mathscr {Y}}_u)\) do not share any random variables from the sets \(\{\xi _i\}_{1 \le i \le n}\) and \(\{Y_{i,j}\}_{1 \le i < j \le n}\). Hence, if for any \(s = (\phi ,i) \in {\mathbb {I}}_i\) we set \({\mathbb {D}}_j(s) = \left\{ \,(\psi , j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 1\,\right\} \), then W satisfies the assumptions of Corollary 3.4. It remains to bound the quantity \(B_{3.4}\).

First, to find \(\alpha _{ij}\) as in Corollary 3.4, given \(\phi \in C_{k_i}\) and if \(k_i, k_j \ge m\) then there are \(\left( {\begin{array}{c}k_i\\ m\end{array}}\right) \left( {\begin{array}{c}n-k_i\\ k_j-m\end{array}}\right) \) subsets \(\psi \in C_{k_j}\) such that \(|\phi \cap \psi | = m\). Therefore, we have for any \(i,j \in [d]\) and \(s \in {\mathbb {I}}_i\)

$$\begin{aligned} |{\mathbb {D}}_j(s)| = \sum _{m=1}^{\min (k_i,k_j)} \left( {\begin{array}{c}k_i\\ m\end{array}}\right) \left( {\begin{array}{c}n-k_i\\ k_j-m\end{array}}\right) = \alpha _{ij} \le k_i^{\min (k_i,k_j) + 1} (n-k_i)^{k_j-1}. \end{aligned}$$
(7.9)

Note that

$$\begin{aligned}{} & {} {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| \\{} & {} \quad = (\sigma _i \sigma _j \sigma _k)^{-1} {\mathbb {E}}\left| \left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s} \right\} \left\{ f_j({\mathscr {X}}_t, {\mathscr {Y}}_t) - \mu _{t} \right\} \left\{ f_l({\mathscr {X}}_u, {\mathscr {Y}}_u) - \mu _{u} \right\} \right| \end{aligned}$$

as well as

$$\begin{aligned}{} & {} {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| \\{} & {} \quad = (\sigma _i \sigma _j \sigma _k)^{-1} {\mathbb {E}}\left| \left\{ f_i({\mathscr {X}}_s, {\mathscr {Y}}_s) - \mu _{s} \right\} \left\{ f_j({\mathscr {X}}_t, {\mathscr {Y}}_t) - \mu _{t} \right\} \right| {\mathbb {E}}\left| f_l({\mathscr {X}}_u, {\mathscr {Y}}_u) - \mu _{u} \right| .\end{aligned}$$

Using Assumption 3.7, for any \(i,j,l \in [d]\) and \(s \in {\mathbb {I}}_i, t \in {\mathbb {I}}_j, u \in {\mathbb {I}}_l\)

$$\begin{aligned} {\mathbb {E}}\left| X_{s} X_{t} X_{u} \right| \le (\sigma _i \sigma _j \sigma _k)^{-1} \beta \; \text { and } \; {\mathbb {E}}\left| X_{s} X_{t} \right| {\mathbb {E}}\left| X_{u} \right| \le (\sigma _i \sigma _j \sigma _k)^{-1} \beta . \end{aligned}$$
(7.10)

To take care of the variance terms, we lower bound the variance using Assumption 3.7;

$$\begin{aligned} {{\,\textrm{Var}\,}}(S_{n,k_i}(f_i))&= \sum _{s \in C_{k_i}} \sum _{t \in {\mathbb {D}}_i(s)} {{\,\textrm{Cov}\,}}(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s), f_i({\mathscr {X}}_t, {\mathscr {Y}}_t)) \\&= \sum _{m=1}^{k_i}\sum _{s \in C_{k_i}} \sum _{\begin{array}{c} t \in C_{k_i} \\ |s \cap t| = m \end{array}} {{\,\textrm{Cov}\,}}(f_i({\mathscr {X}}_s, {\mathscr {Y}}_s), f_i({\mathscr {X}}_t, {\mathscr {Y}}_t)) \\&\ge \left( {\begin{array}{c}n\\ k_i\end{array}}\right) \sum _{m=1}^{k_i} \left( {\begin{array}{c}k_i\\ m\end{array}}\right) \left( {\begin{array}{c}n-k_i\\ k_i-m\end{array}}\right) \alpha _i = \alpha _i \sum _{m=1}^{k_i} \left( {\begin{array}{c}n\\ 2k_i-m\end{array}}\right) \left( {\begin{array}{c}2k_i-m\\ k_i\end{array}}\right) \left( {\begin{array}{c}k_i\\ m\end{array}}\right) \\&\ge \alpha _i k_i \left( {\begin{array}{c}n\\ 2k_i-1\end{array}}\right) \left( {\begin{array}{c}2k_i-1\\ k_i\end{array}}\right) \ge \alpha _i k_i \frac{n^{2k_i - 1}}{(2k_i - 1)^{2k_i - 1}} \frac{(2k_i-1)^{k_i}}{k_i^{k_i}} \\&= \alpha _i \frac{n^{2k_i - 1}}{(2k_i^2 - k_i)^{k_i - 1}}. \end{aligned}$$

Here the second-to-last inequality follows by taking only the term for \(m=1\). Now we take both sides of the inequality to the power of \(-\frac{1}{2}\) to get that for any \(i \in [d]\)

$$\begin{aligned} \sigma _i^{-1} \le n^{-k_i + \frac{1}{2}}\alpha _i^{-\frac{1}{2}}(2k_i^2 - k_i)^{-\frac{k_i}{2} + \frac{1}{2}}. \end{aligned}$$
(7.11)

Using Eqs. (7.9)–(7.11) to bound the quantity \(B_{3.4}\) from Corollary 3.4 we get

$$\begin{aligned}&B_{3.4} \le \frac{2}{3} \sum _{i,j,l=1}^d (\sigma _i \sigma _j \sigma _k)^{-1} \beta \left( {\begin{array}{c}n\\ k_i\end{array}}\right) k_i^{\min (k_i,k_j) + 1} (n-k_i)^{k_j-1} \\&\Bigg \{ k_i^{\min (k_i,k_l) + 1} (n-k_i)^{k_l-1} + k_j^{\min (k_j,k_l) + 1} (n-k_j)^{k_l-1} \Bigg \} \\ \le&\frac{2}{3} \sum _{i,j,l=1}^d n^{k_i + k_j + k_l - 2} \frac{k_i^{\min (k_i,k_j) + 1}}{k_i!} \Bigg \{ k_i^{\min (k_i,k_l) + 1} + k_j^{\min (k_j,k_l) + 1} \Bigg \} \beta \\&\left( n^{-k_i - k_j - k_l + \frac{3}{2}}(\alpha _i \alpha _j \alpha _l)^{-\frac{1}{2}}(2k_i^2 - k_i)^{-\frac{k_i}{2} + \frac{1}{2}} (2k_j^2 - k_j)^{-\frac{k_j}{2} + \frac{1}{2}} (2k_l^2 - k_l)^{-\frac{k_l}{2} + \frac{1}{2}}\right) \\ \le&\left\{ \frac{2 \beta }{3} \sum _{i,j,l=1}^d \frac{k_i^{\min (k_i,k_j) + 1}}{k_i! \sqrt{\alpha _i \alpha _j \alpha _l}} \left( k_i^{\min (k_i,k_l) + 1} + k_j^{\min (k_j,k_l) + 1} \right) K_i K_j K_l \right\} n^{-\frac{1}{2}}. \end{aligned}$$

Now further assume that Assumption 3.8 hold. The key difference in this case is that the dependency neighbourhoods become smaller: now the subsets need to overlap in at least 2 elements for the corresponding summands to share at least one variable \(Y_{i,j}\) and hence become dependent. This makes both the variance and the size of dependency neighbourhoods smaller. In the context of Theorem 3.2, the trade-off works out in our favour to give smaller bounds, as follows. For any \(s = (\phi ,i) \in {\mathbb {I}}_i\) we set \({\mathbb {D}}_j(s) = \left\{ \,(\psi ,j) \in {\mathbb {I}}_j \;\vert \; |\phi \cap \psi | \ge 2\,\right\} \), so that W, under the additional Assumption 3.8, satisfies the assumptions of Corollary 3.4.

Now Eq. (7.9) becomes

$$\begin{aligned}|{\mathbb {D}}_j(s)| \le k_i^{\min (k_i,k_j) + 1} (n-k_i)^{k_j-2}.\end{aligned}$$

Equation (7.11) becomes

$$\begin{aligned} \sigma _i^{-1} \le 2 n^{-k_i + 1}\alpha _i^{-\frac{1}{2}}(2k_i^2 - 2k_i)^{-\frac{k_i}{2} + 1}. \end{aligned}$$

Using the adjusted bounds in Corollary 3.4 gives the result. \(\square \)