1 Introduction

1.1 Semidefinite problems on random graphs

In this paper we present a simple and general method to prove consistency of various semidefinite optimization problems on random graphs.

Suppose we observe one instance of an \(n \times n\) symmetric random matrix A with unknown expectation \(\bar{A} := \mathbb {E}A\). We would like to estimate the solution of the discrete optimization problem

$$\begin{aligned} \text {maximize}\, x^\mathsf {T}\bar{A} x \quad \text {subject to} \quad x \in \{-1,1\}^n. \end{aligned}$$
(1.1)

A motivating example of A is the adjacency matrix of a random graph; the Boolean vector x can represent a partition of vertices of the graph into two classes. Such Boolean problems can be encountered in the context of community detection in networks which we will discuss shortly. For now, let us keep working with the general class of problems (1.1).

Since \(\bar{A}\) is unknown, one might hope to estimate the solution \(\bar{x}\) of (1.1) by solving the random instance of this problem, that is

$$\begin{aligned} \text {maximize}\, x^\mathsf {T}A x \quad \text {subject to} \quad x \in \{-1,1\}^n. \end{aligned}$$
(1.2)

The integer quadratic problem (1.2) is NP-hard for general (non-random) matrices A. Semidefinite relaxations of many problems of this type have been proposed; see [6, 34, 49, 57] and the references therein. Such relaxations are known to have constant relative accuracy. For example, a semidefinite relaxation in [6] computes, for any given positive semidefinite matrix A, a vector \(x_0 \in \{-1,1\}^n\) such that \(x_0^\mathsf {T}A x_0 \ge 0.56 \, \max _{x \in \{-1,1\}^n} x^\mathsf {T}A x\).

In this paper we demonstrate how semidefinite relaxations of (1.2) can recover a solution of (1.1) with any given relative accuracy. Like several previously known methods, our approach is based on Grothendieck’s inequality. We refer the reader to the surveys [41, 61] for many reformulations and applications of this inequality in mathematics, computer science, optimization and other fields. In contrast to the previous methods, we are going to apply Grothendieck’s inequality for the (random) error \(A - \bar{A}\) rather that the original matrix A, and this will be responsible for the arbitrary accuracy.

We will describe the general method in Sect. 2. It is simple and flexible, and it can be used for showing consistency of a variety of semidefinite programs, which may or may not be related to Boolean problems like (1.1). But before describing the method, we would like to pause and give some concrete examples of results it yields for community detection.

For simplicity, we will first focus on the classical stochastic block model, which is a random network whose nodes are split into two equal-sized clusters. In Sect. 1.3 we will extend our discussion for broader models of networks almost without extra effort.

1.2 Community detection: the classical stochastic block model

It is now customary to model networks as inhomogeneous random graphs [13], which generalize the classical Erdös-Rényi model G(np). A benchmark example is the stochastic block model [40]. In this section we focus on the basic model with two communities of equal sizes; in Sect. 1.3 we will consider a more general situation.

We define a random graph on vertices \(\{1,\ldots ,n\}\) as follows. Partition the set of vertices into two communities \(\mathcal {C}_1\) and \(\mathcal {C}_2\) of size n / 2 each. For each pair of distinct vertices, we draw an edge independently with probability p if both vertices belong to the same community, and q (with \(q \le p\)) if they belong to different communities. For convenience we include the loops, so each vertex has an edge connecting it to itself with probability 1. This defines a distribution on random graphs which is denoted G(npq) and called the (classical) stochastic block model. When \(p=q\), we recover the classical Erdös-Rényi model of random graphs G(np).

The community detection problem asks to recover the communities \(\mathcal {C}_1\) and \(\mathcal {C}_2\) by observing one instance of a random graph drawn from G(npq). As we will discuss in detail in Sect. 1.4, an array of algorithms is known to succeed for this problem for relatively dense graphs, those whose expected average degree (which is of order pn) is \(\Omega (\log n)\), while less is known for totally sparse graphs—those with bounded average degrees, i.e. with \(pn = O(1)\). Our paper focuses on this sparse regime.

Recovery of the communities \(\mathcal {C}_1\) and \(\mathcal {C}_2\) is equivalent to estimating the community membership vector, which we can define as

$$\begin{aligned} \bar{x}\in \{-1,1\}^n, \quad \bar{x}_i = {\left\{ \begin{array}{ll} 1, &{} i \in \mathcal {C}_1 \\ -1, &{} i \in \mathcal {C}_2. \end{array}\right. } \end{aligned}$$
(1.3)

We will estimate \(\bar{x}\) using the following semidefinite optimization problem:

$$\begin{aligned} \begin{aligned}&\text {maximize}\, \langle A,Z\rangle - \lambda \langle E_n,Z\rangle \\&\text {subject to}\, Z \succeq 0, \; \hbox {diag}(Z) \preceq \mathbf{I }_n. \end{aligned} \end{aligned}$$
(1.4)

Here the inner product of matrices is defined in the usual way, that is \(\langle A,B\rangle = \hbox {tr}(AB) = \sum _{i,j} A_{ij} B_{ij}\), \(\mathbf{I }_n\) denotes the identity matrix, the matrix \(E_n\) has all entries equal 1, and \(A \succeq B\) means that \(A-B\) is positive semidefinite. Observe that \(E_n = \mathbf{1}_n \mathbf{1}_n^\mathsf {T}\) where \(\mathbf{1}_n \in \mathbb {R}^n\) is the vector whose all coordinates equal 1. The constraint \(\hbox {diag}(Z) \preceq \mathbf{I }_n\) in (1.4) simply means that all diagonal entries of Z are bounded by 1.

For the value of \(\lambda \) we choose the average degree of the graph (with loops removed), which is

$$\begin{aligned} \lambda = \frac{2}{n(n-1)} \sum _{i < j} a_{ij} \end{aligned}$$
(1.5)

where \(a_{ij} \in \{0,1\}\) denote the entries of the adjacency matrix A.

Theorem 1.1

(Community detection in classical stochastic block model) Let \(\varepsilon \in (0,1)\) and \(n \ge 10^4 \varepsilon ^{-2}\). Let A be the adjacency matrix of the random graph drawn from the stochastic block model G(npq) with \(\max \{p(1-p), q(1-q)\} \ge \frac{20}{n}\). Assume that \(p=\frac{a}{n} > q=\frac{b}{n}\), and

$$\begin{aligned} (a-b)^2 \ge 1 0^4 \, \varepsilon ^{-2} (a+b). \end{aligned}$$
(1.6)

Let \(\widehat{Z}\) be a solution of the semidefinite program (1.4). Then, with probability at least \(1- e^3 5^{-n}\), we have

$$\begin{aligned} \Vert \widehat{Z}- \bar{x}\bar{x}^\mathsf {T}\Vert _2^2 \le \varepsilon n^2 = \varepsilon \Vert \bar{x}\bar{x}^\mathsf {T}\Vert _2^2. \end{aligned}$$
(1.7)

Here and in the rest of this paper, \(\Vert \cdot \Vert _2\) denotes the Frobenius norm of matrices and the Euclidean norm of vectors.

Once we have estimated the rank-one matrix \(\bar{x} \bar{x}^\mathsf {T}\) using Theorem 1.1, we can also estimate the community membership vector \(\bar{x}\) itself in a standard way, namely by computing the leading eigenvector.

Corollary 1.2

(Community detection with o(n) misclassified vertices) In the setting of Theorem 1.1, let \(\widehat{x}\) denote an eigenvector of \(\widehat{Z}\) corresponding to the largest eigenvalue, and with \(\Vert \widehat{x}\Vert _2 = \sqrt{n}\). Then

$$\begin{aligned} \min _{\alpha = \pm 1} \Vert \alpha \widehat{x}- \bar{x}\Vert _2^2 \le \varepsilon n = \varepsilon \Vert \bar{x}\Vert _2^2. \end{aligned}$$

In particular, the signs of the coefficients of \(\widehat{x}\) correctly estimate the partition of the vertices into the two communities, up to at most \(\varepsilon n\) misclassified vertices.

As we will discuss in Sect. 1.4.2 in more detail, there are previously known algorithms for recovery of two communities under conditions similar to (1.6). These include a spectral clustering algorithm based on truncating the high degree vertices (whose analysis can be derived from [31, 32]), combinatorial algorithms of [54, 59] based on path counting, and an algorithm [52] based on belief propagation, which minimizes the fraction of misclassified vertices.

An array of simple semidefinite programs like (1.4) and (1.9) has been proposed in networks community. Such programs have been analyzed for relatively dense graphs; see [9] for a review. It has been unknown if they could succeed for totally sparse graphs, where the expected degree is of constant order. Theorem 1.1 provides a positive answer to this question. Moreover, the method of this paper is flexible enough to analyze many semidefinite programs, and it can be applied for more general models of sparse networks than any previous results. To illustrate this point, we will now choose a different semidefinite program and show that it succeeds for a large class of stochastic models of networks.

1.3 Community detection: general stochastic block models

Let us describe a model of networks where one can have multiple communities of arbitrary sizes, arbitrarily many outliers, and unequal edge probabilities.

To define such general stochastic block model, we assume that the set of vertices \(\{1,\ldots ,n\}\) is partitioned into communities \(\mathcal {C}_1,\ldots ,\mathcal {C}_K\) of arbitrary sizes. We do not restrict the sizes of the communities, so in particular this model can automatically handle outliers, the vertices that form communities of size 1. For each pair of distinct vertices (ij), we draw an edge between i and j independently and with certain fixed probability \(p_{ij}\). For convenience we include the loops like in the classical stochastic block model, so \(p_{ii} = 1\). To promote more edges within than across the communities, we assume that there exist numbers \(p > q\) (thresholds) such that

$$\begin{aligned} \begin{aligned}&p_{ij} \ge p \quad \text {if}\, i\, \text {and}\, j\, \text {belong to the same community};\\&p_{ij} \le q \quad \text {if}\, i\, \text {and}\, j\, \text {belong to different communities}. \end{aligned} \end{aligned}$$
(1.8)

The community structure of such a network is captured by the cluster matrix matrix \(\bar{Z} \in \{0,1\}^{n \times n}\) defined as

$$\begin{aligned} \bar{Z}_{ij} = {\left\{ \begin{array}{ll} 1 &{} \text {if}\, i\, \text {and}\, j\, \text {belong to the same community}; \\ 0 &{} \text {if}\, i\, \text {and}\, j\, \text {belong to different communities}. \end{array}\right. } \end{aligned}$$
(1.9)

We will estimate \(\bar{Z}\) using the following semidefinite optimization program:

$$\begin{aligned} \begin{aligned}&\text {maximize}\, \langle A,Z\rangle \\&\text {subject to}\, Z \succeq 0, \; Z \ge 0, \; \hbox {diag}(Z) \preceq \mathbf{I }_n, \; \textstyle {\sum _{i,j=1}^n Z_{ij} = \lambda }. \end{aligned} \end{aligned}$$
(1.10)

Here as usual \(Z \succeq 0\) means that Z is positive semidefinite, and \(Z \ge 0\) means that all entries of Z are non-negative. We choose the value of \(\lambda \) to be the number of elements in the cluster matrix, that is

$$\begin{aligned} \lambda = \sum _{i,j=1}^n \bar{Z}_{ij} = \sum _{k=1}^K |\mathcal {C}_k|^2. \end{aligned}$$
(1.11)

If all communities have the same size s, then \(\lambda = K s^2 = n s\).

Theorem 1.3

(Community detection in general stochastic block model) Let \(\varepsilon \in (0,1)\). Let A be the adjacency matrix of the random graph drawn from the general stochastic block model described above. Denote by \(\bar{p}\) the expected variance of the edges, that is \(\bar{p}= \frac{2}{n(n-1)} \sum _{i<j} p_{ij} (1 - p_{ij})\). Assume that \(p=\frac{a}{n} > q=\frac{b}{n}\), \(\bar{p}= \frac{g}{n}\), \(g \ge 9\) and

$$\begin{aligned} (a-b)^2 \ge 484\, \varepsilon ^{-2} g. \end{aligned}$$
(1.12)

Let \(\widehat{Z}\) be a solution of the semidefinite program (1.10). Then, with probability at least \(1-e^3 5^{-n}\), we have

$$\begin{aligned} \Vert \widehat{Z}- \bar{Z}\Vert _2^2 \le \Vert \widehat{Z}- \bar{Z}\Vert _1 \le \varepsilon n^2. \end{aligned}$$
(1.13)

Here as usual \(\Vert \cdot \Vert _2\) denotes the Frobenius norm of matrices, and \(\Vert \cdot \Vert _1\) denotes the \(\ell _1\) norm of the matrices considered as vectors, that is \(\Vert (a_{ij})\Vert _1 = \sum _{i,j} |a_{ij}|\).

Remark 1.4

(General community structure) The power of Theorem 1.3 does not depend on the community structure, i.e. on the number and sizes of the communities. This seemingly surprising observation can be explained by the fact that small communities, those with sizes o(n), can get absorbed in the error term in (1.13), so they will not be recovered.

Remark 1.5

(If the sizes of communities are not known) Our choice of the parameter \(\lambda \) in (1.11) assumes that we know the sizes of the communities. What if they are not known? From the proof of Theorem 1.3 it will be clear what happens when \(\lambda >0\) is chosen arbitrarily. Assume that we choose \(\lambda \) so that \(\lambda \le \lambda _0 := \sum _k |\mathcal {C}_k|^2\). Then instead of estimating the full cluster graph (described in Remark 1.6), the solution \(\widehat{Z}\) will only estimate a certain subgraph of the cluster graph, which may miss at most \(\lambda _0 - \lambda \) edges. On the other hand, if we choose \(\lambda \) so that \(\lambda \ge \lambda _0\), then the solution \(\widehat{Z}\) will estimate a certain supergraph of the cluster graph, which may have at most \(\lambda - \lambda _0\) extra edges. In either case, such solution could be meaningful in practice.

Remark 1.6

(Cluster graph) It may be convenient to view the cluster matrix \(\bar{Z}\) as the adjacency matrix of the cluster graph, in which all vertices within each community are connected and there are no connections across the communities. This way, the semidefinite program (1.10) takes a sparse graph as an input, and it returns an estimate of the cluster graph as an output. The effect of the program is thus to “densify” the network inside the communities and “sparsify” it across the communities.

Remark 1.7

(Other semidefinite programs) There is nothing special about the semidefinite programs (1.4) and (1.10). For example, one can tighten the constraints and instead of \(\hbox {diag}(Z) \preceq \mathbf{I }_n\) require that \(\hbox {diag}(Z) = \mathbf{I }_n\) in both programs. Similarly, instead of placing in (1.10) the constraint on the sum of all entries of Z, one can place constraints on the sums of each row. In a similar fashion, one should be able to analyze other semidefinite relaxations, both new and those proposed in the previous literature on community detection, see [9].

For one more illustration for the method described here, we refer the reader to Section 7 of the extended version of this paper [36]. There we consider a minor modification of the semidefinite program (1.4), and we show that it succeeds in presence of multiple communities of equal sizes (the so-called balanced planted partition model). The sufficient condition for that is \((a-b)^2 \ge 50^2 \varepsilon ^{-2} (a + b(K-1))\) where K is the number of communities, s the size of the communities and \(p=a/s\), \(q=b/s\).

1.4 Related work

Community detection in stochastic block models is a fundamental problem that has been extensively studied in theoretical computer science and statistics. A plethora or algorithmic approaches have been proposed, in particular those based on combinatorial techniques [18, 30], spectral clustering [4, 5, 14, 22, 38, 43, 50, 56, 58, 62, 63], likelihood maximization [8, 10, 64], variational methods [3, 11, 20], Markov chain Monte Carlo [29, 64], belief propagation [29], and convex optimization including semidefinite programming [2, 7, 9, 18, 19, 2325, 37, 60].

1.4.1 Relatively dense networks: average degrees are \(\Omega (\log n)\)

Most known rigorous results on community detection are proved for relatively dense networks whose expected degrees go to infinity with n. If the degrees grow no slower than \(\log n\), it may be possible to recover the community structure perfectly, without any misclassified vertices. A variety of community detection methods are known to succeed in this regime, including those based on spectral clustering, likelihood maximization and convex optimization mentioned above; see e.g. [18, 50] and the references therein.

The semidefinite programs (1.4) and (1.10) are similar to those proposed in the recent literature, most notably in [9, 18, 19, 23, 25]. The semidefinite relaxations discussed in [19, 25] can perfectly recover the community structure if \((a-b)^2 \ge C (a \log n+b)\) for a sufficiently large constant C; see [9] for a review of these results.

1.4.2 Totally sparse networks: bounded average degrees

The problem becomes more difficult for sparser networks, whose expected average degrees grow to infinity arbitrarily slowly or even remain bounded in n. Although studying such networks is well motivated from the practical perspective [45, 65], little has been known on the theoretical level.

If the degrees grow slower than \(\log n\), it is impossible to correctly classify all vertices, since with high probability a positive fraction of the vertices will be isolated. Still, the fraction of isolated vertices tends to zero with n, so we can hope to correctly classify a majority of the vertices in this regime.

The spectral method developed by J. Kahn and E. Szemeredi for random regular graphs [32] can be adapted for Erdös-Rényi random graphs [5, 31] and, more generally, for the stochastic block model \(G(n, \frac{a}{n}, \frac{b}{n})\). If one truncates the graph by removing all vertices with too large degrees (say, larger than \(10(a+b)\)), then the argument of [31, 32] can be adapted to conclude that with some positive probability, the truncated adjacency matrix concentrates near its expectation in the spectral norm. The communities can then be approximately recovered using the spectral clustering, which is based on the signs of the coefficients of the second eigenvector. Working out the details, one finds that a sufficient condition for this method to succeed is similar to (1.6), that is

$$\begin{aligned} (a-b)^2 \ge C_\varepsilon (a+b) \end{aligned}$$
(1.14)

where \(C_\varepsilon \) depends only on the desired accuracy \(\varepsilon \) or recovery. However, for real networks it is usually impractical to remove high degree vertices and the probabilistic estimate from [31] is not sharp.

A. Coja-Oghlan [27] proposed a different, complicated adaptive spectral algorithm that can approximately recover communities under the condition \((a-b)^2 \ge C_\varepsilon (a+b) \log (a+b)\). Recently, L. Massoulié [59] and E. Mossel, J. Neeman and A. Sly [54] came up with combinatorial algorithms based on path counting, which can approximately recover communities under the condition (1.14). These results are stated in the asymptotic regime for \(n \rightarrow \infty \) and without explicit dependence of \(C_\varepsilon \) on the desired accuracy \(\varepsilon \). Furthermore, E. Mossel, J. Neeman and A. Sly developed an algorithm based on belief propagation [52], which minimizes the fraction of misclassified vertices.

Condition (1.14) has the optimal form. Indeed, it was shown in [55] that the lower bound (1.14) is required for any algorithm to be able to recover communities with at most \(\varepsilon n\) misclassified vertices, where \(C_\varepsilon \rightarrow \infty \) as \(\varepsilon \rightarrow 0\). A conjecture of A. Decelle, F. Krzakala, C. Moore and L. Zdeborova proved recently by E. Mossel, J. Neeman and A. Sly [53, 54] and Massouile [59] states that one can find a partition correlated with the true community partition (i.e. with the fraction of misclassified vertices bounded away from \(50~\%\) as \(n \rightarrow \infty \)) if \((a-b)^2 \ge C(a+b)\) with some constant \(C>2\). Moreover, this result achieves information-theoretic limit: no algorithm can succeed if \(C \le 2\).

It remains an open question whether semidefinite programing can achieve similar information-theoretic limits. Theorem 1.1 does not achieve them; addressing this problem will require to tighten the absolute constant and the dependence on \(\varepsilon \) in (1.6).

1.4.3 The new results in historical perspective

A variety of simple semidefinite programs like (1.4) and (1.9) have been proposed in the network literature. Such programs have been analyzed only for dense networks where the degrees grow as \(\Omega (\log n)\) in which case perfect community detection is possible. The present paper shows that the same semidefinite programs succeed for totally sparse networks as well, producing a small number of misclassified vertices; moreover the sufficient condition (1.14) is optimal up to an absolute constant.

Furthermore, the method of the present paper generalizes smoothly for a broad classes of sparse networks. We saw in Sect. 1.3 that semidefinite programming succeeds for networks with variable edge probabilities \(p_{ij}\); community detection in such networks seems to be out of reach for known spectral methods.

We also saw how networks with multiple communities be handled with semidefinite approach. This has been studied in the statistical literature before; the semidefinite relaxations proposed in [9, 19, 23, 25] were designed for multiple communities and outliers. However, previous theoretical results for multiple communities were only available for dense regime where the degrees grow as \(\Omega (\log n)\), in which case perfect community detection is possible.

1.4.4 Follow up work

After this paper had been submitted, several new results appeared on community detection in stochastic block models. We will mention here only results that apply for totally sparse networks. The initial discovery of [54, 59] mentioned in Sect. 1.4.2 was followed by the work [16]. Semidefinite programs on random graphs were further analyzed in [51] using higher-rank Grothendieck inequalities and insights from mathematical physics. Stochastic block models with labeled edges were addressed in [44] using truncated spectral clustering (with high degree vertices removed, based on [31]) and semidefinite programming (whose analysis is based on the method of the present paper). A two-stage algorithm based on truncated spectral clustering and swapping vertices (like e.g. in [55]) was analyzed in [26]; the swapping stage leads to the sufficient condition (1.14) with with an optimal dependence on the accuracy, \(C_\varepsilon \sim \log (1/\varepsilon )\). A different combinatorial method was proposed and analyzed in [1]; regularized spectral clustering was shown to succeed in [46, 47]; and a computationally feasible likelihood-based algorithm that minimizes the risk for misclassification proportion was found in [33]. Some of the mentioned work can be used for networks with multiple communities, see [1, 26, 33, 46, 47].

1.5 Plan of the paper

We discuss the method in general terms in Sect. 2. We explain how Grothendieck’s inequality can be used to show tightness of various semidefinite programs on random graphs. Section 3 is devoted to Grothendieck’s inequality and its implications for semidefinite programming. In Sect. 4 we prove a simple concentration inequality for random matrices in the cut norm. In Sect. 5 we specialize to the community detection problem for the classical stochastic block model, and we prove Theorem 1.1 and Corollary 1.2 there. In Sect. 6 we consider the general classical stochastic block model, and we prove Theorem 1.3 there.

2 Semidefinite optimization on random graphs: the method in a nutshell

In this section we explain the general method of this paper, which can be applied to a variety of optimization problems. To be specific, let us return to the problem we described in Sect. 1.1, which is to estimate the solution \(\bar{x}\) of the optimization problem (1.1) from a single observation of the random matrix A. We suggested there to approximate \(\bar{x}\) by the solution of the (random) program (1.2), which we can rewrite as follows:

$$\begin{aligned} \text {maximize } \langle A,x x^\mathsf {T}\rangle \quad \text {subject to} \quad x \in \{-1,1\}^n. \end{aligned}$$
(2.1)

Note that if we maximized \(\langle A,x x^\mathsf {T}\rangle \) over the Euclidean ball \(B(0,\sqrt{n})\), then the problem would be simple – the solution x would be the eigenvector corresponding to the eigenvalue of A of largest magnitude. This simpler problem underlies the most basic algorithm for community detection called spectral clustering, where the communities are recovered based on the signs of an eigenvector of the adjacency matrix (going back to [14, 39, 50], see [63]). The optimization problem (2.1) is harder and more subtle; the replacement of the Euclidean ball by the cube introduces a strong restriction on the coordinates of x. This restruction rules out localized solutions x where most of the mass of x is concentrated on a small fraction of coordinates. Since eigenvectors of sparse matrices tend to be localized (see [15]), basic spectral clustering is often unsuccessful for sparse networks.

Let us choose a convex subset \(\mathcal {M}_{\mathrm {opt}}\) of the set of positive semidefinite matrices whose all entries are bounded by 1 in absolute value. (For now, it can be any subset.) Note that \(x x^\mathsf {T}\) appearing in (2.1) are examples of such matrices. We consider the following semidefinite relaxation of (2.1):

$$\begin{aligned} \text {maximize}\, \langle A,Z\rangle \quad \text {subject to} \quad Z \in \mathcal {M}_{\mathrm {opt}}. \end{aligned}$$
(2.2)

We might hope that the solution \(\widehat{Z}\) of this program would enable us to estimate the solution \(\bar{x}\) of (1.1).

To realize this hope, one needs to check a few things, which may or may not be true depending on the application. First, one needs to design the feasible set \(\mathcal {M}_{\mathrm {opt}}\) in such a way that the semidefinite relaxation of the expected problem (1.1) is tight. This means that the solution \(\bar{Z}\) of the program

$$\begin{aligned} \text {maximize}\, \langle \bar{A},Z\rangle \quad \text {subject to} \quad Z \in \mathcal {M}_{\mathrm {opt}}\end{aligned}$$
(2.3)

satisfies

$$\begin{aligned} \bar{Z} = \bar{x} \bar{x}^\mathsf {T}. \end{aligned}$$
(2.4)

This condition can be arranged for in various applications. In particular, this is the case in the setting of Theorem 1.1; we show this in Lemma 5.1.

Second, one needs a uniform deviation inequality, which would guarantee with high probability that

$$\begin{aligned} \max _{x,y \in \{-1,1\}^n} |\langle A-\bar{A},xy^\mathsf {T}\rangle | \le \varepsilon . \end{aligned}$$
(2.5)

This can often be proved by applying standard deviation inequalities for a fixed pair (xy), followed by a union bound over all such pairs. We prove such a deviation inequality in Sect. 4.

Now we make the crucial step, which is an application of Grothendieck’s inequality. A reformulation of this remarkable inequality, which we explain in Sect. 3, states that (2.5) automatically implies that

$$\begin{aligned} \max _{Z \in \mathcal {M}_{\mathrm {opt}}} |\langle A-\bar{A},Z\rangle | \le C\varepsilon . \end{aligned}$$
(2.6)

This will allow us to conclude that the solution \(\widehat{Z}\) of (2.2) approximates the solution \(\bar{Z}\) of (2.3). To see this, let us compare the value of the expected objective function \(\langle \bar{A},Z\rangle \) at these two vectors. We have

$$\begin{aligned} \langle \bar{A},\widehat{Z}\rangle&\ge \langle A,\widehat{Z}\rangle - C\varepsilon \quad \text {(replacing}\, \bar{A}\, \text {by}\, A\, \text {using (2.6))} \nonumber \\&\ge \langle A,\bar{Z}\rangle - C\varepsilon \quad \text {(since}\, \widehat{Z}\,\text {is the maximizer in (2.2))} \nonumber \\&\ge \langle \bar{A},\bar{Z}\rangle - 2C\varepsilon \quad \text {(replacing}\, A\, \text {by}\, \bar{A}\, \text {back using (2.6)).} \end{aligned}$$
(2.7)

This means that \(\widehat{Z}\) almost maximizes the objective function \(\langle \bar{A},Z\rangle \) in (2.3).

The final piece of information we require is that the expected objective function \(\langle \bar{A},Z\rangle \) distinguishes points near its maximizer \(\bar{Z}\). This would allow one to automatically conclude from (2.7) that the almost maximizer \(\widehat{Z}\) is close to the true maximizer, i.e. that

$$\begin{aligned} \Vert \widehat{Z}- \bar{Z}\Vert \le \text {something small} \end{aligned}$$
(2.8)

where \(\Vert \cdot \Vert \) can be the Frobenius or operator norm. Intuitively, the requirement that the objective function distinguishes points amounts to a non-trivial curvature of the feasible set \(\mathcal {M}_{\mathrm {opt}}\) at the maximizer \(\bar{Z}\). In many situations, this property is easy to verify. In the setting of Theorems 1.1 and 1.3, we check it in Lemma 5.2 and Lemmas 6.26.3 respectively.

Finally, we can recall from (2.4) that \(\bar{Z} = \bar{x} \bar{x}^\mathsf {T}\). Together with (2.8), this yields that \(\widehat{Z}\) is approximately a rank-one matrix, and its leading eigenvector \(\widehat{x}\) satisfies

$$\begin{aligned} \Vert \widehat{x}- \bar{x}\Vert _2 \le \text {something small}. \end{aligned}$$

Thus we estimated the solution \(\bar{x}\) of the problem (1.1) as desired.

Remark 2.1

(General semidefinite programs) For this method to work, it is not crucial that the semidefinite program be a relaxation of any vector optimization problem. Indeed, one can analyze semidefinite programs of the type (2.2) without any vector optimization problem (2.1) in the background. In such cases, the requirement (2.4) of tightness of relaxation can be dropped. The solution \(\bar{Z}\) may itself be informative. An example of such situation is Theorem 1.3 where the community membership matrix \(\bar{Z}\) is important by itself. However, \(\bar{Z}\) can not be represented as \(\bar{x} \bar{x}^\mathsf {T}\) for any \(\bar{x}\), since \(\bar{Z}\) is not a rank one matrix.

3 Grothendieck’s inequality and semidefinite programming

Grothendieck’s inequality is a remarkable result proved originally in the functional analytic context [35] and reformulated in [48] in the form we are going to describe below. This inequality had found applications in several areas [41, 61]. It has already been used to analyze semidefinite relaxations of hard combinatorial optimization problems [6, 57], although previous relaxations lead to constant (rather than arbitrary) accuracy.

Theorem 3.1

(Grothendieck’s inequality) Consider an \(n \times n\) matrix of real numbers \(B = (b_{ij})\). Assume that

$$\begin{aligned} \Big | \sum _{i,j} b_{ij} s_i t_j \Big | \le 1 \end{aligned}$$

for all numbers \(s_i, t_i \in \{-1,1\}\). Then

$$\begin{aligned} \Big | \sum _{i,j} b_{ij} \langle X_i,Y_j\rangle \Big | \le K_\mathrm {G}\end{aligned}$$

for all vectors \(X_i,Y_i \in B_2^n\).

Here \(B_2^n = \{ x \in \mathbb {R}^n : \Vert x\Vert _2 \le 1\}\) is the unit ball for the Euclidean norm, and \(K_\mathrm {G}\) is an absolute constant referred to as Grothendieck’s constant. The best value of \(K_\mathrm {G}\) is still unknown, and the best known bound [17] is

$$\begin{aligned} K_\mathrm {G}< \frac{\pi }{2 \ln (1+\sqrt{2})} \le 1.783. \end{aligned}$$
(3.1)

3.1 Grothendieck’s inequality in matrix form

To restate Grothendieck’s inequality in a matrix form, let us assume for simplicity that \(m=n\) and observe that \(\sum _{i,j} b_{ij} s_i t_j = \langle B,s t^\mathsf {T}\rangle \) where s and t are the vectors in \(\mathbb {R}^n\) with coordinates \(s_i\) and \(t_j\) respectively. Similarly, \(\sum _{i,j} b_{ij} \langle X_i,Y_j\rangle = \langle B,X Y^\mathsf {T}\rangle \) where X and Y are the \(n \times n\) matrices with rows \(X_i^\mathsf {T}\) and \(Y_j^\mathsf {T}\) respectively. This motivates us to consider the following two sets of matrices:

$$\begin{aligned} \mathcal {M}_1 := \left\{ s t^\mathsf {T}:\; s, t \in \{-1,1\}^n \right\} , \quad \mathcal {M}_\mathrm {G}:= \left\{ XY^\mathsf {T}:\; \text {all rows}\, X_i, Y_j \in B_2^n \right\} . \end{aligned}$$

Clearly, \(\mathcal {M}_1 \subset \mathcal {M}_\mathrm {G}\). Grothendieck’s inequality can be stated as follows:

$$\begin{aligned} \forall B \in \mathbb {R}^{n \times n}, \quad \max _{Z \in \mathcal {M}_\mathrm {G}} \left| \langle B,Z\rangle \right| \le K_\mathrm {G}\max _{Z \in \mathcal {M}_1} \left| \langle B,Z\rangle \right| . \end{aligned}$$
(3.2)

We can view this inequality as a relation between two matrix norms. The right side of (3.2) defines the \(\ell _\infty \rightarrow \ell _1\) norm of \(B = (b_{ij})\), which is

$$\begin{aligned} \Vert B\Vert _{\infty \rightarrow 1}&= \max _{\Vert s\Vert _\infty \le 1} \Vert Bs\Vert _1 = \max _{s,t \in \{-1,1\}^n} \langle B,s t^\mathsf {T}\rangle = \max _{s,t \in \{-1,1\}^n} \sum _{i,j=1}^n b_{ij} s_i t_j \nonumber \\&= \max _{Z \in \mathcal {M}_1} \left| \langle B,Z\rangle \right| . \end{aligned}$$
(3.3)

We note in passing that this norm is equivalent to the so-called cut norm, whose importance in algorithmic problems is well understood in theoretical computer science community, see e.g. [6, 41].

Let us restrict our attention to the part of Grothendieck’s set \(\mathcal {M}_\mathrm {G}\) consisting of positive semidefinite matrices. To do so, we consider the following set of \(n \times n\) matrices:

$$\begin{aligned} \mathcal {M}_\mathrm {G}^+ := \left\{ Z:\; Z \succeq 0, \; \hbox {diag}(Z) \preceq \mathbf{I }_n \right\} \subset \mathcal {M}_\mathrm {G}\subset [-1,1]^{n \times n}. \end{aligned}$$
(3.4)

To check the first inclusion in (3.4), let \(Z \in \mathcal {M}_\mathrm {G}^+\). Since \(Z \succeq 0\), there exists a matrix X such that \(Z = X^2\). The rows \(X_i^\mathsf {T}\) of X satisfy \( \Vert X_i\Vert _2^2 = \langle X_i,X_i\rangle = (X^\mathsf {T}X)_{ii} = Z_{ii} \le 1, \) where the last inequality follows from the assumption \(\hbox {diag}(Z) \preceq \mathbf{I }_n\). Choosing \(Y=X\) in the definition of \(\mathcal {M}_\mathrm {G}\), we conclude that \(Z \in \mathcal {M}_\mathrm {G}\). To check the second inclusion in (3.4), note that for every matrix \(X Y^\mathsf {T}\in \mathcal {M}_\mathrm {G}\), we have \((XY^\mathsf {T})_{ij} = \langle X_i,Y_j\rangle \le \Vert X_i\Vert _2 \; \Vert Y_j\Vert _2 \le 1.\)

Combining (3.2) with (3.4) and the identity (3.3), we obtain the following form of Grothendieck inequality for positive semidefinite matrices.

Fact 3.2

(Grothendieck’s inequality, PSD) Every matrix \(B \in \mathbb {R}^{n \times n}\) satisfies

$$\begin{aligned} \max _{Z \in \mathcal {M}_\mathrm {G}^+} \left| \langle B,Z\rangle \right| \le K_\mathrm {G}\, \Vert B\Vert _{\infty \rightarrow 1}. \end{aligned}$$

3.2 Semidefinite programming

To keep the discussion sufficiently general, let us consider the following class of optimization programs:

$$\begin{aligned} \text {maximize}\, \langle B,Z\rangle \quad \text {subject to} \quad Z \in \mathcal {M}_{\mathrm {opt}}. \end{aligned}$$
(3.5)

Here \(\mathcal {M}_{\mathrm {opt}}\) can be any subset of the Grothendieck’s set \(\mathcal {M}_\mathrm {G}^+\) defined in (3.4). A good example is where B is the adjacency matrix of a random graph, possibly dilated by a constant matrix. For example, the semidefinite program (1.4) is of the form (3.5) with \(\mathcal {M}_{\mathrm {opt}}= \mathcal {M}_\mathrm {G}^+\) and \(B = A - \lambda E_n\).

Imagine that there is a similar but simpler problem where B is replaced by a certain reference matrix R, that is

$$\begin{aligned} \text {maximize}\, \langle R,Z\rangle \quad \text {subject to} \quad Z \in \mathcal {M}_{\mathrm {opt}}. \end{aligned}$$
(3.6)

A good example is where B is a random matrix and \(R = \mathbb {E}B\); this will be the case in the proof of Theorem 1.3. Let \(\widehat{Z}\) and \(Z_R\) be the solutions of the original problem (3.5) and the reference problem (3.6) respectively, thus

$$\begin{aligned} \widehat{Z}:= \arg \max _{Z \in \mathcal {M}_{\mathrm {opt}}} \langle B,Z\rangle , \quad Z_R := \arg \max _{Z \in \mathcal {M}_{\mathrm {opt}}} \langle R,Z\rangle . \end{aligned}$$

The next lemma shows that \(\widehat{Z}\) provides an almost optimal solution to the reference problem if the original and reference matrices B and R are close.

Lemma 3.3

(\(\widehat{Z}\) almost maximizes the reference objective function) We have

$$\begin{aligned} \langle R,Z_R\rangle - 2 K_\mathrm {G}\Vert B-R\Vert _{\infty \rightarrow 1} \le \langle R,\widehat{Z}\rangle \le \langle R,Z_R\rangle . \end{aligned}$$
(3.7)

Proof

The upper bound is trivial by definition of \(Z_R\). The lower bound is based on fact 3.2, which implies that for every \(Z \in \mathcal {M}_{\mathrm {opt}}\), one has

$$\begin{aligned} |\langle B-R,Z\rangle | \le K_\mathrm {G}\Vert B-R\Vert _{\infty \rightarrow 1} =: \varepsilon . \end{aligned}$$
(3.8)

Now, to prove the lower bound in (3.7), we will first replace R by B using (3.8), then replace \(\widehat{Z}\) by \(Z_R\) using the fact that \(\widehat{Z}\) is a maximizer for \(\langle B,Z\rangle \), and finally replace back B by R using (3.8) again. This way we obtain

$$\begin{aligned} \langle R,\widehat{Z}\rangle \ge \langle B,\widehat{Z}\rangle - \varepsilon \ge \langle B,Z_R\rangle - \varepsilon \ge \langle R,Z_R\rangle - 2\varepsilon . \end{aligned}$$

This completes the proof of Lemma 3.3. \(\square \)

4 Deviation in the cut norm

To be able to effectively use Lemma 3.3, we will now show how to bound the cut norm of random matrices.

Lemma 4.1

(Deviation in \(\ell _\infty \rightarrow \ell _1\) norm) Let \(A = (a_{ij}) \in \mathbb {R}^{n \times n}\) be a symmetric matrix whose diagonal entries equal 1, whose entries above the diagonal are independent random variables satisfying \(0 \le a_{ij} \le 1\). Assume that

$$\begin{aligned} \bar{p}:= \frac{2}{n(n-1)} \sum _{i<j} \hbox {Var}(a_{ij})\ge \frac{9}{n}. \end{aligned}$$
(4.1)

Then, with probability at least \(1- e^3 5^{-n}\), we have

$$\begin{aligned} \Vert A - \mathbb {E}A\Vert _{\infty \rightarrow 1} \le 3 \, \bar{p}^{1/2} n^{3/2}. \end{aligned}$$

We will shortly deduce Lemma 4.1 from Bernstein’s inequality followed by a union bound over \(x,y \in \{-1,1\}^n\); arguments of this type are standard in the analysis of random graphs (see e.g. [12, Section 2.3]). But before we do this, let us pause to explain the conclusion of Lemma 4.1.

Remark 4.2

(Regularization effect of \(\ell _\infty \rightarrow \ell _1\) norm) Let us test Lemma 4.1 on the simple example where A is the adjacency matrix of a sparse Erdös-Renyi random graph G(np) with \(p=a/n\), \(a \ge 1\). Here we have \(\bar{p}= p(1-p) \le p = a/n\). Lemma 4.1 states that \(\Vert A - \mathbb {E}A\Vert _{\infty \rightarrow 1} \le 3 a^{1/2} n\). This can be compared with \(\Vert \mathbb {E}A\Vert _{\infty \rightarrow 1} = (1 + p(n-1)) n \ge an\). So we obtain

$$\begin{aligned} \Vert A - \mathbb {E}A\Vert _{\infty \rightarrow 1} \le 3 a^{-1/2} \, \Vert \mathbb {E}A\Vert _{\infty \rightarrow 1}. \end{aligned}$$

This deviation inequality is good when a exceeds a sufficiently large absolute constant. Since that \(a = pn\) is the expected average degree of the graph, it follows that we can handle graphs with bounded expected degrees.

This is a good place to note the importance of the \(\ell _\infty \rightarrow \ell _1\) norm. Indeed, for the spectral norm a similar concentration inequality would fail. As is well known and easy to check, for \(a =O(1)\) one would have \(\Vert A - \mathbb {E}A\Vert \gg \Vert \mathbb {E}A\Vert \) due to contributions from high degree vertices. In fact, those are the only obstructions to concentration. Indeed, according to a result of U. Feige and E. Ofek [31], the removal of high-degree vertices forces a non-trivial concentration inequality to hold in the spectral norm. In contrast to this, the \(\ell _\infty \rightarrow \ell _1\) norm does not feel the vertices with high degrees. It has an automatic regularization effect, which averages the contributions of all vertices, and in particular the few high degree vertices.

The proof of Lemma 4.1 will be based on Bernstein’s inequality, which we quote here (see, for example, Theorem 1.2.6 in [21]).

Theorem 4.3

(Bernstein’s inequality) Let \(Y_1,\ldots ,Y_N\) be independent random variables such that \(\mathbb {E}Y_k = 0\) and \(|Y_k| \le M\). Denote \(\sigma ^2 = \frac{1}{N} \sum _{k=1}^N \hbox {Var}(Y_k)\). Then for any \(t \ge 0\), one has

$$\begin{aligned} \mathbb {P}_{} \left\{ \frac{1}{N} \sum _{k=1}^N Y_k > t \right\} \le \exp \left( - \frac{N t^2/2}{\sigma ^2 + Mt/3} \right) . \end{aligned}$$

Proof of Lemma 4.1

Recalling the definition (3.3) of the \(\ell _\infty \rightarrow \ell _1\) norm, we see that we need to bound

$$\begin{aligned} \Vert A - \mathbb {E}A\Vert _{\infty \rightarrow 1} = \max _{x,y \in \{-1,1\}^n} \sum _{i,j=1}^n (a_{ij} - \mathbb {E}a_{ij}) x_i y_j. \end{aligned}$$
(4.2)

Let us fix \(x,y \in \{-1,1\}^n\). Using the symmetry of \(A - \mathbb {E}A\), the fact that diagonal entries of \(A - \mathbb {E}A\) vanish and collecting the identical terms, we can express the sum in (4.2) as a sum of independent random variables

$$\begin{aligned} \sum _{i < j} X_{ij}, \quad \text {where} \quad X_{ij} = 2(a_{ij} - \mathbb {E}a_{ij}) x_i y_j. \end{aligned}$$

To control the sum \(\sum _{i < j} X_{ij}\) we can use Bernstein’s inequality, Theorem 4.3. There are \(N = \frac{n(n-1)}{2}\) terms in this sum. Since \(|x_i| = |y_i| = 1\) for all i, the average variance \(\sigma ^2\) of all terms \(X_{ij}\) is at most \(2^2\) times the average variance of all \(a_{ij}\), which is \(\bar{p}\). In other words, \(\sigma ^2 \le 4 \bar{p}\). Furthermore, \(|X_{ij}| \le 2 |a_{ij} - \mathbb {E}a_{ij}| \le 2\) since \(0 \le a_{ij} \le 1\) by assumption. Hence \(M \le 2\). It follows that

$$\begin{aligned} \mathbb {P}_{} \left\{ \frac{1}{N} \sum _{i < j} X_{ij} > t \right\} \le \exp \left( - \frac{N t^2/2}{4 \bar{p}+ 2t/3} \right) . \end{aligned}$$
(4.3)

Let us substitute \(t = 6 \, (\bar{p}/n)^{1/2}\) here. Rearranging the terms and using that \(N = \frac{n(n-1)}{2}\) and \(\bar{p}> 9/n\) (so that \(t < 2 \bar{p}\)), we conclude that the probability in (4.3) is bounded by \(\exp (-3(n -1))\).

Summarizing, we have proved that for every \({x,y \in \{-1,1\}^n}\)

$$\begin{aligned} \mathbb {P}_{} \left\{ \frac{2}{n(n-1)} \sum _{i,j=1}^n (a_{ij} - \mathbb {E}a_{ij}) x_i y_j > 6 \Big ( \frac{\bar{p}}{n} \Big )^{1/2} \right\} \le e^{-3(n-1)}. \end{aligned}$$

Taking a union bound over all \(2^{2n}\) pairs (xy), we conclude that

$$\begin{aligned} \mathbb {P}_{} \left\{ \max _{x,y \in \{-1,1\}^n} \frac{2}{n(n-1)} \sum _{i,j=1}^n (a_{ij} - \mathbb {E}a_{ij}) x_i y_j > 6 \Big ( \frac{\bar{p}}{n} \Big )^{1/2} \right\}&\le 2^{2n} \cdot e^{-3(n-1)} \\&\le e^3 \cdot 5^{-n}. \end{aligned}$$

Rearranging the terms and using the definition (4.2) of the \(\ell _\infty \rightarrow \ell _1\) norm, we conclude the proof of Lemma 4.1. \(\square \)

Remark 4.4

(The sum of entries) Note that by definition, the quantity \(\big | \sum _{i,j=1}^n (a_{ij} - \mathbb {E}a_{ij}) \big |\) is bounded by \(\Vert A - \mathbb {E}A\Vert _{\infty \rightarrow 1}\), and thus it can be controlled by Lemma 4.1. Alternatively, a bound on this quantity follows directly from the last line of the proof of Lemma 4.1. For a future reference, we express it in the following way:

$$\begin{aligned} \frac{2}{n(n-1)} \Big | \sum _{i < j} (a_{ij} - \mathbb {E}a_{ij}) \Big | \le 3 \, \bar{p}^{1/2} n^{-1/2}. \end{aligned}$$

5 Stochastic block model: proof of Theorem 1.1

So far our discussion has been general, and the results could be applied to a variety of semidefinite programs on random graphs. In this section, we specialize to the community detection problem considered in Theorem 1.1. Thus we are going to analyze the optimization problem (1.4), where A is the adjacency matrix of a random graph distributed according to the classical stochastic block model G(npq).

As we already noticed, this is a particular case of the class of problems (3.5) that we analyzed in Sect. 3.2. In our case,

$$\begin{aligned} B := A - \lambda E_n \end{aligned}$$

with \(\lambda \) defined in (1.5), and the feasible set is

$$\begin{aligned} \mathcal {M}_{\mathrm {opt}}:= \mathcal {M}_\mathrm {G}^+ = \left\{ Z :\; Z \succeq 0, \; \hbox {diag}(Z) \preceq \mathbf{I }_n \right\} . \end{aligned}$$

5.1 The maximizer of the reference objective function

In order to successfully apply Lemma 3.3, we will now choose a reference matrix R so that it is close to (but also conveniently simpler than) the expectation of B. To do so, we can assume without loss of generality that \(\mathcal {C}_1 = \{1,\ldots , n/2\}\) and \(\mathcal {C}_2 = \{ n/2+1,\ldots ,n\}\). Then we define R as a block matrix

$$\begin{aligned} R = \frac{p-q}{2} \begin{bmatrix} E_{n/2}&-E_{n/2} \\ -E_{n/2}&E_{n/2} \end{bmatrix} \end{aligned}$$
(5.1)

where as usual \(E_{n/2}\) denotes the \(n/2 \times n/2\) matrix whose all entries equal 1.

Let us compute the expected value \(\mathbb {E}B = \mathbb {E}A - (\mathbb {E}\lambda ) E_n\) and compare it to R. To do so, note that the expected value of A has the form

$$\begin{aligned} \mathbb {E}A = \begin{bmatrix} p E_{n/2}&\quad q E_{n/2} \\ q E_{n/2}&\quad p E_{n/2} \end{bmatrix} +(1-p) I_n. \end{aligned}$$

(The contribution of the identity matrix \(I_n\) is required here since the diagonal entries of A and thus of \(\mathbb {E}A\) equal 1 due to the self-loops.) Furthermore, the definition of \(\lambda \) in (1.5) easily implies that

$$\begin{aligned} \mathbb {E}\lambda = \frac{1}{n(n-1)} \sum _{i \ne j} \mathbb {E}a_{ij} = \frac{p+q}{2} \frac{n^2}{n(n-1)} - \frac{p}{n-1} = \frac{p+q}{2} - \frac{p-q}{n-1}. \end{aligned}$$
(5.2)

Thus

$$\begin{aligned} \mathbb {E}B = \mathbb {E}A - (\mathbb {E}\lambda ) E_n = R + (1-p) \mathbf{I }_n - \frac{p-q}{n-1} E_n. \end{aligned}$$
(5.3)

In the near future we will think of R as the leading term and other two terms as being negligible, so (5.3) intuitively states that \(R \approx \mathbb {E}B\). We save this fact for later.

Using the simple form of R, we can easily determine the form of the solution \(Z_R\) of the reference problem (3.6).

Lemma 5.1

(The maximizer of the reference objective function) We have

$$\begin{aligned} Z_R := \arg \max _{Z \in \mathcal {M}_{\mathrm {opt}}} \langle R,Z\rangle = \begin{bmatrix} E_{n/2}&-E_{n/2} \\ -E_{n/2}&E_{n/2} \end{bmatrix}. \end{aligned}$$

Proof

Let us first evaluate the maximizer of \(\langle R,Z\rangle \) on the larger set \([-1,1]^{n \times n}\), which contains the feasible set \(\mathcal {M}_{\mathrm {opt}}\) according to (3.4). Taking into account the form of R in (5.1), one can quickly check that the maximizer of \(\langle R,Z\rangle \) on \([-1,1]^{n \times n}\) is \(Z_R\). Since \(Z_R\) belongs to the smaller set \(\mathcal {M}_{\mathrm {opt}}\), it must be the maximizer on that set as well. \(\square \)

5.2 Bounding the error

We are going to conclude from Lemma 4.1 and Lemma 3.3 that the maximizer of the actual objective function,

$$\begin{aligned} \widehat{Z}= \arg \max _{Z \in \mathcal {M}_{\mathrm {opt}}} \langle B,Z\rangle , \end{aligned}$$

must be close to \(Z_R\), the maximizer of the reference objective function.

Lemma 5.2

(Maximizers of random and reference functions are close) Assume that \(\bar{p}\) satisfies (4.1). Then, with probability at least \(1- e^3 5^{-n}\), we have

$$\begin{aligned} \Vert \widehat{Z}- Z_R\Vert _2^2 \le \frac{116 \, \bar{p}^{1/2} n^{3/2}}{p-q}. \end{aligned}$$

Proof

We expand

$$\begin{aligned} \Vert \widehat{Z}- Z_R\Vert _2^2 = \Vert \widehat{Z}\Vert _2^2 + \Vert Z_R\Vert _2^2 - 2 \langle \widehat{Z},Z_R\rangle \end{aligned}$$
(5.4)

and control the three terms separately.

Note that \(\Vert \widehat{Z}\Vert _2^2 \le n^2\) since \(\widehat{Z}\in \mathcal {M}_{\mathrm {opt}}\subset [-1,1]^{n \times n}\) according to (3.4). Next, we have \(\Vert Z_R\Vert _2^2 = n^2\) by Lemma 5.1. Thus

$$\begin{aligned} \Vert \widehat{Z}\Vert _2^2 \le \Vert Z_R\Vert _2^2. \end{aligned}$$
(5.5)

Finally, we use Lemma 3.3 to control the cross term in (5.4). To do this, notice that (5.1) and Lemma 5.1 imply that \(R = \frac{p-q}{2} \cdot Z_R\). Then, by homogeneity, the conclusion of Lemma 3.3 implies that

$$\begin{aligned} \langle Z_R,\widehat{Z}\rangle \ge \langle Z_R,Z_R\rangle -\frac{4 K_\mathrm {G}}{p-q} \Vert R-B \Vert _{\infty \rightarrow 1}. \end{aligned}$$
(5.6)

To bound the norm of \(R-B\), let us express this matrix as

$$\begin{aligned} B - R = (B - \mathbb {E}B) + (\mathbb {E}B - R) = (A - \mathbb {E}A) - (\lambda - \mathbb {E}\lambda ) E_n + (\mathbb {E}B - R)\quad \end{aligned}$$
(5.7)

and bound each of the three terms separately. According to Lemma 4.1 and Remark 4.4, we obtain that with probability larger than \(1 - e^3 5^{-n}\),

$$\begin{aligned} \Vert A - \mathbb {E}A\Vert _{\infty \rightarrow 1} \le 3 \bar{p}^{1/2} n^{3/2} \quad \text {and} \quad |\lambda - \mathbb {E}\lambda | \le 3 \bar{p}^{1/2} n^{-1/2}. \end{aligned}$$

Moreover, according to (5.3),

$$\begin{aligned} \mathbb {E}B - R = (1-p) \mathbf{I }_n - \frac{p-q}{n-1} E_n. \end{aligned}$$

Substituting these bounds into (5.7) and using triangle inequality along with the facts that \(\Vert E_n\Vert _{\infty \rightarrow 1} = n^2\), \(\Vert \mathbf{I }_n\Vert _{\infty \rightarrow 1} = n\), we obtain

$$\begin{aligned} \Vert B - R \Vert _{\infty \rightarrow 1} \le 6 \bar{p}^{1/2} n^{3/2} + (1-p)n + \frac{(p-q) n^2}{n-1}. \end{aligned}$$

Since \(\bar{p}\ge 9/n\), one can check that each of the last two terms is bounded by \(\bar{p}^{1/2} n^{3/2}\). Thus we obtain \(\Vert B - R \Vert _{\infty \rightarrow 1} \le 8 \bar{p}^{1/2} n^{3/2}\). Substituting into (5.6), we conclude that

$$\begin{aligned} \langle Z_R,\widehat{Z}\rangle \ge \langle Z_R,Z_R\rangle - 8 \bar{p}^{1/2} n^{3/2} \cdot \frac{4 K_\mathrm {G}}{p-q}. \end{aligned}$$

Recalling from (3.1) that Grothendieck’s constant \(K_\mathrm {G}\) is bounded by 1.783, we can replace \(8 \cdot 4 K_\mathrm {G}\) by 58 in this bound. Substituting it and (5.5) and into (5.4), we conclude that

$$\begin{aligned} \Vert \widehat{Z}- Z_R\Vert _2^2 \le 2 \Vert Z_R\Vert _2^2 - 2 \langle \widehat{Z},Z_R\rangle \le \frac{116 \, \bar{p}^{1/2} n^{3/2}}{p-q}. \end{aligned}$$

The proof of Lemma 5.2 is complete. \(\square \)

Proof of Theorem 1.1

The conclusion of the theorem will quickly follow from Lemma 5.2. Let us check the lemma’s assumption (4.1) on \(\bar{p}\). A quick computation yields

$$\begin{aligned} \bar{p}= \frac{2}{n(n-1)} \sum _{i <j} \hbox {Var}(a_{ij}) = \frac{p(1-p)(n-2)}{2(n-1)} + \frac{q(1-q)n}{2(n-1)}. \end{aligned}$$
(5.8)

Since \(p(1-p) \le 1/4\), we get

$$\begin{aligned} \bar{p}\ge \frac{1}{2} \max \left\{ p(1-p), q(1-q) \right\} - \frac{1}{8(n-1)} > \frac{9}{n} \end{aligned}$$

where the last inequality follows from an assumption of Theorem 1.1. Thus the assumption (4.1) holds, and we can apply Lemma 5.2. It states that

$$\begin{aligned} \Vert \widehat{Z}- Z_R\Vert _2^2 \le \frac{116 \, \bar{p}^{1/2} n^{3/2}}{p-q} \end{aligned}$$
(5.9)

with probability at least \(1- e^3 5^{-n}\). From (5.8), it is not difficult to see that \(\bar{p}\le \frac{p+q}{2}\). Substituting this into (5.9) and expressing \(p = a/n\) and \(q=b/n\), we conclude that

$$\begin{aligned} \Vert \widehat{Z}- Z_R\Vert _2^2 \le \frac{116 \sqrt{(a+b)/2}}{a-b} \cdot n^2. \end{aligned}$$

Rearranging the terms, we can see that this expression is bounded by \(\varepsilon n^2\) if

$$\begin{aligned} (a-b)^2 \ge 7 \cdot 10^3 \varepsilon ^{-2} (a+b). \end{aligned}$$

But this inequality follows from the assumption (1.6).

It remains to recall that according to Lemma 5.1, we have \(Z_R = \bar{x}\bar{x}^\mathsf {T}\) where \(\bar{x}= [\mathbf{1}_{n/2} \; -\mathbf{1}_{n/2}] \in \mathbb {R}^n\) is the community membership vector defined in (1.3). Theorem 1.1 is proved. \(\square \)

Proof of Corollary 1.2

The result follows from Davis-Kahan Theorem [28] about the stability of the eigenvectors under matrix perturbations. The largest eigenvalue of \( \bar{x}\bar{x}^\mathsf {T}\) is n while all the others are 0, so the spectral gap equals n. Expressing \(\widehat{Z}= (\widehat{Z}- \bar{x}\bar{x}^\mathsf {T})+ \bar{x}\bar{x}^\mathsf {T}\) and using that \(\Vert \widehat{Z}- \bar{x}\bar{x}^\mathsf {T}\Vert _2 \le \sqrt{\varepsilon }n\), we obtain from Davis-Kahan’s theorem (see for example Corollary 3 in [66]) that

$$\begin{aligned} \Vert \widehat{v} - \bar{v} \Vert _2 = 2 | \sin (\theta /2) | \le C \sqrt{\varepsilon }. \end{aligned}$$

Here \(\hat{v}\) and \(\bar{v}\) denote the unit-norm eigenvectors associated to the largest eigenvalues of \(\widehat{Z}\) and \(\bar{x}\bar{x}^\mathsf {T}\) respectively, and \(\theta \in [0, \pi /2]\) is the angle between these two vectors. By definition, \(\widehat{x}= \sqrt{n} \widehat{v}\) and \(\bar{x}= \sqrt{n} \bar{v}\). This concludes the proof. \(\square \)

6 General stochastic block model: proof of Theorem 1.3

In this section we focus on the community detection problem for the general stochastic block-model considered in Theorem 1.3. The semidefinite program (1.10) is a particular case of the class of problems (3.5) that we analyzed in Sect. 3.2. In our case, we set \(B:=A\), choose the reference matrix to be

$$\begin{aligned} R := \bar{A} = \mathbb {E}A, \end{aligned}$$

and consider the feasible set

$$\begin{aligned} \mathcal {M}_{\mathrm {opt}}:= \Big \{ Z \succeq 0, \; Z \ge 0, \; \hbox {diag}(Z) \preceq \mathbf{I }_n, \; \sum _{i,j=1}^n Z_{ij} = \lambda \Big \}. \end{aligned}$$

Then \(\mathcal {M}_{\mathrm {opt}}\) is a subset of the Grothendieck’s set \(\mathcal {M}_\mathrm {G}^+\) defined in (3.4). Using (3.4), we see that

$$\begin{aligned} \mathcal {M}_{\mathrm {opt}}\subset \Big \{ Z :\; 0 \le Z_{ij} \le 1 \text { for all } i,j; \; \; \sum _{i,j=1}^n Z_{ij} = \lambda \Big \}. \end{aligned}$$
(6.1)

6.1 The maximizer of the expected objective function

Unlike before, the reference matrix \(R = \bar{A} = \mathbb {E}A = (p_{ij})_{i,j=1}^n\) is not necessarily a block matrix like in (5.1) since the edge probabilities \(p_{ij}\) may be different for all \(i<j\). However, we will observe that the solution \(Z_R\) of the reference problem (3.6) is a block matrix, and it is in fact the community membership matrix \(\bar{Z}\) defined in (1.9).

Lemma 6.1

(The maximizer of the expected objective function) We have

$$\begin{aligned} Z_R := \arg \max _{Z \in \mathcal {M}_{\mathrm {opt}}} \langle \bar{A},Z\rangle = \bar{Z}. \end{aligned}$$
(6.2)

Proof

Let us first compute the maximizer on the larger set \(\mathcal {M}_{\mathrm {opt}}'\), which contains the feasible set \(\mathcal {M}_{\mathrm {opt}}\) according to (6.1). The maximum of the linear form \(\langle \bar{A},Z\rangle \) on the convex set \(\mathcal {M}_{\mathrm {opt}}'\) is attained at an extreme point. These extreme points are 0 / 1 matrices with \(\lambda \) ones. Thus the maximizer of \(\langle \bar{A},Z\rangle \) has the ones at the locations of the \(\lambda \) largest entries of \(\bar{A}\).

From the definition of the general stochastic block model we can recall that \(\bar{A} = (p_{ij})\) has two types of entries. The entries larger than p form the community blocks \(\mathcal {C}_k \times \mathcal {C}_k\), \(k=1,\ldots ,K\). The number of such large entries is the same as the number of ones in the community membership matrix \(\bar{Z}\), which in turn equals \(\lambda \) by the choice we made in Theorem 1.3. All other entries of \(\bar{A}\) are smaller than q. Thus the \(\lambda \) largest entries of \(\bar{A}\) form the community blocks \(\mathcal {C}_k \times \mathcal {C}_k\), \(k=1,\ldots ,K\).

Summarizing, we have shown that the maximizer of \(\langle \bar{A},Z\rangle \) on the set \(\mathcal {M}_{\mathrm {opt}}'\) is a 0 / 1 matrix with ones forming the community blocks \(\mathcal {C}_k \times \mathcal {C}_k\), \(k=1,\ldots ,K\). Thus the maximizer is the community membership matrix \(\bar{Z}\) from (1.9). Since \(\bar{Z}\) belongs to the smaller set \(\mathcal {M}_{\mathrm {opt}}\), it must be the maximizer on that set as well. \(\square \)

6.2 Bounding the error

We are going to conclude from Lemma 4.1 and Lemma 3.3 that the maximizer of the actual objective function,

$$\begin{aligned} \widehat{Z}= \arg \max _{Z \in \mathcal {M}_{\mathrm {opt}}} \langle A,Z\rangle , \end{aligned}$$

must be close to \(\bar{Z}\), the maximizer of the reference objective function. We will first show that the reference objective function \(\langle \bar{A},Z\rangle \) distinguishes points near its maximizer \(\bar{Z}\).

Lemma 6.2

(Expected objective function distinguishes points) Every \(Z \in \mathcal {M}_{\mathrm {opt}}\) satisfies

$$\begin{aligned} \langle \bar{A},\bar{Z} - Z\rangle \ge \frac{p-q}{2} \, \Vert \bar{Z} - Z\Vert _1. \end{aligned}$$
(6.3)

Proof

We will prove that the conclusion holds for every Z in the larger set \(\mathcal {M}_{\mathrm {opt}}'\), which contains the feasible set \(\mathcal {M}_{\mathrm {opt}}\) according to (6.1). Expanding the inner product, we can represent it as

$$\begin{aligned} \langle \bar{A},\bar{Z} - Z\rangle = \sum _{i,j=1}^n p_{ij} (\bar{Z}-Z)_{ij} = \sum _{(i,j) \in \mathrm {In}} p_{ij} (\bar{Z}-Z)_{ij} - \sum _{(i,j) \in \mathrm {Out}} p_{ij} (Z - \bar{Z})_{ij} \end{aligned}$$

where \(\mathrm {In}\) and \(\mathrm {Out}\) denote the set of edges that run within and across the communities, respectively. Formally, \(\mathrm {In}= \cup _{k=1}^K (\mathcal {C}_k \times \mathcal {C}_k)\) and \(\mathrm {Out}=\{1,\ldots , n\}^2{\setminus }\mathrm {In}\).

For the edges \((i,j) \in \mathrm {In}\), we have \(p_{ij} \ge p\) and \((\bar{Z}-Z)_{ij} \ge 0\) since \(\bar{Z}_{ij}=1\) and \(Z_{ij} \le 1\). Similarly, for the edges \((i,j) \in \mathrm {Out}\), we have \(p_{ij} \le q\) and \((Z - \bar{Z})_{ij} \ge 0\) since \(\bar{Z}_{ij}=0\) and \(Z_{ij} \ge 0\). It follows that

$$\begin{aligned} \langle \bar{A},\bar{Z} - Z\rangle \ge p S_{\mathrm {In}} - q S_{\mathrm {Out}} \end{aligned}$$
(6.4)

where

$$\begin{aligned} S_{\mathrm {In}} = \sum _{(i,j) \in \mathrm {In}} (\bar{Z}-Z)_{ij} \quad \text {and} \quad S_{\mathrm {Out}} = \sum _{(i,j) \in \mathrm {Out}} (Z - \bar{Z})_{ij}. \end{aligned}$$

Since both \(\bar{Z}\) and Z belong to \(\mathcal {M}_{\mathrm {opt}}\), the sum of all entries of both these matrices is the same (\(n^2/2\)), so we have

$$\begin{aligned} S_{\mathrm {In}} - S_{\mathrm {Out}} = \sum _{i,j=1}^n \bar{Z}_{ij} - \sum _{i,j=1}^n Z_{ij} =0. \end{aligned}$$
(6.5)

On the other hand, as we already noticed, the terms in the sums that make \(S_{\mathrm {In}}\) and \(S_{\mathrm {Out}}\) are all non-negative. Therefore

$$\begin{aligned} S_{\mathrm {In}} + S_{\mathrm {Out}} = \sum _{i,j=1}^n |(\bar{Z}-Z)_{ij}| = \Vert \bar{Z}-Z\Vert _1. \end{aligned}$$
(6.6)

Substituting (6.5) and (6.6) into (6.4), we obtain the conclusion (6.3). \(\square \)

Now we are ready to conclude that \(\widehat{Z}\approx \bar{Z}\).

Lemma 6.3

(Maximizers of random and expected functions are close) Assume that \(\bar{p}\) satisfies (4.1). With probability at least \(1- e^3 5^{-n}\), we have

$$\begin{aligned} \Vert \widehat{Z}- \bar{Z}\Vert _1 \le \frac{12 \,K_\mathrm {G}\, p_0^{1/2} n^{3/2}}{p-q}. \end{aligned}$$

Proof

Using first Lemma 6.2, Lemma 3.3 (with \(R=\bar{A}\) and \(Z_R=\bar{Z}\) as before) and then Lemma 4.1, we obtain

$$\begin{aligned} \Vert \widehat{Z}- \bar{Z}\Vert _1 \le \frac{2}{p-q} \, \langle \bar{A},\bar{Z} - \widehat{Z}\rangle \le \frac{4 K_\mathrm {G}}{p-q} \Vert A - \bar{A}\Vert _{\infty \rightarrow 1} \le \frac{12 K_\mathrm {G}}{p-q} \bar{p}^{1/2} n^{3/2} \end{aligned}$$

with probability at least \(1- e^3 5^{-n}\). Lemma 6.3 is proved. \(\square \)

Proof of Theorem 1.3

The conclusion follows from Lemma 6.3. Indeed, substituting \(p=a/n\), \(q=b/n\) and \(\bar{p}= g/n\) and rearranging the terms, we obtain

$$\begin{aligned} \Vert \widehat{Z}- \bar{Z}\Vert _1 \le \frac{12 K_\mathrm {G}g^{1/2}}{a-b} \cdot n^2 \le \frac{22 g^{1/2}}{a-b} \cdot n^2 \end{aligned}$$

since we know form (3.1) that Grothendieck’s constant \(K_\mathrm {G}\) is bounded by 1.783. Rearranging the terms, we can see that this expression is bounded by \(\varepsilon n^2\) if \((a-b)^2 \ge 484 \, \varepsilon ^{-2} g\), which is our assumption (1.12). This proves the required bound for the \(\Vert \cdot \Vert _1\) norm.

Since for any sequence \(\sum |b_{i,j}|^2 \le \max |b_{i,j}| \sum |b_{i,j}|\), we get

$$\begin{aligned} \Vert \widehat{Z}- \bar{Z}\Vert _2^2 \le \Vert \widehat{Z}- \bar{Z}\Vert _\infty \cdot \Vert \widehat{Z}- \bar{Z}\Vert _1. \end{aligned}$$

As we noted in (6.1), all entries of \(\widehat{Z}\) and \(\bar{Z}\) belong to [0, 1] hence \(\Vert \widehat{Z}- \bar{Z}\Vert _\infty \le 1\). The bound for the Frobenius norm follows and Theorem 1.3 is proved. \(\square \)