Introduction

In network science and engineering, a subfield of research is to find the network topology and nodal dynamical equations from data1. This is important because networks are ubiquitous in the real world but the details of their connection topology and the intrinsic dynamical systems governing the properties and physical observables of the network are often unknown. The details are desired not only for understanding but also for protecting, disabling, or controlling the network dynamical behaviors (depending on the specific applications), and a viable way is to solve the inverse problem of determining the network details through observational data if they are available. As for any inverse problems in mathematics and physical sciences, the network inverse problem is challenging. Previous works in this area focused on “conventional” networks with pairwise interactions only1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16. Existing methods include those which are based on drive-response3,5, adaptive synchronization2,11, noise correlation6,15, compressive sensing7,9,17, maximum likelihood estimation13,14,16, and Granger causality4,8. The data can be from continuous- or discrete-time dynamical processes. For example, the drive-response and adaptive synchronization methods use data from continuous-time nonlinear coupled systems2,3,5,11, while the maximum likelihood estimation method is suitable for data from discrete-time dynamics13,14,16. In this paper, motivated by the fact that higher-order networks have become a state-of-the-art subfield of research in network science18,19,20,21,22,23,24, we develop a reconstruction framework for finding from time-series data network topology with higher-order interactions.

While pairwise or node-to-node interactions are the familiar type in networks, it has been recognized that higher-order interactions are also ubiquitous and important. For example, in a social network, the collective recommendation of multiple friends can often be more persuasive than the recommendation of a single friend to convince the individual to buy a new product. In a rumor spreading process, a piece of false news is likely to be accepted by an individual if it is shared or promoted simultaneously by many people25,26,27. A similar situation occurs in neuronal networks, where a firing event is often the result of excitatory and inhibitory interactions among many neurons. In all these cases, the interaction arises simultaneously among a group of nodes in the network, and to describe the network by the conventional pairwise interactions is no longer adequate28: higher-order interactions beyond the pairwise relationship must be taken into account. Mathematically, higher-order interactions can be described as hypergraphs or simplicial complexes29, i.e., networks containing higher-order simplexes. In particular, a k-simplex describes the simultaneous interaction among (k + 1) nodes, where a zero-simplex specifies an isolated node (i.e., without any interaction), a 1-simplex represents the conventional pairwise interaction, a 2-simplex underlies the simultaneous interaction among three nodes, and so on.

The past three years have witnessed a growing interest in higher-order networks30. For example, random walks on hypergraphs were studied, where a walker chooses the next destination depending on the number and the size of the shared hyperedges31. A family of random walks on hypergraphs with a parameter controlling the bias of the dynamics towards hyperedges of small or large size was constructed and the impacts of walk strategy and walk time on community detection were elucidated32. The stability conditions of the general dynamical processes on hypergraphs were found18, and a social contagion model on hypergraphs was constructed which presents dynamical phenomena such as first- and second-order transitions, bistability and hysteresis33. A simplicial model of social contagion was proposed and it was demonstrated that the reinforcement mechanisms in 2-simplex can lead to a discontinuous phase transition34. The impacts of the heterogeneity of simplicial complexes on the SIS (susceptible-infected-susceptible) spreading model with collective and individual contagion were analyzed35, and a pair approximation theory to study the SIS dynamics in simplicial complexes was developed, which was argued to be more accurate than the Markov-chain and mean field methods36. A social communication model including idea integration and information transmission in simplicial complexes was proposed and the critical condition leading to the outbreak of information was identified37. A simplicial activity driven model was proposed and the impact of both simplicial and temporally evolving interactions were analyzed38. In terms of network reconstruction, a statistical method to detect higher-order interactions from network data of pairwise links has recently been developed21.

In this paper, we develop a framework to reconstruct complex networks with higher-order interactions from data. To be concrete, we focus on networks with 2-simplexes and assume that the dynamical processes on the network are social contagion and simplicial Ising dynamics that generate binary time-series data. Our method is of the statistical inference type pivoted on maximum likelihood estimation, with the aim to fully reconstruct both pairwise interactions (links) and 2-simplexes at the same time, thereby distinguishing our work from the recent method based on link data21. In particular, the central task is to estimate the probabilities of each node connecting to the reconstruction or target node (pairwise interaction) and of any two nodes forming a three-body 2-simplex with the target node. We articulate a two-step process to greatly enhance the computational efficiency and an effective truncation process to determine the final reconstructed structure of the simplicial complex. Using three synthetic and four real-world simplicial complexes, we demonstrate the accuracy of our reconstruction method and establish its robustness with respect to variations in the average degree of the network and stochastic fluctuations. Our work represents an initial effort in reconstructing complex networks with higher-order interactions based on observed time-series data.

Results

Simplicial complexes

A k-simplex σ is formed by a filled clique of a set of k + 1 nodes \(\left[{v}_{0},\cdots \,,{v}_{k}\right]\), which defines a (k + 1)-body interaction39. A 1-simplex is two nodes connected by an edge, a 2-simplex is three nodes connected pairwisely by edges and with an additional single face, i.e., a triangle, and a 3-simplex is four vertices connected pairwisely by edges and joined by four faces, which are filled in to form a solid tetrahedron, and so on. A simplicial complex \({{{{{{{\mathcal{K}}}}}}}}\) composed of a set of nodes \({{{{{{{\mathcal{V}}}}}}}}\) is a collection of simplexes, with the additional requirement39,40 that if a simplex is in \({{{{{{{\mathcal{K}}}}}}}}\) (\(\sigma \,\in\, {{{{{{{\mathcal{K}}}}}}}}\)), then any simplex ϱ composed of subsets of simplex σ should also be included in \({{{{{{{\mathcal{K}}}}}}}}\). For example, a 2-simplicial complex \({{{{{{{\mathcal{K}}}}}}}}\) is a collection of 0-, 1- and 2-simplexes.

Social contagion dynamics

Peer influence and reinforcement mechanisms are ubiquitous in the dynamical process of social contagion41, from which higher-order interactions in the network are originated. A social contagion model taking reinforcement into account on 2-simplicial complexes was proposed34, which exploits the SIS type of spreading dynamics with binary-state dynamical variables. In particular, let \({S}_{i}^{t}\) be the state of node i at time t. Each node has two possible states: susceptible (\({S}_{i}^{t}\,=\,0\)) or infected (\({S}_{i}^{t}\,=\,1\)). At the initial time, a fraction ρ0 of nodes is infected. A susceptible node i can get infection from an infected neighbor j through their pairwise interaction (i, j) with probability β1. Node i can also be infected through a 2-simplex (i, j, k), where both j and k have already been infected, with the probability β2, and this event can be understood as a synergistic reinforcement effect. For convenience, we set β1 = α/k1 and β2 = ω/k2, where α and ω are two non-zero positive constants, k1 and k2 are the average degrees of the two-body and three-body interactions in a 2-simplicial complex, respectively. In general, we have α < ω so as to ensure that the role of 2-simplex is embodied in the spreading dynamics. Each infected node recovers to the susceptible state with the probability μ. In our work, the values of β1 and β2 are selected near their respective thresholds to facilitate efficient and accurate reconstruction of 2-simplicial complexes. The effects of varying the values of β1 and β2 on the reconstruction accuracy are also studied (Sec. I of Supplementary Information (SI)).

Simplicial Ising dynamics

The Ising model arises in many fields due to its fundamental role in phase transitions in statistical physics. It has also been applied to many social systems42,43. While the Ising dynamics on networks have been extensively studied44,45,46, previous studies were exclusively conducted for networks with pairwise interactions only. To our knowledge, Ising dynamics on networks with higher-order interactions have not been studied.

To address the synergistic reinforcement effect of 2-simplex, we define a simplicial Ising dynamics on 2-simplicial complexes. Each node has two possible states: spin-down (\({S}_{i}^{t}\,=\,-1\)) or spin-up (\({S}_{i}^{t}\,=\,+1\)). At the initial time, the state of each node i is randomly assigned as +1 or −1 with equal probability. Defining the Hamiltonian as:

$$H(t)\,=\,-{J}_{1}\mathop{\sum}\limits_{(i,j)}{S}_{i}^{t}{S}_{j}^{t}\,-\,{J}_{2}\mathop{\sum}\limits_{(i,j,k)}{S}_{i}^{t}{S}_{j}^{t}{S}_{k}^{t},$$
(1)

where J1 and J2 are the strengths of two-body and three-body interactions, and (i, j) and (i, j, k) denote the two-body and the three-body connections in the 2-simplicial complex, respectively. The first term in the Hamiltonian characterizes the interaction between the edges (i.e., two-body connections) and the second term contains three-body interactions from the 2-simplex. At each time step, the spin-flipping probability of each node i is given by \({(1\,+\,{e}^{\delta {{\Delta }}{E}_{i}^{t}})}^{-1}\), where δ is the inverse temperature. The quantity

$${{\Delta }}{E}_{i}^{t}\,=\,2{J}_{1}\mathop{\sum}\limits_{(i,j)\,\in\, {\partial }_{i}}{S}_{i}^{t}{S}_{j}^{t}\,+\,2{J}_{2}\mathop{\sum}\limits_{(i,j,k)\,\in\, {\nabla }_{i}}{S}_{i}^{t}{S}_{j}^{t}{S}_{k}^{t}$$

represents the change in the energy caused by a flipping of node i at time t, where ∂i and i are the 1-simplex set and the 2-simplex set containing node i, respectively.

Statistical inference framework

For SIS and Ising processes taking place on a 2-simplicial complex of size N, the available time-series data representing the states of nodes at different time steps can be stored in a data matrix S, where each row is a time string representing all nodes’ states at that time step and each column is a node’s state at different time steps. We reconstruct 2-simplicial complexes from the data matrix S with our statistical inference framework. This task consists of three steps: (1) establishing the likelihood function based on the available data matrix S; (2) obtaining the connection probabilities of two- and three-body interactions by maximizing the likelihood function according to the idea of the expectation maximization (EM) method, and (3) executing an improved two-step reconstruction strategy to significantly increase the computational efficiency. The details of the framework are described in Methods.

Quantification of reconstruction performance

We use F1 score to quantify the reconstruction accuracy47, a global performance indicator defined as

$${{{{{{{\rm{F}}}}}}}}1\,=\,\frac{2P\,*\, R}{P\,+\,R},$$
(2)

where P = TP/(TP + FP) and R = TP/(TP + FN), with TP, FP, TN, FN being the numbers of true positive, false positive, true negative and false negative classes, respectively. A larger value of F1 corresponds to a higher accuracy and F1 = 1 indicates that the original network structure has been fully reconstructed with zero error.

Reconstructing synthetic and real-world simplicial complexes

For readability, the results from the social contagion dynamics are presented in the main text, while those from the simplicial Ising dynamics are presented in Sec. III of SI.

Figure 1 presents results of reconstructing three synthetic 2-simplicial complexes (see Sec. “Construction of synthetic and real-world 2-simplicial complexes” in Methods on how these networks are constructed), where squares, diamonds and circles denote the performance of reconstructing the two-body connections while triangles with different orientations demonstrate the performance of reconstructing three-body connections. Several features can be seen from Fig. 1. First, the reconstruction accuracy increases with the length T of the time series and can reach the unity value for T 8000. Second, the average degrees k1 and k2 of two-body and three-body simplexes, respectively, have different effects on the reconstruction accuracy. In particular, as shown in Fig. 1a–c, a small value of k1 tends to increase the reconstruction accuracy of both types of simplexes. This can be understood by noting that a small value of k1 means that there are fewer two-body connections that need to be reconstructed, thereby enhancing the accuracy of the two-body connections for the same length of the time series. At the same time, fewer two-body connections reduce the complexity in reconstructing three-body connections and thereby improving the reconstruction accuracy. Regarding the effects of k2, Fig. 1d–f reveal that its value affects only the reconstruction accuracy of three-body connections and has little effect on the accuracy of reconstructing two-body connections that have no dependence on the three-body connections in a 2-simplicial complex. Third, the reconstruction accuracy of three-body interactions is lower than that of two-body interactions owing to the complicated structure of former and its dependence on the latter.

Fig. 1: Reconstruction performance for synthetic 2-simplicial complexes.
figure 1

Shown is F1 score as a function of the length T of the observational binary time series for three synthetic 2-simplicial complexes: (a, d) random simplicial complex (ERSC), (b, e) scale-free simplicial complex (SFSC), and (c, f) small-world simplicial complex (SWSC). In each panel, squares, diamonds and circles denote the performance of reconstructing the two-body connections while triangles with different orientations demonstrate the performance of reconstructing three-body connections, and different values of the average degree are distinguished by colors. All simplicial complexes have the same size N = 200. Other parameter values are α = 0.8, ω = 2.4, ρ0 = 0.2, and μ = 1. The results are averaged over five realizations.

Figure 2 shows the results of reconstructing four real-world 2-simplicial complexes: Hypertext200948, Thiers1249, InVS1550, and LyonSchool51,52 (see Sec. “Construction of synthetic and real-world 2-simplicial complexes” in Methods for the details of these real-world networks). The basic parameters of these 2-simplicial complexes constructed from the datasets are listed in Table 1. It can be seen from Fig. 2 that, as for the real-world networks, the reconstruction accuracies for both the two-body and three-body interactions increase with the length of the time series. Remarkably, these network structures are quite irregular, complicating the reconstruction. Nonetheless, for T = 20,000, the F1 score can exceed 80%.

Fig. 2: Visualization and reconstruction performance for real-world 2-simplicial complexes.
figure 2

Visualization results for (a) Hypertext2009, (b) Thiers12, (c) InVS15, and (d) LyonSchool. The corresponding reconstruction performances are shown in panels (eh) in terms of F1 score versus the length T of the available time series. The blue squares and red circles demonstrate the reconstruction performance of two-body and three-body connections, respectively. Parameter values are α = 0.3, ω = 1, ρ0 = 0.2, and μ = 1. Each data point is the result of averaging over five realizations.

Table 1 Basic parameters of the four 2-simplicial complexes constructed from real-world datasets.

An issue of practical significance is the robustness of our reconstruction framework against random perturbations. To address this issue, we randomly flip a fraction f of infected states and the same number of susceptible states in the data matrix S (see Sec. “Details of the statistical inference framework” in Methods) and investigate the effect of f on the reconstruction accuracy as characterized by F1. The results are shown in Fig. 3 for three synthetic 2-simplicial complexes and three real-world 2-simplicial complexes. It can be seen that increasing the fraction f of flipping leads to a reduction of F1. In particular, the value of F1 for the two-body connections can be as high as 50% even when 30% of the infected states have been flipped (f = 0.3), attesting to the robustness of our framework in reconstructing pairwise links against stochastic fluctuations in the data.

Fig. 3: Effect of random flipping ratio f on reconstruction performance for synthetic and real-world simplicial complexes.
figure 3

Shown is the F1 score for (a) random simplicial complex (ERSC), (b) scale-free simplicial complex (SFSC), (c) small-world simplicial complex (SWSC), (d) Thiers12, (e) InVS15, and (f) LyonSchool. The blue squares and red circles demonstrate the reconstruction performance of two-body and three-body connections, respectively. The parameter values for the synthetic simplicial complexes are N = 200, k1 = 12, k2 = 4, T = 10,000, α = 0.8, ω = 2.4, ρ0 = 0.2, and μ = 1. For the real-world simplicial complexes, the parameter values are T = 20,000, α = 0.3, ω = 1, ρ0 = 0.2, and μ = 1. Each data point is the result of averaging over ten realizations.

Discussion

To find the network structure from observational data has been an active research field for more than fifteen years1. In previous studies, the term “network structure” is largely referred to as the collection of pairwise connections as characterized by the adjacency matrix of the network. Since the goal is to figure out whether there is a link between any two nodes, the existing methods focused on measures that are suitable for ascertaining the “two-body” interactions, such as those based on pairwise correlation or synchronization. From the beginning of modern network science and engineering slightly over two decades ago, networks with only pairwise connections have represented the standard setting of study. Likewise, the inverse problem of data-based discovery of the network structure has been exclusively carried out in this setting. To our knowledge, in the current literature, the problem of finding higher-order connections in complex networks from time-series data has not been addressed.

Higher-order interactions are nonetheless ubiquitous in complex networks and its importance has been gradually recognized with an accumulating interest, eventually generating an explosive growth of research recently18,19,20,21,22,23,24. The structure of networks with higher-order connections, also known as simplicial complexes, are represented by tensors of high orders. For example, three-body interactions or 2-simplexes in a network can be described by a tensor of rank 3. Structurally, simplicial complexes are significantly more sophisticated than the conventional networks with pairwise links only, and richer dynamics can be anticipated in the former, which have begun to be studied. From the point of view of inverse problem, to reconstruct simplicial complexes from time-series data represents a great challenge.

We have taken the first step to address this inverse problem. Focusing on complex networks with 2-simplexes, we have developed a statistical inference framework by which all two-body and three-body interactions in the network can be found simultaneously from binary time-series data only, i.e., no prior knowledge about the network to be reconstructed is required. The backbone of our reconstruction framework is maximum likelihood estimation that yields the probabilities of all the possible pairwise and three-body connections and a criterion to associate the probabilities with the actual interactions. To significantly increase the computational efficiency, we have proposed and tested a two-step process and a truncation process to determine the true structure of the simplicial complexes. The reconstruction framework has withstood tests on synthetic and real-world simplicial complexes with respect to accuracy and robustness against random fluctuations.

Many open problems remain. For example, our reconstruction framework is formulated in terms of binary time-series data from social contagion dynamics and simplicial Ising dynamics (Sec. III of SI). How to reconstruct higher-order networks from data generated by different dynamical processes needs to be studied. Also, our statistical inference method is developed for 2-simplicial complexes that are perhaps the “simplest” network structure beyond the conventional networks with pairwise interactions. To reconstruct networks with higher-order interactions such as 3-simplicial complexes and hypergraphs is worth pursuing. It is also necessary to develop methods to improve the reconstruction accuracy with shorter time series. We hope our work will stimulate further research in this emerging subfield of data-based reconstruction of complex networks with higher-order interactions.

Methods

Details of the statistical inference framework

We describe the details of our statistical inference framework through an illustrative example, as shown in Fig. 4, where a 2-simplicial complex with N = 30 nodes and its data matrix are illustrated in Fig. 4a, b, respectively. For such a network hosting SIS dynamics, the probability of a susceptible node i (i.e., \({S}_{i}^{t}\,=\,0\)) being infected (i.e., \({S}_{i}^{t\,+\,1}\,=\,1\)) is determined only by the infected neighbors and the infected 2-simplexes in which two other nodes in the 2-simplex are both infected, at time t. The transition probability from the infected state to the susceptible state does not depend on the states of the neighbors, so it is only necessary to consider the transition probability from the susceptible state to the infected state for constructing the network. We stress that the details of the dynamical process, such as the infection probabilities β1 and β2 as well as the recovery probability μ, are assumed to be unknown but only the binary time series of the nodal states are available. Figure 4 presents an illustrative example to describe the details of our method.

Fig. 4: Schematic illustration of reconstructing a 2-simplicial complex based on binary social contagion.
figure 4

a A 2-simplicial complex of size N = 30, where the black links represent 1-simplexes and the orange shadows represent 2-simplexes. b Data matrix S that stores all nodes’ states at different time steps, where each row is a time string representing all nodes’ states at that time step and each column is a node’s state at different time steps, where the black and blank squares denote the 1 and 0 states, respectively. Take node 13 as an example (the red frame), the values of \({{{{{{{{\rm{P}}}}}}}}}_{j}^{13}\left(\forall j\,\ne\, 13\right)\) and \({{{{{{{{\rm{P}}}}}}}}}_{jk}^{13}\left(\forall j\,\ne\, k\,\ne\, 13\right)\) can be calculated from data matrix S: \({{{{{{{{\rm{P}}}}}}}}}_{7}^{13}\,=\,1/2\) (the green frame) and \({{{{{{{{\rm{P}}}}}}}}}_{28,30}^{13}\,=\,1\) (the purple frame). c The values of Pj→13 and Pjk→13 are obtained through the EM algorithm, where only non-zero values of Pj→13 and the top 10 values of Pjk→13 are shown. d The values of Pji (each column in the left subgraph) and Pjki (each column in the right subgraph) for each node i based on the method described in Secs. 4.1.1 and 4.1.2, where the blue and red dots denote the actual and nonexistent two-body or three-body connections, respectively. e The 2-simplicial complex is inferred based on the probabilities in (d), in which the two-body connections are exactly predicted, but two 2-simplexes (5, 20, 22) and (13, 18, 29) in (a) cannot be predicted (marked by the light-yellow shadows). f The values of \({{{{{{{{\rm{P}}}}}}}}}_{j\to 13}^{0}\) for node 13 obtained by iterating Eqs. 2225, and only non-zero values are shown. g Compressed data matrix that records only the columns in S giving \({{{{{{{{\rm{P}}}}}}}}}_{j\to 13}^{0} \, > \, 0\). h The values of Pj→13 and Pjk→13 for node 13 based on the compressed data in (g). i The values of Pji and Pjki for each node i. Finally, the full 2-simplicial complex in (a) can be exactly reconstructed by determining whether Pji > 0 or \({{{{{{{{\rm{P}}}}}}}}}_{jk\to i} \, > \, {\hat{{{\Delta }}}}_{i}\).

Establishing the likelihood function

Let j → i denote the event that node j has a direct impact on the state of node i. For example, node j can directly spread the virus or send a piece of information to node i, which means that node j is one of immediate neighbors of node i. Nodes i and j thus form a 1-simplex, a property that is independent of time t. Similarly, let jk → i denote the event that the synergistic reinforcement effect coming from nodes j and k has a direct impact on the state of node i, which is also independent of time. In the following, we determine the probabilities of node i and node j being connected and of three nodes i, j, k forming a three-body connection (i, j, k).

The conditional probability of \({S}_{i}^{t\,+\,1}\,=\,1\) and j → i given \({S}_{j}^{t}\,=\,1\) and \({S}_{i}^{t}\,=\,0\) can be written as

$$ {{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{t\,+\,1}\,=\,1,j\to i\left|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}\,=\,1\right.\right)\\ \,=\, {{{{{{{\rm{P}}}}}}}}\left(j\to i\left|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}\,=\,1,{S}_{i}^{t\,+\,1}\,=\,1\right.\right)\,*\, {{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{t\,+\,1}\,=\,1\left|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}\,=\,1\right.\right)\,\\ =\, {{{{{{{{\rm{P}}}}}}}}}_{j\to i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i},$$
(3)

where \({{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}\triangleq {{{{{{{\rm{P}}}}}}}}(j\to i|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}\,=\,1,{S}_{i}^{t\,+\,1}\,=\,1)\) is the probability of node i being infected by node j under the conditions \({S}_{i}^{t}\,=\,0\), \({S}_{j}^{t}\,=\,1\) and \({S}_{i}^{t\,+\,1}\,=\,1\), Pji > 0 indicates that node j is a neighbor of node i, and \({{{{{{{{\rm{P}}}}}}}}}_{j}^{i}\triangleq {{{{{{{\rm{P}}}}}}}}({S}_{i}^{t\,+\,1}\,=\,1|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}\,=\,1)\) is the probability of \({S}_{i}^{t\,+\,1}\,=\,1\) under the conditions \({S}_{i}^{t}\,=\,0\) and \({S}_{j}^{t}\,=\,1\), which can be estimated from the data matrix S. Take the matrix in Fig. 4b as an example and suppose we wish to estimate the value of \({{{{{{{{\rm{P}}}}}}}}}_{7}^{13}\), where nodes 13 (i.e., node 13 is the target node) and 7 are highlighted by red and green frames, respectively. It is necessary to extract each pair including the time string with \({S}_{13}^{t}\,=\,0\) and its next time strings at t + 1. It can be seen that seven pairs of time strings can be extracted: (t, t + 1), (t +  1, t + 2), (t + 2, t + 3), (t + 3, t + 4), (t + 4, t + 5), (t + 6, t + 7), and (t + 8, t + 9). It can also be seen that two time moments: t + 1 and t + 8, satisfy the conditions that node 13 is in the susceptible state and node 7 is in the infected state. The only time at which node 13 can be infected at the next time step is t + 8. As a result, we have \({{{{{{{{\rm{P}}}}}}}}}_{7}^{13}\,=\,1/2\).

Similarly, the conditional probability of \({S}_{i}^{t+1}=1\) and jk → i given \({S}_{j}^{t}{S}_{k}^{t}\,=\,1\) and \({S}_{i}^{t}\,=\,0\) can be written as

$$ {{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{t\,+\,1}\,=\,1,jk\to i\left|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}{S}_{k}^{t}\,=\,1\right.\right)\\ \, =\, {{{{{{{\rm{P}}}}}}}}\left(jk\to i\left|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}{S}_{k}^{t}\,=\,1\right.,{S}_{i}^{t\,+\,1}\,=\,1\right)\,*\, {{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{t\,+\,1}\,=\,1\left|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}{S}_{k}^{t}\,=\,1\right.\right)\,\\ =\, {{{{{{{{\rm{P}}}}}}}}}_{jk\to i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i},$$
(4)

where \({{{{{{{{\rm{P}}}}}}}}}_{jk\to i}\,\triangleq\, {{{{{{{\rm{P}}}}}}}}(jk\to i|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}{S}_{k}^{t}\,=\,1,{S}_{i}^{t\,+\,1}\,=\,1)\) is the probability of node i being infected through the synergistic interaction from nodes j and k, under the conditions \({S}_{i}^{t}\,=\,0\), \({S}_{j}^{t}{S}_{k}^{t}\,=\,1\), and \({S}_{i}^{t\,+\,1}\,=\,1\), and Pjki > 0 indicates that the three nodes i, j, k form a 2-simplex. The probability \({{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}\,\triangleq\, ({S}_{i}^{t\,+\,1}\,=\,1|{S}_{i}^{t}\,=\,0,{S}_{j}^{t}{S}_{k}^{t}\,=\,1)\) can be estimated from the data matrix S in a similar way. Again, take the three nodes 13, 28, and 30 in Fig. 4b as an example. It can be seen that the time instants at which \({S}_{13}^{t}\,=\,0\), \({S}_{28}^{t}\,=\,1\) and \({S}_{30}^{t}\,=\,1\) are fulfilled are t + 6 and t + 8. Because \({S}_{13}^{t\,+\,7}\,=\,1\) and \({S}_{13}^{t\,+\,9}\,=\,1\), we have \({{{{{{{{\rm{P}}}}}}}}}_{28,30}^{13}\,=\,1\).

According to Eqs. 3, 4, the expected number of susceptible node i being infected at tm + 1 is given by

$${{{{{{{{\rm{E}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\, = \mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{{t}_{m}\,+\,1}\,=\,1,j\,\to\, i\left|{S}_{i}^{{t}_{m}}\,=\,0,{S}_{j}^{{t}_{m}}\,=\,1\right.\right){{{\Psi }}}_{j}^{{t}_{m}}\\ \quad\ +\,\mathop{\sum}\limits_{j,k\left(j\,\ne\, k\,\ne\, i\right)}{{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{{t}_{m}\,+\,1}\,=\,1,jk\,\to\, i\left| {S}_{i}^{{t}_{m}}\,=\,0,\,\right.\left.{S}_{j}^{{t}_{m}}{S}_{k}^{{t}_{m}}\,=\,1\right.\right){{{\Psi }}}_{jk}^{{t}_{m}}\,+\,{\varepsilon }_{i}\\ \, = \mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\,+\,\mathop{\sum}\limits_{j,k\left(j\,\ne\, k\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\,+\,{\varepsilon }_{i},$$
(5)

where \({{{\Psi }}}_{j}^{{t}_{m}}\) represents the events that nodes j are infected at time tm, similarly, \({{{\Psi }}}_{jk}^{{t}_{m}}\) is the events that both nodes j and k are infected at time tm, and their values are zero or one. For example, if \({{{\Psi }}}_{j}^{{t}_{m}}\,=\,1\), it means that node j is infected at time tm; otherwise, \({{{\Psi }}}_{j}^{{t}_{m}}\,=\,0\) when it is not infected at time tm. The quantity εi represents the noise due to the errors from the collected data.

In general, the probability of a given number of events occurring in a fixed interval of time is characterized by the Poisson distribution, so we use it to capture the random nature of the times that node i is infected. An advantage of the Poisson distribution is that it makes a mathematical analysis and computations with the EM algorithm feasible53,54,55,56. Specifically, the likelihood function can be described as

$${{{{{{{\rm{P}}}}}}}}\left({\left\{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\right\}}_{m \,=\, 1,\cdots ,M}\left|{{\Theta }},{\left\{{{{\Psi }}}_{j}^{{t}_{m}}\right\}}_{m \,=\, 1,\cdots ,M;j \,=\, 1,\cdots ,N}\right.\right)\,=\,\mathop{\prod}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\frac{{e}^{-{{{{{{{{\rm{E}}}}}}}}}_{i}^{{t}_{m}\,+\,1}}{\left({{{{{{{{\rm{E}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\right)}^{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}}}{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}!},$$
(6)

where Θ denotes the set of variables Pji, Pjki and εi. We have \({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}!\,\equiv\, 1\) since \({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\) is either zero or one.

Maximizing the likelihood function based on EM algorithm

We next use the EM method to maximize the likelihood function57 for determining the parameter Θ in Eq. 6. Taking the logarithm form of Eq. 6, we get

$$L\left({{\Theta }}\right)\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\log {{{{{{{{\rm{E}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\,-\,{{{{{{{{\rm{E}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\right)\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left[\begin{array}{l}{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\log \left(\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right.\\ \left.+\mathop{\sum}\limits_{j,k\left(j\ne k\ne i\right)}{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\,+\,{\varepsilon }_{i}\right)\\ \, -\,\left(\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right.\\ \left.\,+\,\mathop{\sum}\limits_{j,k\left(j\,\ne\, k\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\,+\,{\varepsilon }_{i}\right)\end{array}\right].$$
(7)

Applying the Jensen’s inequality to the logarithmic term on the right side of Eq. 7 yields

$$ \log \left(\mathop{\sum}\limits_{j\left(j\ne i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\to i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}+\mathop{\sum}\limits_{j,k\left(j\ne k\ne i\right)}{{{{{{{{\rm{P}}}}}}}}}_{jk\to i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}+{\varepsilon }_{i}\right)\\ =\log \left(\mathop{\sum}\limits_{j\left(j\ne i\right)}{\rho }_{j}^{{t}_{m}}\frac{{{{{{{{{\rm{P}}}}}}}}}_{j\to i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}}{{\rho }_{j}^{{t}_{m}}}\right.\left.+\mathop{\sum}\limits_{j,k\left(j\ne k\ne i\right)}{\rho }_{jk}^{{t}_{m}}\frac{{{{{{{{{\rm{P}}}}}}}}}_{jk\to i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}}{{\rho }_{jk}^{{t}_{m}}}+{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\frac{{\varepsilon }_{i}}{{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}}\right)\\ \ge \mathop{\sum}\limits_{j\left(j\ne i\right)}{\rho }_{j}^{{t}_{m}}\log \frac{{{{{{{{{\rm{P}}}}}}}}}_{j\to i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}}{{\rho }_{j}^{{t}_{m}}}+\mathop{\sum}\limits_{j,k\left(j\ne k\ne i\right)}{\rho }_{jk}^{{t}_{m}}\log \frac{{{{{{{{{\rm{P}}}}}}}}}_{jk\to i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}}{{\rho }_{jk}^{{t}_{m}}}+{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\log \frac{{\varepsilon }_{i}}{{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}}\\ =\mathop{\sum}\limits_{j\left(j\ne i\right)}{\rho }_{j}^{{t}_{m}}\log \left({{{{{{{{\rm{P}}}}}}}}}_{j\to i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right)+\mathop{\sum}\limits_{j,k\left(j\ne k\ne i\right)}{\rho }_{jk}^{{t}_{m}}\log \left({{{{{{{{\rm{P}}}}}}}}}_{jk\to i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\right)\\ +{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\log {\varepsilon }_{i}-\mathop{\sum}\limits_{j\left(j\ne i\right)}{\rho }_{j}^{{t}_{m}}\log {\rho }_{j}^{{t}_{m}}-\mathop{\sum}\limits_{j,k\left(j\ne k\ne i\right)}{\rho }_{jk}^{{t}_{m}}\log {\rho }_{jk}^{{t}_{m}}-{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\log {\rho }_{{\varepsilon }_{i}}^{{t}_{m}},$$
(8)

where the equality holds if

$${\rho }_{j}^{{t}_{m}}\,=\,\frac{{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}}{{\varepsilon }_{i}\,+\,\left(\begin{array}{l}\mathop{\sum}\limits_{{j}^{\prime}\left({j}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}}^{{t}_{m}}\,+\,\\ \mathop{\sum}\limits_{{j}^{\prime},{k}^{\prime}\left({j}^{\prime}\,\ne\, {k}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}{k}^{\prime}\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}{k}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}{k}^{\prime}}^{{t}_{m}}\end{array}\right)},$$
(9)
$${\rho }_{jk}^{{t}_{m}}\,=\,\frac{{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}}{{\varepsilon }_{i}\,+\,\left(\begin{array}{l}\mathop{\sum}\limits_{{j}^{\prime}\left({j}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}}^{{t}_{m}}\,+\,\\ \mathop{\sum}\limits_{{j}^{\prime},{k}^{\prime}\left({j}^{\prime}\ne {k}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}{k}^{\prime}\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}{k}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}{k}^{\prime}}^{{t}_{m}}\end{array}\right)},$$
(10)

and

$${\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\,=\,\frac{{\varepsilon }_{i}}{{\varepsilon }_{i}\,+\,\left(\begin{array}{l}\mathop{\sum}\limits_{{j}^{\prime}\left({j}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}}^{{t}_{m}}\,+\,\\ \mathop{\sum}\limits_{{j}^{\prime},{k}^{\prime}\left({j}^{\prime}\,\ne\, {k}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}{k}^{\prime}\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}{k}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}{k}^{\prime}}^{{t}_{m}}\end{array}\right)}.$$
(11)

The maximization problem of Eq. 7 can then be transformed into maximizing the following equation:

$$\hat{L}\left({{\Theta }}\right)\, = \,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{j}^{{t}_{m}}\log \left({{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right)\right.\left.\,-\,{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{j}^{{t}_{m}}\log {\rho }_{j}^{{t}_{m}}\,-\,{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right)\\ \,+\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\mathop{\sum}\limits_{j,k\left(j\,\ne\, k\,\ne\, i\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{jk}^{{t}_{m}}\log \left({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\right)\right.\left.\,-\,{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{jk}^{{t}_{m}}\log {\rho }_{jk}^{{t}_{m}}\,-\,{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\right)\\ \,+\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\log {\varepsilon }_{i}\,-\,{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\log {\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\,-\,{\varepsilon }_{i}\right).$$
(12)

Calculating the partial derivative of \(\hat{L}\left({{\Theta }}\right)\) with respect to Pji, Pjki and εi and setting them to zero, we get

$$\frac{\partial \hat{L}\left({{\Theta }}\right)}{\partial {{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}}\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left(\frac{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{j}^{{t}_{m}}}{{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}}\,-\,{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right)\,=\,0,$$
(13)
$$\frac{\partial \hat{L}\left({{\Theta }}\right)}{\partial {{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}}\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left(\frac{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{jk}^{{t}_{m}}}{{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}}\,-\,{{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\right)\,=\,0,$$
(14)
$$\frac{\partial \hat{L}\left({{\Theta }}\right)}{\partial {\varepsilon }_{i}}\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left(\frac{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}}{{\varepsilon }_{i}}\,-\,1\right)\,=\,0,$$
(15)

which give

$${{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}\,=\,\frac{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{j}^{{t}_{m}}\right)}{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right)},$$
(16)
$${{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\,=\,\frac{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{jk}^{{t}_{m}}\right)}{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{{{{{{\rm{P}}}}}}}}}_{jk}^{i}{{{\Psi }}}_{jk}^{{t}_{m}}\right)},$$
(17)
$${\varepsilon }_{i}\,=\,\frac{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\right)}{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left(1\right)}.$$
(18)

The six equations Eqs. 911 and Eqs. 1618 can be used to solve Pji, Pjki and εi. In particular, by initializing all values of \({{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}{{{{{{{\rm{,}}}}}}}}{{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}{{{{{{{\rm{,}}}}}}}}{\varepsilon }_{i}\left(\forall j\,\ne\, k\,\ne\, i\right)\) to be one and then calculating the values of \({\rho }_{j}^{{t}_{m}}\), \({\rho }_{jk}^{{t}_{m}}\) and \({\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\) in Eqs. 911, we substitute them into Eqs. 1618 to find the values of Pji, Pjki, and εi. We repeat this process until convergence is achieved. Since a single iterative process does not ensure global optimization, we carry out the above iteration process a number of times and choose the proper values that give the maximum of the quantity in Eq. 12.

As an example, as shown in Fig. 4c, the values of Pj→13 and Pjk→13 are given according to this iteration process, where Pj→13 > 0 and the top 10 values of Pjk→13 are demonstrated. Similarly, all the values of Pji and Pjki can be calculated for each node i. As presented in Fig. 4d, each column above the abscissa corresponds to the predicted 1-simplex probabilities (the left subgraph of Fig. 4d) and 2-simplex probabilities (the right subgraph of Fig. 4d) of a node, and the blue and red dots denote the actual and nonexistent two-body or three-body connections, respectively.

An improved two-step reconstruction strategy

For a 2-simplicial complex structure with N nodes, when predicting the 2-simplexes of a node i, we randomly choose two nodes (e.g., j and k) and calculate the probability Pjki, which requires calculating \(\left(\genfrac{}{}{0.0pt}{}{N-1}{2}\right)\) values. To reduce the computational load and increase the reconstruction accuracy, we articulate an improved two-step strategy. The particularity of simplicial complexes stipulates that the other two nodes forming a 2-simplex with node i must be the neighbors of node i, so it is not necessary to calculate the probability Pjki if node j or node k is not a neighbor of node i. The reconstruction process can then be divided into two steps. At the first step, the “approximate” neighborhood of each node is predicted and their corresponding columns in the data matrix S are extracted, leading to a compressed data matrix. At the second step, based on the compressed data matrix, the values of Pji and Pjki for each node i are predicted by iterating Eqs. 911 and 1618. Our two-step method was not designed for the general challenging task of consistently inferring all the subfaces for arbitrarily higher-order simplices. In fact, our method requires the closure condition of simplicial complexes: it is necessary to know in advance that the network under reconstruction is a 2-simplicial complex. Given this premise, the two-step strategy infers first the two-body and then the three-body interactions (i.e., 2-simplex) from the inferred two-body interactions. While the two-step method is efficient to reconstruct 2-simplicial complexes, at the present it cannot be used to reconstruct the hypergraphs because its second step is to find the triangles from the neighbors (i.e., edges).

For the first step, the predicted neighbors are not accurate because the three-body interactions have been ignored. In fact, the main purpose of this step is to determine an approximate range of neighbors to reduce the time for calculating \({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\left(\forall j\,\ne\, k\,\ne\, i\right)\). Without taking into account three-body interactions, the expected number of susceptible nodes being infected at tm + 1 can simply be expressed as

$${\tilde{{{{{{{{\rm{E}}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\, =\,\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{\rm{P}}}}}}}}\left({S}_{i}^{{t}_{m}\,+\,1}\,=\,1,j\,\to\, i\left|{S}_{i}^{{t}_{m}}\,=\,0,{S}_{j}^{{t}_{m}}\,=\,1\right.\right)\,*\, {{{\Psi }}}_{j}^{{t}_{m}}\,+\,{\varepsilon }_{i}\,\\ =\,\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}^{0}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\,+\,{\varepsilon }_{i},$$
(19)

where the notation \({{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}\) is used to emphasize that node j is only an “approximate” neighbor of node i. Assuming that the number Ψi of times of node i being infected in each time period obeys the Poisson distribution, we obtain the likelihood function as

$$ {{{{{{{\rm{P}}}}}}}} \left({\left\{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\right\}}_{m \,=\, 1,\cdots ,M}\left|\tilde{{{\Theta }}},{\left\{{{{\Psi }}}_{j}^{{t}_{m}}\right\}}_{m \,=\, 1,\cdots ,M; \,j \,=\, 1,\cdots ,N}\right.\right)\\ =\,\mathop{\prod}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\frac{{e}^{-{\tilde{{{{{{{{\rm{E}}}}}}}}}}_{i}^{{t}_{m}\,+\,1}}{\left({\tilde{{{{{{{{\rm{E}}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\right)}^{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}}}{{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}!},$$
(20)

where \(\tilde{{{\Theta }}}\) denotes the set of variables \({{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}\) and εi. Taking the logarithm of Eq. 20, we have

$$L\left(\tilde{{{\Theta }}}\right)\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\log {\tilde{{{{{{{{\rm{E}}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\,-\,{\tilde{{{{{{{{\rm{E}}}}}}}}}}_{i}^{{t}_{m}\,+\,1}\right)\,=\,\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left[\begin{array}{l}{{{\Psi }}}_{i}^{{t}_{m}\,+\,1}\log \left(\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\,+\,{\varepsilon }_{i}\right)\\ \,-\,\left(\mathop{\sum}\limits_{j\left(j\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}^{0}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\,+\,{\varepsilon }_{i}\right)\end{array}\right].$$
(21)

Using the EM method to maximize the likelihood function, we obtain the final parameters \(\tilde{{{\Theta }}}\) as

$${{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}^{0}\,=\,\frac{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{j}^{{t}_{m}}\right)}{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}\right)},$$
(22)
$${\varepsilon }_{i}\,=\,\frac{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left({{{\Psi }}}_{i}^{{t}_{m}\,+\,1}{\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\right)}{\mathop{\sum}\limits_{m\left({{{\Psi }}}_{i}^{{t}_{m}}\,=\,0\right)}\left(1\right)},$$
(23)

where

$${\rho }_{j}^{{t}_{m}}\,=\,\frac{{{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}^{0}{{{{{{{{\rm{P}}}}}}}}}_{j}^{i}{{{\Psi }}}_{j}^{{t}_{m}}}{\mathop{\sum}\limits_{{j}^{\prime}\left({j}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}\,\to\, i}^{0}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}}^{{t}_{m}}\,+\,{\varepsilon }_{i}},$$
(24)
$${\rho }_{{\varepsilon }_{i}}^{{t}_{m}}\,=\,\frac{{\varepsilon }_{i}}{\mathop{\sum}\limits_{{j}^{\prime}\left({j}^{\prime}\,\ne\, i\right)}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}\,\to\, i}^{0}{{{{{{{{\rm{P}}}}}}}}}_{{j}^{\prime}}^{i}{{{\Psi }}}_{{j}^{\prime}}^{{t}_{m}}\,+\,{\varepsilon }_{i}}.$$
(25)

With the initial conditions for \({{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}\) and εi, the values of \({{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}\) and εi can be obtained by iterating Eqs. 2225 until convergence is achieved. It is worth noting that \({{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}\) is a probability and we need to determine the “approximate” neighbors of the node under reconstruction. Theoretically, the “approximate” neighbors can be determined by testing whether \({{{{{{{{\rm{P}}}}}}}}}_{j\to i}^{0}\) is non-zero. However, practically this is not feasible due to noise or deviations from the assumptions. For example, as shown in Fig. 4f, nodes 6 and 14 are not neighbors of node 13 even though \({{{{{{{{\rm{P}}}}}}}}}_{6\to 13}^{0}\,=\,0.0002\) and \({{{{{{{{\rm{P}}}}}}}}}_{14\to 13}^{0}\,=\,0.0006\). To overcome this difficulty, we articulate a truncation method for determining the neighbors of node i, as follows.

First, note that the time complexity of the second step can be significantly reduced when fewer neighbors are predicted, but too few predicted neighbors can lead to missing neighbors. On the contrary, too many neighbors would increase the time complexity. A solution is to use a reasonable truncation to determine the “approximate” neighbors of each node. To this end, we re-rank the probability \({{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}^{0}\left(\forall j\,\ne\, i\right)\) in a descending order and place a threshold Δi in the maximum gap defined as14:

$${{{\Delta}} _i} \,=\, \mathop{{{{{\mathrm{arg}}}}}\,{{{{{\mathrm{max}}}}}}}\limits_l \left[ {\frac{{{{{{{{\rm{P}}}}}}}_l^{'}}}{{{{{{{{\rm{P}}}}}}}_{l \,+\, 1}^{'}}}\left( {{{{{{{\rm{P}}}}}}}_l^{'} \,-\, {{{{{{\rm{P}}}}}}}_{l \,+\, 1}^{'}} \right)} \right].$$
(26)

Next, we use Eq. 26 again to find a new threshold \({\bar{{{\Delta }}}}_{i}\) which is smaller than Δi. Finally, node j is regarded as an “approximate” neighbor of node i if \({{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}^{0} \, > \, {\bar{{{\Delta }}}}_{i}\). The truncation method can ensure the detection of all real neighbors and 2-simplexes.

Once the “approximate” neighbors of node i have been obtained, the time series of these neighbors can be extracted (Fig. 4f, g). The neighbors of node i and its 2-simplexes can be quickly re-predicted based on the second step, i.e., by iterating Eqs. 911 and Eqs. 1618 based on the compressed data matrix. For example, the prediction results for node 13 are shown in Fig. 4h and the values of \({{{{{{{{\rm{P}}}}}}}}}_{j\,\to\, i}\left(\forall j\,\ne\, i\right)\) and \({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\left(\forall j\,\ne\, k\,\ne\, i\right)\) for each node are presented in Fig. 4i. The actual two- and three-body connections of each node can then be determined based on the results in Fig. 4i. Because the identification of two-body connections has been refined in the second step, we simply assume that node j is a neighbor of node i if Pji > 0. Following previous work14,58, we assume that nodes i and j are connected when Pji > 0 or Pij > 0.

The case of three-body interactions is more complicated and the solution is sensitive to noise or errors. In fact, using the condition Pjki > 0 as a criterion to detect (i, j, k) as a 2-simplex can lead to many false positives. Our solution is to re-rank \({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\left(\forall j\,\ne\, k\,\ne\, i\right)\) in a descending order and obtain a new threshold \({\hat{{{\Delta }}}}_{i}\) by using Eq. 26 again. As a result, an actual 2-simplex (i, j, k) is formed when \({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\,\ge\, {\hat{{{\Delta }}}}_{i}\). To remove the conflicts in the prediction, we assume that there exists a 2-simplex (i, j, k) when two of three conditions hold at least, e.g., \({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\,\ge\, {\hat{{{\Delta }}}}_{i}\), \({{{{{{{{\rm{P}}}}}}}}}_{ik\,\to\, j}\,\ge\, {\hat{{{\Delta }}}}_{j}\), and \({{{{{{{{\rm{P}}}}}}}}}_{ij\,\to\, k} \, < \, {\hat{{{\Delta }}}}_{k}\), but a three-body cannot form when \({{{{{{{{\rm{P}}}}}}}}}_{jk\,\to\, i}\,\ge\, {\hat{{{\Delta }}}}_{i}\), \({{{{{{{{\rm{P}}}}}}}}}_{ik\,\to\, j} \, < \, {\hat{{{\Delta }}}}_{j}\), and \({{{{{{{{\rm{P}}}}}}}}}_{ij\,\to\, k} \, < \, {\hat{{{\Delta }}}}_{k}\). Implementing the two-step strategy, we can reconstruct the whole 2-simplicial complexes. As shown in Fig. 4a, the 2-simplicial complex has been accurately reconstructed. Overall, the two-step strategy not only greatly reduces the computational time but also significantly improves the reconstruction accuracy (more details in Sec. II in SI).

Construction of synthetic and real-world 2-simplicial complexes

Synthetic 2-simplicial complexes

Here we describe the main steps of constructing synthetic 2-simplicial complexes of size N, average degrees of two-body and three-body interactions k1 and k2, respectively.

Random simplicial complex (ERSC)

First, a random graph is generated by connecting any two nodes with the probability p1. We then add 2-simplexes between any three nodes with the probability p2, where the formulas of p1 and p2 are34:

$${p}_{1}\,=\,\frac{{k}_{1}\,-\,2{k}_{2}}{\left(N\,-\,1\right)\,-\,2{k}_{2}},$$
(27)
$${p}_{2}\,=\,\frac{2{k}_{2}}{\left(N\,-\,1\right)\left(N\,-\,2\right)}.$$
(28)

A random 2-simplicial complex with the specified average degrees can then be constructed using the probabilities p1 and p2.

Scale-free simplicial complex (SFSC)

First, a scale-free network is generated, in which each new node connects m edges to the old nodes with degree preference59. We then add 2-simplexes between any three nodes according to probability p2 in Eq. 28. The average degree of 1-simplexes can be calculated as

$${k}_{1}\,=\,2m\,+\,2{k}_{2}\left(1\,-\,\frac{2m}{N}\right).$$
(29)

Small-world simplicial complex (SWSC)

First, a small-world network60 is generated from a regular lattice (all the nodes have the same degree 2m) with rewiring probability p. We then add 2-simplexes between any three nodes according to probability p2 in Eq. 28. The average degree of 1-simplexes is given by Eq. 29.

2-simplicial complexes from real-world data

In each real-world data set, the face-to-face interactions have been measured with a temporal resolution of 20 s. First, we generate a weighted network according to the data, where a weight represents the number of interactions between a pair of nodes in the whole time window. Second, we remove any link whose weight is less than a given threshold ζ and set the weights of retained links to one to generate an unweighted network. Finally, we cut the data into multiple segments with a temporal window of 5 min and record all the 2-simplexes. In particular, if three nodes communicate with each other in a short time, they are regarded as constituting a three-body connection. We record the frequencies of the 2-simplexes in each segment. According to the total frequency in all segments, we retain the first 50% of the 2-simplexes with the highest frequencies and count them as the actual 2-simplexes. The visualization of four real-world 2-simplicial complexes is shown in Fig. 2a–d.