Keywords

1 Introduction

Data increase in volume and complexity. A major challenge that arises in many applications is to process efficiently large amounts of data in order to synthesize the available bits of information into a concise but meaningful picture.

New data abstractions, emerging from modern applications, require new definitions for data-summarization and synthesis tasks. In particular, for many data that are typically modeled as networks, temporal information is nowadays readily available, leading to temporal networks [9, 19]. In temporal networks \(G=(V,E)\), edges describe interactions over a set of entities V. For each edge \((u,v,t)\in E\), the time of interaction t, between entities \(u,v\in V\) is also available.

In this paper we introduce a new problem for summarizing temporal networks. The main idea is to consider that the entities of the network are active over presumably short time intervals. Edges (interactions) of the temporal network between two entities can be explained by at least one of the two entities being active at the time of the interaction. Our summarization task is to process the available temporal edges (interactions) and infer the latent activity intervals for all entities. In this way, we can infer an activity timeline for the whole network. To motivate the summarization task studied in this paper, consider the following application scenario.

Example. Consider a news story unfolding over the period of several months, or years, such as Brexit. There is a sequence of intertwined events (e.g., UK referendum, prime minister resigns, appointment of new prime minister, supreme court decision, invoking article 50, etc.) as well as a roster of key characters who participate in the events (e.g., Cameron, Johnson, May, Tusk, etc.). Consider now a stream of Brexit-related tweets, as events unfold, and hashtags mentioned in those tweets (e.g., #brexit, #remain, #ukip, #indyref2, etc.). For our purposes, we view the twitter stream as a temporal network: a tweet mentioning two hashtags \(h_1\) and \(h_2\) and posted at time t is seen as a temporal edge \((h_1,h_2,t)\). A typical situation is that a hashtag bursts during a time interval that is associated with a main event, while it may also appear outside the time interval in a connection with other secondary events. For instance, the peak activity for #remain may have been during the weeks leading to the referendum, but the same hashtag may also appear later, say, in reference to invoking article 50, by a user who wished that EU had not voted for Brexit. The question that we ask in this paper is whether it is possible to process the temporal network of entity interactions and reconstruct the latent activity intervals for each entity (hashtags, in this example), and thus, infer the complete timeline of the news story.

Motivated by the previous example, and similar application scenarios, we introduce the network-untangling problem, where the goal is to reconstruct an activity timeline from a temporal network. Our formulation uses a simple model in which we assume that each network entity is active during a time interval. An temporal edge (uvt) is covered if at least one of u or v are active at time t. The algorithmic objective is to find a set of activity intervals, one for each entity, so that all temporal edges are covered, and the length of the activity intervals is minimized. We consider two definitions for interval length: total length and maximum length.

We show that the problem of minimizing the maximum length over all activity intervals can be mapped to 2-SAT, and be solved optimally and in linear time On the other hand, minimizing the total interval length is an NP-hard problem. To confront this challenge we offer two iterative algorithms that rely on the fact that certain subproblems can be solved approximately or optimally. In both cases the subproblems can be solved by linear-time algorithms, yielding overall very practical and efficient methods.

We complement our theoretical results with an experimental evaluation, where we demonstrate that our methods are capable on finding ground-truth activity intervals planted on synthetic datasets. Additionally we conduct a case study where it is shown that the discovered intervals match the timeline of real-world events and related sub-events.

2 Preliminaries and Problem Definition

Our input is a temporal network \(G = (V, E)\), where V is a set of vertices and E is a set of time-stamped edges. The edges of the temporal network are triples of the form (uvt), where \(u, v \in V\) and t is a time stamp indicating the time that an interaction between vertices u and v takes place. In our setting we do not preclude the case that two vertices u and v interact multiple times. As it is customary, we denote by n the number of vertices in the graph, and by m the number of edges. For our algorithms we assume that the edges are given in chronological order, if not, they can be sorted in additional \( \mathcal {O} \mathopen {}\left( m\log m\right) \) time.

Given a vertex \(u \in V\), we will write \( E \mathopen {}\left( u\right) \) to be the set of edges adjacent to vertex u, i.e., \( E \mathopen {}\left( u\right) = \{ (u,v,t) \in E \}\). We will also write \( N \mathopen {}\left( u\right) = \{ v \mid (u,v,t) \in E \}\) to represent the set of vertices adjacent to u, and \( T \mathopen {}\left( u\right) = \{ t \mid (u,v,t) \in E \}\) to represent the set of time stamps of the edges containing u. Finally, we write \( t \mathopen {}\left( e\right) \) to denote the time stamp of an edge \(e\in E\).

Given a vertex \(u\in V\) and two real numbers \(s_{u}\) and \(e_{u}\), we consider the interval \(I_{u}=[s_{u},e_{u}]\), where \(s_{u}\) is a start time and \(e_{u}\) is an end time. We refer to \(I_{u}\) as the activity interval of vertex u. Intuitively, we think of \(I_{u}\) as the time interval in which the vertex u has been active. A set of activity intervals \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\), one interval for each vertex \(u\in V\), is an activity timeline for the temporal network G.

Given a temporal network \(G = (V, E)\) and an activity timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\), we say that the timeline \(\mathcal {T}\) covers the network G if for each edge \((u, v, t) \in E\), we have \(t \in I_{u}\) or \(t \in I_{v}\), that is, when each network edge occurs at least one of its endpoints is active.

Note that each temporal network has a trivial timeline that provides a cover. Such a timeline, defined by \(I_{u} = [\min T \mathopen {}\left( u\right) , \max T \mathopen {}\left( u\right) ]\), may have unnecessarily long intervals. Instead, we aim finding an activity timeline that have as compact intervals as possible. We measure the quality of a timeline by the total duration of all activity intervals in it. More formally, we define the total span, or sum-span, of a timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) by

$$\begin{aligned} S \mathopen {}\left( \mathcal {T}\right) = \sum _{u \in V} \sigma \mathopen {}\left( I_{u}\right) , \end{aligned}$$

where \( \sigma \mathopen {}\left( I_{u}\right) = e_{u}-s_{u}\) is the duration of a single interval. An alternative way to measure the compactness of a timeline is by the duration of its longest interval,

$$\begin{aligned} \varDelta \mathopen {}\left( \mathcal {T}\right) = \max _{u \in V} \sigma \mathopen {}\left( I_{u}\right) . \end{aligned}$$

We refer to \( \varDelta \mathopen {}\left( \mathcal {T}\right) \) as the max-span of the timeline \(\mathcal {T}\).

Associated with the above compactness measures we define the following two problems that we consider in this paper.

Problem 1

(MinTimeline). Given a temporal network \(G = (V, E)\), find a timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) that covers G and minimizes the sum-span \( S \mathopen {}\left( \mathcal {T}\right) \).

Problem 2

(\({\textsc {MinTimeline}_\infty }\)). Given a temporal network \(G = (V, E)\) find a timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) that covers G and minimizes the max-span \( \varDelta \mathopen {}\left( \mathcal {T}\right) \).

3 Computational Complexity and Algorithms

Surprisingly, while MinTimeline is an NP-hard problem, \({\textsc {MinTimeline}_\infty }\)can be solved optimally efficiently. The optimality of \({\textsc {MinTimeline}_\infty }\)is a result of the algorithm presented in Sect. 5. In this section we establish the complexity of MinTimeline, and we present two efficient algorithms for MinTimeline and \({\textsc {MinTimeline}_\infty }\).

Proposition 1

The decision version of the MinTimeline problem is NP-complete. Namely, given a temporal network \(G = (V, E)\) and a budget \(\ell \), it is NP-complete to decide whether there is timeline \(\mathcal {T}^*=\left\{ I_{u}\right\} _{u\in V}\) that covers G and has \( S \mathopen {}\left( \mathcal {T}^*\right) \le \ell \).

Proof

We will prove the hardness by reducing VertexCover to MinTimeline. Assume that we are given a (static) network \(H = (W, A)\) with n vertices \(W = \{ w_1, \ldots , w_n \}\) and a budget \(\ell \). In the VertexCover problem we are asked to decide whether there exists a subset \(U \subseteq W\) of at most \(\ell \) vertices (\(|U|\le \ell \)) covering all edges in A.

We map an instance of VertexCover to an instance of MinTimeline by creating a temporal network \(G = (V, E)\), as follows. The vertices V consists of 2n vertices: for each \(w_i \in W\), we add vertex \(v_i\) and \(u_i\). The edges are as follows: For each edge \((w_i, w_j) \in A\), we add a temporal edge \((v_i, v_j, 0)\) to E. For each vertex \(w_i \in W\), we add two temporal edges \((v_i, u_i, 1)\) and \((v_i, u_i, 2n + 1)\) to E.

Let \(\mathcal {T}^*\) be an optimal timeline covering G. We claim that \( S \mathopen {}\left( \mathcal {T}^*\right) \le \ell \) if and only if there is a vertex cover of H with \(\ell \) vertices. To prove the if direction, consider a vertex cover of H, say U, with \(\ell \) vertices. Consider the following coverage: cover each \(u_i\) at \(2n + 1\), and each \(v_i\) at 1. For each \(w_i \in U\), cover \(v_i\) at 0. The resulting intervals are indeed forming a timeline with a total span of \(\ell \).

To prove the other direction, first note that if we cover each \(v_i\) by an interval [0, 1] and each \(u_i\) by an interval \([2n + 1, 2n + 1]\), then this yields a timeline \(\mathcal {T}^*\) covering G. The total span intervals \(\mathcal {T}^*\) is n. Thus, \( S \mathopen {}\left( \mathcal {T}^*\right) \le n\). This guarantees that if \(0 \in I_{v_i}\), then \(2n + 1 \notin I_{v_i}\), so \(2n + 1 \in I_{u_i}\). This implies that \(1 \notin I_{u_i}\) and so \(1 \in I_{v_i}\). In summary, if \(0 \in I_{v_i}\), then \( \sigma \mathopen {}\left( I_{v_i}\right) = 1\). This implies that if \( S \mathopen {}\left( \mathcal {T}^*\right) \le \ell \), then we have at most \(\ell \) vertices covered at 0. Let U be the set of those vertices. Since \(\mathcal {T}^*\) is timeline covering G, then U is a vertex cover for H.    \(\square \)

3.1 Iterative Method Based on Inner Points

As we saw, MinTimeline is an NP-hard problem. The next logical question is whether we can approximate this problem. Unfortunately, there is evidence that such an algorithm would be highly non-trivial: we can show that if we extend our problem definition to hyper-edges—the coverage then means that one vertex needs to be covered per edge—then such a problem is inapproximable. This suggests that an approximation algorithm would have to rely on the fact that we are dealing with edges and not hyper-edges.

Luckily, we can consider meaningful subproblems. Assume that we are given a temporal network \(G = (V, E)\) and we also given a set of time point \(\left\{ m_v\right\} _{v \in V}\), i.e., one time point \(m_v\) for each vertex \(v\in V\), and we are asked whether we can find an optimal activity timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) so that the interval \(I_{v}\) of vertex v contains the corresponding time point \(m_v\), i.e., \(m_v\in I_{v}\), for each \(v \in V\). Note that these inner points can be located anywhere within the interval (not just, say, in the center of the interval). This problem definition is useful when we know one time point that each vertex was active, and we want to extend this to an optimal timeline. We refer to this problem as \({\textsc {MinTimeline}_m}\).

Problem 3

(\({\textsc {MinTimeline}_m}\)). Given a temporal network \(G = (V, E)\) and a set of inner time points \(\left\{ m_v\right\} _{v \in V}\), find a timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) that covers G, satisfies \(m_v \in I_{v}\) for each \(v \in V\), and minimizes the sum-span \( S \mathopen {}\left( \mathcal {T}\right) \).

Interestingly, we can show that the \({\textsc {MinTimeline}_m}\)problem can be solved approximately, in linear time, within a factor of 2 of the optimal solution. The 2-approximation algorithm is presented in Sect. 4.

Being able to solve \({\textsc {MinTimeline}_m}\), motivates the following algorithm for MinTimeline, which uses \({\textsc {MinTimeline}_m}\)as a subroutine: initialize \(m_v = (\min T \mathopen {}\left( v\right) + \max T \mathopen {}\left( v\right) ) / 2\) to be an inner time point for vertex v; recall that \( T \mathopen {}\left( v\right) \) are the time stamps of the edges containing v. We then use our approximation algorithm for \({\textsc {MinTimeline}_m}\)to obtain a set of intervals \(\left\{ I_{v}\right\} = \left\{ [s_{v},e_{v}]\right\} _{v\in V}\). We use these intervals to set the new inner points, \(m_v = (s_{v} + e_{v}) / 2\), and repeat until the score no longer improves. We call this algorithm Inner.

3.2 Iterative Method Based on Budgets

Our algorithm for \({\textsc {MinTimeline}_\infty }\)also relies on the idea of using a subproblem that is easier to solve.

In this case, we consider as subproblem an instance in which, in addition to the temporal network G, we are also given a set of budgets \(\left\{ b_{v}\right\} \) of interval durations; one budget \(b_{v}\) for each vertex v. The goal is to find a timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) that covers the temporal network G and the length of each activity interval \(I_{v}\) is at most \(b_{v}\). We refer to this problem as \({\textsc {MinTimeline}_b}\).

Problem 4

(\({\textsc {MinTimeline}_b}\)). Given a temporal network \(G = (V, E)\) and a set of budgets \(\left\{ b_{v}\right\} _{v \in V}\), find a timeline \(\mathcal {T}=\left\{ I_{u}\right\} _{u\in V}\) that covers G and satisfies \( \sigma \mathopen {}\left( I_{v}\right) \le b_{v}\) for each \(v \in V\).

Surprisingly, the \({\textsc {MinTimeline}_b}\)problem can be solved optimally in linear time. The algorithm is presented in Sect. 5. Note that this result is compatible with the NP-hardness of MinTimeline, since here we know the budgets for individual intervals, and thus, there are an exponential number of ways that we can distribute the total budget among the individual intervals.

We can now use binary search to find the optimal value \( \varDelta \mathopen {}\left( \mathcal {T}\right) \). We call this algorithm Budget. To guarantee a small number of binary steps, some attention is required: Let \(T = t_1, \ldots , t_m\) be all the time stamps, sorted. Assume that we have L, the largest known infeasible budget and U, the smallest known feasible budget. To define a new candidate budget, we first define \(W(i) = \left\{ t_j - t_i \mid L< t_j - t_i < U\right\} \). The optimal budget is either U or one of the numbers in W(i). If every W(i) is empty, then the answer is U. Otherwise, we compute m(i) to be the median of W(i), ignore any empty W(i). Finally, we test the weighted median of all m(i), weighted by \({\left| W(i)\right| }\), as a new budget. We can show that at each iteration \(\sum {\left| W(i)\right| }\) is reduced by 1 / 4, that is, only \( \mathcal {O} \mathopen {}\left( \log m\right) \) iterations is needed. We can determine the medians m(i) and the sizes \({\left| W(i)\right| }\) in linear time since T is sorted, and we can determine the weighted median in linear time by using a modified median-of-medians algorithm. This leads to a \( \mathcal {O} \mathopen {}\left( m \log m\right) \) running time. However, in our experimental evaluation, we use a straightforward binary search by testing \((U + L) / 2\) as a budget.

4 Approximation Algorithm for MinTimeline\(_m\)

In this section we design a 2-approximation linear-time algorithm for the \({\textsc {MinTimeline}_m}\)problem. As defined in Problem 3, our input is a temporal network \(G = (V, E)\) and a set of interior time points \(\left\{ m_v\right\} _{v \in V}\). As before, \( T \mathopen {}\left( v\right) \) denotes the set of time stamps of the edges containing vertex v.

Consider a vertex v and the corresponding interior point \(m_v\). For a time point t we define the peripheral time stamps \( p \mathopen {}\left( t; v\right) \) to be the time stamps that are on the other side of t than \(m_v\),

$$\begin{aligned} p \mathopen {}\left( t; v\right) = {\left\{ \begin{array}{ll} \left\{ s \in T \mathopen {}\left( v\right) \mid s \ge t\right\} &{} \text {if } t > m_v, \\ \left\{ s \in T \mathopen {}\left( v\right) \mid s \le t\right\} &{} \text {if } t < m_v, \\ T \mathopen {}\left( v\right) &{} \text {if } t = m_v. \end{array}\right. } \end{aligned}$$

Our next step is to express \({\textsc {MinTimeline}_m}\)as an integer linear program. To do that we will define a variable \(x_{vt}\) for each vertex \(v \in V\) and time stamp \(t \in T \mathopen {}\left( v\right) \). Instead of going for the obvious construction, where \(x_{vt} = 1\) indicates that v is active at t, we will do a different formulation: in our program \(x_{vt} = 1\) indicates that t is either the beginning or the end of the active region of v. It follows that the integer program

$$\begin{aligned} \begin{aligned} \min&\sum _{v, t} {\left| t - m_v\right| } x_{vt}, \\ \text {such that}&\sum _{s \in p(v; t)} x_{vs} + \sum _{s \in p(u; t)} x_{us} \ge 1,\ \text {for all}\ (u, v, t) \in E \end{aligned} \end{aligned}$$

solves \({\textsc {MinTimeline}_m}\). Naturally, here we also require that \(x_{vt} \in \{0,1\}\). Minimizing the first sum corresponds to minimizing the sum-span of the timeline, while the constraint on the second sum ensures that the resulting timeline covers the temporal network. Note that we do not require that each vertex should have exactly one beginning and one end, however, the minimality of the optimal solution ensures that this constraint will be satisfied, too.

Relaxing the integrality constraint and considering the program as linear program, allows us to write the dual. The variables in the dual can be viewed as positive weights \(\alpha _e\) on the edges, with the goal of maximizing the total sum of these weights.

To express the constraints on the dual, let us define an auxiliary function \( h \mathopen {}\left( v, t, s\right) \) as the sum of the weights of adjacent edges between t and s,

$$\begin{aligned} h \mathopen {}\left( v, t, s\right) = \sum \left\{ \alpha _e \mid e \in E \mathopen {}\left( v\right) ,\ t \mathopen {}\left( e\right) \text { is between }s \text { and }t\right\} , \end{aligned}$$

where, recall that, \( E \mathopen {}\left( v\right) \) denotes the edges adjacent to v and \( t \mathopen {}\left( e\right) \) denotes the time stamp of edge \(e\in E\). The dual can now be formulated as

$$\begin{aligned} \begin{aligned} \max&\sum _{e \in E} \alpha _e, \quad \text {such that}\ \ h \mathopen {}\left( v, t, m_v\right) \le |t - m_v|, \ \text {for all}\ v \in V, \ t \in T \mathopen {}\left( v\right) , \end{aligned} \end{aligned}$$

that is, we maximize the total weight of edges such that for each vertex v and for each time stamp t, the sum of adjacent edges is bounded by \({\left| t - m_v\right| }\).

We say that the solution to dual is maximal if we cannot increase any edge weight \(\alpha _e\) without violating the constraints. An optimal solution is maximal but a maximal solution is not necessarily optimal.

Our next result shows that a maximal solution can be used to obtain a 2-approximation dynamic cover.

Proposition 2

Consider a maximal solution \(\alpha _e\) to the dual program. Define a set of intervals \(\mathcal {T}= \left\{ I_{v}\right\} \) by \(I_{v} = [\min X_v, \max X_v]\), where

$$\begin{aligned} X_v = \left\{ m_v\right\} \cup \left\{ t \in T \mathopen {}\left( v\right) \mid h \mathopen {}\left( v, t, m_v\right) = |t - m_v|\right\} . \end{aligned}$$

Then \(\mathcal {T}\) is a 2-approximation solution for the problem \({\textsc {MinTimeline}_m}\).

Proof

We first show that a maximal dual solution is a feasible timeline. Let \(e = (u, v, t)\) be a temporal edge. If \( p \mathopen {}\left( t; v\right) \cap X_v = \emptyset \) and \( p \mathopen {}\left( t; u\right) \cap X_u = \emptyset \), then we can increase the value of \(\alpha _e\) without violating the constraints, so the solution is not maximal. Thus \(t \in I_{v} \cup I_{u}\), making \(\mathcal {T}\) a feasible timeline.

Next we show that the resulting solution \(\mathcal {T}\) is a 2-approximation to \({\textsc {MinTimeline}_m}\). Write \(x_v = \min \{X_v\}\) and \(y_v = \max \{X_v\}\). Let \(\mathcal {T}^*\) be the optimal solution. Then

$$\begin{aligned} \begin{aligned} S \mathopen {}\left( \mathcal {T}\right)&= \sum _{v \in V} |x_v - m_v| + |y_v - m_v| = \sum _{v \in V} h \mathopen {}\left( v, x_v, m_v\right) + h \mathopen {}\left( v, y_v, m_v\right) \\&\le \sum _{v \in V} \sum _{e \in E \mathopen {}\left( v\right) } \alpha _e = 2\sum _{e \in E} \alpha _e \le 2 S \mathopen {}\left( \mathcal {T}^*\right) , \\ \end{aligned} \end{aligned}$$

where the second equality follows from the definition of \(X_v\), the first inequality follows from the fact that \(\alpha _e \ge 0\), and the last inequality follows from primal-dual theory. This proves the claim.   \(\square \)

We have established that as long as we can obtain a maximal solution for the dual, we can extract a timeline that is 2-approximation. We will now introduce a linear-time algorithm that computes a maximal dual solution. The algorithm visits each edge \(e = (u, v, t)\) in chronological order and increases \(\alpha _e\) as much as possible without violating the dual constraints. To obtain a linear-time complexity we need to determine in constant time by how much we can increase \(\alpha _e\). The pseudo-code is given in Algorithm 1, and the remaining section is used to prove the correctness of the algorithm.

figure a

Let us enumerate the edges chronologically by writing \(e_i\) for the i-th edge, and let us write \(\alpha _i\) to mean \(\alpha _{e_i}\). We will also write \(t_i\) for the time stamp of \(e_i\). Finally, let us define \(k_v\) to be the smallest index of an edge (uvt) with \(t \ge m_v\), and \(o_v\) to be the largest index of an edge (uvt) with \(t \le m_v\).Footnote 1

For simplicity, we rewrite the dual constrains using indices instead of time stamps. Given two indices \(i \le j\), we slightly overload the notation and we write

$$\begin{aligned} h \mathopen {}\left( v, i, j\right) = \sum \left\{ \alpha _\ell \mid e_\ell \in E \mathopen {}\left( v\right) ,\ \ell \text { is between } i \text { and } j\right\} . \end{aligned}$$

The dual constraints can be written as

$$\begin{aligned} h \mathopen {}\left( v, i, o_v\right) \le {\left| t_i - m_v\right| }, \text { if } i < k_v, \quad \text {and}\quad h \mathopen {}\left( v, i, k_v\right) \le {\left| t_i - m_v\right| }, \text { if } i \ge k_v. \end{aligned}$$
(1)

Each dual constraint is included in these constraints. Equation (1) may also contain some additional constraints but they are redundant, so the dual constraints hold if and only if constraints in Eq. (1) hold.

As the algorithm goes over the edges, we maintain two counters per each vertex, a[v] and b[v]. Let \(e_j = (u, v, t)\) be the current edge. The counter a[v] is maintained only if \(t \ge m_v\), and the counter b[v] is maintained if \(t < m_v\). Our invariant for maintaining the counters a[v] and b[v] is that at the beginning of j-th round they are equal to

$$\begin{aligned} a[v] = h \mathopen {}\left( v, k_v, j\right) \quad \text {and}\quad b[v] = \min _{\ell < j} \{ t_\ell - m_v - h \mathopen {}\left( v, \ell , j - 1\right) \}. \end{aligned}$$

The following lemma tells us how to update \(\alpha _j\) using a[v] and b[v].

Lemma 1

Assume that we are processing edge \(e_j = (u, v, t)\). We can increase \(\alpha _j\) by at most

$$\begin{aligned} \min \{ z(u), z(v) \}, \quad \text {where}\quad z(w) = {\left\{ \begin{array}{ll} t - m_w - a[w] &{} \text { if } j \ge k_v, \\ \min \{ m_w - t, b[w] \} &{} \text { if } j < k_v. \end{array}\right. } \end{aligned}$$
(2)

Proof

We will prove this result by showing that \(\alpha _e \le z(v)\) if and only if all constraints in Eq. (1) related to v are valid. Since the same holds also for u the lemma follows. We consider two cases.

First case: \(j < k_v\). In this case we have \(z(v) = \min \{ m_w - t, b[w]\} = \min _{\ell \le j} \{t_\ell - m_v - h \mathopen {}\left( v, \ell , o_v\right) \}\), before increasing \(\alpha _j\). This guarantees that if \(\alpha _j \le z(v)\), then \( h \mathopen {}\left( v, \ell , o_v\right) \le |t_\ell - m_v|\), for every \(\ell \le j\). Moreover, when \(\alpha _j = z(v)\) one of these constraints becomes tight. Since these are the only constraints containing \(\alpha _j\), we have proven the first case.

Second case: \(j \ge k_v\). If \(\ell < j\), the sum \( h \mathopen {}\left( v, \ell , k_v\right) \) does not contain \(\alpha _j\), so the corresponding constraint remains valid. If \(\ell \ge j\), then the corresponding constraint is valid if and only if \( h \mathopen {}\left( v, j, k_v\right) \le |t_j - m_v|\). This is because \(\alpha _\ell = 0\) for all \(\ell > j\). But z(v) corresponds exactly to the amount we can increase \(\alpha _i\) so that \( h \mathopen {}\left( v, j, k_v\right) = |t_j - m_v|\). This proves the second case.    \(\square \)

Our final step is to how to maintain a[v] and b[v]. Maintaining a[v] is trivial: we simply add \(\alpha _j\) to a[v]. The new b[v] is equal to

$$\begin{aligned} \min _{\ell \le j} \{ t_\ell - m_v - h \mathopen {}\left( v, \ell , j\right) \} = \min \{ b[v] - \alpha _j, m_v - t - \alpha _j \}. \end{aligned}$$

Clearly the counters a[v] and b[v] and the dual variables \(\alpha _e\) can be maintained in constant time per edge processed, making Maximal a linear-time algorithm.

5 Exact Algorithm for MinTimeline\(_b\)

In this section we develop a linear-time algorithm for the problem \({\textsc {MinTimeline}_b}\). Here we are given a temporal network G, and a set of budgets \(\left\{ b_{v}\right\} \) of interval durations, and all activity intervals should satisfy \( \sigma \mathopen {}\left( I_{v}\right) \le b_{v}\).

The idea for this optimal algorithm is to map \({\textsc {MinTimeline}_b}\)into 2-SAT. To do that we introduce a boolean variable \(x_{vt}\) for each vertex v and for each timestamp \(t \in T \mathopen {}\left( v\right) \). To guarantee the solution will cover each edge (uvt) we add a clause \((x_{vt} \vee x_{ut})\). To make sure that we do not exceed the budget we require that for each vertex v and each pair of time stamps \(s, t \in T \mathopen {}\left( v\right) \) such that \({\left| s - t\right| } > b_v\) either \(x_{vs}\) is false or \(x_{vt}\) is false, that is, we add a clause \((\lnot x_{vs} \vee \lnot x_{vt})\). It follows immediately, that \({\textsc {MinTimeline}_b}\)has a solution if and only if 2-SAT has a solution. The solution for \({\textsc {MinTimeline}_b}\)can be obtained from the 2-SAT solution by taking the time intervals that contain all boolean variables set to true. Since 2-SAT is a polynomially-time solvable problem [1], we have the following.

Proposition 3

\({\textsc {MinTimeline}_b}\)can be solved in a polynomial time.

Solving 2-SAT can be done in linear-time with respect to the number of clauses [1]. However, in our case we may have \( \mathcal {O} \mathopen {}\left( m^2\right) \) clauses. Fortunately, the 2-SAT instances created with our mapping have enough structure to be solvable in \( \mathcal {O} \mathopen {}\left( m\right) \) time. This speed-up is described in the remainder of the section.

Let us first review the algorithm by Aspvall et al. [1] for solving 2-SAT. The algorithm starts with constructing an implication graph \(H = (W, A)\). The graph H is directed and its vertex set \(W = P \cup Q\) has a vertex \(p_{i}\) in P and a vertex \(q_{i}\) in Q for each boolean variable \(x_{i}\). Then, for each clause \((x_i \vee x_j)\), there are two edges in A: \((q_i \rightarrow p_j)\) and \((q_j \rightarrow p_i)\); The negations are handled similarly.

In our case, the edges A are divided to two groups \(A_1\) and \(A_2\). The set \(A_1\) contains two directed edges \((q_{vt} \rightarrow p_{ut})\) and \((q_{ut} \rightarrow p_{vt})\) for each edge \(e = (u, v, t) \in E\). The set \(A_2\) contains two directed edges \((p_{vt} \rightarrow q_{vs})\) and \((p_{vs} \rightarrow q_{vt})\) for each vertex v and each pair of time stamps \(s, t \in T \mathopen {}\left( v\right) \) such that \({\left| s - t\right| } > b_v\). Note that \(A_1\) goes from Q to P and \(A_2\) goes from P to Q. Moreover, \({\left| A_1\right| } \in \mathcal {O} \mathopen {}\left( m\right) \) and \({\left| A_2\right| } \in \mathcal {O} \mathopen {}\left( m^2\right) \).

Next, we decompose H in strongly connected components (SCC), and order them topologically. If any strongly connected component contains both \(p_{vt}\) and \(q_{vt}\), then we know that 2-SAT is not solvable. Otherwise, to obtain the solution, we start enumerate over the components, children first: if the boolean variables corresponding to the vertices in the component do not have truth assignment,Footnote 2 then we set \(x_{vt}\) to be true if \(p_{vt}\) is in the component, and \(x_{vt}\) to be false if \(q_{vt}\) is in the component

The bottleneck of this method is the SCC decomposition, which requires \( \mathcal {O} \mathopen {}\left( {\left| W\right| } + {\left| A\right| }\right) \) time, and the remaining steps can be done in \( \mathcal {O} \mathopen {}\left( {\left| W\right| }\right) \) time. Since \({\left| W\right| } \in \mathcal {O} \mathopen {}\left( m\right) \), we need to optimize the SCC decomposition to perform in \( \mathcal {O} \mathopen {}\left( m\right) \) time. We will use the algorithm by Kosajaru (see [10]) for the SCC decomposition. This algorithm consists of two depth-first searches, performing constant-time operations on each visited node. Thus, we need to only optimize the DFS.

To speed-up the DFS, we need to design an oracle such that given a vertex \(p \in P\) it will return an unvisited neighboring vertex \(q \in Q\) in constant time. Since \({\left| Q\right| } \in \mathcal {O} \mathopen {}\left( m\right) \), this guarantees that DFS spends at most \( \mathcal {O} \mathopen {}\left( m\right) \) time processing vertices \(p \in P\). On the other hand, if we are at \(q \in Q\), then we can use the standard DFS to find the neighboring vertex \(p \in P\). Since \({\left| A_1\right| } \in \mathcal {O} \mathopen {}\left( m\right) \), this guarantees that DFS spends at most \( \mathcal {O} \mathopen {}\left( m\right) \) time processing vertices \(q \in Q\).

Next, we describe the oracle: first we keep the unvisited vertices Q in lists \(\ell [v] = (q_{vt} \in Q; q_{vt} \text { is not visited} )\) sorted chronologically. Assume that we are at \(p_{vt} \in P\). We retrieve the first vertex in \(\ell [v]\), say \(q_{vs}\), and compare if \({\left| s - t\right| } > b_v\). If true, then \(q_{vs}\) is a neighbor of \(p_{vt}\), so we return \(q_{vs}\). Naturally, we delete \(q_{vs}\) from \(\ell [v]\) the moment we visit \(q_{vs}\). If \({\left| s - t\right| } \le b_v\), then test similarly the last vertex in \(\ell [v]\), say \(q_{vs'}\). If both \(q_{vs'}\) and \(q_{vs}\) are non-neighbors of \(p_{vt}\), then, since \(\ell [v]\) is sorted chronologically, we can conclude that \(\ell [v]\) does not have unvisited neighbors of \(p_{vt}\). Since \(p_{vt}\) does not have any neighbors outside \(\ell [v]\), we conclude that \(p_{vt}\) does not have any unvisited neighbors.

Using this oracle we can now perform DFS in \( \mathcal {O} \mathopen {}\left( m\right) \) time, which in turns allows us to do the SCC decomposition in \( \mathcal {O} \mathopen {}\left( m\right) \) time, which then allows us to solve \({\textsc {MinTimeline}_b}\)in \( \mathcal {O} \mathopen {}\left( m\right) \) time.

6 Related Work

To the best of our knowledge, the problem we consider in this paper has not been studied before in the literature. In this section we review briefly the lines of work that are most closely related to our setting.

Vertex cover. Our problem definition can also be considered a temporal version of the classic vertex-cover problem, one of 21 original NP-complete problems in Karp’s seminal paper [12]. A factor-2 approximation is available for vertex cover, by taking all vertices of a maximal matching [6]. Slightly improved approximations exist for special cases of the problem, while assuming that the unique games conjecture is true, the minimum vertex cover cannot be approximated within any constant factor better than 2 [13]. Nevertheless, our formulation cannot be mapped directly to the static vertex-cover problem, thus, the proposed solutions need to be tailor-made for the temporal setting.

Modeling and discovering burstiness on sequential data. Modeling and discovering bursts in time sequences is a very well-studied topic in data mining. In a seminal work, Kleinberg [14] discovered burstiness using an exponential model over the delays between the events. Alternative techniques are based on modeling event counts in a sliding window: Ihler et al. [11] modeled such a statistic with Poisson process, while Fung et al. [5] used Binomial distribution. Additionally, Zhu and Shasha [26] used wavelet analysis, Vlachos et al. [23] applied Fourier analysis, and He and Parker [8] adopted concepts from Mechanics to discover burst events. Finally, Lappas et al. [15] propose discovering maximal bursts with large discrepancy.

A highly related problem for discovering bursty events is segmentation. Here the goal is to segment the sequence in k coherent pieces. One should expect that time periods of high activity will occur in its own segment. If the overall score is additive with respect to the segments, then this problem can be solved in \( \mathcal {O} \mathopen {}\left( n^2k\right) \) time [3]. Moreover, under some mild assumptions we can obtain a \((1 + \epsilon )\) approximation in linear time [7].

The difference of all these works with our setting is that we consider networked data, i.e., sequences of interactions among pairs of entities. By assuming that for each interaction only one entity needs to be active, our problem becomes highly combinatorial. In order to counter-balance this increased combinatorial complexity, we consider a simpler burstiness model than previous works: in particular, we assume that each entity has only one activity interval. Extending our definition to more complex activity models (multiple intervals per entity, or multiple activity levels) is left for future work.

Event detection in temporal data. As the input to our problem is a sequence of temporal edges, our work falls in the broad area of mining temporal networks [9, 19]. More precisely, the network-untangling problem can be considered an event-detection problem, where the goal is to find time intervals and/or sets of nodes with high activity. Typical event-detection methods use text or other meta-data, as they reveal event semantics. One line of work is based of constructing different types of word graphs [4, 18, 24]. The events are detected as clusters or connected components in such graphs and temporal information is not considered directly.

Another family of methods uses statistical modeling for identify events as trends [2, 17]. Leskovec et al. [16] and Yang et al. [25] consider spreading of short quotes in the citation network of social media. These methods rely on clustering “bursty” keywords. Our setting is considerably different as we focus on interactions between entities and explicitly model entity activity by continuous time intervals.

Information maps. From an application point-of-view, our work is loosely related with papers that aim to process large amounts of data and create maps that present the available information in a succinct and easy-to-understand manner. Shahaf and co-authors have considered this problem in the context of news articles [21, 22] and scientific publications [20]. However, their approach is not directly comparable to ours, as their input is a set of documents and not a temporal network, and their output is a “metro map” and not an activity timeline.

7 Experimental Evaluation

In this section we empirically evaluate the performance of our methods.Footnote 3

Setup. We first test the algorithms on synthetic datasets and then present a case study on a real-world social-media dataset.

For the Synthetic dataset, we start by generating a static background network of \(n=100\) vertices with a power law degree distribution (we use the configuration model with power law exponent set to 2.0). Then for every vertex we generate a ground-truth activity interval and we add 100 interactions with random neighbors. These interactions are placed consequently with unit time distance, and thus each activity interval has length of \(\ell =99\) time units. We place the ground-truth activity intervals on a timeline in an overlapping manner, and we control their temporal overlap using a parameter \(p\in [0,1]\). When \(p=0\), all intervals are disjoint and every timestamp has only one interaction, thus, it should be easy to find the correct activity intervals. When \(p=1\), all intervals are merged into one, and every time stamp has 100 of different interactions, so there is a large number of solutions whose score is even better than the ground-truth solution. In all cases Synthetic has \(10\,000\) interactions in total.

For the case study we use a dataset collected from Twitter. The dataset records activity of Twitter users in Helsinki during 12.2008–05.2014. We consider only tweets with more than one hashtag (\(666\,487\) tweets) and build the co-occurrence network of these hashtags: vertices corresponding to hashtags and time-stamped edges corresponding to a tweet in which two hashtags are mentioned. The temporal network contains \(304\,573\) vertices and \(3\,292\,699\) edges.

Fig. 1.
figure 1

Output of both algorithms for different overlaps p in the ground truth activity intervals. All values are averaged over 100 runs. (a) F-measure of correctly identifies active time-stamped vertices, (b) L, total activity interval length divided by true total activity interval length, (c) M, maximum activity interval length divided by true maximum activity interval length.

Fig. 2.
figure 2

Convergence of Maximal algorithm. Overlap p is set to 0.5, values are averaged over 100 runs. (a) Precision, recall and F-measure, (b) L, relative total length, (c) M, relative length of the maximum interval.

Fig. 3.
figure 3

Part of the output of Maximal algorithm on Twitter dataset for November’13. Intervals of activity of co-occurring tags, seeded from hashtags #slush13, #mtvema and #nokiaemg.

Results from synthetic datasets. To evaluate the quality of the discovered activity intervals we compare the set of discovered intervals with the ground-truth intervals. For every vertex u we define precision \(P_u=\frac{| TP _u|}{|F_u|}\), where \( TP _u\) is the set of correctly identified moments of activity of u, and \(F_u\) is the set of all discovered moments of activity of u. Similarly, we define the recall for vertex u as \(R_u=\frac{| TP _u|}{|A_u|}\), where \(A_u\) is the set of true moments of activity of u. We calculate the average precision and recall: \(P=\frac{1}{|V|}\sum _{u\in V}P_u\) and \(R=\frac{1}{|V|}\sum _{u\in V}R_u\); and report the F-measure \(F=\frac{2PR}{P+R}\).

In addition to F-measure, we calculate the relative total length L and the relative maximum length M. Here, L is the total length of the discovered intervals divided by the ground-truth total length of the activity intervals. Similarly, M is the maximum length of the discovered intervals divided by the true maximum length of activity intervals.

We test both algorithms on the Synthetic dataset with varying overlap parameter p. The results are shown in Fig. 1. All measures are averaged over 100 runs. Note that in the Synthetic dataset all activity intervals have the same length, thus, if during binary search the correct value of budget is found, then automatically all vertices receive the correct budget.

Figure 1a demonstrates that for algorithm Maximal the F-measure is typically high for all values of the overlap parameter, but drops, when p increases. On the other hand, Fig. 1b shows that algorithm Maximal takes advantage of the overlaps and for large values of p it finds solutions that have better score than the ground truth. This however, leads to decrease in accuracy. As for the maximum interval length, shown in Fig. 1c, algorithm Maximal is not designed to optimize it and it typically finds few large intervals, while keeping the total length low. Budget finds solutions of correct total and maximum lengths on the Synthetic dataset for all values of overlap parameter p.

In Fig. 2 we show how the solution of Maximal evolves during iterations with re-initialization. After a couple of iterations the value and quality (F-measure, precision and recall) of the solution are improved significantly. During the next iterations the value of the solution does not change, but the quality keeps increasing. The method converges in less than 10 iterations.

Scalability. Both Budget and Inner use linear-time algorithms in their inner loops and the number of needed outer loop iterations is small. This means that our methods are scalable. To demonstrate this, we were able to run Maximal with a network of 1 million vertices and 1 billion interactions in 15 min, despite the large constant factor due to the Python implementation.

Case study. Next we present our results on the Twitter dataset. In Fig. 3 we show a subset of hashtags from tweets posted in November 2013. We also depict the activity intervals for those hashtags, as discovered by algorithm Maximal. Note that for not cluttering the image, we depict only a subset of all relevant hashtags. In particular, we pick 3 “seed” hashtags: #slush13, #mtvema and #nokiaemg and the set of hashtags that co-occur with the “seeds.” Each of the seeds corresponds to a known event: #slush13 corresponds to Slush’13 – the world’s leading startup and tech event, organized in Helsinki in November 13–14, 2013. #mtvema is dedicated to MTV Europe Music Awards, held on 10 November, 2013. #nokiaemg is Extraordinary General Meeting (EGM) of Nokia Corporation, held in Helsinki in November 19, 2013.

For each hashtag we plot its entire interval with a light color, and the discovered activity interval with a dark color. For each selected hashtag, we draw interactions (co-occurrence) with other selected hashtags using black vertical lines, while we mark interactions with non-selected hashtags by ticks.

Figure 3 shows that the tag #slush13 becomes active exactly at the starting date of the event. During its activity this tag covers many technical tags, e.g. #zenrobotics (Helsinki-based automation company), #younited (personal cloud service by local company) and #walkbase (local software company). Then on 19 November, the tag #nokiaemg becomes active: this event is very narrow and covers mentions of Microsoft executive Stephen Elop. Another large event is occurring around 10 November with active tags #emazing, #ema2013 and #mtvema. They cover #bestpop, #bestvideo and other related tags.

8 Conclusions

In this paper we introduced and studied a new problem, which we called network untangling. Given a set of temporal undirected interactions, our goal is to discover activity time intervals for the network entities, so as to explain the observed interactions. We consider two settings: MinTimeline, where we aim to minimize the total sum of activity-interval lengths, and \({\textsc {MinTimeline}_\infty }\), where we aim to minimize the maximum interval length. We show that the former problem is NP-hard and we develop efficient iterative algorithms, while the latter problem is solvable in polynomial time.

There are several natural open questions: it is not known whether there is an approximation algorithm for MinTimeline or whether the problem is inapproximable. Second, our model uses one activity interval for each entity. A natural extension of the problem is to consider k intervals per entity, and/or different activity levels.