1 Influencing conversations by controlling the topic

Conversation, interactive discussion between two or more people, is one of the most essential and common forms of communication in our daily lives.Footnote 1 One of the many functions of conversations is influence: having an effect on the belief, opinions or intentions of other conversational participants. Using multi-party conversations to study and identify influencers, the people who influence others, has been the focus of researchers in communication, sociology, and psychology (Katz and Lazarsfeld 1955; Brooke and Ng 1986; Weimann 1994), who have long acknowledged that there is a correlation between the conversational behaviors of a participant and how influential he or she is perceived to be by others (Reid and Ng 2000).

In an early study on this topic, Bales (1970) argues “To take up time speaking in a small group is to exercise power over the other members for at least the duration of the time taken, regardless of the content.” This statement asserts that structural patterns such as speaking time and activeness of participation are good indicators of power and influence in a conversation. Participants who talk most during a conversation are often perceived as having more influence (Sorrentino and Boutiller 1972; Regula and Julian 1973; Daley et al. 1977; Ng et al. 1993), more leadership ability (Stang 1973; Sorrentino and Boutiller 1972), more dominance (Palmer 1989; Mast 2002) and more control of the conversation (Palmer 1989). Recent work using computational methods also confirms that structural features such as number of turns and turn length are among the most discriminative features to classify whether a participant is influential or not (Rienks et al. 2006; Biran et al. 2012).

However, it is wrong to take Bales’s claim too far; the person who speaks loudest and longest is not always the most powerful. In addition to structural patterns, the characteristics of language used also play an important role in establishing influence and controlling the conversation (Ng and Bradac 1993). For example, particular linguistic choices such as message clarity, powerful and powerless language (Burrel and Koper 1998), and language intensity (Hamilton and Hunter 1998) in a message can increase influence. More recently, Huffaker (2010) showed that linguistic diversity expressed by lexical complexity and vocabulary richness has a strong relationship with leadership in online communities. To build a classifier to detect influencers in written online conversations, Biran et al. (2012) also propose to use a set of content-based features to capture various participants’ conversational behaviors, including persuasion and agreement/disagreement.

Among many studied behaviors, topic control and management is considered one of the most effective ways to control the conversation (Planalp and Tracy 1980). Palmer (1989) shows that the less related a participants’ utterances are to the immediate topic, the more dominant they are, and then argues, “the ability to change topical focus, especially given strong cultural and social pressure to be relevant, means having enough interpersonal power to take charge of the agenda.” Recent work by Rienks et al. (2006) also shows that topic change, among other structural patterns discussed above, is the most robust feature in detecting influencers in small group meetings.

In this article, we introduce a new computational model capturing the role of topic control in participants’ influence of conversations. Speaker Identity for Topic Segmentation (SITS), a hierarchical Bayesian nonparametric model, uses an unsupervised statistical approach which requires few resources and can be used in many domains without extensive training and annotation. More important, SITS incorporates an explicit model of speaker behavior by characterizing quantitatively individuals’ tendency to exercise control over the topic of conversation (Sect. 3). By focusing on topic changes in conversations, we go beyond previous work on influencers in two ways:

  • First, while structural statistics such as number of turns, turn length, speaking time etc are relatively easy to extract from a conversation, defining and detecting topic changes is less well understood. Topic, by itself, is a complex concept (Blei et al. 2003; Kellermann 2004). In addition, despite the large number of techniques proposed trying to divide a document into smaller, topically coherent segments (Purver 2011), topic segmentation is still an open research problem. Most previous computational methods for topic discovery and topic segmentation focus on content, ignoring the speaker identities. We show that we can capture conversational phenomena and influence better by explicitly modeling behaviors of participants.

  • Second, the conversation is often controlled explicitly, to some extent, by a subset of participants. For example, in political debates questions come from the moderator(s), and candidates typically have a fixed time to respond. These imposed aspects of conversational structure decrease the value of more easily extracted structural statistics for a variety of conversation types; observe, for example, that similar properties of the conversation can also be observed when looking at hosts and guests in televised political discussion shows such as CNN’s Crossfire.

Applying SITS on real-world conversations (Sect. 4), we show that this modeling approach is not only more effective than previous methods on traditional topic segmentation (Sect. 5), but also more intuitive in that it is able to capture an important behavior of individual speakers during conversations (Sect. 6). We then show that using SITS to model topic control improves influencer detection (Sect. 7). Taking quantitative and qualitative analysis together, the pattern of results suggests that our approach holds significant promise for further development; we discuss directions for future work in Sect. 8.

2 What is an influencer?

2.1 Influencer definition

In most research on persuasion and power, an influencer attempts to gain compliance from others or uses tactics to shape the opinions, attitudes, or behaviors of others (Scheer and Stern 1992; Schlenker et al. 1976). In research on social media, such as blogs and Twitter, measurements such as the number of followers or readers serve as a proxy for influence (Alarcon-del Amo et al. 2011; Booth and Matic 2011; Trammell and Keshelashvili 2005). Others have studied what influencers say; Drake and Moberg (1986) demonstrated that linguistic influence differs from attempts to influence that rely on power and exchange relationships. In interactions with targets, influencers may rely more on linguistic frames and language than on resources offered, which is proposed as the requirement for influence by exchange theorists (Blau 1964; Foa and Foa 1972; Emerson 1981).

We define an influencer as someone who has persuasive ability over where an interaction is headed, what topics are covered, and what positions are espoused within that interaction. In the same way that persuasion shapes, reinforces, or changes attitudes or beliefs, an influencer shapes, reinforces, or changes the direction of the interaction. An influencer within an interaction is someone who may introduce new ideas or arguments into the conversation that others pick up on and discuss (shapes new directions through topic shift), may express arguments about an existing topic that others agree to and further in the discussion (i.e., reinforces the direction), or may provide counter-arguments that others agree to and perpetuate, thereby redirecting where the topic of conversation is headed (i.e., changes the direction of the conversation).

2.2 Data scope and characteristics

We are interested in influence in turn-taking, multiparty discussions. This is a broad category including political debates, business meetings, online chats, discussions, conference panels, and many TV or radio talk shows. More formally, such datasets contain C conversations. A conversation c has T c turns, each of which is a maximal uninterrupted utterance by one speaker.Footnote 2 In each turn t∈[1,T c ], a speaker a c,t utters N c,t words w c,t ={w c,t,n n∈[1,N c,t ]}. Each word is from a vocabulary of size V, and there are M distinct speakers.

3 Modeling topic shift

In this section, we describe SITS, a hierarchical nonparametric Bayesian model for topic segmentation that takes into consideration speaker identities, allowing us to characterize speakers’ topic control behavior over the course of the discussion (Nguyen et al. 2012). We begin with an overview of the topic segmentation problem and some related work. We then highlight the differences between SITS and previous approaches and describe the generative process and the inference technique we use to estimate the model.

3.1 Topic segmentation and modeling approaches

Whether in an informal situation or in more formal settings such as a political debate of business meeting, a conversation is often not about just one thing: topics evolve and are replaced as the conversation unfolds. Discovering this hidden structure in conversations is a key problem for building conversational assistants (Tur et al. 2010) and developing tools that summarize (Murray et al. 2005) and display (Ehlen et al. 2007) conversational data. Understanding when and how the topics change also helps us study human conversational behaviors such as individuals’ agendas (Boydstun et al. 2013), patterns of agreement and disagreement (Hawes et al. 2009; Abbott et al. 2011), relationships among conversational participants (Ireland et al. 2011), and dominance and influence among participants (Palmer 1989; Rienks et al. 2006).

One of the most natural ways to capture conversational structure is topic segmentation—the task of “automatically dividing single long recordings or transcripts into shorter, topically coherent segments” (Purver 2011). There are broadly two basic approaches previous work has used to tackle this problem. The first approach focuses on identifying discourse markers which distinguish topical boundaries in the conversations. There are certain cue phrases such as well, now, that reminds me, etc. that explicitly indicate the end of one topic or the beginning of another (Hirschberg and Litman 1993; Passonneau and Litman 1997). These markers can also serve as features for a discriminative classifier (Galley et al. 2003) or observed variables in generative model (Dowman et al. 2008). However, in practice the discourse markers that are most indicative of topic change often depend heavily on the domain of the data (Purver 2011). This drawback makes methods solely relying on these markers difficult to adapt to new domains or settings.

Our method follows the second general approach, which relies on the insight that topical segments evince lexical cohesion (Halliday and Hasan 1976). Intuitively, words within a segment will look more like their neighbors than like words in other segments. This has been a key idea in previous work. Morris and Hirst (1991) try to determine the structure of text by finding “lexical chains” which consists of units of text that are about the same thing. The often used text segmentation algorithm TextTiling (Hearst 1997) exploits this insight to compute the lexical similarity between adjacent sentences. More recent improvements to this approach include using different lexical similarity metrics like lsa (Choi et al. 2001; Olney and Cai 2005) and improving feature extraction for supervised methods (Hsueh et al. 2006). It also inspires unsupervised models using bags of words (Purver et al. 2006), language models (Eisenstein and Barzilay 2008), and shared structure across documents (Chen et al. 2009).

We also use lexical cohesion using a probabilistic topic modeling method (Blei et al. 2003; Blei 2012). The approach we take is unsupervised, so it requires few resources and is applicable in many domains without extensive training. Following the literature on topic modeling, we define each topic as a multinomial distribution over the vocabulary. Like previous generative models proposed for topic segmentation (Purver et al. 2006), each turn is considered a bag of words generated from an admixture of topics and topics are shared across different turns within a conversation or across different conversations.Footnote 3 In addition, we take a Bayesian nonparametric approach (Müller and Quintana 2004) to allow the number of topics to be unbounded, in order to better represent the observed data.

The settings described above are still consistent with those in popular topic models such as latent Dirichlet allocation (Blei et al. 2003, lda) or hierarchical Dirichlet processes (Teh et al. 2006, hdp), in which turns in a conversation are considered independent. In practice, however, this is not the case. Obviously the topics of a turn at time t are highly correlated with those of the turn at t+1. To address this issue, there have been several recent attempts trying to capture the temporal dynamics within a document. Du et al. (2010) propose Sequential lda to study how topics within a document evolve over its structure. It uses the nested two-parameter Poisson Dirichlet process (pdp) to model the progressive dependency between consecutive part of a document, which can capture the continuity of topical flow in a document nicely but does not capture the topic change explicitly. Fox et al. (2008) proposed Sticky hdp-hmm, which is an extension of hdp-hmm (Teh et al. 2006) for the problem of speaker diarization involving segmenting an audio recording into intervals associated with individual speakers. Applying to the conversational setting, Sticky hdp-hmm associates each turn with a single topic; this is a strong assumption since people tend to talk about more than one thing in a turn, especially in political debates. We will, however, use it as one of the baselines in our topic segmentation experiment (Sect. 5). A related problem is to discover how topics themselves change over time (Blei and Lafferty 2006; Wang et al. 2008; Ren et al. 2008; Ahmed and Xing 2008, 2010), e.g., documents that talk about “physics” in 1900 will use very different terms than “physics” in 2000. These models assume documents are much longer and that topics evolve much more slowly than in a conversation.

Moreover, many of these methods do not explicitly model the changes of the topics within a document or conversation. To address this, we endow each turn with a binary latent variable l c,t , called the topic shift indicator (Purver et al. 2006). This latent variable signifies whether in this turn the speaker changed the topic of the conversation. In addition, to capture the topic-controlling behavior of the speakers across different conversations, we further associate each speaker m with a latent topic shift tendency denoted by π m . Informally, this variable is intended to capture the propensity of a speaker to effect a topic shift. Formally, it represents the probability that the speaker m will change the topic (distribution) of a conversation. In the remainder of this section, we will describe the model in more detail together with the inference techniques we use.

3.2 Generative process of SITS

SITS is a generative model of multiparty discourse that jointly discovers topics and speaker-specific topic shifts from an unannotated corpus (Fig. 1a). As in the hierarchical Dirichlet process (Teh et al. 2006), we allow an unbounded number of topics to be shared among the turns of the corpus. Topics are drawn from a base distribution H over multinomial distributions over the vocabulary of size V; H is a finite Dirichlet distribution with symmetric prior λ. Unlike the hdp, where every document (here, every turn) independently draws a new multinomial distribution from a Dirichlet process, the social and temporal dynamics of a conversation, as specified by the binary topic shift indicator l c,t , determine when new draws happen.

Fig. 1
figure 1

Plate diagrams of our proposed models: (a) nonparametric SITS; (b) parametric SITS. Nodes represent random variables (shaded nodes are observed); lines are probabilistic dependencies. Plates represent repetition. The innermost plates are turns, grouped in conversations

Generative process

The formal generative process is:

  1. 1.

    For speaker m∈[1,M], draw speaker topic shift probability π m ∼Beta(γ)

  2. 2.

    Draw the global topic distribution G 0∼DP(α,H)

  3. 3.

    For each conversation c∈[1,C]

    1. (a)

      Draw a conversation-specific topic distribution G c ∼DP(α 0,G 0)

    2. (b)

      For each turn t∈[1,T c ] with speaker a c,t

      1. i.

        If t=1, set the topic shift indicator l c,t =1. Otherwise, draw \(l _{c,t} \sim\mbox{Bernoulli} ({\pi_{a _{c,t}}})\).

      2. ii.

        If l c,t =1, draw G c,t ∼DP(α c ,G c ). Otherwise, set G c,t G c,t−1.

      3. iii.

        For each word index n∈[1,N c,t ]

        • Draw a topic ψ c,t,n G c,t

        • Draw a token w c,t,n ∼Multinomial(ψ c,t,n ).

The hierarchy of Dirichlet processes allows statistical strength to be shared across contexts; within a conversation and across conversations. The per-speaker topic shift tendency π m allows speaker identity to influence the evolution of topics.

Intuitively, SITS generates a conversation as follows: At the beginning of a conversation c, the first speaker a c,1 draws a distribution over topics G c,1 from the base distribution, and uses that topic distribution to draw a topic ψ c,1,n for each token w c,1,n . Subsequently, at turn t, speaker a c,t will first flip a speaker-specific biased coin \(\pi_{a _{c,t}}\) to decide whether a c,t will change the topic of the conversation. If the coin comes up tails (l c,t =0), a c,t will not change the conversation topic and uses the previous turn’s topic distribution G c,t−1 to generate turn t’s tokens. If, on the other hand, the coin comes up heads (l c,t =1), a c,t will change the topic by drawing a new topic distribution G c,t from the conversation-specific collection of topics DP(α c ,G c ).

Segmentation notation

To make notation more concrete and to connect our model with topic segmentation, we introduce the notion of segments in a conversation. A segment s of conversation c is a sequence of turns [τ,τ′] such that

$$\left \{ \begin{array}{l} l _{c,\tau}= l _{c,{\tau' + 1}} = 1 \\ l _{c,t} = 0,\quad \forall t \in[\tau+ 1, \tau'] \end{array} \right . $$

When l c,t =0, G c,t is the same as G c,t−1 and all topics (i.e. multinomial distributions over words) {ψ c,t,n n∈[1,N c,t ]} that generate words in turn t and the topics {ψ c,t−1,nn′∈[1,N c,t−1]} that generate words in turn t−1 come from the same distribution. Thus, all topics used in a segment s are drawn from a single segment-specific probability measure G c,s ,

$$ G _{c,s} \mid l _{c,1}, l _{c,2}, \ldots, l _{c,{T_c}}, \alpha_c, G_c \sim\mbox{DP}({ \alpha_c}, {G_c}) $$
(1)

A visual illustration of these notations can be found in Fig. 2. For notational convenience, S c denotes the number of segments in conversation c, and s t denotes the segment index of turn t. We emphasize that all segment-related notations are derived from the posterior over the topic shifts l and not part of the model itself.

Fig. 2
figure 2

Diagram of notation for topic shift indicators and conversation segments: Each turn is associated with a latent binary variable topic shift indicator l specifying whether the topic of the turn is shifted. In this example, topic shifts occur in turns τ and τ′+1. As a result, the topic shift indicators of turn τ and τ′+1 are equal to 1 (i.e. l c,τ =l c,τ′+1=1) and the topic shift indicators of all turns in between are 0 (i.e. l c,t =0,∀t∈[τ+1,τ′]). Turns [τ,τ′] form a segment s in which all topic distributions G c,τ ,G c,τ+1,…,G c,τ are the same and are denoted collectively as G c,s

3.3 Inference for SITS

To find the latent variables that best explain observed data, we use Gibbs sampling, a widely used Markov chain Monte Carlo inference technique (Neal 2000; Resnik and Hardisty 2010). The state space in our Gibbs sampler consists of the latent variables for topic indices assigned to all tokens z={z c,t,n } and topic shifts assigned to turns l={l c,t }. We marginalize over all other latent variables. For each iteration of the sampling process, we loop over each turn in each conversation. For a given turn t in conversation c, we first sample the topic shift indicator variable l c,t (Sect. 3.3.2) and then sample the topic assignment z c,t,n for each token in the turn (Sect. 3.3.1). Here, we only present the conditional sampling equations; for details on how these are derived, see the Appendix A.

3.3.1 Sampling topic assignments

In Bayesian nonparametrics, the Chinese restaurant process (crp) metaphor is often used to explain the clustering effect of the Dirichlet process (Ferguson 1973). The crp is an exchangeable distribution over partitions of integers, which facilitates Gibbs sampling (Neal 2000) (as we will see in (2)). When used in topic models, each Chinese restaurant consists of infinite number of tables, each of which corresponds to a topic. Customers, each of which corresponds to a token, are assigned to tables and if two tokens are assigned to the same table: they share the same topic.

The crp has a “rich get richer” property, which means that tables with many customers will attract yet more customers—a new customer will sit at an existing table with probability proportional to the number of customers currently at the table. The crp has no limit on the number of tables; when a customer needs to be seated, there is always a probability—proportional to the Dirichlet parameter α—that it will be seated at a new table. When a new table is formed, it is assigned a “dish”; this is a draw from the Dirichlet process’s base distribution. In a topic model, this atom associated with a new table is a multinomial distribution over word types. In a standard, non-hierarchical crp, this multinomial distribution comes from a Dirichlet distribution.

But it doesn’t have to—hierarchical nonparametric models extend the metaphor further by introducing a hierarchy of restaurants (Teh et al. 2006; Teh 2006), where the base distribution of one restaurant can be another restaurant. This is where things can get tricky. Instead of having a seating assignment, a customer now has a seating path and is potentially responsible for spawning new tables in every restaurant. In SITS there are restaurants for the current segment, the conversation, and the entire corpus, as shown in Fig. 3.

Fig. 3
figure 3

Illustration of topic assignments in our inference algorithm. Each solid rectangle represents a restaurant (i.e., a topic distribution) and each circle represents a table (i.e., a topic). To assign token n of turn t in conversation c to a table z c,t,n in the corpus-level restaurant, we need to sample a path assigning the token to a segment-level table, the segment-level table to a conversation-level table and the conversation-level table to a globally shared corpus-level table

To sample z c,t,n , the index of the shared topic assigned to token n of turn t in conversation c, we need to sample the path assigning each word token to a segment-level table, each segment-level table to a conversation-level table and each conversation-level table to a shared dish. Before describing the sampling equations, we introduce notation denoting the counts:

  • N c,s,k : number of tokens in segment s in conversation c assigned to dish k

  • N c,k : number of segment-level tables in conversations c assigned to dish k

  • N k : number of conversation-level tables assigned to dish k.

Note that we use k to index the global topics shared across the corpus, each of which corresponds to a dish in the corpus-level restaurant. In general, computing the exact values of these counts makes bookkeeping rather complicated. Since there might be multiple tables at a lower-level restaurant assigned to the same table at the higher-level restaurant, to compute the correct counts, we need to sum the number of customers over all these tables. For example, in Fig. 3, since both ψ c,1 and ψ c,2 are assigned to ψ 0,2 (i.e., k=2), to compute N c,k we have to sum over the number of customers currently assigned to ψ c,1 and ψ c,2 (which are 4 and 2 respectively in this example).

To mitigate this problem of bookkeeping and to speed up the sampling process, we use the minimal path assumption (Cowans 2006; Wallach 2008) to generate the path assignments.Footnote 4 Under the minimal path assumption, a new table in a restaurant is created only when there is no table already serving the dish. In other words in a restaurant, there is at most one table serving a given dish. A more detailed example of the minimal path assumption is illustrated in Fig. 4. Using this assumption, in the example shown in Fig. 3, ψ c,1 and ψ c,2 will be merged together since they are both assigned to ψ 0,2.

Fig. 4
figure 4

Illustration of minimal path assumption. This figure shows an example of the seating assignments in a hierarchy of Chinese restaurants of a higher-level restaurant and a lower-level restaurant. Each table in the lower restaurant is assigned to a table in the higher restaurant and tables on the same path serve the same dish k. When sampling the assignment for table \(\psi^{L}_{2}\) in the lower restaurant, given that dish k=2 is assigned to this table, there are two options for how the table in the higher restaurant could be selected. It could be an existing table \(\psi^{H}_{2}\) or a new table \(\psi^{H}_{\mathit{new}}\), both serving dish k=2. Under the minimal path assumption, it is always assigned to an existing table (if possible) and only assigned to a new table if there is no table with the given dish. In this case, the minimal path assumption will assign \(\psi^{L}_{2}\) to \(\psi^{H}_{2}\)

Now that we have introduced our notations, the conditional distribution for z c,t,n is

$$\begin{aligned} &P\bigl(z _{c,t,n}\mid w _{c,t,n}, \boldsymbol{z}^{-{c, t, n}}, \boldsymbol{w}^{-{c, t, n}}, \boldsymbol{l}, *\bigr) \\ &\quad\propto P\bigl(z _{c,t,n} \mid\boldsymbol{z}^{-{c, t, n}}\bigr) P\bigl(w _{c,t,n} \mid z _{c,t,n}, \boldsymbol{w}^{-{c, t, n}}, \boldsymbol {l}, *\bigr) \end{aligned}$$
(2)

The first factor is the prior probability of assigning to a path according to the minimum path assumption (Wallach 2006, p. 60),

$$ P\bigl(z _{c,t,n} = k \mid\boldsymbol{z}^{-{c, t, n}}\bigr) \propto \frac{ N _{c,{s_t},k} ^{-{c, t, n}} + \alpha_c \frac{ N _{c,k} ^{-{c, t, n}} + \alpha_0 \frac{ N_k ^{-{c, t, n}} + \alpha\frac{1}{K^+}}{N_{\cdot} ^{-{c, t, n}} + \alpha}}{N _{c,\cdot}^{-{c, t, n}} + \alpha_0}}{ N _{c,{s_t},\cdot}^{-{c, t, n}} + \alpha_c}, $$
(3)

where K + is the current number of shared topics.Footnote 5 Intuitively, (3) computes the probability of token w c,t,n being generated from a shared topic k. This probability is proportional to \(N _{c,{s_{t}},k}\)—the number of customers sitting at table serving dish k at restaurant \(G _{c,{s_{t}}}\), smoothed by the probability of generating this token from the table serving dish k at the higher-level restaurant (i.e., restaurant G c ). This smoothing probability is computed in the same hierarchical manner until the top restaurant is reached, where the base distribution over topics is uniform and the probability of picking a topic is equal to 1/K +. Equation (3) also captures the case where a table is empty; when the number of customers on that table is zero, the probability of generating the token from the corresponding topic relies entirely on the smoothing probability from the higher-level restaurant’s table.

The second factor is the data likelihood. After integrating out all ψ’s, we have

$$ P\bigl(w _{c,t,n} = w \mid z _{c,t,n} = k, \boldsymbol {w}^{-{c, t, n}}, \boldsymbol{l}, *\bigr) \propto \left \{ \begin{array}{l@{\quad}l} \frac{M _{k, w} ^{-{c, t, n}} + \lambda}{M _{k, \cdot}^{-{c, t, n}} + V\lambda}, & \hbox{if $k$ exists;} \\ \frac{1}{V}, & \hbox{if $k$ is new.} \end{array} \right . $$
(4)

Here, M k,w denotes the number of times word type w in the vocabulary is assigned to topic k; marginal counts are represented with ⋅ and ∗ represents all hyperparameters; V is the size of the vocabulary, and the superscript c,t,n denotes the same counts excluding w c,t,n .

3.3.2 Sampling topic shift indicators

Sampling the topic shift variable l c,t requires us to consider merging or splitting segments. We define the following notation:

  • k c,t : the shared topic indices of all tokens in turn t of conversation c.

  • \(S _{a _{c,t}, x}\): the number of times speaker a c,t is assigned the topic shift with value x∈{0,1}.

  • \(J^{x} _{c, s}\): the number of topics in segment s of conversation c if l c,t =x.

  • \(N^{x} _{c, s, j}\): the number of tokens assigned to the segment-level topic j when l c,t =x.Footnote 6

Again, the superscript c,t is used to denote the exclusion of turn t of conversation c in the corresponding counts.

Recall that the topic shift is a binary variable. We use 0 to represent the “no shift” case, i.e. when the topic distribution is identical to that of the previous turn. We sample this assignment with the following probability:

$$\begin{aligned} & P\bigl(l _{c,t} = 0 \mid\boldsymbol{l}^{-{c, t}}, \boldsymbol {w}, \boldsymbol{k}, \boldsymbol{a}, \ast\bigr) \\ &\quad\propto \frac{S ^{-{c, t}} _{a _{c,t}, 0} + \gamma}{S ^{-{c, t}} _{a _{c,t}, \cdot}+ 2 \gamma} \times \frac{\alpha_c^{J^0 _{c, s_t}} \prod_{j=1}^{J^0 _{c, s_t}} (N^0 _{c, s_t, j} - 1)!}{\prod_{x=1}^{N^0 _{c, s_t, \cdot}} (x-1+\alpha_c)} \end{aligned}$$
(5)

In (5), the first factor is proportional to the probability of assigning a topic shift of value 0 to speaker a c,t and the second factor is proportional to the joint probability of all topics in segment s t of conversation c when l c,t =0.Footnote 7

The other alternative is for the topic shift to be 1, which represents the introduction of a new distribution over topics inside an existing segment. The probability of sampling this assignment is:

$$\begin{aligned} &P\bigl(l _{c,t} = 1 \mid\boldsymbol{l}^{-{c, t}}, \boldsymbol {w}, \boldsymbol{k}, \boldsymbol{a}, \ast\bigr) \\ &\quad\propto \frac{S ^{-{c, t}} _{a _{c,t}, 1} + \gamma}{S ^{-{c, t}} _{a _{c,t}, \cdot}+ 2 \gamma} \times \biggl( \frac{\alpha_c^{J^1 _{c, (s_{t}-1)}} \prod_{j=1}^{J^1 _{c, (s_{t}-1)}} (N^1 _{c, (s_{t}-1), j} - 1)!}{\prod_{x=1}^{N^1 _{c, (s_{t}-1), \cdot}} (x-1+\alpha_c)} \frac{\alpha_c^{J^1 _{c, s_{t}}} \prod_{j=1}^{J^1 _{c, s_{t}}} (N^1 _{c, s_{t}j} - 1)!}{\prod_{x=1}^{N^1 _{c, s_{t}, \cdot}} (x-1+\alpha_c)} \biggr) \end{aligned}$$
(6)

As above, the first factor in (6) is proportional to the probability of assigning a topic shift of value 1 to speaker a c,t ; the second factor in the big bracket is proportional to the joint distribution of the topics in segments s t −1 and s t . In this case, l c,t =1 means splitting the current segment, which results in two joint probabilities for two segments.

4 Data and annotations

We validate our approach using five different datasets (Table 1). In this section, we describe the properties of each of the datasets and what information is available from the data. The datasets with interesting existing annotations typically are small and specialized. After validating our approach on simpler datasets, we move to larger datasets that we can explore qualitatively or by annotating them ourselves.

Table 1 Summary of datasets detailing how many distinct speakers are present, how many distinct conversations are in the corpus, the annotations available, and the general content of the dataset. The † marks datasets we annotated

4.1 Datasets

We first describe the datasets that we use in our experiments. For all datasets, we tokenize texts using Opennlp’s tokenizer and remove common stopwords.Footnote 8 After that, we remove turns that are very short since they do not contain much information content-wise and most likely there is no topic shift during these turns. We empirically remove turns that have fewer than 5 tokens after removing stopwords.

The icsi meeting corpus

The icsi Meeting Corpus consists of 75 transcribed meetings at the International Computer Science Institute in Berkeley, California (Janin et al. 2003). Among these, 25 meetings were annotated with reference segmentations (Galley et al. 2003). Segmentations are binary, i.e., each point in the document is either a segment boundary or not, and on average each meeting has 8 segment boundaries. We use this dataset for evaluating topic segmentation (Sect. 5). After preprocessing, there are 60 unique speakers and the vocabulary contains 3346 non-stopword tokens.

The 2008 presidential election debates

Our second dataset contains three annotated presidential debates between Barack Obama and John McCain and a vice presidential debate between Joe Biden and Sarah Palin (Boydstun et al. 2013). Each turn is one of two types: questions (Q) from the moderator or responses (R) from a candidate. Each clause in a turn is coded with a Question Topic Code (T Q ) and a Response Topic Code (T R ). Thus, a turn has a list of T Q ’s and T R ’s both of length equal to the number of clauses in the turn. Topics are from the Policy Agendas Topics Codebook, a widely used inventory containing codes for 19 major topics and 225 subtopics.Footnote 9 Table 2 shows an example annotation.

Table 2 Example turns from the annotated 2008 election debates (Boydstun et al. 2013). Each clause in a turn is coded with a Question Topic Code (T Q ) and a Response Topic Code (T R ). The topic codes (T Q and T R ) are from the Policy Agendas Topics Codebook. In this example, the following topic codes are used: Macroeconomics (1), Housing & Community Development (14), Government Operations (20)

To obtain reference segmentations in debates, we assign each turn a real value from 0 to 1 indicating how much a turn changes the topic. For a question-typed turn, the score is the fraction of clause topic codes not appearing in the previous turn; for response-typed turns, the score is the fraction of clause topic codes that do not appear in the corresponding question. This results in a set of non-binary reference segmentations. For evaluation metrics that require binary segmentations, we create a binary segmentation by labeling a turn as a segment boundary if the computed score is 1. This threshold is chosen to include only true segment boundaries. After preprocessing, this dataset contains 9 unique speakers and the vocabulary contains 1,761 non-stopword tokens.

The 2012 republican primary debates

We also downloaded nine transcripts in the 2012 Republican Party presidential debates, whose information is shown in Table 3. Since the transcripts are pulled from different sources, we perform a simple entity resolution step using edit distance to merge duplicate participants’ names. For example, “Romney”, “Mitt Romney” are resolved into “Romney”; “Paul”, “Rep. Paul”, “Representative Ron Paul R-TX” are resolved into “Paul” etc. We also merge anonymous participants such as “Unidentified Female”, “Unidentified Male”, “Question”, “Unknown” etc. into a single participant named “Audience”. After preprocessing, there are 40 unique participants in these 9 debates including candidates, moderators and audience members. This dataset is not annotated and we only use it for qualitative evaluation.

Table 3 List of the 9 Republican Party presidential debates used

CNN’s crossfire

Crossfire was a weekly U.S. television “talking heads” program engineered to incite heated arguments (hence the name). Each episode features two recurring hosts, two guests, and clips from the week’s news. Our Crossfire dataset contains 1134 transcribed episodes aired between 2000 and 2004.Footnote 10 There are 2567 unique speakers and the vocabulary size is 16,791. Unlike the previous two datasets, Crossfire does not have explicit topic segmentations, so we use it to explore speaker-specific characteristics (Sect. 6.2).

Wikipedia discussions

Each article on Wikipedia has a related discussion page so that the individuals writing and editing the article can discuss the content, editorial decisions, and the application of Wikipedia policies (Butler et al. 2008). Unlike the other situations, Wikipedia discussions are not spoken conversations that have been transcribed. Instead, these conversations are written asynchronously.

However, Wikipedia discussions have much of the same properties as our other corpora. Contributors have different levels of responsibility and prestige, and many contributors are actively working to persuade the group to accept their proposed policies (for an example, see Table 4), other contributors are attempting to maintain civility, and other contributors are attacking their ostensible collaborators.

Table 4 Example of a Wikipedia discussion in our dataset

Unlike spoken conversations, Wikipedia discussions lack social norms that prevent an individual from writing as often or as much as they want. This makes common techniques such as counting turns or turn lengths less helpful measures to discover who influencers are.

4.2 Influencer annotation

Our goal is to discover who are the influencers in these discussions. To assess our ability to discover influencers, we annotated randomly selected documents from both the Wikipedia and Crossfire datasets. This process proceeded as follows. First, we followed the annotation guidelines for influencers proposed by Bender et al. (2011) for Wikipedia discussion. A discussant is considered an influencer if he or she initiated a topic shift that steered the conversation in a different direction, convinced others to agree to a certain viewpoint, or used an authoritative voice that caused others to defer to or reference that person’s expertise. A discussant is not identified as an influencer if he or she merely initiated a topic at the start of a conversation, did not garner any support from others for the points he or she made, or was not recognized by others as speaking with authority. After annotating an initial set of documents, we revised our annotation guidelines and retrained two independent annotators until we reached an intercoder reliability Cohen’s Kappa (Artstein and Poesio 2008) of 0.8.Footnote 11

Wikipedia discussions

Coders first learned to annotate transcripts using Wikipedia discussion data. The two coders annotated over 400 English Wikipedia discussion transcripts for influencer in batches of 20 to 30 transcripts each week. For the English transcripts, each coder annotated the transcripts independently, then annotations were compared for agreement; any discrepancies in the annotations were resolved through discussion of how to apply the coding scheme. After the first four sets of 20 to 30 transcripts, the coders were able to code the transcripts with acceptable intercoder reliability (Cohen’s Kappa >0.8). Once the coders reached acceptable intercoder reliability for two sets of English data in a row, the coders began independently coding the remaining set of transcripts. Intercoder reliability was maintained at an acceptable level (Cohen’s Kappa >0.8) for the English transcripts over the subsequent weeks of coding.

Crossfire

We then turned our attention to the Crossfire dataset. We split each Crossfire episode into smaller segments using the “Commercial_Break” tags and use each segment as a unit of conversation. The same two coders annotated the Crossfire data. To prepare for annotating the Crossfire interactions, the coders both annotated the same set of 20 interactions. First the intercoder reliability Cohen’s Kappa was calculated for the agreement between the coders, then any disagreements between the coders were resolved through discussion about the discrepant annotations. The first set of 20 transcripts was coded with a Cohen’s Kappa of 0.65 (before discussion). This procedure was repeated twice; each time the coders jointly annotated 20 transcripts, reliability was calculated, and any discrepancies were resolved through discussion. The third set achieved an acceptable Cohen’s Kappa of 0.8. The remaining transcripts were then split and annotated separately by the two coders. In all, 105 Crossfire episode segments were annotated. An annotation guideline for Crossfire is included in the Appendix B.

5 Evaluating topic segmentation

In this section, we examine how well SITS can identify when new topics are introduced, i.e., how well it can segment conversations. We discuss metrics for evaluating an algorithm’s segmentation relative to a gold annotation, describe our experimental setup, and report those results.

5.1 Experiment setups

Evaluation metrics

To evaluate the performance on topic segmentation, we use P k  (Beeferman et al. 1999) and WindowDiff (WD) (Pevzner and Hearst 2002). Both metrics measure the probability that two points in a document will be incorrectly separated by a segment boundary. Both techniques consider all windows of size k in the document and count whether the two endpoints of the window are (im)properly segmented against the gold segmentation. More formally, given a reference segmentation \(\mathcal{R}\) and a hypothesized segmentation \(\mathcal{H}\), the value of P k for a given window size k is defined as follow:

$$ P_k = \frac{\sum_{i=1}^{N-k} \delta_{\mathcal{H}}(i, i+k) \oplus \delta _{\mathcal{R}}(i, i+k)}{N - k} $$
(7)

where \(\delta_{\mathcal{X}}(i,j)\) is 1 if the segmentation \(\mathcal {X}\) assigns i and j to the same segment and 0 otherwise; ⊕ denotes the Xor operator; N is the number of candidate boundaries.

WD improves P k by considering how many boundaries lie between two points in the document, instead of just looking at whether the two points are separated or not. WD of size k between two segmentations \(\mathcal{H}\) and \(\mathcal{R}\) is defined as:

$$ \mbox{WD} = \frac{\sum_{i=1}^{N-k} [|b_{\mathcal{H}}(i,i+k) - b_{\mathcal{R}}(i, i+k)| > 0 ]}{N-k} $$
(8)

where \(b_{\mathcal{X}}(i,j)\) counts the number of boundaries that the segmentation \(\mathcal{X}\) puts between two points i and j.

However, these metrics have a major drawback. They require both hypothesized and reference segmentations to be binary. Many algorithms (e.g., probabilistic approaches) give non-binary segmentations where candidate boundaries have real-valued scores (e.g., probability or confidence). Thus, evaluation requires arbitrary thresholding to binarize soft scores. In previous work, to be fair for all methods, thresholds are usually set so that the number of segments is equal to a predefined value (Purver et al. 2006; Galley et al. 2003). In practice, this value is usually unknown.

To overcome these limitations, we also use (Pele and Werman 2008), a variant of the Earth Mover’s Distance (emd). Originally proposed by Rubner et al. (2000), emd is a metric that measures the distance between two normalized histograms. Intuitively, it measures the minimal cost that must be paid to transform one histogram into the other. emdis a true metric only when the two histograms are normalized (e.g., two probability distributions). relaxes this restriction to define a metric for non-normalized histograms by adding or subtracting masses so that both histograms are of equal size.

Applied to our segmentation problem, each segmentation can be considered a histogram where each candidate boundary point corresponds to a bin. The probability of each point being a boundary is the mass of the corresponding bin. We use |ij| as the ground distance between two points i and j.Footnote 12 To compute we use the Fastemd implementation (Pele and Werman 2009).

Experimental methods

We applied the following methods to discover topic segmentations in a conversation:

  • TextTiling (Hearst 1997) is one of the earliest and most widely used general-purpose topic segmentation algorithms, sliding a fixed-width window to detect major changes in lexical similarity.

  • P-NoSpeaker-single: parametric version of SITS without speaker identity, run individually on each conversation (Purver et al. 2006).

  • P-NoSpeaker-all: parametric version of SITS without speaker identity run on all conversations.

  • P-SITS: the parametric version of SITS with speaker identity run on all conversations.

  • NP-HMM: the HMM-based nonparametric model with speaker identity. This model uses the same assumption as the Sticky hdp-hmm (Fox et al. 2008), where a single topic is associated with each turn.

  • NP-SITS: the nonparametric version of SITS with speaker identity run on all conversations.

Parameter settings and implementation

In our experiment, all parameters of TextTiling are the same as in Hearst (1997). For statistical models, Gibbs sampling with 10 randomly initialized chains is used. Initial hyperparameter values are sampled from U(0,1) to favor sparsity; statistics are collected after 500 burn-in iterations with a lag of 25 iterations over a total of 5000 iterations; and slice sampling (Neal 2003) optimizes hyperparameters. Parametric models are run with 25, 50 and 100 topics and the best results (averaged over 10 chains) are reported.

5.2 Results and analysis

Table 5 shows the performance of various models on the topic segmentation problem, using the icsi corpus and the 2008 debates.

Table 5 Results on the topic segmentation task. Lower is better. The parameter k is the window size of the metrics P k and WindowDiff chosen to replicate previous results

Consistent with previous results in the literature, probabilistic models outperform TextTiling. In addition, among the probabilistic models, the models that had access to speaker information consistently segment better than those lacking such information. Furthermore, np-sits outperforms np-hmm in both experiments, suggesting that using a distribution over topics for turns is better than using a single topic. This is consistent with the parametric models in Purver et al. (2006).

The contribution of speaker identity seems more valuable in the debate setting. Debates are characterized by strong rewards for setting the agenda; dodging a question or moving the debate toward an opponent’s weakness can be useful strategies (Boydstun et al. 2013). In contrast, meetings (particularly low-stakes icsi meetings, technical discussions in r&d group) tend to have pragmatic rather than strategic topic shifts. In addition, agenda-setting roles are clearer in formal debates; a moderator is tasked with setting the agenda and ensuring the conversation does not wander too much.

The nonparametric model does best on the smaller debate dataset. We suspect that an evaluation that directly accessed the topic quality, either via prediction (Teh et al. 2006) or interpretability (Chang et al. 2009b) would favor the nonparametric model more.

6 Evaluating topic control

In this section, we focus on the ability of SITS to capture the extent to which individual speakers affect topic shifts in conversations. Recall that SITS associates with each speaker a topic shift tendency π that represents the probability of changing the topic in the conversation. While topic segmentation is a well studied problem, hence the evaluation in Sect. 5, there are no established quantitative measurements of an individual’s ability to control a conversation. To evaluate whether the tendency is capturing meaningful characteristics of speakers, we look qualitatively at the behavior of the model.

6.1 2008 election debates

To obtain a posterior estimate of π (Fig. 5) we create 10 chains with hyperparameters sampled from the uniform distribution U(0,1) and average π over 10 chains (as described in Sect. 5.1). In these debates, Ifill is the moderator of the debate between Biden and Palin; Brokaw, Lehrer and Schieffer are the three moderators of the three debates between Obama and McCain. Here “Question” denotes questions from audiences in “town hall” debate. The role of this “speaker” can be considered equivalent to the debate moderator.

Fig. 5
figure 5

Topic shift tendency π of speakers in the 2008 Presidential Election Debates (larger means greater tendency). Ifill was the moderator in the vice presidential debate between Biden and Palin; Brokaw, Lehrer and Schieffer were the moderators in the three presidential debates between Obama and McCain; Question collectively refers to questions from the audiences

The topic shift tendencies of moderators are generally much higher than for candidates. In the three debates between Obama and McCain, the moderators—Brokaw, Lehrer and Schieffer—have significantly higher scores than both candidates. This is a useful reality check, since in a debate the moderators are the ones asking questions and literally controlling the topical focus. Similarly, the “Question” speaker had a relatively high variance, consistent with that “participant” in the model as an amalgamation of many distinct speakers.

Interestingly, however, in the vice-presidential debate, the score of moderator Ifill is higher than the candidates’ scores only by a small margin, and it is indistinguishable from the degree of topic control displayed by Palin. Qualitatively, the assessment of the model is consistent with widespread perceptions and media commentary at the time that characterized Ifill as a weak moderator. For example, Harper’s Magazine’s Horton (2008) discusses the context of the vice-presidential debate, in particular the McCain campaign’s characterization of Ifill as a biased moderator because she “was about to publish a book entitled The Breakthrough that discusses Barack Obama, and a number of other black politicians, achieving national prominence”. According to Horton:

First, the charges against Ifill would lead to her being extremely passive in her questioning of Palin and permissive in her moderating the debate. Second, the charge of bias against Ifill would enable Palin to simply skirt any questions she felt uncomfortable answering and go directly to a pre-rehearsed and nonresponsive talking point. This strategy succeeded on both points.

Similarly, Fallows (2008) of The Atlantic included the following in his “quick guide” remarks on the debate:

Ifill, moderator: Terrible. Yes, she was constrained by the agreed debate rules. But she gave not the slightest sign of chafing against them or looking for ways to follow up the many unanswered questions or self-contradictory answers. This was the big news of the evening …

Palin: “Beat expectations.” In every single answer, she was obviously trying to fit the talking points she had learned to the air time she had to fill, knowing she could do so with impunity from the moderator.

That said, our quantitative modeling of topic shift tendency suggests that all candidates managed to succeed at some points in setting and controlling the topic of conversation in the debates. In the presidential debates, our model gives Obama a slightly higher score than McCain, consistent with social science claims that Obama had the lead in setting the agenda over McCain (Boydstun et al. 2013). Table 6 shows some examples of SITS-detected topic shifts.

Table 6 Example of turns designated as a topic shift by SITS. We chose turns to highlight speakers with high topic shift tendency π. Some keywords are manually italicized to highlight the topics discussed

6.2 Crossfire

The Crossfire dataset has many more speakers than the presidential and vice-presidential debates. This allows us to examine more closely what we can learn about speakers’ topic shift tendency and ask additional questions; for example, assuming that changing the topic is useful for a speaker, how can we characterize who does so effectively? In our analysis, we take advantage of properties of the Crossfire data to examine the relationship between topic shift tendency, social roles, and political ideology.

In order to focus on frequent speakers, we filter out speakers with fewer than 30 turns. Most speakers have relatively small π, with the mode around 0.3. There are, however, speakers with very high topic shift tendencies. Table 7 shows the speakers having the highest values according to SITS.

Table 7 Top speakers by topic shift tendencies from our Crossfire dataset. We mark hosts (†) and “speakers” who often (but not always) appeared in video clips (‡). Announcer makes announcements at the beginning and at the end of each show; Narrator narrates video clips; Male and Female refer to unidentified male and female respectively; Question collectively refers to questions from the audience across different shows. Apart from those groups, speakers with the highest tendency were political moderates

We find that there are three general patterns for who influences the course of a conversation in Crossfire. First, there are structural “speakers” that the show uses to frame and propose new topics. These are audience questions, news clips (e.g. many of Gore’s and Bush’s turns from 2000), and voiceovers. That SITS is able to recover these is reassuring, similar to what it has to say about moderators in the 2008 debates. Second, the stable of regular hosts receives high topic shift tendencies, which is again reasonable given their experience with the format and ostensible moderation roles (though in practice they also stoke lively discussion).

The third category is more interesting. The remaining non-hosts with high topic shift tendency appear to be relative moderates on the political spectrum:

  • John Kasich, one of few Republicans to support the assault weapons ban and who was elected in 2010 as the governor of Ohio, a swing state

  • Christine Todd Whitman, former Republican governor of New Jersey, a very Democratic state

  • John McCain, who before 2008 was known as a “maverick” for working with Democrats (e.g. Russ Feingold).

Although these observations are at best preliminary and require further investigation, we would conjecture that in Crossfire’s highly polarized context, it was the political moderates who pushed back, exerting more control over the agenda of the discussion, rather than going along with the topical progression and framing as posed by the show’s organizers. Table 6 shows several detected topic shifts from these speakers. In two of these examples, McCain and Whitman are Republicans disagreeing with President Bush. In the other, Kasich is defending a Republican plan (school vouchers) popular with traditional Democratic constituencies.

6.3 2012 Republican primary debates

As another qualitative data point, we include in Fig. 6 the model’s topic shift tendency scores for a subset of nine 2012 Republican primary debates. Although we do not have objective measures to compare against, nor clearly stated contemporary commentary as in the case of Ifill’s performance as moderator, we would argue that the model displays quite reasonable face validity in the context of the Republican race.

Fig. 6
figure 6

Topic shift tendency π of speakers in the 2012 Republican Primary Debates (larger means greater tendency). King, Blitzer and Cooper are moderators in these debates; the rest are candidates

For example, among the Republican candidates, Ron Paul is known for tight focus on a discrete set of arguments associated with his position that “the proper role for government in America is to provide national defense, a court system for civil disputes, a criminal justice system for acts of force and fraud, and little else” (Paul 2007), often regardless of the specific question that was asked. Similarly, Rick Santorum’s performance in the primary debates tended to include strong rhetoric on social issues. In contrast, Mitt Romney tended to be less aggressive in his responses, arguably playing things safer in a way that was consistent with his general position throughout the primaries as the front-runner.

7 Detecting influencers in conversations

7.1 Computational methods for influencer detection

In this section, we turn to the direct application and validation of the model in detecting influencers in conversations. Even though influence in conversations has been studied for decades in communication and social psychology, computational methods have only emerged in recent years, thanks to improvements in both quantity and quality of conversational data. As one example, an early computational model to quantify influence between conversational participants (Basu et al. 2001) modeled interactions among a conversational group in a multi-sensor lounge room where people played interactive debating games. In these games, each participant can be in two states: speaker or silent. The model equates each participant with a Markov model. Each participant is allowed to be in either speaking state or silent state at each time step and the transition from one state to another of an individual is influenced by other participants’ states. This allows the model to capture pair-wise interactions among participants in the conversation. Zhang et al. (2005) then extended the work by proposing a model with two-level structure: the participant level, representing the actions of individual participants, and the group level, representing group-level actions. In this setting, the influence of each participant on the actions of the whole group is explicitly captured by the model. These models use expensive features such as prosody and visual cues.

Another popular approach is to treat influencer detection as a supervised classification problem that separates influential individuals from non-influential ones. Rienks and Heylen (2005) focus on extracting a set of structural features that can predict participants’ involvement using Support Vector Machines (Cortes and Vapnik 1995, svm). Later, Rienks et al. (2006) improved their previous work by extending the set of features to include features capturing topic changes as well as those derived from audio and speech. Again, we do not use any features extracted from audio or visual data, which makes our approach more generalizable. The two most relevant and most useful features extracted from the meeting textual transcripts are number of turns and length of turns, which we use as the baseline in our experiments described in Sect. 7.2. Biran et al. (2012) also follow a similar approach to detecting influencers in written online conversations by extracting features to capture different conversational behaviors such as persuasion, agreement/disagreement and dialog patterns.

In this paper, we are interested in determining who are the influencers in a conversation using only the conversation transcripts. We tackle this problem by using an unsupervised ranking approach. It is worth mentioning that, even though we are focused on studying how conversational influence expressed in textual data, there has also been a body of work approaching this problem by studying audio data (Hung et al. 2011), visual data (Otsuka et al. 2006) and both audio-visual activity cues (Jayagopi et al. 2009; Aran and Gatica-Perez 2010).

Our main purpose in this experimentation is to assess how effective SITS can be in detecting influencers in conversations, especially in comparison with methods based on structural patterns of conversations. We focus on the influencer detection problem: given a speaker in a multi-party conversation, predict whether the speaker is influential. In the remaining of this section, we describe in details the approach we take, the experimental setups, and the results.

7.2 Influencer detection problem

The influencer detection problem can be tackled using different methods that can be broadly classified into classification and ranking approaches. Most previous work follows the classification approach, in which different sets of features are proposed and a classifier is used (Rienks and Heylen 2005; Rienks et al. 2006; Biran et al. 2012). In this paper, we follow the ranking approach.

The ranking approach allows us to focus on individual functions that take a set of individuals and produce an ordering over those individuals from most influential to least influential. The function that produces this ordering is called a ranking method. More specifically, given a speaker a in a conversation c, each ranking method will provide an influence score \(\mathcal{I} _{a, c}\) that indicates how influential speaker a is in conversation c. We emphasize that, unlike most classification approaches (Rienks and Heylen 2005; Rienks et al. 2006; Biran et al. 2012), the ranking approach we are focusing on is entirely unsupervised and thus requires no training data.

The ranking approach has a straightforward connection to the classification approach, as each ranking function can be turned into a feature in the supervised classification framework. However, viewing the ranking methods (features) independently allows us to compare and interpret the effectiveness of each feature in isolation. This is useful as an evaluation method because it is independent of the choice of classifier and is less sensitive to the size of training data, which is often a limiting factor in computational social science.

We consider two sets of ranking methods: (1) structure-based methods, which use structural features and (2) topic-change-based methods, which use features extracted from the outputs of SITS.

Structure-based methods

score each instance based on features extracted from the structure of the conversation. As defined in Sect. 2, we use T c to denote the number of turns in conversation c; a c,t to denote the speaker that utters turn t in conversation c; and N c,t to denote the number of tokens in turn t in conversation c.

  1. 1.

    Number of turns: assumes that the more turns a speaker has during a conversation, the more influential he or she is. The influence score of this method is

    $$ \mathcal{I} _{a, c} = \bigl\vert \bigl\{ t \in[1, T_c] : a _{c,t} = a \bigr\} \bigr\vert $$
    (9)
  2. 2.

    Total turn lengths: instead of the number of turns, this method uses the total length of turns uttered by the speaker.

    $$ \mathcal{I} _{a, c} = \sum_{t \in[1, T_c] : a _{c,t} = a} N _{c,t} $$
    (10)

The two structural features used here capture the activeness of the speakers during a conversation and have been shown to be among the most effective features to detect influencers. These two structure-based methods are appropriate baselines in our experiment since, although being simple, they have been proven to be very effective in detecting influencers, both qualitatively (Bales 1970) and quantitatively (Rienks et al. 2006; Biran et al. 2012).

Topic-change-based methods

score each instance based on features extracted from the posterior distributions of SITS.

  1. 1.

    Total topic shifts is the total number of expected topic shifts speaker a makes in conversation c,

    $$ \mathcal{I} _{a, c} = \sum_{t \in[1, T_c] : a _{c,t} = a} \bar {l} _{c,t} $$
    (11)

    Recall that in SITS, each turn t in conversation c is associated with a binary latent variable l c,t , which indicates whether the topic of turn t is changed or not (these latent variables are introduced in Sect. 3). This expectation is computed through the empirical average of samples from the Gibbs sampler, \(\bar{l} _{c,t}\), after a burn-in period.Footnote 13 Intuitively, the higher \(\bar{l} _{c,t}\) is, the more successful the speaker a c,t is in changing the topic of the conversation at this turn t.

  2. 2.

    Weighted topic shifts also quantify the topic changes a speaker makes by using the average topic shift indicator \(\bar{l} _{c,t}\) but weighted by (1−π a ), where π a is the topic shift tendency score of the speaker a. The basic idea here is that not all topic shifts should be counted equally. A successful topic shift by a speaker with small topic shift tendency score should be weighted higher than a successful topic by a speaker with high topic shift tendency score. The influence score of this ranking method is defined as

    $$ \mathcal{I} _{a, c} = (1- \pi_a) \cdot\sum _{t \in[1, T_c] : a _{c,t} = a} \bar{l} _{c,t} $$
    (12)

7.3 Experimental setup

Datasets

In this experiment, we use two datasets annotated for influencers: Crossfire and Wikipedia discussion pages. These two datasets and the annotation procedures are described in detail in Sect. 4. Table 8 shows dataset statistics.

Table 8 Statistics of the two datasets Crossfire and Wikipedia discussions that we annotated influencers. We use these two datasets to evaluate SITS on influencer detection

Parameter settings and implementation

As before, we use Gibbs sampling with 10 randomly initialized chains for inference. Initial hyperparameter values are sampled from U(0,1) and statistics are collected after 200 burn-in iterations with a lag of 20 iterations over a total of 1000 iterations. Slice sampling optimizes the hyperparameters.

Evaluation measurements

To evaluate the effectiveness of each ranking method in detecting the influencers, we use three standard evaluation measurements. The first measurement is F 1, the harmonic mean of precision and recall,

$$ F_1 = \frac{2 \cdot\mathrm{Precision} \cdot\mathrm {Recall}}{\mathrm{Precision} + \mathrm{Recall}} $$
(13)

Even though F 1 is widely used, an important disadvantage is that it only examines a subset of top instances with highest scores, which might be the “easiest” cases. This phenomenon might lead to biased results when comparing the performance of different ranking methods. To overcome this problem, we also use auc-roc and auc-pr, which measure the area under the Receiver-Operating-Characteristic (roc) curve and the Precision-Recall (pr) curve. Using these two measurements, we can compare the performances of ranking methods using the full ranked lists. Davis and Goadrich (2006) point out that pr curve is more appropriate than roc for skewed datasets.

7.4 Results and analysis

Table 9 shows the results of the four ranking methods using Crossfire and Wikipedia discussion datasets. Since we run our Gibbs samplers multiple times, the results of the two topic-change-based methods are reported with standard deviations (across different chains).

Table 9 Influencer detection results on Crossfire and Wikipedia discussion pages. For both datasets, topic-change-based methods (⋆) outperform structure-based methods (⋄) by large margins. For all evaluation measurements, higher is better

For both datasets, the two topic-change-based methods outperform the two structure-based methods by a large margin for all three evaluation measurements. The standard deviations in all three measurements of the two topic-change-based methods are relatively small. This shows the effectiveness of features based on topic changes in detecting influencers in conversations. In addition, the weighted topic shifts ranking method generally performs better than the total topic shifts method. This provides strong evidence that SITS is capable of capturing the speakers’ propensity to change the topic. The improvement (if any) in the performance of the weighted topic shifts ranking method over the total topic shifts method is more obvious in the Crossfire dataset than in Wikipedia discussions. We argue that this is because conversations in Wikipedia discussion pages are generally shorter and contain more speakers than those in Crossfire debates. This leaves less evidence about the topic change behavior of the speakers in Wikipedia and thus SITS struggles to capture the speakers’ behavior.

8 Conclusions and future work

SITS is a nonparametric hierarchical Bayesian model that jointly captures topics, topic shifts, and individuals’ tendency to control the topic in conversations. SITS takes a nonparametric topic modeling approach, representing each turn in a conversation as a distribution over topics and consecutive turns’ topic distributions as dependent on each other.

Crucially, SITS also models speaker-specific properties. As such, it improves performance on practical tasks such as unsupervised segmentation, but it also is attractive philosophically. Accurately modeling individuals is part of a broader research agenda that seeks to understand individuals’ values (Fleischmann et al. 2011), interpersonal relationships (Chang et al. 2009a), and perspective (Hardisty et al. 2010), which creates a better understanding of what people think based on what they write or say (Pang and Lee 2008). One particularly interesting direction is to extend the model to capture how language is coordinated during the conversation and how it correlates with influence (Giles et al. 1991; Danescu-Niculescu-Mizil et al. 2012).

The problem of finding influencers in conversation has been studied for decades by researchers in communication, sociology, and psychology, who have long acknowledged qualitatively the correlation between the ability of a participant to control conversational topic and his or her influence on other participants during the conversation. With SITS, we now introduce a computational technique for modeling more formally who is controlling the conversation. Empirical results on the two datasets we annotated (Crossfire TV show and Wikipedia discussion pages) show that methods based on SITS outperform previous methods that used conversational structure patterns in detecting influencers.

Using an unsupervised statistical model for detecting influencers is an appealing choice because it extends easily to other languages and to corpora that are multilingual (Mimno et al. 2009; Boyd-Graber and Blei 2009). Moreover, topic models offer opportunities for exploring large corpora (Zhai et al. 2012) in a wide range of domains including political science (Grimmer 2009), music (Hoffman et al. 2009), programming source code (Andrzejewski et al. 2007) or even household archaeology (Mimno 2011). Recent work has created frameworks for interacting with statistical models (Hu et al. 2011) to improve the quality of the latent space (Chang et al. 2009b), understand relationships with other variables (Gardner et al. 2010), and allow the model to take advantage of expert knowledge (Andrzejewski et al. 2009) or knowledge resources (Boyd-Graber et al. 2007).

This work opens several future directions. First, even though associating each speaker with a scalar that models their tendency to change the topic does improve performance on both topic segmentation and influencer detection tasks, it is obviously an impoverished representation of an individual’s conversational behaviors and could be enriched. For example, instead of just using a fixed parameter π for each conversational participant, one could extend the model to capture evolving topic shift tendencies of participants during the conversation. Modeling individuals’ perspective (Paul and Girju 2010), “side” (Thomas et al. 2006), or personal preferences for topics (Grimmer 2009) would also enrich the model and better illuminate the interaction of influence and topic.

Another important future direction is to extend the model to capture more explicitly the distinction between agenda setting and interaction influence. For example, questions or comments from the moderators during a political debate just shape the agenda of the debate and have little influence over how candidates would respond. Agenda setting does not have a direct effect on the views or opinions of others, and it does not try to sway the attitudes and beliefs of others. Agenda setting focuses generally on the topics that will be addressed, determining what those topics will be from the outset (McCombs and Reynolds 2009). It is during an interaction that an influencer is able to shape the discussion by shifting the interaction from one topic to another or providing evidence or expertise that can shape the opinions and judgments about the topics. To be identified as an influencer, however, others in the interaction must acknowledge or recognize the value of the expertise or agree with the opinion and viewpoints that have been offered. Thus, adding modules to find topic expertise (Marin et al. 2010) or agreement/disagreement (Galley et al. 2004) during the conversation would enable SITS to better detect influencers.

Understanding how individuals use language to influence others goes beyond conversational turn taking and topic control, however. In addition to what is said, often how something is expressed—i.e., the syntax—is nearly as important (Greene and Resnik 2009; Sayeed et al. 2012). Combining SITS with a model that can discover syntactic patterns (Sayeed et al. 2012) or multi-word expressions (Johnson 2010) associated with those attempting to influence a conversation would allow us to better understand how individuals use word choice and rhetorical strategies to persuade (Cialdini 2000; Anand et al. 2011) or coordinate with (Danescu-Niculescu-Mizil et al. 2012) others. Such systems could have a significant social impact, as they could identify, quantify, and measure attempts to spin or influence at a large scale. Models for automatic analysis of influence could lead to more transparent public conversations, ultimately improving our ability to achieve more considered and rational discussion of important topics, particularly in the political sphere.