Evaluation of clustering and topic modeling methods over health-related tweets and emails

https://doi.org/10.1016/j.artmed.2021.102096Get rights and content

Highlights

  • Evaluation of topic modeling and clustering on health-related tweets and emails.

  • Topic modeling: LSI, LDA, BTM, GibbsLDA, Online LDA, Online Twitter LDA, and GSDMM.

  • Clustering: k -means with two feature representations (TF-IDF and Doc2Vec).

  • The evaluation is based on two internal and five external cluster validity indices.

Abstract

Background

Internet provides different tools for communicating with patients, such as social media (e.g., Twitter) and email platforms. These platforms provided new data sources to shed lights on patient experiences with health care and improve our understanding of patient-provider communication. Several existing topic modeling and document clustering methods have been adapted to analyze these new free-text data automatically. However, both tweets and emails are often composed of short texts; and existing topic modeling and clustering approaches have suboptimal performance on these short texts. Moreover, research over health-related short texts using these methods has become difficult to reproduce and benchmark, partially due to the absence of a detailed comparison of state-of-the-art topic modeling and clustering methods on these short texts.

Methods

We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels).

Results

In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N = 286, 971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets.

Conclusions

Researchers hoping to group or classify health related short-text data can expect to select the most suitable topic modeling and clustering methods for their specific research questions. Therefore, we presented a comparison of the most common used topic modeling and clustering algorithms over two health-related, short-text datasets using both internal and external clustering validation indices. Internal indices suggested Online Twitter LDA and GSDMM as the best, while external indices suggested LSI and k-means with TF-IDF as the best. In summary, our work suggested researchers can improve their analysis of model performance by using a variety of metrics, since there is not a single best metric.

Introduction

Several social networking and microblog platforms have emerged exponentially in the last decade. Social networks such as Twitter enable users to interact with each other and share information on a wide range of different topics. Twitter is one of the most popular social media platforms intersecting all types of contents, including health-related texts. Twitter enables users to write short messages, called “tweets”, composed of 280 characters (140 characters before September 2017). Tweets are often adopted to share opinions, feelings, thoughts, and personal activities. With over 500 million tweets posted each day, Twitter has become a very valuable data resource to get real-world insights. In healthcare domain, Twitter has also been adopted by users to share their personal health status, their experience with the care and treatment options with other users with similar conditions/diseases and symptoms as well as more broadly sharing and seeking health information of their interest, attracting the attention of clinical and biomedical researchers with the ultimate goal to improve patients’ outcomes [130], [40], [154], [139]. There have been various existing studies that demonstrated the use of Twitter as a low-cost data source for public health surveillance [138], [108], such as for influenza vaccination [61], mental health [32], [155], human papillomavirus (HPV) vaccination [153], tobacco [99], [31], opioid [83], public mood [104], suicide [23], etc.

Furthermore, email is becoming popular in health care to establish and improve interactions between patients and healthcare professionals [36]. Emails allow patients to participate more actively in their health care, which can improve the quality and accessibility of health services [20], [18]. Indeed, patient-physician email communication has been addressed in various studies [13], such as for detection of depression [137], rural family health practice [27], multiple sclerosis [51], disease prevention [126], coordination of healthcare appointments [18], communication between healthcare professionals [106], among others. Several of these studies found positive effects in the use of emails such as the improvement of clinic efficiency and cost-effectiveness [39], [48], [27].

As a result, Twitter and emails have created a vast amount of short texts. Several natural language processing (NLP) methods, such as topic modeling and clustering, have been adopted to digest and assess these short texts, allowing us to infer patients’ interests, track new health-related stories, and identify emerging health topics. Clustering seeks to split documents into a certain number of groups based on a similarity metric. Topic modeling seeks to discover latent topics that describe the collection of documents. A topic represents a group of words that frequently occur together. There are numerous works that have used classic clustering methods (e.g., k-means) on short texts such as tweets [120], [86], [90], [148]. Diverse topic modeling methods also have been proposed to analyze short texts from different fields. Two of the most popular methods are the latent dirichlet allocation (LDA) [22] and latent semantic indexing (LSI) [56]. There exist various LDA-based techniques applied to text from various domains, including biomedicine [67]. Also, several recent approaches have adopted the Dirichlet Mixture Model for short text clustering [149], [150], [75]. Despite the abundance of NLP techniques available in the literature, there are several challenges when analyzing tweets [10]: significant noise and inconsistent tweeting behaviours of user prevent researchers from leveraging the full potential information carried in tweets.

Moreover, health research using Twitter and emails are difficult to measure because of the lack of comparisons between the various existing applications. As December 2020, we identified two recent studies that compared several topic modeling and clustering methods on several short text datasets. The first study [117] evaluated nine topic modeling based on DMM, global word co-occurrence, and self-aggregation. They found that simpler methods such as GSDMM [149] and BTM [147], [30] were the most suitable with respect to effectiveness and efficiency. The second study [33] evaluated the performance of four classic clustering algorithms (with four different feature representations such as TF-IDF and Doc2Vec) and a topic modeling method (LDA). The experiments showed that the best performance was achieved by k-means with Doc2Vec representation. However, there exist several gaps in these two studies: (1) [117] did not consider LSI or any LDA-based method, (2) [117] did not consider any classic clustering algorithm, (3) [33] considered only LDA as topic modeling, (4) both used small datasets (≤30K docs), (5) both used external validity indices only (i.e., comparing the results of a cluster analysis to an externally known provided class labels), and (6) both used a predefined number of topics for the evaluation, since each dataset was previously annotated.

In this paper, we seek to fill the gaps previously mentioned in order to discover how effectively several standard topic modeling and clustering methods perform on health-related tweets and emails. Therefore, we evaluate the performance of several state-of-the-art topic modeling and clustering algorithms (including those suggested in [117], [33]) on short texts from two health-related datasets. The first dataset is composed of tweets (≤290K docs) and the second is composed of emails (50K docs). We consider individual tweets and emails as single documents, respectively. We include seven topic modeling approaches including LSI, LDA, GibbsLDA [142], Online LDA [57], BTM [147], [30], Online Twitter LDA [76], and GSDMM; as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We use cluster validity indices to evaluate the performance of topic modeling and clustering: two internal (i.e., assessing the goodness of a clustering structure without external information) and five external validity indices.

The remainder of the paper is organized as follows. We will review the literature in the “Related work” section. We will explain our approach in the “Methods” section. The results of the experiments and evaluations of the topic modeling and clustering applications will be presented in the “Experiments and results” section. We will discuss the obtained findings in the “Discussion” section. Finally, we will conclude the current work and present future directions in the “Conclusions” section.

Section snippets

Related work

In this section, we review related work from short text clustering, topic modeling, and validity indices.

Methods

This section describes our study for evaluating state-of-the-art topic modeling and clustering methods to automatically extract relevant topics from health-related tweets and emails. Fig. 1 outlines our approach with the basic steps for this evaluation. In this section we describe: (1) the two datasets used: tweets and emails; (2) the applications based on topic modeling and clustering algorithms; and (3) the validity indices used to assess the clusters defined by the algorithms we studied.

Experiments and results

The focus of this study is to compare the performance of the applications cited below using internal and external indices over the tweets and emails datasets. Thus, next sections presents the results obtained for k={2,5,10,50}.

Tweets dataset

A popular area of study is supervised algorithms using unbalanced datasets. However, skewed distributions also affect the learning process in unsupervised methods, especially in clustering [100] that are based on centroids [140], [73]. Despite enormous solutions, there is a reduced effectiveness when the groups have highly different sizes [74], however, most of the models we used proved capable of creating a group of tweets bigger than the other that reflected the unbalanced nature of the

Conclusions

In this paper, we conducted a detailed comparison of different topic modeling techniques and a document clustering method on short texts from two health-related datasets. The first composed of tweets and the second of emails. We set up LSI, LDA, GibbsLDA, Online LDA, BTM, Online Twitter LDA, GSDMM, and k-means based on TF-IDF and Doc2Vec document vectorizations. We evaluated our models with two internal indices and five external indices. The two internal indices included Calinski-Harabasz index

Conflict of interest

The authors declare that they have no conflict of interest.

Author's contributions

JALV conceived and designed the study. JALV, HAS, JM, and SG collected the data, set up the applications, and performed the evaluation. JALV, HAS, JM, and SG wrote the initial draft and revised subsequent versions. JB and THB provided relevant feedback. JB and THB, senior investigators, led the research project. All authors read, revised and approved the final manuscript.

Conflict of interest

The authors declare that they have no conflict of interest.

Acknowledgements

A portion of the research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA183962. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References (157)

  • Biterm

    (2019)
  • C.C. Aggarwal et al.

    Doc2Vec

    (2019)
  • GibbsLDA

    (2019)
  • K-means

    (2019)
  • LDA

    (2019)
  • LSI

    (2019)
  • Online LDA

    (2019)
  • Online Twitter LDA

    (2019)
  • TF-IDF

    (2019)
  • C.C. Aggarwal et al.

    A survey of text clustering algorithms

    Mining text data

    (2012)
  • E. Amigó et al.

    A comparison of extrinsic clustering evaluation metrics based on formal constraints

    Inf Retr

    (2009)
  • M.J. Anderson

    A new method for non-parametric multivariate analysis of variance

    Austral Ecol

    (2001)
  • J. Antoun

    Electronic mail communication between physicians and patients: a review of challenges and opportunities

    Family Pract

    (2016)
  • C. Arnold et al.

    A topic model of clinical reports

    Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR’12

    (2012)
  • C.W. Arnold et al.

    Clinical case-based retrieval using latent topic analysis

    AMIA annual symposium proceedings, vol. 2010

    (2010)
  • T. Aso et al.

    Predicting protein-protein relationships from literature using latent topics

    Genome informatics 2009: genome informatics series vol. 23

    (2009)
  • H. Atherton et al.

    Email for clinical communication between patients/caregivers and healthcare professionals

    Cochrane Database Syst Rev

    (2012)
  • S. Banerjee et al.

    Clustering short texts using wikipedia

    Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’07

    (2007)
  • D.M. Blei et al.

    Latent dirichlet allocation

    J Mach Learn Res

    (2003)
  • S.R. Braithwaite et al.

    Validating machine learning algorithms for twitter data against established measures of suicidality

    JMIR Mental Health

    (2016)
  • D. Cai et al.

    Modeling hidden topics on document manifold

    Proceedings of the 17th ACM conference on information and knowledge management, CIKM’08

    (2008)
  • T. Caliński et al.

    A dendrite method for cluster analysis

    Commun Stat-Theory Methods

    (1974)
  • A.E. Cano et al.

    Harnessing linked knowledge sources for topic classification in social media

    Proceedings of the 24th ACM conference on hypertext and social media, HT’13

    (2013)
  • F. Chang et al.

    Patient, staff, and clinician perspectives on implementing electronic communications in an interdisciplinary rural family health practice

    Prim Health Care Res Dev

    (2017)
  • J.H. Chen et al.

    Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets

    J Am Med Inform Assoc

    (2017)
  • W.-Y. Chen et al.

    Parallel spectral clustering in distributed systems

    IEEE Trans Pattern Analysis Mach Intell

    (2010)
  • X. Cheng et al.

    Btm: topic modeling over short texts

    IEEE Trans Knowl Data Eng

    (2014)
  • K.-H. Chu et al.

    Diffusion of messages from an electronic cigarette brand to potential users through twitter

    PLOS ONE

    (2015)
  • G. Coppersmith et al.

    Measuring post traumatic stress disorder in twitter

  • A.M. Dai et al.

    Document embedding with paragraph vectors

    (2015)
  • Z. Dai et al.

    Crest: cluster-based representation enrichment for short text classification

    Pacific-Asia conference on knowledge discovery and data mining

    (2013)
  • J. Dash et al.

    Use of email, cell phone and text message between patients and primary-care physicians: cross-sectional study in a french-speaking part of Switzerland

    BMC Health Serv Res

    (2016)
  • D.L. Davies et al.

    A cluster separation measure

    IEEE Trans Pattern Anal Mach Intell

    (1979)
  • C.C. de Jong et al.

    The effects on health behavior and health outcomes of internet-based asynchronous communication between health providers and patients with a chronic condition: a systematic review

    J Med Internet Res

    (2014)
  • I. De Martino et al.

    Social media for patients: benefits and drawbacks

    Curr Rev Musculoskelet Med

    (2017)
  • S. Deerwester et al.

    Indexing by latent semantic analysis

    J Am Soc Inf Sci

    (1990)
  • E. Dimitriadou et al.

    An examination of indexes for determining the number of clusters in binary data sets

    Psychometrika

    (2002)
  • R.O. Duda et al.
    (1973)
  • M. Farhadloo et al.

    Associations of topics of discussion on twitter with survey measures of attitudes, knowledge, and behaviors related to zika: probabilistic study in the united states

    JMIR Public Health Surveill

    (2018)
  • S. Fodeh et al.

    On ontology-driven document clustering using core semantic features

    Knowl Inf Syst

    (2011)
  • Cited by (21)

    • Leveraging blockchain for industry funding: A social media analysis

      2024, Sustainable Technology and Entrepreneurship
    View all citing articles on Scopus
    View full text