Evaluation of clustering and topic modeling methods over health-related tweets and emails
Introduction
Several social networking and microblog platforms have emerged exponentially in the last decade. Social networks such as Twitter enable users to interact with each other and share information on a wide range of different topics. Twitter is one of the most popular social media platforms intersecting all types of contents, including health-related texts. Twitter enables users to write short messages, called “tweets”, composed of 280 characters (140 characters before September 2017). Tweets are often adopted to share opinions, feelings, thoughts, and personal activities. With over 500 million tweets posted each day, Twitter has become a very valuable data resource to get real-world insights. In healthcare domain, Twitter has also been adopted by users to share their personal health status, their experience with the care and treatment options with other users with similar conditions/diseases and symptoms as well as more broadly sharing and seeking health information of their interest, attracting the attention of clinical and biomedical researchers with the ultimate goal to improve patients’ outcomes [130], [40], [154], [139]. There have been various existing studies that demonstrated the use of Twitter as a low-cost data source for public health surveillance [138], [108], such as for influenza vaccination [61], mental health [32], [155], human papillomavirus (HPV) vaccination [153], tobacco [99], [31], opioid [83], public mood [104], suicide [23], etc.
Furthermore, email is becoming popular in health care to establish and improve interactions between patients and healthcare professionals [36]. Emails allow patients to participate more actively in their health care, which can improve the quality and accessibility of health services [20], [18]. Indeed, patient-physician email communication has been addressed in various studies [13], such as for detection of depression [137], rural family health practice [27], multiple sclerosis [51], disease prevention [126], coordination of healthcare appointments [18], communication between healthcare professionals [106], among others. Several of these studies found positive effects in the use of emails such as the improvement of clinic efficiency and cost-effectiveness [39], [48], [27].
As a result, Twitter and emails have created a vast amount of short texts. Several natural language processing (NLP) methods, such as topic modeling and clustering, have been adopted to digest and assess these short texts, allowing us to infer patients’ interests, track new health-related stories, and identify emerging health topics. Clustering seeks to split documents into a certain number of groups based on a similarity metric. Topic modeling seeks to discover latent topics that describe the collection of documents. A topic represents a group of words that frequently occur together. There are numerous works that have used classic clustering methods (e.g., k-means) on short texts such as tweets [120], [86], [90], [148]. Diverse topic modeling methods also have been proposed to analyze short texts from different fields. Two of the most popular methods are the latent dirichlet allocation (LDA) [22] and latent semantic indexing (LSI) [56]. There exist various LDA-based techniques applied to text from various domains, including biomedicine [67]. Also, several recent approaches have adopted the Dirichlet Mixture Model for short text clustering [149], [150], [75]. Despite the abundance of NLP techniques available in the literature, there are several challenges when analyzing tweets [10]: significant noise and inconsistent tweeting behaviours of user prevent researchers from leveraging the full potential information carried in tweets.
Moreover, health research using Twitter and emails are difficult to measure because of the lack of comparisons between the various existing applications. As December 2020, we identified two recent studies that compared several topic modeling and clustering methods on several short text datasets. The first study [117] evaluated nine topic modeling based on DMM, global word co-occurrence, and self-aggregation. They found that simpler methods such as GSDMM [149] and BTM [147], [30] were the most suitable with respect to effectiveness and efficiency. The second study [33] evaluated the performance of four classic clustering algorithms (with four different feature representations such as TF-IDF and Doc2Vec) and a topic modeling method (LDA). The experiments showed that the best performance was achieved by k-means with Doc2Vec representation. However, there exist several gaps in these two studies: (1) [117] did not consider LSI or any LDA-based method, (2) [117] did not consider any classic clustering algorithm, (3) [33] considered only LDA as topic modeling, (4) both used small datasets (≤30K docs), (5) both used external validity indices only (i.e., comparing the results of a cluster analysis to an externally known provided class labels), and (6) both used a predefined number of topics for the evaluation, since each dataset was previously annotated.
In this paper, we seek to fill the gaps previously mentioned in order to discover how effectively several standard topic modeling and clustering methods perform on health-related tweets and emails. Therefore, we evaluate the performance of several state-of-the-art topic modeling and clustering algorithms (including those suggested in [117], [33]) on short texts from two health-related datasets. The first dataset is composed of tweets (≤290K docs) and the second is composed of emails (50K docs). We consider individual tweets and emails as single documents, respectively. We include seven topic modeling approaches including LSI, LDA, GibbsLDA [142], Online LDA [57], BTM [147], [30], Online Twitter LDA [76], and GSDMM; as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We use cluster validity indices to evaluate the performance of topic modeling and clustering: two internal (i.e., assessing the goodness of a clustering structure without external information) and five external validity indices.
The remainder of the paper is organized as follows. We will review the literature in the “Related work” section. We will explain our approach in the “Methods” section. The results of the experiments and evaluations of the topic modeling and clustering applications will be presented in the “Experiments and results” section. We will discuss the obtained findings in the “Discussion” section. Finally, we will conclude the current work and present future directions in the “Conclusions” section.
Section snippets
Related work
In this section, we review related work from short text clustering, topic modeling, and validity indices.
Methods
This section describes our study for evaluating state-of-the-art topic modeling and clustering methods to automatically extract relevant topics from health-related tweets and emails. Fig. 1 outlines our approach with the basic steps for this evaluation. In this section we describe: (1) the two datasets used: tweets and emails; (2) the applications based on topic modeling and clustering algorithms; and (3) the validity indices used to assess the clusters defined by the algorithms we studied.
Experiments and results
The focus of this study is to compare the performance of the applications cited below using internal and external indices over the tweets and emails datasets. Thus, next sections presents the results obtained for k={2,5,10,50}.
Tweets dataset
A popular area of study is supervised algorithms using unbalanced datasets. However, skewed distributions also affect the learning process in unsupervised methods, especially in clustering [100] that are based on centroids [140], [73]. Despite enormous solutions, there is a reduced effectiveness when the groups have highly different sizes [74], however, most of the models we used proved capable of creating a group of tweets bigger than the other that reflected the unbalanced nature of the
Conclusions
In this paper, we conducted a detailed comparison of different topic modeling techniques and a document clustering method on short texts from two health-related datasets. The first composed of tweets and the second of emails. We set up LSI, LDA, GibbsLDA, Online LDA, BTM, Online Twitter LDA, GSDMM, and k-means based on TF-IDF and Doc2Vec document vectorizations. We evaluated our models with two internal indices and five external indices. The two internal indices included Calinski-Harabasz index
Conflict of interest
The authors declare that they have no conflict of interest.
Author's contributions
JALV conceived and designed the study. JALV, HAS, JM, and SG collected the data, set up the applications, and performed the evaluation. JALV, HAS, JM, and SG wrote the initial draft and revised subsequent versions. JB and THB provided relevant feedback. JB and THB, senior investigators, led the research project. All authors read, revised and approved the final manuscript.
Conflict of interest
The authors declare that they have no conflict of interest.
Acknowledgements
A portion of the research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA183962. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
References (157)
- et al.
An extensive comparative study of cluster validity indices
Pattern Recognit
(2013) - et al.
Electronic patient-provider communication: will it offset office visits and telephone consultations in primary care?
Int J Med Inform
(2005) - et al.
A general framework to expand short text for topic modeling
Inform Sci
(2017) - et al.
An evaluation of document clustering and topic modelling in two online social networks: twitter and reddit
Inf Process Manag
(2020) - et al.
Representation learning for very short texts using weighted word embedding aggregation
Pattern Recognit Lett
(2016) - et al.
A density-based cluster validity approach using multi-representatives
Pattern Recognit Lett
(2008) - et al.
Ensemble learning for data stream analysis: a survey
Inf Fus
(2017) - et al.
An unsupervised multilingual approach for online social media topic identification
Expert Syst Appl
(2017) - et al.
A novel framework for biomedical entity sense induction
J Biomed Inform
(2018) - et al.
An unsupervised self-organizing learning with support vector ranking for imbalanced datasets
Expert Syst Appl
(2010)
Biterm
Doc2Vec
GibbsLDA
K-means
LDA
LSI
Online LDA
Online Twitter LDA
TF-IDF
A survey of text clustering algorithms
Mining text data
A comparison of extrinsic clustering evaluation metrics based on formal constraints
Inf Retr
A new method for non-parametric multivariate analysis of variance
Austral Ecol
Electronic mail communication between physicians and patients: a review of challenges and opportunities
Family Pract
A topic model of clinical reports
Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR’12
Clinical case-based retrieval using latent topic analysis
AMIA annual symposium proceedings, vol. 2010
Predicting protein-protein relationships from literature using latent topics
Genome informatics 2009: genome informatics series vol. 23
Email for clinical communication between patients/caregivers and healthcare professionals
Cochrane Database Syst Rev
Clustering short texts using wikipedia
Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR’07
Latent dirichlet allocation
J Mach Learn Res
Validating machine learning algorithms for twitter data against established measures of suicidality
JMIR Mental Health
Modeling hidden topics on document manifold
Proceedings of the 17th ACM conference on information and knowledge management, CIKM’08
A dendrite method for cluster analysis
Commun Stat-Theory Methods
Harnessing linked knowledge sources for topic classification in social media
Proceedings of the 24th ACM conference on hypertext and social media, HT’13
Patient, staff, and clinician perspectives on implementing electronic communications in an interdisciplinary rural family health practice
Prim Health Care Res Dev
Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets
J Am Med Inform Assoc
Parallel spectral clustering in distributed systems
IEEE Trans Pattern Analysis Mach Intell
Btm: topic modeling over short texts
IEEE Trans Knowl Data Eng
Diffusion of messages from an electronic cigarette brand to potential users through twitter
PLOS ONE
Measuring post traumatic stress disorder in twitter
Document embedding with paragraph vectors
Crest: cluster-based representation enrichment for short text classification
Pacific-Asia conference on knowledge discovery and data mining
Use of email, cell phone and text message between patients and primary-care physicians: cross-sectional study in a french-speaking part of Switzerland
BMC Health Serv Res
A cluster separation measure
IEEE Trans Pattern Anal Mach Intell
The effects on health behavior and health outcomes of internet-based asynchronous communication between health providers and patients with a chronic condition: a systematic review
J Med Internet Res
Social media for patients: benefits and drawbacks
Curr Rev Musculoskelet Med
Indexing by latent semantic analysis
J Am Soc Inf Sci
An examination of indexes for determining the number of clusters in binary data sets
Psychometrika
Associations of topics of discussion on twitter with survey measures of attitudes, knowledge, and behaviors related to zika: probabilistic study in the united states
JMIR Public Health Surveill
On ontology-driven document clustering using core semantic features
Knowl Inf Syst
Cited by (21)
Leveraging blockchain for industry funding: A social media analysis
2024, Sustainable Technology and EntrepreneurshipHarnessing customized AI to create voice of customer via GPT3.5
2024, Advanced Engineering InformaticsIdentifying pharmaceutical technology opportunities from the perspective of adverse drug reactions: Machine learning in multilayer networks
2024, Technological Forecasting and Social ChangeTowards finding the lost generation of autistic adults: A deep and multi-view learning approach on social media
2023, Knowledge-Based SystemsDevelopment of technology opportunity analysis based on technology landscape by extending technology elements with BERT and TRIZ
2023, Technological Forecasting and Social Change