How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Daniel Maier; Andreas Niekler; Gregor Wiedemann; Daniela Stoltenberg

doi:10.5117/CCR2020.2.001.MAIE

E-ISSN: 2665-9085

oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models
By Daniel Maier, Andreas Niekler, Gregor Wiedemann & Daniela Stoltenberg
Publisher: Amsterdam University Press
Source: Computational Communication Research, Volume 2, Issue 2, Oct 2020, p. 139 - 152
DOI: https://doi.org/10.5117/CCR2020.2.001.MAIE
Language: English
- Published online: 01 Oct 2020

Previous Article
Table of Contents
Next Article

Abstract

Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently applied, there has been no systematic inquiry into how the application of these techniques affects the respective models. Using three empirical corpora with different characteristics (news articles, websites, and Tweets), we systematically investigated how different sample sizes and pruning affect the resulting topic models in comparison to models of the full corpora. Our inquiry provides evidence that both techniques are viable tools that will likely not impair the resulting model. Sample-based topic models closely resemble corpus-based models if the sample size is large enough (> 10,000 documents). Moreover, extensive pruning does not compromise the quality of the resultant topics.

Article metrics loading...

/content/journals/10.5117/CCR2020.2.001.MAIE

2020-10-01

2024-04-18

Full text loading...

/deliver/fulltext/26659085/2/2/01_CCR2020.2_MAIE.html?itemId=/content/journals/10.5117/CCR2020.2.001.MAIE&mimeType=html&fmt=ahah

References

Bischof, J., & Airoldi, E. M.(2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Conference on Machine Learning, 201-208.
[Google Scholar]
Blei, D., Ng, A. Y., & Jordan, M. I.(2003). Latent Dirichlet allocation. Journal of Machine Learning Research, (3), 993-1022.
[Google Scholar]
Blei, D. M.(2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
[Google Scholar]
Bonilla, T., & Grimmer, J.(2013). Elevated threat levels and decreased expectations: How democracy handles terrorist threats. Poetics, 41(6), 650-669. doi:10.1016/j.poetic.2013.06.003
[Google Scholar]
Denny, M. J., & Spirling, A.(2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189. doi:10.1017/pan.2017.44
[Google Scholar]
Grimmer, J.(2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1), 1-35.
[Google Scholar]
Grimmer, J., & Stewart, B. M.(2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267-297.
[Google Scholar]
Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P.(2016). Big social data analytics in journalism and mass communication: Comparing dictionary-based text analysis and unsupervised topic modeling. Journalism & Mass Communication Quarterly, 93(2), 332-359. doi:10.1177/1077699016639231
[Google Scholar]
Hanks, P.(2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398-436.
[Google Scholar]
Hong, L., & Davison, B. D.(2010). Empirical study of topic modeling in Twitter. Proceedings of the first ACM workshop on social media analytics, 80-88.
[Google Scholar]
Hopkins, D. J., & King, G.(2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229-247.
[Google Scholar]
Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... Adam, S.(2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2-3), 93-118. doi:10.1080/19312458.2018.1430754
[Google Scholar]
Manning, C. D., & Schütze, H.(2003). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
[Google Scholar]
Niekler, A.(2018). Automatisierte Verfahren für die Themenanalyse nachrichtenorientierter Textquellen. Köln: Herbert von Halem Verlag.
[Google Scholar]
Niekler, A., & Jähnichen, P.(2012). Matching results of latent Dirichlet allocation for text. Proceedings of 11th International Conference on Cognitive Modeling, 317-322.
[Google Scholar]
Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R.(2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54(1), 209-228.
[Google Scholar]
Sievert, C., & Shirley, K. E.(2014). LDAvis: A method for visualizing and interpreting topics. Proceedings from the Workshop on Interactive Language Learning, Visualization, and Interfaces. Baltimore, MD.
[Google Scholar]
Waldherr, A., Maier, D., Miltner, P., & Günther, E.(2017). Big data, big noise: The challenge of finding issue networks on the web. Social Science Computer Review, 35(4), 427-443.
[Google Scholar]

http://instance.metastore.ingenta.com/content/journals/10.5117/CCR2020.2.001.MAIE

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

CCR 2, 139 (2020); https://doi.org/10.5117/CCR2020.2.001.MAIE

/content/journals/10.5117/CCR2020.2.001.MAIE

Data & Media loading...

Article Type: Research Article

Keyword(s): latent Dirichlet allocation; model selection; preprocessing; text analysis; topic model

Most Cited Most Cited RSS feed

Call for Papers

Global Vietnam welcomes papers in varied forms, including individual concept papers, research articles, book reviews, debate and opinion pieces, etc. It also welcomes proposals for special issues. The maximum length for a paper published in the Journal is up to 12,000 words including references, notes and figures/tables/charts. Requested length for book reviews is two pages. You can send your proposal to Prof Phan Le-Ha ([email protected]) or to the commissioning editor at AUP, Inge Klompmakers ([email protected]).

Tijdschrift voor Geschiedenis zoekt reviewartikelen!

Aan de hand van een serie reviewartikelen brengt Tijdschrift voor Geschiedenis de komende tijd recente ontwikkelingen in het historische landschap in kaart. Maakt uw vakgebied een interessante ontwikkeling door? Heerst er een debat? Kregen recente publicaties volgens u niet voldoende aandacht? Kruip dan in uw pen en schrijf een reviewartikel voor Tijdschrift voor Geschiedenis! We verwelkomen bijdragen van historici uit alle mogelijke vakgebieden.

Meer info via: www.aup-online.com/content/journals/00407518
en tijdschriftvoorgeschiedenis.org.

Call for papers

Carillon and Bell Culture in the Low Countries is an annual publication about carillon and bell culture and the related tangible and intangible cultural heritage. The articles are the output of academic and/or artistic research. The editors welcome contributions from history, musicology, sociology, anthropology, (historically) informed performance, heritage, cultural studies, campanology etc. Although the Low Countries are the main focus, we also welcome contributions on the bell and carillon culture from all over the world.

For the second yearbook, to be published mid-May 2023, authors may send a short abstract (max. 300 words) to [email protected] by 1 September 2022. Articles can be submitted in Dutch or English. More information can be found in the CfP attached (Nederlands).

oa How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Abstract

Most Read This Month

Most Cited Most Cited RSS feed

Conversational Agent Research Toolkit

Computational observation

Detecting Impoliteness and Incivility in Online Discussions

Opinion-based Homogeneity on YouTube

Four best practices for measuring news sentiment using ‘off-the-shelf’ dictionaries: a large-scale p-hacking experiment

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

iCoRe: The GDELT Interface for the Advancement of Communication Research

The 4CAT Capture and Analysis Toolkit: A Modular Tool for Transparent and Traceable Social Media Research

A Roadmap for Computational Communication Research

Fifteen Seconds of Fame: TikTok and the Supply Side of Social Video