How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models | Amsterdam University Press Journals Online
2004
Volume 2, Issue 2
  • E-ISSN: 2665-9085

Abstract

Abstract

Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently applied, there has been no systematic inquiry into how the application of these techniques affects the respective models. Using three empirical corpora with different characteristics (news articles, websites, and Tweets), we systematically investigated how different sample sizes and pruning affect the resulting topic models in comparison to models of the full corpora. Our inquiry provides evidence that both techniques are viable tools that will likely not impair the resulting model. Sample-based topic models closely resemble corpus-based models if the sample size is large enough (> 10,000 documents). Moreover, extensive pruning does not compromise the quality of the resultant topics.

Loading

Article metrics loading...

/content/journals/10.5117/CCR2020.2.001.MAIE
2020-10-01
2024-04-18
Loading full text...

Full text loading...

/deliver/fulltext/26659085/2/2/01_CCR2020.2_MAIE.html?itemId=/content/journals/10.5117/CCR2020.2.001.MAIE&mimeType=html&fmt=ahah

References

  1. Bischof, J., & Airoldi, E. M.(2012). Summarizing topical content with word frequency and exclusivity. Proceedings of the 29th International Conference on Machine Learning, 201-208.
    [Google Scholar]
  2. Blei, D., Ng, A. Y., & Jordan, M. I.(2003). Latent Dirichlet allocation. Journal of Machine Learning Research, (3), 993-1022.
    [Google Scholar]
  3. Blei, D. M.(2012). Probabilistic topic models. Communications of the ACM, 55(4), 77-84.
    [Google Scholar]
  4. Bonilla, T., & Grimmer, J.(2013). Elevated threat levels and decreased expectations: How democracy handles terrorist threats. Poetics, 41(6), 650-669. doi:10.1016/j.poetic.2013.06.003
    [Google Scholar]
  5. Denny, M. J., & Spirling, A.(2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168-189. doi:10.1017/pan.2017.44
    [Google Scholar]
  6. Grimmer, J.(2010). A Bayesian hierarchical topic model for political texts: Measuring expressed agendas in Senate press releases. Political Analysis, 18(1), 1-35.
    [Google Scholar]
  7. Grimmer, J., & Stewart, B. M.(2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political analysis, 21(3), 267-297.
    [Google Scholar]
  8. Guo, L., Vargo, C. J., Pan, Z., Ding, W., & Ishwar, P.(2016). Big social data analytics in journalism and mass communication: Comparing dictionary-based text analysis and unsupervised topic modeling. Journalism & Mass Communication Quarterly, 93(2), 332-359. doi:10.1177/1077699016639231
    [Google Scholar]
  9. Hanks, P.(2012). The corpus revolution in lexicography. International Journal of Lexicography, 25(4), 398-436.
    [Google Scholar]
  10. Hong, L., & Davison, B. D.(2010). Empirical study of topic modeling in Twitter. Proceedings of the first ACM workshop on social media analytics, 80-88.
    [Google Scholar]
  11. Hopkins, D. J., & King, G.(2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229-247.
    [Google Scholar]
  12. Maier, D., Waldherr, A., Miltner, P., Wiedemann, G., Niekler, A., Keinert, A., ... Adam, S.(2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2-3), 93-118. doi:10.1080/19312458.2018.1430754
    [Google Scholar]
  13. Manning, C. D., & Schütze, H.(2003). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
    [Google Scholar]
  14. Niekler, A.(2018). Automatisierte Verfahren für die Themenanalyse nachrichtenorientierter Textquellen. Köln: Herbert von Halem Verlag.
    [Google Scholar]
  15. Niekler, A., & Jähnichen, P.(2012). Matching results of latent Dirichlet allocation for text. Proceedings of 11th International Conference on Cognitive Modeling, 317-322.
    [Google Scholar]
  16. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R.(2010). How to analyze political attention with minimal assumptions and costs. American Journal of Political Science, 54(1), 209-228.
    [Google Scholar]
  17. Sievert, C., & Shirley, K. E.(2014). LDAvis: A method for visualizing and interpreting topics. Proceedings from the Workshop on Interactive Language Learning, Visualization, and Interfaces. Baltimore, MD.
    [Google Scholar]
  18. Waldherr, A., Maier, D., Miltner, P., & Günther, E.(2017). Big data, big noise: The challenge of finding issue networks on the web. Social Science Computer Review, 35(4), 427-443.
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journals/10.5117/CCR2020.2.001.MAIE
Loading
/content/journals/10.5117/CCR2020.2.001.MAIE
Loading

Data & Media loading...

  • Article Type: Research Article
Keyword(s): latent Dirichlet allocation; model selection; preprocessing; text analysis; topic model
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error