Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

Weisser, Christoph; Gerloff, Christoph; Thielmann, Anton; Python, Andre; Reuter, Arik; Kneib, Thomas; Säfken, Benjamin

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

2022 | journal article. A publication with affiliation to the University of Göttingen.

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data
Weisser, C.; Gerloff, C.; Thielmann, A.; Python, A.; Reuter, A.; Kneib, T. & Säfken, B. (2022)
Computational Statistics,. DOI: https://doi.org/10.1007/s00180-022-01246-z

Copy

GRO View APA Chicago MLA Vancouver

Citable link

GRO.publications Link

Further links

DOI

Documents & Media

document.pdf3.99 MBAdobe PDF

License

Published Version

Attribution 4.0 CC BY 4.0

Details

Authors: Weisser, Christoph; Gerloff, Christoph; Thielmann, Anton; Python, Andre; Reuter, Arik; Kneib, Thomas ; Säfken, Benjamin
Abstract: Abstract Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.
Issue Date: 2022
Journal: Computational Statistics
Organization: Campus-Institut Data Science
ISSN: 0943-4062
eISSN: 1613-9658
Language: English

Export Metadata

Refman EndNote BibTeX RefWorks Excel CSV

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data

Cite this publication

Citable link

Further links

Documents & Media

License

Details

Export Metadata

Reference

Citations

Social Media