Next Article in Journal
Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
Next Article in Special Issue
Cascaded Thermodynamic and Environmental Analyses of Energy Generation Modalities of a High-Performance Building Based on Real-Time Measurements
Previous Article in Journal
On Geometry of Information Flow for Causal Inference
Previous Article in Special Issue
Segmentation Method for Ship-Radiated Noise Using the Generalized Likelihood Ratio Test on an Ordinal Pattern Distribution
Open AccessArticle

Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy

1
National Research University Higher School of Economics, Soyuza Pechatnikov Street 16, 190121 St Petersburg, Russia
2
Institute for Web Science and Technologies, Universität Koblenz-Landau, Universitätsstrasse 1, 56070 Koblenz, Germany
3
Institute for Parallel and Distributed Systems (IPVS), Universität Stuttgart, Universitätsstraße 32, 50569 Stuttgart, Germany
4
Web and Internet Science Research Group, University of Southampton, University Road, Southampton SO17 1BJ, UK
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(4), 394; https://doi.org/10.3390/e22040394
Received: 5 March 2020 / Revised: 24 March 2020 / Accepted: 25 March 2020 / Published: 30 March 2020
(This article belongs to the Special Issue Entropy: The Scientific Tool of the 21st Century)
Topic modeling is a popular technique for clustering large collections of text documents. A variety of different types of regularization is implemented in topic modeling. In this paper, we propose a novel approach for analyzing the influence of different regularization types on results of topic modeling. Based on Renyi entropy, this approach is inspired by the concepts from statistical physics, where an inferred topical structure of a collection can be considered an information statistical system residing in a non-equilibrium state. By testing our approach on four models—Probabilistic Latent Semantic Analysis (pLSA), Additive Regularization of Topic Models (BigARTM), Latent Dirichlet Allocation (LDA) with Gibbs sampling, LDA with variational inference (VLDA)—we, first of all, show that the minimum of Renyi entropy coincides with the “true” number of topics, as determined in two labelled collections. Simultaneously, we find that Hierarchical Dirichlet Process (HDP) model as a well-known approach for topic number optimization fails to detect such optimum. Next, we demonstrate that large values of the regularization coefficient in BigARTM significantly shift the minimum of entropy from the topic number optimum, which effect is not observed for hyper-parameters in LDA with Gibbs sampling. We conclude that regularization may introduce unpredictable distortions into topic models that need further research. View Full-Text
Keywords: topic modeling; Renyi entropy; regularization topic modeling; Renyi entropy; regularization
Show Figures

Figure 1

MDPI and ACS Style

Koltcov, S.; Ignatenko, V.; Boukhers, Z.; Staab, S. Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy. Entropy 2020, 22, 394.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop