Next Article in Journal
Model Selection Criteria on Beta Regression for Machine Learning
Previous Article in Journal
Acknowledgement to Reviewers of MAKE in 2018
Previous Article in Special Issue
Using the Outlier Detection Task to Evaluate Distributional Semantic Models
Open AccessArticle

The Number of Topics Optimization: Clustering Approach

1
Gazpromneft STC, 75-79 Moika River Emb., 190000 Saint Petersburg, Russia
2
Faculty of Applied Mathematics and Control Processes, Saint Petersburg State University, 7-9 Universitetskaya Emb., 199034 Saint Petersburg, Russia
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2019, 1(1), 416-426; https://doi.org/10.3390/make1010025
Received: 8 January 2019 / Revised: 26 January 2019 / Accepted: 29 January 2019 / Published: 30 January 2019
(This article belongs to the Special Issue Language Processing and Knowledge Extraction)
Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents. View Full-Text
Keywords: clustering; additive regularization topic model; validation metrics; Davies Bouldin Index; ARTM clustering; additive regularization topic model; validation metrics; Davies Bouldin Index; ARTM
Show Figures

Figure 1

MDPI and ACS Style

Krasnov, F.; Sen, A. The Number of Topics Optimization: Clustering Approach. Mach. Learn. Knowl. Extr. 2019, 1, 416-426.

Show more citation formats Show less citations formats

Article Access Map by Country/Region

1
Back to TopTop