The Number of Topics Optimization: Clustering Approach

Krasnov, Fedor; Sen, Anastasiia

doi:10.3390/make1010025

Open AccessArticle

The Number of Topics Optimization: Clustering Approach

by

Fedor Krasnov

^1,*

and

Anastasiia Sen

²

¹

Gazpromneft STC, 75-79 Moika River Emb., 190000 Saint Petersburg, Russia

²

Faculty of Applied Mathematics and Control Processes, Saint Petersburg State University, 7-9 Universitetskaya Emb., 199034 Saint Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2019, 1(1), 416-426; https://doi.org/10.3390/make1010025

Submission received: 8 January 2019 / Revised: 26 January 2019 / Accepted: 29 January 2019 / Published: 30 January 2019

(This article belongs to the Special Issue Language Processing and Knowledge Extraction)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents.

Keywords:

clustering; additive regularization topic model; validation metrics; Davies Bouldin Index; ARTM

1. Introduction

Topic models have been used successfully for clustering texts for many years. One of the most common approaches to topic modeling is the Latent Dirichlet Allocation (LDA) [1]. It models a fixed number of topics which selected as a parameter based on the Dirichlet distribution for words and documents. The result is a flat, soft probabilistic clustering of terms by topics and documents by topics. All the topics received are equal, they do not create any characteristic signs that could help the researcher to identify the most useful topics, that is, to choose a subset of topics that are best suited for human interpretation. The problem of finding the metric characterizing such interpretability is a subject of study by many researchers [2,3,4,5].

The topic model is not able to read the insights of the researcher and therefore must have the settings for the task that the researcher is going to solve. According to studies [6,7] topic models based on the LDA have the following parameters:

$α$ : The parameter of the prior Dirichlet distribution for “documents-topics”;
$β$ : Parameter of the prior Dirichlet distribution for “topics-words”;
$t n$ : The number of topics;
b: The number of discarded initial iterations according to Gibbs sampling;
n: The number of samples;
$s i$ : Sampling interval.

In the recent study [7], published in 2018, an attempt was made to find the optimal values of the above parameters using the algorithm of differential evolution [8]. The authors chose a modified Jaccard similarity metric as the cost-function. As a result, a new Latent Dirichlet Allocation Differential Evolution (LDADE) algorithm was created, in which free parameters from the differential evolution algorithm appeared and they also need to be optimized.

There is a difference between evaluating of a complete set of topics and evaluating individual topics to filter out unwanted information (noise). To evaluate a complete set of topics, researchers usually look at the perplexity [9] for the corpus of documents.

This approach does not work very well according to the results of studies [10,11] because the perplexity does not have an absolute minimum, and with increasing iterations, it becomes asymptotic [12].

The most common use of perplexity is to detect the “elbow effect”, that is, when the pattern of growth in the orderliness of the model changes drastically. Perplexity depends on the power of the dictionary and the frequency distribution of words in the collection, hence we get its drawbacks:

It cannot evaluate the quality of deletion of stop words and non-topic words;
It cannot compare rarefying methods for dictionary;
It cannot compare uni-gram and n-gram models.

The authors of the LDA made a study of the quality of topics using the Bayesian approach in [13]. It is important to note that the Hierarchical Dirichlet process (HDP) [14] solves the problem of the optimal number of topics for the whole collection, but not for a specific document.

Let us pay attention to the difference between the LDA, HDP, and hierarchical Latent Dirichlet Allocation (hLDA) [15,16], since these are different topic models. LDA creates a flat, soft probabilistic clustering of terms by topic and documents by topic. In the HDP model, instead of a fixed number of topics for a document, the Dirichlet process generates the number of topics, which leads to the fact that the number of topics is also a random variable. The “hierarchical” part of the name belongs to another level added by the Dirichlet process, which creates several topics, and the topics themselves are still flat clusters. The hLDA model is an adaptation of the LDA, which models the topics as the distribution of a new, predetermined number of topics taken from the Dirichlet distribution. The hLDA model still considers the number of topics as a hyper parameter, that is, regardless of the data. The difference is that clustering is now hierarchical: The hLDA model studies the clustering of the first set of topics, providing more general abstract relationships between topics (and, therefore, words and documents). Note that all three models described (LDA, HDP, hLDA) add a new set of parameters that require optimization, as is noted in the study [17].

One of the main requirements for topic models is human interpretability [18]. In other words, whether the topics contain words that, according to a person’s subjective judgments, are representative of a single coherent concept. In [19], Newman showed that the human assessment of interpretability well correlates with an automated quality measure called coherence.

Recent research [20] proposed to minimize the Rényi and Tsallis entropies to find the optimal number of topics in the topic modeling. In this study, topic models derived from large collections of texts are considered as non-equilibrium complex systems, where the number of topics is considered as the equivalent of temperature. This allows us to calculate the free energy of such systems—the value through which the Renyi and Tsallis entropies are easily expressed. The metrics obtained based on entropy make it possible to find a minimum depending on the number of topics for large collections, but in practice we rarely find small collections of documents.

A study [21] proposed a matrix approach to improving the accuracy of determining topics without using optimization. On the other hand, the study [22] noted that increasing the accuracy of the model is contrary to human interpretability. In particular, a more recent study [23] created the VisArgue framework designed to visualize the model’s learning process to determine the most explainable topics.

The use of the statistical measure of term frequency (TF) divided by inverse document frequency (IDF) as a metric for quantifying the quality of topics was studied in [24]. There is also a series of studies combining the advantages of topic models and dense representations of word-vectors [25,26,27,28].

The motivation of the research conducted by the authors of this paper was the fact that the study of a stable metric for the quality of topics continues. Moreover, the use of cluster analysis is one of the tools for analyzing the stability of topics [29] and the optimal number of topics [30], but it does not consider the benefits of the special training capabilities of the topic model with sequential regularization and dense representation of word-vectors.

To validate the quality of clusters, many metrics have been developed. Among them, there are the partition coefficient [31], Dunn index [32], Davies Bouldin Index (DBI) [33] and its modifications [34,35], and silhouette coefficient [36]. Nevertheless, in the case of a topic model, we already have clusters of topics and do not need a clustering algorithm; we need only to evaluate the clusters obtained. For validation of clusters it is necessary to consider them in space with concepts of proximity and distance. For words, such a space is a vector representation of words. Significant results in this direction were obtained in previous research [37,38,39]. Words presented in the form of dense vectors, reflect the semantic representation and have the properties of proximity and distance. Therefore, presenting the topics in the form of dense vectors, the authors created a new variation of the Davies Bouldin Index metric for the topics, which the authors called

Cosine Davies Bouldin Index (cDBI)

.

The remainder of the paper is described as follows: The proposed methodology and research hypothesis are presented in Section 2; the results of testing a new quality metric are explained in Section 3. We conclude our paper in Section 4.

2. Research Methodology

Consider ways to build a topic model for a specific collection of documents. Collection is homogeneous if it contains documents of the same type. For example, a collection of scientific articles from one conference, created on a single template, is homogeneous. In the case of a homogeneous collection of scientific articles, each document has a similar structure, postulated by a conference template. All scientific articles consist of introduction, presentation of research results and conclusion. Thus, it is possible to present a document in the form of a distribution of the main topic and auxiliary topics: introduction and conclusion.

Of course, the main topics in different documents may be different. However, the collection of scientific articles may be limited to the choice of certain headings from the thematic rubrics of the conference. Then the number of topics will be known. Figure 1 shows the matrix distribution of topics on the documents.

As we see on the left side of Figure 1, topic model leads to the emphasis of topics and their distribution homogeneously over the documents. Such a picture of the probabilities of the “topics-documents” matrix can be obtained using, e.g., models based on the LDA algorithm [1]. In addition, the right side of Figure 1 shows the result of the model with a sequential Additive Regularization of Topic Models (ARTM) [2]. The main and auxiliary topics are highlighted through the management of the learning process of the model. The principle of classifying a topic as auxiliary may be formulated as the existence of such a topic in the overwhelming number of documents. That is, the probabilities of the auxiliary topics will be distributed uniformly and tightly across the documents. Furthermore, the main topic will be a sparse vector for each document, since each document is characterized by one main topic.

We show that the existing internal metrics of the topic model are not suitable for determining the optimal number of topics. To do this, consider the internal automated metrics of the quality of topics. We introduce the concept of core topics:

W_{t} = \{w \in W | p (t | w) \geq t h r e s h o l d\} .

The following quality metrics of the topic model can be calculated based on the topics kernel:

○: Purity of the topics: $P u r i t y = \sum_{w \in W_{t}} p (w | t)$
○: Size of the topic kernel: $| W_{t} |$
○: Contrast of the topics: $\frac{1}{| W_{t} |} \sum_{w \in W_{t}} p (t | w)$
○: Coherence of the topics: $C o h_{t} = \frac{2}{k (k - 1)} \sum_{i = 1}^{k - 1} \sum_{j = 1}^{k} P M I (w_{i}, w_{j})$ , where $k$ is the interval in which the combined use of words is calculated, point-wise mutual information $P M I (w_{i}, w_{j}) = log \frac{N \cdot N_{w_{i} w_{j}}}{N_{w_{i}} \cdot N_{w_{j}}}$ , $N_{w_{i} w_{j}}$ —the number of documents in which words $w_{i}$ and $w_{j}$ appear in interval $k$ at least once. $N_{w_{j}}$ —the number of documents in which the word $w_{i}$ appear at least once, and $N$ is the number of words in the dictionary.

As can be seen from the formulas for the internal metrics of the topic model, each of these metrics can be measured for a different number of topics (

t n

). Consider the behavior of the metric kernel size depending on the number of topics. With an increase in the number of topics, the core size will decrease, since the normalization conditions must be satisfied when constructing the matrices “topics-words” and “documents-topics”: The sum of the probabilities must be equal to one. For metrics, the purity of topics and the contrast of topics, the nature of the changes with an increase in the number of topics will also be monotonously decreasing, since the sum of the probabilities of the topics included in the core will decrease. On the other hand, for the metric, coherence to topics, behavior with an increase in the number of topics will be monotonously increasing, as the contribution from PMI will grow. The specific nature of the changes in the metrics examined may vary; therefore it is advisable to try to find the extreme point using numerical methods, if it is possible.

The quality of the topics of short messages from the point of view of clusters was reviewed in [40] using NMF (non-negative matrix factorization) and metrics reflecting the entropy of clusters. The matrix approach (Latent Semantic Indexing + Singular-Value Decomposition) to the selection of clusters of topics from the program code was investigated in [41] with a modified vector proximity metric. The research of the topic model’s quality [30] use the metric of silhouette coefficient [36] with Euclidean distance for sparse subject vectors. Consequently, in these works, clusters in the space of dense vectors–words constituting topics and non-Euclidean distances in the metrics remain unexplored.

In [12,42,43], the instability of topics with respect to the order of processed documents was discovered and investigated. Therefore, to calculate the quality metrics of the topics, it is necessary to perform calculations for the corpus of documents with a random order to eliminate the dependence on the order of documents. The possibility of stabilizing the topic model with the help of regularization was shown in [44]. Based on the analysis, the authors formulated a methodological framework, depicted as a diagram in Figure 2.

Figure 2 shows the sequence of actions repeated for one corpus of documents a significant number of times, in order comparable to the number of documents in the corpus. On the right, actions that are performed only once are displayed: The formation of a dictionary, the adjustment of the regularization parameters of the topic model, and the transformation of the sparse presentation space of topics into a dense representation. Based on this methodological framework, digital experiments were developed and carried out as described in the next section.

3. Experiment

For the experiment the corpus of scientific and technical articles on topics related to the development of oil and gas fields was chosen. In total, 1695 articles in English were selected in 10 areas of research according to the rubrics. The creation of a dictionary for the selected corpus is described in detail in the previous study by the authors [45]. To build a topic model, the BigARTM library was used, which allows for customization of the topic model by sequential regularization. The choice and adjustment of the regularization parameters of the topic model were made by the authors in a previous study [45]. To transform the sparse space of the vectors-words that make up the topics, the GloVe library was chosen [37]. To obtain a visual representation of the form of a dense representation of topics, a projection was made on a two-dimensional space with the distances preserved using the Multi-dimensional scaling (MDS) library [46]. Figure 3 presents the view of obtained clusters of topics.

In Figure 3, two-dimensional projections of words from topics are highlighted with different markers. Ovals emphasize precise visual grouping of words in the topics.

Figure 4 presents the preliminary calculations of the main metrics behavior of the topic model, set up in accordance with the methodology proposed by the authors, depending on the number of topics.

As we can see from Figure 4, the nature of the dependencies is monotonous and does not allow determination of the optimal number of topics. Measurements of the main internal metrics are made for 1000 different random orders of documents. The y–axis represents the value of one standard deviation. It is evident that for the metric the contrast of the core, the deviations are minimal. For the metrics, purity and coherence of the core, the greater values characterize the best quality of the topic model.

A characteristic point can be considered the number of topics equal to 12, when the curves of changes in the metric purity and coherence of topics intersect. Consider the dependencies of the following metrics: Calinski-Harabaz index [47], and silhouette coefficient [36], as used to validate the number of clusters.

According to Figure 5, the Calinski-Harabaz index and silhouette coefficient metrics do not make it possible to determine the optimal number of topics. As the number of topics increases, the values of these metrics decrease, which means that clusters become worse from the point of view of these metrics. The cDBI metric developed by the authors and shown in Figure 6 behaves differently depending on the number of topics.

In Figure 6 the maximum is clearly expressed when the number of topics is equal to 16. The algorithm for calculating the cDBI metric is based on the ideology of the Davies Bouldin index metric proposed in [33] and modified in [34,35].

Algorithm 1: Calculation of

Cosine Davies Bouldin Index (cDBI)

Metrics.

In the above Algorithm 1,

T

denotes the number of selected,

μ

—this regularizing coefficients. Thus, using the cDBI metric, it is possible to find the optimal number of topics for a collection of documents.

4. Conclusions

The authors investigated the question of choosing the optimal number of topics for building a topic model for a given corpus of texts. The study of this direction is relevant from the moment of discovery of the topic modeling technique. The result of this study was a technique that allows you to determine the optimal number of topics for a corpus of texts.

It should be said that the proposed method was experimentally confirmed under the following conditions:

A small collection of documents;
English language documents (monolingual text corpus);
Thematic uniformity.

An important methodological trick of the authors is the preparation of a topic model using sequential regularization. In previous studies of this collection of documents [45], the authors obtained numerical estimates of the coefficients for the regularizing components of the topic model (

μ

).

When forming a collection of texts, conditions were set that limited the number of topics of scientific articles according to the topic rubrics to 10. The essence of the experiment was to confirm the selected number of topics using an optimization approach based on the quality metric developed by the authors of the topic model—Cosine Davies Bouldin Index (cDBI).

As a result, the experiment showed that the maximum value of the cDBI metric for test corpus is achieved with the average number of topics equal to 16 with standard deviation 2. The result was obtained with a large number of model training to eliminate the influence of the order of documents in the collection.

In conclusion, it is important to emphasize that this study can serve as a methodological groundwork for the creation of software frameworks and proposes support for solving one of the fundamental problems of semantic text processing: Determining the sense of a text fragment (article).

Author Contributions

Data curation, F.K. and A.S.; Methodology, F.K.; Project administration, A.S.; Supervision, F.K.; Writing—original draft, A.S.; Writing—review & editing, A.S.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Vorontsov, K.; Potapenko, A.; Plavin, A. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. In Statistical Learning and Data Sciences; Springer International Publishing: Cham, Switzerland, 2015; pp. 193–202. [Google Scholar] [CrossRef]
Koltsov, S.; Pashakhin, S.; Dokuka, S. A Full-Cycle Methodology for News Topic Modeling and User Feedback Research. Social Informatics; Staab, S., Koltsova, O., Ignatov, D.I., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 308–321. [Google Scholar] [CrossRef]
Seroussi, Y.; Bohnert, F.; Zukerman, I. Authorship Attribution with Author-aware Topic Models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, Jeju Island, Korea, 8–14 July 2012; Association for Computational Linguistics: Stroudsburg, PA, USA, 2012; Volume 2, pp. 264–269. [Google Scholar]
Fang, D.; Yang, H.; Gao, B.; Li, X. Discovering research topics from library electronic references using latent Dirichlet allocation. Libr. Hi Tech 2018, 36, 400–410. [Google Scholar] [CrossRef]
Binkley, D.; Heinz, D.; Lawrie, D.; Overfelt, J. Understanding LDA in Source Code Analysis. In Proceedings of the 22nd International Conference on Program Comprehension (ICPC 2014), Hyderabad, India, 31 May–7 June 2014; ACM: New York, NY, USA, 2014; pp. 26–36. [Google Scholar] [CrossRef]
Agrawal, A.; Fu, W.; Menzies, T. What is wrong with topic modeling? And how to fix it using search-based software engineering. Inf. Softw. Technol. 2018, 98, 74–88. [Google Scholar] [CrossRef] [Green Version]
Storn, R.; Price, K. Differential Evolution—A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces. J. Glob. Optim. 1997, 11, 341–359. [Google Scholar] [CrossRef]
Asuncion, A.; Welling, M.; Smyth, P.; Teh, Y.W. On Smoothing and Inference for Topic Models. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–21 June 2009; AUAI Press: Arlington, VA, USA, 2009; pp. 27–34. [Google Scholar]
Wallach, H.M.; Murray, I.; Salakhutdinov, R.; Mimno, D. Evaluation Methods for Topic Models. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; ACM: New York, NY, USA, 2009; pp. 1105–1112. [Google Scholar] [CrossRef]
Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; Blei, D.M. Reading Tea Leaves: How Humans Interpret Topic Models. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; Curran Associates Inc.: Red Hook, NY, USA, 2009; pp. 288–296. [Google Scholar]
Koltcov, S.; Koltsova, O.; Nikolenko, S. Latent Dirichlet Allocation: Stability and Applications to Studies of User-generated Content. In Proceedings of the 2014 ACM Conference on Web Science, Bloomington, IN, USA, 23–26 June 2014; ACM: New York, NY, USA, 2014; pp. 161–165. [Google Scholar] [CrossRef]
Mimno, D.; Blei, D. Bayesian Checking for Topic Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, UK, 27–31 July 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 227–237. [Google Scholar]
Teh, Y.W.; Jordan, M.I.; Beal, M.J.; Blei, D.M. Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes. In Proceedings of the 17th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2004; MIT Press: Cambridge, MA, USA, 2004; pp. 1385–1392. [Google Scholar]
Blei, D.M.; Griffiths, T.L.; Jordan, M.I. The Nested Chinese Restaurant Process and Bayesian Nonparametric Inference of Topic Hierarchies. J. ACM 2010, 57, 7:1–7:30. [Google Scholar] [CrossRef]
Blei, D.M.; Jordan, M.I.; Griffiths, T.L.; Tenenbaum, J.B. Hierarchical Topic Models and the Nested Chinese Restaurant Process. In Proceedings of the 16th International Conference on Neural Information Processing Systems, Whistler, BC, Canada, 9–11 December 2003; MIT Press: Cambridge, MA, USA, 2003; pp. 17–24. [Google Scholar]
Bryant, M.; Sudderth, E.B. Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Curran Associates Inc.: Red Hook, NY, USA, 2012; Volume 2, pp. 2699–2707. [Google Scholar]
Rossetti, M.; Stella, F.; Zanker, M. Towards Explaining Latent Factors with Topic Models in Collaborative Recommender Systems. In Proceedings of the 2013 24th International Workshop on Database and Expert Systems Applications, Prague, Czech Republic, 26–29 August 2013. [Google Scholar] [CrossRef]
Newman, D.; Lau, J.H.; Grieser, K.; Baldwin, T. Automatic Evaluation of Topic Coherence. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 2–4 June 2010; Association for Computational Linguistics: Stroudsburg, PA, USA, 2010; pp. 100–108. [Google Scholar]
Koltcov, S. Application of Rényi and Tsallis entropies to topic modeling optimization. Phys. A Stat. Mech. Its Appl. 2018, 512, 1192–1204. [Google Scholar] [CrossRef]
Bing, X.; Bunea, F.; Wegkamp, M.H. A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. arXiv, 2018; arXiv:1805.06837. [Google Scholar]
Lipton, Z.C. The Mythos of Model Interpretability. Queue 2018, 16, 30:31–30:57. [Google Scholar] [CrossRef]
El-Assady, M.; Sevastjanova, R.; Sperrle, F.; Keim, D.; Collins, C. Progressive Learning of Topic Modeling Parameters: A Visual Analytics Framework. IEEE Trans. Vis. Comput. Graph. 2018, 24, 382–391. [Google Scholar] [CrossRef]
Nikolenko, S.I.; Koltcov, S.; Koltsova, O. Topic modelling for qualitative studies. J. Inf. Sci. 2016, 43, 88–102. [Google Scholar] [CrossRef]
Batmanghelich, K.; Saeedi, A.; Narasimhan, K.; Gershman, S. Nonparametric Spherical Topic Modeling with Word Embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016. [Google Scholar] [CrossRef]
Law, J.; Zhuo, H.H.; He, J.; Rong, E. LTSG: Latent Topical Skip-Gram for Mutually Improving Topic Model and Vector Representations. In Pattern Recognition and Computer Vision; Springer International Publishing: Cham, Switzerland, 2018; pp. 375–387. [Google Scholar] [CrossRef]
Das, R.; Zaheer, M.; Dyer, C. Gaussian LDA for Topic Models with Word Embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Beijing, China, 26–31 July 2015; pp. 795–804. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Billingsley, R.; Du, L.; Johnson, M. Improving Topic Models with Latent Feature Word Representations. Trans. Assoc. Comput. Linguist. 2015, 3, 299–313. [Google Scholar] [CrossRef]
Mantyla, M.V.; Claes, M.; Farooq, U. Measuring LDA Topic Stability from Clusters of Replicated Runs. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, Oulu, Finland, 11–12 October 2018; ACM: New York, NY, USA, 2018; pp. 49:1–49:4. [Google Scholar] [CrossRef]
Mehta, V.; Caceres, R.S.; Carter, K.M. Evaluating topic quality using model clustering. In Proceedings of the 2014 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Orlando, FL, USA, 9–12 December 2014; pp. 178–185. [Google Scholar] [CrossRef]
Bezdek, J.C. Cluster Validity with Fuzzy Sets. J. Cybern. 1973, 3, 58–73. [Google Scholar] [CrossRef]
Dunn, J.C. Well-Separated Clusters and Optimal Fuzzy Partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 224–227. [Google Scholar] [CrossRef]
Halkidi, M.; Batistakis, Y.; Vazirgiannis, M. Clustering Validity Checking Methods: Part II. SIGMOD Rec. 2002, 31, 19–27. [Google Scholar] [CrossRef]
Xie, X.L.; Beni, G. A Validity Measure for Fuzzy Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 841–847. [Google Scholar] [CrossRef]
Rousseeuw, P. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
Wu, L.Y.; Fisch, A.; Chopra, S.; Adams, K.; Bordes, A.; Weston, J. StarSpace: Embed All The Things! AAAI: Menlo Park, CA, USA, 2018. [Google Scholar]
Bicalho, P.V.; de Oliveira Cunha, T.; Mourao, F.H.J.; Pappa, G.L.; Meira, W. Generating Cohesive Semantic Topics from Latent Factors. In Proceedings of the 2014 Brazilian Conference on Intelligent Systems, Sao Paulo, Brazil, 18–22 October 2014; pp. 271–276. [Google Scholar] [CrossRef]
Kuhn, A.; Ducasse, S.; Gîrba, T. Semantic clustering: Identifying topics in source code. Inf. Softw. Technol. 2007, 49, 230–243. [Google Scholar] [CrossRef]
Chuang, J.; Roberts, M.E.; Stewart, B.M.; Weiss, R.; Tingley, D.; Grimmer, J.; Heer, J. TopicCheck: Interactive Alignment for Assessing Topic Model Stability. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Denver, CO, USA, 31 May–5 June 2015; pp. 175–184. [Google Scholar] [CrossRef]
Greene, D.; O’Callaghan, D.; Cunningham, P. How Many Topics? Stability Analysis for Topic Models. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2014; pp. 498–513. [Google Scholar] [CrossRef]
Koltcov, S.; Nikolenko, S.I.; Koltsova, O.; Filippov, V.; Bodrunova, S. Stable Topic Modeling with Local Density Regularization. In Internet Science; Springer International Publishing: Cham, Switzerland, 2016; pp. 176–188. [Google Scholar] [CrossRef]
Krasnov, F.; Ushmaev, O. Exploration of Hidden Research Directions in Oil and Gas Industry via Full Text Analysis of OnePetro Digital Library. Int. J. Open Inf. Technol. 2018, 6, 7–14. [Google Scholar]
Borg, I.; Groenen, P. Modern Multidimensional Scaling: Theory and Applications. J. Educ. Meas. 2003, 40, 277–280. [Google Scholar] [CrossRef]
Calinski, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar] [CrossRef]

Figure 1. “Topics-documents” scheme.

Figure 2. Research framework.

Figure 3. Projection of a dense presentation of topics with preservation of distances.

Figure 4. Dependencies of the main internal metrics of the quality of the topic model on the number of topics.

Figure 5. Cluster Validation Metrics.

Figure 6. Cosine Davies Bouldin Index (cDBI) metric.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Krasnov, F.; Sen, A. The Number of Topics Optimization: Clustering Approach. Mach. Learn. Knowl. Extr. 2019, 1, 416-426. https://doi.org/10.3390/make1010025

AMA Style

Krasnov F, Sen A. The Number of Topics Optimization: Clustering Approach. Machine Learning and Knowledge Extraction. 2019; 1(1):416-426. https://doi.org/10.3390/make1010025

Chicago/Turabian Style

Krasnov, Fedor, and Anastasiia Sen. 2019. "The Number of Topics Optimization: Clustering Approach" Machine Learning and Knowledge Extraction 1, no. 1: 416-426. https://doi.org/10.3390/make1010025

APA Style

Krasnov, F., & Sen, A. (2019). The Number of Topics Optimization: Clustering Approach. Machine Learning and Knowledge Extraction, 1(1), 416-426. https://doi.org/10.3390/make1010025

Article Menu

The Number of Topics Optimization: Clustering Approach

Abstract

1. Introduction

2. Research Methodology

3. Experiment

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI