Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts

With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is one of the popular topic models in the area of Natural Language Processing (NLP) and has been widely used in information retrieval, topic extraction, and document analysis. Unlike long texts from formal documents, messages on SNS are generally short. Traditional topic models such as LDA or pLSA (probabilistic latent semantic analysis) suffer performance degradation for short-text analysis due to a lack of word co-occurrence information in each short text. To cope with this problem, various techniques are evolving for interpretable topic modeling for short texts, pretrained word embedding with an external corpus combined with topic models is one of them. Due to recent developments of deep neural networks (DNN) and deep generative models, neural-topic models (NTM) are emerging to achieve flexibility and high performance in topic modeling. However, there are very few research works on neural-topic models with pretrained word embedding for generating high-quality topics from short texts. In this work, in addition to pretrained word embedding, a fine-tuning stage with an original corpus is proposed for training neural-topic models in order to generate semantically coherent, corpus-specific topics. An extensive study with eight neural-topic models has been completed to check the effectiveness of additional fine-tuning and pretrained word embedding in generating interpretable topics by simulation experiments with several benchmark datasets. The extracted topics are evaluated by different metrics of topic coherence and topic diversity. We have also studied the performance of the models in classification and clustering tasks. Our study concludes that though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data.


Introduction
Due to the rapid developments of computing and communication technologies and the widespread use of internet, people are gradually becoming accustomed to communicating through various online social platforms, such as microblogs, Twitter, webpages, Facebook, etc. These messages over web and social networking sites contain important information regarding current social situations and trends, people's opinions on different products and services, advertisements, and announcements of government policies, etc. An efficient textprocessing technique is needed to automatically analyze these huge amounts of messages for extracting information. In the area of traditional natural language processing, a topicmodeling algorithm is considered an effective technique for the semantic understanding of text documents. Conventional topic models, such as pLSA [1] or LDA [2] and their various variants, are considerably good at extracting latent semantic structures from a text corpus without prior annotations and are widely used in emerging topic detection, document classification, comment summarizing, or event tracking. In these models, documents are viewed as a mixture of topics, while each topic is viewed as a particular distribution over all the words. Statistical tools are used to determine the latent topic distribution of each document, while higher-order word co-occurrence patterns are used to characterize each topic [3]. The efficient capture of document-level word co-occurrence patterns leads to the success of topic modeling.
The messages posted on various social network sites are generally short compared to the length of relatively formal documents such as newspapers or scientific articles. The main characteristics of these short texts are: (1) a limited number of words in one document, (2) the use of new and informal words, (3) meanings and usages of words that may change greatly depending on the posting, (4) spam posts, and (5) the restricted length of posts, such as API restrictions on Twitter. The direct application of traditional topic models for short-text analysis results in poor performance due to lack of word co-occurrence information in each short text document, originating from the above characteristics of short texts [4]. Earlier research on topic-modeling for short texts with traditional topic models used external, large-scale datasets such as Wikipedia, or related long-text datasets for a better estimation of word co-occurrences across short texts [5,6]. However, these methods work well only when the external dataset closely matches the original short-text data.
To cope with the problems of short-text topic-modeling by traditional topic models, three main categories of algorithms are found in the literature [7]. A simple solution is to aggregate a number of short texts into a long pseudo-document before training a standard topic model to improve word co-occurrence information. In [8], the tweets of an individual user are aggregated in one document. In [9,10], a short text is viewed as sampled from unobserved, long pseudo-documents, and topics are inferred from them. However, the performance of these methods depends on efficient aggregation and data type. When short texts of different semantic contents are aggregated to a long document, non-semantic word co-occurrence information can produce incoherent topics. In the second category, each short text document is assumed to consist of only a single topic. Based on this assumption, Direchlet Multinomial Mixture (DMM) model-based topic-modeling methods have been developed for short texts in [11][12][13]. Although this simple assumption eliminates data-sparsity problems to some extent, they fail to capture multiple topic elements in a document, which makes the model prone to over-fitting. Moreover, "shortness" is subjective and data-dependent; a single-topic assumption might be too strong for some datasets. A poisson-based DMM model (PDMM) [14] considers a small number of topics associated with each short text instead of only one. The third category of algorithms consider global word co-occurrence patterns for inferring latent topics. According to the usage, two types of models are developed. In [15], global word co-occurrence is directly used, while in [16], a word co-occurrence network is first constructed using global word cooccurrence, and then latent topics are inferred from this network. In the present work, we explored methods of exploiting this category for further improvement in the development of algorithms for extracting interpretable topics from short texts.
Another limitation of the above models for short-text analysis is that the context or background information is not used, resulting in the generation of not-so-coherent topics. The statistical information of words in the text document cannot fully capture words that are semantically correlated but that rarely co-occur. Recent advances in word embedding [17] provides an effective way of learning semantic word relations from a large corpus, which can help to develop models for generating more interpretable and coherent topics. Word embedding uses one-hot representation of words with vocabulary length vectors of zeroes with a single one, and words that are similar in semantics are close in a lower-dimensional vector space. An embedded topic model (ETM), a combination of LDA and word embedding which enjoys both the advantages of topic model and word embedding, has been proposed in [18]. Traditional topic models with word embedding for documents are explored in several other research works, cited in [19]. In [20], word embedding is combined with LDA to accelerate the inference process, resulting in the enhanced interpretability of topics. For short texts, models incorporating word embedding into DMM are proposed in [21,22]. In [23,24], short texts are merged into long pseudodocuments using word embedding. Word embedding in conjunction with conventional topic models seems to be a better technique for generating coherent topics.
The increasing complexity of inference processes in conventional topic models on large-text data, along with the recent developments of deep neural networks, has led to the emergence of neural-topic models (NTM). These models combine the performance, efficacy, scalability, and ease of leveraging parallel computing facilities, such as GPU, to probabilistic topic-modeling [25]. Neural-topic models are considered to be computationally simpler and easier for implementation, compared to traditional LDA models, and are increasingly used in various natural language processing tasks in which conventional topic models are difficult to use. A systematic study on the performances of several neural-topic models has been reported in [26]. Although various neural-topic models have been proposed, and although the reported experimental results on topic generation seem to be better than conventional topic models for long and formal texts, little research has been conducted on neural-topic models for effective analysis of short texts [27]. Most of the research works of topic modeling on short texts are based on extensions of Bayesian probabilistic topic models (BPTM) such as LDA.
The objective of the present research is to explore computationally easy and efficient techniques for improving the interpretability of generated topics from real-world short texts using neural-topic models. However, learning context information is the most challenging issue of topic-modeling for short texts, and incorporating pretrained word embedding into a topic model seems to be one of the most efficient ways of explicitly enriching the content information. Neural-topic models with pretrained word embedding for short-text analysis has not been extensively explored yet, compared to its long-text counterparts. In [28], we presented our preliminary analysis of short-text data (benchmark and real-world) with neural-topic models using pretrained word embedding. We found that although pretrained word embedding enhances the topic coherence of short texts that are similar to long and formal texts, the generated topics were often comprised of words having common meanings (which are found in the large external corpus used for pretraining) instead of the particular short-text-specific semantics of the word, which is especially important for real-world datasets. In other words, the learning of topic centroid vectors is influenced by pretraining the text corpus and fails to discover the important words of the particular short text. Our proposal is that this gap can be filled by adding a fine-tuning stage to the training of the topic model with the particular short-text corpus to be analysed. In this work, we have completed an extensive study to investigate the performance of recent neural-topic models with and without word embedding, and also with the proposed finetuning stage, for generating interpretable topics from short texts in terms of a number of performance metrics by simulation experiments on several datasets. We have also studied the performance of the NTM with pretrained word embedding added, with a fine-tuning stage for classification and clustering tasks. As a result of our experiments, we can confirm that the addition of a fine-tuning stage indeed enhances the topic quality of short texts in general, and generates topics with corpus-specific semantics.
In summary, our contributions in this paper are as follows: • A proposal for fine-tuning with the original short-text corpus, along with the pretrained word embedding with the large external corpus, for generating more interpretable and coherent corpus-specific topics from short texts; • An extensive evaluation of the performance of several neural-topic models, with and without pretrained word embedding and with an added fine-tuning stage, in terms of topic quality and measured by several metrics of topic coherence and topic diversity; • A performance evaluation of the proposed fine-tuned neural-topic models for classification and clustering tasks.
In the next section, neural-topic models are introduced in brief, followed by a short description of related works on neural-topic models (NTM), especially NTMs for short texts. The following sections contain our proposal, followed by simulation experiments and results. The final section presents the conclusion.

Neural-Topic Models and Related Works
The most popular neural-topic models (NTMs) are based on a variational autoencoder (VAE) [29], a deep generative model, and amortised variational inferences (AVI) [30]. The basic framework of VAE-based NTMs is described in the next section, in which generative and inference processes are modeled by a neural network-based decoder and encoder, respectively. Compared to the traditional Bayesian probabilistic topic models (BPTM), inference in neural-topic models is computationally simpler, their implementation is easier due to many existing deep learning frameworks, and NTMs are easy to be integrated with pretrained word embeddings for prior-knowledge acquisition. Several categories of VAE-based NTMs have been proposed. To name a few, there is the Neural Variational Document Model (NVDM) [31], Neural Variational Latent Dirichlet Allocation (NVLDA) [32], the Dirichlet Variational Autoencoder topic model (DVAE) [33], the Dirichlet Variational Autoencoder (DirVAE) [34], the Gaussian Softmax Model (GSM) [35], and iTM-VAE [36]. This list is not exhaustive and is still growing.
In addition to VAE-based NTMs, there are a few other frameworks for NTMs. In [37], an autoregressive NTM, named DocNade, has been proposed. Consequently, some extensions of DocNADE are found in the literature. Recently, some attempts have been made to use a GAN (Generative Adversarial Network) framework for topic-modeling [38,39]. Instead of considering a document as a sequence or a bag of words, a graph representation of a corpus of documents can be considered. In [40], a bipartite graph, with documents and words as two separate partitions and connected by word occurrences in documents as the weights, is used. Ref. [41] uses the framework of Wasserstein auto-encoders (WAEs), which minimizes the Wasserstein distance between reconstructed documents from the decoder and the real documents, similar to a VAE-based NTM. In [42], a NTM based on optimal transport that directly minimizes the optimal transport distance between the topic distribution learned by an encoder and the word distribution of a document has been introduced.

Neural-Topic Models for Short-Text Analysis
In order to generate coherent, meaningful, and interpretable topics from short texts by incorporating semantic and contextual information, a few researchers used NTMs in lieu of conventional topic models. In [43,44], a combination of NTM and either a recurrent neural network (RNN) or a memory network has been used, in which topics learned by the NTM are utilized for classification by a RNN or a memory network. In both works, the NTM shows better performance than conventional topic models in terms of topic coherence and a classification task. To enhance the discreteness of multiple topic distributions in a short text, in [27], the authors used Archimedean copulas. In [45], the authors introduced a new NTM with a topic-distribution quantization approach, producing peakier distributions, and also proposed a negative sampling decode, learning to minimize repetitive topics. As a result, the proposed model outperforms conventional topic models. In [46], the authors aggregated short texts into long documents and incorporated document embedding to provide word co-occurrence information. In [47], a variational autoencoder topic model (VAETM) and a supervised version (SVAETM) of it have been proposed by combining embedded representations of words and entities by employing an external corpus. To enhance contextual information, the authors in [48] proposed a graph neural network as the encoder of NTM, which accepts a bi-term graph of the words as inputs and produces the topic distribution of the corpus as the output. Ref. [49] proposed a context-reinforced neural-topic model with the assumption of a few salient topics for each short text, informing the word distributions of the topics using pretrained word embedding.

Proposal for Fine-Tuning of Neural-Topic Models for Short-Text Analysis
From the analysis of present research works on neural-topic models on short-text analysis, it seems that incorporating auxiliary information from an external corpus is one of the most popular and effective techniques for dealing with sparsity in short texts. As mentioned in the introduction, in our previous work [28], we found that although pretrained word embedding with a large external corpus helps with generating coherent topics from a short-text corpus, the generated topics lack the semantics expressed by the corpus-specific meaning of words. If the domain of the short-text corpus and the external corpus vary too much, the topic semantics become poor. This fact is also noted by other researchers [25].
In this work, we propose an additional fine-tuning stage, using the original shorttext corpus, along with the pretrained word embedding and a large external corpus. For pretrained word embedding, we decided to use GloVe [50] after some preliminary experiments with two other techniques, namely, Word2Vec and Fast Text, as GloVe provided consistent results. Here, we have completed an extensive comparative study to evaluate the effect of pretrained embedding with our proposed additional fine-tuning stage using several short-text corpora and neural-topic models. The proposed study setting is represented in Figure 1. Here, pretrained word embedding is denoted as PWE. We have performed three sets of experiments for topic extraction, using only neural-topic models (NTM), neural-topic models with pretrained word embedding (NTM-PWE), and neural-topic models with pretrained embedding and the proposed fine-tuning step (NTM-PWE/fine-tuning). In all the cases, the data corpus is first pre-processed, and in NTM-PWE, word embedding vectors are replaced by PWE after the model parameters are initialized; the weights of PWE are not updated during the training step. In our proposed PWE/fine-tuning or simple fine-tuning (as mentioned in the text), the weights are gradually updated in the training step after replacing the word embedding vectors, as in PWE. In this case, it is possible to update the parameters at the same learning rate that is set to update the entire model, but experiments have shown that updating the PWE values at a large learning rate can easily over-fit the training data. Therefore, in the simulation experiments, we have set the learning rate of the word embedding vectors to a smaller value than the learning rate of the entire model.
We have used popular VAE-based neural-topic models with a few similar WAE (Wasserstein autoencoder)-based models, and ten popular benchmark datasets, for our simulation experiments. The performance of each neural-topic model with no word em- bedding, pretrained word embedding, and additional fine-tuning has been evaluated by the generated topic quality using different evaluation metrics of topic coherence and topic diversity. The neural-topic models, datasets, and evaluation metrics used in this study are described below.

Neural-Topic Models for Evaluation
In this section, the neural-topic models used in this study are described briefly. Table 1 describes the meaning of the notations used for description of the models.
Gaussian random variables, h ∈ R K z n latent topic for the n-th word θ Topic proportion vector, θ ∈ R K + β Topic-word distribution β ∈ R |V|×K α Topic centroid vectors α ∈ R L×K ρ Word embedding vectors ρ ∈ R L×|V| Figures 2 and 3 describe the generalized architecture of the Variational autoencoder (VAE)-and Wasserstein autoencoder (WAE)-based neural-topic models, respectively. In both the models, the part of the network that generates θ is known as the encoder, which maps the input bag-of-words (BoW) to a latent document-topic vector, and the part that receives θ and outputs p(x) is called the decoder, which maps the document-topic vector to a discrete distribution over the words in the vocabulary. They are called autoencoders because the decoder aims to reconstruct the word distribution of the input. In VAE, h is sampled by Gaussian distribution, and θ is created by performing some transformation on it. WAE, on the other hand, uses the Softmax function directly to create θ, so no sampling is required. Evidence lower bound (ELBO), the objective function of VAE, is defined below [29]: i-th hidden layer's outputs h Gaussian Randam variables, h ∈ R K z n latent topic for the n-th word θ Topic proportion vector, θ ∈ R K + β Topic-word distribution β ∈ R |V|×K α Topic centroid vectors α ∈ R L×K ρ Word embedding vectors ρ ∈ R L×|V| In both the models, the part of the network that generates θ is known as encoder which maps the input bag-of-words (BoW) to a latent document-topic vector, and the part that hidden layer's outputs ussian Randam variables, h ∈ R K nt topic for the n-th word ic proportion vector, θ ∈ R K + ic-word distribution β ∈ R |V|×K ic centroid vectors α ∈ R L×K rd embedding vectors ρ ∈ R L×|V|  Figure 3 describe the generalized architecture of Variational Autoen-Wasserstein Autoencoder (WAE) based neural topic models respectively. els, the part of the network that generates θ is known as encoder which ag-of-words (BoW) to a latent document-topic vector, and the part that It is empirically known that maximizing this ELBO alone will result in smaller (worse) topic diversity. In order to solve this problem, some NTMs use a regularization term to increase the topic diversity [51]: where λ is a hyper-parameter that manipulates the influence of the regularization term; 10 was adopted here. This value was determined empirically. The VAE-based models in this paper use this regularization term. The particular NTMs used in our study are mentioned in the next subsections. NVDM [31] is, to our knowledge, the first VAE-based document model proposed with the encoder implemented by a multilayer perceptron. This model uses the sample h from the Gaussian distribution as an input for the decoder, and variational inference is based on minimizing KL divergence. While most of the NTMs proposed after this one transform h to treat θ as a topic proportion vector, NVDM is a general VAE.

Neural Variational Latent Dirichlet Allocation (NVLDA)
NVLDA [32], another variant of NVDM, is a model that uses Neural Variational Inference to reproduce LDA. Here, the Softmax function is used to convert z to θ. The probability distribution that maps samples from a Gaussian distribution to the Softmax basis is called the Logisitic-Normal distribution, which is used as a surrogate for the Dirichlet distribution. Additionally, the decoder is p(x) = so f tmax(β) · θ. Unlike the NVDM, where both the topic proportions and the topic-word distribution are in the form of probability distributions, this model is a topic model. Logisitic-Normal distribution is defined as follows:

Product-of-Experts Latent Dirichlet Allocation (ProdLDA)
ProdLDA [32] is an extension of NVLDA in which the decoder is designed by following the product of the expert model, and the topics-word distribution are not normalized.

Gaussian Softmax Model (GSM)
GSM [35] converts h to θ using Gaussian Softmax, as defined below: where W 1 ∈ R K×K , a linear transformation, is the trainable parameters used as the connection weights.

Gaussian Stick-Breaking Model (GSB)
GSB [35] converts h to θ by Gaussian Stick-Breaking construction, which is defined as follows: where W 2 ∈ R K×K−1 is the trainable parameters used as connection weights, and the stick-breaking function f SB is described by Algorithm 1: Input: Return value from sigmoid function η η ∈ R K + , where ∀η k ∈ [0, 1] Output: Topic proportion vector θ θ ∈ R K + , where ∑ k (θ k ) = 1 1: Assign η 1 to the first element of the topic proportion vector θ 1 .
RSB [35] converts h to θ by recurrent Stick-Breaking construction, as defined below. Here, the stick-breaking construction is considered as a sequential draw from a recurrent neural network (RNN).
where f RNN is decomposed as: and f SB (η) is the same as in GSB.

Wasserstein Latent Dirichlet Allocation (WLDA)
WLDA [41] is a topic model based on a Wasserstein autoencoder (WAE). Though various probability distributions can be used for the prior distribution of θ, in this paper, we use the Dirichlet distribution, which we believe is the most basic. In WAE, two training methods are available, GAN (Generative Adversarial Network)-based training and MMD (Maximum Mean Discrepancy)-based training, but in WLDA, MMD is used because of the ease of convergence of training loss. In VAE, the loss function is composed of the KL Divergence used as the regularization term for θ and the reconstruction error, while in WLDA, MMD is used as the regularization term.
If P Θ is a θ's prior distribution, and Q Θ is a fake samples's prior distribution, maximum mean discrepancy (MMD) is defined as: (15) where H means the reproducing kernel Hilbert space (RKHS) of real-valued functions mapping Θ to R, and k is the kernel function.

Neural Sinkhorn Topic Model (NSTM)
NSTM [52] is trained using optimal transport [42], as in WLDA. Since we assume that θ encodes x into a low-dimensional latent space while preserving sufficient information about x, the optimal transport distance between θ and x is calculated by the Sinkhorn Algorithm. The sum of this optimal transport distance and the negative log likelihood is used as the loss function. Table 2 presents the details of the benchmark datasets used in this work. The first column represents the name of the dataset, followed by the number of documents (|D|), vocabulary size (|V|), the total number of tokens (∑ X), average document length (ave dL), maximum document length (max dL), sparsity, number of classes (C,) and the source of the data in the respective columns. In the source column, 1 and 2 represent OCCITS and STTM, respectively.

Datasets
1. OCTIS: https://aclanthology.org/2021.eacl-demos.31/ (accessed on 14 January 2022). 2. STTM: https://arxiv.org/pdf/1701.00185.pdf (accessed on 14 January 2022). The first two datasets fall in the category of long documents, and the other eight datasets can be considered as the short-text corpus, as the average document length is quite short compared to the long documents. The datasets shown in the table are pre-processed. HTML tags and other symbols have been removed from each dataset, and all words have been lowercased. Then, the stopwords were removed and lemmatized. From these datasets, 80% of the total documents was used as the training data and the rest as the test data. These pre-processed corpora are then converted into a BoW (Bag-of-Words), which basically has word frequency as an element, to be used as input data for the NTM. However, for the NSTM, the vector corresponding to each document in the BoW is divided by the total value of the vectors, as in the original paper.

Evaluation of Topic Quality
It is quite challenging to evaluate the performance of topic models, including NTMs, according to the quality of the generated topics. Topics generated by topic models can be considered as soft clusters of words. Under the constraints of the topic model, this is a probability distribution that collects the probability of word generation for each topic; the same is true for NTM, but this may not be in the form of a probability distribution for document models that impose even weaker constraints than the topic model. Either way, a topic here is a topic-word distribution, and each distribution has as many dimensions as the number of lexemes that occur in the corpus. It is very difficult to understand the goodness of a topic by directly comparing them with human topics. Therefore, in practice, analysts check a list of N words characteristic of a topic based on the values of the word distributions. In most cases, the list of the top-N words in terms of the large probability values in the word distribution is used.
Various metrics have been proposed to evaluate the quality of the top-N words with two main directions. One is to check whether the meaning of words belonging to the top-N words are consistent with each other, defined as topic coherence (TC). The other is to measure the diversity of the top-N words of each pair of topics, defined as topic diversity (TD) or topic uniqueness. Topics with high TC may have low TD. In this case, the top-N words of most topics will be nearly the same, which is not desirable. So, to evaluate the quality of a topic for human-like interpretability, it should have high TC as well as high TD.

Topic Coherence (TC)
For computing TC, general coherence between two sets of words are estimated based on word co-occurrence counts in a reference corpus [53]. The choices are (1) the training corpus for topic modeling; (2) a large external corpus (e.g., Wikipedia); (3) word embedding vectors trained on a large external corpus (e.g., Wikipedia). The scores may differ according to different computations. Choice 1 is easy, but the results are affected by the size of the training corpus. Choices 2 and 3 are more popular, although choice 2 is computationally costly. However, if the domain gap of the training corpus and the external corpus is high, the evaluation is not proper. In this work, we have used the following metrics for computation of topic coherence: • Normalized Point-Wise Mutual Information (NPMI) [54]: NPMI is a measure of the semantic coherence of a group of words. It is considered to have the largest correlations with human ratings, and is defined by the following equation: where w is the list of the top-N words for a topic. N is usually set to 10. For K topics, averages of NPMI over all topics are used for evaluation; • Word Embeddings Topic Coherence (WETC) [55]: WETC represents word embeddingbased topic coherence, and pair-wise WETC for a particular topic is defined as: where ., . denotes the inner product. For the calculation of the WETC score, pretrained weights of GloVe [50] have been used, and E (k) is the word embedding vector sequence of GloVe corresponding to the top-N words for topic k; E

Topic Diversity
Topic diversity is defined here as the percentage of unique words in the top 25 words of all topics, according to [18]. A diversity close to 0 represents a redundant topic, and those close to 1 indicate more varied topics. Here, we have also used two other metrics, inverted rank-biased overlap (InvertedRBO) [56] and mean squared cosine deviation among topics (MSCD) [57], as a measure of diversity of the generated topics. InvertedRBO is a measure of disjointedness between topics weighted on word rankings, based on the top-N words. The higher these metrics are, the better. MSCD is the cosine similarity of the word distribution of each topic, so it should be lower for better topics. In general, NTM training updates parameters to maximize ELBO, but such a naive implementation can easily lead to poor TD. However, in this case, since we use the topic centroid vectors as trainable parameters, we regularize parameters of NTM to increase the angle formed by each topic centroid vector in order to increase the TD.

Simulation Experiments and Results
The simulation experiments have been performed with several benchmark datasets, and the performance of the topic models are evaluated by topic coherence and topic diversity measures.

Experimental Configuration
For the purpose of comparison and evaluation, the experimental setting should be similar for all the neural-topic models and all the datasets. At the beginning, we completed some trial experiments, and determined that the optimum topic size parameter should be set at K = 50, based on topic coherence and perplexity, so that there are a sufficient number of topics without becoming very large, considering the length of short text. This value is also in accordance with the value used for related experiments in similar research works. The number of dimensions of the word embeddings was fixed at L = 300. This is in accordance with the GloVe's Common Crawl-based trained word embedding vectors, publicly available at https://nlp.stanford.edu/projects/glove/ (accessed on 14 January 2022), which cover largest number of vocabularies.
The other experimental parameters are set as follows: number of units of the encoder's hidden layers: H (1) = 500, H (2) = 500; Dropout rate: p dropout = 0.2; Minibatch size: 256; Max epochs: 200; Learning rate for the encoder network: 0.005; Learning rate for the decoder network: 0.001. We employ Adam as the optimizer and Softplus as the activation function of the encoder networks.
On WLDA, we employ Dirichlet as the prior distribution for topic proportion-generation, using MMD for this model's training. On NSTM, Sinkhorn algorithm's max number of updates is 2000, and the threshold value for updating termination condition is 0.05, constant value α Sinkhorn = 20.

Results for Topic Coherence
Tables 3-12 represent the detail results of different topic coherence metrics (NPMI and WETC) for different neural models and different datasets, respectively. Values in bold faces indicate the best results. We have used two versions of GloVe, differing in terms of the size of the corpus.
For many datasets, NVLDA-PWE/fine-tuning has the highest TC. One of the challenges of using PWE without fine-tuning is that the high domain gap between the PWE training corpus and the corpus for topic modeling has a negative impact. In many cases, our proposal produces better results, but not for all the datasets or for all the models. The dataset "GoogleNews" often has the best TC with PWE, and does not show better performance with additional fine-tuning. This is probably because this corpus has a similar domain as the training data for PWE. In a few datasets, the best performance is noticed when no pretraining word embedding is used. It is verified that for those datasets, the original corpus contains sufficient word co-occurrence information.
However, we noted that the TC value changes significantly depending on the type of word embedding. This result suggests that the quality of the word embeddings may have a significant impact on the training of the topic model. In particular, whether or not the unique words in the training corpus are included in the unique words in the PWE has a significant impact. If the coverage of this word dictionary is large, the PWE can be used for evaluation, but if there are many missing words, the reliability of the evaluation value will be greatly compromised. Figure 4 presents the summary of topic coherence over all the neural-topic models for the long-text corpus (2 datasets) and the short-text corpus (8 datasets), which shows the overall trend. In the case of long texts, the scores of the PWE/fine-tuning metrics for TC are either a little worse or the same as the others. One of the reasons for this is that the long-text corpus used in this study is composed of relatively formal documents, which is close to the domain of PWE. In contrast, the short-text corpus shows better performance in all metrics. The overall trend is none < PWE < PWE/finetuning.

TC on Long Texts
TC on Short Texts

Results for Topic Diversity
Tables 13-22 represent the detailed results of different metrics (TopicDiversity, Inverted RBO, and MSCD) expressing a diversity of topics for different neural-topic models and different datasets, respectively. Values in bold indicate the best results. InvertedRBO focuses on the weight of the top-N words. It shows the highest values in almost all of the cases, from which it can be inferred that we were able to construct the topics with high diversity. This result shows that it was useful to add a regularization term that maximizes the distance between topic-centroid vectors, resulting in highly diverse topics.
Furthermore, WLDA and NSTM show similar results without this regularization term, indicating that these models are able to learn without compromising topic diversity in their raw form. To check if this regularization term is working well, we have added TopicCentroidDistance (TCD) in the tables. The larger this metric is, the better, but the values are almost the same for all cases. This metric was evaluated based on two PWEs, and since the values varied, we can infer that the quality of the embedding has a significant impact on the evaluation of the topic model.
Although the results of TopicDiversity varied greatly depending on model and dataset, when checked individually, the scores were sufficiently better in many cases. However, as in the case of Biomedical's NVLDA-pwe/fine-tuning results, there were cases where the TC showed good scores but the TD showed bad scores. In this respect, InvertedRBO also shows a good score, but MSCD, which is an evaluation using the entire topic-word distribution, shows a relatively large value (i.e., a bad score), indicating that the topics are relatively tangled. Metrics such as TopicDiversity and InvertedRBO, which are based on the top-N words, are useful for evaluating topic diversity, but it is also important to evaluate the entire topic-word distribution. Figure 5 presents the summary of topic diversity results over all the neural-topic models for the long-text corpus (2 datasets) and the short-text corpus (8 datasets), which shows the overall trend. Among the metrics related to TD, the InvertedRBO score is almost the highest in all cases. This indicates that there is sufficient diversity in all conditions. However, for the other scores, the performance is slightly worse for PWE and PWE/fine-tuning.

Classification and Clustering Performance
Tables 23-32 represent the classification and clustering performance of all models and all datasets, respectively. Values in the bold face represent best results. For the TrecTweet dataset, the classification results could not be obtained, possibly due to some technical problem. Figure 6 presents the average classification and clustering performance of the models over long-and short-text datasets. Classification has been performed by a SVM (Support Vector Machine) with linear and rbf kernels. Classification accuracy, precision, recall, and F1 scores have been used for performance assessment and for supervised classification, and NMI (Normalized Mutual Information) and Purity have been used for unsupervised classification.
For classification, VAE-based models, such as NVDM and GSM, exhibit good performance, while WAE-based models, such as WLDA and NSTM, show relatively poor performance. NSTM shows good performance in TC and TD, especially in TD, without adding any regularization term. However, the application to downstream tasks using WAE variants remains a challenge. Considering the overall trend, for long texts, PWE with fine-tuning improves all the scores, but for short texts, the performance is the best for the cases without embedding. Although, after fine-tuning, the scores got better than those obtained with pretrained embedding only.
For clustering results, the large NMI and Purity scores for all models and all datasets for both long and short texts indicate that there is a concentration of documents with the same label around the topic-centroid vector, which proves that the proposal of PWE/fine-tuning improves topic cohesion. Therefore, we can see that our proposal of PWE/fine-tuning contributes to narrowing the domain gap between the training corpus and PWE.

Conclusions
Short-text data are now becoming ubiquitous in the real world through various social networking sites. The importance of analysing these short messages is also growing day by day. Unlike long texts or documents, short texts suffer from a lack of word co-occurrence information due to their restricted lengths, posing a difficulty in generating coherent and interpretable topics with popular topic-model techniques.
The use of pretrained word embedding in neural-topic models is a good choice to easily increase the generated topic quality as measured by topic coherence and topic diversity. This is effective for both long and short texts, and reduces the number of trainable parameters, thus shortening the training step time. However, to achieve better topic coherence, especially in short texts, or to make the top-N words of a topic more relevant to the real semantic contents of the training corpus, the additional fine-tuning stage proposed in this work is indeed necessary. The extensive study in this work with several neural-topic models and benchmark datasets justifies our proposal.
However, the use of pretrained word embedding (PWE) has its inherent limitations, which may affect the quality of the extracted topics from short texts. The short-text corpus to be analyzed may contain words that are not included in the vocabulary covered by the corpus used for pretrained word embedding. In this case, NTM-PWE uses a vector initialized with zero. As the vocabulary coverage increases, the performance is likely to deteriorate. Moreover, in the case of NTM-PWE/fine-tuning, there is a possibility that the number of parameter updates will increase until the loss function converges, resulting in an increase in training time. If the time difference between the corpus used for PWE training and the corpus to be analyzed is too large, the meanings of words may change with time, which may have a negative impact on the production of interpretable topics.
It is also seen that the improvement in topic quality after introducing a fine-tuning stage is not the same for all the datasets and all the models. It is difficult to define the correlation between the structure of neural-topic models and the inherent characteristics of the datasets, which poses a challenge to our study. In this work, we limited our study to benchmark datasets available on the internet. Currently, we are collecting data for the evaluation of our proposal with real-world datasets.
By incorporating the additional training with the original training corpus, along with pretrained word embedding with the external corpus, we can improve the purity and NMI of the topics evaluated using the class labels of the documents. Thus, we can construct topics that are more suitable for the training corpus. This method can also be expected to improve the performance of downstream tasks, such as classifications for long texts. Even for short texts, the performance of the downstream tasks is better than when using pretrained word embedding without fine-tuning.