A well-known fact in the data science community is that the proliferation of the Internet, and subsequently of social media, has generated massive amounts of data [
23]. It is also well understood that it requires proper manipulation and analysis to extract useful knowledge. Textual data represents approximately 75% of all data on the web [
24], which presents its own processing challenges. Textual data are unstructured, in the sense that these do not have an underlying schema. They are more ambiguous—making use of synonyms, slang, and abbreviations—and often require an understanding of the context. Spelling errors and lack of grammar further complicate their analysis. When studying textual data extracted from social media, their informal and short-text nature deepens these challenges.
Natural language processing (NLP) is a branch of computer science that aims to overcome these obstacles, enabling computers to understand, interpret, and generate natural, human language. In NLP, text data organized into datasets is called a corpus, while documents are the sources of the data, such as books, news articles, and emails [
25]. NLP has a diverse array of applications, each one contributing to how we interact with and process language data. Document summarization, for instance, involves distilling key information from extensive documents, and providing concise overviews that encapsulate the essence of the text. Information retrieval plays an important role in organizing and categorizing documents based on their central themes, an essential function for the efficiency of search engines and recommendation systems. Additionally, sentiment analysis delves into the emotional undertones within a text, offering insights into the sentiments and perceptions surrounding the discussed topics, which is invaluable in fields ranging from market analysis to social media monitoring. Each of these applications demonstrates the versatile and impactful nature of NLP in extracting, processing, and interpreting language data.
To successfully perform these tasks, there are several NLP techniques available. In part-of-speech tagging, each word in a sentence of the corpus is assigned a part-of-speech (e.g., noun, verb, adjective). Named-Entity-Recognition identifies and categorizes entities into names of persons, organizations, and locations, among others. Another NLP technique is Topic Mining (also known as Topic Extraction and Topic Discovery). It consists of using algorithms to automatically identify and categorize meaningful topics in a document. The objective is to uncover underlying themes within the corpus without prior knowledge of those topics. Since we wish to explore the subjects of politicians’ tweets, Short-Text Topic Mining (STTM) is the focus of this section. In terms of literature, we build upon the review of Murshed et al. [
26]. We focus on publications from 2022 and 2023 that look into Twitter Topic Mining. In the cases where this was too restrictive, we looked into publications on other social media platforms.
3.1. Topic Modeling
Topic Modeling consists of extracting latent topics from a corpus of unlabeled documents. Latent topics are those that are not immediately perceivable in the document and are instead suggested by a collection of words. Topic Modeling initially emerged as a method for representing text. One of the earliest approaches in this regard was the bag-of-words (BoW) model, where words are treated as features within a document. In the BoW model, the value assigned to each feature (term/word) can vary—it might be binary, indicating the presence or absence of a word, or numerical, reflecting the word’s frequency [
27]. Another significant approach is the term frequency-inverse document frequency (TF-IDF) method. Unlike BoW, which might be primarily used for word frequency, TF-IDF assigns weight to words based on their frequency in a particular document and inversely to their prevalence across the entire document corpus, favoring words that are frequent in a single document but infrequent in the overall dataset [
28]. This distinction allows TF-IDF to highlight words that are uniquely significant to individual documents.
Topic Modeling, initially based on clustering techniques like Hierarchical Clustering [
29] and K-Means clustering [
30], evolved to incorporate more sophisticated models for representing and interpreting textual data. A notable advancement in this field was the Vector Space Model (VSM), where documents are represented as vectors not just of words, but of terms more broadly. This approach, echoing J.R. Firth’s concept that “You shall know a word by the company it keeps”, enables a nuanced understanding of language by capturing the context in which terms appear. VSM addresses the limitations of earlier models in handling polysemy and synonymy by allowing for different forms of term representation, including semantic forms like WordNet synsets or word senses. However, VSM itself does not inherently reduce dimensionality; this is achieved through additional methods such as principal component analysis or latent semantic analysis. Despite its advancements, VSM’s treatment of words and terms still faced challenges in fully capturing the complexity of language, a subject further explored in the development of topic modeling techniques [
31].
In the VSM, a term-document matrix X is constructed, where rows represent terms (words) and columns represent documents. Each cell represents the frequency of a term in a particular document. Since documents can vary widely in length and terms, this matrix tends to be large and highly sparse—a challenge for efficient analysis. This obstacle is overcome through a matrix factorization technique called singular-value decomposition (SVD). Its goal is to find the lower-dimensional representations of terms and documents that preserve the semantic relationships in X. The Latent Semantic Analysis (LSA) model by Deerwester et al. [
32] has a main foundation in the distributional hypothesis. In essence, the idea is that texts with similar meanings are expected to have similar representations, which translates to being closer to each other in the vector space. This way, LSA reduces the dimensionality of the vector space while capturing the patterns in the data.
In Singular Value Decomposition (SVD) applied to topic modeling, we consider a term-document matrix
X of size
, where
m is the number of terms and
n denotes the number of documents. The objective is to decompose
X into three matrices: the term-topic matrix
U of size
, the diagonal matrix of singular values
, and the document-topic matrix
V of size
, where
r is the number of selected topics or latent concepts. In this decomposition,
U represents the relationship between terms and topics, with its elements indicating the strength of this relationship. The matrix
V (or its transpose
), on the other hand, illustrates the association between documents and topics, with the rows of
specifying the strength of these associations. The matrix
contains the singular values, which correspond to how much each of the
r latent topics explains the variability in the data. The decomposition can be formally represented as:
LSA was widely adopted at the time of its creation and is still used today, despite the high computational cost of calculating the SVD and the often hard-to-interpret generated feature space. Valdez et al. [
33] used LSA to analyze tweets on the 2016 U.S. election, showing topics in parallel with the most frequent policy-related Internet searches at the time. More currently, Sai et al. [
34] used LSA to guide the identification of fake news on Twitter, while Chang et al. [
35] aimed to gain insights into public perception of the Russia–Ukraine conflict on Twitter through LSA. While works have shown that alternative models outperform LSA [
36], others have tried to overcome LSA’s limitations: Karami et al. [
37] proposed Fuzzy LSA Topic Mining in health news tweets; Kim et al. [
38] presented Word2Vec-based LSA for blockchain trend analysis.
In 1999, Hofmann introduced probabilistic LSA (pLSA) as an alternative to LSA under a probabilistic framework [
39]. It assumes that topics are distributions of words and it aims to find a model of the latent topics that can generate the data in the document-term matrix. Formally, this distribution is defined as:
where
d are documents,
,
w are words,
and
z are the latent topics,
.
Despite being a foundational method, the usage of pLSA probabilistic latent semantic analysis (pLSA) in NLP tasks has become infrequent, as it has been superseded by more advanced models. Kumar and Vardhan [
40] used pLSA for the topic-based sentiment analysis of tweets, and Shen and Guo [
41] used it to add context to a translation task. Though pLSA represents an advancement over the traditional LSA, it is notably susceptible to overfitting. Consequently, newer models have been developed to address this limitation, seeking to balance model complexity with generalization to broader datasets.
First proposed by Blei et al. [
42] in 2003, Latent Dirichlet Allocation (LDA) is a widely recognized topic modeling framework. Unlike traditional text classification algorithms that categorize documents into predefined classes, LDA identifies a range of topics from the data themselves. It then assigns a distribution of these topics to each document, reflecting the varying degrees to which different topics are represented in the text. Its fundamental idea is that “documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words”. The number of latent topics is user-defined. Given parameters
and
, the joint distribution of a topic mixture
, a set of
K topics
Z and a set of
N words
W is given by:
Figure 2 (from Anastasiu et al. [
43]) illustrates the LDA algorithm and how each parameter interacts with all others.
The following example helps to better understand the mechanisms:
Let us have a corpus of ten documents, five words, and three topics: M = 10, N = 5, K = 3. The documents in the corpus follow a Dirichlet distribution, Dir(), which defines how they relate to the topics. In this example, consider a triangle, where each vertex corresponds to a topic. The triangle shading is darker in the corners and lighter in the center, suggesting that the probability of the documents is higher near the individual topics/vertices than their combinations. Dir() defines how the topics relate to the words. In this example, consider a tetrahedron (which has all four vertices equidistant to the center) where each vertex corresponds to a corpus word. Similarly to the triangle, the shading in this figure is lighter in the center and darker in the corners, suggesting the probability of the topics is closer to the individual words than to their combinations.
From the Dirichlet distributions, a new document, D, can be generated as set in Equation (
3). D will have a mixture,
, of representations of each topic - for instance, 70% of Topic 1, 20% of Topic 2, and 10% of Topic 3. These percentages create the multinomial distribution of
Z (
), giving the topics of the words in D. Having the topics, it is necessary to find the words,
W. Similarly, the multinomial distribution of
W is subject to
. For each topic in D, a word is selected with probability
.
Having D, it is possible to compare it with the original documents and select the parameters that generate the closer ones, i.e., that maximize Equation (
3).
To account for documents with different lengths, a Poisson distribution is attached to the formulation in Equation (
3). Gibbs sampling was later suggested as an efficient method for estimating
and
[
44].
LDA has been a popular and widely adopted model for several tasks. In the case of Topic Mining of Twitter data, it has been applied to market research [
45,
46,
47], health [
48,
49,
50], and specifically COVID-19 [
51,
52,
53], as well as other subjects such as climate change [
54], layoffs [
54], and hate speech [
55].
As the limitations of LDA seem to impact performance in STTM, research has shifted to modifying the original LDA model.
Table 1 lists some LDA variations, with a simple description of each one.
Of the above-listed models, Twitter-LDA has achieved particularly interesting results in performing STTM [
56]. As its name suggests, it has been tailored to perform LDA on Twitter data by including two main differences:
It has expanded the preprocessing steps. Instead of removing all non-alphanumeric characters, it saves those that are relevant in tweets—hashtags, mentions, and URLs—and uses them for further information (e.g., as potential topic labels).
Words in a tweet are either topic words or background words, and each has underlying word distributions. A given user has a topic distribution, and follows a Bernoulli distribution when writing a tweet: firstly, the user picks a topic based on their own topic distribution; at the time of choosing each next word, it either chooses a topic word or a background one.
3.2. Topic Classification
The evolution of machine learning (ML) and neural networks has significantly advanced Topic Mining, particularly in the realm of Topic Classification. This advancement enables the precise automatic labeling of documents with predefined topics. Unlike Topic Modeling, which identifies latent topics within a corpus without prior definition, Topic Classification focuses on categorizing documents into specific, predetermined categories based on the topics they contain. This process leverages the semantic understanding facilitated by techniques such as word embeddings to accurately match documents with relevant topic labels. The development of sophisticated models and approaches in ML has improved the accuracy and efficiency of Topic Classification, making it possible to effectively handle complex and large-scale datasets.
A key concept applicable in Topic Classification is word embeddings. Word embeddings consist of transforming individual words into a numerical representation, i.e., vectors. This vector aims to describe the word’s characteristics, such as its definition and context. It is then possible to identify similarities (or dissimilarities) between words—for instance, understanding the synonyms of different terms, or ironic uses of the same one. Once the embeddings have been generated, it is possible to position the words in the vector space, with similar words closer to each other and dissimilar words further apart.
Although word embeddings and document-term matrices (such as those generated by TF-IDF) have the same goal of representing words numerically, they do so differently. In document-term matrices, each word from the corpus is represented and organized by its occurrence in individual documents. This results in a matrix that is typically large and sparse, reflecting the presence or absence of words across different documents. However, due to this format of representation, other aspects like the context and the nuanced meaning of words in their natural linguistic environment are not captured in the matrix. Additionally, vectorization is corpus-dependent, which complicates training in different datasets.
Word2Vec by Mikolov et al. [
57] consists of a group of two-layer neural networks used to generate word embeddings. It makes use of two architectures with opposite goals:
Continuous bag-of-words: predicts a word based on its neighbors. The architecture consists of an input layer (the neighbor words), a hidden layer, and an output layer (the predicted word). The hidden layer learns the vector representation of the words.
Skip–Gram: predicts neighbors based on a word. The architecture consists of an input layer (the word), a hidden layer, and an output layer (the neighbors). The hidden layer outputs the vector representation of the individual word.
Doc2Vec was proposed in 2014 by Le and Mikolov [
58]: It expanded on the ideas of Word2Vec: instead of words, the model is able to learn the vector representations of documents. To do so, it includes an additional vector to represent the document as a whole. Doc2Vec presents two possible approaches:
Distributed Memory: predicts the next word given an input layer of document vector and a ‘context’ vector of current words (those surrounding the target word). The latter consists of a sliding window, which traverses the document as the target word changes. These two vectors are concatenated and passed through a hidden layer, where their relationships are learned. The outputs consist of a softmax layer, predicting the probability distribution of the target word. During training, the hidden and output layer’s weights are adjusted to minimize the prediction error. The document vector is also updated at this stage.
Distributed Bag-of-Words: predicts a set of words given only the document vector. The difference of this approach to distributed memory is that the input layer only consists of the document vector. There is no consideration of word order or context within the document, which simplifies the prediction task.
While neither Word2Vec nor Doc2Vec are used for Topic Mining, introducing these models is necessary to explain Top2Vec. Top2Vec is an unsupervised model for automatic Topic Mining that does not require an a priori user-defined number of topics [
59]. It does so in five steps:
Creating embeddings: generates embedded document and word vectors, using Word2Vec and Doc2Vec.
Reducing dimensionality: reduces the number of dimensions in the documents using Uniform Manifold Approximation and Projection (UMAP). UMAP is a dimensionality reduction algorithm useful for complex datasets. It also has the advantage of clustering similar data points close to each other, helping to identify dense areas. UMAP calculates similarity scores across pairs of data points, depending on their distances and the number of neighbors in the cluster (which are user-defined). It then projects them into a low-dimensional graph and adjusts the distances between the data points according to their cluster [
60]. Using this algorithm, this step maintains embedding variability while decreasing the high dimensional space, and it also manages to identify clusters in the data.
Segmenting clusters: the model identifies dense areas of documents in the space. If a document belongs to one of the areas, it is given a label, otherwise, it is considered noise. This is performed with HDBSCAN, a hierarchical clustering algorithm that classifies data points as either core points or non-core points. Core points are those with more than
neighbors. After classifying all observations, a core point is assigned to cluster 1. All other points in its neighboring region are added to the cluster, and so are the neighboring points of these. When no other core points can be assigned to this cluster, a new core point is assigned to cluster 2. This process is repeated until all data points have been assigned to a cluster [
61].
Computing centroids: For each dense area within the document vectors, the centroid is calculated, which serves as the topic vector.
Finding topic representations: the topic representations consist of the nearest word vectors to the topic vector.
In the context of short-text data, such as social media posts from patients with rare diseases, Top2Vec has been utilized for analysis. According to Karas et al. (2022), it has been demonstrated that Top2Vec outperforms LDA in this application [
62]. Zengul et al. [
63] showed that Top2Vec results are more closely correlated to LDA’s than to LSA’s. Vianna and Silva De Moura [
64] achieved successful results in extracting topics from the abstracts of legal cases—which are much shorter than the complete document, and Crijns et al. [
65] managed to scrape web texts that identify innovative economic sectors or topics. Bretsko et al. [
66] applied Top2Vec to the abstracts of academic papers.
Although Top2Vec was only introduced in 2020, the approach is now considered dated. It has been outdone by Transformers as the state-of-the-art models in NLP [
67].
BERTopic belongs to BERT, a family of language models based on Transformers. Transformers are a type of deep learning architecture that has earned significant attention due to their ability to efficiently handle sequential data and capture complex patterns in text.
The fundamental innovation brought forth by Transformers is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence while considering their relationships. This enables Transformers to capture both local and global dependencies in the data, making them particularly effective for tasks involving long-range dependencies, such as language translation, text generation, and language understanding.
The transformer architecture consists of two main components:
Encoder–decoder structure: the encoder takes in the input sequence and processes it through multiple layers of self-attention and feed-forward neural networks, capturing contextual information. The decoder then generates the output sequence based on the encoded representation of the input and its self-attention mechanism.
Self-attention mechanism: self-attention allows each word in a sequence to consider the relationships with all other words in the sequence. This is achieved by calculating the weighted representations of the input words, where the weights are determined by the relevance of each word to the current word being processed. The self-attention mechanism consists of three main components: query, key, and value. These components are used to compute attention scores that determine how much each word contributes to the representation of the current word. Multiple attention heads are used in parallel to capture the different aspects of the relationships.
Transformers have become the foundation for many state-of-the-art NLP models, such as Bidirectional Encoder Representations from Transformers (BERT), Generative Pre-trained Transformer (GPT), Text-to-Text Transfer Transformer (T5), and more.
In 2020, BERTopic was presented as a “a topic model that leverages clustering techniques and a class-based variation of TF-IDF to generate coherent topic representations”. It aims to overcome other models’ limitations, such as not being able to take into account the semantic relationships between words.
BERTopic mines topics in five stages (
Figure 3, from [
68]). Although the default algorithms have been selected for the reasons presented below, BERTopic is highly modular, and users can customize it at each step. It also allows the fine-tuning of each default algorithm through its corresponding hyperparameters.
BERTopic begins by converting documents into embedding representations. The default algorithm for this is Sentence-BERT, which achieves state-of-the-art performance on such tasks [
69]. The dimensionality of these embeddings tends to be quite high, with some models achieving ten thousand dimensions [
70]. UMAP is used to reduce this to 2D or 3D since it preserves the local and global features of high-dimensional data better than alternatives such as PCA or t-SNE [
60]. With data in a more feasible vector space, it can now be clustered. HDBSCAN allows noise to be considered outliers, does not assume a centroid-based cluster, and therefore does not assume a cluster shape—an advantage relative to other Topic Modeling techniques [
61]. The next step is to perform c-TF-IDF. This is a variation of the classical TF-IDF: firstly, it generates a bag-of-words at the cluster level, concatenating all documents in the same class. From this, TF-IDF is applied to each cluster bag-of-words, resulting in a measure for each cluster, instead of a corpus-wide one.
BERTopic has been increasingly adopted in STTM. Hägglund et al. [
71] aimed to identify the topics discussed on Twitter by autistic users; Li [
72] applied it to examine how tweets can predict tourist arrivals to a given destination; Strydom and Grobler [
73] analyzed COVID-19 misinformation on Twitter; Turner et al. [
74] searched for concept drift on historical cannabis tweets; and Koonchanok et al. [
75] attempted to track public attitudes towards ChatGPT on the same platform. Grigore and Pintilie [
76] trained BERTopic to help identify potential patients with eating disorders based on their social media. Finally, with regard to hate speech on social media, Mekacher et al. [
77] analyzed the risk of banning malicious accounts, showing that users that were banned from Gettr and not Twitter were more toxic on the latter. Schneider et al. [
78] presented a pipeline for detecting hate speech and its targets on Parler.
While BERTopic is competitive against other models, it presents some disadvantages. It assumes that each document only has one topic, which does not necessarily correspond to the truth. It also uses bag-of-words to generate topic representations, meaning it does not consider the relations between those words. As a result, words in a topic might be redundant for interpreting the topic.