Comparison of Topic Modelling Approaches in the Banking Context

Topic modelling is a prominent task for automatic topic extraction in many applications such as sentiment analysis and recommendation systems. The approach is vital for service industries to monitor their customer discussions. The use of traditional approaches such as Latent Dirichlet Allocation (LDA) for topic discovery has shown great performances, however, they are not consistent in their results as these approaches suffer from data sparseness and inability to model the word order in a document. Thus, this study presents the use of Kernel Principal Component Analysis (KernelPCA) and K-means Clustering in the BERTopic architecture. We have prepared a new dataset using tweets from customers of Nigerian banks and we use this to compare the topic modelling approaches. Our findings showed KernelPCA and K-means in the BERTopic architecture-produced coherent topics with a coherence score of 0.8463.


Introduction
Due to the increasing availability of big data, the need for topic modelling (TM) approaches for topic discovery is increasing [1].For example, service industries such as banks sell different products and services.Thus, they need the main topics of their customers' enquiries, feedback, reviews, and discussions to help them target their resources.Blei et al. [2] identified topic modelling as a type of statistical learning approach for detecting coherent topics in a document.Topic modelling (TM) is considered an important task for many applications such as aspect extraction in sentiment analysis [3][4][5], topic extraction for user preference in a recommender system [6], document summarization [7], topic discovery in a chat box system [8], and topic extraction for fake news detection [9].TM has been applied successfully in different domains such as health [10], online learning platforms [11], software engineering [12], and legal documents [13] to discover hidden topics.
With the increasing use of supervised learning algorithms to provide automated systems, TM provides an opportunity to apply unsupervised learning algorithms for topic discovery.This approach is beneficial because it limits the use of a labelled dataset for training purposes, which is labour intensive, time consuming, and expensive to obtain.There are popular topic modelling approaches, namely Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Indexing (pLSI), and Latent Dirichlet Allocation (LDA).These models, especially LDA, have shown good performances over the years and several variants have been developed.Examples include the correlated topic model [14] and hierarchical Dirichlet process model [15].However, these traditional TM approaches have been criticised for not being able to handle data sparsity, which is especially prevalent in short text.The Bi-Term Model (BTM) was developed to address this problem.This model has received praise in terms of performance when applied to short text.This is because BTM captures biterms from the whole corpus and, thus, the global co-occurrences of words are captured.The model assumes a pair of words can reveal topics much better than a single word occurrence.For example, biterms such as ATM_card and ATM_network are compared to unigrams such as card and network.However, Zhen et al. [16] stated BTM does not consider the comprehensive semantic dependencies of words, resulting in no contextual semantic.BTM has been criticised for the strong assumption made (that two co-occurring words will be assigned the same topic label) and this limits the performance.Thus, variants of this model have also been proposed such as Twitter-BTM [17] and GraphBTM [18].More recently, the use of transformer-based language models has been proposed for TM.For example, BERT-Bidirectional Encoder Representations from Transformer [19].However, these approaches are complex to use.
Despite the use of topic modelling becoming more popular, the examples in the literature found to have compared the TM techniques are very limited.In addition, there is no study found to have applied TM techniques in the Nigerian banking context.The banking industry is an important sector for every nation's economy and banking is considered a daily activity in the society [20].The Nigerian banking sector heavily relies on the cash-based economy and most buying and selling are performed with physical cash [21].Thus, the banks can benefit from topic modelling by helping them monitor their customers' preferences towards their product and service.In addition, this will help them understand what their customers are talking about.Although traditional approaches have shown good performance, their performance is inconsistent.Thus, there is a need to conduct further experimental comparative studies of TM approaches to validate state-of-the-art TM models.To this end, this study aims to compare topic modelling techniques for topic extraction in the Nigeria banking context.The rest of the paper is organised as follows.Section 2 will review the literature to provide background knowledge to this study.Section 3 will present the methodology.Section 4 will present and discuss the results and Section 5 will provide conclusions and recommendations.

Related Work
Topic models (TM) are considered more appropriate to extract topics because of their ability to extract implicit and explicit coherent topics.TM can be dated back to 1990 when Deerwester et al. [22] proposed Latent Semantic Indexing (LSI).LSI uses a singular value decomposition (SVD) of a large term-document matrix to identify a linear subspace such that the relationship between the term and document are captured.LSI was then adopted by various studies, such as Dewangan et al. [23].However, LSI is limited as the technique does not assign a probability to the topic.In respect to this, Hofmann [24] developed probabilistic latent semantic indexing (pLSI).pLSI was also criticised for not being able to account for generative probabilistic models of the documents.Thus, Latent Dirichlet Allocation (LDA) was developed by Blei et al. [2] as an improvement.
LDA is a generative probabilistic model that captures the important intra-word/document statistical structure via a mixing distribution.LDA assumes that each document is associated with a topic distribution.Thus, topic assignment strongly relies on local co-occurrence.LDA has been used extensively for topic modelling [25][26][27][28][29].For example, Çalli and Çalli [30] applied LDA to a dataset containing 10,594 airline-customers complaints from two Turkish airlines during the COVID-19 pandemic.Their study generated seven topics as the latent topics that customers complaints were about and used a qualitative human interpretation approach to validate the topics.Several studies [31][32][33][34] employed LDA to extract topics from a financial policy statements document.Moro et al. [35] utilised LDA to discover topics from 219 business intelligence (BI) articles within the banking domain and found that the most prominent topic in the BI banking literature is credit.Westerlund et al. [36] used LDA to generate six topics from 2702 comments of signers of the e-petition against the Bank of America.Tabiaa and Madani [37] utilised LDA to extract topics from eight Morocco mobile banking app reviews and showed security, services, quality, and interface as the topics customers talked about.Damane [38] applied LDA to generate topics from 26 monetary policy statements of Lesotho's central bank.Bastani et al. [39] utilised LDA to × × extract 40 topics from a dataset containing 86,803 consumer financial protection Bureau (CFPB) consumer complaints.This is beneficial to CFPB to understand and monitor what aspect of financial service the complaint narrative was about.However, the authors did not evaluate how well LDA has performed in their experimentation.Additionally, they have used trial and error to determine the number of K topics that does not efficiently demonstrate the optimal number of K topics.It is worth mentioning that the literature review findings showed that some of the studies found determined their number of topics through trial and error during experimentation.Gan and Qi [40] evidenced the need to select an optimal number of topics as this enhances predicting ability, high isolation within topics, repeatability, and no duplicated topics.Thus, an efficient approach to determining the optimal number of topics should be adopted.
More specifically, Hristova [41] extracted topics from Bulgarian bank chat data collected between January 2019 and April 2020.The author utilised cleaned 12,439 chats represented by a term frequency and inverse document frequency (TF-IDF) matrix to fit LDA.The study produced six coherent topics with a coherence score of 0.66.The result produced was used to understand the main themes that customers discussed, and the themes were profiled as loans, general information, digital banking, currency operations, identification, and cards/transfer.Despite the success achieved by LDA over the years, the unsupervised model has been criticized as it generates topics that contain irrelevant features and noisy topics [42], especially when dealing with short text such as social media text.Thus, LDA had been reported to suffer from the data sparsity problem.Improved extensions of LDA have been proposed such as the correlated topic model [14], the hierarchical Dirichlet process model [43], constrained-LDA [44], MaxEnt-LDA [45], automated knowledge LDA [46], LDADE [12], and ontology-LDA [42].To overcome the data sparsity problem, Yan et al. [47] proposed the Bi-term Topic Model (BTM) for topic modelling, specifically for short text.BTM learns topics by modelling word pairs in the whole corpus.For example, "my sure bank" have biterms such as "my sure, sure bank, my bank".This means global occurrences of biterms are captured.BTM was developed with the assumption that two words will be assigned the same topic label if they have co-occurred [48].However, it is worth mentioning that this study considers that assumption too strong.For example, a tweet such as "My atm card could not work with bank X machine, not sure if it's the atm network or card issue".In the tweet, the key words are "atm, network, machine, card, issue" and can have equal global co-occurrence.Unfortunately, topic modelling of tweets such as this suggest that the assumption may be too strong and does not apply in all cases.BTM has been shown to outperform traditional topic models such as LDA.For example, Discriminative-BTM [48], Twitter-BTM [17], and GraphBTM [18].
Due to the instability in the performance of the conventional TM approaches, the use of language models is evolving.Bidirectional Encoder Representations from Transformers (BERT) is a deep bidirectional unsupervised language representation model developed by Google [19] and has shown good result in topic extraction.For example, Yanuar and Shiramatsu [49] employed BERT for aspect extraction.They collected Indonesian tourism reviews from TripAdvisor and pretrained BERT with 4220 review sentences (using train batch size 32, max sequence length 128, number of epoch 16 learning rate 3 10 −5 , and Adam epsilon 3 10 −8 ).Their approach used 501 Indonesian amusement park reviews for testing and reported an accuracy of 79.9% and F1 score of 73.8%.Bensoltane and Zaki [50] compared variants of BERT against bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) for aspect extraction using 2265 Arabic news posts (retrieved from Facebook about the 2014 Gaza attack).They showed that the combination of BERT, bidirectional gated recurrent unit (BiGRU), and CRF achieved the highest performance for aspect term extraction with an F1 score (88%).However, despite the good performances shown by BERT, Liu et al. [51] stated BERT was undertrained and, thus, proposed RoBERTa, a robustly optimized BERT pretraining approach for topic extraction.They showed RoBERTa performed better than BERT on GLUE and SQuAD.Zhu et al. [52] inserted topic layers into fine-tuned RoBERTa and, thus, proposed a topic augmented i ∑ N ∑ N language model for topic extraction.Grootendorst [53] presented BERTopic by leveraging clustering techniques and a class-based variation of TF-IDF to generate coherent topic representations.The author created document embeddings using a pretrained transformer-based language model to obtain document-level information.Thereafter, they applied dimension reduction in document embeddings and created semantically similar document clusters.Finally, the class-based version of TF-IDF was used to extract the topic representation from each topic.Grootendorst [53] showed the approach produced coherent topics in three different datasets namely, 20Newsgroups (16,309 news articles across 20 categories), BBC News (contains 2225 documents between 2004 and 2005), and Trump's tweets.Abuzayed and Al-Khalifa [54] compared BERTopic, LDA, and Non-Negative Matrix Factorization (NMF) in three different Arabic newspapers, namely, Assabah, Hespress, and Akhbarona.They showed BERTopic outperformed other models.Silveira et al. [13] successfully applied BERTopic to perform aspect extraction by investigating the stochastic topic modelling approaches for legal documents.Raju et al. [55] compared LDA, LSA, and BERTopic using the consumer financial protection Bureau dataset.They showed BERTopic outperformed other models with a coherence score of 0.33.
Despite the various approaches to topic modelling.The literature reviewed shows that there are limited studies on TM in the banking context.Secondly, there are no examples in the literature found to have applied or compared these TM techniques in the Nigerian banking context.The work of Hristova [41] is the closest research work to this study.However, it differs with the aim and data utilised because the study applied LDA to generate topics from Bulgarian bank chat data.

Methodology
We propose the use of Kernel Principal Component Analysis (KernelPCA) and Kmeans clustering in the BERTopic architecture for topic modelling.We conducted an experimental comparison of our approach to other topic models such as Latent Dirichlet Allocation (LDA), Latent Semantic Indexing (LSI), Hierarchical Dirichlet Process (HDP), and the variations of BERTopic.This will help to identify how the traditional TM techniques differ to the language topic models that utilise word/document embeddings.

Kernel Principal Component Analysis
The principal components of input data X can be obtained by solving the eigenvalue problem of the covariance matrix.With a sample of n observations, xi = x1, x2 ............ xn.
x e R N .Given that PCA operates on centred data, that is N .i=1 xi = 0, PCA diagonalizes the covariance matrix C as: The principal components are then obtained by solving eigenvalue problem: In cases where n cannot be linearly separable, PCA is performed on a dot product space F (feature space) instead of the input space R N by mapping from R N to F. This can be illustrated as φ: R N → F where φ is the feature mapping of the input data to a high dimensional mapping.Then, N .i=1 φ(xi) = 0 PCA; the covariance matrix C in feature space F can be written as: The principal components are then obtained by solving the eigenvalue problem in Equation ( 2) above, where eigenvector v is spanned in F as: Using the inner product between the two points, we can create the kernel: where Kij = K (xi, xj) and α = (α1 .......... αN) T .For the kernel component extraction, we compute the projection of each data sample x onto eigenvector V: i=1 For any eigenvalue of C, λ 0, the formulation of the eigenvalue problem for kernel matrix k can thus be defined as:

K-Means Clustering
K-means clustering is a popular clustering algorithm that discovers patterns within objects using their similar attributes.The algorithm requires the knowledge of K number of clusters to be predefined and the initial seed points are of importance.K-means clustering aims to partition n observation into K clusters in which each observation belongs to the cluster with the nearest means.Given X is a set of observations that contain n observation (x1, x2, ............xn) in P dimensional vector, the algorithm follows the following steps: Declare K initial seed points (initial centroids) defined in P-dimensional vectors (s1, s2 sp) and the squared Euclidean distance between the ith object and the kth seed vector is obtained: Cluster centroids are assigned and all objects are assigned to the closest centroids.
The centroids changes to move closer to the centre point and the objects are reassigned to the closest centroid.

•
Repeat the last 2 steps until no objects can be moved between clusters.

Dataset
We employed the data collected in the study of Ogunleye [56], which is publicly accessible in the Kaggle repository (https://www.kaggle.com/datasets/batoog/bankcustomer-tweets-10000).Ogunleye [56] collected a total of nine hundred and fifty-nine thousand (959,000) Nigerian bank customer tweets for a duration of nine months (from 12 May 2019 to February 2020).We utilised 10,000 randomly sampled bank customers' tweets for the purpose of this study.This was performed because we considered having distinct tweets and avoided retweets to increase topic coverage.The texts are bank customer tweets towards the handles of 18 commercial banks in Nigeria.The tweets were in Pidgin English and English language.The dataset is particularly challenging not only because of the shortness of the text but the difference in phraseology.Pidgin English is an unofficial language widely used across West African countries such as Nigeria, Ghana, Cameroun, Equatorial Guinea, and Sierra Leone.In Nigeria, Pidgin English is the second most used language across all tribes after English.This is because Nigeria has over 500 different languages.Therefore, Pidgin was adopted as a common language across the tribes.The text (tweet) data were cleaned and pre-processed in Python.Python was adopted due to its prevalence in data science.The programming language is syntactically simple, simple to learn, and straightforward.Additionally, Python is open source, free to use, and has a rich ecosystem of libraries for scientific computing.

Experimental Setup
The dataset went through pre-processing.The natural language toolkit (NLTK) library [57] was used for tokenisation, removal of stop-words, word lemmatisation, and part of speech (POS) tagging.The POS were extracted using n-gram for n = 1, 2, 3.This means the unigram, bigram, and trigram were all considered.It is worth noting that steps such as word lemmatisation and POS tagging were applied to the English tokens in the dataset.The topic models were iterated to classify 10 topics and 20 terms per topic.The Gensim package [58] was used to implement the traditional topic models.The LDA model chunksize was set to 1740, with iterations at 1000 and passes at 20.Both alpha and beta Dirichlet priors were set to 'auto' to allow the model to automatically learn the best values for the hyperparameters during training.The LSI model chunksize was set to 1740 and the power iteration at 1000.The HDP model chunksize was set to 256, alpha at 1, and eta at 0.01.The BERTopic architecture [53] comprises the embedding, dimensionality reduction, clustering, and the C-TFIDF modules.In the embedding component, the BERT, SBERT [59], and FinBERT [60] were used to create word/document embedding.Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Isometric Mapping (ISOMAP), Singular Value Decomposition (SVD), and Uniform Manifold Approximation and Projection (UMAP) were used for dimensionality reduction.Table 1 below shows the parameter set up of the dimensionality reduction algorithms.For the third component, the clustering methods used were K-means, Spectral, Agglomerative, MeanShift, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH), and Ordering Points to Identify the Clustering Structure (OPTICS).Table 2 below shows the parameter set up of the cluster analysis algorithms.
For evaluation purposes, there is no agreed standard metric; however, the coherence score is used to justify the performance of the topic models.Topic coherence is a common evaluation metric for TM techniques and it is understood that the higher the coherence scores the better [61].Röder et al. [62] showed that the coherence score has the highest correlation with human judgements of topics.However, Asghari et al. [63] argued that evaluating TM methods using a coherence score only is not enough.This needs to be complemented with topic quality and interpretability.In addition, there is no standard performance cut off for the coherence score to determine the best performing topic model.In the next section, we present the results of the experiment.

Results
Table 3 below presents the coherence scores of the topic modelling techniques.The LDA has a coherence score of 0.3919 (cv), 9.1174 (umass), and produced interpretable terms as shown in Table 4 below.

−
The above result showed HDP has the highest coherence score (cv), by far, of 0.6347.However, HDP achieved the highest negative coherence score (umass) of 16.5378.The terms were investigated manually as shown in Table 5 below.The HDP terms contain some unwanted words such as 'la', 'aw', and 'v' despite the corpus filtering, cleaning, and pruning.In addition, we observed some terms were common amongst the topics.For example, the term 'help' appeared in eight different topics.Whilst LSI performed with a marginally lower coherence score (cv) with low quality topics as shown in Table 6 below.This is because the LSI terms are repeated across the topics as shown in Table 6.For example, terms such as 'transaction' and 'access' appeared in at least three different topics, 'help' appeared in five different topics, and 'money' and 'account' appeared in six different topics.LSI struggled with the distribution of terms.The model produced overlapping semantic terms across the different topics because LSI tried to map the relationship between the term and document to detect contextual semantic terms.In summary, the LDA produced terms that are more interpretable.Thus, we tune the number of topics to determine the number of topics with an optimal solution.To achieve an optimal solution, the LDA was tuned by changing the parameter for the number of topics to be in the range from 5 to 250.The LDA as shown in Figure 1 below produced a coherence score at a close range from 0.3 to 0.65.
The BERTopic architecture was also employed, and the previously stated components were deployed during the experimentation.Table 7 below shows the approaches adopted in the BERTopic model and thus presents their result.The experimental comparative result shows that the BERTopic with components BERT for embeddings, KernelPCA for dimension reduction, and K-means clustering achieved the highest coherence score of 0.8463.The kernel function can deal with non-linear data.The kernel method helps improve non-linearly separable data by increasing the dimension to a higher dimensional space such that the data can be linearly separable [64].This kernel trick has been applied in the PCA to help produce five principal components that were fed into the partitioned clusters.In addition, the approach produced coherent terms as shown in Figure 2 below.The topics produced were around USSD code, general enquiry about account, transaction problem, mobile app, and ATMs.Some of the topics generated are very similar to those produced in the study of Hristova [41].For example, general information, digital banking, and cards/transfer.It is worth noting that during the experimentation, the FinBERT was utilised to generate the document embeddings.This is because FinBERT was trained with financial data.Unfortunately, the FinBERT embeddings did not improve the result.Similarly, the SBERT embeddings did not improve the result.In addition, it is worth noting that the number of topics were not specified for the results shown in Table 7 above.Thus, the number of topics produced were above 100.In practice, this is not usually the case, as it is not economical for service industries such as the banks to have over 100 unit of topics to be monitored.Based on that assumption, this study inputted 10 as the number of topics for the BERTopic.The result produced showed a significant drop in the coherence score.For example, BERTopic (BERT, UMAP, HDBSCAN, topic = 10, and terms = 20) produced 0.54, which is lower than 0.67 when topics and terms are not specified.Specifically, BERTopic (BERT, kernelPCA, K-means, topic = 10, and terms = 20) produced 0.76 compared to 0.8463 when the topics and terms are Hristova [41], Abuzayed and Al-Khalifa [54], and Grootendorst [53] that showed BERTopic outperformed other models.Most importantly, the combined approach of KernelPCA and K-means has been shown to outperform other combined dimensionality reduction and clustering techniques.For example, Lyu et al. [65] performed a comparative study for solar irradiance forecasting and showed the combination of KernelPCA and K-means clustering outperforms combined KernelPCA and spectral clustering, KernelPCA and BIRCH, and KernelPCA and agglomerative clustering.

Conclusions
Since labelled datasets are expensive, labour intensive, time consuming, and not readily available to utilise supervised learning algorithms in this context, we validate a totally unsupervised approach for extracting topics.This is vital for service industries to understand emerging topics in their customer discussions.The purpose of the current study was to determine the state-of-the-art topic modelling technique in the Nigerian banking context.To this end, we have compared TM approaches and ascertained the use of kernel principal component analysis and k-means clustering in the BERTopic architecture as the state-of-the-art topic modelling technique.Our experimental results showed the LDAproduced coherence score ranged from 0.3 to 0.65 with the number of topics being set to a range between 5 and 250.The BERTopic (BERT, kernelPCA, and k-means clustering) achieved a coherence score of 0.76 when the number of topics is set to 10 and the terms is set to 20 (against leaving the number of topics and terms unspecified where it was 0.8463).Thus, our approach provided a reasonably high performance even restricting the number of topics to a more manageable level.In summary, the use of BERTopic as a language model is very promising as the topics produced were interpretable and of high-quality terms.The LDA with a well processed corpus performed well too but not better.The traditional topic model struggled with the different subtleties in Pidgin English, which depends on the native language of the tweeter.Most notably, the HDP and LSI produced overlapping terms in this context.Thus, this study proposes the use of BERTopic and recommends the application of BERTopic in various domains to compare against the traditional approaches.

Contributions, Limitation, and Future Work
In this study, we have prepared a new dataset using tweets from Nigerian bank customers.This dataset will enhance the research in topic discovery in the Nigerian banking context.Secondly, we explored the use of a language model (BERT), which requires little or no pre-processing of text (input) data to create document embeddings.Lastly, we demonstrated the use of kernel principal component analysis and k-means clustering for extracting topics.This study provided a comprehensive explanation of the methods deployed.To the best of our knowledge, this is the first study to compare TM approaches in the Nigerian banking context.Our study is useful for service industries to discover emerging topics.Specifically, the banks can adopt our approach to monitor their customers' preferences towards their product and service.For example, our findings showed topics around USSD code, general enquiry, mobile app, and ATMs were the leading themes in the bank customer discussion.
A major limitation is that there are limited lexical resources publicly available to support the pre-processing of Pidgin terms.Most notably, there is a need to pre-train the language models with Pidgin English words/documents to provide opportunities for research in that area.In the future, there is need to explore the use of the kernel trick on spectral clustering as the model has a within dimensionality reduction component.This is because, in our experiment, the combination of KPCA and spectral clustering also yielded a competitive result.Another area to explore in the future is the use of domain knowledge to generate pre-defined labels such that BERTopic can be deployed as a semi-supervised learning algorithm to generate N coherent topics.In addition, there is a need to create a labelled set to use supervised learning algorithms for topic discovery.This will enhance the evaluation processes by comparing the model prediction against the ground truth.

Figure 1 .
Figure 1.Plot of topics versus coherence score.