Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models

Alawaji, Raghad; Aloraini, Abdulrahman

doi:10.3390/electronics14112125

Open AccessArticle

Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models

by

Raghad Alawaji

^*

and

Abdulrahman Aloraini

Department of Information Technology, College of Computer, Qassim University, Buraydah 52571, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2125; https://doi.org/10.3390/electronics14112125

Submission received: 19 April 2025 / Revised: 17 May 2025 / Accepted: 21 May 2025 / Published: 23 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

Sentiment analysis, in the context of digital banking reviews, aims to assess customer satisfaction and support service enhancement. Despite increasing attention to sentiment analysis across domains, Arabic banking reviews remain underexplored. To bridge this gap, we introduce a dataset of 4922 Arabic reviews from three major Saudi digital banks with three sentiment categories positive, negative, or conflict—providing actionable insights for banks. We evaluate the dataset using several machine learning models and four large language models (LLMs)—GPT 3.5, GPT 4, Llama-3-8B-Instruct, and SILMA—using zero-shot (no labeled examples) and few-shot (a few labeled examples) learning strategies. Our results show that GPT 4 performs best among LLMs in few-shot settings, while traditional models still outperform LLMs, with a Voting Classifier achieving 90.24% accuracy. This study contributes a domain-specific dataset and comparative analysis to support research and practical improvements in Arabic digital banking services.

Keywords:

Natural Language Processing; Arabic sentiment analysis; digital banking; machine learning; large language models

1. Introduction

The advancement in the digital era and the widespread availability of mobile-based applications have resulted in the generation of a substantial amount of data in different formats, such as textual, audio, and video. People can freely express their opinions and thoughts regarding daily events or services easily using these applications. There was a noticeable change caused by the COVID-19 pandemic that encouraged people and institutions to transition from physical to online operations to mitigate the risk of virus transmission. This shift opened new doors of connectivity which established online financial service applications as the main technology for financial transactions [1,2]. In February 2025, the number of social media users reached 5.24 billion [3]. In addition to their concern for privacy and the challenges of unstructured data that arise from daily generated data by internet users, they open the direction to research to gain advantage from this by analyzing to understand user behavior and opinions from these data [4].

The digital banking sector represents a pivotal and emerging industry that utilizes technological advancements to facilitate the transition of financial processes from physical to digital platforms. This transition enhances convenience and accessibility for users, replacing traditional face-to-face interactions with advanced technology [5]. This line of communication provides fast and comfortable access to banking services. Additionally, customer feedback on these services is shared through online platforms such as mobile banking applications, forms, and blogs, reflecting their satisfaction levels. Considering these comments carefully is essential to enhancing customer loyalty and addressing current pain points during this digital transformation process.

SA is an important field in Natural Language Processing (NLP) that classifies the polarity of a given text by automatically analyzing individual attitudes about specific matters. This field employs various methods, including lexicon-based, machine learning (ML), and hybrid approach. The lexicon-based approach is an early technique that utilizes a predefined dictionary that contains a sentiment word associated with a polarity score, which represents the opinion expressed in the word [6]. The ML approach can be divided into supervised, semi-supervised, and unsupervised learning depending on the availability of a labeled dataset [7]. In this supervised classification problem, classes are assigned based on the polarity conveyed in opinions (i.e., positive or negative). The classifiers are trained on labeled samples, producing a model that links features with the associated class. In contrast, unsupervised approaches are employed when there is a lack of an annotated dataset. The semi-supervised approach is leveraged when there is an unlabeled corpus that contains a small amount of labeled data. Lastly, hybrid approaches are based on the combination of lexicon-based and ML techniques.

Recently, a notable advancement occurred in the deep learning field with the introduction of Bidirectional Encoder Representations from Transformers (BERT) [8]. It employs transformer-based architecture to pretrain the model on a large data corpus, then fine-tune it for downstream tasks with a domain-specific dataset. However, although these models achieve promising results in a range of NLP tasks, they require substantial amounts of data and careful training to achieve optimal performance [9]. Lately, a significant leap occurred in the emergence of LLMs such as ChatGPT (https://openai.com/blog/chatgpt (accessed on 20 May 2025)) and LLaMA [10], which can generate a human-like response and generalize well without additional training. It leverages the capability of in-context learning to produce an accurate result based only on a well-formulated prompt, either with zero- or few-shot examples. The progress of these models enable a wide range of NLP tasks such as mathematical reasoning, machine translation, information extraction, and sentiment analysis as illustrated in Figure 1 [11].

Many previous research studies conducted on this topic, such as [12], provide a comprehensive overview of the techniques and datasets used for SA. Other studies [13,14] provide practical applications of SA tasks across different contexts and languages. However, although Arabic is the fourth most popular language online [15], studies on the application of sentiment analysis in Arabic have been scarce, which can be attributed to the unique characteristics of the language. In terms of mobile applications, the number of downloads reached 255 billion in 2022 [16]. Review and app ratings serve as essential factors for people to decide whether or not to download an app. Furthermore, application reviews directly reflect customer satisfaction with the service provided to them, offering a rich source of information to be used by product owners in a wide manner, especially in opinion mining [17].

A few studies investigate customers’ satisfaction regarding mobile applications in the Arabic language. Specifically, there is a limited amount of research that focuses on sentiment analysis regarding Saudi digital banking using various machine learning classifiers. To address this gap, this paper aims to contribute in this area by:

Building an annotated dataset for Arabic user reviews of digital banking applications in Saudi Arabia.
Assessing various ML classifiers for sentiment classification of user reviews into positive, negative, and conflict classes.
In-depth analysis of customer feedback regarding banking services to uncover the key concerns associated with each category.
Evaluate the effectiveness of LLMs for Arabic SA tasks with a multi-label and domain focus dataset.

The rest of this paper is organized as follows. Section 2 highlights the related work. Section 3 demonstrates the proposed methodology. Section 4 explains the results and the findings of the proposed method. Finally, Section 5 concludes the work and suggests some future works.

2. Related Work

The rapid growth of user-generated data from applications programs and social networking platforms such as Twitter, YouTube, and Google Maps comments [15,18,19] has led to significant attention to developing frameworks and models capable of identifying user opinions which will helps gain valuable insights into user perception and enhance user experience. This related work section is classified into three subsections, including Arabic Sentiment Analysis, Sentiment Analysis in mobile application domain, and Sentiment Analysis using Large Language Models.

2.1. Arabic Sentiment Analysis

This section will present a review of recent research on sentiment analysis in Arabic language. Ref. [20] aimed to determine the sentiment polarity of reviews gathered from e-commerce magazines and blogs in the Arabic language. The researchers collected the corpus manually from various web resources and analyzed the sentiment using a support vector machine (SVM). In [21], the researcher proposed a hybrid system that aims to enhance the performance of the SA task by combining different classifiers to achieve better results. The proposed approach can be a significant step towards improving the accuracy of classification for online movie reviews in the Iraqi dialect. The researcher in [15] addresses the difficulty in assessing video quality by focusing on user interaction in YouTube comments and employing a variety of classifiers to evaluate the sentiment of comments in the Arabic language. They discussed the effects of enabling the term frequency–inverse document frequency (TF-IDF) feature-extraction technique and demonstrated that the Naive Bayes (NB) classifier still performs well even without TF-IDF. Ziani et al. (2019) in [22] demonstrate the effectiveness of combining the three methods of Random Sub Space algorithms, Genetic Algorithms, and SVM to avoid overfitting problems in the classification of user review polarity. Furthermore, the Ensemble Learning technique based on the voting approach used in [23] shows promise in classifying sentiment into positive and negative classes using a Voting Classifier. In [24], NB, SVM, and Maximum Entropy were trained on an Arabic corpus consisting of 2000 Moroccan tweets. Two experiments were conducted: the first was to demonstrate the performance of a single classifier, while the second experiment was based on a voting and stacking Ensemble Learning approach. The results showed that the ensemble approach performed better than each individual classifier.

2.2. Sentiment Analysis in Mobile Application Domain

This section provides an overview of the significant concepts and findings related to mobile review SA. Previous research has examined the polarity of user opinions in various domains, including mental health [25] and tourism [26], either for a sentence or document level [27].

In [28], two approaches of SA were conducted to investigate and explore the major issues affecting the top seven African mobile e-commerce applications. The study utilized both Linguistic Inquiry Word Count (LIWC), a lexicon-based approach used to predict the sentiment-based predefined list of words, and ML to classify user reviews. In addition, they conduct text analysis to identify the positive and negative factors affecting mobile e-commerce applications. According to their findings, LIWC is the best-performing method, with an F1 score of 86.7%. Furthermore, the authors in [29] employed sentiment analysis through an NB classifier and topic modeling using latent Dirichlet allocation (LDA) to analyze customer reviews in mobile banking applications. LDA is a probabilistic model that generates a summary collection of terms through discovering semantic structures in large text datasets. They use the cross-validation technique with different k-values to assess the performance of NB, which achieves its highest accuracy of 86.76% with k = 5. In [30], the researcher examined users’ perceptions of Saudi Arabian government mobile applications. The dataset, which consisted of 8000 reviews, was acquired from the App store and the Google Play store. Common classifiers, such as decision trees, SVM, NB and K-Nearest Neighbor (KNN) were used. The KNN classifier achieved the highest accuracy with 78.46%. Andrian et al. (2022) [31] studied SA in Indonesian digital banking using ML, gathering data from Twitter and the Google Play Store, respectively. Similarly, the authors in [32] collected data from the Google Play Store comprising 55,059 reviews with positive and negative sentiment in the English language. Furthermore, Ref. [33] evaluated customer satisfaction with banking services’ mobile applications. They used a newly developed dataset for their research, which consisted of Arabic user reviews collected from eight different Yemeni banking apps obtained from Google Play. The findings indicate the effectiveness of the NB classifier in achieving high accuracy, recall, precision, and F1-score metrics compared to other applied techniques. A dataset gathered in [34] using web Scraper and Google Forms yields more than 18,200 reviews in Algerian dialect, Arabic, and French. Different languages were handled appropriately, and it follows a specific process to distinguish between algD and Arabizi reviews. For the analysis, two approaches were used: ML and the lexicon-based approach, which produced an accuracy of 72% for the SVM and 80% for the lexicon-based approach. Authors in [35], conducted an SA to identify areas for improvement in the food-delivery sector, taking into account the impact of the COVID-19 pandemic. The researchers collected a dataset of reviews from the Talabat application. They evaluated the performance of two classifiers, Decision Tree (DT) and SVM, before and after cleaning and preprocessing the data. The cleaning step includes replacing dialectical phrases into Modern Standard Arabic and correction of misspellings. The preprocessing phase involved normalization, stopword removal, tokenization, POS tagging, and stemming. The results of the study showed that the accuracy of DT improved by 4% after applying the suggested techniques, while SVM showed a slight improvement of 2%.

2.3. Sentiment Analysis Using Large Language Models

This section delves into previous research that evaluated LLMs within the context of SA classification.

The researcher in [36] employed a GLUE benchmark to assess Chat GPT’s ability in various understanding tasks including SA. The results indicated comparable performance with four types of fine-tuned Bert models. Ref. [37] investigated binary and multi-sentiment classification for 34 languages, including high-, medium-, and low-resource languages. GPT 3.5 achieved an average F score of 67.22% for binary classification and 51.17% for three-way classification using zero-shot setting. Another study [38] compared the performance of six variants of GPT models in NLP tasks, showing identical performance in aspect-based sentiment analysis with zero-shot evaluation and improved performance in a few-shot setting for a specific variation of the SemEval2014-Restaurant dataset. Furthermore, extracting polarity from sentences was widely investigated in previous research using pre-trained language models [39,40]. While these models excel in capturing syntactic and semantic meaning, achieving optimal performance often demands extensive fine-tuning efforts [41]. With regards to the Arabic language, Ref. [42] evaluated the performance of the Chat GPT turbo model alongside BLOOMs in various Natural Language Understanding tasks, including SA. The research findings indicate that Chat GPT outperformed the BLOOMs model in all shot settings. In zero-shot settings, a GPT achieved 58% Macro-F1, whereas BLOOMs achieved 43.72%. However, both models were surpassed by two fine-tuned dedicated Arabic models. Furthermore, it is highlighted that increasing the number of shots does not guarantee model performance, as it depends on the specific task and model architecture. Another study [43] compared the ability of Chat GPT in topic-based sentiment classification using Arabic tweets. Notably, the study found that Chat GPT4 outperformed Chat GPT3.5 with a 16.04% improvement in F1 score, whereas SVM achieved a superior result with an F1 score of 92%.

The literature provides valuable information on sentiment-analysis tasks, especially for mobile banking applications. This study begins with studying ASA and transitions into the mobile application sector, then highlights some current work discussing the use of LLMs in SA tasks. Although there have been previous attempts to conduct experiments for sentiment analysis in the banking sector domain with studies in Indonesian [29,31], English [32], and Arabic [33], significant gaps remain in the examination of sentiment reviews within the digital banking sector for Arabic languages.

Our research differs from previous studies by conducting an in-depth analysis of mobile reviews of digital banking in the Arabic language. We analyze 4922 customer reviews for three Saudi banks presenting a domain-specific dataset. Previous datasets, such as those introduced in [30,35], provide insights into governmental mobile applications and food-delivery services with sizes of 7759 and 30,948 reviews, respectively. Both datasets use three sentiment labels: positive, negative, and neutral. Our dataset differs by including three sentiment classes: positive, negative, and conflict. The inclusion of the conflict label enriches the dataset by capturing cases where satisfaction and dissatisfaction occur within the same review. Furthermore, we investigate the effectiveness of LLMs for Arabic SA tasks using zero-shot and few-shot settings to compare their results with those obtained from traditional ML classifiers.

3. Methodology

In this section, we will explore the methodology of Arabic multi-label classification for digital banking customer reviews with two experiments. This will include steps for dataset collection, preprocessing, and the classification models used. The first experiment utilized six machine learning classifiers as well as Ensemble Learning to enhance performance. For feature extraction, we used TF-IDF with uni-gram. For the second experiment, we evaluated the ability of LLMs in Digital Banking SA. Finally, we evaluated both experiments’ performance using different metrics such as accuracy, F-score, recall, and precision. Figure 2 shows a flowchart of the proposed framework.

3.1. Data Collection

The dataset used in this research was collected from user reviews of Saudi digital bank applications available on the Google Play platform. The banks included in the study were STC Pay, Alinma Pay, and UrPay. We ensured that only reviews in the Arabic language were included. The collection criteria were based on retrieving the most relevant and newest comments. To achieve this, we used a Google Play scraper library (version 1.2.7) in Python. In total, 5387 reviews were collected from the three different banks. The data-collection steps are presented in Figure 3.

3.2. Data Annotation

The process of annotating a corpus involves the assignment of a label or tag to a given text phrase to provide it with contextual information and prepare it for machine learning models. Accurate annotation is crucial for text classification to perform well. Therefore, three Arabic native speakers annotated the dataset using the following labels: positive, negative, and conflict. Annotators received detailed guidelines outlining the definitions and examples of the three sentiment categories. The positive class contains comments reflecting user satisfaction and appreciation for banking services. Conversely, the negative class encompasses reviews that highlight problems, gaps, or unsatisfied users with current banking services. The conflict label means that both positive and negative opinions are included. In cases of disagreement, the final label was determined by majority vote (in cases of full disagreement, a consensus meeting was held to resolve the label). After reviewing the comments to remove irrelevant reviews, the final dataset consists of 4922 reviews. These reviews were categorized as follows: 2020 are positive, 2688 are negative, and 214 are classified as conflict.

Additionally, the reviews are distributed among the banks with 1604 reviews from Alinma Bank, 1632 reviews from STC, and 1686 reviews from UrPay. Figure 4 presents a detailed breakdown, showing the number of reviews for each label within each bank. Alinma Bank has the fewest conflict reviews, while UrPay has the highest number with 117 reviews. Additionally, there is a variation in the distribution of positive and negative reviews among the banks, highlighting the need to address class imbalance effectively. Furthermore, Table 1 shows examples of annotated reviews from the corpus.

3.3. PII Anonymization and Censoring

Personal identifiable information (PII) refers to data that can be used to identify or describe an individual on social networks. This includes information like name, age, photos, phone number, date of birth, and any other information that can be linked to a person’s identity. The widespread availability of such information and the development of new technological tools pose significant challenges to data security and privacy [44]. Malicious actors can exploit these data for nefarious purposes. Additionally, the risk of data exposure has increased with the development of AI-based systems [45]. This issue needs to be addressed to provide secure utilization of digital assets. In our dataset, we ensure user anonymity by removing the following PII: usernames, user images, and user IDs.

3.4. Data Preprocessing

Data processing is a crucial part of any ML algorithm, including converting the raw test into an understandable format by the machines. Ensuring the quality and reliability of data leads to enhancing the performance of a given classifier. In this study, the dataset was written in the Arabic language, which is known for its morphological richness and dialects, indicating the need for further processing. Therefore, we incorporate NLP techniques including cleaning, normalization, stop word removal, and tokenization.

Data cleaning: this process involves eliminating unnecessary elements that do not contribute to polarity classification, such as English letters, numbers, punctuation marks, spaces, and URLs.
Normalization: The primary goal of normalization is to standardize the forms of several Arabic characters that have different forms. This involves converting various forms of Arabic letters into a single standardized representation. Additionally, normalization involves removing elongated characters in a string and also replacing consecutive occurrences of the same character.
Tokenization: Dividing a sentence of strings into a list of words, called tokens, separated by delimiters.
Stop word removal: Stop words are a group of frequently used, less-meaningful terms that appear regularly in natural language. A substantial list of Arabic stop words can be found in the library of NLTK [46].

3.5. Feature Extraction

ML requires numerical input to perform mathematical computations. Therefore, after preprocessing textual data, it is essential to convert them into numerical features. A statistical model TF-IDF assigns a weight to each word to reflect its importance to the corpus. TF will calculate how frequently a word appears in a specific document, while the IDF focuses on how rarely a word appears across all documents. The word with the highest weight score is considered significant [47]. For this study, we used TF-IDF along with unigram as feature extraction to retain the contextual information present in reviews. In this context, a ‘gram’ refers to an individual word, while the term ‘N’ represents the number of subsequent words considered within the analysis.

3.6. Handling Class Imbalance

Our dataset consists of three classes, positive, negative, and conflict, with varying class distributions. An imbalanced dataset poses a challenge in multiclass classification, which impacts the performance of ML models. To address this, we use the class weights method to handle the imbalanced distribution in the dataset. This method encourages the model to focus on underrepresented classes through assigning higher weights to minor categories during the training process [48].

3.7. Classification Models

In this section, we will provide a brief description of six common ML classifiers that are chosen to perform the Arabic SA: XGBoost, DT, Logistic Regression (LR), Random Forest (RF) and SVM. Additionally, to further improve classification performance, we will evaluate the ability of Ensemble Learning techniques, particularly the Voting Classifier. It combines predictions from multiple base estimators to improve the generalizability and accuracy of the model.

SVM: This is a common type of supervised machine algorithm used for classification tasks. It seeks to find the optimal hyperplane that separates the inputs into distinct categories in feature space. Depending on the kernel model type, SVM can adapt to solve a nonlinear classification type.
LR: This is a statistical method that originates from the linear regression model. It involves mapping the categorical output into one or more independent variables based on probability, providing an analytical overview of the dataset.
DT: A classification algorithm based on a simple decision rule. It involves decomposing the dataset into smaller subsets, creating a tree-like structure. Through recursive division, the algorithm continues to separate the data until the leaf node is reached, achieving a homogeneous grouping of data.
RF: Random Forest is an Ensemble Learning technique that involves building a large number of decision trees during training. Each tree generates an output for a given input, and the final prediction is based on the majority voting of the results from each tree. It is known for its flexibility and its ability to handle large, multidimensional datasets.
XGBoost Classifier: A gradient boosted decision tree model called XGBoost was introduced by Chen and Guestrin [49]. It builds multiple decision trees sequentially, where at each iteration, a new tree is added to correct the errors of its predecessors.
Voting Classifier: A type of Ensemble technique that learns by combining baseline models and predicting the output based on the class with the highest probability, thereby enhancing overall performance. There are two types of Voting Classifiers: hard voting and soft voting. Hard voting determines the final prediction by selecting the class with the majority vote, while soft voting relies on average probabilities to identify the final class. For this study, we implemented hard voting to aggregate predictions from SVM, RF, and LR.

3.8. LLM Selection

To evaluate the ability of LLMs in ASA, four LLMs were utilized for evaluation purposes. The models included GPT3.5 [9] and GPT4 (https://openai.com/research/gpt-4 (accessed on 20 May 2025)) developed by OpenAI. GPT 3.5 integrated Reinforcement Learning from Human Preferences (RLHF) to improve its performance and coding abilities using human feedback. In March 2023, GPT4 was launched, capable of handling both textual and visual data with high precision. Furthermore, Llama-3-8B-Instructs have been used, developed by Meta, which gained popularity as an open-source model that was tailored for instruction-based tasks [50]. Lastly, we included SILMA-9B-Instruct (https://huggingface.co/silma-ai/SILMA-9B-Instruct-v1.0 (accessed on 20 May 2025)), a top ranked open source Arabic LLM that has shown promising results, outperforming other models even with a larger number of parameters. For our analysis, we selected a subset of 30 reviews along with their corresponding labels from each class, culminating in a total of 90 reviews. We utilized the OpenAI API for the GPT models, while the Llama and SILMA models were accessed through the Hugging Face library.

3.9. Prompt Design

LLM prompting has emerged as a valuable method for enhancing model capabilities and guiding their responses. The term "prompt" refers to the initial instruction given to the model, which facilitates improved interpretation and comprehension of tasks. Zero-shot evaluation indicates that the model relies solely on its prior training and understanding of language patterns. A few-shot setting means querying the model with a specific task description combined with a limited number of examples of the desired task. It will aid the model in generating accurate responses without the need for explicit extensive training data. These examples assist the LLM in understanding the relationship between input and expected output [41]. For this study, we assessed the model’s accuracy in predicting sentiments utilizing a well-crafted prompt in both zero-shot and few-shot contexts where two examples per class were provided. Following several attempts, we formalized the optimal prompt for each model. Table 2 shows a sample of prompts for GPT query. The prompts used for Llama-3-8B-Instructs and SILMA-9B-Instruct are presented in Table A1 and Table A2 in the Appendix A.

3.10. Model Evaluation

The constructed model’s performance is evaluated using different metrics listed below.

Accuracy: This is assessed by comparing the percentage of accurately labeled data points to the total accuracy. This statistic offers a comprehensive assessment of the model’s performance in classification tasks.
Precision: The proportion of true positive predictions over all positive predictions.
Recall: The proportion of true positives to the sum of true positives and false negatives.
F1-score: The harmonic mean of precision and recall.

4. Results

In this section, we will present the experimental findings used to demonstrate the efficacy of the proposed techniques and conduct a thorough analysis of the results. Different metrics were employed to assist the models’ performance.

4.1. Evaluating ML Approaches for Digital Banking SA

For this experiment, six ML models were trained to serve as sentiment classifiers: XGB, RF, DT, SVM, LR, and Voting Classifier. After preprocessing, a stratified cross-validation technique was employed to ensure the reliability of the model’s results. It maintains consistency of instance distribution per class for each fold, making them a suitable choice for handling imbalance datasets [51]. The dataset was evaluated using 5-fold cross-validation to ensure a reliable performance assessment. Table 3 shows performance measure scores including accuracy, F1 Score, recall, and precision, for each classifier with and without class weighting. The confusion matrix for each model is presented in Appendix B.

The finding of the effect of the imbalanced dataset on selected models shows that the SVM achieves the highest result on all metrics, attaining an accuracy of 90.45% followed by the Voting Classifier with an accuracy of 90.00%. However, both classifiers exhibit a slight drop in F1 score with 89.12% and 88.20%, respectively. The DT classifier shows the lowest recall score of 87.06%. Generally, implementing class weight helps improve the performance of minority classes, such as improving F1 scores for XCBoost, LR, Voting Classifier, and SVM.

By thoroughly analyzing the performance of the classifiers after applying the class-weighted Voting Classifier was considered the most proficient classifier with an accuracy of 90.24% along with notable F1 score, recall, and precision values of 90.20%, 90.24%, and 90.50%, respectively. The increment in F1 score by approximately 2% after applying class weighting reflects an improvement in the Voting Classifier’s ability to correctly classify labels. SVM comes after with an accuracy of 89.60%. In contrast, the DT classifier continued to show the lowest performance, achieving an accuracy score of 86.47%. Further, XGBoost and LR deliver improvement in F1 score and precision.

Quantitative Analysis of Bank Reviews

Data exploration plays a critical role in understanding a given use case in depth, as it allows us to uncover insight and build hypotheses. In this section, visualizations were employed to explore temporal peaks as well as variation in sentiment labels. We conducted this analysis through STCPay, Alinma and UrPay for the year 2023 only, since it has the most balanced data amount and distribution among other years. This step helps identify a suitable actionable response and address key customer concerns.

Figure 5 shows how sentiment evolved in 2023 from January to December. The negative label on the red line is considered the dominant sentiment among all months, with the highest peak in November and December, with 150 negative reviews compared to 55 positive reviews, identifying user dissatisfaction with the newly released version of the application. Positive sentiment started lower than negative at the beginning of the year, while notable increase occurred in October. The conflict label remains low throughout the year, reflecting less mixed user feedback. Figure 6 for Alinma bank shows a sign toward a positive class from January to August. However, after that, there was an equal distribution of both negative and positive during September and October with an increase in negative reviews for the last month of the year. It is important to highlight there is no developer response for all reviews. The UrPay graph in Figure 7 shows a trend of positive reviews during the first quarter. However, a steady increase in negative labels occurred at the beginning of April due to issues arising from the application’s new updates, such as the app tending to log users out suddenly. In September, there was a significant rise in negative reviews, reaching its highest point aligned with the new updates that caused issues on Samsung devices such as A52, A53, and A73. Lastly, the last quarter shows continued dissatisfaction of users with the application focusing on delays in bank transfers being more than just technical issues.

The N-grams analysis method is used to identify the words or phrases that appear together in the dataset to facilitate the capture of relationships and the broader context in a given corpus. We extract the top ten bigrams and trigrams from the reviews. Bigrams present two adjacent words, whereas trigrams represent a sequence of three words as shown in Figure 8. A detailed table presenting each n-grams accompanied by its English translation is provided in Table A3 and Table A4 in the Appendix C.

4.2. Evaluating LLM Approaches for Digital Banking SA

We conducted a comprehensive evaluation of GPT 3.5, GPT 4, Llama-3-8B-Instruct, and SILMA-9B-Instruct in both zero-shot and few-shot settings, utilizing a consistent set of examples across all models. Each model was assessed based on the prompts declared in Section 3.9. For SILMA, we executed two experiments: one utilizing an English prompt and the other an Arabic prompt. In cases where a model failed to predict the correct class and produced unwanted responses or empty outputs, these results were excluded from the final analysis. Additionally, if a model generated appropriate class predictions but in an incorrect format, we mapped these back to the correct classes. The results, as illustrated in Table 4, indicate that GPT 4 in few-shot settings achieved the highest performance in accuracy, F1 score, and recall.

In the zero-shot setting, GPT 4 reached an accuracy of 80%, with an improvement of 0.26% compared to GPT 3.5.SILMA, which ranked second with the Arabic prompt achieving 69%. Regrettably, both LLaMA and SILMA with the English prompt were unable to perform multi-classification tasks, resulting in predictions limited to binary classifications for positive and negative classes, with F1 scores of 51% and 52%, respectively.

In the few-shot setting, SILMA with the Arabic prompt achieved an accuracy of 81%, which is only 0.1% lower than that of GPT 4. However, with the English prompt, SILMA produced inaccurate results, achieving only 33% accuracy and categorizing all predicted classes as negative sentiment. This variant of the result according to different prompt language implies that the model has challenges in handling English. Additionally, LLaMA demonstrated superior performance compared to GPT 3.5, achieving accuracies with few-shot 77.78% and 63.22%, respectively. However, LLaMA categorizes the reviews into binary classes with positive and negative labels, which suggests a limitation in its ability to accurately predict conflict classes.

Figure A3 shows a confusion matrix of GPT 3.5 and GPT 4 in zero-shot and few-shot settings. GPT 3.5 demonstrates low performance in handling all classes, particularly the conflict label, where only 9 reviews are correctly classified. However, the models show improvement in the few-shot settings for both positive and negative classes, while they still struggle to handle the conflict class. The GPT 4 results in Figure A3c exhibit good classification results without any in-context example. The model achieved the best performance after applying the few-shot setting, where the majority of categories are correctly classified including the conflict class.

Figure A4 shows that the Llama-3-8B-Instruct performs well in classifying positive and negative classes. However, the model fails to recognize conflict labels and tends to misclassify conflict reviews as negative classes. A similar approach occurred in Figure A4c, where SILMA-9B-Instruct was only able to predict 6 out of 30 conflict reviews. All models demonstrate improvement after the few-shot setting highlights its essential role for increasing model ability in distinguishing different sentiments.

5. Discussion

The findings of this research provide valuable insight into Arabic SA through conducting a comprehension evaluation of the ability of LLMs compared to traditional ML classifiers. The result shows that the Voting Classifier achieves the highest result with an accuracy of 90.24%. Conversely, tree-based classifiers, such as DT and RF, demonstrate lower performance than other classifiers whose F1 score is higher than 86%. For the second experiment employing LLMs, GPT 4 demonstrated a superior result in zero- and few-shot settings, followed by SILMA with Arabic prompt. Prompt type significantly affects the output produced by LLMs, leading to enhancement in its overall performance. All four LLMs had significant increases in their accuracy with few-shot prompting which is consistent with existing studies [38,41]. These findings suggest that LLMs including the Arabic-based model within the context of Arabic SA fail to surpass the performance of the ML classifier, which aligns with findings in the literature [42].

6. Conclusions

This research contributes by addressing the gap and investigating the Arabic SA of the digital banking sector. A manually annotated dataset was constructed by collecting customer reviews from three well-known Saudi digital banking applications. After that, two experiments were conducted: one based on supervised training of seven ML classifiers and the second experiment utilizing LLMs using zero- and few-shot prompt settings. Despite the competitive result of LLMs without training and using minimal labeled data, their overall performance is still behind that of traditional ML. In the future, we seek to develop this work further and investigate the Arabic sentiment multi-classification task based on fine-tuning LLM models. Moreover, a recent study [52] demonstrates the effectiveness of adopting advanced embedding techniques including self-attention and multi-head cross-attention in improving the performance of models. Employing these techniques in future studies could enhance the ability of models to capture emotional and semantic features in a text. Lastly, increasing the size of the dataset will provide a more detailed overview of customer satisfaction.

Author Contributions

Conceptualization, R.A. and A.A.; methodology, R.A. and A.A.; software, R.A.; validation, A.A.; data curation, R.A.; writing—original draft preparation, R.A.; writing—review and editing, A.A.; supervision, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study available in GitHub repository at https://github.com/Raghad-Alawaji/Arabic_Digital_Banking_Reviews_Dataset (accessed on 20 May 2025).

Acknowledgments

The authors gratefully acknowledge Qassim University, represented by the Deanship of Graduate Studies and Scientific Research, on the financial support for this research under the number (QU-J-PG-2-2025-54146) during the academic year 1446 AH /2024 AD.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional Encoder Representations from Transformers
DL	Deep Learning
DT	Decision Tree
KNN	K-Nearest Neighbor
LLMs	Large Language Models
LR	Logistic Regression
LIWC	Linguistic Inquiry Word Count
ML	Machine Learning
NB	Naive Bayes
NLP	Natural Language Processing
PII	Personal identifiable information
RF	Random Forest
SA	Sentiment Analysis
SVM	Support Vector Machine
TF-IDF	Term frequency-inverse document frequency

Appendix A. Sample of Prompts

Table A1. Sample of prompts used for both zero-shot and few-shot queries for Llama-3-8B-Instruct and SILMA-9B-Instruct—English.

Prompt Setting Template Prompt

Zero-Shot Analyse the sentiment of the reviews written in Arabic language. In your output, only return the sentiment for each review as either [‘positive’, ‘conflict’, ‘negative’]. Respond with only one word.

Few-Shot Analyse the sentiment of the reviews written in Arabic language. In your output, only return the sentiment for each review as either [‘positive’, ‘conflict’, ‘negative’]. Respond with only one word. Here are some examples to guide you:

Review: positive train review ; Sentiment: positive
Review: negative train review ; Sentiment: negative
Review: conflict train review ; Sentiment: conflict

Table A2. Sample of prompts used for both zero-shot and few-shot queries for SILMA-9B-Instruct—Arabic prompt. These prompts represent the Arabic-translated versions of those shown in Table A1.

Prompt Setting	Template Prompt
Zero-Shot
Few-Shot

Appendix B. Confusion Matrices for ML Models

Figure A1. Confusion matrices for (a) XGBoost, (b) Logistic Regression, (c) Decision Tree, and (d) Random Forest.

Figure A2. Confusion matrices for (a) SVM, (b) Voting Classifier.

Appendix C. Translation of the N-Grams Terms

Table A3. Top 10 most frequent Arabic bigrams in the digital banking dataset with their English translations.

Bigram (Arabic)	Frequency	Bigram (English Translation)
	162	Very good
	155	Not working
	139	Very very
	133	More than
	133	Very bad
	113	After the update
	93	The application does not
	80	Solve the problem
	76	Latest update
	72	Customer Service

Table A4. Top 10 most frequent Arabic trigram in the digital banking dataset with their English translations.

Trigram (Arabic)	Frequency	Trigram (English Translation)
	60	The app is not working
	45	Very very very
	32	After the last update
	26	Very very excellent
	24	More than once
	20	Very bad application
	17	Very very bad
	17	Please solve the problem
	16	STC
	16	The application after the update

Appendix D. Confusion Matrices for LLMs Models

Figure A3. Confusion matrices for (a) GPT 3.5 in the zero-shot setting, (b) GPT 3.5 in the few-shot setting, (c) GPT 4 in the zero-shot setting, and (d) GPT 4 in the few-shot setting. Samples with invalid or hallucinated outputs were excluded from the evaluation.

Figure A4. Confusion matrices for (a) Llama-3-8B-Instruct in the zero-shot setting, (b) Llama-3-8B-Instruct in the few-shot setting, (c) SILMA-9B-Instruct—Arabic in the zero-shot setting, and (d) SILMA-9B-Instruct—Arabic the few-shot setting. Samples with invalid or hallucinated outputs were excluded from the evaluation.

References

Al-Qudah, A.A.; Al-Okaily, M.; Alqudah, G.; Ghazlat, A. Mobile payment adoption in the time of the COVID-19 pandemic. Electron. Commer. Res. 2024, 24, 427–451. [Google Scholar] [CrossRef]
Alkhwaldi, A.F.; Alharasis, E.E.; Shehadeh, M.; Abu-AlSondos, I.A.; Oudat, M.S.; Bani Atta, A.A. Towards an understanding of FinTech users’ adoption: Intention and e-loyalty post-COVID-19 from a developing country perspective. Sustainability 2022, 14, 12616. [Google Scholar] [CrossRef]
Statista. Digital Population Worldwide as of January 2024. February 2024. Available online: https://www.statista.com/statistics/617136/digital-populationworldwide/ (accessed on 14 May 2025).
Persia, F.; D’Auria, D. A survey of online social networks: Challenges and opportunities. In Proceedings of the 2017 IEEE International Conference on Information Reuse and Integration (IRI), San Diego, CA, USA, 4–6 August 2017; pp. 614–620. [Google Scholar]
Indrasari, A.; Nadjmie, N.; Endri, E. Determinants of satisfaction and loyalty of e-banking users during the COVID-19 pandemic. Int. J. Data Netw. Sci. 2022, 6, 497–508. [Google Scholar] [CrossRef]
Michailidis, P.D. A Comparative Study of Sentiment Classification Models for Greek Reviews. Big Data Cogn. Comput. 2024, 8, 107. [Google Scholar] [CrossRef]
Sharma, H.D.; Goyal, P. An analysis of sentiment: Methods, applications, and challenges. Eng. Proc. 2023, 59, 68. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J. 2023, 6, 100048. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Qin, L.; Chen, Q.; Feng, X.; Wu, Y.; Zhang, Y.; Li, Y.; Li, M.; Che, W.; Yu, P.S. Large language models meet nlp: A survey. arXiv 2024, arXiv:2405.12819. [Google Scholar]
Nandwani, P.; Verma, R. A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 2021, 11, 81. [Google Scholar] [CrossRef]
Yang, L.; Li, Y.; Wang, J.; Sherratt, R.S. Sentiment analysis for E-commerce product reviews in Chinese based on sentiment lexicon and deep learning. IEEE Access 2020, 8, 23522–23530. [Google Scholar] [CrossRef]
Jain, V.; Kashyap, K.L. Ensemble hybrid model for Hindi COVID-19 text classification with metaheuristic op-timization algorithm. Multimed. Tools Appl. 2023, 82, 16839–16859. [Google Scholar] [CrossRef]
Musleh, D.A.; Alkhwaja, I.; Alkhwaja, A.; Alghamdi, M.; Abahussain, H.; Alfawaz, F.; Min-Allah, N.; Abdulqader, M.M. Arabic Sentiment Analysis of YouTube Comments: NLP-Based Machine Learning Approaches for Content Evaluation. Big Data Cogn. Comput. 2023, 7, 127. [Google Scholar] [CrossRef]
Soumelidou, A.; Tsohou, A. Validation and extension of two domain-specific information privacy competency models. Int. J. Inf. Secur. 2024, 23, 2437–2455. [Google Scholar] [CrossRef]
Genc-Nayebi, N.; Abran, A. A systematic literature review: Opinion mining studies from mobile app store user reviews. J. Syst. Softw. 2017, 125, 207–219. [Google Scholar] [CrossRef]
Zhao, J.; Gui, X. Comparison research on text preprocessing methods on twitter sentiment analysis. IEEE Access 2017, 5, 2870–2879. [Google Scholar]
Mathayomchan, B.; Sripanidkulchai, K. Utilizing Google translated Reviews from Google maps in senti-ment analysis for Phuket tourist attractions. In Proceedings of the 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), Chonburi, Thailand, 10–12 July 2019; pp. 260–265. [Google Scholar]
Sghaier, M.A.; Zrigui, M. Sentiment analysis for Arabic e-commerce websites. In Proceedings of the 2016 International Conference on Engineering & MIS (ICEMIS), Agadir, Morocco, 22–24 September 2016; pp. 1–7. [Google Scholar]
Nasrullah, H.A.; Nasrullah, M.A.; Flayyih, W.N. Sentiment analysis in Arabic language using machine learning: Iraqi dialect case study. AIP Conf. Proc. 2023, 2651, 060015. [Google Scholar]
Ziani, A.; Azizi, N.; Zenakhra, D.; Cheriguene, S.; Aldwairi, M. Combining RSS-SVM with genetic algorithm for Arabic opinions analysis. Int. J. Intell. Syst. Technol. Appl. 2019, 18, 152–178. [Google Scholar] [CrossRef]
Hicham, N.; Karim, S.; Habbat, N. Customer sentiment analysis for Arabic social media using a novel ensemble machine learning approach. Int. J. Electr. Comput. Eng. 2023, 13, 4504–4515. [Google Scholar] [CrossRef]
Oussous, A.; Lahcen, A.A.; Belfkih, S. Improving sentiment analysis of moroccan tweets using Ensemble Learning. In Big Data, Cloud and Applications, Proceedings of the Third International Conference, BDCA 2018, Kenitra, Morocco, 4–5 April 2018; Revised Selected Papers 3; Springer International Publishing: Cham, Switzerland, 2018; pp. 91–104. [Google Scholar]
Benrouba, F.; Boudour, R. Emotional sentiment analysis of social media content for mental health safety. Soc. Netw. Anal. Min. 2023, 13, 17. [Google Scholar] [CrossRef]
Masrury, R.A.; Alamsyah, A. Analyzing tourism mobile applications perceived quality using sentiment analysis and topic modeling. In Proceedings of the 2019 7th International Conference on Information and Communication Technology (ICoICT), Kuala Lumpur, Malaysia, 24–26 July 2019; pp. 1–6. [Google Scholar]
Rhanoui, M.; Mikram, M.; Yousfi, S.; Barzali, S. A CNN-BiLSTM model for document-level sentiment analysis. Mach. Learn. Knowl. Extr. 2019, 1, 832–847. [Google Scholar] [CrossRef]
Olagunju, T.; Oyebode, O.; Orji, R. Exploring key issues affecting african mobile ecommerce applications using sentiment and thematic analysis. IEEE Access 2020, 8, 114475–114486. [Google Scholar] [CrossRef]
Permana, M.E.; Ramadhan, H.; Budi, I.; Santoso, A.B.; Putra, P.K. Sentiment analysis and topic detection of mobile banking application review. In Proceedings of the 2020 Fifth International Conference on Informatics and Computing (ICIC), Gorontalo, Indonesia, 3–4 November 2020; pp. 1–6. [Google Scholar]
Hadwan, M.; Al-Hagery, M.; Al-Sarem, M.; Saeed, F. Arabic sentiment analysis of users’ opinions of govern-mental mobile applications. Comput. Mater. Contin. 2022, 72, 4675–4689. [Google Scholar]
Andrian, B.; Simanungkalit, T.; Budi, I.; Wicaksono, A.F. Sentiment analysis on customer satisfaction of digital banking in Indonesia. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 466–473. [Google Scholar] [CrossRef]
Samudera, B.; Nurdin, N.; Aidilof, H. Sentiment analysis of user reviews on BSI Mobile and Action Mobile applications on the Google Play Store using multinomial Naive Bayes algorithm. Int. J. Eng. Sci. Inf. Technol. 2024, 4, 101–112. [Google Scholar] [CrossRef]
Al-Hagree, S.; Al-Gaphari, G. Arabic Sentiment Analysis Based Machine Learning for Measuring User Satisfaction with Banking Services’ Mobile Applications: Comparative Study. In Proceedings of the 2022 2nd International Conference on Emerging Smart Technologies and Applications (eSmarTA), Ibb, Yemen, 25–26 October 2022; pp. 1–4. [Google Scholar]
Chader, A.; Hamdad, L.; Belkhiri, A. Sentiment analysis in google play store: Algerian reviews case. In Modelling and Implementation of Complex Systems, Proceedings of the 6th International Symposium, MISC 2020, Batna, Algeria, 24–26 October 2020; Springer International Publishing: Cham, Switzerland, 2021; pp. 107–121. [Google Scholar]
Mustafa, D.; Khabour, S.M.; Shatnawi, A.S.; Taqieddin, E. Arabic Sentiment Analysis of Food Delivery Services Reviews. In Proceedings of the 2023 International Symposium on Networks, Computers and Communications (ISNCC), Doha, Qatar, 23–26 October 2023; pp. 1–6. [Google Scholar]
Zhong, Q.; Ding, L.; Liu, J.; Du, B.; Tao, D. Can chatgpt understand too? A comparative study on chatgpt and fine-tuned bert. arXiv 2023, arXiv:2302.10198. [Google Scholar]
Koto, F.; Beck, T.; Talat, Z.; Gurevych, I.; Baldwin, T. Zero-shot sentiment analysis in low-resource languages using a multilingual sentiment lexicon. arXiv 2024, arXiv:2402.02113. [Google Scholar]
Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv 2023, arXiv:2303.10420. [Google Scholar]
Mo, K.; Liu, W.; Xu, X.; Yu, C.; Zou, Y.; Xia, F. Fine-Tuning Gemma-7B for Enhanced Sentiment Analysis of Financial News Headlines. arXiv 2024, arXiv:2406.13626. [Google Scholar]
Xiao, H.; Luo, L. An Automatic Sentiment Analysis Method For Short Texts Based on Transformer-BERT Hybrid Model. IEEE Access 2024, 12, 93305–93317. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Tawkat Islam Khondaker, M.; Waheed, A.; Moatez Billah Nagoudi, E.; Abdul-Mageed, M. GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP. arXiv 2023, arXiv:2305.14976. [Google Scholar]
Alderazi, F.; Algosaibi, A.; Alabdullatif, M.; Ahmad, H.F.; Qamar, A.M.; Albarrak, A. Generative artificial intelligence in topic-sentiment classification for Arabic text: A comparative study with possible future directions. PeerJ Comput. Sci. 2024, 10, e2081. [Google Scholar] [CrossRef]
Oreščanin, D.; Hlupić, T.; Vrdoljak, B. Managing Personal Identifiable Information in Data Lakes. IEEE Access 2024, 12, 32164–32180. [Google Scholar] [CrossRef]
Olabanji, S.O.; Oladoyinbo, O.B.; Asonze, C.U.; Oladoyinbo, T.O.; Ajayi, S.A.; Olaniyi, O.O. Effect of adopting AI to explore big data on personally identifiable information (PII) for financial and economic data transformation. Asian J. Econ. Bus. Account. 2024, 24, 106–125. [Google Scholar] [CrossRef]
Bird, S. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney, Australia, 17–18 July 2006; pp. 69–72. [Google Scholar]
Yamamoto, M.; Church, K.W. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Comput. Linguist. 2001, 27, 1–30. [Google Scholar] [CrossRef]
Razali, M.N.; Arbaiy, N.; Lin, P.C.; Ismail, S. Optimizing Multiclass Classification Using Convolutional Neural Networks with Class Weights and Early Stopping for Imbalanced Datasets. Electronics 2025, 14, 705. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar]
Kaliappan, J.; Bagepalli, A.R.; Almal, S.; Mishra, R.; Hu, Y.C.; Srinivasan, K. Impact of Cross-validation on Machine Learning models for early detection of intrauterine fetal demise. Diagnostics 2023, 13, 1692. [Google Scholar] [CrossRef]
Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-Aware Embedding Fusion in Large Language Models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for Intelligent Response Generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]

Figure 1. Example of utilizing LLMs for various NLP tasks.

Figure 2. A flow diagram for the proposed framework.

Figure 3. The data-collection steps include using the Google Play API, data labeling, PII anonymization, and data preprocessing.

Figure 4. Details of the distribution of review labels across banks.

Figure 5. Monthly sentiment trends in customer feedback for STC Pay Bank (2023).

Figure 6. Monthly sentiment trends in customer feedback for Alinma Bank (2023).

Figure 7. Monthly sentiment trends in customer feedback for UrPay Bank (2023).

Figure 8. Top 10 bigrams and trigrams from the digital banking dataset.

Table 1. Positive, negative, and conflict samples selected randomly from the dataset.

Label	Review (Arabic)	Translation (English)
Positive		A very distinguished bank in its transactions and speed of accomplishments. I recommend dealing with it because of the ease of the matter.
Negative		Bad app. If I enter my information and complete everything correctly, in the end it logs me out of the app more than 20 times.
Conflict		Excellent but a bit heavy

Table 2. Sample of prompts used for both zero-shot and few-shot queries.

Prompt Setting Template Prompt

Zero-Shot Analyse the sentiment of the reviews written in Arabic language above and return a JSON array as the result. In your output, only return the sentiment for each review as ‘positive’, ‘negative’, and ‘conflict’. Do not include any other sentiment.

Few-Shot Analyse the sentiment of the reviews written in Arabic language above and return a JSON array as the result. In your output, only return the sentiment for each review as ‘positive’, ‘negative’, and ‘conflict’. Do not include any other sentiment. Examples of good sentiment-analysis classification are provided between separator “###”.

###
Review: positive train review ; Sentiment: positive
Review: negative train review ; Sentiment: negative
Review: conflict train review ; Sentiment: conflict
###

Table 3. Comparison of ML classifiers with and without class-weighted training for Arabic sentiment analysis (in percentages).

Model Name	Class-Weighted	Accuracy	F1	R	P
XGBoost	Without	89.15	88.68	89.15	88.75
XGBoost	With	89.37	89.47	89.37	89.65
RF	Without	89.23	87.26	89.23	85.40
RF	With	88.89	86.93	88.89	85.10
DT	Without	87.06	86.82	87.06	86.86
DT	With	86.47	86.91	86.47	87.68
LR	Without	89.86	87.92	89.86	87.82
LR	With	89.56	89.66	89.56	90.15
SVM	Without	90.45	89.12	90.45	89.98
SVM	With	89.60	89.98	89.60	90.73
Voting Classifier	Without	90.00	88.20	90.00	89.39
Voting Classifier	With	90.24	90.20	90.24	90.50

Table 4. Results of LLMs with different prompt settings (in percentages).

Model Name	Prompt Type	Accuracy	F1	P	R
GPT 3.5	Zero-shot	54.12	52.67	54.33	54.00
GPT 3.5	Few-shot	63.22	63.00	63.00	63.00
GPT 4	Zero-shot	79.78	78.00	82.67	80.00
GPT 4	Few-shot	82.76	82.33	83.00	83.00
Llama-3-8B-Instruct	Zero-shot	62.22	51.34	44.30	62.22
Llama-3-8B-Instruct	Few-shot	77.78	78.73	82.59	77.78
SILMA-9B-Instruct—English	Zero-shot	64.44	51.89	43.44	64.44
SILMA-9B-Instruct—English	Few-shot	33.33	16.67	11.11	33.33
SILMA-9B-Instruct—Arabic	Zero-shot	68.89	62.99	73.92	68.89
SILMA-9B-Instruct—Arabic	Few-shot	81.11	80.64	82.31	81.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alawaji, R.; Aloraini, A. Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models. Electronics 2025, 14, 2125. https://doi.org/10.3390/electronics14112125

AMA Style

Alawaji R, Aloraini A. Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models. Electronics. 2025; 14(11):2125. https://doi.org/10.3390/electronics14112125

Chicago/Turabian Style

Alawaji, Raghad, and Abdulrahman Aloraini. 2025. "Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models" Electronics 14, no. 11: 2125. https://doi.org/10.3390/electronics14112125

APA Style

Alawaji, R., & Aloraini, A. (2025). Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models. Electronics, 14(11), 2125. https://doi.org/10.3390/electronics14112125

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sentiment Analysis of Digital Banking Reviews Using Machine Learning and Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. Arabic Sentiment Analysis

2.2. Sentiment Analysis in Mobile Application Domain

2.3. Sentiment Analysis Using Large Language Models

3. Methodology

3.1. Data Collection

3.2. Data Annotation

3.3. PII Anonymization and Censoring

3.4. Data Preprocessing

3.5. Feature Extraction

3.6. Handling Class Imbalance

3.7. Classification Models

3.8. LLM Selection

3.9. Prompt Design

3.10. Model Evaluation

4. Results

4.1. Evaluating ML Approaches for Digital Banking SA

Quantitative Analysis of Bank Reviews

4.2. Evaluating LLM Approaches for Digital Banking SA

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Sample of Prompts

Appendix B. Confusion Matrices for ML Models

Appendix C. Translation of the N-Grams Terms

Appendix D. Confusion Matrices for LLMs Models

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI