Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers

Jakha, Hamza; El Houssaini, Souad; El Houssaini, Mohammed-Alamine; Ajjaj, Souad; Hadir, Abdelali

doi:10.3390/asi8040104

Open AccessArticle

Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers

by

Hamza Jakha

^1,*

,

Souad El Houssaini

¹,

Mohammed-Alamine El Houssaini

²,

Souad Ajjaj

³ and

Abdelali Hadir

⁴

¹

ELIITS Laboratory, Department of Computer Science, Faculty of Sciences, Chouaib Doukkali University, El Jadida 24000, Morocco

²

Higher School of Education and Training, Chouaib Doukkali University, El Jadida 24000, Morocco

³

Department of Computer Science, Faculty of Sciences, Ibn Tofail University, Kenitra 14000, Morocco

⁴

National School of Commerce and Management, Hassan II University, Casablanca 20000, Morocco

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2025, 8(4), 104; https://doi.org/10.3390/asi8040104

Submission received: 20 May 2025 / Revised: 24 July 2025 / Accepted: 25 July 2025 / Published: 28 July 2025

(This article belongs to the Topic Social Sciences and Intelligence Management, 2nd Volume)

Download

Browse Figures

Versions Notes

Abstract

Social network platforms have a big impact on the development of companies by influencing clients’ behaviors and sentiments, which directly affect corporate reputations. Analyzing this feedback has become an essential component of business intelligence, supporting the improvement of long-term marketing strategies on a larger scale. The implementation of powerful sentiment analysis models requires a comprehensive and in-depth examination of each stage of the process. In this study, we present a new comparative approach for several feature extraction techniques, including TF-IDF, Word2Vec, FastText, and BERT embeddings. These methods are applied to three multilingual datasets collected from hotel review platforms in the tourism sector in English, French, and Arabic languages. Those datasets were preprocessed through cleaning, normalization, labeling, and balancing before being trained on various machine learning and deep learning algorithms. The effectiveness of each feature extraction method was evaluated using metrics such as accuracy, F1-score, precision, recall, ROC AUC curve, and a new metric that measures the execution time for generating word representations. Our extensive experiments demonstrate significant and excellent results, achieving accuracy rates of approximately 99% for the English dataset, 94% for the Arabic dataset, and 89% for the French dataset. These findings confirm the important impact of vectorization techniques on the performance of sentiment analysis models. They also highlight the important relationship between balanced datasets, effective feature extraction methods, and the choice of classification algorithms. So, this study aims to simplify the selection of feature extraction methods and appropriate classifiers for each language, thereby contributing to advancements in sentiment analysis.

Keywords:

sentiment analysis; business intelligence; feature extraction; BERT; machine learning; deep learning; balanced datasets

1. Introduction

The exponential growth of data generated through websites, social media platforms, online forums, and other digital sources has created substantial opportunities for businesses and decision-makers to refine and innovate their marketing strategies. Effectively leveraging this wealth of information, however, demands advanced processing techniques, particularly those offered by natural language processing (NLP), a subfield of artificial intelligence that focuses on enabling machines to interpret and interact with human language [1].

One key application of NLP is sentiment analysis, also known as opinion mining, which involves examining textual, audio, or visual content to determine the underlying sentiment: positive, negative, or neutral. This technique is crucial in business intelligence, customer relationship management, and strategic marketing, especially in industries such as tourism and hospitality, where understanding customer feedback is essential.

The majority of sentiment analysis research is still mostly focused on the English language, ignoring the linguistic diversity present in many regions despite its increasing significance. In Morocco, for example, Arabic is the native language, while French is spoken by roughly 13 million people (around 35% (https://www.moroccoworldnews.com/2019/03/268475/international-francophonie-day-moroccan-french, accessed on 24 November 2024) of the population), and English usage has expanded to approximately 30% (https://www.britishcouncil.ma/sites/default/files/shift_to_english_in_morocco_16042021_v2.pdf, accessed on 24 November 2024) according to the 2021 British Council report. Given this multilingual landscape, it is essential to consider all three languages when developing effective sentiment analysis tools tailored to the Moroccan context.

Sentiment analysis involves several processing stages, among which feature extraction plays a fundamental role. This step converts text into numerical representations that can be interpreted by machine learning models. A wide range of techniques has been developed for this purpose, from traditional methods like bag of words and TF-IDF [2,3,4], which rely on statistical frequency-based models, to more advanced approaches such as word embeddings, which capture semantic relationships between words [5,6]. Recent transformer-based models go even further by generating deep contextual representations of language [7,8].

In this study, we investigate how various feature extraction methods perform across three languages: Arabic, French, and English, using balanced datasets to ensure fair comparative analysis. We apply four embedding techniques: TF-IDF, Word2Vec, FastText, and BERT, and assess their effectiveness in combination with multiple machine learning and deep learning classifiers. Specifically, this work aims to:

Use balanced multilingual datasets to ensure comparability across English, French, and Arabic;
Apply three categories of feature extraction methods and combine them with both traditional and deep learning classification models;
Evaluate model performance across several metrics, including a novel consideration of execution time during embedding generation, allowing us to examine the relationship between vectorization method, classifier, dataset type, and computational efficiency;
Provide practical guidance for selecting the most appropriate embedding and classification techniques for multilingual sentiment analysis tasks.

The remainder of this paper is structured as follows: We begin with a review of related work in sentiment analysis across the three languages. We then present the methodology used in our study, followed by a detailed analysis of the experimental results. Finally, we conclude with key findings and perspectives for future work.

2. Literature Review

Sentiment analysis has been applied across various economic sectors, particularly in tourism, where analyzing client reviews of hotel experiences helps make decisions to enhance service quality. Developing effective predictive models for sentiment analysis requires a comprehensive examination of each stage in the process. Feature extraction and artificial intelligence algorithms are important steps that significantly impact the accuracy of the final predictive results. Numerous studies, as summarized in Table 1, rely on a single feature extraction method, such as TF-IDF or Word2Vec, often applied to one or two datasets, and trained with a limited set of machine learning or deep learning algorithms. However, there has been little exploration of the performance of these feature extraction techniques when combined with various machine learning, deep learning, and transformer models across multiple datasets in different languages.

In ref. [9], the authors analyze the performance of various machine learning algorithms for sentiment analysis, focusing on an English language dataset labeled into two categories: positive and negative opinions on news articles. The best model achieves an accuracy of 98.02% by using traditional TF-IDF feature extraction combined with the naïve Bayes (NB) algorithm.

In another study [10], researchers compare three feature extraction techniques: bag of words (BOW), TF-IDF, and Word2Vec across several machine learning algorithms. The results show that TF-IDF with support vector machine (SVM) is more effective than the other techniques, achieving an accuracy of 71%, compared to 56% for Word2Vec and SVM. These techniques were applied to an English language Twitter dataset related to COVID-19.

In ref. [11], the authors classify sentiment analysis using six combined English datasets, leveraging bidirectional encoder representations from transformers (BERT) like an embedding technique and comparing it to another embedding technique, Word2Vec. The results indicate that combining BERT with various deep learning algorithms yields strong performance, with BERT and CNN achieving the highest accuracy of 93%.

Sentiment analysis in the Arabic language has made significant progress in recent years. In ref. [12], various feature extraction techniques, including TF-IDF and AraVec, are compared for Arabic sentiment analysis across both standard and dialectal Arabic datasets. TF-IDF combined with logistic regression achieves an accuracy of 82.10%, outperforming AraVec with SVM, which achieves 71.47%. The lower accuracy of AraVec is attributed to its dataset, which contains a blend of comments written in both styles, complicating the semantic capture of relationships between words.

Another comparison of feature extraction techniques between word embedding methods, Word2Vec, global vector for word representation (GloVe), and FastText, was conducted by [13] on a multi-dialect Arabic dataset to classify positive and negative reviews. The combination of FastText with NuSVC classifiers demonstrated superior performance compared to other combinations.

Ref. [14] proposes a hybrid feature extraction and selection approach that combines term document matrix (TDM), term frequency matrix (TFM), and supervised lexicon weighting, along with the K-nearest neighbors (KNN) classifier. This method demonstrates high accuracy compared to other models in experiments conducted on an Arabic dataset, which includes reviews from a mobile store application.

Focusing on Arabic sentiment analysis, ref. [15] compares several classifier models to their approach, which is based on TF-IDF and ensemble classifiers of machine learning algorithms, to enhance sentiment analysis of hotel customer reviews. This model achieves an accuracy of 89.2%, demonstrating high performance on the hotel datasets used in the study.

Research in French language processing, particularly in sentiment analysis and text classification, has seen limited progress. In ref. [16], researchers use a French dataset of customer reviews from an e-commerce platform to analyze sentiment. This study employs TF-IDF alongside machine learning algorithms such as RNN and LSTM, achieving precision metrics between 73% and 77% over two iterations.

In ref. [17], a French dataset containing restaurant and museum reviews was used to analyze sentiments at the aspect level. The study applied various BERT variants for word representation alongside multiple classifiers to compare their effectiveness. The combination of FlauBERT with a pretrained fully connected model (PTM-FC) achieved the highest accuracy, reaching 84.68%.

3. Methodology

Our methodology begins with data extraction, followed by a series of data cleaning and preprocessing steps. Once the data have been prepared, they are labeled. The subsequent stages include feature extraction, dataset balancing, and the application of various machine learning algorithms, including support vector machines (SVM), naive Bayes (NB), and ensemble learning approaches. In parallel, we implement deep learning techniques such as bidirectional long short-term memory (BiLSTM), gated recurrent units (GRU), and hybrid architectures combining BiLSTM with convolutional neural networks (CNN). Furthermore, we incorporate pretrained models, specifically BERT, for word embedding, which are then integrated with the same deep learning architectures to enhance performance (Figure 1).

3.1. Data Collection

In this study, we utilize three datasets in three different languages, obtained from two sources: TripAdvisor for English reviews, and Booking.com for French and Arabic reviews. All three datasets consist of customer reviews related to hotel facilities, which can assist hotel decision-makers in enhancing their services and offerings.

3.1.1. English Dataset

The English dataset for this study was obtained from Hugging Face (https://huggingface.co/datasets/ashraq/hotel-reviews, accessed on 25 October 2024), an open-source platform that supports collaboration on a vast array of models, datasets, and applications. It consists of 93,757 hotel reviews in both English and other languages, all obtained from the TripAdvisor platform. Of the total reviews in this dataset, we retained only the English reviews, totaling about 88,686 entries, representing 94.6% of the total reviews.

3.1.2. French Dataset

The second dataset consists of French language reviews obtained from Booking.com using Python 3.12 tools. These reviews were scraped from various IBIS hotels across Moroccan cities, including Rabat, El Jadida, and others, and the results were combined into a single csv file. Although the dataset contains reviews in multiple languages, we focused on the French reviews, as they represent the majority. In total, we collected 9918 reviews, providing valuable information on customer opinions about IBIS hotels in Morocco.

3.1.3. Arabic Dataset

The third dataset contains Arabic reviews, mostly written in Moroccan Arabic dialect and Standard Arabic. The reviews were collected using the same method as the French dataset, web scraping from the Booking.com website for IBIS hotels. In order to increase the dataset’s size, additional reviews were sourced from various hotels in El Jadida and Marrakech. In total, this dataset contains 4280 Arabic reviews.

3.2. Data Processing

The collected data must go through important preprocessing steps, including several techniques for cleaning and normalizing. The process begins with tokenization, followed by the removal of special characters such as URLs, HTML tags, stop words, and punctuation. Additionally, stemming and lemmatization techniques are applied to further normalize the text. For both the English and French datasets, as shown in Table 2, we follow the same preprocessing steps: first, we convert all text to lowercase, then perform word-by-word tokenization. Next, we remove stop words, punctuation, URLs, HTML tags, and numbers. Finally, we apply stemming using the Porter Stemmer as the word normalizer.

The Arabic dataset as presented in Table 3 requires special processing due to the unique structure of the language. First, we remove the “Tashkil,” which represents vowel marks in Arabic. For example, the phrase ‘إِنَّ هَذَا نَصٌّ بِالتَّشْكِيلِ’ (meaning “Indeed, this is a text with vowel marks”) becomes ‘إن هذا نص بالتشكيل’ (without vowel marks). We also normalize certain letters by converting variations like ‘إأآا’ into ‘ا’. Additionally, we remove the “Tatwil” (elongation marks), so ‘وشــــــــــــــــكــــــــــــــــراً’ becomes ‘وشكراً’ (meaning "thank you").

We eliminate repeated letters, such as changing ‘راااااااائع’ to ‘رائع’ (meaning ‘wonderful’). Moreover, we remove Arabic stop words while keeping negation terms like {“لا”, “ليس”, “لم”, “لن”, “ما”, “بلا”, “بدون”}. Finally, we remove punctuation from the text. For normalization, we apply stemming techniques. In Arabic, there are several options, such as Khoja, Light10, Tashafyne, ISRI, FARASA, QARAB, and MADAMIRA. In our work, we use two techniques: Tashafyne and ISRIStemmer. This choice was motivated by the need to balance computational efficiency and effective preprocessing across all three languages studied. Although these methods do not capture deep morphological structures such as lemmatization or disambiguation, they have proven sufficient to reduce surface-level variability in Arabic text and ensure consistency with the multilingual approach.

3.3. Data Labeling

Data labeling, also known as data annotation, is a critical step in the sentiment analysis process, essential for enabling accurate decision-making. It involves labeling data to prepare them for supervised learning algorithms. There are several methods available for labeling data; one option is manual annotation by experts, which, while precise, can be both costly and time-consuming. Alternatively, automated tools, such as TextBlob and VADER, or pre-trained models like BERT and RoBERTa, can be used to streamline the process and minimize human error.

The English dataset is labeled using the TextBlob 0.17.1 library in Python 3.12. TextBlob is a simple, powerful tool for processing textual data and offers solutions to various natural language processing tasks, such as translation, classification, sentiment analysis, and more [18]. TextBlob relies on a rule-based lexicon approach called Pattern Analyzer. This method assigns a polarity score to each review based on the presence of predefined positive and negative words and their grammatical structure. Unlike the machine learning and deep learning classifiers used in our study, which are trained to learn features from data distributions, TextBlob does not rely on statistical learning or model training. In our dataset, each review is classified into one of two categories: negative, with 42,906 reviews, and positive, with 45,732 reviews. The categories are well balanced, with 48.40% of the reviews being negative and 51.60% positive (Figure 2). Although this method enables efficient automatic labeling, it does not rely on supervised learning or deep contextual analysis, and the resulting labels may reflect surface-level sentiment patterns rather than nuanced human interpretation. To address this, we manually reviewed a sample of labeled comments to verify the general polarity consistency and observed a satisfactory alignment with the expected sentiment in the reviews.

The French and Arabic datasets are annotated based on reviews classified by the users themselves as either “liked” or “unliked” comments, with verification through the associated ratings. Additionally, a manual check was conducted on those datasets to confirm that the assigned labels were generally consistent with the expressed sentiment. In the French dataset, there are 6302 positive reviews, representing 63.54% of the total, compared to 3616 negative reviews, which make up 36.46%. In the Arabic dataset, we have 2686 positive reviews and 1594 negative reviews, accounting for 62.76% and 37.24% (Figure 2), respectively. These results indicate that the distribution of positive and negative categories in both the French and Arabic datasets is not balanced.

3.4. Feature Extraction

Feature extraction is a pivotal process in machine learning and natural language processing, involving the conversion of textual data into numerical vectors that can be efficiently handled by machine learning algorithms. The primary goal of this process is to reduce the dimensionality of high-dimensional data into a set of low-dimensional features. This transformation is essential for enhancing the performance and effectiveness of various algorithms in processing and analyzing textual information. Various techniques are utilized for this purpose, including TF-IDF, word embeddings, and BERT embeddings.

3.4.1. TF-IDF

Term frequency–inverse document frequency (TF-IDF) is a statistical technique for feature extraction that represents an improvement over the bag of words by assessing the importance of a word both within a document and across the entire corpus [19].

TF-IDF is the product of res exterm frequency (TF), which measures how often a word appears in a document, and inverse document frequency (IDF), which measures the importance of a word in a document relative to its occurrence across the entire corpus. TF-IDF is calculated by the following functions:

w_{x, y} = T F_{x y} \times {I D F}_{t} {I D F}_{t} = l o g (\frac{N}{{d f}_{x} + 1})

where

w_{x, y}

is the weight of term x in document y,

T F_{x y}

is the term frequency of x in document y,

N

is the total number of documents in the corpus and

d f_{x}

represents the number of documents that contain the term x.

3.4.2. Word Embedding

Word embedding is a widely used method in deep learning for generating vector representations of words. It is especially valued in sentiment analysis and classification tasks for its ability to capture both syntactic and semantic relationships between words [20].

The principle of word embedding is based on the idea that words with similar contexts will be represented by similar vectors in a vector space. For example, the words “man” and “woman” often appear in similar contexts and are represented by closely related vectors.

Similarly, the words “queen” and “king” are also close to each other and share a semantic relationship with the words “man” and “woman”. With this vector representation, logical operations can be performed between words, such as “man + women − king = queen”.

Among the most widely used feature extraction methods are Word2Vec, GloVe, and FastText. In this paper, we are applying two of these techniques, Word2Vec and FastText, to our datasets as shown in Table 4.

Word2Vec

Word2Vec model is proposed by Mikolov as a vector representation approach that can capture natural language’s syntactic and semantic meaning. It reflects the relationships between words by grouping them into the same vector space [21]. Word2Vec proposes two architectures for generating word embeddings: continuous bag of words (CBOW) (Figure 3) and skip-gram (Figure 4). Both are based on a three-layer neural network.

Word2Vec was utilized across the three datasets in our study. For the English dataset, we employed the standard pre-trained Word2Vec model based on the Google News corpus, which consists of 300-dimensional English word vectors (https://code.google.com/archive/p/word2vec/, accessed on 13 November 2024). In the case of the French dataset, we used frWac2Vec (https://fauconnier.github.io/, accessed on 13 November 2024) from the frWac corpus, constructed from websites with a .fr domain, featuring 200-dimensional word vectors. For the Arabic dataset, we extracted features using the ArWordVec model [22], which was developed from a substantial collection of tweets covering various subjects, resulting in 300-dimensional vectors.

FastText

FastText embedding is a model developed by Facebook researchers in 2016, trained in multiple languages, including English, French, and Arabic. It extends the Word2Vec model by considering subword information, allowing it to better capture the meanings of rare or misspelled words [23].

FastText is an advancement of Word2Vec, capable of producing high-quality word vectors through its mechanism that breaks down each word into character n-gram. Unlike Word2Vec, where each word is represented as a bag of words, FastText represents each word as a bag of n-gram characters. This approach allows for better capturing of suffixes, prefixes, and other linguistic and lexical forms.

The word “Hello” with a trigram could be decomposed into: <He”,”Hel”,”ell”,”llo”,”lo”> where <He and lo> represent, respectively, the beginning and end of the word.

We applied the FastText model to the three datasets. This model supports a wide range of languages, including English, French, and Arabic, and it is pre-trained on Common Crawl and Wikipedia with a vector dimension of 300 (https://fasttext.cc/docs/en/crawl-vectors.html, accessed on 13 November 2024). The model then generated vector representations, which served as inputs to the neural networks we developed.

3.4.3. BERT Embedding

BERT embeddings represent the words in a sentence as numerical, multidimensional vectors. These vectors capture both the meaning and structure of the words.

BERT is a pre-trained natural language processing model built on transformer architecture, developed by Google to understand the bidirectional context of words within sentences. Unlike traditional models, such as Word2Vec and GloVe, BERT represents words contextually [24], meaning that each word is presented differently depending on the surrounding words in a sentence.

BERT’s architecture (Figure 5) starts with the tokenization of words into smaller units, or subwords, using a method called WordPiece. This approach is applied when words are not part of BERT’s vocabulary. For example, the word “loving” could be split into “lov” and “##ing” if the model does not recognize the full word. Additionally, BERT uses two special tokens: [CLS] at the beginning of each sentence, which is used for classification tasks, and [SEP], which marks the end of sentences and helps distinguish between them. After tokenization, the text is processed by the BERT model, which has three main components: bidirectional encoding, multi-layered transformers, and a masked language model. Bidirectional encoding considers the context of words from both directions, helping to understand each word’s meaning in its context. The multi-layered transformers apply a self-attention mechanism, allowing each word to connect with every other word in the sentence. Finally, the masked language model predicts masked tokens to learn deep contextual relationships between words. The next stage is fine-tuning, where the model is adapted to our specific dataset. The final output from BERT is a set of contextual embeddings for each word in the sentence.

BERT offers several models in different languages. In this study, we focus on the English language using distilbert-base-uncased (https://huggingface.co/distilbert/distilbert-base-uncased, accessed on 17 November 2024), a distilled version of BERT. For Arabic, we use asafaya-bert-base-arabic (https://huggingface.co/asafaya/bert-base-arabic, accessed on 17 November 2024), a pre-trained BERT base model for Arabic. For French, we work with CamemBERT (https://huggingface.co/almanach/camembert-base, accessed on 17 November 2024), a BERT adaptation designed specifically for the French language. Although BERT models are pre-trained on raw text, we applied the same preprocessing steps used for other feature extraction techniques. This unified preprocessing strategy was adopted to ensure consistency and fairness in the comparative evaluation across all models. We note, however, that this approach may diverge from BERT’s original training conditions and could influence its contextual embedding behavior, particularly in morphologically rich languages.

3.5. Data Balancing

Balancing the classes in a dataset is an important step when building machine learning and deep learning models. This challenge happens when certain classes are over-represented relative to others. The model tends to focus on these majority classes, often overlooking the minority classes. As a result, the model performs well for the majority classes but poorly for the minority classes, leading to biased predictions and affecting the model’s overall accuracy.

Two techniques are commonly used to address this issue and balance datasets: under-sampling the majority class or over-sampling the minority class [25].

Under-sampling involves reducing the number of examples in the majority class to match the size of the minority class. While this technique helps balance the dataset, it has the disadvantage of losing valuable information when removing data from the majority class. Popular techniques for this technique include random under-sampling (RUS).

Over-sampling is the opposite of under-sampling; it increases the number of examples in the minority class by duplicating existing ones. One widely used method for this is the synthetic minority over-sampling technique (SMOTE) [26], which is the technique we applied to balance our French and Arabic datasets (Figure 6). We retained the English dataset with its original distribution of examples, as the positive and negative classes exhibit no significant difference, containing 42,906 examples in the negative class and 45,732 in the positive class.

3.6. Machine and Deep Learning Algorithms

In this study, we employed multiple machine learning and deep learning algorithms combined with word representations based on TF-IDF, various word embedding models such as Word2Vec and FastText, or BERT embeddings. We selected TF-IDF for use with machine learning algorithms such as SVM, random forest, and ensemble classifiers, which integrate several algorithms. Utilizing TF-IDF enhances text vector relevance, reduces data dimensionality, and provides clear, discriminative vectors that help algorithms like SVM and random forest achieve higher classification accuracy.

Support vector machine (SVM) is a supervised machine learning algorithm commonly used for both classification and regression tasks [27], handling both linear and nonlinear data distributions. The primary objective of SVM is to identify a hyperplane that best separates data points into distinct classes within a multidimensional space by maximizing the margin between this hyperplane and the closest data points, known as support vectors. In this study, we use SVM for binary classification, where the labels 1 and -1 represent the positive and negative classes, respectively. The SVM algorithm separates these two classes by finding a hyperplane, represented by

w \cdot x + b = 0

that maximizes the margin between them. To achieve this, SVM applies the constraint

y_{i} (w \cdot x_{i} + b) \geq 1

for each data point while minimizing the function

\frac{1}{2} {‖w‖}^{2}

to ensure an optimal classification. Here,

w

represents the weight or coefficient vector,

x

is the feature vector, b is the bias term, and

y_{i}

is the label for each data point.

Random forest is a supervised machine learning algorithm that can be used for both classification and regression tasks. It works by creating several independent decision trees from different subsets of the training data [28]. For classification tasks, the random forest model decides if an instance belongs to the positive or negative class by analyzing various data samples and features to build these trees. The final prediction comes from the majority vote among all the trees, which determines the class that the instance belongs to.

Stacking ensemble learning is a robust technique in machine learning that enhances predictive accuracy by combining outputs from multiple models, known as base models, trained on the same dataset. A higher-level model then aggregates these predictions to produce a final output known as the meta model [29], improving performance in both classification and regression tasks. In this study, we employed ensemble classifiers (EC) with base models, including a random forest, support vector machine (SVM), and multi-layer perceptron, and combined their predictions using a simple logistic regression model to generate the final prediction.

TF-IDF was combined with three machine learning techniques, SVM, random forest, and stacking (Figure 7), across the three datasets to compare their performance against each other and other embedding and deep learning techniques.

Word embeddings and BERT embeddings are used with deep learning algorithms like BiLSTM and GRU (Figure 8). This combination enhances the interpretation of text by providing contextual understanding and improves the processing of textual data while maintaining the semantic relationships between words. The deep learning algorithms applied in this study show better performance and are highly regarded for their effectiveness in processing textual data.

Bidirectional long short-term memory (BiLSTM) network is an enhancement of traditional LSTM. BiLSTM is the combination of two layers, one processes the input data from start to end (forward direction), while the other processes it from end to start (backward direction). This architecture is particularly useful to provide additional context from the future and the past to the network [30] and to make effective predictions and decisions in a natural language task.

Gated recurrent unit (GRU) is a type of RNN architecture introduced by Chao et al. in 2014. This model is similar to the LSTM architecture, which is designed to solve the vanishing gradient problem associated with traditional RNN. GRU simplifies the LSTM architecture by reducing the number of parameters, making it easier to compute and often faster to train. GRUs, like LSTMs, are designed to remember forgotten information over long or short period; this is beneficial for modeling sequence data where context is important. The key difference between GRUs and LSTMs lies in their structure: LSTMs utilize three gates while GRUs operate with just two gates, the update and reset gates [31].

Convolutional neural network (CNN) is a class of deep learning algorithms specially designed for image recognition and computer vision. It learns to recognize patterns in images by mimicking the human visual system [32]. However, CNNs have been shown to be successful in the domain of NLP. They provide significant improvements over NLP tasks such as sentence classification and sentiment analysis. A convolutional neural network is based on two principal layers: convolutional layers and a pooling layer. Convolutional layers apply a set of filters (or kernels) to the input data, with each filter detecting specific features, such as edges, textures, or patterns. The pooling layer reduces the spatial dimensions of the feature map. This layer helps retain the most significant features.

In this study, a CNN is combined with a BiLSTM network. This hybrid approach enables the CNN to detect local sentiment-related patterns (positive or negative) within text sequences, while the BiLSTM considers the broader context to interpret the significance of these patterns within the overall text. This combination enhances classification accuracy by leveraging both localized sentiment detection and contextual understanding.

4. Results and Discussion

In this section, we present the outputs of our models, which initially combine the TF-IDF technique with various machine learning algorithms. Following this, we apply multiple word embedding models and BERT embeddings with different deep learning algorithms. To evaluate the results obtained from our models, we use several evaluation metrics that offer an overview of model performance. These metrics quantify the model’s effectiveness in prediction or classification tasks. The models used in this study are assessed using the following metrics: precision, recall, F1-score, accuracy, AUC-ROC, and execution time for generating word representations.

4.1. Evaluation Metrics

A confusion matrix is a structured table that illustrates the performance of a classification model on a dataset with known true values. It classifies prediction into four distinct categories: true positive (TP), true negative (TN), false positive and false negative (FN).

The classification report provides detailed insights into the precision, recall, F1-score, and support for each class. It is very useful for understanding the performance of the classifier across different classes.

Accuracy measures the overall correctness of a model by determining the ratio of correct predictions to the total number of instances. It is calculated by:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

Precision is the accuracy of positive predictions. It is represented as the ratio of true positive predictions to the total number of positive predictions made. It is defined by:

Precision = \frac{T P}{T P + F P}

Recall or sensitivity measures the ability of a model to identify all relevant instances. it is the ratio of true positive predictions to all positive samples. It is represented by:

Recall = \frac{T P}{T P + F N}

F1-score or F1-measure is the harmonic mean of precision and recall, providing a balance between the two metrics. The value of F1-score is a number between 0 and 1, and it is calculated by this formula:

F 1 - score = 2 \times \frac{P r e c i s i o n \times r e c a l l}{P r e c i s i o n + r e c a l l}

Area under the curve (AUC)–receiver operating characteristics (ROC) is a widely used method for evaluating classification model performance. The ROC curve illustrates the balance between the true positive rate (TPR) and the false positive rate (FPR). The TPR, also called recall, shows the percentage of actual positive cases correctly identified by the model. In contrast, the FPR shows the percentage of negative cases incorrectly classified as positive. It is defined by:

FPR = \frac{F P}{T N + F P}

The AUC quantifies the overall area under the ROC curve, representing the model’s ability to distinguish between classes. An AUC value closer to 1 signals a highly effective model.

4.2. Experimental Findings

In this study, the datasets are divided into separate portions in accordance with the requirements of both machine learning and deep learning models as presented in Table 5. For machine learning classifiers, each dataset is split into 80% for training and 20% for testing. For the deep learning algorithms, however, the division varies by dataset: the English dataset is divided into 80% for training, with 20% of the training set reserved for validation, and 20% for testing. The French dataset uses 85% for training, including 10% for validation, and 15% for testing. And the Arabic dataset uses 80% for training, including 10% for validation, and 20% for testing.

The hyperparameters used in this study are influenced by various factors, such as the size of the dataset and the nature of the language. Different parameters are employed for machine learning classifiers, as shown in Table 6, like the kernel for SVM and the number of estimators for RF. For deep learning algorithms, the hyperparameters are tuned for each configuration, such as the learning rate and batch size, as presented in Table 7.

Our research proposes a new approach to comparison between four feature extraction techniques combined with three machine learning and deep learning algorithms, applying three different datasets to improve the effectiveness of statistical methods without context (TF-IDF), methods based on similarity of context with fixed words (Word2Vec and FastText) and methods based on global and dynamic context (BERT embedding).

Table 8 presents the results of combining TF-IDF with SVM, RF, and EC classifiers. The EC model achieved the highest accuracy in both the English and Arabic datasets, with 98.6% and 94.1%, respectively, outperforming the other classifiers. For the French dataset, the SVM and EC classifiers demonstrated strong performance, achieving accuracies of 98.3% and 89.2%, respectively, with both models displaying a performance according to the precision metric that reaches 90%. The RF model showed the lowest performance across all three datasets, with accuracies of 96.2% for the English dataset, 93.3% for the Arabic dataset, and 88.8% for the French dataset. These results highlight the effectiveness of combining TF-IDF with a stacking ensemble learning approach, allowing different base models to leverage these features to capture diverse aspects and complex relationships in textual data, thereby achieving improved generalization compared to individual classifiers. As the English dataset was labeled using TextBlob, some alignment with TF-IDF-based models may exist. This does not affect the validity of our results but suggests that human-annotated data could offer a complementary evaluation perspective. Furthermore, the sentiment labels in the French and Arabic datasets were manually reviewed and showed strong face validity, reinforcing the overall reliability of the results.

Table 9 evaluates the performance of two word embedding models, Word2Vec and FastText, in conjunction with various deep learning algorithms. On the English dataset, the BiLSTM model achieved the highest accuracy and F1-score across both embedding techniques, exceeding 98.4% and 99%, respectively. Conversely, the hybrid BiLSTM–CNN model exhibited better compatibility with the embedding models compared to individual algorithms. For the French dataset, the frWac2Vec model delivered superior results with an accuracy of 85.9%, while for the Arabic dataset, ArWordVec achieved an accuracy of 88.3%. In contrast, the FastText model recorded an accuracy of 85% and 87.6% for the French and Arabic datasets, respectively. The study shows that in all three datasets, the Word2Vec approach performed better than the FastText model. This superior performance is likely attributed to the datasets being well cleaned, standardized, and balanced, enabling Word2Vec to effectively capture global relationships between words and the sentiments they express.

BERT, as an advanced word embedding technique, enhances its performance by capturing semantic and contextual word meanings. It was applied to our three datasets, as presented in Table 10. When used as a vectorization method for several deep learning algorithms, BERT embeddings demonstrated superior performance. In the Arabic dataset, the BiLSTM–CNN hybrid approach achieved the highest accuracy of 92%. Similarly, this hybrid approach outperformed in the French dataset, reaching an accuracy of over 87%. For the English dataset, the BiLSTM algorithm delivered the best results among all tested models, achieving an accuracy of 89.3%. In contrast, GRU algorithms showed the lowest overall performance across the three datasets, with average accuracies not exceeding 89%. Although we did not evaluate the CNN component in isolation as we did with BiLSTM, thus presenting a partial ablation, the comparison with the standalone BiLSTM model shows that the hybrid BiLSTM–CNN configuration consistently outperforms it in both the Arabic and French datasets. This suggests that the CNN layer contributes additional local pattern recognition that complements the sequential modeling capability of BiLSTM.

Referring to Figure 9, a comparison is presented between the best-performing models of each feature extraction technique, achieving the highest accuracies, as detailed in Table 8, Table 9 and Table 10, for English and French datasets across various feature extraction techniques. The TF-IDF combined with EC demonstrates the highest performance compared to Word2Vec, FastText, and BERT for both datasets. However, while BERT embeddings outperform Word2Vec and FastText on the French dataset, the reverse is observed for the English dataset, where Word2Vec and FastText show superior performance compared to BERT embeddings. When comparing Word2Vec and FastText directly, it is shown that Word2Vec slightly outperforms FastText across both datasets in all evaluation metrics used.

This variation in results between the two languages can be attributed to several factors. Firstly, the higher linguistic complexity of the French language compared to English enables models like BERT, which are context- and syntax-sensitive, to better capture these subtleties. Secondly, tasks requiring the identification of individual word importance, such as sentiment classification, allow statistical techniques like TF-IDF to achieve high precision. Also, the dataset size significantly influences model performance, with word embedding models excelling on larger datasets due to their ability to learn lexical and contextual similarities more effectively than on smaller datasets.

Building on the previous comparison of French and English datasets, the performance of the best models from each vectorization technique is compared. The results on the Arabic dataset, as shown in Figure 10, reveal that TF-IDF and BERT embeddings outperform word embedding models, achieving over 92% in both accuracy and F1-score metrics. In contrast, Word2Vec demonstrates higher performance compared to FastText. The superior performance of BERT and TF-IDF can be attributed to the morphological richness, dialectal variations, and contextual complexity of the Arabic language, which make these techniques more effective than others.

The models of each feature extraction were evaluated using the ROC AUC metric to measure how well they could distinguish between positive and negative classes, as shown in Figure 11 and Figure 12. The combination of TF-IDF and the EC classifier achieves the highest AUC scores across all three datasets, particularly excelling on the English dataset with an almost perfect score of 1.0. Other models also demonstrated strong performance, consistently exceeding an AUC of 0.9, demonstrating their ability to effectively discriminate between classes.

Evaluating model performance using a novel metric that measures the time required to generate the vector representations of words varies across the techniques. First, we present that the TF-IDF, Word2Vec, and FastText experiments are executed on an NVIDIA GeForce MX330 2GB GDDR5 GPU, while BERT embeddings are generated in the Google Colab environment using an NVIDIA A100 40GB HBM2 GPU. The performance time metrics, as shown in Table 11, demonstrate that the BERT model consumes more time and resources than the other models across all three datasets. This is due to the transformer architecture, which is more complex than traditional models. Additionally, the TF-IDF method generates vectors that are more than 20 times smaller than those produced by Word2Vec and FastText because of its dependence on simple statistical calculations that do not require model training or learning processes, unlike the other techniques.

Although performance results are presented for English, French, and Arabic datasets, this study does not aim to compare the languages directly. Each language uses distinct pre-trained embeddings with different properties, such as vector dimensionality, corpus origin, and training quality. These variations may affect performance and introduce biases in interpretation.

The present study highlights the effectiveness of integrating both classical and deep learning approaches for multilingual sentiment analysis across Arabic, French, and English datasets. Our findings underscore the importance of selecting appropriate feature extraction techniques and classification models when addressing the linguistic diversity inherent in multilingual environments. By employing balanced datasets, this study ensures a fair comparative evaluation and emphasizes the need for language-sensitive sentiment analysis systems, particularly in culturally and linguistically diverse regions such as Morocco.

The experimental results reveal that traditional machine learning models, including SVM and NB, perform adequately when combined with conventional vectorization methods such as TF-IDF. However, more semantically rich embedding techniques such as Word2Vec, FastText, and BERT demonstrate superior classification performance, especially when paired with deep learning architectures like BiLSTM and GRU. Notably, hybrid models (BiLSTM–CNN) and BERT-based pipelines produce better results, confirming the advantage of context-aware representations in capturing nuanced sentiment patterns.

The observed variations in model performance across the three languages suggest that sentiment expression is not only syntactically different but also culturally contextual. These variations influence lexical usage, emotional intensity, and polarity, all of which affect model accuracy. In addition, all models were trained and tested on the same data splits, under identical experimental conditions, ensuring fair comparisons. Moreover, the superior results achieved by certain configurations were not marginal but consistently higher than those of other models across all datasets. This reinforces the practical relevance and reliability of the reported improvements. We also acknowledge the importance of applying formal statistical significance testing in scenarios where performance differences are less pronounced or experimental variability is higher.

The evaluation of execution time reveals that while transformer-based models offer higher accuracy, they demand substantially more computational resources. This trade-off between performance and computational efficiency must be considered when designing real-world applications, especially those requiring real-time analysis or deployment in resource-constrained environments.

For experts and developers working on multilingual sentiment analysis, especially in fields like digital marketing, customer relationship management, and social media monitoring, the research’s findings provide insightful information. The comparative framework presented here can help them choose the best embedding and classifier combinations based on computing limitations and language-specific requirements.

This study provides a foundation for real-time sentiment monitoring systems capable of capturing and interpreting public opinions expressed in multiple languages. Such systems can be particularly beneficial during critical periods marked by intense public engagement, such as political events, public health emergencies, or large-scale commercial campaigns, by offering timely insights into public perception and sentiment dynamics.

5. Conclusions and Future Work

The selection of language, feature extraction techniques, and classification algorithms plays an important role in building robust predictive models for sentiment analysis. In this study, we explore the relationships between these three elements and demonstrate the effectiveness of various combinations. The work is situated within the domain of business intelligence, with a specific focus on the tourism sector, particularly hotel reviews as a practical case study. We focus on the three primary languages used in Morocco, English, French, and Arabic, to target a broad multilingual population. The data were vectorized using multiple word representation techniques, including TF-IDF, Word2Vec, FastText, and BERT embeddings. To resolve class imbalance, the datasets were balanced using the SMOTE technique. These representations were then combined with various machine learning and deep learning classifiers.

The results were evaluated using multiple performance metrics such as accuracy, F1-score, precision, recall, ROC AUC curve, and a novel metric evaluating the execution time required to generate word representations. TF-IDF paired with ensemble classifiers consistently outperformed other models across all three datasets. BERT embeddings combined with hybrid classifiers, specifically BiLSTM and CNN, achieved high performance on French and Arabic datasets. Word2Vec and FastText showed strong results on the English dataset when used with BiLSTM algorithms; in contrast, Word2Vec outperformed FastText in all experimental settings. These findings provide useful insights for choosing the optimal combination of feature extraction techniques and classification algorithms to develop effective sentiment analysis models.

Looking ahead, we plan to expand our research by incorporating other widely used languages in our region, such as Standard Moroccan Amazigh and Spanish. We aim to experiment with alternative hybrid feature extraction techniques. Furthermore, we intend to deploy these models in real-time applications and develop a visual platform for analyzing customer feedback instantaneously. This advancement will not only enrich academic discussions in the field of natural language processing but also enhance the practical application of these technologies in business contexts, where understanding consumer sentiment is essential for making decisions. Furthermore, to ensure fair cross-language evaluations in future work, we propose using standardized multilingual embeddings, such as multilingual BERT, language-agnostic sentence representations (LASER), or multilingual unsupervised and supervised embeddings (MUSE), or training word embeddings on a unified multilingual corpus. This would allow better isolation of language-specific effects and minimize bias introduced by differences in embedding quality. Additionally, benchmarking with expert-annotated sentiment datasets could help validate models’ performance against a more objective labeling standard.

Author Contributions

H.J.: conceptualization, methodology, software, writing—review and editing, writing—original draft, visualization, validation, preparation of datasets, data curation. S.E.H.: writing—review and editing, validation, formal analysis, conceptualization, supervision, and checking the revisions. M.-A.E.H.: visualization, methodology, data curation, and checking the revisions. S.A.: visualization, formal analysis, data curation. A.H.: visualization and validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability Statement

The English dataset used in this study is publicly available via Hugging Face at (https://huggingface.co/datasets/ashraq/hotel-reviews, accessed on 25 October 2024). The French and Arabic datasets used in this paper are not publicly available but can be provided by the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lareyre, F.; Nasr, B.; Chaudhuri, A.; Di Lorenzo, G.; Carlier, M.; Raffort, J. Comprehensive Review of Natural Language Processing (NLP) in Vascular Surgery. EJVES Vasc. Forum. 2023, 60, 57–63. [Google Scholar] [CrossRef]
Mukherjee, P.; Badr, Y.; Doppalapudi, S.; Srinivasan, S.M.; Sangwan, R.S.; Sharma, R. Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection. Procedia Comput. Sci. 2021, 185, 370–379. [Google Scholar] [CrossRef]
Haris, N.A.K.M.; Mutalib, S.; Malik, A.M.A.; Abdul-Rahman, S.; Kamarudin, S.N.K. Sentiment classification from reviews for tourism analytics. Int. J. Adv. Intell Inform. 2023, 9, 108. [Google Scholar] [CrossRef]
Albahli, S. Twitter sentiment analysis: An Arabic text mining approach based on COVID-19. Front. Public Health 2022, 10, 966779. [Google Scholar] [CrossRef] [PubMed]
Muhammad, P.F.; Kusumaningrum, R.; Wibowo, A. Sentiment Analysis Using Word2vec and Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews. Procedia Comput. Sci. 2021, 179, 728–735. [Google Scholar] [CrossRef]
Alharbi, A.; Kalkatawi, M.; Taileb, M. Arabic Sentiment Analysis Using Deep Learning and Ensemble Methods. Arab. J. Sci. Eng. 2021, 46, 8913–8923. [Google Scholar] [CrossRef]
Talaat, A.S. Sentiment analysis classification system using hybrid BERT models. J. Big Data 2023, 10, 110. [Google Scholar] [CrossRef]
Murfi, H.; Syamsyuriani Gowandi, T.; Ardaneswari, G.; Nurrohmah, S. BERT-based combination of convolutional and recurrent neural network for indonesian sentiment analysis. Appl. Soft Comput. 2024, 151, 111112. [Google Scholar] [CrossRef]
Ahmed, J.; Ahmed, M. Classification, detection and sentiment analysis using machine learning over next generation communication platforms. Microprocessors and Microsystems. Microprocess. Microsyst. 2023, 98, 104795. [Google Scholar] [CrossRef]
Qi, Y.; Shabrina, Z. Sentiment analysis using Twitter data: A comparative application of lexicon- and machine-learning-based approach. Soc. Netw Anal Min. 2023, 13, 31. [Google Scholar] [CrossRef]
Bello, A.; Ng, S.C.; Leung, M.F. A BERT Framework to Sentiment Analysis of Tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef] [PubMed]
Kasri, M.; Birjali, M.; Beni-Hssane, A. A Comparison of Features Extraction Methods for Arabic Sentiment Analysis. In Proceedings of the 4th International Conference on Big Data and Internet of Things, Rabat, Morocco, 23–24 October 2019; ACM: New York, NY, USA, 2019; pp. 1–6. Available online: https://dl.acm.org/doi/10.1145/3372938.3372998 (accessed on 6 November 2024).
Kaibi, I.; Nfaoui, E.H.; Satori, H. A Comparative Evaluation of Word Embeddings Techniques for Twitter Sentiment Analysis. In Proceedings of the 2019 International Conference on Wireless Technologies, Embedded and Intelligent Systems (WITS), Fez, Morocco, 3–4 April 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. Available online: https://ieeexplore.ieee.org/document/8723864/ (accessed on 6 November 2024).
Hadwan, M.; AAl-Hagery, M.; Al-Sarem, M.; Saeed, F. Arabic Sentiment Analysis of Users’ Opinions of Governmental Mobile Applications. Comput. Mater. Contin. 2022, 72, 4675–4689. [Google Scholar] [CrossRef]
Hicham, N.; Karim, S.; Habbat, N. An Efficient Approach for Improving Customer Sentiment Analysis in the Arabic Language Using an Ensemble Machine Learning Technique. In Proceedings of the 2022 5th International Conference on Advanced Communication Technologies and Networking (CommNet), Marrakech, Morocco, 12–14 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/9993924/ (accessed on 6 November 2024).
Mboungou, M.M.B.; Yamin, I.; Zhang, S.; Iqbal, A. Sentiment Analysis of Client Reviews on a French E-Commerce Platform. In Proceedings of the 2023 International Conference on Ambient Intelligence, Knowledge Informatics and Industrial Electronics (AIKIIE), Ballari, India, 2–3 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. Available online: https://ieeexplore.ieee.org/document/10390055/ (accessed on 8 November 2024).
Essebbar, A.; Kane, B.; Guinaudeau, O.; Chiesa, V.; Quénel, I.; Chau, S. Aspect Based Sentiment Analysis using French Pre-Trained Models. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, Vienna, Austria, 4–6 February 2021; SCITEPRESS—Science and Technology Publications: Setúbal, Portugal, 2021; pp. 519–525. Available online: https://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0010382705190525 (accessed on 8 November 2024).
Mas Diyasa, I.G.S.; Marini Mandenni, N.M.I.; Fachrurrozi, M.I.; Pradika, S.I.; Nur Manab, K.R.; Sasmita, N.R. Twitter Sentiment Analysis as an Evaluation and Service Base on Python Textblob. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1125, 012034. [Google Scholar] [CrossRef]
Zhou, J.; Ye, Z.; Zhang, S.; Geng, Z.; Han, N.; Yang, T. Investigating response behavior through TF-IDF and Word2vec text analysis: A case study of PISA 2012 problem-solving process data. Heliyon 2024, 10, e35945. [Google Scholar] [CrossRef]
Rezaeinia, S.M.; Rahmani, R.; Ghodsi, A.; Veisi, H. Sentiment analysis based on improved pre-trained word embeddings. Expert Syst. Appl. Mars. 2019, 117, 139–147. [Google Scholar] [CrossRef]
Nawangsari, R.P.; Kusumaningrum, R.; Wibowo, A. Word2Vec for Indonesian Sentiment Analysis towards Hotel Reviews: An Evaluation Study. Procedia Comput. Sci. 2019, 157, 360–366. [Google Scholar] [CrossRef]
Fouad, M.M.; Mahany, A.; Aljohani, N.; Abbasi, R.A.; Hassan, S.U. ArWordVec: Efficient word embedding models for Arabic tweets. Soft Comput. 2020, 24, 8061–8068. [Google Scholar] [CrossRef]
Abdelhady, N.; Hassan ASoliman, T.; Farghally, M. Stacked-CNN-BiLSTM-COVID: An effective stacked ensemble deep learning framework for sentiment analysis of Arabic COVID-19 tweets. J. Cloud Comp. 2024, 13, 85. [Google Scholar] [CrossRef]
Gomes, L.; Da Silva Torres, R.; Côrtes, M.L. BERT- and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: A comparative study. Inf. Softw. Technol. 2023, 160, 107217. [Google Scholar] [CrossRef]
Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 7–9 April 2020; IEEE: Piscataway, NJ, USA; 2020; pp. 243–248. Available online: https://ieeexplore.ieee.org/document/9078901/ (accessed on 26 November 2024).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Çakir, M.; Yilmaz, M.; Oral, M.A.; Kazanci, H.Ö.; Oral, O. Accuracy assessment of RFerns, NB, SVM, and kNN machine learning classifiers in aquaculture. J. King Saud Univ. Sci. 2023, 35, 102754. [Google Scholar] [CrossRef]
Islam, M.Z.; Liu, J.; Li, J.; Liu, L.; Kang, W. A Semantics Aware Random Forest for Text Classification. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; ACM: New York, NY, USA, 2019; pp. 1061–1070. Available online: https://dl.acm.org/doi/10.1145/3357384.3357891 (accessed on 21 November 2024).
Ghasemieh, A.; Lloyed, A.; Bahrami, P.; Vajar, P.; Kashef, R. A novel machine learning model with Stacking Ensemble Learner for predicting emergency readmission of heart-disease patients. Decis. Anal. J. 2023, 7, 100242. [Google Scholar] [CrossRef]
Wang, Y.; Feng, S.; Wang, D.; Zhang, Y.; Yu, G. Context-Aware Chinese Microblog Sentiment Classification with Bidirectional LSTM; Li, F., Shim, K., Zheng, K., Liu, G., Eds.; Web Technologies and Applications; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9931, pp. 594–606. Available online: http://link.springer.com/10.1007/978-3-319-45814-4_48 (accessed on 31 October 2024).
Bibi, I.; Akhunzada, A.; Malik, J.; Iqbal, J.; Musaddiq, A.; Kim, S. A Dynamic DL-Driven Architecture to Combat Sophisticated Android Malware. IEEE Access 2020, 8, 129600–129612. [Google Scholar] [CrossRef]
O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]

Figure 1. The proposed methodology.

Figure 2. Distribution of labeled datasets.

Figure 3. The architecture of the CBOW model.

Figure 4. The architecture of the skip-gram model.

Figure 5. Bert embedding architecture.

Figure 6. Distribution of French and Arabic datasets before and after SMOTE.

Figure 7. The architecture of TF-IDF and ML algorithms.

Figure 8. The architecture of Word and BERT embedding with DL algorithms.

Figure 9. A comparison of model performance across different vectorization techniques on the English and French datasets.

Figure 10. A comparison of model performance across different vectorization techniques on the Arabic dataset.

Figure 11. The ROC AUC scores for all feature extraction techniques applied to the Arabic and French datasets.

Figure 12. The ROC AUC scores for all feature extraction techniques applied to the English dataset.

Table 1. A summary of related works: approaches and results.

Ref.	Dataset	Resampling Techniques	Feature Extraction	Classifiers	Results
[9]	A dataset of English news articles	No	TF-IDF	NB, logistic regression (LR), linear SVC, gradient boosting	Best model = NB Accuracy = 89.02%
[10]	A dataset of English tweets related to COVID-19	No	Bag of words (BoW), TF-IDF, Word2Vec	RB, multinomial NB, SVC	Best model = TF-IDF with SVM Accuracy = 71%
[11]	Six English datasets combined from Kaggle	No	Word2Vec, BERT	CNN, RNN, BiLSTM	Best model = BERT with CNN, RNN and BiLSTM Accuracy = 93%
[12]	A combination of six English datasets from Kaggle	No	BoW, TF-IDF, AraVec	LR, RF, Extra Trees (ETrees), linear SVM, SVM RBF	Best model = TF-IDF with LR Accuracy = 82.10%
[13]	Arabic dataset of Twitter comments	No	Word2Vec, Glove, FastText	Gaussian NB, linear SVC, NuSVC, LR, SGD, RF	Best model = FastText with NuSVC Accuracy = 84.89%
[14]	An Arabic dataset of user opinions on a mobile application	No	Supervised lexicon weights, TDM, TFM	Decision tree (DT), KNN, SVM, NB	Best model = hybrid (Supervised lexicon weights, TDM, TFM) with KNN Accuracy = 78.46
[15]	Arabic datasets of hotel reviews (Datasets A and C)	No	TF-IDF	Adaboost, KNN, DT, maximum entropy (ME), SVM, ensemble classifier (EC)	Best model = TF-IDF with EC Accuracy = 89.2%
[16]	A dataset of customer reviews from a French e-commerce platform	No	TF-IDF	RNN, LSTM	Precision metric between 73% and 77%
[17]	SemEval2016 French dataset, including restaurant and museum reviews	No	BERT’s variants	Conventional methods like LSTM Fine-tuned models using FC, SPC, and AEN approaches	Best model = FlauBERT with a pretrained fully connected model (PTM-FC) Best accuracy = 84.68%

Table 2. Text cleaning in English and French datasets.

Text	Language	Text Cleaning
Room was very well appointed and facilities were good	English	room well appoint facil good
Large queues for breakfast Room was very crammed booked a room for 2 adults and 2 children no space to go in with a pushchair Not enough lifts to cater for guests at times it took us 10 min to go down to the lobby restaurant\r	English	larg queue breakfast room cram book room adult children space go pushchair enough lift cater guest time took us minut go lobbi restaur
Were staying in london for just one night and the location was perfect we paid around 200 for one night for a family suite and it was very well worth the money\r	English	stay london one night locat perfect paid around one night famili suit well worth money
Chambre de petite taille Le petit-déjeuner à améliorer Manque d’équipement	French	chambr petit taill petit-déjeun amélior manqu d’équipement
Le service, le personnel et la propreté des lieux.	French	servic personnel propreté lieux
Hôtel de qualité à proximité immédiate de la gare.	French	Hôtel qualité proximité immédiat gare

Table 3. Text cleaning in Arabic dataset.

Text	Text in English	Text Cleaning	Tashaphyne Stemmer	ISRI Stemmer
لم يعجبني اي شيء، حجزنا في الصور شيء لكن في الحقيقة هذا المكان حتى الكلاب لا تستطيع ان تقعد فيه ، رائحة الجيفة في كل مكان الذباب و الناموس و التعامل 0 و صاحب الفندق رجل لا يمكن وصفه إلا بأنه نصاب ، دفعنا ف بوكينج و لما وصلنا للمكان الدي يوجد في قرية نائية و الفندق لا توجد فيه حتى قنينة ماء ،	I didn’t like anything. What we booked in the pictures was one thing, but in reality, this place is so bad that not even dogs can sit there. The smell of rotting flesh is everywhere, flies and mosquitoes are all over, the service is zero, and the hotel owner is nothing but a scammer...	لم يعجبني اي شيء حجزنا الصور شيء الحقيقه المكان حتي الكلاب لا تستطيع ان تقعد رايحه الجيفه مكان الذباب الناموس التعامل 0 صاحب الفندق رجل لا يمكن وصفه الا بانه نصاب دفعنا بوكينج وصلنا لمكان الدي يوجد قريه نايه الفندق لا توجد حتي قنينه ماء	لم عجب اي شيء حجز صور شيء حقيقه مك حت كلاب لا تستطيع ان قعد رايح جيفه مك ذباب ناموس تعامل 0 صاحب فندق رجل لا مك صف لا ان صاب دفع وكينج صل مكا دي وجد قر نا فندق لا وجد حت قني ماء	لم عجب اي شيء حجز صور شيء حقق كان حتي كلب لا تطع ان قعد ريح جيف كان ذبب نمس عمل 0 صحب ندق رجل لا يمكن وصف الا بنه نصب دفع كينج وصل لمك الد وجد قره نيه ندق لا وجد حتي قنن ماء

Table 4. Configurations of word embedding models.

	English Word Embedding		French Word Embedding		Arabic Word Embedding
Configuration	Word2Vec	FastText	frWac2Vec	FastText	ArWordVec	FastText
Vector size	300	300	200	300	300	300
Approaches	CBOW	CBOW	CBOW	CBOW	CBOW	CBOW
Window size	5	5	Not Specified	5	3	5

Table 5. Dataset splitting (training, validation, and testing sets).

	English Dataset			French Dataset			Arabic Dataset
	Training	Validation	Testing	Training	Validation	Testing	Training	Validation	Testing
ML classifiers	70,911	-	17,727	10,084	-	2520	4298	-	1074
DL Classifiers	56,729	14,182	17,727	9662	1074	1890	3868	430	1074

Table 6. Parameters of machine learning classifiers.

Classifiers		English Dataset	French Dataset	Arabic Dataset
SVM	C	1	10	10
	Kernel	RBF	RBF	RBF
	Degree, Gamma	3.1	2.1	2.1
RF	N_estimators	200	200	200
RF	min_samples_split	5	5	2
EC	SVM	C = 10; Kernel = RBF; Gamma = 1
	KNN	N estimators = 200; min samples split = 5
	MLP	Hidden layers = 50; max iterations = 1000

Table 7. Hyperparameters of DL algorithms.

Parameters	Values
Learning rate	1 × 10⁻³; 5 × 10⁻⁴; 1 × 10⁻⁴
Epoch	10
Optimizer	Adam; RMSprop
Batch size	64;32;16
Activation	Relu; Softmax

Table 8. A comparison of ML algorithm performance using TF-IDF.

Models	English Dataset				French Dataset				Arabic Dataset
Models	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
TF-IDF
SVM	0.978	0.98	0.98	0.98	0.893	0.90	0.89	0.89	0.938	0.94	0.94	0.94
RF	0.962	0.96	0.96	0.96	0.888	0.89	0.89	0.89	0.933	0.94	0.93	0.93
EC	0.986	0.99	0.99	0.99	0.892	0.90	0.89	0.89	0.941	0.94	0.94	0.94

Table 9. A comparison of DL algorithm performance using Word2Vec and FastText.

Models	English Dataset				French Dataset				Arabic Dataset
Models	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
Word2Vec
BiLSTM	0.985	0.99	0.99	0.99	0.848	0.85	0.85	0.85	0.872	0.87	0.87	0.87
GRU	0.983	0.98	0.98	0.98	0.847	0.85	0.85	0.85	0.866	0.87	0.87	0.87
BiLSTM- CNN	0.983	0.98	0.98	0.98	0.859	0.86	0.86	0.86	0.883	0.89	0.89	0.89
FastText
BiLSTM	0.984	0.98	0.98	0.98	0.840	0.84	0.84	0.84	0.859	0.86	0.86	0.86
GRU	0.980	0.98	0.98	0.98	0.830	0.83	0.83	0.83	0.862	0.86	0.86	0.86
BiLSTM–CNN	0.983	0.98	0.98	0.98	0.850	0.85	0.85	0.85	0.876	0.88	0.88	0.88

Table 10. A comparison of DL algorithm performance using BERT embedding.

Models	English Dataset				French Dataset				Arabic Dataset
Models	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score	Accuracy	Precision	Recall	F1-Score
BERT
BiLSTM	0.893	0.890	0.890	0.890	0.870	0.870	0.870	0.870	0.911	0.920	0.910	0.910
GRU	0.892	0.890	0.890	0.890	0.868	0.870	0.870	0.870	0.906	0.910	0.910	0.910
BiLSTM–CNN	0.891	0.890	0.890	0.890	0.871	0.880	0.870	0.870	0.920	0.920	0.920	0.920

Table 11. Execution time comparison of vectorization methods: TF-IDF, Word2Vec, FastText, and BERT.

		Time Execution of Vectorization
Vectorization Techniques		English Dataset	French Dataset	Arabic Dataset
MX330 GPU	TF-IDF	2.919 s	0.159 s	0.142 s
	Word2Vec	47.679 s	1.597 s	1.229 s
	FastText	15.188 s	39.534 s	13.332 s
A100 GPU	BERT	1029.873 s	470.829 s	106.331 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jakha, H.; El Houssaini, S.; El Houssaini, M.-A.; Ajjaj, S.; Hadir, A. Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers. Appl. Syst. Innov. 2025, 8, 104. https://doi.org/10.3390/asi8040104

AMA Style

Jakha H, El Houssaini S, El Houssaini M-A, Ajjaj S, Hadir A. Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers. Applied System Innovation. 2025; 8(4):104. https://doi.org/10.3390/asi8040104

Chicago/Turabian Style

Jakha, Hamza, Souad El Houssaini, Mohammed-Alamine El Houssaini, Souad Ajjaj, and Abdelali Hadir. 2025. "Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers" Applied System Innovation 8, no. 4: 104. https://doi.org/10.3390/asi8040104

APA Style

Jakha, H., El Houssaini, S., El Houssaini, M.-A., Ajjaj, S., & Hadir, A. (2025). Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers. Applied System Innovation, 8(4), 104. https://doi.org/10.3390/asi8040104

Article Menu

Optimizing Sentiment Analysis in Multilingual Balanced Datasets: A New Comparative Approach to Enhancing Feature Extraction Performance with ML and DL Classifiers

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Collection

3.1.1. English Dataset

3.1.2. French Dataset

3.1.3. Arabic Dataset

3.2. Data Processing

3.3. Data Labeling

3.4. Feature Extraction

3.4.1. TF-IDF

3.4.2. Word Embedding

Word2Vec

FastText

3.4.3. BERT Embedding

3.5. Data Balancing

3.6. Machine and Deep Learning Algorithms

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Experimental Findings

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI