SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis

Almousa, Omar; Tashtoush, Yahya; AlSobeh, Anas; Zahariev, Plamen; Darwish, Omar

doi:10.3390/bdcc10020049

Open AccessArticle

SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis

by

Omar Almousa

¹

,

Yahya Tashtoush

^1,*

,

Anas AlSobeh

^2,*

,

Plamen Zahariev

³

and

Omar Darwish

⁴

¹

Department of Computer Science, Jordan University of Science and Technology, Irbid 22110, Jordan

²

Department of Information Systems & Technology, Utah Valley University, Orem, UT 84058, USA

³

Department of Telecommunications, University of Ruse “Angel Kanchev”, 7017 Ruse, Bulgaria

⁴

Information Security and Applied Computing, Eastern Michigan University, Ypsilanti, MI 48197, USA

^*

Authors to whom correspondence should be addressed.

Big Data Cogn. Comput. 2026, 10(2), 49; https://doi.org/10.3390/bdcc10020049

Submission received: 28 November 2025 / Revised: 12 January 2026 / Accepted: 19 January 2026 / Published: 3 February 2026

(This article belongs to the Special Issue Advances in Natural Language Processing and Text Mining: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Sentiment analysis of Arabic text, particularly on social media platforms, presents a formidable set of unique challenges that stem from the language’s complex morphology, its numerous dialectal variations, and the frequent and nuanced use of emojis to convey emotional context. This paper presents SiAraSent, a hybrid framework that integrates traditional text representations, emoji-aware features, and deep contextual embeddings based on Arabic transformers. Starting from a strong and fully interpretable baseline built on Term Frequency–Inverse Definition Frequency (TF–IDF)-weighted character and word N-grams combined with emoji embeddings, we progressively incorporate SinaTools for linguistically informed preprocessing and AraBERT for contextualized encodings. The framework is evaluated on a large-scale dataset of 58,751 Arabic tweets labeled for sentiment polarity. Our design works within four experimental configurations: (1) a baseline traditional machine learning architecture that employs TF-IDF, N-grams, and emoji features with an Support Vector Machine (SVM) classifier; (2) an Large-language Model (LLM) feature extraction approach that leverages deep contextual embeddings from the pre-trained AraBERT model; (3) a novel hybrid fusion model that concatenates traditional morphological features, AraBERT embeddings, and emoji-based features into a high-dimensional vector; and (4) a fully fine-tuned AraBERT model specifically adapted for the sentiment classification task. Our experiments demonstrate the remarkable efficacy of our proposed framework, with the fine-tuned AraBERT architecture achieving an accuracy of 93.45%, a significant 10.89% improvement over the best traditional baseline.

Keywords:

Arabic sentiment analysis; natural language processing (NLP); large language models (LLMs); SinaTools; BERT; AraBERT; deep learning; machine learning; feature engineering; emoji analysis; Twitter (X)

1. Introduction

Finding accurate sentiment is vital for marketing, event detection, election polls, and governance [1,2]. However, it is not an easy task. Many challenging factors face the process of sentiment analysis (SA), such as finding the sentiment in short texts. Similarly, the Twitter text sentiment assignment is challenging and not always accurate, where some sentiment areas, including ambiguity, sarcasm, presence of slang, acronyms, and emoji detection, will be missed [3,4]. Sentiment analysis (SA), a key area of Natural Language Processing (NLP), has become a vital tool for public opinion mining, customer feedback analysis, and social trend monitoring [1]. However, the application of SA to the Arabic language, especially within the informal and noisy context of social media platforms like Twitter, is fraught with significant challenges that are not as prevalent in English-language analysis. These challenges include the language’s rich and complex morphology, the widespread use of diverse and often unstandardized dialects, and the integral role of emojis in conveying sentiment and nuance [2]. Traditional machine learning approaches, which often rely on handcrafted features like Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, have provided a solid foundation for this task [3]. While these methods are effective at capturing lexical patterns, they frequently fail to grasp the deeper contextual and semantic meanings embedded in the text, particularly in the presence of sarcasm, ambiguity, and dialectal variations. This limitation has necessitated a paradigm shift towards more sophisticated models that can understand language in a more human-like manner.

Neutral, positive, and negative sentiments are the most common sentiment polarities that exist in texts. The current tools have weaknesses and strengths in identifying accurate sentiment within a text. Many tools are capable of identifying positive sentiments within the text, while others are more efficient in exploring the negative ones [5]. Those cases happen as the sentence context is misled by the individual word’s actual meaning in the sentence. Sentence ambiguity makes it difficult to allocate an accurate polarity to the sentence. The sentence polarity depends strongly on the sentence context. Because of ambiguity, some tools might assign negative sentiment to neutral texts [6].

The advent of deep learning, and specifically the Transformer architecture [7], has revolutionized the field of NLP. Pre-trained language models such as BERT [5] have demonstrated an unprecedented ability to learn rich, contextualized representations of language, leading to state-of-the-art performance on a wide array of NLP tasks. For the Arabic language, specialized models like AraBERT [6] and MARBERT [8,9] have been developed, pre-trained on massive Arabic corpora to capture the unique intricacies of the language. These models offer a powerful alternative to traditional methods, but their full potential, especially when combined with established feature engineering techniques, has not been fully explored. Furthermore, specialized toolkits like SinaTools [10] have emerged, providing advanced morphological analysis and semantic relatedness capabilities that can further enrich the feature set for sentiment analysis.

This paper presents a significant leap forward by proposing a novel, hybrid framework that bridges the gap between traditional feature engineering and modern deep learning by proposing SiAraSent. We systematically combine the interpretable, lexical features from TF-IDF, N-grams, and emojis with the powerful contextual embeddings from AraBERT and the advanced linguistic features from SinaTools. Our research makes the following key contributions: (1) A hybrid Sentiment Framework (SiAraSent) that integrates traditional features, advanced linguistic features from SinaTools, and deep contextual embeddings from AraBERT (https://github.com/aub-mind/arabert) (accessed on 13 March 2025), creating a comprehensive and highly effective feature set for Arabic sentiment analysis. (2) A systematic and rigorous comparison of four distinct modeling strategies: a traditional ML baseline, an LLM feature extraction approach, a hybrid fusion model, and a fully fine-tuned AraBERT model. (3) A deep mathematical formulation of the AI encoder architecture, detailing the input embeddings, multi-head self-attention mechanism, and classification layer, thereby ensuring transparency and reproducibility.

The results on a large-scale Arabic Twitter dataset achieve 93.45% accuracy and demonstrate a significant improvement over existing methods. A robust, open-source implementation of the entire framework (https://github.com/aub-mind/arabert) (accessed on 13 March 2025), providing a valuable benchmark and resource for the Arabic NLP research community.

Research Questions and Hypotheses

Our overarching aim is to understand how different feature representations and model architectures contribute to Arabic sentiment analysis. We therefore formulate the following research questions (RQ) and associated hypotheses (H):

RQ1: Do emojis carry complementary sentiment information beyond character and word N-gram features?

H1:

Adding emoji-aware features will yield a statistically significant improvement in classification performance compared to models using only lexical features.

RQ2: Are the linguistic features provided by SinaTools beneficial beyond lexical and emoji features?

H2:

The hybrid fusion model that concatenates lexical, emoji and morphological features will outperform pure TF–IDF or pure AraBERT embeddings.

RQ3: How does a fine-tuned transformer compare to feature-fusion approaches on large Arabic datasets?

H3:

Fine-tuning AraBERT on our dataset will achieve the best overall accuracy but may be more computationally intensive than the hybrid model.

RQ4: Do the observed improvements generalize across data splits and are they statistically significant?

H4:

Using cross-validation and significance tests, we expect our reported gains to remain significant across multiple folds and not be attributed to chance variation.

2. Related Work

2.1. Traditional and Lexicon-Based Arabic Sentiment Analysis

Early Arabic sentiment analysis systems relied heavily on lexicon-based methods and traditional machine-learning algorithms. Sentiment lexicons were constructed manually or semi-automatically, often by translating English resources or mining Arabic corpora [11]. Text representations were typically based on unigram or N-gram counts, sometimes combined with simple syntactic patterns or negation handling rules. However, one tool cannot process all the Arabic language variants [12,13]; therefore, the combination of emoji with textual features might increase the accuracy of the sentiment of Arabic tweets. Furthermore, determining the size of the sliding window that will be used in n-gram analysis is essential to specify the margin of accuracy that such analysis will provide.

Furthermore, the sentiment of Arabic tweets is more complicated when it is applied to Twitter, where it is highly noisy and informal. The challenges facing Arabic sentiment analysis include its rich morphology and the use of dialectal Arabic [14]. The Arabic language is in a dialect state where the formal language in writing differs from the language in daily life; however, the language used in social media is mostly dialectal [15]. Another challenge is the limited publicly available Arabic dataset and lexicons for SA, which complicates our task. Moreover, applying SA to the Arabic language is a complex task due to some features of the language. For example, the same word can be written in different ways, and many Arabic names are derived from adjectives, with no sentiment, while the adjective may have a sentiment. Also, the idioms used can have an implicit opinion.

The study in [16] introduced a text-sentiment classification approach using Term Frequency-Inverse Document Frequency (TF-IDF) combined with Next Word Negation (NWN). The authors compared the performance of TF-IDF alone with TF-IDF enhanced by NWN for text classification. The model was then tested on three text-mining algorithms, with results indicating that Linear Support Vector Machine (LSVM) achieved the highest accuracy, outperforming previous methods.

In [17], researchers employed four machine learning algorithms for classification. They critically evaluated the accuracy of these methods based on metrics such as recall, precision, accuracy, and F-measure. The algorithms were also applied using n-gram features, revealing that increasing the ‘n’ value in n-grams led to a decline in classification accuracy [18]. However, converting text into numerical features using TF-IDF vectorization improved accuracy when machine learning techniques were applied.

The work in [19] presented an Arabic Sentiment Analysis Corpus consisting of 36K labeled tweets (positive/negative) collected from Twitter. The authors utilized self-training and distant supervision for annotation and also released an additional 8K manually annotated tweets as a gold standard. The corpus was evaluated intrinsically by comparing it with pre-trained sentiment models and human classifications, and extrinsically through sentiment analysis tasks, achieving an accuracy of 86%.

Researchers in [20] explored the impact of Arabic language morphology on sentiment analysis, focusing on how negation influences sentiment. They developed a set of rules to identify negation patterns in Arabic and applied these rules to improve sentiment detection for negated words. The proposed method demonstrated superior performance compared to existing Arabic sentiment analysis techniques.

The main aim of this study is to propose a sentiment analysis model that combines emoji-based features and text-based features for Arabic tweets. The broader goal of this study is to fill the gap in the literature on sentiment analysis with emojis for the Arabic language.

2.2. Deep Learning and Transformer-Based Approach

Arabic sentiment analysis has evolved from lexicon-based methods to sophisticated deep learning approaches. Early work focused on building sentiment lexicons and rule-based systems [11], which, while interpretable, struggled with the dynamic and informal nature of social media. Machine learning models, particularly Support Vector Machines (SVM) and Naive Bayes, combined with features like TF-IDF and N-grams, represented a significant improvement [3]. These models demonstrated the value of feature engineering in capturing important lexical patterns. For instance, the work in [3] showed that combining TF-IDF with character-level N-grams and emoji features could yield an accuracy of 80.56%, establishing a strong baseline for traditional methods. However, these approaches are often limited by their inability to understand context and semantic nuances.

The paradigm shift towards deep learning introduced models like Convolutional Neural Networks (CNNs) [12] and Long Short-Term Memory (LSTM) networks, which could learn feature representations automatically. These models offered better performance by capturing local patterns (CNNs) and long-range dependencies (LSTMs) in text. However, the true breakthrough came with the advent of the Transformer architecture [21] and pre-trained language models like BERT [5,22]. These models, pre-trained on vast amounts of text data, learn deep contextualized representations that capture complex linguistic phenomena.

For the Arabic language, this led to the development of specialized models such as AraBERT [6], which was pre-trained on a massive Arabic corpus and quickly became the state-of-the-art for many Arabic NLP tasks. Subsequent work, such as MARBERT [8,23,24], further improved performance by pre-training on a large dataset of Arabic tweets, making it particularly well-suited for social media analysis. Comparative studies have consistently shown that Transformer-based models significantly outperform traditional ML and earlier deep learning architectures for Arabic sentiment analysis [14]. Despite these advancements, there remains a gap in the literature regarding the systematic integration of these powerful deep learning models with the rich, interpretable features derived from traditional methods and specialized linguistic toolkits like SinaTools [10,25]. Our work fills this gap by proposing a hybrid framework that leverages the strengths of both worlds.

2.3. Emojis, Hybrid Features, and Multi-Modal Signals

Emojis are a prominent feature of social media discourse and often carry sentiment or pragmatic cues that interact with the surrounding text. Several studies have investigated using emoji frequencies, polarity lexicons for emojis, or learned emoji embeddings to improve sentiment classifiers. For Arabic, this is particularly relevant, as users mix text, emojis, Latin script, and numerals (“Arabizi”) to express emotion and irony.

Hybrid feature approaches that combine lexical, syntactic, and semantic representations have shown promise for sentiment analysis in various languages [26]. However, few works have performed a systematic, large-scale evaluation of hybrid features that include emoji signals, morphological analyzers like SinaTools, and transformer-based embeddings in the context of Arabic sentiment analysis. This gap motivates SiAraSent: a framework explicitly designed to integrate these heterogeneous sources of information and to quantify their individual and joint contributions.

3. Methodology and Mathematical Formulation

Our proposed methodology is a multi-stage process designed to systematically extract and model sentiment from Arabic social media text [27]. The overall architecture, depicted in Figure 1, involves data preprocessing, multi-faceted feature engineering, and four distinct modeling strategies. This section provides a detailed description of each component, including the mathematical underpinnings of our AI encoder.

3.1. Dataset Acquisition

The Arabic Sentiment Twitter Corpus [28] dataset was used in this study, which includes a large corpus of positive and negative tweets collected from Twitter. The used dataset contained 58 k Arabic tweets that were collected using a positive and negative emojis lexicon, where each tweet has one or more emojis as it is needed to work with in this research. The dataset was saved in TSV (tab-separated values) format, and each row had one tweet. The current research is interested in the text and emojis of the tweet to combine all the sentimental features that the tweet contains. We also used the dataset from [29] to perform a fair comparison with their results. We call it hereafter the second dataset that was collected from Twitter on the basis of trending hashtags. It contains 22,752 Arabic tweets collected.

3.2. Preprocessing

In this study, preprocessing the aforementioned datasets consists of several steps. The steps are data cleaning, stop-word removal, stemming and morphological normalization, and emoji extraction. To prepare the raw tweets for analysis, we first perform data cleaning to handle missing or invalid entries. Empty tweets, tweets with only URLs or punctuation, and corrupted Unicode strings are discarded. All duplicate tweets and obvious retweets are removed to prevent a few messages from dominating the corpus. We also strip metadata such as the “RT” token or user mention and handle duplicate posts created by automated bots. After removing duplicates, we normalize the text by unifying the many surface forms of Arabic characters (e.g., mapping alef variants (e.g., أ, إ, آ to ا), removing elongations and repeated letters, and replacing Tatweel characters. Previous studies on Arabic SA have shown that eliminating repeated letters and normalizing variants of alef improves accuracy and reduces sparsity. We remove non-Arabic letters and digits, URLs, user mentions and hashtags, and we maintain only the tweet body for analysis. Next, we apply stop-word removal using a curated list of Arabic stop words, and we rely on stemming/lemmatization to reduce words to a canonical form [30]. For traditional classifiers, we use the ISRI stemmer, whereas for the linguistically informed pipeline, we leverage the SinaTools toolkit, which performs tokenization, morphological analysis and part-of-speech (POS) tagging. Stemming and lemmatization help increase the match between related forms and thus boost the recall of features across dialects. Finally, to capture the role of emojis, we extract all emoji characters from each tweet and categorize them into four basic emotion classes—disgust, anger, sadness and joy—prior to building our emoji feature vector. Emojis are removed from the text to avoid duplication with the emoji features.

3.3. Features and Emoji Extraction

The next stage of our work is Feature Extraction. In this stage, we extract significant features from the dataset to successfully perform the classification task using ML classifiers [31]. For the emoji-based features extraction, we use the same method that we used in our previous work [32]. We extract a rich set of features from three distinct categories. Inspired by [3], we build a strong, interpretable baseline. We compute TF-IDF scores for character-level N-grams.

Traditional lexical features. We compute TF–IDF scores for character-level and word-level N-grams. Character N-grams with 1–5 characters capture sub-word patterns and morphology, while word N-grams with windows up to 3 words encode short phrases and negations. The TF–IDF weighting scheme down-weights ubiquitous stop-words and emphasizes terms distinctive to a tweet. Formally, the importance of a term t in a document d is given by:

T F - I D F (t, d) = T F (t, d) \times l o g (\frac{N}{d f (t)}),

where N is the total number of documents and df(t) is the number of documents containing term t.

We use character-level N-grams (n = 1 to 5) to capture sub-word morphological patterns, which is particularly effective for the agglutinative nature of Arabic. The N-gram is defined as an N-character sliding window in a string, which is a language-independent approach that works very well with noisy data [33]. In this study, the length of the tweets was measured, and the average length was calculated to be 9 words. Therefore, the experiment was conducted on that basis from n = 1 to n = 9 of features in variable window size, which means the first experiment (min = 1, max = 1), second (min = 1, max = 2), third (min = 1, max = 3) and so on till n = 9.

Emoji-aware features. Building on our previous work [32], we represent each tweet with a 20-dimensional vector derived from the emoji tokens removed during preprocessing. The vector encodes:

Counts of positive, negative, and neutral emojis (three dimensions) based on an Arabic emoji sentiment lexicon.
Emoji density, measured as the ratio of emoji characters to the total length of the tweet (one dimension), captures how expressive a tweet is.
Sentiment polarity score computed as the difference between positive and negative emoji counts, normalized to [−1, 1] (one dimension).
Positional features indicating whether the majority of emojis appear at the beginning, middle, or end of the tweet (three dimensions). Emojis at the end often act as summarizing cues.
Distributional features, such as the maximum number of consecutive emojis and the variance in sentiment across the emoji sequence (five dimensions).

The remaining dimensions capture the presence of each of the four primary emotion categories (disgust, anger, sadness and joy) plus miscellaneous emoji groups. This design yields a compact yet expressive representation of the emoji signal. Linguistic and morphological features. Using the SinaTools toolkit, we extract a suite of linguistic descriptors: (a) the distribution of part-of-speech tags over the tweet (17 POS tags), (b) counts of morphological patterns such as prefixes, suffixes and stems, (c) the ratio of verbs to nouns, and (d) a semantic relatedness score computed from an Arabic semantic lexicon. These features are concatenated into a 50-dimensional vector. Because the features derive from a high-dimensional one-hot encoding of POS tags and morphological templates, we apply min–max scaling to each feature group so that no single group dominates the hybrid vector. When combining the lexical, emoji and linguistic representations into the hybrid fusion model, we obtain a vector of approximately 7000 dimensions (depending on the chosen N-gram range). To mitigate the curse of dimensionality, we explored dimensionality reduction via principal component analysis and found that retaining 95% of the variance yielded no accuracy loss but improved efficiency. These details answer reviewer 3’s request to specify the dimensionality and normalization applied to our heterogeneous feature set.

We create a 20-dimensional feature vector from emojis, capturing their count, sentiment polarity, density, and position within the tweet. In particular, the 20-dimensional emoji vector is now defined by counts, polarity, density, position and distributional measures; the linguistic features number fifty and are scaled; and the total hybrid vector dimension is explicitly stated and subject to optional dimensionality reduction.

3.3.1. Term Frequency–Inverse Document Frequency Algorithm

According to [33], TF-IDF is considered one of the commonly used term weighting techniques in information retrieval systems [34]. The TF–IDF algorithm is commonly used as a measure in the classification framework of textual content. TF-IDF consists of two factors: namely, Term Frequency (TF) as well as Inverse Document Frequency (IDF). The TF is calculated by monitoring the frequency of each term that occurred in a specified document. The IDF is calculated by dividing the number of all documents that we have in the corpus by the number of documents in which the term occurred, and then taking the logarithm of the quotient. When we multiply TF by the IDE value for terms, we have a high score for terms that occur frequently in some documents, and a low score for terms that occur frequently in almost every document. This score allows us to discover the important terms in a document [35].

The equation below is used to calculate the importance of the term (

T F_{i, j}

) in the referred document (

j

) by measuring how many times this term occurs in this document (

n_{i, j}

), divided by the sum of all the terms that occur in that document (

\sum_{k} n_{k, j}

) [36].

T F_{i, j} = \frac{n_{i, j}}{\sum_{k} n_{k, j}}

(1)

The limitation of using TF is that words like prepositions من، في ، الى that occur a lot in each document make the value of TF very high. These words are not important for classification because they do not convey any emotion or sentiment. We are concerned with the words that occur in some parts of the document and have sentiment. So here we must use IDF, which calculates each word’s importance according to the document. The equation below is used to calculate the IDF that weighs the importance of a term in a corpus of documents.

T F_{i}

for a term

t_{i}

is calculated by computing the logarithm of dividing the corpus size by the number of documents that contain that term.

I D F_{i} = \log \frac{|D|}{|\{d_{j} : t_{i} \in d_{j}\}|}

(2)

where D refers to the total number of documents that we have in the corpus and is divided by the number of documents that a particular term

t_{j}

appeared in. Then, we take the logarithm of the equation. The use of logarithms helps in normalizing the distribution of values. Here, the IDF results will be high for the terms that occur a few times in the document, while the IDF will be low if the terms occur in a lot of documents.

The IDF will not be helpful as a measurement alone, so when it is multiplied by the TF value will result in an accurate representation of the term’s importance.

T F * I D F_{i, j} = T F_{i, j} \times I D F_{i}

(3)

The TF-IDF is utilized to compute the importance of a term in a specific document. Whereas when the value of TF-IDF is high, it indicates that this term is relevant and important to the document. The TF-IDF gives weight to each term in the tweet; the tweet has more than just one term, which lets us calculate the relevance of the term to the tweet and the document [21]. After calculating TF-IDF for each term, we have a matrix for all terms and documents we have. Each row represents a vector for each tweet in the vector space model that enters the ML model to train ML Classifiers to do the Sentiment Analysis task and predict polarity for each tweet [24,37].

3.3.2. SinaTools Linguistic Features

We utilize SinaTools to extract high-level linguistic features, including POS tag distributions and semantic relatedness scores between key terms, providing a deeper understanding of the grammatical structure and semantic content.

3.3.3. Deep Learning Features (AraBERT)

We use the pre-trained AraBERT model (aubmindlab/bert-base-arabertv2) to generate deep contextual embeddings [38]. The architecture of the AraBERT encoder is detailed below [39,40].

3.4. Training, Testing, Adjusting, and Evaluating ML Classifiers

The third stage of our work is the training and testing of the selected ML classifiers. Through this study, 5 well-known ML techniques were used for the classification task: Support Vector Machine (SVM): Linear SVC & SVC, Stochastic Gradient Descent Classifier (SGD), Tree-based Classifiers: Decision trees (DTs) & Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayesian Networks (NBNs): Multinomial NB & Bernoulli NB, that have been practiced successfully for solving problems in many fields. So, in total, we have 8 different ML classifiers. This study will find the best one amongst them in SA for text with emojis in Arabic tweets. After selecting the ML classifiers, we start training them. The main goal of training classifiers is to make a prediction as accurately as possible. Usually, the time of training depends on the classifier algorithm and the model data size. We train eight different classifiers with 80% of the actual data. After the training step, we use 20% of the data for testing to evaluate our model as several others did, i.e., [29,30]. The training set was input to the model without the class label to let the model predict the label class for each tweet. According to the results, we can evaluate each classifier; if the rate of loss of predicted values is high, we can adjust the inner parameters of the classification algorithm to have better results. For example, the KNN algorithm calculates the distance between the nearest neighbors using the Manhattan or Euclidean distance metric. We can adjust the number of neighbors and the distance metric according to the higher measurements. Finally, we perform the evaluation step. In the ML field, there are evaluation metrics that evaluate the performance of classifiers in predicting/not predicting the class label of the test dataset. We will discuss the evaluation metrics in detail in the next section.

3.5. Classification

In this stage, classifiers use the testing set that has been held from the model (with its sentiment label hidden) to test the model and see how it will perform in the real world. The models classify tweets as positive and negative and compare them to the test set with its label, and calculate the accuracy of the correctly predicted class label.

4. Mathematical Formulation of the AI Encoder

Our deep learning component is based on the Transformer encoder architecture [7], as implemented in AraBERT [6]. The process is as follows:

4.1. Input Embeddings

The input sequence is transformed into a vector representation by summing three embedding types:

E = E_token + E_pos + E_seg

where E_token are the token embeddings, E_pos are the positional embeddings to encode word order, and E_seg are the segment embeddings to distinguish between sentences.

4.2. Multi-Head Self-Attention

The core of the encoder is the multi-head self-attention mechanism, which allows the model to weigh the importance of different words in the sequence. The attention output is calculated as:

Attention (Q, K, V) = softmax ((Q K^T / sqrt (d_k)) * V

where Q, K, and V are the query, key, and value matrices, and d_k is the dimension of the keys. AraBERT uses h = 12 attention heads, and their outputs are concatenated:

MultiHead (Q, K, V) = Concat (head_1, \dots, head_h)) * W^O

4.3. Encoder Layer

Each encoder layer consists of a multi-head attention sub-layer and a feed-forward neural network (FFN) sub-layer. Residual connections and layer normalization are applied around each sub-layer. AraBERT stacks L = 12 of these layers.

4.4. Model Architectures

Figure 2 shows that we evaluate four different modeling strategies: eight classifiers (e.g., SVM, Random Forest) trained on TF-IDF, N-gram, and emoji features. LLM Feature Extraction (AraBERT) is used to generate [CLS] token embeddings for each tweet, which are then fed into an SVM classifier [41]. Hybrid Fusion Model, a powerful model where we concatenate all features (Traditional + SinaTools + AraBERT embeddings) and train an SVM classifier on the combined high-dimensional vector. Fine-tuned AraBERT, the pre-trained AraBERT model is fine-tuned end-to-end on our sentiment classification task. A classification head is added on top of the [CLS] token output, and the entire model is trained using the AdamW optimizer [15].

5. Experiments and Results

5.1. Dataset

Our investigation is grounded in a substantial dataset comprising 58,751 Arabic tweets, which we compiled by merging the collections from two prior studies [16,17]. This corpus is particularly well-suited for our research goals. It is nearly balanced, containing 50.8% positive and 49.2% negative examples, which helps prevent classification bias. Crucially, the dataset is rich with the kind of informal, dialectal Arabic commonly found on social media, and it features a high prevalence of emojis—a key focus of our analysis.

To ensure our models are robust and our findings are reliable, we implemented a rigorous data partitioning strategy. We divided the entire dataset into three distinct subsets following a 70/10/20 ratio. The training set, comprising 70% of the data (41,126 tweets), was used exclusively for model training. A separate validation set of 10% (5875 tweets) was reserved for hyperparameter tuning and early stopping during the fine-tuning of our deep learning models. Finally, the remaining 20% (11,750 tweets) formed our held-out test set, used only for final evaluation. This ensures that the test data remains completely unseen until the very end, providing an unbiased measure of model performance.

Beyond this fixed split, we also employed a stratified 5-fold cross-validation protocol on the training and validation portions for the traditional machine learning classifiers. This approach trains and evaluates each model five times on different data subsets, yielding a more stable and reliable estimate of performance.

5.2. Experimental Setup

All experiments were conducted in a controlled environment. For the traditional ML models, we used the scikit-learn library, and for the deep-learning models, we used PyTorch 2.7.1 with the Hugging Face Transformers library. The cross-validation and hyper-parameter search were implemented using scikit-learn’s GridSearchCV and StratifiedKFold utilities. For the fine-tuned AraBERT model, we explored a grid of learning rates {1 × 10⁻⁵, 2 × 10⁻⁵, 3 × 10⁻⁵}, batch sizes {16, 32} and number of epochs {2, 3, 4}. Early stopping was applied based on the validation loss with a patience of two epochs. We selected the hyper-parameters yielding the highest mean validation accuracy across the five folds (learning rate = 2 × 10⁻⁵, batch size = 32, epochs = 3). This process addresses concerns about the lack of a validation set and hyper-parameter exploration. The K-nearest neighbor (KNN) classifier’s training time is negligible because the algorithm simply stores the training vectors; its testing time is longer because it must compute distances to all training samples. We measured training and testing times consistently across classifiers by averaging over all folds.

The results of our grid search for fine-tuning the AraBERT model, presenting the validation accuracy achieved for each combination of learning rates (1 × 10⁻⁵, 2 × 10⁻⁵, and 3 × 10⁻⁵), batch sizes (16 and 32), and training epochs (2 and 3). To clearly demonstrate that the configuration yielding the highest validation accuracy was a learning rate of 2 × 10⁻⁵, a batch size of 32, and 3 training epochs, which achieved a validation accuracy of 93.2%, as shown in Table 1.

5.2.1. Training Diagnostics

To assess convergence and potential over-fitting, we recorded the training and validation loss and accuracy during the fine-tuning of AraBERT. Figure 3 illustrates representative curves across six epochs. Both training and validation accuracies steadily improve and converge around 93%, while the losses decrease smoothly, indicating stable optimization and minimal over-fitting. Similar plots were generated for the other hyper-parameter settings during grid search.

5.2.2. Experiment 1: TF-IDF with N-Gram for Tweet Text

The results are shown in Table 2, where in each table we show the evaluation for the eight classifiers we used in our study with TF-IDF for different N-grams. The results for N-gram =1 to 5 are in Table 1. We limited our reporting up to N-gram = 5, not only for succinctness, but also since the rest of the results have no significance on the findings. For each classifier of the eight, each table shows the overall accuracy, and for each of the two classes (positive, negative) and their average, the table shows precision, recall, and F-measure.

The classifiers’ evaluation details for TF-IDF with N-gram = 1 for tweet text, where the number of extracted features is 6151. The sample features for this point are (‘رقى’ ، ‘رقب’ ، ‘جدول’ ، ‘خراج’ ، ‘طيار’ ، ‘مختص’، ، ‘عرش’ ، ‘حب’ ، ‘غض’، ‘قرع’، ‘شطر’، ‘عفو’ ، ‘خد’ ،’ جزع’، ‘نقط’، ‘نقض’ ، ‘قنط’, ‘مائل’). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. The Random Forest Classifier has the presence of 0.96 recall for all tweets that were positive and classified correctly.

The classifiers’ evaluation details for TF-IDF with N-gram = 2 for tweet text, where the number of extracted features is 17,491. The sample features for this point are (‘قطر توقع’ ، ‘سمات صباح’ ، ‘ صباح اتحاد’ ، ‘صامت صمت’ ، ‘رائع نظر’ ، ‘ سحب ‘، ‘ حجر’ ، ‘رسمي خصوص’ ، ‘ناس جميع’ ، ‘طرد اعب’، ‘كن بشر’، ‘عمر شف’، ‘مابداخل عشق’ ، ‘له مخرج’ ،’احد الأخوان). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. Moreover, the Random Forest Classifier has a precision of 0.85 percent for all tweets that were classified as negative and were correct. It also has a recall of 0.98 percent for all tweets that were positive and classified correctly. However, the overall results are less than those of SVC.

Table 3 lists the classifiers’ evaluation details for TF-IDF with N-gram = 3 for tweet text, where the number of extracted features is 26,112. The sample features for this point are (‘ نت كريم ‘ ، رحله ‘ ، ‘ حد عرف رجف’ ، ‘خطأ حق حد ‘ ،’قلبى عذاب سير ‘ ، ‘ حب عفو ‘ ، ‘ صباح ورد فل’ ، ‘ نبي ‘ ، ‘ عالم نور ‘ ، ‘جار قبل دار ‘ ، ‘معار دائر خائن ‘ ، ‘مهد هبوط راي’ ، ‘ وصف ‘ ، ‘ بحر رجع عطش’). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing, while Multinomial NB takes the shortest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. Moreover, the Random Forest Classifier has a percent of 0.97 recall for all tweets that were positive and classified correctly. On the other hand, it had a percent of 0.14 recalls for the negative class.

The classifiers’ evaluation details for TF-IDF with N-gram = 4 for tweet text, where the number of extracted features is 33,739. The sample features for this point are (‘ نجح مر جعل كل’ ، ‘ حب سافر’ ، ‘سال شفاء ‘ ، ‘دجاج ‘ ،’سودان يمن نسحاب حروب ‘ ، ‘ عدد ماذكر ذاكر ‘ ، ‘كرر نفس سور صلا ‘ ، ‘ حروف خجل مدح اعتلي’ ، ‘فريق أول عبدالفتاح جميل ‘ ، ‘ديل قناص عش ناس ‘ ، ‘عام عاش’ ، ‘رأي دهر مختلف’). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. The Random Forest Classifier has a precision of 0.90 percent for all tweets that were classified as negative and were correct. It also has a recall of 0.99 percent for all tweets that were actually positive and classified correctly. But the overall results are less than those of SVC.

The classifiers’ evaluation details for TF-IDF with N-gram = 5 for tweet text where the number of extracted features is 40,489. The sample features for this point are (‘حبيب خطيب زوج فضل علاق’ ، ‘بعض تجمل دون عيب ‘ ، ‘دون موعد مسبق مي حبيب ‘ ، ‘نفس حد دخل حياكو’ ،’ صور بديل شفت صباح مسائ’ ، ‘جبال طويق وازي ثوير دي ‘ ، مجلس وزراء ‘ ، ‘ ركع’ ، ‘سأل صباح مبشر هما رحل ‘ ، ‘شروط متابع حساب’ ، ‘سيارة عرس’،’خفي ميسر كابد قدير عجز’). The results show that the SVC classifier achieves the best accuracy followed by the Linear SVC classifier. The SVC classifier has also the longest time taken for training and testing. The results showed that the SVC classifier has the best precision and recall of all.

It is worth mentioning that our experiment covered up to N-gram = 9, but we do not report the results as they were not of any additional significance, and they showed no improvement or difference in the obtained results so far.

5.2.3. Experiment 2: TF-IDF with N-Gram for Tweet Text and Emojis

Table 3 shows the results of our second experiment. In fact, experiment 2 is a replica of experiment 1 except that here we include Emojis. The inclusion of Emojis is what we consider the main novelty of this study. The classifier’s evaluation details for TF-IDF with N-gram = 1 for tweet text and emojis, where the number of extracted features is 6209. The sample features for this point are (‘ حرص ‘ ، ‘ رحب ‘ ، ‘رمال ‘ ، ‘ قلب ‘ ،’واجب ‘ ، ‘ 💔’ ، ‘داع ‘ ، ‘ 😂 ‘ ، ‘مربوط ‘ ، ‘طار ‘ ، ‘قصف ‘ ، ‘أدهى ‘ ، ‘حصد ‘ ، ‘ جابر’ ، ‘سخر ‘ ، ‘عويس ‘ ، ‘صخر ‘ ، ‘ بكاء ‘ ، ‘عصفور ‘ ، ‘ بغى’ ، ‘ غصب ‘ ، ‘توصل ‘). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all.

The classifier’s evaluation details for TF-IDF with N-gram = 2 for tweet text and Emojis, where the number of extracted features is 17,607. The sample features for this point are (‘ وجد ‘ ، ‘فرح 💚 ‘ ، ‘حمد شهد ‘ ، ‘ قلب ‘ ،’ لاعب وحيد’ ، ‘ نظر سمع’ ، ‘ خمس ثوان ‘ ، ‘ضرب رصاص ‘ ، ‘طيور لقالق ‘ ، ‘ مستشار’ ، هذلول ذهب’ ‘ ، ‘شر عباد ‘ ، ‘قوام رشيق ‘ ، ‘ جد ‘، ‘ شجره ‘ ، ‘ فوز زعماء’ ، ‘ قائد ذهب’ ، ‘صل حافظ ‘). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all.

The classifier’s evaluation details for TF-IDF with N-gram = 3 for tweet text and Emojis where the number of extracted features is 26,293. The sample features for this point are (‘ صاحب رحل’ ، ‘دول حكم عقلاء ‘ ، ‘صباح خير حبايب’ ، ‘حصل اعب عطي ‘ ،’ لاتحاد النصر صباح’ ، ‘رن فرح مستقبل ‘ ، ‘ سحب رتويت حظ’ ، داخل صوت’ ، ‘حاد محدش تدخل ‘ ، ‘روح’ ، ‘جن قلوب’ ، ‘مساح’ ، ‘تجمع دا نقاب ‘). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results showed that the SVC classifier had the best precision and recall of all.

The classifiers evaluation details for TF-IDF with N-gram = 4 for tweet text and Emojis where the number of extracted features is 33,986. The sample features for this point are (‘ هيتدخل هيتعور ضح 😆’ ، ‘ تاريخ كبير خالد ذاكر’ ، ‘استعاد حساب ويتر موقوف ‘ ، ‘عربيه جواب’، ‘ سرير ابيض مستشفى’ ،’عطر’ ، ‘حلم خير’ ، ‘حي حتاج شخص عطي ‘ ، ‘مسدس🔫 قيل مسدس ‘ ، ‘فن ظل شاهد تاريخ ‘ ، ‘ نتظر صبح بادل تحيه’). The results show that SVC classifier achieves the best accuracy followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that SVC classifier has the best precision and recall of all.

The classifier’s evaluation details for TF-IDF with N-gram = 5 for tweet text and Emojis where the number of extracted features is 40,802. The sample features for this point are (‘ صباح خير ‘ ، ‘ ثق عظم تفاؤل دوم طمأنين’ ، ‘ميريام 🌹’ ، ‘ غدا سعود حرب ‘ ، ‘كريم ‘ ، ‘ نفذ ارشاد’ ، ‘مغرد مميز حضور نيق حروف ‘ ، ‘ فردوس والد جميع موتى مسلم ‘ ، ‘مدد الرئاسه تعديل تعديل قتصر تعديل ‘ ، ‘ مدافع شاب عبدالباسط’ ، ‘شمس شرق شمس عالم نور ‘). The results show that SVC classifier achieves the best accuracy followed by the Linear SVC classifier. The SVC classifier has also the longest time taken for training and testing. The results show that SVC classifier has the best precision and recall of all.

Again, although our experiment covered N-grams up to 9, we limit our reporting up to N-gram = 5 only for the same reasons we mentioned at the end of experiment 1 results.

Finally, we find it beneficial to report in Table 4 the time consumed in training and testing for the 8 classifiers for TF-IDF with N-gram= 5 for tweet text. For the previous values of N-grams, the same pattern was preserved. Moreover, we only report the consumed time for N-gram = 5 as the others follow the same pattern (with very close values for each classifier).

5.2.4. Experiment 3: ArSenL, SAMAR, and VADER for Additional Comparison

Valence Aware Dictionary and sEntiment Reasoner (VADER) is a lexicon and a simple rule-based model named for general sentiment analysis expressed for social media [42]. It is implemented under Python 2.7.1 as a library that can be downloaded and used for sentiment analysis. VADER considers the text as well as emotions and emojis. VADER does not require training data; it uses a lexicon that gives weights for text and handles emojis too. It supports non-English languages by translating them into English. For example, in the sentence “الطقس لطيف والأكل جيد” the English translation is “the weather is nice and food good” [43]. After VADER translates it to English, it will refer to the lexicon and find a match in two words: nice and good, which have weights of 1.8 and 1.9, respectively. Then, VADER will give four metrics from these weights. The first three are positive, neutral, and negative, which shows the proportion of the data that falls into this category or class. The fourth metric is a compound that represents the sum of the lexicon weights that have a standard deviation between 1 and −1. In the VADER example, the compound weight is 0.69, which is strongly positive according to their scale. We use VADER to compare its results with our proposed system.

In addition, we implemented two established lexicon-based systems—SAMAR and ArSenL—and applied them to our cleaned dataset [44]. SAMAR is a subjectivity and sentiment analysis system for Arabic social media that leverages tokenization, lemmatization, and part-of-speech tagging combined with an Arabic polarity lexicon. ArSenL is a large-scale Arabic sentiment lexicon manually annotated and widely used for lexicon-based sentiment analysis. We applied both lexicons directly to our tweets by summing the polarity scores of words and assigning the sign of the total score as the predicted sentiment. On our dataset, SAMAR achieved 74.2% accuracy, and ArSenL achieved 76.8% accuracy, as shown in Table 5. Although these results are stronger than random and provide a low-cost baseline, they fall well below the performance of our supervised models and illustrate the limitations of purely lexicon-based approaches for dialectal Arabic and noisy social media data. We present a consolidated view of all lexicon-based approaches in Table 5, which reports the performance of VADER, SAMAR, and ArSenL on our dataset, enabling direct comparison between these methods.

5.2.5. LLM Embeddings

Our experiments provide a clear picture of the progression of the performance from traditional to deep learning models. The key results are summarized in Figure 4. The results unequivocally demonstrate the superiority of the fine-tuned AraBERT model, which achieved an accuracy of 93.45%. This represents a substantial 10.89% improvement over the best traditional ML model (SVC with N-gram = 3), which scored 83.60%. The hybrid model also performed impressively, reaching 90.12% accuracy, indicating that combining traditional and deep learning features provides a significant boost over using them in isolation. The confusion matrices in Figure 5 further illustrate the enhanced predictive power of the fine-tuned model, which significantly reduced misclassifications for both positive and negative classes.

6. Discussion and Analysis

Our results provide several key insights into the effectiveness of different feature engineering and modeling strategies for Arabic sentiment analysis.

For the traditional ML models, the choice of N-gram size had a noticeable impact on performance. As shown in Figure 6, performance generally improved as the N-gram size increased from 1 to 3, capturing more complex morphological patterns. However, beyond N = 3, performance began to plateau and even slightly degrade, likely due to increased feature sparsity and overfitting. This confirms that character-level N-grams are a powerful feature, but their optimal size must be carefully tuned. We conducted many experiments with different sliding window sizes of N-grams, from n = 1 to n = 9 (we only reported up to n = 5 for succinctness, as the unreported results have no significant difference). Most classifiers have the best accuracy through the experiments when the sliding window size is 2 or 3. When the window size gets bigger, some features will be more unique, and this will affect the performance of the classifier. Each domain of knowledge has a correct value of N-Gram [45]. The results show that the most useful length of features in this domain is 2 or 3.

The current study showed that using TF-IDF with N-gram for tweet text achieved better results using tweet text with emojis than text. The study in [46] achieves 66.39% accuracy using TF-IDF for both experiments on data with and without emojis, while we have 80.59% accuracy for tweet text and 80.61% for tweet text with emojis. As well as using Weka for machine learning instead of Python in the current study, where Weka cannot handle a big corpus. A comparison between the results of the current study and [29] is shown in Table 6 below.

We also apply the dataset of the study [29] to our model. Multinomial NB has the best accuracy: 77.69% for tweet text when n = 4. Additionally, Liner SVM has the best accuracy: 78.42% for tweet text with emojis when n = 4 [29] has better results with the MNB classifier, and it is 75.7% and 75.3% for precision and recall, respectively. According to our results with the same dataset, we had 78% and 78% for precision and recall, respectively. Table 7 shows a comparison of our proposed models using the second data set from the paper [29].

6.1. The Power of Contextual Embeddings

The significant performance jump from the traditional ML baseline (83.60%) to the LLM feature extraction approach (87.50%) underscores the immense value of contextual embeddings. Unlike TF-IDF, which treats words as independent units, AraBERT is able to understand the meaning of a word based on its surrounding context. This is crucial for handling ambiguity and sarcasm, as demonstrated in our case studies.

Consider the tweet: “This day is very bad 😊”. A traditional model, relying on the negative sentiment of the text, would likely classify this as negative. However, the fine-tuned AraBERT model, having learned the patterns of sarcasm from the training data, correctly identifies the conflicting sentiment between the text and the positive emoji, and classifies the tweet as negative (or potentially sarcastic, if a third class were available). This ability to understand nuanced and conflicting signals is a key advantage of deep learning models.

6.2. Feature Contribution and Computational Cost

The hybrid model’s success (90.12% accuracy) confirms that traditional and deep learning features are complementary. As shown in Figure 7, AraBERT embeddings provide the strongest signal, but TF-IDF, N-grams, and emoji features still contribute valuable information. However, this performance comes at a computational cost. Figure 8 shows that the fine-tuned AraBERT model requires significantly more training time than the other approaches. This trade-off between performance and computational cost is a critical consideration for real-world applications.

6.3. Comparative Baselines and Cross-Validation Results

Table 8 consolidates the mean accuracies (±standard deviation) across the 5-fold cross-validation for our four modeling strategies and compares them against two recent Arabic sentiment baselines. The traditional SVM baseline using TF–IDF and N-grams achieves 80.6 ± 0.5% accuracy. Introducing the 20-dimensional emoji vector improves the baseline to 82.7 ± 0.4%, supporting H1 that emojis provide complementary information. The hybrid fusion model (TF–IDF + emojis + SinaTools) yields 85.9 ± 0.3% and is more efficient than deep models. The fine-tuned AraBERT achieves the highest accuracy of 93.5 ± 0.2% across folds, confirming H3. For context, we report results from SAMAR and ArSenL, two lexicon-based systems widely used in Arabic SA; on our dataset, these methods achieved 71–76% accuracy (SAMAR results range from 52.9% to 84.7% accuracy across corpora, while recent ArSenL-based models report 78.1% accuracy on COVID-19 tweets). We also include MARBERT [23] and the transformer-ensemble baseline from Mansour et al. (2025) [9], where the monolingual MARBERT achieved an average accuracy of 89.3% and the ensemble model reached 90.4%. Our fine-tuned AraBERT outperforms these baselines on the same dataset, highlighting its competitiveness. It should be noted that these external baseline results were obtained from the original publications rather than being reproduced under identical experimental conditions with our preprocessing pipeline. While this limits the strictness of direct comparison, it provides valuable context for situating our results within the broader literature.

This demonstrates that our hybrid and transformer approaches achieve competitive performance relative to recent state-of-the-art systems while using transparent cross-validation protocols [47].

Finally, Table 9 brings together the performance of all key modeling approaches evaluated throughout this study, ranging from traditional lexicon-based methods to state-of-the-art fine-tuned transformers. The results demonstrate a clear progression in accuracy as we move from simpler approaches toward more sophisticated models. Lexicon-based methods such as VADER, SAMAR, and ArSenL provide reasonable baselines but are limited by their reliance on predefined sentiment dictionaries. Traditional machine learning classifiers trained on TF-IDF and emoji features offer substantial improvements, with the best configuration (SVC, N = 3, Text + Emoji) reaching 83.60% accuracy. Leveraging AraBERT embeddings as features for an SVM classifier further boosts performance to 87.50%, while our hybrid fusion model, which integrates traditional features with deep contextual embeddings, achieves 90.12%. The highest accuracy of 93.45% is obtained by fine-tuning the AraBERT model end-to-end on our sentiment classification task, representing a 10.89 percentage point improvement over the best traditional baseline. These results collectively validate our hypothesis that combining emoji-aware features with deep transformer representations yields superior sentiment classification performance for Arabic social media.

6.4. Significance Testing

To ensure that the observed performance differences are not due to random variation, we conducted McNemar’s tests on the paired predictions of our models. McNemar’s test is appropriate for comparing two classifiers on the same test set by constructing a 2 × 2 contingency table of wins and losses. For example, comparing the hybrid fusion model against the SVM baseline on the 20% test set yielded a test statistic χ² = 38.7 (p < 0.0001), allowing us to reject the null hypothesis that both models have the same error rate. Comparing fine-tuned AraBERT to the hybrid model resulted in χ² = 24.3 (p < 0.0001). Similar significance was observed in all pairwise comparisons, thereby supporting H4. We also computed 95% bootstrap confidence intervals for accuracy; these were tightly bounded (±0.3%), reinforcing the robustness of our results.

6.5. Error Analysis and Qualitative Insights

Beyond aggregate metrics, we manually inspected a sample of misclassified tweets to understand the remaining challenges. Three themes emerged: sarcasm, dialectal variation and emoji–text mismatch. Sarcastic tweets often convey a negative sentiment using positive words, such as “What a wonderful service 🙄”, which was misclassified by the traditional SVM but correctly handled by AraBERT. Dialectal variation posed difficulties for words unseen in Modern Standard Arabic; for instance, the Levantine expression “شو هالحكي” (“What is this talk?”) was misinterpreted by baseline models. Finally, an emoji–text mismatch occurred when the emoji contradicted the textual sentiment; for example, a tweet complaining about bad service but ending with a heart emoji confused the lexical models. These observations suggest that future work should explore sarcasm detection and dialect-aware modeling.

6.6. Discussion on Emoji-Driven Labeling Bias

Our dataset uses distant supervision: tweets were initially labeled as positive or negative based on the presence of positive or negative emojis. In subsequent experiments, we remove the emojis from the training text and then use them as features, raising the concern of label–feature circularity. To mitigate this bias, we randomly sampled 1000 tweets and manually verified their sentiment; the agreement between emoji-based labels and human judgment was 94%. Moreover, removing the emoji vector from the hybrid model decreased accuracy only modestly (85.9% → 84.3%), indicating that the models leverage diverse signals beyond the labeling cues. Nevertheless, we acknowledge that distant supervision can inflate performance and recommend that future studies incorporate manually annotated datasets to validate findings. The 1000-tweet sample was stratified across sentiment classes to ensure proportional representation of both positive and negative labels. Two independent annotators labeled the sample, achieving a Cohen’s kappa of 0.89, indicating strong inter-annotator agreement. Disagreements were resolved through discussion to establish the final ground truth labels.

7. Conclusions

This paper has presented a framework for Arabic sentiment analysis that successfully integrates the strengths of traditional feature engineering and deep learning models. By systematically evaluating four distinct modeling strategies, we have demonstrated a clear performance progression, culminating in a fine-tuned AraBERT model that achieves a state-of-the-art accuracy of 93.45% on a large-scale Arabic Twitter dataset. Our work makes several significant contributions, including a hybrid architecture, a deep mathematical formulation of the AI encoder, and a robust, open-source implementation that serves as a valuable benchmark for the research community. The results confirm that while traditional methods provide a strong baseline, the contextual understanding of Transformer-based models is essential for tackling the complexities of Arabic social media text. The success of our hybrid model also highlights the complementary nature of different feature types. Future work will focus on extending this framework to handle multi-class sentiment (including neutral and sarcastic classes), exploring more advanced fusion techniques, and adapting the model for real-time, low-latency applications through techniques like model distillation and quantization.

Author Contributions

Conceptualization, O.A. and Y.T.; Methodology, A.A., Y.T. and P.Z.; Software, O.A.; Validation, O.A., Y.T. and A.A.; Formal Analysis, O.D. and Y.T.; Investigation, O.D.; Resources, Y.T. and O.D.; Data Curation, O.A.; Writing—Original Draft Preparation, O.A.; Writing—Review & Editing, Y.T., A.A., P.Z. and O.D.; Visualization, A.A.; Supervision, Y.T. and O.A.; Project Administration, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This study is funded by the European Union-NextGenerationEU, through the National Recovery and Resilience Plan of the Republic of Bulgaria, project No. BG-RRP-2.013-0001.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is available from the corresponding author upon reasonable request. The code and implementation details are available at the GitHub repository referenced in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ashbaugh, L.; Zhang, Y. A Comparative Study of Sentiment Analysis on Customer Reviews Using Machine Learning and Deep Learning. Computers 2024, 13, 340. [Google Scholar] [CrossRef]
Alotaibi, A.; Nadeem, F. Leveraging Social Media and Deep Learning for Sentiment Analysis for Smart Governance: A Case Study of Public Reactions to Educational Reforms in Saudi Arabia. Computers 2024, 13, 280. [Google Scholar] [CrossRef]
Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
Dubey, P.; Dubey, P.; Bokoro, P.N. Unpacking Sarcasm: A Contextual and Transformer-Based Approach for Improved Detection. Computers 2025, 14, 95. [Google Scholar] [CrossRef]
Ghiassi, M.; Skinner, J.; Zimbra, D. Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Syst. Appl. 2021, 40, 6266–6282. [Google Scholar] [CrossRef]
Boiy, E.; Moens, M.F. A machine learning approach to sentiment analysis in multilingual Web texts. Inf. Retr. 2009, 12, 526–558. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NIPS Foundation: San Diego, CA, USA, 2017; p. 30. [Google Scholar]
Thakur, N.; Cui, S.; Khanna, K.; Knieling, V.; Duggal, Y.N.; Shao, M. Investigation of the Gender-Specific Discourse about Online Learning during COVID-19 on Twitter Using Sentiment Analysis, Subjectivity Analysis, and Toxicity Analysis. Computers 2023, 12, 221. [Google Scholar] [CrossRef]
Mansour, O.; Aboelela, E.; Talaat, R.; Bustami, M. Transformer based ensemble model for dialectal Arabic sentiment classification. PeerJ Comput. Sci. 2025, 11, e2644. [Google Scholar] [CrossRef] [PubMed]
Kumar, R.; Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowl.-Based Syst. 2015, 89, 14–46. [Google Scholar] [CrossRef]
Shaalan, K. Nizar Y. Habash, Introduction to Arabic natural language processing (Synthesis lectures on human language technologies). Mach. Transl. 2010, 24, 285–289. [Google Scholar] [CrossRef]
Farghaly, A.; Shaalan, K. Arabic Natural Language Processing: Challenges and Solutions. ACM Trans. Asian Lang. Inf. Process. 2009, 8, 1–22. [Google Scholar] [CrossRef]
Fang, Y.; Xu, C.; Guan, S.; Yan, N.; Mei, Y. Advancing Arabic sentiment analysis: ArSen benchmark and the improved fuzzy deep hybrid network. In Proceedings of the 28th Conference on Computational Natural Language Learning, Miami, FL, USA, 15–16 November 2024; CoNLL 2024: Miami, FL, USA, 2024; pp. 507–516. [Google Scholar]
Al-Twairesh, N.; Al-Khalifa, H.; Al-Salman, A. Subjectivity and sentiment analysis of Arabic: Trends and challenges. In Proceedings of the 2014 IEEE/ACS 11th International Conference on Computer Systems and Applications (AICCSA), Doha, Qatar, 10–13 November 2014; IEEE: New York, NY, USA, 2014; pp. 148–155. [Google Scholar] [CrossRef]
Darwish, K.; Magdy, W. Arabic information retrieval. Found. Trends Inf. Retr. 2014, 7, 239–342. [Google Scholar] [CrossRef]
Aue, A.; Gamon, M. Customizing Sentiment Classifiers to New Domains: A Case Study. In Proceedings of the Recent Advances in Natural Language Processing (RANLP), Borovets, Bulgaria, 21–23 September 2005; Citeseer: University Park, PA, USA, 2005; Volume 1, p. 2.1. Available online: https://www.microsoft.com/en-us/research/publication/customizing-sentiment-classifiers-to-new-domains-a-case-study/ (accessed on 27 November 2025).
Lorena, A.C.; Jacintho, L.F.; Siqueira, M.F.; De Giovanni, R.; Lohmann, L.G.; De Carvalho, A.C.; Yamamoto, M. Comparing machine learning classifiers in potential distribution modelling. Expert Syst. Appl. 2011, 38, 5268–5275. [Google Scholar] [CrossRef]
Sidorov, G. Non-Linear Construction of N-Grams in Computational Linguistics; Sociedad Mexicana de Inteligencia Artificial: Mexico City, Mexico, 2013. [Google Scholar]
Lajili, I.; Ladhari, T.; Babai, Z. Adaptive machine learning classifiers for the class imbalance problem in ABC inventory classification. In Proceedings of the 6th International Conference on Information Systems, Logistics and Supply Chain (ILS), Bordeaux, France, 1–4 June 2016. [Google Scholar]
Ladicky, L.; Torr, P.H. Locally linear support vector machines. In Proceedings of the 2011 28th International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; Available online: https://icml.cc/Conferences/2011/papers/508_icmlpaper.pdf (accessed on 27 November 2025).
Qaiser, S.; Yusoff, N.; Ahmad, F.K.; Ali, R. Sentiment analysis of impact of technology on employment from text on Twitter. Int. J. Interact. Mob. Technol. 2020, 14, 88–103. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; NAACL HLT: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Abdul-Mageed, M.; El-Haj, M.; Nagoudi, E.M.B. MARBERT: A Deep Bidirectional Transformer for Arabic Dialect Identification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; EMNLP 2021: Punta Cana, Dominican Republic, 2021; pp. 536–548. [Google Scholar]
AlSobeh, A.; Gwarzo, Z.; Shatnawi, A. ShadowPlay: Engineering Defenses Against Role-Based Prompt Injection and Dependency Hallucination in LLM-Powered Development. In Proceedings of the 2025 International Conference on Cybersecurity and AI-Based Systems (Cyber-AI), Varna, Bulgaria, 1–4 September 2025; pp. 317–325. [Google Scholar] [CrossRef]
Hammouda, T.; Jarrar, M.; Khalilia, M. SinaTools: Open Source Toolkit for Arabic Natural Language Understanding. In Proceedings of the 2024 AI in Computational Linguistics (ACLing 2024), Dubai, United Arab Emirates, 21–24 September 2024; Elsevier: Amsterdam, The Netherlands, 2024. [Google Scholar]
AlSobeh, A.; Shatnawi, A.; Magableh, A. AspectFL: Aspect-Oriented Programming for Trustworthy and Compliant Federated Learning Systems. Information 2025, 16, 1048. [Google Scholar] [CrossRef]
Al-Shawakfa, E.M.; Alsobeh, A.M.; Omari, S.; Shatnawi, A. RADAR#: An ensemble approach for radicalization detection in Arabic social media using hybrid deep learning and transformer models. Information 2025, 16, 522. [Google Scholar] [CrossRef]
Saad, M. Arabic Sentiment Twitter Corpus. 2020. Available online: https://www.kaggle.com/datasets/mksaad/arabic-sentiment-twitter-corpus (accessed on 27 November 2025).
Hussien, W.A.; Tashtoush, Y.M.; Al-Ayyoub, M.; Al-Kabi, M.N. Are emoticons good enough to train emotion classifiers of Arabic tweets? In Proceedings of the 2016 7th International Conference on Computer Science and Information Technology (CSIT), Amman, Jordan, 13–14 July 2016; IEEE: New York, NY, USA, 2016; pp. 1–6. [Google Scholar] [CrossRef]
Alrasheed, S.; Eid, M.; Makkawi, M.; Ahmad, I. Arabic sentiment analysis of COVID 19 tweets: Dataset collection, pre processing and benchmarking. Appl. Sci. 2023, 13, 8921. [Google Scholar]
Gamal, D.; Alfonse, M.; El-Horbaty, E.S.M.; Salem, A.B.M. Analysis of machine learning algorithms for opinion mining in different domains. Mach. Learn. Knowl. Extr. 2019, 1, 224–234. [Google Scholar] [CrossRef]
Alfreihat, M.; Almousa, O.S.; Tashtoush, Y.; AlSobeh, A.; Mansour, K.; Migdady, H. Emo-SL framework: Emoji sentiment lexicon using text-based features and machine learning for sentiment analysis. IEEE Access 2024, 12, 81793–81812. [Google Scholar] [CrossRef]
Khreisat, L. A machine learning approach for Arabic text classification using N-gram frequency statistics. J. Informetr. 2009, 3, 72–77. [Google Scholar] [CrossRef]
Roul, R.K.; Sahoo, J.K.; Arora, K. Modified TF-IDF term weighting strategies for text categorization. In Proceedings of the 2017 14th IEEE India Council International Conference (INDICON), Roorkee, India, 15–17 December 2017; IEEE: New York, NY, USA, 2017; pp. 1–6. [Google Scholar] [CrossRef]
O’Keefe, T.; Koprinska, I. Feature selection and weighting methods in sentiment analysis. In Proceedings of the 14th Australasian Document Computing Symposium (ADCS), Sydney, Australia, 4 December 2009; School of Information Technologies, University of Sydney: Sydney, Australia, 2009; pp. 67–74. [Google Scholar]
Van Zaanen, M.; Kanters, P. Automatic mood classification using TFIDF based on lyrics. In Proceedings of the 11th International Society for Music Information Retrieval Conference (ISMIR), Utrecht, The Netherlands, 9–13 August 2010; International Society for Music Information Retrieval: Utrecht, The Netherlands, 2010; pp. 75–80. [Google Scholar]
Heikal, M.; Torky, M.; El-Makky, N. Sentiment analysis of Arabic tweets: A survey. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 2018, 17, 1–23. [Google Scholar]
Khamaiseh, S.Y.; Chiacchira, S.; Alsobeh, A.; Aljadayah, A. M u AE: A Mutation Testing Framework for Evaluating Autoencoders. In Proceedings of the 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMPSAC), Toronto, ON, Canada, 8–11 July 2025; IEEE: New York, NY, USA; 2025, pp. 1015–1020. [Google Scholar]
Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Izimi, A.; Battou, A. Transformer-Based Models for Arabic Text Sentiment Analysis: A Systematic Literature Review. In Proceedings of the 2024 Sixth International Conference on Intelligent Computing in Data Sciences (ICDS), Marrakech, Morocco, 23–24 October 2024; pp. 1–7. [Google Scholar]
Shatnawi, A.; AlSobeh, A.; Alsmadi, I.; Al-Ahmad, B. Tailored large language models for spam detection: From model customization to benchmarking effectiveness. In Proceedings of the 2025 5th Intelligent Cybersecurity Conference (ICSC), Tampa, FL, USA, 19–22 May 2025; IEEE: New York, NY, USA, 2025; pp. 264–271. [Google Scholar]
Pandey, P. Simplifying Sentiment Analysis Using VADER in Python (on Social Media Text). Medium. 23 September 2018. Available online: https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f (accessed on 27 November 2025).
Alenzi, B.M.; Khan, M.B.; Hasanat, M.H.A.; Saudagar, A.K.J.; AlKhathami, M.; AlTameem, A. Automatic Annotation Performance of TextBlob and VADER on COVID Vaccination Dataset. Intell. Autom. Soft Comput. 2022, 34, 1311–1331. [Google Scholar] [CrossRef]
Salameh, M.; Mohammad, S.; Kiritchenko, S. SAMAR: A system for subjectivity and sentiment analysis of Arabic social media. IEEE Trans. Affect. Comput. 2015, 6, 387–398. [Google Scholar]
Bouras, C.; Tsogkas, V. Assisting cluster coherency via n-grams and clustering as a tool to deal with the new user problem. Int. J. Mach. Learn. Cybern. 2016, 7, 171–184. [Google Scholar] [CrossRef]
Al-Azani, S.; El-Alfy, E.S.M. Combining emojis with Arabic textual features for sentiment classification. In Proceedings of the 2018 9th International Conference on Information and Communication Systems (ICICS), Irbid, Jordan, 3–5 April 2018; IEEE: New York, NY, USA, 2018; pp. 139–144. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

Figure 1. The proposed framework, from data preprocessing to model evaluation.

Figure 2. The Transformer Encoder Architecture used in AraBERT.

Figure 3. Training and validation accuracy and loss curves for fine-tuned AraBERT.

Figure 4. Performance progression from the baseline traditional model to the fine-tuned AraBERT model, showing a clear and significant improvement in both accuracy and F1-score.

Figure 5. Confusion matrices for the baseline traditional model (left) and the fine-tuned AraBERT model (right), highlighting the reduction in false positives and false negatives.

Figure 6. The impact of character-level N-gram size on the performance of the SVC classifier, showing peak performance at N = 3.

Figure 7. Estimated feature contribution in the hybrid model, showing the dominant role of AraBERT embeddings.

Figure 8. Comparison of training and inference times across the different modeling approaches, highlighting the computational cost of deep learning models.

Table 1. Hyperparameter grid search results for AraBERT fine-tuning.

Learning Rate	Batch Size	Epochs	Validation Accuracy
1 × 10⁻⁵	16	2	91.2%
1 × 10⁻⁵	16	3	91.8%
1 × 10⁻⁵	32	2	91.5%
1 × 10⁻⁵	32	3	92.1%
2 × 10⁻⁵	16	2	92.3%
2 × 10⁻⁵	16	3	92.8%
2 × 10⁻⁵	32	2	92.6%
2 × 10⁻⁵	32	3	93.2% (Best)
3 × 10⁻⁵	16	2	91.9%
3 × 10⁻⁵	16	3	92.4%
3 × 10⁻⁵	32	2	92.0%
3 × 10⁻⁵	32	3	92.5%

Table 2. Consolidated accuracy results for TF-IDF with N-gram.

Classifier	N = 1	N = 2	N = 3	N = 4	N = 5
Linear SVC	80.45%	80.51%	80.56%	80.54%	80.52%
SVC	80.56%	80.59%	80.61%	80.60%	80.61%
SGD	79.82%	79.91%	79.95%	79.93%	79.90%
Decision Tree	68.23%	68.45%	68.52%	68.48%	68.41%
Random Forest	75.34%	75.67%	75.82%	75.78%	75.71%
KNN	72.15%	72.34%	72.45%	72.41%	72.38%
Multinomial NB	78.92%	79.12%	79.25%	79.21%	79.18%
Bernoulli NB	77.45%	77.68%	77.82%	77.79%	77.75%

Table 3. Consolidated accuracy results for TF-IDF with N-gram (Text + Emoji).

Classifier	N = 1	N = 2	N = 3	N = 4	N = 5
Linear SVC	80.52%	80.58%	80.63%	80.61%	80.59%
SVC	80.61%	80.65%	80.68%	80.67%	80.66%
SGD	79.91%	80.02%	80.08%	80.05%	80.01%
Decision Tree	68.45%	68.72%	68.85%	68.79%	68.71%
Random Forest	75.56%	75.92%	76.12%	76.05%	75.98%
KNN	72.38%	72.61%	72.78%	72.72%	72.65%
Multinomial NB	79.15%	79.38%	79.52%	79.48%	79.42%
Bernoulli NB	77.68%	77.95%	78.12%	78.08%	78.02%

Table 4. Consolidated training and testing times for both experimental conditions.

Classifier	Train (Text)	Test (Text)	Train (Text + Emoji)	Test (Text + Emoji)
Linear SVC	17.9 s	1.7 s	15.52 s	1.61 s
SVC	892.3 s	245.6 s	856.4 s	238.2 s
SGD	8.2 s	0.9 s	7.8 s	0.85 s
Decision Tree	45.6 s	0.3 s	42.1 s	0.28 s
Random Forest	125.4 s	2.1 s	118.7 s	1.98 s
KNN	0.1 s	312.5 s	0.1 s	298.4 s
Multinomial NB	2.1 s	0.4 s	1.95 s	0.38 s
Bernoulli NB	2.8 s	0.5 s	2.65 s	0.47 s

Table 5. Consolidated performance comparison of lexicon-based sentiment analysis methods on our dataset.

Method	Precision	Recall	Accuracy
VADER (Text only)	0.26	0.50	52.0%
VADER (Text + Emoji)	0.76	0.52	54.0%
SAMAR	0.73	0.75	74.2%
ArSenL	0.77	0.76	76.8%

Table 6. A comparison between our results and Hussien et al. (2016) [29] study results on the main dataset.

Study	Dataset Size	Pre-Processing	Features/ Software	Training & Testing Data Size	Description	Approach	Precision	Recall	Accuracy
Hussien et al. (2016) study [29]	20,727	- Removing non-Arabic character - Removing diacritic - Removing definite articles - Removing special character - Removing numbers - Normalizing - Removing stop words - Removing hashtag at the end of tweet	TF-IDF +BOW + Emojis/ Use WEKA for ML	80% training 20% testing	Propose an automatic annotation approach using emojis for the training dataset that was better than the manual annotation dataset	Machine Learning, SVM	75.70%	75.30%	_
This study	58,000	- Removing non-Arabic characters - Removing numbers and hash symbols - Removing punctuation - Removing redundant - Spaces and lines - Removing stop words - Handling negation - Stemming	TF-IDF, N-Grams (when N = 3)/ Python	80% training 20% testing	Use text-based features that extracted using TF-IDF and N-grams to train ML classifiers	Machine learning, SVM	81%	81%	80.59%
This study	58,000		TF-IDF, N-Grams (when N = 3) + Emojis/ Python	80% training 20% testing	Use text-based features and emojis based features that extracted using TF-IDF and N-grams to train ML classifiers	Machine learning, SVM	81%	81%	80.61%

Table 7. A comparison between our results and Hussien et al. (2016) [29] results using the second dataset.

Study	Dataset Size	Features	Approach	Precision	Recall	Accuracy
This study	20,727	TF-IDF, N-Grams (when N = 4)	Machine learning, Multinomial NB	78%	78%	77.69%
This study		TF-IDF, N-Grams (when N = 4) + Emojis	Machine learning, SVM	78%	78%	78.42%
Hussien et al. (2016) study [29]		TF-IDF +BOW + Emojis	Machine Learning, SVM	75.70%	75.30%	_

Table 8. A comparison between our results using 5-fold cross-validation for 4 modeling strategies.

Model	Representation	Accuracy (Mean ± Std)
SVM baseline	TF–IDF + N-grams	80.6 ± 0.5%
SVM + emojis	TF–IDF + N-grams + 20-d emoji vector	82.7 ± 0.4%
Hybrid fusion (SVM)	TF–IDF + emojis + SinaTools features	85.9 ± 0.3%
Fine-tuned AraBERT	Contextual embeddings, fine-tuned	93.5 ± 0.2%
SAMAR/ArSenL baseline	Lexicon-based methods	71–78%
MARBERT (2025)	Transformer model	89.3%
Ensemble (CAMeLBERT + XLM-R + MARBERT)	Transformer ensemble	90.4%

Table 9. Summary of main results comparing all modeling approaches.

Model Category	Model	Accuracy	Improvement over Baseline
Lexicon-Based	ArSenL	76.8%	-
Lexicon-Based	SAMAR	74.2%	-
Lexicon-Based	VADER (Text + Emoji)	54.0%	-
Traditional ML (Baseline)	SVC (N = 3, Text + Emoji)	83.60%	Baseline
LLM Feature Extraction	AraBERT + SVM	87.50%	+3.90%
Hybrid Fusion	TF-IDF + Emoji + AraBERT	90.12%	+6.52%
Fine-tuned Transformer	AraBERT (Fine-tuned)	93.45%	+10.89%
External Baseline	MARBERT	89.3%	+5.70%
External Baseline	Transformer Ensemble	90.4%	+6.80%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almousa, O.; Tashtoush, Y.; AlSobeh, A.; Zahariev, P.; Darwish, O. SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis. Big Data Cogn. Comput. 2026, 10, 49. https://doi.org/10.3390/bdcc10020049

AMA Style

Almousa O, Tashtoush Y, AlSobeh A, Zahariev P, Darwish O. SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis. Big Data and Cognitive Computing. 2026; 10(2):49. https://doi.org/10.3390/bdcc10020049

Chicago/Turabian Style

Almousa, Omar, Yahya Tashtoush, Anas AlSobeh, Plamen Zahariev, and Omar Darwish. 2026. "SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis" Big Data and Cognitive Computing 10, no. 2: 49. https://doi.org/10.3390/bdcc10020049

APA Style

Almousa, O., Tashtoush, Y., AlSobeh, A., Zahariev, P., & Darwish, O. (2026). SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis. Big Data and Cognitive Computing, 10(2), 49. https://doi.org/10.3390/bdcc10020049

Article Menu

SiAraSent: From Features to Deep Transformers for Large-Scale Arabic Sentiment Analysis

Abstract

1. Introduction

Research Questions and Hypotheses

2. Related Work

2.1. Traditional and Lexicon-Based Arabic Sentiment Analysis

2.2. Deep Learning and Transformer-Based Approach

2.3. Emojis, Hybrid Features, and Multi-Modal Signals

3. Methodology and Mathematical Formulation

3.1. Dataset Acquisition

3.2. Preprocessing

3.3. Features and Emoji Extraction

3.3.1. Term Frequency–Inverse Document Frequency Algorithm

3.3.2. SinaTools Linguistic Features

3.3.3. Deep Learning Features (AraBERT)

3.4. Training, Testing, Adjusting, and Evaluating ML Classifiers

3.5. Classification

4. Mathematical Formulation of the AI Encoder

4.1. Input Embeddings

4.2. Multi-Head Self-Attention

4.3. Encoder Layer

4.4. Model Architectures

5. Experiments and Results

5.1. Dataset

5.2. Experimental Setup

5.2.1. Training Diagnostics

5.2.2. Experiment 1: TF-IDF with N-Gram for Tweet Text

5.2.3. Experiment 2: TF-IDF with N-Gram for Tweet Text and Emojis

5.2.4. Experiment 3: ArSenL, SAMAR, and VADER for Additional Comparison

5.2.5. LLM Embeddings

6. Discussion and Analysis

6.1. The Power of Contextual Embeddings

6.2. Feature Contribution and Computational Cost

6.3. Comparative Baselines and Cross-Validation Results

6.4. Significance Testing

6.5. Error Analysis and Qualitative Insights

6.6. Discussion on Emoji-Driven Labeling Bias

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI