1. Introduction
Finding accurate sentiment is vital for marketing, event detection, election polls, and governance [
1,
2]. However, it is not an easy task. Many challenging factors face the process of sentiment analysis (SA), such as finding the sentiment in short texts. Similarly, the Twitter text sentiment assignment is challenging and not always accurate, where some sentiment areas, including ambiguity, sarcasm, presence of slang, acronyms, and emoji detection, will be missed [
3,
4]. Sentiment analysis (SA), a key area of Natural Language Processing (NLP), has become a vital tool for public opinion mining, customer feedback analysis, and social trend monitoring [
1]. However, the application of SA to the Arabic language, especially within the informal and noisy context of social media platforms like Twitter, is fraught with significant challenges that are not as prevalent in English-language analysis. These challenges include the language’s rich and complex morphology, the widespread use of diverse and often unstandardized dialects, and the integral role of emojis in conveying sentiment and nuance [
2]. Traditional machine learning approaches, which often rely on handcrafted features like Term Frequency-Inverse Document Frequency (TF-IDF) and N-grams, have provided a solid foundation for this task [
3]. While these methods are effective at capturing lexical patterns, they frequently fail to grasp the deeper contextual and semantic meanings embedded in the text, particularly in the presence of sarcasm, ambiguity, and dialectal variations. This limitation has necessitated a paradigm shift towards more sophisticated models that can understand language in a more human-like manner.
Neutral, positive, and negative sentiments are the most common sentiment polarities that exist in texts. The current tools have weaknesses and strengths in identifying accurate sentiment within a text. Many tools are capable of identifying positive sentiments within the text, while others are more efficient in exploring the negative ones [
5]. Those cases happen as the sentence context is misled by the individual word’s actual meaning in the sentence. Sentence ambiguity makes it difficult to allocate an accurate polarity to the sentence. The sentence polarity depends strongly on the sentence context. Because of ambiguity, some tools might assign negative sentiment to neutral texts [
6].
The advent of deep learning, and specifically the Transformer architecture [
7], has revolutionized the field of NLP. Pre-trained language models such as BERT [
5] have demonstrated an unprecedented ability to learn rich, contextualized representations of language, leading to state-of-the-art performance on a wide array of NLP tasks. For the Arabic language, specialized models like AraBERT [
6] and MARBERT [
8,
9] have been developed, pre-trained on massive Arabic corpora to capture the unique intricacies of the language. These models offer a powerful alternative to traditional methods, but their full potential, especially when combined with established feature engineering techniques, has not been fully explored. Furthermore, specialized toolkits like SinaTools [
10] have emerged, providing advanced morphological analysis and semantic relatedness capabilities that can further enrich the feature set for sentiment analysis.
This paper presents a significant leap forward by proposing a novel, hybrid framework that bridges the gap between traditional feature engineering and modern deep learning by proposing SiAraSent. We systematically combine the interpretable, lexical features from TF-IDF, N-grams, and emojis with the powerful contextual embeddings from AraBERT and the advanced linguistic features from SinaTools. Our research makes the following key contributions: (1) A hybrid Sentiment Framework (SiAraSent) that integrates traditional features, advanced linguistic features from SinaTools, and deep contextual embeddings from AraBERT (
https://github.com/aub-mind/arabert) (accessed on 13 March 2025), creating a comprehensive and highly effective feature set for Arabic sentiment analysis. (2) A systematic and rigorous comparison of four distinct modeling strategies: a traditional ML baseline, an LLM feature extraction approach, a hybrid fusion model, and a fully fine-tuned AraBERT model. (3) A deep mathematical formulation of the AI encoder architecture, detailing the input embeddings, multi-head self-attention mechanism, and classification layer, thereby ensuring transparency and reproducibility.
The results on a large-scale Arabic Twitter dataset achieve 93.45% accuracy and demonstrate a significant improvement over existing methods. A robust, open-source implementation of the entire framework (
https://github.com/aub-mind/arabert) (accessed on 13 March 2025), providing a valuable benchmark and resource for the Arabic NLP research community.
Research Questions and Hypotheses
Our overarching aim is to understand how different feature representations and model architectures contribute to Arabic sentiment analysis. We therefore formulate the following research questions (RQ) and associated hypotheses (H):
RQ1: Do emojis carry complementary sentiment information beyond character and word N-gram features?
H1: Adding emoji-aware features will yield a statistically significant improvement in classification performance compared to models using only lexical features.
RQ2: Are the linguistic features provided by SinaTools beneficial beyond lexical and emoji features?
H2: The hybrid fusion model that concatenates lexical, emoji and morphological features will outperform pure TF–IDF or pure AraBERT embeddings.
RQ3: How does a fine-tuned transformer compare to feature-fusion approaches on large Arabic datasets?
H3: Fine-tuning AraBERT on our dataset will achieve the best overall accuracy but may be more computationally intensive than the hybrid model.
RQ4: Do the observed improvements generalize across data splits and are they statistically significant?
H4: Using cross-validation and significance tests, we expect our reported gains to remain significant across multiple folds and not be attributed to chance variation.
3. Methodology and Mathematical Formulation
Our proposed methodology is a multi-stage process designed to systematically extract and model sentiment from Arabic social media text [
27]. The overall architecture, depicted in
Figure 1, involves data preprocessing, multi-faceted feature engineering, and four distinct modeling strategies. This section provides a detailed description of each component, including the mathematical underpinnings of our AI encoder.
3.1. Dataset Acquisition
The Arabic Sentiment Twitter Corpus [
28] dataset was used in this study, which includes a large corpus of positive and negative tweets collected from Twitter. The used dataset contained 58 k Arabic tweets that were collected using a positive and negative emojis lexicon, where each tweet has one or more emojis as it is needed to work with in this research. The dataset was saved in TSV (tab-separated values) format, and each row had one tweet. The current research is interested in the text and emojis of the tweet to combine all the sentimental features that the tweet contains. We also used the dataset from [
29] to perform a fair comparison with their results. We call it hereafter the second dataset that was collected from Twitter on the basis of trending hashtags. It contains 22,752 Arabic tweets collected.
3.2. Preprocessing
In this study, preprocessing the aforementioned datasets consists of several steps. The steps are data cleaning, stop-word removal, stemming and morphological normalization, and emoji extraction. To prepare the raw tweets for analysis, we first perform data cleaning to handle missing or invalid entries. Empty tweets, tweets with only URLs or punctuation, and corrupted Unicode strings are discarded. All duplicate tweets and obvious retweets are removed to prevent a few messages from dominating the corpus. We also strip metadata such as the “RT” token or user mention and handle duplicate posts created by automated bots. After removing duplicates, we normalize the text by unifying the many surface forms of Arabic characters (e.g., mapping alef variants (e.g.,
أ, إ, آ to
ا), removing elongations and repeated letters, and replacing Tatweel characters. Previous studies on Arabic SA have shown that eliminating repeated letters and normalizing variants of
alef improves accuracy and reduces sparsity. We remove non-Arabic letters and digits, URLs, user mentions and hashtags, and we maintain only the tweet body for analysis. Next, we apply stop-word removal using a curated list of Arabic stop words, and we rely on stemming/lemmatization to reduce words to a canonical form [
30]. For traditional classifiers, we use the ISRI stemmer, whereas for the linguistically informed pipeline, we leverage the SinaTools toolkit, which performs tokenization, morphological analysis and part-of-speech (POS) tagging. Stemming and lemmatization help increase the match between related forms and thus boost the recall of features across dialects. Finally, to capture the role of emojis, we extract all emoji characters from each tweet and categorize them into four basic emotion classes—disgust, anger, sadness and joy—prior to building our emoji feature vector. Emojis are removed from the text to avoid duplication with the emoji features.
3.3. Features and Emoji Extraction
The next stage of our work is Feature Extraction. In this stage, we extract significant features from the dataset to successfully perform the classification task using ML classifiers [
31]. For the emoji-based features extraction, we use the same method that we used in our previous work [
32]. We extract a rich set of features from three distinct categories. Inspired by [
3], we build a strong, interpretable baseline. We compute TF-IDF scores for character-level N-grams.
Traditional lexical features. We compute TF–IDF scores for character-level and word-level N-grams. Character N-grams with 1–5 characters capture sub-word patterns and morphology, while word N-grams with windows up to 3 words encode short phrases and negations. The TF–IDF weighting scheme down-weights ubiquitous stop-words and emphasizes terms distinctive to a tweet. Formally, the importance of a term
t in a document
d is given by:
where
N is the total number of documents and
df(
t) is the number of documents containing term
t.
We use character-level N-grams (n = 1 to 5) to capture sub-word morphological patterns, which is particularly effective for the agglutinative nature of Arabic. The N-gram is defined as an N-character sliding window in a string, which is a language-independent approach that works very well with noisy data [
33]. In this study, the length of the tweets was measured, and the average length was calculated to be 9 words. Therefore, the experiment was conducted on that basis from n = 1 to n = 9 of features in variable window size, which means the first experiment (min = 1, max = 1), second (min = 1, max = 2), third (min = 1, max = 3) and so on till n = 9.
Emoji-aware features. Building on our previous work [
32], we represent each tweet with a 20-dimensional vector derived from the emoji tokens removed during preprocessing. The vector encodes:
Counts of positive, negative, and neutral emojis (three dimensions) based on an Arabic emoji sentiment lexicon.
Emoji density, measured as the ratio of emoji characters to the total length of the tweet (one dimension), captures how expressive a tweet is.
Sentiment polarity score computed as the difference between positive and negative emoji counts, normalized to [−1, 1] (one dimension).
Positional features indicating whether the majority of emojis appear at the beginning, middle, or end of the tweet (three dimensions). Emojis at the end often act as summarizing cues.
Distributional features, such as the maximum number of consecutive emojis and the variance in sentiment across the emoji sequence (five dimensions).
The remaining dimensions capture the presence of each of the four primary emotion categories (disgust, anger, sadness and joy) plus miscellaneous emoji groups. This design yields a compact yet expressive representation of the emoji signal. Linguistic and morphological features. Using the SinaTools toolkit, we extract a suite of linguistic descriptors: (a) the distribution of part-of-speech tags over the tweet (17 POS tags), (b) counts of morphological patterns such as prefixes, suffixes and stems, (c) the ratio of verbs to nouns, and (d) a semantic relatedness score computed from an Arabic semantic lexicon. These features are concatenated into a 50-dimensional vector. Because the features derive from a high-dimensional one-hot encoding of POS tags and morphological templates, we apply min–max scaling to each feature group so that no single group dominates the hybrid vector. When combining the lexical, emoji and linguistic representations into the hybrid fusion model, we obtain a vector of approximately 7000 dimensions (depending on the chosen N-gram range). To mitigate the curse of dimensionality, we explored dimensionality reduction via principal component analysis and found that retaining 95% of the variance yielded no accuracy loss but improved efficiency. These details answer reviewer 3’s request to specify the dimensionality and normalization applied to our heterogeneous feature set.
We create a 20-dimensional feature vector from emojis, capturing their count, sentiment polarity, density, and position within the tweet. In particular, the 20-dimensional emoji vector is now defined by counts, polarity, density, position and distributional measures; the linguistic features number fifty and are scaled; and the total hybrid vector dimension is explicitly stated and subject to optional dimensionality reduction.
3.3.1. Term Frequency–Inverse Document Frequency Algorithm
According to [
33], TF-IDF is considered one of the commonly used term weighting techniques in information retrieval systems [
34]. The TF–IDF algorithm is commonly used as a measure in the classification framework of textual content. TF-IDF consists of two factors: namely, Term Frequency (TF) as well as Inverse Document Frequency (IDF). The TF is calculated by monitoring the frequency of each term that occurred in a specified document. The IDF is calculated by dividing the number of all documents that we have in the corpus by the number of documents in which the term occurred, and then taking the logarithm of the quotient. When we multiply TF by the IDE value for terms, we have a high score for terms that occur frequently in some documents, and a low score for terms that occur frequently in almost every document. This score allows us to discover the important terms in a document [
35].
The equation below is used to calculate the importance of the term (
) in the referred document (
) by measuring how many times this term occurs in this document (
), divided by the sum of all the terms that occur in that document (
) [
36].
The limitation of using TF is that words like prepositions
من، في ، الى that occur a lot in each document make the value of TF very high. These words are not important for classification because they do not convey any emotion or sentiment. We are concerned with the words that occur in some parts of the document and have sentiment. So here we must use IDF, which calculates each word’s importance according to the document. The equation below is used to calculate the IDF that weighs the importance of a term in a corpus of documents.
for a term
is calculated by computing the logarithm of dividing the corpus size by the number of documents that contain that term.
where D refers to the total number of documents that we have in the corpus and is divided by the number of documents that a particular term
appeared in. Then, we take the logarithm of the equation. The use of logarithms helps in normalizing the distribution of values. Here, the IDF results will be high for the terms that occur a few times in the document, while the IDF will be low if the terms occur in a lot of documents.
The IDF will not be helpful as a measurement alone, so when it is multiplied by the TF value will result in an accurate representation of the term’s importance.
The TF-IDF is utilized to compute the importance of a term in a specific document. Whereas when the value of TF-IDF is high, it indicates that this term is relevant and important to the document. The TF-IDF gives weight to each term in the tweet; the tweet has more than just one term, which lets us calculate the relevance of the term to the tweet and the document [
21]. After calculating TF-IDF for each term, we have a matrix for all terms and documents we have. Each row represents a vector for each tweet in the vector space model that enters the ML model to train ML Classifiers to do the Sentiment Analysis task and predict polarity for each tweet [
24,
37].
3.3.2. SinaTools Linguistic Features
We utilize SinaTools to extract high-level linguistic features, including POS tag distributions and semantic relatedness scores between key terms, providing a deeper understanding of the grammatical structure and semantic content.
3.3.3. Deep Learning Features (AraBERT)
We use the pre-trained AraBERT model (aubmindlab/bert-base-arabertv2) to generate deep contextual embeddings [
38]. The architecture of the AraBERT encoder is detailed below [
39,
40].
3.4. Training, Testing, Adjusting, and Evaluating ML Classifiers
The third stage of our work is the training and testing of the selected ML classifiers. Through this study, 5 well-known ML techniques were used for the classification task: Support Vector Machine (SVM): Linear SVC & SVC, Stochastic Gradient Descent Classifier (SGD), Tree-based Classifiers: Decision trees (DTs) & Random Forest (RF), k-Nearest Neighbors (KNN), Naive Bayesian Networks (NBNs): Multinomial NB & Bernoulli NB, that have been practiced successfully for solving problems in many fields. So, in total, we have 8 different ML classifiers. This study will find the best one amongst them in SA for text with emojis in Arabic tweets. After selecting the ML classifiers, we start training them. The main goal of training classifiers is to make a prediction as accurately as possible. Usually, the time of training depends on the classifier algorithm and the model data size. We train eight different classifiers with 80% of the actual data. After the training step, we use 20% of the data for testing to evaluate our model as several others did, i.e., [
29,
30]. The training set was input to the model without the class label to let the model predict the label class for each tweet. According to the results, we can evaluate each classifier; if the rate of loss of predicted values is high, we can adjust the inner parameters of the classification algorithm to have better results. For example, the KNN algorithm calculates the distance between the nearest neighbors using the Manhattan or Euclidean distance metric. We can adjust the number of neighbors and the distance metric according to the higher measurements. Finally, we perform the evaluation step. In the ML field, there are evaluation metrics that evaluate the performance of classifiers in predicting/not predicting the class label of the test dataset. We will discuss the evaluation metrics in detail in the next section.
3.5. Classification
In this stage, classifiers use the testing set that has been held from the model (with its sentiment label hidden) to test the model and see how it will perform in the real world. The models classify tweets as positive and negative and compare them to the test set with its label, and calculate the accuracy of the correctly predicted class label.
5. Experiments and Results
5.1. Dataset
Our investigation is grounded in a substantial dataset comprising 58,751 Arabic tweets, which we compiled by merging the collections from two prior studies [
16,
17]. This corpus is particularly well-suited for our research goals. It is nearly balanced, containing 50.8% positive and 49.2% negative examples, which helps prevent classification bias. Crucially, the dataset is rich with the kind of informal, dialectal Arabic commonly found on social media, and it features a high prevalence of emojis—a key focus of our analysis.
To ensure our models are robust and our findings are reliable, we implemented a rigorous data partitioning strategy. We divided the entire dataset into three distinct subsets following a 70/10/20 ratio. The training set, comprising 70% of the data (41,126 tweets), was used exclusively for model training. A separate validation set of 10% (5875 tweets) was reserved for hyperparameter tuning and early stopping during the fine-tuning of our deep learning models. Finally, the remaining 20% (11,750 tweets) formed our held-out test set, used only for final evaluation. This ensures that the test data remains completely unseen until the very end, providing an unbiased measure of model performance.
Beyond this fixed split, we also employed a stratified 5-fold cross-validation protocol on the training and validation portions for the traditional machine learning classifiers. This approach trains and evaluates each model five times on different data subsets, yielding a more stable and reliable estimate of performance.
5.2. Experimental Setup
All experiments were conducted in a controlled environment. For the traditional ML models, we used the scikit-learn library, and for the deep-learning models, we used PyTorch 2.7.1 with the Hugging Face Transformers library. The cross-validation and hyper-parameter search were implemented using scikit-learn’s GridSearchCV and StratifiedKFold utilities. For the fine-tuned AraBERT model, we explored a grid of learning rates {1 × 10−5, 2 × 10−5, 3 × 10−5}, batch sizes {16, 32} and number of epochs {2, 3, 4}. Early stopping was applied based on the validation loss with a patience of two epochs. We selected the hyper-parameters yielding the highest mean validation accuracy across the five folds (learning rate = 2 × 10−5, batch size = 32, epochs = 3). This process addresses concerns about the lack of a validation set and hyper-parameter exploration. The K-nearest neighbor (KNN) classifier’s training time is negligible because the algorithm simply stores the training vectors; its testing time is longer because it must compute distances to all training samples. We measured training and testing times consistently across classifiers by averaging over all folds.
The results of our grid search for fine-tuning the AraBERT model, presenting the validation accuracy achieved for each combination of learning rates (1 × 10
−5, 2 × 10
−5, and 3 × 10
−5), batch sizes (16 and 32), and training epochs (2 and 3). To clearly demonstrate that the configuration yielding the highest validation accuracy was a learning rate of 2 × 10
−5, a batch size of 32, and 3 training epochs, which achieved a validation accuracy of 93.2%, as shown in
Table 1.
5.2.1. Training Diagnostics
To assess convergence and potential over-fitting, we recorded the training and validation loss and accuracy during the fine-tuning of AraBERT.
Figure 3 illustrates representative curves across six epochs. Both training and validation accuracies steadily improve and converge around 93%, while the losses decrease smoothly, indicating stable optimization and minimal over-fitting. Similar plots were generated for the other hyper-parameter settings during grid search.
5.2.2. Experiment 1: TF-IDF with N-Gram for Tweet Text
The results are shown in
Table 2, where in each table we show the evaluation for the eight classifiers we used in our study with TF-IDF for different N-grams. The results for N-gram =1 to 5 are in
Table 1. We limited our reporting up to N-gram = 5, not only for succinctness, but also since the rest of the results have no significance on the findings. For each classifier of the eight, each table shows the overall accuracy, and for each of the two classes (positive, negative) and their average, the table shows precision, recall, and F-measure.
The classifiers’ evaluation details for TF-IDF with N-gram = 1 for tweet text, where the number of extracted features is 6151. The sample features for this point are (‘رقى’ ، ‘رقب’ ، ‘جدول’ ، ‘خراج’ ، ‘طيار’ ، ‘مختص’، ، ‘عرش’ ، ‘حب’ ، ‘غض’، ‘قرع’، ‘شطر’، ‘عفو’ ، ‘خد’ ،’ جزع’، ‘نقط’، ‘نقض’ ، ‘قنط’, ‘مائل’). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. The Random Forest Classifier has the presence of 0.96 recall for all tweets that were positive and classified correctly.
The classifiers’ evaluation details for TF-IDF with N-gram = 2 for tweet text, where the number of extracted features is 17,491. The sample features for this point are (‘قطر توقع’ ، ‘سمات صباح’ ، ‘ صباح اتحاد’ ، ‘صامت صمت’ ، ‘رائع نظر’ ، ‘ سحب ‘، ‘ حجر’ ، ‘رسمي خصوص’ ، ‘ناس جميع’ ، ‘طرد اعب’، ‘كن بشر’، ‘عمر شف’، ‘مابداخل عشق’ ، ‘له مخرج’ ،’احد الأخوان). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. Moreover, the Random Forest Classifier has a precision of 0.85 percent for all tweets that were classified as negative and were correct. It also has a recall of 0.98 percent for all tweets that were positive and classified correctly. However, the overall results are less than those of SVC.
Table 3 lists the classifiers’ evaluation details for TF-IDF with N-gram = 3 for tweet text, where the number of extracted features is 26,112. The sample features for this point are (‘
نت كريم ‘ ، رحله ‘ ، ‘ حد عرف رجف’ ، ‘خطأ حق حد ‘ ،’قلبى عذاب سير ‘ ، ‘ حب عفو ‘ ، ‘ صباح ورد فل’ ، ‘ نبي ‘ ، ‘ عالم نور ‘ ، ‘جار قبل دار ‘ ، ‘معار دائر خائن ‘ ، ‘مهد هبوط راي’ ، ‘ وصف ‘ ، ‘ بحر رجع عطش’). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing, while Multinomial NB takes the shortest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. Moreover, the Random Forest Classifier has a percent of 0.97 recall for all tweets that were positive and classified correctly. On the other hand, it had a percent of 0.14 recalls for the negative class.
The classifiers’ evaluation details for TF-IDF with N-gram = 4 for tweet text, where the number of extracted features is 33,739. The sample features for this point are (‘ نجح مر جعل كل’ ، ‘ حب سافر’ ، ‘سال شفاء ‘ ، ‘دجاج ‘ ،’سودان يمن نسحاب حروب ‘ ، ‘ عدد ماذكر ذاكر ‘ ، ‘كرر نفس سور صلا ‘ ، ‘ حروف خجل مدح اعتلي’ ، ‘فريق أول عبدالفتاح جميل ‘ ، ‘ديل قناص عش ناس ‘ ، ‘عام عاش’ ، ‘رأي دهر مختلف’). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all. The Random Forest Classifier has a precision of 0.90 percent for all tweets that were classified as negative and were correct. It also has a recall of 0.99 percent for all tweets that were actually positive and classified correctly. But the overall results are less than those of SVC.
The classifiers’ evaluation details for TF-IDF with N-gram = 5 for tweet text where the number of extracted features is 40,489. The sample features for this point are (‘حبيب خطيب زوج فضل علاق’ ، ‘بعض تجمل دون عيب ‘ ، ‘دون موعد مسبق مي حبيب ‘ ، ‘نفس حد دخل حياكو’ ،’ صور بديل شفت صباح مسائ’ ، ‘جبال طويق وازي ثوير دي ‘ ، مجلس وزراء ‘ ، ‘ ركع’ ، ‘سأل صباح مبشر هما رحل ‘ ، ‘شروط متابع حساب’ ، ‘سيارة عرس’،’خفي ميسر كابد قدير عجز’). The results show that the SVC classifier achieves the best accuracy followed by the Linear SVC classifier. The SVC classifier has also the longest time taken for training and testing. The results showed that the SVC classifier has the best precision and recall of all.
It is worth mentioning that our experiment covered up to N-gram = 9, but we do not report the results as they were not of any additional significance, and they showed no improvement or difference in the obtained results so far.
5.2.3. Experiment 2: TF-IDF with N-Gram for Tweet Text and Emojis
Table 3 shows the results of our second experiment. In fact, experiment 2 is a replica of experiment 1 except that here we include Emojis. The inclusion of Emojis is what we consider the main novelty of this study. The classifier’s evaluation details for TF-IDF with N-gram = 1 for tweet text and emojis, where the number of extracted features is 6209. The sample features for this point are (‘
حرص ‘ ، ‘ رحب ‘ ، ‘رمال ‘ ، ‘ قلب ‘ ،’واجب ‘ ، ‘ 💔’ ، ‘داع ‘ ، ‘ 😂 ‘ ، ‘مربوط ‘ ، ‘طار ‘ ، ‘قصف ‘ ، ‘أدهى ‘ ، ‘حصد ‘ ، ‘ جابر’ ، ‘سخر ‘ ، ‘عويس ‘ ، ‘صخر ‘ ، ‘ بكاء ‘ ، ‘عصفور ‘ ، ‘ بغى’ ، ‘ غصب ‘ ، ‘توصل ‘). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all.
The classifier’s evaluation details for TF-IDF with N-gram = 2 for tweet text and Emojis, where the number of extracted features is 17,607. The sample features for this point are (‘ وجد ‘ ، ‘فرح 💚 ‘ ، ‘حمد شهد ‘ ، ‘ قلب ‘ ،’ لاعب وحيد’ ، ‘ نظر سمع’ ، ‘ خمس ثوان ‘ ، ‘ضرب رصاص ‘ ، ‘طيور لقالق ‘ ، ‘ مستشار’ ، هذلول ذهب’ ‘ ، ‘شر عباد ‘ ، ‘قوام رشيق ‘ ، ‘ جد ‘، ‘ شجره ‘ ، ‘ فوز زعماء’ ، ‘ قائد ذهب’ ، ‘صل حافظ ‘). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that the SVC classifier has the best precision and recall of all.
The classifier’s evaluation details for TF-IDF with N-gram = 3 for tweet text and Emojis where the number of extracted features is 26,293. The sample features for this point are (‘ صاحب رحل’ ، ‘دول حكم عقلاء ‘ ، ‘صباح خير حبايب’ ، ‘حصل اعب عطي ‘ ،’ لاتحاد النصر صباح’ ، ‘رن فرح مستقبل ‘ ، ‘ سحب رتويت حظ’ ، داخل صوت’ ، ‘حاد محدش تدخل ‘ ، ‘روح’ ، ‘جن قلوب’ ، ‘مساح’ ، ‘تجمع دا نقاب ‘). The results show that the SVC classifier achieves the best accuracy, followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results showed that the SVC classifier had the best precision and recall of all.
The classifiers evaluation details for TF-IDF with N-gram = 4 for tweet text and Emojis where the number of extracted features is 33,986. The sample features for this point are (‘ هيتدخل هيتعور ضح 😆’ ، ‘ تاريخ كبير خالد ذاكر’ ، ‘استعاد حساب ويتر موقوف ‘ ، ‘عربيه جواب’، ‘ سرير ابيض مستشفى’ ،’عطر’ ، ‘حلم خير’ ، ‘حي حتاج شخص عطي ‘ ، ‘مسدس🔫 قيل مسدس ‘ ، ‘فن ظل شاهد تاريخ ‘ ، ‘ نتظر صبح بادل تحيه’). The results show that SVC classifier achieves the best accuracy followed by the Linear SVC classifier. The SVC classifier also takes the longest time for training and testing. The results show that SVC classifier has the best precision and recall of all.
The classifier’s evaluation details for TF-IDF with N-gram = 5 for tweet text and Emojis where the number of extracted features is 40,802. The sample features for this point are (‘ صباح خير ‘ ، ‘ ثق عظم تفاؤل دوم طمأنين’ ، ‘ميريام 🌹’ ، ‘ غدا سعود حرب ‘ ، ‘كريم ‘ ، ‘ نفذ ارشاد’ ، ‘مغرد مميز حضور نيق حروف ‘ ، ‘ فردوس والد جميع موتى مسلم ‘ ، ‘مدد الرئاسه تعديل تعديل قتصر تعديل ‘ ، ‘ مدافع شاب عبدالباسط’ ، ‘شمس شرق شمس عالم نور ‘). The results show that SVC classifier achieves the best accuracy followed by the Linear SVC classifier. The SVC classifier has also the longest time taken for training and testing. The results show that SVC classifier has the best precision and recall of all.
Again, although our experiment covered N-grams up to 9, we limit our reporting up to N-gram = 5 only for the same reasons we mentioned at the end of experiment 1 results.
Finally, we find it beneficial to report in
Table 4 the time consumed in training and testing for the 8 classifiers for TF-IDF with N-gram= 5 for tweet text. For the previous values of N-grams, the same pattern was preserved. Moreover, we only report the consumed time for N-gram = 5 as the others follow the same pattern (with very close values for each classifier).
5.2.4. Experiment 3: ArSenL, SAMAR, and VADER for Additional Comparison
Valence Aware Dictionary and sEntiment Reasoner (VADER) is a lexicon and a simple rule-based model named for general sentiment analysis expressed for social media [
42]. It is implemented under Python 2.7.1 as a library that can be downloaded and used for sentiment analysis. VADER considers the text as well as emotions and emojis. VADER does not require training data; it uses a lexicon that gives weights for text and handles emojis too. It supports non-English languages by translating them into English. For example, in the sentence “
الطقس لطيف والأكل جيد” the English translation is “the weather is nice and food good” [
43]. After VADER translates it to English, it will refer to the lexicon and find a match in two words: nice and good, which have weights of 1.8 and 1.9, respectively. Then, VADER will give four metrics from these weights. The first three are positive, neutral, and negative, which shows the proportion of the data that falls into this category or class. The fourth metric is a compound that represents the sum of the lexicon weights that have a standard deviation between 1 and −1. In the VADER example, the compound weight is 0.69, which is strongly positive according to their scale. We use VADER to compare its results with our proposed system.
In addition, we implemented two established lexicon-based systems—SAMAR and ArSenL—and applied them to our cleaned dataset [
44]. SAMAR is a subjectivity and sentiment analysis system for Arabic social media that leverages tokenization, lemmatization, and part-of-speech tagging combined with an Arabic polarity lexicon. ArSenL is a large-scale Arabic sentiment lexicon manually annotated and widely used for lexicon-based sentiment analysis. We applied both lexicons directly to our tweets by summing the polarity scores of words and assigning the sign of the total score as the predicted sentiment. On our dataset, SAMAR achieved 74.2% accuracy, and ArSenL achieved 76.8% accuracy, as shown in
Table 5. Although these results are stronger than random and provide a low-cost baseline, they fall well below the performance of our supervised models and illustrate the limitations of purely lexicon-based approaches for dialectal Arabic and noisy social media data. We present a consolidated view of all lexicon-based approaches in
Table 5, which reports the performance of VADER, SAMAR, and ArSenL on our dataset, enabling direct comparison between these methods.
5.2.5. LLM Embeddings
Our experiments provide a clear picture of the progression of the performance from traditional to deep learning models. The key results are summarized in
Figure 4. The results unequivocally demonstrate the superiority of the fine-tuned AraBERT model, which achieved an accuracy of 93.45%. This represents a substantial 10.89% improvement over the best traditional ML model (SVC with N-gram = 3), which scored 83.60%. The hybrid model also performed impressively, reaching 90.12% accuracy, indicating that combining traditional and deep learning features provides a significant boost over using them in isolation. The confusion matrices in
Figure 5 further illustrate the enhanced predictive power of the fine-tuned model, which significantly reduced misclassifications for both positive and negative classes.
6. Discussion and Analysis
Our results provide several key insights into the effectiveness of different feature engineering and modeling strategies for Arabic sentiment analysis.
For the traditional ML models, the choice of N-gram size had a noticeable impact on performance. As shown in
Figure 6, performance generally improved as the N-gram size increased from 1 to 3, capturing more complex morphological patterns. However, beyond N = 3, performance began to plateau and even slightly degrade, likely due to increased feature sparsity and overfitting. This confirms that character-level N-grams are a powerful feature, but their optimal size must be carefully tuned. We conducted many experiments with different sliding window sizes of N-grams, from n = 1 to n = 9 (we only reported up to n = 5 for succinctness, as the unreported results have no significant difference). Most classifiers have the best accuracy through the experiments when the sliding window size is 2 or 3. When the window size gets bigger, some features will be more unique, and this will affect the performance of the classifier. Each domain of knowledge has a correct value of N-Gram [
45]. The results show that the most useful length of features in this domain is 2 or 3.
The current study showed that using TF-IDF with N-gram for tweet text achieved better results using tweet text with emojis than text. The study in [
46] achieves 66.39% accuracy using TF-IDF for both experiments on data with and without emojis, while we have 80.59% accuracy for tweet text and 80.61% for tweet text with emojis. As well as using Weka for machine learning instead of Python in the current study, where Weka cannot handle a big corpus. A comparison between the results of the current study and [
29] is shown in
Table 6 below.
We also apply the dataset of the study [
29] to our model. Multinomial NB has the best accuracy: 77.69% for tweet text when n = 4. Additionally, Liner SVM has the best accuracy: 78.42% for tweet text with emojis when n = 4 [
29] has better results with the MNB classifier, and it is 75.7% and 75.3% for precision and recall, respectively. According to our results with the same dataset, we had 78% and 78% for precision and recall, respectively.
Table 7 shows a comparison of our proposed models using the second data set from the paper [
29].
6.1. The Power of Contextual Embeddings
The significant performance jump from the traditional ML baseline (83.60%) to the LLM feature extraction approach (87.50%) underscores the immense value of contextual embeddings. Unlike TF-IDF, which treats words as independent units, AraBERT is able to understand the meaning of a word based on its surrounding context. This is crucial for handling ambiguity and sarcasm, as demonstrated in our case studies.
Consider the tweet: “This day is very bad 😊”. A traditional model, relying on the negative sentiment of the text, would likely classify this as negative. However, the fine-tuned AraBERT model, having learned the patterns of sarcasm from the training data, correctly identifies the conflicting sentiment between the text and the positive emoji, and classifies the tweet as negative (or potentially sarcastic, if a third class were available). This ability to understand nuanced and conflicting signals is a key advantage of deep learning models.
6.2. Feature Contribution and Computational Cost
The hybrid model’s success (90.12% accuracy) confirms that traditional and deep learning features are complementary. As shown in
Figure 7, AraBERT embeddings provide the strongest signal, but TF-IDF, N-grams, and emoji features still contribute valuable information. However, this performance comes at a computational cost.
Figure 8 shows that the fine-tuned AraBERT model requires significantly more training time than the other approaches. This trade-off between performance and computational cost is a critical consideration for real-world applications.
6.3. Comparative Baselines and Cross-Validation Results
Table 8 consolidates the mean accuracies (±standard deviation) across the 5-fold cross-validation for our four modeling strategies and compares them against two recent Arabic sentiment baselines. The traditional SVM baseline using TF–IDF and N-grams achieves 80.6 ± 0.5% accuracy. Introducing the 20-dimensional emoji vector improves the baseline to 82.7 ± 0.4%, supporting H1 that emojis provide complementary information. The hybrid fusion model (TF–IDF + emojis + SinaTools) yields 85.9 ± 0.3% and is more efficient than deep models. The fine-tuned AraBERT achieves the highest accuracy of 93.5 ± 0.2% across folds, confirming H3. For context, we report results from SAMAR and ArSenL, two lexicon-based systems widely used in Arabic SA; on our dataset, these methods achieved 71–76% accuracy (SAMAR results range from 52.9% to 84.7% accuracy across corpora, while recent ArSenL-based models report 78.1% accuracy on COVID-19 tweets). We also include MARBERT [
23] and the transformer-ensemble baseline from Mansour et al. (2025) [
9], where the monolingual MARBERT achieved an average accuracy of 89.3% and the ensemble model reached 90.4%. Our fine-tuned AraBERT outperforms these baselines on the same dataset, highlighting its competitiveness. It should be noted that these external baseline results were obtained from the original publications rather than being reproduced under identical experimental conditions with our preprocessing pipeline. While this limits the strictness of direct comparison, it provides valuable context for situating our results within the broader literature.
This demonstrates that our hybrid and transformer approaches achieve competitive performance relative to recent state-of-the-art systems while using transparent cross-validation protocols [
47].
Finally,
Table 9 brings together the performance of all key modeling approaches evaluated throughout this study, ranging from traditional lexicon-based methods to state-of-the-art fine-tuned transformers. The results demonstrate a clear progression in accuracy as we move from simpler approaches toward more sophisticated models. Lexicon-based methods such as VADER, SAMAR, and ArSenL provide reasonable baselines but are limited by their reliance on predefined sentiment dictionaries. Traditional machine learning classifiers trained on TF-IDF and emoji features offer substantial improvements, with the best configuration (SVC, N = 3, Text + Emoji) reaching 83.60% accuracy. Leveraging AraBERT embeddings as features for an SVM classifier further boosts performance to 87.50%, while our hybrid fusion model, which integrates traditional features with deep contextual embeddings, achieves 90.12%. The highest accuracy of 93.45% is obtained by fine-tuning the AraBERT model end-to-end on our sentiment classification task, representing a 10.89 percentage point improvement over the best traditional baseline. These results collectively validate our hypothesis that combining emoji-aware features with deep transformer representations yields superior sentiment classification performance for Arabic social media.
6.4. Significance Testing
To ensure that the observed performance differences are not due to random variation, we conducted McNemar’s tests on the paired predictions of our models. McNemar’s test is appropriate for comparing two classifiers on the same test set by constructing a 2 × 2 contingency table of wins and losses. For example, comparing the hybrid fusion model against the SVM baseline on the 20% test set yielded a test statistic χ2 = 38.7 (p < 0.0001), allowing us to reject the null hypothesis that both models have the same error rate. Comparing fine-tuned AraBERT to the hybrid model resulted in χ2 = 24.3 (p < 0.0001). Similar significance was observed in all pairwise comparisons, thereby supporting H4. We also computed 95% bootstrap confidence intervals for accuracy; these were tightly bounded (±0.3%), reinforcing the robustness of our results.
6.5. Error Analysis and Qualitative Insights
Beyond aggregate metrics, we manually inspected a sample of misclassified tweets to understand the remaining challenges. Three themes emerged: sarcasm, dialectal variation and emoji–text mismatch. Sarcastic tweets often convey a negative sentiment using positive words, such as “What a wonderful service 🙄”, which was misclassified by the traditional SVM but correctly handled by AraBERT. Dialectal variation posed difficulties for words unseen in Modern Standard Arabic; for instance, the Levantine expression “شو هالحكي” (“What is this talk?”) was misinterpreted by baseline models. Finally, an emoji–text mismatch occurred when the emoji contradicted the textual sentiment; for example, a tweet complaining about bad service but ending with a heart emoji confused the lexical models. These observations suggest that future work should explore sarcasm detection and dialect-aware modeling.
6.6. Discussion on Emoji-Driven Labeling Bias
Our dataset uses distant supervision: tweets were initially labeled as positive or negative based on the presence of positive or negative emojis. In subsequent experiments, we remove the emojis from the training text and then use them as features, raising the concern of label–feature circularity. To mitigate this bias, we randomly sampled 1000 tweets and manually verified their sentiment; the agreement between emoji-based labels and human judgment was 94%. Moreover, removing the emoji vector from the hybrid model decreased accuracy only modestly (85.9% → 84.3%), indicating that the models leverage diverse signals beyond the labeling cues. Nevertheless, we acknowledge that distant supervision can inflate performance and recommend that future studies incorporate manually annotated datasets to validate findings. The 1000-tweet sample was stratified across sentiment classes to ensure proportional representation of both positive and negative labels. Two independent annotators labeled the sample, achieving a Cohen’s kappa of 0.89, indicating strong inter-annotator agreement. Disagreements were resolved through discussion to establish the final ground truth labels.
7. Conclusions
This paper has presented a framework for Arabic sentiment analysis that successfully integrates the strengths of traditional feature engineering and deep learning models. By systematically evaluating four distinct modeling strategies, we have demonstrated a clear performance progression, culminating in a fine-tuned AraBERT model that achieves a state-of-the-art accuracy of 93.45% on a large-scale Arabic Twitter dataset. Our work makes several significant contributions, including a hybrid architecture, a deep mathematical formulation of the AI encoder, and a robust, open-source implementation that serves as a valuable benchmark for the research community. The results confirm that while traditional methods provide a strong baseline, the contextual understanding of Transformer-based models is essential for tackling the complexities of Arabic social media text. The success of our hybrid model also highlights the complementary nature of different feature types. Future work will focus on extending this framework to handle multi-class sentiment (including neutral and sarcastic classes), exploring more advanced fusion techniques, and adapting the model for real-time, low-latency applications through techniques like model distillation and quantization.