1. Introduction
In recent years, the proliferation of social media platforms has revolutionized the way individuals communicate, share opinions, and engage in public discourse [
1]. Social media platforms such as Twitter (now X) and Reddit produce enormous amounts of user-generated textual content reflecting individuals’ sentiments toward social events, products, political developments, and public policies [
1,
2,
3]. This content is characterized by brevity, informality, and rapid evolution, often using slang, abbreviations, emojis, rhetorical expressions, and context-dependent language [
3]. These properties set social media text apart from traditional written sources and provide further complexity for automated text analysis. The process of extracting meaningful information from these data is known as sentiment analysis (SA), which is considered a core component of natural language processing (NLP), classifying the user expressions into different categories of sentiments such as positive, negative, and neutral [
4,
5,
6]. SA is an essential part of modern computational social science, especially in applications such as consumer behavior modeling, political SA, mental health assessment, and personalized recommender systems [
6]. Basically, SA identifies the polarity or emotional orientation of the text. Within social media analytics, it has been widely adopted to study opinion trends, public engagement, and behavioral dynamics at scale.
The interpretation of sentiments on social media remains inherently difficult due to linguistic phenomena such as negation, contrastive constructions, implicit evaluations, and figurative language, which tend to be used in informal communication [
7]. The sentiment in a sentence tends to rely on contextual relationships and not on individual lexical clues. Consequently, in order to accurately classify sentiment, models need to capture subtle semantic interactions and discourse-level meaning [
7]. Interpretability methods have emerged to analyze and explain neural model behavior, offering insights into token-level contributions to predictions. Additionally, symbolic sentiment resources have been used to incorporate structured linguistic knowledge into the sentiment analysis pipeline, for example, polarity lexicons [
7]. These developments collectively form the methodological basis of the modern-day SA systems using social media content.
Hallucinations are non-existent text additions or inferences [
7,
8]. In the context of SA, a hallucination is a systematic polarity error of a classifier. For instance, the sentence “I expected better service” may be misinterpreted as positive due to the over-emphasis on the token “better” while neglecting the underlying criticism. Likewise, sarcastic remarks using positive lexemes, such as “Great, another delay!”, may be misinterpreted. Inaccurate results degrade the performance of sentiment-sensitive systems and increase the levels of risk in an area where interpretability and accuracy are undeniably important, such as healthcare, finance, and crisis communication [
8]. Incorrect risk assessments with medical triage Chatbots and financial advisory tools may be triggered by misclassifications [
9]. Consequently, reducing hallucinations is a primary objective of SA research.
Lexicon-based rule systems and traditional machine learning techniques such as Naive Bayes, Support Vector Machine, and Logistic Regression have been used to build SA models [
10]. These methods are based on surface-level lexical statistics and hand-crafted features [
10]. They confront challenges in modeling long-range contextual dependencies and are unable to process linguistic constructs such as negation, sarcasm, and context-dependent semantic drift, which causes them to misinterpret subtle polarity inversions and overlook implicit emotional cues [
11]. Advanced architectures addressed these challenges by learning distributed representations from raw data. Nevertheless, such models tend to focus on sentiment-bearing phrases without accessing a broader context, leading to hallucinated polarity assignments.
Recently, transformer-based models such as Bidirectional Encoder Representations from Transformer (BERT), Robustly Optimized BERT Pre-training Approach (RoBERTa), Generalized Autoregressive Pre-training for Language Understanding (XLNet), Generative Pre-training Transformer (GPT), and Open Pre-training Transformer (OPT), perform well across a wide range of NLP tasks [
12]. However, these models mostly learn statistical token associations and lack mechanisms to validate whether predicted sentiment labels are semantically justified by the input text [
12,
13,
14]. This limitation is especially problematic in the context of SA, where interpretation of polarity is inherently subjective and context-dependent. While the association of hallucination has been extensively examined in text generation and question answering, the effect of hallucination on sentiment classification, particularly in informal and dynamic fields, has been unexplored [
14].
Current SA approaches lack an effective mechanism for addressing challenges in analyzing social media content. The content often involves sarcasm, negation, contrastive discourse, implicit evaluative information, and rapidly changing domain-specific expressions [
14]. In such settings, transformer models can attribute sentiment based on isolated lexical cues rather than grounded semantic reasoning, leading to systematic polarity errors [
15]. Post hoc methods for interpretability range from Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive exPlanations (SHAP), which enable explanatory insights but are purely a diagnostic tool and cannot control the learning or inference process [
16]. Similarly, hybrid models combining a lexicon and a neural network are usually based on static fusion strategies that cannot adapt to contextual ambiguity.
Existing SA frameworks treat hallucination as an afterthought, either retrospectively or in an abstract manner, where semantic grounding and interpretability are not inherent to the model structure [
17]. They lack integrated frameworks that embed semantic grounding and interpretability directly into model optimization, especially for social media data. This study addresses this gap by introducing a hallucination-aware, interpretable SA framework that explicitly monitors predictions via semantic grounding, attribution-guided supervision, and adaptive neuro-symbolic fusion, thereby promoting reliability and transparency in social media SA. Therefore, we propose a hallucination-aware SA framework that integrates contextual representation learning, interpretability, and linguistic grounding into a single end-to-end transformer-based framework. The suggested framework is a coherent workflow in which one component supports the other, so that sentiment prediction is not only precise but also semantically acceptable, interpretable, and resistant to hallucination.
The Sentiment Integrity Filter (SIF) is the initial step in the workflow and performs semantic grounding by testing model predictions against lexicon polarity cues and expert-verified contextual sentiment cues. In contrast to traditional sentiment classifiers, the SIF is a built-in auditing mechanism that prevents the assignment of unsupported polarity in environments involving negation, sarcasm, or other subtle evaluative language. Extending this grounding layer, the framework integrates a SHAP-directed regularization mechanism that transforms interpretability into an active supervisory signal. SHAP-guided learning encourages the model to align its internal attention with linguistically meaningful sentiment cues by penalizing sentiment-irrelevant tokens, thereby eliminating hallucination corresponding to spurious correlations. Lastly, by leveraging a confidence-based lexicon-deep fusion module that integrates symbolic and neural sentiment reasoning during inference, the proposed SA framework controls neural confidence-driven semantic agreement, thereby strengthening confident predictions and weakening overconfident or poorly supported ones.
A detailed evaluation plan is used to assess the effectiveness of the proposed SA framework using standard classification metrics and hallucination-related evaluation. Model performance is quantitatively evaluated using accuracy, precision, recall, and F1-score to determine the effectiveness of sentiment classification across all classes. Moreover, a hallucination rate measure, based on expert-labeled test cases, is used to measure the rate of semantically unsupported sentiment predictions. Together, these evaluation measures allow the proposed hallucination-aware sentiment analysis system to be rigorously validated on performance and trustworthiness.
Quantitative analysis shows that the hallucination-aware sentiment analysis framework gains a consistent advantage over robust transformer baselines. In the benchmark social media datasets, the model has shown a 5–6% accuracy improvement over the base OPT architecture and an average of 3–5% over the recent transformer and hybrid sentiment models. Additionally, the rate of hallucinations declined by more than 40%, highlighting the importance of incorporating semantic grounding and interpretability into model training and inference. These are statistically significant gains in paired-comparison settings, indicating that the performance gains cannot be attributed to random variation. In addition to the numerical measures, the qualitative analysis shows significant improvements in the analysis of complex social media content. The suggested framework is stronger against negation, sarcasm, and the expression of implicit sentiment, and it correctly interprets contrastive and context-dependent statements, which tend to trigger polarity hallucination with traditional transformers. The model provides credible, semantically accurate sentiment classification in real-time social media settings by aligning predictions with grounded sentiment cues and attribution-consistent token importance.
The rest of this study is divided into four major parts to present the research in a comprehensible, coherent, and academically sound manner.
Section 2 is a detailed description of the datasets used, preprocessing strategies, model architecture, hallucination mitigation components, training procedures, and evaluation protocols.
Section 3 focuses on the performance of the proposed model, including standard sentiment classification metrics and hallucination-oriented measures to offer comparative analyses with the baseline and existing models.
Section 4 provides a contextual interpretation of the study’s findings, including achievements, contributions to the literature, theoretical implications, practical relevance, limitations, and future research directions. Finally,
Section 5 summarizes the study’s findings, reinforces the importance of hallucination-aware sentiment modeling, and presents the broader implications of the proposed study for trustworthy natural language understanding.
2. Related Works
Numerous studies have shown that fine-tuning pre-trained language models such as BERT, RoBERTa, and XLNet results in significant performance gains over traditional machine learning methods when applied to short, informal text streams. Studies by Bello et al. [
18], Mandal et al. [
19], Naseem et al. [
20], Hoque et al. [
21], Vinan-Ludena et al. [
22] report high classification accuracy across various social media datasets by leveraging deep bidirectional contextual embedding.
AlBadani et al. [
23], Jahin et al. [
24], and Kerasiotis et al. [
25] show that combining transformers with convolutional, recurrent, and graph-based neural networks or with attention-driven neural networks can improve feature interactions, structural modeling, and robustness in sentiment classification tasks. Another prominent research direction involves the inclusion of external sources of knowledge and sentiment lexicons to enhance polarity consistency and interpretability. Knowledge-enriched models [
10,
26,
27] employ a sentic computing framework and feature-fusion transformer models that combine symbolic sentiment data with neural data representations.
Data augmentation strategies [
28,
29] have been considered to improve generalization performance. Multilingual and cross-lingual SA models [
30,
31] introduce code-mixed and enhanced transformer architectures for social media content analysis. In addition, explanation-oriented research such as SHAP and attention visualization is used to interpret model behavior [
32] and a task-specific framework [
33] is used to build SA models for specific topic or application-driven contexts. A structured comparison of these approaches based on methodology, strengths, and design characteristics is presented in
Table 1, which summarizes representative studies and their fundamental modeling strategies.
Despite the considerable progress made by transformer-based SA architectures, several critical research gaps remain, especially in hallucination-aware social media content analysis. Existing models rely mainly on statistical token associations learned during the fine-tuning and lack explicit mechanisms to validate whether predicted sentiment labels are semantically grounded in the input text. As a consequence, systematic polarity errors are common in the context of negation, sarcasm, implicit evaluative cues, and contrastive discourse, which are pervasive in social media communication. While hybrid architectures and knowledge-enriched models aim to enhance robustness by integrating multiple feature extractors or external sentiment resources, they usually adopt static fusion strategies that cannot dynamically control neural confidence in the face of contextual ambiguity. Moreover, interpretability techniques such as SHAP and attention visualization are mostly used in a post hoc setting, offering diagnostic information without affecting the learning or inferential process, leaving hallucination unmitigated. Although hallucination has been explored in the context of large-language models, its role in sentiment classification has not been formally defined, measured, or addressed in existing models. Furthermore, the majority of existing studies rely on conventional performance measures, failing to assess whether predictions are linguistically justified or textually supported. Collectively, these limitations underscore the lack of integrated SA frameworks incorporating semantic grounding, interpretability-aware supervision, and confidence regulation directly into model optimization, particularly for informal, ambiguous, and rapidly evolving social media discourse.
3. Materials and Methods
The architecture of the transformers has the potential to identify contextual semantics and adapt to diverse linguistic domains. Nonetheless, these models lack transparency into their decision-making process, posing fundamental challenges for the development of hallucination-aware SA models. In contrast, OPT models [
34] offer comprehensive architectural transparency and enable access to parameter-efficient fine-tuning techniques. Consequently, the use of the OPT model as the base architecture for the proposed hallucination-aware SA framework is driven by a variety of methodological, practical, and scientific considerations. The open-source nature of OPT’s framework allows researchers to fine-tune the attention flows and loss components internally, and thus integrate hallucination mitigation techniques such as SHAP-guided regularization and SIF into the training process. OPT’s decoder-only design is in line with the autoregressive reasoning required for SA tasks in which context changes over time. Moreover, the balanced parameter scales of the model enable experimentation with limited resources while preserving robustness. Within the proposed study, OPT supports domain-aware fine-tuning and hybrid lexicon-deep fusion.
Figure 1 shows the significant role of OPT in the proposed sentiment analysis pipeline, enabling grounding, interpretability, and hallucination-aware modules.
3.1. Dataset Description
In this study, we use two publicly available large-scale SA datasets: Dataset 1 [
35] (a collection of 241,000+ comments) and Dataset 2 (200,000+ comments from X and Reddit platforms). Dataset 1 is a collection of user comments aggregated from across various platforms, such as the HuggingFace, Twitter entity datasets, and the CrowdFlower airline sentiment data. The comments were labeled with a three-class opinion: 0 (negative), 1 (neutral), and 2 (positive). This dataset contains product reviews and general opinions, thus having a wide topical coverage. It reflects a balanced representation in classes, which reduces bias during model optimization. These characteristics favor stable optimization of the modules of the proposed hallucination-aware SA, including SIF and SHAP-guided attribution correction mechanisms. Dataset 2 [
36] has two different corpora with 163,000 tweets and 37,000 Reddit comments. The comments were labeled according to a different sentiment scale: −1 (negative), 0 (neutral), and 1 (positive). The data were extracted via application programming interfaces using Tweepy (Twitter) and PRAW (Reddit), thus providing a platform-dependent linguistic style. We reserve Dataset 2 for external validation, enabling rigorous assessment of model performance and hallucination behavior in a diverse linguistic environment. The labels of Dataset 2 were remapped in order to match the three-class structure of Dataset 1, thereby providing a linguistically and contextually unique testbed for model evaluation.
3.2. Hallucination-Aware Preprocessing
Preprocessing is a key phase in the SA process and plays an essential role in hallucination mitigation. Unlike conventional NLP pipelines, this study adopts a hallucination-aware viewpoint focusing on preserving sentiment-bearing constructs and making the artifact removal optional depending on its impact on semantic interpretation. The preprocessing pipeline begins with basic text normalization, such as lowercase conversion or whitespace correction, and optional transformations of platform-specific tokens into placeholders for tokenizers’ stability. Nonetheless, features contributing to interpreting sentiment, such as emojis, elongated expressions, intensifiers, repeated punctuation, uppercase emphasis, and hashtags, are intentionally preserved. The removal of these features may obscure polarity cues or mask context information, thereby hindering the detection of hallucination. The adaptation of preprocessing strategies balances normalization with preservation of sentiment-rich constructs, enabling fine-grained hallucination-aware evaluation.
Following normalization, a controlled augmentation strategy is used to improve the robustness of the models, thereby reducing their sensitivity to overfitting whilst maintaining overall sentiment classification. Techniques, including safe paraphrasing, replacement of synonyms, and structural rephrasing, are introduced under strict semantic constraints. The objective of augmentation is not to artificially expand the dataset to test the model, but instead to test the model against linguistic variations that reflect the real-time environment. As enhancements have the potential to modify sentiment cues or introduce ambiguous wording, the entire set of preprocessed and augmented samples is forwarded to the subsequent phase for expert validation.
3.3. Expert-Assisted Dataset Transformation
Following preprocessing and controlled augmentation, the dataset undergoes an expert-assisted transformation phase intended to ensure semantic correctness and prepare the data for evaluating hallucinations. This stage is critical as there are many sentiment-bearing phenomena available in social media, such as negation, contrastive transitions, sarcasm, polarity shifts at the discourse level, and non-standard affective markers that cannot be interpreted reliably by automated preprocessing methods. The objective of expert intervention is two-fold: (i) to validate the semantic correctness of the preprocessed data, (ii) to construct a reference set containing linguistically grounded data supporting the hallucination detection.
To support SIF computation, each sentence undergoes lexicon polarity composition. Using expert-curated polarity dictionaries, a lexicon-derived sentiment weight (
is assigned. Equation (1) highlights the process of computing the base polarity score.
where
represents the base polarity score,
returns the sentiment weight of token (
) for a sample (
).
To determine the true semantic scope and prevent systematic misrepresentation of polarity, experts manually validate the negation scope of each sample, including negation cues. Equation (2) shows the modeling of negative effects through polarity inversion within the negative window.
where
represents tokens affected by negation,
denotes polarity inversion function and
is the set of tokens under the true negation scope.
Equation (3) highlights the construction of a lexicon-polarity vector
, facilitating hallucination-aware evaluation.
where
are the class-specific polarity weights derived from the base polarity (
) and negation adjustment (
).
The expert-guided polarity vector lexicon-polarity vector
is used as a symbolic grounded reference in the model evaluation. The preprocessed and augmented instances are used to train the base OPT model with the SHAP explainer (DeepSHAP) on Dataset 1 in order to obtain its sentiment decision with SHAP-based token attributions. However, without ground-truth labels, the inference step is conducted on the held-out test portion of Dataset 1 and Dataset 2 (generalization set). Using this preliminary forward pass, the model’s behavior in realistic deployment conditions is identified. The predicted label (
) serve as a diagnostic signal for hallucination assessment. To quantify hallucination risk in the datasets, a hybrid strategy combining computation screening and expert validation is employed, ensuring scalability and semantic rigor. During this phase, a computational hallucination-risk pre-screening is used to process each sample using multiple indicators, such as the model’s predicted label (
) base polarity (
) and negation adjustment (
), and the
-based attributions (
). Equation (4) shows the computation of hallucination-risk score for a sample.
where
is the mapping function that assigns each instance to one of three hallucination-risk strata (low, medium, and high).
The suggested stratification minimizes the need for exhaustive manual inspection of the entire dataset. Through stratified sampling, representative subsets are selected to maintain a trade-off between sentiment classes, linguistic constructs, and data sources. In subsequent phases, experts assess each selected instance and assign a hallucination indicator: 0 (prediction is semantically grounded) or 1 (hallucinated sentiment). These annotations are used to determine the hallucination probability. Equation (5) presents the computation of the hallucination probability.
where
denotes the expert-evaluated samples that are independent of sentiment classes,
represents hallucination, and
is a binary indicator for sample
i.
To operationalize the expert-assisted scoring approach used in Equation (5), a structured annotation protocol was employed to ensure consistency, reliability, and statistical validity of hallucination-related labels. A panel of three domain experts participated in the annotation process. All experts hold doctoral qualifications in data science, with demonstrated research experience in NLP, SA, or explainable artificial intelligence applications, and prior involvement in annotation-based empirical studies. Due to the scale of the datasets, expert evaluation was conducted on a stratified subset rather than the entire corpus.
From Dataset 1 and Dataset 2, a total of 1200 instances are selected using hallucination-risk stratification (low, medium, and high), ensuring balanced coverage across sentiment classes, linguistic phenomena (negation, sarcasm, implicit sentiment), and data sources. Each expert independently annotated each selected instance. Experts were instructed to assess whether the input text semantically supported the model’s predicted sentiment, assigning a binary hallucination indicator (hallucinated vs. grounded). Disagreements were resolved through majority voting. To quantify annotation consistency, inter-rater reliability was measured using Fleiss’ Kappa, yielding a score of k = 0.81, which indicates strong agreement. The hallucination probabilities derived from expert annotations were subsequently propagated to the entire dataset using weighted stratification. This hybrid evaluation strategy ensures that hallucination rate estimates are both human-validated and statistically representative, while remaining scalable for large datasets.
The primary preprocessing, polarity computation, negation handling, and hallucination-risk stratification are entirely automated. Expert validation is applied selectively to calibrate symbolic polarity scores, verify augmentation consistency, and estimate hallucination probabilities for evaluation, ensuring scalability and enabling reproducibility. The algorithmic components, including lexicon construction, risk stratification, and hallucination propagation, are deterministic and fully specified. Expert involvement ensures that the polarity weights convey the intended sentiment, especially in linguistically complex scenarios such as idiomatic negation, sarcasm, contrastive discourse, or affective amplification. Consequently, the resultant vector yields a semantically grounded target distribution, which serves as a reference signal in subsequent stages of hallucination mitigation.
3.4. Proposed Hallucination Mitigation Mechanisms
The proposed framework combines a set of complementary hallucination mitigation mechanisms directly into the learning and decision-making processes. These mechanisms prevent unsupported inferences and control neural overconfidence. In particular, the framework encompasses a SIF that enforces semantic alignment, SHAP-guided attribution regularization to facilitate interpretability-aware learning, and a confidence-based lexicon-based deep-fusion mechanism for adaptive decision control.
In contrast to the post hoc correction strategies, the SIF monitors the optimization trajectory. Integrating SIF into the learning phase assists the proposed model produce sentiment distributions that aligns with lexicon- and expert-validated polarity signals. Equation (6) highlights the integrity score computation that reflects the application of cosine similarity in quantifying the degree of semantic alignment among neural predictions and symbolic evidence.
where
is the integrity score for instance
,
determines the relationship between probability (
and the lexicon polarity value (
) where a lower value indicates the higher risk of hallucinated sentiment.
By applying the integrity score, the consistency between the model’s prediction and linguistically grounded polarity cues are assessed. During training, a threshold () is introduced to mitigate the likelihood of hallucination by identifying the insufficiently grounded prediction using a condition ( ).
Instead of discarding the hallucinated predictions, we impose a differentiable penalty as shown in Equation (7).
where
denotes sentiment integrity filter loss penalizing semantic misalignment that improves the model’s classification performance.
Transformer-based sentiment analysis models have a tendency to focus on statistical co-occurrence rather than true sentiment relevance due to the distribution of the attention across the tokens. By incorporating a SHAP-guided regularization mechanism as an internal supervising mechanism, the model’s attention behavior can be controlled. SHAP determines the marginal contribution of each input token to the predicted probability, revealing the role of linguistically meaningful cues in sentiment predictions. By implementing this attribution feedback as part of the training purpose, accurate and interpretable predictions are guaranteed. In addition, this strategy minimizes the likelihood of hallucination. The proposed SHAP-guided regularization mechanism determines the significant contribution of individual input tokens to a model’s outcome using an attribution value (
). To manage the computational overhead, regularization is selectively used and not exhaustively used during training. Specifically, SHAP computations are only triggered for samples with low semantic integrity scores. This strategy guarantees that attribution feedback is concentrated on instances where corrective supervision is most beneficial, and therefore limits additional training costs. The SHAP regularization is added as a bounded auxiliary loss with a small weighting coefficient, enabling it to guide the learning process without dominating the optimization. Empirically, this design ensures stable convergence behavior with limited computational overhead. Equation (8) shows the mathematical form of regularization loss calculation.
where
denotes the loss,
and
are the attribution values for partition sets (
and
).
Furthermore, a confidence-based lexicon-deep fusion mechanism is introduced in the hallucination mitigation framework to integrate symbolic and neural evidence at the decision level. Equation (9) presents the fused output distribution.
where
is the fused output distribution,
is the gating coefficient,
denotes the neural sentiment polarity, and
indicates the expert-guided lexicon polarity vector.
Through this fusion mechanism, the proposed hallucination-aware SA model dynamically regulates neural confidence. To suppress hallucinated predictions without interfering with the core learning dynamics, we apply this strategy at inference time as a confidence calibration step.
3.5. Proposed Hallucination-Aware SA Model
Hyperparameter selection was performed using validation-based tuning to ensure reproducibility and efficiency. We use parameter-efficient fine-tuning (PEFT) to make the model faster to train and highly efficient at inference time. Using this strategy, the base OPT parameters are frozen, focusing on a reduced set of trainable components. Key parameters, including learning rate, batch size, dropout probability, semantic integrity threshold, and loss weighting coefficients, were optimized on the validation set. To balance semantic robustness and computational efficiency, a parameter-efficient optimization strategy is followed, controlling the model’s learning process with task-specific parameters. Performance was assessed in terms of classification accuracy and hallucination rate to avoid bias towards strictly predictive measures. Robustness was further evaluated by repeating training with multiple random seeds and variance across multiple runs. Sensitivity analysis was used to ensure the model’s stability across reasonable parameter ranges, indicating that the proposed framework is not sensitive to fragile hyperparameter configurations. Equations (10) and (11) outline the final set of trained parameters with PEFT.
where
denotes the OPT model’s parameters,
W and
b are the trainable weight matrix and bias of the classification head,
indicates the standard cross-entropy loss,
and
are the hyperparameters controlling the influence of hallucination mitigation constraints,
are the optimized parameters,
is used to select the sentiment class with the highest fused probability,
is the probability of neural sentiment distribution among
y and
x,
represents the frozen parameters, and
is the small set of trainable parameters.
Equation (12) presents the final inference equation that predicts the sentiment for the input instance (
x).
where
is the final prediction made by the fine-tuned OPT model.
3.6. Experimental Settings
We have performed the experiments on a Windows-based workstation.
Table 2 provides an overview of the details of the model implementation. To speed up the training of transformer architectures and SHAP-based attribution computation, we used Python 3.10 and the HuggingFace Transformers library. Furthermore, the controlled environment included libraries such as SciKit-Learn, NumPy, Pandas, and Seaborn. The OPT version is configured with a moderate parameter budget in an effort to maintain expressiveness while remaining feasible in terms of computation. By setting the maximum sequence length to 64 tokens and using AdamW, the base model is fine-tuned on the primary dataset (Dataset 1). Performance on the training, validation, and test sets was measured in terms of accuracy, precision, recall, and F1-score, which was computed separately on the training (70%), validation (15%), and test sets (15%). In addition to standard classification metrics, the hallucination rate [
37] was employed to quantify the extent of unsupported or semantically unjustified sentiment predictions, as identified through an expert-assisted evaluation protocol. Monitoring of training and validation curves aimed to evaluate convergence stability and generalization behavior, whereas statistical consistency was assessed using mean values and confidence intervals. The inference efficiency was estimated as the mean inference time per sample on the GPU. Additionally, external robustness was assessed using Dataset 2, an unseen test set.
4. Results
Figure 2 clearly shows that the proposed hallucination-aware SA model achieves stable and efficient learning over the 36-epoch training period. Both accuracy trajectories show a consistent upward trend, whereas the corresponding loss trajectories exhibit a smooth downward trend, indicating that the model successfully internalizes sentiment-relevant patterns without overfitting. The strong agreement between the training and validation curves further supports robust generalization, indicating that the model readily interprets sentiment regardless of noisy, ambiguous, or contextually complex social media content. This performance is the direct result of the architectural innovations embedded in the system. The SIF places a constraint on the model, preventing unsupported sentiment inferences and enabling it to focus on lexically and contextually based evidence. The SHAP guided penalty attribution reinforces this behavior by penalizing predictions affected by non-salient or irrelevant tokens.
Furthermore, the confidence-based lexicon deep-fusion mechanisms help overcome neural overconfidence and stabilize predictions in ambiguous contexts. Collectively, these mechanisms foster semantic fidelity while reducing hallucination, thereby allowing the model to achieve high accuracy with a small generalization gap. Consequently, the curves establish the usefulness of the training strategy and the robustness of the proposed hallucination-aware framework.
Over 10 independent runs (
N = 10), we report (
Table 2) the outcomes with confidence intervals to provide a robust estimate of run-to-run variability. Each run is initialized with a different random seed on the same test set. The randomness is introduced through model initialization and mini-batch shuffling. A choice of N = 10 runs aligns with common practice in NLP evaluation, particularly for transformer-based models.
Table 3 shows that the hallucination-aware OPT model achieves strong, stable performance across all sentiment classes in Dataset 1. The narrow confidence intervals can be attributed to factors, including training stabilization (PEFT fine-tuning), hallucination-aware constraints, and large-scale training data. Notably, achieving a better trade-off between the negative, neutral, and positive classes indicates a significant improvement in social media content analysis, where the neutral and negative classes are traditionally prone to misclassification and hallucinated polarity. The positive class yields the highest accuracy (98.2%) and the lowest hallucination rate (1.9%), thus reflecting the model’s ability to reinforce sentiment when lexical and contextual evidence are consistent correctly. Importantly, the negative and neutral classes show low hallucination rates. This implies that SIF and SHAP-guided regularization are effective in suppressing unsupported polarity amplification in ambiguous or weakly expressed text. Ultimately, these results support the suggested hallucination-mitigation strategies that enhance prediction accuracy and semantic reliability.
Figure 3 shows the class-wise performance of the hallucination-aware sentiment analysis model on Dataset 2, which is an external and unseen evaluation benchmark. Unlike Dataset 1, Dataset 2 combines content from diverse platforms and exhibits a higher degree of linguistic variation, making it a realistic approximation of real deployment. A critical observation is a controlled hallucination rate below 4% for each class, even in the neutral class, which is traditionally the most challenging due to weak or implicit sentiment signals. The high accuracy, precision, recall, and F1 scores across the negative, neutral, and positive classes show that the model generalizes effectively outside the training distribution. This suggests that the learned representations are not overfitting to dataset-specific patterns.
All ablation results are obtained under identical experimental conditions. Paired tests are used to assess the statistical significance of results across multiple randomized test splits.
Table 4 provides conclusive evidence that a substantial reduction in the rate of hallucinations occurs with the combined application of the grounding, interpretability, and confidence-gating mechanisms. Although the base OPT model exhibits exceptional baseline performance (accuracy of 92.5%), the objective of the proposed model is not simply to enhance the raw classification accuracy, but to systematically reduce hallucinated sentiment predictions. Each ablation component addresses a different failure mode: semantic misalignment, attribution drift, and neural overconfidence.
While the individual improvements (ranging from 0.4% to 2.1%) may appear modest in isolation, paired statistical testing shows that their cumulative impact is statistically significant (p < 0.01). Importantly, these improvements are accompanied by a consistent and substantial decrease in the rate of hallucinations, reflecting improved semantic grounding rather than random variation.
Incorporating lexicon information without confidence gating yields mild improvements in predictive performance, underscoring the importance of employing the SIF, confidence gating, and SHAP-guided correction learning mechanisms. The results therefore confirm the complementary and synergistic nature of the proposed model, indicating that the reduction in hallucinations should be an integral part of the learning and inference pipeline rather than a post hoc diagnosis.
Table 5 shows findings that clearly demonstrate the effectiveness of the proposed hallucination-aware SA framework, which outperforms well-established transformer- based SA models. Compared to BERT-base, RoBERTa-base, and XLNet-base, the proposed model achieves a considerable improvement in accuracy and F1-score and, most importantly, a significant reduction in the hallucination rate. This outcome confirms the importance of incorporating semantic accountability directly into learning and inference processes. The SHAP-guided attribution regularization imposes additional constraints on the model by penalizing decisions driven by sentiment-irrelevant tokens, resulting in consistent improvements in classification accuracy and reduced hallucination without impairing convergence stability. Across all experiments, training was stable with minimal computational overhead, resulting in improved inference efficiency.
Figure 4 shows that the proposed hallucination-aware framework is not only superior to current transformer models in predictive power but also fundamentally superior in decision reliability. The visualization highlights a uniform and significant performance gain for the proposed model across all evaluation metrics, while achieving a lower hallucination rate. Unlike conventional transformers, the proposed model exhibits remarkable performance. Notably, unsupported polarity inferences reveal the shortcomings of the baseline models, resulting in high hallucination rates. This confirms that strong classification performance is not sufficient to guarantee semantic reliability, especially in the case of domain shift. The proposed architecture overcomes this limitation by incorporating semantic grounding and attribution regularization into both training and inference.
In order to ensure a fair and unbiased comparison, all the models presented in
Table 5 were tested under identical experimental conditions. Each baseline and state-of-the-art transformer model is re-implemented or fine-tuned using Dataset 1, the preprocessing pipeline, label mappings, training-validation splits, and the hardware environment. Hyperparameters, including maximum sequence length, optimizer type, batch size, and stopping criteria, were matched as much as possible to those used for the proposed model, whilst maintaining the architectural integrity of each model. Model Inference is performed on the same workstation with similar execution settings in order to achieve identical latency readings. This controlled evaluation procedure ensures that the observed performance improvements and the decrease in hallucination rate correspond to methodological improvements in hallucination-aware learning rather than to environmental or implementation-related factors. The proposed hallucination-aware OPT model showed a consistent overall performance in relation to state-of-the-art transformer and hybrid SA models, including Jain et al. [
10] and Albladi et al. [
33], and it achieved a higher classification accuracy (97.6%) with a lower hallucination rate (9.2%), decreasing over 40% as shown in
Table 6. This observation shows that a combination of semantic grounding and interpretability-driven optimization is necessary to obtain consistent sentiment analysis beyond standard architectural improvements.
Figure 5 clearly shows that the proposed hallucination-aware model outperforms state-of-the-art transformer-based SA models with respect to accuracy, precision, recall, and F1-score on Dataset 2. Existing models, including Mandal et al. [
19], Naseem et al. [
20], Tiwari et al. [
26], Jain et al. [
10], Gong et al. [
28], and Albladi et al. [
33], focus mainly on improving contextual representation through attention mechanisms, data augmentation, or external knowledge injection. Although these strategies improve classification accuracy, they lack the semantic justification that assigns confident sentiment labels even when polarity cues are weak, implicit, or misleading. In contrast, the proposed model induces an inductive bias towards hallucination-awareness using semantic grounding (SIF), attribution regularization (SHAP-guided regularization), and confidence-controlled lexicon deep fusion, guaranteeing the model’s reliability. Due to the lack of explicit hallucination reduction mechanisms, the existing hybrid and transformer-based SA models fail to capture hallucinated sentiments, resulting in a poor hallucination rate. The methodological advances enable the proposed SA model to achieve superior generalization and reliability in unseen and noisy social media settings.
The AUROC values indicate the effectiveness of hallucination-mitigation mechanisms in Dataset 1, outlining the stabilization of the model’s decision-making process.
Figure 6 shows the excellent discriminative power of the proposed hallucination-aware SA model, in which a uniformly high AUROC value indicates that the model can differentiate between correct and incorrect sentiment assignment, with no overlap in the score distribution. Existing SA models use a data-driven learning technique without semantic grounding or hallucination control, which limits their classification performance. Although these models adequately represent surface-level contextual patterns, they confront challenges in reliably distinguishing true sentiment signals from confounding linguistic phenomena, including negation, implicit criticism, and subtle sentiment indicators, affecting their ability to prioritize positive instances over negative ones across various decision thresholds. By using SIF- and SHAP-guided correction layers, the proposed model reduces the dependence on non-informative tokens. Taking advantage of the lexicon-deep fusion module, the proposed model classifies social media content based on evidence-based polarity cues, thereby reducing unsupported sentiment predictions.
The consistent AUPRC value demonstrates the model’s discriminative power in the sparse, contextually subtle sentiment-cue setting.
Figure 7 offers empirical evidence of the effectiveness and reliability of the proposed model. AUPRC emphasizes the potential of the model in maintaining high precision while achieving better recall value. It is evident that the SIF prevents the model from generating high-confidence predictions without sufficient textual evidence. The stabilization of the PR curves at higher recall levels shows the significance of the proposed confidence-based lexicon-deep fusion mechanism. For sentiment classes with overlapping or weak polarity cues, existing models’ lower AUPRC values suggest that they are incapable of maintaining high precision throughout recall levels. These models frequently produce overconfident predictions in ambiguous contexts, thereby increasing false positives. Contextual embedding or static feature fusion methods-driven models are unable to prevent hallucinated sentiments. The findings reinforce the suitability of the proposed model for real-world social media SA.
A qualitative error analysis is conducted to understand the model behavior.
Table 7 shows the model’s potential in dealing with negation and contrastive sentiment constructions, indicating a reduced hallucination rate in mildly sarcastic constructions under the presence of explicit sentiment cues. However, failure cases related to instances, such as deep irony, culturally specific idiomatic expressions, and domain-specific slang, require external contextual or pragmatic knowledge, underscoring the inherent difficulty of interpreting sentiment in highly informal, context-dependent social media discourse. The observed failure patterns provide clear directions for future extensions through the incorporation of pragmatic knowledge bases or user-context modeling, reinforcing the transparency of the proposed study.
5. Discussions
As the scope and societal applicability of SA models continue to broaden, the need to focus on semantic accountability and evidence-based reasoning becomes paramount for developing reliable natural language processing technology. The proposed hallucination-aware SA model offers quantifiable benefits in terms of predictive fidelity, semantic grounding, and interpretability, which are essential for analyzing social media content. It achieves reliability and robustness by mitigating neural overconfidence.
Unlike the existing SA models [
10,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33], the proposed model emphasizes semantic accountability in which each prediction is supported by verifiable textual evidence. This transformation addresses the challenges of transformer-based SA models. The key finding emerging from the evaluation shows that the model maintains high discriminative power under ambiguous conditions. The AUROC scores indicate the effective performance of the classifier across the sentiment categories. Existing models achieve considerable classification accuracy. However, their hallucination rates are higher than the proposed model when polarity cues are subtle or embedded in complex discourse.
The empirical stability of the proposed model can be explained by its internal regulatory mechanisms, which limit the overgeneralization in neural networks. The SIF is used to improve the model’s prediction reliability. It cross-validates the predictions against sentiment-bearing lexical cues and contextual markers-mitigating the primary sources of hallucination. Moreover, it suppresses unsupported polarity outputs and acts as a preventive mechanism that influences the decision path of the proposed model before misclassification happens. The robustness of the model is derived from SHAP-guided regularization as an explicit internal supervisory signal. Although SHAP was initially developed as a post hoc interpretability mechanism, its incorporation into the training loop enforces attribution regularization. By penalizing predictions that rely heavily on specific tokens, this model assigns high interpretability scores to sentences, reflecting human conceptualizations of sentiment-bearing cues. The SHAP-guided constraint turns interpretability from a purely diagnostic function to a proactive regulator of model behavior.
A comparative analysis with the existing literature shows the unique methodological contribution of this study. Baseline models, including BERT, RoBERTa, XLNet, and OPT, lack the ability to understand sarcasm, model long-range dependencies, or interpret the sentiment in the absence of polarity cues. The existing SA approaches overcome these limitations through data augmentation, fine-tuning domain-specific methods, or post hoc interpretability frameworks. However, they are unable to address hallucination. By directly introducing hallucination mitigation into the model architecture via SIF, SHAP-guided supervision, and lexicon deep fusion, this model presents an inductive bias for evidence-grounded reasoning, providing an effective model addressing neural overconfidence and semantic drift.
The practical implications of this research are significant. In real-time settings, the hallucinated sentiment predictions may have severe consequences. In brand reputation analytics, false-negative-driven sentiment can mask real customer dissatisfaction, leading to unwanted escalation. Incorrect sentiment categorization in public policy and social behavior monitoring can result in erroneous perceptions of issues or crisis signals. In the case of online safety and mental health, the hallucinated sentiment prediction might override the signals of distress or accentuate the non-distress discourse. The broader implications of hallucination-aware SA are especially consequential in critical sectors such as healthcare and finance, where inaccurate sentiment interpretation can lead to unintended consequences. Healthcare monitoring systems employing SA for the purpose of identifying patient discomfort, treatment discontent, or mental health risks from online forums and social media posts. Negative hallucinations may cause needless clinical warnings, whereas positive hallucinations can mask symptoms of sadness or therapy failure. In a similar manner, sentiment models are used in the field of financial analytics to evaluate investor confidence, market reactions, or customer feedback. Hallucinated sentiments within earnings-related discussions or customer complaints can lead to inaccurate risk evaluations, misguided trading decisions, or unwarranted escalation of customer service procedures. Decision reliability in economically sensitive and safety-critical applications is improved by the suggested framework, which explicitly grounds predictions in linguistic evidence and suppresses unsupported polarity assignments, hence reducing the possibility of such errors. Moreover, the internal justification mechanisms are consistent with emerging artificial intelligence regulatory frameworks that emphasize transparency, accountability, and auditable decision-making processes.
While the proposed study exhibits remarkable performance in the internal and external datasets, it has some limitations. The inclusion of interpretability-focused training and lexicon neural integration adds a significant amount of computational overhead that may limit model deployment in resource-constrained settings. Although confidence-based fusion is effective at moderating overconfident predictions, threshold selection is empirically driven and may need adjustment when transferring to unseen domains with different sentiment distributions. Expert-assisted annotation is effective in determining semantically grounded sentiments. However, it may introduce scalability constraints, particularly for real-time social media monitoring. Expert involvement may make the process of annotation expensive and restrict deployment to new domains. The introduction of multiple integrating components may demand substantial implementation effort compared to regular transformer fine-tuning. In addition, hallucination detection is assessed at the sentence level, and there is a lack of explicit modeling of the discourse level or conversational level.
The proposed model performs exceptionally well on English data; however, it may not be effective for languages characterized by morphological richness, culturally embedded metaphors, or idiomatic sentiment expressions. Thus, a deeper investigation into language-specific grounding strategies is needed to extend the proposed model to multilingual environments. Although the architecture drastically reduces the rate of hallucinations, it struggles in processing idiomatic constructions and dense sarcasm. These limitations motivate some interesting directions towards future research. Integrating complex, structured semantic knowledge, such as sentiment ontologies and knowledge graphs, can improve the depth and precision of semantic grounding. Transforming the SIF into a more general semantic validation mechanism might enable verification of stance, pragmatic intent, and the consistency of facts.
A critical factor in assessing the external validity of the suggested model is the intrinsic bias within social media datasets, especially when integrating platforms like Twitter (now X) and Reddit. Twitter data reflects concise, contextually rich, and frequently emotive statements, through sarcasm, hashtags, and informal abbreviations. In contrast, Reddit posts are lengthy, argumentative, and topic-focused, expressing thoughtful ideas rather than impulsive emotions. Platform-specific sentiment distributions and linguistic standards are introduced as a result of these structural and behavioral differences, which may add bias to the model’s learning process. The residual bias may persist when platform-specific cues predominate sentiment expression. The incorporation of lexicon grounding and attribution regularization mitigates this risk by limiting predictions to linguistically validated evidence instead of platform-specific statistical trends. Subsequent evaluations across diverse platforms and domains are essential to enhance and validate cross-platform resilience. These factors indicate that careful adaptation is required in heterogeneous real-time settings.
Additionally, the combination of reinforcement learning from human feedback may help the model internalize human perceptual cues related to justification, coherence, and its groundedness, thereby achieving a holistic reduction in hallucinations. Future research can build on this framework by building more systematic and scalable approaches to automate hallucination detection for stance analysis. While the current study is based on expert-guided assessment and token-level rationales to identify unsupported or inconsistent reasoning, future work may incorporate fully automated modules for hallucination detection. One promising direction is to model hallucination as a form of evidence inconsistency, in which predicted stances are validated by extracted rationales, semantic entailment scores, or contradiction signals derived from natural language inference models. Confidence-aware calibration techniques and uncertainty estimation can be further applied to mark predictions with weak or diffuse evidence, enabling automatic identification of potentially hallucinatory outputs without the need for human intervention. In addition, checks for self-consistency among perturbed inputs or paraphrased variants may serve as an effective mechanism for identifying unstable or premeditated reasoning patterns at inference time.
Another critical path is to broaden the proposed framework beyond the bilingual English-Arabic context to multilingual contexts. The dual encoder architecture’s modularity and cross-lingual contrastive alignment render the framework readily extensible to more languages, especially low-resource or morphologically rich languages. Future research could include multilingual pre-trained models across a broader range of languages and exploring language-agnostic target representations that would allow stance transfer across different languages simultaneously. Expanding the training regimen to incorporate a multilingual topic alignment, typology-aware contrastive goals, or regional patterns of discourse would further increase generalization. In addition, by adding continual learning strategies, it may be possible to make the model increasingly adaptable to new emerging languages, topics, or discourse styles. Ultimately, these directions would lead the framework toward a scalable, multicultural, and self-monitoring stance detection system that can support real-world applications in social media analysis, policy monitoring, and misinformation research at a global scale.