1. Introduction
Social media serves as a platform for political discourse, shaping individuals’ perceptions of societal well-being and facilitating real-time civic engagement [
1,
2,
3]. Individuals express emotions through direct or indirect stances toward specific targets/topics, such as policies, individuals, organizations, or events. Stance detection enables automated processing of the individual perceptions on specific issues on social media, offering systematic insights into how opinions are formed, polarized, and propagated within ideological and socio-political discourse [
4,
5,
6]. It determines support, opposition, or neutrality towards a specific target. Additionally, it enables finer control over the population discourse, misinformation analysis, and evidence-based decision-making in a highly dynamic digital setting.
Although English dominates social media discourse, Arabic remains one of the most influential languages in the Middle East and North Africa, reflecting the region’s cultural, political, and social dynamics [
7,
8,
9]. On social media, Arabic and English are frequently mixed, producing code-switching content that restricts the applicability of monolingual modelling. As a result, the ability to understand a bilingual stance is crucial for identifying the individuals’ viewpoints that emerge in these digitally connected, linguistically diverse communities [
10,
11,
12]. The development of transformer-based architectures and large language models (LLMs) has advanced stance detection, capturing nuanced linguistic phenomena, including negation, sarcasm, and implicit disagreement [
13]. Through the utilization of cross-lingual reasoning, these models process Arabic content. By leveraging generalized semantic representations, LLMs enhance the abilities of transformers across domains and topics. Compared with traditional feature-driven methods, transformer-based methods exhibit better performance and adaptability, dominating the stance detection research [
13].
The majority of existing studies focus on English language-based stance-detection models, providing minimal attention to Arabic or bilingual contexts [
14,
15]. To detect Arabic stance, topic-specific datasets are utilized. However, these models report poor generalization across dialects, domains, and targets [
15]. The bilingual and code-switched stance detection is underexplored. From a methodological perspective, existing methods rely on conventional fine-tuning of pre-trained models with limited architectural customization for target-aware stance modeling, making them less suitable for addressing challenges posed by unbalanced label distributions and unstructured data [
16,
17,
18]. Despite the exceptional performance of LLMs, they confront challenges, such as computational cost, data efficiency, and explainability, especially in socially sensitive use cases. These constraints reflect the necessity of data-oriented, structure-sensitive stance-detection systems that process Arabic-English social media discourse with limited computational resources.
This study is motivated by social media’s expanding role in influencing public attitudes, beliefs, and opinions on social, political, and healthcare topics. Existing target-aware stance-detection models are typically based on single-encoder concatenation models, where the target and text are concatenated and processed as a single sequence (e.g., BiCond, CrossNet, and concatenation-BERT variants). While useful in monolingual contexts, these approaches are hindered by representational entanglement, which makes it challenging to generalize across heterogeneous stance types (topic stance vs. claim stance vs. conversational stance) and across languages. Similarly, cross-lingual natural language inference (NLI)/stance models are mostly label-space transfer models, lacking a clear model of the relational structure between the target and stance-bearing texts. They typically use translate-train or translate-test heuristics to bridge languages, resulting in translation noise, domain mismatch, and stylistic drift—major challenges for Arabic social media content, which contains dialect, code-switching, spelling variation, and informal morphology. Addressing these limitations is crucial for developing robust, transparent, and comprehensive stance models, capturing the complex dynamics of individuals’ expressions across varied, linguistically rich online contexts. Therefore, in this study, a target-aware bilingual stance-detection model is introduced that generalizes across languages, domains, and discourse forms while remaining robust, interpretable, and resource-efficient. The proposed framework makes three substantive innovations over the existing NLI/stance-prediction paradigms:
Instead of incorporating target and text in a single encoder, the framework encodes them separately and combines them with late fusion. This structural separation ensures that stance prediction is driven by the semantic relationship between two well-formed representations rather than surface co-occurrence patterns, thereby enabling the model to generalize across cross-format stance scenarios.
- 2.
Cross-Lingual Contrastive Alignment at Representation Level
Unlike existing cross-lingual NLI or stance models based on translation, parallel data, or shared label semantics, the proposed method aligns target-text pairs between English and Arabic using a contrastive objective. This produces a shared bilingual stance space without the need for parallel corpora, rendering the proposed model more robust in real-time bilingual social media, where parallel data are limited.
- 3.
Robustness and Explainability as Integrated Components of Stance Modeling
The framework includes perturbation-based robustness regularization, which stabilizes performance under noisy, informal, or adversarial social media text, often ignored in target-aware stance studies. Additionally, token-level rationale extraction enables bilingual interpretation, enabling analysis of hallucinations that are not addressed in existing target-aware or cross-lingual stance models.
- 4.
Efficient Fine-Tuning for Computationally Lightweight Multilingual deployment
Applying parameter-efficient fine-tuning approaches maintains training cost and performance, offering a practical, resource-aware approach for deploying stance-detection models in a low-resource bilingual setting.
Together, these innovations position the proposed system not as a mere combination of existing methods but as a systematic, extensible architecture explicitly designed for target-conditioned, bilingual, and robustness-aware stance detection, a setting insufficiently supported by existing models.
The remainder of this study is organized as follows:
Section 2 outlines the features and limitations of existing stance-detection approaches, covering monolingual, multilingual, and target-aware transformer-based models, as well as widely used datasets, highlighting the theoretical and empirical background of the proposed study.
Section 3 explains the details of the datasets, preprocessing, and augmentation methods, model architecture, training procedure, and the evaluation procedure. It describes the proposed target-aware bilingual framework, including cross-linguistic and linguistic contrastive alignment, robustness mechanisms, and explainability components.
Section 4 presents results based on standard performance metrics, AUROC and AUPRC analyses, confusion matrices, and zero-shot generalization across datasets.
Section 5 presents a detailed interpretation of the results, considers generalizability, and discusses study implications. Finally,
Section 6 summarizes the main contributions, discusses the practical implications and limitations, and outlines directions for future research in bilingual stance detection.
2. Literature Review
Existing transformer-based stance-detection approaches are typically classified into monolingual, bilingual, or multilingual frameworks. Using target and text concatenation, English-language-driven stance-detection models, including BERT, RoBERTa, and DistilBERT, are fine-tuned using the benchmark datasets such as SemEval-2016, FNC-1, and VAST [
19,
20,
21,
22]. In topic-stance and claim-stance tasks, these models are used due to their ability to produce strong in-domain representations. With the introduction of explicit target encoding, target-aware architectures like BiCond, CrossNet, and topic-injected BERT variants achieve improved stance detection [
23,
24]. In the Arabic domain, transformer models, such as AraBERT, MARBERT, QARiB, and CAMeLBERT, have been adapted using datasets such as AraStance, ArabicStanceX, and corpora on coronavirus vaccines [
25,
26]. They incorporate key aspects of Arabic into pre-trained transformers, such as dialects, code-mixing, and noisy social media content, to achieve optimal performance. Bilingual and multilingual approaches are based on mBERT, XLM-R, RemBERT, and other multilingual models [
27,
28]. To enhance generalization, recent studies use cross-lingual alignment, adversarial adaptation, and translation-based augmentation. In addition, contrasting objectives and shared embedding spaces enable zero-shot transfer between linguistically diverse targets and texts.
Table 1 outlines the features of the existing stance-detection models.
Despite the rapid improvement of transformer-based stance-identification models, numerous key gaps remain in the literature, especially in low-resource and bilingual contexts involving Arabic and Arabic-English. Limited incorporation of contrastive learning approaches is one of the most significant limitations. Existing stance-detection models based on MARBERT, AraBERT, or XLM-R depend on cross-entropy loss and fail to employ supervised or unsupervised contrastive objectives that may improve inter-class separability. In the context of NLP, contrastive learning has improved feature representations and reduced semantic ambiguity across closely related stance classes.
The lack of robustness limits the potential of existing models to handle noisy, informal, or dialect-rich text, a characteristic prevalent in Arabic and code-switched social media data. Transformer models trained on controlled news corpora or high-resource English datasets frequently underperform on noisy, real-world inputs that include misspellings, emojis, dialectal variations, and non-standard grammar. Although pre-trained Arabic-specific models, including MARBERT and AraBERT, have partially addressed this challenge, a research gap persists in developing noise-aware pre-training methodologies, effective data augmentation techniques, and adversarial training approaches to enhance stance-detection performance under low-quality text conditions. The explainability of stance-detection models remains unexplored. Transformers are opaque, lacking a rationale for their predictions. Although attention weights are visualized, their representations fail to provide properly grounded or causal explanations, underscoring the need for explainability frameworks such as SHapley Additive exPlanations (SHAP), Local Interpretable Model-agnostic Explanations (LIME), or attention-flow tracking to determine reasoning spans and increase user confidence.
Recent studies overlook target-aware modeling and treat stance detection as a generic text classification problem. The majority of existing models are unable to explicitly encode or condition on the stance target, resulting in reduced performance in complex social media settings. Another challenge stems from the absence of effective hallucination mitigation strategies within cross-lingual contexts. Despite increased attention to hallucination in generative tasks, its effects on discriminative stance models are underestimated. Current approaches mistokenize Arabic words or combine non-standard forms, compromising semantic encoding and downstream attitude categorization.
3. Materials and Methods
The methodological design of this study focuses on mDeBERTa-v3 [
49], a multilingual transformer known for robust contextual encoding and suitable for modeling stance in typologically distinct languages such as English and Arabic. As stance detection relies on subtle interactions between targets and stance-bearing expressions, mDeBERTa-v3 serves as a successful foundational model. However, traditional fine-tuning remains insufficient to enable cross-language transfer, robustness against noisy social media text, or adaptation to heterogeneous stance formats. To overcome these limitations, the proposed framework includes architectural customizations with parameter-efficient fine-tuning. First, a dual-encoder architecture with late fusion explicitly differentiates target and text representation pathways before fusion, thereby strengthening target conditioning and enhancing generalization across topic-level, conversational, and claim-article scenarios. Second, a cross-lingual contrastive alignment module is proposed to encourage the model to learn standard stance-relevant semantics between English and Arabic. Third, noise-awareness and perturbation-resilience-encoded components are integrated to provide greater resilience against dialectal variation, code-switching, sarcasm, and spelling inconsistencies. Finally, a token-level rationale extraction layer provides interpretable evidence for each prediction, thereby enriching transparency and reliability in a bilingual setting. Collectively, these architectural and training-level improvements yield a consistent, extensible methodology for bilingual stance detection.
Figure 1 visualizes the proposed stance-detection pipeline and the interactions among its components.
3.1. Dataset Description
The existing stance-detection benchmarks, though widely used in the research community, have several limitations, limiting their suitability for the study of robust and bilingual stance modelling. SemEval-2016 is limited in scope and topical diversity. Moreover, its reliance on English Twitter data makes it less relevant to evaluating generalization to novel domains or languages. X-Stance, despite being multilingual, focuses primarily on political topics and uses structured, parallel annotation. This tendency to focus on a narrow domain reduces the linguistic variability and undermines the applicability to informal or noisy social media discourse. EZ-stance, while concentrating on zero-shot target transfer, is monolingual and topic-centric, and lacks conversational or document-level stance reasoning capabilities.
To overcome these limitations, this study utilizes datasets including MT-CSD [
34], VAST [
35], ArabicStanceX [
48], and AraStance [
46]. VAST enables large-scale, target English-aware training; MT-CSD introduces conversational stance reasoning, enabling testing of cross-format and lingual generalization. ArabicStanceX covers informal Arabic social media stance in diverse topics, while AraStance includes claim-article reasoning at the document level. The intersection of these datasets enables a thorough assessment across languages, targets, and discourse structures, and aligns with the study’s objective of bilingual and explainable stance detection.
Table 2 outlines the features of the datasets, including the number of samples, labels, and stance types.
3.2. Data Preprocessing and Expert-Guided Hallucination Assessment
The datasets are processed through a unified preprocessing pipeline, ensuring linguistic consistency between English and Arabic while retaining stance-relevant semantic information. For English texts, preprocessing involves eliminating non-informative information such as URLs, user mentions, hashtags, and unnecessary symbols, and normalizing excessive punctuation and repeated characters. For Arabic texts, language-specific normalization is applied to reduce orthographic and morphological variability typical of informal writing, including normalization of letter forms, removal of optional diacritics, and normalization of elongations and spelling patterns. These are measures to control surface-level noise without altering stance polarity and semantic intent. Subsequent to normalization, the data are split into two non-overlapping sets: an externally held-out test set and a development set. The held-out test set is used to evaluate the model’s generalization and is not used in the training, cross-validation, augmentation, or model selection process, ensuring impartial evaluation of generalization capability. The development set is used in a five-fold cross-validation setting. In each fold, the development data are split into fold-specific training and validation data. Robustness-oriented data augmentation strategies are applied only to the training data subset in each fold, whereas the validation data subset is not. This design eliminates the risk of information leakage and ensures that validation performance reflects generalization to unseen, unmodified data.
The augmentation strategies are label-preserving and are intended to model realistic linguistic variability found in user-generated content, including controlled lexical substitutions, minimal paraphrasing, and character-level perturbations that simulate informal linguistic writing patterns. Augmentation is not applied to validation subsets or the held-out test set.
To facilitate cross-dataset training and bilingual generalization, all datasets were mapped to a unified three-class stance scheme (Agree, Disagree, Neutral). This harmonization is necessary, as the current corpora of English and Arabic stance differ significantly in the granularity of their labels and their semantic scope. For example, Discuss and Unrelated are included in AraStance, and these are not stance categories but discourse-relations categories. These labels do not explicitly state their polarity and were thus merged into Neutral in accordance with previous practices of stance normalization in multilingual environments. While this mapping is practical, it introduces two methodological considerations. First, there is an inevitable increase in neutral class heterogeneity, since the merged class covers more semantic space—from truly neutral opinions to topical commentary without polarity. Second, comparability with the existing approach becomes approximate rather than exact, especially for models assessed on the original multi-label AraStance taxonomy. We address this by presenting results strictly within the three-class framework and not directly comparing them with studies using the five-label scheme in terms of metrics. Third, for zero-shot performance interpretation, the simplification mitigates target ambiguity by excluding discourse-relation labels that do not express stance; however, it may artificially boost predictions of neutral in datasets where the “discuss” class is predominant. These caveats are explicitly recognized, and harmonized labelling is presented as a requirement for bilingual, cross-format stance modelling rather than an intrinsic property of the datasets themselves.
In this study, hallucination refers to any lexical, semantic, or inferential rationale emphasized by the model that is not based on the input text and cannot be rationalized by explicit or implicit evidence in the original stance-bearing statement. A rationale is observed as hallucinated if it provides unsupported evidence (i.e., a highlighted token that does not exist in the text, or that is semantically irrelevant to the label on the gold stance), incorrect attribution (highlighting text that contradicts the label on the gold stance), fabricated inference (adding causal or sentiment cues not present in the content), or cross-lingual drift (misinterpreting neutral cues across languages).
In addition to preprocessing and augmentation, the study applies an expert-guided hallucination assessment framework that operates independently of primary stance datasets. Due to the lack of hallucination annotations in the datasets, hallucination is treated as an evaluation metric rather than a supervision target for model-generated explanations. Following model inference, token-level rationales generated by the model are evaluated by domain experts on a representative subset to determine whether the explanations are supported by the input text and whether the reasoning is target or semantically inconsistent.
To systematically evaluate hallucination, a stratified subset of 300 instances was selected from the three stance classes: 100 from the combined bilingual test set, 100 from MT-CSD, and 100 from AraStance, and evaluated through expert-guided annotation. Three senior NLP researchers, proficient in English and/or Arabic, independently reviewed token-level saliency maps in a blinded condition, without access to model predictions or labels assigned by other annotators. Experts identified each as either Grounded or Hallucinated based on a standardized guideline. Inter-annotator agreement was high (Fleiss’ k = 0.82; Krippendorff’s a = 0.79), and disagreements were resolved by majority vote. Since hallucination assessment is based on deterministic saliency maps rather than generative outputs, the evaluation is completely reproducible in the controlled inference environment, where the same inputs will always generate the same rationales and hallucination assessments.
The involvement of multiple experts and the application of inter-rating protocols strengthen the methodological rigor of the hallucination analysis by reducing subjectivity and ensuring that qualitative judgments reflect consistent interpretative standards. This expert-driven validation is particularly significant for the study, as hallucination assessment in stance detection extends beyond numerical accuracy and requires nuanced evaluation of semantic coherence, thereby reinforcing the trustworthiness and interpretability of the proposed framework. These expert annotations are used solely to validate the models’ outcomes and are not used for training, dataset composition, or model parameters.
3.3. The Proposed Target-Aware Stance-Detection Approach
3.3.1. Target-Text Dual-Encoder Architecture
Traditional stance-detection approaches rely on single-encoder architectures, in which the target and stance texts are concatenated and encoded. While efficient within controlled settings, this formulation introduces several limitations. Firstly, target information is often poorly represented because it competes with adjacent textual context within a single sequence. Consequently, predictions of stance rely on surface lexical overlap rather than true target-conditioned reasoning. Secondly, such models have limited generalization capability when deployed to unseen targets, novel discourse formats, or cross-lingual scenarios, as the learned representations entangle target semantics with domain-specific textual patterns. Lastly, concatenation-based encoding struggles to scale across heterogeneous stance formation, such as topic-level, conversational, and claim articles.
Furthermore, target expressions in stance detection are usually short, noun-phrase-level units with stable and context-independent semantics (e.g., “Electric Vehicles”, “Climate Policy”). In contrast, stance-bearing text is in the form of longer, heterogeneous, and stylistically diverse discourse segments that carry sentiment, argumentation, and pragmatic cues. Using a single shared encoder to handle both inputs forces the model to fit fundamentally different distributions into a common representational space, leading to semantic entanglement.
To overcome these limitations, the stance-detection problem is redefined as a target-conditioned representation. The proposed framework uses completely independent parameters for the target encoder and the text encoder, which is one of the motivations behind this approach, since the linguistic structures of the target encoder and the text encoder are inherently asymmetric. By encoding the target and text independently within a dual-encoder framework, the model preserves explicit target semantics and enables controlled interaction through late fusion, ensuring that stance reasoning is explicitly conditioned on the target rather than implicitly inferred from it, thereby enhancing robustness against target variation and discourse structure.
Figure 2 illustrates the architecture of the proposed target-text dual-encoder.
The dual-encoder formulation naturally supports cross-lingual and cross-format transfer, thereby aligning independent target and text representations. Through this formulation, the target is explicitly conditioned, reducing representational interference and leading to stable generalization. Equation (1) outlines the computation of independent contextual representations.
where
is the target standalone semantics,
is the contextual meaning of the text, and
and
represent mDeBERTa-V3’s parameterized encoder functions.
To preserve semantic disentanglement between the target and the stance-bearing text, the parameter sets
and
are maintained independently. Equation (2) shows the integration of the resulting representations using a late-fusion interaction operator (
.
where
denotes vector concatenation and
represents the element-wise absolute difference.
The fusion operation allows the model to comprehend the relationship between the text and the target. Through the structured comparison at the decision level, it preserves clean target and text representations. This mechanism enables explicit target-aware stance detection and enhances the model’s robustness across diverse discourse formats and languages.
3.3.2. Cross-Lingual Contrastive Alignment
While the dual-encoder architecture provides explicit target conditioning, bilingual stance detection also requires the model to learn language-invariant representations that are generalizable across English and Arabic. Standard multilingual fine-tuning typically relies on shared vocabulary or implicit parameter sharing, which is inadequate for aligning stance-expressive cues. To overcome this limitation, a cross-lingual contrastive alignment layer is proposed, which is a training objective at the representation level without affecting the inference architecture. The core idea is to explicitly shape the embedding space to express stance instances with the same polarity, regardless of language, while separating semantically unrelated stances. Equation (3) describes the fused representations of the two stance instances with equivalent stance polarity.
where
represents the fused representations covering
and
is the late-fusion operator. The enforcement of contrastive alignment is presented in Equation (4).
where
denotes cosine similarity,
is a temperature parameter controlling the sharpness of the similarity distribution, and
is the contrastive alignment. This objective increases similarity between positive cross-lingual stance pairs while suppressing similarity to negative samples, effectively regularizing the shared embedding space. By encouraging stance-consistent alignment rather than lexical overlap, the contrastive loss reduces language-specific bias and enhances zero-shot transfer.
The effectiveness of cross-lingual contrastive learning is critically dependent on the definition of positive and negative sample pairs. In this framework, pair construction is grounded in stance-label alignment, enabling the model to learn cross-lingual stance representations without requiring semantically parallel English-Arabic data. Pairs require only stance-label alignment. Semantic, topical, or contextual similarity is not required. A positive pair is created by choosing one instance from each language for which the same stance category (Agree, Disagree, or Neutral) is assigned. This helps ensure that the contrastive layer attends to polarity-level correspondences rather than surface-level lexical similarities. In contrast, a negative pair consists of two instances, again drawn cross-linguistically, that belong to different stance categories. This drives the embeddings of opposing stances further apart in the shared representation space, allowing the model to learn language-invariant stance cues, including polarity markers, modal expressions, and opinion indicators. This label-centric design is intentionally chosen because topic-conditioned or semantic-similarity-based pairing strategies introduce unwanted biases, such as reliance on topic overlap or susceptibility to dialectal variation. By removing the focus on topical content and focusing only on stance categories, the model can learn more stable and generalizable cross-lingual polarity features. This approach guarantees scalability without the need for parallel corpora or manual semantic alignment. Thus, the positive/negative pairing strategy is a principled and resource-efficient mechanism for enforcing bilingual stance coherence.
3.3.3. Robust Stance Modeling Under Noisy and Adversarial Text
Stance detection in real-world applications is often hindered by noisy, irregular language, particularly in social media and conversational corpora, where spelling variations, informal language, code-switching, and intentional obfuscation are prevalent. Models trained on unperturbed text tend to overfit on shallow lexical patterns and make unstable predictions when inputs deviate from standard forms. To overcome this limitation, the proposed framework incorporates a robustness-regularization layer that explicitly encourages prediction stability against controlled input perturbations during training. This approach ensures that small label-preserving changes to the input do not lead to excessive changes in stance predictions. Equation (5) computes the robustness enforcement function, minimizing the discrepancy between the predicted stance distributions.
where
represents the model’s predicted probability distribution over stance classes,
is the robustness enforcement function,
denotes the original input text, and
represents a perturbed version obtained through controlled noise injection. The regularization enforces the model to learn smoother decision boundaries and to base its predictions on semantically salient signals. Using the robust enforcement function (robustness loss), stance predictions are stabilized under linguistic noise, including misspellings, informal orthography, dialect variations, and lightweight perturbations. By generating controlled perturbations, the regularization is implemented. The robustness loss function is considered a constraint on the classifier to encourage reliance on semantic content. By penalizing prediction inconsistency, the model is less sensitive to orthographic noise, informal language, and adversarial perturbations, enabling it to maintain computational efficiency while achieving improved generalization on more noisy, real-world inputs.
3.3.4. Token-Level Rationale Extraction
In order to improve transparency and interpretability, the proposed framework combines a token-level rationale extraction layer that operates post-prediction. This component uncovers the internal decision-making process by identifying the input text that influences a specific stance prediction. Such fine-grain interpretability is particularly important in the context of bilingual and cross-format stance detection, where predictions may otherwise appear opaque or hard to justify. Equation (6) shows the estimation of token importance using a gradient-based saliency approach.
where
denotes the absolute magnitude of the gradient. The higher value of
indicates greater influence on the model’s prediction,
is the input token, and
is the predicted stance.
The extracted rationales provide interpretable evidence for the predictions in English and Arabic, enabling qualitative inspection and comparative analysis across languages. Crucially, these token-level explanations form the foundation for evaluating hallucinations, allowing evaluators to check whether the highlighted evidence is based on the input text and related to the target, and to hold the model accountable and ensure the reliability and trustworthiness of its predictions.
3.3.5. End-to-End Proposed Stance-Detection Pipeline
The proposed stance-detection framework is conceptualized as a unified end-to-end pipeline that combines architectural modifications and auxiliary training objectives while maintaining a single, fixed inference pathway. During inference, the model maps a stance-target text pair to a stance prediction via a series of compositional operations. Equation (7) outlines the model’s prediction using the proposed approach.
where
and b are the classifier parameters and
denotes parallel encoding.
In addition, the stance prediction is implemented using a fixed dual-encoder and late-fusion architecture. Contrastive alignment and robustness objectives are used during training, influencing representation learning without affecting the deployed inference pipeline. Equation (8) shows the composite loss function.
where
is the total loss,
and
are the weighting coefficients, and
supervises the stance classification accuracy.
The proposed stance-detection model relies on a single, coherent prediction architecture, and auxiliary learning signals serve as enhancements during training. This design ensures computational efficiency at deployment while enabling improved target awareness, cross-lingual generalization, and robustness during learning.
3.4. Experimental Setting
The corpora of stances used in this study were harmonized into a unified target text format. They were merged into a single bilingual development pool by combining the English VAST instances with the Arabic ArabicStanceX instances. From this combined bilingual dataset, 80% of the data were used for a development set, and the remaining 20% were held out exclusively for final generalization testing. A total of 80% of the development portion was used for five-fold validation. The samples in each fold were representative of the English and Arabic samples, and had a balanced distribution of stance labels. In each cross-validation run, four folds (64% of the entire dataset) were used to train a target-aware bilingual model, and the remaining fold (16%) served as the validation set, resulting in five independently trained bilingual models. To avoid double exposure, model overfitting, and unnecessary computational costs, further retraining was not performed using the complete 80% of the dataset. All five cross-validated models were evaluated directly on the held-out 20% test set, which retained the original English-Arabic data. The final performance was calculated by averaging the prediction probabilities across the five models for each test instance, yielding a robust estimate of bilingual generalization to unseen data.
Table 3 highlights the key experimental settings for the model implementation.
To achieve computational efficiency and practical deployability, the proposed framework follows a parameter-efficient fine-tuning approach based on mDeBERTa-v3. Rather than updating all transformer parameters during fine-tuning, the process selectively updates task-specific layers related to the dual encoders and the late-fusion classifier, while other lower-level shared representations are frozen to a large extent. This strategy greatly reduces training overhead and memory usage without sacrificing performance. Importantly, auxiliary objectives such as contrastive alignment and robustness regularization act on the learned representations without adding extra parameters at inference time. As a result, the proposed model achieves better bilingual stance performance while being suitable for low-resource multilingual deployment scenarios.
The proposed bilingual stance-detection framework is evaluated using a combination of traditional classification metrics and reliability indicators to provide a balanced assessment of predictive performance and trustworthiness. Accuracy measures overall correctness across stance classes, while precision, recall, and F1-score assess class-specific performance and address imbalances in stance labels. In addition to predictive performance, the hallucination rate is included to determine how faithful model explanations are. To ensure the robustness of the conclusions, all metrics are reported as mean ± standard deviation across cross-validation folds, and 95% confidence intervals are computed to quantify performance stability. This strategy for reporting results ensures that observed enhancements reflect consistent model behavior rather than random variations.
4. Results
The results of five-fold cross-validation in
Table 4 show that the proposed bilingual stance-detection framework achieves consistently high performance across all evaluation metrics, with limited variance across folds and narrow confidence intervals. This constancy demonstrates that the learned representations are not overly biased by a particular data partition and that the model shows stable generalization on the development set. Such behavior is especially relevant to tasks of stance detection, where instability in training and test data is often due to linguistic variation, topic diversity, and label ambiguity.
The observed performance can be attributed to the structural and learning mechanisms adopted in the proposed approach. The target-text dual-encoder formulation explicitly separates target semantics from stance-bearing content, thereby limiting representational interference and enabling more precise target conditioning. Cross-lingual contrastive alignment constrains the space of representations to maintain stance polarity across languages, thereby facilitating consistent predictions in bilingual contexts. Robustness-oriented regularization further reduces the sensitivity to surface-level noise and informal linguistic patterns, which are common in social media and conversational data. Additionally, averaging probabilities across cross-validation folds provides a more reliable estimate of model behavior by reducing the effects of variance in individual-fold predictions. Collectively, these characteristics are the source of robust, reproducible performance across all cross-validation folds.
The radar chart for the combined bilingual test set in
Figure 3 offers a concise visual summary of class-wise behavior for the proposed stance-detection model, complementing the reported overall accuracy of 88.1%. This suggests that the target-aware representation is an effective way of capturing the explicit supportive and opposing cues in English and Arabic inputs. The stable form of these two classes indicates stable boundary decision functions and a limited trade-off between precision and recall, desirable in applications sensitive to stance, such as opinion analysis and policy monitoring.
The Neutral class has slightly lower values for all three metrics and thus a comparatively smaller polygon. This behavior is expected and reflects the inherent ambiguity of neutral stance expressions containing mixed or hedging language and weaker lexical signals. Importantly, the decrease in performance for the Neutral class does not significantly affect overall accuracy, suggesting that the model maintains its ability to discriminate without bias towards polarized stances. The smooth, non-fragmented contours of the radar plot provide further indication that performance degradation is controlled and systematic rather than unstable or highly variable.
Table 5 provides critical insights into the behavior of each architecture and training component. The single-encoder baseline underperforms due to its limited capacity to learn target-conditioned representations. Adopting a dual-encoder design yields a substantial gain, empirically validating the necessity of disentangling target and text semantics for stance reasoning. Cross-lingual contrastive alignment further promotes bilingual generalization by aligning stance-relevant cues across languages, while robustness regularization promotes stability under noisy inputs. Their combined optimization has consistent additive benefits and can be considered evidence of complementary effects. Comparative experiments with alternative pairing strategies, including topic-conditioned and embedding-based semantic pairing, revealed that these strategies yield weaker or unstable transfer due to topical bias and dialectal variability. On the contrary, label-aligned pairing was consistently more stable during optimization and achieved higher zero-shot performance, resulting in 2–4% improvements. Thus, this construction strategy can be used as an effective, computationally practical approach to impose cross-lingual coherence in stance without the need for aligned bilingual resources. A controlled ablation comparison (A4 and A5 vs. A6) is included to provide empirical support for the effectiveness of the robustness regularization function. Without robustness regularization, the variant achieves strong performance. However, it exhibits higher sensitivity to noisy or orthographically inconsistent inputs. The introduction of the robustness loss (Dual-encoder + Contrastive + Robustness) results in consistent improvements across metrics, reducing prediction volatility. The significant performance gains demonstrate the model’s ability to generalize across noisy textual environments. Finally, ensemble averaging helps consolidate these improvements by reducing variance, thereby establishing that these final performance gains are due to the integrated framework, not the ensemble process alone.
Furthermore, to test whether reliance on different parameter sets for the target encoder and text encoder leads to redundancy or instability, we implemented a partial-sharing variant in which the lower transformer blocks were shared, and the upper blocks remained decoupled. This design is less parameter-rich but more prone to representational interference, since early layers represent mixed distributions of both target phrases and longer stance-bearing text. Empirically, partial sharing produced mild reductions (≈0.7–1.1%) in precision and F1-score, and convergence became less stable in cross-lingual settings (particularly Arabic to English). In addition, training performance showed less stable convergence in cross-lingual transfer settings, especially for Arabic-English zero-shot predictions, indicating that shared early layers were unable to achieve a clean semantic separation. Accordingly, full decoupling offers a clear methodological advantage: it preserves disentangled representations and produces more stable and discriminative target–text interactions under the dual-encoder fusion mechanism. The results are consistent with the overall ablation results, indicating that architectural choices (rather than just scale or depth of classifier) are responsible for the performance gains.
The Area Under the Receiver Operating Characteristic (AUROC) curves in
Figure 4 demonstrate the powerful discriminative capabilities of the proposed stance-detection framework across the Agree, Disagree, and Neutral classes. In contrast to accuracy, AUROC allows for the evaluation of the efficacy of systems across all conceivable decision thresholds and is thus particularly suitable for tasks related to stance detection, where class imbalance and heterogeneity of confidence levels are prevalent. The sharp upward curves across all three ROC curves indicate that the model consistently achieves high true-positive rates while maintaining low false-positive rates. Notably, the Disagree class achieves the highest AUROC (0.991), suggesting that oppositional stances have distinctive semantic and lexical signatures that the model captures well. The Agree class also has significant separability with an AUROC of 0.985.
Although the Neutral class reports a slightly lower AUROC (0.974), this outcome is due to the inherently ambiguous nature of neutral expressions, which lack explicit stance markers. Several factors contribute to this strong performance. The target-aware dual-encoder architecture helps ensure that stance predictions are explicitly conditioned on the target, enabling more accurate modeling of target-text interactions. Cross-lingual contrastive learning further improves performance by aligning stance-relevant semantics across English and Arabic, thereby reducing reliance on surface-level lexical cues. Moreover, robustness-oriented training and controlled data augmentation promote stability under noisy, informal textual conditions. Collectively, these design choices lead to well-calibrated ranking behavior, as evidenced by the consistently high AUROC values across stance categories.
The precision-recall (PR) curves in
Figure 5 highlight the model’s ability to maintain high precision as recall increases, which is especially important in stance detection, where false positives can affect subsequent analytical steps. The consistently high precision at moderate to high recall levels indicates that the model rarely assigns definitive stance labels when corroborating evidence is lacking. The Disagree class achieves the highest area under the PR curve (AUC-PR) of 0.982, corresponding to effective identification of explicit rejection cues, while the Agree class follows closely with AUC-PR = 0.973. The relatively lower AUC-PR for the Neutral class (0.958) indicates natural overlap between neutral expressions and weakly polarized statements.
Three precision-specific factors contribute to this exceptional performance in PR. First, token-level rationale extraction limits predictions to linguistically grounded evidence, reducing overconfident false positives and improving precision. Second, the late-fusion decision mechanism prevents premature uncertain predictions until sufficient stance evidence has been gathered, thereby maintaining recall without sacrificing precision. Third, calibration-aware training leads to smoother confidence distributions, supporting the model to retrieve true stance instances at higher recall thresholds. Collectively, these mechanisms ensure high recall without the precision loss observed in generically fine-tuned transformer models.
Figure 6 shows that the proposed target-aware bilingual framework achieves 85.0% accuracy under zero-shot conditions, with balanced performance across classes. The error distribution further suggests that the performance limitations are due to the natural ambiguity of language, not model instability. The dominant diagonal entries confirm that the model is correct for the majority of instances across all three stance categories. In particular, the Agree class has the highest number of accurate predictions, showing that it has recognized the supportive stance cues even when the model has not been trained on conversational data.
Similarly, the model achieved strong classification performance in identifying correct Disagree instances, demonstrating effective capture of explicit oppositional expressions. Misclassifications are limited and have systematic patterns rather than random noise. Most errors occur between the neutral and polarized classes, as expected, given some overlap in the semantics of weakly expressed stances and neutrality in conversational contexts. For example, there are a few instances in which the neutral cases were misclassified as Agree or Disagree, indicating borderline cases in which implicit stance cues are present but not strongly articulated. Consequently, there is minimal direct confusion between Agree and Disagree, indicating that the model maintains clear polarity separation.
Figure 7 shows that the proposed target-aware bilingual framework generalizes well to the claim-article stance setting and achieves an accuracy of over 86.8% under zero-shot conditions. The small number of errors, which are semantically plausible, suggests that the remaining misclassifications are due to linguistic ambiguity rather than to model instability or overfitting. The misclassifications are negligible and follow a stance-proximity pattern, occurring predominantly between the neutral and polarized classes. For example, a small fraction of neutral cases is incorrectly classified as either “Agree” or “Disagree”, reflecting the inherent ambiguity of neutral claims that may provide only weak evaluative cues. Notably, direct confusion between “Agree” and “Disagree” is virtually absent, indicating that the model maintains a clear distinction between the two polarities.
Figure 8 shows that the proposed stance-detection framework outperforms existing bilingual and multilingual transformer baselines on the combined bilingual test set. This performance improvement is primarily due to the model’s target-aware dual-encoder architecture, which explicitly models the interaction between stance targets and textual evidence, rather than the implicit concatenation of sequences. Consequently, the model learns stance as a relational phenomenon, thus improving its robustness to topic variation and cross-lingual semantic shifts.
In terms of efficiency, the framework is still computationally feasible. Built on an mDeBERTa-v3 backbone with ~276 million parameters, it is comparable in scale to XLM-R-base while substantially lighter than instruction-tuned or generative models like FLAN-T5. The use of parameter-efficient fine-tuning and a discriminative classification head allows for faster inference and approximately 15–20% lower inference time than generative baselines under identical hardware conditions. These efficiency gains indicate the roles of architectural conditioning and representation learning in achieving strong performance in bilingual stance detection.
Figure 9 shows that the proposed model achieves better precision, recall, F1-score, and accuracy on the MT-CSD test set than both pre-trained multilingual transformers and state-of-the-art approaches. This performance advantage is mainly due to the target-aware formulation adopted in the proposed framework. Unlike standard pre-training models such as XLM-R, mBERT, or mRoBERTa [
33,
41,
42,
43] that tend to encode the target and the stance-bearing text using simple concatenation, the proposed dual-encoder architecture explicitly disentangles the target semantics from the conversational content, allowing the model to reason about stance as a relational dependency, enhancing its generalization ability.
Furthermore, the addition of cross-lingual contrastive alignment enables the model to learn representations relevant to stance while remaining robust to lexical and contextual variation. In contrast, many existing state-of-the-art methods are tailored for topic-level stance and exhibit reduced transferability in the conversational domain. Additionally, the robustness regularization used in the proposed approach reduces sensitivity to informal language, pragmatic cues, and discourse noise, maintaining superior generalization.
Figure 10 shows that the proposed model achieves the best, most balanced performance across precision, recall, F1-score, and accuracy on the AraStance test set, outperforming both general-purpose multilingual transformers and Arabic-specific baselines. In AraStance, the claim-article relationships involve implicit stance cues distributed across longer contexts and are not always lexically aligned with the target. Standard pre-training models like XLM-R, mBERT, and mRoBERTa represent such inputs using simple sequence concatenation, which can obfuscate the boundary between target semantics and supporting evidence, leading to poor stance discrimination.
Arabic-specific transformers, such as AraBERT and MARBERT-based methods [
44,
45,
46,
47], benefit from language adaptation. However, these models are monolingual and lack structured target-conditioning mechanisms, limiting their ability to generalize across domains and discourse styles. In contrast, the proposed dual-encoder architecture explicitly differentiates target and text representations and integrates them via late fusion, allowing for precise relational reasoning. Additionally, robustness-aware training reduces sensitivity to stylistic variation and implicit negation, which are prevalent in Arabic news and fact-checking content.
Figure 11 provides qualitative evidence that the proposed bilingual stance-detection framework makes linguistically grounded and semantically appropriate decisions across English and Arabic inputs. In the “Agree” English example, the model correctly elevates the saliency of stance-bearing tokens (e.g., support, climate, and policy), while assigning negligible weight to syntactic fillers (e.g., I, the). This behavior indicates that the system is not relying on superficial patterns but is instead attending to opinion-expressive lexical cues relevant to stance prediction. A similar pattern emerges in the Arabic “Disagree” instance, where key polarity indicators such as “غير عادل” (unfair) and “سلبًا” (negatively) receive the highest importance scores. The ability to highlight morphologically complex, sentiment-rich Arabic tokens demonstrates that the model effectively captures semantic nuance in a morphologically rich language.
The neutral examples further reinforce the reliability of the rationales. Rather than assigning high salience to sentiment-laden words, the model distributes importance across informational tokens, such as discussion and essential points, accurately reflecting the absence of explicit polarity. The Arabic neutral input shows a similar distribution across آراء مختلفة (“different views”), aligning with the non-committal stance.
Hallucination analysis is an important step in ensuring the trustworthiness of stance-prediction models, which often play a role in bilingual and cross-lingual contexts, where polarity is frequently based on subtle variations in lexical meaning. As shown in
Table 6, hallucinated rationales often occur when the model attributes stance decisions to tokens that do not exist, are not semantically relevant to the input text, or are contradictory to it. Such attribution errors undermine interpretability and may appear to bias downstream analyses in sensitive domains such as political discourse or misinformation monitoring. By systematically identifying and quantifying these errors, the study ensures the model’s predictions are anchored in textual evidence, thereby enhancing transparency, user trust, and expert verifiability.
Evaluating hallucination rates is critical to stance detection, especially in applications involving social discourse, political narratives, or fact-checking, where misleading explanations may distort downstream interpretations. The hallucination rate analysis in
Table 7 provides strong evidence for the reliability and interpretability of the proposed stance-detection framework. Across all evaluated datasets, the rate of hallucinating remains consistently low with values below 4%, indicating that the model very rarely hallucinates stances with fabricated or semantically inconsistent rationales. The lowest hallucination rate is 2.9% on the combined bilingual test set, indicating stable reasoning. Notably, even in zero-shot generalisation settings, such as MT-CSD and AraStance, which have varied discourse structures and input lengths, the hallucination rate increases at a small scale, highlighting the robustness of the learned representations.
The reduced hallucination can primarily be attributed to the target-aware dual-encoder model, which explicitly conditions predictions on target-text relationships and discourages reliance on loosely correlated contextual cues. In addition, token-level rationale extraction requires grounding decisions in identifiable lexical evidence, thereby limiting unsupported inference. The robustness-oriented training also makes the model more stable to noisy or informal inputs, a common cause of spurious reasoning in stance detection. By incorporating hallucination analysis, the study shows that the proposed model is not only competitive in predictive performance but also reliable and evidence-consistent in reasoning across diverse bilingual and cross-format stance datasets.
To assess the practical deployability of the proposed bilingual stance-detection framework, we implemented a thorough analysis of the computational cost, inference efficiency, and performance-effectiveness trade-off of the proposed framework.
Table 8 highlights latency, throughput, and GPU usage for key model variants. Compared to the single-encoder model, the dual-encoder introduces a small latency increase (+0.13 ms). However, it produces absolute gains of +10.5% to +12.4% in accuracy and F1-score. At deployment, the proposed model has a mean inference latency of 1.22 ms per input, corresponding to a throughput of 820 samples/s, indicating it is suitable for a high-volume or real-time social media monitoring system. The proposed configuration represents a favorable trade-off between accuracy and computation cost. Enhancements such as contrastive alignment and robustness regularization improve cross-lingual generalization and stability with minimal runtime overhead. Due to shared frozen layers during inference, there is a minimal memory overhead. The marginal throughput degradation confirms the architecture’s operational efficiency. Overall, the empirical results demonstrate that the proposed architecture is computationally efficient while achieving significant improvements in bilingual stance-prediction performance, providing a practical, scalable solution for real-world deployment.
The sample instances outlined in
Table 9 provide qualitative evidence that supports the credibility and practical effectiveness of the proposed stance-detection framework. Across both English and Arabic examples, the model consistently produces predictions that align with the gold labels, suggesting that the learned representations generalize well across languages and topical domains. Importantly, the examples cover the three stance categories: Agree, Disagree, and Neutral, which shows balanced performance rather than bias toward polarized opinions. This is especially apparent in the Neutral cases, where the model correctly identifies hedging or contrastive clues, such as concessive expressions, which are often difficult for stance classifiers. The addition of token-level rationales enhances the model’s credibility by demonstrating that its predictions are grounded in semantically meaningful lexical evidence. In English and Arabic cases, highlighted tokens correspond directly to stance-bearing expressions rather than to incidental or spurious words. This alignment aids the quantitative analysis of hallucinations by demonstrating that the model’s decisions are explainable and evidence-consistent. Additionally, the Arabic examples validate that the model captures stance cues in Modern Standard Arabic without relying on translation artefacts or language-specific heuristics.
5. Discussions
In this research, a framework for target-aware multilingual stance detection is proposed. It incorporates contrastive learning, robustness mechanisms, token-level rationales, a dual-encoder fusion architecture, and effective fine-tuning. Firstly, a contrastive learning mechanism is used to achieve cross-lingual semantic alignment, allowing the model to represent stance-relevant cues from English and Arabic into a common latent space. Secondly, a robustness-enhanced stance learner is developed through noise-resistant encoding and normalization strategies to process noisy, adversarial, and code-switched user-generated text. Thirdly, token-level rationale extraction is integrated to provide comprehensive, bilingual explanations of the stance choice, thereby enhancing its interpretability and trustworthiness. Lastly, a dual-encoder architecture with late fusion is proposed to explicitly condition stance predictions on the target whilst still maintaining the distinction between the semantics representations for the target and the text, thereby improving generalizability to novel topics and claim types. These collective contributions address ongoing issues in stance modeling across language, topics, target types, and discourse contexts. In contrast to traditional stance classifiers, the proposed study builds a single bilingual model capable of reasoning over English and Arabic. The experimental results show that bilingual semantic alignment, explicit target conditioning, and interpretability improve generalization and cross-domain transferability.
The proposed model outperforms the state-of-the-art methods by addressing cross-lingual transferability, robustness to noisy input text, and target-aware reasoning. While current architectures [
41,
42,
43,
44,
45,
46,
47,
48], including AraBERT, MARBERT, XLM-R, and mBERT, are built on monolithic encoder representations or target-text concatenations, the dual-encoder model distinguishes target semantics from stance-bearing content, yielding highly stable decision boundaries across topic and claim variations. Contrastive learning enables the model to reconcile stance representations across English and Arabic, achieving better zero-shot generalization, whereas existing stance-detection models have failed due to cultural and lexical differences. Robustness-aware processing mitigates noise, dialectal variation, and adversarial perturbations, degrading performance in baseline systems. Token-level rationale extraction improves decision quality by uncovering core stance indicators and mitigating the spurious correlations common in traditional fine-tuned transformers. Collectively, these architectural advancements enable the proposed model to achieve better accuracy, cross-lingual consistency, and interpretability than existing transformer-based stance-detection systems.
Although trained on topic-level bilingual data, the strong performance of the proposed model on MT CSD and AraStance compared to the existing approaches [
32,
42,
43,
44,
45,
46,
47,
48], can be attributed to its target-conditioned representation design rather than the learning of dataset-specific patterns. MT-CSD and AraStance differ substantially from the combined bilingual dataset. MT-CSD relates to conversational stance while AraStance focuses on claim-article relations. Conventional stance models entangle target semantics with surface textual cues learned during training, resulting in poor generalization. In contrast, the dual-encoder architecture makes explicit distinctions between target and text representations, allowing the proposed model to express stance as a relational property rather than as a dataset-dependent label association. Moreover, the cross-lingual contrastive alignment objective encourages the model to learn stance-salient semantic dimensions that are independent of language, context length, and discourse form, enabling the transfer of stance cues extracted from short topic-centric inputs into longer conversational or document-level contexts. The robustness regularization also stabilizes predictions by reducing sensitivity to informal phrasing, implicit disagreement, and contextual noise. Ultimately, these mechanisms allow the model to have consistent decision boundaries across structurally diverse datasets, resulting in reliable zero-shot generalization without task-specific retraining.
Although recent improvements in text detection models like SwinTextSpotter, CM-Net, and Text-Pass Filter have achieved reliable performance in visual text localization and recognition, these systems operate in fundamentally different problem spaces than the proposed study. Such models aim to extract text from images, whereas in the proposed framework, the text is assumed to be available, and the problem is solved at a higher linguistic level: target-aware stance prediction in English and Arabic. Accordingly, the contribution of this study is not in text detection but in bilingual stance reasoning, target conditioning, and robust cross-format generalization, which are not handled by conventional scene-text approaches.
Similarly, previous studies [
50,
51] that interpret visual targets based on a single encoder, for example, such as models used for traffic sign interpretation or first-person scene understanding, use joint encoding of target and context. In the case of stance detection, however, this single-encoder formulation leads to entangled representations, where the short and stable target phrase competes with longer, noisy stance text for representational space. The proposed dual-encoder architecture overcomes the limitation by keeping target and text encoders independent, allowing for clear target semantics and explicit late-fusion interaction. As validated with ablation results, this separation supports stronger cross-lingual transfer, improved interpretability, and better generalization across heterogeneous stance formats.
The study’s findings have broader implications for bilingual stance analysis and cross-lingual conversation monitoring. By demonstrating that effective bilingual stance detection can be achieved without extensive retraining, the proposed framework provides a feasible, resource-efficient solution for stance detection across linguistically diverse content. The cross-lingual alignment monitors real-time narratives that develop in one language and spread to another, which is crucial for applications in crisis response, digital diplomacy, and multilingual content moderation. The interpretability provided by token-level rationales enhances trust and accountability, making the system suitable for high-stakes settings such as fact-checking programs, regulatory oversight, and social media governance. Furthermore, the parameter-efficient fine-tuning approach enables advanced bilingual stance detection for institutions with limited computational resources.
Despite the study’s strengths, it has a number of limitations. The mixed training strategy is based on topic-level stance data, and there is no inclusion of conversational turns and long-form document structures. Although the model generalizes well to structurally different datasets such as MT-CSD and AraStance in zero-shot settings, the lack of explicit supervision for discourse formats may affect the model’s real-time performance. The effectiveness of contrastive alignment depends on the range and representativeness of bilingual target-text pairs, while domains with less dense cross-lingual overlap may yield weaker alignment. Token-level rationales, though informative, fail to capture the deeper aspects of conversational pragmatics such as sarcasm, idiomatic expressions, and culturally embedded cues to stance. Moreover, a limited focus on user-level metadata, temporal patterns, or the dynamics of social interaction can affect the model’s classification performance. Finally, the study is limited to English and Modern Standard Arabic, limiting its generalizability.
Future studies may consider conversational and thread-level models in order to build stance reasoning that evaluates the flow of discourse, reply chains, and context accumulation. Developing multimodal stance-detection methods combining text with images, video, or speech could capture the complex methods in which stance is expressed on the internet. Extending the cross-lingual alignment framework to additional languages, especially those with sociopolitical or cultural links to Arabic, would increase the multilingual transferability. Combining token-level rationales with evidence-retrieval or sentence-level justification mechanisms may lead to better interpretation, thereby generating robust explanations. Incorporating continual learning or domain adaptation techniques may support the model’s adaptation to temporal drift and evolving public discourse. Lastly, substantial distillation of the ensemble process may yield a highly reliable stance-detection model with a limited computational footprint.