Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture

Sait, Abdul Rahaman Wahab; Alkhurayyif, Yazeed

doi:10.3390/electronics15040830

Open AccessArticle

Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture

by

Abdul Rahaman Wahab Sait

^1,*

and

Yazeed Alkhurayyif

^2,*

¹

Department of Documents and Archive, Center of Documents and Administrative Communication, King Faisal University, Al-Ahsa 31982, Saudi Arabia

²

Applied College, Shaqra University, Shaqra 11961, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(4), 830; https://doi.org/10.3390/electronics15040830

Submission received: 19 January 2026 / Revised: 12 February 2026 / Accepted: 13 February 2026 / Published: 14 February 2026

(This article belongs to the Special Issue Next-Generation Machine Learning and Deep Learning Models for Complex Data, Vision, and Intelligent Applications)

Download

Browse Figures

Versions Notes

Abstract

Stance detection has emerged as an essential tool in natural language processing for understanding how individuals express agreement, disagreement, or neutrality toward specific targets in social and online discourse. It plays a crucial role in bilingual and multilingual environments, including English-Arabic social media ecosystems, where differences in language structure, discourse style, and data availability pose significant challenges for reliable stance modelling. Existing approaches often struggle with target awareness, cross-lingual generalization, robustness to noisy user-generated text, and the interpretability of model decisions. This study aims to build a reliable, explainable target-aware bilingual stance-detection framework that generalizes across heterogeneous stance formats and languages without retraining on a dataset specific to the target language. Thus, a unified dual-encoder architecture based on mDeBERTa-v3 is proposed. Cross-language contrastive learning offers an auxiliary training objective to align English and Arabic stance representations in a common semantic space. Robustness-oriented regularization is used to mitigate the effects of informal language, vocabulary variation, and adversarial noise. To promote transparency and trustworthiness, the framework incorporates token-level rationale extraction, enables fine-grained interpretability, and supports analysis of hallucination. The proposed model is tested on a combined bilingual test set and two structurally distinct zero-shot benchmarks: MT-CSD and AraStance. Experimental results show consistent performance, with accuracies of 85.0% and 86.8% and F1-scores of 84.7% and 86.8% on the zero-shot benchmarks, confirming stable performance and realistic generalization. Ultimately, these findings reveal that effective bilingual stance detection can be achieved via explicit target conditioning, cross-lingual alignment, and explainability-driven design.

Keywords:

bilingual stance detection; cross-lingual contrastive learning; zero-shot generalization; robust stance modeling; token-level explainability; hallucination analysis

1. Introduction

Social media serves as a platform for political discourse, shaping individuals’ perceptions of societal well-being and facilitating real-time civic engagement [1,2,3]. Individuals express emotions through direct or indirect stances toward specific targets/topics, such as policies, individuals, organizations, or events. Stance detection enables automated processing of the individual perceptions on specific issues on social media, offering systematic insights into how opinions are formed, polarized, and propagated within ideological and socio-political discourse [4,5,6]. It determines support, opposition, or neutrality towards a specific target. Additionally, it enables finer control over the population discourse, misinformation analysis, and evidence-based decision-making in a highly dynamic digital setting.

Although English dominates social media discourse, Arabic remains one of the most influential languages in the Middle East and North Africa, reflecting the region’s cultural, political, and social dynamics [7,8,9]. On social media, Arabic and English are frequently mixed, producing code-switching content that restricts the applicability of monolingual modelling. As a result, the ability to understand a bilingual stance is crucial for identifying the individuals’ viewpoints that emerge in these digitally connected, linguistically diverse communities [10,11,12]. The development of transformer-based architectures and large language models (LLMs) has advanced stance detection, capturing nuanced linguistic phenomena, including negation, sarcasm, and implicit disagreement [13]. Through the utilization of cross-lingual reasoning, these models process Arabic content. By leveraging generalized semantic representations, LLMs enhance the abilities of transformers across domains and topics. Compared with traditional feature-driven methods, transformer-based methods exhibit better performance and adaptability, dominating the stance detection research [13].

The majority of existing studies focus on English language-based stance-detection models, providing minimal attention to Arabic or bilingual contexts [14,15]. To detect Arabic stance, topic-specific datasets are utilized. However, these models report poor generalization across dialects, domains, and targets [15]. The bilingual and code-switched stance detection is underexplored. From a methodological perspective, existing methods rely on conventional fine-tuning of pre-trained models with limited architectural customization for target-aware stance modeling, making them less suitable for addressing challenges posed by unbalanced label distributions and unstructured data [16,17,18]. Despite the exceptional performance of LLMs, they confront challenges, such as computational cost, data efficiency, and explainability, especially in socially sensitive use cases. These constraints reflect the necessity of data-oriented, structure-sensitive stance-detection systems that process Arabic-English social media discourse with limited computational resources.

This study is motivated by social media’s expanding role in influencing public attitudes, beliefs, and opinions on social, political, and healthcare topics. Existing target-aware stance-detection models are typically based on single-encoder concatenation models, where the target and text are concatenated and processed as a single sequence (e.g., BiCond, CrossNet, and concatenation-BERT variants). While useful in monolingual contexts, these approaches are hindered by representational entanglement, which makes it challenging to generalize across heterogeneous stance types (topic stance vs. claim stance vs. conversational stance) and across languages. Similarly, cross-lingual natural language inference (NLI)/stance models are mostly label-space transfer models, lacking a clear model of the relational structure between the target and stance-bearing texts. They typically use translate-train or translate-test heuristics to bridge languages, resulting in translation noise, domain mismatch, and stylistic drift—major challenges for Arabic social media content, which contains dialect, code-switching, spelling variation, and informal morphology. Addressing these limitations is crucial for developing robust, transparent, and comprehensive stance models, capturing the complex dynamics of individuals’ expressions across varied, linguistically rich online contexts. Therefore, in this study, a target-aware bilingual stance-detection model is introduced that generalizes across languages, domains, and discourse forms while remaining robust, interpretable, and resource-efficient. The proposed framework makes three substantive innovations over the existing NLI/stance-prediction paradigms:

Explicit Target-Text Disentanglement through a Dual-Encoder Architecture

Instead of incorporating target and text in a single encoder, the framework encodes them separately and combines them with late fusion. This structural separation ensures that stance prediction is driven by the semantic relationship between two well-formed representations rather than surface co-occurrence patterns, thereby enabling the model to generalize across cross-format stance scenarios.

2.: Cross-Lingual Contrastive Alignment at Representation Level

Unlike existing cross-lingual NLI or stance models based on translation, parallel data, or shared label semantics, the proposed method aligns target-text pairs between English and Arabic using a contrastive objective. This produces a shared bilingual stance space without the need for parallel corpora, rendering the proposed model more robust in real-time bilingual social media, where parallel data are limited.

3.: Robustness and Explainability as Integrated Components of Stance Modeling

The framework includes perturbation-based robustness regularization, which stabilizes performance under noisy, informal, or adversarial social media text, often ignored in target-aware stance studies. Additionally, token-level rationale extraction enables bilingual interpretation, enabling analysis of hallucinations that are not addressed in existing target-aware or cross-lingual stance models.

4.: Efficient Fine-Tuning for Computationally Lightweight Multilingual deployment

Applying parameter-efficient fine-tuning approaches maintains training cost and performance, offering a practical, resource-aware approach for deploying stance-detection models in a low-resource bilingual setting.

Together, these innovations position the proposed system not as a mere combination of existing methods but as a systematic, extensible architecture explicitly designed for target-conditioned, bilingual, and robustness-aware stance detection, a setting insufficiently supported by existing models.

The remainder of this study is organized as follows: Section 2 outlines the features and limitations of existing stance-detection approaches, covering monolingual, multilingual, and target-aware transformer-based models, as well as widely used datasets, highlighting the theoretical and empirical background of the proposed study. Section 3 explains the details of the datasets, preprocessing, and augmentation methods, model architecture, training procedure, and the evaluation procedure. It describes the proposed target-aware bilingual framework, including cross-linguistic and linguistic contrastive alignment, robustness mechanisms, and explainability components. Section 4 presents results based on standard performance metrics, AUROC and AUPRC analyses, confusion matrices, and zero-shot generalization across datasets. Section 5 presents a detailed interpretation of the results, considers generalizability, and discusses study implications. Finally, Section 6 summarizes the main contributions, discusses the practical implications and limitations, and outlines directions for future research in bilingual stance detection.

2. Literature Review

Existing transformer-based stance-detection approaches are typically classified into monolingual, bilingual, or multilingual frameworks. Using target and text concatenation, English-language-driven stance-detection models, including BERT, RoBERTa, and DistilBERT, are fine-tuned using the benchmark datasets such as SemEval-2016, FNC-1, and VAST [19,20,21,22]. In topic-stance and claim-stance tasks, these models are used due to their ability to produce strong in-domain representations. With the introduction of explicit target encoding, target-aware architectures like BiCond, CrossNet, and topic-injected BERT variants achieve improved stance detection [23,24]. In the Arabic domain, transformer models, such as AraBERT, MARBERT, QARiB, and CAMeLBERT, have been adapted using datasets such as AraStance, ArabicStanceX, and corpora on coronavirus vaccines [25,26]. They incorporate key aspects of Arabic into pre-trained transformers, such as dialects, code-mixing, and noisy social media content, to achieve optimal performance. Bilingual and multilingual approaches are based on mBERT, XLM-R, RemBERT, and other multilingual models [27,28]. To enhance generalization, recent studies use cross-lingual alignment, adversarial adaptation, and translation-based augmentation. In addition, contrasting objectives and shared embedding spaces enable zero-shot transfer between linguistically diverse targets and texts. Table 1 outlines the features of the existing stance-detection models.

Despite the rapid improvement of transformer-based stance-identification models, numerous key gaps remain in the literature, especially in low-resource and bilingual contexts involving Arabic and Arabic-English. Limited incorporation of contrastive learning approaches is one of the most significant limitations. Existing stance-detection models based on MARBERT, AraBERT, or XLM-R depend on cross-entropy loss and fail to employ supervised or unsupervised contrastive objectives that may improve inter-class separability. In the context of NLP, contrastive learning has improved feature representations and reduced semantic ambiguity across closely related stance classes.

The lack of robustness limits the potential of existing models to handle noisy, informal, or dialect-rich text, a characteristic prevalent in Arabic and code-switched social media data. Transformer models trained on controlled news corpora or high-resource English datasets frequently underperform on noisy, real-world inputs that include misspellings, emojis, dialectal variations, and non-standard grammar. Although pre-trained Arabic-specific models, including MARBERT and AraBERT, have partially addressed this challenge, a research gap persists in developing noise-aware pre-training methodologies, effective data augmentation techniques, and adversarial training approaches to enhance stance-detection performance under low-quality text conditions. The explainability of stance-detection models remains unexplored. Transformers are opaque, lacking a rationale for their predictions. Although attention weights are visualized, their representations fail to provide properly grounded or causal explanations, underscoring the need for explainability frameworks such as SHapley Additive exPlanations (SHAP), Local Interpretable Model-agnostic Explanations (LIME), or attention-flow tracking to determine reasoning spans and increase user confidence.

Recent studies overlook target-aware modeling and treat stance detection as a generic text classification problem. The majority of existing models are unable to explicitly encode or condition on the stance target, resulting in reduced performance in complex social media settings. Another challenge stems from the absence of effective hallucination mitigation strategies within cross-lingual contexts. Despite increased attention to hallucination in generative tasks, its effects on discriminative stance models are underestimated. Current approaches mistokenize Arabic words or combine non-standard forms, compromising semantic encoding and downstream attitude categorization.

3. Materials and Methods

The methodological design of this study focuses on mDeBERTa-v3 [49], a multilingual transformer known for robust contextual encoding and suitable for modeling stance in typologically distinct languages such as English and Arabic. As stance detection relies on subtle interactions between targets and stance-bearing expressions, mDeBERTa-v3 serves as a successful foundational model. However, traditional fine-tuning remains insufficient to enable cross-language transfer, robustness against noisy social media text, or adaptation to heterogeneous stance formats. To overcome these limitations, the proposed framework includes architectural customizations with parameter-efficient fine-tuning. First, a dual-encoder architecture with late fusion explicitly differentiates target and text representation pathways before fusion, thereby strengthening target conditioning and enhancing generalization across topic-level, conversational, and claim-article scenarios. Second, a cross-lingual contrastive alignment module is proposed to encourage the model to learn standard stance-relevant semantics between English and Arabic. Third, noise-awareness and perturbation-resilience-encoded components are integrated to provide greater resilience against dialectal variation, code-switching, sarcasm, and spelling inconsistencies. Finally, a token-level rationale extraction layer provides interpretable evidence for each prediction, thereby enriching transparency and reliability in a bilingual setting. Collectively, these architectural and training-level improvements yield a consistent, extensible methodology for bilingual stance detection. Figure 1 visualizes the proposed stance-detection pipeline and the interactions among its components.

3.1. Dataset Description

The existing stance-detection benchmarks, though widely used in the research community, have several limitations, limiting their suitability for the study of robust and bilingual stance modelling. SemEval-2016 is limited in scope and topical diversity. Moreover, its reliance on English Twitter data makes it less relevant to evaluating generalization to novel domains or languages. X-Stance, despite being multilingual, focuses primarily on political topics and uses structured, parallel annotation. This tendency to focus on a narrow domain reduces the linguistic variability and undermines the applicability to informal or noisy social media discourse. EZ-stance, while concentrating on zero-shot target transfer, is monolingual and topic-centric, and lacks conversational or document-level stance reasoning capabilities.

To overcome these limitations, this study utilizes datasets including MT-CSD [34], VAST [35], ArabicStanceX [48], and AraStance [46]. VAST enables large-scale, target English-aware training; MT-CSD introduces conversational stance reasoning, enabling testing of cross-format and lingual generalization. ArabicStanceX covers informal Arabic social media stance in diverse topics, while AraStance includes claim-article reasoning at the document level. The intersection of these datasets enables a thorough assessment across languages, targets, and discourse structures, and aligns with the study’s objective of bilingual and explainable stance detection. Table 2 outlines the features of the datasets, including the number of samples, labels, and stance types.

3.2. Data Preprocessing and Expert-Guided Hallucination Assessment

The datasets are processed through a unified preprocessing pipeline, ensuring linguistic consistency between English and Arabic while retaining stance-relevant semantic information. For English texts, preprocessing involves eliminating non-informative information such as URLs, user mentions, hashtags, and unnecessary symbols, and normalizing excessive punctuation and repeated characters. For Arabic texts, language-specific normalization is applied to reduce orthographic and morphological variability typical of informal writing, including normalization of letter forms, removal of optional diacritics, and normalization of elongations and spelling patterns. These are measures to control surface-level noise without altering stance polarity and semantic intent. Subsequent to normalization, the data are split into two non-overlapping sets: an externally held-out test set and a development set. The held-out test set is used to evaluate the model’s generalization and is not used in the training, cross-validation, augmentation, or model selection process, ensuring impartial evaluation of generalization capability. The development set is used in a five-fold cross-validation setting. In each fold, the development data are split into fold-specific training and validation data. Robustness-oriented data augmentation strategies are applied only to the training data subset in each fold, whereas the validation data subset is not. This design eliminates the risk of information leakage and ensures that validation performance reflects generalization to unseen, unmodified data.

The augmentation strategies are label-preserving and are intended to model realistic linguistic variability found in user-generated content, including controlled lexical substitutions, minimal paraphrasing, and character-level perturbations that simulate informal linguistic writing patterns. Augmentation is not applied to validation subsets or the held-out test set.

To facilitate cross-dataset training and bilingual generalization, all datasets were mapped to a unified three-class stance scheme (Agree, Disagree, Neutral). This harmonization is necessary, as the current corpora of English and Arabic stance differ significantly in the granularity of their labels and their semantic scope. For example, Discuss and Unrelated are included in AraStance, and these are not stance categories but discourse-relations categories. These labels do not explicitly state their polarity and were thus merged into Neutral in accordance with previous practices of stance normalization in multilingual environments. While this mapping is practical, it introduces two methodological considerations. First, there is an inevitable increase in neutral class heterogeneity, since the merged class covers more semantic space—from truly neutral opinions to topical commentary without polarity. Second, comparability with the existing approach becomes approximate rather than exact, especially for models assessed on the original multi-label AraStance taxonomy. We address this by presenting results strictly within the three-class framework and not directly comparing them with studies using the five-label scheme in terms of metrics. Third, for zero-shot performance interpretation, the simplification mitigates target ambiguity by excluding discourse-relation labels that do not express stance; however, it may artificially boost predictions of neutral in datasets where the “discuss” class is predominant. These caveats are explicitly recognized, and harmonized labelling is presented as a requirement for bilingual, cross-format stance modelling rather than an intrinsic property of the datasets themselves.

In this study, hallucination refers to any lexical, semantic, or inferential rationale emphasized by the model that is not based on the input text and cannot be rationalized by explicit or implicit evidence in the original stance-bearing statement. A rationale is observed as hallucinated if it provides unsupported evidence (i.e., a highlighted token that does not exist in the text, or that is semantically irrelevant to the label on the gold stance), incorrect attribution (highlighting text that contradicts the label on the gold stance), fabricated inference (adding causal or sentiment cues not present in the content), or cross-lingual drift (misinterpreting neutral cues across languages).

In addition to preprocessing and augmentation, the study applies an expert-guided hallucination assessment framework that operates independently of primary stance datasets. Due to the lack of hallucination annotations in the datasets, hallucination is treated as an evaluation metric rather than a supervision target for model-generated explanations. Following model inference, token-level rationales generated by the model are evaluated by domain experts on a representative subset to determine whether the explanations are supported by the input text and whether the reasoning is target or semantically inconsistent.

To systematically evaluate hallucination, a stratified subset of 300 instances was selected from the three stance classes: 100 from the combined bilingual test set, 100 from MT-CSD, and 100 from AraStance, and evaluated through expert-guided annotation. Three senior NLP researchers, proficient in English and/or Arabic, independently reviewed token-level saliency maps in a blinded condition, without access to model predictions or labels assigned by other annotators. Experts identified each as either Grounded or Hallucinated based on a standardized guideline. Inter-annotator agreement was high (Fleiss’ k = 0.82; Krippendorff’s a = 0.79), and disagreements were resolved by majority vote. Since hallucination assessment is based on deterministic saliency maps rather than generative outputs, the evaluation is completely reproducible in the controlled inference environment, where the same inputs will always generate the same rationales and hallucination assessments.

The involvement of multiple experts and the application of inter-rating protocols strengthen the methodological rigor of the hallucination analysis by reducing subjectivity and ensuring that qualitative judgments reflect consistent interpretative standards. This expert-driven validation is particularly significant for the study, as hallucination assessment in stance detection extends beyond numerical accuracy and requires nuanced evaluation of semantic coherence, thereby reinforcing the trustworthiness and interpretability of the proposed framework. These expert annotations are used solely to validate the models’ outcomes and are not used for training, dataset composition, or model parameters.

3.3. The Proposed Target-Aware Stance-Detection Approach

3.3.1. Target-Text Dual-Encoder Architecture

Traditional stance-detection approaches rely on single-encoder architectures, in which the target and stance texts are concatenated and encoded. While efficient within controlled settings, this formulation introduces several limitations. Firstly, target information is often poorly represented because it competes with adjacent textual context within a single sequence. Consequently, predictions of stance rely on surface lexical overlap rather than true target-conditioned reasoning. Secondly, such models have limited generalization capability when deployed to unseen targets, novel discourse formats, or cross-lingual scenarios, as the learned representations entangle target semantics with domain-specific textual patterns. Lastly, concatenation-based encoding struggles to scale across heterogeneous stance formation, such as topic-level, conversational, and claim articles.

Furthermore, target expressions in stance detection are usually short, noun-phrase-level units with stable and context-independent semantics (e.g., “Electric Vehicles”, “Climate Policy”). In contrast, stance-bearing text is in the form of longer, heterogeneous, and stylistically diverse discourse segments that carry sentiment, argumentation, and pragmatic cues. Using a single shared encoder to handle both inputs forces the model to fit fundamentally different distributions into a common representational space, leading to semantic entanglement.

To overcome these limitations, the stance-detection problem is redefined as a target-conditioned representation. The proposed framework uses completely independent parameters for the target encoder and the text encoder, which is one of the motivations behind this approach, since the linguistic structures of the target encoder and the text encoder are inherently asymmetric. By encoding the target and text independently within a dual-encoder framework, the model preserves explicit target semantics and enables controlled interaction through late fusion, ensuring that stance reasoning is explicitly conditioned on the target rather than implicitly inferred from it, thereby enhancing robustness against target variation and discourse structure. Figure 2 illustrates the architecture of the proposed target-text dual-encoder.

The dual-encoder formulation naturally supports cross-lingual and cross-format transfer, thereby aligning independent target and text representations. Through this formulation, the target is explicitly conditioned, reducing representational interference and leading to stable generalization. Equation (1) outlines the computation of independent contextual representations.

h_{t} = f_{θ_{t}} (t), h_{x} = f_{θ_{x}} (x)

(1)

where

h_{t}

is the target standalone semantics,

h_{x}

is the contextual meaning of the text, and

f_{θ_{t}} (.)

and

f_{θ_{x}} (.)

represent mDeBERTa-V3’s parameterized encoder functions.

To preserve semantic disentanglement between the target and the stance-bearing text, the parameter sets

θ_{t}

and

θ_{x}

are maintained independently. Equation (2) shows the integration of the resulting representations using a late-fusion interaction operator (

\emptyset (.))

.

\emptyset (h_{t}, h_{x}) = [h_{t} ‖h_{x}‖ |h_{t} - h_{x}|]

(2)

where

‖.‖

denotes vector concatenation and

|.|

represents the element-wise absolute difference.

The fusion operation allows the model to comprehend the relationship between the text and the target. Through the structured comparison at the decision level, it preserves clean target and text representations. This mechanism enables explicit target-aware stance detection and enhances the model’s robustness across diverse discourse formats and languages.

3.3.2. Cross-Lingual Contrastive Alignment

While the dual-encoder architecture provides explicit target conditioning, bilingual stance detection also requires the model to learn language-invariant representations that are generalizable across English and Arabic. Standard multilingual fine-tuning typically relies on shared vocabulary or implicit parameter sharing, which is inadequate for aligning stance-expressive cues. To overcome this limitation, a cross-lingual contrastive alignment layer is proposed, which is a training objective at the representation level without affecting the inference architecture. The core idea is to explicitly shape the embedding space to express stance instances with the same polarity, regardless of language, while separating semantically unrelated stances. Equation (3) describes the fused representations of the two stance instances with equivalent stance polarity.

z = \emptyset (h_{t}, h_{x})

(3)

where

z

represents the fused representations covering

z_{i} and z_{j}

and

\emptyset (.)

is the late-fusion operator. The enforcement of contrastive alignment is presented in Equation (4).

L_{C o n} = - \log \frac{e x p (s i m (z_{i}, z_{j}) / τ)}{\sum_{k} e x p (s i m (z_{i}, z_{j}) / τ)}

(4)

where

s i m (.)

denotes cosine similarity,

τ

is a temperature parameter controlling the sharpness of the similarity distribution, and

L_{C o n}

is the contrastive alignment. This objective increases similarity between positive cross-lingual stance pairs while suppressing similarity to negative samples, effectively regularizing the shared embedding space. By encouraging stance-consistent alignment rather than lexical overlap, the contrastive loss reduces language-specific bias and enhances zero-shot transfer.

The effectiveness of cross-lingual contrastive learning is critically dependent on the definition of positive and negative sample pairs. In this framework, pair construction is grounded in stance-label alignment, enabling the model to learn cross-lingual stance representations without requiring semantically parallel English-Arabic data. Pairs require only stance-label alignment. Semantic, topical, or contextual similarity is not required. A positive pair is created by choosing one instance from each language for which the same stance category (Agree, Disagree, or Neutral) is assigned. This helps ensure that the contrastive layer attends to polarity-level correspondences rather than surface-level lexical similarities. In contrast, a negative pair consists of two instances, again drawn cross-linguistically, that belong to different stance categories. This drives the embeddings of opposing stances further apart in the shared representation space, allowing the model to learn language-invariant stance cues, including polarity markers, modal expressions, and opinion indicators. This label-centric design is intentionally chosen because topic-conditioned or semantic-similarity-based pairing strategies introduce unwanted biases, such as reliance on topic overlap or susceptibility to dialectal variation. By removing the focus on topical content and focusing only on stance categories, the model can learn more stable and generalizable cross-lingual polarity features. This approach guarantees scalability without the need for parallel corpora or manual semantic alignment. Thus, the positive/negative pairing strategy is a principled and resource-efficient mechanism for enforcing bilingual stance coherence.

3.3.3. Robust Stance Modeling Under Noisy and Adversarial Text

Stance detection in real-world applications is often hindered by noisy, irregular language, particularly in social media and conversational corpora, where spelling variations, informal language, code-switching, and intentional obfuscation are prevalent. Models trained on unperturbed text tend to overfit on shallow lexical patterns and make unstable predictions when inputs deviate from standard forms. To overcome this limitation, the proposed framework incorporates a robustness-regularization layer that explicitly encourages prediction stability against controlled input perturbations during training. This approach ensures that small label-preserving changes to the input do not lead to excessive changes in stance predictions. Equation (5) computes the robustness enforcement function, minimizing the discrepancy between the predicted stance distributions.

L_{r o b} = {‖\hat{y} (x) - \hat{y} (x + δ)‖}_{2}^{2}

(5)

where

\hat{y} (.)

represents the model’s predicted probability distribution over stance classes,

L_{r o b}

is the robustness enforcement function,

x

denotes the original input text, and

x + δ

represents a perturbed version obtained through controlled noise injection. The regularization enforces the model to learn smoother decision boundaries and to base its predictions on semantically salient signals. Using the robust enforcement function (robustness loss), stance predictions are stabilized under linguistic noise, including misspellings, informal orthography, dialect variations, and lightweight perturbations. By generating controlled perturbations, the regularization is implemented. The robustness loss function is considered a constraint on the classifier to encourage reliance on semantic content. By penalizing prediction inconsistency, the model is less sensitive to orthographic noise, informal language, and adversarial perturbations, enabling it to maintain computational efficiency while achieving improved generalization on more noisy, real-world inputs.

3.3.4. Token-Level Rationale Extraction

In order to improve transparency and interpretability, the proposed framework combines a token-level rationale extraction layer that operates post-prediction. This component uncovers the internal decision-making process by identifying the input text that influences a specific stance prediction. Such fine-grain interpretability is particularly important in the context of bilingual and cross-format stance detection, where predictions may otherwise appear opaque or hard to justify. Equation (6) shows the estimation of token importance using a gradient-based saliency approach.

α_{i} = |\frac{\partial \hat{y}}{\partial x_{i}}|

(6)

where

α_{i}

denotes the absolute magnitude of the gradient. The higher value of

α_{i}

indicates greater influence on the model’s prediction,

x_{i}

is the input token, and

\hat{y}

is the predicted stance.

The extracted rationales provide interpretable evidence for the predictions in English and Arabic, enabling qualitative inspection and comparative analysis across languages. Crucially, these token-level explanations form the foundation for evaluating hallucinations, allowing evaluators to check whether the highlighted evidence is based on the input text and related to the target, and to hold the model accountable and ensure the reliability and trustworthiness of its predictions.

3.3.5. End-to-End Proposed Stance-Detection Pipeline

The proposed stance-detection framework is conceptualized as a unified end-to-end pipeline that combines architectural modifications and auxiliary training objectives while maintaining a single, fixed inference pathway. During inference, the model maps a stance-target text pair to a stance prediction via a series of compositional operations. Equation (7) outlines the model’s prediction using the proposed approach.

\hat{y} = S o f t m a x (W \emptyset (f_{θ_{t}} (t) \oplus f_{θ_{x}} (x)) + b)

(7)

where

W

and b are the classifier parameters and

\oplus

denotes parallel encoding.

In addition, the stance prediction is implemented using a fixed dual-encoder and late-fusion architecture. Contrastive alignment and robustness objectives are used during training, influencing representation learning without affecting the deployed inference pipeline. Equation (8) shows the composite loss function.

L_{t o t a l} = L_{C l s} + λ_{1} L_{C o n} + λ_{2} L_{r o b}

(8)

where

L_{t o t a l}

is the total loss,

λ_{1}

and

λ_{2}

are the weighting coefficients, and

L_{C l s}

supervises the stance classification accuracy.

The proposed stance-detection model relies on a single, coherent prediction architecture, and auxiliary learning signals serve as enhancements during training. This design ensures computational efficiency at deployment while enabling improved target awareness, cross-lingual generalization, and robustness during learning.

3.4. Experimental Setting

The corpora of stances used in this study were harmonized into a unified target text format. They were merged into a single bilingual development pool by combining the English VAST instances with the Arabic ArabicStanceX instances. From this combined bilingual dataset, 80% of the data were used for a development set, and the remaining 20% were held out exclusively for final generalization testing. A total of 80% of the development portion was used for five-fold validation. The samples in each fold were representative of the English and Arabic samples, and had a balanced distribution of stance labels. In each cross-validation run, four folds (64% of the entire dataset) were used to train a target-aware bilingual model, and the remaining fold (16%) served as the validation set, resulting in five independently trained bilingual models. To avoid double exposure, model overfitting, and unnecessary computational costs, further retraining was not performed using the complete 80% of the dataset. All five cross-validated models were evaluated directly on the held-out 20% test set, which retained the original English-Arabic data. The final performance was calculated by averaging the prediction probabilities across the five models for each test instance, yielding a robust estimate of bilingual generalization to unseen data. Table 3 highlights the key experimental settings for the model implementation.

To achieve computational efficiency and practical deployability, the proposed framework follows a parameter-efficient fine-tuning approach based on mDeBERTa-v3. Rather than updating all transformer parameters during fine-tuning, the process selectively updates task-specific layers related to the dual encoders and the late-fusion classifier, while other lower-level shared representations are frozen to a large extent. This strategy greatly reduces training overhead and memory usage without sacrificing performance. Importantly, auxiliary objectives such as contrastive alignment and robustness regularization act on the learned representations without adding extra parameters at inference time. As a result, the proposed model achieves better bilingual stance performance while being suitable for low-resource multilingual deployment scenarios.

The proposed bilingual stance-detection framework is evaluated using a combination of traditional classification metrics and reliability indicators to provide a balanced assessment of predictive performance and trustworthiness. Accuracy measures overall correctness across stance classes, while precision, recall, and F1-score assess class-specific performance and address imbalances in stance labels. In addition to predictive performance, the hallucination rate is included to determine how faithful model explanations are. To ensure the robustness of the conclusions, all metrics are reported as mean ± standard deviation across cross-validation folds, and 95% confidence intervals are computed to quantify performance stability. This strategy for reporting results ensures that observed enhancements reflect consistent model behavior rather than random variations.

4. Results

The results of five-fold cross-validation in Table 4 show that the proposed bilingual stance-detection framework achieves consistently high performance across all evaluation metrics, with limited variance across folds and narrow confidence intervals. This constancy demonstrates that the learned representations are not overly biased by a particular data partition and that the model shows stable generalization on the development set. Such behavior is especially relevant to tasks of stance detection, where instability in training and test data is often due to linguistic variation, topic diversity, and label ambiguity.

The observed performance can be attributed to the structural and learning mechanisms adopted in the proposed approach. The target-text dual-encoder formulation explicitly separates target semantics from stance-bearing content, thereby limiting representational interference and enabling more precise target conditioning. Cross-lingual contrastive alignment constrains the space of representations to maintain stance polarity across languages, thereby facilitating consistent predictions in bilingual contexts. Robustness-oriented regularization further reduces the sensitivity to surface-level noise and informal linguistic patterns, which are common in social media and conversational data. Additionally, averaging probabilities across cross-validation folds provides a more reliable estimate of model behavior by reducing the effects of variance in individual-fold predictions. Collectively, these characteristics are the source of robust, reproducible performance across all cross-validation folds.

The radar chart for the combined bilingual test set in Figure 3 offers a concise visual summary of class-wise behavior for the proposed stance-detection model, complementing the reported overall accuracy of 88.1%. This suggests that the target-aware representation is an effective way of capturing the explicit supportive and opposing cues in English and Arabic inputs. The stable form of these two classes indicates stable boundary decision functions and a limited trade-off between precision and recall, desirable in applications sensitive to stance, such as opinion analysis and policy monitoring.

The Neutral class has slightly lower values for all three metrics and thus a comparatively smaller polygon. This behavior is expected and reflects the inherent ambiguity of neutral stance expressions containing mixed or hedging language and weaker lexical signals. Importantly, the decrease in performance for the Neutral class does not significantly affect overall accuracy, suggesting that the model maintains its ability to discriminate without bias towards polarized stances. The smooth, non-fragmented contours of the radar plot provide further indication that performance degradation is controlled and systematic rather than unstable or highly variable.

Table 5 provides critical insights into the behavior of each architecture and training component. The single-encoder baseline underperforms due to its limited capacity to learn target-conditioned representations. Adopting a dual-encoder design yields a substantial gain, empirically validating the necessity of disentangling target and text semantics for stance reasoning. Cross-lingual contrastive alignment further promotes bilingual generalization by aligning stance-relevant cues across languages, while robustness regularization promotes stability under noisy inputs. Their combined optimization has consistent additive benefits and can be considered evidence of complementary effects. Comparative experiments with alternative pairing strategies, including topic-conditioned and embedding-based semantic pairing, revealed that these strategies yield weaker or unstable transfer due to topical bias and dialectal variability. On the contrary, label-aligned pairing was consistently more stable during optimization and achieved higher zero-shot performance, resulting in 2–4% improvements. Thus, this construction strategy can be used as an effective, computationally practical approach to impose cross-lingual coherence in stance without the need for aligned bilingual resources. A controlled ablation comparison (A4 and A5 vs. A6) is included to provide empirical support for the effectiveness of the robustness regularization function. Without robustness regularization, the variant achieves strong performance. However, it exhibits higher sensitivity to noisy or orthographically inconsistent inputs. The introduction of the robustness loss (Dual-encoder + Contrastive + Robustness) results in consistent improvements across metrics, reducing prediction volatility. The significant performance gains demonstrate the model’s ability to generalize across noisy textual environments. Finally, ensemble averaging helps consolidate these improvements by reducing variance, thereby establishing that these final performance gains are due to the integrated framework, not the ensemble process alone.

Furthermore, to test whether reliance on different parameter sets for the target encoder and text encoder leads to redundancy or instability, we implemented a partial-sharing variant in which the lower transformer blocks were shared, and the upper blocks remained decoupled. This design is less parameter-rich but more prone to representational interference, since early layers represent mixed distributions of both target phrases and longer stance-bearing text. Empirically, partial sharing produced mild reductions (≈0.7–1.1%) in precision and F1-score, and convergence became less stable in cross-lingual settings (particularly Arabic to English). In addition, training performance showed less stable convergence in cross-lingual transfer settings, especially for Arabic-English zero-shot predictions, indicating that shared early layers were unable to achieve a clean semantic separation. Accordingly, full decoupling offers a clear methodological advantage: it preserves disentangled representations and produces more stable and discriminative target–text interactions under the dual-encoder fusion mechanism. The results are consistent with the overall ablation results, indicating that architectural choices (rather than just scale or depth of classifier) are responsible for the performance gains.

The Area Under the Receiver Operating Characteristic (AUROC) curves in Figure 4 demonstrate the powerful discriminative capabilities of the proposed stance-detection framework across the Agree, Disagree, and Neutral classes. In contrast to accuracy, AUROC allows for the evaluation of the efficacy of systems across all conceivable decision thresholds and is thus particularly suitable for tasks related to stance detection, where class imbalance and heterogeneity of confidence levels are prevalent. The sharp upward curves across all three ROC curves indicate that the model consistently achieves high true-positive rates while maintaining low false-positive rates. Notably, the Disagree class achieves the highest AUROC (0.991), suggesting that oppositional stances have distinctive semantic and lexical signatures that the model captures well. The Agree class also has significant separability with an AUROC of 0.985.

Although the Neutral class reports a slightly lower AUROC (0.974), this outcome is due to the inherently ambiguous nature of neutral expressions, which lack explicit stance markers. Several factors contribute to this strong performance. The target-aware dual-encoder architecture helps ensure that stance predictions are explicitly conditioned on the target, enabling more accurate modeling of target-text interactions. Cross-lingual contrastive learning further improves performance by aligning stance-relevant semantics across English and Arabic, thereby reducing reliance on surface-level lexical cues. Moreover, robustness-oriented training and controlled data augmentation promote stability under noisy, informal textual conditions. Collectively, these design choices lead to well-calibrated ranking behavior, as evidenced by the consistently high AUROC values across stance categories.

The precision-recall (PR) curves in Figure 5 highlight the model’s ability to maintain high precision as recall increases, which is especially important in stance detection, where false positives can affect subsequent analytical steps. The consistently high precision at moderate to high recall levels indicates that the model rarely assigns definitive stance labels when corroborating evidence is lacking. The Disagree class achieves the highest area under the PR curve (AUC-PR) of 0.982, corresponding to effective identification of explicit rejection cues, while the Agree class follows closely with AUC-PR = 0.973. The relatively lower AUC-PR for the Neutral class (0.958) indicates natural overlap between neutral expressions and weakly polarized statements.

Three precision-specific factors contribute to this exceptional performance in PR. First, token-level rationale extraction limits predictions to linguistically grounded evidence, reducing overconfident false positives and improving precision. Second, the late-fusion decision mechanism prevents premature uncertain predictions until sufficient stance evidence has been gathered, thereby maintaining recall without sacrificing precision. Third, calibration-aware training leads to smoother confidence distributions, supporting the model to retrieve true stance instances at higher recall thresholds. Collectively, these mechanisms ensure high recall without the precision loss observed in generically fine-tuned transformer models.

Figure 6 shows that the proposed target-aware bilingual framework achieves 85.0% accuracy under zero-shot conditions, with balanced performance across classes. The error distribution further suggests that the performance limitations are due to the natural ambiguity of language, not model instability. The dominant diagonal entries confirm that the model is correct for the majority of instances across all three stance categories. In particular, the Agree class has the highest number of accurate predictions, showing that it has recognized the supportive stance cues even when the model has not been trained on conversational data.

Similarly, the model achieved strong classification performance in identifying correct Disagree instances, demonstrating effective capture of explicit oppositional expressions. Misclassifications are limited and have systematic patterns rather than random noise. Most errors occur between the neutral and polarized classes, as expected, given some overlap in the semantics of weakly expressed stances and neutrality in conversational contexts. For example, there are a few instances in which the neutral cases were misclassified as Agree or Disagree, indicating borderline cases in which implicit stance cues are present but not strongly articulated. Consequently, there is minimal direct confusion between Agree and Disagree, indicating that the model maintains clear polarity separation.

Figure 7 shows that the proposed target-aware bilingual framework generalizes well to the claim-article stance setting and achieves an accuracy of over 86.8% under zero-shot conditions. The small number of errors, which are semantically plausible, suggests that the remaining misclassifications are due to linguistic ambiguity rather than to model instability or overfitting. The misclassifications are negligible and follow a stance-proximity pattern, occurring predominantly between the neutral and polarized classes. For example, a small fraction of neutral cases is incorrectly classified as either “Agree” or “Disagree”, reflecting the inherent ambiguity of neutral claims that may provide only weak evaluative cues. Notably, direct confusion between “Agree” and “Disagree” is virtually absent, indicating that the model maintains a clear distinction between the two polarities.

Figure 8 shows that the proposed stance-detection framework outperforms existing bilingual and multilingual transformer baselines on the combined bilingual test set. This performance improvement is primarily due to the model’s target-aware dual-encoder architecture, which explicitly models the interaction between stance targets and textual evidence, rather than the implicit concatenation of sequences. Consequently, the model learns stance as a relational phenomenon, thus improving its robustness to topic variation and cross-lingual semantic shifts.

In terms of efficiency, the framework is still computationally feasible. Built on an mDeBERTa-v3 backbone with ~276 million parameters, it is comparable in scale to XLM-R-base while substantially lighter than instruction-tuned or generative models like FLAN-T5. The use of parameter-efficient fine-tuning and a discriminative classification head allows for faster inference and approximately 15–20% lower inference time than generative baselines under identical hardware conditions. These efficiency gains indicate the roles of architectural conditioning and representation learning in achieving strong performance in bilingual stance detection.

Figure 9 shows that the proposed model achieves better precision, recall, F1-score, and accuracy on the MT-CSD test set than both pre-trained multilingual transformers and state-of-the-art approaches. This performance advantage is mainly due to the target-aware formulation adopted in the proposed framework. Unlike standard pre-training models such as XLM-R, mBERT, or mRoBERTa [33,41,42,43] that tend to encode the target and the stance-bearing text using simple concatenation, the proposed dual-encoder architecture explicitly disentangles the target semantics from the conversational content, allowing the model to reason about stance as a relational dependency, enhancing its generalization ability.

Furthermore, the addition of cross-lingual contrastive alignment enables the model to learn representations relevant to stance while remaining robust to lexical and contextual variation. In contrast, many existing state-of-the-art methods are tailored for topic-level stance and exhibit reduced transferability in the conversational domain. Additionally, the robustness regularization used in the proposed approach reduces sensitivity to informal language, pragmatic cues, and discourse noise, maintaining superior generalization.

Figure 10 shows that the proposed model achieves the best, most balanced performance across precision, recall, F1-score, and accuracy on the AraStance test set, outperforming both general-purpose multilingual transformers and Arabic-specific baselines. In AraStance, the claim-article relationships involve implicit stance cues distributed across longer contexts and are not always lexically aligned with the target. Standard pre-training models like XLM-R, mBERT, and mRoBERTa represent such inputs using simple sequence concatenation, which can obfuscate the boundary between target semantics and supporting evidence, leading to poor stance discrimination.

Arabic-specific transformers, such as AraBERT and MARBERT-based methods [44,45,46,47], benefit from language adaptation. However, these models are monolingual and lack structured target-conditioning mechanisms, limiting their ability to generalize across domains and discourse styles. In contrast, the proposed dual-encoder architecture explicitly differentiates target and text representations and integrates them via late fusion, allowing for precise relational reasoning. Additionally, robustness-aware training reduces sensitivity to stylistic variation and implicit negation, which are prevalent in Arabic news and fact-checking content.

Figure 11 provides qualitative evidence that the proposed bilingual stance-detection framework makes linguistically grounded and semantically appropriate decisions across English and Arabic inputs. In the “Agree” English example, the model correctly elevates the saliency of stance-bearing tokens (e.g., support, climate, and policy), while assigning negligible weight to syntactic fillers (e.g., I, the). This behavior indicates that the system is not relying on superficial patterns but is instead attending to opinion-expressive lexical cues relevant to stance prediction. A similar pattern emerges in the Arabic “Disagree” instance, where key polarity indicators such as “غير عادل” (unfair) and “سلبًا” (negatively) receive the highest importance scores. The ability to highlight morphologically complex, sentiment-rich Arabic tokens demonstrates that the model effectively captures semantic nuance in a morphologically rich language.

The neutral examples further reinforce the reliability of the rationales. Rather than assigning high salience to sentiment-laden words, the model distributes importance across informational tokens, such as discussion and essential points, accurately reflecting the absence of explicit polarity. The Arabic neutral input shows a similar distribution across آراء مختلفة (“different views”), aligning with the non-committal stance.

Hallucination analysis is an important step in ensuring the trustworthiness of stance-prediction models, which often play a role in bilingual and cross-lingual contexts, where polarity is frequently based on subtle variations in lexical meaning. As shown in Table 6, hallucinated rationales often occur when the model attributes stance decisions to tokens that do not exist, are not semantically relevant to the input text, or are contradictory to it. Such attribution errors undermine interpretability and may appear to bias downstream analyses in sensitive domains such as political discourse or misinformation monitoring. By systematically identifying and quantifying these errors, the study ensures the model’s predictions are anchored in textual evidence, thereby enhancing transparency, user trust, and expert verifiability.

Evaluating hallucination rates is critical to stance detection, especially in applications involving social discourse, political narratives, or fact-checking, where misleading explanations may distort downstream interpretations. The hallucination rate analysis in Table 7 provides strong evidence for the reliability and interpretability of the proposed stance-detection framework. Across all evaluated datasets, the rate of hallucinating remains consistently low with values below 4%, indicating that the model very rarely hallucinates stances with fabricated or semantically inconsistent rationales. The lowest hallucination rate is 2.9% on the combined bilingual test set, indicating stable reasoning. Notably, even in zero-shot generalisation settings, such as MT-CSD and AraStance, which have varied discourse structures and input lengths, the hallucination rate increases at a small scale, highlighting the robustness of the learned representations.

The reduced hallucination can primarily be attributed to the target-aware dual-encoder model, which explicitly conditions predictions on target-text relationships and discourages reliance on loosely correlated contextual cues. In addition, token-level rationale extraction requires grounding decisions in identifiable lexical evidence, thereby limiting unsupported inference. The robustness-oriented training also makes the model more stable to noisy or informal inputs, a common cause of spurious reasoning in stance detection. By incorporating hallucination analysis, the study shows that the proposed model is not only competitive in predictive performance but also reliable and evidence-consistent in reasoning across diverse bilingual and cross-format stance datasets.

To assess the practical deployability of the proposed bilingual stance-detection framework, we implemented a thorough analysis of the computational cost, inference efficiency, and performance-effectiveness trade-off of the proposed framework. Table 8 highlights latency, throughput, and GPU usage for key model variants. Compared to the single-encoder model, the dual-encoder introduces a small latency increase (+0.13 ms). However, it produces absolute gains of +10.5% to +12.4% in accuracy and F1-score. At deployment, the proposed model has a mean inference latency of 1.22 ms per input, corresponding to a throughput of 820 samples/s, indicating it is suitable for a high-volume or real-time social media monitoring system. The proposed configuration represents a favorable trade-off between accuracy and computation cost. Enhancements such as contrastive alignment and robustness regularization improve cross-lingual generalization and stability with minimal runtime overhead. Due to shared frozen layers during inference, there is a minimal memory overhead. The marginal throughput degradation confirms the architecture’s operational efficiency. Overall, the empirical results demonstrate that the proposed architecture is computationally efficient while achieving significant improvements in bilingual stance-prediction performance, providing a practical, scalable solution for real-world deployment.

The sample instances outlined in Table 9 provide qualitative evidence that supports the credibility and practical effectiveness of the proposed stance-detection framework. Across both English and Arabic examples, the model consistently produces predictions that align with the gold labels, suggesting that the learned representations generalize well across languages and topical domains. Importantly, the examples cover the three stance categories: Agree, Disagree, and Neutral, which shows balanced performance rather than bias toward polarized opinions. This is especially apparent in the Neutral cases, where the model correctly identifies hedging or contrastive clues, such as concessive expressions, which are often difficult for stance classifiers. The addition of token-level rationales enhances the model’s credibility by demonstrating that its predictions are grounded in semantically meaningful lexical evidence. In English and Arabic cases, highlighted tokens correspond directly to stance-bearing expressions rather than to incidental or spurious words. This alignment aids the quantitative analysis of hallucinations by demonstrating that the model’s decisions are explainable and evidence-consistent. Additionally, the Arabic examples validate that the model captures stance cues in Modern Standard Arabic without relying on translation artefacts or language-specific heuristics.

5. Discussions

In this research, a framework for target-aware multilingual stance detection is proposed. It incorporates contrastive learning, robustness mechanisms, token-level rationales, a dual-encoder fusion architecture, and effective fine-tuning. Firstly, a contrastive learning mechanism is used to achieve cross-lingual semantic alignment, allowing the model to represent stance-relevant cues from English and Arabic into a common latent space. Secondly, a robustness-enhanced stance learner is developed through noise-resistant encoding and normalization strategies to process noisy, adversarial, and code-switched user-generated text. Thirdly, token-level rationale extraction is integrated to provide comprehensive, bilingual explanations of the stance choice, thereby enhancing its interpretability and trustworthiness. Lastly, a dual-encoder architecture with late fusion is proposed to explicitly condition stance predictions on the target whilst still maintaining the distinction between the semantics representations for the target and the text, thereby improving generalizability to novel topics and claim types. These collective contributions address ongoing issues in stance modeling across language, topics, target types, and discourse contexts. In contrast to traditional stance classifiers, the proposed study builds a single bilingual model capable of reasoning over English and Arabic. The experimental results show that bilingual semantic alignment, explicit target conditioning, and interpretability improve generalization and cross-domain transferability.

The proposed model outperforms the state-of-the-art methods by addressing cross-lingual transferability, robustness to noisy input text, and target-aware reasoning. While current architectures [41,42,43,44,45,46,47,48], including AraBERT, MARBERT, XLM-R, and mBERT, are built on monolithic encoder representations or target-text concatenations, the dual-encoder model distinguishes target semantics from stance-bearing content, yielding highly stable decision boundaries across topic and claim variations. Contrastive learning enables the model to reconcile stance representations across English and Arabic, achieving better zero-shot generalization, whereas existing stance-detection models have failed due to cultural and lexical differences. Robustness-aware processing mitigates noise, dialectal variation, and adversarial perturbations, degrading performance in baseline systems. Token-level rationale extraction improves decision quality by uncovering core stance indicators and mitigating the spurious correlations common in traditional fine-tuned transformers. Collectively, these architectural advancements enable the proposed model to achieve better accuracy, cross-lingual consistency, and interpretability than existing transformer-based stance-detection systems.

Although trained on topic-level bilingual data, the strong performance of the proposed model on MT CSD and AraStance compared to the existing approaches [32,42,43,44,45,46,47,48], can be attributed to its target-conditioned representation design rather than the learning of dataset-specific patterns. MT-CSD and AraStance differ substantially from the combined bilingual dataset. MT-CSD relates to conversational stance while AraStance focuses on claim-article relations. Conventional stance models entangle target semantics with surface textual cues learned during training, resulting in poor generalization. In contrast, the dual-encoder architecture makes explicit distinctions between target and text representations, allowing the proposed model to express stance as a relational property rather than as a dataset-dependent label association. Moreover, the cross-lingual contrastive alignment objective encourages the model to learn stance-salient semantic dimensions that are independent of language, context length, and discourse form, enabling the transfer of stance cues extracted from short topic-centric inputs into longer conversational or document-level contexts. The robustness regularization also stabilizes predictions by reducing sensitivity to informal phrasing, implicit disagreement, and contextual noise. Ultimately, these mechanisms allow the model to have consistent decision boundaries across structurally diverse datasets, resulting in reliable zero-shot generalization without task-specific retraining.

Although recent improvements in text detection models like SwinTextSpotter, CM-Net, and Text-Pass Filter have achieved reliable performance in visual text localization and recognition, these systems operate in fundamentally different problem spaces than the proposed study. Such models aim to extract text from images, whereas in the proposed framework, the text is assumed to be available, and the problem is solved at a higher linguistic level: target-aware stance prediction in English and Arabic. Accordingly, the contribution of this study is not in text detection but in bilingual stance reasoning, target conditioning, and robust cross-format generalization, which are not handled by conventional scene-text approaches.

Similarly, previous studies [50,51] that interpret visual targets based on a single encoder, for example, such as models used for traffic sign interpretation or first-person scene understanding, use joint encoding of target and context. In the case of stance detection, however, this single-encoder formulation leads to entangled representations, where the short and stable target phrase competes with longer, noisy stance text for representational space. The proposed dual-encoder architecture overcomes the limitation by keeping target and text encoders independent, allowing for clear target semantics and explicit late-fusion interaction. As validated with ablation results, this separation supports stronger cross-lingual transfer, improved interpretability, and better generalization across heterogeneous stance formats.

The study’s findings have broader implications for bilingual stance analysis and cross-lingual conversation monitoring. By demonstrating that effective bilingual stance detection can be achieved without extensive retraining, the proposed framework provides a feasible, resource-efficient solution for stance detection across linguistically diverse content. The cross-lingual alignment monitors real-time narratives that develop in one language and spread to another, which is crucial for applications in crisis response, digital diplomacy, and multilingual content moderation. The interpretability provided by token-level rationales enhances trust and accountability, making the system suitable for high-stakes settings such as fact-checking programs, regulatory oversight, and social media governance. Furthermore, the parameter-efficient fine-tuning approach enables advanced bilingual stance detection for institutions with limited computational resources.

Despite the study’s strengths, it has a number of limitations. The mixed training strategy is based on topic-level stance data, and there is no inclusion of conversational turns and long-form document structures. Although the model generalizes well to structurally different datasets such as MT-CSD and AraStance in zero-shot settings, the lack of explicit supervision for discourse formats may affect the model’s real-time performance. The effectiveness of contrastive alignment depends on the range and representativeness of bilingual target-text pairs, while domains with less dense cross-lingual overlap may yield weaker alignment. Token-level rationales, though informative, fail to capture the deeper aspects of conversational pragmatics such as sarcasm, idiomatic expressions, and culturally embedded cues to stance. Moreover, a limited focus on user-level metadata, temporal patterns, or the dynamics of social interaction can affect the model’s classification performance. Finally, the study is limited to English and Modern Standard Arabic, limiting its generalizability.

Future studies may consider conversational and thread-level models in order to build stance reasoning that evaluates the flow of discourse, reply chains, and context accumulation. Developing multimodal stance-detection methods combining text with images, video, or speech could capture the complex methods in which stance is expressed on the internet. Extending the cross-lingual alignment framework to additional languages, especially those with sociopolitical or cultural links to Arabic, would increase the multilingual transferability. Combining token-level rationales with evidence-retrieval or sentence-level justification mechanisms may lead to better interpretation, thereby generating robust explanations. Incorporating continual learning or domain adaptation techniques may support the model’s adaptation to temporal drift and evolving public discourse. Lastly, substantial distillation of the ensemble process may yield a highly reliable stance-detection model with a limited computational footprint.

6. Conclusions

This study presents a target-aware bilingual stance-detection approach that addresses the limitations of existing stance modeling approaches. By reformulating stance detection as a target-conditioned representation-learning problem, the proposed model adopts a dual-encoder architecture that explicitly distinguishes target semantics from stance-bearing text. This mechanism supports robust reasoning across heterogeneous stance formats, including topic-level, conversational, and claim-article scenarios, without requiring dataset-specific architectural modifications. Integrating cross-lingual contrastive alignment identifies stance-relevant semantic structures, thereby favoring reliable zero-shot generalization. Moreover, robustness-oriented regularization promotes stability under noisy, informal textual conditions, while token-level rationale extraction provides interpretable insights into the linguistic cues that influence stance prediction. Experimental evaluations on a range of benchmark datasets show that the proposed approach consistently outperforms strong multilingual, Arabic-specific, and state-of-the-art baselines, while maintaining a trade-off between predictive performance and computational efficiency, outlining the significance of explicit modelling of target-text relationships, especially for real-world applications such as public opinion monitoring and cross-lingual fact checking. Notwithstanding these advantages, the study has several limitations. The training process is primarily oriented toward topic-level stance supervision, while the dynamics of conversation and long-form document structures are not explicitly modeled, which could limit the ability to directly capture discourse-level interactions, as in conversational and claim-article stance settings. Furthermore, hallucination assessment relies partially on expert-guided evaluation, which may not scale well to diverse datasets. Future research will explore integrating discourse-aware and conversational modelling, extending the framework to other languages, and building fully automated mechanisms for hallucination detection. Expanding the model to incorporate multimodal inputs and to operate in continual learning environments remains a promising future direction. Through advanced ensemble techniques, the reliability of the proposed stance-detection model can be improved.

Author Contributions

Conceptualization, A.R.W.S. and Y.A.; methodology, A.R.W.S. and Y.A.; software, A.R.W.S. and Y.A.; validation, A.R.W.S. and Y.A.; formal analysis, A.R.W.S.; investigation, A.R.W.S. and Y.A.; resources, A.R.W.S. and Y.A.; data curation, A.R.W.S.; writing—original draft preparation, A.R.W.S. and Y.A.; writing—review and editing, A.R.W.S. and Y.A.; visualization, A.R.W.S. and Y.A.; supervision, A.R.W.S. and Y.A.; project administration, A.R.W.S. and Y.A.; funding acquisition, A.R.W.S. and Y.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deanship of Scientific Research, Vice Presidency for Graduate Studies and Scientific Research, King Faisal University, Saudi Arabia [Grant No. KFU260814].

Data Availability Statement

The datasets are available in the repositories: ArabicStanceX Dataset. Available online: https://github.com/AliAlkhathlan/ArabicStanceX (accessed on 24 December 2025). mDeBERTA Model. Available online: https://huggingface.co/microsoft/mdeberta-v3-base (accessed on 25 December 2025).

Acknowledgments

The authors thank King Faisal University’s Deanship of Scientific Research for its financial support [Grant No. KFU260814]. The authors would like to thank the Deanship of Scientific Research at Shaqra University for supporting this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

AlDayel, A.; Magdy, W. Stance detection on social media: State of the art and trends. Inf. Process. Manag. 2021, 58, 102597. [Google Scholar] [CrossRef]
Alturayeif, N.; Luqman, H.; Ahmed, M. A systematic review of machine learning techniques for stance detection and its applications. Neural Comput. Appl. 2023, 35, 5113–5144. [Google Scholar] [CrossRef]
Küçük, D.; Can, F. Stance detection: A survey. ACM Comput. Surv. (CSUR) 2020, 53, 12. [Google Scholar] [CrossRef]
Albtoush, E.S.; Gan, K.H.; Alrababa, S.A.A. Fake news detection: State-of-the-art review and advances with attention to Arabic language aspects. PeerJ Comput. Sci. 2025, 11, e2693. [Google Scholar] [CrossRef] [PubMed]
Hardalov, M.; Arora, A.; Nakov, P.; Augenstein, I. A Survey on Stance Detection for Mis- and Disinformation Identification. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 1259–1277. [Google Scholar]
Srivastava, S.; Singh, S.; Srivastava, S. Target-aware stance detection: A systematic review. IEEE Access 2024, 12, 4567–4589. [Google Scholar]
Alkhraiji, A.K.; Azmi, A.M. Stance Detection in Arabic Tweets: A Machine Learning Framework for Identifying Extremist Discourse. Mathematics 2025, 13, 2965. [Google Scholar] [CrossRef]
Mohammad, S.; Kiritchenko, S.; Sobhani, P.; Zhu, X.; Cherry, C. Semeval-2016 task 6: Detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 31–41. [Google Scholar]
Alturayeif, N.; Luqman, H.; Alyafeai, Z.; Yamani, A. Stanceeval 2024: The first arabic stance detection shared task. In Proceedings of the Second Arabic Natural Language Processing Conference, Bangkok, Thailand, 16 August 2024; pp. 774–782. [Google Scholar]
Zubiaga, A.; Aker, A.; Bontcheva, K.; Liakata, M.; Procter, R. Detection and resolution of rumours in social media: A survey. ACM Comput. Surv. (CSUR) 2018, 51, 32. [Google Scholar] [CrossRef]
AlShenaifi, N.; Azmi, A. Arabic dialect identification using machine learning and transformer-based models: Submission to the NADI 2022 shared task. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP 2022), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 464–467. [Google Scholar]
Allaway, E.; McKeown, K. Zero-shot stance detection: A dataset and model using generalized topic representations. arXiv 2020, arXiv:2010.03640. [Google Scholar] [CrossRef]
Augenstein, I.; Rocktäschel, T.; Vlachos, A.; Bontcheva, K. Stance detection with bidirectional conditional encoding. arXiv 2016, arXiv:1606.05464. [Google Scholar] [CrossRef]
Dahou, A.; Dahou, A.H.; Cheragui, M.A.; Abdedaiem, A.; Al-qaness, M.A.; Abd Elaziz, M.; Ewees, A.A.; Zheng, Z. A Survey on Dialect Arabic Processing and Analysis: Recent Advances and Future Trends. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2025, 24, 84. [Google Scholar] [CrossRef]
Alshahrani, E.S.; Aksoy, M.S.; Emam, A. Detection of Hate Speech and Offensive Language in Arabic Text: A Systematic Literature Review. Appl. Comput. Intell. Soft Comput. 2025, 2025, 5565888. [Google Scholar] [CrossRef]
Pangtey, L.; Bhatnagar, A.; Bansal, S.; Dar, S.S.; Kumar, N. Large Language Models Meet Stance Detection: A Survey of Tasks, Methods, Applications, Challenges and Future Directions. arXiv 2025, arXiv:2505.08464. [Google Scholar] [CrossRef]
Jamialahmadi, S.; Sahebi, I.; Sabermahani, M.M.; Shariatpanahi, S.P.; Dadlani, A.; Maham, B. Rumor stance classification in online social networks: The state-of-the-art, prospects, and future challenges. IEEE Access 2022, 10, 113131–113148. [Google Scholar] [CrossRef]
Ng, L.H.; Cruickshank, I.J.; Lee, R. Examining the influence of political bias on large language model performance in stance classification. In Proceedings of the International AAAI Conference on Web and Social Media, Copenhagen, Denmark, 23–26 June 2025; Volume 19, pp. 1315–1328. [Google Scholar]
Hanselowski, A.; PVS, A.; Schiller, B.; Caspelherr, F.; Chaudhuri, D.; Meyer, C.M.; Gurevych, I. A retrospective analysis of the fake news challenge stance detection task. arXiv 2018, arXiv:1806.05180. [Google Scholar] [CrossRef]
Zhang, R.; Yang, H.; Mao, W. Cross-lingual cross-target stance detection with dual knowledge distillation framework. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 10804–10819. [Google Scholar]
Du, J.; Xu, R.; Gui, L.; Wang, X. Leveraging target-oriented information for stance classification. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing, Budapest, Hungary, 17–23 April 2017; Springer International Publishing: Cham, Switzerland, 2017; pp. 35–45. [Google Scholar]
Hardalov, M.; Arora, A.; Nakov, P.; Augenstein, I. Few-shot cross-lingual stance detection with sentiment-based pre-training. Proc. AAAI Conf. Artif. Intell. 2022, 36, 10729–10737. [Google Scholar] [CrossRef]
Zhao, C.; Caragea, C. EZ-STANCE: A large dataset for English zero-shot stance detection. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 15697–15714. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How multilingual is multilingual BERT? arXiv 2019, arXiv:1906.01502. [Google Scholar] [CrossRef]
AlRowais, R.K.; Alsaeed, D. Arabic stance detection of COVID-19 vaccination using transformer-based approaches: A comparison study. Arab. Gulf J. Sci. Res. 2024, 42, 1319–1339. [Google Scholar] [CrossRef]
Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Anitha, R.; Kumar, K.A. Sentiment analysis in low resource language: Exploring BERT, MBERT, XLM-R, and RNN architectures to underpin the deep language understanding. J. Nonlinear Anal. Optim. 2024, 15, 3. [Google Scholar]
Deshpande, A.; Talukdar, P.; Narasimhan, K. When is BERT multilingual? isolating crucial ingredients for cross-lingual transfer. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 10–15 July 2022; pp. 3610–3623. [Google Scholar]
Yan, Y.; Sun, S.; Tang, Z.; Liu, T.; Liu, M. Collaborative stance detection via small-large language model consistency verification. arXiv 2025, arXiv:2502.19954. [Google Scholar] [CrossRef]
Burnham, M. Stance detection: A practical guide to classifying political beliefs in text. Political Sci. Res. Methods 2025, 13, 611–628. [Google Scholar] [CrossRef]
Xu, C.; Paris, C.; Nepal, S.; Sparks, R. Cross-target stance classification with self-attention networks. arXiv 2018, arXiv:1805.06593. [Google Scholar]
Muthusami, R.; Saritha, K.; Rao, K.S.; Sugapriya, P.; Saveetha, G. Interpretable stance detection in social media via topic-guided transformers. Discov. Artif. Intell. 2025, 5, 355. [Google Scholar] [CrossRef]
Liang, B.; Chen, Z.; Gui, L.; He, Y.; Yang, M.; Xu, R. Zero-shot stance detection via contrastive learning. In Proceedings of the ACM Web Conference 2022, Virtual, 25–29 April 2022; pp. 2738–2747. [Google Scholar]
MT-CSD Dataset. Available online: https://github.com/nfq729/MT-CSD (accessed on 24 December 2025).
VAST Dataset. Available online: https://github.com/Babakbehkamkia/GPT-Stance-Detection/tree/main/Dataset/VAST (accessed on 24 December 2025).
Green, T.; Ponzetto, S.P.; Glavaš, G. Massively multilingual lexical specialization of multilingual transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 7700–7715. [Google Scholar]
Sobhani, P.; Inkpen, D.; Zhu, X. A dataset for multi-target stance detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 551–557. [Google Scholar]
Li, Y.; Vasilakes, J.; Zhao, Z.; Scarton, C. SCRum-9: Multilingual Stance Classification over Rumours on Social Media. arXiv 2025, arXiv:2505.18916. [Google Scholar] [CrossRef]
Zhu, Z.; Zhang, Z.; Zhang, H.; Li, C. RATSD: Retrieval augmented truthfulness stance detection from social media posts toward factual claims. In Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistic: Kerrville, TX, USA, 2025; pp. 3366–3381. [Google Scholar]
Poonam, K.M.; Ramakrishnudu, T. Bias-resilient multi-label deep learning hybrid model for stance detection. Int. J. Data Sci. Anal. 2025, 20, 6695–6714. [Google Scholar] [CrossRef]
Vamvas, J.; Sennrich, R. X-Stance: A multilingual multi-target dataset for stance detection. arXiv 2020, arXiv:2003.08385. [Google Scholar]
Zotova, E.; Agerri, R.; Nunez, M.; Rigau, G. Multilingual stance detection: The catalonia independence corpus. arXiv 2020, arXiv:2004.00050. [Google Scholar] [CrossRef]
Niu, F.; Yang, M.; Li, A.; Zhang, B.; Peng, X.; Zhang, B. A challenge dataset and effective models for conversational stance detection. arXiv 2024, arXiv:2403.11145. [Google Scholar] [CrossRef]
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; pp. 7088–7105. [Google Scholar]
Alhindi, T.; Alabdulkarim, A.; Alshehri, A.; Abdul-Mageed, M.; Nakov, P. Arastance: A multi-country and multi-domain dataset of arabic stance detection for fact checking. arXiv 2021, arXiv:2104.13559. [Google Scholar]
ArabicStance Dataset. Available online: https://github.com/Tariq60/arastance/tree/main/data (accessed on 24 December 2025).
Alkhathlan, A.; Alahmadi, F.; Kateb, F.; Al-Khalifa, H. Constructing and evaluating ArabicStanceX: A social media dataset for Arabic stance detection. Front. Artif. Intell. 2025, 8, 1615800. [Google Scholar] [CrossRef]
ArabicStanceX Dataset. Available online: https://github.com/AliAlkhathlan/ArabicStanceX (accessed on 24 December 2025).
mDeBERTA Model. Available online: https://huggingface.co/microsoft/mdeberta-v3-base (accessed on 25 December 2025).
Yang, C.; Han, X.; Han, T.; Su, Y.; Gao, J.; Zhang, H.; Wang, Y.; Chau, L.P. Signeye: Traffic sign interpretation from vehicle first-person view. IEEE Trans. Intell. Transp. Syst. 2025, 26, 19413–19425. [Google Scholar] [CrossRef]
Yang, C.; Zhuang, K.; Chen, M.; Ma, H.; Han, X.; Han, T.; Guo, C.; Han, H.; Zhao, B.; Wang, Q. Traffic sign interpretation via natural language description. IEEE Trans. Intell. Transp. Syst. 2024, 25, 18939–18953. [Google Scholar] [CrossRef]

Figure 1. The Proposed Bilingual Stance-detection Model.

Figure 2. The Proposed Dual-Encoder Architecture.

Figure 3. Class-wise Performance of the Proposed Model on the Combined Bilingual Dataset (Test set).

Figure 4. AUROC Curves for Three-Class Stance-detection (Test set).

Figure 5. Class-wise Precision–Recall Curves for Stance Categories (Test set).

Figure 6. Confusion Matrix (MT-CSD–Test set).

Figure 7. Confusion Matrix (Arastance–Test set).

Figure 8. Findings of the Comparative Analysis (Proposed Model Against Existing Bilingual Transformers and State-of-the-art techniques [41,42])—Combined Bilingual Dataset (Test set).

Figure 9. Comparative Analysis (Proposed Model Against Existing Transformers and State-of-the-art techniques [20,33,41,42,43])—MT-CSD Dataset (Test set).

Figure 10. Comparative Analysis (Proposed Model Against Existing Transformers and State-of-the-art techniques [44,47])—ArabicStance Datasets (Test set).

Figure 11. Sample Token-Level Rationale Visualizations for Correct Predictions; Translation for the Arabic terms 1. For Disagree prediction: [“هذا” (this)—“القرار” (decision)—“غير” (not)—“عادل” (fair)—“ويؤثر” (and affects)—“سلبًا” (negatively)—“على” (on)—“الناس” (people)] 2. For Neutral prediction: [“النقاش” (the discussion)—“يطرح” (presents)—“آراءً” (opinions)—“مختلفة” (different)—“من” (from)—“كل” (all)—“الأطراف” (parties)].

Table 1. Characteristics of Existing Stance-Detection Models.

Approach	Models/Studies	Target-Aware	Cross-Lingual	Datasets	Key Limitations
Monolingual Fine-Tuning	BERT, RoBERTa, DistilBERT fine-tuned on stance tasks [29,30]	Partial (via input concatenation)	No	SemEval-2016, VAST	Poor generalization to new topics/domains
Explicit Target-Aware Models	BiCond, CrossNet, TAN variants [31,32]	Yes	Limited	SemEval-2016 (topic stance)	Limited explainability; English-centric
Dual Encoder/Cross Target	Zero-shot stance with generalized topic representations [12,33,34,35]	Strong	No	Multilingual political dataset	Partial multilingual alignment
Multilingual/Bilingual Models	mBERT and XLM-R [36,37,38,39,40,41,42,43]	Partial	Yes	X-Stance, multilingual political	Weak target conditioning
Arabic-Specific Transformers	AraBERT and MARBERT [44,45,46,47,48]	Partial	Limited	AraStance, ArabicStanceX	Monolingual; limited cross-domain

Table 2. Features of the Datasets.

Dataset	Language	Number of Samples	Label Set	Stance Type
VAST [35]	English	13,477	Agree/Disagree/Neutral	Topic–level
ArabicStanceX [48]	Arabic	14,477	Favor/Against/None	Topic–level
MT-CSD [34]	English	15,876	Agreement/Disagreement/Other	Conversational–level
AraStance [46]	Arabic	4063	Agree/Disagree/Discuss/Unrelated	Claim–article

Table 3. Experimental Settings and Deployment Environment.

Category	Specification
Operating System	Windows 11 Pro (64-bit)
Processor (CPU)	Intel Core i7 (12th Gen)
GPU	NVIDIA GeForce RTX 4090
CUDA Version	CUDA 11.8
cuDNN Version	cuDNN 8.x
System Memory	32 GB DDR4/DDR5
Storage	≥1 TB NVMe SSD
Python Version	Python 3.10
Deep Learning Framework	PyTorch 2.0+
Transformer Library	HuggingFace Transformers 4.38+
Model Backbone	mDeBERTa-v3 (multilingual)
Tokenizer	mDeBERTa-v3 tokenizer
Training Strategy	Dual-encoder fine-tuning with late fusion
Optimization Algorithm	AdamW
Learning Rate	2 × 10⁻⁵
Loss Functions	Cross-Entropy + Contrastive + Robustness Regularization
Mixed Precision Training	Enabled (FP16)
Inference Mode	Single forward pass

Table 4. Findings of the Five-Fold Cross –Validation.

Fold	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)
1	88.1	87.8	87.9	88.4
2	87.9	87.6	87.7	88.1
3	88.3	88.0	88.0	88.7
4	88.0	88.2	88.2	88.6
5	88.4	88.5	88.5	88.9
Mean $\pm$ Standard Deviation	88.14 $\pm$ 0.19	88.02 $\pm$ 0.34	88.04 $\pm$ 0.24	88.54 $\pm$ 0.29
95% Confidence Interval	87.8–88.5	87.5–88.6	87.6–88.6	88.1–89.0

Table 5. Ablation Study of the Proposed Model on the Combined Bilingual Dataset (Test Set).

Model Configuration	Contrastive	Robustness	Ensemble (Fold Averaging)	Precision (%)	Recall (%)	F1-Score (%)	Accuracy (%)
A1. Single-Encoder Baseline (target–text concat.)	✗	✗	✗	76.4	75.8	75.9	76.8
A2. Dual Encoder Only	✗	✗	✗	80.1	79.4	79.6	79.8
A3. Dual Encoder + Contrastive Alignment	✓	✗	✗	82.7	82.1	82.3	83.0
A4. Dual Encoder + Robustness Regularization	✗	✓	✗	81.9	81.2	81.5	82.0
A5. Dual Encoder + Contrastive + Robustness	✓	✓	✗	82.1	82.5	82.4	83.2
A6. Full Model (Proposed)	✓	✓	✓	86.4	86.3	86.2	88.1

Note: ✗—Not Applied and ✓—Applied.

Table 6. Examples of Grounded vs. Hallucinated Token-Level Rationales.

Language	Target	Input Text	Gold Label	Model Rationale	Assessment	Explanation
English	Climate Policy	“Stricter regulations are necessary to reduce emissions.”	Agree	Strict regulations, reduce emissions	Grounded	Tokens reflect actual stance-bearing evidence in the input.
English	COVID-19 Measures	“I am not against the policy, but I need more information.”	Neutral	Policy, need more information	Grounded	Highlights both hedging and information-seeking cues consistent with neutrality.
English	Renewable Energy	“Wind farms are too expensive for small towns.”	Disagree	Environment, pollution, climate	Hallucinated	Tokens do not appear in the text; fabricated environmental cues.
Arabic	“التعليم الإلكتروني” (E-learning)	“الإلكتروني التعليم” (E-learning)—“يساعد” (helps)—“الطلاب” (students)—“على” (to)—“التعلم” (learning)—“بسهولة” (easily).”	Agree	“يساعد” (helps)—“التعلم” (learning)—“سهولة” (easily)	Grounded	Highlights supportive and semantically aligned tokens.
Arabic	“زيادة الضرائب” (Tax Increase)	“زيادة الضرائب” (tax increase)—“ستؤثر” (will affect)—“على” (on)—“الأسر ذات الدخل المحدود” (low-income families)	Disagree	“الاقتصاد” (economy)—“الأسعار” (prices)—“انخفاض” (decrease/decline)	Hallucinated	Tokens are absent; model invents economic terminology not present in the input.

Table 7. Findings of the Hallucination Rate Analysis.

Dataset	Evaluated Samples	Hallucinate Rate (%)	Standard Deviation (%)
Combined Bilingual Test set	300	2.9	±0.42
VAST	90	2.3	±0.35
ArabicStanceX	80	3.1	±0.47
MT-CSD	70	3.6	±0.51
ArabicStance	70	3.9	±0.58

Table 8. Computational Efficiency Metrics of Model Variants.

Model Variant	Parameters (Millions)	GPU Memory (GB)	Inference Time/Sample (ms)	Throughput (Samples/s)
Single-Encoder Baseline	278 M	10.6	1.05	950
Dual-Encoder (No Auxiliary Learning Objectives)	328 M	12.4	1.18	845
Dual-Encoder + Contrastive + Robustness	328 M	14.0	1.21	825
Full Model (Proposed, with Ensemble)	328 M × 5 folds	15.3	1.25	820

Table 9. Illustration of Model Predictions and Token-Level Rationales Through Sample Instances.

Language	Target	Input Text	Gold Label	Model Prediction	Token–Level Rationale (Top Tokens)
English	Remote Work Policy	“Allowing employees to work from home improves productivity and work–life balance.”	Agree	Agree	improves productivity, work–life balance
English	Electric Vehicles	“EVs are still too expensive and unreliable for most families.”	Disagree	Disagree	too expensive, unreliable
English	COVID-19 Vaccination	“I am not against vaccines, but I need more evidence before taking this one.”	Neutral	Neutral	not against, need more evidence
Arabic	الإلكترونيالتعليم (E-Learning)	“الطلاب ساعد الإلكتروني التعليم المحتوى إلى الوصول على بسهولة.” (E-Learning helped students access content easily.)	Agree	Agree	ساعد (helped), سهولة (ease), الوصول (access)
Arabic	زيادة الضرائب (Tax Increase)	“ على سلباً ستؤثر الضرائب زيادة المحدود الدخل ذات الأسر.” (Increasing taxes will negatively affect low-income families.)	Disagree	Disagree	سلباً (negatively),الدخل المحدود (low income), تؤثر (affect)
Arabic	الطاقة المتجددة (Renewable Energy)	“ تطبيقها لكن جيدة، المتجددة الطاقة المناسب التخطيط يحتاج.” (Renewable energy is good, but its implementation requires proper planning.)	Neutral	Neutral	جيدة (good), لكن (but), التخطيط (planning)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sait, A.R.W.; Alkhurayyif, Y. Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture. Electronics 2026, 15, 830. https://doi.org/10.3390/electronics15040830

AMA Style

Sait ARW, Alkhurayyif Y. Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture. Electronics. 2026; 15(4):830. https://doi.org/10.3390/electronics15040830

Chicago/Turabian Style

Sait, Abdul Rahaman Wahab, and Yazeed Alkhurayyif. 2026. "Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture" Electronics 15, no. 4: 830. https://doi.org/10.3390/electronics15040830

APA Style

Sait, A. R. W., & Alkhurayyif, Y. (2026). Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture. Electronics, 15(4), 830. https://doi.org/10.3390/electronics15040830

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Target-Aware Bilingual Stance Detection in Social Media Using Transformer Architecture

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. Dataset Description

3.2. Data Preprocessing and Expert-Guided Hallucination Assessment

3.3. The Proposed Target-Aware Stance-Detection Approach

3.3.1. Target-Text Dual-Encoder Architecture

3.3.2. Cross-Lingual Contrastive Alignment

3.3.3. Robust Stance Modeling Under Noisy and Adversarial Text

3.3.4. Token-Level Rationale Extraction

3.3.5. End-to-End Proposed Stance-Detection Pipeline

3.4. Experimental Setting

4. Results

5. Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI