Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning

Al-Henaki, Lubna; Al-Khalifa, Hend; Al-Salman, Abdulmalik

doi:10.3390/app15158160

Open AccessArticle

Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning

by

Lubna Al-Henaki

^1,2,3,*,

Hend Al-Khalifa

^1,4

and

Abdulmalik Al-Salman

^1,2

¹

STC’s Artificial Intelligence Chair, Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh 11451, Saudi Arabia

²

Department of Computer Science, College of Computer and Information Sciences, King Saud University, P.O. Box 2614, Riyadh 13312, Saudi Arabia

³

Department of Computer Science, College of Computer and Information Sciences, Majmaah University, Al-Majmaah 11952, Saudi Arabia

⁴

Department of Information Technology, College of Computer and Information Sciences, King Saud University, P.O. Box 2614, Riyadh 13312, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8160; https://doi.org/10.3390/app15158160

Submission received: 22 June 2025 / Revised: 16 July 2025 / Accepted: 17 July 2025 / Published: 22 July 2025

(This article belongs to the Special Issue New Trends in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Social media has become a platform for the rapid spread of persuasion techniques that can negatively affect individuals and society. Propaganda detection, a crucial task in natural language processing, aims to identify manipulative content in texts, particularly in news media, by assessing propagandistic intent. Although extensively studied in English, Arabic propaganda detection remains challenging because of the language’s morphological complexity and limited resources. Furthermore, most research has treated propaganda detection as an isolated task, neglecting the influence of sentiments and emotions. The current study addresses this gap by introducing the first multi-task learning (MTL) models for Arabic propaganda detection, integrating sentiment analysis and emotion detection as auxiliary tasks. Three MTL models are introduced: (1) MTL combining all tasks, (2) PSMTL (propaganda and sentiment), and (3) PEMTL (propaganda and emotion) based on transformer architectures. Additionally, seven task-weighting schemes are proposed and evaluated. Experiments demonstrated the superiority of our framework over state-of-the-art methods, achieving a Macro-F1 score of 0.778 and 79% accuracy. The results highlight the importance of integrating sentiment and emotion for enhanced propaganda detection; demonstrate that MTL improves model performance; and provide valuable insights into the interaction among sentiment, emotion, and propaganda.

Keywords:

propaganda detection; multi-task learning (MTL); natural language processing (NLP); sentiment analysis; emotion detection

1. Introduction

The vast growth of social media platforms, online news outlets, and digital communication, with over five billion users globally, has transformed the online sphere into an ideal breeding ground for propaganda dissemination [1]. Propaganda is defined as “information, ideas, opinions, or images, often only giving one part of an argument, that are broadcast, published, or in some other way spread with the intention of influencing people’s opinions.” [2]. Consequently, developing automated tools and techniques capable of accurately extracting and analyzing opinions and attitudes embedded within these vast text streams is essential. Propaganda detection, a pivotal task in natural language processing (NLP), focuses on understanding how information is presented and interpreted, particularly within news media, and involves predicting whether a given text constitutes propagandistic content intended to influence public opinion.

Multi-task learning (MTL) is a form of transfer learning in which jointly learning related tasks might result in a shared representation. Knowledge acquired from one related task can then be leveraged to improve performance in a different task. MTL has contributed to progress in a wide range of applications, especially in the era of deep learning, such as speech recognition, computer vision [3], and more recently, NLP [4]. In the NLP field, MTL can jointly handle related problems, working toward a more comprehensive understanding of language [5]. Further, MTL has proven effective in a wide range of NLP tasks, including abusive content detection, sentiment analysis, and sarcasm detection [6,7,8]. In general, MTL is applicable when optimization involves more than one loss function, which leads to more effective results than single-task learning (STL). When learning features is challenged in the primary task, the appropriate solution is to predict these features as an auxiliary task. Overall, there are challenges in designing an effective MTL model, including choosing appropriate related tasks, balancing their impact, and determining the shared representation level [9].

MTL has several significant advantages in the NLP field. One of its main benefits is its ability to enhance task performance through shared information use. For instance, when a model is trained on tasks such as machine translation, part-of-speech tagging, and named entity recognition, it can leverage shared representations from their training to boost performance on all tasks [10]. Another major advantage of MTL is the reduction in the amount of labelled data required for training [11]. By utilizing labelled data from one task, the model can enhance its performance on other tasks. This is particularly useful in NLP, where acquiring labelled data is often costly and time-consuming. Furthermore, MTL strengthens model generalization and robustness to previously unseen data by applying the same feature representations across multiple tasks [12].

Although research on Arabic propaganda is still in its early stages, English propaganda detection has yielded remarkable results. Arabic is the fourth most used language on the internet, with around 237 million Arab users [13]. However, detecting propaganda in Arabic poses unique challenges owing to its complex morphology, orthographic ambiguity, limited availability of linguistic resources (e.g., lexicons), and dialectal variances [14]. Text news provides several indicators that can be used to assess its content’s credibility.

Previous studies on propaganda detection have primarily focused on training models solely for propaganda detection without incorporating other auxiliary tasks. Additionally, existing studies have shown a positive interaction between propaganda and sentiment, whereas others have found emotion to be efficient in propaganda detection models. Moreover, the detection and analysis of MTL for propaganda detection has only been attempted in a few studies, particularly in English [15,16] and Arabic [17,18,19], without focusing on sentiment and emotion as critical related tasks. However, to the best of our knowledge, none of these studies have fully considered sentiment and emotion as integral auxiliary tasks in propaganda detection. To date, no studies have focused on Arabic propaganda detection’s emotional and sentiment dimensions. Building on this, we aim to enhance the propaganda detection model by integrating knowledge of other opinion dimensions (sentiment and emotion tasks). We hypothesize that MTL combining propaganda and multiple opinion dimensions will contribute to the development of more comprehensive and accurate propaganda detection systems.

This study introduces MTL models designed to jointly address three interrelated tasks: propaganda detection, sentiment analysis, and emotion detection. Furthermore, the proposed models utilize various task-weighting schemes to optimize performance, with the primary objective of enhancing propaganda detection. Previous studies have demonstrated MTL’s efficacy in improving machine learning techniques’ performance [11,12,20]. However, MTL frameworks often require more complex task management and data preprocessing than STL approaches. Additionally, while transformer-based architectures have become a cornerstone of modern NLP, their adaptation to MTL scenarios presents unique challenges that are not as prevalent in other deep learning architectures [21]. Therefore, to address these challenges, our work aims to simplify the process of developing MTL models, reduce their implementation complexity, and make the process as straightforward as building STL models. Additionally, the effects of several task-weighting schemes on model performance are investigated, providing empirical insights into their relative effectiveness. This study’s main contributions are summarized as follows:

We introduce the first MTL models that effectively enhance Arabic propaganda detection by incorporating sentiment analysis and emotion detection tasks.
We develop two MTL models: PSMTL, which combines propaganda and sentiment tasks, and PEMTL, which integrates propaganda and emotion tasks. The main objective is to investigate how incorporating sentiment and emotion can enhance propaganda detection models’ effectiveness.
We propose utilizing seven distinct task-weighting techniques and provide empirical evidence supporting their effective implementation in MTL models.
We conduct a comprehensive comparative study on the proposed method compared with state-of-the-art (SOTA) studies and recent studies utilizing STL and MTL, such as [6,22,23,24,25]. The experimental results demonstrate that the proposed MTL framework outperforms all compared methods.

The remainder of this article is structured as follows: Section 2 provides a review of related SOTA studies in the field, evaluates previous research, and identifies gaps that our study seeks to fill. Section 3 details the methodology employed in our study and outlines the framework and analysis. Section 4 and Section 5 describe the experiments and results, accompanied by in-depth discussion and interpretation, respectively. Section 6 presents the error analysis of the results. Finally, Section 7 provides the conclusions and limitations and outlines future directions.

2. Related Work

This section presents a survey and discussion of related work on propaganda detection viewed as a form of applied persuasion. Although it is not exhaustive, this review organizes previous studies into two subsections: the first examines propaganda detection using single-task models, and the second explores MTL approaches for propaganda detection. Finally, a detailed comparison outlining the criteria for evaluating these studies is presented in the discussion.

2.1. STL for Propaganda Detection

Recently, the research community has made significant efforts to detect propaganda, a form of persuasion, in online news from various news sources and social media sites. Early approaches to propaganda detection relied heavily on feature engineering combined with classical machine learning algorithms. Studies employed Naive Bayes classifiers for probabilistic propaganda identification in news and fact-checking contexts, such as [26,27], whereas support vector machines were implemented in systems such as [28] and dedicated propaganda detection tasks such as [27,29]. Logistic regression demonstrated the effectiveness of fine-grained propaganda detection in news corpora [30,31]. Furthermore, hybrid approaches incorporating decision trees, such as the HAPI system [32], were developed for propaganda identification on social media. K-nearest neighbors algorithms were applied in Arabic propaganda detection shared tasks [33]. While these conventional ML techniques achieved reasonable performance, their reliance on handcrafted linguistic and syntactic features, as well as their failure to account for words’ contextual meaning, limited their adaptability across different languages and evolving propaganda techniques. Therefore, there is a critical need for more sophisticated approaches.

With the rise of deep learning, several researchers have proposed supervised models for propaganda detection that leverage deep learning techniques to automatically learn feature representations from text, including convolutional neural networks [30,34], long short-term memory (LSTM) [26,35,36], and bidirectional recurrent neural networks (Bi-LSTM) [37]. However, these models require a substantial amount of annotated data that are specifically tailored to the task at hand. Acquiring such data presents a significant challenge in real-world NLP applications owing to the inherent linguistic diversity and complexity of natural languages. Consequently, a lack of suitable annotated data can impede supervised learning’s effectiveness in these contexts.

Recently, the transfer learning field in NLP has undergone a significant transformation with the advent of pretrained language models (PLMs). This approach involves leveraging knowledge from related domains, tasks, or languages by maximizing the use of unlabeled data in either the source or the target domain. Several researchers in the field of propaganda detection have embraced transfer learning by utilizing PLMs trained on extensive unlabeled data, followed by the fine-tuning of the models for the classification task. The best systems in SemEval-2020 have leveraged pretrained transformers (BERT-based models) and ensemble methods, achieving strong results in identifying propaganda spans and categories [38]. PLMs utilized for propaganda detection include ROBERTA [39] and DeBERTa [40]. Likewise, transformer-based language models specifically trained for Arabic have proven to be highly effective, such as AraBERT [17,18,23,33,41,42,43,44]. Furthermore, large language models (LLMs), including GPT-4 [45] and LlamaLens [24], have been utilized.

2.2. MTL for Propaganda Detection

MTL is a paradigm in which a single model is trained on multiple related tasks simultaneously, with the goal of improving generalization by sharing knowledge between tasks. In an MTL setup, it is common to have a shared encoder (shared layers) that learns representations used by all tasks and then task-specific decoders (output layers) that produce predictions for each task. This is often referred to as hard parameter sharing, and it has been a successful framework for many NLP problems.

In the propaganda detection field, several recent studies have explored the use of MTL for detecting propaganda in both English and Arabic texts, often through shared task competition. These studies vary in the MTL architectures they adopt, the auxiliary tasks they incorporate, and the datasets they use for training. The earliest among the previous studies is [15], which investigated propaganda detection in English memes. It used three datasets, SemEval-2021 Task 6, SemEval-2020 Task 11, and a Fake News corpus, as auxiliary tasks in two MTL setups: sequential (pretraining on auxiliary, then fine-tuning on the main task) and parallel (joint training). This hierarchical approach aimed to enrich token-level, multi-label classification by first grounding the model in related propaganda and fake news tasks. For the 2023 SemEval shared task, the researchers in [16] proposed a hierarchical MTL architecture using BERT. The model first identified the start index of a persuasive span and then used the embedding of that token for multi-label classification. This structure was an example of a task hierarchy in which span detection feeds directly into persuasion classification. The model was trained using a global loss function that combines two subtasks’ objectives.

Regarding the Arabic language, the researchers in [17] introduced a multi-label classification setup using AraBERT. The MTL approach decomposed the problem into N binary classification tasks, one for each propaganda technique. This enabled the model to learn distinct representations per technique and fine-tune separate binary heads along with a shared encoder. Similarly, [18] incorporated MTL to jointly learn the genre of the text (as an auxiliary task) and a binary classification task for persuasion detection using a PLM called MARBERT. By injecting genre awareness, the model presumably obtained contextual cues that could influence the identification of persuasive strategies. Furthermore, the authors in [19] addressed Arabic propaganda detection but employed an LLM approach, specifically GPT-3. The architecture routed the [CLS] token embedding to separate the binary and multi-label classification heads for the subtask’s binary and multi-label classification tasks, respectively. Although not explicitly referred to as MTL, the dual-headed structure functioned like one, sharing a base representation while optimizing multiple related objectives.

2.3. Discussion

Table 1 summarizes and compares the reviewed studies based on several categories, including reference numbers, the languages used, the models employed, MTL utilization, the best achieved scores, and noted limitations. An analysis of the surveyed studies in propaganda detection, regarded as an application of persuasion, shows that there are still many limitations in previous research:

MTL vs. STL: MTL has demonstrably advanced propaganda detection in both English and Arabic through diverse methodological innovations, including parallel binary classifiers for technique identification, span-based hierarchical pipelines, auxiliary genre classification tasks, and joint training paradigms. By contrast, most existing studies rely on STL for propaganda detection, which lacks the benefits of shared knowledge across related tasks. This isolated approach often limits model generalization, particularly in low-resource settings such as Arabic.
Sentiment and emotions feature: A critical limitation persists across the reviewed MTL frameworks, whether designed in English or Arabic. No framework has yet integrated sentiment or emotion analysis as auxiliary learning objectives, despite affective manipulation’s well-documented role in propaganda. This oversight is particularly notable because propaganda frequently relies on emotional appeal and polarized sentiment to influence audiences. Incorporating such affective dimensions can provide valuable supervisory signals to regularize shared representations and improve model discriminability.
Language: English was the dominant language in most of the previous studies. Although Arabic research has made remarkable progress, especially in shared tasks, to the best of our knowledge, few studies in Arabic have explored propaganda detection. This research area is in its early stages, and Arabic researchers have only recently started to explore it. The MTL framework is particularly valuable for Arabic propaganda detection, where data scarcity and linguistic complexity pose significant challenges. By leveraging shared knowledge across tasks, MTL enhances generalization, reduces overfitting, and improves model robustness in low-resource scenarios. For Arabic specifically, MTL addresses two critical gaps: achieving data efficiency, where parameter sharing across tasks mitigates the need for large, labeled datasets; and addressing the scarcity of annotated Arabic propaganda corpora. Additionally, contextual awareness is where auxiliary tasks, such as emotion or intent prediction, help decode culturally embedded persuasion tactics. However, challenges remain, including task imbalance (e.g., subtasks with noisier labels) and the need for architectures that better balance shared and task-specific learning.

3. Methodology

This section presents the framework of the MTL model proposed in the current study. The model integrated seven distinct task-weighting schemes to optimize the learning process across three interrelated tasks, propaganda detection (primary task), sentiment analysis, and emotion detection (auxiliary tasks), with the aim of improving propaganda detection. Further, two models, PSMTL and PEMTL, were developed. Although the primary focus was propaganda detection, the inclusion of auxiliary tasks, sentiment analysis, and emotion detection enhanced the model’s understanding of textual data, consequently improving its primary task performance. Figure 1 provides a high-level workflow for the proposed MTL model. The workflow was divided into three main parts: the input layer, the shared layer, and the task-specific layer. Section 3.2, Section 3.3, Section 3.4 and Section 3.5 explain each part in detail.

The parallel architecture, in which all tasks shared a backbone model but had separate output layers, was the proposed MTL models’ core architecture. The backbone model involved fine-tuning the AraBERT model [46], which encoded text news and labels as hidden representations. The best-performing pretrained model presented in [22] motivated the selection of this model. Although the Hugging-Face Transformers library, designed for training BERT-based models, currently supports single-task models, it does not natively accommodate modular task heads. To address this limitation, a hard-parameter sharing strategy was implemented. In this approach, the parameters of the shared layers were shared between all related tasks. Each task had its own output layers, referred to as task heads. This approach enabled the model to learn a shared feature representation, which effectively supported the simultaneous modeling of all tasks [9]. As Figure 1 illustrates, the proposed model followed a pipeline consisting of three components: input, shared, and task-specific layers. Algorithm 1 shows the main design steps of the proposed MTL for Arabic propaganda detection. The following subsections provide a comprehensive description of each component and the mathematical formulation of the problem.

Algorithm 1: The Proposed MTL for Arabic Propaganda Detection

Input: D: A raw dataset containing Arabic documents,
Tasks: A set {Propaganda Detection, Sentiment Detection, Emotion Detection},
w: Initial weights for task losses

Output: Preds: Predictions for propaganda labels
1: Preds ← ∅
2: for all d ∈ D do
3: cleaned_d ← preprocess(d) // Cleaning and normalization
4: tokens_d ← tokenize(cleaned_d) // Tokenization via AraBERT tokenizer
5: input_ids, attention_masks, token_type_ids ← tokens_d
6: for task ∈ Tasks do
7: Ytask ← generate_labels(d, task) // Propaganda, Sentiment, Emotion labels
8: Task_ID ← assign_taskID(task) // e.g., Propaganda:0, Sentiment:1, Emotion:2
9: end for
10: batch ← create_batches(input_ids, attention_masks, token_type_ids, Task_ID, Ytask)
11: encoded_batch ← AraBERT_encoder(batch) // Shared AraBERT encoder
12: θ_shared ← extract_pooled_output(encoded_batch) // Shared features extraction
13: for task ∈ Tasks do
14: task_embedding ← get_task_embedding(Task_ID)
15: combined_representation ← θ_shared ⊕ task_embedding
16: task_preds ← task_specific_head(combined_representation, task)
17: Lt ← compute_task_loss(task_preds, Ytask)
18: end for
19: Lglobal ← w1 × Lpropaganda + w2 × Lsentiment + w3 × Lemotion // Weighted sum
20: update_weights(encoder, task_heads, Lglobal) // Backpropagation
21: preds ← predict(encoded_batch, propaganda_head)
22: Preds ← Preds ∪ preds
23: end for
24: Return Preds

3.1. Mathematical Notations of the Problem

For a better description of the proposed MTL model for Arabic propaganda detection, the following mathematical notation can be introduced:

Let ${D_{t}}_{t = 1}^{T}$ be the data from the task set. Specifically,

D_{t} {= {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots ., (x_{n}, y_{n})}}_{i = 1}^{N}

(1)

where

D_t represents the training data for task.
N represents the number of examples in D_t.
x_i represents the input text (news text).
y_i represents the corresponding label of task t. The label for each task is associated with the input $x_{i}$ , where $y_{i t}$ denotes the label for the $t^{t h}$ task. For example, $y_{1}$ may represent the label for propaganda detection (1—propaganda, 2—nonpropaganda), $y_{2}$ may represent sentiment analysis labels (1—positive, 2—negative, 3—neutral), and $y_{3}$ may represent emotion detection labels (1—happiness, 2—sadness, 3—anger, 4—fear, 5—none).
T represents the total number of tasks. Each task $t$ corresponds to a specific task such as propaganda detection, sentiment analysis, or emotion detection.
Each pair ( $x_{i}$ , $y_{i})$ is used for training.

H = {h_{1}, h_{2}, \dots, h_{M}}

(2)

where

H represents the set of shared hidden layers across all tasks. These layers are responsible for extracting generalized features from the input text that are relevant to all tasks.

{Z_{t}}_{t = 1}^{T}

(3)

where:

$Z_{t}$ represents the task descriptor, or task-specific dictionary, generated in the shared layers for task t, where $t$ ∈ {1, 2, 3, …, T} denotes each task in the MTL framework, as described earlier. Specifically, $Z_{t}$ is a task-specific feature vector that captures the relevant information for task $t$ derived from the shared representations learned in the shared layers. This descriptor is then passed to the task-specific layers, where it guides the model’s learning process for the corresponding task.

L_{t} ({y_{i}}^{\land}, y_{i})

(4)

where

L_t represents the loss function for task $t$ . This loss measures the discrepancy between the predicted label ${y_{i}}^{\land}$ and the true label $y_{i}$ for task $t$ .

θ_{s h} = {θ_{s h 1}, θ_{s h 2}, \dots ., θ_{s h n}}

(5)

where

$θ_{s h}$ represents the set of parameters shared across the MTL model during the encoding process.
n represents the total number of shared parameters in the encoding stage.

θ_{t} = {θ_{t 1}, θ_{t 2}, \dots ., θ_{t m}}

(6)

where

$θ_{t}$ represents the task-specific parameters for the output decoder heads corresponding to task t. These parameters are unique to each task and are used to generate final predictions for each task.
m represents the total number of parameters specific to task t in the output decoder.

Table 2 provides summary definitions of the symbols employed throughout the description of the proposed framework.

3.2. Input Layer

The MTL model’s input layer was responsible for preparing the raw data for further processing. After constructing the MultiProSE dataset to make it suitable for use in the proposed MTL, as we explained in our previous study [22], the preprocessing of the input text (

x_{i})

was applied. Generally, the data preprocessing process was divided into two main phases, as Figure 2 shows. The first phase involved handling the raw news text, whereas the second phase focused on formatting the data to meet the input requirements of transformer-based models. The first stage included character normalization, diacritical mark elimination, stop word elimination, and the exclusion of non-Arabic text. Subsequently, the input text was tokenized using the WordPiece tokenizer [47], which was compatible with the BERT backbone model. Tokenization enabled the creation of word vectors and effectively addressed the challenge of out-of-vocabulary words by breaking them down into root words and sub-words.

Once the data preprocessing and tokenization were completed, the multi-task dataset was created by merging samples from three task-specific datasets (i.e., propaganda, sentiment, and emotion). Each sample of the multi-task dataset consisted of the news text, a label, the task type, and the task ID. All three tasks were assigned the “seq_classification” task type because they involved sentence classification. Furthermore, a unique task ID was added to each sample as a new token, labeled “task_ids,” which assisted the model in properly handling the samples for each task. Next, the tokenized batch was fed into the shared layer, as the next subsection explains.

3.3. Shared Layer

The shared layer of the MTL model was responsible for learning generalizable representations of each token in the input. Task-specific layers then utilized these representations to enhance each task’s performance. The shared layer consisted of two primary components: a shared encoder and a task-specific dictionary (task descriptor) for each individual task model.

First, the shared encoder processed the tokenized inputs using a pretrained 12-layer AraBERT model, transforming them into three distinct representations: token embeddings, segment embeddings, and position embeddings. Element-wise operations were then applied to these representations to generate a unified feature vector of size 128 × 768, which was subsequently fed into the AraBERT PLM for fine-tuning. During the fine-tuning process, the model adapted its learned contextual embeddings to each specific task within the multi-task framework.

Second, the task descriptor facilitated the association of each input with its corresponding task during the joint training. This task descriptor helped guide the model to distinguish between different tasks, ensuring each input was processed correctly. Finally, the task descriptor passe the task-specific layers, which were responsible for generating predictions for the task, as explained in the following section.

3.4. Task-Specific Layer

This section explains the last component of the proposed MTL framework: the task-specific layer, specifically, the MTL objective. The task-specific layer defined the MTL training objective (global loss function) by jointly minimizing the loss of each task and applying task-specific transformations to generate the output relevant to its particular task. For example, a propaganda detection task classified news text as propaganda or nonpropaganda. For clarity, the term “MTL training objective” referred to the final learning objective, namely, the global loss function, whereas the term “loss” referred to the individual loss functions used for each specific task within this objective. Table 3 explains the mathematical formulation for calculating the objective in both the single-task model and the proposed MTL model. Table 2 defines the notations for all symbols.

In a single-task model, the training process focuses on a single dataset and a unified objective that optimizes one model to minimize a single loss function. This approach treats tasks independently and does not share knowledge across tasks. In contrast, a multi-task model introduces a shared representation space using a common encoder. Each task is associated with a specific descriptor and a dedicated head. The model simultaneously learns shared and task-specific parameters by minimizing a global loss function, which is the sum of individual task losses. This enables the model to generalize better by leveraging task interdependencies and shared linguistic patterns. The next subsection discusses the MTL architecture.

3.5. MTL Architecture

In general, there are several MTL model architectures. One such architecture is the parallel model, where all tasks share a common backbone model and are trained simultaneously, and each task has its own separate output layer. Figure 3 illustrates the high-level architecture of the proposed MTL. Generally, the encoded batch output was fed into a task ID embedding lookup module, which routed shared representations to three task-specific heads: propaganda, sentiment, and emotion. Each head produced task-specific outputs with individual loss functions weighted and combined into a global loss used for joint training. The final propaganda prediction was generated via a softmax layer.

As Figure 3 illustrates, the task-specific head layer in the proposed MTL framework was designed to handle multiple tasks simultaneously by allocating a dedicated decoder head for each task. After the shared encoder processed the input and transformed it into a pooled output, it was passed to a task identification module that performed an embedding lookup based on the task ID. This embedding guided the model in selecting the appropriate task-specific head: propaganda, sentiment, or emotion.

Each of these heads was parameterized by its own set of weights

θ_{t}

, enabling the model to learn specialized representations for each task while still leveraging the shared knowledge encoded in

θ_{s h}

. Furthermore, the proposed MTL model dynamically optimized task performance through learnable loss weights

w_{t}

for each task-specific loss

L_{t}

. This weighting scheme, as detailed in Section 3.6, ensured balanced gradient propagation and optimal task-specific representation learning. Subsequently, these weighted losses were aggregated to form global loss L.

3.6. Task-Weighting

Section 3.5 discusses one of the primary challenges of MTL, which is the construction of architectures capable of learning multiple tasks simultaneously. In this section, another important challenge in MTL was addressed: the task-weighting problem, commonly considered an optimization challenge called multi-task optimization (MTO). It aims to resolve the imbalances and conflicts of tasks during MTL. It is particularly important to assign appropriate weights to each task to carefully manage the joint learning process. This prevents any single task from exerting a dominant influence on the MTL network parameters and ensures a balanced focus between improving the main task’s performance and utilizing the contributions of auxiliary tasks. Each task may have a different MTL objective. For example, classification tasks commonly utilize the cross-entropy loss function, whereas regression tasks typically employ the mean squared error. The following subsections present a comprehensive explanation of task-weighting approaches, followed by a detailed description of the various methods proposed and implemented in the context of the proposed MTL framework.

3.6.1. MTO Weighting Schemes

Optimization techniques for training MTL models are as crucial as the design of the model architectures, as Section 3.5 describes. Several task-weighting schemes have been proposed to address challenges, such as task imbalance and the need to prioritize specific tasks [11,48,49]. These task-weighting schemes can be broadly categorized into two main classes: static weighting and learning (adaptive) weighting. In static weighting, the weights are fixed before training and can be categorized as follows [50]:

Equal weighting: This approach involves assigning uniform and constant weights to each task’s loss function throughout the training process. Although conceptually straightforward and easy to implement, it inherently assumes that all tasks hold equal importance.
Proportional weighting: This approach involves a domain expert or heuristic approach because it assigns weights to each task loss in proportion to its relative importance.

In contrast, learning weighting treats the coefficients as learnable variables by learning the optimal weight for each task loss during training. Unlike static weighting schemes, adaptive methods dynamically update task weights based on real-time signals, potentially leading to the MTL model’s improved overall performance. These adaptive weighting techniques can be broadly categorized as follows:

Uncertainty weighting (UW): This approach involves learning to estimate homoscedastic noise terms (uncertainty) associated with each task’s loss and rescaling each loss inversely to its estimated noise. Therefore, tasks with higher uncertainty (i.e., those that are more difficult to learn) are assigned lower weights, and tasks with lower uncertainty (i.e., those that are easier to learn) are assigned higher weights. This scheme aims to minimize the overall uncertainty in the learning process, effectively prioritize more stable tasks, and down-weight those that contribute more noise or variance.
Dynamic weighting: This approach involves reweighting tasks according to recent loss function values. Tasks that improve rapidly (i.e., those whose loss decreases quickly) may be down-weighted to prevent them from dominating the learning process, whereas tasks that improve more slowly (i.e., those whose loss decreases slowly) may be up-weighted to accelerate their learning. This scheme aims to dynamically balance each task’s contribution, ensuring that the model gives more attention to still-underperforming tasks, thus enhancing the overall task learning efficiency.

3.6.2. MTL Training Objective

As Figure 3 illustrates, the global loss is computed by combining multiple loss functions corresponding to the various tasks involved in the model training. It is essential to carefully balance all tasks’ joint learning to prevent a situation in which one or more tasks have a significant influence on network weights, a phenomenon known as task imbalance. Specifically, if one loss function is significantly larger than the other, the corresponding task may dominate the training process, exacerbating the task imbalance problem. Furthermore, some loss functions may converge more quickly or become more crucial to the overall system objective.

Additionally, the optimization approach does not explicitly account for the losses associated with each task. Consequently, the relative weights assigned to each task heavily influence the MTL model’s performance. For example, when the weights for all tasks, except one, are set to zero, the optimization concentrates solely on the task with a nonzero weight, effectively disregarding the others. In the developed MTL model, our focus was on the main task of propaganda detection. Thus, the propaganda detection task must be prioritized during training while considering sentiment task and emotion detection as auxiliary tasks. To achieve this goal, the MTL model’s objective function, as presented in Table 2, was modified by introducing a task importance coefficient (i.e., weight

w_{t})

for task t, as seen in Equation (7):

O b j (M T L) = {m i n}_{θ_{s h}, θ_{1}, \dots, θ_{T}} \sum_{t = 1}^{T} w_{t} L_{t} ({θ_{s h}, θ_{t}}, D_{t})

(7)

It is a widely adopted practice to treat weights as hyperparameters, which are often determined using methods such as grid search or expert judgment. Furthermore, some weight adaptation techniques modify the MTL optimization framework by adjusting the task weights during training, as Section 3.6.3 explains. In the next subsection, various weighting schemes proposed for the MTL models are explained.

3.6.3. Task-Weighting Schemes

As MTO plays a critical role in the performance of MTL, as discussed in Section 3.6.1, various weighting schemes were proposed and developed in the experimental study conducted, as will be detailed in Section 5. Specifically, the following seven loss-balanced task-weighting schemes were proposed for inclusion in the MTL training objective.

Uniform Averaging Weighting (UAW)

This approach, commonly known as equal weighting or average loss aggregation, allocated equal importance to all tasks by averaging their loss values. It was considered one of the simplest and most direct methods. Equation (8) demonstrates this calculation:

L_{g l o b a l} = \frac{1}{T} \sum_{t = 1}^{T} L_{t}

(8)

where the loss function for the

t^{t h}

task

L_{t}

is calculated, as in Equation (9):

L_{t} = L_{P r o} + L_{S e n} + L_{E m o}

(9)

where

$T$ represents the total number of tasks.
$t$ represents each individual task in the summation.
$L_{t}$ represents the loss function for the $t^{t h}$ task.
$\frac{1}{T}$ represents the normalization of the total loss across tasks.
$L_{P r o}$ , $L_{S e n},$ and $L_{E m o}$ represent propaganda, sentiment, and emotion loss, respectively, and calculated as in Equation (10):

L_{t} = \{\begin{matrix} B C E (l o g i s t, y_{t}), i f t = p r o p a g a n d a \\ C C E (l o g i s t, y_{t}), o t h e r w i s e \end{matrix}

(10)

Binary cross-entropy (BCE) was employed to calculate the loss for the propaganda classification task. The main goal of BCE loss was to measure the difference between the predicted probability and the true label for each class. Moreover, BCE treated each class independently and calculated the loss separately; it is formally defined in Equation (11). For sentiment and emotion classification tasks, categorical cross-entropy was calculated as seen in Equation (13).

L_{B C E} = - (t \log (p) + (1 - t) \log (1 - p))

(11)

where

t is the true label (either 0 or 1).
p is the sigmoid probability output of the positive class. The sigmoid activation function applies an independent probability estimation to each class by mapping the output scores to the range (0, 1). Thus, it allows each class to be predicted independently. The probability $P (C_{i})$ for class $C_{i}$ is computed as seen in Equation (12):

P (c_{i}) = S i g m o i d (y_{i}) = \frac{1}{1 + e^{- y_{i}}}

(12)

L_{E m o t i o n o r S e n t i m e n t} = - \sum_{i = 1}^{K} t_{i} \log (p (c_{i}))

(13)

where

K is the total number of classes.
t_i represents the true category distribution in one-hot encoding, where 1 represents the correct class and 0 represents the others.
$p (c_{i})$ is the softmax probability distribution for class i.
The negative sum ensures that incorrect predictions are penalized by minimizing the negative log-likelihood of the correct prediction.

2: Linear Scalarization (LS)

This approach, commonly referred to as unweighted sum loss aggregation, involves summing the losses of all tasks without any normalization. The calculation is represented by Equation (14):

L_{g l o b a l} = \sum_{t = 1}^{T} L_{t}

(14)

where the loss function for the

t^{t h}

task

L_{t}

is calculated as seen in Equation (15):

L_{t} = L_{P r o} + L_{S e n} + L_{E m o}

(15)

3.: Dynamic Difficulty Weighting (DDW)

Instead of traditional weighting schemes, we introduced a DDW scheme, a hybrid approach combining both static and dynamic elements. Based on the intuition that tasks with higher training loss should receive more attention, and inspired by the work in [51], which utilized dynamic weight averaging, DDW uniquely prioritized the primary task, propaganda detection, by training the network to learn a single parameter (

w)

, while assigning relatively smaller weights to the sentiment and emotion tasks. Simultaneously, it dynamically learned a task-specific difficulty score that reflected the difficulty of a task at any given moment during training. The difficulty score was adjusted based on task-specific losses. This method represented a key innovation by integrating the priority for the main task and dynamic weighting for auxiliary tasks based on real-time loss trends. The following Equation (16) demonstrates this calculation:

L_{g l o b a l} = w_{p r o p} . {D S}_{p r o p} + \frac{w_{S e n}}{4} . {D S}_{S e n} + \frac{w_{E m o}}{3} . {D S}_{E m o}

(16)

where

$w_{p r o p}$ represents a constant propaganda weighting that ensures that the propaganda task remains prioritized even if its loss decreases over time.
${D S}_{p r o p}$ represents the difficulty score of the propaganda task, which is based on its loss.
$w_{S e n}$ and $w_{E m o}$ represent a constant scaling factor for the sentiment and emotion task, respectively.
${D S}_{S e n}$ and ${D S}_{E m o}$ represent difficulty scores for the sentiment and emotion tasks, respectively.

w_{p r o p}

was assigned to prioritize the propaganda detection task, while relatively smaller weights were assigned to the sentiment and emotion tasks. Based on our empirical analysis and the observation of loss values during model training, we concluded that setting these parameters to specified values, as shown in Equation (16), yielded the best performance for the developed MTL models. Section 4.3 provides a detailed explanation of how these parameters were selected.

4.: Priority-Guided Random Weighting (PGRW)

Motivated by a recent study in [52], original random loss weighting (RLW) was used to assign random weights to losses and treat all tasks equally. However, this approach may not be ideal when certain tasks, such as propaganda detection, are more important. To address this problem, we introduced a priority-guided parameter (α) that assigned a higher priority to propaganda detection while maintaining a controlled degree of randomness in the learning process. The introduced scheme balanced the importance of tasks while maintaining the flexibility of random weighting, which helped in generalization and prevented overfitting.

In the developed MTL model, several strategies for dynamically assigning task weights were employed, utilizing three distinct sampling distributions: normal, uniform, and Bernoulli. Each distribution introduced a unique behavior into how the task importance was stochastically determined during training. The normal distribution provided unbounded values centered around zero, introducing high variability. Further, uniform distribution generated values strictly within the range of [0.01, 1.0], ensuring that all weights remained positive and nonzero. In contrast, the Bernoulli distribution simulated a more binary decision process in which each task was assigned a high weight (0.99) or near-zero weight (0.01) with equal probability. Moreover, the priority-guided parameter (α) was sampled from a uniform distribution. By systematically comparing these strategies, we sought to gain a deeper understanding of how varying levels of randomness and task prioritization influenced the overall model performance. This was particularly relevant when certain tasks, such as propaganda detection, were considered more critical and therefore required a higher priority in the learning process. Equation (17) shows the calculation:

L_{g l o b a l} = \sum_{t = 1}^{T} {w_{t} L}_{t}

(17)

where the learnable generated weight

w_{t}

is dynamically updated as seen in Equation (18):

w_{t} = \frac{e^{α . I_{p r o} (t) . λ_{t} + λ_{t}}}{\sum_{j \in T} e^{α . I_{p r o} (j) . λ_{t} + λ_{t}}}

(18)

The simplified form of the above equation yields the following:

w_{t} = \frac{e^{λ_{t} (1 + α . I_{p r o} (t))}}{\sum_{j \in T} e^{λ_{t} (1 + α . I_{p r o} (j))}}

(19)

where

T represents the total number of tasks.
t represents each individual task in the summation.
L_t represents loss function for the $t^{t h}$ task.
$α$ represents priority-guided parameter.
$λ_{t}$ represents the randomly sampled task weight for task t.
$I_{p r o} (t)$ represents the indicator function (1 if t = propaganda, and 0 otherwise).
$w_{t}$ represents the final normalized task weight.
j represents a variable that iterates through all the tasks in T when computing the sum (softmax denominator).

5.: Hierarchical Weighting (HW)

The researchers in [53] employed a hierarchical MTL framework with a structured weighting scheme for task prioritization. The model jointly trained three hierarchically related tasks: sentence function identification, paragraph function identification, and organization evaluation. Organizational evaluation was the primary essay-scoring objective. These tasks were organized hierarchically, with higher-level tasks assessing the overall organization of the essay being prioritized over lower-level tasks that examined more granular features such as grammar or coherence. Inspired by this, we employed a dynamic weight assignment strategy that allocated a higher weight to lower-level tasks, such as sentiment and emotion, during the initial phases of training. Then, we progressively shifted the weight toward the target task, which was propaganda in the later stages. The scheme operated under the hypothesis that its optimal performance on the target task requires initial proficiency in foundational tasks because these tasks provide the necessary scaffolding for subsequent learning. For instance, the sentiment task is critical for propaganda detection because a text’s propaganda is often closely linked to its sentiment. Equation (20) represents the calculation:

L_{g l o b a l} = w_{p r o p} . L_{P r o} + L_{S e n} + L_{E m o}

(20)

where the learnable generated weight

w_{p r o p}

is dynamically updated as follows:

w_{p r o p} = \max (\min (w_{p r o p} . \frac{L_{P r o}}{L_{S e n}}, 0.5), 1)

(21)

The learnable weight,

w_{p r o p}

, was employed to regulate the relative importance of

L_{P r o}

based on the empirical assumption that

L_{S e n}

and

L_{E m o}

are of equal significance. Initially, it was set to 0.1. Additionally,

w_{p r o p}

ensured that all tasks were given equal emphasis during the early stages of training. As

L_{S e n}

became smaller than

L_{P r o}

,

w_{p r o p}

gradually shifted, allowing the model to progressively prioritize the propaganda detection task as

L_{S e n}

increased. Section 4.3 provides a comprehensive explanation of the selection process for these parameters.

6.: UW

Generally, uncertainty in learning models can be classified into two primary types: epistemic and aleatoric uncertainty. Epistemic uncertainty refers to a model’s lack of knowledge owing to limited training data. In contrast, aleatoric uncertainty arises from the inherent variability in the data, which the model cannot explain. Moreover, aleatoric uncertainty can be further divided into two categories: heteroscedastic and homoscedastic uncertainty. Heteroscedastic uncertainty is data dependent, meaning that its variance varies across the input data.

In contrast, homoscedastic uncertainty is task dependent, assuming consistent uncertainty across different instances of the same task. Consequently, homoscedastic uncertainty is suitable for tasks with similar difficulty levels, whereas heteroscedastic uncertainty is more appropriate when task difficulties vary, leading to different model performance and uncertainty levels. This approach is based on the principle that tasks exhibiting higher uncertainty should be assigned lower weights, whereas tasks exhibiting lower uncertainty should be assigned higher weights [49].

The authors of [54] used homoscedastic uncertainty as a dynamic weighting mechanism for heterogeneous vision tasks, specifically per-pixel depth regression, semantic segmentation, and instance segmentation within a single convolutional network. Building on their approach, we applied homoscedastic uncertainty for task-weighting and adapted their formulation to make it suitable for classification problems, because their original work primarily focused on regression. Equation (22) shows the calculation:

L_{g l o b a l} = \sum_{t = 1}^{T} {\frac{1}{σ_{t}^{2}} L}_{t} + l o g σ_{t}

(22)

where

σ_{t}

represents the learnable homoscedastic uncertainty associated with task

t

. To ensure numerical stability during optimization, we trained the network to learn the log-variance,

l o g σ_{t}

, which acted as a regularization term to prevent the model from converging to a trivial solution. It was clear from Equation (22) that, if

σ_{t}

increased, the weight of

L_{t}

decreased. Thus, an increase in the uncertainty value resulted in a smaller contribution of the task to the overall loss. Overall, this approach allowed the model to adaptively adjust the weighting of each task’s loss based on its estimated uncertainty during training, thereby improving the overall MTL performance.

7.: Priority-Aware Softmax-Based UW (PASUW)

To the best of our knowledge, this is the first work to propose a softmax-normalized, uncertainty-based task-weighting scheme with a priority bias term,

α_{P r o}

, that explicitly emphasizes the main task in an MTL setting. This approach extended the uncertainty-based loss weighting of [54], adapting it to cases in which one task—such as propaganda detection—was of primary importance. The use of softmax-based weighting ensured that the task weights summed to one, preventing extreme disparities between tasks. Equation (23) gives the calculation:

L_{g l o b a l} = w_{p r o p} . L_{P r o} + w_{S e n} . L_{S e n} + {w_{E m o} . L}_{E m o}

(23)

where the normalized uncertainty-based weights

w_{t}

for each task t ∈ {Pro, Sen, Emo} are computed as Equation (24):

w_{t} = \frac{e^{- σ_{t}}}{\sum_{j \in T} e^{- σ_{t}}}

(24)

In the case of the propaganda task, a priority bias term,

α_{P r o}

>1, was applied before normalization to increase its contribution, and it was dynamically computed as follows:

w_{P r o} = α_{P r o} . e^{- σ_{P r o}}

(25)

Based on empirical evaluation, the priority bias term,

α_{P r o}

, was set to 1.5 to achieve optimal performance. This mechanism enabled the model to dynamically prioritize the main task while still benefiting from auxiliary signals, thereby achieving balanced uncertainty-aware optimization across tasks.

4. Experiments

This section outlines the datasets, models, experimental setups, and evaluation measurements. First, a description of the dataset is provided. Subsequently, various proposed model variations, followed by the experimental setting, are explained. Finally, the evaluation measurements are listed.

4.1. Dataset

The primary objective of this study was to propose and develop a model for detecting propaganda in Arabic by leveraging the MultiProSE dataset [22]. The primary reason for selecting this dataset was that it is the first and only publicly available resource designed to support research and development in propaganda detection, making it well-suited for our proposed models in the Arabic language. The MultiProSE dataset comprises 8000 paragraphs written in Modern Standard Arabic, each of which has been manually annotated across three complementary tasks. For the propaganda detection task, each paragraph was assigned a binary label (1 = True, 2 = False); for sentiment analysis, each paragraph was assigned a ternary label (1 = Positive, 2 = Negative, 3 = Neutral); and for emotion recognition, each paragraph was assigned a five-way label (1 = Happiness, 2 = Sadness, 3 = Anger, 4 = Fear, 5 = None). Table 4 provides detailed statistics for the MultiProSE dataset.

4.2. Models

This section presents the various model variations proposed in the current study. Model variations can be classified into three categories, as outlined in Table 5, including the model categories, model names, and model descriptions.

4.3. Experimental Setup

All experiments were conducted on a single NVIDIA Tesla T4 GPU equipped with 15 GB of memory and utilizing Google Colab Pro+ (a robust Python 3.11 programming environment). As in [23], the dataset was split into 75%, 8.5%, and 16.5% for training, development, and testing, respectively. Transformer toolkits were employed to develop a pipeline for propaganda, sentiment, and emotion analysis [55]. For the PLM, AraBERT served as the backbone model for training on the MultiProSE dataset, which encoded both news text and labels as hidden representations. This model was selected because of its SOTA performance, as demonstrated in [22]. The proposed system initiated the preprocessing of Arabic texts, as described in Section 3. Based on the training data visualization, the maximum sequence length was set to 256 tokens. After conducting multiple empirical experiments, the batch size and number of epochs were set to eight and five, respectively. An AdamW optimizer was used with a learning rate of 2 × 10⁻⁵ [56].

The hyperparameter, known as “patience,” was set to five, which represented the maximum number of epochs that could elapse without observable improvement, at which point the training process was terminated. To prevent overfitting, the weight decay was set to 1 × 10⁻⁵. Additionally, for each experiment, five runs with distinct random seeds were performed, and the average performance over the test subset was reported.

Furthermore, to optimize the performance of the proposed models, an exhaustive grid search algorithm to tune the hyperparameters of every task-weighting scheme was employed. A grid search enumerates the Cartesian product of predefined value sets, trains a model for each resulting combination, and retains the configuration that maximizes performance on the development split [57]. Specifically, the grid search was conducted based on the evaluation of the macro F1 score on the development set, with the optimal hyperparameters selected accordingly for each model. Then, the configuration that achieved the highest macro-F1 on the development set was subsequently retrained and evaluated on the held-out test set. For example, in the Hierarchical Weighting scheme, the search space was defined as λ ∈ {0.1, 0.5, 1}, upper bound ∈ {0.5, 1, 2}, and lower bound ∈ {0.001, 0.01, 1}, yielding 3 × 3 × 3 = 27 configurations. Each configuration was trained and validated under the same fixed schedule to isolate the influence of the weighting parameters. Then, the selected optimal triple was λ = 0.5, upper_bound = 0.5 and lower_bound = 1, achieving a macro-F1 of 0.7958 on the development set. Algorithm 2 summarizes the grid-search tuning of weighting-scheme hyperparameters. Table 6 summarizes the hyperparameter values used in this experiment.

Algorithm 2: Grid-search tuning of weighting-scheme hyperparameters

Input:

𝒟

train: Training set (75%),

𝒟

dev: Development set (8.5%)
G: Search grids G1 … Gn

Output: Optimal hyperparameter vector θ⋆
1: bestScore ← −∞; θ⋆ ← ∅
2: for θ in G1 × … × Gn //Cartesian product
3: train model Mθ on

𝒟

train for ≤ 5 epochs
4: apply early stopping (patience = 2)
5: score ← macroF1(Mθ(

𝒟

dev))
6: if score > bestScore then
7: bestScore ← score
8: θ⋆ ← θ
9: end for
10: Return θ⋆

4.4. Evaluation Measurements

The macro- and micro-averaged F1 scores were computed to evaluate the models’ performance for propaganda, sentiment, and emotion tasks. These measures were particularly suitable for unbalanced datasets and were widely utilized in previous studies to report the effectiveness of models for these tasks [23]. Moreover, accuracy evaluation measures were used. A more detailed explanation of the accuracy of the calculation of the F1-macro and F1-micro measures is provided in this section. For this purpose, a confusion matrix was used to analyze the performance of a binary class. In this matrix, each row contained information about the actual class, whereas each column contained information about the predicted class. Accordingly, the confusion matrix aimed to analyze how well a classification could recognize the instances of different classes. Table 7 presents the confusion matrix [58].

In the context of the propaganda classification problem, true positives (TPs) referred to instances where text news containing propaganda was correctly classified as propaganda. True negatives (TNs) referred to instances where text news not containing propaganda was correctly classified as nonpropaganda (i.e., these were the correct decisions, represented along the diagonal of the confusion matrix). False positives (FPs) occurred when text news that did not contain propaganda was incorrectly classified. Meanwhile, false negatives (FNs) arose when text news containing propaganda was incorrectly classified as nonpropaganda.

5. Results and Discussion

This section presents the experimental results of this study, conducted to develop the proposed models using the MultiProSE dataset. The aim was to identify the most effective approach for propaganda detection. The results were classified into three sub-sections: MTL models, proposed task-weighting models, and MTL models with optimal proposed task-weighting. Additionally, an analysis was conducted on the test set, and the outcome of this section informed the selection of the most effective approach for propaganda detection among the proposed MTL models. Furthermore, the experimental results for the various models proposed for propaganda detection that were evaluated using the test MultiProSE dataset and that were assessed based on three primary performance metrics, accuracy, F1-macro, and F1-micro, were provided.

5.1. Ablation Experiments on the Proposed MTL Models

Several ablation experiments were conducted to prove the main components’ effectiveness in MTL and the auxiliary task. The experiments involved cumulatively addi components to obtain the proposed solution. The experiments started by testing the propaganda detection task alone, as in our previous work [22], which will be discussed in Section 5.3. Then, we trained the base MTL model with each auxiliary task separately—sentiment detection (MTL-Sent) and emotion detection (MTL-Emo)—followed by training the model with both auxiliary tasks combined.

As Table 8 summarizes, the MTL models, comprising several configurations, demonstrated consistent performance within a narrow range, with accuracy scores between 77% and 79% and Macro-F1 scores between 0.763 and 0.778. Among these, the MTL model employing LS (MTL-LS) achieved the highest accuracy of 79% and a Macro-F1 score of 0.775, whereas MTL-UAW yielded 77% accuracy and a Macro-F1 score of 0.763. Regarding the incorporation of sentiment and emotion tasks into propaganda detection models, the MTL-Sent-LS (PSMTL-LS) and MTL-Emo-UAW (PEMTL-UAW) models achieved 79% accuracy. Their Macro-F1 scores were 0.778 for the MTL-Sent-LS and 0.775 for MTL-Emo-UAW. In contrast, MTL-Sent-UAW and MTL-Emo-LS exhibited slightly lower performances, with accuracy scores of 78% and Macro-F1 scores of 0.770 and 0.773, respectively. Overall, these results suggest that the inclusion of auxiliary tasks, such as sentiment and emotion (e.g., MTL-Sent-LS and MTL-Emo-UAW), can significantly enhance the main task’s performance.

5.2. Experimental Results on the Proposed Task-Weighting

As Section 3.6.3 outlines, several task-weighting schemes were proposed. The primary objective was to analyze their impact on the proposed MTL models’ performance. The evaluation results, presented in Table 9, demonstrated that MTL-UW offers a distinct advantage over all other task-weighting methods, achieving an accuracy of 79% and a Macro-F1 score of 0.776. This Macro-F1 score was 0.13% higher than MTL-HW, which achieved a Macro-F1 score of 0.775, 0.52% higher than MTL-DDW, which achieved a Macro-F1 score of 0.772. In contrast, the three MTL-PGRW variants employing normal, Bernoulli, and uniform sampling achieved Macro-F1 scores of 0.767, 0.765, and 0.764, respectively. These results corresponded to relative decreases of 1.17%, 1.44%, and 1.57 percentage points, respectively, compared with the MTL-UW scheme. Furthermore, the top three schemes (MTL-UW, MTL-HW, and MTL-DDW) achieved the highest accuracy of 79%, outperforming the MTL-PGRW methods (77–78%) by 1–2%.

Overall, the experimental results consistently highlighted the positive impact of task-weighting on MTL performance. Additionally, as Table 9 shows, certain weighting schemes are more effective than others. To test whether the most promising weighting rule based on Macro-F1, namely UW, could further benefit related task MTL settings, experiments were conducted using MTL models with the best proposed task-weighting. The MTL-Sent-UW and MTL-Emo-UW models, as detailed in Table 9, both achieved an accuracy score of 79% but differed in overall Macro-F1. Specifically, MTL-Sent-UW achieved a Macro-F1 of 0.778, whereas MTL-Emo-UW remained at 0.777. In general, the MTL-UW framework significantly enhanced the efficacy of propaganda detection through the integration of auxiliary tasks. Notably, the incorporation of sentiment analysis yielded a marginally higher Macro-F1 score (0.778) than emotion detection (0.777), although this difference was statistically small. This subtle variation suggests that sentiment features may provide a slight advantage in optimizing class-level performance metrics, potentially owing to their broader contextual relevance within textual data. Nonetheless, both auxiliary tasks contributed comparably to the primary objective of propaganda detection, highlighting their utility in improving model generalizability.

5.3. Comparisons with Previous Studies

To conduct a comprehensive evaluation and gain insights into the advancements of our proposed approach in comparison with those of existing research, this section compares our model’s performance with the results of previous studies. Table 8 summarizes cthe omparative results of previous research on propaganda detection. Specifically, we selected SOTA systems along with all systems that evaluated the ArPro dataset [23]. Regarding the MultiProSE dataset, no prior systems have been developed for it because it was only recently released [22]. Nevertheless, we assessed the performance of our top-performing model by comparing it with the best model proposed in the dataset paper [23] and all published studies that utilized the same dataset.

All models proposed for the ArPro dataset employed STL. To the best of our knowledge, there is no prior public benchmark for propaganda detection that is suitable for MTL settings, even in English, a more widely studied language. Therefore, we also compared our model with other models that applied MTL in the NLP field using the same MultiProSE dataset and model configuration, as detailed in Section 4.1, to ensure a fair comparison [6]. The models selected for comparison were as follows:

AraBERT [22]: AraBERT is a single-task model that extends the pretrained BERT language model by incorporating a linear classification layer into the hidden representation of the special [CLS] token. Notably, this model follows the same approach as the backbone model, making it an appropriate point for comparison.
AraBERT with Features [25]: This single-task model extends the pretrained BERT language model by adding a linear classification layer to the hidden representation of the [CLS] token, along with several additional features. These features enhance the model’s propaganda detection performance.
ML with Features [25]: This is a single-task model that integrates multiple ML classifiers, each utilizing various features for improved performance in propaganda detection.
AraBERT [23]: AraBERT is a single-task model that extends the pretrained BERT language model by appending a linear classification layer to the hidden representation of the [CLS] token. This model serves as a baseline for propaganda detection.
LlamaLens [24]: LlamaLens is a specialized multilingual LLM designed for the analysis of propaganda content in news. This single-task model utilizes multilingual capabilities to assess and detect propaganda in diverse linguistic contexts.
MARBERT [6]: MARBERT is a multi-task model that extends the pretrained BERT language model by incorporating a linear classification layer into the hidden representation of the [CLS] token. This model was specifically designed for the detection of abusive content in Arabic tweets, showcasing its versatility in MTL settings.

As Table 10 shows, the proposed MTL models consistently outperformed all STL and MTL models in terms of Macro-F1 score, Micro-F1 score, and accuracy. This improvement was attributed to MTL’s ability to reduce overfitting during the gradient descent, leading to more effective learning. Moreover, sentiment analysis contributed to the MTL models’ enhanced performance in propaganda detection. Specifically, MTL-Sent-LS and MTL-Sent-UW exhibited a remarkable 2.9% improvement in Macro-F1 scores and a 2.6% improvement in accuracy compared with the STL-AraBERT model from [22] and [23]. Similarly, MTL-Sent-LS and MTL-Sent-UW showed improvements of 1.3% and 9.8% in Macro-F1 scores compared to STL-AraBERT and STL-ML with feature models, respectively [25]. Although existing MTL models consider sentiment information, they still perform 4.8% lower in terms of the Macro-F1 score [6]. These results highlight the effectiveness of the main components incorporated into the proposed models.

To demonstrate the effectiveness of the proposed propaganda detection model, a paired t-test was conducted using the scipy.stats library in Python. This statistical analysis compared the means of two groups to determine whether the observed differences were statistically significant. Table 11 summarizes the results. Specifically, we performed paired t-tests across five random seeds for each evaluation metric, comparing our model against MTL-MARBERT [6]. As Table 11 shows, our model consistently achieved statistically significant improvements (p < 0.05) across all metrics: accuracy, Macro-F1, and Micro-F1. For instance, the Macro-F1 score increased from 0.742 to 0.778 (proposed model), with a p-value of 0.0001. These findings confirm that the performance gains were both consistent and statistically meaningful, thereby supporting the robustness and effectiveness of our MTL approach.

5.4. Discussion

This study’s primary objective was to enhance propaganda detection’s effectiveness in Arabic news texts through the development of advanced MTL models. Building on recent advancements in MTL and the integration of auxiliary tasks, the proposed models explored the impact of various task-weighting strategies and auxiliary tasks, particularly sentiment and emotion analyses, on model performance. The experimental results presented in Section 5.1, Section 5.2 and Section 5.3 demonstrate substantial advancements in propaganda detection through MTL frameworks, outperforming both existing STL models and prior MTL models.

The proposed MTL models consistently outperformed the STL baselines across all metrics, that is, accuracy, Macro-F1, and Micro-F1, validating the hypothesis that auxiliary tasks, such as sentiment and emotion analyses, significantly enhance propaganda detection. Furthermore, the observed improvements in the MTL models incorporating sentiment and emotion tasks aligned with the findings presented in the related work in Section 2, according to which auxiliary tasks contribute to improved task generalization and reduced overfitting during training. Specifically, the MTL-UAW model, incorporating sentiment and emotion tasks, achieved the highest Macro-F1 score (0.763), surpassing the STL benchmark, AraBERT [23], by 1.7%. Consistent with prior research, incremental improvements of 1–2% in Macro-F1 and accuracy are considered significant within the context of Arabic NLP, owing to the linguistic complexity and the nuanced nature of propaganda detection. The statistically significant results presented in Table 11 (p < 0.05) substantiate that these improvements are robust and not attributable to random variation. Moreover, qualitative analyses, elaborated in Section 6, indicate that the proposed multi-task learning framework more effectively captures subtle instances of propaganda, particularly in texts exhibiting neutral or mixed sentiments, where conventional baseline models frequently demonstrate limited performance. Additionally, the integration of auxiliary sentiment and emotion tasks contributes to the enhancement of feature representations, thereby improving model robustness and providing avenues for further advancement in detection performance.

The comparative analysis of two-task models, such as propaganda and sentiment and propaganda and emotion, revealed a subtle yet informative distinction between sentiment- and emotion-oriented supervision, influencing propaganda detection’s overall effectiveness. Moreover, although MTL-Sent-LS and MTL-Sent-UW outperformed MTL-Emo-UW and MTL-Emo-LS in terms of the Macro-F1 score, the differences of 0.39% and 0.65%, respectively, were marginal. This suggests that both sentiment and emotion features are equally valuable for optimizing performance, with sentiment analysis offering a slightly broader contextual advantage in terms of model generalization. However, in agreement with previous studies [5,12], the superior performance observed when using sentiment as the sole auxiliary task can be attributed to a stronger alignment between the auxiliary and main tasks as well as a reduced likelihood of negative transfer. In contrast, integrating emotion, which is considered a task characterized by greater subjectivity and noise, likely introduces optimization difficulties, resulting in task interference and weaker shared representations [59,60].

Among the proposed loss weighting schemes that were evaluated, UW emerged as the most reliable, narrowly outranking HW and dynamic-difficulty weighting. This superiority can be attributed to two factors. First, the homoscedastic uncertainty parameters adaptively down-weight noisy gradients from the auxiliary heads, thereby stabilizing the joint optimization. Second, unlike PGRW variants, UW maintains continuous, differentiable control over the weight simplex, allowing for finer adjustments as training progresses, instead of sensitivity to sampling distributions. This underscores the need for stable weighting mechanisms in MTL. Nevertheless, the marginal gap between UW and HW, 0.13% based on Macro-F1, indicates that hierarchical loss ordering, where the main task is always given precedence, remains a competitive, computationally cheaper alternative for scenarios in which uncertainty estimation is infeasible. Even feature-enhanced STL models such as AraBERT, with features outperformed by 1.04% in Macro-F1, highlighting MTL’s capacity to implicitly capture feature interactions without manual engineering. Furthermore, our proposed model surpassed prior MTL approaches such as MARBERT by 4.58%, based on Macro-F1, likely owing to optimized task-weighting and the integration of sentiment/emotion. This is a novel contribution to Arabic propaganda detection.

6. Error Analysis

In this section, we conducted an in-depth analysis to examine and understand errors in propaganda detection. We aimed to identify key factors contributing to the model’s successes and limitations. Both quantitative and qualitative error analyses were undertaken to elucidate each model’s strengths and weaknesses. Our study focused on MTL models with the best proposed task-weighting models: (a) MTL-Sent-UW and (b) MTL-Emo-UW.

6.1. Per-Class Performance

Figure 4 and Figure 5 show the classification report and confusion matrix for each model. In Figure 4 and Figure 5, matrix (a) corresponds to the MTL-Sent-UW model, whereas matrix (b) presents the results for MTL-Emotion-UW. The diagonal elements of each confusion matrix indicate correct predictions, whereas the off-diagonal elements represent misclassified instances. In general, both MTL variants achieved competitive results. Matrix (a) showed that the model tended to under-predict propaganda. The comparatively thick shading in the lower-left cell (FN = 142) signaled many persuasive texts that slipped through as “neutral.” Furthermore, matrix (b) revealed a shift where the lower-left cell lightens, confirming the recall improvement, whereas the upper-right cell darkens slightly (FP = 138). Emotion cues helped the model retrieve alarmist or fear-laden propaganda that sentiment alone missed, but the same cues occasionally raised neutral crisis reports to class propaganda.

We observed that the propaganda class was consistently easier to learn than nonpropaganda. However, the two models differed in how they reached that conclusion. In matrix (a), the propaganda F1-score was 0.8338, but its recall stalled at 0.8293, leaving 142 FN propaganda texts undetected. In contrast, in the emotion auxiliary task, recall climbed to 0.8425 and the F1-score to 0.8390, which was a relative gain of 0.6 with 11 fewer FNs (only 131). The price was a marginal increase of five FPs (from 133 to 138); hence, the overall accuracy rose from 79.26% to 79.71%. Overall, the MTL-Emotion-UW model achieved the strongest overall performance, as shown in Table 12, with 269 text misclassifications (138 FPs + 131 FNs). In contrast, MTL-Sent-UW attained a slightly lower accuracy of 79.26% and produced 275 misclassifications (133 FPs + 142 FNs). Notably, MTL-Emotion-UW exhibited higher TPs for propaganda and slightly lower FNs, indicating improved sensitivity to propaganda cues compared to MTL-Sent-UW.

Sentiment is a useful cue for spotting propaganda. Messages that try to persuade people often carry clear feelings. Harsh attacks sound very negative, whereas patriotic slogans sound very positive. By marking sentences with strong positive or negative tones, a sentiment-aware model can focus on parts of the text where propaganda is more likely and avoid many neutral factual statements. While sentiment alone cannot find every subtle case, it offers a simple, low-cost signal that improves the early detection of persuasive content. In contrast, emotion introduces nuanced affective signals such as anger and fear that exhibit a strong association with persuasive messages. When these affective cues coincide with propaganda indicators, the model effectively identifies challenging instances, particularly neutral-toned statements that implicitly convey fear appeals or employ glittering generalities. However, this approach also leads to a modest increase in FPs, as highly emotional yet nonpropagandistic content like pandemic updates and disaster statistics tends to be occasionally misclassified as propaganda. The remaining challenge is to distinguish genuine persuasive intent from emotionally charged but factual reporting. This is a task that may require discourse-level context or a dedicated rumor detection auxiliary task in future work. Overall, incorporating emotion and sentiment tasks significantly enhanced the model’s ability to distinguish between propaganda and nonpropaganda text, thereby improving the overall performance of the detection models. These findings suggest that integrating auxiliary tasks such as emotion and sentiment analysis can substantially boost propaganda detection effectiveness in the Arabic language context.

6.2. Correct and Error Distribution

To investigate the two models further, we manually inspected the first 50 sentences from each model’s output. Following standard classification taxonomy, we identified four types of classification outcomes in the propaganda detection task:

Type-1: TP—A news text containing propaganda is correctly classified as propaganda.
Type-2: TN—A news text not containing propaganda is correctly classified as nonpropaganda.
Type-3: FP—A news text not containing propaganda is incorrectly classified as propaganda.
Type-4: FN—A news text containing propaganda is incorrectly classified as nonpropaganda.

Table 13 shows the distribution of the correct and erroneous results in all the models tested. Additionally, Figure 6 shows a visualization of the statistics. By calculating the distribution of types, we observed that for both assessed models, the vast majority of cases were correctly classified, with Type-1 and Type-2 together representing over 75% of the outputs. In contrast, error types with Type-3 and Type-4 together accounted for less than a quarter of the total. This highlighted the models’ strong ability to achieve correct classification, while errors remained relatively infrequent and consistently distributed across both models.

Table 14 presents an example of the five errors produced by MTL models, including both false positives and false negatives, and the justifications for these classification errors. It is clear from the table that the FPs arose from neutral or factual statements containing terms associated with propaganda, whereas the FNs involved subtle propaganda lacking explicit cues. These findings highlight the inherent challenges in accurately distinguishing propaganda within Arabic text and emphasize the need for improved contextual and linguistic modeling in future work.

7. Conclusions and Future Work

This study contributes to advancing Arabic propaganda detection by developing the first MTL models that integrate sentiment and emotion tasks with the primary task of propaganda detection. By leveraging recent advancements in MTL and auxiliary task integration, the proposed models explore various task-weighting schemes to optimize performance. The extensive experimental results presented in Section 5 demonstrate the effectiveness of the proposed MTL models in improving propaganda detection, showing significant enhancements over both STL models and prior MTL approaches. These findings emphasize the critical role of task-weighting in enhancing MTL performance for propaganda detection. This study makes an important contribution to MTL propaganda detection, particularly in highlighting the importance of task prioritization when integrating multiple dimensions of analysis.

Among the seven proposed loss-weighting schemes, homoscedastic UW (MTL-UW) consistently delivers the most significant improvements, marginally outperforming HW (MTL-HW) and dynamic-difficulty weighting (MTL-DDW), and decisively surpassing PGRW (MTL-PGRW). These results emphasize the advantages of principled data-driven weighting strategies over heuristic or stochastic methods, especially when working with noisy or weakly supervised auxiliary labels. Exploration of the interplay between sentiment and emotion addresses a notable research gap, offering valuable insights into how these auxiliary tasks can enhance the performance of propaganda detection systems. Additionally, the findings underscore the benefits of incorporating auxiliary tasks to address data scarcity, thereby improving the model’s generalization and robustness.

Although this study offers valuable contributions, several limitations should be acknowledged. First, handling a multi-aspect annotation dataset introduces considerable complexity, often resulting in noisy or inconsistent labels that may affect model performance. Second, the scope of this study is limited to the Arabic language. Given that propaganda techniques and linguistic patterns can vary significantly across contexts, the observed performance improvements may not generalize well to other domains (e.g., news articles versus social media) or language varieties (e.g., Modern Standard Arabic versus dialectal Arabic). Third, the computational demands and increased model complexity may limit the practicality of the proposed approach in real-world, resource-constrained environments. Fourth, the multi-head MTL architecture demands elevated GPU time and memory allocation, which may hinder reproducibility for researchers working with limited hardware resources. Finally, while the task-weighting strategies—particularly homoscedastic UW—yield notable improvements, selecting optimal hyperparameters remains a challenging and time-consuming process.

In future work, the model can be extended by incorporating LLM, such as GPT-4 or LLaMA, with prompt-based or in-context learning techniques. In addition, deeper linguistic cues such as argumentation structures can be explored. Developing dialect-specific resources and evaluating the generalizability across different Arabic variants is another promising direction. Finally, applying this methodology in cross-lingual and multimodal settings, such as meme or video propaganda detection, would further extend the scope and utility of this study.

Author Contributions

L.A.-H. conceived and designed the analysis, developed the code, conducted the data analysis, and wrote the paper. H.A.-K. and A.A.-S. supervised the project and contributed to the review and editing of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This This research was funded by the Deanship of Scientific Research, King Saud University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study analyzes existing datasets, which have been appropriately cited in the manuscript. No new datasets were created. The analyzed data are publicly available or accessible through the referenced sources. The source code is publicly available at https://github.com/LubnaAlhenaki/MTL-ArabicPropagandaDetection (accessed on 30 May 2025).

Acknowledgments

The authors are grateful to the Deanship of Scientific Research, King Saud University, for funding through the Vice Deanship of Scientific Research Chairs.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Digital Around the World. DataReportal—Global Digital Insights. Available online: https://datareportal.com/global-digital-overview (accessed on 13 February 2024).
PROPAGANDA|English Meaning—Cambridge Dictionary. Available online: https://dictionary.cambridge.org/dictionary/english/propaganda (accessed on 13 February 2024).
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. Available online: http://arxiv.org/abs/1504.08083 (accessed on 17 February 2024). [CrossRef]
Collobert, R.; Weston, J. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
Liu, X.; He, P.; Chen, W.; Gao, J. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Vienna, Austria, 2019; pp. 4487–4496. [Google Scholar] [CrossRef]
Alrashidi, B.; Jamal, A.; Alkhathlan, A. Abusive Content Detection in Arabic Tweets Using Multi-Task Learning and Transformer-Based Models. Appl. Sci. 2023, 13, 5825. [Google Scholar] [CrossRef]
Fadel, A.; Saleh, M.; Salama, R.; Abulnaja, O. MTL-AraBERT: An Enhanced Multi-Task Learning Model for Arabic Aspect-Based Sentiment Analysis. Computers 2024, 13, 98. [Google Scholar] [CrossRef]
Mahdaouy, A.E.; Mekki, A.E.; Essefar, K.; Mamoun, N.E.; Berrada, I.; Khoumsi, A. Deep Multi-Task Model for Sarcasm Detection and Sentiment Analysis in Arabic Language. arXiv 2021, arXiv:2106.12488. [Google Scholar] [CrossRef]
Ruder, S. Neural Transfer Learning for Natural Language Processing. Ph.D. Thesis, NUI Galway, Galway, Ireland, 2019. [Google Scholar]
Niehues, J.; Cho, E. Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, 7–8 September 2017; Association for Computational Linguistics: Vienna, Austria, 2017; pp. 80–89. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. arXiv 2021, arXiv:1707.08114. [Google Scholar] [CrossRef]
Ruder, S. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv 2017, arXiv:1706.05098. [Google Scholar] [CrossRef]
Top Ten Internet Languages in The World—Internet Statistics. Available online: https://www.internetworldstats.com/stats7.htm#google_vignette (accessed on 17 February 2024).
Darwish, K.; Habash, N.; Abbas, M.; Al-Khalifa, H.; Al-Natsheh, H.T.; Bouamor, H.; Bouzoubaa, K.; Cavalli-Sforza, V.; El-Beltagy, S.R.; El-Hajj, W.; et al. A Panoramic Survey of Natural Language Processing in the Arab World. arXiv 2021, arXiv:2011.12631. Available online: http://arxiv.org/abs/2011.12631 (accessed on 5 February 2024). [CrossRef]
Kaczyński, K.; Przybyła, P. HOMADOS at SemEval-2021 Task 6: Multi-Task Learning for Propaganda Detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; Association for Computational Linguistics: Vienna, Austria, 2021; pp. 1027–1031. [Google Scholar] [CrossRef]
Baraniak, K.; Sydow, M. Kb at SemEval-2023 Task 3: On Multitask Hierarchical BERT Base Neural Network for Multi-label Persuasion Techniques Detection. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Toronto, ON, Canada, 13–14 July 2023; Association for Computational Linguistics: Vienna, Austria, 2023; pp. 1395–1400. [Google Scholar] [CrossRef]
Attieh, J.; Hassan, F. Pythoneers at WANLP 2022 Shared Task: Monolingual AraBERT for Arabic Propaganda Detection and Span Extraction. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates (Hybrid), 8 December 2022; Association for Computational Linguistics: Vienna, Austria, 2022; pp. 534–540. [Google Scholar] [CrossRef]
Hadjer, K.; Bouklouha, T. HTE at ArAIEval Shared Task: Integrating Content Type Information in Binary Persuasive Technique Detection. In Proceedings of the ArabicNLP 2023, Singapore, 23 November 2023; Association for Computational Linguistics: Vienna, Austria, 2023; pp. 502–507. [Google Scholar] [CrossRef]
Shukla, U.; Vyas, M.; Tiwari, S. Raphael at ArAIEval Shared Task: Understanding Persuasive Language and Tone, an LLM Approach. In Proceedings of the ArabicNLP 2023, Singapore, 23 November 2023; Association for Computational Linguistics: Vienna, Austria, 2023; pp. 589–593. [Google Scholar] [CrossRef]
Li, Y.; Tian, X.; Liu, T.; Tao, D. Multi-Task Model and Feature Joint Learning. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Mahabadi, R.K.; Ruder, S.; Dehghani, M.; Henderson, J. Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Association for Computational Linguistics: Vienna, Austria, 2021; pp. 565–576. [Google Scholar] [CrossRef]
Al-Henaki, L.; Al-Khalifa, H.; Al-Salman, A.; Alqubayshi, H.; Al-Twailay, H.; Alghamdi, G.; Aljasim, H. MultiProSE: A Multi-label Arabic Dataset for Propaganda, Sentiment, and Emotion Detection. arXiv 2025, arXiv:2502.08319. [Google Scholar] [CrossRef]
Hasanain, M.; Ahmad, F.; Alam, F. Can GPT-4 Identify Propaganda? Annotation and Detection of Propaganda Spans in News Articles. arXiv 2024, arXiv:2402.17478. [Google Scholar] [CrossRef]
Kmainasi, M.B.; Shahroor, A.E.; Hasanain, M.; Laskar, S.R.; Hassan, N.; Alam, F. LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content. arXiv 2024, arXiv:2410.15308. [Google Scholar] [CrossRef]
Al-Henaki, L.; Al-Khalifa, H.; Al-Salman, A. Enhancing Arabic Propaganda Detection through Hybrid Learning and Lexicon-Based Feature Engineering. Computation 2025. submitted. [Google Scholar]
Rashkin, H.; Choi, E.; Jang, J.Y.; Volkova, S.; Choi, Y. Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Vienna, Austria, 2017; pp. 2931–2937. [Google Scholar] [CrossRef]
Khanday, A.M.U.D.; Khan, Q.R.; Rabani, S.T. Detecting Textual Propaganda Using Machine Learning Techniques. Baghdad Sci. J. 2021, 18, 0199. [Google Scholar] [CrossRef]
Barrón-Cedeño, A.; Jaradat, I.; Da San Martino, G.; Nakov, P. Proppy: Organizing the news based on their propagandistic content. Inf. Process. Manag. 2019, 56, 1849–1864. [Google Scholar] [CrossRef]
Khanday, A.M.U.D.; Khan, Q.R.; Rabani, S.T. SVMBPI: Support Vector Machine-Based Propaganda Identification. In Cognitive Informatics and Soft Computing; Mallick, P.K., Bhoi, A.K., Marques, G., de Albuquerque, V.H.C., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2021; Volume 1317, pp. 445–455. [Google Scholar] [CrossRef]
Gupta, P.; Saxena, K.; Yaseen, U.; Runkler, T.; Schütze, H. Neural Architectures for Fine-Grained Propaganda Detection in News. In Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, Hong Kong, China, 4 November 2019; Association for Computational Linguistics: Vienna, Austria, 2019; pp. 92–97. [Google Scholar] [CrossRef]
Li, J.; Ye, Z.; Xiao, L. Detection of Propaganda Using Logistic Regression. In Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, Hong Kong, China, 4 November 2019; Association for Computational Linguistics: Vienna, Austria, 2019; pp. 119–124. [Google Scholar] [CrossRef]
Khanday, A.M.U.D.; Wani, M.A.; Rabani, S.T.; Khan, Q.R.; El-Latif, A.A.A. HAPI: An efficient Hybrid Feature Engineering-based Approach for Propaganda Identification in social media. PLoS ONE 2024, 19, e0302583. [Google Scholar] [CrossRef]
Samir, A.; Soliman, A.B.; Ibrahim, M.; Hesham, L.; El-Beltagy, S.R. NGU_CNLP at WANLP 2022 Shared Task: Propaganda Detection in Arabic. In Proceedings of the Seventh Arabic Natural Language Processing; WANLP: Abu Dhabi, United Arab Emirates, 2022. [Google Scholar]
Kausar, S.; Tahir, B.; Mehmood, M.A. ProSOUL: A Framework to Identify Propaganda From Online Urdu Content. IEEE Access 2020, 8, 186039–186054. [Google Scholar] [CrossRef]
Wang, L.; Shen, X.; de Melo, G.; Weikum, G. Cross-Domain Learning for Classifying Propaganda in Online Contents. arXiv 2020, arXiv:2011.06844. [Google Scholar] [CrossRef]
Polonijo, B.; Suman, S.; Simac, I. Propaganda Detection Using Sentiment Aware Ensemble Deep Learning. In Proceedings of the 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 27 September–1 October 2021; IEEE: New York City, NY, USA, 2021; pp. 199–204. [Google Scholar] [CrossRef]
Tiwari, P.; Eswari, R. An LSTM based Propaganda Detection System for News Articles. In Proceedings of the 2023 International Conference on Computational Intelligence, Communication Technology and Networking (CICTN), Ghaziabad, India, 20–21 April 2023; pp. 728–733. [Google Scholar] [CrossRef]
Da San Martino, G.; Barrón-Cedeño, A.; Wachsmuth, H.; Petrov, R.; Nakov, P. SemEval-2020 Task 11: Detection of Propaganda Techniques in News Articles. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona, Spain, 12–13 December 2020; International Committee for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1377–1414. [Google Scholar] [CrossRef]
Pauli, A.; Derczynski, L.; Assent, I. Modelling Persuasion through Misuse of Rhetorical Appeals. In Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI), Abu Dhabi, United Arab Emirates, 7 December 2022; Association for Computational Linguistics: Vienna, Austria, 2022; pp. 89–100. [Google Scholar] [CrossRef]
Abdullah, M.; Abujaber, D.; Al-Qarqaz, A.; Abbott, R.; Hadzikadic, M. Combating propaganda texts using transfer learning. IAES Int. J. Artif. Intell. IJ-AI 2023, 12, 956. [Google Scholar] [CrossRef]
Chavan, T.; Kane, A.M. ChavanKane at WANLP 2022 Shared Task: Large Language Models for Multi-label Propaganda Detection. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; Association for Computational Linguistics: Vienna, Austria, 2022; pp. 515–519. [Google Scholar] [CrossRef]
Gaanoun, K.; Benelallam, I. SI2M & AIOX Labs at WANLP 2022 Shared Task: Propaganda Detection in Arabic, A Data Augmentation and Named Entity Recognition Approach. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022. [Google Scholar]
Laskar, S.R.; Singh, R.; Khilji, A.F.U.R.; Manna, R.; Pakray, P.; Bandyopadhyay, S. CNLP-NITS-PP at WANLP 2022 Shared Task: Propaganda Detection in Arabic using Data Augmentation and AraBERT Pre-trained Model. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; Association for Computational Linguistics: Vienna, Austria, 2022; pp. 541–544. [Google Scholar] [CrossRef]
Lamsiyah, S.; Mahdaouy, A.; Alami, H.; Berrada, I.; Schommer, C. UL & UM6P at ArAIEval Shared Task: Transformer-based model for Persuasion Techniques and Disinformation detection in Arabic. In Proceedings of the ArabicNLP 2023, Singapore, 23 November 2023; Association for Computational Linguistics: Vienna, Austria, 2023; pp. 558–564. [Google Scholar] [CrossRef]
Nabhani, S.; Borg, C.; Micallef, K.; Al-Khatib, K. Integrating Argumentation Features for Enhanced Propaganda Detection in Arabic Narratives on the Israeli War on Gaza. In Proceedings of the first International Workshop on Nakba Narratives as Language Resources, Online, 20 January 2025; Jarrar, M., Habash, H., El-Haj, M., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 127–149. Available online: https://aclanthology.org/2025.nakbanlp-1.14/ (accessed on 3 May 2025).
Antoun, W.; Baly, F.; Hajj, H. AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv 2021, arXiv:2003.00104. Available online: http://arxiv.org/abs/2003.00104 (accessed on 1 May 2022).
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. An overview of multi-task learning. Natl. Sci. Rev. 2018, 5, 30–43. [Google Scholar] [CrossRef]
Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Dai, D.; Gool, L. Revisiting Multi-Task Learning in the Deep Learning Era. arXiv 2020. Available online: https://www.semanticscholar.org/paper/6a248e075035cc6f17a64ed4336a507faad1f72e (accessed on 11 May 2025).
Chen, S.; Zhang, Y.; Yang, Q. Multi-Task Learning in Natural Language Processing: An Overview. arXiv 2024, arXiv:2109.09138. [Google Scholar] [CrossRef]
Liu, S.; Johns, E.; Davison, A.J. End-to-End Multi-Task Learning with Attention. arXiv 2019, arXiv:1803.10704. [Google Scholar] [CrossRef]
Lin, B.; Ye, F.; Zhang, Y.; Tsang, I.W. Reasonable Effectiveness of Random Weighting: A Litmus Test for Multi-Task Learning. arXiv 2021, arXiv:2111.10603. [Google Scholar]
Song, W.; Song, Z.; Liu, L.; Fu, R. Hierarchical Multi-task Learning for Organization Evaluation of Argumentative Student Essays. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Yokohama, Japan, 11–17 July 2020; pp. 3875–3881. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. arXiv 2018, arXiv:1705.07115. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; Association for Computational Linguistics: Vienna, Austria, 2020; pp. 38–45. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2019, arXiv:1711.05101. Available online: http://arxiv.org/abs/1711.05101 (accessed on 10 September 2024). [CrossRef]
Liashchynskyi, P.; Liashchynskyi, P. Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar] [CrossRef]
Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2009; p. 569. [Google Scholar]
Sanh, V.; Wolf, T.; Ruder, S. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks. arXiv 2018, arXiv:1811.06031. [Google Scholar] [CrossRef]
Yu, T.; Kumar, S.; Gupta, A.; Levine, S.; Hausman, K.; Finn, C. Gradient Surgery for Multi-Task Learning. arXiv 2020, arXiv:2001.06782. [Google Scholar] [CrossRef]

Figure 1. The workflow of the proposed MTL framework.

Figure 2. Data preprocessing pipeline.

Figure 3. The high-level architecture of the proposed MTL.

Figure 4. Confusion matrices for (a) MTL-Sent-UW and (b) MTL_Emotion_UW models.

Figure 5. Classification reports for (a) MTL-Sent-UW and (b) MTL_Emotion_UW models.

Figure 6. Distribution of correct and errors for each model.

Table 1. Summary of conducted studies in propaganda detection.

Reference	Language	Model Name	MTL Used?	Best Score	Limitation
[15]	English	BERT	✓	0.4074 F1-Score	Marginal benefit from MTL due to insufficiently strong relationships between the main and auxiliary tasks.
[16]	English	BERT	✓	0.4080 F1-Micro	Limited performance gains from MTL, constrained by dataset size and the weak interdependence between tasks; avoids considering the relationship between propaganda and other related tasks.
[17]	Arabic	AraBERT with CRF	✓	0.396 F1-Score	Marginal performance due to limited data size, affecting span extraction and classification accuracy. Also, the relationship with other opinion dimensions is not taken into account.
[18]	Arabic	AraBERT, MARBERT	✓	0.7634 F1-Micro	Limited by the reliance on content-type integration, which may not generalize well across domains. In addition, the analysis disregards any associations with additional opinion dimensions.
[19]	Arabic	GPT-3	✓	0.5347 F1-Micro	Performance is limited by dependence on prompt engineering and few-shot learning in the LLM-based approach. Moreover, interactions with other opinion dimensions are excluded from consideration.
[23]	Arabic	AraBERT	×	0.750 F1-Macro	Considers only individual tasks without multi-task relationships.
[26]	English	Maximum entropy, Naive Bayes LSTM	×	0.5600 F1-Score	Requires significant time and effort for feature extraction.
[27]	English	Naive Bayes, SVMs	×	0.765, F1-Score	Generates a static embedding for each word without capturing the entire context of words used.
[28]	English	Maximum entropy, SVM	×	97.15% accuracy	Requires explicit feature extraction methods and handles only single-task scenarios, missing multi-task learning advantages.
[29]	English	SVM	×	0.81 F1-Score	Handles only individual tasks without considering related tasks.
[30]	English	Logistic Regression, CNN, BERT, LSTM-CRF	×	0.6231 F1-Score	Avoids integrating related tasks, limiting performance due to task isolation.
[31]	English	Logistic regression	×	0.6616 F1-Score	Limited by the simplicity of logistic regression, restricting its ability to model complex textual features.
[32]	English	SVM, Naïve Bayes, Decision Tree, Logistic Regression	×	0.647 F1-Score	Limited by dependency on manual hybrid feature engineering, impacting generalization to other domains.
[33]	Arabic	SVM, Naïve Bayes, Stochastic Gradient Descent, Logistic Regression, Random Forests, K-nearest Neighbor, AraBERT	×	0.649 F1-Score	Demands an extensive amount of data to effectively train the deep model Require feature extraction methods; considers only individual tasks without multi-task relationships.
[34]	Urdu	CNN and Logistic Regression	×	91% accuracy	Limited by single-task focus without task interactions.
[35]	English	LSTM	×	0.687 F1-Score	Demands an extensive amount of data to effectively train the deep model.
[36]	English	Gradient Boosting, LSTM	×	96.00% Accuracy	Considers only individual tasks without multi-task relationships.
[37]	English	Bi-LSTM	×	0.7500 F1-Score	Limited scalability due to the training complexity of Bi-LSTM, especially with larger datasets.
[39]	English	ROBERTA	×	88.32% Accuracy	Considers only individual tasks without multi-task relationships.
[40]	English	RoBERTa, BERT, DeBERTa	×	0.6000 F1-Score
[41]	Arabic	AraBERT, MARBERT, ARBERT, XLMRoBERTa, AraELECTRA. DeHateBERT	×	0.565 F1-Score	Considers only individual tasks without multi-task relationships.
[42]	Arabic	AraBERT	×	0.5850 F1-Micro	Relies on named entity recognition and data augmentation, limiting effectiveness across varying propaganda techniques.
[43]	Arabic	AraBERT	×	0.6020 F1-Micro	Data augmentation may introduce noise, limiting the robustness of AraBERT in propaganda detection.
[44]	Arabic	AraBERT-Twitter-v2	×	0.5666 F1-Micro	Considers only individual tasks without multi-task relationships.
[45]	Arabic	GPT-4	×	0.4025 F1-Micro	Limited by GPT-4’s sensitivity to prompts, affecting performance consistency across varied narratives.
[24]	Arabic	LlamaLens	×	0.747 F1-Micro	Performance is limited by multilingual capability constraints and domain-specific fine-tuning requirements.

Table 2. Symbols’ definitions in the proposed MTL.

Symbol	Explanation
T	Total number of tasks t = (1, …, T)
D_t	Training data for task t
N	Number of examples in D_t
$x_{i}$	Input text x = (x₁, …, x_N)
$y_{i}$	$Label set for input text x_{i}$
$Z_{t}$	Task descriptor generated in the shared layers
$L_{t}$	$Loss function for task t$
$θ_{s h}$	Shared parameters during the encoding stage
$θ_{t}$	Task-specific parameters for output decoder heads

Table 3. Comparative overview of single-task vs. multi-task learning objectives.

Feature	Single-Task Model	MTL Model
Dataset	$D = {(x_{i}, y_{i})}_{i = 1}^{N}$	$D_{t} =$ ${{(x_{i}, y_{i})}_{i = 1}^{N}}_{t = 1}^{T}$
Task Descriptor	Not used	${Z_{t}}_{t = 1}^{T}$
Loss Function	L	$L_{t}$ $per each task t$
Objective Function	${m i n}_{θ} L (θ, D)$	${m i n}_{θ_{s h}, θ_{1}, \dots, θ_{T}} \sum_{t = 1}^{T} L_{t} ({θ_{s h}, θ_{t}}, D_{t})$
Model	$f_{θ} (y_{i} \| x_{i})$	$f_{θ} (y_{i} \| x_{i}, z_{t})$

Table 4. Data split statistics of the MultiProSE dataset.

Task	Label	Train	Test
Propaganda	True	62.99%	62.9%
Propaganda	False	37.01%	37.1%
Sentiment	Positive	35.60%	37.4%
	Negative	42.70%	40.4%
	Neutral	21.70%	22.2%
Emotion	Happiness	29.80%	31.5%
	Sadness	20.40%	24.5%
	Anger	16.50%	14.4%
	Fear	4.60%	3.4%
	None	28.60%	26.2%

Table 5. Overview of proposed model variations.

Model Categories	Model Name	Model Description
MTL Models	MTL-UAW	A parallel multi-task learning model that leverages three tasks, propaganda, sentiment, and emotion, and utilizes uniform averaging weighting. This model is shown in Figure 1.
	MTL-LS	A parallel multi-task learning model that leverages three tasks, propaganda, sentiment, and emotion, and utilizes linear scalarization weighting. This model is shown in Figure 1.
	MTL-Sent-UAW (PSMTL-UAW)	An MTL model that leverages two tasks, propaganda and sentiment, and utilizes uniform averaging weighting.
	MTL-Sent-LS (PSMTL-LS)	An MTL model that leverages two tasks, propaganda and sentiment, and utilizes linear scalarization weighting.
	MTL-Emo-UAW (PEMTL-UAW)	An MTL model that leverages two tasks. propaganda and emotion, and utilizes uniform averaging weighting.
	MTL-Emo-LS (PEMTL-LS)	An MTL model that leverages two tasks, propaganda and emotion, and utilizes linear scalarization weighting.
Proposed Task-Weighting	MTL-DDW	An MTL setting that utilizes dynamic difficulty weighting.
	MTL-PGRW (NormalDist)	An MTL setting that utilizes priority-guided random weighting, based on a normal sampling distribution.
	MTL-PGRW (BernoulliDist)	An MTL setting that utilizes priority-guided random weighting, based on a Bernoulli sampling distribution.
	MTL-PGRW (UniformDist)	An MTL setting that utilizes priority-guided random weighting, based on a uniform sampling distribution.
	MTL-HW	An MTL setting that utilizes hierarchical weighting.
	MTL-UW	An MTL setting that utilizes uncertainty weighting.
	MTL-PASUW	An MTL setting that utilizes priority-aware softmax-based uncertainty weighting.
MTL Models with the Best Proposed Task-Weighting	MTL-Sent-UW (PSMTL-UW)	An MTL model that leverages two tasks, propaganda and sentiment, and utilizes uncertainty weighting.
MTL Models with the Best Proposed Task-Weighting	MTL-Emo-UW (PEMTL-UW)	An MTL model that leverages two tasks, propaganda and emotion, and utilizes uncertainty weighting

Table 6. Hyperparameter values.

Hyperparameter	Value
Max. sequence length	256
Batch size	8
Number of epochs	5
Early stop patience	2
Optimizer	AdamW
Learning rate	2 × 10⁻⁵
Weight decay	0.001

Table 7. Confusion matrix.

Actual/Predicted	Propaganda (True)	Nonpropaganda (False)
Propaganda (True)	True Positive (TP)	False Negative (FN)
Nonpropaganda (False)	False Positive (FP)	True Negative (TN)

Table 8. Ablation experiments on the proposed MTL fodels.

Model Categories	Model Name	Accuracy	Macro F1	Micro F1
MTL Models	MTL-LS	79%	0.775	0.790
	MTL- UAW	77%	0.763	0.777
	MTL-Sent-LS	79%	0.778	0.792
	MTL-Sent- UAW	78%	0.770	0.784
	MTL-Emo- LS	78%	0.773	0.789
	MTL-Emo- UAW	79%	0.775	0.791

Table 9. Experimental results on the proposed task-weighting.

Model Categories	Model Name	Accuracy	Macro F1	Micro F1
Proposed Task-Weighting	MTL-DDW	79%	0.772	0.790
	MTL-PGRW (NormalDist)	78%	0.767	0.784
	MTL-PGRW (BernoulliDist)	77%	0.765	0.778
	MTL-PGRW (UniformDist)	78%	0.764	0.782
	MTL-HW	79%	0.775	0.791
	MTL-UW	79%	0.776	0.790
	MTL-PASUW	78%	0.765	0.782
MTL Models with best Proposed Task-Weighting	MTL-Sent-UW	79%	0.778	0.791
MTL Models with best Proposed Task-Weighting	MTL-Emo-UW	79%	0.777	0.795

Table 10. A comparison of the best models with SOTA models.

Model	Accuracy	Macro-F1	Micro-F1
STL-AraBERT [22]	77%	0.756	0.769
STL-AraBERT with Features [25]	78%	0.768	0.785
STL-ML with Features [25]	75%	0.708	0.755
STL-AraBERT [23]	NA	0.750	0.767
STL-LlamaLens: Specialized Multilingual LLM [24]	NA	NA	0.747
MTL-MARBERT [6]	76%	0.742	0.757
MTL-Sent-LS (ours)	79%	0.778	0.792
MTL-Sent-UW (ours)	79%	0.778	0.791

Table 11. Paired t-test results.

Metric	MTL-Sent-LS Mean	MTL-MARBERT Mean	t-Statistic	p-Value	Significant (p < 0.05)
Accuracy	0.79%	0.76%	9.285295	0.000748	True
Macro F1	0.778	0.742	13.132868	0.000194	True
Micro F1	0.791	0.757	9.285295	0.000748	True

Table 12. Error counts extracted from the confusion matrices. Bold indicates the best results.

Model	FP	FN	Total Errors	Accuracy	Recall (Class Propaganda)
MTL-Sent-UW	133	142	275	0.7926	0.8293
MTL-Emotion-UW	138	131	269	0.7971	0.8425

Table 13. Percentage of correct and error types produced by each model.

Model	Type-1	Type-2	Type-3	Type-4
MTL-Sent-UW	52%	25%	10%	13%
MTL-Emotion-UW	53%	27%	9%	11%

Table 14. Examples of errors produced by MTL models on MultiProSE dataset.

Arabic Text	English Text	Predicted Label	TrueLabel	Justification
وجاء إعلان السلطات السورية لينفي معلومات نشرها موقع البعث ميديا نقل تصريحات لمدير صحة السويداء الدكتور نزار مهنا قال إن مواطنين دخلوا البلاد بطرق شرعية يلتزمون بالحجر الصحي في منازلهم المراقبة الصحية	The announcement by the Syrian authorities came to deny information published by the Al-Baath Media website, which reported statements by the Director of Health in Al-Suwayda, Dr. Nizar Muhanna, who said that citizens who entered the country through legal means are adhering to home quarantine and health monitoring.	Propaganda	Nonpropaganda	Neutral official denial misclassified as propaganda, likely due to formal tone and presence of official entities triggering propaganda cues.
واعتبرت جبهة بوليساريو أن العملية المغربية أنهت وقف إطلاق النار الموقع عام برعاية الأمم المتحدة، عاماً القتال.	The Polisario Front considered that the Moroccan operation ended the ceasefire agreement signed under the auspices of the United Nations, resuming the years of fighting.	Propaganda	Nonpropaganda	Factual political statement misclassified due to conflict-related terminology, causing false alarms.
التقى رئيس وزراء الاحتلال بنيامين نتنياهو الثلاثاء برئيس المجلس الوزاري التشادي عبد الكريم ديبي نجل الرئيس إدريس ديبي ورئيس المخابرات التشادية أحمد كغاري	The Israeli Prime Minister Benjamin Netanyahu met on Tuesday with the Chadian Ministerial Council President Abdel Karim Déby, the son of President Idriss Déby, and the Chadian Intelligence Chief Ahmed Kogri.	Nonpropaganda	Propaganda	Propaganda missed due to neutral diplomatic reporting and subtle framing, lacking explicit manipulative cues.
قالت المنظمة الكندية الحكومية إن ثمة تقارير حكومية مسربة تظهر بأن استشراء المرض يشمل انتشار صنف بلازموديوم فيفاكس العصِيّ على العلاج.	The Canadian governmental organization stated that there are leaked government reports indicating that the spread of the disease includes the proliferation of the Plasmodium vivax strain, which is resistant to treatment.	Propaganda	Nonpropaganda	Factual health news misclassified as propaganda because of alarming language (“leaked,” “spread”), resulting in false positives.
يملك لبنان تجهيزات لإدارة الكوارث وإمكانات تقنية متقدمة، وسارعت دول عدة إلى إرسال فرق إغاثة ومساعدات تقنية لمساعدته بعد الانفجار.	Lebanon possesses disaster management equipment and advanced technical capabilities, and several countries quickly sent relief teams and technical assistance to help it after the explosion.	Nonpropaganda	Propaganda	Positive news on disaster response may contain subtle persuasive framing missed by the model, leading to false negatives.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-Henaki, L.; Al-Khalifa, H.; Al-Salman, A. Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning. Appl. Sci. 2025, 15, 8160. https://doi.org/10.3390/app15158160

AMA Style

Al-Henaki L, Al-Khalifa H, Al-Salman A. Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning. Applied Sciences. 2025; 15(15):8160. https://doi.org/10.3390/app15158160

Chicago/Turabian Style

Al-Henaki, Lubna, Hend Al-Khalifa, and Abdulmalik Al-Salman. 2025. "Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning" Applied Sciences 15, no. 15: 8160. https://doi.org/10.3390/app15158160

APA Style

Al-Henaki, L., Al-Khalifa, H., & Al-Salman, A. (2025). Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning. Applied Sciences, 15(15), 8160. https://doi.org/10.3390/app15158160

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Propaganda Detection in Arabic News Context Through Multi-Task Learning

Abstract

1. Introduction

2. Related Work

2.1. STL for Propaganda Detection

2.2. MTL for Propaganda Detection

2.3. Discussion

3. Methodology

3.1. Mathematical Notations of the Problem

3.2. Input Layer

3.3. Shared Layer

3.4. Task-Specific Layer

3.5. MTL Architecture

3.6. Task-Weighting

3.6.1. MTO Weighting Schemes

3.6.2. MTL Training Objective

3.6.3. Task-Weighting Schemes

4. Experiments

4.1. Dataset

4.2. Models

4.3. Experimental Setup

4.4. Evaluation Measurements

5. Results and Discussion

5.1. Ablation Experiments on the Proposed MTL Models

5.2. Experimental Results on the Proposed Task-Weighting

5.3. Comparisons with Previous Studies

5.4. Discussion

6. Error Analysis

6.1. Per-Class Performance

6.2. Correct and Error Distribution

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI