GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection

Alrasheedy, Mashary N.; Tiun, Sabrina; Fauzi, Fariza

doi:10.3390/electronics15061169

Open AccessArticle

GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection

by

Mashary N. Alrasheedy

^1,2,*

,

Sabrina Tiun

^1,*

and

Fariza Fauzi

¹

Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia

²

Department of Computer Science, Applied College, University of Ha’il, Ha’il 55424, Saudi Arabia

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(6), 1169; https://doi.org/10.3390/electronics15061169

Submission received: 8 January 2026 / Revised: 20 February 2026 / Accepted: 4 March 2026 / Published: 11 March 2026

(This article belongs to the Special Issue Deep Learning Approaches for Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

Multi-label emotion detection (MLED) suffers from long-tailed label distributions and structured inter-label correlations, which jointly suppress rare label recall and yield incoherent predictions. We present Graph Neural Network-Enhanced Transformer with Hybrid Loss Weighting (GTHL-Emo), a unified framework that addresses both challenges without heavy additional machinery. First, an adaptive imbalance-aware training scheme combines binary cross-entropy, asymmetric focal, and pairwise ranking losses under a learned batch-wise controller, emphasizing rare labels while stabilizing thresholding. Second, a lightweight correlation alignment module learns transformer-based label embeddings and aligns their predicted affinities with empirical co-occurrence via Kullback–Leibler (KL) regularization, smoothing rare label predictions through correlated frequent labels. A transformer encoder with learnable attention pooling provides semantic representations, and a dynamic GraphSAGE layer captures inter-instance structural dependencies. Comprehensive evaluation across three Arabic benchmarks—SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025 (Track A, Arq)—demonstrates competitive or leading performance. On SemEval-2018-Ec-Ar, GTHL-Emo attained a Jaccard accuracy of 58.70%, micro-F1 score of 71.02%, and macro-F1 score of 60.48%. On ExaAEC, it achieved a Jaccard accuracy of 65.99%, micro-F1 score of 70.72%, and macro-F1 score of 68.71%. On SemEval-2025-Arq, it obtained a Jaccard accuracy of 41.47%, micro-F1 score of 56.78%, and macro-F1 score of 56.69%. Ablation studies revealed that the GraphSAGE structure and ranking loss contributed most significantly (1.45% and 1.46% Jaccard accuracy drops, respectively), while label correlation alignment provided consistent improvements across the scales. These findings demonstrate that jointly optimizing imbalance-aware objectives and label dependencies yields robust Arabic MLED with minimal overhead.

Keywords:

Arabic natural language processing; multi-label emotion detection; graph neural networks; label correlation; class imbalance; adaptive hybrid loss; transformer models

1. Introduction

Emotion detection in text is a fundamental task in affective computing and natural language processing (NLP), with applications spanning social media monitoring, mental health analytics, and intelligent dialogue systems [1]. Unlike traditional sentiment analysis, which typically predicts a single polarity [2], multi-label emotion detection (MLED) acknowledges that human expressions often convey multiple emotions simultaneously, reflecting the complex and nuanced nature of affective communication in real-world scenarios [3]. This complexity is particularly pronounced for morphologically rich languages such as Arabic, which presents unique challenges including dialectal variation, orthographic inconsistency, and limited annotated resources [4,5].

Multi-label emotion detection in Arabic social media texts presents several interconnected challenges that jointly impede model performance. First, the morphological complexity of Arabic, coupled with prevalent dialectal variations, complicates semantic understanding and feature extraction. Recent Arabic emotion detection advances include span-level approaches [6] that identify emotion-bearing text segments, complementing sentence-level classification tasks. Second, emotion datasets exhibit severe class imbalance, where frequent emotions (e.g., joy or sadness) dominate training while underrepresented emotions (e.g., optimism or trust) receive insufficient learning signals, leading to biased predictions [5,7]. Third, and critically, emotional states manifest intricate interdependencies, and the co-occurrence patterns between emotions create structured correlations that, when ignored, result in semantically inconsistent predictions. Recent advances in label correlation modeling [8] have demonstrated the importance of explicit dependency capture through regularization constraints and pairwise emotion relationships, motivating unified approaches that jointly optimize correlation-aware objectives with class imbalance mitigation [9]. These challenges are compounded in Arabic by the scarcity of large-scale, high-quality multi-label emotion corpora.

Recent transformer-based architectures have achieved remarkable progress in text representation and classification tasks. Models such as BERT [10], RoBERTa [11], and multilingual variants (e.g., XLM-R [12]) have established new benchmarks across diverse NLP applications. For Arabic, specialized models including AraBERT [13] and MARBERT [14] have demonstrated superior performance in sentiment analysis and emotion recognition tasks. In multi-label emotion classification, frameworks such as GoEmotions [15] have leveraged large-scale datasets with transformer architectures to capture multi-faceted emotional expressions. Recent Arabic-specific studies have explored transformer fine-tuning strategies [5,16], but these approaches primarily address single-label classification or fail to jointly model class imbalance and label correlations as critical bottlenecks for robust MLED. To address label correlation modeling, several works have explored explicit dependency capture mechanisms. Traditional approaches include classifier chains [17] and attention-based label embeddings [18], while recent advances leverage graph neural networks (GNNs) to model both inter-sample and inter-label relationships [19]. Notable contributions include label-dependent graph networks [20], which utilize graph convolutional networks (GCNs) to capture label dependencies, and GraphSAGE-based frameworks for contextual relationship modeling in multi-label settings [21]. Simultaneously, class imbalance mitigation has been addressed through adaptive loss functions, including focal loss variants [22] and ranking-based objectives [23], with recent work exploring weighted loss strategies for Arabic emotion detection [24]. However, existing methods typically operate on balanced datasets and lack unified frameworks that jointly optimize imbalance-aware objectives with correlation modeling.

Recent advances in graph-based multi-label text classification have demonstrated the effectiveness of modeling label dependencies through GNNss. Chen et al. [25] pioneered the use of graph convolutional networks for learning label embeddings that capture inter-label correlations in image recognition, showing that structured label relationships significantly improve classification performance. In text classification, Vu et al. [26] extended this concept by introducing correlation matrices based on label co-occurrence probabilities, demonstrating superior performance in multi-label text classification through graph-enhanced label representations. Similarly, Huang et al. [27] explored hybrid attention mechanisms for extreme multi-label text classification, incorporating both document content and label structure through adaptive fusion strategies. While these approaches have shown success in their respective domains: image recognition, English text classification, and extreme classification, they have not been adapted for Arabic emotion detection, nor do they address the specific challenges of class imbalance and dynamic label correlation alignment that are critical for Arabic MLED. This work presents Graph Neural Network-Enhanced Transformer with Hybrid Loss Weighting (GTHL-Emo), a unified framework that synergistically addresses both class imbalance and label correlation challenges in Arabic MLED. Our architecture combines a transformer-based semantic encoder with GraphSAGE for inter-instance relationship modeling, while a dedicated transformer-based label embedding module captures higher-order label correlations. Most critically, we introduce an adaptive hybrid loss weighting scheme that dynamically balances binary cross-entropy, focal, ranking, and graph regularization losses using batch-level statistics (Jaccard similarity and entropy), enabling responsive optimization across imbalanced label distributions.

Our contributions are summarized as follows:

We present a unified architecture that jointly leverages transformer encoders, graph neural networks, and correlation-aware label embeddings for robust Arabic multi-label emotion detection.
We propose an adaptive hybrid loss weighting mechanism driven by batch-level statistics that dynamically mitigates class imbalance while preserving the label correlation structure.
We introduce a transformer-based label embedding module with KL-divergence alignment that captures complex label dependencies and smooths predictions for rare emotions through correlated frequent labels.
We conduct comprehensive evaluation across three Arabic benchmarks (SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq) with detailed ablation studies demonstrating the necessity of each architectural component.

To the best of our knowledge, this is the first work to propose a unified framework that synergistically combines graph-based label correlation modeling, transformer-enhanced text representation, and adaptive batch statistics-driven hybrid loss weighting specifically for Arabic MLED. While previous works have explored label correlations through graph neural networks [25,26] in image recognition and English text classification and adaptive attention mechanisms [27] in extreme multi-label settings, our approach uniquely adapts and unifies these techniques with novel extensions tailored for Arabic emotion detection: (1) dynamic correlation alignment through Kullback–Leibler (KL)divergence rather than static adjacency matrices, (2) batch-level adaptive loss weighting rather than fixed multi-objective combinations, and (3) comprehensive validation across diverse Arabic language varieties and social media contexts with severe class imbalance.

Our findings establish new state-of-the-art performance for Arabic MLED while providing interpretable insights into the relative importance of semantic modeling, structural relationships, and adaptive training objectives.

The remainder of this paper is structured as follows. Section 2 reviews existing literature and related work relevant to our study. Section 3 introduces our proposed methodology, detailing the problem formulation and transformer-based text representation. It subsequently describes the core components of the framework, including dynamic graph construction, label dependency modeling, feature fusion, and the classification head. Furthermore, this section elaborates on the adaptive hybrid loss function and the inference and thresholding procedures. Section 4 presents the experimental setup, description of datasets, and a comprehensive analysis of the results. Finally, Section 5 concludes the paper and outlines potential directions for future research.

2. Related Work

Arabic multi-label emotion detection has evolved from traditional machine learning (ML) to transformer-based approaches, yet it remains significantly challenged by class imbalance, label correlations, and resource scarcity. Early Arabic emotion classification efforts focused on lexicon-based methods and single-label sentiment analysis [3,28], with the SemEval-2018 Arabic emotion detection task establishing the first major benchmark [29]. The winning EMA team [30] achieved 48.9% accuracy through SVC-L1 with AraVec embeddings and extensive feature engineering, while subsequent competition participants explored binary relevance transformations and traditional ML approaches [31,32]. Deep learning advances introduced contextual modeling. Samy et al. [33] proposed context-aware Gated Recurrent Units (GRUs) achieving 53.2% accuracy with implicit label correlation modeling, while subsequent work explored Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) hybrids and attention mechanisms [34,35]. The transformer era brought significant breakthroughs, with Elfaik and Nfaoui [36] establishing the current state of the art at 60% accuracy through AraBERT + BabelSenticNet + AraVec fusion with attentional BiGRU-BiLSTM-CNN architectures. Recent SemEval-2025 Arabic (Arq) results demonstrated continued progress; the HTU team achieved a 51.2% macro-F1 score using label-fused iterative mask filling with six DziriBERT models [37], while LATE-GIL-NLP demonstrated the effectiveness of optimized loss function combinations including focal loss, cross-entropy, and label smoothing [38]. Meanwhile, Sarbazi-Azad et al. [39] introduced ExaAEC, a 20,000-tweet Arabic multi-label emotion corpus with 10 emotion classes based on Plutchik’s model, with their baseline BiLSTM + ELMo achieving a 65.7% F1 score. Limited work addresses explicit label correlations in Arabic contexts, while general multi-label classification has explored classifier chains [17], label-dependent graph networks [20], and semantic-sensitive graph convolutions [19], techniques that remain largely unexplored for Arabic emotion detection. Recent research has explored information-theoretic regularization to explicitly control statistical dependence in deep neural models. Incremona et al. [40] introduced a differentiable, uncertainty-aware mutual information (MI) regularizer for bias mitigation, demonstrating how MI constraints can regulate representation-label dependencies under uncertainty. Unlike global MI-based objectives, our KL divergence alignment focuses specifically on preserving structured empirical emotion co-occurrence patterns rather than minimizing generic dependence. Beyond NLP, the graph learning literature emphasizes adaptive edge construction and self-supervision. MEGA [41] and self-supervised edge feature modeling for COVID-19 severity prediction [42] illustrate how learned edge representations enhance relational reasoning. Similarly, Chen et al. [25] and Vu et al. [26] demonstrated that structured label graphs significantly improve multi-label classification, reinforcing the theoretical foundation of our correlation-aware graph modeling framework. Class imbalance mitigation has been addressed through focal loss variants [24] and recent SemEval-2025 studies demonstrating adaptive loss combinations [38], yet dynamic batch statistics-driven loss weighting remains unexplored. Existing Arabic multi-label emotion approaches exhibit critical limitations: (1) independent label treatment ignoring emotional co-occurrence patterns, (2) a lack of inter-instance relationship modeling through graph structures, (3) static loss functions inadequate for dynamic class distributions, and (4) the absence of unified architectures jointly optimizing semantic representation, structural relationships, and adaptive training objectives. Our GTHL-Emo framework addresses these gaps through a unified transformer-GraphSAGE architecture with adaptive hybrid loss weighting, representing the next evolutionary step in Arabic MLED.

3. Proposed Methodology

This section presents the proposed GTHL-Emo framework for Arabic multi-label emotion classification. Our methodology addresses the challenges of semantic representation, inter-sample and inter-label dependencies, class imbalance, and interpretability in emotion classification from social media text. The architecture integrates (1) transformer-based text encoding, (2) dynamic graph-based aggregation using GraphSAGE, (3) label dependency modeling via a label transformer, and (4) an adaptive hybrid loss function. Figure 1 provides an overview.

3.1. Problem Formulation

Multi-label emotion classification aims to identify all relevant emotions expressed within a given text, allowing for the simultaneous detection of multiple affective states. In the context of Arabic social media analysis, this task is especially challenging due to linguistic variability, frequent label co-occurrence, and significant class imbalance.

Let

D = {(x_{i}, y_{i})}_{i = 1}^{N}

denote a dataset of N Arabic text samples, where each

x_{i}

is a raw text instance (e.g., a tweet) and

y_{i} \in {0, 1}^{C}

is a binary vector indicating the presence or absence of each of the C possible emotion labels. In the multi-label setting, a single instance may be associated with multiple emotions such that

\sum_{c = 1}^{C} y_{i c} \geq 1

. The objective is to learn a function

f_{θ} : x \to \hat{y}

, parameterized by

θ

, which maps each input text x to a predicted vector

\hat{y} \in {[0, 1]}^{C}

, where

{\hat{y}}_{c}

denotes the predicted probability of the cth emotion being present. Binary predictions are obtained by thresholding, through which

{\tilde{y}}_{c} = 1

if

{\hat{y}}_{c} \geq τ_{c}

, and

{\tilde{y}}_{c} = 0

otherwise, where

τ_{c}

denotes a (possibly label-specific) threshold that can be tuned on the validation set.

Given the morphological complexity of Arabic, label co-occurrence, and class imbalance,

f_{θ}

must capture rich contextual features, model label dependencies, and mitigate bias toward frequent classes. We frame model training as minimizing a composite multi-label loss function

L (y, \hat{y})

, which combines binary cross-entropy (BCE) (for multi-label classification), focal loss (to address class imbalance), ranking loss (for improved ranking consistency), and graph-based regularization (to capture inter-sample relations), as defined in Equation (1):

L = L_{BCE} + λ_{f} L_{focal} + λ_{r} L_{ranking} + λ_{g} L_{graph},

(1)

where

λ_{f}

,

λ_{r}

, and

λ_{g}

are adaptive coefficients dynamically predicted by a learned network controlling the contribution of each loss term. We evaluated model performance using established multi-label metrics, including the Jaccard score, micro- and macro-F1, and Hamming loss.

3.2. Transformer-Based Text Representation

Given an input text

x^{(i)}

, we first tokenize it into a sequence of T tokens as

x^{(i)} = [w_{1}, w_{2}, \dots, w_{T}]

. This sequence is fed into a pretrained transformer encoder, specifically MARBERT, to obtain contextualized representations for each token. The output of the transformer is a matrix of hidden states

H^{(i)}

, as shown in Equation (2):

H^{(i)} = [h_{1}, h_{2}, \dots, h_{T}] \in R^{T \times d}

(2)

where

h_{j} \in R^{d}

denotes the contextualized embedding of the jth token and d is the transformer’s hidden size. Unlike simple pooling operations, we employed a learnable self-attention mechanism to construct a sentence-level embedding that emphasizes the tokens most relevant to emotion recognition. For each token

h_{j}

, we first computed an unnormalized attention score

e_{j}

, as defined in Equation (3):

e_{j} = h_{j}^{⊤} a + b

(3)

where

a \in R^{d}

and

b \in R

are trainable parameters. This score is then transformed into a normalized attention weight

α_{j}

using the softmax operation, as given in Equation (4):

α_{j} = \frac{exp (e_{j})}{\sum_{k = 1}^{T} exp (e_{k})}

(4)

Using these attention weights, we computed the attended sentence (span) embedding

s^{(i)}

as a weighted sum of the token representations, as formulated in Equation (5):

s^{(i)} = \sum_{j = 1}^{T} α_{j} h_{j}

(5)

This operation effectively summarizes the sequence into a fixed-size vector, allowing the model to focus on salient or emotion-bearing words while suppressing irrelevant context. The attended sentence embedding

s^{(i)}

serves as the core representation for each input, capturing both local and global semantic information.

3.3. Dynamic Graph Construction and GraphSAGE Aggregation

To effectively capture the relationships between the text samples within each mini-batch, we dynamically constructed a graph whose structure is updated in every training iteration. Each sample i in the batch is represented as a node, utilizing its sentence-level (span) embedding

s^{(i)}

derived from the transformer-based self-attention mechanism as node features. We employed GraphSAGE for neighbor aggregation due to its inductive learning capability and sampling-based efficiency for dynamic batch graphs, an architectural choice empirically validated in our subsequent experimental analysis.

3.3.1. Mixed Similarity Metric via Adaptive Lambda Network

To define the relationship between each pair of nodes

(i, j)

, we introduce a mixed similarity measure

S_{i j}

that integrates (1) the Jaccard similarity

S_{i j}^{jac}

between their ground-truth multi-label vectors

y^{(i)}

and

y^{(j)}

, reflecting label-level relationships, and (2) the cosine similarity

S_{i j}^{\cos}

between their semantic embeddings

s^{(i)}

and

s^{(j)}

, capturing semantic-level connections, as defined in Equation (6):

S_{i j} = λ_{1} S_{i j}^{jac} + λ_{2} S_{i j}^{\cos}

(6)

The coefficients

λ_{1}

and

λ_{2}

are non-negative, dynamically predicted for each batch, and satisfy

λ_{1} + λ_{2} = 1

. These coefficients are determined by a neural network known as LambdaNet, conditioned upon batch-level statistics including mean cosine similarity, label entropy, and mean Jaccard similarity, as computed in Equation (7):

[λ_{1}, λ_{2}] = LambdaNet (s^{(b)})

(7)

where

s^{(b)}

denotes the vector of batch-level statistics used to dynamically adjust the importance of semantic and label similarities according to the evolving data distribution.

3.3.2. Adjacency Matrix Construction

Utilizing the calculated mixed similarity scores

S_{i j}

, we constructed the adjacency matrix A through thresholding. The adjacency matrix is binary, where connections between nodes are defined based on whether the similarity exceeds a threshold

τ

, according to Equation (8):

A_{i j} = \{\begin{matrix} 1, & if S_{i j} > τ \\ 0, & otherwise \end{matrix}

(8)

The threshold

τ

can be statically set or adaptively determined, for instance, by selecting the 75th percentile of

S_{i j}

scores within the batch. This yields a sparse graph structure emphasizing significant inter-sample relationships while reducing noise.

3.3.3. GraphSAGE Layer with Rich Aggregation

We utilized GraphSAGE layers [43] to propagate information across nodes based on the dynamically constructed graph. For each node i, the representation

h_{i}

is updated by aggregating features from its neighbors, as detailed in Equation (9):

h_{i}^{'} = LayerNorm (ReLU (W_{self} h_{i} + W_{neigh} AGG ({h_{j} : j \in N_{i}})))

(9)

where

N_{i} = {j : A_{i j} = 1}

denotes the neighbors of node i and

W_{self}, W_{neigh} \in R^{d \times d}

are learnable transformation matrices. Let

g^{(i)} = h_{i}^{'}

denote this final graph-enhanced embedding for sample i.

The aggregator function

AGG

is a composite operation that combines mean pooling, max pooling, and attention-based aggregation, capturing different aspects of the neighborhood information, as shown in Equation (10):

AGG ({h_{j}}) = w_{1} \cdot Mean ({h_{j}}) + w_{2} \cdot Max ({h_{j}}) + w_{3} \cdot AttnAgg ({h_{j}})

(10)

where the weights

w_{1}, w_{2}, w_{3}

are learnable parameters normalized through a softmax operation and

AttnAgg

is an attention-based aggregation function, allowing the model to selectively emphasize the most informative neighbors for each node.

While the current node representations are derived from transformer-based sentence embeddings, the aggregation design allows flexible incorporation of richer structural features. In the broader graph learning literature, learned edge representations and self-supervised relational objectives have been shown to enhance graph expressiveness [41,42]. Such approaches augment static similarity-based adjacency with adaptive edge features, improving robustness and interpretability. In our framework, the composite aggregation in Equation (10) partially captures heterogeneous neighborhood signals through mean, max, and attention pooling. Nevertheless, future extensions could incorporate explicit edge embeddings or contrastive self-supervised objectives to further enrich relational modeling, particularly in low-resource or dialectally diverse Arabic settings.

3.4. Label Dependency Modeling with Transformer Encoders

Emotion labels in multi-label settings often exhibit strong interdependencies. For example, emotions such as sadness and pessimism frequently co-occur in social media text, while emotions like joy and trust may be positively correlated. Accurately modeling these correlations can substantially improve multi-label classification performance.

Our approach is inspired by and extends prior work on label attention mechanisms and label graph modeling in multi-label classification [25,26,27], and it enables the classifier to leverage learned label relationships for more accurate, coherent, and interpretable multi-label predictions. Unlike static correlation matrices used in previous GCN-based approaches [26], we employ transformer encoders for dynamic label dependency modeling and incorporate KL divergence alignment to ensure prediction consistency with empirical co-occurrence patterns.

To capture such dependencies, our model introduces a learnable label embedding matrix

L = [ℓ_{1}, \dots, ℓ_{C}] \in R^{C \times d}

, where each

ℓ_{c}

represents an initial embedding vector for the cth emotion class, C is the total number of classes, and d is the embedding dimension. Unlike traditional multi-label classifiers, which treat labels independently, we explicitly refined these label embeddings by processing them through a stack of transformer encoder layers, formulated in Equation (11):

[ℓ_{1}^{'}, \dots, ℓ_{C}^{'}] = Transformer ([ℓ_{1}, \dots, ℓ_{C}])

(11)

where the output

L^{'} = [ℓ_{1}^{'}, \dots, ℓ_{C}^{'}]

contains contextually enriched embeddings for each label. Each encoder layer enables information flow among label vectors, facilitating the learning of complex and higher-order inter-label relationships.

These refined label embeddings are integrated into two key components of our system. First, they serve as trainable “queries” or attention vectors for label-aware pooling of text features, allowing the model to align specific semantic regions in the text with each emotion class. Second, they are used to compute a predicted label affinity (co-occurrence) matrix, which is then regularized using a graph-based loss.

3.5. Feature Fusion and Classification

Once the model has obtained the transformer-derived span embedding

s^{(i)}

for sample i and the graph-enhanced embedding

g^{(i)}

via dynamic GraphSAGE aggregation, these complementary feature representations are combined through a dedicated fusion module. Specifically, we concatenate the span and graph-based vectors for each sample and feed the result into a fully connected fusion layer, followed by a rectified linear unit (ReLU) activation and layer normalization to improve training stability, as shown in Equation (12):

z^{(i)} = LayerNorm (ReLU (W_{f} [s^{(i)}; g^{(i)}] + b_{f}))

(12)

where

[\cdot; \cdot]

denotes vector concatenation,

W_{f}

and

b_{f}

are learnable fusion weights and bias, and

z^{(i)}

is the final fused representation for sample i.

To produce multi-label predictions, the fused representation is projected via a linear classifier followed by a sigmoid activation function, yielding predicted probabilities for each emotion class, as computed in Equation (13):

{\hat{y}}^{(i)} = σ (W_{cls} z^{(i)} + b_{cls})

(13)

where

W_{cls} \in R^{C \times d_{z}}

and

b_{cls} \in R^{C}

are learnable classifier weights and bias, respectively,

d_{z}

is the dimensionality of the fused feature, and

σ (\cdot)

denotes the elementwise sigmoid function.

3.6. Adaptive Hybrid Loss Function

To robustly optimize the model for multi-label emotion detection, we employed an adaptive hybrid loss that not only combines several complementary loss components but also dynamically tunes their relative importance throughout training. Our loss design integrates theoretically complementary objectives; BCE provides standard multi-label optimization, focal loss addresses class imbalance through confidence-based weighting, ranking loss improves multi-label calibration, and KL divergence ensures label correlation consistency. The adaptive weighting mechanism, driven by batch-level statistics, enables responsive optimization across imbalanced distributions rather than static loss combinations used in previous work [27]. The overall objective is formulated in Equation (14):

L = L_{BCE} + λ_{f} L_{focal} + λ_{r} L_{ranking} + λ_{g} L_{graph}

(14)

where

λ_{f}

,

λ_{r}

, and

λ_{g}

are adaptive coefficients that balance the focus between each loss term during training.

3.6.1. Adaptive Weight Computation

Unlike static coefficients, the weights

λ_{f}

,

λ_{r}

, and

λ_{g}

are dynamically predicted for each batch, allowing the loss function to adapt its focus based on the evolving learning dynamics. Specifically, we employ a small neural network,

f_{adapt} (\cdot)

(implemented as a multi-layer perceptron), which receives as input a vector of batch-level summary statistics

s^{(b)}

such as the batchwise Jaccard score, average label entropy, or mean class frequencies and outputs a set of adaptive loss weights, as given in Equation (15):

λ^{(b)} = f_{adapt} (s^{(b)}; θ_{λ})

(15)

where

f_{adapt} (\cdot)

is parameterized by

θ_{λ}

and

λ^{(b)} = [λ_{f}, λ_{r}, λ_{g}]

are the resulting batch-specific coefficients.

3.6.2. Loss Components

Binary Cross-Entropy Loss

The standard binary cross-entropy loss for multi-label classification computes the average negative log-likelihood over all N samples and C classes, as defined in Equation (16):

L_{BCE} = - \frac{1}{N C} \sum_{i = 1}^{N} \sum_{c = 1}^{C} [y_{i c} log {\hat{y}}_{i c} + (1 - y_{i c}) log (1 - {\hat{y}}_{i c})]

(16)

Focal Loss

To address the significant class imbalance present in Arabic multi-label emotion datasets, we incorporated focal loss. Unlike standard cross-entropy, the focal loss introduces a modulating factor

{(1 - {\hat{y}}_{i c})}^{γ}

that down-weights the loss assigned to well-classified examples, as shown in Equation (17):

L_{focal} = - \frac{1}{N C} \sum_{i = 1}^{N} \sum_{c = 1}^{C} α_{c} {(1 - {\hat{y}}_{i c})}^{γ} y_{i c} log {\hat{y}}_{i c}

(17)

where

γ > 0

is the focusing parameter and

α_{c}

is a class-balancing weight.

Ranking Loss

To guide the model toward correctly prioritizing true positive emotions over irrelevant ones, we introduced a pairwise ranking loss, defined in Equation (18):

L_{ranking} = \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{| P_{i} | | N_{i} |} \sum_{p \in P_{i}} \sum_{n \in N_{i}} max (0, m - ({\hat{y}}_{i p} - {\hat{y}}_{i n}))

(18)

where

P_{i}

and

N_{i}

denote the positive and negative label sets for instance i, respectively, and m is a margin hyperparameter.

Graph KL Loss

To explicitly model the dependencies and co-occurrence patterns among emotion labels, we incorporated a graph-based Kullback–Leibler (KL) divergence loss, formulated in Equation (19):

L_{graph} = KL (Q ‖ A)

(19)

where

Q \in R^{C \times C}

represents the model’s predicted label affinity matrix computed from the learned label embeddings, and

A \in R^{C \times C}

denotes the empirical co-occurrence matrix of the ground-truth labels in the current batch.

While aligning predicted affinities with empirical co-occurrence enhances structural consistency, excessive reliance on dataset-level correlations may amplify annotation biases or spurious co-occurrence artifacts. To mitigate this risk, correlation alignment is softly regularized through a weighted KL term rather than hard constraints, allowing the model to deviate when semantic evidence contradicts empirical priors.

3.6.3. Exponential Moving Average Stabilization

To improve training stability and generalization, we maintained an exponential moving average (EMA) of model parameters during optimization. At each training step t, the EMA parameters

θ_{ema}

are updated as follows in Equation (20):

θ_{ema}^{(t)} = α θ_{ema}^{(t - 1)} + (1 - α) θ^{(t)},

(20)

where

α

is the decay rate and

θ^{(t)}

denotes the current model parameters. During evaluation, predictions are computed using the EMA weights. This strategy reduces variance induced by batch-level fluctuations and is particularly beneficial under severe class imbalance.

3.7. Inference and Thresholding

During inference, the trained model produces, for each input sample, a probability vector

\hat{y} \in {[0, 1]}^{C}

corresponding to the likelihood of each emotion class. Unlike single-label classification, multi-label classification requires converting these continuous outputs into a binary vector. Due to class imbalance and the varying label difficulty, we employed class-specific thresholds, denoted by

τ_{c}^{*}

, which were selected to maximize the multi-label validation metric for each class c. The final binary prediction for each sample i and class c is then given by Equation (21):

{\hat{y}}_{i c}^{final} = \{\begin{matrix} 1, & if {\hat{y}}_{i c} \geq τ_{c}^{*} \\ 0, & otherwise \end{matrix}

(21)

where

τ_{c}^{*}

is the optimized threshold for class c, typically determined by a grid search on the validation set. The complete end-to-end training procedure for the proposed GTHL-Emo framework, integrating the dynamic graph construction, label refinement, and adaptive hybrid loss optimization, is summarized in Algorithm 1.

Algorithm 1: Training Algorithm for GTHL-Emo with adaptive hybrid loss.

Input: Training set

D_{train} = {(x_{i}, y_{i})}_{i = 1}^{N}

, validation set

D_{val}

, maximum epochs E, patience P, batch size B, learning rate

η

Output: Trained parameters

θ^{*}

Initialize model parameters

θ

and optimizer;

Initialize best validation Jaccard

J^{*} \leftarrow 0

, patience counter

c \leftarrow 0

4. Experiment and Result Analysis

To comprehensively evaluate the effectiveness of our proposed GTHL-Emo framework for Arabic multi-label emotion classification, we conducted experiments on three publicly available, human-annotated Arabic emotion datasets. (1) The SemEval-2018-Ec-Ar dataset [29] contains 4381 Arabic tweets annotated with 11 Ekman-inspired emotions [44] (anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise, and trust), providing the first major multi-label benchmark for Arabic emotion detection in morphologically rich social media text. (2) The ExaAEC corpus [39] contains 20,050 Arabic tweets annotated with 10 Plutchik-based emotions [45] (anger, anticipation, disgust, fear, joy, love, sadness, surprise, trust, and other), where Sarbazi et al. achieved a 65.7% mean F1 score using a BiLSTM + ELMo architecture. We utilized this dataset as a scalability benchmark to validate our graph-enhanced transformer architecture against contemporary transformer baselines, enabling assessment of architectural advancement from Recurrent Neural Network (RNN)-based to transformer + GraphSAGE approaches, training methodology impact through our adaptive hybrid loss weighting, and performance consistency across larger dataset scales. Our data splits maintained the multi-label structure while ensuring consistent preprocessing across our experimental framework. (3) The SemEval-2025 Shared Task 11 dataset [46] represents the latest multi-label emotion benchmark with 1903 Algerian Arabic instances across six emotions (anger, disgust, fear, joy, sadness, and surprise), featuring perceived emotion annotations with intensity levels by fluent native speakers for multilingual emotion detection research. We focused solely on the Arabic (Arq) portion, specifically Track A for the multi-label emotion detection task in our study. Table 1 summarizes the data splits for the three datasets: training, validation, and testing.

These datasets provide comprehensive evaluation across different Arabic varieties (Modern Standard Arabic in SemEval-2018-Ec-Ar and ExaAEC and Algerian Arabic in SemEval-2025-Arq), emotion taxonomies (Ekman vs. Plutchik vs. Basic Six), dataset scales (1900–20,000 samples), and annotation approaches (direct labeling vs. perceived emotions), ensuring robust generalizability assessment of our GTHL-Emo framework. The inclusion of ExaAEC as a large-scale benchmark enabled thorough validation of architectural innovations beyond traditional baselines, while SemEval competitions provided standardized evaluation protocols for fair comparison with established state-of-the-art approaches.

4.1. Implementation Details

4.1.1. Data Preparation

All datasets were preprocessed while following a unified pipeline to ensure comparability across sources. Each sample underwent standard text cleaning, emoji normalization, and removal of URLs, hashtags, and special characters. We mapped common Arabic emojis to their corresponding emotion words and tokenized using the MARBERT tokenizer. For each dataset, we removed samples with fewer than three valid tokens or missing labels and ensured that all target emotion columns were binary and consistently ordered. For the training, validation, and testing splits, we preserved the original data splits where available (as in SemEval-2018-Ec-Ar and SemEval-2025-Arq) and used 10% of the training data as a validation set where only training or testing was provided (as in ExaAEC).

4.1.2. Model Configuration and Training

We used the MARBERT model as our text encoder, with token embeddings of the dimension 768. Span representations were computed with a learnable self-attention mechanism, followed by a stack of GraphSAGE layers to model the inter-sample relationships within each batch. The label dependency module was realized with two transformer encoder layers, each with eight attention heads, applied to learnable label embeddings. Feature fusion, final classification, and all loss terms were implemented as described in Section 3.6. For optimization, we used AdamW with a learning rate of

1 \times 10^{- 5}

, weight decay of

0.02

, and batch size of 2. Early stopping with a patience of 7 is used, and the best model checkpoint was selected based on the highest validation Jaccard score. Each training run was performed for up to 20 epochs, though training often stopped earlier due to early stopping. The dropout was set to 0.1 for regularization. For all models, we used mixed precision training (torch.amp) for memory and speed efficiency.

4.2. Hyperparameter Settings

To ensure robust and optimal performance, a comprehensive grid search was conducted on the validation sets of each dataset to fine-tune the hyperparameters. The final selected configurations, representing the best performance achieved by GTHL-Emo, are summarized in Table 2.

Most hyperparameters were found to be consistent and optimal across all three datasets, highlighting the generalizability of our architectural and optimization choices. These uniform settings included the embedding dimension (

d = 768

), max sequence length (128), weight decay (

0.02

), GraphSAGE layers (2), GraphSAGE neighbors (5), label transformer layers (2), label transformer heads (8), margin ranking loss) (

0.5

), graph similarity temperature (

0.07

), EMA alpha (loss weights) (

0.95

), and Top-k spans (attention) (5). The core coefficients for the adaptive hybrid loss, focal loss

λ

(

0.1

), graph loss

λ

(

0.1

), and ranking loss

λ

(

0.1

) were also uniformly set. The adaptive nature of the focal loss

γ

and

α_{c}

was maintained across the datasets.

However, certain parameters were fine-tuned per dataset to account for their distinct characteristics and scales:

Learning Rate: An optimal learning rate of $1 \times 10^{- 5}$ was selected for SemEval-2018-Ec-Ar and SemEval-2025-Arq. For the larger ExaAEC dataset, a slightly higher learning rate of $2 \times 10^{- 5}$ proved more optimal.
Batch Size: A smaller batch size of 2 was optimally chosen for SemEval-2018-Ec-Ar and SemEval-2025-Arq due to their relatively smaller dataset sizes, allowing for more frequent gradient updates. For the larger ExaAEC dataset, a batch size of 8 provided a better balance between memory efficiency and training stability.
Dropout Rate: This regularization parameter was adjusted from $0.1$ for SemEval-2018-Ec-Ar to $0.3$ for ExaAEC and SemEval-2025-Arq, reflecting different levels of model complexity and data volume.
Epochs and Early Stopping: Training was conducted for a maximum of 20 epochs across all datasets. To efficiently conserve computational resources and prevent overfitting, early stopping with a patience of 7 was consistently applied based on a validation Jaccard score [47,48]. On SemEval-2018-Ec-Ar, training typically concluded at around 13 epochs; for ExaAEC, it was around 18 epochs; and for SemEval-2025-Arq, it was around 15 epochs.

All models were trained using the AdamW [49] optimizer. Gradient clipping with a threshold of

1.0

was used to prevent exploding gradients [50].

4.2.1. Evaluation Metrics

To comprehensively assess the performance of our multi-label emotion classification models, we followed the evaluation protocols established in SemEval 2018 Task 1 E-C [29], as well as more recent studies [39]. Specifically, we reported the sample-based Jaccard score (also known as the Jaccard accuracy or intersection over union), which measures the average overlap between the predicted and ground-truth label sets for each test instance. In addition, we computed standard multi-label metrics, including the precision, recall, and F1 scores, which were evaluated in micro-, macro-, and sample-averaged formats. Micro-averaged metrics aggregate predictions and true labels across all classes before computing scores, favoring performance on frequent classes. Macro-averaged metrics, in contrast, compute scores independently for each class and then average across classes, ensuring that all emotion labels are weighted equally regardless of prevalence. We further report the class-wise precision, recall, and F1 score for each emotion label, providing a detailed breakdown of model performance across the label space. Finally, we included the mean average precision (MAP) to reflect the ranking quality of the predicted scores for the multi-label outputs. All metrics were evaluated on the held-out test set, with class-specific optimal thresholds selected on the validation set as described in Section 3.6. This comprehensive evaluation framework ensured a robust and nuanced assessment of MLED performance.

4.2.2. Hardware and Software

All experiments were conducted on Google Colab Pro with an NVIDIA A100 GPU (80 GB of memory) running PyTorch 2.8.0+cu126, Transformers 4.55.2, and Scikit-learn 1.6.1.

4.3. Baselines

We evaluated our proposed GTHL-Emo model against three evaluation benchmarks, comparing it to both strong transformer-based models and existing baselines to verify its efficacy.

4.3.1. SemEval-2018-Ec-Ar

For this dataset, we included official competition baselines and state-of-the-art methods from the literature:

Random [29]: A naïve baseline that assigns emotion labels to tweets by random sampling, serving as a lower-bound reference for multi-label performance.
MEDIAN Team [29]: Utilizes a median-value heuristic where for each emotion, it predicts the median label frequency observed in the training data, acting as an official shared task baseline.
SVM-Unigrams [29]: Trains a one-versus-rest Support Vector Machine (SVM) on Term Frequency-Inverse Document Frequency (TF-IDF) unigram features (no additional embeddings or lexicons), establishing a strong lexical baseline for multi-label emotion classification.
UNCC Team [32]: Employs a fully connected neural network that fuses pretrained Word2Vec and Doc2Vec embeddings with psycholinguistic features (e.g., affective lexicons), using the same architecture across all subtasks for English and Arabic.
Tw-StAR Team [31]: Implements a binary relevance scheme with TF-IDF-based features for each emotion label and standard machine learning classifiers (e.g., SVM), placing an emphasis on careful preprocessing for Arabic, English, and Spanish tweets.
PARTNA Team [29]: Combines lexical resources (emotion lexicons and character-level features) with distributional (embedding-based) features, feeding them into logistic regression-based classifiers in a multi-label set-up.
EMA Team [30]: Applies extensive Arabic-specific preprocessing (diacritics removal, elongation normalization, emoji→word replacement, and light stemming), followed by the extraction of word embedding and handcrafted features. These features were then fed into various classification and regression models for each subtask.
Khalil et al. [34]: Uses AraVec (Word2Vec embeddings trained on Arabic corpora) as input to a BiLSTM network with attention to capture contextual and sequential patterns, achieving robust gains on SemEval-2018-Ec-Ar.
Alswaidan and Menai [35]: Introduces a low-resource Arabic emotion recognition framework that stacks diverse classifiers (e.g., SVM or Random Forest (RF)) with self-training on unlabeled tweets, leveraging hybrid lexical and distributional features to augment scarce annotated data.
Samy et al. [33]: Proposes GRU and C-GRU in order to identify the contextual information from Arabic tweets. Then, they are used as an extra layer in order to grasp the emotional states expressed in the input Arabic tweet.
Elfaik et al. [5]: Proposes combining AraBERT for contextual embeddings with an attention-based LSTM-Bi-LSTM model to classify emotions in a given Arabic tweet.
Mansy et al. [51]: Proposes an ensemble of deep learning architectures, combining MARBERT, BiLSTM, and Bi-GRU for emotion detection in Arabic tweets, using a weighted sum equation to aggregate predictions from the three models.
Elfaik et al. [36]: Employs feature-level fusion of contextual embeddings (e.g., AraBERT or ArabicBERT) and attentional CNNs to jointly capture global and local cues in Arabic tweets, demonstrating improved generalization on SemEval-2018-Ec-Ar.

4.3.2. ExaAEC Dataset

The ExaAEC corpus [39] represents a significant contribution to Arabic multi-label emotion classification, where Sarbazi et al. achieved a 65.7% mean F1 score using the BiLSTM + ELMo architecture. Rather than direct comparison with their specific implementation, our evaluation focused on benchmarking GTHL-Emo against multiple transformer baselines to demonstrate the effectiveness of our graph-enhanced architecture and adaptive loss weighting mechanism. This approach enabled comprehensive assessment of architectural advancement from RNN-based to transformer + GraphSAGE approaches, training methodology impact through our adaptive hybrid loss weighting, and scalability validation across different dataset scales. We evaluated popular Arabic transformer models pretrained on relevant corpora:

asafaya-BERT [52]: A general-purpose Arabic transformer trained on OSCAR and other web-scale corpora.
Qarib-BERT [53]: A BERT variant adapted for Arabic news and Wikipedia content.
MARBERT [54]: A Twitter-centric transformer model designed for Arabic dialects and informal language.
AraBERTv0.2 [13]: A robust Arabic language model trained on a mix of formal and social media text.

These transformer models provide strong comparative baselines for evaluating our unified framework’s architectural innovations beyond traditional RNN approaches.

4.3.3. SemEval-2025 Track A Arabic (Arq) Dataset

For this dataset, we chose to highlight the approaches of the teams below, as their work directly utilized transformer models and various loss functions to address critical challenges prevalent in multi-label emotion detection in the specific dialectical Arabic (Arq) dataset with the Track A task specifically, such as data and class imbalance. While some of these baseline systems approached the multi-label task by treating emotions as independent binary classification problems, their foundational work with transformer architectures and innovative loss function applications provides a crucial comparative context for our research. Our proposed methodology also centers on leveraging transformer models and custom loss functions to address these challenges, including the explicit modeling of label correlations. Therefore, this comparison excluded teams whose primary contributions involved large language models (LLMs), prompt engineering, or zero-shot learning, as their core methodologies diverge from the specific technical scope of our current study:

HTU Team [37]: Proposed a Label-Fused Iterative Mask Filling (L-IMF) technique and implemented six DziriBERT-based classifiers for multi-label emotion classification in the Arabic dialect (Arq). Their approach addressed class imbalance and label dependencies.
INFOTEC-NLP Team [55]: Proposed a hybrid model combining XLM-RoBERTa embeddings with Bi-LSTM and multi-head attention for multi-label emotion classification in the same Arabic dialect (Arq). This approach captures sequential dependencies and addresses class imbalance.
LATE-GIL-NLP Team [38]: Proposed fine-tuning mDeBERTa-v3-base with optimized loss combinations for multi-label emotion classification in the same dataset (Arq). This approach focuses on handling imbalanced data without augmentation.
YNU-HPCC Team [56]: Proposed translating the dataset to English and using DeBERTa with modified prediction heads for multi-label emotion classification in the same dataset. This approach addresses class imbalance and biases toward dominant emotions.

4.4. Main Results

4.4.1. Baseline Comparison

SemEval-2018-Ec-Ar

Table 3 comprehensively compares the performance of GTHL-Emo with a broad set of established baselines and recent state-of-the-art models on the SemEval-2018-Ec-Ar, which remains the most widely used benchmark for Arabic multi-label emotion classification. The included baselines cover a spectrum from traditional statistical models, such as SVM-Unigrams and random baselines, to top-performing competition systems (e.g., MEDIAN, EMA, UNCC, PARTNA, and Tw-StAR) and more recent neural network-based approaches. In addition, several recent works in the literature [5,30,33,34,35,36,51] provide a strong point of reference for comparison. We evaluated each system using the sample-based Jaccard accuracy and, if available, micro- and macro-F1 scores. Our GTHL-Emo model improved the micro-F1 score by 71.02%, representing a significant gain over the highest reported baseline. While the Jaccard accuracy (58.70%) and macro-F1 (60.48%) scores were marginally lower than the strongest reported baseline for each metric, the results are nonetheless competitive and demonstrate the model’s ability to generalize across diverse label distributions and challenging data. The increase in micro-F1 score is particularly significant for multi-label settings, as it reflects robust overall label assignment, especially for the most frequent emotions. This superior performance highlights the effectiveness of our approach in leveraging both global contextual information and graph-based relational reasoning, in addition to dynamically balancing loss components to account for class imbalance and label dependencies. Collectively, these results affirm that GTHL-Emo advances the state of the art for large-scale Arabic multi-label emotion detection and validates the core contributions of our proposed methodology.

Table 4 presents a detailed comparison of the precision, recall, and F1 scores for each emotion class between our proposed GTHL-Emo model and the approach of Elfaik et al. [36] on the SemEval-2018-Ec-Ar dataset. The results clearly demonstrate that GTHL-Emo achieved superior performance across nearly all emotion categories, often with substantial margins. For challenging emotions such as anger, disgust, fear, love, pessimism, sadness, surprise, and trust, our model consistently delivered higher precision, recall, and F1 scores compared with the baseline, reflecting both improved sensitivity and reliability in multi-label prediction. Notably, for rare or difficult-to-predict classes such as pessimism, surprise, and trust, GTHL-Emo demonstrated meaningful improvements in the F1 score, suggesting that the integration of adaptive loss weighting and explicit modeling of label dependencies is effective at overcoming class imbalance and capturing nuanced affective states. While the approach of Elfaik et al. [36] outperformed our model for anticipation, our model provided generally stronger and more balanced performance across the majority of emotion categories, underscoring its robustness and generalizability for MLED in complex real-world Arabic texts.

ExaAEC

Table 5 reports a comprehensive evaluation of GTHL-Emo against a suite of strong transformer-based models upon further challenging with a larger size of about 20,000 tweets of the ExaAEC dataset. The baselines included those of Sarbazi et al. [39], asafaya-BERT, Qarib-BERT, MARBERT, and AraBERTv0.2, as well as fine-tuned MARBERT. Performance was assessed using the Jaccard accuracy, micro-F1 score, and macro-F1 score, allowing for a holistic comparison across both overall and class-level performance. Our proposed GTHL-Emo consistently surpassed all baselines, establishing new state-of-the-art results for Arabic MLED. It achieved a 65.99% Jaccard accuracy, 70.72% micro-F1 score, and 68.71% macro-F1 score, improving over the strongest baseline by +5.34%, +5.98%, and +7.84%, respectively. These improvements underscore the impact of incorporating both transformer-based semantic encoding and dynamic graph neural reasoning within a unified framework. The hybrid loss weighting mechanism further enhanced learning for minority emotions and complex label correlations, leading to more balanced and accurate multi-label predictions. The consistent outperformance of GTHL-Emo across diverse testbeds demonstrates its capacity to address the key challenges of class imbalance, linguistic diversity, and intricate emotional expression in Arabic, thereby setting a new performance bar for the field.

Table 6 presents the class-wise F1 score comparison between GTHL-Emo and strong pretrained Arabic baselines. The proposed model achieved the best performance in the majority of the emotion categories, with particularly notable gains for anger, disgust, fear, love, and sadness. The largest improvement was observed for love (+8.89 over MARBERT-Finetuned), indicating that explicit label-correlation modeling and structured reasoning are especially beneficial for emotions that frequently co-occur with others. Similarly, improvements in fear and disgust suggest that graph-based relational modeling helps disentangle overlapping affective signals. While some competitive baselines achieved strong results in isolated categories (e.g., neutral and surprise), GTHL-Emo demonstrated more balanced performance across classes, particularly for minority or structurally dependent emotions. Overall, these results confirm that incorporating graph reasoning and correlation alignment enhances fine-grained emotion discrimination beyond standard transformer fine-tuning.

SemEval-2025 Track A Arabic (Arq)

Table 7 demonstrates GTHL-Emo’s performance on the latest SemEval-2025 Track A Arabic (Arq) dataset, compared against recent competition systems that utilized transformer architectures and advanced loss functions. Our model achieved a 56.69% macro-F1 score, representing a +9.65% improvement over the strongest baseline (INFOTEC-NLP Team: 51.7%). This performance gain is particularly significant given the challenging characteristics of this dataset: (1) dialectal complexity (Algerian Arabic presents unique morphological and lexical challenges compared to Modern Standard Arabic); (2) limited training data (with only 901 training samples, the dataset requires robust generalization capabilities), and (3) class imbalance (the six-emotion taxonomy exhibits significant label frequency variations). We also reported the Jaccard accuracy and micro-F1 scores to provide a comprehensive evaluation of multi-label predictive consistency. While the official macro-F1 metric emphasizes minority class performance, the inclusion of the micro-F1 score demonstrates global robustness across frequent labels, and the Jaccard accuracy confirms the model’s capability to correctly identify overlapping emotion subsets at the instance level. Our additional obtained results for this dataset are presented in Table 8.

The superior performance of GTHL-Emo over recent competition systems validates several key contributions. For graph-enhanced learning, unlike HTU’s binary relevance approach with L-IMF augmentation or LATE-GIL-NLP’s simple loss combinations, our GraphSAGE integration captures inter-sample relationships crucial for low-resource settings. Regarding adaptive loss weighting, while INFOTEC-NLP combines XLM-RoBERTa with BiLSTM and attention, our adaptive hybrid loss dynamically adjusts to batch-level statistics, providing more nuanced optimization than static loss combinations used by competing teams. For label correlation modeling, our explicit transformer-based label dependency modeling surpasses the independent binary classification approaches employed by most competition systems, enabling better handling of emotion co-occurrence patterns in dialectal Arabic text.

The consistent improvements across all three datasets (SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq), spanning different Arabic varieties, emotion taxonomies, and data scales, demonstrate the robustness and generalizability of our GTHL-Emo framework for Arabic MLED.

4.4.2. Component-Wise Ablation Study

Table 8 presents a component-wise ablation of GTHL-Emo across three benchmarks, revealing that each module contributed differently depending on the dataset characteristics. On SemEval-2018-Ec-Ar, removing GraphSAGE or the label transformer led to noticeable drops in the macro-F1 score and Jaccard accuracy, confirming the importance of structured instance-level and label-level reasoning for modeling emotion co-occurrence. On ExaAEC, the full model achieved the strongest overall performance, and the removal of correlation alignment or imbalance handling slightly reduced the macro-F1 score, indicating that both structural and class-distribution modeling are beneficial in moderately imbalanced settings. In contrast, on SemEval-2025-Arq, disabling adaptive loss weighting resulted in the largest degradation, suggesting that dynamic balancing is particularly important under higher label sparsity and domain variability. Overall, while the full model consistently achieved the best or near-best micro-F1 scores across the datasets, the relative contribution of each component varied, highlighting the dataset-dependent trade-offs between structural correlation modeling, ranking supervision, and robustness to imbalance and noise.

4.4.3. Sensitivity and Stability Analysis of Adaptive Weights

Graph Construction Sensitivity

Table 9 evaluates the impact of different similarity mechanisms used to construct the batch-level graph, highlighting the role of structured instance connectivity in GTHL-Emo. Across the datasets, cosine-based similarity provided stable and competitive performance, reflecting the effectiveness of representation-driven relational modeling. The Jaccard (train-only) variant yielded comparable results on EC but showed mild degradation on ExaAEC and Arq, suggesting that label-informed similarity may not always generalize optimally beyond training distributions, even when replaced by cosine similarity at inference to avoid leakage. The k-Nearest Neighbors (kNN) graph exhibited slightly lower robustness on Arq, while the random graph consistently produced weaker macro-F1 scores and higher Hamming loss values, especially under domain variability, confirming that a meaningful similarity structure is essential for reliable relational reasoning. Overall, the full model, which adaptively integrates similarity information, achieved the best balance between the micro-F1 score, macro-F1 score, and stability, demonstrating that structured graph construction enhances robustness beyond naive or random connectivity schemes.

Loss Component Configuration Analysis

In addition, we performed a focused analysis of the effect of varying loss weight configurations in the hybrid objective. Table 10 summarizes the results of different combinations of focal, graph, and ranking loss coefficients (

λ

), as well as the exponential moving average (

α

) for the optimizer, across the three benchmark datasets. The results reveal a clear trend: balanced, moderately sized loss weights (

λ_{focal} = λ_{graph} = λ_{rank} = 0.10

,

α = 0.95

) yielded the best performance for all key metrics, with GTHL-Emo achieving a 58.70% Jaccard accuracy and 71.02% micro-F1 score on SemEval-2018-Ec-Ar, with similar improvements on ExaAEC and SemEval-2025-Arq. Both lower and imbalanced loss weight settings resulted in decreased performance, particularly in terms of the macro-F1 score and Jaccard accuracy, indicating the importance of simultaneously optimizing for label imbalance, relational regularization, and ranking objectives. These findings demonstrate the effectiveness of our adaptive hybrid loss, which dynamically balances multiple learning objectives for robust multi-label classification in highly imbalanced and correlated emotion datasets.

Together, these ablation studies establish the necessity of each GTHL-Emo component and the importance of carefully calibrated loss weighting, providing empirical evidence for the effectiveness and design choices of our proposed architecture.

The loss configuration comparison (Figure 2) analyzes the contribution of individual objective components to the overall SemEval-2018-Ec-Ar validation performance. Removing complementary terms (e.g., focal-only, BCE-only, or ranking-only configurations) led to noticeable degradations across the Jaccard accuracy, micro-F1 score, and macro-F1 score, confirming that no single loss component was sufficient to model imbalance and structured label dependencies simultaneously. In particular, configurations that excluded correlation alignment or ranking regularization exhibited reduced macro-level consistency, suggesting weaker performance on minority or co-occurring emotion classes. The full objective, which integrated imbalance-aware learning, structural alignment, and ranking constraints, achieved the most balanced performance across all metrics. These results justify the architectural complexity of the hybrid loss design and demonstrate that each component contributes non-trivially to final model robustness.

Coefficient Sensitivity and Stability

Figure 3 presents a sensitivity analysis over a wide range of fixed coefficient values for the focal, graph-based correlation, and ranking terms. The curves exhibit smooth performance variations without abrupt instabilities, indicating that the training process was well conditioned and not overly sensitive to precise hyperparameter tuning. While moderate peaks can be observed at specific coefficient values, performance remained relatively stable across broad intervals, supporting the claim that the adaptive weighting mechanism operated within meaningful regimes rather than exploiting narrow hyperparameter optima. This analysis provides empirical evidence that the proposed hybrid objective is stable, generalizable across configurations, and robust to reasonable variations in coefficient selection.

Adaptive Weight Dynamics

Figure 4 illustrates the evolution of the adaptive loss coefficients during training. We observed a stable and progressively calibrated adjustment of the three components. The focal loss weight

λ_{focal}

decreased slightly from its initial value and then stabilized, indicating that class imbalance correction remained important but did not dominate optimization. In contrast, the correlation alignment weight

λ_{corr}

exhibited a steady downward trend, suggesting that as training progresses and label representations become better aligned, the model gradually reduces reliance on the graph-based regularization term. The ranking loss weight

λ_{rank}

remained comparatively high and stable throughout training, highlighting the persistent importance of label ordering and relative score separation in the multi-label setting. Importantly, no abrupt oscillations or instability were observed, confirming that the exponential moving average mechanism effectively smoothed the adaptive updates. Overall, the smooth trajectories demonstrate that the adaptive weighting scheme converged to a balanced configuration rather than overfitting to any single objective component.

4.4.4. Findings

Effect on Different Graph Neural Networks

Figure 5 shows how alternative GNN layer topologies (GCN, GraphSAGE, and graph attention networks (GATs)) affected the performance of the multi-label emotion classification model in terms of the Jaccard accuracy, micro-F1 score, and macro-F1 score. The results reveal that the GraphSAGE layer consistently outperformed both the GCN and GAT in all reported metrics, indicating its superior ability to leverage local neighborhood information through inductive aggregation. This suggests that the adaptive neighbor sampling and aggregation strategy of GraphSAGE more effectively captured the nuanced inter-sample relationships inherent in complex, multi-label Arabic emotion datasets. In contrast, while the GCN and GAT also provided substantial improvements over simpler baselines, their performance lagged slightly behind, possibly due to limitations in either fixed aggregation (GCN) or sensitivity to graph connectivity and attention networks (GAT). Overall, these findings highlight the critical role of selecting an appropriate GNN backbone, with GraphSAGE offering the best trade-off between expressiveness and generalization in this context.

Effect on GraphSAGE Depth

Figure 6 illustrates the impact of varying the depth of the GraphSAGE component on the Jaccard accuracy across three Arabic multi-label emotion classification datasets: SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq. As shown, increasing the number of GraphSAGE layers from one to two led to consistent improvements in the Jaccard accuracy on all datasets, indicating that the model benefitted from aggregating information from a broader neighborhood. However, as the depth increased beyond two or three layers, the gains either plateaued or declined, with performance slightly dropping for deeper architectures. This trend was particularly evident on the SemEval-2025-Arq dataset, where deeper models showed diminished returns. The results suggest that while stacking multiple GraphSAGE layers allows the model to better capture inter-sample dependencies and higher-order graph structures, excessive depth can introduce oversmoothing and hamper the discriminative power of learned representations. Therefore, careful tuning of the GNN depth is critical to maximizing performance in MLED tasks.

Label Correlations

Figure 7 provides a comparative analysis between the ground-truth and model-predicted correlation matrices for emotion labels. Figure 7a visualizes the empirical Pearson correlations computed from the gold emotion annotations in the test set, highlighting which emotions tended to co-occur or be mutually exclusive in real-world Arabic social media texts. For example, strong positive correlations between joy and love or negative correlations between anger and joy reflect intuitive affective relationships underlying human expression. Figure 7b presents the corresponding matrix derived from the model’s predicted multi-label outputs, summarizing how effectively the model captured and reproduced these empirical dependencies.

The close alignment between the two matrices demonstrates the model’s ability to internalize complex label interactions and preserve consistency with observed emotion co-occurrence patterns. The Pearson correlation coefficient between the ground-truth and predicted correlation matrices was 0.83 (p < 0.001), indicating statistically significant label dependency learning.

Modeling such correlations is essential for robust multi-label emotion classification, as it improves predictive coherence and supports more interpretable, context-aware emotion attribution in morphologically rich and low-resource languages such as Arabic.

4.5. Complexity vs. Performance

Table 11 presents the complexity-performance trade-off on the SemEval-2025-Arq dataset, using the macro-F1 score as the evaluation metric. Notably, most variants maintained nearly identical parameter counts (≈170 M), indicating that improvements were not driven by parameter scaling but by structural and objective-level enhancements. The full GTHL-Emo model achieved the highest macro-F1 score (56.69%) while maintaining a competitive training time (14.86 s per epoch), demonstrating that the additional components did not incur prohibitive computational overhead. Among graph construction strategies, the kNN-based graph yielded stronger performance than cosine or random variants, confirming the importance of meaningful neighborhood selection. Removing adaptive weighting resulted in a clear degradation, highlighting the benefit of dynamic objective balancing. Although the removal of correlation alignment slightly improved performance in this specific setting, it reduced the structural regularization and consistency across datasets, suggesting that its contribution is stability-oriented rather than purely metric-driven. Overall, the results show that GTHL-Emo achieved the best accuracy-efficiency balance without excessive computational cost, supporting the necessity of its integrated design.

4.6. Model Analysis and Interpretability

4.6.1. Interpretability

Figure 8 provides qualitative evidence of the interpretability of the proposed GTHL-Emo model through token-level attention heat maps for representative Arabic social media texts. Panels (a–d) visualize the distribution of normalized self-attention weights across input tokens for different emotion predictions, illustrating how the model allocates importance when aggregating contextual information into sentence-level representations. To preserve the original linguistic context, the figure retains the source-language tokens, while their functional meanings are explained in the text.

Figure 8a shows the attention distribution for the fear emotion in Sample 1, where the model assigns the dominant weight to the global [CLS] token (1.00), followed by [SEP] (0.88), while lower but still meaningful attention is given to contextual tokens glossed as and you (0.75), and a half (0.62), and the explicit affective word sad (0.50). Figure 8b,c presents attention maps for the love emotion across two distinct samples. In Sample 1, the model emphasizes tokens glossed as and a half (1.00), on (0.88), and you (0.75), that (0.62), and the proper noun Abha (0.50). In Sample 3, the attention is distributed over conversational and affect-related cues glossed as just/but (1.00), honestly (0.88), without (0.75), I know (0.62), and a tokenized fragment corresponding to annoyance (0.50). Figure 8d illustrates the attention pattern for the trust emotion, where the model again assigns high weight to the structural tokens [SEP] (1.00) and [CLS] (0.88), while also emphasizing the contextual token glossed as the seventh (0.75), likely indicating a temporal reference, together with and a half (0.62) and and you (0.50).

Across all panels, the model systematically highlighted emotion-relevant lexical and contextual cues while attenuating less informative content, even in morphologically complex and dialectally varied Arabic text. These patterns indicate that GTHL-Emo captures semantically meaningful token-level signals aligned with human intuition about affect expression. Such interpretability complements the strong quantitative results and strengthens the practical applicability of the model in explainable multi-label emotion classification.

4.6.2. Nearest-Neighbor Explanations

Table 12 illustrates nearest-neighbor explanations based on cosine similarity in the learned embedding space. For each development-set query, the most similar training instance (Rank 1) was retrieved along with its gold labels. The high similarity scores (ranging from approximately 0.85 to 0.99) indicate that the learned representations effectively clustered semantically and emotionally aligned instances. In many cases, retrieved neighbors shared both a lexical structure and emotional composition, particularly for strongly co-occurring emotions such as joy, love, and optimism or anger and disgust. Notably, even when lexical overlap was limited, the model retrieved instances with a comparable emotional tone, suggesting that the embedding space captured higher-level affective semantics beyond surface similarity. These results provide interpretability evidence that GTHL-Emo organizes texts according to structured emotional relationships, supporting the claim that correlation-aware training promotes semantically meaningful clustering in representation space.

4.6.3. Error Analysis and Failure Cases

Table 13 presents a qualitative error analysis across representative true positives (TPs), false positives (FPs), and false negatives (FNs) for multiple emotion categories. The analysis reveals that the model performed reliably when emotions were expressed explicitly with strong lexical cues, particularly for high-frequency emotions such as anger, joy, love, and pessimism. However, false positives frequently occurred in semantically complex or metaphorical contexts, where surface-level cues overlapped across correlated emotions (e.g., confusion between sadness and pessimism or joy and optimism). False negatives often arise in cases of implicit emotional expression, dialectal variation, or figurative language where the emotional signal is subtle and not lexically dominant. These findings suggest that while the model effectively captured strong emotional polarity and structured label dependencies, performance degraded under nuanced contextual ambiguity or overlapping affective states. This analysis highlights both the robustness of correlation modeling and the limitations of relying on distributional similarity for fine-grained emotional nuance.

5. Conclusions and Future Work

In this work, we presented GTHL-Emo, a unified framework designed to address two persistent challenges in Arabic multi-label emotion detection: severe class imbalance and structured label correlations. By integrating a transformer-based semantic encoder, a dynamic GraphSAGE layer for inter-sample relational modeling, and a correlation-aware label embedding module, the proposed architecture provides a structured and adaptive solution to multi-label prediction under imbalanced conditions. The core contribution lies in the adaptive hybrid loss mechanism, which dynamically balances complementary objectives, namely BCE, focal, and ranking losses, using batch-level statistics to avoid manual tuning while improving minority label sensitivity.

Extensive experiments across three diverse Arabic benchmarks demonstrated that GTHL-Emo achieved highly competitive performance. Ablation studies confirmed that each component contributed complementary improvements rather than redundant complexity. Importantly, although the framework integrates multiple modules, inference complexity remains dominated by the transformer encoder, and the dynamic graph operates at the mini-batch level without requiring global graph construction, supporting practical deployability. To ensure realistic deployment, label-based Jaccard similarity was restricted strictly to training time graph construction and replaced with cosine-based feature similarity during inference, preventing information leakage. Sensitivity analysis further confirmed that structural reasoning does not depend on ground-truth labels at test time. While correlation alignment enhances structural consistency, enforcing empirical co-occurrence patterns may risk amplifying dataset-specific annotation artifacts. Our use of soft KL regularization rather than hard constraints allows the model to deviate when semantic evidence contradicts empirical priors.

Despite its strengths, several limitations motivate further investigation. First, the current framework does not explicitly incorporate predictive uncertainty when refining label correlations. Structural alignment is applied uniformly across samples, even when model confidence varies substantially. Under ambiguous conditions, correlation alignment may either over-regularize or insufficiently adapt to uncertainty. Second, structural reasoning is performed using dataset-level affinity patterns rather than sample-wise adaptive modulation, limiting flexibility under heterogeneous or domain-shifted inputs. Third, although adaptive loss coefficients are dynamically learned, the framework does not explicitly quantify uncertainty propagation or structural ambiguity within the label graph. Finally, cross-regional and dialectal generalization of learned label dependencies warrants deeper investigation, particularly where emotional co-occurrence structures differ across domains.

Beyond emotion detection, the modular design of GTHL-Emo offers broader applicability. The adaptive hybrid loss mechanism and correlation-aware graph modeling can be transferred to other Arabic multi-label tasks such as topic tagging, intent detection, or stance classification, where imbalance and label dependency similarly arise. Future work will explore uncertainty-aware structural modeling in which predictive confidence dynamically modulates the label graph propagation strength. Integrating calibrated uncertainty estimation with adaptive graph diffusion mechanisms may enable more robust and interpretable multi-label reasoning under ambiguity.

By addressing imbalance, structured dependencies, and adaptive optimization within a unified architecture, GTHL-Emo establishes a strong foundation for principled Arabic multi-label modeling while highlighting important theoretical and practical challenges that motivate subsequent advances.

Author Contributions

Conceptualization, M.N.A. and S.T.; methodology, M.N.A., S.T. and F.F.; validation, M.N.A. and F.F.; formal analysis, M.N.A.; investigation, M.N.A.; data curation, M.N.A.; visualization, M.N.A.; writing—original draft preparation, M.N.A.; writing—review and editing, S.T. and F.F.; supervision, S.T. and F.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The authors would like to express their sincere gratitude to the organizers of SemEval-2018 Task 1 shared task and SemEval-2025 Task 11 for providing the publicly available datasets, as well as the developers of the EXAAEC corpus for making their annotated dataset accessible to the research community. These valuable resources have been instrumental in enabling and advancing this study. The datasets are accessible via the following links: SemEval-2018 Task 1: https://huggingface.co/datasets/SemEvalWorkshop/sem_eval_2018_task_1 (accessed on 5 January 2026); SemEval-2025 Task 11: https://github.com/emotion-analysis-project/SemEval2025-Task11 (accessed on 5 January 2026); EXAAEC Dataset: https://github.com/exaco/exaaec (accessed on 5 January 2026).

Acknowledgments

The first author, M.N.A., would like to express his gratitude to the University of Ha’il, Saudi Arabia, and Universiti Kebangsaan Malaysia (UKM) for their continuous support and the valuable resources provided during this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Representative Arabic query texts and their nearest-neighbor Arabic texts examples reported in Table 12.

ID	Arabic Query	Nearest Neighbor Texts
E1
E2
E3
E4

Table A2. Representative Arabic texts used in the qualitative analysis examples reported in Table 13.

ID	Arabic Text (Original)
E5
E6
E7
E8
E9

References

Hossain, M.M.; Hossain, M.S.; Mridha, M.; Safran, M.; Alfarhood, S. Multi task opinion enhanced hybrid BERT model for mental health analysis. Sci. Rep. 2025, 15, 3332. [Google Scholar] [CrossRef]
Hossain, M.M.; Hossain, M.S.; Chaki, S.; Rahman, M.S.; Ali, A.S. Revolutionizing Mental Health Sentiment Analysis with BERT-Fuse: A Hybrid Deep Learning Model. IEEE Access 2025, 13, 85428–85446. [Google Scholar] [CrossRef]
Mohammad, S.M.; Turney, P.D. Crowdsourcing a word–emotion association lexicon. Comput. Intell. 2013, 29, 436–465. [Google Scholar] [CrossRef]
Abdul-Mageed, M.; Diab, M.; Kübler, S. SAMAR: Subjectivity and sentiment analysis for Arabic social media. Comput. Speech Lang. 2014, 28, 20–37. [Google Scholar] [CrossRef]
Elfaik, H.; Nfaoui, E.H. Combining context-aware embeddings and an attentional deep learning model for Arabic affect analysis on Twitter. IEEE Access 2021, 9, 111214–111230. [Google Scholar] [CrossRef]
Alhuzali, H.; Ananiadou, S. SpanEmo: Casting multi-label emotion classification as span-prediction. arXiv 2021, arXiv:2101.10038. [Google Scholar]
Baziotis, C.; Athanasiou, N.; Chronopoulou, A.; Kolovou, A.; Paraskevopoulos, G.; Ellinas, N.; Narayanan, S.; Potamianos, A. Ntua-slp at semeval-2018 task 1: Predicting affective content in tweets with deep attentive rnns and transfer learning. arXiv 2018, arXiv:1804.06658. [Google Scholar]
Chochlakis, G.; Mahajan, G.; Baruah, S.; Burghardt, K.; Lerman, K.; Narayanan, S. Leveraging label correlations in a multi-label setting: A case study in emotion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Trohidis, K.; Tsoumakas, G.; Kalliris, G.; Vlahavas, I. Multi-label classification of music by emotion. EURASIP J. Audio Speech Music. Process. 2011, 2011, 4. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. In 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7088–7105. [Google Scholar] [CrossRef]
Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A dataset of fine-grained emotions. arXiv 2020, arXiv:2005.00547. [Google Scholar]
Thiab, A.; Alawneh, L.; Mohammad, A.S. Contextual emotion detection using ensemble deep learning. Comput. Speech Lang. 2024, 86, 101604. [Google Scholar] [CrossRef]
Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
Shi, M.; Tang, Y.; Zhu, X.; Liu, J. Multi-label graph convolutional network representation learning. IEEE Trans. Big Data 2020, 8, 1169–1181. [Google Scholar] [CrossRef]
Zeng, D.; Zha, E.; Kuang, J.; Shen, Y. Multi-label text classification based on semantic-sensitive graph convolutional network. Knowl.-Based Syst. 2024, 284, 111303. [Google Scholar] [CrossRef]
He, Y.; Zhang, Y.; Yang, F.; Yan, D.; Sheng, V.S. Label-dependent graph neural network. IEEE Trans. Comput. Soc. Syst. 2023, 11, 2990–3003. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar]
Dery, L. Multi-label ranking: Mining multi-label and label ranking data. In Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook; Springer: Cham, Switzerland, 2023; pp. 511–535. [Google Scholar]
Alturayeif, N.; Luqman, H. Fine-grained sentiment analysis of arabic covid-19 tweets using bert-based transformers and dynamically weighted loss function. Appl. Sci. 2021, 11, 10694. [Google Scholar] [CrossRef]
Chen, Z.M.; Wei, X.S.; Wang, P.; Guo, Y. Multi-label image recognition with graph convolutional networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2019; pp. 5177–5186. [Google Scholar]
Vu, H.T.; Nguyen, M.T.; Nguyen, V.C.; Pham, M.H.; Nguyen, V.Q.; Nguyen, V.H. Label-representative graph convolutional network for multi-label text classification. Appl. Intell. 2023, 53, 14759–14774. [Google Scholar] [CrossRef]
Huang, X.; Chen, B.; Xiao, L.; Yu, J.; Jing, L. Label-aware document representation via hybrid attention for extreme multi-label text classification. Neural Process. Lett. 2022, 54, 3601–3617. [Google Scholar] [CrossRef]
Badaro, G.; Jundi, H.; Hajj, H.; El-Hajj, W.; Habash, N. Arsel: A large scale arabic sentiment and emotion lexicon. In Third Arabic Natural Language Processing Workshop (WANLP 2018); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 26–35. [Google Scholar]
Mohammad, S.; Bravo-Marquez, F.; Salameh, M.; Kiritchenko, S. Semeval-2018 task 1: Affect in tweets. In 12th International Workshop on Semantic Evaluation (SemEval-2018); Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 1–17. [Google Scholar]
Badaro, G.; El Jundi, O.; Khaddaj, A.; Maarouf, A.; Kain, R.; Hajj, H.; El-Hajj, W. EMA at SemEval-2018 Task 1: Emotion Mining for Arabic. In 12th International Workshop on Semantic Evaluation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 236–244. [Google Scholar]
Mulki, H.; Ali, C.B.; Haddad, H.; Babaoğlu, I. Tw-StAR at SemEval-2018 Task 1: Preprocessing Impact on Multi-label Emotion Classification. In 12th International Workshop on Semantic Evaluation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 167–171. [Google Scholar]
Abdullah, M.; Shaikh, S. TeamUNCC at SemEval-2018 Task 1: Emotion Detection in English and Arabic Tweets using Deep Learning. In 12th International Workshop on Semantic Evaluation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 350–357. [Google Scholar]
Samy, A.E.; El-Beltagy, S.R.; Hassanien, E. A context integrated model for multi-label emotion detection. Procedia Comput. Sci. 2018, 142, 61–71. [Google Scholar] [CrossRef]
Khalil, E.A.H.; Houby, E.M.E.; Mohamed, H.K. Deep learning for emotion analysis in Arabic tweets. J. Big Data 2021, 8, 136. [Google Scholar] [CrossRef]
Alswaidan, N.; Menai, M.E.B. Hybrid feature model for emotion recognition in Arabic text. IEEE Access 2020, 8, 37843–37854. [Google Scholar] [CrossRef]
Elfaik, H.; Nfaoui, E.H. Leveraging feature-level fusion representations and attentional bidirectional RNN-CNN deep models for Arabic affect analysis on Twitter. J. King Saud-Univ.-Comput. Inf. Sci. 2023, 35, 462–482. [Google Scholar] [CrossRef]
Saleh, A.; Biltawi, M. HTU at SemEval-2025 Task 11: Divide and Conquer-Multi-Label emotion classification using 6 DziriBERTs submodels with Label-fused Iterative Mask Filling technique for low-resource data augmentation. In 19th International Workshop on Semantic Evaluation (SemEval-2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 675–683. [Google Scholar]
Vázquez-Osorio, J.; Gómez-Adorno, H.; Sierra, G.; Sierra-Casiano, V.; Canchola-Hernández, D.; Tovar-Cortés, J.; Solís-Vilchis, R.; Salazar, G. LATE-GIL-NLP at SemEval-2025 Task 11: Multi-Language Emotion Detection and Intensity Classification Using Transformer Models with Optimized Loss Functions for Imbalanced Data. In 19th International Workshop on Semantic Evaluation (SemEval-2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 666–674. [Google Scholar]
Sarbazi-Azad, S.; Akbari, A.; Khazeni, M. ExaAEC: A New Multi-label Emotion Classification Corpus in Arabic Tweets. In 2021 11th International Conference on Computer Engineering and Knowledge (ICCKE); IEEE: New York, NY, USA, 2021; pp. 465–470. [Google Scholar]
Incremona, A.; Pozzi, A.; Guiscardi, A.; Tessera, D. A differentiable and uncertainty-aware mutual information regularizer for bias mitigation. Neurocomputing 2025, 669, 132498. [Google Scholar] [CrossRef]
Hang, C.N.; Yu, P.D.; Chen, S.; Tan, C.W.; Chen, G. MEGA: Machine learning-enhanced graph analytics for infodemic risk management. IEEE J. Biomed. Health Inform. 2023, 27, 6100–6111. [Google Scholar] [CrossRef]
Sehanobish, A.; Ravindra, N.; van Dijk, D. Gaining insight into sars-cov-2 infection and COVID-19 severity using self-supervised edge features and graph neural networks. In AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2021; Volume 35, pp. 4864–4873. [Google Scholar]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. In 31st International Conference on Neural Information Processing Systems; ACM: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Ekman, P. Expression and the nature of emotion. In Approaches to Emotion; Scherer, K.R., Ekman, P., Eds.; Psychology Press: New York, NY, USA, 2014; pp. 319–343. [Google Scholar] [CrossRef]
Plutchik, R. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. Am. Sci. 2001, 89, 344–350. [Google Scholar] [CrossRef]
Muhammad, S.H.; Ousidhoum, N.; Abdulmumin, I.; Wahle, J.P.; Ruas, T.; Beloucif, M.; de Kock, C.; Surange, N.; Teodorescu, D.; Ahmad, I.S.; et al. Brighter: Bridging the gap in human-annotated textual emotion recognition datasets for 28 languages. arXiv 2025, arXiv:2502.11926. [Google Scholar] [CrossRef]
Prechelt, L. Automatic early stopping using cross validation: Quantifying the criteria. Neural Netw. 1998, 11, 761–767. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep feedforward networks. Deep Learn. 2016, 1, 161–217. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning; PMLR: Atlanta, GA, USA, 2013; pp. 1310–1318. [Google Scholar]
Mansy, A.; Rady, S.; Gharib, T. An ensemble deep learning approach for emotion detection in Arabic tweets. Int. J. Adv. Comput. Sci. Appl. 2022, 13. [Google Scholar] [CrossRef]
Safaya, A.; Abdullatif, M.; Yuret, D. KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media. In Fourteenth Workshop on Semantic Evaluation, Barcelona; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2054–2059. [Google Scholar]
Abdelali, A.; Hassan, S.; Mubarak, H.; Darwish, K.; Samih, Y. Pre-Training BERT on Arabic Tweets: Practical Considerations. arXiv 2021, arXiv:2102.10684. [Google Scholar] [CrossRef]
Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv 2020, arXiv:2101.01785. [Google Scholar]
Santos-Rodriguez, E.; Graff, M. INFOTEC-NLP at SemEval-2025 Task 11: A Case Study on Transformer-Based Models and Bag of Words. In 19th International Workshop on Semantic Evaluation (SemEval-2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 350–356. [Google Scholar]
Yang, H.; Wang, J.; Zhang, X. YNU-HPCC at SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Using Multiple Prediction Headers. In 19th International Workshop on Semantic Evaluation (SemEval-2025); Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 83–89. [Google Scholar]

Figure 1. Architecture of GTHL-Emo showing MARBERT-based text encoding with attention pooling, dynamic graph construction using mixed similarities with

λ (b)

weighting, GraphSAGE feature propagation with label-aware transformer processing, and adaptive hybrid loss computation with four weighted components driven by batch-level statistics. (BCE: Binary Cross-Entropy; FL: Focal Loss; RL: Ranking Loss; KL: Kullback–Leibler).

Figure 1. Architecture of GTHL-Emo showing MARBERT-based text encoding with attention pooling, dynamic graph construction using mixed similarities with

λ (b)

weighting, GraphSAGE feature propagation with label-aware transformer processing, and adaptive hybrid loss computation with four weighted components driven by batch-level statistics. (BCE: Binary Cross-Entropy; FL: Focal Loss; RL: Ranking Loss; KL: Kullback–Leibler).

Figure 2. Loss function configuration comparison on the SemEval-2018-Ec-Ar validation set. We evaluated partial objectives (e.g., BCE-only, focal-only, and hybrids) and observed that combining imbalance-aware, correlation-alignment, and ranking terms yielded the most balanced performance across metrics, supporting the complementary role of each loss component. (BCE: Binary Cross-Entropy; AP: Average Precision).

Figure 3. Sensitivity analysis of fixed loss coefficients on the SemEval-2018-Ec-Ar validation set (macro-F1 score). Performance varied smoothly over a broad range of values for

λ_{imb}

(focal),

λ_{corr}

(graph or correlation), and

λ_{rank}

(ranking), indicating that the objective was not overly sensitive to precise coefficient tuning.

Figure 3. Sensitivity analysis of fixed loss coefficients on the SemEval-2018-Ec-Ar validation set (macro-F1 score). Performance varied smoothly over a broad range of values for

λ_{imb}

(focal),

λ_{corr}

(graph or correlation), and

λ_{rank}

(ranking), indicating that the objective was not overly sensitive to precise coefficient tuning.

Figure 4. Evolution of adaptive loss coefficients during training on the SemEval-2018-Ec-Ar development set. The weights converge smoothly and remain stable, indicating that the learned hybrid objective did not collapse to a single dominant term.

Figure 5. Comparison of GNN layer architectures on multi-label emotion classification performance. The bar chart shows the impact of three different graph neural network (GNN) layer types—graph convolutional networks (GCNs), GraphSAGE, and graph attention networks (GATs)—on important evaluation metrics: the Jaccard accuracy, micro-F1 score, and macro-F1 score (all shown as percentages). The results reveal that GraphSAGE outperformed the GCN and GAT in all metrics. This highlights the effectiveness of inductive, local aggregation methodologies in capturing a semantic and relational structure for accurate multi-label emotion identification in Arabic.

Figure 6. Effect of GraphSAGE depth on Jaccard accuracy across three Arabic emotion classification datasets (SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq). The results demonstrate that model performance improved as the number of GraphSAGE layers increased from 1 to 2, reaching an optimal depth at 2 or 3 layers for most datasets. Further increasing the layer depth beyond this point led to diminishing returns or even slight performance degradation, likely due to oversmoothing or increased difficulty in learning meaningful representations. This analysis highlights the importance of tuning the GNN depth to balance representational power and generalization.

Figure 7. Emotion correlation matrices on the SemEval-2018-Ec-Ar benchmark: (a) ground-truth matrix computed from label co-occurrences in the test set and (b) predicted matrix derived from model multi-label outputs. The similarity between the two matrices indicates the model’s ability to capture empirical label dependencies.

Figure 8. Token-level attention heat maps for selected Arabic social media samples in GTHL-Emo: (a) fear (Sample 1), (b) love (Sample 1), (c) love (Sample 3), and (d) trust (Sample 1). Darker regions indicate higher normalized attention weights assigned by the self-attention pooling mechanism, illustrating label-specific interpretability.

Table 1. Data distribution across training, validation, and testing sets for the three Arabic multi-label emotion classification datasets: SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq.

	SemEval-2018-Ec-Ar	ExaAEC	SemEval-2025-Arq
Train	2278	16,031	901
Validation	585	2005	100
Test	1518	2014	902
Total	4381	20,050	1903

Table 2. Final selected hyperparameters for the GTHL-Emo model. Unless specified, these values were consistently optimal across all three datasets.

Hyperparameter	Selected Value
Embedding Dimension (d)	768
Max Sequence Length	128
Learning Rate (SemEval-2018-Ec-Ar, SemEval-2025-Arq)	$1 \times 10^{- 5}$
Learning Rate (ExaAEC)	$2 \times 10^{- 5}$
Weight Decay	0.02
Class Weights	Weighted
Epochs (Max)	20
Dropout Rate (SemEval-2018-Ec-Ar)	0.1
Dropout Rate (ExaAEC, SemEval-2025-Arq)	0.3
GraphSAGE Layers	2
GraphSAGE Neighbors	5
Focal Loss $λ$	0.1
Graph Loss $λ$	0.1
Ranking Loss $λ$	0.1
Margin (Ranking Loss)	0.5
Focal Loss $γ$	Adaptive
Focal Loss $α_{c}$	Adaptive
Early Stopping Patience	7
EMA Alpha (Loss Weights)	0.95
Top-k Spans (Attention)	5
Label Transformer Layers	2
Label Transformer Heads	8

Table 3. Performance comparison on SemEval-2018-Ec-Ar. Best results in each column are shown in bold.

Model	Jaccard Acc (%)	Micro-F1 (%)	Macro-F1 (%)
Baseline: Random [29]	17.70	–	–
MEDIAN Team [29]	25.40	–	–
Baseline: SVM-Unigrams [29]	38.00	–	–
UNCC Team [32]	44.60	–	–
Tw-StAR Team [31]	46.50	–	–
PARTNA Team [29]	48.40	–	–
EMA Team [30]	48.90	61.80	46.10
Khalil et al. [34]	49.80	61.50	44.00
Alswaidan and Menai [35]	51.20	63.10	50.20
Samy et al. [33]	53.20	49.50	64.80
Elfaik et al. [5]	53.82	–	–
Mansy et al. [51]	54.00	52.70	70.10
Elfaik et al. [36]	60.00	52.00	35.00
GTHL-Emo (Proposed)	58.70	71.02	60.83

Table 4. Class-wise comparison of precision, recall, and F1 scores between our GTHL-Emo model and the approach of Elfaik et al. [36] on SemEval-2018-Ec-Ar. The bold values indicate the best metric per row.

Emotion	GTHL-Emo (Proposed)			Elfaik et al. [36]
Emotion	Precision	Recall	F1	Precision	Recall	F1
anger	0.78	0.86	0.82	0.67	0.62	0.70
anticipation	0.28	0.28	0.28	0.25	0.66	0.52
disgust	0.55	0.63	0.59	0.45	0.35	0.39
fear	0.79	0.72	0.75	0.38	0.11	0.17
joy	0.83	0.85	0.84	0.83	0.54	0.65
love	0.80	0.82	0.81	0.75	0.46	0.57
optimism	0.76	0.80	0.78	0.77	0.52	0.62
pessimism	0.45	0.72	0.56	0.33	0.16	0.22
sadness	0.70	0.84	0.76	0.69	0.45	0.55
surprise	0.35	0.32	0.33	0.00	0.00	0.00
trust	0.22	0.14	0.18	0.00	0.00	0.00

Table 5. Comparison of GTHL-Emo and transformer-based baselines on the ExaAEC dataset. Best overall results are in bold; best baseline results are underlined. Values in parentheses report the relative change over the best baseline. Arrows indicate the change relative to the strongest baseline: ↑ for improved performance.

Model	Jaccard Acc. (%)	Micro-F1 (%)	Macro-F1 (%)
Sarbazi et al. [39] (BiLSTM + ELMo)	–	65.00	–
asafaya-BERT	56.66	61.06	59.02
Qarib-BERT	59.53	64.15	60.88
MARBERT	60.93	65.84	63.00
AraBERTv0.2	59.42	63.72	58.97
MARBERT Fine-tuned	62.64	66.73	63.71
GTHL-Emo (Proposed)	65.99 ( $+ 5.34$ pp, ↑ )	70.72 ( $+ 3.99$ pp, ↑)	68.71 ( $+ 5.00$ pp, ↑)

Table 6. Class-wise F1 score (%) comparison across baseline models and the proposed GTHL-Emo model on ExaAEC. Best results per column are in bold; second-best results are underlined.

Model	Anger	Ant.	Disgust	Fear	Joy	Love	Accept.	Sadness	Surprise	Neutral
MARBERT (base)	42.91	54.84	65.73	57.14	65.47	64.22	74.92	69.91	61.54	73.36
MARBERT Fine-tuned	49.70	56.12	64.16	66.67	60.58	69.37	75.45	66.76	53.93	74.37
AraBERTv0.2	45.89	52.73	63.48	60.24	40.00	62.40	72.78	60.47	57.21	74.48
Qarib-BERT	42.80	51.85	62.46	56.47	49.47	66.36	69.91	56.83	56.02	71.39
asafaya-BERT	37.32	45.54	60.69	59.52	65.69	54.32	69.91	56.01	54.59	70.98
GTHL-Emo (Proposed)	55.66	56.09	68.24	69.66	66.33	78.26	75.60	71.29	59.10	73.77

Note: Ant. = anticipation; Accept. = acceptance.

Table 7. Performance comparison of closely related works on the SemEval-2025 Track A in Arabic (Arq) dataset. The best scores are highlighted in bold, and the highest baseline results are underlined. Arrows indicate the change relative to the strongest baseline: ↑ for improved performance.

Model	Macro-F1 (%)
HTU Team [37]	51.2%
INFOTEC-NLP Team [55]	51.7%
LATE-GIL-NLP Team [38]	48.6%
YNU-HPCC Team [56]	44.4%
GTHL-Emo (Proposed)	56.69 (+9.65% ↑)

Table 8. Component-wise ablation of GTHL-Emo across three benchmark datasets. Each row removes one architectural module or loss component. J = Jaccard accuracy (%), µF1 = micro-F1 score (%), mF1 = macro-F1 score (%), HL = Hamming loss (lower is better). The best scores are highlighted in bold, and the highest baseline results are underlined.

Ablation Setting	SemEval-2018-Ec-Ar				ExaAEC				SemEval-2025-Arq
Ablation Setting	J	µF1	mF1	HL	J	µF1	mF1	HL	J	µF1	mF1	HL
w/o GraphSAGE	57.60	69.92	60.43	63.69	68.61	66.20	0.0874	40.70	58.42	55.48	0.2905
w/o Label Transformer	57.25	70.43	59.11	0.1404	64.03	68.40	66.31	0.0868	40.68	57.29	54.86	0.2664
w/o Correlation Alignment	58.43	70.38	59.67	0.1400	64.08	68.47	65.91	0.0868	41.07	58.23	56.04	0.2714
w/o Ranking Loss	58.01	69.67	57.51	0.1382	63.53	68.29	65.76	0.0849	40.12	57.88	55.43	0.2762
w/o Imbalance Loss	57.79	70.64	58.67	0.1383	63.53	68.11	65.43	0.0875	41.02	57.58	55.44	0.2631
w/o Adaptive Weights	59.15	70.19	57.53	0.1320	63.17	67.91	65.52	0.0891	39.11	56.83	54.60	0.2633
GTHL-Emo (Full Model)	58.70	71.02	60.83	0.1259	65.99	70.72	68.71	0.0887	41.83	58.52	56.69	0.2605

Note: Removing GraphSAGE significantly degraded structured modeling on EC, while adaptive loss weighting notably affected robustness on SemEval-2018. The impact of components varied across the datasets, highlighting the dataset-dependent trade-offs between structured reasoning and label noise robustness.

Table 9. Graph construction sensitivity analysis. Comparison of different similarity mechanisms used for batch-level graph construction. The best scores are highlighted in bold, and the highest baseline results are underlined.

Graph Setting	SemEval-2018-Ec-Ar				ExaAEC				SemEval-2025-Arq
Graph Setting	J	µF1	mF1	HL	J	µF1	mF1	HL	J	µF1	mF1	HL
Cosine Only	57.84	70.34	60.76	0.1341	64.11	68.75	66.30	0.0864	41.99	58.48	54.71	0.2799
Jaccard (Train Only)	57.70	70.02	60.83	0.1359	62.83	67.51	65.16	0.0887	41.83	58.52	55.69	0.2905
kNN Graph	57.05	70.04	59.28	0.1321	62.86	68.65	66.32	0.0871	40.77	57.75	56.08	0.3064
Random Graph	58.38	70.55	59.15	0.1367	63.83	68.72	66.08	0.0841	40.83	56.27	54.74	0.3378
GTHL-Emo (Proposed)	58.70	71.02	60.83	0.1259	65.99	70.72	68.71	0.0887	41.83	58.52	56.69	0.2605

Note: The Jaccard-based similarity was used only during training and replaced by the cosine similarity at inference time, preventing information leakage. Performance differences across graph strategies indicate that structured similarity modeling improved robustness beyond random connectivity. Bold values indicate the best overall scores, and underlined values denote the highest baseline results. kNN: k-Nearest Neighbors.

Table 10. Comparison of GTHL-Emo model performance using different configurations of the adaptive hybrid loss weighting scheme across three benchmark datasets. Columns indicate the coefficients for focal, graph, and ranking loss terms (

λ

), as well as the exponential moving average parameter (

α

). Metrics are the Jaccard accuracy, micro-F1 score, and macro-F1 score, each reported on SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq. Boldface highlights the best overall results.

Table 10. Comparison of GTHL-Emo model performance using different configurations of the adaptive hybrid loss weighting scheme across three benchmark datasets. Columns indicate the coefficients for focal, graph, and ranking loss terms (

λ

), as well as the exponential moving average parameter (

α

). Metrics are the Jaccard accuracy, micro-F1 score, and macro-F1 score, each reported on SemEval-2018-Ec-Ar, ExaAEC, and SemEval-2025-Arq. Boldface highlights the best overall results.

Loss Weights				SemEval-2018-Ec-Ar			ExaAEC			SemEval-2025-Arq
Focal $λ$	Graph $λ$	Rank $λ$	EMA $α$	Jaccard Acc	Micro-F1	Macro-F1	Jaccard Acc	Micro-F1	Macro-F1	Jaccard Acc	Micro-F1	Macro-F1
0.05	0.05	0.05	0.90	57.80	67.12	60.05	62.14	68.32	64.50	39.30	55.42	52.08
0.10	0.10	0.05	0.95	58.40	68.55	61.10	64.50	69.78	66.82	40.65	56.20	53.80
0.10	0.10	0.10	0.95	58.70	71.02	60.48	65.99	70.72	68.71	41.47	56.78	56.69

Table 11. Complexity vs. performance trade-off on the SemEval-2025-Arq dataset using macro-F1 score. Parameters are in millions; metrics are in percentage points. Bold are best results.

Run	Parameters (M)	Enabled Modules	Macro-F1	Sec per Epoch Avg
Graph Cosine Only	170.340	6.00	54.71	14.77
Graph Jaccard (Train Only)	170.340	6.00	55.69	15.15
kNN Graph	170.340	6.00	56.08	15.40
Random Graph	170.340	6.00	54.74	15.26
w/o Adaptive Weights	170.340	5.00	54.60	15.43
w/o Correlation Alignment	170.340	5.00	56.04	14.26
w/o GraphSAGE	169.160	5.00	55.48	14.83
w/o Imbalance Loss	170.340	5.00	55.44	14.96
w/o Label Transformer	164.030	5.00	54.86	14.59
w/o Ranking Loss	170.340	5.00	55.43	14.86
GTHL-Emo	170.340	6.00	56.69	14.86

Table 12. Qualitative nearest-neighbor analysis (English translation only). Original Arabic texts corresponding to each example ID are provided in Appendix A (Table A1).

ID	Query (English)	Query Labels	Rank	Sim.	Nearest Neighbor (English)	NN Labels
E1	My heart adores him; don’t read and torture yourself.	love	1	0.9626	I love Yasser and wish he dominates the field.	love
E2	A very happy day, praise be to God; it’s the leader’s wedding.	joy, love, optimism	1	0.9247	O God, make its ending joyful and blessed.	joy, love, optimism
E3	Why wouldn’t I miss you? Stay safe and well.	joy, love, sadness	1	0.9371	Such a rush of feeling, like filling a car to the brim.	joy, love, optimism
E4	Fear and anxiety after this emotional breakdown.	anger, disgust, fear, sadness	1	0.9284	I’m honestly scared in a way I’ve never been before.	fear

Table 13. Qualitative error analysis on SemEval-2018-Ec-Ar (English translations only). Original Arabic texts corresponding to each example ID are provided in Appendix A (Table A2). TP = true positive; FN = false negative.

ID	Label	Case	Prob	Text (English Translation)	Gold	Pred
E5	anger	TP	1.0000	Give everyone what they deserve; if you keep exaggerating for everyone, you will end up exhausted.	pessimism, sadness	anger, disgust, pessimism, sadness
E6	anger	FN	0.3542	Yesterday we were love letters; today we are falling pages that hurt each other.	anger, love, pessimism, sadness	anticipation, pessimism, sadness
E7	anticipation	TP	0.9983	Thursday is here; I am going to check the site now. Pray for me in the next two minutes.	anticipation, fear, surprise	anticipation, joy, optimism
E8	anticipation	FN	0.1454	I hope my fears and expectations never come true for you and that they stay far away.	anticipation, fear	fear, pessimism
E9	trust	FN	0.0962	Asking a question does not imply ignorance; it reflects a need for reassurance and certainty.	trust	disgust

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alrasheedy, M.N.; Tiun, S.; Fauzi, F. GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection. Electronics 2026, 15, 1169. https://doi.org/10.3390/electronics15061169

AMA Style

Alrasheedy MN, Tiun S, Fauzi F. GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection. Electronics. 2026; 15(6):1169. https://doi.org/10.3390/electronics15061169

Chicago/Turabian Style

Alrasheedy, Mashary N., Sabrina Tiun, and Fariza Fauzi. 2026. "GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection" Electronics 15, no. 6: 1169. https://doi.org/10.3390/electronics15061169

APA Style

Alrasheedy, M. N., Tiun, S., & Fauzi, F. (2026). GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection. Electronics, 15(6), 1169. https://doi.org/10.3390/electronics15061169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GTHL-Emo: Adaptive Imbalance-Aware and Correlation-Aligned Training for Arabic Multi-Label Emotion Detection

Abstract

1. Introduction

2. Related Work

3. Proposed Methodology

3.1. Problem Formulation

3.2. Transformer-Based Text Representation

3.3. Dynamic Graph Construction and GraphSAGE Aggregation

3.3.1. Mixed Similarity Metric via Adaptive Lambda Network

3.3.2. Adjacency Matrix Construction

3.3.3. GraphSAGE Layer with Rich Aggregation

3.4. Label Dependency Modeling with Transformer Encoders

3.5. Feature Fusion and Classification

3.6. Adaptive Hybrid Loss Function

3.6.1. Adaptive Weight Computation

3.6.2. Loss Components

Binary Cross-Entropy Loss

Focal Loss

Ranking Loss

Graph KL Loss

3.6.3. Exponential Moving Average Stabilization

3.7. Inference and Thresholding

4. Experiment and Result Analysis

4.1. Implementation Details

4.1.1. Data Preparation

4.1.2. Model Configuration and Training

4.2. Hyperparameter Settings

4.2.1. Evaluation Metrics

4.2.2. Hardware and Software

4.3. Baselines

4.3.1. SemEval-2018-Ec-Ar

4.3.2. ExaAEC Dataset

4.3.3. SemEval-2025 Track A Arabic (Arq) Dataset

4.4. Main Results

4.4.1. Baseline Comparison

SemEval-2018-Ec-Ar

ExaAEC

SemEval-2025 Track A Arabic (Arq)

4.4.2. Component-Wise Ablation Study

4.4.3. Sensitivity and Stability Analysis of Adaptive Weights

Graph Construction Sensitivity

Loss Component Configuration Analysis

Coefficient Sensitivity and Stability

Adaptive Weight Dynamics

4.4.4. Findings

Effect on Different Graph Neural Networks

Effect on GraphSAGE Depth

Label Correlations

4.5. Complexity vs. Performance

4.6. Model Analysis and Interpretability

4.6.1. Interpretability

4.6.2. Nearest-Neighbor Explanations

4.6.3. Error Analysis and Failure Cases

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI