Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification

Thennakoon Mudiyanselage, Anupama Udayangani Gunathilaka; Zhang, Jinglan; Li, Yeufeng

doi:10.3390/make8010009

Open AccessArticle

Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification

by

Anupama Udayangani Gunathilaka Thennakoon Mudiyanselage

^1,*

,

Jinglan Zhang

²

and

Yeufeng Li

²

¹

School of Computer Science, Queensland University of Technology, George Street, Brisbane 4000, Australia

²

Centre for Data Science, Queensland University of Technology, George Street, Brisbane 4000, Australia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(1), 9; https://doi.org/10.3390/make8010009 (registering DOI)

Submission received: 20 October 2025 / Revised: 6 December 2025 / Accepted: 24 December 2025 / Published: 31 December 2025

(This article belongs to the Section Data)

Download

Browse Figures

Versions Notes

Abstract

Fine-grained sentiment analysis requires a deep understanding of emotional intensity in the text to distinguish subtle shifts in polarity, such as moving from positive to more positive or from negative to more negative, and to clearly separate emotionally neutral statements from polarized expressions, especially in short or contextually sparse texts such as social media posts. While recent advances combine deep semantic encoding with context-aware architectures, such as Bidirectional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNNs), many models still struggle to detect nuanced emotional cues, particularly in short texts, due to the limited contextual information, subtle polarity shifts, and overlapping affective expressions, which ultimately hinder performance and reduce a model’s ability to make fine-grained sentiment distinctions. To address this challenge, we propose an Emotion- Aware Bidirectional Gating Network (Electra-BiG-Emo) that improves sentiment classification and subtle sentiment differentiation by learning contextual emotion representations and refining them with auxiliary emotional signals. Our model employs an asymmetric gating mechanism within a BiLSTM to dynamically capture both early and late contextual semantics. The gates are temperature-controlled, enabling adaptive modulation of emotion priors, derived from Reddit post datasets to enhance context-aware emotion representation. These soft emotional signals are reweighted based on context, enabling the model to amplify or suppress emotions in the presence of an ambiguous context. This approach advances fine-grained sentiment understanding by embedding emotional awareness directly into the learning process. Ablation studies confirm the complementary roles of semantic encoding, context modeling, and emotion modulation. Further our approach achieves competitive performance on Sem- Val 2017 Task 4c, Twitter US Airline, and SST5 datasets compared with state-of-the-art methods, particularly excelling in detecting subtle emotional variations and classifying short, semantically sparse texts. Gating and modulation analyses reveal that emotion-aware gating enhances interpretability and reinforces the value of explicit emotion modeling in fine-grained sentiment tasks.

Keywords:

emotion recognition; fine-grained sentiments; social media posts; textual context; BiLSTM; gating; modulating

1. Introduction

Fine-grained sentiment classification requires distinguishing subtle differences in sentiment intensity, such as mildly positive vs. strongly positive, negative vs. strongly negative, or clearly neutral vs. weakly negative. This task is particularly challenging when sentiment classes are closely confusable, especially in short texts where semantic signals are sparse or ambiguous [1,2]. Importantly, emotion and sentiment are related but distinct concepts: sentiment reflects an overall evaluative polarity (positive, negative, or neutral), whereas emotions emerge from internal mental and psychological states and represent specific affective states, such as joy, anger, or sadness [3,4]. These emotions significantly influence how sentiment is expressed, particularly in terms of intensity. For instance, a strongly felt emotion, like rage, may signal a more intense negative sentiment than mild annoyance. Therefore, capturing emotional nuances is essential for accurately distinguishing fine-grained sentiment differences, where the boundary between closely related sentiment classes hinges on variations in emotional intensity.

However, existing approaches largely focus on syntactic or contextual cues and lack explicit modeling of emotional dynamics. This limitation is particularly evident in sentiment analysis of short texts, such as social media posts, where the scarcity of contextual information makes it difficult to capture nuanced emotional signals [5]. While Large Language Models (LLMs) and their hybrid variants (e.g., LLMs combined with BiLSTM–CNNs) provide rich global representations [6,7,8,9,10,11], they often struggle to resolve such fine-grained distinctions because the inherent data sparsity of the text provides an insufficient signal from which to learn a robust and nuanced contextual understanding [12,13,14]. Recent methods like Label-aware Contrastive Loss (LCL) explicitly model inter-class relationships by pulling similar sentiment classes closer in embedding space and pushing dissimilar ones apart, aiming to improve fine-grained discrimination through the learned context [1]. However, they rely heavily on label structure and may struggle when class boundaries are inherently fuzzy or when data is sparse. A study on orthogonally constrained Long Short-term Memory (LSTM) states with multi-scale windows aims to disentangle redundant features and capture semantic diversity by enforcing independence across learned representations [15]. While this improves representational clarity, it often lacks adaptability to specific contextual shifts, especially in short or noisy texts. Deformable self-attention for text classification (DLAWG) [13] enhances token-level representation by dynamically learning window sizes and applying variable-size self-attention, improving context granularity. Still, it depends on optimal window learning and may miss global emotional cues essential for distinguishing sentiment intensity.

Overall, while these methods make progress by improving representation structure or attention flexibility, they often lack the ability to modulate emotional intensity in a context-sensitive manner, limiting their performance in scenarios with closely confusable classes or sparse linguistic cues in fine-grained level sentiment classification.

To address these limitations, we propose an emotion-aware bidirectional gating mechanism that leverages both contextual and emotional understanding for fine-grained sentiment classification, inspired by gating mechanisms in sequential and multimodal fusion methods [16,17], where modality-specific features are weighted to emphasize the most informative signals. We extract early and late hidden states from the BiLSTM since emotional cues tend to appear predominantly at the beginning and end of text, especially in short texts [18,19]. These positions provide strong sentiment signals that, when modulated with auxiliary emotion representations, help in identifying fine-grained sentiment more effectively.

These contextual features are processed through a gating layer to produce emotionally sensitive intermediate representations. Unlike methods that rely solely on implicit emotional understanding, we explicitly modulate these representations using auxiliary emotion signals learned from an external, emotion-labeled domain (e.g., Reddit posts). This cross-domain emotional knowledge transfer enhances the model’s ability to recognize subtle emotional cues that are critical for distinguishing fine-grained sentiment classes. By combining deep contextual encoding with explicitly guided emotional modulation, our model gains a significant advantage in interpreting nuanced sentiment variations that may be under-represented or ambiguous in the primary task data.

The key contributions of our approach include the following:

Emotion-Aware Bidirectional Gating Mechanism: We introduce a novel gating architecture leveraging both early (forward) and late (backward) BiLSTM contexts to generate emotionally sensitive representations, enabling adaptive modulation of contextual semantics for fine-grained sentiment classification.
Cross-Domain Emotion Modulation via Auxiliary Signals: Our method incorporates auxiliary emotion signals learned from an external emotion-labeled dataset, enabling effective emotional knowledge transfer that enhances discrimination between closely related sentiment classes.
Model Interpretability: We evaluate the asymmetric, gating strategy that dynamically adjusts emotional influence based on linguistic context and modulated through auxiliary emotions. Comprehensive ablation studies and emotion weight analyses offer interpretable insights into how emotional cues are amplified or suppressed across sentiment classes.

We conducted comprehensive experiments to evaluate our model’s effectiveness in fine-grained sentiment classification. Ablation studies, including base model isolation, gating mechanism removal, and emotion feature exclusion, were performed to assess the contribution of each component. We further investigated the impact of different pretrained embeddings and fine-tuning strategies. Our approach was benchmarked against state-of-the-art methods and LLMs using metrics such as precision, recall, F1-score, Mean Absolute Error (MAE), and accuracy. Additionally, confusion matrix analysis provided insights into classification challenges among closely related classes. For model interpretability, we analyzed feature importance and examined how emotional signals influence sentiment predictions across various text contexts, including illustrative example posts, highlighting the interplay between emotion and sentiment.

The remainder of this paper is structured as follows: Section 2 reviews the related work on fine-grained sentiment classification. Section 3 details the proposed Electra_BiG_Emo method. Section 4 presents the experimental design, Section 5 demonstrates the results, including comparisons with state-of-the-art models, multi-step ablation studies, and interpretability analysis. Section 6 adds discussion and finally Section 7 concludes the paper.

2. Related Work

Fine-grained sentiment classification is challenging due to subtle polarity shifts, limited context in short texts, and overlapping emotional cues. To address these issues, models have evolved to combine deep semantic encoding with context-aware architectures, like BiLSTM or CNNs, while integrating attention mechanisms [20,21] to highlight sentiment-relevant features. Techniques such as emotion-aware representations, dynamic context adaptation, and data augmentation help improve inter-class discrimination and enhance the model’s ability to detect nuanced emotional differences, leading to more accurate and robust fine-grained sentiment predictions.

LLMs have shown strong performance in binary sentiment classification due to their ability to model general semantic patterns, especially when fine-tuned on sentiment-labeled datasets [11,20,22,23]. However, when it comes to fine-grained sentiment classification which requires distinguishing between closely related sentiment classes there is still substantial room for improvement [6]. To bridge this gap, hybrid models that combine LLMs with sequential architectures (e.g., BiLSTM or CNNs) have been explored to jointly capture long-range dependencies and local syntactic cues. For example, ref. [7] leverages Bidirectional Encoder Representations from Transformer (BERT)-pretrained word embedding vectors to aid model fine-tuning and investigates the impact of hybridizing Bidirectional Gated Recurrent Unit (BiGRU) and BiLSTM layers on BERT variants, aiming to enhance classification accuracy considering emotions in the textual context. Meanwhile, ref. [8] utilizes FastText embeddings trained alongside convolutional neural networks (CNNs) to learn local patterns for improved text classification. To capture long-distance dependencies within textual data, the [13] model integrates the Robustly Optimized BERT Pretraining Approach (RoBERTa) with an LSTM network. RoBERTa encodes words into dense, semantically rich embeddings, while the LSTM component captures extended contextual dependencies, enabling more effective sentiment representation through a combination of global semantic encoding and sequential contextual modeling. Nevertheless, these hybrid approaches still struggle with short texts, where limited context makes it difficult to balance local and global representations [2]. Moreover, their static architectures lack the flexibility to dynamically adapt to context-sensitive requirements, which restricts their effectiveness in handling subtle distinctions in fine-grained sentiment tasks. Short texts present another major challenge due to their limited n-gram coverage, reducing a model’s ability to capture nuanced semantics. Ref. [8] address this by disentangling LSTM states with orthogonal constraints and employing multi-scale temporal windows. This design enables efficient extraction of non-redundant semantic features and better represents fine-grained sentiments even in sparse contexts. Although this parameter-efficient architecture improves sentence representations, it can still struggle when long-range dependencies play a dominant role.

Fixed-window models often fail to distinguish when short- or long-range context is more informative. To address this, ref. [13] dynamically determines optimal context sizes via a learned network, applying variable-size local self-attention and integrating multi-range features through a multi-range fusion interface. This adaptive mechanism significantly improves performance in identifying subtle sentiment shifts by overcoming static window constraints.

Several studies aim to enhance inter-class distinction through tailored objectives. For instance, Label-aware Contrastive Loss [1] enforces class-specific margin constraints, improving discrimination among similar sentiment categories. Meanwhile, ref. [8] also applies orthogonal constraints to disentangle LSTM states, thereby reducing redundancy and promoting the capture of class-relevant signals.

Fine-grained classification benefits from expanded training data that captures within-class semantic diversity. Ref. [24] addresses this by generating class-preserving paraphrases using LLMs like GPT-3 and OPT-175B, which enhances semantic understanding and data diversity. This augmentation improves generalization by training models on richer, syntactically varied inputs. Similarly, the RoBERTa-LSTM hybrid model utilizes such augmented data for better token-level and contextual learning.

Emotion-based representation learning has been increasingly used to guide sentiment understanding [20,25,26,27]. Ref. [27] introduces an attention embedding module that assigns contextual importance to emotionally salient words after extracting embeddings from LLMs, enabling downstream deep learning models (e.g., recurrent CNNs) to better learn local semantic patterns. This aids in differentiating sentiment intensities, particularly in cases where emotional cues overlap across fine-grained sentiment classes. Additionally, ref. [12] adopts a feature fusion strategy using handcrafted patterns like sentiment shift metrics and consecutive flips, providing domain-specific cues that aid in modeling nuanced transitions in sentiment, assuming linguistically meaningful patterns capture subtle sentiment shifts. Further ref. [28] introduces a dual-perspective attention mechanism designed to enhance short text classification. It captures semantic dependencies from both feature-level and token-level views using two distinct attention modules, which are then fused via a woven attention mechanism to form a comprehensive text-level representation. To address semantic sparsity—common in short texts—the model integrates a question-answering (QA)-based label prompt, which is concatenated with the input and processed jointly through the attention framework. This prompt-augmented input enables the model to better align with target labels, improving the granularity and robustness of classification decisions.

Despite these advancements, there remains a critical gap in effectively capturing emotional subtleties that define fine-grained sentiment, especially in short or contextually ambiguous texts. While prior models leverage global semantics and local patterns, they often overlook the nuanced emotional context that underpins sentiment expression. To address this, our research proposes an emotion-aware bidirectional gating mechanism that not only learns emotions directly from contextual embeddings but also refines these emotional cues through auxiliary emotional signals. By integrating affective awareness into a BiLSTM-based framework, the model dynamically emphasizes emotionally salient features, enabling more precise differentiation between closely related sentiment classes. This context-sensitive emotional modeling enhances the granularity and robustness of fine-grained sentiment classification, particularly in emotionally rich or semantically sparse texts.

3. Methodology

We propose a novel deep learning model, Electra-BiG-Emo. Our model introduces an emotion-aware bidirectional gating mechanism that effectively integrates contextual representations with external emotional cues learned through transfer learning. By dynamically modulating emotion signals using bidirectional LSTM states and asymmetric gating, the model enhances both sentiment classification performance and the interpretability of the decision-making process. This approach enables more precise differentiation between closely related sentiment classes, particularly in short or ambiguous texts where emotional intensity plays a crucial role.

This section outlines the data description, details of pre-processing, and architecture of the proposed method (Figure 1), including the transfer learning approach for learning emotional cues out of the text.

3.1. Data Description

We utilised the GoEmotions dataset to fine-tune the Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA) LLM to obtain emotional representation for the target sentiment datasets. GoEmotions is a dataset of 58k Reddit comments annotated with 27 emotions and neutral, curated from diverse subreddits spanning mental health, relationships, technical, humorous, and general topics to minimize domain bias [29]. To extract emotion features for our target fine-grained sentiment datasets, we fine-tuned multiple LLMs (BERT, RoBERTa, ELECTRA) on GoEmotions using a semi-supervised approach. Emotion predictions were retained based on strong sentiment–emotion correlation

(ρ \geq 0.5)

. By analyzing the correlation between emotions and sentiments, emotional features can be integrated into sentiment analysis models for more precise, fine-grained sentiment identification.

To further enhance affective coverage of emotional nuances, we adopt the evidence proposed in prior work [30], which clusters highly correlated fine-grained emotions into broader affective dimensions (e.g., disappointment and disgust align with anger; surprise and excitement align with joy affect and neutral). This grouping preserves nuanced affective signals while preventing sparsity from under-represented categories. For fine-tuning, we primarily focused on Reddit comments belonging to such classes, as these showed the strongest correlations with the sentiment labels. For instance, if anger strongly correlates with negative sentiment, its detection can guide the model toward better sentiment interpretation. These emotion-based cues serve as an additional information layer, improving classification accuracy, particularly in ambiguous cases such as sarcasm or mixed sentiments.

To comprehensively evaluate the proposed Electra-BiG-Emo (ELECTRA-Bidirectional Gated Emotion) model, we conducted experiments on three publicly available fine-grained sentiment datasets. These datasets were selected to represent diverse domains, text lengths, and annotation granularities, ensuring robust evaluation across different sentiment analysis scenarios. Additionally, we employed an auxiliary emotion dataset for emotion feature extraction. The Stanford Sentiment Treebank (SST-5) comprises 11,855 sentences extracted from movie reviews on Rotten Tomatoes. Each sentence is annotated with fine-grained sentiment labels across five categories: very negative, negative, neutral, positive, and very positive. With an average length of 19.3 tokens and a relatively balanced class distribution (21.3% very negative, 23.2% negative, 22.3% neutral, 20.1% positive, 13.1% very positive), SST-5 serves as a primary benchmark for evaluating fine-grained sentiment classification performance.

The dataset from SemEval 2017 Task 4 [31] contains 28,630 tweets annotated on a five-point sentiment scale toward specific topics (−2: very negative, −1: negative, 0: neutral, +1: positive, +2: very positive). The tweets average 15.7 tokens, including social media elements such as hashtags and mentions. The class distribution is naturally imbalanced (8.2% very negative, 18.5% negative, 40.1% neutral, 24.3% positive, 8.9% very positive), reflecting real-world Twitter sentiment patterns.

The Twitter dataset contains 14,640 tweets directed at US airlines, categorized into three sentiment classes: positive, negative, and neutral. With an average length of 14.2 tokens, the dataset exhibits severe class imbalance (positive: 16.3%, negative: 62.7%, neutral: 21.0%), predominantly featuring negative sentiment due to the complaint-heavy nature of airline-related social media discourse.

Dataset Pre-Processing

For the SemEval 2017 Task 4C dataset and US Airline Twitter dataset, the popular Ekphrasis library [31,32] was used for pre-processing, including normalization of URLs, dates, times, usernames, and currency. Annotation steps addressed hashtags, all-caps text, elongated and repeated words, emphasis (e.g., asterisks), and censored words. Contractions were also unpacked to further enhance data quality and relevance.

For the GoEmotion dataset, Emojis are removed using a regular expression that matches a wide range of Unicode emoji characters. URLs in the text are identified using regex patterns and removed to eliminate noisy, irrelevant content. Further all special characters and punctuation (except and #) are stripped from the text. Words beginning with (mentions) or # (hashtags) are also removed to focus on the core text.

For the SST-5 dataset, we applied lower casing and handled contractions for text normalization. Further, we filtered out very short or empty sentences after cleaning.

3.2. Extracting Emotional Features

We fine-tuned BERT, RoBERTa, and ELECTRA on the GoEmotions dataset to generate emotion probability distributions and evaluated their correlation with sentiment labels on SST-5, SemEval-C, and Twitter sentiment datasets. We tested multiple LLMs—BERT, RoBERTa, and ELECTRA—because each captures emotional nuances differently due to their distinct pretraining strategies. This comparative evaluation ensured robustness and reduced model bias. Figure 2 shows the correlation between the predicted emotions by different LLMs and their correlation with sentiments.

Based on emotion–sentiment correlation analysis, ELECTRA demonstrated the strongest alignment with sentiment labels across datasets. Therefore, we selected ELECTRA as the optimal choice for generating emotion probabilities used in the final sentiment classification model. Here we considered highly correlated emotions, assuming they are the most strongly linked to specific sentiment labels. Using these emotions helps the model better understand and predict the correct sentiment, especially in cases where the sentiment is subtle or mixed. This improves the accuracy and reliability of sentiment classification.

To obtain emotion representations, we apply softmax activation over the output logits of a fine-tuned Electra emotion classifier. This yields a probability distribution over predefined emotion classes, which serves as a dense emotional feature vector for each input sample. Formally, given an input text, the model outputs a logit vector e representing the emotion probabilities of it on each emotion category. The corresponding emotion probability in an emotion category is computed using the softmax function as:

e_{i} = \frac{e^{i}}{\sum_{j = 1}^{K} e^{j}} . f o r i = 1, \dots, k

(1)

Here,

e_{i}

represents the emotion probability of category i, and k is the number of emotion categories (in our study, k = 3, where the categories are joy, neutral, and anger).

Then the emotion representation vector for a text e becomes

e = [e_{1}, e_{2}, \dots e_{k}]

(2)

where

e \in {[0, 1]}^{k} and \sum_{i = 1}^{k} e_{i} = 1

(3)

3.3. Textual Encoding

We employ the

E L E C T R A_{b a s e}

(electra–base–discriminator) model from HuggingFace’s Transformers library [1] to initialize both the encoder and weighting network. The model comprises 12 Transformer layers with a hidden size of 768. Rather than relying solely on the [CLS] token, we utilize the final-layer token embeddings as input to the BiLSTM, enabling effective sequence-level processing as below formula.

\begin{matrix} (\vec{h_{b a c k w a r d}}, \overset{\leftarrow}{h_{f o r w a r d}}) = BiLSTM (ELECTRA (X, M)) \end{matrix}

where X is the last layer token embeddings from the Electra model, and M is the attention mask.

h_{b a c k w o r d}

and

h_{f o r w a r d}

are the BILSTM forward and backward hidden states.

3.4. Extracting Contextual Features

Large Language Models (LLMs) encode words into a compact and semantically rich embedding space, capturing nuanced meanings based on vast contextual knowledge. In contrast, recurrent neural network architectures, such as LSTM, are specifically designed to process sequential data, effectively modeling long-range dependencies. Therefore, recent studies have implemented hybrid deep learning methods that combine the strengths of sequence models and Transformer models to utilize the advantages of both [33]. LSTM networks utilize three primary gates the input gate, forget gate, and output gate which regulate the flow of information by deciding what to store, forget, and output at each time step, enabling effective learning of long-term dependencies in sequential data [33].

Extracting BiLSTM Textual Contexts

Bi-LSTM is a bidirectional extension of LSTM that processes sequences in both forward and backward directions to capture context from past and future time steps [33]. The proposed model uses a bidirectional LSTM that processes text embeddings in both forward (left-to-right) and backward (right-to-left) directions, and we utilise early textual context and later contextual texts to extract the emotional cues. For a sequence X = (

x_{1}, x_{2} \dots x_{t}

), where

x_{t} \in R^{d_{t e x t}}

, hidden state updates in the forward direction are like

At time step t:

\begin{matrix} f_{t}^{\to} & = σ (W_{f}^{\to} [h_{t - 1}^{\to}, x_{t}] + b_{f}^{\to}) & (Forget) \\ i_{t}^{\to} & = σ (W_{i}^{\to} [h_{t - 1}^{\to}, x_{t}] + b_{i}^{\to}) & (Input) \\ o_{t}^{\to} & = σ (W_{o}^{\to} [h_{t - 1}^{\to}, x_{t}] + b_{o}^{\to}) & (Output) \\ {\tilde{c}}_{t}^{\to} & = tanh (W_{c}^{\to} [h_{t - 1}^{\to}, x_{t}] + b_{c}^{\to}) & (Candidate) \\ c_{t}^{\to} & = f_{t}^{\to} ⊙ c_{t - 1}^{\to} + i_{t}^{\to} ⊙ {\tilde{c}}_{t}^{\to} & (Update) \\ h_{t}^{\to} & = o_{t}^{\to} ⊙ tanh (c_{t}^{\to}) & (Hidden) \end{matrix}

where W and b represent the weight and bias values for each gate,

h_t

represents the hidden state at time t, and the forward arrow represents the forward direction.

In our model we take the final forward state (last_forward) since it encodes information from the beginning to the end of the sequence, and we want the emotional cues from the early context.

h_{forward} = h_{T}^{\to} \in R^{d_{hidden}} .

Similarly, it processes the sequence in reverse

X^{r e v} = (x_{T}, x_{T - 1} \dots x_{2}, x_{1})

, and hidden state updates in the backward direction with gating are like,

At time step t:

\begin{matrix} f_{t}^{\leftarrow} & = σ (W_{f}^{\leftarrow} [h_{t + 1}^{\leftarrow}, x_{t}] + b_{f}^{\leftarrow}) & (Forget gate) \\ i_{t}^{\leftarrow} & = σ (W_{i}^{\leftarrow} [h_{t + 1}^{\leftarrow}, x_{t}] + b_{i}^{\leftarrow}) & (Input gate) \\ o_{t}^{\leftarrow} & = σ (W_{o}^{\leftarrow} [h_{t + 1}^{\leftarrow}, x_{t}] + b_{o}^{\leftarrow}) & (Output gate) \\ {\tilde{c}}_{t}^{\leftarrow} & = tanh (W_{c}^{\leftarrow} [h_{t + 1}^{\leftarrow}, x_{t}] + b_{c}^{\leftarrow}) & (Candidate) \\ c_{t}^{\leftarrow} & = f_{t}^{\leftarrow} ⊙ c_{t + 1}^{\leftarrow} + i_{t}^{\leftarrow} ⊙ {\tilde{c}}_{t}^{\leftarrow} & (Update) \\ h_{t}^{\leftarrow} & = o_{t}^{\leftarrow} ⊙ tanh (c_{t}^{\leftarrow}) & (Hidden) \end{matrix}

where W and b represent the weight and bias values for each gate,

h_t

represents the hidden state at time t, and the backward arrow represents the backward direction.

In our model we take the last_backward state since it encodes information from the end to the beginning, and we want the emotional cues from the later context.

h_{backward} = h_{1}^{\leftarrow} \in R^{d_{hidden}} .

Further we obtain the final hidden state by concatenating h_backward and h_forward to obtain the summary of the entire sequence in each direction.

3.5. Dynamic Gating and Modulation of Emotional Features

Inspired by the study [34] on gated units and the adaptive gated fusion mechanism for dynamically weighting multi-source auxiliary features [17], we developed an asymmetric gating mechanism to dynamically modulate auxiliary emotion features using bidirectional textual context. Unlike traditional approaches that statically fuse or attend to emotion representations, we propose directional gating where separate temperature-scaled sigmoid gates—conditioned on the forward and backward states of a BiLSTM and multiplicatively reweighted through auxiliary emotion features. The forward-state gate captures early context sentiment trends, while the backward-state gate detects late context where ambiguity exists, together enabling fine-grained emotion suppression or amplification. This design uniquely enforces context-aware emotion calibration. The temperature parameters serve as tunable confidence thresholds, making the model adaptable to varying emotion-label noise levels. By concatenating these gated emotion features with the BiLSTM’s final states, we create a representation where emotions are explicitly vetted against bidirectional linguistic evidence, a paradigm shift from static fusion to interpretable, context-conditioned emotion refinement.

Initially we perform feature projection to a linear transform of LSTM hidden states (h_forward and h_backward) to the emotion dimension to create a context-aware gate, as presented in the formula below.

\begin{matrix} g_{logits}^{forward} & = W_{f} h_{T}^{\to} + b_{f} \\ g_{f} & = σ (\frac{g_{logits}^{forward}}{τ_{f}}) \end{matrix}

where

w_{f}

and

b_{f}

are weights applied to states

h_{f o r w a r d}

, represented as

h_{t}

;

τ_{f}

represents forward temperature.

We apply temperature-scaled activation to dynamically control the gates’ weighting of emotion features based on training progression. The temperature parameter scales gate logits before the sigmoid, controlling the sharpness of gating decisions [12,35]. Temperatures above 1 soften outputs toward 0.5 for conservative modulation, which is useful in noisy or ambiguous contexts. Temperatures below 1 sharpen outputs toward 0 or 1, enabling stronger suppression or amplification when context clearly supports or contradicts auxiliary emotions. In the forward gate we use a higher temperature (

τ = 1.5

) to maintain softer, more exploratory gating during initial learning, preventing premature overconfidence in emotion–sentiment correlations. Conversely, the backward gate employs a lower temperature (

τ = 0.7

) to enforce sharper, more selective feature weighting in later stages, enabling precise emotion modulation for fine-grained sentiment analysis. In the formula below,

w_{b}

and

b_{b}

are weights applied to states

h_{b a c k w o r d}

, represented as

h_{1}

;

τ_{f}

represents backward temperature.

\begin{matrix} g_{logits}^{backward} & = W_{b} h_{1}^{\leftarrow} + b_{b} \\ g_{b} & = σ (\frac{g_{logits}^{bacward}}{τ_{b}}) \end{matrix}

For context-aware emotion modulation in the fine-grained sentiment analysis model, we apply the gates to act as soft switches to amplify/suppress emotions based on auxiliary emotion representation, as shown in the formulas below for forward and backward directions and obtained gated emotions in the forward and backward directions

E_{f}

and

E_{b}

.

\begin{matrix} E_{f} = g_{f} ⊙ e \\ E_{b} = g_{b} ⊙ e \end{matrix}

3.6. Feature Fusion and Classification

The model employs concatenation-based feature fusion that combines bidirectional contextual representations with asymmetrically gated emotion features. Specifically, it merges four components: (1) forward context embedding from the final BiLSTM forward state, (2) backward context embedding from the initial BiLSTM backward state, (3) forward-gated emotion features modulated by the forward context, and (4) backward-gated emotion features modulated by the backward context.

The fused feature vector z (specified in the formula below) is passed through a softmax classifier, where

W_{c}

and

b_{c}

are weights and bias, to produce the final prediction. A linear transformation layer projects the concatenated features into the output class space, followed by a softmax activation that converts logits into probability distributions over target classes. This enables multi-class classification while providing interpretable confidence scores for each predicted class.

\begin{matrix} z = [h_{forward} ‖ h_{backward} ‖ E_{f} ‖ E_{b}] \\ \hat{y} = softmax (W_{c} z + b_{c}) \end{matrix}

Algorithm 1 outlines the core methodology detailed throughout this section.

Algorithm 1: Asymmetric Context-Aware Emotion Gating and Modulating

Input:

● Input tokens:

X = {x_{1}, \dots, x_{n}}

● Attention mask: M

● Emotion probabilities:

E \in R^{d}

Output:

● Prediction:

\hat{y}

● Gate values:

g_{f}, g_{b}

H \leftarrow Electra (X, M)

// $H \in R^{n \times h}$

(\vec{h}, \overset{\leftarrow}{h}) \leftarrow BiLSTM (H)

// Forward/backward

h_{forward} \leftarrow {\vec{h}}_{T}

// Last forward state

h_{backward} \leftarrow {\overset{\leftarrow}{h}}_{1}

// First backward state

\tilde{E} \leftarrow W_{e} E + b_{e}

// Emotion projection

g_{f} \leftarrow σ ((W_{1} h_{forward} + b_{1}) / τ_{1})

// Forward gate

g_{b} \leftarrow σ ((W_{2} h_{backward} + b_{2}) / τ_{2})

// Backward gate

{\hat{E}}_{f} \leftarrow \tilde{E} ⊙ g_{f}

// Forward-gated

{\hat{E}}_{b} \leftarrow \tilde{E} ⊙ g_{b}

// Backward-gated

z \leftarrow [h_{forward} ∥ h_{backward} ∥ {\hat{E}}_{f} ∥ {\hat{E}}_{b}]

// Fusion

\hat{y} \leftarrow softmax (W_{c} z + b_{c})

return

\hat{y}, g_{f}, g_{b}

4. Experimental Design

To ensure robust and unbiased evaluation, we employed a rigorous 10-fold stratified cross-validation approach across all three datasets. This methodology minimizes variance in performance estimates and provides statistically reliable results. Each fold preserves the original class distribution of the entire dataset, preventing disproportionate representation of minority classes in any single fold.

For comprehensive comparison against standalone LLMs, we conducted statistical significance testing for each evaluation metric (accuracy, precision, recall, and F1-score) using paired t-tests across the 10 folds. This analysis quantifies whether the performance improvements of our proposed Electra-BiG-Emo architecture over standalone models are statistically significant.

4.1. Evaluation Matrices

To comprehensively evaluate the performance of the proposed Electra-BiG-Emo model, we employ four standard classification metrics: accuracy, precision, recall, and F1-score. We evaluate model performance using the following metrics:

\begin{matrix} Accuracy = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix}

(4)

\begin{matrix} Precision = \frac{T P}{T P + F P} \end{matrix}

(5)

\begin{matrix} Recall = \frac{T P}{T P + F N} \end{matrix}

(6)

\begin{matrix} F_{1} = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{2 \times T P}{2 \times T P + F P + F N} \end{matrix}

(7)

where TP represents True Positive, TN represents True Negatives, FP represents False Positive, and FN represents False Negative.

For multi-class classification with C classes, we compute macro-averaged (equal weight to all classes) of precision (5), recall (6), and F1-score (7). Accuracy (4) is reported as the overall classification rate. All experimental results are reported as averages across 10-fold stratified cross-validation.

4.2. Implementation Environment

The proposed Electra-BiG-Emo model was implemented using PyTorch 2.0.1 and HuggingFace Transformers 4.31.0. Experiments were conducted on a high-performance computing cluster equipped with 2× NVIDIA V100 GPUs (32GB VRAM each), four CPU cores, and 16GB RAM, managed via PBS job scheduling with a 02-hour runtime allocation. We employed a comprehensive software stack including NumPy 1.23.5, scikit-learn 1.3.0, and specialized NLP libraries. All experiments followed a strict 10-fold stratified cross-validation protocol with random seeds to ensure reproducibility.

4.3. Hyper Parameter Search

We conducted an exhaustive grid search across parameter ranges established through empirical transformer research and architectural considerations. For LSTM dimensions, we searched through 64, 128, and 256, as they correspond to fractions of the 768-dimensional ELECTRA embeddings. Learning rates were examined through 5 × 10⁻⁶, 1 × 10⁻⁵, 2 × 10⁻⁵, 5 × 10⁻⁵, and 1 × 10⁻⁴ as established fine-tuning recommendations. Batch sizes were selected in the range of 8, 16, 32, and 64 to address memory performance trade-offs. Dropout rates in the range of 0.1, 0.2, 0.3, 0.4, 0.5 span typical regularization strengths. The temperature parameters

τ_{f i r s t}

0.5, 1.0, 1.5, 2.0, and 2.5 and

τ_{l a s t}

0.3, 0.6, 0.9, 1.2, and 1.5 reflect the asymmetric gating requirements identified in preliminary experiments. This comprehensive search across different configurations ensures robust optimization while maintaining computational feasibility through parallel processing. All hyperparameters for the datasets are listed in Table 1.

4.4. Experiments

To evaluate the contribution of each component in our Electra-BiG-Emo model, we conducted a comprehensive series of ablation studies across all datasets. These included isolating the base model, removing the gating mechanism, fusing different emotion representations, experimenting with alternative emotion groupings derived from the Reddit dataset, and comparing the effects of using pretrained versus fine-tuned embeddings within the proposed architecture. In addition to these ablations, we further benchmarked our method against strong state-of-the-art approaches, including standalone Large Language Models as well as LLM-based embedding representations combined with deep learning classifiers.

5. Results and Analysis

This section provides a comprehensive evaluation of the proposed model. It begins with a comprehensive ablation study to assess the contribution of each architectural component and the impact of different emotional features on fine-grained sentiment classification through multiple ablation variants. Then to test generalizability, it examines the impact of using different Large Language Model (LLM) embeddings on performance. The model is then benchmarked against state-of-the-art methods and LLMs to validate its effectiveness. A detailed confusion matrix analysis is conducted to understand how emotional cues in the text contribute to fine-grained sentiment classification. Finally, the section explores model interpretability, highlighting how emotional signals interact with textual context for fine-grained classification, feature importance in overall prediction, emotional bias, and contextual emotion gating, supported by visualization and examples. This helps reveal which emotional features the model relies on most during decision-making.

5.1. Ablation Study

To assess the contribution of each component in our Electra-BiG-Emo model, we conducted multiple ablation studies across all datasets. Each ablation was performed in a sequential manner, and for every variant, we measured and reported performance using standard evaluation metrics (average on 10 folds). The different ablation setups and their justifications are detailed below:

Base Model Isolation: To evaluate the individual impact of ELECTRA, BiLSTM, and Emotions by testing simplified versions and confirming the necessity of complex additions and Emotion Influence.
Gating Mechanism Ablation: To verify whether emotion–text fusion via gating improves performance.

Each ablation study and the compared model components/variants, their descriptions, and justification for the selection are listed in Table 2.

Base Model Isolation: The ablation study (Table 3) reveals stark differences in how core components (BiLSTM, emotions) interact across datasets when added with ELECTRA embeddings.

For the SemEval dataset, the baseline ELECTRA achieves the best overall balance (F1 = 0.65). Adding BiLSTM significantly reduces performance (F1 = 0.49), while adding Emotion also lowers F1 to 0.53, though with a slight gain in accuracy (0.66). On SST-5, ELECTRA alone again leads (F1 = 0.56). Both BiLSTM (F1 = 0.53) and Emotion (F1 = 0.54) degrade results across all metrics, indicating limited benefit from either additional sequence modeling or explicit emotional features in this setting. For Twitter, the baseline is strong (F1 = 0.85). BiLSTM sharply decreases performance (F1 = 0.80), whereas Emotion maintains competitive results (F1 = 0.81) and yields a modest accuracy improvement (0.86).

Overall, ELECTRA-only is consistently the strongest baseline across datasets. BiLSTM consistently harms performance, while Emotion features show dataset-dependent effects: small improvements in Twitter but regressions in SemEval and SST-5. These results highlight the robustness of ELECTRA and suggest that auxiliary modules must be carefully tailored to dataset characteristics to deliver reliable gains.

Gating Mechanism Ablation: An ablation study performed based on the gating mechanism is shown in Table 4.

For the SemEval dataset, gating proves essential. Simple concatenation (ELECTRA + BiLSTM + Emo) achieves only F1 = 0.53 (P/R = 0.52/0.54). Single-direction gating performs slightly better but remains unbalanced—FG reaches F1 = 0.51 with higher precision (0.57) but lower recall (0.49), while LG yields F1 = 0.52 (P/R = 0.53/0.52). In contrast, bidirectional gating (Electra-BiG-Emo) substantially improves performance, achieving balanced precision and recall of 0.67 and lifting F1 to 0.67, a clear gain of +0.14 over concatenation.

On SST-5, improvements are smaller but consistent. Concatenation reaches F1 = 0.53, while FG and LG are stable (0.54 and 0.53). Electra-BiG-Emo achieves F1 = 0.59 with P = R = 0.60, providing a modest but steady advantage.

For Twitter, bidirectional emotion-aware gating delivers the largest absolute gains. Concatenation attains F1 = 0.79 (Acc = 0.84), FG reaches 0.80, and LG slightly improves to 0.81. Electra-BiG-Emo, however, raises all metrics to 0.88, demonstrating that gating both directions with emotional modulation is far more effective than simple concatenation or single-direction gating.

Overall, bidirectional gating with emotion modulation consistently balances precision and recall and achieves the best performance across all datasets. Its impact is most pronounced on SemEval and Twitter, with smaller yet reliable gains on SST-5. These results underscore that while concatenation or single-gate models can be viable for efficiency, the proposed Electra-BiG-Emo configuration offers the optimal trade-off between complexity and accuracy, particularly for tasks requiring fine-grained sentiment–emotion reasoning.

Across both baseline and gating studies, Electra-BiG-Emo (bidirectional gating + emotion) consistently emerges as the most effective configuration, achieving the top precision, recall, F1, and accuracy scores on SemEval and Twitter and delivering steady improvements on SST-5. In contrast, single-direction gates (FG/LG), ungated concatenation, and baseline extensions with BiLSTM or Emotion alone underperform across datasets, often leading to substantial drops. These findings highlight that full bidirectional gating with emotion modulation is not redundant but crucial for balancing precision and recall in complex or noisy domains. Simpler variants may be considered only when computational efficiency or latency outweighs the need for accuracy.

Fusing Different Emotions: To address the issue of potential misalignment of emotion features from GoEmotions with datasets like SST-5 or Twitter and to enhance model generalizability, we experimented with fusing emotion probabilities derived from GoEmotions, SemEval emotions, and ISEAR into the same Electra-BiG-Emo architecture. For each sentiment dataset, we evaluated the performance using accuracy, precision, recall, and F1, reporting the mean and standard error over 10 folds (Table 5).

The results in (Table 5) show, on SST-5, GoEmotions fusion slightly outperforms SemEval and ISEAR emotions (F1: 0.593 ± 0.018 vs. 0.576–0.574), indicating that some cross-domain emotion signals can still be beneficial. On Twitter, all emotion sources perform similarly (F1: 0.864–0.868), suggesting that simpler sentiment datasets are less sensitive to the choice of emotion features. On SemEval, GoEmotions again yields slightly higher performance (F1: 0.666 ± 0.007) compared with SemEval (0.659 ± 0.015) and ISEAR (0.665 ± 0.008), showing that cross-dataset emotion fusion can be competitive or slightly better than in-domain features.

Overall, these experiments demonstrate that fusing emotion features from multiple sources improves robustness and generalization of the model. While domain-specific emotion features are ideal, the model can still benefit from external emotion datasets, and the reported standard deviations confirm the stability of the results across runs.

Different emotions’ grouping from the Reddit dataset: The 3-category emotion grouping was selected based on preliminary dataset analysis, which indicated this approach optimally balances emotional specificity with data coverage. Broader categorizations risk oversimplifying affective distinctions, while finer categorizations face data sparsity challenges. This grouping strategy aligns with established affective computing practices that cluster related emotions to enhance model robustness.

To enhance affective coverage, we adopted the aligning approach proposed by [30], which groups highly correlated emotions into broader affective dimensions in the Reddit dataset. Specifically, anger encompasses closely related emotions like disappointment and disgust, joy incorporates aligned affective states such as surprise and excitement, and neutral serves as the baseline emotional state.

This grouping strategy preserves nuanced affective signals under-represented in fine-grained emotion categories. During fine-tuning, we primarily focused on Reddit comments belonging to these three classes, with emotion grouping, as they demonstrated the strongest correlations with sentiment labels. For instance, the detection of the anger group strongly correlates with negative sentiment, thereby guiding the model toward more accurate sentiment interpretation. Further, to empirically validate the effectiveness of our emotion grouping approach in three categories, we conducted extensive experiments comparing different grouping schemas. Table 6 provides the detailed grouping definitions, while Table 7 presents the performance metrics across different emotion categorizations.

Our analysis revealed that the fourth grouping schema (aligning with ref. [30]’s affective dimensions) consistently achieved the best performance across all datasets, thereby justifying our adoption of this particular emotion clustering for the current study.

The proposed emotion-aware framework supports flexible customization of emotion taxonomies to accommodate diverse research needs. Researchers can integrate their own emotion grouping schemas by (1) defining a mapping from fine-grained emotions to custom categories, (2) adapting the model’s output layer to match the number of custom groups, and (3) training with the re-mapped emotion labels while preserving the core gating architecture.

5.2. Impact of Pretrained Language Model Embeddings on the Proposed Emotion Fusion Approach

To comprehensively evaluate the generalization capability of the proposed model, we rigorously tested its performance across multiple datasets using different pretrained embeddings (Table 8). For this purpose the ELECTRA-based model used in Section 3.3 can be replaced with other Large Language Models for broader applicability. This analysis aimed to determine how effectively the model adapts to varying linguistic patterns and emotional cues when initialized with diverse foundational representations.

Across all datasets, models using ELECTRA-pretrained embeddings consistently outperform BERT and RoBERTa. On SemEval, ELECTRA achieves 0.67 F1 and balanced precision/recall (0.67 each), outperforming BERT (0.58 F1) and slightly exceeding RoBERTa (0.66 F1). On SST-5, ELECTRA also leads (0.59 F1, 0.60 precision/recall) compared with BERT (0.55 F1) and RoBERTa (0.55 F1). For Twitter, ELECTRA achieves the highest F1 (0.88), surpassing BERT (0.86) and RoBERTa (0.86).

These results suggest that ELECTRA’s pretraining objective—predicting replaced tokens—yields more robust token-level contextual representations, which are especially beneficial for emotion-text fusion. Despite being similar in size to BERT, ELECTRA often generalizes better due to sample-efficient pretraining. RoBERTa performs closer to ELECTRA but lacks the fine-grained token discrimination that benefits nuanced emotion–text interactions. BERT underperforms in complex tasks such as SemEval, with low recall indicating difficulty capturing ambiguous or subtle sentiment cases.

Overall, ELECTRA embeddings provide cleaner token-level features, enabling more effective emotional gating and improved performance across both complex and simpler sentiment tasks.

5.3. Comparison with Fine-Tuned LLMs

Pretrained language models (PLMs) offer strong general-purpose representations, but task-specific fine-tuning can enhance their ability for domain-specific nuance (Table 9). Therefore we measured the performance of our proposed model with fine-tuned embeddings from different LLMs.

Fine-tuning significantly improves performance across all datasets, with ELECTRA consistently leading in most cases. On SST-5, fine-tuned ELECTRA achieves 0.62 F1 (up from 0.60 with frozen embeddings) and 0.64 precision, outperforming BERT (0.61 F1) and RoBERTa (0.61 F1). Fine-tuned ELECTRA best captures the interplay between sentiment and emotion in this dataset. For SemEval, fine-tuned ELECTRA achieves 0.70 F1 and 0.71 recall, slightly surpassing RoBERTa (0.69 F1/recall), while BERT lags behind (0.62 F1). RoBERTa’s larger pretraining helps it handle ambiguity in SemEval, but ELECTRA maintains better overall balance. On Twitter, RoBERTa slightly trails ELECTRA in F1 (0.88 vs. 0.91) and precision (0.90 vs. 0.90), though both models perform comparably in accuracy. ELECTRA remains more balanced across precision and recall, demonstrating robust performance even with simpler datasets.

Overall, fine-tuning unlocks the full potential of the proposed Electra-BiG-Emo model, with ELECTRA as the default choice for most tasks. The gating mechanism synergizes best with task-aligned embeddings, reducing noise in emotion–text fusion and maximizing performance across both complex and simpler datasets.

5.4. Comparison with State-of-the-Art Methods

To validate the effectiveness of the proposed Electra-BiG-Emo model, we compared it against the Electra, Bert, and Roberta LLMs’ fine-tuned results and a variety of strong state-of-the-art methods listed below across the three benchmark datasets. The purpose of this comparison is to assess whether incorporating gated fusion of emotional features with contextual embeddings leads to tangible improvements in sentiment classification performance.

To address concerns regarding whether performance improvements stem from the system design or chance, we conducted two-tailed paired t-tests comparing the proposed Electra-BiG-Emo model against baseline stand-alone LLMs (BERT, RoBERTa, and ELECTRA) across all evaluation metrics. The average performance over 10 folds and the mean and standard deviation of scores were recorded. Specifically, for each fold i, we compute the mean performance difference

d_{i} = m_{p r o p o s e d}^{(i)} - m_{b a s e l i n e}^{(i)}

for each metric (refer to column 3 of Table 10, Table 11 and Table 12), then test the null hypothesis against the alternative. We report p-values with significance levels

α = 0.05

(refer to column 4 in Table 10, Table 11 and Table 12).

Table 10 shows that the improvements in the proposed model are statistically significant across all metrics on the SST-5 dataset. Accuracy, precision, recall, and F1 all show p-values

\leq 0.01

, with most comparisons against BERT and RoBERTa yielding

p \leq 0.001

. For ELECTRA, differences remain significant (

p \leq 0.0012

for most metrics), confirming that the performance gains are not due to random chance but result from the proposed bidirectional gating and emotion fusion design. These tests validate the robustness of the proposed architecture and provide strong evidence that the observed improvements are consistent and reproducible across 10 folds.

According to Table 12, on the semval dataset, compared with BERT, the proposed model achieves highly significant improvements across all metrics (accuracy, precision, recall, F1;

p \leq 0.001

), confirming substantial gains. Compared with RoBERTa, the improvements are not statistically significant

(p \geq 0.45

for all metrics), suggesting that for SemEval, RoBERTa performs similarly to the proposed model. Compared with ELECTRA, the proposed model shows moderate but significant improvements in accuracy (p = 0.016), recall (p = 0.014), and F1 (p = 0.013), with precision showing no significant difference (p = 0.458). These results confirm that the proposed model’s bidirectional gating and emotion fusion mechanisms provide statistically significant benefits over weaker baselines (BERT) and modest gains over strong pretrained models (ELECTRA), while performance is comparable to RoBERTa on this dataset.

On the Twitter dataset (Table 12), compared with BERT, the proposed model shows highly significant improvements across all metrics (accuracy, precision, recall, F1;

p \leq 0.001

), confirming consistent gains. Compared with RoBERTa, improvements are statistically significant, though less pronounced (accuracy p = 0.018, precision p = 0.009, F1 p = 0.007), indicating that RoBERTa is a strong baseline but the proposed model still provides meaningful enhancements. Compared with ELECTRA, all metrics show highly significant improvements (

p \leq 0.0008

), demonstrating that the proposed model reliably outperforms strong pretrained embeddings on Twitter. These results confirm that the Electra-BiG-Emo architecture consistently delivers statistically significant performance gains over all tested baselines, validating its robustness even on simpler sentiment datasets like Twitter.

Further we compared our proposed method with other strong state-of-the-art methods, as listed below, across the three benchmark datasets:

Single_Domain_Tweets [22]—A Roberta large model fine-tuned on tweet data using a [CLS] token embedding for classification.
Fast_Textcnn [8]—A baseline model that combines FastText word embeddings with a CNN to capture local semantic features for text classification.
DeepFusionSent [12]—A method that uses handcrafted sentiment shift patterns to model subtle transitions and enhance fine-grained sentiment detection.
RoBERTa_HYBRID_Emoji—A method enhancing sentiment classification by fine-tuning RoBERTa embeddings and integrating BiGRU-BiLSTM layers, with a focus on handling emoticons in text.
Senti_Twitter_BERT [22]—A BERT transformer-based architecture for solving task 4A.
BERT-Large [6]—Uses BERT Large for SST5 classification.
DLAWG [13]—A context-adaptive model that dynamically learns optimal window sizes and applies variable-size self-attention to enhance token-level representation and capture fine-grained contextual cues.
LM-CPPF [24]—A method that uses LLM-generated, class-preserving paraphrases to enrich training data and improve semantic understanding within classes for classification.
MP-TFWA [28]—A model that uses a sentence-level representation learning mechanism to capture both token-level (words/phrases) and feature-level (latent semantic) dependencies for enhanced sentiment understanding.
LACL [1]—A model that introduces Label-aware Contrastive Loss (LCL) to explicitly model inter-class relationships, thereby enhancing fine-grained sentiment discrimination.
Mode LSTM [15]—A model that disentangles LSTM states using orthogonal constraints and employs multi-scale temporal windows to efficiently extract non-redundant semantic features for fine-grained sentiment classification.
GPT4 [36]—As per the evaluation of SST5 dataset performance with zero-shot prompts, we compared its results with our approach.

Table 13 reports recall, F1-score, accuracy, and MAE (where applicable), with results directly compared against the reported results in reference study.

In the Twitter dataset, the proposed Electra-BiG-Emo model dominates with the highest accuracy (88) and balanced F1/recall, emphasizing that the gating mechanism likely suppresses noise in noisy Twitter data more effectively than other models. Baseline limitation FAST_TEXTcnn matches recall but performs weakly in terms of accuracy.

On SemVal, the proposed method outperforms the state-of-the-art models in terms of complex emotion detection, achieving +11 F1 over Single_Domain_Tweets and a lower MAE (0.26) than Senti_Twitter_BERT, indicating superior recall, accuracy, and ordinal emotion scoring.

For sst5, the proposed model sets a new benchmark (59.7 accuracy), likely due to BiLSTM + gating capturing subtle sentiment shifts when compared to other state-of-the-art methods.

Overall, the proposed Electra-BiG-Emo outperforms fine-grained sentiment classification tasks with emotion awareness, especially where precision–recall balance (SemVal) or fine-grained discrimination (SST-5) matters. The results validate the importance of dynamic gating with auxiliary emotions and task-aligned embeddings.

5.5. Evaluating Emotional Contributions via Confusion Matrix Analysis for Fine-Grained Sentiment Classification

To evaluate how emotional features contribute to correctly distinguishing between fine-grained sentiment classes (e.g., more_negative, negative, neutral, positive, more_positive), we analyze the confusion matrices of our proposed model and compare them with those of other LLM-based baselines on the same test dataset. This helps us assess how well each model resolves the “fuzziness” or ambiguity in classifying closely related sentiment categories, especially when distinguishing subtle variations within the broader positive, negative, or neutral classes. See Figure 3, Figure 4 and Figure 5.

The confusion matrix (Figure 3) for the sst5 dataset analysis reveals that the proposed model consistently achieves higher diagonal values compared to the other three LLM-based models, indicating superior accuracy in correctly classifying each fine-grained sentiment class through emotional cues.

The confusion matrix analysis for the Twitter dataset (Figure 4) shows that the proposed model outperforms the other four LLM-based models by correctly classifying neutral and negative classes. For the positive class, while RoBERTa achieves slightly better performance than the others, the proposed model still performs competitively and surpasses the remaining LLMs, demonstrating robust handling of fine-grained sentiment, particularly for challenging neutral and negative distinctions.

In semval (Figure 5), the proposed model outperforms all other LLM-based models in accurately classifying the neutral, positive, and more negative classes. While RoBERTa slightly outperforms in the more positive class, and Electra performs marginally better for the negative class, the proposed model still achieves competitive or superior results across these categories compared to the remaining LLMs, highlighting its overall effectiveness in fine-grained sentiment classification on the semval dataset.

Across both fine-grained (SST-5, SemEval) and coarse-grained (Twitter) sentiment classification tasks, the proposed model consistently demonstrates stronger performance than the LLM-based baselines, particularly in handling difficult and ambiguous sentiment classes. This highlights its robustness and effectiveness in leveraging emotional cues for accurate sentiment prediction.

5.6. Interpretation of Fine-Grained Sentiment Analysis Based on Emotions in Different Text Contexts

5.6.1. Emotional Bias with Text Context

To interpret the model’s decision-making, we visualized its gating mechanism across modulated emotions. The forward (left-to-right) and backward (right-to-left) gates dynamically weigh emotion features based on text context and modulated with auxiliary emotions. Forward LSTM captures early context and enhances dominant emotional cues while suppressing misleading ones using auxiliary features. Similarly, backward LSTM emphasizes later emotional signals. Comparing these modulated gates reveals how the model prioritizes early or late emotional context, refining fine-grained sentiment predictions. We visualized forward (first_gate)- and backward (last_gate)-modulated gate values side by side to reveal emotion prioritization (higher values = stronger influence) for fine-grained classification.

5.6.2. Feature Importance

We visualize feature importance in the gated fusion model by extracting weights from the final linear layer (model.fc.weight) to examine the importance of each feature in fine-grained classification. These weights correspond to LSTM hidden states (text features), forward-gate-weighted emotions, and backward-gate-weighted emotions. For each class, importance is computed as the mean absolute weight and normalized using min–max scaling (0 = least important, 1 = most important). The scaled feature importance per sentiment class for two datasets (sst5 and semval) is shown in Table 14 and Table 15.

Table 14, for the sst5 dataset, shows that text context (LSTM) contributes consistently across sentiments, with a slight emphasis on positivity. Modulated initial emotional cues (first emo gate) are heavily relied on for positive and more_positive predictions, while ending cues (last emo gate) are dominant overall, being especially critical for extreme sentiments (more_negative and more_positive).

For the Semval dataset (Table 15), the model exhibits distinct gating behaviors across sentiment subtypes. Standard negative sentiment relies solely on the initial emotional signals (first_ emo_gate), whereas more_negative primarily depends on the final emotional cues (last_emo_gate), suggesting the presence of different negative categories in the data. Similarly, neutral sentiment also relies only on the modulated early context, potentially reflecting ambiguous or uncertain emotional content. In contrast, positive sentiment requires strong signals refined at the end (last_emo_gate), while more_positive effectively integrates both modulated initial and final emotional cues, indicating a more robust emotional buildup.

5.6.3. Interpretation of Emotion Influence of Text Context in Overall Prediction with Example Posts

To illustrate how the gated fusion model weighs emotions across the text, we visualized forward- and backward-gated emotion values for each emotion (Table 16). This side-by-side comparison reveals how the model modulates emotional information flow at different stages, aiding interpretation and debugging. The charts in column 2 on the left (blue bars) show first_gate, which are modulated weights (scaled for visualization) from the early context, and red bars show the later context with last_gate (y axis of both charts, shows the gate value from 0-1 range scaled with 0.2 and x axis represents the respective emotion fused). Actual sentiment labels and the predicted ones are in columns 3 and 4. Overall, the modulated values illustrate how the model balances contextual emotional cues, highlighting which features dominate and influence fine-grained sentiment prediction.

5.6.4. Emotion vs. Sentiment

We analyze how gated emotional features vary across sentiment classes by averaging modulated values per class (Table 17 and Table 18). This reveals which emotions are modulated in the forward and backward directions for each sentiment, showing how emotional cues influence classification and whether gating behavior differs by sentiment. It also helps assess if the gates exhibit interpretable emotional preferences.

For the sst5 dataset, negative predictions rely heavily on modulated early anger and neutral signals, while positive classification depends less on initial cues. The neutral class shows mid-range emotion values, suggesting it captures ambiguous cases. Backward modulation reveals that ending context is crucial for positives, with more_negative emphasizing anger. Inconsistent patterns in negative cases suggest that late positive cues can override earlier signals.

In the semval dataset, negative-oriented sentiments (negative, more_negative) assign higher importance to anger and neutral emotions, indicating these emotions are more influential in the forward LSTM direction. Positive sentiments (positive, more_positive) show lower forward attention to anger and neutral and slightly lower joy, suggesting positive classes rely less on forward-gated emotional cues, especially anger. Positive-oriented sentiments (positive, more_positive) emphasize joy and neutral emotions more in the backward LSTM direction. Neutral and positive sentiments display higher backward joy values than negative ones, with neutral showing the highest last_joy (0.626).

Interestingly, more_positive has the highest backward anger (0.624), indicating that backward modulation may pick up subtle contrasting emotional cues. Overall this implies that the model uses directional gating to focus on emotionally relevant cues depending on the sentiment polarity of the textual context.

6. Discussion

This study presents a comprehensive exploration of the proposed Elecra-BiG-Emo model, designed for fine-grained sentiment classification by leveraging emotional cues through a gated fusion mechanism. Through extensive ablation studies, the model demonstrates the critical contributions of its architectural components, particularly the synergistic integration of contextual embeddings and emotional features. The experiments reveal that while BiLSTM enhances sequence modeling and context retention, emotional features—when effectively gated—substantially improve classification precision and recall, especially in complex sentiment scenarios. The gating mechanism emerges as an essential element, enabling the model to dynamically modulate emotional information based on contextual relevance, which is particularly impactful in nuanced datasets, such as SST-5 and SemEval.

The robustness and adaptability of the model are further validated by its performance across various pretrained language model embeddings. ELECTRA, in particular, consistently outperforms both BERT and RoBERTa due to its token-level discrimination capabilities and efficient pretraining strategy. Fine-tuning further enhances performance, with ELECTRA-based embeddings offering a significant boost in accuracy and F1 scores across all tested datasets. Notably, the proposed model maintains competitive results even in domains where other models traditionally perform well, such as Twitter sentiment classification, indicating its generalizability across both coarse- and fine-grained sentiment tasks.

Crucially, paired t-tests over 10 folds show that our pretrained–embedding–gated fusion model achieves statistically significant improvements over fine-tuned versions of LLMs across every metric (accuracy, precision, recall, F1) for both Twitter and SST-5. On SemEval, our model is significantly better than two of them (e.g., BERT, ELECTRA) across three metrics, and it matches fine-tuned RoBERTa, with differences that are smaller and sometimes not significant, underscoring that emotion-aware gating, not backbone fine-tuning alone, drives the gains. Qualitatively, confusion matrix analyses confirm that the model more reliably separates adjacent sentiment categories under ambiguous emotional cues, reflecting its ability to surface and weight subtle affective signals via the forward/last gates. Although the model may appear complex, the necessity of the gated fusion mechanism is supported by ablation studies. Removing or simplifying the gates results in consistent performance drops, especially on challenging datasets like SST-5 and SemEval, where subtle emotional cues are critical. This indicates that the added complexity is justified by clear gains in accuracy and robustness, as simpler networks fail to capture the same level of fine-grained emotional integration.

Despite the overall effectiveness, performance discrepancies are observed with the SemEval dataset, showing comparatively weaker results. These inconsistencies can be partly attributed to domain misalignment, as the emotional features were learned from the GoEmotions dataset, which contains diverse multi-domain annotations. While this diversity sometimes limits alignment with specific tasks, ablation studies reveal that GoEmotions-based features consistently outperform emotions derived from SemEval or user annotations. This suggests that although task-specific emotional learning on SemEval is less effective, the broader emotional representations from GoEmotions still provide stronger and more transferable signals.

Further interpretability analysis offers deep insights into the model’s decision-making process. Visualizations of the bidirectional gating mechanism illustrate how early and late emotional cues modulated through auxiliary emotions influence predictions differently, with first-gate and last-gate weighting highlighting emotion-specific dependencies for various sentiment categories. Feature importance assessments reveal that while textual context contributes consistently across sentiments, emotional signals—particularly those captured through backward gating and modulating—play a decisive role in accurately identifying extreme sentiment classes. Case-based interpretations further validate the model’s nuanced understanding of emotion–sentiment interplay, while aggregated gate weighting patterns confirm interpretable and sentiment-aligned emotional preferences.

Beyond sentiment classification, the model’s ability to decode fine-grained emotional cues also offers indirect insights into users’ mental states as expressed through language. The dynamic fusion of emotional and contextual signals allows the model to capture not just surface-level polarity but also deeper affective patterns of emotions. Particularly on datasets like SemEval and Twitter, where users often express raw and spontaneous thoughts, the model’s sensitivity to subtle emotional shifts can reflect transient mental conditions.

7. Conclusions

In conclusion, the proposed model represents a significant advancement in fine-grained sentiment classification by effectively combining pretrained contextual embeddings with an emotion-aware gating mechanism. Its ability to dynamically balance emotional and textual signals not only enhances predictive performance but also contributes to meaningful interpretability, especially in emotionally nuanced or ambiguous cases. Notably, the model’s sensitivity to subtle emotional shifts across text positions provides insights into the underlying mental states expressed on social media or in user-generated content, suggesting its potential beyond sentiment classification, including early indicators of emotional distress or affective instability.

However, while we explored external emotional cues from multiple datasets, GoEmotions consistently provided the most transferable and reliable signals. In contrast, emotions learned directly within certain sentiment tasks, such as SemEval, proved less effective, suggesting that task-specific emotional supervision may not always capture rich affective patterns. As future work, we plan to investigate domain-adaptive emotional transfer that leverages the strengths of broad, multi-domain resources like GoEmotions while incorporating task-level refinements. Developing sentiment-aware and psychologically grounded emotional representations is expected to further improve performance and broaden the model’s applications in areas such as mental health monitoring and behavioral insight extraction.

Author Contributions

A.U.G.T.M.: conceptualization, methodology, writing—original draft, writing—editing, data curation. Y.L.: supervision, writing—reviewing and editing. J.Z.: supervision, writing—reviewing and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Queensland University of Technology.

Data Availability Statement

Semval Dataset: https://alt.qcri.org/semeval2017/task4/?id=download-the-full-training-data-for-semeval-2017-task-4 (accessed on 23 December 2025); SST5 Dataset: https://www.kaggle.com/datasets/haoshaoyang/sst5-data (accessed on 23 December 2025); Twitter Airline Dataset: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment (accessed on 23 December 2025).

Acknowledgments

I, Anupama Udayangani Gunathilaka Thennakoon Mudiyanselage acknowledge the supervisors for their expertise and assistance in all aspects of our study and for their help in writing the manuscript.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Suresh, V.; Ong, D.C. Not all negatives are equal: Label-aware contrastive loss for fine-grained text classification. arXiv 2021, arXiv:2109.05427. [Google Scholar]
Chakraborty, K.; Bhattacharyya, S.; Bag, R. A survey of sentiment analysis from social media data. IEEE Trans. Comput. Soc. Syst. 2020, 7, 450–464. [Google Scholar] [CrossRef]
Gandhi, A.; Adhvaryu, K.; Poria, S.; Cambria, E.; Hussain, A. Multimodal sentiment analysis: A systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Inf. Fusion 2023, 91, 424–444. [Google Scholar] [CrossRef]
Zhao, S.; Hong, X.; Yang, J.; Zhao, Y.; Ding, G. Toward label-efficient emotion and sentiment analysis. Proc. IEEE 2023, 111, 1159–1197. [Google Scholar] [CrossRef]
dos Santos, C.; Gatti, M. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), Dublin, Ireland, 23–29 August 2014; pp. 69–78. [Google Scholar]
Munikar, M.; Shakya, S.; Shrestha, A. Fine-grained sentiment classification using BERT. In Proceedings of the 2019 Artificial Intelligence for Transforming Business and Society (AITB), Kathmandu, Nepal, 5 November 2019; Volume 1, pp. 1–5. [Google Scholar]
Talaat, A.S. Sentiment analysis classification system using hybrid BERT models. J. Big Data 2023, 10, 110. [Google Scholar] [CrossRef]
Umer, M.; Imtiaz, Z.; Ahmad, M.; Nappi, M.; Medaglia, C.; Choi, G.S.; Mehmood, A. Impact of convolutional neural network and FastText embedding on text classification. Multimed. Tools Appl. 2023, 82, 5569–5585. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Vu, T.; Nguyen, A.T. BERTweet: A pre-trained language model for English Tweets. arXiv 2020, arXiv:2005.10200. [Google Scholar] [CrossRef]
Eang, C.; Lee, S. Improving the accuracy and effectiveness of text classification based on the integration of the BERT model and a recurrent neural network (RNN_BERT_Based). Appl. Sci. 2024, 14, 8388. [Google Scholar] [CrossRef]
Ghanee, A.; Jahanbin, K.; Yamchi, A.R. ElHBiAt: Electra pre-training network hybrid of BiLSTM and an attention layer for aspect-based sentiment analysis. IEEE Access 2025, 13, 88342–88370. [Google Scholar] [CrossRef]
Thakkar, A.; Pandya, D. DeepFusionSent: A novel feature fusion approach for deep learning-enhanced sentiment classification. Inf. Fusion 2025, 118, 103000. [Google Scholar] [CrossRef]
Tan, K.L.; Lee, C.P.; Anbananthen, K.S.M.; Lim, K.M. RoBERTa-LSTM: A hybrid model for sentiment analysis with transformer and recurrent neural network. IEEE Access 2022, 10, 21517–21525. [Google Scholar] [CrossRef]
Gunathilaka, T.M.A.U.; Zhang, J.; Li, Y. Fine-grained feature extraction in key sentence selection for explainable sentiment classification using BERT and CNN. IEEE Access 2025, 13, 68462–68480. [Google Scholar] [CrossRef]
Ma, Q.; Lin, Z.; Yan, J.; Chen, Z.; Yu, L. Mode-LSTM: A parameter-efficient recurrent network with multi-scale for sentence classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6705–6715. [Google Scholar]
Tang, D.; Qin, B.; Liu, T. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 17–21 September 2015; pp. 1422–1432. [Google Scholar]
Mena, F.; Pathak, D.; Najjar, H.; Sanchez, C.; Helber, P.; Bischke, B.; Habelitz, P.; Miranda, M.; Siddamsetty, J.; Nuske, M.; et al. Adaptive fusion of multi-modal remote sensing data for optimal sub-field crop yield prediction. Remote Sens. Environ. 2025, 318, 114547. [Google Scholar] [CrossRef]
Liu, C.; Chen, C. Text mining and sentiment analysis: A new lens to explore the emotion dynamics of mother-child interactions. Soc. Dev. 2024, 33, e12733. [Google Scholar] [CrossRef]
Choi, G.; Oh, S.; Kim, H. Improving document-level sentiment classification using importance of sentences. Entropy 2020, 22, 1336. [Google Scholar] [CrossRef] [PubMed]
Ding, K.; Fan, C.; Ding, Y.; Wang, Q.; Wen, Z.; Li, J.; Xu, R. LCSEP: A large-scale Chinese dataset for social emotion prediction to online trending topics. IEEE Trans. Comput. Soc. Syst. 2024, 11, 3362–3375. [Google Scholar] [CrossRef]
Yu, Z.; Li, H.; Feng, J. Enhancing text classification with attention matrices based on BERT. Expert Syst. 2024, 41, e13512. [Google Scholar] [CrossRef]
Fiok, K.; Karwowski, W.; Gutierrez, E.; Wilamowski, M. Analysis of sentiment in tweets addressed to a single domain-specific Twitter account: Comparison of model performance and explainability of predictions. Expert Syst. Appl. 2021, 186, 115771. [Google Scholar] [CrossRef]
Galal, O.; Abdel-Gawad, A.H.; Farouk, M. Rethinking of BERT sentence embedding for text classification. Neural Comput. Appl. 2024, 36, 20245–20258. [Google Scholar] [CrossRef]
Abaskohi, A.; Rothe, S.; Yaghoobzadeh, Y. LM-CPPF: Paraphrasing-guided data augmentation for contrastive prompt-based few-shot fine-tuning. arXiv 2023, arXiv:2305.18169. [Google Scholar]
Yu, J.; Jiang, J.; Xia, R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2019, 28, 429–439. [Google Scholar] [CrossRef]
Du, Y.; Li, T.; Pathan, M.S.; Teklehaimanot, H.K.; Yang, Z. An effective sarcasm detection approach based on sentimental context and individual expression habits. Cogn. Comput. 2022, 14, 78–90. [Google Scholar] [CrossRef]
Hossain, M.S.; Hossain, M.M.; Hossain, M.S.; Mridha, M.F.; Safran, M.; Alfarhood, S. EmoNet: Deep Attentional Recurrent CNN for X (formerly Twitter) Emotion Classification. IEEE Access 2025, 13, 37591–37610. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Demszky, D.; Movshovitz-Attias, D.; Ko, J.; Cowen, A.; Nemade, G.; Ravi, S. GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020. [Google Scholar]
Wang, K.; Jing, Z.; Su, Y.; Han, Y. Large language models on fine-grained emotion detection dataset with data augmentation and transfer learning. arXiv 2024, arXiv:2403.06108. [Google Scholar] [CrossRef]
Baziotis, C.; Pelekis, N.; Doulkeridis, C. DataStories at SemEval-2017 Task 4: Deep LSTM with Attention for Message-level and Topic-based Sentiment Analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, BC, Canada, 3–4 August 2017; pp. 747–754. [Google Scholar]
Ameer, I.; Bölücü, N.; Siddiqui, M.H.F.; Can, B.; Sidorov, G.; Gelbukh, A. Multi-label emotion classification in texts using transfer learning. Expert Syst. Appl. 2023, 213, 118534. [Google Scholar] [CrossRef]
Kumar, A.; Singh, J.P.; Singh, A.K. Explainable BERT-LSTM stacking for sentiment analysis of COVID-19 vaccination. IEEE Trans. Comput. Soc. Syst. 2023, 12, 1296–1306. [Google Scholar] [CrossRef]
Arevalo, J.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated multimodal networks. Neural Comput. Appl. 2020, 32, 10209–10228. [Google Scholar] [CrossRef]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with Gumbel-Softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Liu, Z.; Yang, K.; Xie, Q.; Zhang, T.; Ananiadou, S. Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Barcelona, Spain, 25–29 August 2024; pp. 5487–5496. [Google Scholar]

Figure 1. Proposed Electra_BiG_Emo model.

Figure 2. Correlation between sentiments and different LLM-predicted emotions.

Figure 3. Confusion matrices for SST5 dataset by different model predictions. (a) Bert. (b) Electra. (c) Roberta. (d) Proposed.

Figure 4. Confusion matrices for Twitter dataset by different model predictions. (a) Bert. (b) Electra. (c) Roberta. (d) Proposed.

Figure 5. Confusion matrices for Semval dataset by different model predictions. (a) Bert. (b) Electra. (c) Roberta. (d) Proposed.

Table 1. Hyperparameter configurations for all datasets.

Parameter	SST-5	SemEval 2017	Twitter Airline
LSTM Hidden Dim	256	128	128
Learning Rate	2 × 10⁻⁵	2 × 10⁻⁵	2 × 10⁻⁵
Batch Size	32	16	32
Dropout Rate	0.2	0.2	0.25
$τ_{f i r s t}$	1.5	1.2	1.5
$τ_{l a s t}$	0.8	0.7	0.7
Epochs	15	12	10
Early Stopping Patience	5	5	3

Table 2. Ablation study of the ELECTRA-BiG-Emo model.

Ablation Study	Compared Models/Variants	Description and Justification
Base Models	ELECTRA-only	Remove LSTM and emotion gates; fine-tune ELECTRA only to test if LSTM/gating is needed.
	ELECTRA + BiLSTM	Remove emotion_probs; ELECTRA → BiLSTM → classifier. Tests impact of emotion fusion.
	ELECTRA + Emotion	Remove BiLSTM; concatenate ELECTRA [CLS] + emotion_probs → FC. Tests if LSTM adds value.
Gating Mechanism	ELECTRA + BiLSTM + EMO	Replace gates with concatenation: [BiLSTM_out; emotion_probs] → FC. Tests if gating outperforms naive fusion.
	ELECTRA + BiLSTM + FG	Use only first gate (forward LSTM) to check necessity of bidirectional gating.
	ELECTRA + BiLSTM + LG	Use only last gate (backward LSTM) to check necessity of bidirectional gating.
	ELECTRA-BiG-Emo (Proposed)	Incorporates bidirectional gated emotion fusion and modulation for optimal emotion–text interaction.

Each ablation variant was evaluated using 10-fold cross-validation, and average metrics were reported across datasets.

Table 3. Performance of base models across datasets.

Dataset	Base Model	Precision	Recall	F1-Score	Accuracy
	ELECTRA-only	0.65	0.66	0.65	0.65
SemEval	ELECTRA + BiLSTM	0.48	0.55	0.49	0.61
	ELECTRA + Emotion	0.54	0.52	0.53	0.66
	ELECTRA-only	0.56	0.57	0.56	0.57
SST-5	ELECTRA + BiLSTM	0.54	0.53	0.53	0.55
	ELECTRA + Emotion	0.54	0.55	0.54	0.55
	ELECTRA-only	0.85	0.85	0.85	0.85
Twitter	ELECTRA + BiLSTM	0.79	0.82	0.80	0.85
	ELECTRA + Emotion	0.81	0.82	0.81	0.86

The best results per dataset are highlighted in bold.

Table 4. Performance of gating mechanism variants across datasets.

Dataset	Gated Model	Precision	Recall	F1-Score	Accuracy
SemEval	ELECT. + BiLSTM + Emo	0.52	0.54	0.53	0.64
	ELECT. + BiLSTM + FG	0.57	0.49	0.51	0.66
	ELECT. + BiLSTM + LG	0.53	0.52	0.52	0.64
	ELECTRA-BiG-Emo	0.67	0.67	0.67	0.67
SST-5	ELECT. + BiLSTM + Emo	0.53	0.55	0.53	0.54
	ELECT. + BiLSTM + FG	0.53	0.55	0.54	0.54
	ELECT. + BiLSTM + LG	0.52	0.55	0.53	0.54
	ELECTRA-BiG-Emo	0.60	0.60	0.59	0.60
Twitter	ELECT. + BiLSTM + Emo	0.78	0.82	0.79	0.84
	ELECT. + BiLSTM + FG	0.82	0.79	0.80	0.86
	ELECT. + BiLSTM + LG	0.80	0.81	0.81	0.85
	ELECTRA-BiG-Emo	0.88	0.88	0.88	0.88

The best results for each dataset are highlighted in bold.

Table 5. Performance comparison of fusing different emotion sources into ELECTRA-BiG-Emo (mean ± standard deviation).

Dataset	Emotion Source	Accuracy	Precision	Recall	F1-Score
SST-5	GoEmotion	0.597 ± 0.018	0.598 ± 0.018	0.597 ± 0.018	0.593 ± 0.018
SST-5	SemEval	0.581 ± 0.015	0.585 ± 0.013	0.581 ± 0.015	0.576 ± 0.017
SST-5	ISEAR	0.579 ± 0.014	0.581 ± 0.017	0.579 ± 0.018	0.574 ± 0.019
Twitter	GoEmotion	0.870 ± 0.010	0.870 ± 0.010	0.870 ± 0.010	0.868 ± 0.013
Twitter	SemEval	0.869 ± 0.009	0.870 ± 0.010	0.869 ± 0.009	0.868 ± 0.012
Twitter	ISEAR	0.866 ± 0.011	0.867 ± 0.012	0.866 ± 0.011	0.864 ± 0.014
SemEval	GoEmotion	0.668 ± 0.007	0.668 ± 0.007	0.668 ± 0.007	0.666 ± 0.007
SemEval	SemEval	0.660 ± 0.017	0.665 ± 0.014	0.660 ± 0.017	0.659 ± 0.015
SemEval	ISEAR	0.664 ± 0.008	0.669 ± 0.006	0.664 ± 0.008	0.665 ± 0.008

Bold values indicate the best performance for each dataset.

Table 6. Emotion categories in each group.

Group	Emotion Categories
Group 1	Joy (admiration, joy, love), anger (sadness, anger, annoyance), neutral
Group 2	Joy (joy, excitement, desire), anger (disgust, annoyance, anger), neutral
Group 3	Joy (joy, surprise, pride), anger (embarrassment, disappointment, anger), neutral
Group 4	Joy (joy, surprise, excitement), anger (disappointment, disgust, anger), neutral

Table 7. Performance with Different Groupings across datasets.

Group	Dataset	Precision	Recall	F1-Score	Accuracy
Group 1	SemEval	0.65	0.65	0.65	0.65
	SST-5	0.56	0.55	0.54	0.55
	Twitter	0.85	0.85	0.85	0.85
Group 2	SemEval	0.66	0.65	0.65	0.65
	SST-5	0.47	0.50	0.47	0.50
	Twitter	0.86	0.85	0.85	0.85
Group 3	SemEval	0.66	0.65	0.65	0.65
	SST-5	0.56	0.57	0.56	0.57
	Twitter	0.85	0.85	0.85	0.85
Group 4	SemEval	0.67	0.67	0.67	0.67
	SST-5	0.60	0.60	0.59	0.60
	Twitter	0.88	0.88	0.88	0.88

Table 8. Performance of the proposed ELECTRA-BiG-Emo model with different pretrained LLM embeddings.

Dataset	Pretrained LLM	Precision	Recall	F1-Score	Accuracy
SST-5	ELECTRA	0.60	0.60	0.59	0.60
SST-5	BERT	0.55	0.56	0.55	0.56
SST-5	RoBERTa	0.56	0.56	0.55	0.56
SemEval	ELECTRA	0.67	0.67	0.67	0.67
SemEval	BERT	0.60	0.59	0.58	0.60
SemEval	RoBERTa	0.67	0.66	0.66	0.67
Twitter	ELECTRA	0.88	0.88	0.88	0.88
Twitter	BERT	0.86	0.86	0.86	0.86
Twitter	RoBERTa	0.87	0.86	0.86	0.87

Bold values indicate the best performance for each dataset.

Table 9. Performance of the proposed ELECTRA-BiG-Emo model with different fine-tuned LLM embeddings.

Dataset	Fine-Tuned LLM	Precision	Recall	F1-Score	Accuracy
SST-5	ELECTRA	0.64	0.62	0.62	0.64
SST-5	BERT	0.62	0.61	0.61	0.62
SST-5	RoBERTa	0.61	0.64	0.61	0.62
SemEval	ELECTRA	0.70	0.71	0.70	0.71
SemEval	BERT	0.62	0.63	0.62	0.63
SemEval	RoBERTa	0.69	0.69	0.69	0.69
Twitter	ELECTRA	0.90	0.93	0.91	0.91
Twitter	BERT	0.87	0.89	0.87	0.89
Twitter	RoBERTa	0.90	0.88	0.88	0.91

Bold values indicate the best performance for each dataset.

Table 10. Statistical significance of the proposed model improvements on the SST-5 dataset (paired t-test,

n = 10

folds).

Table 10. Statistical significance of the proposed model improvements on the SST-5 dataset (paired t-test,

n = 10

folds).

Metric	Baseline	Score (Mean ± SD)	p-Value
Accuracy	BERT	$0.597 \pm 0.018$ vs. $0.535 \pm 0.013$	<0.001 ***
Accuracy	RoBERTa	$0.597 \pm 0.018$ vs. $0.557 \pm 0.018$	<0.001 ***
Accuracy	ELECTRA	$0.597 \pm 0.018$ vs. $0.572 \pm 0.014$	0.0012 **
Precision	BERT	$0.598 \pm 0.018$ vs. $0.532 \pm 0.014$	<0.001 ***
Precision	RoBERTa	$0.598 \pm 0.018$ vs. $0.552 \pm 0.019$	<0.001 ***
Precision	ELECTRA	$0.598 \pm 0.018$ vs. $0.569 \pm 0.014$	0.00095 ***
Recall	BERT	$0.597 \pm 0.018$ vs. $0.535 \pm 0.013$	<0.001 ***
Recall	RoBERTa	$0.597 \pm 0.018$ vs. $0.557 \pm 0.018$	<0.001 ***
Recall	ELECTRA	$0.597 \pm 0.018$ vs. $0.572 \pm 0.014$	0.0012 **
F1-score	BERT	$0.593 \pm 0.018$ vs. $0.533 \pm 0.014$	<0.001 ***
F1-score	RoBERTa	$0.593 \pm 0.018$ vs. $0.551 \pm 0.018$	<0.001 ***
F1-score	ELECTRA	$0.593 \pm 0.018$ vs. $0.568 \pm 0.014$	0.00087 ***

*** p < 0.001, ** p < 0.01.

Table 11. Statistical significance of proposed model improvements on the SemEval dataset (paired t-test,

n = 10

folds).

Table 11. Statistical significance of proposed model improvements on the SemEval dataset (paired t-test,

n = 10

folds).

Metric	Baseline	Score (Mean ± SD)	p-Value
Accuracy	BERT	$0.668 \pm 0.007$ vs. $0.591 \pm 0.008$	<0.001 ***
Accuracy	RoBERTa	$0.668 \pm 0.007$ vs. $0.667 \pm 0.007$	$0.621$
Accuracy	ELECTRA	$0.668 \pm 0.007$ vs. $0.652 \pm 0.014$	0.016 *
Precision	BERT	$0.668 \pm 0.007$ vs. $0.573 \pm 0.016$	<0.001 ***
Precision	RoBERTa	$0.668 \pm 0.007$ vs. $0.666 \pm 0.007$	$0.458$
Precision	ELECTRA	$0.668 \pm 0.007$ vs. $0.660 \pm 0.010$	$0.458$
Recall	BERT	$0.668 \pm 0.007$ vs. $0.591 \pm 0.008$	<0.001 ***
Recall	RoBERTa	$0.668 \pm 0.007$ vs. $0.667 \pm 0.007$	$0.621$
Recall	ELECTRA	$0.668 \pm 0.007$ vs. $0.652 \pm 0.014$	0.014 *
F1-score	BERT	$0.666 \pm 0.007$ vs. $0.573 \pm 0.011$	<0.001 ***
F1-score	RoBERTa	$0.666 \pm 0.007$ vs. $0.666 \pm 0.007$	$0.892$
F1-score	ELECTRA	$0.666 \pm 0.007$ vs. $0.653 \pm 0.011$	0.013 *

*** p < 0.001, * p < 0.05.

Table 12. Statistical significance of proposed model improvements on the Twitter dataset (paired t-test,

n = 10

folds).

Table 12. Statistical significance of proposed model improvements on the Twitter dataset (paired t-test,

n = 10

folds).

Metric	Baseline	Score (Mean ± SD)	p-Value
Accuracy	BERT	$0.876 \pm 0.009$ vs. $0.857 \pm 0.010$	0.00027 ***
Accuracy	RoBERTa	$0.876 \pm 0.009$ vs. $0.865 \pm 0.010$	0.018 *
Accuracy	ELECTRA	$0.876 \pm 0.009$ vs. $0.853 \pm 0.011$	0.0008 ***
Precision	BERT	$0.875 \pm 0.007$ vs. $0.855 \pm 0.010$	0.00011 ***
Precision	RoBERTa	$0.875 \pm 0.007$ vs. $0.864 \pm 0.011$	0.009 **
Precision	ELECTRA	$0.875 \pm 0.007$ vs. $0.852 \pm 0.010$	0.0003 ***
Recall	BERT	$0.876 \pm 0.009$ vs. $0.857 \pm 0.010$	0.00027 ***
Recall	RoBERTa	$0.876 \pm 0.009$ vs. $0.865 \pm 0.010$	0.018 *
Recall	ELECTRA	$0.876 \pm 0.009$ vs. $0.853 \pm 0.011$	0.0008 ***
F1-score	BERT	$0.875 \pm 0.008$ vs. $0.856 \pm 0.010$	0.00010 ***
F1-score	RoBERTa	$0.875 \pm 0.008$ vs. $0.864 \pm 0.011$	0.007 **
F1-score	ELECTRA	$0.875 \pm 0.008$ vs. $0.852 \pm 0.010$	0.0003 ***

*** p < 0.001, ** p < 0.01, * p < 0.05.

Table 13. Performance comparison of the proposed GEF model and baseline models across datasets.

Dataset	Model	Recall	F1	Accuracy	MAE
Twitter	Proposed ELECTRA-BiG-Emo	0.88	0.88	0.88	–
Twitter	RoBERTa_HYBRID_Emoji	–	–	0.86	–
Twitter	DeepFusionSent	0.85	–	0.86	–
Twitter	FAST_TEXTcnn	0.86	–	0.86	–
SemEval Task 4	Proposed ELECTRA-BiG-Emo	0.67	0.67	0.67	0.26
SemEval Task 4	Single_Domain_Tweets	–	0.43	–	0.45
SemEval Task 4	Senti_Twitter_BERT	0.54	–	0.54	0.45
SST-5	Proposed ELECTRA-BiG-Emo	–	–	0.597	–
SST-5	LM-CPPF	–	–	0.55	–
SST-5	Mode LSTM	–	–	0.55	–
SST-5	DLAWG	–	–	0.54	–
SST-5	MP-TFWA	–	–	0.56	–
SST-5	BERT Large	–	–	0.56	–
SST-5	LACL	–	–	0.59	–
SST-5	GPT-4	–	0.50	0.54	–

Table 14. Average importance scores of LSTM, modulated forward gate, and modulated backward gate features across sentiment classes for the SST-5 dataset.

Sentiment	LSTM	First Emo Gate	Last Emo Gate
More_negative	3.8339	2.7648	4.2156
Negative	3.8489	3.4404	3.8397
Neutral	3.8787	0.0000	3.6235
Positive	3.9543	4.5038	3.5901
More_positive	3.9583	4.6151	4.3517

Table 15. Average importance scores of LSTM, modulated forward gate, and modulated backward gate features across sentiment classes for the SemEval dataset.

Sentiment	LSTM	First Emo Gate	Last Emo Gate
More_negative	3.9006	0.9816	4.0604
Negative	3.8200	4.2171	0.0000
Neutral	3.8624	4.2111	3.6614
Positive	3.7977	2.6933	4.6151
More_positive	3.8124	4.4493	4.3204

Table 16. Illustrative examples of emotional bias with textual context.

Sample Post	Actual	Pred.	Interpretation
All in all, there’s only one thing to root for: expulsion for everyone.	Neg.	Neg.	Positive opener (“all in all”) overridden by late anger cue (“expulsion”); late context dominates.
Amazon Prime Day will be like Black Friday, I guess, because I’m just as disappointed.	More Neg.	More Neg.	Mixed early cues; “disappointed” in late context suppresses mixed signals → strong negative.
Can you bear the laughter?	Pos.	Pos.	Early positivity; late context tempers intensity but remains Positive overall.
Creepy, authentic, and dark	Neu.	Neu.	Early gate over-weights “authentic”; late balanced cues cancel it → Neutral.
You are my early frontrunner for best airline! Oscars 2016.	More Pos.	More Pos.	Early/late cues both reinforce joy → strong positive.

Table 17. Forward- and backward-weighted emotions by sentiment classes (SST-5).

Gate Type	Sentiment	Anger	Neutral	Joy
Forward	More_negative	0.612	0.629	0.485
	More_positive	0.398	0.385	0.438
	Negative	0.614	0.604	0.488
	Neutral	0.519	0.496	0.485
	Positive	0.394	0.393	0.450
Backward	More_negative	0.520	0.453	0.507
	More_positive	0.624	0.531	0.531
	Negative	0.522	0.467	0.569
	Neutral	0.489	0.508	0.626
	Positive	0.572	0.569	0.572

Table 18. Forward- and backward-weighted emotions by sentiment classes (SemEval).

Gate Type	Sentiment	Anger	Neutral	Joy
Forward	More_negative	0.537	0.490	0.447
	More_positive	0.454	0.502	0.557
	negative	0.534	0.522	0.499
	Neutral	0.489	0.519	0.566
	Positive	0.456	0.502	0.574
Backward	More_negative	0.578	0.464	0.393
	More_positive	0.616	0.508	0.450
	Negative	0.650	0.558	0.372
	Neutral	0.760	0.621	0.355
	Positive	0.748	0.574	0.343

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Thennakoon Mudiyanselage, A.U.G.; Zhang, J.; Li, Y. Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification. Mach. Learn. Knowl. Extr. 2026, 8, 9. https://doi.org/10.3390/make8010009

AMA Style

Thennakoon Mudiyanselage AUG, Zhang J, Li Y. Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification. Machine Learning and Knowledge Extraction. 2026; 8(1):9. https://doi.org/10.3390/make8010009

Chicago/Turabian Style

Thennakoon Mudiyanselage, Anupama Udayangani Gunathilaka, Jinglan Zhang, and Yeufeng Li. 2026. "Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification" Machine Learning and Knowledge Extraction 8, no. 1: 9. https://doi.org/10.3390/make8010009

APA Style

Thennakoon Mudiyanselage, A. U. G., Zhang, J., & Li, Y. (2026). Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification. Machine Learning and Knowledge Extraction, 8(1), 9. https://doi.org/10.3390/make8010009

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Context-Aware Emotion Gating and Modulation for Fine-Grained Sentiment Classification

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Data Description

Dataset Pre-Processing

3.2. Extracting Emotional Features

3.3. Textual Encoding

3.4. Extracting Contextual Features

Extracting BiLSTM Textual Contexts

3.5. Dynamic Gating and Modulation of Emotional Features

3.6. Feature Fusion and Classification

4. Experimental Design

4.1. Evaluation Matrices

4.2. Implementation Environment

4.3. Hyper Parameter Search

4.4. Experiments

5. Results and Analysis

5.1. Ablation Study

5.2. Impact of Pretrained Language Model Embeddings on the Proposed Emotion Fusion Approach

5.3. Comparison with Fine-Tuned LLMs

5.4. Comparison with State-of-the-Art Methods

5.5. Evaluating Emotional Contributions via Confusion Matrix Analysis for Fine-Grained Sentiment Classification

5.6. Interpretation of Fine-Grained Sentiment Analysis Based on Emotions in Different Text Contexts

5.6.1. Emotional Bias with Text Context

5.6.2. Feature Importance

5.6.3. Interpretation of Emotion Influence of Text Context in Overall Prediction with Example Posts

5.6.4. Emotion vs. Sentiment

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI