Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning

Xia, Tian; Liu, Xuan; Deng, Yuancheng; Qiu, Feng

doi:10.3390/app16094515

Open AccessArticle

Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning

¹

School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai 201209, China

²

Academy of Basic Education Development, Shanghai Normal University, Shanghai 200233, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4515; https://doi.org/10.3390/app16094515

Submission received: 3 March 2026 / Revised: 19 April 2026 / Accepted: 30 April 2026 / Published: 4 May 2026

Download

Browse Figures

Versions Notes

Abstract

Sentiment analysis is significant for exploiting countless opinion-rich data from social media. However, it faces known challenges, such as informal expression and data sparsity. While large language models (LLMs) excel at dynamic contextual disambiguation, such deep learning models neglect to leverage prior knowledge calculated statistically and globally based on dataset-wide category tendencies to guide the encoding process. To overcome these challenges, this study proposes a pre-attention framework that includes a modified gated recurrent unit hierarchically integrated with a dual-level pre-attention mechanism (PA-GRU) and an LLM-PA-GRU model equipped with a strategic LLM fine-tuning method. The pre-attention mechanism extracts inter-category and intra-text statistical priors from the training set. Trainable coefficient matrices are proposed to adaptively fuse these priors within a unified formulation, enabling flexible allocation of global and local statistical signals. The proposed PA-GRU cell features a modified gating structure and a computation flow that integrates these extracted priors as explicit statistical anchors during sequence encoding. Moreover, we implement an LLM-PA-GRU model that connects the LLM’s deep feature representations to the PA-GRU. This strategic LLM fine-tuning approach is intended to mitigate gradient instability frequently encountered in models with heterogeneous architectures. Finally, a prototype-based alignment loss is employed to enforce feature consistency across modules. Extensive experiments on benchmark datasets demonstrate that the proposed approach achieves competitive performance compared with recent literature-reported models.

Keywords:

pre-attention mechanism; attention mechanism; pre-attention gated recurrent unit; LLM fine-tuning; sentiment analysis

1. Introduction

Short-text sentiment classification plays a vital role in natural language processing (NLP), with widespread applications across healthcare, finance, education, and commerce. In today’s information age, social media platforms have become deeply integrated into human life, generating vast amounts of emotional data. These data, often in the form of short texts, contain rich emotional insights encompassing stances, opinions, views, and moods. However, the inherent characteristics of short texts, i.e., short length, sparse features, and informal expression, pose significant challenges to accurate sentiment classification. This paradigm requires robust methodologies capable of distilling latent emotional value from noisy, short-form environments to facilitate precise and efficient classification. Meanwhile, with the rapid development of deep learning and large language models (LLMs), existing methods still have limitations in leveraging corpus-derived statistical priors during encoding and in optimizing architectural integration, which motivates this study to explore more effective solutions for short-text sentiment classification.

Researchers have extensively investigated sentiment classification using both traditional machine learning techniques and deep learning methods. Traditional machine learning methods typically represent short texts as high-dimensional matrices, which often exhibit extreme sparsity due to the brevity of short-text content, thereby severely degrading classification performance. This process exhibits the typical difficulties of short-text sentiment classification, including shortness [1], sparsity [2], and informal writing [3]. Nevertheless, the core idea of leveraging statistical prior knowledge embedded in the training corpus, which is widely adopted in traditional machine learning, still provides valuable insights for the design of deep learning models.

In contrast, the application of deep learning models in recent years has significantly advanced research in sentiment classification. This includes non-sequential models primarily based on convolutional neural networks (CNNs) [4] and sequential models represented by recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) [5]. Notably, since natural language is inherently sequential, sequential deep learning models have demonstrated exceptional performance in short-text sentiment classification. Among various sequential models, the gated recurrent unit (GRU) [6] remains highly useful and competitive in short-text sentiment classification due to its parameter efficiency compared to RNN and LSTM architectures. This structural simplicity, characterized by a streamlined gating mechanism, effectively mitigates the vanishing or exploding gradient problems common in RNNs. It better captures long-term feature dependencies in short texts and makes it a suitable choice as the encoder for short-text sentiment classification tasks.

Simultaneously, attention mechanisms have been widely adopted in deep learning sequential models to enhance sentiment classification. Initially introduced to address machine translation tasks in NLP [7], attention mechanisms allocate attention by measuring relationships between vectors generated by the encoder. They assign higher weights to critical information and reduce weights for irrelevant data, thereby improving the performance of deep learning models [8]. These weights continuously evolve and adapt during training, ultimately enhancing model scalability and robustness. However, an issue (Issue 1) exists: conventional attention mechanisms perform QKV computation based on the hidden states output by the encoder. They neglect the corpus-derived statistical priors that can be extracted before encoding. Thus, the encoding process lacks the guidance of category-sensitive statistical tendencies. This limitation leads to insufficient utilization of category-related statistical information and affects the stability of encoding results. To address this gap, this study proposes a pre-attention mechanism designed specifically to extract corpus-derived statistical priors from the training corpus. This mechanism computes attention scores by analyzing the distribution of features at the category and text levels in the training corpus. Specifically, the pre-attention mechanism comprises two parts: category-level pre-attention (C-PA) and text-level pre-attention (T-PA), respectively.

Furthermore, another key issue (Issue 2) arises: how to effectively utilize the extracted statistical priors to guide the encoding process. As the GRU has been proven to be a suitable encoder for short-text sentiment classification, we modify its gating mechanism and propose the pre-attention GRU, abbreviated as PA-GRU. The statistical priors extracted by the pre-attention mechanism act as weights to guide the encoding process of the PA-GRU. It is designed to address the instability of encoding in sparse and informal short-text environments by integrating corpus-derived statistical priors into the internal gating manifold.

Parallel to these developments, the emergence of large language models (LLMs) has redefined the state-of-the-art. They leverage expansive linguistic representations acquired through unsupervised pre-training, yielding exceptional efficacy across diverse downstream applications. For sentiment classification tasks, while full-parameter fine-tuning is a common method to enhance task performance, it faces challenges of high computational demands. To address this issue, some studies have adopted parameter-efficient fine-tuning (PEFT) methods [9]. However, when an LLM is coupled with a heterogeneous downstream architecture such as an RNN, global parameter updates or uniform low-rank adaptation (LoRA) across all layers may still induce severe gradient instability and may disturb foundational linguistic priors, making such methods less suitable in such heterogeneous settings [10]. This leads to a third issue (Issue 3): how to design a fine-tuning strategy that allows the LLM to work stably with the downstream PA-GRU in short-text sentiment classification. To address this issue, this study proposes the LLM-PA-GRU model together with its strategic LLM fine-tuning method. Specifically, it selectively unfreezes the terminal layers of the pre-trained LLM, so that high-level semantic representations can be adapted to the downstream PA-GRU while reducing the instability caused by global adaptation.

The main contributions of this study are structured as follows:

The pre-attention Mechanism: This paper introduces and validates a pre-attention mechanism acting as corpus-derived, class-conditioned statistical priors extracted from the training set. The C-PA derives attention weights by analyzing differences in word co-occurrence tendencies, while T-PA captures intra-text mutual dependencies. Extensive ablation studies verify that these attention weights, combined via trainable coefficient matrices, resolve statistical scale mismatches for adaptive prior allocation.
The PA-GRU Cell and its gating mechanism: This study proposes the PA-GRU cell that improves the traditional GRU by integrating global pre-attention statistical priors to explicitly guide the gating mechanism. Experimental results confirm that this modified internal gating structure provides the necessary structural anchors to stabilize the encoding of informal short texts.
The LLM-PA-GRU Model and Strategic LLM Fine-tuning: This paper proposes the LLM-PA-GRU model for sentiment classification. It integrates the proposed pre-attention mechanism, the PA-GRU cell, and a strategic LLM fine-tuning scheme. By selectively unfreezing only the highest layers of a pre-trained LLM, the proposed method enables the LLM to provide high-level semantic representations that can be aligned with the downstream PA-GRU while reducing gradient instability caused by global adaptation methods.

Extensive experiments are conducted on four widely used benchmark datasets for short-text sentiment classification to evaluate the performance of the proposed LLM-PA-GRU model. Experimental results show that the proposed LLM-PA-GRU model achieves competitive or better performance compared with many literature-reported models on most datasets and matches the best result on SUBJ. Specifically, the model achieved peak accuracies of 99.37% on the SMSSpamCollection dataset, 94.05% on the MR dataset, 95.50% on the CR dataset, and 97.60% on the SUBJ dataset, verifying the rationality and effectiveness of the proposed mechanisms and integration strategies.

The remainder of this paper is structured as follows. Section 2 reviews related work. Section 3 details the proposed LLM-PA-GRU model and the fine-tuning method. Section 4 presents the experimental results. Section 5 discusses the mechanisms, efficiency, and limitations of the model. Finally, Section 6 concludes the study.

2. Related Work

2.1. The Limitations in Conventional Attention Mechanisms

Attention mechanisms have gradually become key components in sentiment classification due to their high adaptability, low computational cost, good interpretability, and excellent information filtering capabilities. Researchers often combine them with deep learning models to enhance semantic feature extraction. For example, Zhang et al. [11] propose a Sliced Bi-GRU model combining a multi-head self-attention mechanism and BERT embeddings. Similarly, Su et al. [12] propose a model based on an RNN combined with a CNN attention mechanism. Additionally, Sivakumar et al. [13] propose an Attention-based Convolutional Bidirectional Recurrent Neural Network model (ACBRNN), and Xiang et al. [14] propose a hierarchical neural network employing sequential attention mechanisms at the sentence level. Such hybrid methods excel in identifying emotional keywords and highlighting important information.

Nevertheless, a critical research gap persists: common attention mechanisms rely almost exclusively on dynamically encoded features, lacking a pathway to inject robust global statistical priors. Although co-occurrence-based methods [15] attempt to bridge this gap, they frequently overlook how varying word-category distributions impact discriminative precision. Consequently, to address this issue, this paper proposes a pre-attention mechanism, which identifies word-level category-sensitive co-occurrence tendencies at both the category and text levels. Furthermore, this study utilizes trainable coefficient matrices to adaptively fuse their statistical contributions.

2.2. The Limitations in Deep Learning Methods, Especially GRU

In recent years, sentiment classification has continued to attract widespread attention. Deep learning methods are widely applied to sentiment classification. CNNs are widely adopted due to their powerful local feature extraction capabilities [4]. Researchers combine CNN-based models with sentiment lexicons, word embeddings, and knowledge rules to extract emotional features [16]. However, CNNs are inadequate in modeling long-distance dependencies [17]. In contrast, RNNs and their variants, such as LSTMs and GRUs, are more adept at handling variable-length sequential data. Among these, GRUs are favored in short-text sentiment classification due to their parameter efficiency, fast convergence, and good generalization capability [18]. Furthermore, enhanced GRU-based variants have emerged, such as tree-structured GRU (TS-GRU) and bidirectional enhancements [19,20].

Despite these advances, a major limitation of GRU remains. Most existing models utilize the architecture’s intrinsic properties without fundamentally innovating the internal gating manifold. Specifically, the closest architectural analogues focus on local recursive structures. They neglect to integrate explicit global statistical priors into the gating cycle, thereby limiting the model’s capacity to anchor contextual category-related information against training-set-wide tendencies. Therefore, this paper improves the standard GRU gating mechanism by constructing a PA-GRU cell. This new architecture utilizes the extracted global priors as explicit statistical anchors. By embedding these statistical anchors directly into the internal gating cycle, this approach stabilizes the encoding process for informal short texts.

2.3. The Limitations in LLM Fine-Tuning Methods

Recently, with the rise of large-scale pre-trained language models like BERT and GPT, full-parameter fine-tuning is the mainstream method for enhancing downstream task performance. However, Rusyn et al. [21] point out that while full fine-tuning brings significant performance improvements, it is accompanied by high computational costs. To address this, parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are widely adopted.

However, these methods exhibit significant drawbacks in hybrid systems. The closest adaptation analogue, LoRA, applies updates across all layers, which can inadvertently erode the foundational linguistic priors stored in shallower transformer layers. To overcome this issue, this study proposes the LLM-PA-GRU model and its strategic LLM fine-tuning method. By capitalizing on the principle that terminal transformer layers encapsulate highly abstract, task-specific semantics rather than basic syntax, this approach unfreezes only the highest layers of the LLM to establish a more task-adapted semantic method. Additionally, a prototype-based alignment loss is applied for end-to-end joint fine-tuning. This configuration is intended to mitigate the gradient instability exacerbated by global adaptation methods.

In summary, existing methods still have clear room for improvement. Firstly, attention mechanisms take effect after encoding and thus do not make full use of corpus-derived statistical cues; secondly, recurrent gating mechanisms lack global statistical prior guidance; thirdly, LLM fine-tuning methods often lead to instability when the LLM is coupled with heterogeneous downstream modules. To address all these issues comprehensively, this paper proposes a hierarchical framework. It introduces the pre-attention mechanism, the PA-GRU cell, the LLM-PA-GRU model and the strategic LLM fine-tuning method.

The research gaps identified in this section and the corresponding architectural responses are summarized in Table 1.

3. Method

This section elucidates the structural components of the proposed framework. The framework comprises the innovation of the gating manifold to stabilize encoding in sparse environments and the fine-tuning method designed to bridge semantic spaces of the PA-GRU and the pre-trained LLM for optimized sentiment classification performance.

3.1. Pre-Attention Mechanism

Traditional attention mechanisms typically derive weights from dynamic hidden states generated during the encoding phase. However, such approaches often overlook the static statistical signals inherent in the training corpus. Specifically, they fail to account for the variance in word-pair co-occurrence across discrete sentiment categories or the localized dependency structures within individual texts.

We propose a dual-level pre-attention mechanism consisting of category-level pre-attention (C-PA) and text-level pre-attention (T-PA), as illustrated in Figure 1. The C-PA module measures the category-sensitive statistical variation of word pairs by computing the standard deviation of their co-occurrence tendencies across categories, yielding the category-level pre-attention matrix

α^{C}

. In parallel, T-PA captures intra-text dependencies through the matrix of co-occurrence tendency ratios

P

, from which the text-level pre-attention weight matrix

α^{T}

is derived. A core design of this mechanism involves the offline computation of these matrices exclusively from the training partition, thereby establishing a global anchor of weights to guide the encoding process. In this framework, the pre-computed matrices serve as a corpus-level statistical lookup table, providing explicit structural guidance for the subsequent GRU encoder. These measures are used as corpus-derived, class-conditioned statistical cues rather than direct semantic or probabilistic representations. Because C-PA and T-PA are drawn from heterogeneous statistical distributions, we further introduce trainable coefficient matrices to adaptively reweight and fuse them within a unified formulation, enabling the model to adaptively calibrate the relative influence of category-wide co-occurrence tendencies and text-specific dependency patterns during backpropagation.

3.1.1. Category-Level Pre-Attention

In short-text sentiment classification tasks, assuming any two words appear independently is inappropriate because their occurrences often follow class-sensitive patterns. Statistically, certain words are more common in specific sentiment categories and less common in others. This distribution difference reflects category-dependent statistical tendencies of words across the training corpus. The greater the distribution difference across categories, the more category-sensitive the associated statistical cue becomes. Thus, the category-level pre-attention mechanism aims to characterize class-conditioned co-occurrence tendencies of words based on the co-occurrence tendency of any two words, i.e., a word pair. Specifically, it first calculates the co-occurrence ratio of word pairs in each category, then uses standard deviation to compare the differences in co-occurrence tendencies of word pairs across n categories. Finally, the normalized value is used as the category-level pre-attention weight for the word pair.

Assume a short text’s word sequence consists of m words, the i-th word is denoted as

t_{i}

, and the text can be represented as

T = (t_{1}, t_{2}, \dots, t_{i}, \dots, t_{m})

. Furthermore, let the presence of word

t_{i}

and word

t_{j}

in the k-th category be marked as

o_{i}^{k}

and

o_{j}^{k}

respectively. Assume the document frequency of word

t_{i}

in texts of category k strictly within the training set is denoted as

f_{i}^{k} (1 \leq k \leq n)

, and its document frequency across all documents in the training corpus is denoted as

f_{i}

. Therefore, the document frequency ratio of word

t_{i}

in the k-th category based on the training prior can be calculated using Equation (1):

R (o_{i}^{k}) = \frac{f_{i}^{k}}{f_{i}}

(1)

We denote the co-occurrence frequency of word pair

t_{i}

and

t_{j}

in texts of category k as

f_{i, j}^{k}

. Their total occurrence across all categories, i.e., total document frequency, is denoted as

f_{i, j}

, satisfying

f_{i, j} = \sum_{k = 1}^{n} f_{i, j}^{k}

. Then, the conditional co-occurrence ratio of word

t_{j}

appearing in the k-th category when word

t_{i}

appears in the k-th category is calculated by Equation (2):

R (o_{j}^{k} ∣ o_{i}^{k}) = \frac{f_{i, j}^{k}}{f_{i}^{k}}

(2)

It can be derived that the co-occurrence tendency ratio of words

t_{i}

and

t_{j}

in the k-th category is shown as Equation (3):

r_{i, j}^{k} = R (o_{i}^{k}) \times R (o_{j}^{k} ∣ o_{i}^{k})

(3)

Therefore, for n categories, the co-occurrence tendency of words

t_{i}

and

t_{j}

can be represented as a vector

r = [r_{i, j}^{1}, r_{i, j}^{2}, \dots, r_{i, j}^{n}]

, where the mean of vector

r

is denoted as

{\bar{r}}_{i, j} = \frac{\sum_{k = 1}^{n} r_{i, j}^{k}}{n}

. These quantities are used as corpus-derived, class-conditioned statistical cues that may help identify category-relevant co-occurrence tendencies during encoding.

This paper uses the standard deviation

D_{i, j}

of the co-occurrence tendency ratios of words

t_{i}

and

t_{j}

across n categories to quantify the distribution difference of these two words across n categories. The calculation formula is Equation (4):

D_{i, j} = \sqrt{\frac{\sum_{k = 1}^{n} {(r_{i, j}^{k} - {\bar{r}}_{i, j})}^{2}}{n}}

(4)

Calculating the standard deviation of the co-occurrence tendency ratios for all word pairs composed of the words in the input text T, the co-occurrence tendency standard deviation matrix

D

(with size of $m \times m$ ) is obtained by Equation (5):

D \in R^{m \times m} = [\begin{matrix} D_{1, 1} & D_{1, 2} & \dots & D_{1, m} \\ D_{2, 1} & D_{2, 2} & \dots & D_{2, m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ D_{m, 1} & D_{m, 2} & \dots & D_{m, m} \end{matrix}]

(5)

The weight matrix for the category-level pre-attention of input text T can be represented as Equation (6):

α^{C} \in R^{m \times m} = softmax (D) = [α_{1}^{C}, α_{2}^{C}, \dots, α_{m}^{C}]

(6)

where

α_{i}^{C} = [α_{i, 1}^{C}, α_{i, 2}^{C}, \dots, α_{i, m}^{C}] \in R^{1 \times m}

represents the category-level pre-attention weight vector for word

t_{i}

in input text T.

3.1.2. Text-Level Pre-Attention

From a linguistic perspective, the co-occurrence of a word pair within a single sequence frequently mirrors their underlying semantic or syntactic dependency. While word pairs exhibiting strong dependency tend to co-occur consistently, those with weaker associations appear together less frequently—a distinction that serves as a critical signal for robust text sequence modeling. Text-level pre-attention is designed to quantify these latent relationships by computing the co-occurrence tendency ratio of word pairs across the broader linguistic context of the training set.

Consider a word pair

(t_{i}, t_{j})

within the sequence of text T. To establish a statistically grounded prior, we denote

d_{i}

and

d_{j}

as the document frequencies of words

t_{i}

and

t_{j}

, respectively, across the entire training corpus. Furthermore, the frequency with which this specific pair co-occurs within the same document across the training set is represented as the prior co-occurrence document frequency,

d_{i, j}

. By leveraging these corpus-wide statistics, the joint co-occurrence ratio of

t_{i}

and

t_{j}

appearing within a unified textual context is formally derived via Equation (7):

r_{i, j} = \frac{d_{i, j}}{d_{i} + d_{j}}

(7)

The co-occurrence tendency ratio matrix

P

with size of

m \times m

for all words in the input text T co-occurring within a single text is represented by Equation (8):

P \in R^{m \times m} = [\begin{matrix} r_{1, 1} & r_{1, 2} & \dots & r_{1, m} \\ r_{2, 1} & r_{2, 2} & \dots & r_{2, m} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ r_{m, 1} & r_{m, 2} & \dots & r_{m, m} \end{matrix}]

(8)

The text-level pre-attention weight matrix for input text T is computed using Equation (9):

α^{T} \in R^{m \times m} = softmax (P) = [α_{1}^{T}, α_{2}^{T}, \dots, α_{m}^{T}]

(9)

where

α_{i}^{T} = [α_{i, 1}^{T}, α_{i, 2}^{T}, \dots, α_{i, m}^{T}] \in R^{1 \times m}

represents the text-level pre-attention weight vector for word

t_{i}

in input text T.

3.1.3. Trainable Coefficient Matrices for Prior Allocation

Category-level and text-level pre-attention represent the dependency relationships of a word pair from distinct statistical dimensions. Assigning higher attention weights to word pairs with stronger dependencies guides the model to extract discriminative statistical information. However, from a statistical perspective,

α^{C}

and

α^{T}

operate on heterogeneous statistical distributions. Directly combining them would create a feature mismatch and inject statistical noise into the sequential model. To resolve this statistical incompatibility, this paper introduces

β

and

γ

not merely as attention weights, but as trainable coefficient matrices. During backpropagation, these matrices adaptively adjust the contributions of the two distinct statistical distributions through element-wise multiplication before their linear combination

F_{combine} (\cdot)

, ensuring adaptive prior fusion.

Specifically, the combination process of category-level and text-level pre-attention is shown in Figure 2. Matrices

D

and

P

are obtained from text T, transformed through the softmax function and the

F_{combine} (\cdot)

function, and the output is the pre-attention weight matrix

α

. In this same figure, the variable

V

represents the Value matrix, which in the context of the PA-GRU cell, corresponds to the current input embeddings

x_{i}

and the previous hidden states

h_{i - 1}

that are subsequently weighted by

α

.

For each pair

α_{i}^{C}

and

α_{i}^{T}

(where

α_{i}^{C} \in R^{1 \times m}

and

α_{i}^{T} \in R^{1 \times m}

are the i-th row weight vectors of

α^{C}

and

α^{T}

respectively), the combination formula

F_{combine} (\cdot)

is given in Equation (10). Here,

β

and

γ

are

m \times m

trainable coefficient matrices (

β \in R^{m \times m}

,

γ \in R^{m \times m}

), consistent with the dimensions of

α^{C}

and

α^{T}

(

α^{C} \in R^{m \times m}

,

α^{T} \in R^{m \times m}

), ensuring legal element-wise multiplication. To promote a balanced fusion of the two pre-attention sources, the corresponding elements of

β

and

γ

(i.e.,

β_{i, j}

and

γ_{i, j}

for the j-th element of

α_{i}^{C}

and

α_{i}^{T}

) are constrained to sum to 1 (i.e.,

β_{i, j} + γ_{i, j} = 1

for all

i, j \in [1, m]

), which helps prevent weight distortion and supports a stable linear fusion process.

α_{i} = F_{combine} (β, γ, α_{i}^{C}, α_{i}^{T}) = β_{i} ⊙ α_{i}^{C} + γ_{i} ⊙ α_{i}^{T}

(10)

where

β_{i}

and

γ_{i}

are the i-th row vectors of

β

and

γ

respectively (i.e.,

β_{i} \in R^{1 \times m}

,

γ_{i} \in R^{1 \times m}

), ⊙ denotes the element-wise multiplication (Hadamard product), and the matrix elements of

β

and

γ

satisfy

β_{i, j} \in (0, 1)

,

γ_{i, j} \in (0, 1)

and

β_{i, j} + γ_{i, j} = 1

.

Therefore, for input text T, its final pre-attention weight is the concatenation of all individual weights, defined in Equation (11):

α = α_{1} \oplus α_{2} \oplus \dots \oplus α_{m}

(11)

where

α_{i} = [α_{i, 1}, α_{i, 2}, \dots, α_{i, m}] \in R^{1 \times m}

, representing the pre-attention weight vector for word

t_{i}

in input text T, and the final pre-attention weight matrix

α \in R^{m \times m}

.

3.2. The PA-GRU Cell

In this study, a pre-attention mechanism GRU, PA-GRU, is designed by using the pre-attention mechanism to improve the gating mechanism of a GRU cell. The core design of the PA-GRU is redesigning the candidate state calculation to integrate prior anchors as a structural constraint.

Let

x_{i} \in R^{d}

be the d-dimensional word embedding vector of the i-th word

t_{i}

in the input text sequence T of total m words, and let the previous hidden state be denoted as

h_{i - 1} \in R^{d}

, which is defined in the same d-dimensional feature space as the word embedding. The input text sequence is then encoded as an embedding in Equation (12):

x_{1 : m} = x_{1} \oplus x_{2} \oplus \dots \oplus x_{m} .

(12)

Compared with the conventional GRU, as shown in Figure 3a, the architecture and computations within a PA-GRU cell are illustrated in Figure 3b. The pre-attention mechanism yields

α_{i} \in R^{1 \times m}

, which acts as an explicit global statistical prior rather than dynamically encoded local attention. In addition, the linear combination of

h_{i - 1}

and

x_{i}

undergoes activation through the sigmoid activation function

σ

(with bias integrated internally to guarantee gate values within

[0, 1]

), to form the reset gate

r_{i}

and the update gate

z_{i}

. On the one hand,

r_{i}

resets the previous hidden layer output

h_{i - 1}

; concurrently,

h_{i - 1}

and

x_{i}

are anchored using the extracted global prior

α_{i}

. While the standard GRU operations dynamically resolve local context and polysemy, the injection of

α_{i}

ensures the encoding process remains structurally guided by the corpus-wide category-sensitive statistical tendencies. The reset gate’s output, weighted results, and input

x_{i}

are then summed, yielding candidate states via the activation function tanh. On the other hand, the update gate

z_{i}

modifies the candidate state, while the update gate

1 - z_{i}

updates the previous hidden layer output

h_{i - 1}

.

To map the attention weights (

α_{i} \in R^{1 \times m}

) into the shared d-dimensional feature space of

x_{i}

and

h_{i - 1}

, we introduce a feature-space gating operator. Specifically, learnable projection matrices

W_{α x} \in R^{m \times d}

and

W_{α h} \in R^{m \times d}

project

α_{i}

into feature-space vectors. At time step i, the reset gate

r_{i}

first filters the previous hidden state

h_{i - 1}

to obtain

h_{i - 1}^{'}

. Then, the PA-GRU anchors

h_{i - 1}

and

x_{i}

using the projected priors. This is achieved by the element-wise Hadamard product (⊙) between the projected priors and the feature vectors, resulting in

{h_{i - 1}^{'}}^{a t t}

and

x_{i}^{a t t}

. These operations are mathematically defined by Equations (13)–(16):

r_{i} = σ (x_{i} \cdot W_{x r} + h_{i - 1} \cdot W_{h r} + b_{r}),

(13)

h_{i - 1}^{'} = r_{i} ⊙ (h_{i - 1} \cdot W_{h h}),

(14)

{h_{i - 1}^{'}}^{a t t} = (1 - α_{i} \cdot W_{α h}) ⊙ (h_{i - 1} \cdot W_{h a t t}),

(15)

x_{i}^{a t t} = (α_{i} \cdot W_{α x}) ⊙ (x_{i} \cdot W_{x a t t}) .

(16)

Combining the current input

x_{i}

with the outputs of Equations (14)–(16), a refined candidate state

{\hat{h}}_{i}

is generated using the tanh activation function. This candidate state

{\hat{h}}_{i}

integrates the advantages of both the dynamic gating mechanism and global pre-attention priors. To prevent information redundancy from these multiple additive paths, each component is regulated by an independent, learnable weight matrices (

W_{h h}

,

W_{h a t t}

,

W_{x a t t}

, and

W_{x h}

). These matrices act as balancing coefficients, allowing the network to adaptively fuse dynamic recurrence and static priors during backpropagation. The tanh activation further constrains the candidate state within

[- 1, 1]

, effectively suppressing information redundancy and gradient explosion. The refined candidate state

{\hat{h}}_{i}

is computed using Equation (17):

{\hat{h}}_{i} = tanh (h_{i - 1}^{'} + {h_{i - 1}^{'}}^{a t t} + x_{i}^{a t t} + x_{i} \cdot W_{x h} + b_{h}) .

(17)

Similarly, the update gate and final hidden state are computed using Equations (18) and (19):

z_{i} = σ (x_{i} \cdot W_{x z} + h_{i - 1} \cdot W_{h z} + b_{z}),

(18)

h_{i} = (1 - z_{i}) ⊙ h_{i - 1} + z_{i} ⊙ {\hat{h}}_{i} .

(19)

where

W_{x r}, W_{h r}, W_{x h}, W_{h h}, W_{x z}, W_{h z}, W_{h a t t}, W_{x a t t} \in R^{d \times d}

and

W_{α x}, W_{α h} \in R^{m \times d}

denote trainable parameter matrices,

1 \in R^{1 \times d}

denotes a vector of ones, and

b_{r}, b_{z}, b_{h} \in R^{d}

denote bias quantities.

In summary, through the training process of PA-GRU, the gating mechanism filters and updates information based on the sequential inputs and the pre-attention weights calculated by the combination of T-PA and C-PA. In contrast to GRU, because the pre-attention mechanism calculates weights strictly based on the training dataset, it provides the PA-GRU with a global statistical perspective for controlling information through explicit structural anchors that help address some limitations of GRU. Furthermore, the trainable coefficient matrices of the combination are adaptively updated through the training process.

3.3. The LLM-Based PA-GRU Model and Its Fine-Tuning Method

The architecture of the LLM-based PA-GRU model (LLM-PA-GRU) is depicted in Figure 4. At its core, the PA-GRU module comprises sequential cells deployed across consecutive time steps, flanked by a dual-branch feature extraction network.

On the left branch, we implement a strategic LLM fine-tuning protocol. In contrast to LoRA, which risks diluting foundational linguistic priors by distributing parameter plasticity indiscriminately, our approach freezes the majority of the LLM while selectively unfreezing only the final two transformer layers. This targeted unfreezing strategy creates a more task-adapted semantic method. It is intended to reduce gradient instability between the LLM and the PA-GRU. Thus, the high-level semantic representations from the LLM can be transferred more effectively to the downstream PA-GRU. Under this paradigm, an input text sequence is projected by the LLM into a sequence of word embeddings

x_{1} \oplus x_{2} \oplus \dots \oplus x_{m}

.

Simultaneously, the right branch utilizes the pre-attention mechanism to derive global prior vectors

α_{1} \oplus α_{2} \oplus \dots \oplus α_{m}

from the same input. These embeddings and statistical priors are fed synchronously into the PA-GRU module, where the hidden state

h_{i}

maintains temporal continuity. This integration uses explicit pre-attention weights to anchor each PA-GRU cell with word-level statistical priors.

During training, the model undergoes end-to-end joint fine-tuning encompassing the terminal LLM layers, the trainable coefficient matrices, and the PA-GRU parameters.

Conventional optimization mostly relies on task-specific cross-entropy loss. However, this loss often ignores the structural alignment between different modules. To solve this, a multi-objective optimization strategy is utilized. It combines the cross-entropy loss with a prototype-based alignment loss based on mean squared error. This design improves classification accuracy and maintains structural consistency between the LLM semantic space and the PA-GRU feature space.

Specifically, assume all short texts in the task setting can be classified into n categories, the k-th classification label is denoted as

y_{k}

, and all classification labels are represented as

Y = (y_{1}, y_{2}, \dots, y_{k}, \dots, y_{n})

. Assume the input text

T = (t_{1}, t_{2}, \dots, t_{m})

. After passing through the pre-trained large language model, the text obtains the hidden state matrix of the last layer

H_{d} \in R^{m \times d}

. This matrix, as a sequence of word embedding vectors, is then input into the PA-GRU module for sequence modeling, ultimately generating the sequence representation

h_{m} \in R^{d}

. It is further mapped through a classifier to obtain the predicted probability distribution

\hat{y}

. The end-to-end fine-tuning process for the LLM-PA-GRU parameters is defined as follows.

Step 1: based on the prediction result

\hat{y}

and the true labels, first compute the classic cross-entropy loss as Equation (20):

L_{C E} = - \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} log ({\hat{y}}_{i, c})

(20)

where N is the batch size, C is the number of categories, and

y_{i, c}

is the true label of sample i for category c.

Step 2: to enforce structural consistency across the heterogeneous network, we introduce a prototype-based alignment loss computed by mean squared error, as shown in Equation (21):

L_{align} = \frac{1}{N} \sum_{i = 1}^{N} {∥h_{m}^{(i)} - e_{y_{i}}∥}_{2}^{2}

(21)

where

e_{y_{i}}

is the class prototype vector comprising learnable parameters corresponding to label

y_{i}

, and

h_{m}^{(i)}

represents the sequence-ending hidden state of the i-th sample.

Step 3: the final total loss is obtained by combining the classification loss and the alignment loss, as defined in Equation (22):

L_{total} = L_{C E} + L_{align}

(22)

By jointly optimizing

L_{total}

, the model improves both classification accuracy and feature alignment. During fine-tuning, the unfrozen parameters are updated synchronously using backpropagation. Finally, the model extracts discriminative features to output the sentiment prediction.

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Datasets

To validate the effectiveness of the pre-attention mechanism, we employ the official DeepSeek-LLM-7B-Base model [22] as the foundational language backbone. This model possesses 30 transformer layers and utilizes a hidden embedding dimension of 4096. The evaluation was conducted across four diverse benchmark datasets: the SMSSpamCollection dataset [23] from the UCI repository, and three sentiment/subjectivity benchmarks: MR (movie reviews) [24], CR (customer reviews) [25], and SUBJ (subjectivity dataset) [26].

SMSSpamCollection: Comprises 5574 English short messages, including 747 spam and 4827 non-spam instances.
MR: Consists of 5331 positive and 5331 negative movie review snippets extracted from Rotten Tomatoes.
CR: A product review dataset used for fine-grained customer feedback analysis.
SUBJ: Contains 5000 subjective reviews and 5000 objective plot summaries, focusing on subjectivity classification.

These benchmarks have different task definitions. However, they all involve short-text classification tasks. The model must capture affect, polarity, or subjectivity cues under sparse conditions. For example, SMSSpamCollection is an imbalanced spam detection dataset. Yet, spam messages often contain strong persuasive or affective language. We include it to test the model on short, sparse, and anomalous texts. In addition, although SUBJ is a subjectivity benchmark rather than a pure polarity benchmark, it is still closely related to opinion- and sentiment-related text analysis. Therefore, it is included to evaluate the proposed framework on subjectivity-oriented short-text classification under sparse conditions. In this evaluation setting, MR and CR are treated as sentiment benchmarks, SUBJ is treated as a subjectivity benchmark within the broader opinion- and sentiment-related evaluation spectrum, and SMSSpamCollection is included more cautiously as a sparse and anomalous short-text classification benchmark to further examine the robustness of the proposed framework.

Quantitative statistics of these datasets, including class distribution and average word counts, are summarized in Table 2. To ensure rigorous evaluation, all global statistical priors for the pre-attention mechanism were computed strictly from the designated training split of each dataset.

4.1.2. Performance Evaluation Metrics

Model performance is quantitatively assessed using four standard metrics: Accuracy, Precision, Recall, and F1-score [27]. These metrics provide a multidimensional view of the model’s discriminative capability across different class distributions.

4.1.3. Baselines

The LLM-PA-GRU model is compared against several state-of-the-art architectures. For SMSSpamCollection, baselines include CNN-LSTM [28], LGRU [29], Bi-GRU [30], CNN-GRU [31], GA + GXBoost [32], and LSTM + BERT-embedding [33]. For sentiment benchmarks (MR, CR, SUBJ), we compare against Bi-GRU [30], RCNNGWE [34], SCD [35], FFT-TIFS [36], DK-HDNN [37], EMW-CNN-BiLSTM [38], CICWE-QNN [39], and QPFE-ERNIE [40]. All baseline results in this study are based on the original publications, and the corresponding comparisons are presented as literature-based references.

4.1.4. Experimental Protocol and Hyperparameters

For completeness, the detailed training configuration is reported in Table 3. The experiments were implemented using the LlamaFactory framework. DeepSeek-LLM-7B-Base was adopted as the backbone model. During fine-tuning, most transformer blocks were frozen, and only the final two transformer layers were unfrozen.

During training, the checkpoint with the lowest validation loss was selected for evaluation. Each experiment was repeated 10 times, and the average results are reported.

4.2. Performance Results and Comparison

The performance results of the proposed LLM-PA-GRU model on the SMSSpamCollection dataset are summarized in Table 4. As indicated by the data, the model achieved an accuracy of 99.37%, which was higher than the literature-reported baseline results listed for comparison.

Table 5 presents the experimental results across the MR, CR, and SUBJ datasets. The LLM-PA-GRU model recorded classification accuracies of 94.05%, 95.50%, and 97.60% on these benchmarks, respectively.

Figure 5 visualizes these accuracy metrics against the baselines. It highlights the overall performance pattern of the proposed framework.

Detailed class-wise evaluation metrics for the proposed model, including precision, recall, and F1-score for each category across all four datasets, are provided in Table 6.

4.3. Ablation Study

In this section, we conduct ablation studies across the four benchmark datasets to evaluate the individual contribution of the proposed pre-attention components. The following three variant models are implemented for comparison:

GRU (Baseline): This model utilizes standard GRU layers to extract temporal dependencies, followed by a fully connected layer for final classification.
CPA-GRU: This variant incorporates only the category-level explicit global priors (C-PA) within the PA-GRU architecture to extract category-sensitive statistical tendencies.
TPA-GRU: This variant utilizes only the text-level global priors (T-PA) to capture word-level dependencies before the GRU sequence modeling.

Table 7 summarizes the performance comparison between the full LLM-PA-GRU model and these ablated versions.

The data indicate that the integrated LLM-PA-GRU model recorded the highest accuracy among the tested variants on these datasets. Specifically, the inclusion of either C-PA or T-PA results in a performance increase relative to the baseline GRU model, with the C-PA component showing a more pronounced improvement on the benchmarks.

5. Discussion

5.1. Mechanism of Hierarchical Pre-Attention and Gating Manifold

Evaluations on the SMSSpamCollection dataset reveal that the LLM-PA-GRU model achieves strong performance compared with conventional deep learning architectures, particularly in terms of classification accuracy. The core of this improvement lies in the pre-attention mechanism’s ability to extract category-level class tendencies and intra-text dependencies as explicit statistical priors. By embedding these structural anchors directly into the PA-GRU cells, the model addresses the limitations of standard sequential encoders. The results suggest that the LLM-PA-GRU model alleviates the lack of global guidance in conventional transformers.

In the context of the four benchmark sentiment datasets, the LLM-PA-GRU model achieves competitive or superior performance across the four datasets by consistently isolating more discriminative category-relevant statistical features. This advantage stems from the deep integration of pre-attention priors within the internal gating manifold of the PA-GRU. Conventional attention mechanisms rely on localized text relationships. They do not account for corpus-wide category distributions or integrate them into the encoding process. In contrast, the PA-GRU architecture ensures that at each temporal step, the hidden state transition is guided by explicit category-based and text-based priors via C-PA and T-PA. These statistical priors anchor the previous hidden state and the current input. This effectively filters out noise and preserves critical classification-related features. Furthermore, the use of trainable coefficient matrices resolves the inherent statistical mismatches between C-PA and T-PA, adaptively reweighting and fusing heterogeneous distributions for adaptive allocation. As a result, these explicit anchors guide the gating mechanism to emphasize category-informative features and improve classification precision.

Finally, the performance advantage of LLM-PA-GRU highlights the effectiveness of our fine-tuning protocol. By selectively unfreezing only the terminal transformer layers, we construct a more task-adapted semantic method. This configuration is intended to mitigate the gradient instability common in heterogeneous architectures. Combined with the prototype-based alignment loss, this strategy encourages consistency between the semantic space of the LLM representations and the temporal feature space of the PA-GRU.

5.2. Empirical Analysis of Computational Efficiency and Complexity

We provide empirical data on the model’s parameter scale and training dynamics to demonstrate its optimization efficiency. Unlike standard full-parameter updating, our protocol uses a selective layer-wise unfreezing strategy. We freeze the majority of the DeepSeek-LLM-7B-Base backbone. We only optimize the terminal two transformer layers and the GRU-based classification head. Thus, the trainable parameter count is approximately 420 M. This represents roughly 6.0% of the total 7 B backbone parameters.

The model also converges rapidly during training. On an NVIDIA A100 GPU, the LLM-PA-GRU reaches convergence within 3 epochs. Each epoch requires significantly less training time than full-parameter fine-tuning of 7B-scale models. This efficiency suggests the practical scalability of the proposed framework.

5.3. Limitations

This study has limitations. Although the PA-GRU gating logic is highly effective for anchoring dense category-related statistical features in short sequences, the GRU architecture inherently struggles with long-range dependencies. Consequently, the model’s performance may degrade on complex, long-form texts. This is why our framework focuses on short-text scenarios. In addition, the deployment of this framework in a production environment involves memory footprint trade-offs during the inference phase. While parameter-efficient, retaining the LLM backbone alongside the GRU head still requires substantial memory capacity and bandwidth for real-time applications.

6. Conclusions

This research developed a comprehensive framework for short-text sentiment analysis, integrating a dual-level pre-attention mechanism, the PA-GRU architecture, and a strategic LLM fine-tuning protocol.

The proposed pre-attention mechanism extracted corpus-derived, class-conditioned statistical priors from both inter-category and intra-text dimensions derived from the training corpus. This approach overcame the limitations of conventional attention mechanisms, which lacked a corpus-wide view of features. In addition, the proposed trainable coefficient matrices provided a solution to adaptively fuse these priors within a unified formulation for adaptive prior allocation. This methodology effectively addressed the neglect of corpus-wide category-dependent statistical tendencies and word-distribution variances that often hindered conventional encoding processes.

Furthermore, this study revised the internal gating manifold of the GRU by embedding global statistical priors directly, creating the PA-GRU cell. This structural innovation ensured that sequence encoding was no longer a localized, purely dynamic operation but was instead grounded by explicit dataset-wide anchors. The PA-GRU enabled the model to maintain consistent category-sensitive statistical guidance even in the presence of linguistic noise.

The LLM-PA-GRU model, supported by our strategic fine-tuning method, improved overall performance. In contrast to global adaptation techniques, we selectively unfroze the terminal transformer layers. This constructed a more task-adapted semantic method. The fine-tuning of the DeepSeek-LLM-7B-Base backbone restricted the trainable parameters to approximately 420 M. This represented roughly 6.0% of the total backbone parameters. This configuration was optimized through a prototype-based alignment loss. It reduced gradient instability and enforced structural consistency across heterogeneous feature spaces. As a result, the model reached convergence within 3 epochs on an NVIDIA A100 GPU.

Extensive benchmarking confirmed that this approach achieved competitive or better performance compared with many literature-reported models, particularly in sparse and informal short-text environments where traditional deep learning models often struggled. Specifically, the model achieved peak accuracies of 99.37% on SMSSpamCollection, 94.05% on MR, 95.50% on CR, and 97.60% on SUBJ.

Future research will focus on extending the integration of explicit statistical priors to other heterogeneous architectures. Exploring these corpus-derived benchmarks within diverse neural frameworks will refine internal computational mechanisms and foster the development of more robust, architecturally-grounded variants for a broader spectrum of NLP challenges.

Author Contributions

Conceptualization, T.X.; methodology, T.X., X.L. and Y.D.; software, X.L. and Y.D.; validation, T.X., X.L. and Y.D.; investigation, X.L. and Y.D.; writing—original draft preparation, X.L. and Y.D.; writing—review and editing, T.X. and F.Q.; visualization, X.L. and Y.D.; supervision, F.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Research Base of Online Education for Shanghai Middle and Primary Schools (Shanghai, China).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the UCI Machine Learning Repository and respective public archives cited in the text.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that helped improve the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
GRU	Gated Recurrent Unit
PA-GRU	Pre-Attention Gated Recurrent Unit
C-PA	Category-level Pre-Attention
T-PA	Text-level Pre-Attention
LoRA	Low-Rank Adaptation
PEFT	Parameter-Efficient Fine-Tuning
NLP	Natural Language Processing
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory

References

Lee, Y.; Chen, A.S.; Tajwar, F.; Kumar, A.; Yao, H.; Liang, P.; Finn, C. Surgical Fine-Tuning Improves Adaptation to Distribution Shifts. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zuo, Y.; Li, C.R.; Lin, H.; Wu, J.J. Topic Modeling of Short Texts: A Pseudo-Document View With Word Embedding Enhancement. IEEE Trans. Knowl. Data Eng. 2023, 35, 972–985. [Google Scholar] [CrossRef]
Saranya, S.; Usha, G. A Machine Learning-Based Technique with Intelligent WordNet Lemmatizer for Twitter Sentiment Analysis. Intell. Autom. Soft Comput. 2023, 36, 339–352. [Google Scholar] [CrossRef]
Kim, H.; Jeong, Y.S. Sentiment classification using convolutional neural networks. Appl. Sci. 2019, 9, 2347. [Google Scholar] [CrossRef]
Hameed, Z.; Garcia-Zapirain, B. Sentiment classification using a single-layered bilstm model. IEEE Access 2020, 8, 73992–74001. [Google Scholar] [CrossRef]
Al Wazrah, A.; Allumoud, S. Sentiment analysis using stacked gated recurrent unit for arabic tweets. IEEE Access 2021, 9, 137176–137187. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 74–80. [Google Scholar] [CrossRef]
Zhu, Y.; Zhou, R.; Chen, G.; Zhang, B. Enhancing sentiment analysis of online comments: A novel approach integrating topic modeling and deep learning. PeerJ Comput. Sci. 2024, 10, e2542. [Google Scholar] [CrossRef] [PubMed]
Patil, R.; Khot, P.; Gudivada, V. Analyzing LLAMA3 performance on classification task using LoRA and QLoRA techniques. Appl. Sci. 2025, 15, 3087. [Google Scholar] [CrossRef]
Liu, J.; Hu, T.; Zhang, Y.; Feng, Y.; Hao, J.; Lv, J.; Liu, Z. Parameter-efficient transfer learning for medical visual question answering. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2816–2826. [Google Scholar] [CrossRef]
Zhang, X.; Wu, Z.; Liu, K.; Zhao, Z.; Wang, J.; Wu, C. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU. Sensors 2023, 23, 1481. [Google Scholar] [CrossRef] [PubMed]
Su, B.; Peng, J. Sentiment analysis of comment texts on online courses based on hierarchical attention mechanism. Appl. Sci. 2023, 13, 4204. [Google Scholar] [CrossRef]
Sivakumar, S.; Haritha, D.; Ram, N.S.; Kumar, N.; Krishna, G.R.; Kumar, D.A. Attention-based convolution bidirectional recurrent neural network for sentiment analysis. Int. J. Decis. Support Syst. Technol. 2022, 14, 219–239. [Google Scholar] [CrossRef]
Xiang, H.; Jinjin, Z.; Peng, Z.; Xiaodong, M. Hierarchical neural network with serial attention mechanism for review sentiment classification. Neural Process. Lett. 2023, 55, 8269–8283. [Google Scholar] [CrossRef]
Feng, Y.; Wei, R. Method of multi-label visual emotion recognition fusing fore-background features. Appl. Sci. 2024, 14, 8564. [Google Scholar] [CrossRef]
Zhou, M.; Liu, D.; Zheng, Y.; Zhu, Q.; Guo, P. A text sentiment classification model using double word embedding methods. Multimed. Tools Appl. 2022, 81, 18993–19012. [Google Scholar] [CrossRef]
Esmi, N.; Shahbahrami, A.; Gaydadjiev, G.; de Jonge, P. Suicide ideation detection based on documents dimensionality expansion. Comput. Biol. Med. 2025, 192, 110266. [Google Scholar] [CrossRef]
Degife, W.A.; Lin, B.S. A multi-aspect informed GRU: A hybrid model of flight fare forecasting with sentiment analysis. Appl. Sci. 2024, 14, 4221. [Google Scholar] [CrossRef]
Low, H.Q.; Keikhosrokiani, P.; Asl, M.P. Decoding violence against women: Analysing harassment in middle eastern literature with machine learning and sentiment analysis. Humanit. Soc. Sci. Commun. 2024, 11, 497. [Google Scholar] [CrossRef]
Soloman, S.S.J.; Seb, B.; Baydeti, N.; Das, D.K. A CTO-based GRU model for identifying emotions from textual data. Knowl. Inf. Syst. 2025, 67, 4967–4990. [Google Scholar] [CrossRef]
Rusyn, V.; Boichuk, A.; Mochurad, L. Cross-language transfer-learning approach via a pretrained Preact ResNet-18 architecture for improving kanji recognition accuracy and enhancing a number of recognizable kanji. Appl. Sci. 2025, 15, 4894. [Google Scholar] [CrossRef]
DeepSeek-AI. DeepSeek-LLM-7B-Base. 2024. Available online: https://huggingface.co/deepseek-ai/deepseek-llm-7b-base (accessed on 13 April 2026).
Almeida, T.; Hidalgo, J.; Yamakami, A. Contributions to the study of SMS spam filtering: New collection and results. In Proceedings of the 11th ACM Symposium on Document Engineering, Mountain View, CA, USA, 19–22 September 2011; pp. 259–262. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, MI, USA, 25–30 June 2005; pp. 115–124. [Google Scholar] [CrossRef]
Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 168–177. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, 21–26 July 2004; pp. 271–278. [Google Scholar] [CrossRef]
Raghunathan, N.; Kandasamy, S. Challenges and Issues in Sentiment Analysis: A Comprehensive Survey. IEEE Access 2023, 11, 69626–69642. [Google Scholar] [CrossRef]
Hossain, S.; Sumon, J.; Sen, A.; Alam, M.; Kamal, K.; Alqahtani, H.; Sarker, I. Spam filtering of mobile SMS using CNN-LSTM based deep learning model. In Proceedings of the International Conference on Hybrid Intelligent Systems; Springer: Berlin/Heidelberg, Germany, 2021; pp. 106–116. [Google Scholar] [CrossRef]
Wei, F.; Nguyen, T. A lightweight deep neural model for sms spam detection. In Proceedings of the 2020 International Symposium on Networks, Computers and Communications (ISNCC); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar] [CrossRef]
Kumar, A.; Rastogi, R. Attentional recurrent neural networks for sentence classification. In Proceedings of the Innovations in Infrastructure: Proceedings of ICIIF 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 549–559. [Google Scholar] [CrossRef]
Altunay, H.; Albayrak, Z. SMS spam detection system based on deep learning architectures for Turkish and English messages. Appl. Sci. 2024, 14, 11804. [Google Scholar] [CrossRef]
Ghatasheh, N.; Altaharwa, I.; Aldebei, K. Modified genetic algorithm for feature selection and hyper-parameter optimization: Case of XGBoost in spam prediction. IEEE Access 2022, 10, 84365–84383. [Google Scholar] [CrossRef]
Ghourabi, A.; Alohaly, M. Enhancing spam message classification and detection using transformer-based embedding and ensemble learning. Sensors 2023, 23, 3861. [Google Scholar] [CrossRef] [PubMed]
Onan, A. Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 2098–2117. [Google Scholar] [CrossRef]
Klein, T.; Nabi, M. SCD: Self-contrastive decorrelation for sentence embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 394–400. [Google Scholar] [CrossRef]
Nandi, B.P.; Jain, A.; Tayal, D.K.; Narang, P.A. High performing sentiment analysis based on fast Fourier transform over temporal intuitionistic fuzzy value. Soft Comput. 2022, 26, 3059–3073. [Google Scholar] [CrossRef]
Khan, J.; Ahmad, N.; Lee, Y.; Khalid, S.; Hussain, D. Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis. Mathematics 2025, 13, 1456. [Google Scholar] [CrossRef]
Gong, J.; Zhang, J.; Guo, W.; Ma, Z.; Lv, X. Short text classification based on explicit and implicit multiscale weighted semantic information. Symmetry 2023, 15, 2008. [Google Scholar] [CrossRef]
Shi, J.; Li, Z.; Lai, W.; Li, F.; Shi, R. Two end-to-end quantum-inspired deep neural networks for text classification. IEEE Trans. Knowl. Data Eng. 2023, 35, 4335–4345. [Google Scholar] [CrossRef]
Shi, J.; Chen, T.; Lai, W.; Zhang, S.; Li, X. Pretrained quantum-inspired deep neural network for natural language processing. IEEE Trans. Cybern. 2024, 54, 5973–5985. [Google Scholar] [CrossRef]

Figure 1. Category and Text Pre-Attention.

Figure 2. Combination process of C-PA and T-PA.

Figure 3. Architecture comparison between (a) the standard GRU and (b) the proposed PA-GRU mechanism.

Figure 4. The LLM-PA-GRU Model.

Figure 5. Accuracy (%) on three benchmark datasets.

Table 1. Concise comparison with architectural analogues.

Analogue	Limitation	Our Response
TS-GRU/Bi-GRU	No global statistical anchors.	PA-GRU: Modified GRU with gating mechanisms for pre-attention priors.
Dynamic Attention	Dependent on dynamic input.	Pre-attention: Additional extracted statistical priors.
LoRA/Full Tuning	Unstable in hybrid models.	Selective Unfreezing: Constructs a more task-adapted semantic method to mitigate instability.

Table 2. Quantitative Statistics of Experimental Datasets.

Dataset		SMSSpamCollection	MR	CR	SUBJ
Classes		2	2	2	2
Avg. Words		20	20	22	23
Quantity	Pos/Subj	4827	5331	2406	5000
Quantity	Neg/Obj	747	5331	1367	5000

Table 3. Detailed Hyperparameter Settings.

Hyperparameter	Value
Optimizer	AdamW
Learning Rate	$1.0 \times 10^{- 4}$
Batch Size (Per Device)	4
Gradient Accumulation Steps	8
Total Effective Batch Size	32
Training Epochs	3.0
Warmup Ratio	0.1
Dropout Rate	0.0
Repeated Runs	10
Model Selection Criterion	Lowest Validation Loss
Precision	FP16 (Mixed Precision)

Table 4. Results of Different Models on SMSSpamCollection.

Model	Precision	Recall	F1-Score	Accuracy
CNN-LSTM	0.9800	1.000	0.9900	0.9840
LGRU	0.9887	0.9977	0.9926	0.9904
Bi-GRU	0.9823	0.9476	0.9682	0.9907
CNN-GRU	0.9906	0.9937	0.9922	0.9907
LSTM + BERT-Embedding	0.9791	0.9527	0.9658	0.9910
GA + GXBoost	0.9835	0.9733	0.9669	0.9912
LLM-PA-GRU	0.9959	0.9969	0.9964	0.9937

Note: Bold values indicate the best performance for each metric.

Table 5. Results of Different Models on MR, CR, SUBJ Datasets.

Model	MR	CR	SUBJ
Bi-GRU	81.24	84.88	92.80
RCNNGWE	84.19	N/A	N/A
SCD	82.17	87.76	93.67
FFT-TIFS	79.40	86.60	N/A
DK-HDNN	93.80	95.25	N/A
EMW-CNN-BiLSTM	85.70	N/A	96.90
CICWE-QNN	78.31	83.33	93.23
QPFE-ERNIE	89.96	94.71	97.60
LLM-PA-GRU	94.05	95.50	97.60

Note: “N/A” indicates that the experimental results were not reported in the original publications. Bold values indicate the best performance for each dataset. All LLM-PA-GRU results are reported as the mean of ten independent runs under the same experimental configuration, and the observed standard deviations of the accuracy were 0.05% on MR, 0.03% on CR, and 0.07% on SUBJ, indicating low variability across repeated trials.

Table 6. Class-wise Performance Metrics Using LLM-PA-GRU.

Dataset	Class	Precision	Recall	F1-Score	Accuracy
SMSSpamCollection	Non-Spam	0.9959	0.9969	0.9964	0.9937
SMSSpamCollection	Spam	0.9789	0.9720	0.9754	0.9937
MR	Pos	0.9350	0.9431	0.9390	0.9405
MR	Neg	0.9457	0.9380	0.9418	0.9405
CR	Pos	0.9706	0.9585	0.9645	0.9550
CR	Neg	0.9283	0.9487	0.9384	0.9550
SUBJ	Obj	0.9746	0.9784	0.9765	0.9760
SUBJ	Subj	0.9775	0.9735	0.9755	0.9760

Table 7. Performance Comparison of LLM-PA-GRU and Ablated Variants.

Model	SMSSpamCollection	MR	CR	SUBJ
GRU	99.03	80.79	83.84	91.23
CPA-GRU	99.25	84.01	84.42	95.39
TPA-GRU	99.17	81.32	84.37	93.68
LLM-PA-GRU	99.37	94.05	95.50	97.60

Note: Bold values indicate the best performance for each dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xia, T.; Liu, X.; Deng, Y.; Qiu, F. Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning. Appl. Sci. 2026, 16, 4515. https://doi.org/10.3390/app16094515

AMA Style

Xia T, Liu X, Deng Y, Qiu F. Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning. Applied Sciences. 2026; 16(9):4515. https://doi.org/10.3390/app16094515

Chicago/Turabian Style

Xia, Tian, Xuan Liu, Yuancheng Deng, and Feng Qiu. 2026. "Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning" Applied Sciences 16, no. 9: 4515. https://doi.org/10.3390/app16094515

APA Style

Xia, T., Liu, X., Deng, Y., & Qiu, F. (2026). Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning. Applied Sciences, 16(9), 4515. https://doi.org/10.3390/app16094515

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Augmenting Sentiment Analysis with a Hierarchical Pre-Attention Framework and Strategic LLM Fine-Tuning

Abstract

1. Introduction

2. Related Work

2.1. The Limitations in Conventional Attention Mechanisms

2.2. The Limitations in Deep Learning Methods, Especially GRU

2.3. The Limitations in LLM Fine-Tuning Methods

3. Method

3.1. Pre-Attention Mechanism

3.1.1. Category-Level Pre-Attention

3.1.2. Text-Level Pre-Attention

3.1.3. Trainable Coefficient Matrices for Prior Allocation

3.2. The PA-GRU Cell

3.3. The LLM-Based PA-GRU Model and Its Fine-Tuning Method

4. Experiments and Results

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Performance Evaluation Metrics

4.1.3. Baselines

4.1.4. Experimental Protocol and Hyperparameters

4.2. Performance Results and Comparison

4.3. Ablation Study

5. Discussion

5.1. Mechanism of Hierarchical Pre-Attention and Gating Manifold

5.2. Empirical Analysis of Computational Efficiency and Complexity

5.3. Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI