1. Introduction
Short-text sentiment classification plays a vital role in natural language processing (NLP), with widespread applications across healthcare, finance, education, and commerce. In today’s information age, social media platforms have become deeply integrated into human life, generating vast amounts of emotional data. These data, often in the form of short texts, contain rich emotional insights encompassing stances, opinions, views, and moods. However, the inherent characteristics of short texts, i.e., short length, sparse features, and informal expression, pose significant challenges to accurate sentiment classification. This paradigm requires robust methodologies capable of distilling latent emotional value from noisy, short-form environments to facilitate precise and efficient classification. Meanwhile, with the rapid development of deep learning and large language models (LLMs), existing methods still have limitations in leveraging corpus-derived statistical priors during encoding and in optimizing architectural integration, which motivates this study to explore more effective solutions for short-text sentiment classification.
Researchers have extensively investigated sentiment classification using both traditional machine learning techniques and deep learning methods. Traditional machine learning methods typically represent short texts as high-dimensional matrices, which often exhibit extreme sparsity due to the brevity of short-text content, thereby severely degrading classification performance. This process exhibits the typical difficulties of short-text sentiment classification, including shortness [
1], sparsity [
2], and informal writing [
3]. Nevertheless, the core idea of leveraging statistical prior knowledge embedded in the training corpus, which is widely adopted in traditional machine learning, still provides valuable insights for the design of deep learning models.
In contrast, the application of deep learning models in recent years has significantly advanced research in sentiment classification. This includes non-sequential models primarily based on convolutional neural networks (CNNs) [
4] and sequential models represented by recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) [
5]. Notably, since natural language is inherently sequential, sequential deep learning models have demonstrated exceptional performance in short-text sentiment classification. Among various sequential models, the gated recurrent unit (GRU) [
6] remains highly useful and competitive in short-text sentiment classification due to its parameter efficiency compared to RNN and LSTM architectures. This structural simplicity, characterized by a streamlined gating mechanism, effectively mitigates the vanishing or exploding gradient problems common in RNNs. It better captures long-term feature dependencies in short texts and makes it a suitable choice as the encoder for short-text sentiment classification tasks.
Simultaneously, attention mechanisms have been widely adopted in deep learning sequential models to enhance sentiment classification. Initially introduced to address machine translation tasks in NLP [
7], attention mechanisms allocate attention by measuring relationships between vectors generated by the encoder. They assign higher weights to critical information and reduce weights for irrelevant data, thereby improving the performance of deep learning models [
8]. These weights continuously evolve and adapt during training, ultimately enhancing model scalability and robustness. However, an issue (Issue 1) exists: conventional attention mechanisms perform QKV computation based on the hidden states output by the encoder. They neglect the corpus-derived statistical priors that can be extracted before encoding. Thus, the encoding process lacks the guidance of category-sensitive statistical tendencies. This limitation leads to insufficient utilization of category-related statistical information and affects the stability of encoding results. To address this gap, this study proposes a pre-attention mechanism designed specifically to extract corpus-derived statistical priors from the training corpus. This mechanism computes attention scores by analyzing the distribution of features at the category and text levels in the training corpus. Specifically, the pre-attention mechanism comprises two parts: category-level pre-attention (C-PA) and text-level pre-attention (T-PA), respectively.
Furthermore, another key issue (Issue 2) arises: how to effectively utilize the extracted statistical priors to guide the encoding process. As the GRU has been proven to be a suitable encoder for short-text sentiment classification, we modify its gating mechanism and propose the pre-attention GRU, abbreviated as PA-GRU. The statistical priors extracted by the pre-attention mechanism act as weights to guide the encoding process of the PA-GRU. It is designed to address the instability of encoding in sparse and informal short-text environments by integrating corpus-derived statistical priors into the internal gating manifold.
Parallel to these developments, the emergence of large language models (LLMs) has redefined the state-of-the-art. They leverage expansive linguistic representations acquired through unsupervised pre-training, yielding exceptional efficacy across diverse downstream applications. For sentiment classification tasks, while full-parameter fine-tuning is a common method to enhance task performance, it faces challenges of high computational demands. To address this issue, some studies have adopted parameter-efficient fine-tuning (PEFT) methods [
9]. However, when an LLM is coupled with a heterogeneous downstream architecture such as an RNN, global parameter updates or uniform low-rank adaptation (LoRA) across all layers may still induce severe gradient instability and may disturb foundational linguistic priors, making such methods less suitable in such heterogeneous settings [
10]. This leads to a third issue (Issue 3): how to design a fine-tuning strategy that allows the LLM to work stably with the downstream PA-GRU in short-text sentiment classification. To address this issue, this study proposes the LLM-PA-GRU model together with its strategic LLM fine-tuning method. Specifically, it selectively unfreezes the terminal layers of the pre-trained LLM, so that high-level semantic representations can be adapted to the downstream PA-GRU while reducing the instability caused by global adaptation.
The main contributions of this study are structured as follows:
The pre-attention Mechanism: This paper introduces and validates a pre-attention mechanism acting as corpus-derived, class-conditioned statistical priors extracted from the training set. The C-PA derives attention weights by analyzing differences in word co-occurrence tendencies, while T-PA captures intra-text mutual dependencies. Extensive ablation studies verify that these attention weights, combined via trainable coefficient matrices, resolve statistical scale mismatches for adaptive prior allocation.
The PA-GRU Cell and its gating mechanism: This study proposes the PA-GRU cell that improves the traditional GRU by integrating global pre-attention statistical priors to explicitly guide the gating mechanism. Experimental results confirm that this modified internal gating structure provides the necessary structural anchors to stabilize the encoding of informal short texts.
The LLM-PA-GRU Model and Strategic LLM Fine-tuning: This paper proposes the LLM-PA-GRU model for sentiment classification. It integrates the proposed pre-attention mechanism, the PA-GRU cell, and a strategic LLM fine-tuning scheme. By selectively unfreezing only the highest layers of a pre-trained LLM, the proposed method enables the LLM to provide high-level semantic representations that can be aligned with the downstream PA-GRU while reducing gradient instability caused by global adaptation methods.
Extensive experiments are conducted on four widely used benchmark datasets for short-text sentiment classification to evaluate the performance of the proposed LLM-PA-GRU model. Experimental results show that the proposed LLM-PA-GRU model achieves competitive or better performance compared with many literature-reported models on most datasets and matches the best result on SUBJ. Specifically, the model achieved peak accuracies of 99.37% on the SMSSpamCollection dataset, 94.05% on the MR dataset, 95.50% on the CR dataset, and 97.60% on the SUBJ dataset, verifying the rationality and effectiveness of the proposed mechanisms and integration strategies.
The remainder of this paper is structured as follows.
Section 2 reviews related work.
Section 3 details the proposed LLM-PA-GRU model and the fine-tuning method.
Section 4 presents the experimental results.
Section 5 discusses the mechanisms, efficiency, and limitations of the model. Finally,
Section 6 concludes the study.
3. Method
This section elucidates the structural components of the proposed framework. The framework comprises the innovation of the gating manifold to stabilize encoding in sparse environments and the fine-tuning method designed to bridge semantic spaces of the PA-GRU and the pre-trained LLM for optimized sentiment classification performance.
3.1. Pre-Attention Mechanism
Traditional attention mechanisms typically derive weights from dynamic hidden states generated during the encoding phase. However, such approaches often overlook the static statistical signals inherent in the training corpus. Specifically, they fail to account for the variance in word-pair co-occurrence across discrete sentiment categories or the localized dependency structures within individual texts.
We propose a dual-level pre-attention mechanism consisting of category-level pre-attention (C-PA) and text-level pre-attention (T-PA), as illustrated in
Figure 1. The C-PA module measures the category-sensitive statistical variation of word pairs by computing the standard deviation of their co-occurrence tendencies across categories, yielding the category-level pre-attention matrix
. In parallel, T-PA captures intra-text dependencies through the matrix of co-occurrence tendency ratios
, from which the text-level pre-attention weight matrix
is derived. A core design of this mechanism involves the offline computation of these matrices exclusively from the training partition, thereby establishing a global anchor of weights to guide the encoding process. In this framework, the pre-computed matrices serve as a corpus-level statistical lookup table, providing explicit structural guidance for the subsequent GRU encoder. These measures are used as corpus-derived, class-conditioned statistical cues rather than direct semantic or probabilistic representations. Because C-PA and T-PA are drawn from heterogeneous statistical distributions, we further introduce trainable coefficient matrices to adaptively reweight and fuse them within a unified formulation, enabling the model to adaptively calibrate the relative influence of category-wide co-occurrence tendencies and text-specific dependency patterns during backpropagation.
3.1.1. Category-Level Pre-Attention
In short-text sentiment classification tasks, assuming any two words appear independently is inappropriate because their occurrences often follow class-sensitive patterns. Statistically, certain words are more common in specific sentiment categories and less common in others. This distribution difference reflects category-dependent statistical tendencies of words across the training corpus. The greater the distribution difference across categories, the more category-sensitive the associated statistical cue becomes. Thus, the category-level pre-attention mechanism aims to characterize class-conditioned co-occurrence tendencies of words based on the co-occurrence tendency of any two words, i.e., a word pair. Specifically, it first calculates the co-occurrence ratio of word pairs in each category, then uses standard deviation to compare the differences in co-occurrence tendencies of word pairs across n categories. Finally, the normalized value is used as the category-level pre-attention weight for the word pair.
Assume a short text’s word sequence consists of
m words, the
i-th word is denoted as
, and the text can be represented as
. Furthermore, let the presence of word
and word
in the
k-th category be marked as
and
respectively. Assume the document frequency of word
in texts of category
k strictly within the training set is denoted as
, and its document frequency across all documents in the training corpus is denoted as
. Therefore, the document frequency ratio of word
in the
k-th category based on the training prior can be calculated using Equation (
1):
We denote the co-occurrence frequency of word pair
and
in texts of category
k as
. Their total occurrence across all categories, i.e., total document frequency, is denoted as
, satisfying
. Then, the conditional co-occurrence ratio of word
appearing in the
k-th category when word
appears in the
k-th category is calculated by Equation (
2):
It can be derived that the co-occurrence tendency ratio of words
and
in the
k-th category is shown as Equation (
3):
Therefore, for n categories, the co-occurrence tendency of words and can be represented as a vector , where the mean of vector is denoted as . These quantities are used as corpus-derived, class-conditioned statistical cues that may help identify category-relevant co-occurrence tendencies during encoding.
This paper uses the standard deviation
of the co-occurrence tendency ratios of words
and
across
n categories to quantify the distribution difference of these two words across
n categories. The calculation formula is Equation (
4):
Calculating the standard deviation of the co-occurrence tendency ratios for all word pairs composed of the words in the input text
T, the co-occurrence tendency standard deviation matrix
(with size of
) is obtained by Equation (
5):
The weight matrix for the category-level pre-attention of input text
T can be represented as Equation (
6):
where
represents the category-level pre-attention weight vector for word
in input text
T.
3.1.2. Text-Level Pre-Attention
From a linguistic perspective, the co-occurrence of a word pair within a single sequence frequently mirrors their underlying semantic or syntactic dependency. While word pairs exhibiting strong dependency tend to co-occur consistently, those with weaker associations appear together less frequently—a distinction that serves as a critical signal for robust text sequence modeling. Text-level pre-attention is designed to quantify these latent relationships by computing the co-occurrence tendency ratio of word pairs across the broader linguistic context of the training set.
Consider a word pair
within the sequence of text
T. To establish a statistically grounded prior, we denote
and
as the document frequencies of words
and
, respectively, across the entire training corpus. Furthermore, the frequency with which this specific pair co-occurs within the same document across the training set is represented as the prior co-occurrence document frequency,
. By leveraging these corpus-wide statistics, the joint co-occurrence ratio of
and
appearing within a unified textual context is formally derived via Equation (
7):
The co-occurrence tendency ratio matrix
with size of
for all words in the input text
T co-occurring within a single text is represented by Equation (
8):
The text-level pre-attention weight matrix for input text
T is computed using Equation (
9):
where
represents the text-level pre-attention weight vector for word
in input text
T.
3.1.3. Trainable Coefficient Matrices for Prior Allocation
Category-level and text-level pre-attention represent the dependency relationships of a word pair from distinct statistical dimensions. Assigning higher attention weights to word pairs with stronger dependencies guides the model to extract discriminative statistical information. However, from a statistical perspective, and operate on heterogeneous statistical distributions. Directly combining them would create a feature mismatch and inject statistical noise into the sequential model. To resolve this statistical incompatibility, this paper introduces and not merely as attention weights, but as trainable coefficient matrices. During backpropagation, these matrices adaptively adjust the contributions of the two distinct statistical distributions through element-wise multiplication before their linear combination , ensuring adaptive prior fusion.
Specifically, the combination process of category-level and text-level pre-attention is shown in
Figure 2. Matrices
and
are obtained from text
T, transformed through the softmax function and the
function, and the output is the pre-attention weight matrix
. In this same figure, the variable
represents the Value matrix, which in the context of the PA-GRU cell, corresponds to the current input embeddings
and the previous hidden states
that are subsequently weighted by
.
For each pair
and
(where
and
are the
i-th row weight vectors of
and
respectively), the combination formula
is given in Equation (
10). Here,
and
are
trainable coefficient matrices (
,
), consistent with the dimensions of
and
(
,
), ensuring legal element-wise multiplication. To promote a balanced fusion of the two pre-attention sources, the corresponding elements of
and
(i.e.,
and
for the
j-th element of
and
) are constrained to sum to 1 (i.e.,
for all
), which helps prevent weight distortion and supports a stable linear fusion process.
where
and
are the
i-th row vectors of
and
respectively (i.e.,
,
), ⊙ denotes the element-wise multiplication (Hadamard product), and the matrix elements of
and
satisfy
,
and
.
Therefore, for input text
T, its final pre-attention weight is the concatenation of all individual weights, defined in Equation (
11):
where
, representing the pre-attention weight vector for word
in input text
T, and the final pre-attention weight matrix
.
3.2. The PA-GRU Cell
In this study, a pre-attention mechanism GRU, PA-GRU, is designed by using the pre-attention mechanism to improve the gating mechanism of a GRU cell. The core design of the PA-GRU is redesigning the candidate state calculation to integrate prior anchors as a structural constraint.
Let
be the
d-dimensional word embedding vector of the
i-th word
in the input text sequence
T of total
m words, and let the previous hidden state be denoted as
, which is defined in the same
d-dimensional feature space as the word embedding. The input text sequence is then encoded as an embedding in Equation (
12):
Compared with the conventional GRU, as shown in
Figure 3a, the architecture and computations within a PA-GRU cell are illustrated in
Figure 3b. The pre-attention mechanism yields
, which acts as an explicit global statistical prior rather than dynamically encoded local attention. In addition, the linear combination of
and
undergoes activation through the sigmoid activation function
(with bias integrated internally to guarantee gate values within
), to form the reset gate
and the update gate
. On the one hand,
resets the previous hidden layer output
; concurrently,
and
are anchored using the extracted global prior
. While the standard GRU operations dynamically resolve local context and polysemy, the injection of
ensures the encoding process remains structurally guided by the corpus-wide category-sensitive statistical tendencies. The reset gate’s output, weighted results, and input
are then summed, yielding candidate states via the activation function tanh. On the other hand, the update gate
modifies the candidate state, while the update gate
updates the previous hidden layer output
.
To map the attention weights (
) into the shared
d-dimensional feature space of
and
, we introduce a feature-space gating operator. Specifically, learnable projection matrices
and
project
into feature-space vectors. At time step
i, the reset gate
first filters the previous hidden state
to obtain
. Then, the PA-GRU anchors
and
using the projected priors. This is achieved by the element-wise Hadamard product (⊙) between the projected priors and the feature vectors, resulting in
and
. These operations are mathematically defined by Equations (
13)–(
16):
Combining the current input
with the outputs of Equations (
14)–(
16), a refined candidate state
is generated using the tanh activation function. This candidate state
integrates the advantages of both the dynamic gating mechanism and global pre-attention priors. To prevent information redundancy from these multiple additive paths, each component is regulated by an independent, learnable weight matrices (
,
,
, and
). These matrices act as balancing coefficients, allowing the network to adaptively fuse dynamic recurrence and static priors during backpropagation. The tanh activation further constrains the candidate state within
, effectively suppressing information redundancy and gradient explosion. The refined candidate state
is computed using Equation (
17):
Similarly, the update gate and final hidden state are computed using Equations (
18) and (
19):
where
and
denote trainable parameter matrices,
denotes a vector of ones, and
denote bias quantities.
In summary, through the training process of PA-GRU, the gating mechanism filters and updates information based on the sequential inputs and the pre-attention weights calculated by the combination of T-PA and C-PA. In contrast to GRU, because the pre-attention mechanism calculates weights strictly based on the training dataset, it provides the PA-GRU with a global statistical perspective for controlling information through explicit structural anchors that help address some limitations of GRU. Furthermore, the trainable coefficient matrices of the combination are adaptively updated through the training process.
3.3. The LLM-Based PA-GRU Model and Its Fine-Tuning Method
The architecture of the LLM-based PA-GRU model (LLM-PA-GRU) is depicted in
Figure 4. At its core, the PA-GRU module comprises sequential cells deployed across consecutive time steps, flanked by a dual-branch feature extraction network.
On the left branch, we implement a strategic LLM fine-tuning protocol. In contrast to LoRA, which risks diluting foundational linguistic priors by distributing parameter plasticity indiscriminately, our approach freezes the majority of the LLM while selectively unfreezing only the final two transformer layers. This targeted unfreezing strategy creates a more task-adapted semantic method. It is intended to reduce gradient instability between the LLM and the PA-GRU. Thus, the high-level semantic representations from the LLM can be transferred more effectively to the downstream PA-GRU. Under this paradigm, an input text sequence is projected by the LLM into a sequence of word embeddings .
Simultaneously, the right branch utilizes the pre-attention mechanism to derive global prior vectors from the same input. These embeddings and statistical priors are fed synchronously into the PA-GRU module, where the hidden state maintains temporal continuity. This integration uses explicit pre-attention weights to anchor each PA-GRU cell with word-level statistical priors.
During training, the model undergoes end-to-end joint fine-tuning encompassing the terminal LLM layers, the trainable coefficient matrices, and the PA-GRU parameters.
Conventional optimization mostly relies on task-specific cross-entropy loss. However, this loss often ignores the structural alignment between different modules. To solve this, a multi-objective optimization strategy is utilized. It combines the cross-entropy loss with a prototype-based alignment loss based on mean squared error. This design improves classification accuracy and maintains structural consistency between the LLM semantic space and the PA-GRU feature space.
Specifically, assume all short texts in the task setting can be classified into n categories, the k-th classification label is denoted as , and all classification labels are represented as . Assume the input text . After passing through the pre-trained large language model, the text obtains the hidden state matrix of the last layer . This matrix, as a sequence of word embedding vectors, is then input into the PA-GRU module for sequence modeling, ultimately generating the sequence representation . It is further mapped through a classifier to obtain the predicted probability distribution . The end-to-end fine-tuning process for the LLM-PA-GRU parameters is defined as follows.
Step 1: based on the prediction result
and the true labels, first compute the classic cross-entropy loss as Equation (
20):
where
N is the batch size,
C is the number of categories, and
is the true label of sample
i for category
c.
Step 2: to enforce structural consistency across the heterogeneous network, we introduce a prototype-based alignment loss computed by mean squared error, as shown in Equation (
21):
where
is the class prototype vector comprising learnable parameters corresponding to label
, and
represents the sequence-ending hidden state of the
i-th sample.
Step 3: the final total loss is obtained by combining the classification loss and the alignment loss, as defined in Equation (
22):
By jointly optimizing , the model improves both classification accuracy and feature alignment. During fine-tuning, the unfrozen parameters are updated synchronously using backpropagation. Finally, the model extracts discriminative features to output the sentiment prediction.
5. Discussion
5.1. Mechanism of Hierarchical Pre-Attention and Gating Manifold
Evaluations on the SMSSpamCollection dataset reveal that the LLM-PA-GRU model achieves strong performance compared with conventional deep learning architectures, particularly in terms of classification accuracy. The core of this improvement lies in the pre-attention mechanism’s ability to extract category-level class tendencies and intra-text dependencies as explicit statistical priors. By embedding these structural anchors directly into the PA-GRU cells, the model addresses the limitations of standard sequential encoders. The results suggest that the LLM-PA-GRU model alleviates the lack of global guidance in conventional transformers.
In the context of the four benchmark sentiment datasets, the LLM-PA-GRU model achieves competitive or superior performance across the four datasets by consistently isolating more discriminative category-relevant statistical features. This advantage stems from the deep integration of pre-attention priors within the internal gating manifold of the PA-GRU. Conventional attention mechanisms rely on localized text relationships. They do not account for corpus-wide category distributions or integrate them into the encoding process. In contrast, the PA-GRU architecture ensures that at each temporal step, the hidden state transition is guided by explicit category-based and text-based priors via C-PA and T-PA. These statistical priors anchor the previous hidden state and the current input. This effectively filters out noise and preserves critical classification-related features. Furthermore, the use of trainable coefficient matrices resolves the inherent statistical mismatches between C-PA and T-PA, adaptively reweighting and fusing heterogeneous distributions for adaptive allocation. As a result, these explicit anchors guide the gating mechanism to emphasize category-informative features and improve classification precision.
Finally, the performance advantage of LLM-PA-GRU highlights the effectiveness of our fine-tuning protocol. By selectively unfreezing only the terminal transformer layers, we construct a more task-adapted semantic method. This configuration is intended to mitigate the gradient instability common in heterogeneous architectures. Combined with the prototype-based alignment loss, this strategy encourages consistency between the semantic space of the LLM representations and the temporal feature space of the PA-GRU.
5.2. Empirical Analysis of Computational Efficiency and Complexity
We provide empirical data on the model’s parameter scale and training dynamics to demonstrate its optimization efficiency. Unlike standard full-parameter updating, our protocol uses a selective layer-wise unfreezing strategy. We freeze the majority of the DeepSeek-LLM-7B-Base backbone. We only optimize the terminal two transformer layers and the GRU-based classification head. Thus, the trainable parameter count is approximately 420 M. This represents roughly 6.0% of the total 7 B backbone parameters.
The model also converges rapidly during training. On an NVIDIA A100 GPU, the LLM-PA-GRU reaches convergence within 3 epochs. Each epoch requires significantly less training time than full-parameter fine-tuning of 7B-scale models. This efficiency suggests the practical scalability of the proposed framework.
5.3. Limitations
This study has limitations. Although the PA-GRU gating logic is highly effective for anchoring dense category-related statistical features in short sequences, the GRU architecture inherently struggles with long-range dependencies. Consequently, the model’s performance may degrade on complex, long-form texts. This is why our framework focuses on short-text scenarios. In addition, the deployment of this framework in a production environment involves memory footprint trade-offs during the inference phase. While parameter-efficient, retaining the LLM backbone alongside the GRU head still requires substantial memory capacity and bandwidth for real-time applications.
6. Conclusions
This research developed a comprehensive framework for short-text sentiment analysis, integrating a dual-level pre-attention mechanism, the PA-GRU architecture, and a strategic LLM fine-tuning protocol.
The proposed pre-attention mechanism extracted corpus-derived, class-conditioned statistical priors from both inter-category and intra-text dimensions derived from the training corpus. This approach overcame the limitations of conventional attention mechanisms, which lacked a corpus-wide view of features. In addition, the proposed trainable coefficient matrices provided a solution to adaptively fuse these priors within a unified formulation for adaptive prior allocation. This methodology effectively addressed the neglect of corpus-wide category-dependent statistical tendencies and word-distribution variances that often hindered conventional encoding processes.
Furthermore, this study revised the internal gating manifold of the GRU by embedding global statistical priors directly, creating the PA-GRU cell. This structural innovation ensured that sequence encoding was no longer a localized, purely dynamic operation but was instead grounded by explicit dataset-wide anchors. The PA-GRU enabled the model to maintain consistent category-sensitive statistical guidance even in the presence of linguistic noise.
The LLM-PA-GRU model, supported by our strategic fine-tuning method, improved overall performance. In contrast to global adaptation techniques, we selectively unfroze the terminal transformer layers. This constructed a more task-adapted semantic method. The fine-tuning of the DeepSeek-LLM-7B-Base backbone restricted the trainable parameters to approximately 420 M. This represented roughly 6.0% of the total backbone parameters. This configuration was optimized through a prototype-based alignment loss. It reduced gradient instability and enforced structural consistency across heterogeneous feature spaces. As a result, the model reached convergence within 3 epochs on an NVIDIA A100 GPU.
Extensive benchmarking confirmed that this approach achieved competitive or better performance compared with many literature-reported models, particularly in sparse and informal short-text environments where traditional deep learning models often struggled. Specifically, the model achieved peak accuracies of 99.37% on SMSSpamCollection, 94.05% on MR, 95.50% on CR, and 97.60% on SUBJ.
Future research will focus on extending the integration of explicit statistical priors to other heterogeneous architectures. Exploring these corpus-derived benchmarks within diverse neural frameworks will refine internal computational mechanisms and foster the development of more robust, architecturally-grounded variants for a broader spectrum of NLP challenges.