Next Article in Journal
Transformer Encoder vs. Mamba SSM: Lightweight Architectures for Machining Stability-Induced Surface-Quality Categorization
Previous Article in Journal
KANs Layer Integration: Benchmarking Deep Learning Architectures for Tornado Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Scale Feature Fusion Linear Attention Model for Movie Review Sentiment Analysis

1
School of Art and Design, Shandong Women’s University, Jinan 250300, China
2
School of Artificial Intelligence, Jiangxi Normal University, Nanchang 330022, China
3
School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2025, 9(12), 325; https://doi.org/10.3390/bdcc9120325
Submission received: 31 October 2025 / Revised: 30 November 2025 / Accepted: 13 December 2025 / Published: 18 December 2025

Abstract

Sentiment classification is a key technique for analyzing the emotional tendency of user reviews and is of great significance to movie recommendation systems. However, existing methods often face challenges in practical applications due to complex model structures, low computational efficiency, or difficulties in balancing local details with global contextual features. To address these issues, this paper proposes a Multi-Scale Feature Fusion Linear Attention model (MSFFLA). The model consists of three core modules: the BERT Encoder module for extracting basic semantic features; the Parallel Multi-scale Feature Extraction module (PMFE), which employs multi-branch dilated convolutions to accurately capture local fine-grained features; and the Global Multi-scale Linear Feature Extraction module (MGLFE), which introduces a Multi-Scale Linear Attention mechanism (MSLA) to efficiently model global contextual dependencies with approximately linear computational complexity. Extensive experiments were conducted on three public datasets: SST-2, Amazon Reviews, and MR. The results show that compared to the state-of-the-art BERT-CondConv model, our model achieves improvements in accuracy and F1-Score by 1.8% and 0.4%, respectively, on the SST-2 dataset, and by 1.5% and 0.3% on the Amazon Reviews dataset. This study not only validates the effectiveness of the proposed model but also provides an efficient and lightweight solution for sentiment classification tasks in movie recommendation systems, demonstrating promising practical application prospects.

1. Introduction

Sentiment Classification (SC) primarily involves the automatic identification of the sentiment conveyed in text, providing key technological support for understanding and utilizing subjective emotional information within massive amounts of textual data [1]. Sentiment classification is widely applied across various domains, such as analyzing customer feedback in e-commerce [2], music recognition and genre classification [3], and analyzing user comments on social media [4,5]. Within the film industry, SC holds profound application value. It enables the automated, fine-grained sentiment analysis of vast quantities of movie reviews, social media discussions, and short comments. This allows industry practitioners to move beyond simple metrics like box office numbers or average ratings, and instead capture the complex and nuanced emotional pulse of public opinion in real-time and at scale. For movie recommendation systems, successful sentiment classification facilitates the crucial leap from “knowing what a user has watched” to “understanding what a user enjoys.” By analyzing the sentiment in a user’s historical reviews, the system can accurately identify their preferred movie elements, thereby recommending films that are more likely to resonate emotionally. This significantly enhances the personalization of recommendations and improves user satisfaction.
In terms of marketing strategy formulation, this technology provides a data-driven basis for decision-making. From the film’s promotion phase to its post-release period, marketing teams can monitor sentiment dynamics to evaluate the effectiveness of trailers, posters, and promotional campaigns. This enables them to promptly identify key topics in public discourse, achieve precise allocation of marketing resources, and implement proactive risk management, ultimately optimizing the film’s market performance.
Methods for SC in existing movie recommendation systems are commonly classified into four major types [6]: (1) conventional manual-driven methods, (2) traditional machine learning-based approaches, (3) deep learning-driven techniques, (4) hybrid model-based approaches. Traditional manual methods primarily rely on handcrafted lexicons and grammatical rules. For instance, Wu et al. [7] put forward an unsupervised methodological framework that automatically extracts domain-specific sentiment word pairs for target terms by analyzing syntactic dependencies and semantic features, thereby constructing a fine-grained sentiment lexicon to address the limited coverage of conventional lexicons for domain-specific targets. Zhang et al. [8] manually constructed three types of rule-based dictionaries and designed sentiment composition rules, which markedly enhanced the accuracy of short-text sentiment classification, particularly for platforms with concise expressions, such as Weibo. However, the knowledge bases used in such manual approaches often struggle to handle complex semantic expressions. To overcome these limitations, researchers have introduced traditional machine learning-based techniques using features such as Bag-of-Visual-Words (BoVW) and Term Frequency-Inverse Document Frequency (TF-IDF). For example, Widyantoro et al. [9] proposed the User Profile Correlation-based Similarity (UPCSim) algorithm, which integrates user ratings and behavior data to compute similarity weights and employs K-Nearest Neighbors (KNN) to classify user preferences, effectively diminishing prediction errors—though it comes at the expense of high computational overhead. Pavirha et al. [10] applied TF-IDF weighting to extract keywords from movie reviews and used Support Vector Machine (SVM) for sentiment classification, achieving high accuracy. This method helps filter films with strongly negative reviews to improve recommendation reliability; however, it relies on explicit sentiment words and fails to interpret ironic or sarcastic comments.
Due to the wide application of deep learning in various fields, scholars have introduced it into SC. For instance, Dashtipour et al. [11] proposed integrating Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) for sentiment analysis of Persian movie reviews, replacing manual feature extraction with automated feature engineering to capture contextual information effectively. Prasath et al. [12] explored the use of deep learning models to analyze audience sentiment in movie-related tweets. They collected data via Tweepy and Beautiful Soup, preprocessed it using SpaCy, and employed Robustly Optimized BERT Pretraining Approach with Multi-head Attention (RoBERT) a for sentiment classification, emotion detection, and aspect-based analysis. Combined with Plotly 5.x stable version for visualization, their approach provides comprehensive insights into audience reception, thereby supporting decision-making in the entertainment industry. Since CNNs primarily focus on local features and struggle to capture long-range dependencies, researchers have introduced Transformer-based self-attention mechanisms for sentiment analysis, leveraging large-scale pre-trained language models to improve accuracy significantly. Acheampong et al. [13] systematically demonstrated how Transformer models facilitate the transition from coarse-grained positive/negative classification to fine-grained sentiment detection, emphasizing context modeling and transfer learning as core drivers in this field. Saad et al. [14] proposed a deep learning architecture that integrates the strengths of Transformers and sequential models. Experimental validation on the IMDb Large Movie Review Dataset (IMDB) movie review demonstrated that the RoBERTa model achieved an accuracy of 95.02%, outperforming BERT, DistilBERT, and existing methods, thereby significantly enhancing the efficacy of sentiment analysis and providing a reliable basis for emotional assessment in movie reviews. However, these models face challenges, including high computational costs and limited interpretability.
To address more complex sentiment analysis tasks, researchers have integrated the strengths of different models and proposed hybrid model-based approaches. For instance, Ruan et al. [15] proposed an Attention-based LSTM (ATT-LSTM) framework tailored for user review analysis. This framework enhances the efficacy of sentiment feature extraction and attains superior accuracy relative to conventional Recurrent Neural Network (RNN) /LSTM models. This method further integrates behavioral and emotional attributes to enable feature-level information mining. To tackle challenges in textual feature extraction, Wang et al. [16] developed the BERT-CondConv model, which adaptively optimizes BERT’s hidden-state features using conditional parameterized convolution. Experimental validation on movie confirmed its effectiveness. Sun et al. [17] introduced the Enhanced Representation through kNowledge IntEgration—Multi-Channel Bidirectional Multi-head Attention (ERNIE-MCBMA) model, which extracts fused features through a multi-channel CNN, bidirectional LSTM, multi-head attention mechanism, and cross-layer fusion. Their model demonstrated superior accuracy over baseline models in the six-category sentiment classification task of Social Media Processing 2020—Evaluation of Weibo Emotion Classification Task (SMP2020-EWECT).
While the aforementioned model has achieved promising results, the following limitations and research gaps persist:
  • Existing models fail to adequately extract relevant feature information. Some models primarily focus on local features while neglecting global contextual information. For instance, CNN models excel at extracting local features but are limited by their receptive field, hindering effective modeling of long-range dependencies. Other models predominantly concentrate on global information while overlooking fine-grained details. Furthermore, these models often operate with high feature dimensions, which constrains their computational performance. For example, while standard BERT can effectively model global dependencies through self-attention mechanisms, it struggles to capture subtle patterns in local phrases, and its quadratic computational complexity makes it computationally expensive for long sequences.
  • To extract more comprehensive features, existing models typically employ enlarged convolutional kernels. However, this approach leads to a substantial increase in parameters, which directly constrains computational efficiency. Furthermore, the attention mechanisms in some current models exhibit high computational complexity, further exacerbating the computational burden. Hybrid models such as BERT-CNN and BERT-LSTM, while integrating advantages from both architectures, often suffer from structures, inefficient feature fusion, and parameter redundancy.
  • Existing models are plagued by several shortcomings, including high computational complexity, excessive parameter counts, and low computational efficiency. These limitations hinder the widespread adoption of the models and make them unsuitable for real-time detection scenarios.
This study aims to investigate the following question: How can a model effectively capture both local fine-grained features and global contextual dependencies in text without significantly increasing computational complexity? Our core hypothesis is that, through the synergistic design of parallel multi-scale convolution (PMFE) and efficient linear attention modules (MGLFE), it is possible to extract local detailed features and contextual features while reducing the model’s computational complexity.
In response to the limitations and research gaps in existing sentiment analysis models for movie reviews, this paper proposes optimizations and improvements to the model. The main contributions are summarized as follows:
  • We propose a novel PMFE module and a MGLFE module. These two modules can effectively capture multi-scale fine-grained information and global contextual features. This design enables the model to simultaneously benefit from both global contextual understanding and local detailed characteristics while maintaining relatively low computational complexity, which helps reduce the number of model parameters. By incorporating the PMFE module, which utilizes parallel multi-kernel dilated convolutions to extract local features at different granularities, the model precisely captures key words and phrasal patterns, thereby compensating for BERT’s limitations in local semantic understanding. Through the Multi-scale Linear Attention (MSLA) mechanism in the MGLFE module, global contextual information is modeled with approximate linear computational complexity, overcoming the inherent limitations of traditional CNNs regarding restricted receptive fields.
  • We propose a Multi-Scale Feature Fusion Linear Attention (MSFFLA) model, which primarily adopts a symmetric lightweight design. In the PMFE module, we employ parallel dilated convolutions that effectively reduce the number of parameters compared to traditional convolutional operations while expanding the receptive field. In the MGLFE module, we incorporate a MSLA mechanism that significantly reduces computational complexity. Through residual connections for efficient feature fusion, our design achieves a “1 + 1 > 2” effect, this architecture not only enhances computational performance but also reduces the number of model parameters, thereby substantially improving the computational efficiency of the model.
  • Our model not only achieves significant improvements in both accuracy and F1-Score on three public benchmarks, but also demonstrates superior parameter efficiency and computational performance, thereby delivering an efficient and lightweight solution suitable for resource-constrained practical application scenarios.
The structure of this paper is organized as follows: Section 2 reviews and analyzes related methods in sentiment analysis. Section 3 provides a detailed introduction to the proposed MSFFLA model. Section 4 presents the experimental results and corresponding analysis. Finally, Section 5 summarizes the full text and outlines future research directions.

2. Related Work

2.1. Methods Based on Traditional Manual Approaches

In the developmental trajectory of computer-based sentiment classification, traditional manual approaches primarily relied on rule design and feature engineering. Methods based on sentiment lexicons achieve classification by constructing manually annotated word banks and formulating rule systems. For instance, Louati et al. [18] proposed an SVM-based sentiment analysis system for Arabic course evaluations (SVM-SAA-SCR) to analyze student feedback from Saudi universities. Using a real-world dataset provided by PSUA, the SVM model achieved an accuracy of 84.7% in classification tasks after preprocessing and feature extraction. When compared with the advanced CAMeLBERT model, the results were comparable. This system aids in identifying areas for teaching improvement and enhancing the quality of higher education. Similarly, Hu et al. [19] developed a product review analysis system validated on real e-commerce review datasets (Amazon, CNet). The final accuracy for feature extraction was 72%, with a recall of 80%. Sentence-level sentiment classification achieved a high accuracy of 84.2%. However, such methods face inherent limitations of static lexicons, struggling to adapt to semantic ambiguities.
To enhance flexibility, the rule-based template method emerged. By defining grammatical patterns, such as matching nouns with optional negative words and the structure of adjectives to determine negative evaluations, for example, the context polarity rule proposed by Wilson [20] is a typical representative. Using the MPQA corpus, in the two-stage classification, the accuracy rate of neutral-polarity classification was 75.9%, and the accuracy rate of polarity classification was 65.7%. Nevertheless, these methods exhibit weak domain generalization capability; rules developed for film reviews require complete reconstruction when applied to medical domains, and they fail to interpret complex semantics like negated language or sarcasm.
Muddiman et al. [21] proposed a method combining manual validation with corpus statistics to construct domain-customized dictionaries. This involves manually screening keywords relevant to the target domain and expanding the vocabulary using corpus statistics and semantic similarity, emphasizing the avoidance of using validation set data for dictionary construction to prevent overfitting. Amsler et al. [22] employed data-driven semantic computation models to extend traditional human-dominated text analysis paradigms, suggesting the use of corpus statistics—such as high-frequency correlated words and semantic similarity—to optimize dictionary building.
While these methods can achieve high accuracy within specific domains, they generally depend on manually constructed sentiment lexicons or rule templates. Consequently, they face fundamental limitations, including difficulties in domain adaptation and challenges in handling semantic ambiguities and complex linguistic phenomena.

2.2. Approaches Founded on Traditional Machine Learning

With the rise of machine learning, feature engineering-based approaches became mainstream. For instance, manual extraction of bag-of-words features and N-gram features captured sentiment in phrases like “must_watch,” while syntactic dependency parsing calculated the dependency distance between sentiment words and targets. Statistical analysis of part-of-speech distributions yielded multidimensional features such as adjective and adverb densities, which were then fed into classifiers like SVM or Naïve Bayes. Pang et al. [23] achieved 82.9% accuracy using this framework on curated IMDb movie review data, though feature design heavily relied on expert knowledge. These traditional methods shared three major limitations: high manual labor costs, poor domain transferability, and weak semantic comprehension—particularly struggling with metaphors and implicit sentiments. These constraints ultimately drove the shift toward adaptive learning paradigms based on deep models like BERT.
Innovative combinations of traditional NLP techniques and machine learning models (e.g., VADER + SVM) enabled high-precision classification and collectively advanced recommendation systems toward fine-grained sentiment understanding. Sarhan et al. [24] validated a hybrid approach integrating lexical rules with machine learning. Using both TMDB 5k/10k and Reviews.txt datasets, they combined the VADER sentiment analyzer with an SVM classifier to achieve an exceptional sentiment classification accuracy of 99.28%, significantly improving recommendation relevance. Jassim et al. [25] compared multiple machine learning models for sentiment analysis on movie reviews and proposed a new rating prediction scheme based on textual reviews and FDOSM decision-making methodology. Their experiments on the IMDb movie review dataset confirmed that the SVM classifier delivered optimal performance with an accuracy of 88.333%.
While feature engineering-based machine learning methods demonstrated powerful sentiment classification capabilities on movie review datasets, achieving up to 99.28% accuracy, their reliance on meticulous feature design and domain-specific data exposed inherent limitations. This recognition ultimately catalyzed the paradigm shift toward end-to-end deep learning models.

2.3. Approaches Founded on Deep Learning

In the domain of deep learning, BERT-based sentiment classification models have achieved significant breakthroughs in recent years, particularly in target-dependent and fine-grained analysis. Gardazi et al. [26] highlighted in their survey that BERT’s contextual awareness substantially enhances Aspect-Based Sentiment Analysis (ABSA) performance. Subsequent studies further optimized this through domain adaptation [27] and hybrid architectures integrated with XGBooST [28], though challenges remain in handling implicit sentiments. Gao et al. [29] pioneered the TD-BERT model, which locates target words and constructs target-related sentences, outperforming traditional feature engineering and embedding models in aspect-level sentiment analysis and establishing a new baseline. Evaluated on SemEval-2014 Task 4 and Twitter datasets, TD-BERT achieved state-of-the-art performance across all three benchmarks, improving accuracy by 2–3% over previous best results on the Twitter dataset with a final accuracy of 85%. For multi-class tasks, Almufareh et al. [30] developed BertSent, employing oversampling and undersampling techniques to address five-class imbalance and achieving 75.3% accuracy with limited data. In commercial applications, Rahman et al. [31] demonstrated BERT’s superiority in customer feedback analysis, where its 95% accuracy substantially enhanced service quality and corporate revenue.
With the advancement of pre-trained models, researchers have begun integrating BERT with neural networks to enhance downstream task performance. Chen et al. [32] proposed the CNN-TE model by combining BERT with CNN, conducting experiments on two public datasets, GTZAN and FMA-small. The model achieved an accuracy of 87.41% on GTZAN and 89.09% on FMA-small, thereby improving the classification accuracy of long Chinese news texts. Li et al. [33] adopted a BERT-BiGRU architecture to optimize sentiment polarity analysis of e-commerce reviews, where the BERT-BiGRU algorithm achieved the highest accuracy of 95.51% among seven compared algorithms. Prottasha et al. [34] further improved sentiment classification decisions through an integrated BERT-CNN-BiLSTM model. However, the unique characteristics of multi-scale emotional expressions in movie reviews pose new challenges to existing techniques: traditional BERT struggles to simultaneously capture local details and global tendencies, while CNN-Transformer hybrid architectures suffer from inefficient feature fusion. Although large language models (e.g., GPT-3, LLaMA-2) demonstrate potential in zero-shot sentiment analysis [35,36], their high computational costs [37] and domain adaptation limitations [38] hinder deployment in real-time recommendation systems.

2.4. Hybrid Model-Based Methods

In the domain of hybrid models, research on movie recommendation systems based on textual sentiment analysis has demonstrated diversified technical approaches. The CNN-Transformer hybrid architecture has shown significant advantages in cross-modal tasks, particularly in text-related domains. Tennakoon et al. [39] employed a BERT + CNN model for sentiment classification of user texts, achieving 78% accuracy on the benchmark dataset (FER2013), and combined it with collaborative filtering to realize emotion-driven personalized recommendations. Chen et al. [32] proposed the CNN-TE model, which extracts local features from mel-spectrograms through CNN and then models global temporal dependencies via a Transformer encoder, significantly improving the efficiency of music genre classification. Hossain et al. [40] designed the TransNet text classification system, integrating the sequential learning capabilities of GRU/LSTM with the multi-head self-attention mechanism of Transformer, combined with BERT tokenizers to handle linguistic characteristics. Validated on the Arabic Asthma Tweets dataset, the system achieved 97.87% accuracy in sentiment analysis.
Recently, attention mechanisms have manifested prominent advantages in the fields of affective computing and text analysis. In the context of EEG-based emotion recognition, Jiang et al. [41] introduced an Attention-fused Multi-Scale Feature Fusion Network (AM-MSFFN). which extracts features via spatial-temporal convolution before incorporating channel and spatial attention Convolutional Block Attention Module (CBAM) to focus on key physiological signals, substantially improving cross-subject generalization and achieving accuracy rates exceeding 99% on the Database for Emotion Analysis using Physiological Signals (DEAP) and SJTU Emotion EEG Dataset (SEED). In textual sentiment analysis, Jia [42] combined BERT with CNN and utilized attentive pooling to mine emotional features in microblog texts, while Guo et al. [43] designed a multi-scale self-attention Transformer that enhances text classification performance through hierarchical control of attention granularity. Addressing multi-label sentiment classification, Ameer et al. [44] innovatively employed a multi-head attention mechanism (e.g., RoBERTa-MA), assigning independent attention heads to each emotion type, which improved accuracy by 3.6% in the SemEval-2018 task. In multimodal analysis, Subbaiah et al. [45] developed a Hybrid Optimized Multi-Scale Residual Attention Network (HOMRA-Net), which integrates attention-weighted features from text, audio, and visual modalities, enabling efficient emotion recognition on the Multimodal EmotionLines Dataset (MELD).
Recent sentiment classification research demonstrates a trend toward integrating multiple technical pathways, with notable progress in addressing critical challenges such as data imbalance, label dependency, and model efficiency. Mirzaee et al. [46] proposed a multi-label classification framework based on self-supervised contrastive learning and adaptive data augmentation. By designing a Focal Weighted Cross-Entropy loss function and an inter-class correlation learning module, their approach effectively handles class imbalance and label interdependence, demonstrating strong generalization capabilities in tasks like chest X-ray disease detection. Collectively, current sentiment classification research is progressively integrating self-supervised learning, sequential modeling [47], and large language model technologies [48], striving to enhance performance while balancing practicality, interpretability, and resource consumption.

3. Methodology

3.1. Overall Structure

The overall framework of our model is illustrated in Figure 1, The algorithm is as shown in Algorithm 1. It primarily consists of three modules: the BERT Encoder, the PMFE, and the MGLFE. First, the input data is encoded by the BERT Encoder module to obtain features at different time steps. Subsequently, we employ two PMFE modules and two MGLFE modules to capture local detailed features and global contextual information from the data. Additionally, a multi-scale linear attention mechanism is adopted to effectively model long-range dependencies within the global context. Finally, the outputs are aggregated via average pooling, then passed through a 1 × 1 convolution layer and a SOFTMAX classifier to yield the final result of sentiment classification.
The overall architectural design of the MSFFLA model is grounded in a profound understanding of the multi-scale nature of natural language and the limitations of existing deep learning models. Emotional expressions manifest simultaneously through local keywords and global contextual cues, whereas conventional CNNs and BERT models are constrained by their limited receptive fields and deficient capabilities in capturing fine-grained local patterns, respectively. To address these issues, we construct a hierarchical semantic understanding architecture capable of capturing both fine-grained features and long-range dependencies through the synergistic integration of PMFE and MGLFE. To balance performance and efficiency, we incorporate a linear attention mechanism atop the semantic foundation provided by BERT, maintaining global modeling capability with approximately linear computational complexity, while employing lightweight components such as depthwise separable convolutions to control parameter growth. Furthermore, multi-layer residual connections ensure smooth gradient propagation and lossless feature integration, combined with feed-forward networks and layer normalization to achieve stable refinement of multi-scale features. This architectural design achieves an organic unification of multi-scale semantic comprehension, computational efficiency, and training stability.

3.2. BERT Encoder BLOCK

We employ the BERT Encoder pre-trained model as the foundational module. Relative to traditional models (including CNN, RNN, and BERT), the incorporation of the Transformer architecture avoids the information loss caused by pooling operations, while residual connections help preserve low-level features [49]. Leveraging the general linguistic representations provided by its pre-trained weights, the BERT model bypasses the need for feature learning from scratch, thereby acquiring stronger generalization capabilities [50]. Through the employment of multi-head self-attention, BERT is capable of comprehensively and concurrently aggregating contextual information from both preceding and succeeding contextual tokens throughout the entire input sequence. This powerful information fusion capability enables it to excel in modeling long-range semantic dependencies [51]. Consequently, a key advantage of this module lies in its effectiveness for long-text feature classification, as it captures multi-level semantic features and extracts hidden state representations across different time steps. This approach mitigates the issue of feature homogenization and enhances the overall performance of the model.
BERT was selected as the backbone architecture due to its comprehensive advantages in semantic representation, knowledge inheritance, scalability, and efficiency. Leveraging Transformer and multi-head self-attention mechanisms, BERT enables bidirectional contextual modeling, allowing it to accurately capture complex emotional expressions such as negations and contrastive evaluations. Its extensive pre-trained weights encapsulate rich linguistic knowledge, providing high-quality word embeddings for downstream tasks, thereby avoiding the need for learning from scratch and enhancing generalization in few-shot scenarios. Furthermore, BERT-base strikes an effective balance between model scale, computational efficiency, and deployment feasibility, establishing a robust semantic foundation for the subsequent integration of PMFE and MGLFE modules. This forms a synergistic “powerful-backbone and specialized-enhancement” architecture, fully aligned with the goal of building an efficient and lightweight sentiment analysis system.

3.3. PMFE and MGLFE BLOCK

3.3.1. PMFE BLOCK

The PMFE module serves as an efficient local feature extractor. Its operating principle involves capturing fine-grained semantic features of text through a parallel multi-scale convolutional architecture. The module first preprocesses the input features using batch normalization and depthwise separable convolution. Subsequently, three parallel dilated convolutional branches capture local features with different receptive fields. This is followed by feature concatenation, channel compression via 1 × 1 convolution, and information fusion through multi-layer residual connections. The refined features are ultimately output via a feed-forward structure. This design significantly enhances the model’s ability to capture key local patterns, such as negation phrases and intensity adverbs, through multi-scale parallel processing. The use of dilated convolutions expands the receptive field while controlling the number of parameters, and depthwise separable convolution ensures computational efficiency. Consequently, the model effectively improves the accuracy and robustness of local semantic understanding while maintaining a lightweight design.
The overall architectural framework of PMFE is illustrated in Figure 2, while its specific formulas are presented as follows:
Z 1 = D W C o n v 3 × 3 ( B N ( X ) ) + X
Z 2 = C o n v 1 × 1 ( B N ( Z 1 ) )
f 1 = S e L U ( P D C o n v 3 × 3 ( Z 2 ) + Z 2 )
f 2 = S e L U ( P D C o n v 5 × 5 ( Z 2 ) + Z 2 )
f 3 = S e L U ( P D C o n v 7 × 7 ( Z 2 ) + Z 2 )
f = f 1 + f 2 + f 3
Z 3 = C o n v 1 × 1 ( f ) + Z 1
F = C o n v 1 × 1 ( G e L U ( C o n v 1 × 1 ( B N ( Z 3 ) ) ) ) + Z 3
Here, X ∈ RH×W×C denotes the input tensor, where H, W, and C denote the height, width, and channel count of the target feature map, respectively. BN refers to Batch Normalization. The term DWConv3×3 denotes a 3×3 depth wise separable convolution. Meanwhile, PDConv3×3, PDConv5×5, and PDConv7×7 denote parallel dilated convolutions featuring kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively. Specifically, SeLU denotes the Scaled Exponential Linear Unit, a type of activation function; GeLU, by contrast, refers to the Gaussian Error Linear Unit, a widely adopted activation function in deep learning.
First, Batch Normalization (BN) is applied to the input sample X∈RH×W×C to improve training stability and accelerate convergence [52]. This is followed by a lightweight 3 × 3 depthwise separable convolution (DWConv3×3) for preliminary feature extraction with reduced parameters. The original input features are then combined with the initially extracted features via a residual connection, preserving low-level information and producing an intermediate feature map Z1. Next, Z1 undergoes a second BN operation and is transformed through a 1 × 1 convolution to obtain Z2, which is split along the channel dimension into three parallel branches. These branches are processed by PDConv3×3, PDConv5×5, and PDConv7×7, respectively, to capture multi-scale contextual information. Each branch employs the SeLU activation function to enhance nonlinear modeling capacity while preserving local characteristics. Independent residual connections integrate the original features of Z2 with the multi-scale outputs, resulting in feature groups f1, f2, and f3. These features are concatenated into a combined representation f, which is then compressed and integrated across channels via a 1 × 1 convolution. The result is fused with Z1 through another residual connection, yielding the refined feature Z3. Finally, Z3 undergoes BN, followed by a 1 × 1 convolution and the GeLU activation function to optimize feature distribution. A residual connection is applied again to combine Z3 with its processed version, producing the final high-representation feature F. The architecture reduces computational load through depthwise separable convolution, enhances contextual perception via multi-scale parallel dilated convolutions, ensures lossless information flow through hierarchical residual connections, and strengthens complex pattern modeling with SeLU and GeLU activations, thereby achieving efficient local feature extraction.

3.3.2. MGLFE BLOCK

The MGLFE module is designed to efficiently model global contextual dependencies in text. Its working principle begins with enhancing local feature representation through depthwise separable convolution. The core MSLA mechanism then employs multi-branch depthwise convolutions to generate multi-scale features and utilizes linear attention to reduce computational complexity from quadratic to linear, enabling efficient long-range dependency modeling across the entire sequence. Finally, the features are further refined through feed-forward networks and residual connections. This module enables the model to accurately capture the global emotional tone of the text. While achieving global long-range dependency modeling with linear complexity, it forms an effective functional complement to the PMFE module, collectively enhancing the model’s accurate understanding of complex emotional expressions and computational efficiency.
The overall architectural framework of MGLFE is depicted in Figure 3, while the corresponding mathematical formulations are presented as follows:
U 1 = D W C o n v 3 × 3 ( B N ( X ) ) + X
U 2 = M S L A ( L N ( U 1 ) ) + U 1
U 3 = L i n e a r ( G e L U ( L i n e a r ( L N ( U 2 ) ) ) ) + U 2
Herein, LN denotes Layer Normalization, MSLA represents the Multi-Scale Linear Attention module, Linear indicates a linear operation, and other symbols retain the same meanings as defined above. The symbol ⊕ denotes a residual connection. Since convolutional operations in the model may lead to the loss of critical information, we employ residual connections to supplement the features.
To further enhance the model’s capability to locate regions of interest and suppress irrelevant information, we employ Efficient Attention to capture the contextual information of multi-scale features in each branch.
First, we reshape Y ˜ j N × N × η 4 into Y ˜ j N × η 4 ,where Y ˜ j represents the multi-scale feature tensor of the jth branch, N × N denotes the spatial dimensions of the feature, is the number of feature channels, and N is the square of the spatial dimension. To extract global representations, we apply Efficient Attention to the multi-scale token Y ˜ j . Specifically, for the jth branch and the hth head, we use specific linear projections to obtain the query, key, and value tensors P j , h , R j , h , S j , h N × d , where N is the number of flattened spatial elements and d is the feature dimension of a single attention head:
P j , h = Y ˜ j U j , h p ,   R j , h = Y ˜ j U j , h r ,   S j , h = Y ˜ j U j , h s
where U j , h p ,   U j , h r ,   U j , h s η 4 × d are the projection weights for the query, key, and value, respectively. Next, we perform attention computation for each head separately using Efficient Attention:
o u t p u t j , h = E f f i c i e n t A t t j , h ( P j , h , R j , h , S j , h )
T j = C o n c a t ( o u t p u t j , 0 , o u t p u t j , 1 , , o u t p u t j , h 1 ) U j T
where o u t p u t j , h is the output of the hth attention head in the jth branch. An additional linear transformation is applied using the weight matrix U j T η 4 × η 4 to combine the outputs of all heads. T j denotes the feature tensor of the jth branch after attention computation and merging all heads. C o n c a t ( ) represents the concatenation of the outputs of all h attention heads in the jth branch along the channel dimension.
Finally, T j N × η 4 is reshaped along the spatial dimension into the image representationz T j N × N × η 4 . This transformation facilitates subsequent convolutional operations, thereby enhancing the fusion of multi-scale features. The fusion process can be described as:
T = f 1 × 1 ( [ v 1 T 1 , v 2 T 2 , v 3 T 3 , v 4 T 4 ] )
where v j are learnable weight parameters, denotes concatenation along the channel dimension, and f 1 × 1 represents a 1 × 1 convolution. Subsequently, we reshape the fused feature map T N × N × C into T N × C to obtain the final output tokens.
In the MGLFE module, the input sample X ∈ RH×W×C is first processed by BN, followed by a 3 × 3 depth-wise convolution that incorporates relative position encoding. The result is combined with the original input via a residual connection to enhance spatial awareness, yielding feature U1. Applying BN prior to convolution improves convergence speed, stabilizes the data distribution, and accelerates subsequent training [52]. The fused output is then normalized using LN to stabilize feature statistics, after which it is fed into MSLA module (detailed in the next section). This module captures multi-scale local features through parallel depth-wise convolution branches and efficiently models global long-range dependencies via a linear attention mechanism. The MSLA output is again fused with the per-normalized features through a residual connection, producing feature U2. Finally, the current features undergo LN and are passed to a Feed-Forward Network (FFN) [53], which expands and contracts the channel dimensions to introduce nonlinear transformation and refine high-level semantics. A final residual connection integrates the processed features, outputting the refined feature U3. The entire pipeline employs three residual connections to preserve the original information flow, uses depth-wise convolutions to reduce parameters, and applies layer normalization to enhance training stability, thereby achieving effective joint modeling of local details and global context.

4. Experiments

4.1. Datasets

To verify the feasibility and effectiveness of the proposed model, this study employs three publicly accessible and challenging benchmark datasets, namely the SST-2 dataset [54], the Amazon Reviews dataset [55], and the MR dataset [56]. The detailed information and key attributes of these datasets are summarized in Table 1.
Across all three datasets, the data were split into 80% for training, 10% for validation, and the remaining 10% for testing, resulting in an 80%-10%-10% training-validation-test split.
To secure the consistency of the evaluation process and the uniformity of data labels, it was necessary to perform fundamental data preprocessing in this study, and relevant information thereof is compiled in Table 2.

4.2. Experimental Setup and Evaluation Metrics

The sentiment analysis model proposed in this study was implemented based on the TensorFlow deep learning platform [57], with the pre-trained BERT-base-uncased model employed as its backbone architecture. including the removal of irrelevant features such as Uniform Resource Locators (URLs), stop words, and consecutive punctuation, as well as the elimination of semantically inconsistent or incomplete samples to ensure high-quality input for model training. Considering the characteristics of the different, varying sequence lengths were applied. With respect to the SST-2 and MR datasets, which are primarily composed of short-to-medium-length text samples, a maximum length of 64 tokens was set. In contrast, for the Amazon Reviews, containing longer texts, a length of 128 tokens was adopted. Inputs were uniformly adjusted via padding or truncation to facilitate efficient text feature extraction by the model.
The configuration of the experimental environment is detailed in Table 3. When the validation loss exhibited a slow decrease, the initial learning rate underwent dynamic downward adjustment by a factor of 10−3, aiming to optimize convergence performance. To ensure experimental fairness, comparisons and evaluations were conducted in accordance with the standards outlined in reference [16], with overall Accuracy and the macro-average F1-Score adopted as the evaluation metrics. To ensure the reliability of the experimental results, all three were randomly split in the same ratio and subjected to 10 repeated experiments, with statistically significant conclusions derived from standard deviation and mean calculations. The entire training process was completed on an Titan-X × 2 GPU hardware platform. By incorporating the weight decay mechanism of the AdamW optimizer [58], with the learning rate set to 2 × 10−5, the generalization capability and training efficiency of the model were effectively improved.
Algorithm Outline.
Algorithm 1. Emotion Classification
Input: Raw input data X
Output: Emotion classification result Y
Step1: BERT Encoding
Encode input data using BERT to get time-step features
Formula :   F b = B E R T E n c o d e r ( X )
Step2: Parallel Multi-scale Feature Extraction
Extract local detailed features using PMFE modules
Formula :   i 1 , 2 ,   F p m f e = P M F E ( F b )
Step3: Global Multi-scale Linear Feature Extraction
Extract global context information using MGLFE modules
Formula :   i 1 , 2 ,   F m g l f e = M G L F E ( F b )
Step4: Feature Integration and Global Dependency Modeling
Build global long-range dependencies
Formula :   F i n t e g r a t e d = M S L A ( F p m f e , F m g l f e )
Step5: Classification
Apply average pooling
Formula :   F p o o l = A v e r a g e P o o l i n g ( F i n t e g r a t e d )
Apply 1 × 1 convolution
Formula :   F c o n v = C o n v 1 × 1 ( F p o o l )
Get final emotion classification result
Formula :   Y = S O F T M A X ( F c o n v )
return Y
The input data X wmf is encoded through B E R T E n c o d e r , yielding the feature F b .Two PMFE modules and two MGLFE modules are initialized. The feature F b is processed separately by the two PMFE modules and the two MGLFE modules, obtaining two PMFE features and two MGLFE features. The two PMFE features and the two MGLFE features are fused, and then a 1 × 1 convolution is applied for dimensionality reduction, resulting in the feature F i n t e g r a t e d . Average pooling is performed on F i n t e g r a t e d to obtain F p o o l . A 1 × 1 convolution is applied to F p o o l , yielding F c o n v Finally, SOFTMAX is applied to F c o n v to obtain the final emotion classification probability Y .

4.3. Results

Table 4 presents the following results:
As illustrated in Figure 4, which depicts the training and validation loss curves along with accuracy curves, the model demonstrates strong performance in the initial training phase: the training loss decreases rapidly, accompanied by a synchronous reduction in validation loss, while both training and validation accuracy show a steady upward trend, indicating robust learning capability early in the training process. As the number of training epochs increases, the training loss continues to decline and gradually stabilizes, with training accuracy approaching a high level, reflecting the model’s improving capacity to fit the training data. However, in the later stages of training, the rate of decrease in validation loss slows noticeably and exhibits minor fluctuations, while the improvement in validation accuracy also plateaus, maintaining a discernible gap compared to the training accuracy. This pattern suggests that the model may be experiencing mild overfitting, wherein excessive optimization on the training data leads to a slight degradation in generalization performance on unseen validation data. Overall, the model exhibits relatively stable behavior throughout the training process, with no severe overfitting or underfitting observed. The validation accuracy eventually stabilizes at a reasonably satisfactory level, indicating that the model possesses good usability and generalization capability.
On the SST-2 and Amazon review datasets with a 80% training ratio, our proposed model achieved accuracies of 97.9% and 95.7%, respectively, representing improvements of 1.8% and 1.5% over BERT-CondConv [16]. For further analysis, BERT-CondConv [16] leverages all hidden layers of BERT and employs multi-scale dynamic convolution (CondConv) for feature fusion and extraction. However, this model suffers from high computational complexity and resource consumption, which stems from its integration of all BERT hidden layer outputs and the introduction of dynamic conditional convolutional parameters (CondConv), resulting in prolonged training and inference times. This poses challenges for deployment in resource-constrained or real-time application scenarios. In contrast, our model is based on fine-tuning BERT. Through supervised learning on annotated data for downstream tasks, we adjust the pre-trained weights via gradient updates to adapt the model to specific tasks and maximize its performance. We use BERT-base-uncased as the pre-trained model to generate high-quality universal contextualized word representations, providing a strong feature foundation for transfer learning in downstream NLP tasks. Furthermore, we propose two novel modules: PMFE and MGLFE. The PMFE module excels at accurately capturing local keywords and patterns that determine semantics, while the MGLFE module is designed to understand global contextual relationships and assign higher weights to important information. The combination of these two modules enables comprehensive and efficient extraction of deep semantic features from short texts, laying the groundwork for improved prediction accuracy.
Our model achieves improvements of 0.4% and 0.3% in F1-Score, respectively. Further analysis reveals that the BERT-CondConv [16] architecture is relatively complex, where the tight coupling of multiple modules increases the difficulty of hyperparameter tuning, and its performance heavily depends on the quality of the BERT pre-trained model. When processing long texts, BERT-CondConv relies on truncation operations, which may lead to the loss of critical information and limit its ability to model long-range dependencies. We observe that constructing effective long-range textual relationships can enhance accuracy. Therefore, our model incorporates a MSLA mechanism, which enhances expressive capacity by computing multiple independent attention heads in parallel. This allows the model to simultaneously focus on various dependency relationships at different positions in the input sequence. By projecting the input into different representation subspaces, computing attention weights separately, and then integrating the results, the module captures richer contextual information. As a result, our approach not only improves the model’s ability to learn complex patterns compared to BERT-CondConv [16] but also enhances its generalization performance and interpretability, leading to certain advantages in predictive accuracy.
Table 5 presents the following results:
Compared to the SST-2 and Amazon datasets, the MR dataset contains more complete reviews and exhibits more nuanced emotional expressions. With a 80% training ratio, our proposed MSFFLA model yields an accuracy of 85.2%, outperforming the ATT-Pooling [66] model by 1.6%. Further analysis indicates that while the ATT-Pooling [66] approach effectively captures key emotional features in text and improves classification accuracy, it suffers from a large number of parameters, high computational complexity, a tendency to overfit, and a strong reliance on extensive training data. Our model, by contrast, presents a novel lightweight architectural design grounded in the BERT pre-trained model. Through the synergistic design of the PMFE and MGLFE modules, it significantly reduces parameter complexity while maintaining strong expressive power. The model exhibits excellent generalization performance in data-scarce scenarios, as its feature decoupling encoding strategy reduces reliance on large-scale annotated data, and the dual-module parallel computing structure effectively decreases inference latency. In cross-domain text classification tasks, it significantly surpasses baseline models in both accuracy and stability, demonstrating robust domain adaptation capability. This design offers an effective solution for natural language processing tasks in resource-constrained environments.
An improvement of 0.3% in F1-Score is observed. The ATT-Pooling [66] model is an elaborately designed hybrid architecture integrating both CNN and RNN, which effectively incorporates an attention mechanism to combine the strengths of CNNs in local feature extraction and RNNs in sequence modeling. While it has demonstrated effectiveness across multiple benchmark tests, the model exhibits several limitations: its attention mechanism remains relatively static, suffers from feature redundancy, and heavily relies on external pre-trained word embeddings. These factors impose significant constraints on its practical deployment and application. Therefore, the proposed MSLA mechanism, incorporated into our model, significantly enhances the capability to capture features at varying structural scales in movie reviews by combining parallel multi-scale convolution with linear attention computation. This approach effectively models global contextual dependencies while maintaining low computational complexity.
On the SST-2 and Amazon Review datasets, the total number of parameters of our model is 23.73M, representing a reduction of 88.01M compared to BERT-CondConv [16]. On the MR dataset, our model contains 23.73M parameters, which is 87.35M fewer than the existing ATT-Pooling model [66], demonstrating significant parameter efficiency. This lightweight characteristic can be attributed to two key design elements: first, the adopted parallel dilated convolutions effectively expand the receptive field without significantly increasing the number of parameters; second, the efficient attention mechanism enhances the flow of critical information by suppressing irrelevant features, thereby reducing parameter redundancy while maintaining the model’s representational capacity.

4.4. Ablation Study

With the Amazon dataset employed as an example, a series of experimental trials were performed to examine the contribution of each module within the proposed model.

4.4.1. Effects of PMFE in the Model

To assess the effect of the PMFE module, a comparative analysis was performed between the model with an integrated PMFE module and its ablated counterpart (without the PMFE module). The experimental findings are shown in Table 6. These results demonstrate that the PMFE-free model reached an accuracy of 93.3% and an F1-Score of 91.9%, while the model with the PMFE module obtained significantly higher performance: 95.7% in accuracy and 94.5% in F1-Score. These findings demonstrate the critical role of the PMFE module in feature extraction. The improvement is attributed to its architectural design, which features depthwise separable convolution and parallel multi-scale branches that ensure computational efficiency. Additionally, the multi-scale convolutions capture features across varying receptive fields. In addition, Batch Normalization, together with the three residual connections, boosts the stability during training and the smoothness of gradient flow, while the combination of SeLU and GeLU activation functions, along with the FFN, strengthens nonlinear feature representation. Collectively, these components strike an effective balance between computational efficiency and expressive power.

4.4.2. Effects of MGLFE in the Model

To evaluate the impact of the MGLFE module, a comparative analysis was carried out between the MGLFE module-incorporated model and its ablated variant (without the MGLFE module). The experimental results are displayed in Table 7. These results indicate that the model lacking the MGLFE module reached an accuracy of 94.0% and an F1-Score of 92.1%, whereas the model with the MGLFE module obtained higher scores: 95.7% in accuracy and 94.5% in F1-Score. This outcome demonstrates the significant contribution of the MGLFE module to feature extraction. The improvement can be attributed to the module’s use of multi-scale convolutions with varying kernel sizes, which effectively capture information at different granularities. This design diminishes computational overhead, and in the process boosts the comprehensive expressiveness of the learned features.
The significant performance improvement of our model primarily stems from the synergistic innovation of the newly proposed PMFE and MGLFE modules. Specifically, while BERT-CondConv relies on conditionally parameterized convolutions for dynamic feature fusion, it introduces additional complexity and fails to optimally balance local and global feature extraction. To address this, our PMFE module adopts a parallel multi-branch architecture with dilated convolutions, which efficiently and precisely captures multi-scale local features without significantly increasing the parameter count. Its lightweight design leverages depthwise separable convolutions and residual connections.
Simultaneously, the core multi-scale linear attention mechanism (MSLA) in the MGLFE module overcomes the quadratic computational complexity bottleneck of the original BERT self-attention. It efficiently models full-sequence contextual dependencies with approximately linear complexity, thereby accurately capturing the overall sentiment tone of contrastive sentences—such as “the action effects are great, but the plot is mediocre.”
Ultimately, the PMFE and MGLFE modules are deeply integrated via a symmetric lightweight architecture, enabling effective complementarity between local details and global context. The local features captured by PMFE provide foundational cues for global judgment, while the contextual relationships established by MGLFE help disambiguate local semantics and identify ironic or contrastive expressions. This tight, efficient, and multi-scale feature fusion synergy is the key to our model’s breakthrough in both accuracy and efficiency.

4.4.3. Effects of MSLA in the Model

To evaluate the impact of the MSLA module, a comparative analysis was conducted between models with and without its incorporation. As summarized in Table 8, the model without the MSLA module achieved an accuracy of 92.6% and an F1-Score of 91.7%, while the model integrated with the MSLA module attained significantly higher performance; the accuracy it attained reached 95.7%, with the F1-Score corresponding to 94.5%. These results confirm the substantial contribution of the MSLA unit within the MGLFE module to feature extraction. The MSLA mechanism utilizes parallel multi-branch depthwise convolutions with diverse kernel sizes, specifically 3 × 3, 5 × 5, 7 × 7, and 9 × 9, to acquire multi-scale features. This design empowers the model to extract fine-grained local details while simultaneously integrating wider contextual information, thereby facilitating the accurate modeling of the target. Furthermore, the module effectively captures global contextual dependencies while maintaining low computational complexity and parameter efficiency.

4.4.4. Effects of BERT in the Model

To evaluate the impact of the BERT module, a comparative analysis was conducted between models with and without its integration. As shown in Table 9, the BERT-free model achieved an accuracy of 92.7% and an F1-Score of 91.3%, whereas the model with integrated BERT delivered significantly better performance, recording an accuracy of 95.7% and an F1-Score of 94.5%. These results confirm the critical advantage of BERT: its pre-trained architecture serves as a robust backbone network, where an integrated tokenizer first processes input data into tokens, generating contextualized representations of these tokens. This process strengthens the model’s capability of capturing semantic nuances, thus improving accuracy and accelerating the convergence rate.
Based on 10 repeated experiments, we calculated 95% confidence intervals to evaluate the stability of the results. On the SST-2 dataset, our model achieved an accuracy of 97.9% [97.56%, 98.24%], significantly outperforming BERT-CondConv’s 96.1% [95.68%, 96.52%]. The performance differences observed in all module ablation studies were statistically significant (p < 0.05), confirming the important contribution of each component to the model’s performance.

5. Conclusions

In this study, we propose a MSFFLA model. The model first captures contextual relationships and sequential features through a BERT encoding layer, then hierarchically refines local and global features via the multi-stage feature extraction layers (PMFE and MGLFE), and finally accomplishes the classification task through a classification head. The “BERT + multi-stage CNN” hybrid architecture effectively merges BERT’s sequence modeling capability with the hierarchical feature extraction advantage of CNN, making it suitable for tasks requiring simultaneous understanding of sequential dependencies and hierarchical semantics. To validate the model’s effectiveness, extensive experiments were conducted on the SST-2, Amazon Reviews, and MR datasets. Relative to the BERT-CondConv model, our model attained accuracy and F1-Score gains of 1.8% and 0.4%, respectively, on the SST-2 dataset, and 1.5% in accuracy and 0.3% in F1-Score on the Amazon Reviews dataset. These experimental results illustrate that the proposed model demonstrates certain advantages compared with existing mainstream models.
Although our model has certain advantages in terms of accuracy and parameters, it still lacks in computational performance and accuracy when dealing with real scenarios. The main reason is that we used standardized and cleaned data, and the detection accuracy on real datasets may decline. In the future, we will perform cleaning, word segmentation, truncation length, and category distribution operations on real datasets, and consider introducing more data types, such as video content or audio, to improve the sentiment analysis of movie reviews. Then we will use this model. We will continue to optimize the computational performance of the model, ensuring a high prediction accuracy, while further enhancing the computational performance of the model, in order to enable the model to be better applied in real scenarios.
The practical value of this study lies in the model’s high-precision capability to provide nuanced user sentiment insights for movie recommendation systems, thereby fostering a shift in personalized services from “rating-based” to “sentiment-aware” paradigms. Furthermore, its lightweight architecture offers potential for deployment in mobile or edge computing environments, enabling real-time applications such as public opinion analysis and online recommendation. Additionally, the multi-scale feature fusion methodology employed in our model provides a valuable technical reference for handling complex semantic structures.
In future research, we will explore multimodal data fusion by incorporating multimedia information such as video trailers, poster images, and audio reviews to construct a cross-modal sentiment analysis framework, thereby comprehensively capturing the comprehensive appeal of film productions. We will validate the model’s generalization capability across domains by applying it to scenarios such as product reviews and social media short texts, while investigating domain adaptation techniques to enhance its universality. Additionally, we will optimize the model’s computational efficiency through research on model compression techniques like knowledge distillation and conduct systematic deployment validation in real industrial settings to further improve its practical value.

Author Contributions

Conceptualization, Z.J. and C.X.; methodology, Z.J. and C.X.; software, C.X.; validation, C.X.; formal analysis, Z.J.; investigation, C.X.; resources, Z.J.; data curation, C.X.; writing—original draft preparation, Z.J.; writing—review and editing, C.X.; visualization, Z.J.; supervision, C.X.; project administration, Z.J.; funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by the National Natural Science Foundation of China (42261068), the Natural Science Foundation of Jiangxi Province (20242BAB25112), National Natural Science Foundation of China (62363015), Research Project of Social Sciences Planning of Shandong Province (23DWYJ13).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data associated with this research are available online. The SST-2 dataset is available for download at https://gluebenchmark.com/tasks/ (accessed on 12 November 2023). The Amazon Review is available for download at https://amazon-reviews-2023.github.io/ (accessed on 15 December 2023). The Movie Review (MR) dataset is available for download at https://zenodo.org/records/8276786 (accessed on 16 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MSFFLAMulti-Scale Feature Fusion Linear Attention Model
BERTBidirectional Encoder Representations from Transformers
PMFEParallel Multi-scale Feature Extraction Module
MGLFEMulti-scale Global Linear Feature Extraction Module
MSLAMulti-Scale Linear Attention
SST-2Stanford Sentiment Treebank—2 classes
MRMovie Review
SCSentiment Classification
BoVWBag-of-Visual-Words
TF-IDFTerm Frequency-Inverse Document Frequency
UPCSimUser Profile Correlation-based Similarity
KNNK-Nearest Neighbors
CNN Convolutional Neural Network
LSTMLong Short-Term Memory
BCATBidirectional Context-Aware Transformer
IMDBIMDb Large Movie Review Dataset
RNNRecurrent Neural Network
ERNIE-MCBMAEnhanced Representation through kNowledge IntEgration—Multi-Channel Bidirectional Multi-head Attention
SMP2020-EWECTSocial Media Processing 2020—Evaluation of Weibo Emotion Classification Task
SVMSupport Vector Machine
NLPNatural Language Processing
VADERValence Aware Dictionary and sEntiment Reasoner
ABSAAspect-Based Sentiment Analysis
XGBoosteXtreme Gradient Boosting
TD-BERT Target-Dependent BERT
BERT-BiGRUBERT Bidirectional Gated Recurrent Unit
BiLSTMBidirectional Long Short-Term Memory
GPT-3Generative Pre-trained Transformer 3
LLaMA-2Large Language Model Meta AI 2
CNN-TECNN Transformer Encoder
GRUGated Recurrent Unit
AM-MSFFNAttention-fused Multi-Scale Feature Fusion Network
CBAMConvolutional Block Attention Module
DEAPDatabase for Emotion Analysis using Physiological Signals
SEEDSJTU Emotion EEG Dataset
RoBERTa-MARobustly Optimized BERT Pretraining Approach with Multi-head Attention
HOMRA-NetHybrid Optimized Multi-Scale Residual Attention Network
MELDMultimodal EmotionLines Dataset
BNBatch Normalization
LNLayer Normalization
FFNFeed-Forward Network
URLUniform Resource Locators
SGDStochastic Gradient Descent

References

  1. Davoodi, L.; Mezei, J.; Heikkilä, M. Aspect-based sentiment classification of user reviews to understand customer satisfaction of e-commerce platforms. In Electronic Commerce Research; Springer Nature: Berlin/Heidelberg, Germany, 2025; pp. 1–43. [Google Scholar] [CrossRef]
  2. Prova, N. Multilingual Emotion Classification in E-Commerce Customer Reviews Using GPT and Deep Learning-Based Meta-Ensemble Model. 2025. Available online: https://ssrn.com/abstract=5161505 (accessed on 1 January 2025).
  3. Shi, Y. A CNN-Based Approach for Classical Music Recognition and Style Emotion Classification. IEEE Access 2025, 13, 20647–20666. [Google Scholar] [CrossRef]
  4. Nabiilah, G.Z. Effectiveness analysis of roberta and distilbert in emotion classification task on social media text data. Eng. Math. Comput. Sci. J. (EMACS) 2025, 7, 45–50. [Google Scholar] [CrossRef]
  5. Rout, J.K.; Choo, K.K.R.; Dash, A.K.; Bakshi, S.; Jena, S.K.; Williams, K.L. A model for sentiment and emotion analysis of unstructured social media text. Electron. Commer. Res. 2018, 18, 181–199. [Google Scholar] [CrossRef]
  6. Sharma, N.A.; Ali, A.S.; Kabir, M.A. A review of sentiment analysis: Tasks, applications, and deep learning techniques. Int. J. Data Sci. Anal. 2025, 19, 351–388. [Google Scholar] [CrossRef]
  7. Wu, S.; Wu, F.; Chang, Y. Automatic construction of target-specific sentiment lexicon. Expert Syst. Appl. 2019, 116, 285–298. [Google Scholar] [CrossRef]
  8. Zhang, S.X.; Wei, Z.L.; Wang, Y. Sentiment analysis of Chinese micro-blog text based on extended emotion dictionary. Future Gener. Comput. Syst. 2018, 81, 395–403. [Google Scholar] [CrossRef]
  9. Widiyaningtyas, T.; Hidayah, I.; Adji, T.B. User profile correlation-based similarity (UPCSim) algorithm in movie recommendation system. J. Big Data 2021, 8, 52. [Google Scholar] [CrossRef]
  10. Pavirha, N.; Pungliya, V.; Raut, A.; Bhonsle, R.; Purohit, A.; Patel, A.; Shashidhar, R. Movie recommendation and sentiment analysis using machine learning. Glob. Transit. Proc. 2022, 3, 279–284. [Google Scholar] [CrossRef]
  11. Dashtipour, K.; Gogate, M.; Adeel, A.; Larijani, H.; Hussain, A. Sentiment analysis of persian movie reviews using deep learning. Entropy 2021, 23, 596. [Google Scholar] [CrossRef]
  12. Prasath, S.R.; Jeevitha, J.K.; Margret, S.; Krishnan, R.S.; Raj, J.R.F.; Nithila, E.E. Deep Learning Models for Understanding Audience Sentiments in Movie Tweets. In Proceedings of the 2024 5th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 7–9 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1745–1753. [Google Scholar] [CrossRef]
  13. Acheampong, F.A.; Nunoo-Mensah, H.; Chen, W. Transformer models for text-based emotion detection: A review of BERT-based approaches. Artif. Intell. Rev. 2021, 54, 5789–5829. [Google Scholar] [CrossRef]
  14. Saad, T.B.; Ahmed, M.; Ahmed, B.; Sazan, S.A. A Novel Transformer Based Deep Learning Approach of Sentiment Analysis for Movie Reviews. In Proceedings of the 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT), Dhaka, Bangladesh, 2–4 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1228–1233. [Google Scholar]
  15. Ruan, T.; Liu, Q.; Chang, Y. Digital media recommendation system design based on user behavior analysis and emotional feature extraction. PLoS ONE 2025, 20, e0322768. [Google Scholar] [CrossRef] [PubMed]
  16. Wang, X.; Liu, W. Sentiment classification method based on BERT-CondConv multi-moment state fusion. Comput. Speech Lang. 2026, 95, 101855. [Google Scholar] [CrossRef]
  17. Sun, Y.; Yu, Z.; Sun, Y.; Xu, Y.; Song, B. A novel approach for multiclass sentiment analysis on Chinese social media with ERNIE-MCBMA. Sci. Rep. 2025, 15, 18675. [Google Scholar] [CrossRef] [PubMed]
  18. Louati, A.; Louati, H.; Kariri, E.; Alaskar, F.; Alotaibi, A. Sentiment analysis of Arabic course reviews of a Saudi university using support vector machine. Appl. Sci. 2023, 13, 12539. [Google Scholar] [CrossRef]
  19. Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 168–177. [Google Scholar] [CrossRef]
  20. Wilson, T.; Wiebe, J.; Hoffmann, P. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, 6–8 October 2005; pp. 347–354. [Google Scholar] [CrossRef]
  21. Muddiman, A.; McGregor, S.C.; Stroud, N.J. (Re) claiming our expertise: Parsing large text corpora with manually validated and organic dictionaries. Political Commun. 2019, 36, 214–226. [Google Scholar] [CrossRef]
  22. Amsler, M. Using Lexical-Semantic Concepts for Fine-Grained Classification in the Embedding Space. Ph.D. Thesis, University of Zurich, Zürich, Switzerland, 2020. [Google Scholar] [CrossRef]
  23. Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), Philadelphia, PA, USA, 6–7 July 2002; pp. 79–86. [Google Scholar] [CrossRef]
  24. Sarhan, A.M.; Ayman, H.; Wagdi, M.; Ali, B.; Adel, A.; Osama, R. Integrating machine learning and sentiment analysis in movie recommendation systems. J. Electr. Syst. Inf. Technol. 2024, 11, 53. [Google Scholar] [CrossRef]
  25. Jassim, M.A.; Abd, D.H.; Omri, M.N. Machine learning-based new approach to films review. Soc. Netw. Anal. Min. 2023, 13, 40. [Google Scholar] [CrossRef]
  26. Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT applications in natural language processing: A review. Artif. Intell. Rev. 2025, 58, 1–49. [Google Scholar] [CrossRef]
  27. Chen, Z.; Qian, T. Relation-aware collaborative learning for unified aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5 July 2020; pp. 3685–3694. [Google Scholar] [CrossRef]
  28. He, X. Sentiment Classification of Social Media User Comments Using SVM Models. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 29–31 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1755–1759. [Google Scholar] [CrossRef]
  29. Gao, Z.; Feng, A.; Song, X.; Wu, X. Target-dependent sentiment classification with BERT. IEEE Access 2019, 7, 154290–154299. [Google Scholar] [CrossRef]
  30. Almufareh, M.F.; Jhanjhi, N.Z.; Khan, N.A.; Almuayqil, S.N.; Humayun, M.; Javed, D. BertSent: Transformer-based model for sentiment analysis of penta-class tweet classification. IEEE Access 2024, 12, 196803–196817. [Google Scholar] [CrossRef]
  31. Rahman, B. Optimizing customer satisfaction through sentiment analysis: A BERT-based machine learning approach to extract insights. IEEE Access 2024, 12, 196803–196817. [Google Scholar] [CrossRef]
  32. Chen, J.; Ma, X.; Li, S.; Ma, S.; Zhang, Z.; Ma, X. A hybrid parallel computing architecture based on CNN and transformer for music genre classification. Electronics 2024, 13, 3313. [Google Scholar] [CrossRef]
  33. Li, P.; Duan, C.; Jiang, H.; Liu, E. Research on emotion analysis of e-commerce product reviews based on deep learning. In Proceedings of the Second International Conference on Statistics, Applied Mathematics, and Computing Science (CSAMCS 2022), Nanjing, China, 28 March 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12597, pp. 960–965. [Google Scholar] [CrossRef]
  34. Prottasha, N.J.; Sami, A.A.; Kowsher, M.; Murad, S.A.; Bairagi, A.K.; Masud, M.; Baz, M. Transfer learning for sentiment analysis using BERT based supervised fine-tuning. Sensors 2022, 22, 4157. [Google Scholar] [CrossRef] [PubMed]
  35. Yang, H.; Zi, Y.; Qin, H.; Zheng, H.; Hu, Y. Advancing emotional analysis with large language models. J. Comput. Sci. Softw. Appl. 2024, 4, 8–15. [Google Scholar] [CrossRef]
  36. Garg, S.; Torra, V. Exploring Distribution Learning of Synthetic Data Generators for Manifolds. In Proceedings of the European Symposium on Research in Computer Security, Bydgoszcz, Poland, 16–20 September 2024; Springer Nature: Cham, Switzerland, 2024; pp. 65–76. [Google Scholar] [CrossRef]
  37. Zhang, W.; Deng, Y.; Liu, B.; Pan, S.; Bing, L. Sentiment analysis in the era of large language models: A reality check. In Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 3881–3906. [Google Scholar] [CrossRef]
  38. Kumar, A.; Sharma, R.; Bedi, P. Towards optimal NLP solutions: Analyzing GPT and LLaMA-2 models across model scale, data set size, and task diversity. Eng. Technol. Appl. Sci. Res. 2024, 14, 14219–14224. [Google Scholar] [CrossRef]
  39. Tennakoon, N.; Senaweera, O.; Dharmarathne, H.A.S.G. Emotion-based movie recommendation system. Int. J. Adv. ICT Emerg. Reg. (ICTer) 2024, 17, 34–39. [Google Scholar] [CrossRef]
  40. Hossain, M.M.; Hossain, M.S.; Hossain, M.S.; Mridha, M.F.; Safran, M.; Alfarhood, S. TransNet: Deep attentional hybrid transformer for Arabic posts classification. IEEE Access 2024, 2, 111070–111096. [Google Scholar] [CrossRef]
  41. Jiang, Y.; Xie, S.; Xie, X.; Cui, Y.; Tang, H. Emotion recognition via multiscale feature fusion network and attention mechanism. IEEE Sens. J. 2023, 23, 10790–10800. [Google Scholar] [CrossRef]
  42. Jia, K. Sentiment classification of microblog: A framework based on BERT and CNN with attention mechanism. Comput. Electr. Eng. 2022, 101, 108032. [Google Scholar] [CrossRef]
  43. Guo, Q.; Qiu, X.; Liu, P.; Xue, X.; Zhang, Z. Multi-scale self-attention for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7847–7854. [Google Scholar] [CrossRef]
  44. Ameer, I.; Bölücü, N.; Siddiqui, M.H.F.; Can, B.; Sidorov, G.; Gelbukh, A. Multi-label emotion classification in texts using transfer learning. Expert Syst. Appl. 2023, 213, 118534. [Google Scholar] [CrossRef]
  45. Subbaiah, B.; Murugesan, K.; Saravanan, P.; Marudhamuthu, K. An efficient multimodal sentiment analysis in social media using hybrid optimal multi-scale residual attention network. Artif. Intell. Rev. 2024, 57, 1–27. [Google Scholar] [CrossRef]
  46. Mirzaee, G.; Doretto, G.; Adjeroh, D. Multi-label Classification using Self-Supervised Learning: Addressing Class Inter-Dependency and Data Imbalance. In Proceedings of the 2024 International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 18–20 December 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1271–1276. [Google Scholar] [CrossRef]
  47. Foumani, N.M.; Tan, C.W.; Webb, G.I.; Salehi, M. Improving position encoding of transformers for multivariate time series classification. Data Min. Knowl. Discov. 2024, 38, 22–48. [Google Scholar] [CrossRef]
  48. Kallstenius, T.; Capusan, A.J.; Andersson, G.; Williamson, A. Comparing traditional natural language processing and large language models for mental health status classification: A multi-model evaluation. Sci. Rep. 2025, 15, 24102. [Google Scholar] [CrossRef] [PubMed]
  49. Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A survey of visual transformers. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 7478–7498. [Google Scholar] [CrossRef]
  50. Zhuang, L.; Wayne, L.; Ya, S.; Jun, Z. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, Huhhot, China, 13–15 August 2021; pp. 1218–1227. Available online: https://aclanthology.org/2021.ccl-1.108/ (accessed on 1 January 2025).
  51. Malik, S.Z.; Iqbal, K.; Sharif, M.; Shah, Y.A.; Khalil, A.; Irfan, M.A.; Rosak-Szyrocka, J. Attention-aware with stacked embedding for sentiment analysis of student feedback through deep learning techniques. PeerJ Comput. Sci. 2024, 10, e2283. [Google Scholar] [CrossRef]
  52. Xu, C.; Zhu, G.; Shu, J. A combination of lie group machine learning and deep learning for remote sensing scene classification using multi-layer heterogeneous feature extraction and fusion. Remote Sens. 2022, 14, 1445. [Google Scholar] [CrossRef]
  53. Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989, 2, 359–366. [Google Scholar] [CrossRef]
  54. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355. [Google Scholar] [CrossRef]
  55. McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conferenceon Recommender Systems, Hong Kong, China, 12–16 October 2013; pp. 165–172. [Google Scholar] [CrossRef]
  56. Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:cs/0506075. [Google Scholar] [CrossRef]
  57. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Zheng, X. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467. [Google Scholar] [CrossRef]
  58. Llugsi, R.; El Yacoubi, S.; Fontaine, A.; Lupera, P. Comparison between Adam, AdaMax and Adam W optimizers to implement a Weather Forecast based on Neural Networks for the Andean city of Quito. In Proceedings of the 2021 IEEE Fifth Ecuador Technical Chapters Meeting (ETCM), Cuenca, Ecuador, 12–15 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
  59. Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to forget: Continual prediction with LSTM. Neural Comput. 2000, 12, 2451–2471. [Google Scholar] [CrossRef]
  60. Schuster, M.; Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
  61. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder–decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
  62. Wang, Y.; Feng, L.; Liu, A.; Wang, W.; Hou, Y. Dual BIGRU-CNN-based sentiment classification method combining global and local attention. J. Supercomput. 2024, 80, 2799–2837. [Google Scholar] [CrossRef]
  63. Zhang, X.; Wu, Z.; Liu, K.; Zhao, Z.; Wang, J.; Wu, C. Text sentiment classification based on BERT embedding and sliced multi-head self-attention Bi-GRU. Sensors 2023, 23, 1481. [Google Scholar] [CrossRef]
  64. Wu, P.; Li, X.; Ling, C.; Ding, S.; Shen, S. Sentiment classification using attention mechanism and bidirectional long short-term memory network. Appl. Soft Comput. 2021, 112, 107792. [Google Scholar] [CrossRef]
  65. Li, W.; Qi, F.; Tang, M.; Yu, Z. Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification. Neurocomputing 2020, 387, 63–77. [Google Scholar] [CrossRef]
  66. Usama, M.; Ahmad, B.; Song, E.; Hossain, M.S.; Alrashoud, M.; Muhammad, G. Attention-based sentiment analysis using convolutional and recurrent neural network. Future Gener. Comput. Syst. 2020, 113, 571–578. [Google Scholar] [CrossRef]
Figure 1. MFFLA Model. In this model, the BERT Encoder is composed of four BERT Blocks, which generate different temporal features. Then, it adopts four stages. The first two stages consist of two parts, namely Patch Embedding and the Parallel Multi-scale Feature Extraction module. The last two stages consist of two parts, namely Patch Embedding and the Global Multi-scale Linear Feature Extraction module.
Figure 1. MFFLA Model. In this model, the BERT Encoder is composed of four BERT Blocks, which generate different temporal features. Then, it adopts four stages. The first two stages consist of two parts, namely Patch Embedding and the Parallel Multi-scale Feature Extraction module. The last two stages consist of two parts, namely Patch Embedding and the Global Multi-scale Linear Feature Extraction module.
Bdcc 09 00325 g001
Figure 2. PMFE Module.
Figure 2. PMFE Module.
Bdcc 09 00325 g002
Figure 3. MGLFE Module.
Figure 3. MGLFE Module.
Bdcc 09 00325 g003
Figure 4. Training and Validation Loss Curve and Accuracy Curve.
Figure 4. Training and Validation Loss Curve and Accuracy Curve.
Bdcc 09 00325 g004
Table 1. The summary of statistical results for all datasets.
Table 1. The summary of statistical results for all datasets.
DatasetPositiveNegativeTotalTraining,
Validation,
Testing Ratio
Dataset Description
SST-231,12038,92270,04280%, 10%, 10%A movie review with fine-grained, phrase-level sentiment labels for precise analysis.
Amazon Review35,15034,85070,00080%, 10%, 10%This labeled sentiment provides massive-scale, diverse user data with train-test splits for robust model training.
MR5331533110,66280%, 10%, 10%The IMDb movie review features sentence-level sentiment labels and authentic, full-length reviews with clear binary classification.
Table 2. Data Preparation and Preprocessing.
Table 2. Data Preparation and Preprocessing.
DatasetDescription
SST-2Label 0 denotes negative sentiment, while label 1 denotes positive sentiment.
Amazon ReviewChange the sentiment labels from 1 representing negative and 2 representing positive to 0 representing negative and 1 representing positive.
MRLabel 0 denotes negative sentiment, while label 1 denotes positive sentiment.
Table 3. Parameters of the Experimental Environment.
Table 3. Parameters of the Experimental Environment.
ItemContent
ProcessorIntel Core i7-4700 CPU with 2.70 GHz × 12
Memory32 GB
Operating systemCentOS 7.8 64 bit
Hard disk1T
GPUTitan-X × 2
Python3.7.2
PyTorch1.4.0
CUDA10.0
Learning rate10−3
Momentum0.9
Weight decay5 × 10−4
Batch16
Saturation1.5
Subdivisions64
Table 4. Result of various netwoek models on datasets SST-2 and Amazon Review (Accuracy and F1-Score).
Table 4. Result of various netwoek models on datasets SST-2 and Amazon Review (Accuracy and F1-Score).
SST-2Amazon ReviewParameters (M)
AccuracyF1-ScoreAccuracyF1-Score
BERT [16]0.9490.9490.9340.934110.23
BERT-CNN [16]0.9510.9510.9330.933113.54
BERT-LSTM [59]0.9440.9440.9310.931115.56
BERT-BiLSTM [60]0.9480.9480.9330.933117.83
BERT-GRU [61]0.9500.9500.9270.927116.27
BERT-CondConv [16]0.9610.9610.9420.942111.34
Bi-LSTM [39]0.8230.8160.8090.78649.76
LSTM [39]0.7350.7110.7150.70930.65
CNN with BERT Embeddings [39]0.7630.7510.7420.7292110.26
Our model0.9790.9650.9570.94523.73
Table 5. Results of various network models on the MR.
Table 5. Results of various network models on the MR.
MRParameters (M)
AccuracyF1-Score
BCAT [62]0.8020.798-
BERT-MS-BiGRU [63]0.8280.829110.43
SC-ABiLSTM [64]0.7710.768111.55
SAMF-BiLSTM [65]0.8330.832112.64
ATT-Pooling [66]0.8360.836111.26
BERT-CondConv [16]0.8340.833110.35
Our model0.8520.83923.73
Table 6. Influence of PMFE.
Table 6. Influence of PMFE.
ModularsAccuracyF1-Score
W/O PMFE0.9330.919
Ours0.9570.945
Table 7. Influence of MGLFE.
Table 7. Influence of MGLFE.
ModularsAccuracyF1-Score
W/O MGLFE0.9400.921
Ours0.9570.945
Table 8. Influence of MSLA.
Table 8. Influence of MSLA.
ModularsAccuracyF1-Score
W/O MSLA0.9260.917
Ours0.9570.945
Table 9. Influence of BERT.
Table 9. Influence of BERT.
ModularsAccuracyF1-Score
W/O BERT0.9270.913
Ours0.9570.945
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, Z.; Xu, C. A Multi-Scale Feature Fusion Linear Attention Model for Movie Review Sentiment Analysis. Big Data Cogn. Comput. 2025, 9, 325. https://doi.org/10.3390/bdcc9120325

AMA Style

Jiang Z, Xu C. A Multi-Scale Feature Fusion Linear Attention Model for Movie Review Sentiment Analysis. Big Data and Cognitive Computing. 2025; 9(12):325. https://doi.org/10.3390/bdcc9120325

Chicago/Turabian Style

Jiang, Zi, and Chengjun Xu. 2025. "A Multi-Scale Feature Fusion Linear Attention Model for Movie Review Sentiment Analysis" Big Data and Cognitive Computing 9, no. 12: 325. https://doi.org/10.3390/bdcc9120325

APA Style

Jiang, Z., & Xu, C. (2025). A Multi-Scale Feature Fusion Linear Attention Model for Movie Review Sentiment Analysis. Big Data and Cognitive Computing, 9(12), 325. https://doi.org/10.3390/bdcc9120325

Article Metrics

Back to TopTop