Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis

Khan, Jawad; Ahmad, Niaz; Lee, Youngmoon; Khalid, Shah; Hussain, Dildar

doi:10.3390/math13091456

Open AccessArticle

Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis^†

by

Jawad Khan

¹

,

Niaz Ahmad

²

,

Youngmoon Lee

^3,*

,

Shah Khalid

⁴

and

Dildar Hussain

^5,*

¹

School of Computing, Gachon University, Seongnam 13120, Republic of Korea

²

Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada

³

Department of Robotics, Hanyang University, Ansan 15588, Republic of Korea

⁴

School of Electrical Engineering and Computer Science, National University of Sciences and Technology, Islamabad 44000, Pakistan

⁵

Department of Artificial Intelligence and Data Science, Sejong University, Seoul 05006, Republic of Korea

^*

Authors to whom correspondence should be addressed.

^†

2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, 1–4 December 2023; pp. 52–59.

Mathematics 2025, 13(9), 1456; https://doi.org/10.3390/math13091456

Submission received: 11 February 2025 / Revised: 16 April 2025 / Accepted: 21 April 2025 / Published: 29 April 2025

(This article belongs to the Special Issue High-Dimensional Data Analysis and Applications)

Download

Browse Figures

Versions Notes

Abstract

Sentiment analysis (SA) analyzes online data to uncover insights for better decision-making. Conventional text SA techniques are effective and easy to understand but encounter difficulties when handling sparse data. Deep Neural Networks (DNNs) excel in handling data sparsity but face challenges with high-dimensional, noisy data. Incorporating rich domain semantic and sentiment knowledge is crucial for advancing sentiment analysis. To address these challenges, we propose an innovative hybrid sentiment analysis approach that combines established DNN models like RoBERTA and BiGRU with an attention mechanism, alongside traditional feature engineering and dimensionality reduction through PCA. This leverages the strengths of both techniques: DNNs handle complex semantics and dynamic features, while conventional methods shine in interpretability and efficient sentiment extraction. This complementary combination fosters a robust and accurate sentiment analysis model. Our model is evaluated on four widely used real-world benchmark text sentiment analysis datasets: MR, CR, IMDB, and SemEval 2013. The proposed hybrid model achieved impressive results on these datasets. These findings highlight the effectiveness of this approach for text sentiment analysis tasks, demonstrating its ability to improve sentiment analysis performance compared to previously proposed methods.

Keywords:

sentiment analysis; domain knowledge; dimensionality reduction

MSC:

68T50

1. Introduction

SA, also known as opinion mining, is a dynamic field within natural language processing (NLP). It employs computational methods to analyze and evaluate subjective opinions, sentiments, emotions, and attitudes expressed in textual data. SA makes an effort to categorize user-generated reviews or comments into positive and negative classes to offer helpful insights from online opinions and customer feedback. This improves the user experience and helps in decision-making in areas such as marketing, customer support, and public opinion tracking.

Two main approaches dominate SA: lexicon-based and machine learning-based [1]. Lexicon-based methods rely on pre-defined lists of sentiment words. The sentiment of a text is determined by counting these words and their assigned values. They are fast and do not need training data but can be rigid and inflexible. Machine learning methods are trained on labeled data to learn the complex relationships between words and sentiments. They are more versatile but require extensive data and computation. This judgment is necessary to accurately determine the sentiment orientation of a particular text.

DNNs have shown remarkable performance in the field of textual SA. Recent research in SA highlights shortcomings in DNN-based methods. Despite their effectiveness, these models can extract irrelevant or redundant features while missing essential sentiment cues in words, leading to reduced classification accuracy [2]. DNNs lead in SA performance, but traditional methods remain strong in interpretability and efficiency. Can we build a hybrid model that leverages both strengths? Our research explores a hybrid model for synergizing DNN and classic feature-based methods for enhanced SA performance. In order to address the aforementioned issues in SA, we suggest an SA model specifically designed for semantic and sentiment awareness. The goal of this model is to improve sentiment analysis’s accuracy and solve its current problems. We introduce an innovative fusion approach that greatly improves text SA performance and reduces the high-dimensional feature space by combining conventional feature creation with DNN-based techniques.

The review text is first parsed, including through tokenization and the methodical application of linguistic rules. The parsing process identifies sentences that contain contradictory positive and negative clauses or phrases within the same structure as expressing mixed views. Linguistic rules (LR) aim to choose the most representative sentences/clauses from mixed opinions for the training set, with the primary goal of preserving (or enhancing) the effectiveness of the classification model [1,3]. We then apply part-of-speech (POS) tags to specific words, including adjectives, adverbs, verbs, and nouns, using the Stanford POS tagger [4]. This tagging process lays the foundation for our incorporation of the wide coverage integrated sentiment lexicon (WCISL) [1].

We use the WCISL as a valuable asset to acquire semantic and sentimental information. This helps in extracting sentiment-related words and features from textual data, including verbs, nouns, adjectives, and adverbs. Next, we employ the RoBERTa transformer model as an encoder to tokenize the input sequence and encode it into a discriminative sentiment-enhanced word embedding. To further enhance word embeddings with richer contextual understanding and highlight the most significant features, we employ a bidirectional GRU model augmented by an attention mechanism.

The dimensionality of the feature space is then successfully reduced by applying a statistical feature reduction approach, namely Principal Component Analysis (PCA). PCA is preferred due to its speed, ease of use, and exceptional ability to preserve original data to a considerable degree. In order to choose the features optimally and increase the overall effectiveness of our method, this reduction step is essential. After the feature reduction step, the Sigmoid function is applied to produce the probabilistic distribution of the classes in the sentiment analysis dataset. In particular, our integrated model fulfills three primary purposes: First, it reveals the underlying connections in the data. Second, it effectively extracts sentiment words and gives the important sentiment words the appropriate weight. Third, it lowers the dimensionality of the feature space and eliminates superfluous features.

The following is a summary of our primary contributions:

We suggest DK-HDNN, a unique hybrid deep neural network-based accurate sentiment analysis model that combines semantic and sentiment awareness. The suggested model effectively finds and extracts significant contextual sentiment features from the review text by combining traditional DNNs with contextual semantics, linguistic, and sentiment information. In particular, the suggested hybrid sentiment analysis model had been significantly (statistically) impacted by the incorporation of linguistic semantics and sentiment knowledge with standard deep neural networks (RoBERTA, BiGRU, attention mechanism), as well as by PCA.
We use linguistic semantic rules to classify review texts with mixed opinions composed of positive and negative words.
We employ WCISL and Standford PoS Tagger to tap into semantic and sentiment knowledge, which facilitate the identification and extraction of sentiment features in the review text for SA.
We utilize the RoBERTa transformer model as the encoder to tokenize the input sequence and encode it into a distinct sentiment-enhanced word embedding.
We weigh the important aspects and investigate the deeper internal relationships in the data using BiGRU with the attention mechanism.
We employ PCA to efficiently reduce the dimensionality of features in our data while preserving the most important information.
In comparison to a number of earlier baseline techniques in sentiment analysis, our suggested model shows a notable boost in performance across four real-world benchmark datasets.

The remaining paper is structured as follows: Section 2 describes related work, Section 3 presents the detailed architectures and proposed methodology of the proposed system, Section 4 shows experimental results, Section 5 concludes the paper, and Section 6 discusses challenges and future directions.

2. Related Work

This section reviews the state-of-the-art methods for text sentiment analysis, which are feature extraction and selection, pre-trained large language models, and deep neural network models.

2.1. Feature Extraction and Selection

Numerous studies have explored diverse feature representation schemes to enhance sentiment classification performance in textual sentiment analysis (SA). For instance, Mohammed et al. [5] developed a sentiment analysis framework that utilizes Count Vectorizer for feature extraction, combining five machine learning models and three deep learning models, including LSTM, MLP, and CNN. When tested on Facebook and Twitter datasets, the LSTM model achieved the highest accuracy of 0.99, particularly excelling with Facebook data. The proposed method outperformed existing models, improving accuracy by up to 20.9% and showing significant gains in precision, recall, and F1-score. Their study demonstrates that integrating diverse algorithms can significantly enhance sentiment detection on social media platforms.

Chang et al. [6] introduced two innovative feature selection techniques—Modified Categorical Proportional Difference (MCPD) and Balanced Category Feature (BCF)—designed to ensure equal representation of attributes from text reviews. Their experiments demonstrated that integrating MCPD and BCF not only significantly reduces the dimensionality of the feature space but also enhances the accuracy of sentiment classification.

Khan et al. [7] introduced a novel ensemble approach for text SA named EnSWF, which leverages POS tagging and n-gram patterns. This method incorporates semantic context, sentiment cues, and word order to enhance performance. Relevant features for SA were effectively extracted and selected using POS patterns and POS-based n-gram structures, combined through ensemble learning techniques to improve overall accuracy.

Noura et al. [8] highlighted the importance of feature extraction in SA by evaluating several methods—BoW, Word2Vec, N-gram, TF-IDF, HV, and GloVe—on the Twitter US Airlines and Amazon Musical Instruments datasets. Using a Random Forest classifier, their results showed that TF-IDF delivered the best performance, achieving 99% accuracy in Amazon reviews and 96% on Twitter data. This study emphasizes how selecting the right feature extraction technique can significantly enhance SA outcomes.

Sen et al. [9] proposed a novel method to improve text classification by combining traditional n-gram features with graph-based deep learning. They transformed text into graphs using discriminative n-gram sequences to capture long-range word dependencies and trained a graph convolutional network (GCN) to produce enriched word embeddings. When these embeddings were integrated into an LSTM model, they achieved a 2% performance gain across various datasets, demonstrating the effectiveness of merging n-gram features with deep learning techniques.

The extraction and selection of relevant attributes conveying sentiment information are pivotal in SA. In this procedure, irrelevant and noisy attributes are excised from the feature space, resulting in a leaner, more accurate representation of sentiment-carrying elements. This refined set of features contributes to improved classification accuracy in SA and related tasks. In the realm of text categorization and SA, diverse feature selection techniques have been explored, with Document Frequency (DF), Chi-square (CHI), Mutual Information (MI), and Information Gain (IG) emerging as prominent approaches for refining feature sets and optimizing classification performance [7].

Principal Component Analysis (PCA) [10] is a statistical technique that streamlines complex datasets by identifying a compact set of principal components that capture the majority of the original data’s variance, enabling effective dimensionality reduction without sacrificing significant information.

Despite the application of traditional feature selection techniques for dimensionality reduction, classifiers continue to face significant challenges related to data sparsity. This issue arises because traditional feature selection methods, such as filter-based, wrapper-based, or embedded approaches, primarily focus on reducing the number of features by selecting the most relevant ones. However, they do not necessarily enhance the quality of data representation.

To address these challenges, advanced representation learning techniques like deep learning-based feature extraction, word embeddings (e.g., Word2Vec, GloVe, transformer models), and autoencoders can be used.

2.2. Pre-Trained Large Language Models

Pre-trained large language models (PLLMs) are revolutionizing the field of SA, bringing an unprecedented level of sophistication, accuracy, and contextual understanding to this domain. These models, such as BERT, GPT, XLNet, T5, and RoBERTa [11,12,13,14], have been trained on vast amounts of textual data, enabling them to understand language in a highly nuanced manner. Unlike traditional machine learning models that rely on handcrafted features and shallow text representations, PLLMs can capture complex linguistic structures, contextual dependencies, and sentiment nuances with remarkable efficiency.

The transformer-based architectures of these models allow them to process text bidirectionally (e.g., BERT, RoBERTa, and XLNet) or autoregressively (e.g., GPT). This bidirectional processing is particularly beneficial for sentiment analysis, as it enables models to consider both past and future words in a sentence when determining sentiment. For instance, the meaning of a word like “great” in “This movie is great!” versus “This movie is not great.” depends heavily on context, which these models effectively capture.

Advantages of LLMs in Sentiment Analysis:

Contextual Understanding—Unlike traditional word embedding techniques (e.g., Word2Vec, GloVe), PLLMs generate dynamic word representations that change based on the surrounding words, allowing for context-sensitive sentiment classification.
Handling Ambiguity and Sarcasm—Sentiment in text is often complex, especially in informal settings like social media. LLMs improve the detection of subtle linguistic cues such as sarcasm, negations, and implicit sentiments.
Transfer Learning Capabilities—Since these models are pre-trained on massive datasets, they require less labeled training data and can be fine-tuned for domain-specific sentiment tasks with improved generalization.
Multilingual Capabilities—Many LLMs, including XLM-R and mBERT, are multilingual, enabling SA across different languages without the need for separate models.

Why RoBERTa for Sentiment Analysis?

In this research, we employ RoBERTa for dynamic word embedding and vectorization, leveraging its optimized training strategies and enhanced performance over BERT. RoBERTa, short for Robustly Optimized BERT Pretraining Approach, retains the original transformer-based bidirectional architecture of BERT but incorporates key modifications that improve its efficiency and accuracy in NLP tasks.

Improved Training Strategies—RoBERTa removes the Next Sentence Prediction (NSP) objective from BERT, focusing entirely on masked language modeling (MLM), which enhances its contextual learning capabilities.
Larger Training Data and Batch Sizes—RoBERTa is trained on much larger datasets with increased batch sizes, enabling it to capture richer linguistic patterns.
Dynamic Masking—Unlike BERT’s static masking approach, RoBERTa uses dynamic masking, meaning the masked words in training change across epochs, leading to better generalization.
Faster and More Efficient Fine-Tuning—With no NSP and an optimized training procedure, RoBERTa achieves comparable or superior performance to BERT while requiring less training time and hyperparameter tuning.

Due to these advantages, RoBERTa excels in sentiment analysis tasks, providing better contextual embeddings and improved sentiment classification accuracy. By leveraging RoBERTa in our study, we aim to capture fine-grained sentiment variations and enhance sentiment classification performance in complex textual datasets.

2.3. DNN Paradigms

The development of architectures such as convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM), and bidirectional gated recurrent unit (BiGRU) architectures, alongside word embeddings and attention mechanisms, has had a major impact on SA and other related areas, sparking significant research interest [4,15,16,17,18,19,20,21].

Kim [22] developed a CNN model with multiple filters and max-pooling to extract key features, which were then classified using a fully connected layer. Rezaeinia et al. [23] enhanced word embedding in a CNN-based SA model by incorporating lexical, positional, and syntactical features, followed by sequential CNN modules for feature selection. Yang et al. [24] introduced a dual-channel DNN using pre-trained Word2Vec for text feature extraction and intent classification, achieving strong results in multi-intent tasks. Additionally, CNN primarily captures local features and neglects sequence information in text analysis.

LSTM and its variants are effective for sequential modeling tasks and capturing long-range dependencies between words in a sentence [25,26,27,28,29]. Wang et al. [30] utilized the Word2Vec model for word embedding and introduced an LSTM-based sentiment classification method for short social media texts. Yang et al. [31] proposed an attention-enhanced bidirectional LSTM to improve target-dependent sentiment classification.

The synergistic integration of deep learning architectures, word embeddings, and attention mechanisms demonstrates promising predictive potential in text classification and SA [4,19,32,33].

Li et al. [34] employed the CBOW model for word embedding in an LSTM-CNN hybrid framework for Chinese news text classification, achieving promising results. Similarly, Zhang et al. [35] developed an LSTM-CNN hybrid model for sentiment classification of movie reviews.

Additionally, several studies have explored hybrid models that integrate deep neural networks (DNNs) with attention mechanisms to enhance SA [32,36,37]. Liu et al. [36] introduced AC-BiLSTM, a hybrid model combining bidirectional LSTM and CNN with an attention mechanism for SA and question answering. Their approach leveraged BiLSTM for capturing contextual dependencies while using attention to focus on the most relevant text segments.

Basiri et al. [37] proposed ABCDM, a SA model integrating BiLSTM, BiGRU, and CNN with an attention mechanism. The model was evaluated on five English comment datasets and three Twitter datasets, achieving state-of-the-art performance across multiple benchmarks.

However, recent studies suggest that DNN-based methods tend to select irrelevant and redundant features [38], overlooking the sentiment cues associated with each sentiment word. This can impact their performance in terms of classification accuracy.

Although conventional feature extraction methods offer interpretability and efficiency, recent studies show that Deep Neural Networks (DNNs) consistently outperform them in SA tasks. This paper proposes DK-HDNN, a novel model that leverages both approaches. DK-HDNN combines conventional feature extraction with deep learning techniques that incorporate linguistic semantics and sentiment information.

DK-HDNN stands out from existing models in several ways. First, it integrates linguistic semantics and sentiment information from sources like sentiment lexicons and linguistic rules. This allows for extracting context-rich sentiment-bearing words and accurate classification of opinions. Second, DK-HDNN utilizes RoBERTA to generate sentiment-enhanced word embeddings. These embeddings are then processed by a BiGRU with an attention mechanism, further capturing nuanced sentiment features. Finally, DK-HDNN employs PCA for dimensionality reduction, ensuring an efficient model.

3. Proposed Technical Approach

Conventional SA methods often struggle with limitations such as sparse data and the intricate nature of natural language. DK-HDNN bridges this gap by combining feature engineering’s interpretability with DNNs’ power to capture language nuances. More specifically, we introduce DK-HDNN, a novel hybrid approach that integrates DNNs and feature engineering for accurate text SA. To create the initial sentiment feature space and clean up the review text, the approach starts with text preprocessing. Subsequently, the RoBERTa pre-trained language model is utilized to extract dynamic semantic and sentiment feature representations from the input review text’s sequence. The extracted word features are then input into the BiGRU network, enhancing the model’s ability to understand intricate details within the text. The attention mechanism is incorporated to highlight key features, and PCA is applied to reduce dimensionality and overcome feature redundancy. Finally, the reduced feature vector is fed into the sigmoid layer for classification, completing the SA of the review text. The detailed framework is shown in Figure 1.

3.1. Preprocessing the Text and Extracting Features

We systematically generate the initial feature space by preprocessing the text. It involves loading the review dataset, employing a sentence parser to segment text into manageable units, and applying a tokenizer to break down sentences into individual tokens, preparing the data for feature extraction.

To enhance the signal-to-noise ratio in the dataset, we employ a noise reduction and text modification module to eliminate unnecessary elements such as stop words, URLs, and numerical symbols. Additionally, we incorporate spell correction using the spellchecker library, handle contractions with regular expressions, and improve irrelevant content filtering by removing special characters and redundant text.

After that, the POS tagger is used to give words POS tags, paying special attention to recognizing verbs, nouns, adjectives, and adverbs. The WCISL searches the sentiment orientation of these words to find and extract sentiment features.

Linguistic Semantic Rules

To uncover the subtle nuances of language and navigate sentiment’s twists and turns, we utilize linguistic norms as our guide. Inspired by the wisdom of [39,40,41], these norms act as a compass, helping us navigate the complexities of context and differentiate between synonyms and antonyms. Imagine a sentence as a tapestry of words, each thread delicately woven together. Linguistic semantic rules act as weavers, carefully examining the patterns and connections to reveal the true sentiment hidden within. Consider this example: “The movie maker is well-known but the movie is uninteresting”. Without linguistic guidance, we might misinterpret the overall sentiment. However, these rules focus our attention on the pivotal clause following “but”, leading us to the heart of the matter: the movie’s disappointing quality. Specific words like “but”, “despite”, “while”, and “unless”, act as signposts, signaling potential shifts in sentiment. As gatekeepers of meaning, they hold the power to transform a sentence’s polarity. Expanding on the groundwork established by earlier studies [3,39,40], we choose a carefully curated set of linguistic rules, as shown in Table 1, to ensure a comprehensive and context-sensitive SA.

3.2. Enhanced Sentiment and Semantic Word Embedding

We utilize the extensive expertise of wide coverage integrated sentiment lexicons (WCISLs), which include sentiment scores and related words in conjunction with refined linguistic rules. This potent combination enables the extraction and curation of highly relevant sentiment features. These serve as valuable inputs for both word embedding, which enriches understanding of individual words in context, and sentiment classification, which determines the overall emotional stance of the text. This phase entails the execution of the subsequent steps, as detailed below.

3.3. Sentiment Lexicons with Integrated Wide Coverage

A sentiment lexicon serves as a collection of words used to express either positive or negative sentiments. These lexicons play a crucial role in recognizing sentimental features and phrases. In the literature, numerous sentiment lexicons have been created on different scales, which contain AFFIN, OL, SO-CAL, WordNet-Affect, GI, SentiSense, MPQA Subjectivity Lexicon, NRC Hashtag Sentiment Lexicon, SenticNet5, and SentiWordNet, among others [42].

We standardize various sentiment lexicons by assigning scores of +1 (positive), −1 (negative), and 0 (neutral) to words. This allows us to seamlessly integrate information from various sources, culminating in the creation of an extensive merged lexicon, denoted as WCISL. This lexicon encompasses a significantly larger repertoire of sentiment-bearing words compared to individual sources.

In this research, we leverage the matching process between the review sentence words and those present in the WCISL to conduct accurate SA. Formally,

W C I S L = {L_{1}, L_{2}, \dots, L_{m}}

denotes the wide coverage integrated sentiment lexicon (m words) and

S = {W_{1}, W_{2}, \dots, W_{n}}

an input sentence (n features).

P_{w} \in W C S L

represents the sentiment polarity of word

w_{i}

in

W C S L

, and

S_{w}

= 1 indicates

w_{i}

presence in S. Table 2 offers comprehensive details on the dimensions and composition of these cutting-edge sentiment lexicons, highlighting their individual word coverage and contributing to the richness of the consolidated WCISL.

The main difficulty with sentiment lexicons is their limited vocabulary coverage. If a sentiment word is not present in the existing sentiment lexicons, it is typically omitted, which can impact the sentiment of the review text. To address this issue, we utilize semantic knowledge from WordNet to identify synonyms of out-of-vocabulary sentiment words and determine their sentiment orientation. Specifically, for a given word W, we retrieve its synonym set

{w_{1}, w_{2}, \dots, w_{m}}

from WordNet. Each synonym is then checked in the WCISL. If a synonym exists in WCISL, its sentiment score is used.

The sentiment score of W is computed as follows:

P_{W} = \frac{1}{m} \sum_{i = 1}^{m} P_{w_{i}}

(1)

where

P_{w_{i}}

represents the sentiment polarity of synonym

w_{i}

in WCISL. If

P_{W} > 0

, the sentiment is classified as positive; otherwise, it is negative. Furthermore, the WCISL is utilized for sentiment-enhanced word embedding and classification. Given an input review sentence

S = {w_{1}, w_{2}, w_{3}, \dots, w_{n}}

(2)

our method applies WCISL to extract sentiment words from S and form a sentiment-enhanced word vector:

W s V = {W s V_{1}, W s V_{2}, W s V_{3}, \dots, W s V_{n}}

(3)

which is then used for improved SA.

3.4. Word Vector Representation Based on RoBERTA

The proposed model leverages the RoBERTa, a pre-trained transformer-based large language model (PLLM), to generate numerical feature vectors that capture the semantic and syntactic nuances of input text. RoBERTa, a refined descendant of BERT, stands upon the same transformer foundation, offering enhanced language understanding capabilities. RoBERTa leverages byte-level BPE (Byte Pair Encoding) for tokenization, dynamic masking, and training on massive datasets (BookCorpus+Wikipedia, CC-News, OpenWebText, Stories) to achieve superior accuracy and efficiency compared to BERT’s character-level BPE and static masking. The RoBERTa base model consists of 12 layers, 768 hidden state vectors, and 125 million parameters. These base layers are designed to generate a meaningful word embedding as the feature representation, facilitating the subsequent layers in capturing valuable information from this embedding.

To effectively process an input sequence/sentence, we use special boundary tokens: “

[< s >]

” to indicate the beginning of a sentence and “

[< / s >]

” to indicate the end of a sentence. These tokens help the model recognize sentence boundaries and structure, enabling it to interpret the input as a coherent sentence rather than a disordered collection of words. This is crucial for preserving the contextual integrity of the sentence and for allowing the model to accurately learn and represent the relationships between words.

For each training instance

(S)

, we first locate all sentiment words within

(S)

using the WCISL dictionary. Next, we tokenize

(S)

into a sequence of subwords and retrieve the corresponding embeddings for each subword. This sequence of embeddings is then passed into the PLLM. From the final layer of the PLLM, we take the representation of the first token

E_{[< s >]}

as the context representation for

(S)

.

At the same time, we also extract the embeddings of the sentiment words present in

(S)

. As illustrated in Figure 1, the pre-trained RoBERTa model generates embeddings based on individual words

(W_{t})

, positional information

P_{t}

indicating each word’s position within the sentence, and token type IDs

T_{s}

, which reflect the segment type within the training input.

WordEmbedding (W): This is the representation of individual words in the input sequence. Each word in the sequence (

T o k_{1}

,

T o k_{2}

, …,

T o k_{n}

) is mapped to a high-dimensional vector (

W_{1}

,

W_{2}

, …,

W_{n}

) based on pre-trained or learned word vectors. These word vectors capture semantic meaning and the inherent relationships between words.

Position Embedding (P): Position embeddings encode the relative or absolute position of a word in the sequence. Since transformers, like RoBERTa, do not inherently process sequential data (unlike recurrent neural networks), position embeddings are added to the word embeddings to inform the model about the order of words in the sentence.

Token-Type Embedding (T): Token-type embeddings are helpful in distinguishing between the sentence that provides the main content (such as opinions or sentiments) and other structural tokens like punctuation or special markers. This can guide the model in focusing on the most relevant words for sentiment classification.

Combined Input Representation: These embeddings (W, P, T) are combined through superimposition, forming the input for the next network layer, with the formula being

h t = R o B E R T a (W_{t} + P_{t} + T_{s})

(4)

RoBERTa then transforms the input into a low-dimensional vector

h t

, capturing contextual information at time t.

Key Process Flow:

Raw Text to Tokenization: The sentence is tokenized into individual words, adding special tokens for the sentence boundaries.
Embedding Combination: Word embeddings (W), position embeddings (P), and token-type embeddings (T) are added together for each token.
Transformer Layers: The combined input representation is passed through 12 transformer layers, where each word’s contextual meaning is refined and updated based on its interaction with other words in the sequence.

This process ultimately generates rich vector representations that encode both semantic (word meanings) and sentiment (contextual sentiment based on word interactions) information, enabling effective sentiment classification.

To enhance the RoBERTa model’s output, a BiGRU is utilized as a feature extractor. This incorporation allows the model to benefit from both the contextual information captured by RoBERTa and the long-range dependencies between tokens, leading to more precise predictions.

3.5. BiGRU Deep Neural Network

Bidirectional GRUs (BiGRUs) are specialized neural networks that build on traditional GRUs. BiGRUs often achieve performance comparable to BiLSTMs while using fewer parameters, leading to faster training, reduced memory requirements, and lower computational costs. Unlike simpler RNNs, BiGRUs can effectively learn from both past and future information within lengthy sequences, even overcoming the vanishing gradient problem. This allows BiGRUs to capture subtle contextual nuances and ultimately achieve deeper linguistic understanding.

In a standard GRU, input sequences are processed sequentially. At each time step, the GRU unit takes the current input and the previous hidden state to generate a new hidden state and output. GRU incorporates a reset gate

r_{t}

and an update gate

z_{t}

that regulates information flow. The update gate ensures the network holds onto valuable insights, while the reset gate prevents it from becoming bogged down by outdated information, fostering a dynamic and adaptive understanding. Furthermore, h and

{\tilde{h}}_{t}

bear the current and updated information, respectively. The calculation of

z_{t}

,

r_{t}

, h, and

{\tilde{h}}_{t}

at time step t is summarized as follows.

z_{t} = σ (W_{j} x_{t} + U_{j} h_{t - 1} + b_{i}),

(5)

r_{t} = σ (W_{f} x_{t} + U_{f} x_{t} h_{t - 1} + b_{f}),

(6)

{\tilde{h}}_{t} = tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c}),

(7)

h_{t} = (1 - z_{t}) ⊙ {\tilde{h}}_{t} + z_{t} ⊙ h_{t - 1},

(8)

where the symbol ⊙ represents element-wise multiplication.

σ

is a sigmoid function.

W_{j}

,

W_{f}

,

W_{c}

,

U_{c}

, and

U_{j}

,

U_{f}

,

U_{c}

are all weights to be learned. The biases of the update gate, reset gate, and memory cell are denoted as (

b_{i}

,

b_{f}

,

b_{c}

).

In BiGRU, features are extracted using both a forward GRU and a backward GRU. To comprehensively capture information from different directions, we concatenate the two extracted features using the following equation.

h_{i} = \vec{h_{i}} \oplus \overset{\leftarrow}{h_{i}},

(9)

where

\vec{h_{i}}

represents the forward sequence,

\overset{\leftarrow}{h_{i}}

represents the backward sequence, and ⊕ denotes the concatenation operation. BiGRU processes the text matrix, producing output H, which encapsulates deep semantic features. Its expression is as follows:

H = [h_{1}, h_{2}, \dots, h_{n}],

(10)

In our model, a strategically placed BiGRU leverages RoBERTa’s encoded text, unraveling hidden relationships that might otherwise remain concealed. This long-range comprehension empowers the model to grasp the text’s overall coherence, unveiling the intricate tapestry woven by distant words and concepts.

3.6. Enhancing BiGRU with Attention

Although BiGRU networks excel in capturing contextual relationships within text, they sometimes struggle to pinpoint the most crucial words for accurate classification. Additionally, their hidden layers might inadvertently discard valuable information from earlier stages. To overcome these challenges, attention mechanisms can be integrated to selectively focus on keywords and preserve important information throughout the network, enhancing overall performance. Attention mechanisms revolutionize deep learning by pinpointing the essence of language.

Derivation of the Attention Weight Formula:

Contextualizing the Hidden States: The attention mechanism is incorporated into the hidden layer of BiGRU. The attention mechanism gives each word a weight, represented as

a_{i}

, using a softmax function. Higher attention weights are given to words with stronger sentiment signals, highlighting their importance in the analysis. It is anticipated that contextually rich features, which are essential for enhancing the text’s contextual meaning for sentiment prediction, will receive the majority of this attention weight. This process is used to calculate the attention weights for the hidden state output

H = h_{1}, h_{2}, h_{3}, \dots, h_{n}

from the BiGRU, as shown in Figure 1. Attention weights are computed as follows.

Applying the Softmax Function: The attention scores

e_{i}

are passed through a softmax function to normalize them and produce attention weights

a_{t}

:

a_{t} = \frac{e x p (e_{i})}{\sum_{i = 1}^{k} e x p (e_{k})},

(11)

This ensures that the attention weights sum to 1, making them interpretable as probabilities. Words that are more important for sentiment analysis will receive higher attention weights.

Calculating Attention Scores:Each hidden state

h_{t}

is assigned an attention score, denoted as

e_{i}

, using the following formula:

e_{i} = tanh (W_{h} h_{t} + b_{l}),

(12)

While

h_{t}

is formed by mixing the representations from the forward and backward GRU,

W_{h}

(weight matrix) and

h_{t}

(hidden state) are learnt parameters, and

b_{l}

is a bias term. The hyperbolic tangent function tanh is used to introduce non-linearity and ensure that the attention scores are within a suitable range for further calculation.

Computing the Context Vector: Each word is given an attention weight

a_{t}

that indicates its importance in the text.

D = \sum_{i = 1}^{t} a_{t} h_{t} .

(13)

The whole input text vector, including sentiment data collected word-by-word, is represented by the variable D. This context vector D captures the most relevant features of the text, weighted by the attention mechanism.

Before feeding the key feature vector into the sigmoid classifier for final decision-making, we first apply PCA, a dimensionality reduction technique. This process effectively condenses the feature vector into its most informative components, removing irrelevant and redundant features and enhancing the classifier’s accuracy.

3.7. Optimizing Feature Representation with Reduced Dimensionality

Traditional attention mechanisms attempt to determine importance by averaging weights across all input information. However, this can lead to unnecessary attention being given to less relevant words, potentially hindering performance. To address this, this paper employs PCA [43], a technique that strategically reduces dimensionality, ensuring focus on the most meaningful aspects of the data for SA. PCA is a widely used statistical method for dimensionality reduction. It employs an orthogonal transformation to convert potentially correlated input attributes into a set of uncorrelated variables called principal components. The first principal component captures the largest amount of variance in the data, and subsequent components capture progressively smaller amounts of the remaining variance. PCA reduces the number of dimensions while retaining most of the original data’s information, as the number of principal components is always less than or equal to the original number of attributes.

PCA Mathematical Formulation:

Dataset Representation: Consider a dataset

X \in R^{m \times n}

, where m is the number of samples and n is the number of features. Each row

x_{i} \in R^{n}

represents a data point with n features.

Mean Normalization: Center the dataset by subtracting the mean of each feature:

{\bar{x}}_{j} = \frac{1}{m} \sum_{i = 1}^{m} x_{i j}, for j = 1, 2, \dots, n

X_{centered} = X - \bar{X}

Covariance Matrix: Compute the covariance matrix of the centered dataset:

Σ = \frac{1}{m - 1} X_{centered}^{T} X_{centered}, Σ \in R^{n \times n}

Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix:

Σ v_{i} = λ_{i} v_{i}, for i = 1, 2, \dots, n

where the

v_{i} \in R^{n}

are the eigenvectors (principal components) and the

λ_{i}

are the corresponding eigenvalues.

Principal Components: The eigenvectors are sorted in descending order of their eigenvalues. The first k eigenvectors corresponding to the largest eigenvalues form the principal components. These capture the most variance in the data.

Dimensionality Reduction: Project the original data

X_{centered}

onto the new k-dimensional subspace:

Z = X_{centered} V_{k}, Z \in R^{m \times k}

where

V_{k} \in R^{n \times k}

contains the top k eigenvectors.

Variance Explained by Principal Components: The proportion of variance explained by the i-th principal component is given by the following:

Variance {Ratio}_{i} = \frac{λ_{i}}{\sum_{j = 1}^{n} λ_{j}}

Interpretation of Components: The first principal component, PC1, captures the highest variability in the dataset. Subsequent components PC2, PC3, …capture the remaining variance, ensuring that each component is orthogonal to the others.

By selecting the top k components, PCA ensures that the majority of the dataset’s information is preserved while significantly reducing dimensionality. This makes PCA particularly effective for improving the performance and efficiency of sentiment classification models.

After PCA strategically reduces the feature space, the resulting concentrated features are fed into the sigmoid classifier for final decision making.

3.8. Sentiment Classification with Sigmoid

To determine the final output of the binary class, we use the sigmoid function shown in Equation (11) to predict sentiment polarity. The input data are essentially mapped to the range [0, 1] by this function, which generates a probability distribution across the labels. Finding the sentiment label of the input text is the goal of the two-class SA problem. A sentiment class that leans toward the negative polarity is shown when the output y is around 0. On the other hand, an emotion class that tends toward the positive polarity is indicated when y is around 1.

y = (w x + b) .

(14)

The equation includes learned parameters w (the parameter matrix) and b (the bias), with x representing the strong sentiment state.

3.9. Model Compilation

The model compilation utilizes categorical cross entropy as the loss function, the Adam optimization algorithm, and accuracy as the evaluation metric. The objective of the model is to predict the correct class for each input sequence.

Algorithm 1 presents detailed step-by-step pseudocode for implementing a hybrid deep learning model that integrates Word-level Contextual Wide Coverage Sentiment Lexicon (WCISL), RoBERTa, BiGRU, an attention mechanism, and PCA for sentiment classification tasks. This innovative architecture combines traditional feature engineering with advanced deep learning methods to enhance SA performance. Additionally, the algorithm employs linguistic semantic rules to effectively classify mixed opinions, providing a nuanced approach to SA. The pseudocode starts with preprocessing the input text and extracting features using WCISL and pre-trained word embeddings. It then leverages RoBERTa to capture contextual representations and applies BiGRU to model sequential dependencies in the text. The attention mechanism identifies the most relevant features, emphasizing critical parts of the input, while PCA reduces dimensionality and eliminates redundant features. Finally, the algorithm describes using a sigmoid layer for classification. The pseudocode provides a high-level overview of the DK-HDNN model, encompassing the key steps in its construction, compilation, training, and evaluation, including the essential operations and decision points.

The comprehensive design ensures the extraction of both semantic and sentiment features, making the model suitable for various SA tasks on benchmark datasets.

Algorithm 1 DK-HDNN for Text Sentiment Analysis

Input
- $D = {d_{1}, d_{2}, \dots, d_{m}}$ ; /* review text documents */
- $Embed (\cdot)$ -/* pre-trained embedding model (e.g., RoBERTa) */
- WCISL- /* wide coverage integrated sentiment lexicon */
- $λ$ -/* attention weights */
Output
- $\hat{Y} = {{\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n}}$ ; /* sentiment labels for documents, $| \hat{Y} | \leq | D |$ */

0:: Begin
1:: for each document $d_{i} \in D$ do
2:: /* Preprocessing */
3:: Tokenize $d_{i}$ into words $W = {w_{i 1}, w_{i 2}, \dots, w_{i n_{i}}}$ ;
4:: Remove stop words, punctuation, and apply part of speech tagging and spell correction;
5:: /* Feature Extraction */
6:: Obtain word embeddings $e_{i} = Embed (w_{i}), \forall w_{i} \in W$ ;
7:: Extract sentiment scores $s_{i} = WCISL (w_{i}), \forall w_{i} \in W$ ;
8:: Concatenate embeddings and sentiment features $X = [e_{1}, e_{2}, \dots, e_{n}; S]$ , where $S = [s_{1}, s_{2}, \dots, s_{n}]$
9:: /* Deep Learning */
10:: Pass X through RoBERTa to obtain contextual representations $H^{(r)}$
11:: Use BiGRU to process $H^{(r)}$ :
12:: $H_{forward}^{(b)} = {GRU}_{forward} (H^{(r)})$ ;
13:: $H_{backward}^{(b)} = {GRU}_{backward} (H^{(r)})$ ;
14:: $H^{(b)} = [H_{forward}^{(b)}; H_{backward}^{(b)}]$ .
15:: Apply attention mechanism to compute attention weights $α_{i}$ :
16:: $α_{i} = \frac{exp (W_{a} h_{i}^{(b)})}{\sum_{j = 1}^{n} exp (W_{a} h_{j}^{(b)})}, \forall i$ .
17:: Compute weighted document representation:
18:: $r = \sum_{i = 1}^{n} α_{i} h_{i}^{(b)}$ .
19:: Reduce dimensionality of r using PCA: $r^{'} = PCA (r)$ .
20:: Predict sentiment class:
21:: $\hat{y} = arg max (Sigmoid (W_{c} r^{'} + b_{c}))$ .
22:: Store $\hat{y}$ in $\hat{Y}$ .
23:: end for
24:: return $\hat{Y}$ ; /* predicted sentiment labels */
25:: End

4. Experimental Results and Discussion

4.1. Evaluation

We evaluated our system’s performance using four real-world benchmark datasets: (1) Movie Reviews (MR) dataset [44]; (2) Customer Review datasets (CR) [45]; (3) Large Movie Review (IMDB) dataset [46]; (4) SemEval 2013 dataset [47]. There are 5331 positive and 5331 negative review samples in the MR dataset. The CR dataset is composed of 14 product reviews collected from Amazon, as detailed in [45]. The IMDB dataset features a rich collection of 50,000 movie review texts. SemEval 2013 has a traditional training-test split; however, MR, CR, and IMDB do not have comparable splits. Following prior work [48], we use 10-fold cross-validation for evaluation. For MR, CR, and SemEval 2013, we also set aside 10% of the training data for development purposes, such as early stopping.

We used 300-dimensional word vectors for word embedding, implemented with Keras v2.11.0 and TensorFlow v2.11.0, and utilized RapidMiner Studio v10.1 for visual workflow design. We used the RoBERTA-BASE model for word vectorization, employing an Adam optimizer with a learning rate of

2 \times 10^{- 5}

. The RoBERTA-BASE model’s architecture consisted of 12 layers, a hidden layer dimension of 768, 12 attention heads, and over 110 million parameters. Additionally, we used the wide coverage integrated sentiment lexicon (WCISL) to extract sentiment information from the review texts. In BiGRU, 128 hidden neurons are allocated to each layer. In the output layer, the sigmoid function is used to calculate the probabilities related to class labels.

The total number of training epochs in this architecture was set at 10. We employed PCA to reduce the dimensionality of each embedding vector individually, projecting them into a 300-dimensional vector space. We assessed the effectiveness of our proposed hybrid method by analyzing its predictive performance using metrics such as Accuracy (ACC), Precision (PRE), Recall (REC), and F-measure. We used a paired t-test to calculate the evaluation metrics for our suggested model, setting the significance level (p < 0.05) to less than 0.05.

4.2. Model Variants

We evaluate the effectiveness of the proposed method by using four distinct model variants (One baseline and three proposed models). First, we employed the SOTA baseline DNN model (RoBERTA+BiGRU), which we call HDNN-1. Then, for the second model variant (DK-HDNN-2), we included SSK (linguistic semantics and sentiment knowledge) with RoBERTA and BiGRU. Similarly, for the third and fourth model variants (DK-HDNN-3, DK-HDNN), we included the attention mechanism and PCA, respectively, in order to find which model variant is most important.

HDNN-1 (RoBERTA+BiGRU): The model utilizes RoBERTA’s pre-trained word embedding in tandem with BiGRU, excluding considerations of linguistic semantics and sentiment knowledge (SSK), PCA, and the attention mechanism.
DK-HDNN-2 (SSK+RoBERTA+BiGRU): The model leverages linguistic, semantic, and sentiment knowledge, employing RoBERTA’s pre-trained word embedding in conjunction with BiGRU. Notably, it does not incorporate PCA or the attention mechanism.
DK-HDNN-3 (SSK+RoBERTA+BiGRU+Attention): The model incorporates linguistic semantics and sentiment knowledge and utilizes RoBERTA’s pre-trained word embedding in conjunction with BiGRU and an attention mechanism, while PCA is excluded from the model’s configuration.
DK-HDNN (SSK+RoBERTA+BiGRU+Attention+PCA): The model integrates linguistic semantics and sentiment knowledge, utilizes RoBERTA’s pre-trained word embedding, BiGRU, attention mechanism, and employs PCA.

4.3. Influence of Sentiment Knowledge, Attention Mechanism, and Dimensionality Reduction on DK-HDNN Performance

The accuracy of the ablation results for each variant model and the recommended model for different datasets is shown in Figure 2 and Table 3. As shown in Figure 2, the proposed DK-HDNN model and its variants showcase their accuracy across multiple datasets. Notably, DK-HDNN demonstrates superior performance on four distinct multi-domain datasets, outperforming all variant models. The results clearly indicate that the integration of linguistic semantics, sentiment knowledge, attention mechanisms, and dimensionality reduction (PCA) plays a pivotal role in enhancing model performance. The sentiment knowledge utilized in DK-HDNN, including meaningful words obtained through word conversion and typo correction, linguistic rules with POS tagging, and the wide coverage integrated sentiment lexicon (WCISL) (see Figure 1), significantly contributes to its effectiveness in sentiment classification. Additionally, the attention mechanism enables the model to focus on the most relevant features within the data, while PCA reduces feature dimensionality, effectively preserving critical information and mitigating overfitting. Each of these components synergistically enhances the model’s overall performance, with DK-HDNN-4 achieving the highest accuracy, solidifying its position as the most effective configuration.

4.4. Cross Model Comparison

We compare our proposed approach with previously proposed DNN-based baseline models on the binary sentiment classification problem [20,48,49,50,51,52,53,54,55] as shown in Table 4. More specifically, this section evaluates the classification accuracy of the proposed DK-HDNN model against previously established baseline models on the binary sentiment classification task using multi-domain datasets. The comparative results, presented in Table 4, highlight the performance of various DNN-based models, with the best outcomes emphasized in bold. As shown in Table 4, DK-HDNN consistently outperforms competing approaches on most benchmark datasets, demonstrating its superior effectiveness. Specifically, DK-HDNN achieves notable classification accuracies of 93.80%, 95.90%, 96.88%, 93.65% on the MR, CR, IMDB, and SemEval 2013 datasets, respectively. These findings underscore the robustness and efficiency of DK-HDNN compared to conventional DNN models in addressing sentiment classification challenges.

Furthermore, the results highlight DK-HDNN (Our proposed model) as a highly sentiment- and context-aware model that effectively integrates linguistic semantics and sentiment knowledge. By leveraging wide-coverage domain sentiment lexicons, linguistic rules, POS tagging, and sentiment clues, the model attains a deeper understanding of textual nuances. Additionally, the incorporation of RoBERTA embeddings, attention mechanism, and PCA further enhances the model’s ability to focus on critical features while reducing dimensionality. Together, these elements, combined with hybrid DNNs, validate the superior classification accuracy of DK-HDNN for textual SA, establishing it as a robust and efficient solution in this domain.

4.5. Loss Function Analysis

The lower loss values observed for both training and validation indicate that the model’s predictions are, on average, closer to the actual targets. This demonstrates that the model is effectively learning the underlying patterns in the data, as shown in Figure 3.

4.6. Effect of Learning Rate on DK-HDNN

Choosing the right learning rate is crucial for efficient training in gradient descent. A learning rate that is too small can lead to excessive iterations or convergence to local minima. In the worst-case scenario, an excessively small learning rate might even cause the optimization process to become trapped in an infinite loop. Conversely, a learning rate that is too large can lead to cost function instability and hinder convergence to the global minimum. This can significantly slow down the optimization process. The presented Figure 4 exemplifies the sensitivity of the DK-HDNN model’s performance to the learning rate. As observed, a learning rate of 0.01 results in peak model accuracy. Further increases in the learning rate lead to a decrease in accuracy. This empirical evaluation suggests that a learning rate of 0.01 is optimal for this specific configuration.

4.7. Impact of Cumulative Contribution Rate of Eigenvalues on Classification Performance

This study examines how dimensionality reduction through Principal Component Analysis (PCA) affects the classification performance of a DK-HDNN model. Specifically, it investigates the impact of the Cumulative Contribution Rate of Eigenvalues (CCRE) on classification accuracy using an MR dataset. The analysis focuses on varying the CCRE from 100% to 95%. Text features extracted by the DK-HDNN model undergo PCA to progressively eliminate redundant information. The findings, depicted in the provided figure, indicate enhanced classification accuracy when the CCRE is set to 95%. This implies that removing a moderate amount of redundancy results in a more informative feature representation for the classification task. However, reducing dimensionality further by lowering the CCRE below 95% appears to eliminate not only redundant information but also partially relevant text features, leading to a decline in classification performance (Figure 5).

4.8. Comparative Analysis of Execution Time Across Model Variation

We evaluated the computational cost of all model variations by running experiments for up to 10 epochs. The figure presents the execution time of the proposed DK-HDNN model alongside other model variations using the large Movie Reviews (MR) dataset. The DK-HDNN model demonstrates slightly higher execution time compared to the baseline model HDNN-1, primarily due to the integration of both semantic and sentiment knowledge, as well as the attention mechanism (Figure 6).

4.9. Model Performance

We investigated the performance of our proposed domain knowledge-based Hybrid Deep Neural Network model, which incorporates linguistic semantics and sentiment knowledge for sentiment analysis. Our model’s increased accuracy over the most advanced baseline DNN-based models can be attributed to a number of reasons. First of all, our model performs exceptionally well in the preprocessing phase by removing unnecessary and noisy elements from the review text and fixing typos. Second, it shows effectiveness in identifying and extracting sentiment elements that are pertinent to the text at hand. Thirdly, our approach successfully classifies mixed opinions by using linguistic semantic principles. Fourthly, for accurate sentiment-related feature recognition, it integrates the sentiment lexicon with broad coverage. Fifth, it may capture long-range relationships within the word sequence as well as contextual semantic information inside the text by representing sentiment features as high-density real-valued vectors. Sixthly, it makes use of a powerful blend of cutting-edge methods, such as Principal Component Analysis (PCA), attention mechanism, Bidirectional Gated Recurrent Unit (BiGRU), and RoBERTa. Together with linguistic semantics and sentiment knowledge, this combination effectively extracts highly informative sentiment features, which helps the model classify sentiment accurately.

5. Conclusions

The persistent problem of sentiment analysis in the context of deep learning is readdressed in this research. To improve the overall performance of sentiment classification, we provide a novel method that smoothly combines deep learning approaches with conventional feature engineering. Our approach aims to combine the best features of both paradigms to produce sentiment analysis results that are more reliable and accurate. In order to achieve this, we use WCISL to extract and choose the pertinent sentiment features for word embedding and sentiment classification, leveraging linguistic semantics and sentiment knowledge. Additionally, we successfully take into account the semantic, sentiment, and contextual importance of words by combining the pre-trained RoBERTa model with the use of BiGRU, an attention mechanism, and PCA. This method reduces the dimensionality of the feature space while effectively removing disruptive and duplicate features. Also, it makes it easier to choose relevant salient sentiment features that are crucial for text sentiment analysis. The proposed model undergoes comprehensive testing on four standard benchmark domain datasets, which offers a thorough evaluation of its effectiveness.

6. Challenges and Future Directions

The proposed approach faces some challenges in classifying reviews, as outlined below:

Idiomatic expressions: Culturally specific or non-literal phrases may be overlooked, limiting the model’s ability to interpret nuanced sentiments.
Figurative language: Metaphors, irony, and sarcasm often rely on contextual or tonal subtleties (e.g., exaggerated praise masking criticism) that literal textual analysis may misinterpret.
Subtle or layered sentiments: Implicit opinions, indirect critiques, or humor-laced dissatisfaction require deeper contextual inference beyond surface-level keyword detection.

To address these limitations, our upcoming work will focus on the following:

Enhanced contextual markers: Integration of punctuation-based cues (e.g., exclamation marks), laugh indicators (e.g., “haha”, “lol”), and explicit sarcasm markers (e.g., exaggerated positive terms in negative contexts) to better detect irony and mixed sentiments.
Expanded linguistic rules: Development of a dynamic rule set incorporating syntactic patterns, sentiment polarity shifts, and context-aware phrase disambiguation to decode ambiguous or contradictory expressions.
Self-learning mechanisms: Implementation of adaptive feedback loops to iteratively refine sentiment predictions based on user corrections and evolving linguistic trends, improving adaptability across diverse textual contexts (e.g., social media, product reviews).

By bridging these gaps, we aim to enhance the model’s robustness in identifying complex sentiment expressions, particularly in scenarios where literal and figurative meanings diverge.

In our upcoming work, we will explore the integration of exclamation marks, laugh indicators, sarcasm markers, and mixed sentiment indicators to enhance sentiment analysis, particularly in detecting sarcasm and irony. Additionally, we aim to expand the linguistic rule set and incorporate self-learning mechanisms to improve the model’s adaptability in handling complex sentiment expressions across diverse textual contexts.

Author Contributions

Conceptualization, J.K. and Y.L.; formal analysis, J.K. and D.H.; funding acquisition, Y.L.; investigation, Y.L. and D.H.; methodology, J.K. and Y.L.; project administration, Y.L; resources, Y.L.; software, S.K. and D.H.; validation, J.K. and Y.L.; visualization, N.A. and J.K.; writing—original draft, J.K.; writing—review and editing, J.K. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Hanyang University (Grant HY-2023-0009) and the Institute of Information and Communications Technology Planning and Evaluation (IITP) grants IITP-2025-RS-2020-II201741, RS-2022-00155885, and RS-2024-00423071, funded by the Korea government (MSIT).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor correction to the Funding statement. This change does not affect the scientific content of the article.

References

Khan, J.; Ahmad, N.; Choi, C.; Ullah, S.; Kim, G.; Lee, Y. SSK-DNN: Semantic and Sentiment Knowledge for Incremental Text Sentiment Classification. In Proceedings of the 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, 1–4 December 2023; pp. 52–59. [Google Scholar]
Ayinde, B.O.; Inanc, T.; Zurada, J.M. On correlation of features extracted by deep neural networks. In Proceedings of the IJCNN, Budapest, Hungary, 14–19 July 2019. [Google Scholar]
Appel, O.; Chiclana, F.; Carter, J.; Fujita, H. A hybrid approach to the sentiment analysis problem at the sentence level. Knowl.-Based Syst. 2016, 108, 110–124. [Google Scholar] [CrossRef]
Khan, J.; Ahmad, N.; Khalid, S.; Ali, F.; Lee, Y. Sentiment and Context-Aware Hybrid DNN With Attention for Text Sentiment Classification. IEEE Access 2023, 11, 28162–28179. [Google Scholar] [CrossRef]
Mohammed Alsekait, D.; Fathi, H.; Abdallah Ibrahim, S.; Shdefat, A.Y.; Saleh Alattas, A.; Salama AbdElminaam, D. Sentiment analysis: A machine learning utilisation for analyzing the sentiments of facebook and twitter posts. Intell. Data Anal. 2025, 1088467X241301389. [Google Scholar] [CrossRef]
Chang, J.R.; Liang, H.Y.; Chen, L.S.; Chang, C.W. Novel feature selection approaches for improving the performance of sentiment classification. J. Ambient. Intell. Humaniz. Comput. 2020, 1–14. [Google Scholar] [CrossRef]
Khan, J.; Alam, A.; Hussain, J.; Lee, Y.K. EnSWF: Effective features extraction and selection in conjunction with ensemble learning methods for document sentiment classification. Appl. Intell. 2019, 49, 3123–3145. [Google Scholar] [CrossRef]
Semary, N.A.; Ahmed, W.; Amin, K.; Pławiak, P.; Hammad, M. Enhancing machine learning-based sentiment analysis through feature extraction techniques. PLoS ONE 2024, 19, e0294968. [Google Scholar]
Şen, T.Ü.; Yakit, M.C.; Gümüş, M.S.; Abar, O.; Bakal, G. Combining N-grams and graph convolution for text classification. Appl. Soft Comput. 2025, 175, 113092. [Google Scholar] [CrossRef]
Yu, X.; Chum, P.; Sim, K.B. Analysis the effect of PCA for feature reduction in non-stationary EEG based motor imagery of BCI system. Optik 2014, 125, 1498–1502. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. Adv. Neural Inf. Process. Syst. 2019, 32, 5754–5764. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv 2015, arXiv:1510.03820. [Google Scholar]
Ay Karakuş, B.; Talo, M.; Hallaç, İ.R.; Aydin, G. Evaluating deep learning models for sentiment classification. Concurr. Comput. Pract. Exp. 2018, 30, e4783. [Google Scholar] [CrossRef]
Severyn, A.; Moschitti, A. Unitn: Training deep convolutional neural network for twitter sentiment classification. In Proceedings of the 9th International Workshop on Semantic Evaluation(SemEval 2015), Denver, CO, USA, 4–5 June 2015; pp. 464–469. [Google Scholar]
Jang, B.; Kim, I.; Kim, J.W. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS ONE 2019, 14, e0220976. [Google Scholar] [CrossRef]
Khan, J.; Ahmad, N.; Alam, A.; Lee, Y. Leveraging Semantic and Sentiment Knowledge for User-Generated Text Sentiment Classification. In Proceedings of the W-NUT, Gyeongju, Republic of Korea, 12–17 October 2022. [Google Scholar]
Wang, J.; Zhang, Y.; Yu, L.C.; Zhang, X. Contextual sentiment embeddings via bi-directional GRU language model. Knowl.-Based Syst. 2022, 235, 107663. [Google Scholar] [CrossRef]
Raheem, O.A.; Aksenov, I.; Redkin, Y.V.; Gorshkov, A.; Sorokin, S.; Atlasov, I.; Kravets, O.J. Development of a neural network model of an intelligent monitoring agent based on a recurrent neural network with a long chain of short-term memory elements. Int. J. Inf. Technol. Secur. 2024, 16, 63–70. [Google Scholar] [CrossRef]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014. [Google Scholar]
Rezaeinia, S.M.; Rahmani, R.; Ghodsi, A.; Veisi, H. Sentiment analysis based on improved pre-trained word embeddings. Expert Syst. Appl. 2019, 117, 139–147. [Google Scholar] [CrossRef]
Yang, Z.; Wang, L.; Wang, Y. Multi-intent text classification using dual channel convolutional neural network. In Proceedings of the 2019 34rd Youth Academic Annual Conference of Chinese Association of Automation (YAC), Jinzhou, China, 6–8 June 2019; pp. 397–402. [Google Scholar]
Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
Chatterjee, A.; Gupta, U.; Chinnakotla, M.K.; Srikanth, R.; Galley, M.; Agrawal, P. Understanding emotions in text using deep learning and big data. Comput. Hum. Behav. 2019, 93, 309–317. [Google Scholar] [CrossRef]
Song, M.; Park, H.; Shin, K.s. Attention-based long short-term memory network using sentiment lexicon embedding for aspect-level sentiment analysis in Korean. Inf. Process. Manag. 2019, 56, 637–653. [Google Scholar] [CrossRef]
Chen, T.; Xu, R.; He, Y.; Wang, X. Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN. Expert Syst. Appl. 2017, 72, 221–230. [Google Scholar] [CrossRef]
Tan, K.L.; Lee, C.P.; Anbananthen, K.S.M.; Lim, K.M. RoBERTa-LSTM: A Hybrid Model for Sentiment Analysis With Transformer and Recurrent Neural Network. IEEE Access 2022, 10, 21517–21525. [Google Scholar] [CrossRef]
Wang, J.H.; Liu, T.W.; Luo, X.; Wang, L. An LSTM approach to short text sentiment classification with word embeddings. In Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018), Hsinchu, Taiwan, 4–5 October 2018; pp. 214–223. [Google Scholar]
Feng, S.; Wang, Y.; Liu, L.; Wang, D.; Yu, G. Attention based hierarchical LSTM network for context-aware microblog sentiment classification. World Wide Web 2019, 22, 59–81. [Google Scholar] [CrossRef]
Naseem, U.; Razzak, I.; Musial, K.; Imran, M. Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Gener. Comput. Syst. 2020, 113, 58–69. [Google Scholar] [CrossRef]
Giglou, H.B.; Rahgooy, T.; Rahgouy, M.; Razmara, J. Uot-uwf-partai at semeval-2021 task 5: Self attention based bi-gru with multi-embedding representation for toxicity highlighter. arXiv 2021, arXiv:2104.13164. [Google Scholar]
Li, X.; Ning, H. Chinese text classification based on hybrid model of cnn and lstm. In Proceedings of the 3rd International Conference on Data Science and Information Technology, Xiamen, China, 24–26 July 2020; pp. 129–134. [Google Scholar]
Zhang, J.; Li, Y.; Tian, J.; Li, T. LSTM-CNN hybrid model for text classification. In Proceedings of the 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 12–14 October 2018; pp. 1675–1680. [Google Scholar]
Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
Basiri, M.E.; Nemati, S.; Abdar, M.; Cambria, E.; Acharya, U.R. ABCDM: An attention-based bidirectional CNN-RNN deep model for sentiment analysis. Future Gener. Comput. Syst. 2021, 115, 279–294. [Google Scholar] [CrossRef]
Chen, J.A.; Niu, W.; Ren, B.; Wang, Y.; Shen, X. Survey: Exploiting data redundancy for optimization of deep learning. ACM Comput. Surv. 2023, 55, 1–38. [Google Scholar] [CrossRef]
Xie, Y.; Chen, Z.; Zhang, K.; Cheng, Y.; Honbo, D.K.; Agrawal, A.; Choudhary, A.N. MuSES: Multilingual sentiment elicitation system for social media data. IEEE Intell. Syst. 2013, 29, 34–42. [Google Scholar] [CrossRef]
Ramteke, A.; Malu, A.; Bhattacharyya, P.; Nath, J.S. Detecting turnarounds in sentiment analysis: Thwarting. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 860–865. [Google Scholar]
Khan, J.; Alam, A.; Lee, Y. Intelligent hybrid feature selection for textual sentiment classification. IEEE Access 2021, 9, 140590–140608. [Google Scholar] [CrossRef]
Khan, J.; Lee, Y.K. LeSSA: A unified framework based on lexicons and semi-supervised learning approaches for textual sentiment classification. Appl. Sci. 2019, 9, 5562. [Google Scholar] [CrossRef]
Shlens, J. A tutorial on principal component analysis. arXiv 2014, arXiv:1404.1100. [Google Scholar]
Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:cs/0506075. [Google Scholar]
Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the SIGKDD, Seattle, WA, USA, 22–25 August 2004. [Google Scholar]
Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
Rosenthal, S.; Farra, N.; Nakov, P. SemEval-2017 task 4: Sentiment analysis in Twitter. arXiv 2019, arXiv:1912.00741. [Google Scholar]
Huang, H.; Jin, Y.; Rao, R. Sentiment-Aware Transformer Using Joint Training. In Proceedings of the ICTAI, Baltimore, MD, USA, 9–11 November 2020. [Google Scholar]
Erkan, A.; Güngör, T. Sentiment analysis using averaged weighted word vector features. PLoS ONE 2024, 19, e0299264. [Google Scholar] [CrossRef]
Chu, Z.; Wang, X.; Jin, M.; Zhang, N.; Gao, Q.; Shao, L. An Effective Strategy for Sentiment Analysis Based on Complex-Valued Embedding and Quantum Long Short-Term Memory Neural Network. Axioms 2024, 13, 207. [Google Scholar] [CrossRef]
Jin, Y.; Zhao, A. Bert-based graph unlinked embedding for sentiment analysis. Complex Intell. Syst. 2024, 10, 2627–2638. [Google Scholar] [CrossRef]
Rezaeinia, S.; Ghodsi, A.; Rahmani, R. Improving the Accuracy of Pre-trained Word Embeddings for Sentiment Analysis. arXiv 2017, arXiv:1711.08609. [Google Scholar]
Lei, Z.; Yang, Y.; Yang, M. SAAN: A sentiment-aware attention network for sentiment analysis. In Proceedings of the SIGIR, Ann Arbor, MI, USA, 8–12 July 2018. [Google Scholar]
Zhao, Q.; Ma, S.; Ren, S. KESA: A Knowledge Enhanced Approach For Sentiment Analysis. arXiv 2022, arXiv:2202.12093. [Google Scholar]
Bölücü, N.; Can, B. Which Sentence Representation is More Informative: An Analysis on Text Classification. In Proceedings of the DEPLING, Washington, DC, USA, 9–12 March 2023. [Google Scholar]

Figure 1. Framework of the proposed hybrid SA model (DK-HDNN). The model starts with data preprocessing and POS tagging, followed by word embedding with RoBERTa. A BiGRU network, enhanced by attention, captures contextual features. PCA reduces dimensionality before classification with a sigmoid function for sentiment prediction.

Figure 2. Ablation study results for each variant and the proposed model across multi-domain datasets.

Figure 3. Loss function.

Figure 4. Effect of learning rate on model performance.

Figure 5. Cumulative contribution rate.

Figure 6. Execution time.

Table 1. Rules of linguistic semantics.

Rule	Example	Linguistic Semantic Rules
R1	I am not a fan of that director, but I enjoyed this movie.	If “but” appears in a sentence, disregard the sentiment that comes before it and just consider the sentiment that comes after it.
R2	I like this movie, despite I dislike the director.	When analyzing sentiment in sentences containing “despite” the sentiment expressed after “despite” should be ignored.
R3	Everyone I know loves this video, unless they are a sociopath.	If a sentence has the phrase “unless” and a negative clause comes after it, disregard the “unless” clause.
R4	While the team demonstrated effort, their performance was suboptimal.	If a sentence includes the word “while”, disregard the sentence that comes after it and simply think on the sentiment of the sentence that comes after it.
R5	The film featured talented actors, however, the plot was very disappointing.	If “however” appears in a sentence, disregard the statement before it and think about the sentiment of the sentence after it.

Table 2. Cutting-edge sentiment lexicons.

Sentiment Lexicon	Size	Category	Sentiment Score
AFFIN	Words: 2477	N/A	[−5, +5]
GI	Words: 11,789	Positive, Negative, pstv, ngtv, strong, weak, etc.	N/A
OL	Words: 6786	Positive, Negative	N/A
SentiWordNet	117,659 Synsets, Words: 155,287	Positive, negative, objective	[0, 1]
SO-CAL	Words: 6306	N/A	[−5, +5]
Subjectivity Lexicons	Words: 8221	Positive, negative, neutral	N/A
WordNet-Affect	2872 Synsets, Words: 4785	Positive, Negative, Ambiguous, Neutral	N/A
NRC Hashtag Sentiment Lexicon	Words: 54,129	Positive, Negative	[−7, +7]
SenticNet5	Concepts: 100,000	Positive, Negative	[−1, +1]
SentiSense	2190 Synsets, Words: 5496	Like, Love, Hope, Hate, etc.	N/A

Table 3. Experimental results of variant models. The proposed model DK-HDNN obtained the best classification performance in terms of Precision, Recall, F-measure, and Accuracy.

Dataset	Model	Precision	Recall	F-Measure	Accuracy
MR	HDNN-1	88.00	87.65	85.1	87.80
	DK-HDNN-2	92.00	90.02	91.00	90.90
	DK-HDNN-3	93.00	92.45	92.72	92.70
	DK-HDNN	93.60	93.98	93.79	93.80
CR	HDNN-1	91.00	90.10	90.55	90.50
	DK-HDNN-2	93.00	92.54	92.77	92.75
	DK-HDNN-3	94.00	93.07	93.53	93.48
	DK-HDNN	95.50	95.02	95.26	95.25
IMDB	HDNN-1	91.60	90.87	91.24	91.20
	DK-HDNN-2	93.60	93.79	93.69	93.70
	DK-HDNN-3	94.20	94.39	94.29	94.30
	DK-HDNN	97.20	96.58	96.89	96.88
SemEval 2013	HDNN-1	89.48	89.51	89.50	89.50
	DK-HDNN-2	91.84	91.58	91.71	91.70
	DK-HDNN-3	92.84	92.91	92.88	92.88
	DK-HDNN	93.71	93.61	93.66	93.65

Table 4. Comparative study of deep learning models for sentiment classification on multiple domains.

Model	Dataset	Accuracy (%)
Rezaeinia et al., 2017 [52]	MR	79.80
Lei et al., 2018 [53]	MR	84.30
Huang et al., 2020 [48]	MR	79.45
Zhao et al., 2022 [54]	MR	91.26
Bolucu et al., 2023 [55]	MR	82.04
Jin et al., 2024 [51]	MR	80.44
Chu et al., 2024 [50]	MR	86.20
Our model	MR	93.80
Rezaeinia et al., 2017 [52]	CR	83.70
Zhao et al., 2022 [54]	CR	94.98
Bolucu et al., 2023 [55]	CR	83.37
Chu et al., 2024 [50]	CR	91.10
Our model	CR	95.90
Huang et al., 2020 [48]	IMDB	88.34
Zhao et al., 2022 [54]	IMDB	95.83
Jin et al., 2024 [51]	IMDB	89.57
Erkan et al., 2024 [49]	IMDB	95.56
Our model	IMDB	96.88
Wang et al., 2022 [20]	SemEval 2013	90.80
Our model	SemEval 2013	93.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, J.; Ahmad, N.; Lee, Y.; Khalid, S.; Hussain, D. Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis. Mathematics 2025, 13, 1456. https://doi.org/10.3390/math13091456

AMA Style

Khan J, Ahmad N, Lee Y, Khalid S, Hussain D. Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis. Mathematics. 2025; 13(9):1456. https://doi.org/10.3390/math13091456

Chicago/Turabian Style

Khan, Jawad, Niaz Ahmad, Youngmoon Lee, Shah Khalid, and Dildar Hussain. 2025. "Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis" Mathematics 13, no. 9: 1456. https://doi.org/10.3390/math13091456

APA Style

Khan, J., Ahmad, N., Lee, Y., Khalid, S., & Hussain, D. (2025). Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis. Mathematics, 13(9), 1456. https://doi.org/10.3390/math13091456

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis †

Abstract

1. Introduction

2. Related Work

2.1. Feature Extraction and Selection

2.2. Pre-Trained Large Language Models

2.3. DNN Paradigms

3. Proposed Technical Approach

3.1. Preprocessing the Text and Extracting Features

Linguistic Semantic Rules

3.2. Enhanced Sentiment and Semantic Word Embedding

3.3. Sentiment Lexicons with Integrated Wide Coverage

3.4. Word Vector Representation Based on RoBERTA

3.5. BiGRU Deep Neural Network

3.6. Enhancing BiGRU with Attention

3.7. Optimizing Feature Representation with Reduced Dimensionality

PCA Mathematical Formulation:

3.8. Sentiment Classification with Sigmoid

3.9. Model Compilation

4. Experimental Results and Discussion

4.1. Evaluation

4.2. Model Variants

4.3. Influence of Sentiment Knowledge, Attention Mechanism, and Dimensionality Reduction on DK-HDNN Performance

4.4. Cross Model Comparison

4.5. Loss Function Analysis

4.6. Effect of Learning Rate on DK-HDNN

4.7. Impact of Cumulative Contribution Rate of Eigenvalues on Classification Performance

4.8. Comparative Analysis of Execution Time Across Model Variation

4.9. Model Performance

5. Conclusions

6. Challenges and Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Hybrid Deep Neural Network with Domain Knowledge for Text Sentiment Analysis^†