1. Introduction
Nowadays, the Internet has become the primary source of information for almost everything [
1], and approximately 87% of its users utilize it as a research tool [
2]. The Web offers a variety of thoughts, comments, and views, and online users constantly publish reviews on blogs and social networks, which generates content at an impressive speed [
3]. Social media platforms have been an enormous source of knowledge, characterized by the rapid spread of information, where many personalities convey their views through social networks [
1,
4]. In 2024, for instance, Twitter had more than 500 million monthly active users, with over 200 million daily tweets on a wide range of topics [
5]. Text mining has been a powerful tool for extracting valuable information while working on social networks [
6]. Text mining has become a viable tool for analyzing public opinions on various issues across social media outlets [
7].
Sentiment analysis using deep learning and machine learning approaches has been widely employed by researchers in various projects, including movie reviews [
8,
9], fake tweet detection [
10,
11], medicine intake detection [
12], and others [
13,
14].
Text sentiment analysis practices are generally categorized into statistical techniques, lexicon-based methods, and hybrid processes [
15]. Prior to performing sentiment analysis, researchers perform text processing, including removing URLs and punctuation, tokenization, normalization, removing stop words, and stemming and or lemmatization [
16,
17,
18,
19,
20]. The impact of text preprocessing on sentiment analysis significantly improves model accuracy [
20,
21] by removing unnecessary noise, such as punctuation, stop words, and irrelevant characters.
There are 179 words in the NLTK English stop words list, including the word “not” and its contraction forms [
22]. Removing stop words during text sentiment analysis can impact the sentiment value. Consider the following negative sentiment review: “The product was not good”. After removing stop words, the term becomes “product good”, a positive sentiment.
It is critical to perform a text sentiment analysis with negation handling. It might help to accurately compute a sentence’s sentiment by correctly identifying and addressing negative words, such as “not” or “no”, These negative words can completely flip the polarity of a sentence, which significantly affects the overall sentiment classification if not adequately handled. Ignoring negation might lead to misjudging negative sentiment as positive and vice versa. Improperly handling negation in text sentiment analysis might lead to incorrect classification [
23].
Inverting the polarity of the negated term [
3,
24] is a typical way to handle the negation, which sometimes neglects the scope and impact of negation [
25,
26]. Other techniques combine the scope of negation with enhanced XLNet to improve the effectiveness of negation handling [
27]. About 30% of reviews are implicit sentiments [
28], so considering the implicit component combined with the rule-based might enhance the negation in text sentiment analysis [
29].
Simple or traditional rule-based negation systems typically fail to capture contextual dependencies within sentence structures [
30]. It may encounter difficulties in handling the inherent complexity of human language and nuanced expressions [
31]. Accurate negation handling is critical, as it can significantly impact sentiment interpretation and classification result [
32], while failure to handle negation correctly can distort polarity and cause systematic errors in sentiment classification [
24]. Although various methods have been introduced, effectively managing negation continues to be an obstacle in text-based sentiment analysis [
33,
34,
35]. This study introduces a hybrid negation approach (integrating multiple components) to address the persistent challenge of handling negation in complex sentences. Furthermore, the hybrid negation approach is expected to outperform single-method techniques, delivering more accurate and reliable sentiment predictions for complex textual data.
This work is organized as follows:
Section 2 reviews the existing literature on negation handling,
Section 3 covers the research method, and
Section 4 explains the results. Furthermore,
Section 5,
Section 6 and
Section 7 confer discussions, contributions, and conclusions, respectively. Future work is presented in
Section 8.
2. Literature Review
In the literature, most negation handling methods involve determining the polarity of sentences [
3,
24,
36,
37]. Several proposed methods exist to detect statements with negation using static windows or punctuation marks [
34], which utilize the “_NEG” suffix [
25,
26]. The other methods used are BERT memorization [
38] and applying similarity and antonym to the negation words [
34]. Previous studies have only employed neural network detection for negation detection [
29] and grouped negation based on classes [
39].
Table 1 presents previous research projects that have worked on negation handling for sentiment analysis.
Older models, such as BiLSTM, underachieve in complex sentences commonly due to their sequential nature which struggles with very long-range dependencies and complex relationships [
40]. Secondly, BiLSTM is composed of a back and forward LSTM, so for extremely long sequences, maintaining and updating the hidden state becomes computational and memory intensive [
41].
The research projects [
29,
39] did not perform sentiment analysis. They classified text documents into negation or non-negation to identify whether expressions were negated. The study in [
29] employed neural-network-based detection to sense both explicit and implicit negation using both rule-based and implicit cues. Using a BiLSTM, the project achieved the highest F1-score of 93.09%, while [
39] applied scope, diminisher, and morphological negation. Diminisher negation applied a 0.2 multiplier, while morphological (implicit negation) multiplied by −1. This study attained the highest accuracy of 83.3%.
A research study [
38] focused on BERT memorization to detect and understand negation using specific training data. The goal was to gain a better understanding of how to identify the source of BERT’s errors to improve sentiment analysis. The project achieved the highest precision of 88%, recall of 89%, and F1-score of 89%.
The work study in [
34] implemented rule-based negation, similarity (synonym), and antonyms of negated expressions. This technique uses
synsets from the NLTK library, providing a more straightforward way to find antonyms or synonyms of negated words in the
WordNet lexical database rather than a complex training model. This technique could be the most cost-effective in handling negation. Applying logistic regression with word embeddings, lemmatization, and negation, the project earned an accuracy of 91.79%.
The following projects [
36,
37] applied a rule-based approach to switch the polarity or sentiment of the expressions: the work study in [
36] implemented naïve Bayes with negation handling, which gained the best accuracy of 77.57%, while Ref. [
37] achieved the highest accuracy of 77.3% using SentiTFIDF.
Another negation-handling project [
3] implemented a rule-based approach using a dependency parse tree to enhance the negation technique. Its goal was to analyze grammatical relations by type of dependency, such as
nsubj,
aux,
neg,
dobj, and
cc. Using word sense disambiguation (WSD), the project achieved a 67% accuracy and an F1-score of 72%.
Work studies [
25,
26] implemented _NEG to identify negated terms and employed the scope of negation to measure the impact of word relations on the final sentiment analysis. The project in [
25] achieved the highest accuracy of 95.67% by utilizing an ANN. On the other hand, the work study in [
26] applied _NEG identification and double negation with scope, achieving the highest F1-score of 69.5% using the SVM classifier.
Next, Ref. [
24] used identification of explicit and implicit negations with the SentiWordNet (SWN) lexicon. This method has improved sentiment analysis by 2–6% compared to the traditional method. Using a hybrid RBF-SVM, the project achieved the best accuracy of 58.67%.
5. Discussion
According to our experiment results, the highest model accuracy of 98.582%, was obtained when hybrid negation handling was combined with a BERT classifier for the final prediction. Using the same negation strategy with a BERT-based classification model also produced the highest precision, reaching 98.196%. Likewise, applying hybrid negation with BERT yielded the highest recall of 98.189%. The best F1-score, 98.193%, was achieved under the same configuration. Overall, hybrid negation consistently enabled classification models (particularly those using BERT) to outperform all other negation handling techniques.
Like the performance using the SMOTE imbalance technique, the highest accuracy in the control group (98.262%) was achieved by applying hybrid negation with BERT as its model classifier. In addition, the highest precision in the control group (97.750%) was achieved by utilizing a similar negation technique (hybrid) with BERT as its final prediction model. Furthermore, applying hybrid negation with BERT as its model classifier in the control group achieved the highest recall of 97.816%. Lastly, the highest F1-score (97.783%) in the control group was accomplished by applying the hybrid negation technique with BERT as the classifier.
The hybrid negation performed much better than the other negation handling techniques in the control experiment group. There were 15 out of 20 models with the best accuracy in our experiment (75%) that were accomplished when we utilized the hybrid negation technique. Two out of twenty best model accuracies (10%) in this research project were achieved when we applied the TextBlob negation handling method. Next, there were 1 out of 20 best model accuracies in this study (5%) that were achieved when we employed the Negex method. Additionally, 2 out of 20 best model accuracies (10%) in this project were achieved when we applied antonym–synonym with a second rule-based negation handling method.
Like the control group, hybrid negation consistently outperformed the other negation handling techniques when we applied the SMOTE imbalance method. There were 16 out of 20 best model accuracies (80%) that succeeded when we employed a hybrid approach. Next, 2 out of 20 (10%) achieved best accuracies when we applied antonym–synonym with a second based-rule. At the same time, the TextBlob negation handling method succeeded in achieving 2 out of 20 best model accuracies (10%) in our experiment.
Table 6 represents an ablation study of the hybrid negation handling technique. It shows that reducing a component in hybrid negation decreased all model performances. Without adding a “
scope” component to the hybrid, the model accuracy decreased the most (by 0.402%). The highest reduction in precision (by 0.441%) happened when we excluded a “
double negation” component in the system. Next, the greatest drop in recall (by 0.594%) occurred when we eliminated “
scope” components in the hybrid negation. Furthermore, the highest decrease in F1-score (by 0.515%) happened when we excluded “
scope” components in the system. Consequently, the scope element in the hybrid negation seems to have greatly affected the system because misinterpreting the scope of sentences can lead to misinterpretation of the information which leads to inaccurate sentiment analysis.
There is still a possibility that our hybrid model struggles to recognize linguistic ambiguity, such as in ambiguous sentences or sarcasm. Our model relies on predefined rules or features that may not be considered for nuanced human language use. The presence of sarcasm might lead to incorrect sentiment analysis. For example, “Oh, fantastic! It was wonderful that the customer service hung up on me three times”, might be misclassified as a positive sentiment because of the words “fantastic” and “wonderful”. The expression shows anger or dissatisfaction, which is considered a negative sentiment. Another possible misclassification hybrid model is when dealing with complex metaphors. For instance, “The team’s performance was a slow-motion train wreck”, might be translated into neutral, as it interpreted “slow-motion train wreck” as a noun, which is classified as neutral. In fact, the metaphorical meaning of the expression “slow-motion train wreck” describes a poor performance that is gradual and drawn out, which likely leads to negative sentiment.
For evaluation purposes, we conducted a comparative analysis of our proposed hybrid method against LLMs (large language models). We applied the “
cardiffnlp/twitter-roberta-base-sentiment” model with the
transformers library’s
pipeline function to perform sentiment analysis using the LLMs.
Table 7 compares the performance of our suggested methods with that of the LLMs. For all metrics, the Hybrid methods surpass the LLM technique. The best model performance for the LLMs (accuracy of 93.855%, precision of 93.808%, recall of 93.833%, and F1-score of 93.807%) was achieved when we applied the MLP-LinearSVC classifier. On the other hand, our proposed methods using the same model (MLP-LinearSVC) achieved 98.067% accuracy, 98.072% precision, 98.066% recall, and 98.066% F1-score.
The descriptive statistics in
Table 8 show that applying SMOTE improved overall model performance. The average accuracy increased from 90.379% to 92.764%; its median rose from 92.254% to 93.197%; the minimum accuracy grew from 69.059% to 69.121%; and its maximum accuracy advanced from 98.262% to 98.582%. In addition, utilizing an imbalance technique (SMOTE) also improved the standard error from 0.569 to 0.500, the standard deviation from 6.074 to 5.340, and the sample variance from 36.896 to 28.514. Consequently, applying SMOTE advanced model performance in this research project.
We utilized classification models without SMOTE as a control experiment while applying SMOTE as a treatment group. It seems that applying the SMOTE imbalance technique has improved the classification model accuracy (as shown in
Table 8). To validate this assumption, we used a one-tailed
Wilcoxon Signed Rank Test to examine our hypothesis.
H
0 is the null hypothesis, while H
1 represents the alternative hypothesis. P
1 describes the performance of the SMOTE model, and P
0 is designated as the performance of the control group (without the SMOTE technique).
Our null hypothesis asserts that the metric performance of the control group (without the SMOTE technique) is the same as that of the treatment group (with the SMOTE). On the other hand, the alternative hypothesis posits that the treatment class outperformed the control group. We used 99% confidence intervals with a 1% significance level (a z-score of 2.33 as the cutoff).
A z-score value (
Table 9) was created for accuracy, precision, recall, and F1-score from the SMOTE model performance. The table shows that the z-score for accuracy (8.186723131) is extremely high, and the
p-value close to zero. In addition, the z-score for precision (7.892674018), recall (8.729583031), and F1-score (8.636278986) are also exceptionally high, meaning that the
p-value is extremely low, close to zero. Using a 99% confidence interval (significant level of 1%), the z-scores of SMOTE are higher than the 2.33. Therefore, we rejected the null hypothesis. The
W-test results suggest that the SMOTE imbalance technique improves model performances in this negation handling in text sentiment analysis. The result corresponds to research studies [
57,
58,
59] that show that the oversampling method enhances model performance for its ability not simply to generate synthetic minority samples but also to remove misclassified samples [
57] and reduce noisy samples [
59], which improves model performances, predictive accuracy, and mitigates the risk of overfitting [
57].
Next, to examine whether the hybrid negation handling performance surpassed that of the other negation techniques, we applied a one-tailed Wilcoxon Signed Rank Test with a 99% confidence interval (significance level 1%). The null hypothesis stated that the metric performance of hybrid negation was the same as that of the other negation techniques. In contrast, the alternative hypothesis stated that hybrid negation performed better than the other negation handling methods.
Table 10 presents z-scores for the hybrid negation technique compared with the other negation handling methods (Negex, synonym with second rule-based, synonym-only, TextBlob, and zero-shot). The z-scores were higher than the z-score of the confidence interval (z-score of 2.33). Consequently, we rejected the null hypothesis. Utilizing a 1% significance level (confidence interval of 99%), the
W-test suggests that the hybrid negation technique significantly outperforms the other negation techniques.
Our research project surpassed the previous work studies. We successfully achieved the highest model performance (both with and without SMOTE) when we applied hybrid negation and the BERT model classification as the final prediction model. Our experiment, with an accuracy of 98.582%, outperformed the research study [
25] (accuracy of 95.67%). The previous study employed Negex negation handling and SentiWordNet (SWN) sentiment computation in conjunction with the ANN classification model.
Additionally, without applying the SMOTE imbalance technique, our project with three classes achieved an accuracy of 98.262%, exceeding the previous research studies. A work study [
25], having two classes (positive and negative), achieved the highest accuracy of 95.67%. Project [
26] had the same three classes and achieved the best accuracy of 69.5%. Project [
34] achieved an accuracy of 77.57%, experiment [
37] achieved an accuracy of 77.3%, and study [
34] achieved an accuracy of 91.79%. A research study [
3] with three classes (positive, negative, and neutral) achieved the best accuracy of 67%.
The previous project [
34] utilized antonym–synonym and SWN sentiment computation with a Logistic Regression model, achieving the highest model accuracy of 91.79%. Nevertheless, we outperformed the earlier project when we used the same negation handling method (antonym–synonym), a different sentiment computation (Vader SentimentIntensityAnalyzer), and a BERT model classifier (with SMOTE, achieving 94.822%, and without SMOTE, achieving 94.666%). Furthermore, when we modified the antonym–synonym handling method by applying a second rule-based, we achieved a much higher model performance (with SMOTE 97.796% and without SMOTE 97.723%).
According to the model performance, our model achieved the lowest accuracy of 69.059%, yet it outperformed the previous negation handling project. A research project [
24], for example, achieved an accuracy of 58.67% by applying the hybrid RBF (radial basis function)-SVM (support vector machine) model with three classes (positive, negative, and neutral).
Most previous research projects [
3,
24,
25,
34] utilized SentiWordNet (SWN) to compute sentiment scores. Project [
3] (an accuracy of 72%) applied rule-based negation by inverting the SWN sentiment score and word sense disambiguation model. In addition, the work study [
24] achieved the highest accuracy of 58.67% by inverting the SWN polarity for negated expressions and utilizing a hybrid RBF-SVM model classifier. On the other hand, when we applied a different sentiment computation (TextBlob) and model classifier (BERT), we achieved better model performance, with SMOTE at 97.438% and without SMOTE at 97.368%. Consequently, applying a suitable sentiment computation improved classification model performance in text sentiment analysis.