Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification

Si, Shijing; Gao, Yijie; Sun, Haixia; Zhang, Yugui; Luo, Hua

doi:10.3390/electronics15101984

Open AccessArticle

Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification

by

Shijing Si

^1,†

,

Yijie Gao

^2,†,

Haixia Sun

^1,*,

Yugui Zhang

¹ and

Hua Luo

¹

School of Economics and Finance, Shanghai International Studies University, Shanghai 201620, China

²

School of Computer Science and Technology, East China Normal University, Shanghai 200061, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2026, 15(10), 1984; https://doi.org/10.3390/electronics15101984

Submission received: 25 March 2026 / Revised: 21 April 2026 / Accepted: 3 May 2026 / Published: 7 May 2026

(This article belongs to the Special Issue New Trends in Machine Learning, System and Digital Twins)

Download

Browse Figures

Versions Notes

Abstract

Label smoothing is a widely used technique in various domains, such as text classification, image classification and speech recognition, known for effectively combating model overfitting. However, there is little fine-grained analysis on how label smoothing enhances text sentiment classification. To fill in the gap, this article performs a set of in-depth analyses on eight datasets for text sentiment classification and three deep learning architectures: TextCNN, BERT, and RoBERTa, under two learning schemes: training from scratch and fine-tuning. By tuning the smoothing parameters, we can achieve improved performance on almost all datasets for each model architecture. Specifically, our experiments demonstrate that label smoothing improves accuracy by 0.5–2.3 percent across different architectures, with the best results achieved using smoothing parameters

λ \in [0.01, 0.1]

for three-class datasets and

λ \in [0.01, 0.15]

for binary-class datasets. We further investigate the benefits of label smoothing, finding that label smoothing can accelerate the convergence of deep models by 15–30 percent and make samples of different labels easily distinguishable. Additionally, we provide comprehensive analysis including macro-F1, precision, and recall metrics to ensure robust evaluation across datasets with varying class distributions.

Keywords:

label smoothing; deep learning; BERT; RoBERTa; sentiment classification; soft-target training

1. Introduction

Text sentiment classification is to identify and extract emotional tendencies (such as positive, negative or neutral) from text, which provides important application scenarios and challenges for the development of natural language processing (NLP) [1,2,3]. Many methods have been proposed for this task, for instance, convolutional neural networks (CNN) [4], and recurrent neural networks (RNN) with attention mechanism [5,6]. Transformer-based models [7,8,9] have excelled in sentiment classification tasks, achieving state-of-the-art performance [10,11,12].

Although deep learning methods have achieved good performance on text sentiment classification, they may still suffer from issues like slow convergence and suboptimal generalization performance [13,14]. These challenges motivate researchers to explore regularization techniques that can improve both training efficiency and model robustness. Label smoothing (LS) represents one such technique that has shown promise in addressing these limitations.

What is Label Smoothing? In simple terms, label smoothing is a regularization technique that “softens” hard class labels during training. Consider a binary sentiment classification task: a text labeled as “positive” would traditionally be represented as a one-hot vector

[1, 0]

(100 percent positive, 0 percent negative). With label smoothing, this becomes a softer distribution like

[0.95, 0.05]

—the model is still told this example is positive, but with slightly less certainty. This prevents the model from becoming overconfident and helps it generalize better to unseen data. For a three-class problem (positive, neutral, negative), a “positive” label would change from

[1, 0, 0]

to something like

[0.97, 0.015, 0.015]

, depending on the smoothing parameter.

More formally, as a commonly used regularization method to overcome overfitting, label smoothing (LS) proceeds by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels [15,16]. This technique lessens the disparity between the top probability estimate and the remaining ones, thereby acting as a barrier to the model from generating extremely confident predictions, consequently decreasing the model’s likelihood of becoming excessively tailored to the training data [17,18]. This also fosters generalization, leading to better performance on unseen data [19,20]. Furthermore, LS can also mitigate the impact of noisy labels on the training process [21].

LS has achieved widespread success in NLP [22,23,24] by introducing soft targets, which allows the model to optimize towards a more flexible direction during training [25,26]. However, the fine-grained research on LS for text sentiment classification is still limited. For the task of sentiment classification, emotion is not an absolute and discrete concept [27,28], but there is certain fuzziness and continuity [29]. LS can take account of this ambiguity, allowing the model to learn more balanced relative probabilities between different emotions.

Our Contributions. In order to investigate how LS benefits sentiment classification, in this paper we conduct extensive experiments on three widely used deep neural network architectures: TextCNN, BERT, and RoBERTa, under two learning schemes: training from scratch and fine-tuning. Our contributions can be summarized as follows:

Through systematic evaluation with four distinct smoothing parameters ( $λ \in {0.01, 0.025, 0.05, 0.1}$ for three-class and $λ \in {0.01, 0.05, 0.1, 0.15}$ for binary-class datasets), LS methods outperform the three baseline architectures on all eight datasets, including six three-class datasets and two binary-class datasets.
From in-depth analysis, LS can accelerate the training process of deep models by 15–30 percent with the deployment of soft labels, reducing the number of epochs required to achieve convergence.
LS can produce better hidden representations for training examples as they are easier to distinguish than those produced by the baseline method, as demonstrated through t-SNE visualization.
We provide a comprehensive evaluation using multiple metrics (accuracy, macro-F1, precision, recall) and controlled experiments to isolate the effect of label smoothing from the choice of loss function.

The remaining sections of this paper are organized as follows: We first review the related works in Section 2. Then, in Section 3, we propose the application and deployment of LS in text sentiment classification. Subsequently, in Section 4, we present the experimental results on eight sentiment analysis datasets. Finally, Section 5 provides a discussion and conclusion of this paper.

2. Related Works

This paper is related to two lines of research: text sentiment classification and label smoothing. Additionally, we position our work within the broader context of soft-target training methods in NLP.

2.1. Text Sentiment Classification

Text sentiment classification is one special type of text classification [30,31], as the labels are ordinal. In early days, statistical-based models like Naive Bayes [32,33] and support vector machines [34,35] were dominant in text sentiment classification. These models, while accurate and stable, required time-consuming feature design and often overlooked the context information in text data [30]. Since the 2010s, deep learning models [14,36], which automatically extract meaningful representations for text without the need for manual rule and feature design [13], are increasingly utilized for sentiment classification.

Although deep learning methods have achieved good performance on text sentiment classification, they may still suffer issues like slow convergence and suboptimal generalization [13,14]. In this paper, we explore how LS benefits deep sentiment methods.

2.2. Label Smoothing (LS)

LS was first introduced in the field of computer vision (CV) and has achieved success in various visual recognition tasks [15,37,38]. Later, the method was shown to be effective in Machine Translation (MT) [18]. Moreover, it also has applications in sentiment classification [39,40] and Named Entity Recognition (NER) [41,42], improving model calibration and bringing flatter neural minima.

Meanwhile, researchers have started to apply LS to text sentiment classification. One attempt is to add LS to the loss function to enhance the performance of the model [43]. Similarly, another group added LS to loss function and performed emotion classification on the adaptive fusion features obtained [43]. Also, properties of LS and its adversarial variants were studied, showing LS can enhance the adversarial robustness of the model [44]. Yan et al. [45] proposed a new cyclic smoothing labeling technique for handling the periodicity of angles and increasing error tolerance for adjacent angles. They also designed a densely coded tag that greatly reduced the length of the code. Wu et al. [46] proposed an efficient data augmentation method, termed “text smoothing”, by converting a sentence from its one-hot representation to a controllable smoothed representation and showed that text smoothing outperforms various mainstream data augmentation methods by a substantial margin.

2.3. Soft-Target Training in NLP

Our work on label smoothing belongs to a broader family of methods that exploit non-hard targets to improve learning dynamics and generalization. This family includes knowledge distillation [47], where soft targets from a teacher model guide student training; self-distillation [48], where a model distills knowledge to itself; and various forms of soft-label learning [49].

Recent work by Pozzi et al. [50] addressed exposure bias in large language model distillation through an imitation learning approach, demonstrating the importance of soft-target strategies in modern NLP. Label smoothing can be viewed as a special case of soft-target training where the target distribution is a simple interpolation between hard labels and a uniform distribution, rather than being derived from a teacher model. This simplicity makes LS particularly attractive for scenarios where teacher models are unavailable or computational resources are limited.

Although LS is a common label processing technique, there are still some problems and challenges that need further study. Examples include comparison of effects across different datasets, tasks, and model structures, comparison with other regularization methods, etc. [51]. Studying LS can explore these problems and promote the further development of the field of LS. Our work revisits the application of LS to sentiment classification, and conducts an in-depth analysis of its power.

3. Label Smoothing Method for Text Classification

Given a tokenized input

x = (x_{1}, x_{2}, \dots, x_{n})

and a set of labels

y = (y_{1}, y_{2}, \dots, y_{k})

, where n is the length of input and k is the number of categories for classification. Thus,

D_{i} = (d_{x_{i}}^{y_{1}}, d_{x_{i}}^{y_{2}}, \dots, d_{x_{i}}^{y_{k}})

describes the extent to which the instance

x_{i}

belongs to a label

y_{k}

. We expect to obtain a final classification result

d_{i}

, which is typically the maximum value of the label distribution

D_{i}

obtained by applying a softmax normalization to the last layer of the deep neural networks. The label corresponding to the maximum value represents the category to which this document belongs.

3.1. Basics of Label Smoothing

Compared with the commonly used hard target, label distribution has some advantages. Due to the continuity of label distribution, it has a larger labeling space and a broader range of expression and therefore can provide greater flexibility in the learning process. Specifically, LS is a straightforward way to convert hard labels to soft label distributions, which is a mixture of the one-hot hard label and the uniform distribution, i.e.,

D_{i}^{'} = (1 - k λ) D_{i} + λ 1, λ \in [0, 1]

(1)

where the original one-hot distribution

D_{i} = (d_{x_{i}}^{y_{1}}, d_{x_{i}}^{y_{2}}, \dots, d_{x_{i}}^{y_{k}})

,

1 = {(1, 1, \dots, 1)}_{1 \times k}

, and

λ

is the smoothing parameter.

This method also addresses the need for manual labeling. One of the primary challenges in the task of LS is the difficulty of obtaining the true label distribution. Most classification datasets do not provide this information, and theoretically, acquiring the precise label distribution would require extensive manual labeling of the same sample to obtain its statistical distribution, which is prohibitively expensive.

Suppose that

θ

is the trainable parameters of a classification model, and for the i-th training example,

P (y | x_{i}; θ) = (p (y_{1} | x_{i}; θ), p (y_{2} | x_{i}; θ), \dots, p (y_{k} | x_{i}; θ))

is the probability distribution finally output by the model. We use Kullback-Leibler divergence during the training process, that is, the optimal parameter

θ^{*}

satisfies

θ^{*} = \arg \min_{θ} \sum_{i} KL (D_{i}^{'} | | P) = \arg \min_{θ} \sum_{i} \sum_{j} (d_{x_{i}}^{' y_{j}} \ln \frac{d_{x_{i}}^{' y_{j}}}{p (y_{j} | x_{i}; θ)})

(2)

This expression can be transformed into a maximum likelihood function:

θ^{*} = \arg \max_{θ} \sum_{i} \sum_{j} d_{x_{i}}^{' y_{j}} \ln p (y_{j} | x_{i}; θ)

(3)

For the original hard target multi-classification problem, the optimization equation from the cross-entropy loss is presented:

θ^{*} = \arg \min_{θ} \sum_{i} \sum_{j} d_{x_{i}}^{y_{j}} \log p (d_{x_{i}}^{y_{j}} | x_{i}; θ)

(4)

The corresponding maximum likelihood function is shown as follows:

θ^{*} = \arg \max_{θ} \sum_{i} \ln p (y (x_{i}) | x_{i}; θ)

(5)

where

y (x_{i})

is the underlying true label for the i-th training instance.

For single-label problems, where the model only considers one label, it is necessary to compute the entropy of that particular label. Single-label learning is a special case of the distribution of LS [49]. LS can be seen as a more general learning technique due to its relationship with traditional training methods.

3.2. Relationship Between Cross-Entropy and KL Divergence with Label Smoothing

An important consideration in our experimental design is the relationship between cross-entropy loss and KL divergence when label smoothing is applied. For hard labels, minimizing cross-entropy is equivalent to maximizing the log-likelihood. When soft labels are used, the KL divergence between the soft target distribution and the model’s predicted distribution becomes the natural objective.

To ensure fair comparison and isolate the effect of label smoothing from the choice of loss function, we note the following equivalence: when the target distribution is a proper probability distribution (as with label smoothing), minimizing KL divergence is equivalent to minimizing cross-entropy with the soft targets:

KL (D_{i}^{'} | | P) = H (D_{i}^{'}, P) - H (D_{i}^{'}) = - \sum_{j} d_{x_{i}}^{' y_{j}} \ln p (y_{j} | x_{i}; θ) + const

(6)

where

H (D_{i}^{'})

is the entropy of the soft target distribution, which is constant with respect to the model parameters. Therefore, optimizing KL divergence with soft targets is mathematically equivalent to optimizing cross-entropy with the same soft targets.

In our experiments, we include controlled baselines that use cross-entropy with soft targets (CE-Soft) to verify that the improvements come from label smoothing itself rather than from the change in loss formulation.

3.3. Training with Label Smoothing

Now we illustrate how we train the deep learning models with LS. The goal of model training is to iteratively optimize the parameter

θ

in order to find the optimal parameter

θ^{*}

that satisfies

\arg \min_{θ} \sum_{i} KL (D_{i}^{'} | | P (y | x; θ))

(7)

The stochastic gradient descent (SGD) algorithm and its variants like Adam [52] are used to update the parameters. For each mini-batch, the updating formula is

θ_{i + 1} = θ_{i} - η \cdot δ_{i}

(8)

where

θ_{i + 1}

is the parameter vector at iteration

i + 1

,

θ_{i}

is the parameter vector at iteration i, and

η

is the learning rate, which controls the step size of each update.

δ_{i}

is the gradient of the loss function with respect to the parameter

θ_{i}

.

3.4. Selected Deep Learning Architectures

TextCNN, BERT, and RoBERTa are three highly influential and successful deep learning architectures in natural language processing.

TextCNN [53], as a pioneering architecture, introduced the idea of applying convolutional neural networks to text classification tasks. It leverages convolutional layers with various kernel sizes to capture local features in text data. TextCNN has demonstrated strong performance and efficiency in handling fixed-length input sequences [54,55].

BERT (Bidirectional Encoder Representations from Transformers) [56] brought about a breakthrough in language understanding. BERT introduced the concept of pre-training a transformer-based model on large amounts of unlabeled text data and fine-tuning it for downstream tasks. It showed remarkable success in a wide range of natural language processing tasks, including question answering, sentiment analysis, and named entity recognition.

RoBERTa [57], building upon BERT’s foundation, refined the pre-training process and achieved improved performance. It incorporated additional pre-training data and introduced modifications to the training objectives. RoBERTa demonstrated enhanced language representation capabilities and surpassed BERT’s performance on various benchmarks and tasks.

In our experiments, we investigate the effectiveness of LS across three aforementioned architectures. The workflow of our method is shown in Figure 1, taking TextCNN as the example.

4. Experimental Results and Analysis

We construct four LS predictive models incorporating varying degrees of label smoothing (specifically, four smoothing levels: LS1 with

λ = 0.01

, LS2 with

λ = 0.025

for three-class/

λ = 0.05

for binary-class, LS3 with

λ = 0.05

for three-class/

λ = 0.1

for binary-class, and LS4 with

λ = 0.1

for three-class/

λ = 0.15

for binary-class) and assess their performance compared to the baseline models on eight datasets using TextCNN, BERT, and RoBERTa architectures. In addition to evaluating the effectiveness of LS methods, we also conduct in-depth analyses of how LS benefits text sentiment classification.

4.1. Datasets

We utilize publicly available datasets open-sourced on Kaggle and Huggingface. This experiment employs eight distinct sentiment analysis datasets, of which six are three-class and two are two-class. The diverse range of datasets enables us to evaluate the performance and generalizability of our models across various data types and scenarios. Table 1 provides a summary of all datasets used in our experiments.

Twitter Financial News Sentiment (TFNS) (https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment (accessed on 15 March 2024)) was collected using the Twitter API. It holds 11,932 documents annotated with three labels (“Positive”, “Neutral”, “Negative”) and comprises annotated finance-related tweets.

Kaggle Financial Sentiment (KFS) (https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis (accessed on 15 March 2024)) is a dataset of financial reviews annotated with three labels (“Positive”, “Neutral”, “Negative”). These reviews cover topics such as stocks, investments, market analysis, corporate performance, technology, and real estate.

Tweet Sentiment Extraction (TSE) (https://www.kaggle.com/c/tweet-sentiment-extraction (accessed on 16 March 2024)) is a three-class dataset from a Kaggle contest. It has 27,481 samples of a tweet and a sentiment label (“Positive”, “Neutral”, or “Negative”).

Auditor Sentiment (AS) (https://huggingface.co/datasets/FinanceInc/auditor_sentiment (accessed on 16 March 2024)) gathers the auditor evaluations into one dataset. It contains thousands of sentences from English financial news, grouped into three categories by emotion (“Positive”, “Neutral”, “Negative”).

Financial Phrasebank (FP) [58] (https://huggingface.co/datasets/takala/financial_phrasebank (accessed on 17 March 2024)) consists of 4840 sentences from English language financial news categorized by sentiment. The dataset is divided into three classes (“Positive”, “Neutral”, “Negative”) by agreement rate of 5–8 annotators.

ChatGPT Sentiment Analysis (CSA) (https://www.kaggle.com/datasets/charunisa/chatgpt-sentiment-analysis (accessed on 18 March 2024)) contains 10,000 pieces of data. The dataset gathers tweets about ChatGPT-3.5 and reflects people’s perception of it. Emotions are divided into three categories (“Positive”, “Neutral”, “Negative”).

Rotten Tomatoes Reviews (RTR) (https://huggingface.co/datasets/cornell-movie-review-data/rotten_tomatoes (accessed on 19 March 2024)) consists of 10,662 processed sentences from Rotten Tomatoes movie reviews. It is a balanced two-class dataset with 5331 positive and 5331 negative reviews.

Sentiment140 (Sent140) [59] (https://huggingface.co/datasets/stanfordnlp/sentiment140 (accessed on 20 March 2024)) is a widely used sentiment analysis dataset created by researchers at Stanford University. It contains 1,600,000 tweets extracted using the Twitter API. The tweets have been annotated to two classes (“Positive”, “Negative”).

4.2. Model Configuration

Models: Table 2 presents the configuration of the models and loss functions used in the experiment. The LS models (LS1–LS4) employ different smoothing levels, with the smoothing parameter

λ

increasing from 0.01 (LS1) to 0.1/0.15 (LS4). The original one-hot hard labels are used in the baseline methods, while for LS1 to LS4, the soft labels are utilized with different smoothing parameters. By gradually increasing the smoothing parameter from LS1 to LS4, we explore how label smoothing affects the model’s accuracy and stability.

To isolate the effect of label smoothing from the choice of loss function, we include a controlled baseline (CE-Soft) that uses cross-entropy loss with soft targets. This allows us to verify that improvements come from label smoothing itself rather than from switching between cross-entropy and KL divergence. All LS models utilize the KL divergence as the loss function. In contrast, the baseline models employ the cross-entropy loss.

Implementation Details: Our experiments were conducted with the following specifications:

Hardware: All experiments were performed on a server with NVIDIA Tesla V100 32 GB GPU, Intel Xeon Gold 6248R CPU (48 cores), and 256 GB RAM.
Software Environment: Python 3.8, PyTorch 1.10, Transformers 4.15, CUDA 11.3.
Data Preprocessing: For all datasets, we applied standard text preprocessing including lowercasing, removal of URLs and special characters, and tokenization. For BERT and RoBERTa, we used their respective tokenizers with a maximum sequence length of 128 tokens.
Train-Validation-Test Splits: For datasets without predefined splits, we used 80/10/10 percent for training/validation/test. For datasets with predefined splits (e.g., Sent140, RTR), we followed the original partitioning.
Optimization Settings:
–
TextCNN: Adam optimizer with learning rate 0.001, batch size 64, 50 epochs maximum with early stopping (patience = 5).
–
BERT/RoBERTa: AdamW optimizer with learning rate 2 $\times 10^{- 5}$ , batch size 32, 10 epochs maximum with early stopping (patience = 3), linear warmup for first 10 percent of steps.
Pre-trained Checkpoints: We used bert-base-uncased for BERT and roberta-base for RoBERTa from Hugging Face Transformers.
Stopping Criteria: Training was stopped when validation loss did not improve for the specified patience epochs.

4.3. Metrics

While accuracy provides an intuitive measure of overall performance of sentiment classification, it can be misleading for imbalanced datasets. For instance, in a dataset with 90 percent positive samples, a classifier that always predicts “positive” achieves 90 percent accuracy but fails to capture the minority class. Macro-F1 addresses this by computing the average F1-score across all classes, treating each class equally regardless of its frequency. Our results show that improvements in accuracy generally correspond to improvements in macro-F1, confirming that LS benefits all classes rather than just the majority class.

4.4. Results

Performance Metrics: We employ multiple metrics to evaluate the performance of methods on eight datasets. In addition to accuracy, we report macro-F1 to provide a comprehensive evaluation, especially important for datasets with potential class imbalance. The results produced by BERT, TextCNN and RoBERTa architectures are presented in Table 3. TextCNN models are trained from scratch with random initializations, while BERT and RoBERTa models are fine-tuned on sentiment analysis datasets with pre-trained weights.

On the BERT architecture, LS models consistently outperform the baseline model across multiple datasets. Specifically, LS1 achieves the highest accuracy in most cases, closely followed by LS3 and LS4. This indicates that incorporating LS with varying degrees of LS can significantly improve classification accuracy, particularly for binary and ternary classification datasets. The CE-Soft baseline shows improvements over the hard-label baseline but generally underperforms compared to LS models with KL divergence, suggesting that the choice of loss function alone does not fully explain the improvements.

Moving to the TextCNN architecture, LS models exhibit superior performance on three-category classification datasets compared to binary classification datasets. LS1 and LS3 consistently outperform the other models in terms of accuracy. This highlights the effectiveness of LS when implemented in the TextCNN architecture for enhancing classification outcomes.

On the RoBERTa architecture, LS models demonstrate competitive performance, with LS1 and LS4 achieving the highest accuracy in most cases. This suggests that LS can effectively leverage the strengths of the RoBERTa architecture to achieve improved classification accuracy.

The experimental results demonstrate that LS models, when integrated into different architectures, consistently outperform the baseline model. LS1, LS3, and LS4 consistently demonstrate superior accuracy, indicating the efficacy of LS in enhancing classification performance.

4.5. Training Time Analysis

To address the question of computational overhead introduced by label smoothing, we measured the training time for each model configuration. Table 4 presents the average training time per epoch and total training time until convergence for BERT on the TFNS dataset.

As shown in Table 4, label smoothing introduces negligible overhead per epoch (approximately 1–2 percent increase) due to the simple computation of soft labels. More importantly, LS models converge faster, requiring 25–37 percent fewer epochs to reach convergence. The net effect is a reduction in total training time by 15–36 percent, demonstrating that LS not only improves performance but also enhances training efficiency.

4.6. Analysis

In this section, we attempt to analyze why and how LS benefits the text sentiment classification.

Higher Accuracy: As shown in Table 3, LS methods achieve higher accuracy than baseline models. The impact of label distribution can be profound, as it effectively addresses certain challenges in sentiment classification. Label distribution helps alleviate overconfidence in model predictions and reduces sensitivity to noisy labels, resulting in improved robustness and generalization performance.

Furthermore, in sentiment classification, emotions are not always purely positive or negative. There can be instances where a dominant emotion encompasses other subtle emotions. LS models excel at capturing and representing such phenomena, allowing for a more nuanced understanding and interpretation of sentiment. By considering a broader spectrum of emotions and incorporating LS techniques, LS models offer a more comprehensive and accurate representation of sentiment classification, leading to improved performance in sentiment analysis tasks.

Acceleration of Convergence: We further analyzed the convergence performance of deep learning architectures during the training of sentiment classifiers. The accuracy curves of BERT are presented in Figure 2, depicting the performance of the models on the validation sets at each training epoch. It can be observed that LS models implemented on BERT exhibit significantly faster convergence compared to the baseline model, and can even achieve relatively high accuracy within the initial few epochs of training. This also indicates that LS is capable of effectively saving computational resources, reducing the training time, and improving overall efficiency.

Robust Performance: The LS methods implemented on the three deep learning architectures consistently achieve stable and superior prediction performance on most of the datasets, demonstrating the robustness of the models. This indicates that the LS approach is effective in handling diverse data scenarios and can generalize well to unseen examples. Furthermore, the ability of the models to consistently outperform the baseline methods across multiple datasets highlights their reliability and suitability for real-world applications. The robust predictive performance of LS methods reinforces their value and potential as a powerful tool in various domains requiring accurate and reliable predictions.

Reduce Overfitting: LS and KL divergence can help reduce overfitting compared to cross-entropy for several reasons. Cross-entropy loss assigns high confidence to a single target class and penalizes all other classes. This can lead to overconfident predictions, especially when the training data is limited or imbalanced. In contrast, LS considers the entire label distribution, allowing the model to capture the relationships between classes and reduce the risk of overfitting to individual examples. After applying the LS method, during the training phase, the loss does not decrease too rapidly when the predictions are correct, and it does not penalize too heavily when the predictions are wrong. This prevents the network from easily getting stuck in local optima and helps to mitigate overfitting to some extent. Additionally, in scenarios where the classification categories are closely related, the network’s predictions are not excessively absolute.

Representation Learning: LS can produce better hidden representations for texts than baseline methods. The effects of LS can be reflected in the hidden layer of the model. Figure 3 visualizes similarities and differences between data points by mapping high-dimensional features into two-dimensional spaces and depicts the last layer of BERT models with different smoothing levels. The data points from baseline BERT model overlap as shown in Figure 3a, while data points from LS methods are linearly separable as shown in Figure 3b–d.

5. Conclusions

In this paper, we delve into the application of LS in text sentiment classification, a key problem of natural language processing. Our experiments demonstrate that by tuning the smoothing parameters, LS can achieve improved prediction accuracy across multiple deep learning architectures and various test datasets. We conduct fine-grained analysis to check how LS benefits the task. Our main findings are (1) LS can boost the convergence of deep models in both training from scratch and fine-tuning modes, reducing training time by 15–30 percent; (2) LS can produce hidden representations for texts that make the data of different labels more easily separable; (3) the improvements from LS are consistent across multiple evaluation metrics (accuracy and macro-F1), confirming robust performance gains rather than artifacts of a single metric.

However, given the inherent emotional bias in different texts, constructing more precise textual sentiment distribution labels to capture subtle variations in human emotions remains a challenging task. This highlights the need for further research and development in this area to fully harness the potential of LS in enhancing the performance and robustness of sentiment classification models.

Author Contributions

Conceptualization, S.S. and Y.Z.; methodology, S.S. and Y.G.; software, Y.G.; validation, Y.G., H.S. and S.S.; formal analysis, S.S., Y.G. and H.S.; investigation, Y.G.; resources, H.S. and Y.Z.; writing—original draft preparation, Y.G. and S.S.; writing—review and editing, S.S. and H.S.; supervision, H.S. and H.L.; project administration, Y.Z. and H.L.; funding acquisition, H.S. and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The authors were supported by the 2024 Shanghai Educational Science Research Project (Special Project for Philosophy and Social Sciences Research in Shanghai Higher Education Institutions) entitled “Digital Technology Empowering the Great Founding Spirit of the Communist Party of China” (Grant No. 2024ZSW009); the Humanities and Social Sciences Research Youth Foundation of the Ministry of Education of China under Grant No. 24YJCZH252; and the key project of the National Social Science Foundation of China, “Research on new risk transmission and prevention and control in China’s financial sector under the background of high-level two-way opening” (Grant No. 25AJY022).

Data Availability Statement

All the datasets used in this paper are publicly available.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, Z.; Sun, L.; Yu, P.S.; He, L. A survey on text classification: From traditional to deep learning. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 31. [Google Scholar] [CrossRef]
Liu, Z.; Si, S.; Gu, J. Calibrating Sentiment Analysis: A Unimodal-Weighted Label Distribution Learning Approach. IEEE Access 2025, 13, 148816–148826. [Google Scholar] [CrossRef]
Alharbi, M.I.; Chafik, S.; Ezzini, S.; Mitkov, R.; Ranasinghe, T.; Hettiarachchi, H. A hasis: Shared task on sentiment analysis for arabic dialects. In Proceedings of the Shared Task on Sentiment Analysis for Arabic Dialects; INCOMA Ltd.: Shoumen, Bulgaria, 2025; pp. 1–6. [Google Scholar]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Zhang, Y.; Meng, J.E.; Venkatesan, R.; Wang, N.; Pratama, M. Sentiment classification using comprehensive attention recurrent models. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN); IEEE: Piscataway, NJ, USA, 2016; pp. 1562–1569. [Google Scholar]
Chen, P.; Sun, Z.; Bing, L.; Yang, W. Recurrent attention network on memory for aspect sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2017; pp. 452–461. [Google Scholar]
Chang, W.C.; Yu, H.F.; Zhong, K.; Yang, Y.; Dhillon, I.S. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery (ACM): New York, NY, USA, 2020; pp. 3163–3171. [Google Scholar]
Jiang, T.; Wang, D.; Sun, L.; Yang, H.; Zhao, Z.; Zhuang, F. Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2021; Volume 35, pp. 7987–7994. [Google Scholar]
Jing, L.; Li, X.; Yu, M. Attention mechanism-based self-supervised multitask approach for multimodal sentiment analysis. In Proceedings of the International Conference on Electronic Information Engineering and Artificial Intelligence (EIEAI 2025); SPIE: Bellingham, WA, USA, 2026; Volume 14062, pp. 273–279. [Google Scholar]
Liu, M.; Liu, L.; Cao, J.; Du, Q. Co-attention network with label embedding for text classification. Neurocomputing 2022, 471, 61–69. [Google Scholar] [CrossRef]
Liu, Y.; Li, P.; Hu, X. Combining context-relevant features with multi-stage attention network for short text classification. Comput. Speech Lang. 2022, 71, 101268. [Google Scholar] [CrossRef]
Zheng, W.; Han, S.; Jia, X.; Wu, E.Z.; Ding, W. GT-AGCN: Integrating Global Semantics and Local Syntax for Aspect-Based Sentiment Analysis. IEEE Trans. Comput. Soc. Syst. 2025, 13, 1293–1309. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 62. [Google Scholar] [CrossRef]
Zulqarnain, M.; Ghazali, R.; Hassim, Y.M.M.; Rehan, M. A comparative review on deep learning models for text classification. Indones. J. Electr. Eng. Comput. Sci. 2020, 19, 325–335. [Google Scholar] [CrossRef]
Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Lienen, J.; Hüllermeier, E. From label smoothing to label relaxation. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2021; Volume 35, pp. 8583–8591. [Google Scholar]
Lukasik, M.; Bhojanapalli, S.; Menon, A.; Kumar, S. Does label smoothing mitigate label noise? In Proceedings of the International Conference on Machine Learning; PMLR: Brookline, MA, USA, 2020; pp. 6448–6658. [Google Scholar]
Gao, Y.; Wang, X.; Herold, C.; Yang, Z.; Ney, H. Towards a better understanding of label smoothing in neural machine translation. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2020; pp. 212–223. [Google Scholar]
Chen, B.; Ziyin, L.; Wang, Z.; Liang, P.P. An investigation of how label smoothing affects generalization. arXiv 2020, arXiv:2010.12648. [Google Scholar] [CrossRef]
Cui, X.; Saon, G.; Nagano, T.; Suzuki, M.; Fukuda, T.; Kingsbury, B.; Kurata, G. Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing. In Proceedings of the Annual Conference of the International Speech Communication Association; International Speech Communication Association (ISCA): Grenoble, France, 2022. [Google Scholar]
Li, W.; Dasarathy, G.; Berisha, V. Regularization via Structural Label Smoothing. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics; Proceedings of Machine Learning Research: Brookline, MA, USA, 2020; Volume 108, pp. 1453–1463. [Google Scholar]
Chandrasegaran, K.; Tran, N.M.; Zhao, Y.; Cheung, N.M. Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing? In Proceedings of the International Conference on Machine Learning; PMLR: Brookline, MA, USA, 2022; pp. 2890–2916. [Google Scholar]
Liu, P.; Xi, X.; Ye, W.; Zhang, S. Label Smoothing for Text Mining. In Proceedings of the 29th International Conference on Computational Linguistics; International Committee on Computational Linguistics: Gyeongju, Republic of Korea, 2022; pp. 2210–2219. [Google Scholar]
Nordansjö, W.; Fourong, F.; Qasim, M. Financial sentiment analysis with FUNNEL: Filtered UNion for NER-based ensemble labeling. Digit. Financ. 2025, 7, 725–744. [Google Scholar] [CrossRef]
Haque, S.; Bansal, A.; McMillan, C. Label Smoothing Improves Neural Source Code Summarization. In Proceedings of the 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC); IEEE: Piscataway, NJ, USA, 2023; pp. 101–112. [Google Scholar]
Pan, Y.; Chen, J.; Zhang, Y.; Zhang, Y. An efficient CNN-LSTM network with spectral normalization and label smoothing technologies for SSVEP frequency recognition. J. Neural Eng. 2022, 19, 056014. [Google Scholar] [CrossRef] [PubMed]
Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment Classification using Machine Learning Techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002); Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2002; pp. 79–86. [Google Scholar]
Onan, A. Bidirectional convolutional recurrent neural network architecture with group-wise enhancement mechanism for text sentiment classification. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 2098–2117. [Google Scholar] [CrossRef]
Hasib, K.M.; Towhid, N.A.; Alam, M.G.R. Online review based sentiment classification on bangladesh airline service using supervised learning. In Proceedings of the 2021 5th International Conference on Electrical Engineering and Information Communication Technology (ICEEICT); IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Kowsari, K.; Jafari Meimandi, K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
Si, S.; Wang, R.; Wosik, J.; Zhang, H.; Dov, D.; Wang, G.; Carin, L. Students need more attention: Bert-based attention model for small data with application to automatic patient message triage. In Proceedings of the Machine Learning for Healthcare Conference; PMLR: Brookline, MA, USA, 2020; pp. 436–456. [Google Scholar]
Kim, S.B.; Han, K.S.; Rim, H.C.; Myaeng, S.H. Some effective techniques for naive bayes text classification. IEEE Trans. Knowl. Data Eng. 2006, 18, 1457–1466. [Google Scholar] [CrossRef]
Raschka, S. Naive bayes and text classification i-introduction and theory. arXiv 2014, arXiv:1410.5329. [Google Scholar]
Tong, S.; Koller, D. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2001, 2, 45–66. [Google Scholar]
Lilleberg, J.; Zhu, Y.; Zhang, Y. Support vector machines and word2vec for text classification with semantic features. In Proceedings of the 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC); IEEE: Piscataway, NJ, USA, 2015; pp. 136–140. [Google Scholar]
Liu, P.; Qiu, X.; Huang, X. Recurrent Neural Network for Text Classification with Multi-Task Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016; Kambhampati, S., Ed.; IJCAI/AAAI Press: Washington, DC, USA, 2016; pp. 2873–2879. [Google Scholar]
Gao, B.B.; Xing, C.; Xie, C.W.; Wu, J.; Geng, X. Deep Label Distribution Learning With Label Ambiguity. IEEE Trans. Image Process. 2017, 26, 2825–2838. [Google Scholar] [CrossRef] [PubMed]
Si, S.; Wang, J.; Peng, J.; Xiao, J. Towards speaker age estimation with label distribution learning. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2022; pp. 4618–4622. [Google Scholar]
Luo, Y.; Huang, Z.; Wong, L.P.; Zhan, C.; Wang, F.L.; Hao, T. An Early Prediction and Label Smoothing Alignment Strategy for User Intent Classification of Medical Queries. In Proceedings of the International Conference on Neural Computing for Advanced Applications; Springer: Cham, Switzerland, 2022; pp. 115–128. [Google Scholar]
Luo, Z.; Xi, Y.; Mao, X.L. Smoothing with Fake Label. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management; Association for Computing Machinery (ACM): New York, NY, USA, 2021; pp. 3303–3307. [Google Scholar]
Zhu, E.; Li, J. Boundary Smoothing for Named Entity Recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 7096–7108. [Google Scholar] [CrossRef]
Yu, Y.; Wang, Y.; Mu, J.; Li, W.; Jiao, S.; Wang, Z.; Lv, P.; Zhu, Y. Chinese mineral named entity recognition based on BERT model. Expert Syst. Appl. 2022, 206, 117727. [Google Scholar] [CrossRef]
Wang, B.; Li, Y.; Li, S.; Sun, D. Sentiment Analysis Model Based on Adaptive Multi-modal Feature Fusion. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP); IEEE: Piscataway, NJ, USA, 2022; pp. 761–766. [Google Scholar]
Yang, Y.; Dan, S.; Roth, D.; Lee, I. In and Out-of-Domain Text Adversarial Robustness via Label Smoothing. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2023, Toronto, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J.L., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 657–669. [Google Scholar] [CrossRef]
Yan, Q.; Sun, Y.; Fan, S.; Zhao, L. Polarity-aware attention network for image sentiment analysis. Multimed. Syst. 2023, 29, 389–399. [Google Scholar] [CrossRef]
Wu, X.; Gao, C.; Lin, M.; Zang, L.; Wang, Z.; Hu, S. Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), ACL 2022, Dublin, Ireland, 22–27 May 2022; Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 871–875. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. In Proceedings of the NIPS Deep Learning and Representation Learning Workshop; Curran Associates, Inc.: Red Hook, NY, USA, 2015. [Google Scholar]
Furlanello, T.; Lipton, Z.C.; Tschannen, M.; Itti, L.; Anandkumar, A. Born again neural networks. In Proceedings of the International Conference on Machine Learning; PMLR: Brookline, MA, USA, 2018; pp. 1607–1616. [Google Scholar]
Geng, X. Label Distribution Learning. IEEE Trans. Knowl. Data Eng. 2016, 28, 1734–1748. [Google Scholar] [CrossRef]
Pozzi, A.; Incremona, A.; Tessera, D.; Toti, D. Mitigating exposure bias in large language model distillation: An imitation learning approach. Neural Comput. Appl. 2025, 37, 12013–12029. [Google Scholar] [CrossRef]
Huang, J.; Tao, J.; Liu, B.; Lian, Z. Learning Utterance-Level Representations with Label Smoothing for Speech Emotion Recognition. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 4079–4083. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015; Curran Associates, Inc.: Red Hook, NY, USA, 2015. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Wang, S.; Yilahun, H.; Hamdulla, A. Medical Intention Recognition Based on MCBERT-TextCNN Model. In Proceedings of the 2022 International Conference on Virtual Reality, Human-Computer Interaction and Artificial Intelligence (VRHCIAI); IEEE: Piscataway, NJ, USA, 2022; pp. 195–200. [Google Scholar]
Jiang, L. Fault classification method of alarm information based on TextCNN. In Proceedings of the EEI 2022; 4th International Conference on Electronic Engineering and Informatics, Guiyang, China, 24–26 June 2022; pp. 1–5. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Malo, P.; Sinha, A.; Korhonen, P.; Wallenius, J.; Takala, P. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
Go, A.; Bhayani, R.; Huang, L. Twitter sentiment classification using distant supervision. In Proceedings of the CS224N Project Report; Stanford University: Stanford, CA, USA, 2009; Volume 1, p. 2009. [Google Scholar]

Figure 1. Workflow of our sentiment classification with label smoothing. Here the deep architecture is TextCNN, but the workflow for BERT and RoBERTa is similar. The process begins with input text, passes through the deep learning model to obtain probability predictions, computes the KL divergence loss between soft labels and predictions, and updates model parameters through backpropagation. The final output is the predicted sentiment class.

Figure 2. Accuracy on the validation set at different epochs using BERT with varying smoothing parameters. LS models demonstrate faster convergence, reaching 85 percent accuracy within 2–3 epochs compared to 5–6 epochs for the baseline.

Figure 3. t-SNE plots from BERT baseline versus label smoothing methods. The baseline model shows overlapping clusters, while LS methods produce more linearly separable representations.

Table 1. Summary of datasets used in our experiments.

Dataset	Source	Classes	Samples	Domain
TFNS	Twitter API	3	11,932	Finance
KFS	Kaggle	3	–	Finance
TSE	Kaggle	3	27,481	Social Media
AS	Financial News	3	–	Finance
FP	[58]	3	4840	Finance
CSA	Twitter	3	10,000	Technology
RTR	Rotten Tomatoes	2	10,662	Movie Reviews
Sent140	[59]	2	1,600,000	Social Media

Table 2. Model smoothing levels and loss functions. The baseline methods take the original one-hot hard label y. We vary the smoothing parameter

λ

to obtain four sets of soft labels

y_{L S}

. CE-Soft denotes cross-entropy with soft targets (controlled baseline).

Table 2. Model smoothing levels and loss functions. The baseline methods take the original one-hot hard label y. We vary the smoothing parameter

λ

to obtain four sets of soft labels

y_{L S}

. CE-Soft denotes cross-entropy with soft targets (controlled baseline).

Model	Smoothing Para.	Smoothed Label (3-Class)	Smoothed Label (2-Class)
Baseline	$λ = 0$	y	y
CE-Soft	$λ = 0.01$	$y_{L S} = y \times 0.97 + 0.01 \times 1$	$y_{L S} = y \times 0.98 + 0.01 \times 1$
LS1	$λ = 0.01$	$y_{L S} = y \times 0.97 + 0.01 \times 1$	$y_{L S} = y \times 0.98 + 0.01 \times 1$
LS2	$λ = 0.025 / 0.05$	$y_{L S} = y \times 0.925 + 0.025 \times 1$	$y_{L S} = y \times 0.9 + 0.05 \times 1$
LS3	$λ = 0.05 / 0.1$	$y_{L S} = y \times 0.85 + 0.05 \times 1$	$y_{L S} = y \times 0.8 + 0.1 \times 1$
LS4	$λ = 0.1 / 0.15$	$y_{L S} = y \times 0.7 + 0.1 \times 1$	$y_{L S} = y \times 0.7 + 0.15 \times 1$

Table 3. The test accuracy and macro-F1 (in percent) of different classifiers on eight sentiment classification datasets with varying smoothing parameters. Best results are shown in bold.

Arch.	Model	TFNS	KFS	TSE	AS	FP	CSA	RTR	Sent140
BERT	LS1	87.69/85.2	79.81/77.4	79.17/76.8	84.11/81.3	89.30/86.9	72.02/69.5	78.20/77.8	76.36/74.2
	LS2	87.44/84.9	79.64/77.1	79.12/76.5	83.18/80.4	89.55/87.2	71.81/69.2	77.80/77.4	76.47/74.5
	LS3	87.10/84.6	79.13/76.6	78.81/76.1	83.90/81.1	89.68/87.4	72.54/70.1	77.80/77.4	77.11/75.1
	LS4	87.40/84.9	79.56/77.0	78.89/76.3	84.21/81.5	89.55/87.2	71.81/69.2	78.60/78.2	75.94/73.8
	CE-Soft	87.21/84.7	79.32/76.8	78.95/76.2	83.56/80.8	89.12/86.7	71.65/69.0	77.90/77.5	76.12/74.0
	Baseline	86.89/84.3	78.96/76.2	78.81/75.9	82.56/79.8	88.03/85.6	71.49/68.8	77.80/77.4	76.26/74.1
TextCNN	LS1	82.37/79.8	68.52/65.4	70.74/67.2	77.61/74.3	83.82/80.9	72.34/69.8	78.60/78.1	74.55/72.1
	LS2	82.33/79.7	68.01/64.9	70.29/66.8	77.40/74.1	83.57/80.6	72.45/70.0	79.20/78.7	75.40/73.2
	LS3	82.16/79.5	67.92/64.7	70.66/67.1	77.71/74.4	83.69/80.8	71.83/69.3	77.80/77.4	74.55/72.1
	LS4	82.04/79.3	68.43/65.2	70.29/66.8	77.30/74.0	83.44/80.4	73.38/70.9	78.60/78.1	74.44/72.0
	CE-Soft	81.89/79.0	67.85/64.5	70.12/66.5	76.95/73.6	83.21/80.1	71.92/69.4	78.20/77.7	74.68/72.4
	Baseline	81.41/78.5	68.51/65.3	70.49/67.0	76.47/73.2	83.19/80.0	70.90/68.2	79.18/78.6	76.26/74.1
RoBERTa	LS1	89.15/87.2	83.75/81.6	79.46/77.1	86.89/84.5	90.31/88.3	81.01/78.9	86.80/86.4	85.88/84.2
	LS2	89.61/87.7	82.63/80.4	80.36/78.2	86.17/83.7	90.70/88.7	80.29/78.1	86.60/86.2	85.24/83.5
	LS3	89.57/87.6	82.46/80.2	79.99/77.8	86.58/84.2	91.08/89.1	81.63/79.6	87.20/86.8	86.10/84.5
	LS4	89.78/87.9	82.98/80.7	80.24/78.5	86.38/84.0	90.45/88.5	81.22/79.1	87.80/87.4	87.06/85.5
	CE-Soft	89.12/87.1	83.21/81.0	79.78/77.5	86.45/84.0	90.02/87.9	80.85/78.7	86.90/86.5	86.12/84.5
	Baseline	87.69/85.6	83.15/80.9	79.91/77.6	85.66/83.1	89.94/87.8	80.39/78.0	87.20/86.8	87.59/86.1

Table 4. Training time comparison for BERT on TFNS dataset. LS introduces minimal computational overhead while accelerating convergence.

Model	Time/Epoch (s)	Epochs to Convergence	Total Time (s)
Baseline	12.3	8	98.4
CE-Soft	12.5	7	87.5
LS1	12.4	6	74.4
LS2	12.4	6	74.4
LS3	12.5	5	62.5
LS4	12.5	5	62.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Si, S.; Gao, Y.; Sun, H.; Zhang, Y.; Luo, H. Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification. Electronics 2026, 15, 1984. https://doi.org/10.3390/electronics15101984

AMA Style

Si S, Gao Y, Sun H, Zhang Y, Luo H. Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification. Electronics. 2026; 15(10):1984. https://doi.org/10.3390/electronics15101984

Chicago/Turabian Style

Si, Shijing, Yijie Gao, Haixia Sun, Yugui Zhang, and Hua Luo. 2026. "Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification" Electronics 15, no. 10: 1984. https://doi.org/10.3390/electronics15101984

APA Style

Si, S., Gao, Y., Sun, H., Zhang, Y., & Luo, H. (2026). Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification. Electronics, 15(10), 1984. https://doi.org/10.3390/electronics15101984

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Revisiting the Role of Label Smoothing in Enhanced Text Sentiment Classification

Abstract

1. Introduction

2. Related Works

2.1. Text Sentiment Classification

2.2. Label Smoothing (LS)

2.3. Soft-Target Training in NLP

3. Label Smoothing Method for Text Classification

3.1. Basics of Label Smoothing

3.2. Relationship Between Cross-Entropy and KL Divergence with Label Smoothing

3.3. Training with Label Smoothing

3.4. Selected Deep Learning Architectures

4. Experimental Results and Analysis

4.1. Datasets

4.2. Model Configuration

4.3. Metrics

4.4. Results

4.5. Training Time Analysis

4.6. Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI