Dual-Path Short Text Classification with Data Optimization

Li, Wei; Lv, Guangying; He, Yunling

doi:10.3390/app152011015

Open AccessArticle

Dual-Path Short Text Classification with Data Optimization

by

Wei Li

,

Guangying Lv

^* and

Yunling He

School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11015; https://doi.org/10.3390/app152011015

Submission received: 9 September 2025 / Revised: 1 October 2025 / Accepted: 11 October 2025 / Published: 14 October 2025

(This article belongs to the Special Issue Natural Language Processing in the Era of Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In order to solve problems of fragmented information, missing context and difficult-to-capture feature information in short texts, this paper proposes a dual-path classification model combining word-level and sentence-level feature information. Our method is developing the BERT pre-trained model for obtaining word vectors, and presenting attention mechanisms and the BiGRU model to extract local key information and global semantic information, respectively. To tackle the difficulties of models focusing more on hard-to-learn samples during training, a novel hybrid loss function is constructed as an optimization objective, and to address common quality issues in training data, a text data optimization method that integrates data filtering and augmentation techniques is proposed. This method aims to further enhance model performance by improving the quality of input data. Experimental results on three different short text datasets show that our proposed model outperforms existing models (such as Att + BiGRU, BERT + At), with an average F1 score exceeding 90%. Moreover, the performance metrics of the model improved on the datasets optimized with the proposed data optimization method compared to the original datasets, demonstrating the effectiveness of this method in enhancing training data quality and improving model performance.

Keywords:

short text classification; data optimization; data filtering; data augmentation

1. Introduction

With the continuous development of the Internet and the rise of various social media platforms, the public can freely express their opinions and viewpoints online, resulting in a massive amount of textual information. However, the vast scale of internet text data and the presence of a large amount of false and useless information make it crucial to quickly filter out irrelevant information and efficiently organize, understand, and mine the valuable information. To achieve efficient processing and rapid application of large-scale text data, text classification technology has emerged, becoming a core research focus in the field of deep learning. Among various types of text data, short texts have gradually become the most mainstream form of information exchange in daily life due to their concise and direct nature. Therefore, short text classification technology has also become a current research hotspot with applications in information retrieval and filtering, public opinion monitoring, market analysis, and more.

However, unlike conventional types of text, short texts often exhibit characteristics such as brevity, ambiguous context, and sparse contextual information [1]. As a result, conventional classification models struggle to effectively capture the intrinsic semantic information and understand the deeper meaning of these texts. Additionally, short texts, especially those collected from the internet or social media, may contain numerous abbreviations, internet slang, typos, and polysemous words [2]. Without preprocessing, traditional text analysis models are likely unable to recognize these unique usages. These inherent attributes of short texts significantly increase the difficulty of data preprocessing and feature extraction, leading to the unsatisfactory performance of many popular algorithms that work well on conventional texts in short text classification tasks. Therefore, constructing accurate and efficient classification models to rapidly extract genuinely valuable information from massive short text data is a critical issue in the field of Natural Language Processing (NLP) and holds profound significance.

Based on the characteristics of short texts and the shortcomings of existing models, our main contributions are as follows:

(1): We construct a dual-path short text classification model that can fully extract semantic information of different granularities from short texts with limited information, achieving precise classification and efficient management of short text data.
(2): To address issues such as poor training data quality and weak model generalization ability, an input data optimization method is proposed on the basis of the constructed dual-path classification model to improve model performance and enhance classification accuracy and generalization ability.
(3): We compare our model to the existing models, and the experiments (different data) show our model is better; see Section 5.

2. Related Work

In recent years, with the rapid development of deep learning technology, various deep learning-based text classification models have evolved quickly, becoming standard solutions in the field of NLP. Additionally, numerous optimization methods for text classification technology have continually emerged. This section will introduce the current work by researchers on deep learning-based text classification technology and data optimization methods. We will only review the relevant work here, and there may be many good and important works that have not been mentioned.

2.1. Text Classification

In 2014, TextCNN [3], a model based on Convolutional Neural Networks (CNNs) [4], was proposed, becoming a foundational work for CNN-based models in text classification tasks. This work was the first to demonstrate how to use one-dimensional CNN to classify sentence-level text. By extracting N-gram features, it effectively utilized convolutional layers and max-pooling layers to capture both local and global features of the text, achieving unexpectedly excellent results. Due to the strong performance of TextCNN in text classification tasks, more researchers in the NLP field began to apply CNN-based models to solve text classification tasks, improving the models based on the characteristics of text data. Unlike CNN-based models, Recurrent Neural Networks (RNN) [5] can relate the input at the current time step to previous states and capture long-distance dependencies, making them very suitable for processing various sequential data. Therefore, RNNs are more widely used in text classification tasks compared to CNN-based models. A study [6] explored how to apply RNNs to text classification tasks and combined them with Multi-Task Learning (MTL) strategies to improve classification performance, establishing the classic TextRNN model. Variants of RNN, such as LSTM [7] and GRU [8], which effectively mitigate the gradient vanishing and exploding issues of traditional RNN models, are also widely applied in text classification tasks. Another study [9] compared the performance of LSTM and GRU networks in text classification tasks, finding that GRU outperforms LSTM in overall encoding motivation and can achieve better results in most cases. Consequently, GRU is more widely used in the field of text classification.

Although RNN-based models can effectively capture and process dependency information in sequential data, they struggle to focus on all important information in long sequences simultaneously. As a result, researchers proposed the attention mechanism [10] as an improvement, allowing the model to dynamically allocate weights and focus on information at different positions in the sequence, thereby significantly enhancing the performance of sequential tasks. The attention mechanism in deep learning is similar to human attention, enabling the computer to focus more on critical information, allowing the model to capture important details.

The earliest application of the attention mechanism to text classification was presented in a study [11], which introduced sentence-level and document-level attention mechanisms to distinguish between key information and less important content. This helped the model capture the hierarchical structure within the document and improve classification performance. Inspired by this, subsequent researchers have continually expanded and refined this idea. More works [12,13] have optimized text feature extraction and classification performance by incorporating attention mechanisms into RNN-based models, achieving excellent results.

In 2017, the Google team proposed the Transformer model [14], a revolutionary architecture that processes sequential data entirely based on attention mechanisms rather than recurrent structures, significantly improving computational efficiency and model performance. This model marked a milestone in the development of sequence modeling technology in the NLP field. The successful application of Transformers also spurred revolutionary progress in various pre-trained models based on this architecture within NLP. In 2018, the Google team introduced the BERT (Bidirectional Encoder Representations from Transformers) [15] pre-trained model, achieving state-of-the-art results in various NLP tasks and marking a significant breakthrough for pre-trained models in the NLP field. The BERT pre-trained model utilizes unsupervised learning for pre-training, followed by fine-tuning on specific downstream tasks, demonstrating excellent transfer learning capabilities. This approach greatly reduces the amount of data and computational resources required to train models from scratch for specific tasks. Currently, many researchers in the NLP field have proposed numerous improvements to BERT [16,17,18,19,20]. Ref. [20] introduces nBERT, a fine-tuned BERT model integrated with the NRC Emotion Lexicon, to elevate emotion recognition in psychotherapy transcripts; Ref. [21] proposes a novel framework for emotion-based video content analysis. These pre-trained models have achieved significant success in text classification, question-answering systems, named entity recognition, sentiment analysis, semantic parsing, and many other areas. They have significantly advanced NLP technology and become standard tools in contemporary NLP research and industry.

Compared to traditional machine learning models and RNN- or CNN-based text classification models, pre-trained models can better understand and handle complex language structures, thus performing better in classification tasks. However, traditional deep learning models still have unique advantages in feature extraction. Therefore, combining pre-trained models with conventional deep learning models is a current research focus in the NLP field. Existing work often integrates pre-trained models as the word embedding layer with sequence models like RNNs to enhance feature extraction effectiveness. Although this approach improves the semantic representation capability of the models, there are still inherent limitations in the actual feature extraction process, such as the difficulty in effectively integrating global semantic information and local detailed features simultaneously. Additionally, various deep learning architectures and their combined models still face critical issues, including sensitivity to noise, poor generalization performance, and lack of robustness. These challenges necessitate further exploration.

2.2. Data Optimization

Although current deep learning models can achieve high accuracy, the ultimate performance and generalization ability of these models largely depend on the quality of the data used. Even the most advanced models and algorithms cannot perform well without high-quality data. This is why 80–90% of the time in solving various machine learning tasks is spent on data preparation [22]. Therefore, when addressing machine learning tasks, besides considering the rationality and applicability of the model structure, the quality of the data is also an important factor to consider. Consequently, researchers in the field of NLP are increasingly focusing on data preprocessing and optimization. These studies primarily involve data filtering, data augmentation, and other related methods.

2.2.1. Data Filtering

Data filtering refers to establishing a process to verify data quality, promptly identifying issues in the data, and taking measures to address them. It is one of the methods for data optimization. A study [23] designed a data selection method based on adaptive thresholds and noise scores. This method adaptively defines the noise score of samples according to the characteristics of different datasets and different density regions. Through an integrated filtering and iterative process, it detects and filters potential noisy samples from the original training set, significantly improving model performance.

Ref. [24] introduced the data map method, which calculates the confidence of each sample in its true category and its variability over different iterations based on the model’s performance on individual samples during each training iteration. This method selects samples with different characteristics for training, effectively enhancing the model’s robustness and out-of-distribution generalization ability. Additionally, Ref. [25] enumerated various sources of dataset bias and their impact on model performance and robustness. Based on this, the researchers created a general machine learning data quality index formula to help filter out high-quality data, enabling the model to exhibit better generalization ability and perform well on out-of-distribution data.

In addition to quantitative filtering methods for sample quality, filtering data while training the model is also a common strategy. These methods aim to select high-quality samples from all data and use them to update the model, featuring real-time processing and dynamic adjustment. Ref. [26] proposed MentorNet, which involves pre-training an additional network and then using this network to select clean instances to guide training. By learning data-driven curricula, the basic deep neural network focuses more on samples likely to have correct labels, effectively improving model robustness and generalization performance. Ref. [27] introduced the Co-teaching framework based on the memory effect of deep neural networks, where two deep neural networks are trained simultaneously. Because the two networks have different learning capabilities, they can filter out different types of errors introduced by incorrect labels. They teach each other during each mini-batch, achieving significant success in filtering erroneous samples. Ref. [28] used fuzzy clustering to divide the dataset into several similar regions and independently applied genetic algorithms within each cluster for sample selection. The results of overlapping clusters were aggregated through ensemble voting, effectively removing mislabeled data and improving model performance.

2.2.2. Data Augmentation

Due to the inherently discrete and abstract nature of natural language, even small changes can lead to significant shifts in meaning. Consequently, effective data augmentation methods are less prevalent in the NLP field compared to the computer vision field, making them a subject worthy of in-depth research. Ref. [29] proposed a simple data augmentation technique called EDA (Easy Data Augmentation), which includes four basic but powerful operations: synonym replacement, random insertion, random swap, and random deletion. These techniques can effectively enhance the performance of text classification models and have been widely applied. Although EDA is simple and efficient, its randomness can easily lead to the loss of textual information. To address this issue, Ref. [30] introduced a simpler data augmentation technique based on EDA called AEDA (An Easier Data Augmentation). This method randomly inserts punctuation marks into the original text, making it simpler than EDA while preserving the original text’s information and offering better generalization performance. To ensure the preservation of the original sentence meaning, Ref. [31] proposed the context augmentation method. This method uses the context of the words to be replaced and employs a language model to predict the replacement word, resulting in a reconstructed sentence. By comprehensively considering the context information of the sentence, this method improves the effectiveness of text classification. Ref. [32] introduced the back-translation method to generate diverse sentences. This method translates the original English text into French and then translates it back into English, ensuring that the sentences are semantically similar but structurally different, thus generating richer augmented data.

Currently, most data optimization efforts in NLP focus on a single method, such as data filtering or data augmentation. Data filtering methods aim to select high-quality and representative instances from the original dataset while removing mislabeled or ambiguous data. However, this approach can significantly reduce the dataset size and decrease data diversity, which may limit further improvements in model performance. On the other hand, data augmentation methods can automatically generate large amounts of data, but they typically make improvements based on the original data. Consequently, they may expand datasets with mislabeled data, lowering the overall quality of the dataset. Therefore, effectively combining the advantages of these two data optimization methods to improve the quality of datasets for NLP tasks is a current research focus.

3. Construction of a Dual-Path Short Text Classification Model with Hybrid Loss

In this section, we first introduce how to construct the model. To address the challenges of sparse contextual information and difficult feature extraction in short texts, this paper proposes a dual-path short text classification model from both word and sentence perspectives. This model aims to improve information extraction and context understanding in short text classification tasks.

On one hand, through the word-level path, the model can capture the semantics of each word in the text, helping it to better understand subtle differences and important content. On the other hand, through the sentence-level path, the model integrates word-level information to obtain sentence-level features, effectively capturing the global semantic information of the text, leading to a more comprehensive understanding of short texts. Figure 1 illustrates the overall structure of the dual-path model.

The dual-path model constructed in this paper consists of four layers: the embedding layer, the feature extraction layer, the feature fusion layer, and the classification layer. The following sections introduce the modules and corresponding principles of each layer in the model.

3.1. Embedding Layer

This layer uses the BERT pre-trained model to convert the original short text data into word vectors. Due to the massive corpus used during the pre-training phase, various pre-trained models, including BERT, can effectively understand the complex language structures in texts, capturing rich contextual and semantic information. Consequently, they perform excellently in different NLP tasks.

The input of BERT is composed of three parts: Token Embedding, Segment Embedding, and Position Embedding, which, respectively, represent word information, sentence information, and position information, as shown in Figure 2.

The model first preprocesses the given text

X = \{X_{1}, X_{2}, \dots, X_{N}\}

input sequence of length by inserting a [CLS] token at the beginning to indicate the start of the sentence, which contains the semantic information of the entire text, and a [SEP] token at the end to mark the end of the sentence, used to distinguish the end of the sentence during training. Then, based on the reconstructed input sequence, the model generates the token embeddings

E_{i}^{token}

, segment embeddings

E_{i}^{s e g m e n t}

, and position embeddings

E_{i}^{position}

for each character. These three components are combined to obtain the final vector representation

E_{i}

of each word:

E_{i} = E_{i}^{token} + E_{i}^{segment} + E_{i}^{position}

(1)

Character encoding converts each character in the text into a vector representation. Sentence encoding adds markers to each sentence, enabling the BERT model to distinguish between different sentences or paragraphs. Positional encoding provides a vector representation for the position of each word in a sentence, ensuring that the model understands word order information and the structure of the text. By integrating these three components, the input to the BERT model is formed.

3.2. Feature Extraction Layer

In the feature extraction layer, two paths are constructed to extract word-level and sentence-level feature information, respectively. In the word-level feature extraction path, an attention mechanism is applied to the output of the BERT model, allowing the model to dynamically adjust the focus on different words in the input sequence, thereby highlighting the key information in the text. In the sentence-level feature extraction path, a BiGRU model is used for temporal information modeling, capturing the contextual semantics and contextual information within sentences, thus providing a more comprehensive understanding of short texts.

(1): Word-Level Feature Extraction

Since the BERT model has been pre-trained on a large corpus and has a rich and accurate understanding of word semantics, the main task in the word-level feature extraction path is to input the word vectors obtained from the embedding layer into the attention module to assign weights, further highlighting the key information in the text.

The attention mechanism is a model that simulates human brain attention. When humans observe things, they first notice the most attention-grabbing parts. The attention mechanism applies this pattern by assigning higher weights to more important information, thereby highlighting the content that needs focus. This not only effectively enhances the model’s understanding of text semantics but also alleviates the issue of information overload. The model is shown in Figure 3.

For the word vectors

\{T_{1}, T_{2}, \dots, T_{N}\}

output by the embedding layer, the attention score function is used to calculate the relevance between all input vectors and the query vector

Query

, obtaining the attention score for each input vector. The formula is as follows:

S c o r e (Query, T_{i}) = R e L U (W T_{i} + b)

(2)

where

W

is the weight matrix learned by the model and

b

is the bias. After obtaining the attention score for each word vector, the Softmax function is used for normalization to obtain the attention weight for each word vector:

α_{i} = S o f t m a x (Score (Query T_{i})) = \frac{\exp (Score (Query, T_{i}))}{\sum_{j = 1}^{M} e x p (Score (Query, T_{j}))}

(3)

where

M

is the length of the entire input text.

Finally, the weighted sum of each input vector based on its attention weight is calculated to obtain the final output vector of the word-level path:

W = \sum_{i = 1}^{N} α_{i} \cdot T_{i}

(4)

(2): Sentence-Level Feature Extraction

In the sentence-level feature extraction path, the output vectors from the embedding layer are input into the BiGRU model for temporal information modeling and feature extraction, aiding the model in understanding global semantic information. GRU, a simplified version of the Long Short-Term Memory (LSTM) network, mitigates issues such as the complex structure and tendency to overfit inherent in LSTM, while still maintaining excellent sequence modeling capabilities. Therefore, GRU is widely used in NLP, speech recognition, and other fields. The structure of the GRU model is shown in Figure 4. It mainly includes two gating units: the Update Gate and the Reset Gate, which control how much information from the previous state should be retained and forgotten, helping the model capture long-term dependencies.

The Bidirectional Gated Recurrent Unit (BiGRU) combines the forward GRU and backward GRU, allowing the relationships between earlier and later inputs to be fully reflected in the current state’s output. This generates text feature vectors with contextual semantic information, leading to a more accurate understanding of the text. The basic structure of the BiGRU model is shown in Figure 5.

The input to the BiGRU module remains the word vector

T = \{T_{1}, T_{2}, \dots, T_{N}\}

representation output from the embedding layer. At each time step

t

, the forward GRU receives the hidden state from the previous time step

h_{t - 1}^{(f)}

and the input at the current time step

T_{t}

, producing the forward hidden state

h_{t}^{(f)}

. The

h_{t}^{(f)}

calculation of the forward hidden state is given by Equations (5)–(8).

z_{t}^{(f)} = σ (W_{z}^{(f)} \cdot [h_{t - 1}^{(f)}, T_{t}])

(5)

r_{t}^{(f)} = σ (W_{r}^{(f)} \cdot [h_{t - 1}^{(f)}, T_{t}])

(6)

{\tilde{h}}_{t}^{(f)} = t a n h (W_{h}^{(f)} \cdot [r_{t}^{(f)} ⊙ h_{t - 1}^{(f)}, T_{t}])

(7)

h_{t}^{(f)} = (1 - z_{t}^{(f)}) ⊙ h_{t - 1}^{(f)} + z_{t}^{(f)} ⊙ {\tilde{h}}_{t}^{(f)}

(8)

In Equation (5), the calculation of the update gate is shown, where

W_{z}^{(f)}

represents the weight matrix of the update gate, and

[h_{t - 1}^{(f)}, T_{t}]

denotes the concatenation of the previous time step’s forward hidden state and the current time step’s input. Equation (6) shows the calculation of the reset gate. Equation (7) represents the calculation of the candidate hidden state at the current time step. Equation (8) shows the calculation of the forward hidden state

h_{t}^{(f)}

at the current time step. In contrast to the forward calculation, the backward calculation

T = \{T_{N}, T_{N - 1}, \dots, T_{1}\}

uses the reversed sequence of word vectors. For each time step

t

, the backward GRU receives the hidden state of the next time step

h_{t + 1}^{(b)}

and the current time step’s input

T_{t}

, producing the backward hidden state

h_{t}^{(b)}

. The calculation process is similar to that of the forward hidden state

h_{t}^{(f)}

. Finally, the forward and backward hidden states are concatenated according to the time steps to form the hidden state sequence of the entire sequence:

S = \{[h_{1}^{(f)}, h_{N}^{(b)}], [h_{2}^{(f)}, h_{N - 1}^{(b)}], \dots, [h_{N}^{(f)}, h_{1}^{(b)}]\}

(9)

Finally, the hidden state

S_{N}

at the last time step output by the BiGRU network is considered the representation of the entire sequence and is taken as the output of the sentence-level feature extraction path. This, along with the word-level feature information, is input into the feature fusion layer, where the word-level and sentence-level feature information are combined.

3.3. Feature Fusion Layer

In the feature fusion layer, the attention mechanism is used to integrate the feature vectors from different paths, enabling the model to consider both local key information and the global semantic information of the entire sentence, thereby better understanding the text’s semantics and context.

First, the similarity matrix

U

between the word-level feature vectors

W

and the sentence-level feature vectors

S

is calculated. Then, attention weight vectors

A

are computed to assign a weight to each feature vector, indicating its importance in the feature fusion. The attention weights can be calculated using the Softmax function:

A = S o f t m a x (U)

(10)

The final fused feature vector, which represents the integrated word-level and sentence-level information, is obtained by performing a weighted sum of the two feature vectors using the attention weights

A

:

F = A \cdot W + (1 - A) \cdot S

(11)

3.4. Classification Layer

In the classification layer, the fused feature vector is fed into a fully connected layer and then passed through the Softmax activation function to obtain the class corresponding to the sentence. In classification tasks, we aim for the predicted probability distribution of the model to closely match the true label distribution. Therefore, cross-entropy loss is typically used as the loss function, and optimization methods like gradient descent are employed to adjust the model parameters to minimize the loss function. However, such an optimization objective may cause the model to “overfit” easy-to-learn samples in an effort to reduce the loss, leading to overfitting issues. To address this, this paper constructs a hybrid loss function combining cross-entropy loss with Hinge Loss [33] to improve the performance of the classification model.

Cross-entropy loss is one of the most commonly used loss functions in classification problems, measuring the difference between the predicted probability distribution and the actual labels. For a given sample, assuming the model outputs a class probability distribution

p

and the true label distribution (one-hot encoded) is

q

, the cross-entropy loss is defined as follows:

CrossEntropyLoss (p, q) = - \sum_{i} q_{i} l o g (p_{i})

(12)

Here,

i

represents the class index,

q_{i}

is the ith element of the true label, and

p_{i}

is the probability of the ith class predicted by the model.

Hinge Loss is also a commonly used loss function in classification problems, typically applied in learning algorithms such as support vector machines. The core design of Hinge Loss is to ensure that for correctly classified samples, the difference between the model’s predicted value and the true value is greater than a preset margin; for misclassified samples, the difference between the predicted value and the true value exceeds this margin plus an additional interval. This optimization objective makes the model focus more on samples near the decision boundary that are difficult to classify, thus achieving a more reasonable decision boundary.

For a given sample

i

with a true label

y_{i}

, the true class score, which is the probability for the true class obtained from the model’s forward propagation, is defined as

z_{i, y_{i}}

, and the score for class

k

is defined as

Z_{i, k}

. The Hinge Loss is defined as follows:

HingeLoss = \frac{1}{N} \sum_{i = 1}^{N} m a x (0, \underset{k \neq y_{i}}{m a x} (z_{i, k} - z_{i, y_{i}} + mar))

(13)

where

N

represents the number of samples,

i

represents the sample index, and

m a x

is a hyperparameter representing the margin between classes.

In the dual-path model constructed in this paper, cross-entropy loss is combined with Hinge Loss to obtain the hybrid loss as the optimization objective:

ConbinedLoss = CrossEntropyLoss + HingeLoss

(14)

4. Research on Data Optimization Methods Combining Data Filtering and Augmentation

To address issues such as annotation errors, uneven distribution, and semantic ambiguity in current NLP datasets, this paper proposes a data optimization method that integrates both data filtering and data augmentation techniques. This method aims to improve the quality of datasets for short text classification tasks, thereby effectively enhancing the performance of classification models.

Firstly, the method calculates the confidence score for each sample in the original data to filter out samples with annotation errors or semantic ambiguities. Then, a certain number of sentences are randomly sampled from different classes to obtain samples with various features as the seed set for text generation. Using an improved similar sentence generation model, RoFormer-Sim [34], corresponding similar sentences are generated as candidate samples for text augmentation. Considering the quality and diversity of the generated similar sentences, the candidate samples are evaluated through feedback from the classification model. The most suitable augmented samples are selected to expand the training set. Finally, the generated samples are mixed with the data filtered based on confidence scores to obtain the optimized dataset. The technical roadmap of the data optimization strategy is shown in Figure 6.

4.1. Confidence Calculation Method

The data filtering method adopted in this paper calculates the confidence score of each sentence based on the feedback from the classification model [24]. Compared to other quality quantification metrics calculated based on the linguistic characteristics of the text, the sample confidence obtained from model feedback assesses data quality through direct interaction with the model. This approach can more accurately identify and filter out data that negatively impacts model training, providing strong dynamism and specificity. Therefore, this paper chooses this more intuitive and easily interpretable confidence calculation method for data filtering.

Intuitively, in a given dataset, there is a noticeable difference between samples that the model can always predict correctly and those where the model’s predictions are always wrong or unstable. Confidence filtering based on model feedback leverages this characteristic, calculating confidence scores based on the model’s performance on individual instances and using these scores for data filtering.

Consider a dataset with a sample size of

D = {\{{(x, y^{*})}_{i}\}}_{i = 1}^{N}

, where the

i

-th instance consists of text

x_{i}

and its corresponding true label

y_{i}^{*}

. Here, assume that an optimization process based on stochastic gradient descent is used, and the training instances are randomly shuffled in each iteration. The final model will output the probability distribution of the corresponding labels for the sentences in each instance. The confidence of instance

i

is defined as the average of the model probabilities corresponding to the true label of that instance over

E

iterations:

{\bar{μ}}_{i} = \frac{1}{E} \sum_{e = 1}^{E} p_{θ^{(e)}} (y_{i}^{*} ∣ x_{i})

(15)

where

p_{θ^{(e)}}

represents the probability of the true label

y_{i}^{*}

given by the model with parameters

θ^{(e)}

at the end of the

e

-th iteration. It can be observed that even with different initial values of parameters, the confidence of each instance remains quite stable.

By calculating the confidence, it is possible to clearly distinguish between samples that the model can consistently and correctly predict and those for which the model’s predictions are prone to errors or instability. For samples that, even after multiple training sessions, still fail to yield stable and correct predictions, there is ample reason to believe that these samples may have annotation errors or ambiguous semantics.

4.2. Sampling Method

After filtering out samples with annotation errors or ambiguous semantics from the original data, it is necessary to select some samples as a seed set for subsequent text augmentation operations. Since the experimental data chosen in this paper have relatively balanced class distributions, stratified sampling is used to extract augmented samples. This step ensures that the seed set for generating similar sentences covers samples with different characteristics, and each selected sample has high quality and clear annotations. This allows the augmented data to reflect the characteristics of each class, enhancing the model’s ability to understand and handle complex linguistic phenomena.

4.3. Text Augmentation Method

This paper uses the RoFormer-Sim model in combination with a sample evaluation and selection module as a text data augmentation method to expand the dataset. First, similar sentences are generated using the improved RoFormer-Sim model as candidate samples for augmentation. Then, these candidate samples are evaluated based on feedback from the classification model, selecting the most reasonable augmented samples. The flowchart of the text augmentation method is shown in Figure 7.

4.3.1. Similar Sentence Generation Method

In the initial stage of the augmentation method, this paper uses the improved RoFormer-Sim model to generate similar sentences. RoFormer-Sim is a language generation model based on the RoFormer [35] pre-trained model with similar sentence retrieval capabilities. It integrates various training techniques, demonstrating powerful natural language understanding and generation abilities, making it widely applicable to various NLP tasks.

RoFormer-Sim collects a large amount of question-answer pairs and sentence passages. It constructs unsupervised pre-training corpora by calculating inter-sentence similarity, introducing a random masking mechanism in the process to further enhance the model’s natural language understanding. Additionally, to achieve the goals of MTL and cross-task transfer, RoFormer-Sim employs a training method similar to UniLM [36] to accomplish the task of similar sentence generation. The model’s loss function consists of two parts: first, the Seq2Seq task, which uses the input text to predict the corresponding similar text; second, the construction of the semantic similarity task, evaluating semantic similarity between texts by calculating the CLS vector of the texts. The training method of the model is shown in Figure 8.

After the language generation model processes the input text, it usually calculates the likelihood scores for each token in the vocabulary to represent the probability of that token being the next word in the sentence. The traditional greedy strategy directly selects the word with the highest probability, which can lead to the generated text being too monotonous and repetitive. On the other hand, the random sampling method selects a word according to the probability distribution, which can result in incoherent or meaningless text. The RoFormer-Sim model adopts the Top-p strategy to select the output token. This strategy dynamically adjusts the range of candidate words during the generation process by selecting the smallest set of words whose cumulative probability exceeds the threshold

p

. It is currently one of the most commonly used decoding methods in various language generation models.

The choice of the threshold

p

in the Top-p strategy is crucial. A higher threshold can increase the diversity and creativity of the text but may affect the coherence and consistency of the sentences. Conversely, a lower threshold helps maintain thematic consistency and logical coherence but may limit the range of variation in the text, resulting in overly monotonous generated text. In the original RoFormer-Sim model, the threshold in the Top-p strategy is a manually specified hyperparameter that needs to be adjusted according to specific tasks and requirements, making it sometimes difficult to balance the diversity and coherence of the generated text.

Therefore, we propose an adaptive threshold adjustment method based on sentence complexity to balance the logical coherence and lexical diversity of the generated text.

Assuming that the length of a given sentence is

L

and the maximum length is

L_{m a x}

, the sentence length score is

length_score = L / L_{m a x}

; If the number of different words in the sentence is

V

and the total number of words is

V_{total}

, the sentence diversity score is

diversity_score = V / V_{total}

. Combining the length and lexical diversity of the sentence, the complexity score of the sentence is calculated as follows:

c_{s} = \frac{length_score + diversity_score}{2}

(16)

According to the set minimum Top-p threshold

p_{m i n}

and maximum threshold

p_{m a x}

, a top – p policy threshold dynamically adjusted according to the current sentence complexity can be obtained as follows:

top - p = p_{m i n} + (p_{m a x} - p_{m i n}) \times c_{s}

(17)

During the process of generating similar sentences, shorter sentences with lower lexical richness have limited information content. For such sentences, we focus more on the logical coherence and linguistic accuracy, thus assigning a lower threshold to ensure semantic correctness in the generated sentences. In contrast, for longer sentences with more diverse vocabulary, which provide richer context and expressive space, we prioritize increasing sentence diversity and creativity. Therefore, we assign a higher threshold to generate more varied text.

By adaptively adjusting the threshold in the Top-p strategy, the text generation process can be more finely controlled. This approach ensures that while maintaining the logical coherence of the sentences, the generated text also offers richer and more diverse expressions, thus adapting to different scenarios and requirements.

4.3.2. Method of Selecting Augmented Samples

The previous sections have described the process of generating similar samples using the similar sentence generation model RoFormer-Sim. However, not all generated similar sentences are effective in enhancing model performance. While pursuing diverse expressions, there may be a loss of key information or a change in semantics, which can reduce the model’s effectiveness. Inspired by the EPiDA [37] method, after obtaining similar sentences generated by the model, this paper selects enhanced samples that are semantically similar to the original sentences and diverse based on feedback from the classification model. This approach effectively combines the characteristics of the model and the data, achieving a dynamic augmentation strategy.

Given a labeled dataset consisting of pairs

X = \{(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})\}

, where

x_{i}

represents the text,

y_{i}

represents the label,

y_{i} \in {0,1}^{C}

is a one-hot vector, and

C

denotes the number of classes. Let

p_{θ} (y | x)

represent the probability of predicting the label given the input

x

, with

θ

being the model parameters.

When first considering the diversity of generated samples, it is essential to make the augmented samples significantly different from the original ones. Consequently, for the

j

-th candidate sample

t_{i}^{j}

, produced corresponding to the original text

x_{i}

, the objective is to maximize the loss incurred by this sample when evaluated by the model. The target function can be expressed as follows:

\max L_{g} (ω, ϕ (t_{i}^{j})) = \frac{1}{m n} \sum_{i = 1}^{n} \sum_{j = 1}^{m} D (p (ω^{T} ϕ (t_{i}^{j})), p (y_{i})) + H (p (y_{i}))

(18)

Wherein,

ϕ : R^{d} \to R^{D}

represents a finite feature mapping,

ω \in R^{D}

denotes learnable parameters,

p

signifies a probability distribution,

H

refers to Shannon entropy, and

D

is the relative entropy, also known as the Kullback–Leibler (KL) divergence. Given that

p (y_{i})

is a one-hot vector and we know that

H (p (y_{i})) = 0

, the transformation involves maximizing the KL divergence, denoted as

D (p (ω^{T} ϕ (a_{i}^{j})), p (y_{i}))

, to gauge the diversity of candidate samples. Consequently, a diversity score is defined as follows:

s c o r e_{1} = D (p (ω^{T} ϕ (t_{i}^{j})), p (y_{i}))

(19)

Secondly, considering the aspect of the quality of generated samples, it is essential to ensure that the synthesized examples are semantically as similar as possible to the original samples. This is achieved by minimizing the conditional entropy, which constrains the semantic discrepancy between the candidate samples and the original ones. Formally, this can be expressed as follows:

\min H (p (ω^{T} ϕ (t_{i}^{j})) | p (ω^{T} ϕ (x_{i})))

(20)

Consequently, the quality score of the candidate samples can be defined as follows:

s c o r e_{2} = - H (p (ω^{T} ϕ (t_{i}^{j})) | p (ω^{T} ϕ (x_{i})))

(21)

Finally, combining the normalized diversity score

s c o r e_{1}^{'}

and the quality score

s c o r e_{2}^{'}

, we obtain the final score for the candidate augmented samples:

s = s c o r e_{1}^{'} + s c o r e_{2}^{'}

(22)

Based on the final scores of the candidate samples, evaluate the samples generated by the RoFormer-Sim model, select the appropriate enhanced samples, and merge them with the previously filtered high-confidence data to expand the training set.

5. Experimental Process and Results Analysis

In this section, we first introduce the dataset sources, then evaluation metrics and model parameter settings are given, and last experimental results analysis is shown.

5.1. Dataset Sources

To verify the effectiveness of the proposed model, this paper selects three short text datasets from different fields and of different scales for experiments: OnlineShopping [38], KUAKE-QIC [39], and TNews [40].

OnlineShopping is a product review dataset collected from various e-commerce platforms, containing two categories: positive and negative. KUAKE-QIC is a medical search intent classification dataset released by the Chinese Biomedical Language Understanding Evaluation (CBLUE). The data source is user queries from medical search engines, from which seven categories were selected for the experiment: Precautions, Disease Diagnosis, Efficacy, Consequence Description, Disease Description, Medical Advice, and Treatment Plan. TNews is a news headline dataset released by the Chinese Language Understanding Evaluation (CLUE) from the Toutiao news app. The data source is news headlines from the Toutiao app, from which thirteen categories were selected for the experiment: Culture, Entertainment, Sports, Finance, Real Estate, Automobile, Education, Technology, Military, Travel, International, Agriculture, and eSports. A certain number of texts were randomly selected from each dataset and divided into training, validation, and test sets, with all texts in these datasets having fewer than 50 characters. The statistics and partitioning information of the datasets are shown in Table 1.

5.2. Evaluation Metrics

In this section, precision, recall, and F1 score are used as evaluation metrics for the model performance. In classification models, the confusion matrix is an indicator used to assess model performance. It compares the predicted results of the model with the actual labels, categorizing each sample in the dataset as one of TP, FP, TN, or FN. Here, TP (True Positive) represents the number of positive instances correctly predicted as positive by the classifier, TN (True Negative) represents the number of negative instances correctly predicted as negative, FP (False Positive) represents the number of negative instances incorrectly predicted as positive, and FN (False Negative) represents the number of positive instances incorrectly predicted as negative. The confusion matrix is shown in Table 2.

Using the four elements of the confusion matrix, classification performance metrics such as precision, recall, and F1 score can be calculated. Precision indicates the proportion of true positive predictions out of all positive predictions made by the classifier, i.e., how many of the predicted positive samples are actually positive. A higher precision value indicates better model performance. The calculation method is as follows:

p r e c i s i o n = \frac{T P}{T P + F P}

(23)

Recall represents the proportion of actual positive samples that are correctly predicted as positive by the classifier. A higher recall value indicates better model performance. The calculation method is as follows:

r e c a l l = \frac{T P}{T P + F N}

(24)

The F1 score is the weighted harmonic mean of precision and recall, providing a comprehensive measure of these two metrics. It is the most commonly used and important metric in classification tasks. A higher F1 score indicates that the experimental method is more ideal. The calculation method is as follows:

F_{1} = 2 \times \frac{p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l}

(25)

5.3. Model Parameter Settings

In the embedding layer, the Google team’s pre-trained model, bert-base-chinese, is used. It has 12 layers, employs a 12-head attention mechanism, and has a hidden layer dimension of 768, with a total of 110 million parameters.

In the feature extraction layer, the basic number of neurons is set to 128, meaning the hidden layer dimension is 128. The batch size for training is set to 32, and the initial learning rate is set to 2 × 10⁻⁵. To optimize model parameters, AdamW is chosen as the optimizer, and ReLU is selected as the activation function.

5.4. Experimental Results Analysis

5.4.1. Effectiveness Analysis of Short Text Classification Models

To verify the effectiveness of the proposed model for short text classification tasks, multiple experiments were conducted using different models on three short text datasets of varying scales. Model 1 is a traditional BiGRU network. Model 2 is a BiGRU model with an added attention mechanism, and both models use Word2Vec for text vectorization. Model 3 is a fine-tuned BERT model. Model 4 is a BiGRU network with a BERT embedding module, handling only sentence-level features. Model 5 is a fine-tuned BERT model with an added attention mechanism, handling only word-level features. Model 6 is a dual-path model using the original cross-entropy loss. Model 7 is the proposed dual-path classification model with a hybrid loss. These seven sets of experiments were conducted to verify the impact of different modules on the performance of short text classification. The results are shown in Table 3.

To more intuitively compare the effects of different models, the three metrics of prediction results for different models on each dataset were visualized. The resulting bar charts are shown in Figure 9.

The experimental results show that the proposed dual-path classification model with hybrid loss achieved the best classification performance across four different short text datasets. On the binary classification OnlineShopping dataset, the F1 score reached 95.02%. On the 7-class KUAKE-QIC dataset, the F1 score reached 92.17%. On the 13-class TNews dataset, the F1 score reached 87.70%.

Comparing Model 2 with Model 1, and Models 4, 5, 6, and 7 with Model 3, it can be observed that the F1 scores have significantly improved across all datasets. This indicates that the combination of multiple models outperforms single models for short text classification tasks, demonstrating the effectiveness of the proposed integrated model.

Comparing Models 3, 4, 5, 6, and 7 with Models 1 and 2, it can be observed that using the BERT model for word vector representation, as opposed to Word2Vec, significantly improves the performance of short text classification tasks, with F1 scores improving by about 2% across all datasets. While Models 1 and 2 perform well, they are slightly lacking in all metrics compared to BERT and its derivative models. This highlights the strong performance of the BERT pre-trained model in NLP tasks and further amplifies its advantages when combined with other models.

The comparison between Model 2 and Model 1, as well as between Model 5 and Model 3, shows that adding an attention mechanism generally improves metrics across several datasets. The attention mechanism helps the model focus more on important information in the input sequence, increasing its sensitivity to critical context. This focus is crucial for capturing key information and enhancing performance in short text classification tasks.

Models 4 and 5, which, respectively, focus only on sentence-level features and only on word-level features, have achieved good results. However, the dual-path models 6 and 7, which consider both granularities of feature information, show approximately 1% improvement over Models 4 and 5. This indicates that the dual-path architecture, which integrates both word-level and sentence-level information, effectively enhances the performance of the classification model by simultaneously focusing on local key information and global semantic information.

Compared to Model 6, Model 7 shows slight accuracy improvements across several datasets. This indicates that the hybrid loss function, which combines cross-entropy and HingeLoss, not only enhances the model’s classification performance but also helps the model focus more on learning ambiguous and hard-to-distinguish samples, thereby improving its classification performance and generalization ability.

Overall, comparing Model 7 with Models 4, 5, and 6 shows that the combination of multiple models and the choice of loss function have a significant impact on the final classification results. Model 7 achieves the best classification performance. This result demonstrates that among various multi-model combination strategies involving the BERT model, attention mechanism, and BiGRU network, the proposed dual-path model with hybrid loss is the optimal combination strategy.

Additionally, on the KUAKE-QIC dataset, the classification results of the proposed model are compared with the following widely used advanced models:

FastText [41]: A fast and simple text classification method with a simple model structure and fast training speed.

TextCNN [3]: A model that uses CNN for NLP tasks, efficiently extracting important features.

LSTM [7]: An improved version of RNN that addresses the issues of gradient vanishing and explosion during long sequence training.

GRU [8]: A simplified version of LSTM, which achieves comparable modeling performance with fewer parameters and a simpler structure.

The experimental results are shown in Table 4.

These experimental results indicate that, compared to other models, the proposed dual-path text classification model with hybrid loss achieves the best recognition performance for short texts. This outcome highlights the importance of fully leveraging pre-trained models, combining different models to capture information at various granularities, and using a hybrid loss function that pays more attention to sample boundaries when handling short text classification tasks.

5.4.2. Effectiveness Analysis of Data Optimization Methods

To verify the effectiveness of the data optimization methods proposed in this paper, the three short text datasets introduced earlier were optimized, and experiments were conducted using the proposed dual-path classification model. The confidence distribution levels of the three datasets were obtained based on the confidence calculation method presented in Section 4. In the OnlineShopping dataset, 0.35% of the samples had a confidence below 0.90. In the KUAKE-QIC dataset, 0.40% of the samples had a confidence below 0.90. In the TNews dataset, 3.72% of the samples had a confidence below 0.90.

Since the vast majority of samples in each dataset have confidence levels concentrated in the 0.90–1.0 range, this section provides a detailed comparison of the confidence distribution levels within the 0.90–1.0 interval for different datasets. The number of samples and corresponding proportions within this confidence range are listed in detail. The specific results are shown in Table 5.

The confidence filtering threshold for the OnlineShopping dataset was set to 0.95, retaining a total of 24,021 samples. From each category in this dataset, 30% of the samples were randomly sampled as the text augmentation seed set. For the KUAKE-QIC dataset, the confidence filtering threshold was set to 0.95, retaining a total of 4,664 samples. From each category, 100% of the samples were randomly sampled as the text augmentation seed set. For the TNews dataset, the confidence filtering threshold was set to 0.90, retaining a total of 96,280 samples. From each category, 20% of the samples were randomly sampled as the text augmentation seed set.

In terms of text data augmentation, the chinese_roformer-sim-char_L-12_H-768_A-12 model was used for similar sentence generation, with Create_Num set to 1, meaning that one similar sentence was generated for each input sentence, and the sentence with the highest similarity was selected. Taking the KUAKE-QIC dataset as an example, a sample of RoFormer-Sim augmented data is showen in Table 6.

After generating similar sentences using the improved RoFormer-Sim model, the scores of the candidate samples were calculated based on feedback from the classification model trained on the confidence-filtered data. The top 50% of the samples by score were selected as the final augmented samples. After optimization, the final training set sizes were as follows: OnlineShopping dataset had 27,624 samples; KUAKE-QIC dataset had 6996 samples; TNews dataset had 105,908 samples. The final experimental results are shown in Table 7.

The experimental results show that the models trained on datasets filtered by confidence scores exhibit improved performance metrics compared to those trained on the original datasets. Furthermore, after applying text data augmentation on top of confidence filtering, the performance metrics further improved. This indicates that confidence filtering effectively selects correctly labeled and semantically clear sentences from the original data, while data augmentation compensates for the potential reduction in training samples caused by confidence filtering, thereby enhancing classification performance to some extent. The experiments in this paper demonstrate that the combination of these two methods yields good results, improving data quality to some extent and providing new insights for other researchers addressing similar issues.

5.4.3. Ablation Experiment

In order to verify the effectiveness of each step in the dataset improvement method proposed in this article, ablation experiments need to be conducted for analysis. Taking the KUAKE-QIC dataset as an example, the steps of enhancing sample selection, text data augmentation, and confidence screening were removed separately. The specific experimental results are shown in Table 8.

From Table 8, it can be seen that, after removing various experimental steps, all indicators of the dataset have decreased to varying degrees, which also demonstrates the effectiveness of the improved method proposed in this paper. Especially for the KUAKE-QIC dataset, the model performance decreases significantly after removing the text enhancement step. This is because the sample size of the dataset is very small, so generating appropriate enhancement samples to expand the data size can effectively improve the performance of the model. For the TNews dataset, the degree of decrease in various indicators is greatest when removing confidence filtering, indicating that for large-scale datasets, it is more important to filter out annotation errors and semantically ambiguous sentences by calculating confidence, which can effectively improve the quality of the dataset and save computational resources.

6. Summary and Discussion

With the development of the internet and the proliferation of social media, public expressions online have generated massive amounts of text information, rich in human thoughts, emotions, and social trends, forming an integral part of the information age. However, the vast and mixed nature of online text data necessitates the use of efficient text classification techniques to filter out genuine and useful information for processing. Short texts, in particular, have become the mainstream form of information exchange due to their concise and direct format. Correspondingly, classification techniques targeting short texts are a current research hotspot. Nevertheless, short texts pose challenges due to their limited length, lack of context, and sparse information, making it difficult for classification models to understand and capture deep semantics. Additionally, phenomena such as internet slang, typos, and polysemy further complicate data preprocessing and feature extraction. This limits the effectiveness of conventional text analysis models in handling short texts, necessitating more precise and efficient classification models to address this task. In this paper, we did a little work; see Table 9.

The proposed short text classification model and corresponding data optimization method were experimentally validated on three short text datasets from different domains and of different scales. The results show that the proposed dual-path classification model achieves higher classification accuracy compared to other popular deep learning models. Moreover, its performance further improves after data optimization. This demonstrates the effectiveness of the proposed short text classification model and data optimization method, providing valuable references and insights for future researchers in related fields.

Although this research has achieved significant results, there are still many areas that need improvement and refinement. Future work will focus on the following aspects:

1. Although the dual-path short text classification model with hybrid loss can extract multi-granularity semantic features, it correspondingly increases the number of parameters and computational complexity. Therefore, exploring methods such as model compression, knowledge distillation, or lightweight model design to balance model performance and efficiency is an important direction for future research.

2. The similar sentence generation model, RoFormer-Sim, used in the data augmentation method, is a generative language model pre-trained on similar sentence pairs collected from the internet. Due to the unstructured nature of its pre-training data, the quality of the generated similar sentences may not be high, potentially leading to semantic bias and logical inconsistencies. Therefore, future work can further explore more advanced data augmentation schemes and models to improve data quality and optimize model performance. In addition, the comparison with recent Transformer-based models such as RoBERTa, XLNet, will be considered in the further work.

Author Contributions

Conceptualization, W.L. and Y.H.; methodology, G.L. and Y.H.; software, Y.H.; validation, W.L., G.L. and Y.H.; formal analysis, W.L.; investigation, G.L.; resources, Y.H.; data curation, Y.H.; writing—original draft preparation, W.L. and Y.H.; writing—review and editing, G.L.; visualization, W.L.; supervision, G.L.; project administration, G.L.; funding acquisition, G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the NSF of China grants 12171247.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors are grateful to the referees for their valuable suggestions and comments on the original manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cheng, X.; Yan, X.; Lan, Y.; Guo, J. BTM: Topic modeling over short texts. IEEE Trans. Knowl. Data Eng. 2014, 26, 2928–2941. [Google Scholar] [CrossRef]
Song, G.; Ye, Y.; Du, X.; Huang, X.; Bie, S. Short text classification: A survey. J. Multimed. 2014, 9, 635–643. [Google Scholar] [CrossRef]
Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Liu, P.; Qiu, X.; Huang, X. Recurrent neural network for text classification with multi-task learning. arXiv 2016, arXiv:1605.05101. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Gruber, N.; Jockisch, A. Are GRU Cells More Specific and LSTM Cells More Sensitive in Motive Classification of Text? Front. Artif. Intell. 2020, 3, 40. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
Jang, B.; Kim, M.; Harerimana, G.; Kang, S.U.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
Deng, J.; Cheng, L.; Wang, Z. Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput. Speech Lang. 2021, 68, 101182. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2020, arXiv:1909.11942. [Google Scholar] [CrossRef]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced language representation with informative entities. arXiv 2019, arXiv:1905.07129. [Google Scholar] [CrossRef]
Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Rasool, A.; Aslam, S.; Hussain, N.; Imtiaz, S.; Riaz, W. nBERT: Harnessing NLP for Emotion Recognition in Psychotherapy to Transform Mental Health Care. Information 2025, 16, 301. [Google Scholar] [CrossRef]
Akbar, Z.; Ghani, M.U.; Aziz, U. Boosting Viewer Experience with Emotion-Driven Video Analysis: A BERT-based Framework for Social Media Content. J. Artif. Intell. Bioinform. 2025, 1, 3–11. [Google Scholar] [CrossRef]
Whang, S.E.; Lee, J.G. Data collection and quality challenges for deep learning. Proc. VLDB Endow. 2020, 13, 3429–3432. [Google Scholar] [CrossRef]
Li, C.; Mao, Z. A label noise filtering method for regression based on adaptive threshold and noise score. Expert Syst. Appl. 2023, 228, 120422. [Google Scholar] [CrossRef]
Swayamdipta, S.; Schwartz, R.; Lourie, N.; Wang, Y.; Hajishirzi, H.; Smith, N.A.; Choi, Y. Dataset cartography: Mapping and diagnosing datasets with training dynamics. arXiv 2020, arXiv:2009.10795. [Google Scholar] [CrossRef]
Mishra, S.; Arunkumar, A.; Sachdeva, B.; Bryan, C.; Baral, C. Dqi: Measuring data quality in nlp. arXiv 2020, arXiv:2005.00816. [Google Scholar] [CrossRef]
Jiang, L.; Zhou, Z.; Leung, T.; Li, L.-J.; Fei-Fei, L. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2304–2313. [Google Scholar]
Han, B.; Yao, Q.; Yu, X.; Niu, G.; Xu, M.; Hu, W.; Tsang, I.; Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31, 1–11. [Google Scholar]
Kordos, M.; Blachnik, M.; Scherer, R. Fuzzy clustering decomposition of genetic algorithm-based instance selection for regression problems. Inf. Sci. 2022, 587, 23–40. [Google Scholar] [CrossRef]
Wei, J.; Zou, K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
Karimi, A.; Rossi, L.; Prati, A. AEDA: An easier data augmentation technique for text classification. arXiv 2021, arXiv:2108.13230. [Google Scholar] [CrossRef]
Kobayashi, S. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv 2018, arXiv:1805.06201. [Google Scholar] [CrossRef]
Yu, A.W.; Dohan, D.; Luong, M.T.; Zhao, R.; Chen, K.; Norouzi, M.; Le, Q.V. QANet: Combining local convolution with global self-attention for reading comprehension. arXiv 2018, arXiv:1804.09541. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Xie, W. Entity Linking based on RoFormer-Sim for Chinese Short Texts. Front. Comput. Intell. Syst. 2023, 4, 46–50. [Google Scholar] [CrossRef]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified language model pre-training for natural language understanding and generation. Adv. Neural Inf. Process. Syst. 2019, 32, 1–13. [Google Scholar]
Zhao, M.; Zhang, L.; Xu, Y.; Ding, J.; Guan, J.; Zhou, S. EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification. arXiv 2022, arXiv:2204.11205. [Google Scholar]
Onlineshopping. Available online: https://www.heywhale.com/mw/dataset/63345cb433bf6e92aabb889b/file (accessed on 9 September 2022).
KUAKE-QIC. Available online: https://tianchi.aliyun.com/dataset/211461 (accessed on 29 September 2025).
TNews. Available online: https://github.com/ShannonAI/ChineseBert/blob/main/tasks/TNews/README.md (accessed on 10 May 2024).
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]

Figure 1. Dual-path classification model architecture diagram.

Figure 2. BERT model input.

Figure 3. Attention mechanism model diagram.

Figure 4. GRU model architecture diagram.

Figure 5. BiGRU model architecture diagram.

Figure 6. Roadmap for Data Optimization Methods.

Figure 7. Text Augmentation Method Flowchart.

Figure 8. Training process of RoFormer-Sim.

Figure 9. Histogram of Experimental Results for Different Models.

Table 1. Dataset Statistics and Partitioning.

Dataset	Specific Application	Number of Categories	Training Set	Validation Set	Test Set
OnlineShopping	Sentiment Analysis	2	25,000	5000	5000
KUAKE-QIC	Intent Recognition	7	4770	600	600
TNews	Topic Classification	13	100,000	20,000	20,000

Table 2. Confusion Matrix of Classification Metrics.

Actual Situation	Predicted Result
Actual Situation	Positive Class	Negative Class
Positive Class	TP	FN
Negative Class	FP	TN

Table 3. Comparison of Experimental Results for Different Models.

Model	OnlineShopping
Model	Precision	Recall	F1-Score
1. BiGRU	0.9228	0.9228	0.9227
2. Att+BiGRU	0.9275	0.9272	0.9271
3. BERT	0.9452	0.9446	0.9445
4. BERT+BiGRU	0.9446	0.9444	0.9443
5. BERT+Att	0.9480	0.9480	0.9479
6. Dual-Path Model with Cross-Entropy Loss	0.9511	0.9473	0.9498
7. Dual-Path Model with Hybrid Loss	0.9503	0.9502	0.9502
Model	KUAKE-QIC
Model	Precision	Recall	F1-score
1. BiGRU	0.8955	0.8950	0.8948
2. Att+BiGRU	0.9027	0.9019	0.9022
3. BERT	0.9175	0.9062	0.9079
4. BERT+BiGRU	0.9150	0.9087	0.9091
5. BERT+Att	0.9179	0.9155	0.9143
6. Dual-Path Model with Cross-Entropy Loss	0.9198	0.9171	0.9175
7. Dual-Path Model with Hybrid Loss	0.9279	0.9198	0.9217
Model	TNews
Model	Precision	Recall	F1-score
1. BiGRU	0.8471	0.8460	0.8460
2. Att+BiGRU	0.8515	0.8494	0.8499
3. BERT	0.8685	0.8639	0.8641
4. BERT+BiGRU	0.8723	0.8699	0.8699
5. BERT+Att	0.8684	0.8662	0.8665
6. Dual-Path Model with Cross-Entropy Loss	0.8773	0.8763	0.8763
7. Dual-Path Model with Hybrid Loss	0.8777	0.8771	0.8770

Table 4. Comparison of Experimental Results for Different Models.

Model	KUAKE-QIC
Model	Precision	Recall	F1-Score
FastText	0.7963	0.7861	0.7788
TextCNN	0.8877	0.8869	0.8869
LSTM	0.8295	0.8393	0.8333
GRU	0.8817	0.8801	0.8803
Ours	0.9279	0.9198	0.9217

Table 5. Confidence Level Distribution Across Datasets.

CI	OnlineShopping		KUAKE-QIC		TNews
CI	Number	Proportion	Number	Proportion	Number	Proportion
0–0.90	87	0.35%	19	0.40%	3720	3.72%
0.90–0.91	175	0.70%	25	0.52%	2278	2.28%
0.91–0.92	162	0.65%	10	0.21%	1999	2.00%
0.92–0.93	176	0.70%	15	0.31%	2057	2.06%
0.93–0.94	180	0.72%	17	0.36%	2311	2.31%
0.94–0.95	199	0.80%	20	0.42%	2555	2.56%
0.95–0.96	277	1.11%	26	0.55%	3299	3.30%
0.96–0.97	347	1.39%	51	1.07%	4523	4.52%
0.97–0.98	622	2.49%	92	1.93%	6860	6.86%
0.98–0.99	1340	5.36%	198	4.15%	13,698	13.70%
0.99–1.0	21,435	85.74%	4297	90.08%	56,700	56.70%

Table 6. Examples of RoFormer-Sim Augmented Data.

Original Sentence	Similar Sentence	Category
What medicine should be taken for oral ulcers	What medicine is best for oral ulcers?	treatment plan
Which is a good specialized hospital for liver cirrhosis in Shanxi province	Which is a good hospital for liver cirrhosis in Shanxi Province	Medical advice
Can rubbing ginger prevent hair loss?	Can using ginger to wipe hair prevent hair loss?	Efficacy and Effects

Table 7. Comparison of Experimental Results of Data Optimization Methods Across Datasets.

Training Data	OnlineShopping
Training Data	Precision	Recall	F1-Score
Original Data	0.9503	0.9502	0.9502
Confidence Filtered Data	0.9520	0.9520	0.9520
Text Augmented Data	0.9548	0.9548	0.9548
Training Data	KUAKE-QIC
Training Data	Precision	Recall	F1-score
Original Data	0.9279	0.9198	0.9217
Confidence Filtered Data	0.9296	0.9266	0.9268
Text Augmented Data	0.9400	0.9361	0.9369
Training Data	TNews
Training Data	Precision	Recall	F1-score
Original Data	0.8777	0.8771	0.8770
Confidence Filtered Data	0.8808	0.8800	0.8801
Text Augmented Data	0.8838	0.8832	0.8833

Table 8. Comparison of ablation experimental effects.

Operations on the Dataset	Precision	Recall	F1-Score
Confidence screening + similar sentence generation + enhanced sample selection	0.9400	0.9361	0.9369
Confidence screening + generation of similar sentences	0.9346	0.9346	0.9340
Similar sentence generation + enhanced sample selection	0.9370	0.9348	0.9353
Similar sentence generation	0.9352	0.9348	0.9348
Confidence screening	0.9296	0.9266	0.9268

Table 9. Summary of Contributions.

Problems	Our Solutions
short text classification techniques	a novel architecture for short text classification is proposed, and the model’s performance is optimized by improving the quality of input data.
current classification models struggle to capture meaningful features from short texts with limited information	proposes a dual-path short text classification model combining the BERT pre-trained model, attention mechanism, and BiGRU model. This model considers both word-level and sentence-level feature information, allowing it to more fully extract deep semantic features and contextual information from limited data.
traditional cross-entropy loss can cause during training	a hybrid loss function that combines cross-entropy loss with Hinge Loss. This helps the model better optimize the classification boundary during training, thereby improving classification accuracy.
annotation errors, semantic ambiguity, and uneven distribution in training data	a short text data optimization method that integrates data filtering and augmentation techniques, based on the constructed dual-path model. This approach enhances model performance from the perspective of input data.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Lv, G.; He, Y. Dual-Path Short Text Classification with Data Optimization. Appl. Sci. 2025, 15, 11015. https://doi.org/10.3390/app152011015

AMA Style

Li W, Lv G, He Y. Dual-Path Short Text Classification with Data Optimization. Applied Sciences. 2025; 15(20):11015. https://doi.org/10.3390/app152011015

Chicago/Turabian Style

Li, Wei, Guangying Lv, and Yunling He. 2025. "Dual-Path Short Text Classification with Data Optimization" Applied Sciences 15, no. 20: 11015. https://doi.org/10.3390/app152011015

APA Style

Li, W., Lv, G., & He, Y. (2025). Dual-Path Short Text Classification with Data Optimization. Applied Sciences, 15(20), 11015. https://doi.org/10.3390/app152011015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Path Short Text Classification with Data Optimization

Abstract

1. Introduction

2. Related Work

2.1. Text Classification

2.2. Data Optimization

2.2.1. Data Filtering

2.2.2. Data Augmentation

3. Construction of a Dual-Path Short Text Classification Model with Hybrid Loss

3.1. Embedding Layer

3.2. Feature Extraction Layer

3.3. Feature Fusion Layer

3.4. Classification Layer

4. Research on Data Optimization Methods Combining Data Filtering and Augmentation

4.1. Confidence Calculation Method

4.2. Sampling Method

4.3. Text Augmentation Method

4.3.1. Similar Sentence Generation Method

4.3.2. Method of Selecting Augmented Samples

5. Experimental Process and Results Analysis

5.1. Dataset Sources

5.2. Evaluation Metrics

5.3. Model Parameter Settings

5.4. Experimental Results Analysis

5.4.1. Effectiveness Analysis of Short Text Classification Models

5.4.2. Effectiveness Analysis of Data Optimization Methods

5.4.3. Ablation Experiment

6. Summary and Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI