Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks

Li, Cunhe; Xie, Zian; Wang, Haotian

doi:10.3390/app15095102

Open AccessArticle

Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks

by

Cunhe Li

,

Zian Xie

^* and

Haotian Wang

Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5102; https://doi.org/10.3390/app15095102

Submission received: 7 April 2025 / Revised: 24 April 2025 / Accepted: 29 April 2025 / Published: 4 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In recent years, text classification has found wide application in diverse real-world scenarios. In Chinese news classification tasks, limitations such as sparse contextual information and semantic ambiguity exist in the title text. To improve the performance of short text classification, this paper proposes a Word2Vec-based enhanced word embedding method and exhibits the design of a dual-channel hybrid neural network architecture to effectively extract semantic features. Specifically, we introduce a novel weighting scheme, Term Frequency-Document Frequency Category-Distribution Weight (TF-IDF-CDW), where Category Distribution Weight (CDW) reflects the distribution pattern of words across different categories. By weighting the pretrained Word2Vec vectors with TF-IDF-CDW and concatenating them with part-of-speech (POS) feature vectors, semantically enriched and more discriminative word embedding vectors are generated. Furthermore, we propose a dual-channel hybrid model based on a Gated Convolutional Neural Network (GCNN) and Bidirectional Long Short-Term Memory (BiLSTM), which jointly captures local features and long-range global dependencies. To evaluate the overall performance of the model, experiments were conducted on the Chinese short text datasets THUCNews and TNews. The proposed model achieved classification accuracies of 91.85% and 87.70%, respectively, outperforming several comparative models and demonstrating the effectiveness of the proposed method.

Keywords:

text classification; word embedding; gated convolutional neural network; bidirectional long short-term memory

1. Introduction

With the rapid advancement of internet technologies, the volume of news text information has increased exponentially, characterized by real-time dissemination and highly fragmented content. The automatic classification of large-scale text data has emerged as a prominent research direction [1]. Text classification, a fundamental component of natural language processing (NLP), focuses on mapping textual data to a set of predefined labels [2]. It has found wide applications in the field of sentiment classification, spam detection, and public opinion monitoring [3,4,5].

The primary task of text classification is to convert text into a vector representation that is recognizable to computers. The quality of text representation directly influences the final classification results [6]. Traditional text representation methods, such as One-Hot Encoding, overlook grammatical and semantic relationships between words, resulting in high dimensionality and sparse matrices. Static word embedding models like Word2Vec are trained on a corpus to generate richer semantic representations by capturing the relationships between words within a contextual window [7]. However, Word2Vec overlooks the importance of part-of-speech (POS), word frequency, and category distribution information for classification results. Specifically, the model should focus more on category-specific terms—words that are concentrated in specific categories. To address this, we propose an enhanced word embedding method that weights Word2Vec vectors Term Frequency-Document Frequency Category-Distribution Weight (TF-IDF-CDW) and integrates POS features to obtain more semantically enriched representation vectors. The TF-IDF-CDW weighting method is based Term Frequency-Document Frequency Category (TF-IDF) with the consideration of category distribution, allowing for higher weights to be assigned to category-specific terms.

After converting the text into a computer-recognizable embedding vector, machine learning or deep learning models are needed to further learn the vector features and extract the deep semantic information within the text vector. Traditional machine learning methods perform adequately on simple tasks but are often insufficient for more complex scenarios. Deep learning approaches, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs), have shown superior capabilities in this domain [8,9,10]. CNNs effectively capture local patterns through convolutional filters, while RNNs are well suited for modeling sequential dependencies. To address the issues of gradient vanishing and explosion in traditional RNNs, researchers often use more-effective models such as Long Short-Term Memory (LSTM) networks or Gated Recurrent Units (GRUs) for text classification tasks [11,12]. To simultaneously capture local key features and global contextual dependencies, we propose a hybrid neural model based on a Gated Convolutional Neural Network (GCNN) and two-layer Bidirectional Long Short-Term Memory (BiLSTM).

The primary contributions of this paper can be summarized as follows:

An enhanced word embedding model based on pretrained Word2Vec is proposed. By incorporating Category Distribution Weight (CDW) into the refinement of TF-IDF, higher weights are assigned to category-specific terms. The TF-IDF-CDW-weighted Word2Vec embeddings, combined with POS features, allow the embedding vectors to emphasize key terms while minimizing the impact of noise.
A hybrid model is designed by combining a GCNN with two BiLSTM networks. The GCNN captures local semantic patterns through multiscale gated convolutions. The BiLSTM layer captures global contextual dependencies, and a residual connection mechanism is incorporated to preserve the integrity of semantic representations. Finally, an attention mechanism dynamically adjusts the weights of the dual channels, enhancing classification performance.
Experiments on the THUCNews and TNews datasets demonstrate that our model achieves accuracies of 91.85% and 87.70%, respectively, consistently outperforming several baseline models, thereby validating the effectiveness of both the enhanced embedding strategy and the hybrid neural architecture.

The structure of this paper is as follows: Section 2 provides an overview of prior studies on text representation and classification methods. Section 3 introduces the proposed embedding enhancement strategy and hybrid neural framework. Section 4 presents the experimental details and discusses the results. Section 5 summarizes the conclusions and outlines potential directions for future work.

2. Related Work

2.1. Text Representation

Text representation is a fundamental step in text classification, aiming to transform unstructured textual data into machine-interpretable semantic vectors. Research in this area has evolved from traditional statistical and rule-based approaches to advanced pretrained language models. While conventional methods are computationally efficient, they often fail to capture syntactic rules and deep semantic relationships. In contrast, pretrained models built on large-scale text corpora yield richer context-aware features that strengthen model generalization. Depending on whether the embeddings change with context, word embeddings are typically categorized into static and dynamic variants [13].

Mikolov et al. [7] introduced Word2Vec, an influential static embedding model comprising two learning schemes: Bag-of-Words (CBOW), which infers a word from its context, and Skip-gram, which does the reverse. However, both are limited to local windows and cannot capture global co-occurrence features. To address this limitation, Pennington et al. [14] proposed GloVe, which incorporates global co-occurrence statistics into the embedding process to enhance semantic representation.

Since static embeddings map words to fixed vectors regardless of context, they can sometimes lead to ambiguity. Peters et al. [15] proposed ELMo, which uses a two-layer BiLSTM trained with forward and backward language models to generate context-sensitive word representations. Although ELMo improves performance on various NLP tasks, its shallow output fusion constrains deeper contextual interaction. Devlin et al. [16] later introduced BERT, based on bidirectional Transformer encoders, enabling simultaneous modeling of left and right contexts. BERT adopted a pretraining–fine-tuning paradigm and attained top results on various NLP benchmarks. Building on this, Onan [17] proposed a dynamic BERT-based fusion framework that integrates BERT’s global semantics with graph-based local topological features through a seven-stage heterogeneous interaction process. This method achieves notable improvements in text classification by combining structural awareness and context modeling.

Despite the advantages of dynamic embeddings, static embeddings remain valuable in real-world applications due to their lower computational cost and faster inference. Numerous strategies have recently been explored to improve static vector representations. Zheng et al. [18] developed the Context-to-Vec framework, which infuses contextual information into static vectors and refines them using graph topology and synonym knowledge. Sun et al. [19] proposed a six-granularity modeling approach for Chinese characters to enrich semantic information in Word2Vec vectors. Li et al. [20] integrated word, character, and N-gram embeddings to improve fault classification in the aviation domain using CNN. Zhang et al. [21] introduced LDA2Vec, which uses Word2Vec to model topic–context relations and adjusts embeddings via topic probabilities to better reflect domain semantics. Empirical evidence supports the effectiveness of such enhancements. For instance, Wang et al. [22] designed LogUAD, an anomaly detection model that utilizes TF-IDF-weighted Word2Vec embeddings and achieved a significant 67.25% F1-score improvement over LogCluster. George et al. [23] showed that combining Word2Vec with TF-IDF outperformed both standalone Word2Vec and Doc2Vec in sentiment classification tasks.

It is worth noting that traditional TF-IDF does not account for the distribution of words across categories, resulting in lower weights being assigned to words that frequently appear in specific categories. To address this limitation, we first introduced the concept of CDW and combined it with TF-IDF to form a new weighting calculation method, TF-IDF-CDW. We then used TF-IDF-CDW to weight Word2Vec vectors and concatenate POS features to enhance the word embedding vectors. Table 1 shows a comparison between the improved word embedding method proposed in this paper and the text representation methods from the aforementioned literature.

2.2. Text Classification

The evolution of text classification algorithms highlights a clear paradigm shift from shallow models to deep learning approaches. Conventional machine learning approaches depend on manually engineered features and statistical representations, which are subsequently fed into classifiers [24]. Although these methods performed well in early applications, their ability to handle high-dimensional, sparse, and semantically complex data has proven to be increasingly limited with the advent of large-scale data. In response, deep learning has since emerged as the dominant paradigm for text classification, offering end-to-end hierarchical semantic modeling capabilities [25].

Johnson et al. [26] designed the Deep Pyramid Convolutional Neural Network (DPCNN), which introduces a novel deep pyramid structure to expand the receptive field through equal-length convolutions and residual connections, improving semantic abstraction while maintaining computational efficiency. Tan et al. [27] developed an adaptive convolutional model incorporating label embeddings to dynamically generate convolutional filters aligned with semantic labels, enhancing feature alignment and representation. Soni et al. [28] proposed a convolutional framework that encodes text into three-dimensional matrices and employs dual convolutional filters to capture semantic information at both intra-sentence and inter-sentence levels, achieving superior performance over other CNN models on five benchmark datasets. While CNNs excel at capturing local patterns, they are limited in modeling sequential dependencies and contextual semantics, which are better handled by recurrent models such as LSTM and GRUs. To further improve sequential modeling, Khataei et al. [29] proposed an SHO-LSTM model for multi-label classification, employing a Spotted Hyena Optimizer to tune LSTM weights for better convergence. Dou et al. [30] designed a memristor-based hardware acceleration solution for LSTM networks, significantly improving training efficiency. In sentiment analysis, Zulqarnain et al. [31] introduced a TS-GRU architecture that integrates feature-level attention into a two-stage GRU framework, enabling it to model intricate dependencies and improve prediction via sequential attention and decoding mechanisms.

To leverage both local and global semantic cues, hybrid deep learning models have been explored. Prabhakar et al. [32] introduced CABO, a composite architecture that integrates CNNs for local feature extraction with BiLSTM and attention mechanisms for contextual comprehension, achieving superior results over single-stream architectures. Duan et al. [33] presented a Transformer-based encoder–decoder framework for multi-label classification, integrating word embeddings, BiLSTM encoding, and Transformer modules to jointly model document relationships and label queries. Wu et al. [34] developed an XLNet–CNN–GRU hybrid model, combining static GloVe and contextual XLNet embeddings, which are processed by parallel CNN and GRU modules with attention mechanisms to improve classification. Zeng et al. [35] proposed an ensemble framework combining a GCN and a CNN, where a simplified boosting strategy retrains the CNN on GCN-misclassified samples, and classification is finalized via weighted voting. Chen et al. [36] presented a capsule-based architecture incorporating label embeddings and graph convolutional layers, effectively capturing inter-label correlations and spatial features. Sun et al. [37] developed a multi-channel convolution–capsule network hybrid that replaces pooling operations with capsule routing, preserving information integrity while extracting local features. Their model outperformed BERT in both long- and short-text classification tasks.

In summary, hybrid architectures are increasingly favored for their ability to capture complementary semantic representations. This paper proposes a dual-branch hybrid framework that combines a GCNN with a two-layer BiLSTM network. The GCNN branch performs dynamic local feature selection using multi-scale gated convolutions to suppress noise, while the BiLSTM branch incorporates residual connections to maintain semantic completeness during global sequence modeling. An attention-based fusion mechanism further integrates local and global features.

3. Methodology

This paper proposes a text classification method based on enhanced word embeddings and a hybrid neural network architecture, as illustrated in Figure 1. In the embedding layer, the words in each news headline are initially transformed into dense vector representations. Specifically, initial word embeddings are obtained using a pretrained Word2Vec model. To address the limitation of traditional TF-IDF, which does not consider the distribution of words across categories and thus assigns insufficient weights to category-specific terms, we introduce the concept of CDW. The TF-IDF-CDW value is computed for each vocabulary term to reflect its importance across different documents. Additionally, POS encoding is assigned to each word in the vocabulary to capture syntactic features. The final text representation is constructed by weighting the Word2Vec embedding with the TF-IDF-CDW value and concatenating it with the corresponding POS vector.

In the feature learning layer, we have constructed a hybrid neural network comprising a GCNN and a two-layer BiLSTM network. The word embedding vectors are fed into the classification model to further extract deep hidden features. The resulting feature vectors are subsequently passed through an attention mechanism to assign weights, then fed into a fully connected (FC) layer followed by a softmax activation to produce the final classification output.

3.1. Enhanced Word Embedding Based on Word2Vec

3.1.1. Word2Vec

Word2Vec is a widely used static embedding method that encodes words as dense vectors in a continuous semantic space, such that semantically related terms are positioned nearby. It employs two distinct training mechanisms: Skip-gram and CBOW. The Skip-gram approach aims to infer context words from a given center word, whereas CBOW predicts the central term by aggregating information from its surrounding words. In this work, we utilize the CBOW variant to construct word embeddings.

As shown in Figure 2, CBOW computes the probability of a center word

w_{n}

given its surrounding context words

w_{c}

based on the embeddings of multiple surrounding words

w_{c}

. The objective function is formulated as in [7]:

p (w_{n} | w_{c}) = \frac{\exp (w_{n} h_{n})}{\sum_{w^{'} \in c o r p u s} \exp (w^{'} h_{n})}

(1)

where

w_{c}

denotes the context words, and

h_{n}

represents the average embedding vector of the surrounding context window

w_{n}

. The model is trained to predict the center word based on its surrounding context.

3.1.2. TF-IDF-CDW

TF-IDF is a commonly adopted term-weighting method in natural language processing, designed to evaluate the significance of a word in a specific document [38]. Term Frequency (TF) reflects how often a term

t_{i}

occurs within a document

d_{j}

and is computed as follows:

T F (t_{i}, d_{j}) = \frac{n_{i, j}}{| d_{j} |}

(2)

where

n_{i, j}

denotes the frequency of term

t_{i}

in document

d_{j}

, and

| d_{j} |

is the total number of words in

d_{j}

. IDF evaluates the rarity of a term across the entire corpus. The formula is given as follows:

I D F (t_{i}) = \log \frac{| D |}{1 + | {d_{j} : t_{i} \in d_{j}} |}

(3)

where

| D |

represents the overall count of documents within the corpus, and

| {d_{j} : t_{i} \in d_{j}} |

is the count of documents containing term

t_{i}

. By multiplying TF and IDF, the TF-IDF score is obtained, indicating the significance of term

t_{i}

in document

d_{j}

:

T F - I D F (t_{i}, d_{j}) = \frac{n_{i, j}}{| d_{j} |} \times \log \frac{| D |}{1 + | {d_{j} : t_{i} \in d_{j}} |}

(4)

However, conventional TF-IDF does not account for the distribution of terms across different categories. As a result, some terms that appear frequently within specific categories but rarely across others may receive inappropriately low weights. To address this limitation, we propose an enhanced metric called CDW. It is defined as follows:

C D W (t_{i}) = 1 + \frac{\sum_{c \in C} P (c | t_{i}) \log P (c | t_{i})}{\log | C | + 1}

(5)

where

| C |

is the total number of categories, and

P (c | t_{i})

represents the proportion of documents containing

t_{i}

that belong to category

c

. This method is designed to suppress terms that are uniformly distributed across categories, while emphasizing those with category-specific preferences. CDW is particularly effective at identifying terms strongly associated with specific categories. By incorporating CDW, the model assigns greater weight to category-specific terms.

Finally, the TF-IDF-CDW score is computed by multiplying the TF-IDF value with the corresponding CDW value, thereby capturing both document-level and category-level term importance:

T F - I D F - C D W (t_{i}, d_{j}) = T F (t_{i}, d_{j}) \cdot I D F (t_{i}) \cdot C D W (t_{i})

(6)

This formulation allows the model to assign greater importance to terms that contribute more significantly to class distinction, thereby improving the semantic representation quality and classification performance.

3.1.3. Part-of-Speech Encoding

POS serves as a core syntactic indicator in linguistic analysis, representing the grammatical role of a word within a sentence. Incorporating POS information can improve the model’s comprehension of grammatical structure and semantic context. In particular, in text classification tasks, certain POS tags—such as nouns and verbs—tend to be more helpful for classification results. Thus, incorporating POS features enhances the model’s syntactic awareness and improves classification performance.

In this paper, we adopt a random vector-based POS encoding strategy. Specifically, each distinct POS category is assigned a random 10-dimensional vector, which is fine-tuned through training. For every word in the text, its corresponding POS category is mapped to this fixed-dimensional vector, enabling the model to capture latent syntactic and semantic cues associated with different POS categories.

3.1.4. Embedding Vector Generation

We first apply the Jieba tool to perform word segmentation on short text inputs, resulting in a token sequence

W_{1 : n} = {w_{1}, w_{2}, \dots, w_{n}}

. Each word is then mapped to a 300-dimensional initial embedding vector

P V_{1 : n} = {p v_{1}, p v_{2}, \dots, p v_{n}}

using a pretrained Word2Vec model. Subsequently, we compute the TF-IDF-CDW weight for each word

T I C_{1 : n} = {t i c_{1}, t i c_{2}, \dots, t i c_{n}}

as described in Equation (6). POS encoding vectors

P O S_{1 : n} = {p o s_{1}, p o s_{2}, \dots p o s_{n}}

are derived using the approach in Section 3.1.3.

Ultimately, the enhanced embedding vector

V_{1 : n}

for each word is constructed by scaling the Word2Vec vector

p v_{i}

with its corresponding TF-IDF-CDW weight

t i c_{i}

and concatenating it with the 10-dimensional POS encoding vector

p o s_{i}

, yielding a 310-dimensional representation:

V_{1 : n} = {p v_{1} \cdot t i c_{1} + p o s_{1}, p v_{2} \cdot t i c_{2} + p o s_{2}, \dots, p v_{n} \cdot t i c_{n} + p o s_{n}}

(7)

The newly generated enhanced word embedding vectors reflect the importance differences of the same word across different texts and also capture POS information. The generated embedding matrix highlights key semantic elements while reducing noise interference in the classification process.

3.2. Hybrid Neural Network Based on a GCNN and Two-Layer BiLSTM

3.2.1. Gated Convolutional Neural Network

The core objective of the GCNN is to capture local features between words by applying convolutional kernels of varying sizes [39]. The gating mechanism is then used to regulate feature activation, enabling the model to effectively suppress noise and extract salient patterns from the text. The GCNN consists of convolutional kernels, a gating module, a pooling operation, and a FC layer, with its architecture illustrated in Figure 3.

Let

X_{i}

denote the local word embedding matrix starting at position

i

, formed by a sliding window of size

h

across the input sequence:

X_{i} = [V_{i}; V_{i + 1}; \dots; V_{i + h - 1}]

(8)

The gated convolutional layer conducts two parallel convolution operations: one branch applies a standard convolution to extract raw features, while the other generates gate values using a sigmoid-activated convolution. The outputs from both branches are defined as follows:

F_{i} = f (W_{f} \cdot X_{i} + b_{f})

(9)

G_{i} = σ (W_{g} \cdot X_{i} + b_{g})

(10)

where

W_{f}

and

W_{g}

are weight matrices,

b_{f}

and

b_{g}

are bias terms,

f (\cdot)

is nonlinear activation function, and

σ (\cdot)

denotes the sigmoid function. The final gated output is produced by element-wise multiplication between the two paths:

C_{i} = F_{i} ⊙ G_{i}

(11)

Since feature maps produced by different convolutional kernels vary in length, max pooling is applied over each feature map to obtain a fixed-size representation. This operation extracts the most prominent feature value from each map, thereby facilitating robust multi-scale feature fusion:

P_{h} = M a x (C_{1}, C_{2}, \dots, C_{n - h + 1})

(12)

Finally, outputs from multiple convolution kernels with different widths are concatenated to form a unified fixed-length vector, which constitutes the final output of the GCNN branch:

O u t_{G C N N} = [P_{h_{1}}, P_{h_{2}}, \dots, P_{h_{k}}]

(13)

where

h_{1}, h_{2}, \dots, h_{k}

represent the different convolution kernel widths.

3.2.2. Two-Layer BiLSTM with Residual Connections

LSTM networks extend a traditional RNN by introducing specialized gates that manage the flow and preservation of information across time steps, thereby improving the modeling of long-term dependencies in sequential data. Each LSTM cell comprises an input gate, forget gate, output gate, and a memory cell that together coordinate what information to retain, discard, or expose. The input gate filters incoming signals, the forget gate removes irrelevant historical data, and the output gate determines the exposure of the updated memory content. This gated architecture enables LSTM to better address issues like vanishing or exploding gradients and significantly enhances its ability to learn semantic relationships over extended text sequences.

To strengthen the modeling of bidirectional context, we employ a BiLSTM architecture composed of two LSTM layers operating in opposite temporal directions: one encoding the input sequence from left to right, and the other from right to left, as illustrated in Figure 4. The forward and backward hidden states are combined to produce a rich representation encompassing both historical and future context. Given an input sequence

v_{t}

, the bidirectional hidden representations at time step

t

are obtained through the following computation [40]:

\vec{h_{f}} = L S T M (v_{t}, \vec{h_{t - 1}})

(14)

\overset{\leftarrow}{h_{b}} = L S T M (v_{t}, \overset{\leftarrow}{h_{t - 1}})

(15)

The final hidden representation is formed by merging the outputs from both the forward and backward passes:

H_{t} = [\vec{h_{f}}, \overset{\leftarrow}{h_{b}}]

(16)

The output sequence

H_{t}

produced by the BiLSTM captures bidirectional contextual features, thereby enabling richer semantic representation. To further extract higher-level, more abstract semantic features, a second BiLSTM layer is stacked on top of the first. To preserve the original embedding information and mitigate degradation caused by deep stacking, we introduce a residual connection by concatenating the original input

V

with the output

H

of the first BiLSTM layer [41]:

R = [H, V]

(17)

The concatenated result

R

is then passed to the second BiLSTM layer. The final global feature vector, denoted as

O u t_{B i L S T M}

, is formed by merging the terminal hidden states generated from the forward and backward directions of the second BiLSTM layer.

3.2.3. Fully Connected Layer

In order to capture and combine fine-grained and contextual semantic information, we concatenate the local representations

O u t_{G C N N}

learned by the GCNN branch and the global contextual features

O u t_{B i L S T M}

obtained from the BiLSTM branch:

O u t = [O u t_{G C N N}, O u t_{B i L S T M}]

(18)

To further emphasize informative components, we utilize the Bahdanau attention mechanism to adaptively reweight the concatenated feature representation. This attention mechanism first applies a non-linear transformation using the tanh function to compute the importance score of each feature dimension, followed by a softmax operation to normalize the scores into attention weights [42]:

α = softmax (V^{Τ} \tanh (W \cdot O u t + b))

(19)

where

W

and

V

are learnable weight matrices, and

b

is a bias term. The final semantic representation of the input text is computed as follows:

V_{o u t} = O u t \cdot α

(20)

Finally, the softmax function is applied to generate the final classification output:

l a b e l = argmax (softmax (V_{o u t}))

(21)

4. Experiment

4.1. Datasets and Preprocessing

To assess how well the proposed model performs, we utilized two publicly available Chinese news headline datasets:

THUCNews [43]: This dataset was sourced from news headlines published on the Sina News platform. To ensure uniform input length, samples with character lengths between 10 and 30 were selected, resulting in a dataset of 100,000 entries evenly distributed across 10 categories.
TNews [44]: This dataset, collected from the Toutiao news app, was also filtered using the same length criterion. A total of 130,000 samples were selected, evenly covering 13 categories.

Table 2 presents the descriptive statistics of the two datasets. Each dataset was partitioned into training, validation, and testing subsets following an 8:1:1 ratio.

During data preprocessing, irrelevant special characters were removed, while all Chinese characters and English letters were retained. Word segmentation and POS tagging were performed using the Jieba tool. Jieba can recognize over 40 POS categories. We retained 11 commonly used types, while infrequent POS tags were grouped into a unified #other category. The selected POS categories included: #noun, #quantifier, #adverb, #verb, #geographical name, #adjective, #English word, #other proper noun, #person name, #idiom, #set phrase, and #other.

4.2. Experimental Environment and Hyperparameter Settings

Table 3 shows the experimental environment of this study.

Table 4 shows the hyperparameter settings for the model proposed in this paper. The pretrained Word2Vec word embeddings had a dimension of 300, which were trained on the Chinese Wikipedia corpus. The POS embedding dimension was set to 10, which is sufficient to distinguish 12 different parts of speech. A larger POS dimension would likely introduce noise that could interfere with the 300-dimensional word embeddings. Additionally, the hyperparameters for the GCNN and BiLSTM models were determined through a grid search method. The final model parameters, obtained after multiple iterations, are listed in Table 4.

During training, the discrepancy between predicted and actual labels was measured for each sample, and model parameters were updated via backpropagation. In this study, cross-entropy loss was employed as the optimization criterion due to its effectiveness in classification tasks. The formulation is given as follows:

l o s s = \sum_{i} q (i) \times \log (\frac{1}{p (i)})

(22)

where

p (i)

denotes the predicted probability distribution, and

q (i)

is the true label distribution. For optimization, the Adam algorithm was applied due to its effectiveness in numerous deep learning applications. Adam integrates the benefits of Momentum and RMSProp, dynamically adjusting learning rates to enhance convergence over traditional stochastic gradient descent. This helps to accelerate convergence while mitigating oscillations and instabilities.

4.3. Experimental and Result Analysis

4.3.1. Comparative Experiments on News Datasets

To assess the effectiveness of the proposed model, we conducted comparative experiments against seven representative baseline models: TextCNN [8], BiLSTM-Attention [45], GCNN [39], DPCNN [26], FastText [46], CNN-BiLSTM [47], and CNN-BiGRU-Attention [48]. All models were trained under identical conditions, including the training parameters and data preprocessing strategies, to ensure a fair comparison. For consistency, the hidden size of the BiGRU layers was set to 128, the same as that used in BiLSTM. Accuracy and F1-score were adopted as evaluation metrics to comprehensively measure classification performance. Table 5 reports the classification outcomes of our model versus baseline approaches on the two benchmark datasets.

Table 5 demonstrates that the model proposed in this paper achieved the highest accuracy and F1-score on both datasets, surpassing all baseline methods. Specifically, it obtained 91.85% accuracy and an F1-score of 91.84% for THUCNews, while reaching 87.70% accuracy and an F1-score of 87.72% for TNews. Based on these outcomes, several key insights can be derived:

Our model outperforms other baseline models in terms of both accuracy and F1 score. The experimental results validate that word representations enriched with Term Frequency and POS features provide more-comprehensive semantic cues, and the hybrid model based on a GCNN and BiLSTM can learn more-comprehensive semantic information.
The hybrid models CNN-BiLSTM and CNN-BiGRU-Attention outperform other single-architecture models on both datasets. CNN-BiGRU-Attention achieves an accuracy of 91.38% for THUCNews, 1.75% higher than the TextCNN model. CNN-BiLSTM attains an accuracy of 87.22% for TNews, which is 2.05% higher than BiLSTM-Attention. This verifies that the hybrid model’s ability to simultaneously capture local features and contextual information can improve classification performance.
The GCNN achieves higher accuracy and F1 scores than TextCNN across both datasets, confirming that the introduction of a gating mechanism in the convolutional layers is more effective for extracting key features.

The bar charts in Figure 5 further illustrate the performance disparities across categories for the THUCNews and TNews datasets. For the THUCNews dataset, the classification accuracy ranges from 85.3% to 96.8%. Among them, the entertainment category achieved the highest accuracy of 96.8%, whereas the science category had the lowest accuracy of 85.3%. For the TNews dataset, the classification accuracy varies from 81.2% to 93.2%, with the sports and real estate categories achieving the highest classification accuracy at 93.2%. Finance and economics showed the lowest accuracy rates, at 81.3% and 81.2%, respectively. The science category in both datasets exhibits relatively low classification accuracy due to the prevalence of specialized and emerging terms in the field of science, which hinder the model’s ability to perform accurate word segmentation or retrieve corresponding representations from pretrained embeddings.

To further clarify the classification results, Figure 6 illustrates the confusion matrix generated by the model for the THUCNews testing set. From the matrix, it is evident that the model performed well in classifying the sports and entertainment categories, achieving the highest accuracy in these. Moreover, a total of 52 samples from the finance category were misclassified as stock, and 43 stock samples were incorrectly labeled as finance. This is mainly due to the high semantic similarity and intersection between the finance and stock categories. In some short texts, the model failed to capture sufficient information to effectively distinguish between finance and stock.

Figure 7 displays the confusion matrix for the TNews test set. From the confusion matrix, it can be observed that the finance and science categories are frequently misclassified as each other. At the same time, these two categories are also the ones with the lowest accuracy. In addition, the military and world categories also exhibit a high degree of misclassification between them.

4.3.2. Comparative Experiments of Different Embedding Methods

To compare the performance differences between static and dynamic word embeddings, we conducted a detailed comparison of our model with DPCNN and BERT using the THUCNews dataset. The BERT model used 768 hidden units. The evaluation metrics were accuracy, F1 score, training time, inference time, and model size. The experimental results are shown in Table 6.

The experimental results show that although the accuracy and F1-score of the model presented in this paper are slightly lower than those of the BERT model, BERT has significantly longer training and inference times and a larger model size. This makes BERT less optimal when computational resources are limited, and its deployment cost is higher than that of static word embedding models. In contrast, the model proposed in this paper demonstrates significant advantages in terms of cost and efficiency. Specifically, the training time of the proposed model is 348 s, much lower than the 1413 s required by BERT, and the inference speed is approximately five times faster than that of BERT. Additionally, the model size is greatly reduced to only 49 MB, making it more suitable for rapid deployment and efficient computation in practical applications.

4.3.3. Cross-Domain Performance on Medical Dataset

To further evaluate the cross-domain and generalization performance of the proposed model, we extended the experiment to the Chinese Medical Dialogue Data (CMDD) dataset [49]. The dataset contains question-and-answer short texts from six categories of departments, with each category including 10,000 short titles. The same hyperparameters and preprocessing methods were used in the experiments. Table 7 presents the classification results of various models foe the medical dataset.

The proposed model continues to demonstrate excellent performance when applied to the medical domain. For the CMDD dataset, the model achieved a classification accuracy of 88.22% and an F1-score of 88.27%, outperforming other baseline models. Through this experiment, we validated the effectiveness of the proposed model for cross-domain datasets. Even for classification tasks in the medical field, the model based on enhanced word embeddings and hybrid neural networks still performs effectively in text classification.

4.4. Ablation Experiment

To evaluate the individual impact of each module on the overall performance, ablation studies were carried out on the THUCNews and TNews datasets. In the ablation study, specific components were selectively removed while keeping the rest intact, allowing us to examine their individual influence on classification performance. The outcomes, summarized in Table 8, highlight the contribution of each module to the overall accuracy of the model.

As shown in Table 8, any modification to the word embedding module also results in performance degradation. When both the TF-IDF-CDW and POS encoding modules are removed, using only the pretrained Word2Vec embeddings causes accuracy to drop by 0.33% and 0.50% for the two datasets, respectively. These results confirm that incorporating TF-IDF-CDW-weighted Word2Vec vectors with POS information improves the semantic quality of the text representations. Moreover, removing either the GCNN or BiLSTM layer results in a notable decline in classification performance. The experimental results show that, by introducing a gating mechanism, the GCNN layer significantly strengthens the model’s capacity to recognize local semantic dependencies. And the BiLSTM layer is capable of capturing long-distance dependencies in short texts.

5. Conclusions

This paper focuses on optimizing word embedding semantic representations and innovating classification model architectures in text classification tasks. We propose a classification model based on improved word embeddings and a hybrid neural network. To address the limitations of traditional Word2Vec embeddings, which do not consider word frequency, POS, or the varying importance of the same word in different documents, we have designed a novel weighted word vector generation algorithm. By weighting Word2Vec with TF-IDF-CDW, the word embeddings can better highlight key terms in the text, improving the category discrimination of feature words. In the feature learning stage, we have constructed a hybrid dual-channel architecture combining a GCNN and a two-layer BiLSTM, which realizes the complementary advantages of local features and contextual dependencies and assigns different weights to dual-channel semantic information using an attention mechanism.

The effectiveness of our method was validated using two publicly available datasets. Experimental findings demonstrate that it consistently surpasses several baselines across both datasets. Specifically, for the THUCNews dataset, the model achieved 91.85% accuracy and an F1-score of 91.84%, indicating that the integration of enhanced word embeddings and a hybrid neural structure can significantly boost classification performance. For the TNews dataset, it obtained 87.70% accuracy and an F1-score of 87.72%, further confirming the model’s robustness and adaptability across different textual domains.

Although the proposed model performs well overall, the confusion matrices in the experimental results show higher misclassification rates for categories with substantial semantic overlap, such as finance and stocks. Future research may explore constructing a hierarchical classification framework to reduce misclassification between adjacent domains. Additionally, the number of samples in some specialized domains is limited. Future research could explore the model’s performance in specialized fields by combining data augmentation strategies and adjusting the model to adapt to different domains.

Author Contributions

Conceptualization, Z.X. and C.L.; methodology, Z.X. and C.L.; software, Z.X.; validation, Z.X.; formal analysis, Z.X. and H.W.; data curation, Z.X. and H.W.; writing—original draft preparation, Z.X.; writing—review and editing, Z.X. and C.L.; visualization, Z.X.; supervision, C.L.; project administration, C.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This paper uses public datasets for experiments. THUCNews is available at https://github.com/thunlp/THUCTC (accessed on 3 March 2025). TNews is available at https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset (accessed on 3 March 2025). Chinese Medical Dialogue Data is available at https://github.com/Toyhom/Chinese-medical-dialogue-data (accessed on 24 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Duarte, J.M.; Berton, L. A Review of Semi-Supervised Learning for Text Classification. Artif. Intell. Rev. 2023, 56, 9401–9469. [Google Scholar] [CrossRef]
Wang, K.; Ding, Y.; Han, S.C. Graph Neural Networks for Text Classification: A Survey. Artif. Intell. Rev. 2024, 57, 190. [Google Scholar] [CrossRef]
Huang, Y.; Liu, Q.; Peng, H.; Wang, J.; Yang, Q.; Orellana-Martín, D. Sentiment Classification Using Bidirectional LSTM-SNP Model and Attention Mechanism. Expert Syst. Appl. 2023, 221, 119730. [Google Scholar] [CrossRef]
Bountakas, P.; Xenakis, C. HELPHED: Hybrid Ensemble Learning PHishing Email Detection. J. Netw. Comput. Appl. 2023, 210, 103545. [Google Scholar] [CrossRef]
Zhou, Z.; Zhou, X.; Qian, L. Online Public Opinion Analysis on Infrastructure Megaprojects: Toward an Analytical Framework. J. Manag. Eng. 2021, 37, 04020105. [Google Scholar] [CrossRef]
Xiao, L.; Li, Q.; Ma, Q.; Shen, J.; Yang, Y.; Li, D. Text Classification Algorithm of Tourist Attractions Subcategories with Modified TF-IDF and Word2Vec. PLoS ONE 2024, 19, e0305095. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1746–1751. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. Graph Convolutional Networks for Text Classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7370–7377. [Google Scholar] [CrossRef]
Pavan Kumar, M.R.; Jayagopal, P. Context-Sensitive Lexicon for Imbalanced Text Sentiment Classification Using Bidirectional LSTM. J. Intell. Manuf. 2023, 34, 2123–2132. [Google Scholar] [CrossRef]
Zhang, X.; Wu, Z.; Liu, K.; Zhao, Z.; Wang, J.; Wu, C. Text Sentiment Classification Based on BERT Embedding and Sliced Multi-Head Self-Attention Bi-GRU. Sensors 2023, 23, 1481. [Google Scholar] [CrossRef]
Wang, Y.; Hou, Y.; Che, W.; Liu, T. From Static to Dynamic Word Representations: A Survey. Int. J. Mach. Learn. Cybern. 2020, 11, 1611–1630. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Walker, M., Ji, H., Stent, A., Eds.; Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 2227–2237. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar]
Onan, A. Hierarchical Graph-Based Text Classification Framework with Contextual Node Embedding and BERT-Based Dynamic Fusion. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101610. [Google Scholar] [CrossRef]
Zheng, J.; Wang, Y.; Wang, G.; Xia, J.; Huang, Y.; Zhao, G.; Zhang, Y.; Li, S. Using Context-to-Vector with Graph Retrofitting to Improve Word Embeddings. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Muresan, S., Nakov, P., Villavicencio, A., Eds.; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 8154–8163. [Google Scholar]
Sun, X.; Liu, Z.; Huo, X. Six-Granularity Based Chinese Short Text Classification. IEEE Access 2023, 11, 35841–35852. [Google Scholar] [CrossRef]
Li, H.; Li, X.; Yang, P.; Zhou, Z.; Cheng, L.; Hong, D. Avionics Fault Classification Based on Improved Word2Vec Word Embedding. In Proceedings of the Third International Symposium on Computer Applications and Information Systems (ISCAIS 2024), Wuhan, China, 22–24 March 2024; Volume 13210, pp. 842–849. [Google Scholar]
Zhang, T.; Cui, W.; Liu, X.; Jiang, L.; Li, J. Research on Topic Evolution Path Recognition Based on LDA2vec Symmetry Model. Symmetry 2023, 15, 820. [Google Scholar] [CrossRef]
Wang, J.; Zhao, C.; He, S.; Gu, Y.; Alfarraj, O.; Abugabah, A. LogUAD: Log Unsupervised Anomaly Detection Based on Word2Vec. Comput. Syst. Sci. Eng. 2022, 41, 1207. [Google Scholar] [CrossRef]
George, M.; Murugesan, R. Improving Sentiment Analysis of Financial News Headlines Using Hybrid Word2Vec-TFIDF Feature Extraction Technique. Procedia Comput. Sci. 2024, 244, 1–8. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning--Based Text Classification: A Comprehensive Review. ACM Comput. Surv. 2022, 54, 1–40. [Google Scholar] [CrossRef]
Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–41. [Google Scholar] [CrossRef]
Johnson, R.; Zhang, T. Deep Pyramid Convolutional Neural Networks for Text Categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 562–570. [Google Scholar]
Tan, C.; Ren, Y.; Wang, C. An Adaptive Convolution with Label Embedding for Text Classification. Appl. Intell. 2023, 53, 804–812. [Google Scholar] [CrossRef]
Soni, S.; Chouhan, S.S.; Rathore, S.S. TextConvoNet: A Convolutional Neural Network Based Architecture for Text Classification. Appl. Intell. 2023, 53, 14249–14268. [Google Scholar] [CrossRef] [PubMed]
Khataei Maragheh, H.; Gharehchopogh, F.S.; Majidzadeh, K.; Sangar, A.B. A New Hybrid Based on Long Short-Term Memory Network with Spotted Hyena Optimization Algorithm for Multi-Label Text Classification. Mathematics 2022, 10, 488. [Google Scholar] [CrossRef]
Dou, G.; Zhao, K.; Guo, M.; Mou, J. Memristor-based LSTM newtwork for text classification. Fractals 2023, 31, 2340040. [Google Scholar] [CrossRef]
Zulqarnain, M.; Ghazali, R.; Aamir, M.; Hassim, Y.M.M. An Efficient Two-State GRU Based on Feature Attention Mechanism for Sentiment Analysis. Multimed. Tools Appl. 2024, 83, 3085–3110. [Google Scholar] [CrossRef]
Prabhakar, S.K.; Rajaguru, H.; Won, D.-O. Performance Analysis of Hybrid Deep Learning Models with Attention Mechanism Positioning and Focal Loss for Text Classification. Sci. Program. 2021, 2021, 2420254. [Google Scholar] [CrossRef]
Duan, L.; You, Q.; Wu, X.; Sun, J. Multilabel Text Classification Algorithm Based on Fusion of Two-Stream Transformer. Electronics 2022, 11, 2138. [Google Scholar] [CrossRef]
Wu, D.; Wang, Z.; Zhao, W. XLNet-CNN-GRU Dual-Channel Aspect-Level Review Text Sentiment Classification Method. Multimed. Tools Appl. 2024, 83, 5871–5892. [Google Scholar] [CrossRef]
Zeng, F.; Chen, N.; Yang, D.; Meng, Z. Simplified-Boosting Ensemble Convolutional Network for Text Classification. Neural Process. Lett. 2022, 54, 4971–4986. [Google Scholar] [CrossRef]
Chen, Z.; Li, S.; Ye, L.; Zhang, H. Multi-Label Classification of Legal Text Based on Label Embedding and Capsule Network. Appl. Intell. 2023, 53, 6873–6886. [Google Scholar] [CrossRef]
Sun, G.; Cheng, Y.; Zhang, Z.; Tong, X.; Chai, T. Text Classification with Improved Word Embedding and Adaptive Segmentation. Expert Syst. Appl. 2024, 238, 121852. [Google Scholar] [CrossRef]
Salton, G.; Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 933–941. [Google Scholar]
Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; Xu, B. Text Classification Improved by Integrating Bidirectional LSTM with Two-Dimensional Max Pooling. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers; Matsumoto, Y., Prasad, R., Eds.; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 3485–3495. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
THUCNews. Available online: https://github.com/thunlp/THUCTC (accessed on 3 March 2025).
TNews. Available online: https://github.com/aceimnorstuvwxz/toutiao-text-classfication-dataset (accessed on 3 March 2025).
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Berlin, Germany, 2016; pp. 207–212. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers; Association for Computational Linguistics: Valencia, Spain, 2017; pp. 427–431. [Google Scholar]
Mutabazi, E.; Ni, J.; Tang, G.; Cao, W. An Improved Model for Medical Forum Question Classification Based on CNN and BiLSTM. Appl. Sci. 2023, 13, 8623. [Google Scholar] [CrossRef]
Li, X.; Zhang, Y.; Jin, J.; Sun, F.; Li, N.; Liang, S. A Model of Integrating Convolution and BiGRU Dual-Channel Mechanism for Chinese Medical Text Classifications. PLoS ONE 2023, 18, e0282824. [Google Scholar] [CrossRef]
Chinese Medical Dialogue Data. Available online: https://github.com/Toyhom/Chinese-medical-dialogue-data (accessed on 24 April 2025).

Figure 1. Overall architecture of the proposed model.

Figure 2. The structure of CBOW.

Figure 3. The structure of the GCNN.

Figure 4. The structure of the BiLSTM.

Figure 5. The classification accuracy for each category across two datasets: (a) the accuracy for the THUCNews; (b) the accuracy for the TNews.

Figure 6. The confusion matrix of the experimental results of the proposed model for the THUCNews dataset.

Figure 7. The confusion matrix of the experimental results of the proposed model for the TNews dataset.

Table 1. Comparison of text representation methods.

Method	Description	Main Advantages
Word2Vec [7]	Generates low-dimensional, dense word vectors based on contextual window information.	Embedding vectors provide rich semantic representations.
BERT [16]	Generates context-sensitive word vectors using a deep bidirectional Transformer network architecture.	Dynamically adapts embeddings based on context.
Context-to-Vec [18]	Enhances Word2Vec embeddings by integrating contextual information and refining it through graph-based topology, while improving semantic representations with synonym knowledge.	Addresses the limitations of static embeddings in semantic representation and context handling, significantly enhancing semantic expressiveness.
SGCSTC [19]	Fuses multiple granularities of information, such as word-jieba, word-jieba-radical, word-ngram, word-ngram-radical, and character embeddings.	Enriches the semantic representation of the text and alleviates the context sparsity issue in short texts.
LDA2Vec [21]	Combines Word2Vec with topic models to optimize domain-specific semantic representations through topic probabilities.	Improves word embedding by combining the strengths of LDA and Word2Vec.
Word2Vec-TFIDF [22]	Calculates the importance of each word using TF-IDF, then adjusts Word2Vec embeddings by applying these weights.	Retains the powerful semantic representation capability of Word2Vec while enhancing the distinguishability of words in specific contexts.
This paper	Quantifies the category distinction of words by using the improved TF-IDF-CDW weighting, dynamically adjusting Word2Vec embeddings, and incorporating part-of-speech encoding.	Enhances the semantic weight of key terms by considering category distribution and part-of-speech and assigning higher weights to category-specific words.

Table 2. The datasets used to evaluate the proposed model.

Dataset	Categories	Train	Validate	Test
THUCNews	10	80,000	10,000	10,000
TNews	13	104,000	13,000	13,000

Table 3. Experimental environment.

Environment Component	Configuration
operating system	Windows 10
CPU	Intel Core i7-13620H
GPU	RTX 4060
RAM	32 GB
programming language	Python 3.10.8
framework	PyTorch 2.3.1

Table 4. Hyperparameter settings.

Parameter	Value
Word2Vec embedding dimension	300
POS embedding dimension	10
convolution kernel widths	{2,3}
number of convolution kernels	128
BiLSTM hidden units	128
maximum sequence length	16
batch size	64
number of epochs	10

Table 5. Comparative experimental results.

Model	THUCNews		TNews
Model	Accuracy (%)	F1-Score (%)	Accuracy (%)	F1-Score (%)
TextCNN	89.63	89.60	85.96	85.99
BiLSTM-Attention	90.21	90.23	85.17	85.09
GCNN	90.48	90.51	86.12	86.05
DPCNN	90.69	90.80	86.47	86.49
FastText	90.84	90.83	86.75	86.59
CNN-BiLSTM	91.26	91.34	87.22	87.23
CNN-BiGRU-Attention	91.38	91.36	87.15	87.06
Our model	91.85	91.84	87.70	87.72

Table 6. Comparative experimental results with BERT.

Model	Accuracy (%)	F1-Score (%)	Training Time (s)	Inference Time (s)	Model Size (MB)
DPCNN	90.69	90.80	313	3.32	48
BERT	92.03	91.99	1413	19.52	394
Our model	91.85	91.84	348	3.66	49

Table 7. The experimental results for medical dataset.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
TextCNN	85.55	85.75	85.55	85.63
BiLSTM-Attention	85.78	85.68	85.78	85.71
GCNN	86.35	86.27	86.35	86.28
DPCNN	86.87	86.86	86.87	86.72
FastText	87.07	87.05	87.07	87.05
CNN-BiLSTM	87.82	87.71	87.82	87.74
CNN-BiGRU-Attention	87.73	87.82	87.73	87.72
Our model	88.22	88.35	88.22	88.27

Table 8. Ablation experimental results.

Model Variation	THUCNews		TNews
Model Variation	Accuracy (%)	Decreased (%)	Accuracy (%)	Decreased (%)
Remove the TF-IDF-CDW	91.56	0.29	87.29	0.41
Remove the POS encoding	91.71	0.14	87.52	0.18
Only keep Word2Vec	91.52	0.33	87.20	0.50
Remove the GCNN layer	90.51	1.34	85.66	2.04
Remove the BiLSTM layer	90.82	1.03	86.22	1.58
Our model	91.85	-	87.70	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Xie, Z.; Wang, H. Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks. Appl. Sci. 2025, 15, 5102. https://doi.org/10.3390/app15095102

AMA Style

Li C, Xie Z, Wang H. Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks. Applied Sciences. 2025; 15(9):5102. https://doi.org/10.3390/app15095102

Chicago/Turabian Style

Li, Cunhe, Zian Xie, and Haotian Wang. 2025. "Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks" Applied Sciences 15, no. 9: 5102. https://doi.org/10.3390/app15095102

APA Style

Li, C., Xie, Z., & Wang, H. (2025). Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks. Applied Sciences, 15(9), 5102. https://doi.org/10.3390/app15095102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short Text Classification Based on Enhanced Word Embedding and Hybrid Neural Networks

Abstract

1. Introduction

2. Related Work

2.1. Text Representation

2.2. Text Classification

3. Methodology

3.1. Enhanced Word Embedding Based on Word2Vec

3.1.1. Word2Vec

3.1.2. TF-IDF-CDW

3.1.3. Part-of-Speech Encoding

3.1.4. Embedding Vector Generation

3.2. Hybrid Neural Network Based on a GCNN and Two-Layer BiLSTM

3.2.1. Gated Convolutional Neural Network

3.2.2. Two-Layer BiLSTM with Residual Connections

3.2.3. Fully Connected Layer

4. Experiment

4.1. Datasets and Preprocessing

4.2. Experimental Environment and Hyperparameter Settings

4.3. Experimental and Result Analysis

4.3.1. Comparative Experiments on News Datasets

4.3.2. Comparative Experiments of Different Embedding Methods

4.3.3. Cross-Domain Performance on Medical Dataset

4.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI