A Sequential Graph Neural Network for Short Text Classification

: Short text classiﬁcation is an important problem of natural language processing (NLP), and graph neural networks (GNNs) have been successfully used to solve different NLP problems. However, few studies employ GNN for short text classiﬁcation, and most of the existing graph-based models ignore sequential information (e.g., word orders) in each document. In this work, we propose an improved sequence-based feature propagation scheme, which fully uses word representation and document-level word interaction and overcomes the limitations of textual features in short texts. On this basis, we utilize this propagation scheme to construct a lightweight model, sequential GNN (SGNN), and its extended model, ESGNN. Speciﬁcally, we build individual graphs for each document in the short text corpus based on word co-occurrence and use a bidirectional long short-term memory network (Bi-LSTM) to extract the sequential features of each document; therefore, word nodes in the document graph retain contextual information. Furthermore, two different simpliﬁed graph convolutional networks (GCNs) are used to learn word representations based on their local structures. Finally, word nodes combined with sequential information and local information are incorporated as the document representation. Extensive experiments on seven benchmark datasets demonstrate the effectiveness of our method.


Introduction
With the rapid development of network information technology, a large amount of short text data, such as book/movie reviews, online news, and product introductions, are increasingly generated on the Internet [1][2][3]. The existence of such unstructured data provides huge resources for data processing and management to mine useful information [4]. Automatic classification of these short text data is one of the most important tasks in NLP and it is a key prerequisite for the development of applications in different domains, such as news categorization, sentiment analysis, question-answer systems, dialogue systems, and query intent classification [5][6][7][8].
Traditional machine learning methods have initially been leveraged to solve the problem of short text classification [9]. Compared with long texts, short texts have fewer words and less descriptive information, which are sparse [10]. However, text representation obtained by feature engineering in this method is high-dimensional and highly sparse, each word is independent, ignoring the contextual relationship in the text, and the feature expression ability is very weak [11][12][13], which has a great impact on the accuracy of short text classification. Traditional machine learning methods cannot meet the needs of short text classification.
To obtain better features of textual data, distributed representation models [14] and deep learning models (e.g., convolutional and recurrent neural networks) [15,16] have been used to learn text data representations [17,18]. The word embedding obtained from the distributed representation models [19,20], such as Word2Vec [21,22] and GloVe [23], has strong feature expression ability, which helps the existing linear classifier models to significantly improve performance. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are typical representatives of deep learning models. CNNs is a variation on the multilayer perceptron, uses two-dimensional matrices and is very effective in computer vision, such as the application of electroencephalogram signals in medical area [24,25]. RNNs are suitable for processing sequential data and has been widely used in Maximum Power Point Tracking, parameter estimators for induction motors and so on [26,27]. Moreover, they both can better learn sentence and document representation. TextCNN uses an one-dimensional convolution layer and k-max-pooling layer to capture the key information similar to n-gram features in the text, and the key point is to capture the local correlation in the text [28]. RNN-based models regard text as a word sequence, aiming to capture word correlation and text structure, and better express contextual information [29]. As CNN and RNN prioritize locality and sequentiality, which can capture the semantic information on the local continuous word sequence in the document [30], these deep learning models have been widely used in short text classification. Compared with traditional text classification models, these models provide better results and achieve significant improvement [17,31]. Additionally, in recent years, with the development of pretrained language models, people use large-scale pre-trained models for text classification. For example, many studies use pre-training Bert to promote text classification [32,33].
In recent years, GNNs have attracted wide attention [30,34]. GNNs can effectively deal with tasks with rich relational structures and preserve the global structural information of graphs [35]. Moreover, GNNs have recently been used in text classification since GNNs can model complex semantic structures and perform well in handling complex network structures [36,37]. TextRank [38] was the earliest graph-based model that applies graph structures to NLP, representing a natural language text as a graph. Nodes in the graph can represent various types of text units, such as words and collocations, whereas edges can be used to represent different types of relationships between any nodes, such as lexical or semantic relationships. There are two main methods to generate graph structures from complex corpora. One method is to build a large single text graph for the corpus according to the word co-occurrence and document word relationships in the whole corpus. The graph includes word nodes and document nodes. Then, under the supervision of known document node labels, the text classification problem is transformed into the document node classification problem in the large graph, such as TextGCN [35], HyperGAT [39], and TensorGCN [40]. The other method generates small individual graphs for each document in the corpus, such as semantic and syntactic dependency graphs. The words of each document are the nodes of the graph and convert the text classification problem using a graph classification problem, such as S-LSTM [37], the model of [36], and TextING [41].
However, all of these graph-based studies focus on the classification of long texts, and none of them applied GNN to short text classification. Graph-based methods outperform traditional models in long texts because GNNs can capture the global word co-occurrence relationship of nonconsecutive and long-distance semantics in a corpus [42]. However, due to the short length of short texts and limitations of textual features, extracting only the structural features of text graphs limits the ability of text representation. For example, the performance of TextGCN is worse than that of a CNN or RNN on MR [35]. In addition, most graph-based methods ignore the continuous semantic information in each document of the corpus, which is very important to NLP tasks, such as sentiment analysis [43]. Specifically, graph-based methods update the node representation by aggregating the features of neighbor nodes in parallel [44], which only extracts the local features of the document or word nodes, and the contextual information and sequential features of the document are often ignored.
In this work, to address the above issues, we aim to build a GNN model based on the sequential feature propagation scheme while capturing the sequential information and structural information of each document in the corpus and obtain a more accurate text representation for short text classification. Towards this end, we propose an improved sequence-based feature propagation scheme that can better analyze textual features, and we propose a new GNN-based method for short text classification, termed SGNN. First, we train each document in the short text corpus as individual graphs, which use sliding windows to model the contextual structure of words and transform text classification into graph classification. Meanwhile, the pre-trained word embedding is used as the semantic feature of words. Second, according to the distinctive sequential information of the document, we use Bi-LSTM to extract the contextual feature of each word in the document to update the word node representation for each document graph. Compared with previous graph-based models, the sequential information of the document is considered in the feature matrix of each document graph. Third, a simplified GCN is used to aggregate the neighbor features of each word node to learn word representations based on their local structures. Finally, the sufficiently updated word nodes are incorporated as document representations. In addition, we extend the model, termed ESGNN, which retains some initial contextual features in the aggregation process of simplified GCN and effectively alleviates the problem of over-smoothing. In total, our method uses the semantic features of the pretrained word embedding to extract the sequential features and structural features of each document in turn, which increases the feature exchange between words in the document and overcomes the limitations of textual features in short texts. Moreover, since test documents are not mandatory in training, thus we are inductive learning, in which text representation of new documents can be easily obtained using the trained model [41]. We also conduct extensive experiments on seven benchmark datasets, and the results show the effectiveness of our method for short text classification. The overall structure of the model is shown in Figure 1 and the novelty of our work compared with other proposals is shown in Table 1. The main contributions to this paper are summarized as follows:

1.
We propose an improved sequence-based feature propagation scheme. Each document in the corpus is trained as an individual graph, and the sequential features and local features of words in each document are learned, which contributes to the analysis of textual features.

2.
We propose new GNN-based models, SGNN and ESGNN, for short text classification, combining the Bi-LSTM network and simplified GCNs, which can better understand document semantics and generate more accurate text representation.

3.
We conduct extensive experiments on seven short text classification datasets with different sentence lengths, and the results show that our approach outperforms stateof-the-art text classification methods.
of the document or word nodes, and the contextual information and sequential featu of the document are often ignored. In this work, to address the above issues, we aim to build a GNN model based on sequential feature propagation scheme while capturing the sequential information structural information of each document in the corpus and obtain a more accurate representation for short text classification. Towards this end, we propose an impro sequence-based feature propagation scheme that can better analyze textual features, we propose a new GNN-based method for short text classification, termed SGNN. F we train each document in the short text corpus as individual graphs, which use slid windows to model the contextual structure of words and transform text classification graph classification. Meanwhile, the pre-trained word embedding is used as the sema feature of words. Second, according to the distinctive sequential information of the do ment, we use Bi-LSTM to extract the contextual feature of each word in the documen update the word node representation for each document graph. Compared with previ graph-based models, the sequential information of the document is considered in the ture matrix of each document graph. Third, a simplified GCN is used to aggregate neighbor features of each word node to learn word representations based on their lo structures. Finally, the sufficiently updated word nodes are incorporated as docum representations. In addition, we extend the model, termed ESGNN, which retains so initial contextual features in the aggregation process of simplified GCN and effectiv alleviates the problem of over-smoothing. In total, our method uses the semantic featu of the pretrained word embedding to extract the sequential features and structural tures of each document in turn, which increases the feature exchange between word the document and overcomes the limitations of textual features in short texts. Moreo since test documents are not mandatory in training, thus we are inductive learning which text representation of new documents can be easily obtained using the trai model [41]. We also conduct extensive experiments on seven benchmark datasets, and results show the effectiveness of our method for short text classification. The overall str ture of the model is shown in Figure 1 and the novelty of our work compared with ot proposals is shown in Table 1. The main contributions to this paper are summarized follows: 1. We propose an improved sequence-based feature propagation scheme. Each do ment in the corpus is trained as an individual graph, and the sequential features local features of words in each document are learned, which contributes to the a ysis of textual features. 2. We propose new GNN-based models, SGNN and ESGNN, for short text classif tion, combining the Bi-LSTM network and simplified GCNs, which can better und stand document semantics and generate more accurate text representation. 3. We conduct extensive experiments on seven short text classification datasets w different sentence lengths, and the results show that our approach outperforms st of-the-art text classification methods.    The rest of the paper is organized as follows. First, Section 2 introduces our method in detail, which includes graph construction and our proposed models. Second, we describe seven datasets for short text classification, baseline models and experimental settings in detail in Section 3. Section 4 shows the overall test performance of our models and baseline models and reports the experimental results in detail. Finally, we summarize this research and discuss the prospects of future research in Section 5.

Methods
In this section, the detailed method is introduced. First, we detail how to construct individual document graphs for each document in the short text corpus. Second, we describe our proposed SGNN model and its extended ESGNN model in detail. Third, we detail how to predict the label for a given text according to the learned representations of documents.

Graph Construction
We constructed individual graphs for each document in the short text corpus. We represented words as nodes and the co-occurrence relationship between words as edges, denoted as G = (V, E), where V is the set of nodes and E is the set of edges. First, we preprocessed the text, including cleaning and tokenizing [28], to obtain the word sequence S1. Second, we removed the stop words, including the stop words of NLTK (http: //www.nltk.org/) (accessed on 30 November 2021) and the words with word frequencies less than 5 in the corpus, to obtain the word sequence S2. Third, a fixed-size sliding window (length = 4 at default) was used to generate edges according to word co-occurrence on word sequences S1. If the word in the sliding window does not appear in the S2 sequence, then delete the node and corresponding edge in the graph. Finally, the embedding of nodes in each document graph were initialized with word embedding, denoted as X ∈ R |V|×d , where d is the embedding dimension. An example of constructing a document graph is shown in Figure 2.  An illustration of constructing a document graph for a real document. 1 and 2 represent the word sequence after preprocessing and after removing the stop words, respectively. We set the sliding window size = 2 in the figure for convenience. and X represent adjacency matrix and feature matrix, respectively.

SGNN Model and ESGNN Model
With the continuous development of GNNs, GCNs are a simple and widely used message passing algorithm for semi-supervised classification [45]. In one message passing layer, GCNs propagate the features of nodes through average aggregation and transform the features of nodes by linear mapping. Its equation is An illustration of constructing a document graph for a real document. S1 and S2 represent the word sequence after preprocessing and after removing the stop words, respectively. We set the sliding window size k = 2 in the figure for convenience. A and X represent adjacency matrix and feature matrix, respectively.

SGNN Model and ESGNN Model
With the continuous development of GNNs, GCNs are a simple and widely used message passing algorithm for semi-supervised classification [45]. In one message passing layer, GCNs propagate the features of nodes through average aggregation and transform the features of nodes by linear mapping. Its equation is is the trainable weight matrix, σ is the activation function, such as ReLU, and X ∈ R |V|×d and X ∈ R |V|×d are the feature matrices of the current layer and next layer, respectively [41].
Recently, it has been found that by decoupling the GCN's feature transformation and propagation and removing the nonlinearities between GCN layers, the improved GCN is more efficient than the traditional GCN in many tasks [46,47]. Inspired by the above research, we propose SGNN and ESGNN models for short text classification. The model architecture is shown in Figure 3. Since Bi-LSTM can capture bidirectional semantic dependencies, it can well model the sequential information of documents. Therefore, for each document in the short text corpus, Bi-LSTM is used to extract the contextual information between words, learn the unique word representations of each document, and then update the feature matrix of the document graph. Then, we use the simplified GCN to aggregate the features of neighboring nodes on average and update the word node representation. The formulas of the SGNN model are as follows: In addition, we extend our model with a branch ESGNN, where in the process of node aggregation, through set = 0.1, 0.2 …, the initial contextual feature of words is preserved. The formula of the ESGNN is as follows: and ̂ represents adjacency matrix and symmetrically normalized adjacency matrix, respectively.
is the output of model's hidden layer.

Document Classification
After the word nodes of each document were fully updated, we used global maximum pooling to extract features from the output of the last SGNN or ESGNN layer and obtain the graph-level representation of each document.
where ∈ and L are the layer numbers of the SGNN or the ESGNN model. Finally, the label of the document is predicted by feeding the graph-level embedding into a layer: where and are weights and bias, respectively. The goal of training is to minimize the cross-entropy loss between ground truth label and predicted label . In addition, we extend our model with a branch ESGNN, where in the process of node aggregation, through set α = 0.1, 0.2 . . ., the initial contextual feature of words is preserved. The formula of the ESGNN is as follows:

Document Classification
After the word nodes of each document were fully updated, we used global maximum pooling to extract features from the output of the last SGNN or ESGNN layer and obtain the graph-level representation of each document.
where X doc ∈ R d L and L are the layer numbers of the SGNN or the ESGNN model. Finally, the label of the document is predicted by feeding the graph-level embedding X doc into a so f tmax layer: where W linear and b are weights and bias, respectively. The goal of training is to minimize the cross-entropy loss between ground truth label y and predicted label y i .

Materials and Experiments
In this section, we describe our datasets, baseline models, and experimental settings in detail.

Datasets
We conducted experiments on seven short text datasets, including R8, R52, MR, Search-Snippets, SMS, and Biomedical. The detailed description of each dataset is listed below. We first preprocessed all the datasets by cleaning and tokenizing text as [28]. Then, we deleted stop words defined in NLTK and low-frequency words that appeared less than 5 times for R8, R52, and Ag news sub. For the other four datasets, we did not delete words after cleaning and tokenizing raw text because the documents were very short, so word sequence S1 and word sequence S2 were the same. The statistics of the datasets are shown in Table 2.

Baselines
In the experiment, we compared our methods with the different state-of-the-art models. The models used in the experiment are as follows.

•
Fasttext [52]: A simple and efficient text classification method that takes the average of all word embedding as document representation and then feeds the document representation into a linear classifier. We evaluated it without bigrams. • SWEM: A simple word embedding model proposed by [53], and in our experiment, we used SWEM-concat and obtained the final text representation through two fully connected layers. • TextGCN: A graph-based text classification model proposed by [35], which builds a single large graph for the whole corpus and converts text classification into a node classification task based on GCN. • TensorGCN: A graph-based text classification model in [40], which uses semantic and syntactic contextual information. • HeteGCN [54]: A model unites the best aspects of predictive text embedding and TextGCN together. • S-LSTM [37]: A model that treats each sentence or document as a graph and uses repeated steps to exchange local and global information between word nodes at the same time. • TextING [41]: This model builds individual graphs for each document and uses a gated graph neural network to learn word interactions at the text level.

Experiment Settings
For all the datasets, we randomly split the training set into a ratio of 9:1 for actual training and validation. We used pretrained 300-dimensional GloVe (http://nlp.stanford. edu/data/glove.6B.zip) (accessed on 30 November 2021) word embedding as the input features, whereas the out-of-vocabulary (OOV) words were set to 0. We referred to [55] and used a maximum sentence length as the truncation for SearchSnippets, Biomedical, and MR, whereas Ag New Sub and SMS were set to 50, and R8 and R52 were set to 100. Empirically, the batch size of our model was 32, the number of layers and the hidden length of Bi-LSTM were 2 and 128, respectively, the learning rate was 0.002 with the Adam [4,56] optimizer, and the dropout rate was 0.5. For learning SGNN and ESGNN, we trained the model for 100 epochs with an early stopping strategy. For baseline models, we used the default parameter settings as in their original papers or implementations. For models using pretrained word embedding, we used 300-dimensional GloVe word embedding.

Results and Discussion
In this section, we describe experimental results in detail, and to further analyze our models, we explore the influence of different parameters on the model's ability in the experiment. Tables 3 and 4 show the test accuracy and test macro-f1 of all baselines and our two models for all datasets, respectively. We can see that our model achieves optimal results on all datasets, especially for datasets with short average text lengths, such as Ag news sub, MR, Searchsnippets, SMS, and Biomedical. SSGNN has performed better than other baselines, which proves the effectiveness of our method on short text datasets. In addition, ESGNN performs better than SGNN, which shows that the feature extraction ability of the model is more powerful for short text classification. Moreover, we also evaluate the model efficiency by the training time of per epoch, as shown in Table 5; it can be seen that in addition to TextGCN, our method shows better advantages compared with other GNN-based methods, which may be because TextGCN builds a large corpus graph and only captures the structural information of each document. We note that TextGCN based on a large corpus graph model performs better than traditional models such as CNN and RNN in R8 and R52. This may be because the average texts length of R8 and R52 is relatively long. TextGCN can capture global word co-occurrence in the long-distance corpus by constructing a single large graph and take advantage of GCN in dealing with complex network structure to learn more accurate representations of document nodes. We also note that S-LSTM and TextING based on small graphs perform better than traditional models, which may be because traditional models lack long-distance and non-consecutive word interactions [41]. In addition, they also perform better than TextGCN based on a large corpus graph model. This may be because small graph excludes a large number of words that are far away in the text and have little relationship with the current words [36] to learn more accurate text representation in a specific context, so the generalization ability of the model is further improved. Additionally, both of them make use of the advantage of pretrained word embedding and achieve better results.

Test Performance
Our model performs better than the traditional models and also better than the most advanced graph models, such as S-LSTM and TextING. This may be because our model first captures the continuous semantic information of the document to well model the sequential information of the document, which is considered in the feature matrix of each document graph. In addition, taking advantage of small graphs, local structure features of word nodes are extracted by using the dependency relationship between the word nodes in the document. In summary, our method uses the semantic features of the pretrained word embedding and document-level word interaction, which extracts the sequential information and structural information of each document, to improve the classification accuracy. It has powerful feature extraction and text representation abilities and achieves better classification performance in short text classification.

Combine with Bert
One of the advantages of the pre-trained model is that it can obtain contextual dynamic word embedding, which shows better results than the static word embedding in NLP tasks [32]. Therefore, we use Bert instead of Bi-LSTM as the input of the model to explore the combination ability of our method with Bert, which is called C-Bert. The formulas of the C-Bert model are as follows: where S is the word sequence of each document in the short text corpus.
The experimental results are shown in Table 6. It shows that Bert is better than our model on the five datasets except for Searchsnippets and Biomedical, which may be because of the words out of vocabulary. Moreover, we can find that in addition to SMS, the C-Bert model shows better results than Bert, which proves the effectiveness of our feature propagation scheme.  Figure 4 shows the test performance of our two models using different graph layers on MR and Searchsnippets. The results reveal that when L = 1, the test results of the two datasets are optimal, which shows that the models capture the textual features of each document in the short text corpus well. However, with the increase in the number of layers, both of the models present a different downward trend in different datasets. This may be because of the short average text length in a short text corpus; for word nodes in each document graph, too much received information from high-order neighbors will make the word nodes become overly smooth, which inhibits the generalization ability of the model. In addition, ESGNN performs better than SGNN with the number of layers increasing, which indicates that in the process of node aggregation, preserving proper initial contextual feature information is helpful to alleviate the over-smoothing problem.  Figure 5 shows the effect of different window sizes on the performance of the SGNN model in MR and Searchsnippets. The generalization ability of the model increases with increasing window size, and each word node has more neighbors, which increases the scope of feature exchange to learn the representation of word nodes more accurately. For both datasets, the model obtains the best results when the window size is equal to 4. However, when the window size is larger than 4, the performance of different datasets shows a different downward trend. This may be because large window sizes may not introduce very close edges for word nodes, resulting in excessive feature exchange.   Figure 5 shows the effect of different window sizes on the performance of the SGNN model in MR and Searchsnippets. The generalization ability of the model increases with increasing window size, and each word node has more neighbors, which increases the scope of feature exchange to learn the representation of word nodes more accurately. For both datasets, the model obtains the best results when the window size is equal to 4. However, when the window size is larger than 4, the performance of different datasets shows a different downward trend. This may be because large window sizes may not introduce very close edges for word nodes, resulting in excessive feature exchange.  Figure 5 shows the effect of different window sizes on the performance of the SGNN model in MR and Searchsnippets. The generalization ability of the model increases with increasing window size, and each word node has more neighbors, which increases the scope of feature exchange to learn the representation of word nodes more accurately. For both datasets, the model obtains the best results when the window size is equal to 4. However, when the window size is larger than 4, the performance of different datasets shows a different downward trend. This may be because large window sizes may not introduce very close edges for word nodes, resulting in excessive feature exchange.  The results show that when the α value increases, the trend is similar to window sizes, and the optimum typically lies within α ∈ [0.2, 0.3] but slightly changes for different datasets. The value should be adjusted according to the dataset since the average text length of different datasets is different, and different document graphs exhibit different neighborhood structures [57,58]. In addition, compared with SGNN (the dotted orange line in the figure), we note that too small or too large values will affect the structure information during aggregation, which reduces the feature extraction ability of the ESGNN model.

Proportions of Training Data
GCNs can perform well with a low label rate during training [45]; therefore, to test the robustness of the model for semi-supervised tasks [59], we use different proportions of training datasets to test graph-based models. For MR and SMS, we reduce the training data to 1%, 2.5%, 5%, 10%, 15% and 20%, respectively. Figure 7 shows the test results of ESGNN, SGNN, TextING, S-LSTM, and TextGCN. We note that with the increase in training data, our ESGNN and SGNN models perform better than TextING, S-LSTM, and TextGCN in most cases. The extraction of sequential features also helps to generate a more accurate representation for the word nodes of the test set, which do not appear in the training set, and gives our model stronger feature extraction and text representation abilities in limited training labeled documents.

Conclusions
In this work, we propose an improved sequence-based feature propagation scheme that can better analyze textual features. We also propose a new GNN-based method for short text classification, termed SGNN, and its extended model, ESGNN. Each document in the short text corpus is trained as an individual graph; our two models extract the sequential features and structural features of each document in turn from the semantic features of words, which increases the feature exchange between words in the document and

Proportions of Training Data
GCNs can perform well with a low label rate during training [45]; therefore, to test the robustness of the model for semi-supervised tasks [59], we use different proportions of training datasets to test graph-based models. For MR and SMS, we reduce the training data to 1%, 2.5%, 5%, 10%, 15% and 20%, respectively. Figure 7 shows the test results of ESGNN, SGNN, TextING, S-LSTM, and TextGCN. We note that with the increase in training data, our ESGNN and SGNN models perform better than TextING, S-LSTM, and TextGCN in most cases. The extraction of sequential features also helps to generate a more accurate representation for the word nodes of the test set, which do not appear in the training set, and gives our model stronger feature extraction and text representation abilities in limited training labeled documents.

Proportions of Training Data
GCNs can perform well with a low label rate during training [45]; therefore, to test the robustness of the model for semi-supervised tasks [59], we use different proportions of training datasets to test graph-based models. For MR and SMS, we reduce the training data to 1%, 2.5%, 5%, 10%, 15% and 20%, respectively. Figure 7 shows the test results of ESGNN, SGNN, TextING, S-LSTM, and TextGCN. We note that with the increase in training data, our ESGNN and SGNN models perform better than TextING, S-LSTM, and TextGCN in most cases. The extraction of sequential features also helps to generate a more accurate representation for the word nodes of the test set, which do not appear in the training set, and gives our model stronger feature extraction and text representation abilities in limited training labeled documents.

Conclusions
In this work, we propose an improved sequence-based feature propagation scheme that can better analyze textual features. We also propose a new GNN-based method for short text classification, termed SGNN, and its extended model, ESGNN. Each document in the short text corpus is trained as an individual graph; our two models extract the sequential features and structural features of each document in turn from the semantic features of words, which increases the feature exchange between words in the document and overcomes the limitations of textual features in short texts, and the accuracy of short text classification is improved. Moreover, experimental results suggest the strong robustness

Conclusions
In this work, we propose an improved sequence-based feature propagation scheme that can better analyze textual features. We also propose a new GNN-based method for short text classification, termed SGNN, and its extended model, ESGNN. Each document in the short text corpus is trained as an individual graph; our two models extract the sequential features and structural features of each document in turn from the semantic features of words, which increases the feature exchange between words in the document and overcomes the limitations of textual features in short texts, and the accuracy of short text classification is improved. Moreover, experimental results suggest the strong robustness of our models to less training data compared with other graph-based models. In future work, we will explore more effective feature propagation schemes and propose more effective models to improve the adaptability of the model to different classification tasks in NLP.