Attention-Based Transformer-BiGRU for Question Classiﬁcation

: A question answering (QA) system is a research direction in the ﬁeld of artiﬁcial intelligence and natural language processing (NLP) that has attracted much attention and has broad development prospects. As one of the main components in the QA system, the accuracy of question classiﬁcation plays a key role in the entire QA task. Therefore, not only the traditional machine learning methods but also today’s deep learning methods are widely used and deeply studied in question classiﬁcation tasks. This paper mainly introduces our work on two aspects of Chinese question classiﬁcation. The ﬁrst is to use an answer-driven method to build a richer Chinese question classiﬁcation dataset for the small-scale problems of the existing experimental dataset, which has a certain reference value for the expansion of the dataset, especially for the construction of those low-resource language datasets. The second is to propose a deep learning model of problem classiﬁcation with a Transformer + Bi-GRU + Attention structure. Transformer has strong learning and coding ability, but it adopts the scheme of ﬁxed coding length, which divides the long text into multiple segments, and each segment is coded separately; there is no interaction that occurs between segments. Here, we achieve the information interaction between segments through Bi-GRU so as to improve the coding effect of long sentences. Our purpose of adding the Attention mechanism is to highlight the key semantics in questions that contain answers. The experimental results show that the model proposed in this paper has signiﬁcantly improved the accuracy of question classiﬁcation.


Introduction
According to the prediction of Data Age 2025 white paper released by IDC, in 2025, the amount of global data will reach an unprecedented 163ZB [1]. All walks of life are constantly generating data every day: Mobike generates 25 million orders per day, 50 million messages per day from Twitter, Youtube uploads more than 400 h of video per minute, Taobao generates 20 tb data every day, Facebook generates 300 tb data every day, and Google processes 24 pb data every day. In this age of information explosion, people are often dissatisfied with search engines simply returning to a related page, especially in specific areas, such as law, health care, etc. As traditional search engines return more web pages, it is more difficult to find the key information they need. However, a QA system can better identify users' intentions and meet their needs for obtaining information quickly and accurately, which has become one of the current research hotspots.
The question and answering system is an information retrieval system that accepts questions from users in natural language (e.g., what is the longest river in the world?) and finds accurate, concise answers to those questions (e.g., the Nile) from a large amount of heterogeneous data. There is a fundamental difference from traditional search engines. The goal of a question and answering system is to accurately answer the questions that users ask in natural language. Compared with traditional search engines that search based on keywords and return a collection of relevant documents, question answering systems focus more on accuracy. In addition, regarding question classification as a special form of text classification, including in sentiment analysis, label classification, news text classification, and other text classification subordinate tasks above, ideas and methods can refer to and learn from each other. Therefore, the research and implementation of a question classification system is of great importance to improve the performance of a question and answer system and regarding how text classification can obtain valuable information and improve information efficiency.
Reviewing and summarizing the history and current state of research in question and answering systems will help to promote the development of a question and answering system as well as question classification technology.
In the early 1960s, researchers tried to build an intelligent system that could answer people's questions to meet the development of artificial intelligence (AI). This period is known as the AI period, which is mainly devoted to AI systems and expert systems, represented by systems such as BASEBALL [2] and LUNAR [3], which are mainly domainlimited question and answer systems that deal with structured data.
In the 1970s and 1980s, due to the rise of computational linguistics, a large amount of research focused on how to use computational linguistics technology to reduce the cost and difficulty of constructing QA. This period is known as the computational linguistics period, which mainly focuses on defining the field and processing structural data, and the representative system is Unix Consultant [4].
In the 1990s, QA entered a new period of open domain and text-based systems. Along with the rapid development of the Internet, a large number of electronic documents were generated, which provided objective conditions for QA to enter the open domain, text-based period. Especially since the establishment of the QA track of TREC (text retrieval conference) in 1999, the development of question and answer systems has been greatly promoted. Subsequently, frequently asked questions (FAQ) data appeared on the Web; especially since the end of 2005, a large number of community-based question answering (CQA) data (e.g., Yahoo! Answer) appeared on the Web. With a large amount of question-answer pair data available, the QA entered the open domain, question-answer pair-based period.
In recent years, with the development of deep learning technology, research on question answering systems based on deep learning methods has emerged. For QA in the open field, predecessors have done a great deal of research. In 2014, the GoogleBrain [5] team and the Yoshua Bengio [6] team published respective articles, and the two articles coincided with the idea of solving machine translation, which is the seq2seq model. In 2015, Kyunghyun, Bengio, and Bahdanau proposed an attention mechanism on the basis of seq2seq [7], which improved the accuracy of translation. There are already some mature QAs in industry and academia. For example, Microsoft's "Xiaobing", Apple's "Siri", and so on.
QA is a computer system that interacts with the user in natural language, which can automatically process questions asked by users and give users concise and correct answers. A general QA system consists of three parts: question analysis, information retrieval, and answer selection [8]. The structure of a typical QA system is shown in Figure 1.
It can be seen from Figure 1 that question classification is the initial part of the QA task, which has an important influence on the subsequent answer extraction and the overall performance of the QA system. In short, question classification plays an important role in QA, mainly manifested in two aspects: (1) Assign the corresponding label to the question according to the expected answer type, thereby narrowing the range of candidate answers. For example, in the question "Who was the first Chinese to enter space?", the answer that users really want to know is "Yang Liwei" instead of searching for too many materials containing content related to "first" or "space". After question classification, it can be learned that this is a question asking for a person's name. Therefore, the candidate statement outside the person's name will be screened out in the answer extraction stage, and it is only necessary to focus on some answers related to the person's name without having to pay too much attention to the candidate answer statement unrelated to the person's name. This helps to improve the accuracy of answer selection and reduce the amount of calculation.
(2) For different question types, QA will develop different strategies in subsequent operations. For example, the question "What kind of food is there in Sichuan?" The answer to the question is about the food category. The focus of the extraction should also be placed on the selection of food-related strategies. It can be seen from Figure 1 that question classification is the initial p task, which has an important influence on the subsequent answer extraction all performance of the QA system. In short, question classification plays an i in QA, mainly manifested in two aspects: (1) Assign the corresponding label to the question according to the exp type, thereby narrowing the range of candidate answers. For example, in "Who was the first Chinese to enter space?", the answer that users really wa "Yang Liwei" instead of searching for too many materials containing cont "first" or "space". After question classification, it can be learned that this asking for a person's name. Therefore, the candidate statement outside the p will be screened out in the answer extraction stage, and it is only necessar some answers related to the person's name without having to pay too muc the candidate answer statement unrelated to the person's name. This helps to accuracy of answer selection and reduce the amount of calculation.
(2) For different question types, QA will develop different strategies i operations. For example, the question "What kind of food is there in Sichu swer to the question is about the food category. The focus of the extraction s placed on the selection of food-related strategies.
Researchers have reached a consensus that the accuracy of question cla sults plays a key role in answer selection, and even the efficiency and perfo Researchers have reached a consensus that the accuracy of question classification results plays a key role in answer selection, and even the efficiency and performance of the entire QA system. The experimental results of Moldovan et al. [9]. indicate that the wrong answer is caused by the inaccurate question classification.
From the perspective of the task category, question classification also belongs to text classification. Therefore, the main methods of question classification usually refer to and quote some ideas of text classification [10], but they are different in some details. For example, common words (stop words) such as "what" and "is" are usually filtered in text classification, but these words are often very important in question classification. In addition, questions are all natural language questions randomly posed by users rather than traditional normative texts. Therefore, compared with conventional text classification, a question classification task mainly faces two major challenges.
The first challenge is that the user's question is too short, has a small vocabulary, and contains little information. For example, the Chinese question "007?" should be classified as an entity, and it is difficult to determine whether "007" is a number or a movie in this question.
The second challenge is that the questions are too long, such as the English question "why do people get goosebumps when they have something emotional happen to them, like when they hear a beautiful piece or see something beautiful, or get aroused by someone they love ?" The question should be classified as DESC (Description category), but, because the question is long and has a lot of entity words, it is easy to be misclassified in other categories, which is one of the difficulties.
Early question classification mainly used rule-based matching methods [11], which required the manual formulation of a large number of rules and establishment of a rule base. In recent years, question classification methods mainly develop lexical, syntactic, semantic, and other feature extraction strategies for question sentences [12] and then classify question sentences with the help of machine learning (e.g., K-nearest neighbor, SVM, Bayesian, etc.) methods. The accuracy of its classification results is determined by the merit of feature extraction, and, the richer the extracted features, the higher the accuracy of classification. The rule-based and feature extraction methods have the following three shortcomings.
(1) The manually developed feature extraction strategy is somewhat subjective and cannot comprehensively understand the interrogative sentences.
(2) In order to achieve better classification results, the feature extraction strategy needs to be constantly adjusted and optimized to better represent the question sentence in syntactic and semantic aspects, which is not very flexible.
(3) When the syntactic complexity of the question is high or the category granularity of the question is small, it is more difficult to develop feature rules and the classification effect is not good.
Compared with the above method, deep learning technology is undoubtedly a new research hotspot in the branch of machine learning, which has become a powerful force to promote the rapid development of machine learning. The vigorous development of deep learning technology provides technical support for the research of question and sentence classification, which will play a powerful role in promoting the development of QA in the future and will become a trend.
Based on the above summary, the main research of this article is summarized as follows: • A brief introduction to traditional machine learning question classification models based on statistical methods is presented.

•
There is no unified public dataset for the Chinese question and answer corpus compared to the English question and answer corpus, and the small amount of corpora for Chinese question classification and insufficient resources is one of the main reasons that restrict the accuracy of question classification. To solve the problem, thousands of question and answer sentences were captured from Chinese community question and answer platforms, including Baidu Know and Sogou Q&A, and the categories were manually marked and the missing and noisy samples were processed, finally, after manual verification. In addition, the difficulty of short questions with little information can be better solved by combining the answer information of the question with the classification of the question. Through experimental comparison, it is found that the question sentence containing the information-rich features of the answer has a better classification effect than the single question sentence.

•
For longer question sentences that contain answer information, a hybrid neural network (TBGA) model that combines Transformer and Bi-GRU and includes the Attention mechanism is used. The input of Transformer's encoder is the sum of word vector and position vector, which can obtain the relationship between words and capture the internal characteristics of question sentences; Bi-GRU can consider the context on the basis of a time series and has good dependence on long sequences. The effect of longer question information can also be captured well. The introduction of Attention can highlight the key feature information based on the features extracted from the above network, thus avoiding complicated words and lengthy interrogative sentences from affecting the classification results. In addition, the answer content of question sentences is introduced to enhance the information of question sentences, which ultimately improves the efficiency and accuracy of question sentence classification. The experimental results show that the question classification method used in this question helps to improve the accuracy of question classification.

•
Experimentation with multiple deep models on TREC, a classical dataset in the publicly available open domain, and comparison of fine-grained category accuracies are performed to identify classification models that are superior for application in different category domains.

Materials and Methods
The rest of this article is organized as follows. The second chapter introduces related work, including Chinese and English question classification system standards, as well as the machine learning classification methods mainly used in the past few years and the current research status of popular deep learning technologies in this field. The third part introduces the method we proposed and the dataset used in the experiment. The fourth part shows the experiment and results. Finally, the paper is concluded in the fifth part.

Question Classification
Question classification is a very important part of the QA. It needs to classify the question into a certain category according to its answer type. The subsequent retrieval and extraction will adopt different measures according to the question category.
Firstly, question classification must determine the classification standard. The standard of classification is the basis and premise of question classification. As shown in Table 1 [13], English mainly adopts UIUC's question classification standards. In this classification system, question sentences are divided into 6 coarse categories and 50 fine categories, and the question category is determined by the type of answer. China has not yet established a unified Chinese question classification standard. As shown in Table 2, Harbin Institute of Technology's Wen Xu et al. [14] are based on the existing question classification standards in English, and, in view of the complexity of Chinese, they formulated a set of Chinese question classification standards that has 7 coarse categories and 60 fine categories, each large class containing several small categories.

Traditional Question Classification Method
After determining the question category, then there are the methods and models used in the classification. In the past decades, the use of machine learning methods has been the mainstream method of question classification research, and the quality of question feature extraction is the decisive factor of classification performance; question classification methods based on machine learning are mainly divided into two categories: one is a classification method based on empirical rules and the other is a machine learning classification method based on statistics [15]. The former method is more common in the early stage, and it is mainly based on preset empirical rules and templates to distinguish the types of question sentences [16]. Statistics-based machine learning methods are relatively common in recent years, and they have the advantages of strong versatility, ease of transplantation, and expansion. This method first needs to extract some feature vectors, the category of the question is represented by these feature vectors, and then the real test questions that have been accurately labeled are learned through statistics and analysis so as to automatically build a classifier and finally use the classifier to mark the category to which the question belongs. The core of the statistics-based machine learning classification method is to extract the feature vector of the question sentence. The classification accuracy of the classifier is usually affected by the quality of the feature vector.
Zhang [17] uses word bag and multi-word chunks as the main features and adopts the Bayesian model to classify English training sets of different sizes; Silva et al. [13] use only a single word bag as a classification feature and combine Support Vector Machine (SVM) to classify the UIUC English question set; Lee et al. [17] use bag of words and word blocks (including all consecutive word sequences in the question), which is the main feature, and the K-Nearest Neighbor (KNN) is used to classify the UIUC10 English question set; Sundblad [18] uses bag of words as a classification feature to classify in the TREC10 English question set. The classification accuracy of the large category is 67.2%, and the classification accuracy of the small category is 60.0%; Li et al. [19] proposed a classification method for Transformation-based Error-driven Language rules (TBL) and adopted the English synonym set from WordNet, the concept of hypernyms from nouns, and Minipar's dependency relationship and other basic language knowledge as the characteristics of the question classification method and achieved 91.4% classification accuracy on the adopted English public question set; THINT M et al. [20] proposed to use the sentence's hypernym, the head word is used as a feature, and the Maximum Entropy model (ME) is used to classify the UIUC English question set.

Deep Learning Technology
In recent years, deep learning technology is undoubtedly a new research hotspot in the branch of machine learning. The question classification method based on deep learning has strong adaptive learning capabilities and relatively high fault tolerance. Some larger noises and complex deformations also have higher resistance. In the field of natural processing, more and more scholars and researchers use deep learning methods to solve problems.
Commonly used deep learning classification methods are CNN, RNN, and Attention. Many deep learning classification models are improved on the basis of these methods. For example, DPCNN is an improvement on CNN, while BILSTM and BIGRU are improvements on RNN, and BERT pre-training model, the core component Transformer, a multi-head attention mechanism, is an improvement on Attention.

Question Classification Method Based on Deep Neural Network
The application of traditional machine learning classification algorithms in the field of question classification is mainly based on the question classification dataset, manually extracting the characteristics of the question, or combining certain features to represent the question, so it has a strong subjectivity. Moreover, it has a relatively diverse language expression, which means that it has a relatively high cost of manually formulating an accurate feature extraction method.
Deep learning has undoubtedly become a powerful force driving the rapid development of machine learning nowadays, and deep learning methods have been playing a huge role in different fields, such as image processing, speech recognition, and natural language processing. Deep learning methods do not require many corresponding rules in advance, and there is no complicated feature engineering to obtain feature representations, especially the emergence of word vector techniques, such as word2vec, glove, and textfast [21], which use words as network parameters to supervise the training of randomly initialized vectors through the network, thus making deeper extensions of deep neural networks possible. The semantic and other information of the text is also more informative.
The development of deep learning techniques is driving research related to interrogative sentence classification. For example, Kim et al. [22] classified English questions by turning question sentences into word vectors and using Convolutional Neural Network (CNN); Kominos et al. [23] studied the effect of word embedding on deep neural networks, and the results show that context-based word embedding can achieve better classification results in question classification tasks. Xu Jian et al. [24] used LSTM to construct a twochannel LSTM question classification model on both Chinese and English corpora, which can be more extensively obtained through text translation, thereby improving the accuracy of question classification; Shi et al. [25] introduced the Attention mechanism to extract the characteristics of the question and made full use of the answer information of the question to enhance the expression of the question, more effectively extract the effective information in the question, and grasp the semantic key points. The experimental results show that the question classification method that introduces the attention mechanism improves the YahooAns question set and CQA question set by 4% and 7%, respectively, compared with the model not introduced in the article.
As introduced at the end of the previous section, some neural networks have their own defects and shortcomings and are not able to fully learn the feature information and semantic information of question sentences. Therefore, this paper tries to explore a deep learning framework that is more effective and suitable for question classification by combining multiple deep learning methods. In addition, the Chinese open question corpus is relatively small, and experiments and studies are usually conducted based on small samples of interrogative sentences, such as the dataset size of 1500 used by Yu Zhengtao et al. [26], the dataset size of 4280 used by Tian Weidong et al. [27], and Li Ru et al. [28]. All of the above were extended on the question set of Harbin Institute of Technology.
BERT is a pre-training model that has been very hot in the NLP field in recent years. BERT achieved record-breaking results in many NLP tasks, as shown in the GLUE benchmark [29]. Many other Transformer architectures followed BERT, such as RoBERTa [30], DistillBERT [31], OpenAI Transformer, and XLNet [32], achieving incremental results.
The perspective of some deep learning models is relatively single, and, as introduced at the end of the previous chapter, some neural networks have their own shortcomings and cannot fully learn the characteristic information and semantic information of the question. Therefore, we try to use a variety of deep learning methods to explore the question classification and try to find a more effective and more suitable deep learning framework for question classification.
Since this paper uses a small sample size dataset, the current more popular pre-training models, such as BERT and XLnert, are not applicable, and these pre-training models have high requirements for the lower line of the data volume and do not work well in small sample data. Experiments in Aysu [33] showed that, with a small number of data samples, BERT is not as effective as some neural networks that are simple compared to pre-trained models, such as LSTM models, and LSTM accuracy is better than BERT, and BERT takes more time and is prone to overfitting on small datasets for specific tasks. For this reason, this paper uses Transformer, the core component of BERT with fewer parameters, to perform feature extraction. On the one hand, Transformer, as the core component of BERT, is a powerful feature extractor that has been widely proven in image, audio, and text research fields. The Transformer Encoder is able to efficiently extract features from the input interrogative text and capture semantic and other information using the multi-head attention mechanism.
Finally, add an attention mechanism behind the two-layer network. On the one hand, it can aggregate the feature information of the upper two layers while reducing the output dimension. On the other hand, Attention can further highlight key information, enhance semantic feature capture, and improve the accuracy of question classification. The overall structure of the model is shown in Figure 2: formation 2022, 13, x FOR PEER REVIEW Figure 2. The structure of the TBGA model. According to the above analysis, the combination of Transformer Enc GRU is used to better enhance the representation of interrogative features an deep feature extraction. On the one hand, Bi-GRU, as an improved versio merges and reduces the original three gating units to two units, reducing the while improving efficiency and not degrading performance, and, on the oth GRU can well increase the long-distance dependence length of the Transform ing overfitting on small sample datasets and increasing stability. At the sa obvious benefit of the bidirectional GRU is that it can simultaneously extract of the hidden layer units from both the front and back directions of the quest more effective for obtaining contextual information. The attention mechan added to the back of the two-layer network, which can, on the one hand, a feature information of the upper two layers while reducing the output dim on the other hand, the attention can further highlight the key information an According to the above analysis, the combination of Transformer Encoder and Bi-GRU is used to better enhance the representation of interrogative features and to perform deep feature extraction. On the one hand, Bi-GRU, as an improved version of LSTM, merges and reduces the original three gating units to two units, reducing the parameters while improving efficiency and not degrading performance, and, on the other hand, Bi-GRU can well increase the long-distance dependence length of the Transformer, preventing overfitting on small sample datasets and increasing stability. At the same time, the obvious benefit of the bidirectional GRU is that it can simultaneously extract the features of the hidden layer units from both the front and back directions of the question, which is more effective for obtaining contextual information. The attention mechanism is then added to the back of the two-layer network, which can, on the one hand, aggregate the feature information of the upper two layers while reducing the output dimension, and, on the other hand, the attention can further highlight the key information and enrich the information by combining the answer information of the question and sentence, thus enhancing the semantic feature capture, strengthening the learning of word order semantics and deep features, and thus improving the accuracy of question classification.
As shown in Figure 2, in this paper, all the questions are first represented by word vectors, then input to the embedding layer, and then successively entered into the Transformer and Bi-GRU double-layer networks, giving full play to the respective advantages of Transformer and Bi-GRU and complementing each other, keeping the features of the question and extracting the features, then focusing on and identifying the features of important words and sentences through attention, and aggregating the features of the upper two layers; finally, the final result of the classification is obtained through the Softmax classifier.
As shown in Figure 2, all the question sentences are first represented by word vectors and then input to the embedding layer, which is successively entered into the Transformer and Bi-GRU two-layer networks, giving full play to the advantages of each Transformer and Bi-GRU, as well as complementing each other, keeping the question sentence features and extracting the features; then, the important words and utterances are focused and identified by attention, and the features of the upper two layers are aggregated; finally, the final results of classification are obtained by the Softmax classifier.
We set the question as Q i : where the i-th word in the question sentence is represented by x i . If it is a question and answering sentence containing answer information, the word vector of the answer is spliced behind the question word vector. For example, the question Q "Which team has won the most championships in La Liga?" corresponds to the answer A "Manchester United Club.", then Q and A together represent the question. What this article uses is the method of initializing word vector randomization, which is constantly updated during the training process. After this step, the question sentence will be input into the next layer of the network in the form of a word vector, as shown in the following formula: S 1:n = {S 1 , S 2 , . . . , S n−1 , S n } (2)

Word Vector Representation Layer
First, jieba is used to segment the Chinese question and stop words are removed. English does not require a word segmentation step, and then the word vector of each word in the question is obtained through training by word2vec [34]. Word2vec was first proposed by Tomas Mikolov in 2013 on the basis of the NNLM model [35]. At the same time, Google also open-sourced an efficient tool for generating word vectors in the same year. The Chinese training corpus of word vectors used in the article is Sogou News [36] and Tencent News [37], and the English training corpus is Google News.

Transformer Layer
This layer mainly performs feature extraction on the input word vector, using the Transformer [38] encoder belonging to seq2seq; the word embedding of the question sentence and the corresponding position word embedding are added as input. The Transformer model can capture a certain distance dependency while computing in parallel through the multi-head self-attention mechanism, thereby effectively learning the semantic information of the input text. After introducing a series of operations, such as position coding, residual connection, normalization processing, and feedforward layer network connection, the input question sentence is compressed into a fixed-length semantic vector.
The structure of Transformer encoder is shown in the Figure 3. First, positional encoding is introduced, which exists to interpret the order of words in the input sequence and to determine the positional information of the words. In order to enable parallel operations, the attention mechanism drops the order that is important in the temporal sequence, and, if a sequence is disrupted, then the semantics also changes, which is solved by introducing positional encoding. The dimensionality of positional encoding is the same as that of embedding. The core of the encoder is the Mult-head Attention sublayer, which can calculat degree of association between each word and other words in the question in paralle the calculation, each word is independent of the output of the previous word. It ca calculated in parallel. The general form of self-attention layer calculation can be expre as: Att( , , ) = soft The core of the encoder is the Multi-head Attention sublayer, which can calculate the degree of association between each word and other words in the question in parallel. In the calculation, each word is independent of the output of the previous word. It can be calculated in parallel. The general form of self-attention layer calculation can be expressed as: The matrices of the three self-attention layers of Q, K, and V represent Query, Key, and Value, respectively. Through three different linear transformation layers, the input vector is calculated. These three matrices are the calculation results on the input vector, that is, self-attention; d k is the dimensional size of the word embedding layer, which plays the role of adjusting the inner product size after Q and K transposition to prevent the vector distribution of too large inner product after Softmax is not uniform. Q and K adjust the size of the inner product after transposition through d k , thus avoiding the problem of too large vector inner product and uneven distribution after Softmax. where sublayer(x) = [Att 1 , Att 2 , . . . , Att n ] where The feedforward sublayer consists of two linear transformations with the ReLU activation function in the middle, with the following equation.
where max is the ReLU activation function, W 1 and W 2 are linear transformations. Considering that the attention mechanism may not fit the complex process sufficiently, the learning ability of the model is enhanced by adding these two layers of linear transformations. The general form of output can be expressed as: LN is the residual full connection and the normalization layer specification; sublayer(x) is a function implemented by the sublayer itself, which are added after the multi-headed attention sublayer and the feed-forward sublayer, respectively. The normalization can improve the convergence speed of the algorithm, and the residual connection can prevent the phenomenon that the current network layer is poorly learned.
The purpose of normalization is to unify the data into a fixed interval in order to avoid the problem of gradient disappearance or gradient explosion when the input data fall into the saturation zone of the activation function later so that activation functions, such as ReLU, can work better. Batch normalization calculates the mean and variance of each layer for each small batch; i.e., the data are normalized to a mean of 0 and a standard deviation of 1 according to the batch dimension. Layer normalization, on the other hand, computes the mean and variance of each sample in each layer independently; i.e., it normalizes the vector data vertically each time. Layer normalization is chosen here because, for models such as transformer that deal with text sequence information, batch normalization becomes very complicated, while layer normalization is possible for a single sample without calculating the global mean-variance, thus improving the convergence speed of the model.
The multi-headed attention sublayer and feedforward sublayer of the encoder are fed as output to the next layer of the neural network through the residual concatenation and normalization (Add & Layer norm, LN) operations described above.

BiGRU Layer
Gated Recurrent Unit (GRU) [6] is an improved and simplified neural network for Long Short Term Memory (LSTM), which effectively solves the problems of gradient disappearance and gradient explosion in traditional RNNs. For many sentence-level processing tasks, it is very important to consider the context. However, traditional LSTM often only considers timing information and ignores the following information. BiGRU expands the unidirectional network through the second layer of the network; in the input and output, in the process of mapping between sequences, the relevant information about the past and the future of the question is fully utilized so that the information before and after the sentence is captured, and the past and future information is fully considered. Its significant advantages are that the accuracy of question classification is higher, the dependence on word vectors is small, the long-distance dependence is long, the complexity is low, and the response time is relatively fast.
In the Figure 4, x and h are input data and GRU unit output, respectively. c is the reset gate, g is the update gate, and c and g jointly control the calculation and update from the previous hidden state h t−1 to the new hidden state ht. Compared with the three gating units of LSTM: input gate, forget gate, and output gate, GRU combines the input gate and the forget gate into an update gate, and the output gate serves as a reset gate to reduce the parameters, and linear self-update does not need to be established In the additional memory state, it is directly linearly accumulated based on the hidden state and controlled by the gate structure, which is more flexible and efficient. The calculation formulas for the update gate and reset gate are as follows: Information 2022, 13, x FOR PEER REVIEW 13 of 23 cessing tasks, it is very important to consider the context. However, traditional LSTM often only considers timing information and ignores the following information. BiGRU expands the unidirectional network through the second layer of the network; in the input and output, in the process of mapping between sequences, the relevant information about the past and the future of the question is fully utilized so that the information before and after the sentence is captured, and the past and future information is fully considered. Its significant advantages are that the accuracy of question classification is higher, the dependence on word vectors is small, the long-distance dependence is long, the complexity is low, and the response time is relatively fast. In the Figure 4, x and h are input data and GRU unit output, respectively. c is the reset gate, g is the update gate, and c and g jointly control the calculation and update from the previous hidden state ht−1 to the new hidden state ht. Compared with the three gating units of LSTM: input gate, forget gate, and output gate, GRU combines the input gate and the forget gate into an update gate, and the output gate serves as a reset gate to reduce the parameters, and linear self-update does not need to be established In the additional memory state, it is directly linearly accumulated based on the hidden state and controlled by the gate structure, which is more flexible and efficient. The calculation formulas for the update gate and reset gate are as follows: = ( [ℎ , ] + ) (11)  In Formula (11), xt is the data input by the upper layer, φ is the sigmoid function, wg is the weights of the update gate the, bg is bias terms. Last input information ht−1 and the current input data gt are controlled by the update gate at the same time, and the output is a value from 0 to 1; 0 is discarded, 1 is reserved, gt decide whether to transfer the previous state to the next time state.

= ( [ℎ , ] + )
In Formula (12), the reset gate determines the importance of the last time state ht−1 to the result ht; wd is the weights parameter; bc is bias terms; In Formula (13), the update gate generates new memory informationht. Where wh is the weights of the update reset gate, respectively, and bh is the bias terms.
The output at the current moment is ht, that is: In Formula (11), x t is the data input by the upper layer, ϕ is the sigmoid function, w g is the weights of the update gate the, b g is bias terms. Last input information h t−1 and the current input data g t are controlled by the update gate at the same time, and the output is a value from 0 to 1; 0 is discarded, 1 is reserved, g t decide whether to transfer the previous state to the next time state.
In Formula (12), the reset gate determines the importance of the last time state h t−1 to the result h t ; w d is the weights parameter; b c is bias terms; In Formula (13), the update gate generates new memory information?h t . Where w h is the weights of the update reset gate, respectively, and b h is the bias terms. The output at the current moment is h t , that is: The final result h i output by this layer is a fusion result of the front and back outputs, as shown in the following formula.

Feature Aggregation Layer
In question classification, the contribution of each clause and each word in the question to the classification is different. Some words or clauses are particularly important to the question classification, while the contribution of some words and clauses to the classification is insignificant. In order to capture the effective information in the question, grasp the semantic key points, in this paper, an attention mechanism [39] is added behind BiGRU, and, in order to succeed, we can highlight key semantic feature information, extract effective information, and fully evaluate the contribution of each word to the classification of the entire question so as to retain the most critical information and filter out redundant information and improve the efficiency and performance of question classification. This layer network takes the output of the upper layer network model as the input of the layer model and obtains the vector expression of each sentence in the BiGRU network with the Attention mechanism.
The basic form of Attention can be expressed as: In Figure 5, where M represents the matrix composed of the word vectors output by the upper BiGRU network, M ∈ R dn , d represents the dimension of the word vector, n is the length of the sentence, w is a training parameter vector, w n is a transpose, and the dimension w, α, r correspond to d, n, and d, respectively, and q is the final representation of the question sentence used for classification.
The final result hi output by this layer is a fusion result of the front and back outputs as shown in the following formula.

Feature Aggregation Layer
In question classification, the contribution of each clause and each word in the ques tion to the classification is different. Some words or clauses are particularly important to the question classification, while the contribution of some words and clauses to the clas sification is insignificant. In order to capture the effective information in the question grasp the semantic key points, in this paper, an attention mechanism [39] is added behind BiGRU, and, in order to succeed, we can highlight key semantic feature information, ex tract effective information, and fully evaluate the contribution of each word to the classi fication of the entire question so as to retain the most critical information and filter ou redundant information and improve the efficiency and performance of question classifi cation. This layer network takes the output of the upper layer network model as the inpu of the layer model and obtains the vector expression of each sentence in the BiGRU net work with the Attention mechanism.
The basic form of Attention can be expressed as: In Figure 5, where M represents the matrix composed of the word vectors output by the upper BiGRU network, M ∈ R dn , d represents the dimension of the word vector, n i the length of the sentence, w is a training parameter vector, w n is a transpose, and the di mension w, α, r correspond to d, n, and d, respectively, and q is the final representation o the question sentence used for classification.  Figure 5. The structure of attention mechanism.

Softmax Layer
Finally, the category is divided in this layer, and a set of discrete categories Y is used to predict the label of the question Q through the Softmax classifier. The classifier takes the final hidden state q in the upper layer as the input of this layer. Calculated as follows: The loss function is as follows: where j ∈ Rm is onehot encoding, x ∈ Rm represents the estimated probability of each category, and m is the number of target categories, representing a regular term L2, which is used to constrain the weight vector.

Algorithm Steps
The question and answer data is input into the model and features are extracted for classification, and the final step of outputting the category is as shown in Algorithm 1.
question and answer datasets containing answer information from Chinese community question and answer websites, such as "Baidu Know" and "Sogou QA", are collected.
TREC: TREC English question collection [40] belongs to the UIUC question classification standard, and they are all fact questions. There are two versions of the question set: "TREC-6" and "TREC-50". There are six large categories (ABBR, DESC, ENTY, HUM, LOC, NUM) and 50 small categories. The training set is more classic and universal, which can effectively prove the performance of the method.
YAHOO: The YAHOO dataset [41] is a batch of question and answering sentences collected from the English community question and answer platform YAHOO QA. Each question sentence has a corresponding answer, which has been manually verified. There are four categories, namely: information, advice, opinion, and polling.
OQA (Open domain question and answer set): This dataset is a batch of data grabbed from the Chinese community question and answer platforms Baidu Know and Sogou QA. Each question has a corresponding best answer, and some questions in this article are added. The questions were processed by the sentence expansion method. There are a total of seven categories, namely description, character, location, number, time, entity, and unknown, all of which have passed manual labeling and verification. Because the data are noisy, data cleaning was carried out, including the processing of missing and invalid data, and the checking of data consistency.
The structures of the three datasets are shown in Table 3.

The Validity Threats
The datasets created by the experiments in this paper are all manually annotated, and the commonly used classification standards in Chinese are used. Judgment errors occur due to personal, subjective reasons. In order to reduce this threat, we selected the most commonly used open-source datasets, TREC and YAHOO, in the question classification dataset for comparative experiments, hoping that using the repeatedly verified datasets would more effectively avoid the impact of subjective annotation. In addition, because the paper is aimed at the deep learning question classification model of python language, the templates set are for some specific cases of this method, so the results may not be generalized to other python programs or other languages.

Experimental Setup
The Chinese dataset uses the Jieba toolkit to segment the experimental data, and the word vectors are obtained from the training text of the CBOW model of Word2Vec. The word vectors used on the English datasets TREC and YahooAns are Google News Corpus pre-trained by word2vec containing 100 billion vocabulary, and the dimension is set to 200 dimensions. The word vectors used on the Chinese dataset OQA are the Sogou news and the The Sogou news and Tencent news corpus. The Sogou news corpus size is 711 megabytes, the word vector dimension is 200 dimensions, the Tencent news corpus size is 800 megabytes, and the training word vector dimension is also 200 dimensions.
On all three datasets, the parameters used in the algorithm model are the same. In order to compare with the previous work completed by Kim et al., the experiments in this article use some basic parameters. To prevent over-parameterization and over-fitting during the training process, and to avoid the occasional bad local minimum phenomenon, set the Dropout parameter to 0.5, the value of l2consraint(s) is 3, the learning rate is 1 × 10 −3 , the L2 regular term is 1 × 10 −2 , the number of multi-head attention heads is 2, the batch size is 32, and the maximum sequence length is 200. In addition, through the optimizer, the learning rate exponentially decays dynamically; each epoch: learning rate = gamma ×learning rate so as to ensure the efficiency while training to achieve the best results. In addition, if the validation set loss exceeds 1000 batches and does not decrease, the training ends.

Evaluation Indicators
The evaluation indicators of the experiment are Accuracy (Acc), Precision (Prec), Recall (Rec), and F1 value. The specific calculation formula is as follows: where AccNum represents the number of question samples correctly classified, and Num total represents the total number of questions in the test set. The ratio of the number of samples in which the category predicted by the classifier is consistent with the actual category and the number of questions in the test set is used as an important measure.
where Recall (a) represents the recall rate of category a, RecNum (a) represents the number of samples in the test set that are predicted to be category a and are actually category a, and Numtotal (a) the true number of the a dataset in the test set. The recall rate of category a is the ratio of the correct number of datasets in category a to the true number of category a datasets, and it is designed to detect the recall rate of the category.

Hardware and Software Environment
The software and hardware environment used in the experiments in this paper are shown in Table 4.

Experimental Results
As shown in Figure 6, first of all, we use the Text-CNN model on the YAHOO dataset to compare 600 single question sentences with questions containing answer information. As shown in the figure, the results show that adding an answer and not adding an answer is two to three percentage points higher. It shows that the classification effect is better after adding the answer information to enrich the question feature information. It also shows that the Chinese question and answer dataset constructed in this article has a certain meaning. On the one hand, it can improve the classification accuracy and, on the other hand, it can provide convenience for subsequent answer extraction tasks. Figures 7-9 are the comparison results of experiments performed by different models on the validation sets of the three data sets.
to compare 600 single question sentences with questions containing answer information. As shown in the figure, the results show that adding an answer and not adding an answer is two to three percentage points higher. It shows that the classification effect is better after adding the answer information to enrich the question feature information. It also shows that the Chinese question and answer dataset constructed in this article has a certain meaning. On the one hand, it can improve the classification accuracy and, on the other hand, it can provide convenience for subsequent answer extraction tasks.        As shown in Table 5, it can be seen from the results that LSTM has the worst effect on the TREC dataset, and it is basically the same as the Transformer after the attention mechanism is added. In general, the classification effect of TREC is better than that of YahooAns and OQA, mainly because the TREC dataset is composed of question sentences and has no answer information. Compared with the latter, it is relatively simple. The classification effect of OQA is the lowest due to community questions and answers, such as Baidu Know and Sogou QA. The questions on the platform are more complex, with many abbreviated and colloquial uses of vocabulary and sentences, and they also contain complex answer information. The model TBGA achieves ideal results on questions containing answer information. In addition, the addition of an attention mechanism is indeed effective, which can capture the text characteristics to be represented. Especially on the OQA dataset, compared with the traditional CNN method, the accuracy has been improved by 4.75%. Transformer has the worst effect. The main reason is that the amount of data offline is relatively high, and the effect of longer question and answer sentences containing answers is not good. This is also caused by insufficient long-distance dependence. LSTM is not effective, insufficient feature extraction is the main reason. Transformer is used as a strong feature extractor, and long-distance dependence is supplemented by improved Bi-GRU. At the same time, more features are extracted through the hidden layer, and, finally, the attention mechanism is used to highlight important question information and feature fusion so as to give play to their respective advantages and complement each other to improve the classification effect. In summary, on the English question dataset and on the Chinese question and answer dataset, the method proposed in this paper has significant advantages. The improvement of the effect on different datasets also shows the effectiveness and comparison of the method, with strong generalization ability.
In summary, the method proposed in this paper has achieved significant improvement in both the English and Chinese datasets, which not only verifies the effectiveness of the method but also reflects the generalization ability of the method from the side.
In addition, as shown in Figure 10, by comparing the F1 values of each category on the TREC dataset for commonly used text classification models: convolutional network and RNN network plus attention mechanism and multi-head attention network, it can be found that the abbreviated category (ABB) question has the most effect. A good model is Transformer. On the statement type (DES) question, the CNN model is relatively effective. The model that works best in ENT (entity class) is also the CNN network. For the character category (HUM) question, the Transformer model has the best effect. For the place name category (LOC) question, the RNN plus attention model has the best effect. On the number category (NUM) question, the Transformer model has the best effect. It is found through experiments that different networks have different performances in different categories. This has a certain enlightening effect on the classification of problems in specific domain categories with corresponding models.

Conclusions and Future Work
In recent years, the rapid development of the Internet has made mo formation technology imminent. As a hot research direction in recent ye answer systems can satisfy users' needs for accurate information acq background of increasing information. Demand and the combination of d question answering system tasks is the current trend. Question classific mary link of the question and answer system, can see the range of ca which plays a role in the direction of follow-up research. Deep learning doubtedly a new research hotspot in the branch of machine learning, an powerful force to promote the rapid development of machine learning. T of deep learning provides strong technical support for the research of qu tion and also plays a guiding role in the development of future questio

Conclusions and Future Work
In recent years, the rapid development of the Internet has made more intelligent information technology imminent. As a hot research direction in recent years, question and answer systems can satisfy users' needs for accurate information acquisition with the background of increasing information. Demand and the combination of deep learning and question answering system tasks is the current trend. Question classification, as the primary link of the question and answer system, can see the range of candidate answers, which plays a role in the direction of follow-up research. Deep learning technology is undoubtedly a new research hotspot in the branch of machine learning, and it has become a powerful force to promote the rapid development of machine learning. The development of deep learning provides strong technical support for the research of question classification and also plays a guiding role in the development of future question answering systems. Therefore, the focus of this paper was the combination of deep learning technology and question classification tasks.