ALBERT over Match-LSTM Network for Intelligent Questions Classiﬁcation in Chinese

: This paper introduces a series of experiments with an ALBERT over match-LSTM network on the top of pre-trained word vectors, for accurate classiﬁcation of intelligent question answering and thus the guarantee of precise information service. To improve the performance of data classiﬁcation, a short text classiﬁcation method based on an ALBERT and match-LSTM model was proposed to overcome the limitations of the classiﬁcation process, such as few vocabularies, sparse features, large amount of data, lots of noise and poor normalization. In the model, Jieba word segmentation tools and agricultural dictionary were selected to text segmentation, GloVe algorithm was then adopted to expand the text characteristic and weighted word vector according to the text of key vector, bi-directional gated recurrent unit was applied to catch the context feature information and multi-convolutional neural networks were ﬁnally established to gain local multidimensional characteristics of text. Batch normalization, Dropout, Global Average Pooling and Global Max Pooling were utilized to solve overﬁtting problem. The results showed that the model could classify questions accurately, with a precision of 96.8%. Compared with other classiﬁcation models, such as multi-SVM model and CNN model, ALBERT+match-LSTM had obvious advantages in classiﬁcation performance in intelligent Agri-tech information service.


Introduction
It has been almost 60 years since the first successful knowledge-based question answering system for baseball was developed in 1963, but the intelligent QA machine for Chinese agriculture datasets is still Blue Ocean.

Goals of the Paper
Our intelligent question answering machine constrains answers to cloze-style reading comprehension datasets [1,2], also called domain-specific, different from the space of all possible spans. This work aims to improve the performance of question answering on the National Agricultural Technology and Education Cloud Platform (NJTG, http: //njtg.nercita.org.cn/ (accessed 20 August 2017)) by applying classification technology. Our model is evaluated on data from NJTG, which is based on big data, cloud computation and mobile technology, with all kinds of agricultural technology educational resources-we call it the wing of Chinese agricultural technology. Agricultural administration departments at all levels, such as agricultural experts, agricultural technology-extension officers and farmers, who can access online education, online consultation, achievement promotion and marketing easily, will fasten the development of Chinese intelligent agriculture. Our

Related Work
The word embedding technology is derived by the Google open-source project word2vec [11,12] from 2013. Currently, word2vec and GloVe are the most common technologies for vectorization. Both are fundamentally similar: they capture local co-occurrence statistics and distance between embedding vectors. GloVe is count-based; it captures global co-occurrence statistics and requires upfront pass-through entire datasets. A large corpus can be used in many research domains, such as information retrieval [13] and question answering [14][15][16][17]. We loaded Sogou agricultural vocabulary as a word segmentation corpus instead of the main word base, which improved agricultural vocabulary recognition greatly.
The Learning to Rank Model uses the machine learning method to solve rank targets. There are three kinds of learning to rank models: the first one is a point-wise model which treats single documents as training objects, the second one is a pair-wise model which uses document pairs to train objects and the last one is a list-wise model, which maintains the whole document list as an optimized object, using the listNet model to optimize the performance.
In recent years, notable results have been achieved by deep learning models in RC and QA. Within natural language processing, there is a stereotype that word vector representations are often involved in much of the work through neural language models, since computer cannot take the words either Chinese [18] or non-Chinese directly as the input for the ALBERT model.
The QA system comprises question classification, search, text retrieval, answer extract, answer ranking and output and text classification technology will be employed to classify the questions into different types based on the answers, which is one task for this paper.
Many NLP systems take words as atomic units, such as the N-grams model, and the simple systems trained on a massive amount of data outperform the complex model trained on a reduced amount of data. However, the simple model is not sufficient in many

Related Work
The word embedding technology is derived by the Google open-source project word2vec [11,12] from 2013. Currently, word2vec and GloVe are the most common technologies for vectorization. Both are fundamentally similar: they capture local co-occurrence statistics and distance between embedding vectors. GloVe is count-based; it captures global co-occurrence statistics and requires upfront pass-through entire datasets. A large corpus can be used in many research domains, such as information retrieval [13] and question answering [14][15][16][17]. We loaded Sogou agricultural vocabulary as a word segmentation corpus instead of the main word base, which improved agricultural vocabulary recognition greatly.
The Learning to Rank Model uses the machine learning method to solve rank targets. There are three kinds of learning to rank models: the first one is a point-wise model which treats single documents as training objects, the second one is a pair-wise model which uses document pairs to train objects and the last one is a list-wise model, which maintains the whole document list as an optimized object, using the listNet model to optimize the performance.
In recent years, notable results have been achieved by deep learning models in RC and QA. Within natural language processing, there is a stereotype that word vector representations are often involved in much of the work through neural language models, since computer cannot take the words either Chinese [18] or non-Chinese directly as the input for the ALBERT model.
The QA system comprises question classification, search, text retrieval, answer extract, answer ranking and output and text classification technology will be employed to classify the questions into different types based on the answers, which is one task for this paper.
Many NLP systems take words as atomic units, such as the N-grams model, and the simple systems trained on a massive amount of data outperform the complex model trained on a reduced amount of data. However, the simple model is not sufficient in many tasks. Google has published ALBERT [30] with parameter-reduction techniques, which outperforms BERT. For example, the amount of relevant in-domain data for automatic speech recognition is limited-the performance is usually dominated by the size of high-Agronomy 2021, 11, 1530 4 of 15 quality transcribed speech data (often just millions of words). Thus, to acquire significant progress, we should focus on more advanced methods.

Model Architectures
Our end-to-end architecture for text classification is illustrated in Figure 1. The model consists of four parts: text preprocessing layer, ALBERT layer, match-LSTM layer and interactive classification layer. For pre-processing, we employ an efficient and precise model proposed by Duan et al. [31] for Chinese word segmentation (CWS). Based on the stereotype in most NLP tasks, attention is strong in catching the "important" features and distribution of the input, employing multi-head attention. A recurrent network encoder is engaged to build representation for questions and answers separately; thus, we pre-train the data with the ALBERT model, which is a lighter and faster version of BERT, as it cuts down the parameters to reduce the memory consumption and speed up the BERT training model. After that, the match-LSTM model will be used to find the correct category of the question (as illustrated in Figure 1 in more detail), by collecting information from questions and answers.

Chinese Word Segmentation
As Chinese text cannot be taken directly as the input for the classification model, it is necessary to convert them into vector. To preserve as much as possible the integrity and comprehensiveness of the semantic meaning of a sentence, we first preprocess the sentence with de-noising and word segmentation and vectorization, and then use the GloVe method to transform the word segmentation results into word vectors.
Duan et al. report a Chinese word segmentation (CWS) model-called attention is all you need for CWS (AAY_CWS)-which achieves state-of-the-art performance. Compared with Python's Jieba word base, Duan's CWS model is a more advanced greedy decoding segmentation algorithm, which employs a transformer-like method-Gaussian-masked Directional (GD) transformer. For smoother training, it consumes two highway connections: one is GD multi-head attention and the other is GD attention. The Python implementation of AAY_CWS can be found at https://github.com/akibcmi/SAMS (accessed on 6 October 2020).
The segmentation results of Chinese sentences are greatly influenced by semantics and context. To improve the precision of segmentation, the stop words table is loaded before segmentation, which can reduce the adverse effect of the disabled words, special characters and spaces with little or no contributions to feature extraction in the sentence.
The attention mechanism has achieved great success in many fields. Most competitive neural sequence transduction models have an encoder and decoder layer. Using attention to connect the encoder and decoder performs well, since attention represents the relationship between words. The encoder maps the input sequence denoted as X to a sequence Z; given Z, the decoder then produces an output sequence Y.
An attention model maps a query vector Q and a set of key-value (K, V) vector pairs as (Q, K, V) to an output vector. As shown in Figure 2, we can visualize the attention relationship between words with colors. The deeper the color, the closer the relationship between the word "wishes" and others. As shown in the table, "they" has the deepest color with "wishes", which means the model finds that "they" refers to "wishes".

Global Word Vector
Semantic vector space models of language represent each word with a real-valued vector, and these vectors are useful as features in a variety of applications, such as information retrieval [12], document classification [32], question answering and named entity recognition [33,34].
Therefore, based on our wide-coverage agricultural corpus, we employ GloVe for the global log bilinear regression model properties needed for such regularities. GloVe stands for Global Vectors, which uses global words statistics information.
First, GloVe provides annotation for word-word co-occurrence counts with matrix X, whose entries represents the number of times word i occurs in the context of probe word k. We take the final formula directly from the GloVe source: This is one example of an equation: represents the probability of word k occurring in the context of probe word i.
First, noting that the ratio depends on three words i, j and k, the most general model takes the form: After a few variations, Pennington et al. proposed a new weighted least squares regression model that addresses these problems. Casting Equation (3) as a least squares problem and introducing a weighting function ( ) into the cost function gives us the model: The higher the co-occurrence of (j, k), the bigger the weight , the assumption is reasonable-except for high-frequency words. Some high-frequency auxiliary words without actual meanings should be ignored. Therefore:

Global Word Vector
Semantic vector space models of language represent each word with a real-valued vector, and these vectors are useful as features in a variety of applications, such as information retrieval [12], document classification [32], question answering and named entity recognition [33,34].
Therefore, based on our wide-coverage agricultural corpus, we employ GloVe for the global log bilinear regression model properties needed for such regularities. GloVe stands for Global Vectors, which uses global words statistics information.
First, GloVe provides annotation for word-word co-occurrence counts with matrix X, whose entries X ik represents the number of times word i occurs in the context of probe word k. We take the final formula directly from the GloVe source: This is one example of an equation: P ik represents the probability of word k occurring in the context of probe word i. First, noting that the ratio P ik/P jk depends on three words i, j and k, the most general model takes the form: After a few variations, Pennington et al. proposed a new weighted least squares regression model that addresses these problems. Casting Equation (3) as a least squares problem and introducing a weighting function f X ij into the cost function gives us the model: The higher the co-occurrence of (j, k), the bigger the weight X ik , the assumption is reasonable-except for high-frequency words. Some high-frequency auxiliary words without actual meanings should be ignored. Therefore: where α = 3/4 and x max = 100. Take sheep and peach from our dataset as examples. Table 1 shows these probabilities and their ratios for a large corpus, and the numbers confirm these expectations. Compared to the raw probabilities, the ratio is better able to distinguish relevant words (raise and pick) from irrelevant words (disease and radio) and it is also better able to discriminate between the two relevant words. This co-occurrence result shows that our word vectors are good word vectors, preserving the relevant features of words. Table 1. Two non-discriminative words in our dataset.

ALBERT
In 2018, Google proposed the BERT model [35], which set a record in the 11 task tests at that time, bringing a landmark change to the development of natural language processing.
BERT is a seq2seq [36] model of encoder-decode structure, while ALBERT is an improvement based on it. There are four methods to turn BERT to ALBERT. Figure 3 shows the structure of factorized embedding parameterization.
Take sheep and peach from our dataset as examples. Table 1 shows these probabilities and their ratios for a large corpus, and the numbers confirm these expectations. Compared to the raw probabilities, the ratio is better able to distinguish relevant words (raise and pick) from irrelevant words (disease and radio) and it is also better able to discriminate between the two relevant words. This co-occurrence result shows that our word vectors are good word vectors, preserving the relevant features of words.

ALBERT
In 2018, Google proposed the BERT model [35], which set a record in the 11 task tests at that time, bringing a landmark change to the development of natural language processing.
BERT is a seq2seq [36] model of encoder-decode structure, while ALBERT is an improvement based on it. There are four methods to turn BERT to ALBERT. Figure 3 shows the structure of factorized embedding parameterization. It can be seen from Figure 3 that the encoder of the model contains two layers. The first layer is the multi-attention network layer, which can extract different features of the model, and the second layer is the feedforward network layer. Each layer contains a function of concatenation and standardization of input and output information.
The calculation formula of multi-head attention is as follows: In the above formula, represents the additional weight of Formula M, Concat represents the logical concatenation and ℎ represents the attention mechanism, as follows: It can be seen from Figure 3 that the encoder of the model contains two layers. The first layer is the multi-attention network layer, which can extract different features of the model, and the second layer is the feedforward network layer. Each layer contains a function of concatenation and standardization of input and output information.
The calculation formula of multi-head attention is as follows: In the above formula, W o represents the additional weight of Formula M, Concat represents the logical concatenation and h i represents the attention mechanism, as follows: where Q, K, V represent a query vector Q and a set of key-value (K, V) vector pairs as in the Attention model and W Q i , W K i , W V i represent the corresponding weight matrix. The formula of the Attention mechanism is as follows: where k T denotes the transpose of k vector and d q is the vector dimension of q. Softmax is a normalization function as follows: As a lighter version of BERT, the ALBERT model transformed Next Sentence Prediction (NSP, ref. [35]) to Sentence Order Prediction (SOP), which improves the downstream effect of multi-sentence input. Compared to BERT, the ALBERT model reduces the number of parameters and enhances the ability of semantic comprehension.

Decoder Layer
For decoder layer, match-LSTM is employed; to help understand the model, LSTM is explained. The long short-term memory network (LSTM) [37] is a special type of recurrent neural network (RNN), which solves the problems of RNN, such as the disappearance of feedback and the long interval and delay of prediction sequence. The LSTM model includes three gate structures: forget gate, input gate and output gate. The three gates act on the cell unit to form the hidden middle layer of the LSTM model. The so-called gate is actually a mapping representation of the data model, which determines whether to connect or close through the combination operation of sigmoid function and matrix multiplication, and controls whether the current time node information is added to the cell unit. The key to LSTMs is the horizontal line running through the top of the diagram.
The working mode of LSTM is basically the same as that of RNN network, and its network structure is shown in Figure 4. (All the annotation used in Figure 4 is explained in Figure 5.) where Q, K, V represent a query vector Q and a set of key-value (K, V) vector pairs as in the Attention model and The formula of the Attention mechanism is as follows: where denotes the transpose of k vector and is the vector dimension of q. Softmax is a normalization function as follows: As a lighter version of BERT, the ALBERT model transformed Next Sentence Prediction (NSP, [35]) to Sentence Order Prediction (SOP), which improves the downstream effect of multi-sentence input. Compared to BERT, the ALBERT model reduces the number of parameters and enhances the ability of semantic comprehension.

Decoder Layer
For decoder layer, match-LSTM is employed; to help understand the model, LSTM is explained. The long short-term memory network (LSTM) [37] is a special type of recurrent neural network (RNN), which solves the problems of RNN, such as the disappearance of feedback and the long interval and delay of prediction sequence. The LSTM model includes three gate structures: forget gate, input gate and output gate. The three gates act on the cell unit to form the hidden middle layer of the LSTM model. The so-called gate is actually a mapping representation of the data model, which determines whether to connect or close through the combination operation of sigmoid function and matrix multiplication, and controls whether the current time node information is added to the cell unit.
The key to LSTMs is the horizontal line running through the top of the diagram.
The working mode of LSTM is basically the same as that of RNN network, and its network structure is shown in Figure 4. (All the annotation used in Figure 4 is explained in Figure 5.)  As can be seen from Figure 4, for the set represented by a listed input sequence x, the calculation formula of hidden state ℎ is as follows: where Q, K, V represent a query vector Q and a set of key-value (K, V) vector pairs as in the Attention model and The formula of the Attention mechanism is as follows: where denotes the transpose of k vector and is the vector dimension of q. Softmax is a normalization function as follows: As a lighter version of BERT, the ALBERT model transformed Next Sentence Prediction (NSP, [35]) to Sentence Order Prediction (SOP), which improves the downstream effect of multi-sentence input. Compared to BERT, the ALBERT model reduces the number of parameters and enhances the ability of semantic comprehension.

Decoder Layer
For decoder layer, match-LSTM is employed; to help understand the model, LSTM is explained. The long short-term memory network (LSTM) [37] is a special type of recurrent neural network (RNN), which solves the problems of RNN, such as the disappearance of feedback and the long interval and delay of prediction sequence. The LSTM model includes three gate structures: forget gate, input gate and output gate. The three gates act on the cell unit to form the hidden middle layer of the LSTM model. The so-called gate is actually a mapping representation of the data model, which determines whether to connect or close through the combination operation of sigmoid function and matrix multiplication, and controls whether the current time node information is added to the cell unit.
The key to LSTMs is the horizontal line running through the top of the diagram.
The working mode of LSTM is basically the same as that of RNN network, and its network structure is shown in Figure 4. (All the annotation used in Figure 4 is explained in Figure 5.)  As can be seen from Figure 4, for the set represented by a listed input sequence x, the calculation formula of hidden state ℎ is as follows: As can be seen from Figure 4, for the set represented by a listed input sequence x, the calculation formula of hidden state h t is as follows: While the "model" in the Figure 4 represents f, which is a non-linear function, h t−1 represents the hidden state at time t − 1. Taking the sample X as input of the model, a set of hidden state {h 1 , h 2 , . . . , h t } will be calculated, where h t represents the final state of the sequence transferred to the end of the neural network. It can be understood that any input x t in the middle will interact with the hidden state of the secondary t − 1, and thus the self-loop form on the left side of Figure 4 can be equivalent to the expression on the right side.
The first step in our LSTM is to decide what information we are going to throw away from the cell state. This decision is made by a sigmoid layer called the "forget gate layer".
where w is the weight matrix and b is the bias unit.
The input gate layer is as follows: The output gate layer is calculated as follows: The cell state is calculated as follows, where C t represents the new memory cell or temporary memory cell and C t represents the final cell at time t: The hidden state at time t is as follows: In 2016, Wang and Jiang proposed a match-LSTM sequence-to-sequence model for predicting textual entailment, which is a widely used and well-explored method for RC tasks. Based on different answer pointer layer, Wang proposed two types of match-LSTM mode: sequence and boundary models. In the case of textual entailment, two sentences, a premise-hypothesis pair, are specified. In our case, a question-answer pair is given. To find whether the answer matches the question, we will practice the boundary match-LSTM model. After the output of ALBERT, the results will be sent to the match-LSTM to get multiple candidates, and the n-best re-rank model will be used. Among them, the top four results (or there could be less or only one result) will be sent to users, and the users will decide which result is the correct one.
The boundary model consists of an LSTM Preprocessing Layer, a match-LSTM layer and an answer Pointer layer; the detailed formula can be found in the paper of Wang and Jiang.

Data: 20,000 Questions
We use NQuAD to conduct our experiments. Python's regular expression is employed to clean and filter the obtained text data to remove useless information.
We used a random sample of two thousand QA pairs of NQuAD, and these 20,000 peachrelated questions were classified into 12 categories (shown in Table 2): marketing, plant diseases and pests, animal disease, cultivation management, breeding management, fertilizer science, nutrition, harvest process, agricultural equipment, storage and transportation, slaughtering process and "OTHERS". Sample questions are shown in Table 3. From Table 2, we can see that the distribution is not a uniform distribution-there are more than 6000 questions for plant diseases and pests, while only 28 for slaughtering process, which is a challenge for our classification task.  The data are sampled as follows. For each category, 80% of the questions are sampled as a training set (with 16,000 questions), 10% as a test set (2000 questions) and the rest as a validation set (2000 questions).

Statistics of Our Data
We divide our NQuAD questions into six types, which are common in search logs, as shown in Table 4. The statistics are illustrated in Table 5. We set the dimension of word vector as 120, and max sentence length as 100 words.

Flow Chart of Data Set Construction
The flow chart of our classification model based on ALBERT+match-LSTM is as follows (Figure 6), which contains input layer, feature extraction and output layer.

Flow Chart of Data Set Construction
The flow chart of our classification model based on ALBERT+match-LSTM is as follows ( Figure 6), which contains input layer, feature extraction and output layer.

•
Input layer CWS + GloVe → N*512-dimension word vectors, which is the input for our AL-BERT+match-LSTM model.

Feature extraction layer
In this layer, parameters will be trained to extract the significant features. We set the feature dimension as 512, N = 128, and the initial weight is a Gaussian distribution ∼ (0, 0.01). The K-fold cross validation is employed to train our model and monitor the performance; 16,000 questions were selected as our training set and 2000 as validation set. Set batch size as 1000, in total 16 batches; set the epoch as 500, the sample will be validated every 400 epochs. Adam optimization algorithm is used with = 10 , = 0.9, = 0.999. The dropout regularization is employed to avoid overfitting and the gradients are clipped to the maximum norm of 10.0.

Output layer
After the features are extracted, all the feature vectors will be concatenated to be a one-dimension vector, and finally, the SoftMax function is used to obtain the feature vector output.

Hardware, Software Environment and Evaluation Indicators
This experiment's software environment is Python 3.6.2 and TensorFlow 1.13.1, the server's hardware environment is NVIDIA Corporation device 1e04 (Rev A1) and GPU is NVIDIA GeForce RTX 2080ti. In this study, the TensorFlow neural network framework is used to construct the neural network.
In the experiment, 20,000 problems are divided into the training set, verification set and test set according to a ratio of 8:1:1. The precision, recall rate and F1 value are used as evaluation indexes in this paper. The formula is as follows: • Input layer CWS + GloVe → N*512-dimension word vectors, which is the input for our ALBERT+ match-LSTM model.

•
Feature extraction layer In this layer, parameters will be trained to extract the significant features. We set the feature dimension as 512, N = 128, and the initial weight is a Gaussian distribution X ∼ N(0, 0.01). The K-fold cross validation is employed to train our model and monitor the performance; 16,000 questions were selected as our training set and 2000 as validation set. Set batch size as 1000, in total 16 batches; set the epoch as 500, the sample will be validated every 400 epochs. Adam optimization algorithm is used with α = 10 −3 , β 1 = 0.9, β 2 = 0.999. The dropout regularization is employed to avoid overfitting and the gradients are clipped to the maximum norm of 10.0.

Output layer
After the features are extracted, all the feature vectors will be concatenated to be a one-dimension vector, and finally, the SoftMax function is used to obtain the feature vector output.

Hardware, Software Environment and Evaluation Indicators
This experiment's software environment is Python 3.6.2 and TensorFlow 1.13.1, the server's hardware environment is NVIDIA Corporation device 1e04 (Rev A1) and GPU is NVIDIA GeForce RTX 2080ti. In this study, the TensorFlow neural network framework is used to construct the neural network.
In the experiment, 20,000 problems are divided into the training set, verification set and test set according to a ratio of 8:1:1. The precision, recall rate and F 1 value are used as evaluation indexes in this paper. The formula is as follows:

Chinese Word Segmentation Results
We first tokenize all the questions and answers. The resulting vocabulary contains 100k unique words. We use "attention is all you need" for Chinese word segmentation (CWS)-the result is illustrated in Table 6-and then use word embedding from GloVe to initialize the model.

Test Results and Analysis
Attention is used to measure the importance of different labels. From Table 7, the proposed model ALBERT+match-LSTM has a slightly lower recall than Attention+match-LSTM, but it outperformed for precision. Compared with other well-known and accepted multi-label classification methodologies, such as multi-SVM, KNN, CNN, LSTM, match-LSTM, attention + LSTM and ALBERT+LSTM, our proposed model has the highest matching performance on NQuAD. Those methods can be classified into classic machine learning (ML) algorithms (SVM, KNN) and deep learning (DL) algorithms, and those DL algorithms can be further divided into convolution kernel-based (CNN, LSTM, match-LSTM) and attention-based (attention + LSTM, ALBERT+LSTM, ALBERT+match-LSTM). Table 7 shows the evaluation of different models on NQuAD. Table 8 shows that our model has better performance in the plant diseases and pests, animal disease, cultivation management, storage and transportation and "OTHER" categories (highlighted in bold); those five precisions, recall rates and F 1 values of matching question pairs are greater than 96.8%, 97.6% and 96.9%, respectively, and the overall classification effect is better than other models. The F1 value of this model is significantly higher than that of other models in the data sets with fewer data of marketing, nutrition, slaughtering process and four other categories, which indicates that the ALBERT+match-LSTM model can still effectively extract the features of a short text in the case of insufficient data.  Figure 7 shows the relations between training and validation precision and number of training epochs-a linear co-relation. As the cross validation is used, the training precision is close to validation precision and just slightly higher, which indicates the epoch was good and our model was not overfitting. Before epoch = 4800, these two values are competing. As observed, epoch = 5200 is a turning point. Before the turning point, the precision rate rises steadily with epochs; while after, the precision curve tends to be flat. The training set has the highest precision rate of 97.1% at the turning point, and the verification set has the highest precision rate of 91.9% at epoch = 4800. Table 9 shows the response time and precision of four neural network models based on attention mechanism on 2000 test sets, which meets the requirements for quick classification of question sentences. CNN is the fastest in response time due to the simple structure of the CNN model, fewer training layers and fewer model parameters. The coherent proposed in this paper, the ALBERT+match-LSTM model, can accurately classify question sentence categories in the test set in 14 seconds and the precision rate reaches 96.8%, which is much higher than that of other models. More interestingly, it is observed that our ALBERT+match-LSTM model is faster than ALBERT+LSTM, with a slight advantage of 2 s in response time; the difference is the decoder layer, match-LSTM vs. LSTM. Our explanation is that the boundary model of match-LSTM performs well in that it only needs to predict two indices, start and end indices, which is more efficient than LSTM. of training epochs-a linear co-relation. As the cross validation is used, the training precision is close to validation precision and just slightly higher, which indicates the epoch was good and our model was not overfitting. Before epoch = 4800, these two values are competing. As observed, epoch = 5200 is a turning point. Before the turning point, the precision rate rises steadily with epochs; while after, the precision curve tends to be flat. The training set has the highest precision rate of 97.1% at the turning point, and the verification set has the highest precision rate of 91.9% at epoch = 4800.  Table 9 shows the response time and precision of four neural network models based on attention mechanism on 2000 test sets, which meets the requirements for quick classification of question sentences. CNN is the fastest in response time due to the simple structure of the CNN model, fewer training layers and fewer model parameters. The coherent Precision ()%

Number of Training Epochs x400
Training Precision Validation Precision

Discussion and Conclusions
In this paper, to improve the performance of QA machines, we propose to use Duan's Chinese word segmentation method to preprocess our data, utilize GloVe instead of word2vec to represent the words, and then employ ALBERT as an encoder and match-LSTM as a decoder, which treat the question as a premise and the answer as a hypothesis over multi-head attention (attention score).
Word2vec is predictive, while GloVe is count-based. GloVe can be implemented in parallel, which means it is faster than Word2vec to obtain to a set precision rate. Additionally, GloVe uses global corpus statistics, which contains more information.
In this paper, our attention-based method is compared with classic machine learning models (Multi-SVM and KNN), as well as convolution-kernel models (CNN, LSTM and match-LSTM).
SVM (here, multi-SVM is used) has been one of the most efficient machine learning algorithms since its introduction in the 1990s. However, the SVM algorithms for text classification are limited by the lack of transparency in results caused by a high number of dimensions, while KNN is limited by data storage constraints for large search problems to find nearest neighbors. Additionally, the performance of KNN is dependent on finding function, thus making this technique a very data-dependent algorithm.
CNN, LSTM (a modified RNN) and match-LSTM use multiple convolution kernels to extract text features, which are then inputted to the linear transformation layer followed by a sigmoid function to output the probability distribution over the label space.
ALBERT takes the advantage of attention mechanisms and convolution kernels, thus reaching the highest performance. Extensive experimental results show that the proposed methods outperform other models by a substantial margin (3.9%). Further analysis of experimental results demonstrates that our proposed methods not only find the adequate answers to the question, but also select the most informative words automatically when predicting different answers.
This model can be quite valuable in practice; as mentioned previously in the introduction, this agri-intelligent QA system is built on the NJTG platform. Currently, there are more than 10 million QA pairs collected and stored in the NJTG QA dataset (20,000 were used in this paper); what is more, end-users of the NJTG platform submit thousands of questions every day. With the application of this model, our QA system can identify the most adequate answers from the exact category (about 100 thousand QA pairs) instead of the whole NJTG QA dataset (10 million QA pairs). In other words, the time complexity decreases 100 times.
Regarding the drawback of this model, if one question sentence "what vitamin can be used for pig's night blindness" can be classified into two categories (nutrition or animal disease), based on our current model, it can belong to only one category; there is a chance that the adequate answer cannot be identified but some constructive work can be done in a sense.
For future work, there is still a gap between the English and non-English (such as Chinese) intelligent QA systems. With the rapid development and high precision of current English QA models and machine translation (MT) models, can we take a detour? To bridge this gap, first there should be a built English-based dataset using MT models from non-English data; there are tons of well-built English-based datasets. Second, we should translate non-English questions into English using high precision MT models. Third, we should identify the adequate answers with a high precision method, and translate the answers back to the original language. It sounds like extra work, but with many ready-touse models for both tasks, it might be worth the effort. Furthermore, this general method could be applied to all languages, not only Chinese. Based on this grand unification theory, English plays the role of a bridge-all we need are MT models and English-based models.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy policy of the authors' institution.