3.1. FastText Model Design
The FastText [
26] model has the function of word vector computation and classification. To enhance the classification capabilities, the objective model is trained by expanding the layers of the network model, while FastText utilizes a shallow network composed of the input, hidden and output layers.
Figure 1 shows its basic structure.
Where the input layer is the feature vector after word embedding, the feature vector is composed of words and phrases in the text or sentence, and the model’s output is the probability value of the sentence or text belonging to different prediction categories. The hidden layer calculates the mean value of the input vector, calculated based on Formula (14).
where the mean value of the word vector in the input layer represents the sentence information, and the values calculated in the hidden layer are then fed into the softmax multiclassifier to output the predicted class information.
Additionally, the FastText model boosts the categorization capability of the model by training tricks. There are two main ways: the first is to improve operational efficiency by using the hierarchical softmax layer. The hierarchical softmax uses the Huffman tree structure instead of the flat softmax. It mainly uses the Huffman coding method to encode multiple tags, and the values of all leaf nodes are calculated by the original calculation. It needs to calculate only the value from the root node to one of the leaf nodes, which greatly reduces the complexity of model training and the test time on the test set. The second is to apply N-grams to extract features and use the hashing algorithm to map the 2-gram and 3-gram vocabulary information into two tables. Because the vocabulary of 2-grams and 3-grams is much larger than that of word embedding, the hash bucket method is used to map 2-grams and 3-grams to buckets, and the 2-grams and 3-grams in the same bucket share the same word vector. The matrix composed of word embedding and N-grams is depicted in
Figure 2.
In the hidden layer, the word embedding, 2-grams and 3-grams of the input sentences are concatenated, and the mean value of each sequence word is obtained. Then, the calculated mean value is sent to the softmax layer through a nonlinear activation function of the full connection layer for normalization processing and finally outputs the probability values of the predicted values of each category. The architecture of FastText is depicted in
Figure 3.
3.2. Shallow Bidirectional GRU Network Design
In the traditional RNN network structure, the RNN can retain only short memory information. If the sequence of text data is very long, the earlier time series information cannot be transmitted to the following time series information, resulting in inaccurate text classification results. In addition, there is a problem of gradient disappearance during backpropagation, mainly because the value of the gradient update neural network weight changes little, resulting in an inability to learn more text series information. To eliminate this issue, the long short-term memory (LSTM) [
27] and GRU [
28] text classification methods are used. The LSTM approach has a significant quantity of parameters and is computationally sophisticated. The GRU is used instead of LSTM. As shown in
Figure 4, this method can achieve the classification effect of LSTM. The gating mechanism is used to update the door and reset the door using Formulas (15) and (16), respectively. Formula (16) dictates whether to reset the current input and the previous
information, and the amount of prior information saved to current moment is determined by the update door. Finally, the output vectors are calculated based on Formulas (17) and (18).
where
is the update gate vector,
is the rectified linear unit (ReLU) activation function,
is the input text feature vector,
W is the parameter matrix,
is the reset gate vector, and
is the output vector.
In addition, a bidirectional GRU network is designed to address the issue of information loss during the propagation of a single GRU network. In the bidirectional GRU network, there are two main hidden states: the forward-learning GRU network unit and the reverse-learning GRU network unit [
29]. Suppose that the hidden state at moment t needs to be calculated. The input of forward learning consists of
and
. Reverse learning is composed of
and
. Then, the hidden states of forward learning and reverse learning are calculated based on Formulas (19) and (20), respectively.
where
and
represent the hidden layer weight matrix in forward learning and the hidden layer weight matrix in reverse learning, respectively,
and
represent the hidden state at times t − 1 and t + 1, respectively,
and
represent the weight matrix of forward input and reverse input in the input layer, respectively,
represents the input at time t,
and
represent the forward and reverse offset values, respectively, and
f represents the activation function.
In the bidirectional GRU network, the forward- and backward-learning GRU do not interfere with each other before the model output, and the weight matrix of the input, hidden layers and bias term are also not shared [
30]. At the output, the forward-learning text information and the backward-learning text information are spliced, and the feature vector
is output at time t, which is finally normalized using the softmax function and calculated based on Formula (21).
where
f is the sigmoid activation function,
V is the weight matrix, and is the offset term of the output layer.
Finally, the maximum value of the probability in the number of categories is taken as the final prediction result. The whole GRU is depicted in
Figure 5.
3.4. Model Integration
In the integrated model, this paper integrates three network architectures: the FastText network, shallow bidirectional GRU network and shallow TextCNN network designed in the first three sections.
Figure 7 shows the overall structure of the integrated model. The model is mainly composed of three layers: the input layer, hidden layer and output layer. The input layer represents the input text feature vector, and the output layer represents the classification result statistics of each classifier after applying the softmax function. In the hidden layer, from left to right, FastText, Bidirectional GRU and TextCNN network structures are shown. These network architectures are integrated to classify the input text information at the same time; finally, the classification results of the three network structures are determined, and the maximum value is selected as the final classification result.
By using the integrated strategy of voting, the integrated model has the following characteristics:
(1) Different from the traditional single classifier, the designed network architecture model can effectively extract text features; for example, word embedding, 2-gram and 3-gram feature vectors are used to stitch together in FastText, and convolutional kernels (2, 3, 4) and pooling layers are utilized to extract features in CNNs. For the purpose of minimizing the model’s information loss during training, a bidirectional GRU is used for forward and backward learning.
(2) To further boost the training efficiency and accuracy of the model, each network architecture uses the SGD, RMSProp and Adam optimization algorithms at the same time because these three algorithms can solve the problems of large sample data, sparse sample data and the model learning rate. Therefore, for each network architecture, three optimization algorithms are used to train nine network models, which can solve most of the problems in the training process.
(3) General integrated models have the risk of overfitting. Each network model trained in this work uses the dropout method to reduce the overfitting problem in the training process so that each model has a strong learning ability. This paper also uses the ReLU activation function to reduce problems such as training stops and gradient disappearance in the training process.
(4) Traditional integrated learning uses serial training, and the integrated model in this paper uses parallel training to train nine kinds of network models at the same time, which significantly shortened the training time of the model. In addition, another function of parallel training is that when each model makes a prediction, the prediction results of all prediction models are recorded, the final voting is conducted, and the largest number of votes is compared with the actual categories. The model evaluation indicators, such as the accuracy, are calculated.
In summary, the integrated model is designed as illustrated in
Figure 8.
The main idea of the integrated model is as follows: the total number of models trained in parallel is
k, the number of categories of documents is
m, and the classification results of text datum i of each model are counted, in which the datum i with the largest number of votes is considered to belong to the
category. Finally, the accuracy of the
category is summed to average. The final prediction result is calculated based on Formulas (24)–(26).
where
represents the result of model
j classifying text
i,
represents the number of votes that text datum
i belongs to category
m,
represents the accuracy of text datum
i belonging to category
j, and
N represents the number of votes.