1. Introduction
As communication and transportation technology advances, communicating with foreigners is more common than it used to be. Chinese has become one of the most popular languages. In a globalization era, speaking a few foreign languages is a common thing. However, it is sometimes hard to tell if our grammar is correct; it would be wonderful if there would be a system that can automatically correct sentences.
The traditional way may correct sentences is with a predefined dictionary. Although this can be easily scaled up because of its low computational cost, it is challenging to correct semantic errors and grammatical errors without capturing sentence-level information.
With the deep learning method, the learnable neural network is able to capture context and sentence-level information to correct semantic errors and grammatical errors. Several researches have developed methods to use machine learning for Chinese language correction [
1,
2,
3]. However, deep learning methods have intrinsic disadvantages. With higher computational costs, it is hard to scale up to commercial services.
We introduce hybrid models as the solution, using Bidirectional Encoder Representations from Transformers (BERT) [
4] and Transformer [
5] as the baseline, the hybrid model speeds up the inference speed, without reducing the correctness.
2. Existing Work
Several researches have been done which in turn are able to solve problem of capture context and sentence-level information to correct semantic and grammatical errors. Some of the researches are:
2.1. Sequence to Sequence (Seq2Seq)
Our task to correct an incorrect Chinese sentence can be viewed as a sequence transformation problem, transforming an error sequence to a correct sequence. Sequence to sequence [
6] is a model which can be used for sequence transformation [
7,
8,
9]. While encoder captures the concept of the input sequence, decoder generates sentences corresponding to the concept that encoder have captured.
2.2. Recurrent Neural Network
Using a recurrent neural network (
Figure 1) to process sequential data has become a tradition in the deep learning field. Many models are based on recurrent neural networks. Its chain-like structure and hidden state mechanism make it possible to process sequential data [
10]. However, the chain-like structure also leads to harder parallelization and may lose information after many steps of hidden state updates.
2.3. Long Short-Term Memory (LSTM)
Vanilla recurrent neural network cells update the hidden state at each step thoroughly [
11], even if there is a piece of information for which it is essential to remain in the hidden state. LSTM [
12] uses multiple learnable gates to control the update flow, to decide which information should be cleaned, remained, and written into the hidden state. This mechanism makes the hidden state update more efficient and loses less information. This can be seen in
Figure 2,
means sigmoid.
2.4. Gated Recurrent Unit (GRU)
Although LSTM can deal with most problems, the 3 learnable gates mean a more complicate structure, which may suffer from lower training speed, higher memory cost, and being harder to train. GRU [
13], with only two gates, update gate and reset gate (
Figure 3), which is more lightweight than LSTM. However, some research [
14] shows that GRU may have comparable results to LSTM in some tasks.
2.5. Transformer
Transformer as a novelty architecture, which can also be used in the sequential task, but its core concept is far from that of the recurrent neural network. There is no hidden state and chain-like structure in Transformer, but only self-attention. Self-attention is a method that allows each token in the sequence to calculate the correlation with all other tokens and extract relevant and useful information at the same time. The graph for self-attention can be seen in
Figure 4. It is based on the attention mechanism [
15]. The most significant advantage of self-attention is that the distance between each token in the sequence is fixed, and it will not be affected by the length of the input sequence. It can also prevent a network from losing some information while processing long sequences such as RNN, thereby reducing training time and gradient vanishing.
The disadvantage of self-attention is that the computational cost will be relatively large. Without having the chain-like structure as RNN, a Transformer composed of self-attention does not need to wait for the completion of the previous state to process the next state, like RNN does, which makes the Transformer parallelizable.
2.6. BERT
BERT (Bidirectional Encoder Representations from Transformers), as its name suggests, is a pre-trained language model based on a Transformer encoder, which can be adopted to many different natural language processing tasks.
The reason to use a pre-trained language model is that sometimes we may lack data in some specific task. Because the model is pre-trained on large-scale datasets first, it may compensate the for lack of data, and it can also speed up the training since the pre-training-made model has already learned the language modeling.
3. Proposed Method
The correction system architecture that we implemented is shown in
Figure 5. First, the text to be corrected will be used as input into the system, and it must be pre-processed before performing other operations. Because the correction system cannot understand the human-readable text, the text must be converted into a format that can be read by the machine before it can be input to Seq2Seq for correction. When the encoder understands the sentence’s meaning, it will correct the errors and output it through different decoding methods according to the training phase and the inference phase.
3.1. Preprocessing
The preprocessing can be divided into three parts: tokenizer, vocabulary, and embedding layer. The tokenizer, dictionary, and embedding layer are inseparable, and the dictionary is usually used to determine how the tokenizer should segment the sentence.
3.2. Vocabulary
Vocabulary is essential in a natural language processing system, and the use of vocabulary can be divided into two categories, word-based vocabulary, and character-based vocabulary. We use character-based vocabulary, because character-based vocabulary is more flexible, unlike word-based vocabulary, which is not prone to encounter out-of-vocabulary words. Because we were implementing a sentence correction system, the input is more likely to contain unknown words than other systems.
3.3. Tokenizer
The purpose of the tokenizer is to split a sentence into multiple words or combinations of characters. However, it must be used with a vocabulary in practical applications. The type of tokenizer is determined according to the characteristics of the dictionary. Becuase we use a character-based vocabulary, the tokenizer also uses a character-based tokenizer.
3.4. Embedding Layer
The embedding layer converts the characters or words that have been segmented by the tokenizer into a machine-readable format. The converted product is called embedding or word embedding, which represents words and characters with a high dimensional vector, and can also express the relationship between each words and characters.
3.5. Language Model
The language model is a way to represent the probability of a sentence through a distribution. Without a good language model, a good natural language processing system cannot be achieved. We use a fine-tuning-based language model. Fine-tuning-based refers to the second training on a pre-trained language model so that the language model can be better applied to different natural language process tasks, and the fine-tuning-based model we used in the experiment is BERT. BERT is a language model and also an encoder. Therefore, for the rest of the non-BERT-based encoders in the experiments, we still use the BERT vocabulary, but the language model itself has not been pre-trained.
3.6. Encoder Properties
Since the encoder’s input sequence is known in both the training phase and the inference phase, the encoder can be parallelized in the training phase. The inference phase is roughly the same, and mainly depends on the model’s architecture. As is the case with Transformer architecture, it is parallelizable because of the self-attention mechanism. In contrast, the encoder-based on recurrent neural networks are more difficult to parallelize. Because of the existence of the hidden state mechanism, the encoder must wait to complete the previous hidden state before proceeding to the next one. The model architecture mainly determines the directionality of the encoder. For example, the structure of the recurrent neural network is uni-directional by default. However, the bi-directional encoding can be achieved by using two encoders from different directions, while the Transformer encoder is bi-directional by default. The Encoder properties are shown in
Table 1 and
Table 2.
3.7. Decoder Properties
The parallelization of the decoder may be different during the training phase and the inference phase. The decoder cannot be parallelized at the inference phase as it has to wait for the previous output to be complete. However, the decoder acts differently at the training phase depending on its architecture and training method; it still has to wait for previous output at the training phase. However, it is possible to undertake parallel training with the specific training method.
Teacher forcing is a method that given the decoder groundtruth sequence at the training phase, so the decoder only needs to predict one step. Because of the existence of groundtruth, the decoder can see through the complete sequence, which makes parallel training possible. However, RNN needs to wait for the hidden state at each step, so it is still challenging to achieve parallel training even if teacher forcing is used. As for the directional decoding, the decoder cannot know the complete sequence in advance, so the Transformer-based decoder can only perform uni-directional decoding. Although teacher forcing is likely to let the decoder see the complete sequence, it will lead to inconsistency between the training phase and the inference phase. The Decoder properties are shown in
Table 3 and
Table 4.
3.8. Analysis
BERT is a pre-trained language model based on Transformer, so it inherits both the advantages and disadvantages of Transformer. We found that a Transformer-ased sequence to sequence model has high performance in Chinese sentence correction, but it is very slow at the inference phase. By inspecting the property of the encoder and decoder, we found that the Transformer-based sequence to sequence model is slow at the decoder.
The Transformer-based decoder is slow at the inference phase because it is based on self-attention. With self-attention, the Transformer does not have the chain-like structure of the RNN-based model, which makes the Transformer powerful and parallelizable but also suffers from higher computational complexity.
At the inference phase, the encoder knows the complete input sequence in advance, which makes Transformer encoder parallelizable, but the decoder generates the next token based on the previous token, which makes the decoding process hard to parallelize.
Though the Transformer encoder has higher computational complexity, we found that parallelization has a significant effect on reducing execution time. However, the Transformer decoder has no such feature, which slows down the whole sequence transformation process.
To speed up the inference speed of Chinese sentence correction, we introduce the hybrid model by combining the BERT and RNN-based models, which speed up the inference speed but still preserve the Transformer-based model’s performance.
3.9. Hybrid Architecture, BERT-RNN
The Transformer-based model is very different from the RNN-based model, so they cannot be combined directly. We usually have to initialize the decoder hidden state for RNN-based sequence to sequence model. There are many ways to initialize the decoder’s hidden state, such as using another neural network to predict the initial state. However, we use the most straightforward method, averaging BERT’s output to avoid the extra computation, as seen in
Figure 6.
3.10. BERT-LSTM
As shown in
Figure 7, the first hybrid model is BERT-GRU. It concatenates the BERT encoder and GRU decoder, containing the fast and high-performance characteristics of the BERT encoder, and the lightweight, adequate performance of the GRU decoder, which is specialized in improving inference speed.
3.11. BERT-GRU
The second hybrid model is BERT-LSTM (
Figure 8). It concatenates the BERT encoder and LSTM decoder. LSTM has more parameters than the GRU model, but theoretically, it has better performance. Therefore, this hybrid model is specialized in improving performance rather than improving inference speed compared to BERT-GRU.
3.12. Training Methods
The simplest way of training sequential tasks is by generating the target sequence directly and calculating the loss function according to the inference phase. This method is called free running (
Figure 9). Its most significant disadvantage is that training is hard to converge because sequential tasks are a structured prediction problem. That is, the outputs are dependent on each other. In the beginning, the correctness of the prediction from the sequential model is still low, so if the previous step prediction is wrong, it may also turn the next step prediction into a wrong prediction. The accumulation of errors will lead to larger oscillations during training, which will take a long time to converge. Besides, the gradient will have to be calculated back along the time axis, which requires a larger amount of memory.
Therefore, in the training phase, we will use teacher forcing for training (
Figure 10). Teacher forcing is a training technique proposed to solve the training instability. We will shift groundtruth one step to the right and add a special symbol to indicate the start signal. BOS in
Figure 10 stands for the beginning of the sequence, so for the Decoder, the input at each step would be correct. Thus, the error mentioned above will not occur in teacher forcing. However, this training method also has side effects, that is, the sequential model trained with teacher forcing just has to learn to predict one step, and does not consider the output of the next few steps and the overall structure of the sentence. Such a problem is called exposure bias. The solution we use is beam search, which will be introduced in detail in the following sections.
3.13. Greedy Decoding
Greedy decoding is the most basic decoding method in sequential tasks. At each step of decoding, the token with the highest probability is selected as the output. It is called greedy decoding because the algorithm always selects the best option while decoding. The decoder does not consider the long-term future with more foresight. However, in the long run, the best choice now at each step may not necessarily be the best sequence. This problem is called exposure bias, and one of the reasons for exposure bias is teacher forcing.
Although teacher forcing makes the training process relatively stable, it causes the sequential model learning only to predict one step during training and will not consider the long-term future, which leads to inconsistency between the inference phase and training phase. One of the solutions is beam search.
3.14. Beam Search
Beam search, as shown in
Figure 11, is a method of searching existing sentences without training and trying to find a better sentence than greedy decoding. When beam search is decoding, it selects multiple outputs with the highest probability as candidate outputs for each step, forming a tree with multiple branches. The number of candidate outputs selected at each step is called beam size. The decoder generates multiple branches at each step until each branch receives the end signal, or the sentence exceeds the length limit and then selects a branch with the highest probability as the final output. Although beam search can choose a better answer, it requires a more extensive computation than greedy decoding. Greedy decoding can be regarded as a beam search with a beam size of 1, which means that the best output is selected as a candidate each time.
4. Experiment
For our experiment, we prepare the dataset, environment, settings, and certain metrics. The subsection below explains the experiment thoroughly.
4.1. Dataset
We use NLPCC 2018 grammatical error correction (GEC) dataset [
16] to do the experiment. However, groundtruth of GEC’s testing set is not available, so we split the GEC’s training set into a custom training set and custom testing set. We perform the experiment on three different sequence lengths, 25, 50 and 128 respectively as
Table 5 shows. We chose the lengths based on presumption the sequences lengths are able to reflect the relation between length and the results, according to the sequence length distribution seen in
Figure 12.
4.2. Environment
The experiment was done on a single machine using Arch Linux as the operating system, Intel i7 6700 as CPU, Nvidia 1080 Ti as GPU. We use MXNet [
17], GluonNLP [
18] as framework.
4.3. Experiment Settings
We use Adam (Adaptive Moment Estimation) [
19] as the neural network optimizer in the experiment. Adam will refer to the previous update direction when updating the parameters, and adjust the learning rate according to the gradient. The learning rate in the experiment is set to 0.0001. We use both greedy decoding (GD) and beam search (BS) in the experiment.
4.4. Vocabulary Setup
As we use BERT in the experiment, the source vocabulary we followed the default BERT vocabulary, including simplified Chinese character, traditional Chinese character, lowercase English character, and some punctuation marks, 21,128 characters in total.
In the experiment, we convert all simplified Chinese characters into traditional Chinese characters, which means our target vocabulary only has to contain traditional Chinese characters, English characters, punctuation marks, 10,991 characters in total, nearly half of the source vocabulary, which also reduces the memory usage and training time.
4.5. Evaluation Metric
We use BLEU (Bilingual Evaluation Understudy Score) [
20] to evaluate the performance. It was originally used as a tool to evaluate machine translation. The principle is to compute the overlap ratio of consecutive characters between prediction and groundtruth. In the experiment, we use the NLTK [
16] package, and N = 4, which calculates a single character’s overlap rate to the overlap rate of four consecutive characters.
4.6. Model Performance Comparison
The naming convention of the model is encoder-decoder. For example, GRU-GRU means that the model’s encoder is GRU, and the decoder is also GRU. The unit of inference speed and training speed is sample per second.
From the results above, we can see that the pure RNN-based model has lower performance than the pure Transformer-based model, but it has faster inference speed. However, our hybrid model has similar performance with the Transformer-based model, but faster inference than the pure RNN-based model and our hybrid model BERT-GRU showed the best performance and fastest inference speed in 3 experiments, no matter whether it used beam search or greedy decoding.
From the pure RNN-based model’s perspective, replacing the RNN-based encoder with a Transformer-based encoder accelerates the encoding process for the inference phase and training phase, which also improves performance. However, from the perspective of the pure Transformer-based model, replacing the Transformer-based decoder with an RNN-based decoder accelerates the inference phase decoding process. However, it slows down decoding process at the training phase.
We use teacher forcing at the training phase, which is a training method for a given groundtruth to the decoder to know the entire decoding target in advance, which makes decoder possible to parallelize, without teacher forcing. The decoder has to wait for the previous token to be predicted, but the RNN-based model also has to wait for the previous hidden state, so it still cannot be parallelized, even if training with teaching forcing. As a result, it slows down the training speed of the hybrid model.
The experimental result also shows that our hybrid model speeds up more while we turn our pure Transformer-based model into the hybrid model using a beam search. The reason for this is that the beam search has more load at the decoder, so when we replace hte Transformer-based decoder with a lightweight RNN-based decoder, it speeds up more.
4.7. Experiment Comparison
The data used in the three sets of experiments are not precisely the same, so the following comparison will focus on the improvement of speed.
Figure 13 shows the percentage increase in the hybrid model’s inference speed for the RNN-Based original model under different lengths. The vertical axis represents the percentage increase in the inference speed, and the horizontal axis represents experiments of different lengths. We can see that both the beam search and the greedy decoding speed has been improved. The longer the sequence, the more the improvement. Since the RNN-based encoder is replaced with the Transformer-based encoder, the beam search which focuses on the decoding stage has poorer improvement. In contrast, using greedy decoding, the longer the sequence, the higher the ratio of improvement. Therefore,
Figure 13 shows that the two lines using greedy decoding are non-linear.
Figure 14 shows the percentage increase in the inference speed of the hybrid model for the Transformer-based original model under different sequence lengths. The vertical axis represents the percentage increase in the inference speed, and the horizontal axis represents experiments of different lengths. Both the beam search and greedy decoding’s inference speed are improved. The longer the sequence, the more the improvement, as we replace the Transformer-Based decoder with the RNN-based decoder, so the beam search, which focuses on the decoding stage, has a greater improvement. Moreover,
Figure 14 does not have apparent non-linear improvement, as shown in
Figure 13. The main reason for this is that the decoder itself cannot be parallelized during inference. Therefore, the increase of speed relies on GRU and LSTM’s lightweight, while in
Figure 13 we replace the non-parallelizable RNN-based encoder with the parallelizable BERT encoder, and the non-linear improvement is more significant. Detailed results can be seen in
Table 6,
Table 7 and
Table 8.
5. Conclusions
As Transformer architecture is gradually being popular, the field of natural language processing has seen the deprecation of RNN series models. By inspecting the advantages and disadvantages of RNN and Transformer, we introduce a hybrid model with faster inference speed and better performance.
In addition to the pre-training of the BERT encoder itself, the Transformer encoder also has the characteristics of parallelization, which makes up for the shortcomings of GRU and LSTM as encoders in terms of speed and performance. In contrast, GRU and LSTM as decoders make up for the slower inference speed of the Transformer-based model, so we can say that they compensate for each other’s shortcomings. Through multiple sets of experiments, we have also proved that this combination can indeed be applied to Chinese sentence correction. Among them, BERT-GRU has obtained the highest BLEU Score in all experiments. The inference speed of the Transformer-based original model can be improved by 1131% in beam search decoding in the 128-word experiment, and greedy decoding can also be improved by 452%. The longer the sequence, the larger the improvement.