A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

: One important issue that a ﬀ ects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not su ﬃ cient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we ﬁlter and merge origin data and synthetic parallel corpus to train the ﬁnal model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks.


Introduction
Neural machine translation (NMT) is one of the most interesting areas in natural language processing (NLP). It is based on an encoder-decoder architecture, where the encoder encodes the source sentence as a continuous space representation, and the decoder generates the target sentence based on the encoder output [1]. NMT has achieved tremendous success in the past few years [2,3]. It usually requires a large amount of high-quality bilingual corpus for training [4]. For some languages, there are not enough resources to train a robust neural machine translation system. Therefore, studies in neural machine translation under low-resource conditions is an enormous challenge. There have been many studies on low-resource language machine translation. Using high-resource language to help improve the performance of low-resource neural machine translation is an intuitive approach; for example, transfer learning [5], model-agnostic meta-learning algorithm [6], triangular architecture [7], and multi-way, multilingual NMT frameworks [8]. As an essential way to enhance translation performance by generating additional training samples, data augmentation has been proven useful for low-resource neural machine translation. Some previous works use a synonym to replace specific words in training data. However, thesauruses are scarce for low-resource languages. Another approach is based on replacing some words in the target sentence with other words from the target vocabulary, for instance, randomly replacing the word with a placeholder or sampling the word from the frequency distribution of the vocabulary [9] and randomly setting the word embedding to 0 [10]. However, due to data sparsity in low-resource languages, it is difficult for these methods to leverage all possible augmented data. Another data augmentation method is the use of monolingual data. The well-known methods are back-translation and self-learning [11][12][13]. Back-translation is a data augmentation approach that translates monolingual data of the target side into the source to augment pseudo bitext [11]. Zhang et al. [12] proposed a self-learning method. They proved that translating the monolingual source data into the target in order to augment the training data is useful for improving the translation performance. However, these methods require substantial efforts to collect and clean the necessary amount of monolingual data. Meanwhile, the additional monolingual corpus is scarce in some low-resource languages.
In this paper, we propose an effective data augmentation strategy that does not use any monolingual data, which augments the training data by generating diversity source and target sentences on the origin data. Compared with the original training data, the diversity source or target data has the same semantics but different expressions [14,15]. To augment diversity data, we train the translation model in two directions: backward (target-to-source) and forward (source-to-target). Then, these translation models are employed to decode the training data multiple times. In the decoding process, we use the restricted sampling strategy, which can produce the diversity data. Finally, the duplicate sentences are deleted in the training data and pseudo parallel data to train the final model.
To demonstrate the effectiveness of our method, we first performed experiments on the IWSLT2014 German-English translation tasks, which can be used to simulate a low-resource setting. We compare our approach with other data augmentation methods, the results on English-German translation tasks show that the proposed data augmentation method achieved an improvement of 1.96 BLEU points over the baselines without using extra monolingual data. Our method achieved the best results among all data augmentation methods. We also conducted experiments on three low-resource translation tasks, including English-Turkish, Nepali-English, and Sinhala-English. The experimental results indicate that our method also boosted performance by 1.51, 1.28, and 1.53 BLEU points in the English-Turkish, Nepali-English, and Sinhala-English translation tasks, respectively.
In a summary, our contributions are as follows: (1) We propose a data augmentation strategy that has proved effective for many languages. (2) Compared with other data augmentation approaches, ours obtained the best result. (3) We performed experiments to explain the effectiveness of our method. These results verify that the increase in performance is not due to data replication, and our approach can produce diverse data. Finally, we found that the backward model was more important than the forward model.
In the rest of this article, Section 2 presents some related works about data augmentation. Section 3 describes the details of the restricted sampling strategy and our diversity data augmentation. The experiment details and results are shown in Section 4. Section 5 presents some experiments we conducted to analyze the effect of our data augmentation method. Finally, the conclusions are presented in Section 6.

Related Work
Although neural machine translation (NMT) has achieved better performance in many languages, data sparsity and the lack of morphological information are important issues. There are some works which aimed to improve the effectiveness of machine translation, including adjusting the translation granularity or incorporating morphological information. Sennrich et al. used the byte-pair-encoding algorithm to segment source sentences or target sentences into subword sequences [16]. Pan et al. segmented words into morphemes based on morphological information. Their results show that their method can effectively reduce data sparsity [17]. Sennrich et al. improved the performance of their encoder by employing some source side features, such as morphological features, part-of-speech tags, and syntactic dependency labels [18]. Tamchyna et al. employ an encoder-decoder to predict a sequence of interleaving morphological tags and lemmas, then use a morphological generator to generate the final results [19]. However, for some languages, there is a lack of effective morphological analysis tools. Therefore, some researchers pay more attention to improving the performance of machine translation with data augmentation.
Data augmentation is an effective method that generates additional training examples to improve the performance of deep learning. This method has been widely applied in many areas. In the field of computer vision, some image augmentation methods such as cropping, rotating, scaling, shifting, and adding noise are widely used and highly effective [20,21]. There are several related works about data augmentation for NMT. One of the data augmentation methods is based on word replacement. Fadaee et al. propose a word replacement method that uses the target language model to replace the high-frequency words with rare words, then changes its corresponding word in the source [22]. Xie et al. replace the word with a placeholder token or a word sampled from the frequency distribution of the vocabulary [9]. Kobayashi et al. use a wide range of substitute words generated by a bi-directional language model to replace the word token in the sentence [23]. Wu et al. replace the bi-directional model with BERT [24], which is a more powerful model, then use it to generate a set of substitute words [25]. Gao et al. propose a soft contextual data augmentation method that uses a soft distribution to replace the word representation instead of a word token [26]. Due to the data sparsity for low-resource languages, it is difficult for those methods to leverage all possible augmented data.
The other category of data augmentation is based on monolingual data. Sennrich et al. propose a simple and effective data augmentation method, where the target language data is translated into the source to augment the parallel corpus [11]. It has been proved effective by many works [27][28][29][30]. Zhang et al. propose a self-learning data augmentation method that translates monolingual source data to target, then combines it with origin data to train the final model [12]. Imamura et al. show that generating synthetic sentences based on sampling is more effective than beam search. They generate multiple source sentences for each target [14]. Currey et al. show that copying target monolingual data into the source can boost the performance of low-resource translation [31]. Chang et al. and He et al. employ monolingual corpora from source and target sides to extend the back-translation method as dual learning [32,33]. A similar method has been applied in unsupervised NMT [34,35]. Hoang et al. suggest an iterative data augmentation procedure that continuously improves the quality of the back-translation and final systems [36]. Niu et al. use multilingual NMT, which trains two directions of a translation model in a single model to translate monolingual data that comes from source or target to generate synthetic data [37]. Zhang et al. propose a corpus augmentation method; they segment long sentences based on word alignment and use back-translation to generate pseudo-parallel sentence pairs [38]. Although these methods have significantly improved the effectiveness of machine translation, they use additional monolingual corpora.

Approach
In this section, we describe our diversity data augmentation approach in detail. First, we introduce the main idea of back-translation and self-learning. Then, we present a decoder strategy which is used to generate diversity data. Finally, the training process of our data augmentation approach is presented in detail.

Back-Translation and Self-Learning
Back-translation is an effective way to improve the performance of machine translation. It is usually used to increase the size of parallel data. Given the parallel language pairs D = (s n , t n ) N n=1 and a monolingual target dataset D mon = t m mon M m=1 , the main idea of back-translation is as follows. . Finally, the initial corpus is combined with the synthetic corpus to train the main translation system NMT S→T .
The main idea of self-learning is the same as back-translation. The difference is that self-learning is based on monolingual source data. Given the parallel language pairs D = (s n , t n ) N n=1 and a monolingual source dataset D mon = s m mon M m=1 , the process of self-learning includes the following steps: First, the forward translation model NMT S→T is trained with parallel corpus D. Second, the monolingual source data D mon are translated into the target by translation model NMT T→S . The monolingual source data and its translations are combined as synthetic corpus D synthetic = s m mon , t m st M m=1 . Finally, the main translation system NMT S→T is trained with the mixture of parallel and synthetic data.

Decoder Strategy
NMT systems typically use beam search to translate the sentences [39]. Beam search is an algorithm that approximately maximizes conditional probability. Given the source sentence, it retains several high-probability words at each decoding step, and generates the translation with the highest overall probability: However, beam search always focuses on the head of the model distribution, which results in very regular translation hypotheses that do not adequately cover the actual data distribution [30]. On the contrary, decoding based on sampling or restricted sampling can produce diverse data by sampling from the model distribution [14,30]. At each decoding step, the sampling method randomly chooses the token from the whole vocabulary distribution as: where sampling y (P) denotes the sampling operation of y according to the probability of distribution P. The restricted sampling strategy is a middle ground between beam search and unrestricted sampling. It adds a restriction on the selection of candidate tokens. The process of the restricted sampling strategy contains the following steps: First, at each decoding step, the translation model generates the probability distribution P. According to the output distribution P, it selects the k-highest probability tokens as candidate set C. Second, it renormalizes all words in the candidate set. Finally, it samples a token from the candidate set as output: Figure 1 shows the difference between these decoding strategies. . Finally, the main translation system → is trained with the mixture of parallel and synthetic data.

Decoder Strategy
NMT systems typically use beam search to translate the sentences [39]. Beam search is an algorithm that approximately maximizes conditional probability. Given the source sentence, it retains several high-probability words at each decoding step, and generates the translation with the highest overall probability: However, beam search always focuses on the head of the model distribution, which results in very regular translation hypotheses that do not adequately cover the actual data distribution [30]. On the contrary, decoding based on sampling or restricted sampling can produce diverse data by sampling from the model distribution [14,30]. At each decoding step, the sampling method randomly chooses the token from the whole vocabulary distribution as: where ( ) denotes the sampling operation of according to the probability of distribution .
The restricted sampling strategy is a middle ground between beam search and unrestricted sampling. It adds a restriction on the selection of candidate tokens. The process of the restricted sampling strategy contains the following steps: First, at each decoding step, the translation model generates the probability distribution . According to the output distribution , it selects thehighest probability tokens as candidate set . Second, it renormalizes all words in the candidate set. Finally, it samples a token from the candidate set as output: = argmax ( ( | , )),

= ( ).
(4) Figure 1 shows the difference between these decoding strategies. For low-resource languages, the restricted sampling strategy gets a better result compared with the unrestricted [30], so we use the restricted sampling strategy in our decoding procedure.

Training Strategy
In our approach, we aim to generate diversified data based on the original data without any monolingual data. So, we use back-translation and self-learning on the initial training data to For low-resource languages, the restricted sampling strategy gets a better result compared with the unrestricted [30], so we use the restricted sampling strategy in our decoding procedure.

Training Strategy
In our approach, we aim to generate diversified data based on the original data without any monolingual data. So, we use back-translation and self-learning on the initial training data to augment source or target data. We first train the two-directional model based on the idea of back-translation and self-learning, then use those models to decode the source or target sentence in the origin corpus. Finally, we combine multiple synthetic data with the original data to train the final model. The framework for our data augmentation method is presented in Figure 2.
The steps of our data diversification strategy are as follows: Notations: Let S and T denote two languages, respectively. Let D = (S, T) denote the bilingual training data set. We use R to represent the number of the training round. Let M R S→T denote the forward translation model at the Rth-round and M R T→S indicate the backward translation model at the Rth-round. We use M R l→l ,K (X) to represent the results of dataset X translated by the model M R l→l , where K denotes a diversification factor. Algorithm 1. Our data augmentation strategy.
Training round R 4.
Output: Model f inal
Initialize M S→T with random parameters θ 7.
Train M S→T on D = (S, T) until convergence 8.
Return M S→T 9.

Return Model f inal
In the first round, we train the backward NMT translation model M 1 T→S and forward NMT translation model M 1 S→T based on the initial dataset D 0 . Then we employ the forward NMT model M 1 S→T to decode the source sentences of the training data by the restricted sampling strategy. We repeat the above processes to create multiple synthetic sentences on the target side. In other words, we gain numerous synthetic sentences as: Similarly, we run the same process on the backward translation model M 1 T→S . The multiple synthetic source sentences are generated by using the backward translation model M 1 T→S to translate the original target sentence.
Then, we add the multiple synthetic corpora to the original data as follows: If R > 1, we continue training the 2-round backward model M 2 T→S and forward model M 2

S→T
based on D 1 . We follow the process above until the final data set D R is generated. Finally, we train the final translation model on corpus D R . To aid understanding, Algorithm 1 summarizes this process.
Information 2020, 11, x FOR PEER REVIEW 5 of 12 augment source or target data. We first train the two-directional model based on the idea of backtranslation and self-learning, then use those models to decode the source or target sentence in the origin corpus. Finally, we combine multiple synthetic data with the original data to train the final model. The framework for our data augmentation method is presented in Figure 2.

Experiments
This section describes the experiments conducted to verify the effectiveness of our diversity data augmentation method. We first compared our method with other data augmentation methods on the IWSLT2014 English-German Translation tasks. This translation task can be used to simulate the low-resource setting. Then we verify the effectiveness of our method on three low-resource translation tasks.

IWSLT2014 EN-DE Translation Experiment
We used this dataset to simulate a low-resource setting. The dataset contains about 160k parallel sentences. We randomly sampled 5% of sentences from the training data as a validation set and concatenated IWSLT14.TED.dev2010 set, IWSLT.TED.dev2012 set, and three IWSLT14 test sets from 2010 to 2012 for testing. We used byte-pair-encoding (BPE) [16] to build a shared vocabulary of 10,000 tokens. All datasets were tokenized with the Moses toolkit [40]. We compared our method with other data augmentation methods mentioned in Zhu's work [26]. The data augmentation strategies were as follows: • Artetxe et al. and Lample et al. propose to randomly swap words in nearby locations within a window size k; we denote it as SW [26,35,41].
• Xie et al. use a placeholder to replace word randomly. We denote this method as BW [9,26]. • Xie et al. employ a method that randomly replaces word tokens with a sample from the unigram frequency distribution over the vocabulary. We denote it as SmoothW [9,26]. • Kobayashi et al. randomly replace word tokens sampled from the output distribution of one language model [23,26]. We denote it as LMW.

•
We denote Gao's work as SoftW. They randomly replace word embedding with a weight combination of multiple semantically similar words [26].
For model parameters, we used the basic parameter settings of the Transformer model. It consists of a 6-layer encoder and a 6-layer decoder. There are some exceptions: the model dimension was 512, the feed-forward dimension was 1024, and there were 4 attention heads. The dropout rate was 0.3, and the label smoothing was set as 0.1. We trained the models until convergence based on valid loss. At the decoding step, we set a beam size of 5 and a length penalty of 1.0. All the above data augmentation methods used the same setting as Zhu' s work [26]; we used the probability 0.15 to replace the word tokens in training steps. For our approach, unless specified otherwise, we used the same default setup, where K = 3 and R = 1. Table 1 presents the results of the DE-EN translation task. As we can see, the DE-EN baseline based on the Transformer achieved 34.72 BLEU points without data augmentation. Compared with the baseline, we can find that our method substantially improved the translation performance by 1.96 BLEU points. Compared with other data augmentation methods, our approach obtained the best result. These results verify the effectiveness of our approach.  [23], other are based on our runs.

Low-Resource Translation Tasks
We also verified the effectiveness of three low-resource language translation tasks (i.e., English-Turkish (EN-TR), English-Nepali (EN-NE), and English-Sinhala (EN-SI) tasks). Both Nepali and Sinhala are very challenging languages to translate because their morphology and syntax are different compared with high-resource languages such as English. Meanwhile, there are not many users or parallel corpora, so data resources are particularly scarce. Table 2 presents the statistics of three low-resource corpora. For the EN-TR experiment, we combined WMT EN-TR training sets and IWSLT14 training data, and the final corpus contained about 350k sentence pairs. We chose dev2010 and test2010 as the validation sets and used the four test sets from 2011 to 2014 as test sets. We learned the BPE vocabulary jointly on the source and target language sentences, and the vocabulary was built with 32k merge operations. The English-Sinhala training data contained about 400k pairs, while the English-Nepali training data had about 500k pairs. We used the same development set and test set as in Guzman' work [43]. We used the Indic NLP library [44] to tokenize the Nepali and Sinhala corpora. We used sentencepiece toolkits [45] to build the shared vocabulary. The size of the vocabulary was 5000. For the TR-EN translation task, we chose the base Transformer as our model structure. The model parameters were as follows: the number layers of the encoder was 6, the number of decoder layers was 6, the dimension of the inner feed-forward layer was 2048, the model dimensions was 512, and the number of attention heads was 8. The Adam optimizer was used, and the learning rate was 0.001, the dropout rate was 0.3, there were 4000 warm-up steps, and the models ran for 50 epochs. For inference, we set the beam size to 5 and the length penalty to 1.0, and averaged the last five checkpoints. We used the BLEU scores to measure the performance of the final model. For EN-NE and EN-SI translation tasks, we used a Transformer architecture with a 5-layer encoder and a 5-layer decoder, and each layer had two attention heads. The embedding dimension and feed-forward dimension were 512 and 2048. To regularize our models, we set a dropout rate of 0.4, label smoothing of 0.2, and weight decay of 10 −4 . We set the batch size to 16,000 tokens and trained the model for 100 epochs. At the inference step, for NE-EN and SI-EN tasks, we used the length penalty of 1.2. We used the detokenized sacreBLEU [46] for these tasks. Table 3 shows the TR-EN translation results of the different test sets. We can observe that our method was able to boost the translation quality compared with the baseline, which is without data diversification. We averaged the BLEU of four test sets, and our method achieved a 1.51 BLEU improvement. From Table 4, it can be seen that our method achieved a more than 1.0 BLEU improvement without using other monolingual data. These results indicate that augmenting the training data with diversity pseudo parallel data is useful for improving the translation performance.

Discussion
In this section, we analyze the proposed data augmentation approach from the following perspectives: (1) Is the improvement of our method due to the multiple copies of the original data? (2) Which is more important between the backward model and forward model? (3) What effect does the different sampling values have on the performance?

Copying the Original Data
We copied the initial data seven times and merged the data to train the model in order to verify whether the performance of translation improved due to the increase of the data, denoted as 7Baseline. We ran two experiments based on two languages: EN-DE, TR-EN, and used the same parameter settings. Table 5 shows the BLEU scores of two data sets. We found that the model based on copied data consistently decreased the performance by 0.1 to 0.6 BLEU scores in all translation tasks. However, our method yielded an improvement of 1.0 to 2.0 BLEU points on two datasets. These results verify that the increase in performance was not due to data replication, and that our approach can produce diverse data. Nguyen' work [15] showed that using random seeds to train translation models yields different model distributions. We trained three forward translation models and three backward translation models with a different random seed, then we used those models to translate training data to generate different synthetic corpora and combine them to train the final model. The parameters of those experiments were the same as our approach. Table 6 shows the results of the EN-DE translation task and EN-TR translation tasks. From Table 6, we observe that the approach based on random seed also yielded improvements of more than 2.01 and 1.89 in the BLEU scores for EN-DE and EN-TR tasks respectively, indicating the effectiveness of our data augmentation method. Our approach reached the same conclusion by using fewer translation models. Therefore, decoding with the restricted sampling increased the diversity of the training data, and our method could also further enhance translation performance.

Backward Data or Forward Data
We perform experiments on EN-TR translation tasks, where we use the diversity data generated by the forward model and the backward model separately. We also compare these models with our bidirectional diversified model and the baseline model without data diversification. The results are shown in Table 7. We can observe that both backward and forward diversification models are still valid but worse than bidirectional diversification. We also find that diversification with backward models outperforms the ones with the forward models. Those findings strongly support us that our approach by leveraging both forward and backward diversification is helpful.

The Number of Samplings
We conducted experiments on the different synthetic data based on the number of samplings. We used the same translation models to translate source and target sentences with different decoding strategies, including unrestricted sampling from the model distribution, restricting sampling to the 5 or 10 highest-scoring outputs at every time step. Table 8 shows the BLEU scores for EN-DE and EN-TR translation tasks. From Table 8, we observe that restricted sampling outperformed the unrestricted sampling methods. Restricting sampling to the 5 highest-scoring outputs at every time step yielded the better result. This is because restricted sampling is the middle-ground between beam search and unrestricted sampling; it is unlikely to choose a lower-scoring output but still retains some randomness, especially in low-resource language translation tasks. Sample_K represent that the model based on restricting sampling to the K-highest-scoring outputs at every time steps.

Conclusions
In this paper, we proposed a novel approach that is very effective in improving the performance of low-resource translation tasks. We trained a forward and backward model to translate the training data multiple times by restricted sampling, then used them to train the final model. The experimental results demonstrated that the proposed method was effective in many translation tasks. It outperformed the baselines in the IWSLT English-German translation task, IWSLT English-Turkish translation task, and two low-resource language translation tasks by 1.0-2.0 BLEU. Other experiments were conducted to analyze why the proposed method was effective. We found that our method could increase the data diversification of training data without extra monolingual data. Using bidirectional diversification methods is better than using either alone.