5.3. Experiment Results and Discussion
Table 3 shows the information of the ASPEC-JC corpus. We randomly extracted 300k sentence pairs from 672k training data of the ASPEC-JC corpus for experiments as the training data.
Figure 5 shows the changes of BLEU scores on the same test data by epochs with 300k original training data.
Figure 6 shows the changes of TER scores on the same test data by epochs with 300k original training data.
Figure 7 shows changes of the validation perplexity values on the same development data by epochs with 300k original training data. The proposed methods obtained BLEU and TER scores better than the other methods on the test data in both cases.
The best (lowest) perplexities on the development (dev) data after they stopped declining and the BLUE and TER scores of the translation results on the same test-data and the same development-test data by each method are presented in
Table 4 and
Table 5.
“Baseline 1” was a character-level translation with the 300k original training data. The back-translation models for corpus augmentation were constructed using the 300k original training data of “Baseline 1”.
“Proposed 1” was the proposed data augmentation method without consideration of the common Chinese character rates and reuse of the undivided sentences (
Section 4.4), which can be used for all language pairs. This method expanded the parallel corpus from the original 300k sentence pairs to 952k sentence pairs in both directions (Japanese→Chinese and Chinese→Japanese). Two hundred and eighteen thousand sentence pairs from 300k training data were used for back-translation in both directions.
“Proposed 2” was the proposed method, which considered the common Chinese character rates and reused undivided sentences, which can be used only for Chinese-Japanese language pairs. This method expanded the parallel corpus from 300k sentence pairs to 977k and 972k sentence pairs in Japanese→Chinese and Chinese→Japanese directions, respectively. Two hundred twenty seven thousands and 218k sentence pairs from 300k original training data were used for back-translation, in each direction.
“Baseline 2 for P1” was the back-translation method that back-translated the same data as the Proposed 1 (P1) method did (218k from original training data). The experiment of this method aimed to compare Proposed 1 with the back-translation method (Baseline 2) on the same back-translated data. “Baseline 2 for P2” was the back-translation method that back-translated the same data as the Proposed 2 (P2) method did (231k from original training data). The experiment of this method aimed to compare Proposed 2 with the back-translation method (Baseline 2) on the same back-translated data.
“Copied” was the method that added duplicate copies of the training data as the same times as the Proposed 2 method did. The experiment of this method aimed to highlight differences between the generated pseudo-parallel sentences pairs and unchanged sentences pairs. This method expanded the parallel corpus from 300k sentence pairs to 977k and 972k sentence pairs in each direction.
“Partial” was the method that augmented the corpus with parallel partial sentences generated by the procedure in
Section 4.3, without back-translating and mixing the partial sentences. The experiment of this method aimed to confirm that the mixing step (
Section 4.3, Step 2) was necessary. This method expanded the parallel corpus from 300k sentence pairs to 984k sentence pairs in both directions.
“# sentences” in the tables denotes the size (the number of sentence pairs) of training data, whereas “# back-translated” denotes the number of parallel sentence pairs used for back-translation processing, i.e., the corpus augmentation, in each method. “ppl” denotes the best (lowest) perplexity values on the development (dev) data in each method.
In the case of 300k training data, the number of parallel sentence pairs augmented by Proposed 2 was 977k in the Japanese→Chinese direction. However, only 945k pairs were used as the training data. This is because the translation errors in the back-translation steps sometimes generated unusually long pseudo-source sentences where the same words or phrases occurred repeatedly in a sentence, and such sentence pairs were removed due to exceeding the upper limit of the training data length (500 characters). As a result, the used training data size of Proposed 2 was 32k (3.3%) smaller than that of “Copied” (977k). For this reason, the training sentence numbers of “Copied” and “Proposed 2” in
Table 4 and
Table 5 are different. Hence, we added the columns “Raw” and “Used” in the tables to denote the numbers of generated sentences (raw data before removing) and used sentences (after removing unusually long sentences), respectively.
The proposed methods obtained BLEU and TER scores better than the baselines did on the development-test data and test data in both cases. Although there were translation errors and unnatural expressions in the generated pseudo-source sentences, the BLEU scores were higher and the TER scores were lower than “Copied” and “Partial” on the development-test data and test data in both directions, Japanese→Chinese and Chinese→Japanese. The BLEU scores of the “Partial” method were lower than the proposed methods, in both directions. Therefore, the mixing step (
Section 4.3, Step 2) was necessary. These results demonstrate that the proposed methods were effective for augmenting small-scale parallel corpora to improve translation performance for Japanese→Chinese and Chinese→Japanese NMT.
Comparing “Proposed 1”, “Proposed 2” with “Baseline 2 for P1”, “Baseline 2 for P2” in the tables, the results of the proposed methods were nearly identical and better: the proposed method was effective at improving translation accuracy in both directions, Japanese→Chinese and Chinese→Japanese.
The experiments described above (
Table 4 and
Table 5) proved the effectiveness of the proposed methods. Nevertheless, our approach was based on only the original parallel data and did not require any additional monolingual data, unlike the back-translation method of Sennrich et al. [
11]. Most methods of corpus augmentation were applied to pair monolingual training data with automatic back-translation and then treat them as additional parallel training data. Therefore, we added comparison experiments.
We conducted comparison experiments using 150k and 300k sentences that were randomly extracted from 672k training data of ASPEC-JC as the original data and used the remaining 522k and 372k sentences as the monolingual data.
For the comparison experiment, we only implemented our “Proposed 2” method because the experiments described above proved that the “Proposed 2” method was better than the “Proposed 1” method in most cases with the Chinese-Japanese parallel corpus. Translation results obtained on the test data and development-test data are shown in
Table 6,
Table 7,
Table 8 and
Table 9 with 150k and 300k original training data in both directions, respectively.
For
Table 6 and
Table 7, “Baseline 1” is a character-level translation, which did not process anything with 150k original training data. The back-translation models were constructed using 150k original training data of Baseline 1 for each method, before corpus augmentation. The back-translation models were constructed using 150k original training data of “Baseline 1” for each method, before corpus augmentation. “Baseline 2 + mono (522k)” was the back-translation method of Sennrich et al. [
11], which back-translated the remaining 522k target language sentences of 672k training data to generate 522k pseudo-source sentences directly with no segmentation; the 522k pseudo-source sentences, together with the 522k target sentences, expanded the parallel corpus from 150k sentence pairs to 672k sentence pairs. The experiment of this method aimed to confirm the effectiveness of applying our proposed methods to the augmented data by the back-translation method of Sennrich et al. [
11].
“150k + mono (522k) + P2” represents the combination method of “Baseline 2 + mono (522k)” (672k training data) and “Proposed 2”. “150k + mono (522k) + P2” back-translated 525k and 515k from the “Baseline 2 + mono (522k)” (672k training data), so that the numbers of sentence pairs were increased from 672k to 2313k and 2239k in both directions.
For
Table 8 and
Table 9, “Baseline 1” is a character-level translation which did not process anything with 300k original training data. The back-translation models were constructed using 300k original training data of Baseline 1 for each method, before corpus augmentation. The back-translation models were constructed using 300k original training data of “Baseline 1” for each method, before corpus augmentation. “Baseline 2 + mono (372k)” was the back-translation method of Sennrich et al. [
11], which back-translated the remaining 372k target language sentences of 672k training data to generate 372k pseudo-source sentences directly with no segmentation; the 372k pseudo-source sentences, together with the 372k target sentences, expanded the parallel corpus from 300k sentence pairs to 672k sentence pairs. The experiment of this method aimed to confirm the effectiveness of applying our proposed methods to the augmented data by the back-translation method of Sennrich et al. [
11].
“300k + mono (372k) + P2” represents the combination method of “Baseline 2 + mono (372k)” (672k training data) and “Proposed 2”. “300k + mono (372k) + P2” back-translated 522k and 501k from the “Baseline 2 + mono (372k)” (672k training data), so that the numbers of sentence pairs were increased from 672k to 2287k and 2223k in both directions.
The proposed methods obtained nearly identical and better results of BLEU and TER scores than in the case of baseline methods on the development-test data and test data. These comparison experiments demonstrate that our proposed method can augment the extended data by the other corpus augmentation methods to yield better translation performance. In the future, we plan to combine the proposed methods with other augmentation approaches, as our results suggest it may be more beneficial than only back-translation.
The salient benefits of the proposed method are that it requires no monolingual data and that, without changing the neural network architecture, our method can generate more pseudo-parallel sentences. Moreover, it can be combined with other augmentation methods.