Next Article in Journal
Ground Subsidence Investigation in Fuoshan, China, Based on SBAS-InSAR Technology with TerraSAR-X Images
Previous Article in Journal
Using Social Media to Identify Consumers’ Sentiments towards Attributes of Health Insurance during Enrollment Season
Article Menu
Issue 10 (May-2) cover image

Export Article

Open AccessArticle

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

1
Electronics and Information Systems Engineering Division, Graduate School of Engineering, Gifu University, Gifu 501-1193, Japan
2
Department of Electrical, Electronic and Computer Engineering, Gifu University, Gifu 501-1193, Japan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2019, 9(10), 2036; https://doi.org/10.3390/app9102036
Received: 2 April 2019 / Revised: 6 May 2019 / Accepted: 9 May 2019 / Published: 17 May 2019
(This article belongs to the Section Computing and Artificial Intelligence)
  |  
PDF [794 KB, uploaded 17 May 2019]
  |     |  

Abstract

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method. View Full-Text
Keywords: back translation; Chinese-Japanese translation; corpus augmentation; decoder; encoder; Japanese-Chinese translation; LSTM; neural machine translation; sentence segmentation back translation; Chinese-Japanese translation; corpus augmentation; decoder; encoder; Japanese-Chinese translation; LSTM; neural machine translation; sentence segmentation
Figures

Graphical abstract

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Supplementary material

SciFeed

Share & Cite This Article

MDPI and ACS Style

Zhang, J.; Matsumoto, T. Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora. Appl. Sci. 2019, 9, 2036.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Appl. Sci. EISSN 2076-3417 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top