Next Article in Journal
Software Support for Discourse-Based Textual Information Analysis: A Systematic Literature Review and Software Guidelines in Practice
Next Article in Special Issue
Fully-Unsupervised Embeddings-Based Hypernym Discovery
Previous Article in Journal
Communication Strategies in Social Media in the Example of ICT Companies
Previous Article in Special Issue
A Framework for Word Embedding Based Automatic Text Summarization and Evaluation
Open AccessArticle

A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation

by Yu Li 1,2,3, Xiao Li 1,2,3,*, Yating Yang 1,2,3 and Rui Dong 1,2,3
1
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
*
Author to whom correspondence should be addressed.
Information 2020, 11(5), 255; https://doi.org/10.3390/info11050255
Received: 9 March 2020 / Revised: 27 April 2020 / Accepted: 4 May 2020 / Published: 6 May 2020
(This article belongs to the Special Issue Advances in Computational Linguistics)
One important issue that affects the performance of neural machine translation is the scale of available parallel data. For low-resource languages, the amount of parallel data is not sufficient, which results in poor translation quality. In this paper, we propose a diversity data augmentation method that does not use extra monolingual data. We expand the training data by generating diversity pseudo parallel data on the source and target sides. To generate diversity data, the restricted sampling strategy is employed at the decoding steps. Finally, we filter and merge origin data and synthetic parallel corpus to train the final model. In the experiment, the proposed approach achieved 1.96 BLEU points in the IWSLT2014 German–English translation tasks, which was used to simulate a low-resource language. Our approach also consistently and substantially obtained 1.0 to 2.0 BLEU improvement in three other low-resource translation tasks, including English–Turkish, Nepali–English, and Sinhala–English translation tasks. View Full-Text
Keywords: neural machine translation; back translation; data argument; low resource language neural machine translation; back translation; data argument; low resource language
Show Figures

Figure 1

MDPI and ACS Style

Li, Y.; Li, X.; Yang, Y.; Dong, R. A Diverse Data Augmentation Strategy for Low-Resource Neural Machine Translation. Information 2020, 11, 255.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop