Next Article in Journal
Strategies and Recommendations for the Management of Uncertainty in Research Tools and Environments for Digital History
Next Article in Special Issue
Translation Quality and Error Recognition in Professional Neural Machine Translation Post-Editing
Previous Article in Journal
Towards A Taxonomy of Uncertainties: Analysing Sources of Spatio-Temporal Uncertainty on the Example of Non-Standard German Corpora
Previous Article in Special Issue
Post-Editing Neural MT in Medical LSP: Lexico-Grammatical Patterns and Distortion in the Communication of Specialized Knowledge
Open AccessArticle

Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach

1
CrossLang NV, 9050 Gentbrugge, Belgium
2
Independent Data Science Consultant, 9000 Ghent, Belgium
*
Author to whom correspondence should be addressed.
Informatics 2019, 6(3), 35; https://doi.org/10.3390/informatics6030035
Received: 30 April 2019 / Revised: 14 August 2019 / Accepted: 29 August 2019 / Published: 1 September 2019
(This article belongs to the Special Issue Advances in Computer-Aided Translation Technology)
To build state-of-the-art Neural Machine Translation (NMT) systems, high-quality parallel sentences are needed. Typically, large amounts of data are scraped from multilingual web sites and aligned into datasets for training. Many tools exist for automatic alignment of such datasets. However, the quality of the resulting aligned corpus can be disappointing. In this paper, we present a tool for automatic misalignment detection (MAD). We treated the task of determining whether a pair of aligned sentences constitutes a genuine translation as a supervised regression problem. We trained our algorithm on a manually labeled dataset in the FR–NL language pair. Our algorithm used shallow features and features obtained after an initial translation step. We showed that both the Levenshtein distance between the target and the translated source, as well as the cosine distance between sentence embeddings of the source and the target were the two most important features for the task of misalignment detection. Using gold standards for alignment, we demonstrated that our model can increase the quality of alignments in a corpus substantially, reaching a precision close to 100%. Finally, we used our tool to investigate the effect of misalignments on NMT performance. View Full-Text
Keywords: data-curation; web crawling; neural machine translation data-curation; web crawling; neural machine translation
Show Figures

Figure 1

MDPI and ACS Style

Defauw, A.; Szoc, S.; Bardadym, A.; Brabers, J.; Everaert, F.; Mijic, R.; Scholte, K.; Vanallemeersch, T.; Van Winckel, K.; Van den Bogaert, J. Misalignment Detection for Web-Scraped Corpora: A Supervised Regression Approach. Informatics 2019, 6, 35.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop