1. Introduction
In recent years, machine translation (MT) researchers have proposed approaches to counter the data sparsity problem and to improve the performance of neural MT (NMT) systems in low-resource scenarios, e.g., augmenting training data from source and/or target monolingual corpora [
1,
2], unsupervised learning strategies in the absence of labelled data [
3,
4], exploiting training data involving other languages [
5,
6], multi-task learning [
7], the selection of hyperparameters [
8], and pre-trained language model fine-tuning [
9]. Despite some success, none of the existing benchmarks can be viewed as an overall solution as far as MT for low-resource language pairs is concerned. For examples, the back-translation strategy of Sennrich et al. [
1] is less effective in low-resource settings where it is hard to train a good back-translation model [
10]; unsupervised MT does not work well for distant languages [
11] due to the difficulty of training unsupervised cross-lingual word embeddings for such languages [
12], and the same is applicable in the case of transfer learning [
13]. Hence, this line of research needs more attention from the MT research community. In this context, we refer interested readers to some of the papers [
14,
15] that compared phrase-based statistical machine translation (PB-SMT) and NMT on a variety of use-cases. As for low-resource scenarios, as mentioned above, many studies (e.g., Koehn and Knowles [
16], Östling and Tiedemann [
17], Dowling et al. [
18]) found that PB-SMT can provide better translations than NMT, and many found the opposite results [
8,
19,
20]. Hence, the findings of this line of MT research have indeed yielded a mixed bag of results, leaving the way ahead unclear.
To this end, we investigated the performance of PB-SMT and NMT systems on two rarely tested under-resourced language pairs, English-To-Tamil and Hindi-To-Tamil, taking a specialised data domain (software localisation) into account [
21]. We also produced rankings of the MT systems (PB-SMT, NMT, and a commercial MT system (Google Translate (GT))) (
https://translate.google.com/, (accessed on 5 March 2020) on English-To-Tamil via a social media platform-based human evaluation scheme and demonstrate our findings in this low-resource domain-specific text translation task [
22]. The next section talks about some of the papers that compared PB-SMT and NMT on a variety of use-cases.
The remainder of the paper is organized as follows. In
Section 2, we discuss related work.
Section 3 explains the experimental setup including the descriptions of our MT systems and details of the datasets used.
Section 4 presents the results with discussions and analysis, while
Section 5 concludes our work with avenues for future work.
2. Related Work
The advent of NMT in MT research has led researchers to investigate how NMT is better (or worse) than PB-SMT. This section presents some of the papers that compared PB-SMT and NMT on a variety of use-cases. Although our primary objective of this work was to study translations of the MT systems (PB-SMT and NMT) in under-resourced conditions, we provide a brief overview on some of the papers that compared PB-SMT and NMT in high-resource settings as well.
Junczys-Dowmunt et al. [
23] compared PB-SMT and NMT on a range of translation pairs and showed that for all translation directions, NMT is either on par with or surpasses PB-SMT. Bentivogli et al. [
14] analysed the output of MT systems in an English-to-German translation task by considering different linguistic categories. Toral and Sánchez-Cartagena [
24] conducted an evaluation to compare NMT and PB-SMT outputs across broader aspects (e.g., fluency, reordering) for nine language directions. Castilho et al. [
15] conducted an extensive qualitative and quantitative comparative evaluation of PB-SMT and NMT using automatic metrics and professional translators. Popović [
25] carried out an extensive comparison between NMT and PB-SMT language-related issues for the German–English language pair in both translation directions. The works [
14,
15,
24,
25] showed that NMT provides better translation quality than the previous state-of-the-art PB-SMT. This trend continued in other studies and use-cases: translation of literary text [
26], MT post-editing setups [
27], industrial setups [
28], translation of patent documents [
29,
30], less-explored language pairs [
31,
32], highly investigated “easy” translation pairs [
33], and the translation of catalogues of technical tools [
34]. An opposite picture is also seen in the case of the translation of text pertaining to a specific domain; Nunez et al. [
35] showed that PB-SMT outperforms NMT when translating user-generated content.
The MT researchers have tested and compared PB-SMT and NMT in resource-poor settings as well. Koehn and Knowles [
16], Östling and Tiedemann [
17] and Dowling et al. [
18] found that PB-SMT can provide better translations than NMT in low-resource scenarios. In contrast to these findings, however, many studies have demonstrated that NMT is better than PB-SMT in low-resource situations [
8,
19]. This work investigated translations of a software localisation text with two low-resource translation pairs, Hindi-To-Tamil and English-To-Tamil, taking two MT paradigms, PB-SMT and NMT, into account.
5. Conclusions
In this paper, we investigated NMT and PB-SMT in resource-poor scenarios, choosing a specialised data domain (software localisation) for translation and two rarely tested morphologically divergent language pairs, Hindi-To-Tamil and English-To-Tamil. We studied translations on two setups, i.e., training data compiled from (i) a freely available variety of data domains (e.g., political news, Wikipedia) and (ii) exclusively software localisation data domains. In addition to an automatic evaluation, we carried out a manual error analysis on the translations produced by our MT systems. In addition to an automatic evaluation, we randomly selected one hundred sentences from the test set and ranked our MT systems via a social media platform-based human evaluation scheme. We also considered a commercial MT system, Google Translate, in this ranking task.
Use of in-domain data only at training had a positive impact on translation from a less inflected language to a highly inflected language, i.e., English-To-Tamil. However, it did not impact the Hindi-To-Tamil translation. We conjectured that the morphological complexity of the source and target languages (Hindi and Tamil) involved in translation could be one of the reasons why the MT systems performed reasonably poorly even when they were exclusively trained on specialised domain data.
We looked at the translations produced by our MT systems and found that in many cases, the BLEU scores underestimated the translation quality mainly due to the relatively free word order in Tamil. In this context, Shterionov et al. [
61] computed the degree of underestimation in the quality of three most widely used automatic MT evaluation metrics: BLEU, METEOR [
62], and TER [
63], showing that for NMT, this may be up to 50%. Way [
64] reminded the MT community how important subjective evaluation is in MT, and there is no easy replacement of that in MT evaluation. We refer the interested readers to Way [
51] who also drew attention to this phenomenon.
Our error analysis on the translations by the English-To-Tamil and Hindi-To-Tamil MT systems revealed many positive and negative sides of the two paradigms: PB-SMT and NMT: (i) NMT made many mistakes when translating domain terms and failed poorly when translating OOV terms; (ii) NMT often made incorrect lexical selections for polysemous words and omitted words and domain terms in translation, while occasionally committing reordering errors; and (iii) translations produced by the NMT systems occasionally contained repetitions of other translated words, strange translations, and one or more unexpected words that had no connection with the source sentence. We observed that whenever the NMT system encountered a source sentence containing OOVs, it tended to produce one or more unexpected words or repetitions of other translated words. As for SMT, unlike NMT, the MT systems usually did not make such mistakes, i.e., repetitions, strange, spurious, or unexpected words in translation.
We observed that the BPE-based segmentation could completely change the underlying semantic agreements of the source and target sentences of the languages with greater morphological complexity. This could be one of the reasons why the Hindi-To-Tamil NMT system’s translation quality was poor when the system was trained on the sub-word-level training data in comparison to the one that was trained on the word-level training data.
From our human ranking task, we found that sentence-length could be a crucial factor for the performance of the NMT systems in low-resource scenarios, i.e., NMT turned out to be the best performing for very short sentences (number of words ≤ 3). This finding indeed did not correlate with the findings of our automatic evaluation process, where PB-SMT was found to be the best performing, while GT and NMT were comparable. This finding could be of interest to translation service providers who use MT in their production for low-resource languages and may exploit the MT models based on the length of the source sentences to be translated.
GT became the winner followed by PB-SMT and NMT for the sentences of other lengths (number of words > 3) in the MIXED setup, and PB-SMT became the winner followed by NMT and GT for the sentences of other lengths (number of words > 3) in the IT setup. Overall, the human evaluators ranked GT as the first choice, PB-SMT as the second choice, and NMT as the third choice of the MT systems in the MIXED setup. As for the IT setup, PB-SMT was the first choice, NMT the second choice, and GT the third choice of the MT systems. Although a manual evaluation process is an expensive task, in the future, we want to conduct a ranking evaluation process with five MT systems, i.e., with the NMT and PB-SMT systems from MIXED and IT setups and GT.
We believe that the findings of this work provide significant contributions to this line of MT research. In the future, we intend to consider more languages from different language families. We also plan to judge errors in translations using the multidimensional quality metrics error annotation framework [
65], which is a widely used standard translation quality assessment toolkit in the translation industry and in MT research. The MT evaluation metrics such as chrF, which operates at the character level, and COMET [
66], which achieved new state-of-the-art performance on the WMT2019 Metrics Shared Task [
67], obtained high levels of correlation with human judgements. We intend to consider these metrics (chrF and COMET) in our future investigation. As in Exel et al. [
58], who examined terminology translation in NMT in an industrial setup while using the terminology integration approaches presented in Dinu et al. [
56], we intend to investigate terminology translation in NMT using the MT models of Dinu et al. [
56] on English-To-Tamil and Hindi-To-Tamil. In the future, we aim to carry out experiments with different configurations for BPE and NMT architectures including an ablation study to better understand the effects of various components and settings. We also would like to carry out experiments to see if our PB-SMT system can be improved with using monolingual training data. We aim to investigate the possibility of building BPE-based SMT models and word-based NMT models as well. Thus, we can compare word-based NMT with BPE-based NMT. Since BPE model training depends on the training data, in the future, we aim to see how effective it would be if we train the BPE models on additional monolingual data. As for building the NMT systems, we plan to perform two-stage training process where we will first train a model on the MIXED data and then “fine-tune” it on the IT data.