Improving English-to-Indian Language Neural Machine Translation Systems
2. Literature Review
2.1. Neural Machine Translation
2.3. Alternative Low-Resource Solutions
2.4. Related Work
- Hindi: Hindi belongs to the Indo-Aryan language family and is a descendent of Sanskrit, like many Indian languages. Like Sanskrit, Hindi also uses the Devanagari script, although the script offers minimal phonetics to certain sounds. The sentence structure of short sentences in Hindi is flexible; in longer sentences, the Subject–Object–Verb structure is given preference.
- Bengali: Bengali is also a descendent of Sanskrit and from the Indo-Aryan language family, but it makes use of a custom script that is more phonetically suitable. It is not inflected by gender, has the same grammatical rules as Hindi, and is highly morphological. It is the most widely spoken Indian language after Hindi.
3. Research Questions
- RQ-1: How efficient is back-translation in improving the baseline system built from the most recently developed largest known Indian language parallel corpus called Samanantar , especially for Hindi and Bengali?
- RQ-2: Is the actual translation quality similarly reflected in both Automatic and Manual evaluations?
4. System Description
4.1. Corpora Used
4.2. Corpus Pre-Processing
- Filtering long sentences: Extremely long sentences were deleted because MT systems generally produce a low-quality translation for very long sentences. If either side contains too many words (100 words is set as the default limit), the sentence pair is discarded.
- Removing blank lines: Sentence pairs with no content on either side are removed.
- Removing sentence pairs with odd ratio: Sentences with marginally longer or shorter translations compared to their original sentences were removed because of the probability of their being incorrect translations. The filtering ratio was 1:3 in our case.
- Removing duplicates: All duplicate sentence pairs were discarded.
- Tokenisation: We broke down the sentences into their most basic elements, which were called tokens. Tokenisation is particularly relevant because it is the form in which transformer models ingest sentences. In practice, most NMT models are fed with sub-words as tokens.
- BPE: Both the Indic languages we used in this study are derivatives of Sanskrit, which makes them morphologically rich. This would imply that most OOV words have similar morphemes to some of the words already in our vocabulary. With this in mind, the BPE technique was leveraged to resolve the OOV problem by helping the model infer the meaning of words through similarity. The BPE algorithm performs sub-word regularization by building a vocabulary using corpus statistics. Firstly, it learns the most frequently occurring sequences of characters, and then it greedily merges them to obtain new text segments.
5.1. Building Baseline Models
5.2. Building Back-Translation Models
5.3. Parameter Settings for MT Models
- Minibatch size = 128;
- Hidden state size = 1000;
- Source and target vocabulary size = 32 K;
- Low dropout probability = 0.2;
- Learning rate used for both forward and backward models = 0.2;
- Decay rate = 0.9999;
- Beam search width = 12;
- Save checkpoint steps = 10,000;
- Minimum train steps = 100,000.
5.4. Evaluation Metrics
5.5. Experimental Architecture
- Stage 1: This generates synthetic data by translating (back-translation) from Indian-language (IL) into English (EN), and
- Stage 2: This builds English-to-Indian language MT systems using the existing parallel data and the generated synthetic parallel data.
6.1. Automatic Evaluation
6.2. Manual Evaluation
- Adequacy: This refers to the measurement of how much information is retained in the translation outputs as compared to the references, regardless of the grammatical correctness.
- Fluency: This refers to the measurement of how fluent the output is, that is, how grammatically correct it is, regardless of adequacy.
7. Output Analysis
8. Conclusions and Future Work
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, U.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS’17. Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Berlin, Germany, 2016; pp. 86–96. [Google Scholar] [CrossRef]
- Poncelas, A.; Shterionov, D.; Way, A.; Wenniger, G.; Passban, P. Investigating Backtranslation in Neural Machine Translation. 2018, pp. 249–258. Available online: https://arxiv.org/pdf/1804.06189.pdf (accessed on 8 April 2022).
- Fadaee, M.; Bisazza, A.; Monz, C. Data Augmentation for Low-Resource Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 567–573. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Berlin, Germany, 2016; pp. 1715–1725. [Google Scholar] [CrossRef]
- Norouzi, M.; Bengio, S.; Chen, Z.; Jaitly, N.; Schuster, M.; Wu, Y.; Schuurmans, D. Reward Augmented Maximum Likelihood for Neural Structured Prediction. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; NIPS’16; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 1731–1739. [Google Scholar]
- Zoph, B.; Yuret, D.; May, J.; Knight, K. Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–4 November 2016; Association for Computational Linguistics: Austin, TX, USA, 2016; pp. 1568–1575. [Google Scholar] [CrossRef]
- Choudhary, H.; Rao, S.; Rohilla, R. Neural Machine Translation for Low-Resourced Indian Languages. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Marseille, France, 2020; pp. 3610–3615. [Google Scholar]
- Shibata, Y.; Kida, T.; Fukamachi, S.; Takeda, M.; Shinohara, A.; Shinohara, T.; Arikawa, S. Byte Pair Encoding: A Text Compression Scheme That Accelerates Pattern Matching. 1999. Available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.4046&rep=rep1&type=pdf (accessed on 8 April 2022).
- Huang, E.; Socher, R.; Manning, C.; Ng, A. Improving Word Representations via Global Context and Multiple Word Prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Jeju Island, Korea, 2012; pp. 873–882. [Google Scholar]
- Goyal, V.; Sharma, D.M. TRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019. In Proceedings of the 6th Workshop on Asian Translation; Hong Kong, China, 4 November 2019, Association for Computational Linguistics: Hong Kong, China, 2019; pp. 137–140. [Google Scholar] [CrossRef][Green Version]
- Das, A.; Yerra, P.; Kumar, K.; Sarkar, S. A study of attention-based neural machine translation model on Indian languages. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016); The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 163–172. [Google Scholar]
- Przystupa, M.; Abdul-Mageed, M. Neural Machine Translation of Low-Resource and Similar Languages with Backtranslation. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2); Association for Computational Linguistics: Florence, Italy, 2019; pp. 224–235. [Google Scholar] [CrossRef]
- Sennrich, R.; Firat, O.; Cho, K.; Birch, A.; Haddow, B.; Hitschler, J.; Junczys-Dowmunt, M.; Läubli, S.; Miceli Barone, A.V.; Mokry, J.; et al. Nematus: A Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: Valencia, Spain, 2017; pp. 65–68. [Google Scholar]
- Klein, G.; Hernandez, F.; Nguyen, V.; Senellart, J. The OpenNMT Neural Machine Translation Toolkit: 2020 Edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track); Association for Machine Translation in the Americas: Orlando, FL, USA, 2020; pp. 102–109. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; ACL ’02. Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef][Green Version]
- Denkowski, M.; Lavie, A. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, MD, USA, 26–27 June 2014; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 376–380. [Google Scholar] [CrossRef]
- Snover, M.G.; Madnani, N.; Dorr, B.; Schwartz, R. TER-Plus: Paraphrase, Semantic, and Alignment Enhancements to Translation Edit Rate. Mach. Transl. 2009, 23, 117–127. [Google Scholar] [CrossRef]
- Ramesh, G.; Doddapaneni, S.; Bheemaraj, A.; Jobanputra, M.; AK, R.; Sharma, A.; Sahoo, S.; Diddee, H.; Nagaraj, S.; Kakwani, D.; et al. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. arXiv 2021, arXiv:cs.CL/2104.05596. Available online: http://xxx.lanl.gov/abs/2104.05596 (accessed on 8 April 2022). [CrossRef]
|Corpus Name||Number of Parallel Sentences per Language Pair|
|Translation Model||BLEU Score per Language Pair|
|English – Indian language||33.45||11.58|
|English – Indian Language + back-translation||33.72||11.99|
|Translation Model||BLEU Score|
|5||all information present||perfect in terms of|
|in the translation||grammatical correctness|
|4||most of the information present||not perfect but very good|
|3||nearly half of the information present||average quality|
|2||very little information present||poor quality|
|1||no information present||worst or completely incomprehensible|
|“In a way, this is endowed||It is the|
|with the might to transform||weather of the||2||5|
|the entire season-cycle of the country”.||entire country.|
|Language Pair||Translation Direction||Average Adequacy||Average Fluency|
|Language||Translation||Translation Quality||Average or|
|I recently visited the||I had visited the|
|1||Krishi Unnati Mela||Agri-Unnati Mela||4||5|
|organized in New Delhi.||in Delhi recently.|
|Start-Ups have been||Startups are exempted|
|2||given income tax||from paying income||5||5|
|exemption for three years.||tax for 3 years.|
|Yoga helps to maintain||Adds yoga|
|3||balance amidst||between this scatter.||2||2|
|“It brings about peace||It brings happiness|
|4||in the family by uniting||and prosperity||3||5|
|the person with the family.”||to the family.|
|“Not only this, we||We have also earned|
|1||are also the sixth||the honour of the sixth||4||5|
|largest producer of||largest producer of|
|renewable energy”.||renewable energy.|
|Start-Ups have been||Start-up companies|
|2||given income tax||have been given tax||5||5|
|exemption for||concessions for the first|
|three years.||three years.|
|“In a way, this is||All these ideas|
|endowed with the might||are the strength||1||3|
|3||to transform the||of the country’s|
|of the country.|
|“That is why a large number||Many letters have|
|4||of letters on agriculture||been written||3||5|
|have been received”.||about agriculture.|
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kandimalla, A.; Lohar, P.; Maji, S.K.; Way, A. Improving English-to-Indian Language Neural Machine Translation Systems. Information 2022, 13, 245. https://doi.org/10.3390/info13050245
Kandimalla A, Lohar P, Maji SK, Way A. Improving English-to-Indian Language Neural Machine Translation Systems. Information. 2022; 13(5):245. https://doi.org/10.3390/info13050245Chicago/Turabian Style
Kandimalla, Akshara, Pintu Lohar, Souvik Kumar Maji, and Andy Way. 2022. "Improving English-to-Indian Language Neural Machine Translation Systems" Information 13, no. 5: 245. https://doi.org/10.3390/info13050245