Enhancement of English-Bengali Machine Translation Leveraging Back-Translation

: An English-Bengali machine translation (MT) application can convert an English text into a corresponding Bengali translation. To build a better model for this task, we can optimize English-Bengali


Introduction 1.Background and Motivation
In machine translation (MT), we input an ordered sequence of words in a source language (i.e., a sentence) to the model and expect that the model outputs a sentence in a target language, which has the same meaning as the input [1].Most recent research on MT falls under neural machine translation (NMT), which employs various neural networks, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformer models, and so on.A typical NMT model consists of two main components: an encoder that transcribes the input sentence into a hidden representation and a decoder that generates the target sentence according to the hidden representation [2].In particular, RNN-based models (basic RNN, LSTM, GRU, and other variants) were used in earlier days of NMT evolution [3,4].However, with RNN-based models, translation accuracy degrades when the sequence length and out-of-vocabulary or unknown words increase.Another issue is that the information that the decoder receives from the encoder is the last hidden state of the encoder.It especially contains the numerical summary of the input sequence.With this, it can handle the translation of short sentences but not the translation of long sentences since it gets a numerical summary of itself.As human beings, we may also have trouble translating a long sentence.To address these issues, an RNN-based model with an attention-based mechanism was introduced [5][6][7].In this mechanism, the encoder encodes the input sentence into a sequence of vectors (for each token) rather than encoding it into a fixed-length vector and passes the vectors to the decoder while addressing the issue of forgetting long sequences.It persists for all the sequences.As we know, LSTM or GRU are not computationally efficient for large datasets, also cannot help ensure parallel processing of tasks, and are attached to numerous issues.As such, Vaswani et al. [8] proposed a new model, the Transformer that renounces recurrence and rather entirely relies on self-attention technique while computing the global dependencies between source (input) and target (output).In addition, Transformer helps us ensure better parallel processing of tasks and helps achieve better translation accuracy [8][9][10].Now, transformer models have become the mainstream architecture for NMT.A detailed demonstration of all these three architectures is presented in the Related Work section (Section 2.1).
We observe that in NMT, the size of the training data affects the translation quality.Models for translation between language pairs with abundant parallel corpora are easier to train and perform well in general usage.However, the languages that lack parallel corpora (known as low-resource languages) make it challenging to train a high-quality translation model on limited data.English-Bengali is an example of this case.As stated earlier, NMT is a data-driven task, so applying NMT to low-resource parallel corpora (English-Bengali in this study) is a challenge.We also observe that very few research works have been conducted in handling the translation in this domain.We briefly review a set of approaches herein.Mumin et al. [11] uses the NMT architecture, which has an RNNbased encoder-decoder with the attention mechanism.However, the translation accuracy and quality are very low (BLEU: 16.26, BilinguaL Evaluation Understudy (BLEU) is the most commonly used evaluation metric in MT evaluation [12].The significance of it is presented in the later part of this section), and it is very hard to understand the gist.Kunchukuttan et al. [13] contributed significantly.The authors worked to build a corpus for the top 10 Indian languages, including Hindi, Bengali, Tamil, Telegu, and others; however, the accuracy of bilingual lexicon induction for English-Bengali is not good.We observe that in building English-Bengali corpus and English-Bengali NMT architecture, the most prominent research works come from BUET CSE NLP Group, Bangladesh (BUET CSE NLP Group, Bangladesh https://csebuetnlp.github.io/accessed on 1 August 2024) [14][15][16][17].In their research, they compile a Bengali-English parallel corpus, captioned as BanglaNMT, comprised of 2.75M sentence pairs [18].Also, in their research, they build a model called BanglaT5 [16] (a Transformer model based on the standard Transformer introduced by Vaswani et al. [8].The base variant of the BanglaT5 model is comprised of 12 layers, 12 attention heads, 768 hidden-layer size, 2048 feed-forward size, trained on a v3-8 TPU instance on Google Cloud Platform), which evaluates six NLP tasks such as machine translation, question answering, text summarization, dialogue generation, cross-lingual summarization, and headline generation [16].In the context of our study of English-Bengali Translation, we observe that the translation accuracy of BanglaT5 is not high (BLEU: 17.4).
Notably, we can incorporate different augmentation approaches with the NMT models for the enhancement of machine translation tasks [2].In this paper, we work on this aspect; however, our baseline Transformer model is not as large as the BanglaT5 (the detailed configuration is presented in the Experiments Section (Section 4.3)).Due to the lack of powerful computing resources and to reduce the carbon footprint while reducing the environmental impacts, we designed a very basic Transformer model integrated with two decoding strategies, as discussed in this paper.Especially toward building a viable NMT Engine for low-resource parallel corpora a set of augmentation approaches, such as back-translation [19][20][21][22][23][24][25][26][27][28][29], pivot language technique [30][31][32], leveraging multi-modal data [33], and others can be used while leveraging monolingual data.We observe that the notable approaches work well for the language pairs mentioned in the literature (a detailed demonstration is presented in the Related Work section (Section 2.2).However, we need to evaluate the effectiveness of the approaches for other low-resource language pairs.

Challenges and Contributions
In this paper, in building our viable English-Bengali NMT engine, we employ backtranslation while leveraging monolingual data.Back-translation is one of the state-ofthe-art strategies used in NMT for addressing low-resource languages and monolingual corpora [19][20][21][22][23][24][25][26][27][28][29].Notably, the architectures of translation and back-translation models are similar, simply altering/reversing the source and target corpora for the encoder and decoder.In back-translation, instead of extracting the test set from the datasets used for training and validation, it is recommended to use a part of the dataset for testing or directly use a monolingual target corpus.That improves the model generalization.Especially, the translated text and the monolingual target data in the last stage can be used as a pseudoparallel corpus and can be appended with the original dataset, and the model can be trained with this augmented dataset, which can help train the model better.Therefore, with backtranslation, we can obtain an augmented dataset.However, the new data can be regarded as data with noise because they are generated by models that may not yet be trained very well or not evaluated as well as human translators (assuming that human translators generate perfect translations).Since the original output of a translation model is a probability distribution of candidate words, to make the model more robust, different decoding methods are used, such as beam search [34], top-k random sampling [35] and random sampling with temperature T [36], and others.Notably, top-k random sampling and random sampling with temperature T are the most commonly used and optimal decoding methods than the beam search [35,36].Beam search was usually used in the early days of NMT evolution.Therefore, we can append the pseudo-parallel corpus with the original dataset obtained from back-translation with the decoding strategies.The experiment results (Section 4.4) show that back-translation with decoding strategies improves translation accuracy respectfully.Thus, the model can be trained with the augmented dataset, which can help enhance the generalization ability [37][38][39].Especially, the process can be repeated further with the monolingual datasets from different sources.
In the experimental study of this paper, we start our analysis by comparing two (2) architectures: LSTM and Transformer.Thereafter, we employ back-translation with the two decoding methods as stated earlier: top-k random sampling [35] and random sampling with temperature T [36].The top-k random sampling samples from the top k predictions.For example, if k = 3, in this case, the model simply picks up the 3 words or tokens with the highest probabilities and sets the probabilities of the rests to −∞ (−10,000 in practice).Alongside, random sampling with temperature T is a factor to modify the randomness of candidate words.Notably, if we set k = +∞ in top-k random sampling or if we set T = 1 in random sampling with temperature T, then the probability distribution is not modified, i.e., the model samples from all tokens in its vocabulary with their original probabilities.(It can be called "no strategy").It is not resource-intensive and optimal as it samples from all the tokens.In this paper, in our study, we explore the optimal values for k and T respectfully.For more details, including the definitions of k and T, we refer to Sections 3 and 4.
Especially in our study, we evaluate the machine translation task using an automatic evaluation method called BilinguaL Evaluation Understudy (BLEU) [12].BLEU is the most commonly used evaluation metric in machine translation evaluation [2].Given a sentence predicted by the model and the reference translation, BLEU measures their similarity.Notably, the BLEU score is based on precision (ranges from 0 to 1) while analyzing the similarity between the candidate translation and any translation reference(s).Especially, the BLEU score is multiplied by 100 to make it a score in a 0 to 100 interval by convention [12].Now, a question arises in our mind: what is a good BLEU score?Usually, a BLEU score higher than 30 is considered a good score.For the interpretation of BLEU scores, a rough guideline is provided by Google (Interpretation of BLEU Scores https://cloud.google.com/translate/automl/docs/evaluateaccessed on 1 August 2024).If the score lies between 30-40, it means that "Understandable to good translations".Similarly, 40-50 interprets "High-quality translations", 50-60 interprets "Very high quality, adequate, and fluent translations", and >60 interprets "Quality often better than human".
So far, we have presented our study thoroughly.Now, for the sake of clarity, we summarize it.As such, the key aspects are as follows: • The first aspect is a comparison between LSTM and Transformer on English-Bengali machine translation.We find that the Transformer architecture has better accuracy even with a small training dataset (BLEU: LSTM 3.62 on validation, 0.00 on test; Transformer 27.80 on validation, 1.33 on test).Specifically, this is our baseline study to evaluate whether the Transformer can perform better in our study of English-Bengali machine translation.In our further study, we use the baseline Transformer as the basis and employ back-translation with decoding strategies as mentioned earlier and in the following aspect.

•
The second aspect is performing back-translation, reversing source and target languages.However, what is the novelty of doing this?Since we have a low-resource English-Bengali corpus, the model will not have a very high level of generalization ability.In this case, if we perform inference with the monolingual Bengali datasets as the training set, the Translation accuracy or quality will not be very good.Further, we would like to add that even if we train the model with a large dataset like BanglaNMT [18], the generalization ability of the model reaches a certain level.Therefore, in this case, the back-translation does not help much in augmenting machine translation.To this end, our goal is to augment the English-Bengali corpus with back-translation while incorporating decoding strategies that improve translation accuracy and quality.Since NMT is a data-driven task, we can repeat this process.Especially, we can employ this process for domain-specific translation tasks since achieving high accuracy in domain-specific machine translation is very challenging due to the lack of authentic parallel corpora, and understanding the languages by the model is a very complex task even though we may have high-resource parallel corpora.In sum, we use back-translation to boost our base Transformer model incorporated with the two decoding strategies, such as top-k random sampling and random sampling with temperature T. We observe that decoding by top-k = +∞ random sampling helps improve the accuracy (BLEU 38.22, +10.42 on validation, 2.07, +0.74 on test), while sampling with a proper value of T makes the model achieve a higher score (T = 0.5, BLEU 35.02, +7.22 on validation, 2.35, +1.02 on test) than the first strategy (top-k random sampling).

Outline of the Paper
The rest of the paper is organized as follows: Section 2 presents a literature review of the approaches to NMT in the context of low-resource languages.In Section 3, we give an introduction to the methods and evaluation metrics used in our experiments.Section 4 introduces our experiments about the impact of back-translation with decoding strategies on the final performance of the model.Conclusion and future work are in the last section.

Related Work
In this section, we review literature about methods of NMT approaches developed for low-resource languages.According to our observation, statistical machine translation (SMT) and neural machine translation (NMT) are the two most commonly used machine translation approaches.Both are corpus-based, where source and target texts are required to build the translation system.However, due to numerous issues with SMT, NMT has become the mainstream machine translation engine [2,[40][41][42][43].
We observe that in SMT, a model is separated into several sub-modules, such as the language model, the translation model, the reordering model, etc.These components work together to implement the translation function.On the other hand, NMT adopts the neural network to translate from the source language to the target language directly.NMT with an attention mechanism dynamically obtains the source language information, which is correlated to the current generating word.Therefore, NMT can derive the corresponding alignment information without setting up an alignment model, as in SMT [2,44].Notably, NMT models are non-linear and smaller, but SMTs are linear and larger [45].Also, we observe that NMT outperforms SMT by a large margin in all evaluation metrics [44,[46][47][48][49][50].The key advantage of NMT is that the source and target corpora can be trained directly in an encoderdecoder engine, which makes it fast and accurate [51].The classic neural translation models involve sequence-to-sequence implementation with RNN encoder-decoder models [3,4,[52][53][54][55], attention mechanisms [5][6][7][56][57][58][59][60][61], transformers [8,[62][63][64][65], and so on.The applications include: Google Translate [7], Microsoft Translate [66], OpenNMT [67][68][69], and many others.

Neural Machine Translation (NMT) Architecture
In this section, we review the architectures used in NMT.NMT is the mainstream approach for MT tasks.Let x be the input sentence in the source language, y be the output sentence in the target language, and θ be the parameters of the model; the objective of NMT is to find the Equation (1): The basic architecture of NMT, which handles input and output of arbitrary length, is a Seq2Seq model with pre-and post-processing.In this section, we introduce an example framework of NMT using LSTM as the encoder and decoder of a Seq2Seq model, as illustrated in Figure 1.The selection of the Seq2Seq model (including encoder and decoder) is essential in the architecture of NMT [70,71].A lot of network structures can be leveraged as Seq2Seq models, including RNN-based models (basic RNN, LSTM, GRU, or other variants), RNN-based models with encoder-decoder attention mechanisms, Transformer, and so on.For these structures, the key idea is to store and leverage the information from the input sequence and the generated part of the output sequence.

RNN Based Model
For Seq2Seq architecture with RNN models as the encoder and decoder [3,4,[52][53][54][55], the framework was introduced in the previous section.Given an input sequence (x 1 , x 2 , . . ., x I ), in the i th step, the key features of an RNN model are two functions that take element x i and a hidden state h i−1 from the last step as input and respectively output y i for postprocessing, as well as a new hidden state h i for the next step.Basically, each function can simply be a stack of FC layers with non-linear activation.As the encoder and decoder, a basic RNN can tackle input and output sequences of arbitrary length.It encodes and exploits the information from the input sequence in a hidden state.However, in a basic RNN, the hidden state has limited capacity and gradually forgets information obtained from earlier elements of the input sequence, which is known as the long-term dependency problem.For instance, if we pass a long sentence to a basic RNN encoder, the output translation may have the wrong subject.Because the subject is at the beginning of the input and its information vanishes from its hidden state after many updates.Long short-term memory (LSTM) [73] alleviates the problem.It employs the LSTM unit as the function to update the hidden state in RNN structures and introduces the cell state as another variable to store information.At step t, input gate i t , forget gate f t , cell gate g t , and output gate o t are calculated from input x t and hidden state h t−1 by corresponding learnable linear transformations with non-linear activation functions.Then the cell state and hidden state are updated as Equations ( 2) and (3): where ⊙ represents element-wise multiplication.Notably, the hidden state h t can be used by the other functions to produce output y t .
In the aforementioned RNN models, the encoded information is accumulated in one or two hidden states, and the output sequence is generated according to such hidden states that contain the information of the entire input.However, when producing some part of the output sequence, the context information of the corresponding part in the input is usually more important.For example, suppose the model is translating "We go to the bank of river" into Bengali.When it translates "bank", besides "bank" itself, the encoded information from "river" is necessary because it specifies the meaning of "bank", while other parts of the input are less relevant.

RNN Based Model with Attention Mechanism
To address the aforementioned problem, Bahdanau et al. [5] introduce an attention mechanism in the RNN encoder-decoder architecture.In modified RNN models with an encoder-decoder attention mechanism, given the input sequence (x 1 , . . ., x I ), all hidden states in the encoder phase are stored in a matrix H = [h 1 , . . ., h I ].Then, in step i of the decoder phase, the hidden state (denoted as s i to identify) is updated as well, while a weighted sum c i of encoder hidden states is calculated to represent the information that is important for the output in this step.Formally, given the decoder hidden state s i (information of output status), the normalized weight for the encoder hidden state h j is as Equation ( 4): where a(s i , h j ) can simply be dot product of s i and h j in practice.Then c i is, as Equation ( 5): Then, c i can be used to generate output y i .Notice that in each h j , the information from the corresponding input x j and its neighbor is expected to be predominant, and a high value of e i,j makes c i contain more information from h j .Therefore, weight e i,j indicates the attention of the decoder on input x j .In general, RNN models with an encoderdecoder attention mechanism can focus on important parts of input information in the transduction task, which improves performance [5][6][7][56][57][58][59][60][61].However, RNN models have an auto-regressive workflow in both training and inference; that is, the hidden state h i has a dependency on h i−1 from the last step.Therefore, the input and output have to be processed step-by-step and thus cannot be accelerated by parallel computation.Moreover, the encoder-decoder attention mechanism works in the decoder phase, while context information is neglected when encoding the input sequence.

Transformer Based Model
To address the aforementioned problems, the Transformer [8] is introduced, which receives the entire input sequence in only one step and uses self-attention to leverage context information for encoding the input sequence.For the details of the Transformer, we recommend referring to the original paper by Vaswani et al. [8]. Figure 2 shows the architecture of the Transformer.In the encoder phase, given the embedded input sequence (denoted as matrix X = [x 1 , . . ., x I ]), it is first added by a positional encoding matrix, then goes through a stack of N encoder blocks (N = 6 in practice).In each encoder block, it is further encoded by a multi-head self-attention layer, which leverages context information.The production of head n (n = 1, . . ., 8 in practice) is matrix H n , as is shown by Equations ( 6)- (11).
where W Q , W K , and W V are linear transformations that transfer embedded input x i into 3 vectors that play as query, key, and value, respectively in constructing context-considered representations, d k is the dimension of q i and k j , i, j = 1, . . ., I. Then the output of this layer O has the same size as the input, as shown in Equation (12).
Then it is passed to a Feed Forward network (2 FC layers in practice) to obtain better representation.After each step, there is a residual connection and a layer normalization operation to keep the value of the output in a reasonable range.The output of the encoder is the context matrix C = [c 1 , . . ., c I ], which is similar to the hidden state matrix in RNN.The decoder of the Transformer consists of N = 6 blocks.It uses teacher forcing in the training stage; that is, the embedded input of first decoder block is right shifted ground truth Ŷ = [zero_vec, ŷ1 , . . . ,ŷM ] where column i is similar to the input of decoder in step i for RNN.After positional encoding, it is passed to a masked multi-head self-attention layer, where exp(q i • k j / √ d k ) is masked (set to 0) for all j > i to prevent the contribution from future information when constructing a representation of Ŷ.Another attention layer in the decoder block is similar to that in the encoder, with some differences are shown in Equations ( 13)- (15).
where c j is from C, the output of encoder, i = 1, . . ., N + 1, j = 1, . . ., I. The output of the decoder in training (after (softmax)) is matrix Y = [y 1 , . . ., y M ], where y i is the vector of the probability distribution of candidate tokens for position i in the output sequence.The Transformer can be trained at high speed through parallel computation and has high capacity.It is the basis of many large language models (LLM), including chat generative pretrained transformer (ChatGPT (ChatGPT https://openai.com/blog/chatgptaccessed on 1 August 2024)).Also, as stated earlier that BUET CSE NLP Group from Bangladesh, who are consistently working on optimizing the NLP tasks in Bengali, they also use Transformer as the basis of their study [14][15][16][17].We further observe that prominent machine translation models [63][64][65] are built on top of the base transformer [8].

NMT Architecture in Our Study
For our study, we proceed with LSTM (without attention mechanism) and Transformerbased architecture, as demonstrated in Section 4. Notably, it is our baseline study as stated earlier.Moreover, we exploit the OpenNMT [67][68][69] toolkit to build the NMT framework for our study.OpenNMT is an open-source platform that provides various pre-and post-processing methods, as well as encoder and decoder model structures for sequence transduction tasks.By OpenNMT, we can explore the performance of customized LSTM and Transformer models in the English-Bengali translation task.Besides the selection of model structure, the training data are also critical for NMT tasks.However, as a lowresource language pair, English-Bengali translation lacks parallel data, so it is difficult to achieve a high accuracy in translation for low-resource language pairs (e.g., English-Bengali, or others).We observe that the prominent methods built on top of the baseline models can help address this issue and consequently help optimize the accuracy and quality of translations significantly [2,74].

Back-Translation
To augment the dataset, if we have a monolingual corpus in the target language, we can translate it into the source language by a translation engine in the reverse direction.Thus, we can have a pseudo-parallel corpus.We can then add the generated corpus to the original dataset to obtain an augmented dataset.This is called back-translation [19][20][21][22][23][24][25][26][27][28][29].Similarly, in forward translation, if we have monolingual data in the source language, we can generate their counterparts in the target language using the same model we are training [25].The new data can be regarded as data with noise because they are generated by models that are not trained very well yet rather than human translators (supposing they generate perfect translations).Since the original output of a translation model is a probability distribution of candidate words, to make the model more robust, Imamura et al. [77] employ random sampling instead of conventional beam search when generating translation.Meanwhile, the research work [78] adds noise to the pseudo-source sentences by deleting or masking certain words for the robustness of the model.
In Caswell et al.
[26]'s research work, "Tagged Back-Translation" (TaggedBT), they show that setting a tag ⟨BT⟩ at the beginning of generated source sentences, instead of adding noise by dropping words, improves the translation performance.Especially in this work, by analyzing the entropy of decoder attention, the researchers noticed that when trained on normal back-translation data without a tag, the model concentrates more on the last input word (low entropy).It indicates a bias toward word-for-word translation.Meanwhile, models trained on tagged data show diffused attention (high entropy), indicating more attention to context.This implies that, when there is no tag, the model tends to use word-for-word translation for all input, while the ⟨BT⟩ tag helps the model identify sentences that belong to the back-translation and avoid using the word-forword strategy on untagged input.
Dual learning [27] is another approach to back-translation that utilizes monolingual or low-resource datasets.In this method, Dual Learning [27] trains a source-target model and a target-source model at the same time.It combines the 2 models to make both input and output in the source language.Thus, the model can be trained only on source language data by minimizing the difference between input and output.Wang et al. [28] proposed letting models (agents) in different translation directions evaluate each other.In data diversification [29], models in 2 translation directions and the parallel corpus can be alternatively updated.That is, they first train models on the old corpus (updating models), then add pseudo-parallel data output by the trained models to the corpus (updating corpus).

Pre-Training on Monolingual Data
If there is no abundant parallel data, we can pre-train the model with monolingual data, forcing it to build a hidden representation space for source and target languages.Then the translation quality can be improved because the model gains a better understanding of the languages and better capability in generating the languages, even before translation training.The basic idea of monolingual pre-training is to request that the model reconstruct corrupted text.
In masked language modeling (MLM) [75], some words in the input sentence are replaced with mask tokens, and the model is tasked with predicting the masked words.Later, Conneau et al. [76] incorporated MLM with translation language modeling (TLM).In this work, instead of a single sentence in one language, a parallel sentence pair is concatenated and used as input.Zhu et al. [79] employ pre-trained BERT (Bidirectional Encoder Representations from Transformers) [75] to provide hidden representations of input sentences to the encoder and decoder designed in this work.These works pre-train the encoder and decoder, respectively.However, for the encoder-decoder attention mechanism, joint pre-training becomes important.In addition, to build a more robust model, some researchers seek more sophisticated methods to add noise during joint pre-training.
Song et al. [80] propose MAsked sequence-to-sequence pre-training (MASS).In this work, at the pre-training stage, some neighboring words in the sentence input to the encoder are masked, while the input to the decoder is further masked to force the decoder to employ the information extracted by the encoder.Likewise, we observe that at the pre-training stage of the text-to-text transfer transformer (T5) model [62], some neighboring words are masked by a single token, and the model should reconstruct the masked content of arbitrary length.Lewis et al. [81] propose bidirectional and auto-regressive transformers (BART), which allow more types of document noise in joint pre-training, including masking (e.g., x 1 x 2 x 3 → x 1 ⟨M⟩x 3 ), deleting words (e.g., x 1 x 2 x 3 → x 1 x 3 without a mask token), sentence permutation (e.g., x 1 x 2 , x 3 x 4 → x 3 x 4 , x 1 x 2 ), and so on.

Pivot Language Technique
Besides, some research employs rich resource languages that are different from the source and the target (called auxiliary languages [30]).For example, Cheng et al. [31] directly use English as the pivot between German and French by training a German-English model and an English-French model.Leng et al. [32] train a predictor to estimate the potential translation accuracy according to the choice of pivot language.The critical point in using an auxiliary language is to select an appropriate language.To better exploit the information from auxiliary languages, researchers may choose languages that are in the same or similar family.The auxiliary language may have the same writing system or similar grammar as the source or target language.In addition, if we choose languages whose speakers communicate frequently with speakers of source or target languages, the auxiliary language and source or target language may share expressions or words (called loan words), which helps build a shared hidden representation space for the translation model [82].Tan et al. [83] define "similar languages" by clustering languages based on their embedding instead of prior domain knowledge.They train a multi-lingual translator, whose input is a sentence and a tag indicating its language.Then they extract the hidden representation of the tag (a vector) in the trained model as the embedding.Lin et al. [84] consider using auxiliary language models for transfer learning and propose LANGRANK, which selects the optimal language by ranking the candidates according to the data size, typological information, and so on.Niu et al. [85] proposes target-conditioned sampling.In this algorithm, the model first samples a sentence in the target language (low-resource), then samples its corresponding source sentence, where the source language is not fixed.The joint probability distribution Q(sentence_X, sentence_Y) is constructed to minimize the expected loss function of the translation task: E (x,y)∼Q [loss(x, y|θ)].Then the sampling is according to the conditional probability distribution: P(y|x) = Q(x, y)/Q(x).If we consider introducing auxiliary languages, we should balance the size of training data in different languages.Otherwise, the model cannot gain much knowledge about the low-resource language because of its finite capacity.Wang et al. [86] train a scorer that automatically decides the size of the data given the languages.

Leveraging Multi-Modal Data
Some studies also leverage multi-modal data.The basic idea is that multi-modal data has semantic information as well as text in any language, so researchers can embed an image, a piece of video, and/or a sentence in the same semantic space.Su et al. [33] conduct unsupervised learning of an English-French bi-directional translation model with (English, image) and (French, image) multi-modal monolingual corpora, where the images are not aligned.This model contains 2 Transformers for English and French, respectively, as well as a ResNet for the image part of the two datasets.For auto-encoding loss, in English, the English Transformer learns noisy words in a sentence according to the information extracted from images by the ResNet, while the French counterparts work similarly.For cycle-consistency loss, a pipeline like (En, encoder)→(Fr, decoder)→(Fr, encoder)→(En, decoder) is built to make both input and output English, and extracted image information is given to the English decoder.This allows researchers to avoid using an aligned English-French corpus.The model learns by minimizing the difference between input and output in English.Alongside, similar work is completed for the French corpus.Finally, in the English-French inference stage, an English sentence is input to the English encoder and the corresponding image to the French decoder, and then the model outputs the translated French sentence.
Instead of using multi-modal data for training directly, we can also employ such data to construct pseudo-parallel corpus.For example, given an image, Chen et al. [87] exploit image2text models in different languages to predict its caption.In this case, the short descriptions of the same image in different languages are expected to have the same meaning, thus a pseudo-parallel corpus is constructed.

Augmentation Approach in Our Study
Since NMT is data-driven, the aforementioned augmentation approaches are very effective in enhancing machine translation accuracy and quality.The concern is to know which approach is suitable for our study.We observe that, for low-resource language pairs, researchers either augment the parallel corpus by pseudo-sentence pairs or exploit the language data other than the parallel corpus.Back translation and forward translation are used in many research papers, combined with other modifications.In our research, we use back-translation on an English-Bengali translation model and explore the different decoding methods (sampling strategies) in back-translation, which is rare in previous research papers.Notably, top-k random sampling [35] and random sampling with temperature T [36] are the most commonly used and optimal decoding methods and are very effective in translation, as evaluated by the authors in [35,36].Now, it comes to why we do not employ the other augmentation approaches mentioned in this literature.To this end, we start a comparative analysis with Pre-training on Monolingual Data.We observe that it is mostly used (also more effective) in masked language modeling and next-sentence prediction than machine translation.Next, it is pivot language technique.As stated earlier, it is to employ rich resource language pairs that are different from the source and the target, such as employing English as the pivot between German and French by training a German-English model and an English-French model.Thus, it is not effective if we do not have rich datasets for two different pairs with the pivot language as a medium.The last approach is leveraging multi-modal data.In our domain, it is difficult to get such datasets, and it is also resource-demanding since we need to conduct training for the language model and vision model.To this end, we find that back-translation with sampling strategies can help augment our machine translation study respectfully.

Methodology
In this section, we demonstrate the approaches adopted in our study for machine translation.In addition, we demonstrate the evaluation metrics in our experiments.Notably, for MT, we study the effect of model design and back-translation on the quality of MT models.Our focus is on low-resource English-Bengali translation.
As stated earlier, for our study, we proceed with LSTM (without attention mechanism) and Transformer-based architecture, as demonstrated in Section 4. As stated earlier, this is our baseline study to evaluate whether Transformer models can perform better in our study of English-Bengali machine translation.In our further study, we use the baseline Transformer as the basis and employ back-translation with the decoding strategies mentioned earlier.Moreover, we exploit the OpenNMT [67][68][69] toolkit to build the framework from input text in the source language to output text in the target language.OpenNMT is an open-source platform that provides various pre-processing and post-processing methods as well as encoder and decoder model structures for sequence transduction tasks.By using OpenNMT, we can explore the performance of customized LSTM and Transformer models in the English-Bengali translation task.Besides the selection of model structure, the training data is also critical for NMT tasks.However, as a low-resource language pair, the English-Bengali translation lacks parallel data.Therefore, we employ back-translation to augment the training data.

Back-Translation with OpenNMT
The framework of back-translation we conduct is as follows:

•
We train an English-Bengali model and a Bengali-English model on the original parallel corpus.

•
We introduce a Bengali monolingual corpus and use the trained Bn-En model to translate it into English; thus, we have another English-Bengali corpus (pseudo-corpus).

•
We use the two corpora to train a new English-Bengali model.
Notably, when preparing the pseudo-parallel data by back translation, we use two decoding methods: top-k random sampling and random sampling with temperature T. To demonstrate them, we observe the last steps of predicting an output word.The direct output from the decoder is a vector.The official documentation of OpenNMTpy (OpenNMT-py Official Documentation https://opennmt.net/OpenNMT-py/options/translate.htmlaccessed on 1 August 2024) [67][68][69] use the term "logits" for it.The logits vector is then processed by softmax and becomes a probability distribution, according to which the model samples the output word.Therefore, top-k sampling is easy to understand.Supposing that k = 3, it simply picks up the 3 words with the highest probabilities and sets the probabilities of the other words to −∞ (−10000 in practice).Another strategy is to introduce temperature T, a positive float variable, to modify the probability distribution.That is, we divide the logits by T before feeding them into softmax.In this strategy, all words in the vocabulary of the model are regarded as candidates.For better understanding, we consider logits with 2 elements: {'I':0, 'you':x}, notice that x can be any value, positive or negative.After softmax with temperature T (can be any positive value), P('you') becomes Equation ( 16): Figure 3 shows the result of P('you') at different temperatures.At a low temperature (T = 0.1), if the probability of 'you' (x) is lower than that of 'I' (zero), then the final probability is P('you') = 0, and thus P('I') = 1 and vice versa.Therefore, at a low temperature, the probability distribution becomes sharp, and the sampling algorithm concentrates more on the top words.Meanwhile, at a high temperature (T = 50), the probability distribution becomes flat, and all words have similar probabilities for sampling, despite the raw output (x) from the decoder.Using these two decoding methods, with different values of k or T, we investigate the performance of models trained on data with different levels of noise.First, we compare the performance of LSTM and Transformer networks in both translation directions.Then we choose the better architecture and back-translate (Bengali-English) the Bengali part of the training data to create a pseudo-parallel corpus and add it to the original training set.We keep the validation set invariant for consistency.At this stage, we generate different translations by setting the random sampling parameter, k = 1 (the argmax), 5, and 10.We also generate translations with T = 0.5, 1.Note that if we set T = 0, then the top 1 word with the maximum probability will have P(word) = 1 and will be sampled.Therefore, this is equivalent to top-k = 1 random sampling.In addition, setting k = +∞ in top-k allows the model to sample from all tokens in the vocabulary, while setting temperature T = 1 leads to the unmodified probability distribution for all tokens in the vocabulary.Both of the cases are equivalent to the setup where neither top-k nor temperature sampling is used.It is denoted as "no strategy" in this article.
Finally, we train new models on the augmented training data and compare their performance on the test set, which is from another dataset.In the experiment, we set the decoding methods using OpenNMT.

Evaluation Metrics: BLEU
BilinguaL Evaluation Understudy (BLEU) [12] is a widely used metric in machine translation.Given a sentence predicted by the model and the reference translation (usually generated by humans).The BLEU score represents the similarity between them.To compute BLEU, the first step is, for n = 1 to N, counting c n,total , the total n-grams in the prediction, and c n,match , the n-grams in the prediction that are also found in the reference translation.We stop counting some n-grams in c n,match when they appear more frequently in the prediction than in the reference translation.Then we obtain P n = c n,match /c n,total and a brevity penalty as Equation ( 17): which is introduced to reduce the score for short prediction.Finally, the BLEU is calculated as Equation ( 18): In practice, we set N = 4 and all weights W n = 0.25, as in the original literature.We multiply 100 by the original BLEU value to make it a score in a 0 to 100 interval by convention.

Experiments
In this section, we present our experiments.We start by demonstrating the datasets used, then the pre-process tasks and model configuration, and finally, we show the experimental results.

Datasets
For English-Bengali machine translation, there are limited parallel corpora (∼1 k to ∼0.1 M sentences) compared with rich resource language pairs like English-Chinese (∼10 M to ∼100 M sentences).Most of the datasets have parallel contents that are translated or checked by humans, while some are automatically crawled and scored by algorithms.We obtain the following freely accessible datasets from OPUS [88], except SUPara.

•
WikiMatrix [89] is collected from Wikipedia in 1620 language pairs by data mining technology.The pages describing the same entity in different languages can be related, but it may be hard to construct a sentence-to-sentence alignment.Therefore, the corresponding contents are "comparable" rather than "parallel".For English-Bengali, it has 0.7 M pairs of sentences.In practice, we can separately use its English part and Bengali part as a monolingual corpus.• GlobalVoices [90] is extracted from Global Voice news articles.The constructors leverage human evaluation to rate the aligned contents and then filter out low-quality translations.For the English-Bengali language pair, it contains 0.1 M pairs of sentences.However, the only source we find for this corpus is not freely available.Therefore, we only obtain 500 pairs of sentences (known as the test set of the SUPara benchmark) and employ them as the final test set.
We refer to Table 1 for some example sentences from the aforementioned datasets.It can be observed that some of the translations do not convey the exact same meaning.We plan to address this issue.

Pre-Processing
In this section, we present the pre-processing tasks of our study [92].
• Experimental Setup: In our study, we conducted our experiments on the supercomputer at Macau University of Science and Technology.The setup for a personal user is: CPU: Intel Xeon E52098, GPU: NVIDIA Tesla V100.• Data Filtering: Upon receiving the dataset, we first perform the data filtering.With data filtering, we prune the low-quality segments that can help optimize the translation accuracy and quality.Especially without filtering, the dataset may include misalignments, duplicates, empty segments, and other issues.

•
Tokenization/Subwording: To train our model, we first need to build a vocabulary for the machine translation task.To this end, we usually tokenize/split sentences into words, which is called Word-based tokenization.This issue is that, in this case, our model is limited to learning a certain number of vocabulary tokens.To address this issue, subword tokenization is the preferred method over whole words.During translation, if the model encounters a new word/token that resembles one in its vocabulary, it may attempt to translate it instead of labeling it as "unknown" or "unk".The most commonly used subwording approaches are byte pair encoding (BPE) and the Unigram model [93].In our experimental analysis, we observe that the results of these two models are almost similar.Notably, we find that both models are integrated with OpenNMT-py, as we stated earlier that we conduct our analysis with OpenNMT-py.
In our study, we by default use the Unigram model for subword tokenization.Notably, after translation, we have to "desubword" our text back employing the same subword tokenization model.

•
Data Splitting: We use GlobalVoices, Tanzil, and Tatoeba datasets for both training and validation.We extract the first 500 pairs of sentences from each dataset to construct the validation set, then leave the rest for training.On the other hand, instead of extracting the test set from the datasets used for training and validation, we use the SUPara dataset (SUPara benchmark) for testing.
After the pre-processing tasks, next comes our experimental study.

Model Configuration
In this section, we present our model configuration.Notably, we begin our experimental analysis by comparing the performance of LSTM and Transformer models in both translation directions, and we conduct the analysis with OpenNMT-py.With OpenNMT, the basic configurations of models are presented in Table 2.The simplification is to avoid overfitting of models on the training set.Notably, customconfigured models can be built.Especially after the preliminary experiment with LSTM and Transformer, the better model is then used for back-translation.To train a translation model in the reversed direction (Bn-En), we swap the source and target of vocabularies, as well as those of training and validation data.Notably, when we get the translation file after passing the test dataset to the model, we need to perform the desubwording of the translated output using the same subword tokenization model as stated in the Tokenization/Subwording stage of the pre-processing section.After that, we evaluate the translated output using the BLEU metric.These are the post-processing stages of our study.
As stated earlier, we use GlobalVoices, Tanzil, and Tatoeba datasets for both training and validation.We extract the first 500 pairs of sentences from each dataset to construct the validation set, then leave the rest for training.The vocabularies of both languages are generated from the first 200,000 pairs of sentences in the training set.In each vocabulary, the words that appear only once in the training data are neglected.We train one Transformer and one LSTM model on each translation direction for 50,000 epochs and save a checkpoint every 1000 epochs.Then we compute the BLEU score on the validation set at checkpoints.

Experiment Results and Discussion
In this section, we present the results of our experiment and their discussion.As Figure 4 shows, Transformer models on both translation directions converge rapidly and outperform LSTM models.LSTM models need much more training epochs and show lower scores.This phenomenon implies that we can choose better architectures, like Transformer, to optimize the translation.Notably, according to our observation, the Bengali-English Transformer (for back-translation) has a higher BLEU score than the En-Bn by LSTM.One probable explanation is that a single sentence in English corresponds to several translations based on the status of speakers, but that extra information is lost in the corpus.Meanwhile, the result is reversed for the Transformer, which warrants further research.We then use the best Transformer Bn-En checkpoint (11,000 epochs) to back-translate the Bengali part of the WikiMatrix dataset.We obtain 7 versions of the translation, corresponding to different generation strategies.For top-k random sampling strategy, we randomly sample from the k most probable candidate words when generating the translation, with k = 1, 5, 10.Another method is to randomly sample from all candidate words with the probability modified by temperature T, with T = 0.25, 0.5, 0.75, and 1.
As discussed earlier, T = 0 (frozen) means that the word with the highest probability survives, while other choices have zero probability.This case is equivalent to the top 1 random sampling (argmax).Similarly, T = 1 is equivalent to the "no strategy" setup because dividing by 1 does not change the probability.For large T (suppose T → +∞), the probability distribution becomes flat, which means the model degrades into an untrained situation (every word has the same probability).It destroys the modeled probability distribution, causing many writing errors to appear in the back-translated text.Therefore, we do not examine for a higher temperature.
The translations are separately added to the training data.Thus, we have seven versions of augmented training sets.Next, we train a new Transformer model on each of the seven augmented training data for 50,000 epochs, pick up their best checkpoints on the validation set, and compare their performance on the test set.Instead of extracting the test set from the datasets used for training and validation, we use a small part of the SUPara dataset (SUPara benchmark) for testing.With this approach, we expect to better examine the generalizability.
Table 3 shows the results of different setups on the validation set and the test set.Note that k = 1 and T = 0 refer to the same case: the argmax sampling in back-translation, and that the top-all setup (k = +∞) and the T = 1 setup are the "no strategy" cases.According to the test, we obtained three results.First, the Transformer model has much better performance than LSTM on the base training set.Therefore, Transformer is promising for translating low-resource languages.Second, we can use back-translation to boost the translation performance, but the adoption of the top-k random sampling strategy weakens the effect of back-translation.Third, back-translation with temperature T sampling achieves higher BLEU with T = 0.5, while too low or too high T values reduce the enhancement.

Conclusion and Contributions
In this article, we develop a machine translation method to translate English texts to Bengali texts.In particular, we have evaluated LSTM and Transformer architectures.The experiment shows that the Transformer model outperforms LSTM in both directions of the English-Bengali translation task.The performance enhancement indicates that we can exploit Transformer in designing better NMT architectures for low-resource language pairs like English-Bengali.We then investigate the effect of back-translation on the Transformer for the English-to-Bengali translation task.In our experiments, we analyze two decoding methods: top-k random sampling and random sampling with temperature T. According to the results, back-translation with proper parameters improves translation accuracy through data augmentation.As the experiment shows, if we take top-k random sampling in back-translation, we can let all words in the vocabulary become candidates to achieve the best results.However, while the top-all (k = +∞) setup is shown as the optimal case in the top-k strategy, it is theoretically equivalent to the "no strategy" setup, which can also be regarded as T = 1 case in temperature T sampling, where we can optimize T for even better results.Therefore, to obtain the best back-translation output, it is better to adopt random sampling with temperature T, and explore the optimal value of T between 0 and 1.We observed that using random sampling with temperature T = 0.5 in back-translation makes the model perform best.Therefore, we can augment the English-Bengali corpus by backtranslation incorporating the decoding strategy of random sampling with temperature T over and over again, which helps the model gain a higher generalization ability.Finally, it can help improve the accuracy and quality of translation.However, in the evaluation, we did not get a very high BLEU score.Notably, in our experiments, we have built a simplified Transformer with only two blocks in the encoder and two blocks in the decoder, having only two heads in the multi-head attention because of the lack of powerful computing resources and to reduce the carbon footprint.Above all, our study shows how we can augment the datasets for low-resource languages and, finally, how to enhance the learnability of the model toward optimizing machine translation tasks.

Limitations and Further Study
We can further optimize the English-Bengali translation in several ways as follows: • For English-Bengali NMT, we can augment the limited parallel corpus by back-and forward translation.Since these methods simply need monolingual data, training models to generate monolingual text in English and Bengali will be beneficial for further study.

•
We can explore the explanation for the phenomenon that the BLEU of Bengali to English translation by the Transformer is lower than that of the En to Bn translation.

•
We can develop an algorithm to automatically refine the existing parallel corpus.

•
We can exploit different datasets for training, validation, and testing.Compared with dividing one dataset into three parts, this method is expected to help select the model with the best generalization ability.

•
We can employ our study for domain-specific translation tasks.Achieving high accuracy in a domain-specific machine translation is very challenging due to the lack of authentic parallel corpora.Notably, even if we have the high-resource parallel corpora for domain-specific translation tasks, for a model, it is a very complex task to understand the languages because of domain-specific vocabularies.Therefore, there is an urgent need to build domain-specific machine translation engines, and we plan to step into it in the future.

Figure 3 .
Figure 3.The relation between x and P('you') at different temperatures T.

Figure 4 .
Figure 4. BLEU scores on the validation set over epochs for different models.
• Tanzil (Tanzil translations https://tanzil.net/trans/accessed on 1 August 2024) is a project that provides the Quran in different languages, including 0.2 M sentence pairs in English-Bengali.The translations are submitted by users.Notably, aligned bilingual versions can be downloaded from OPUS.• Tatoeba (Tatoeba https://tatoeba.org/accessed on 1 August 2024) is an open and free multi-language collection of sentences and translations.Its name means "for example" in Japanese.The version of the English-Bengali corpus we use was updated on 12 April 2023, and contains 5.6 k pairs of sentences.Most of the sentences in this dataset are selected from short daily conversations.• Shahjalal University Parallel (SUPara) corpus [91] is a collection of aligned English-Bengali sentences.It is constructed with a balance of text sources, including literature, administrative texts, journalistic texts, and so on.It has 21,158 pairs of sentences.

Table 1 .
Parallel sentences from datasets.

Table 2 .
Training Configuration.A simplified Transformer with only two blocks in the encoder and two blocks in the decoder (by default it is 6 for encoder and decoder) with 512 hidden-size, 512 feed-forward-size.We set two heads in multi-head attention instead of eight heads in the default setup.Because of the lack of powerful computing resources and to reduce the carbon footprint, we designed a basic Transformer model.

Table 3 .
English-Bengali NMT results on test set.We bold faced the best result for each set of comparisons.Note that in the base case, the training set is not augmented by back-translation.