High-Performance English–Chinese Machine Translation Based on GPU-Enabled Deep Neural Networks with Domain Corpus

: The ability to automate machine translation has various applications in international commerce, medicine, travel, education, and text digitization. Due to the different grammar and lack of clear word boundaries in Chinese, it is challenging to conduct translation from word-based languages (e.g., English) to Chinese. This article has implemented a GPU-enabled deep learning machine translation system based on a domain-speciﬁc corpus. Our system takes an English text as input and uses an encoder-decoder model with an attention mechanism based on Google’s Transformer to translate the text to Chinese output. The model was trained using a simple self-designed entropy loss function and an Adam optimizer on English–Chinese bilingual text sentences from the News area of the UM-Corpus. The parallel training process of our model can be performed on common laptops, desktops, and servers with one or more GPUs. At training time, we not only track loss over training epochs but also measure the quality of our model’s translations with the BLEU score. We also provide an easy-to-use web interface for users so as to manage corpus, training projects, and trained models. The experimental results show that we can achieve a maximum BLEU score of 29.2. We can further improve this score by tuning other hyperparameters. The GPU-enabled model training runs over 15x faster than on a multi-core CPU, which facilitates us having a shorter turn-around time. As a case study, we compare the performance of our model to that of Baidu’s, which shows that our model can compete with the industry-level translation system. We argue that our deep-learning-based translation system is particularly suitable for teaching purposes and small/medium-sized enterprises.


Introduction
Currently, machine learning (ML) is experiencing a renaissance where deep learning (DL) has been the main driving force. Deep neural networks (DNNs) are extremely powerful machine learning models that can achieve promising performance on challenging problems such as speech recognition [1,2] and visual object recognition [3][4][5][6]. In particular, due to the capacity of capturing complex linguistic structures, DNNs have enabled great breakthroughs in natural language processing (NLP) [7][8][9]. Among the NLP tasks, machine translation (MT) is a successful representative, and its main task is using computer software to translate text or speech from one language to another.
It is a common belief that machine translation has experienced three major development waves: rule-based machine translation (RMT) [10], statistical machine translation (SMT) [11], and neural machine translation (NMT) [12]. SMT has been the mainstream driving force during the past two decades. However, this approach may ignore the long dependency beyond the length of phrases and thus cause inconsistencies in translation results such as incorrect gender agreements. It also suffers in separate components such as word aligners, translation rule extractors, and other feature extractors. Compared with the SMT platform for NMT. We conclude that our translation system is platform-portable, which is suitable for teaching purposes and use scenarios in small/medium-sized enterprises. As a case study, we compare the performance of our model to that of Baidu's and show that our model can compete with the production-level translation system.
To summarize, our contributions are as follows: • We present a transformer-based machine translation system, which is built from scratch, based on a domain corpus. • Our translation system is easy to use with a web interface, so that domain experts can extend the existing corpus and train the model in a fine-tuned manner. • Our machine learning system is both portable and configurable and can be deployed on laptops, desktops, or servers with multiple GPUs. • Our translation system trained based on a domain corpus can achieve competing performance with a production-level translation system.

Background and Related Work
This section introduces the background of neural machine translation (NMT) and emphasizes prior works on the English-Chinese machine translation.

Neural Machine Translation
Due to the proliferation of deep learning, using deep neural networks for machine translation tasks has gained great attention. We regard that Kalchbrenner and Blunsom proposed the first successful DNN-based machine translation model, which is a new concept for machine translation [24]. Compared with other models, the NMT model needs less linguistic knowledge while producing a competitive performance. Since then, many researchers have shown that NMT can perform much better than SMT.

Formulating the NMT Task
In the MT task, the language model (LM) can actually give the most important information: the emergence probability of a particular word (or phrase) that is conditioned on previous words. Thus, the key to improve the translation performance is to build a better language model. The NMT task is designed as an end-to-end learning task. It directly processes a source sequence to a target sequence. The learning objective is to find the correct target sequence given the source sequence, which can be seen as a high dimensional classification problem that tries to map the two sentences in the semantic space.
Given a parallel corpus C having a set of parallel sentence pairs (x, y), the training objective is to maximize the likelihood L in terms of θ, which is shown in Equation (1) L: where x = x 1 , . . . , x n denotes an input sentence, y = y 1 , . . . , y n represents its translation, and θ is a set of parameters to be learned. Given the source sentence, the probability of a target sentence is calculated as shown in Equation (2): where m is the number of words in y, y j is the current generated word, and y <j are the previously generated words. At the inference time, beam search is typically used to find the translation that maximizes this probability.

The NMT Structure
The most commonly used NMT approach is the "Embed-Encode-Attend-Decode" paradigm, which is illustrated in Figure 1. When the encoder receives one source sentence, it reads the source sentence word by word and compresses the variable-length sequence into a fixed-length vector. This process is encoding, i.e., the encoder converts words in the source sentence into word embedding. These word embeddings are then processed by neural layers and converted to representations that capture contextual information. These contextual representations are called the encoder representations. The decoder uses an attention mechanism, the encoder representations, and previously generated words to generate the decoder representations, which in turn are used to generate the next target word. The encoder and decoder can be of RNN [13], CNN [14], or self-attention and feed-forward [15]. While NMT has shown great potential in capturing the dependencies inside the sequence, it still suffers a huge performance reduction when the input sentences are too long. This is due to the limited feature representation ability in a fixed-length vector. Thus, the attention mechanism came into being. It works as an intermediate component between Encoder and Decoder, which facilitates the word correlation in a dynamic manner ( Figure 1). As a matter of fact, the inspiration for applying the attention mechanism on NMT comes from human behavior in reading and translating text data: human beings often read text repeatedly to mine the word dependency within the sentence.

NMT with Attention Mechanism
Recently, fully attention-based NMT has shown promising performance. In particular, the attention mechanism has worked as a driving force in text feature extraction rather than having an auxiliary role. Among them, Transformer is one representative, which is a fully attention-based model from Google [15].
Different from prior RNN-or CNN-based models, Transformer is a complete attentionbased NMT model. That is, it is of self-attention with a feed-forward connection, which can be a feature extractor allowing the entire sentence to be "read" and modeled once. It is a common practice to stack multiple layers, which leads to an improved translation quality.
Formulating Self-Attention Layers. The attention mechanism is calculated across the decoder and encoder in Equations (3) and (4): where e ji is an alignment score, a is an alignment model that scores the match level of the inputs around position i and the output at position j, s (j−1) is the decoder hidden state of the previously generated word, and h i is the encoder hidden state at position i. The calculated attention vector is then used to weight the encoder hidden states to obtain a context vector as This context vector, is fed to the decoder along with the previously generated word and its hidden state to produce a representation for generating the current word. A decoder hidden state for the current word s j is computed by where g is an activation decoder function, s (j−1) is the previous decoder hidden state, and y (j−1) is the embedding of the previous word. The current decoder hidden state s j , the previous word embedding, and the context vector are fed to a feed-forward layer f and a softmax layer to compute a score for generating a target word as output:

NMT Model Training
When training an NMT model, the first step is to transfer the words to vectors, i.e., word embedding. The most frequently used words in one language will be chosen, and the remaining words are treated as unknown words. To overcome the problem of unknown words, the most common practice is subword tokenization with methods such as byte-pair encoding (BPE) [25], word-piece model (WPM) [26], or sentence-piece model (SPM) [27].
During the training time, the encoder-decoder model is fed by a parallel corpus. The learning objective is to minimize the cross-entropy loss between the predicted target words and the actual target words in the reference. The model parameters are initialized randomly. The training process could be formulated as updating its parameters periodically until obtaining the minimum loss of the neural network. This loss minimization is an optimization problem, and we can use gradient descent methods such as SGD, Adam, ADAGRAD, and Adafactor [28]. Among them, Adam is able to train models very fast, but it suffers in converge speed. In contrast, SGD can converge better, but it requires a long time for training. Designing a learning schedule that combines several optimizers can help train a model efficiently [29].
Training is done for a large number of iterations till the model converges. That is, the model evaluation does not change by a significant amount over iterations. In the implementation, we will refine the parameters after it processes a batch of training samples. We have to take care of the hyperparameter tuning, including learning rate, number of layers, and so on.

Model Evaluation
It is common to use bilingual evaluation understudy (BLEU) to evaluate NMT tasks. This metric is used to measure the differences between a model generated target sentence and its reference sentence. BLEU is defined in Equations (8) and (9): where p n is n-gram corrected accuracy, w n is 1 n , c is the length of the translated sentence, r is the length of the reference sentence, and N = 4. The larger the BLEU is, the better.

English-Chinese Machine Translation
English-Chinese machine translation has been investigated for several decades. We summarize the prior NMT work in terms of designing new learning models and leveraging language features.

Designing New Learning Models
Hassan et al. address the problem of how to define and accurately measure human parity in translation and describe Microsoft's machine translation system [30]. They see that the translation quality of the latest neural machine translation system is at human parity. To address the issue of duplicate or missing translations, Lin et al. proposed neural machine translation improvements based on a novel beam search evaluation function [31]. They show that the proposed methods can effectively improve the English to Chinese translation quality.
SMT often performs better than NMT in translation adequacy and word coverage. Thus, it is a promising direction to combine the advantages of NMT and SMT. Zhou et al. propose a deep neural network-based system combination framework leveraging both minimum Bayes-risk decoding and multi-source NMT, which take as input the N-best outputs of NMT and SMT systems and produce the final translation [32]. This approach has been shown to significantly outperform the conventional system combination methods.
Xiong et al. propose to enhance encoding components with different levels of composition [33]. This model takes (1) the original word embedding for raw encoding with no composition and (2) a particular design of external memory in a neural turing machine (NTM) for more complex compositions. An empirical study on Chinese-English translation shows that their model can improve by 6.52 BLEU points. Wang et al. describe the Sogou neural machine translation systems for the WMT 2017 Chinese-English news translation tasks [34]. Their translation systems are built based on a multi-layer encoder-decoder architecture with attention mechanism. The best translation is obtained with ensemble and reranking techniques. Their translation system achieved the highest BLEU among all 20 submitted systems.
Tencent neural machine translation systems were designed for the WMT 2020 news translation tasks [35]. Their systems are built on deep Transformer and several data augmentation methods. They propose a boosted in-domain finetuning method to improve single models. They achieved a BLEU score of 36.8 on the Chinese-English task. In 2021, Tencent introduced a system based on the Transformer with several novel and effective variants. Their constrained systems achieve very good BLEU scores.

Leveraging Chinese Features
Chinese phonologic features play an important role in the sentence pronunciation. To improve the machine translation performance, Yang et al. propose a novel phonologyaware neural machine translation (PA-NMT) model where Chinese phonologic features are leveraged for translation tasks with Chinese as the target [36]. A separate recurrent neural network (RNN) is constructed in the NMT framework to exploit Chinese phonologic features to facilitate the generation of more native Chinese expressions. Experimental results on the English-to-Chinese task show that the proposed method significantly outperforms state-of-the-art baselines.
Neural machine translation (NMT) faces the challenge of out-of-vocabulary (OOV) word translation. Han et al. address this OOV issue and improve the NMT adequacy with a harder language, such as Chinese, whose characters are even more sophisticated in composition [37]. They integrate the Chinese radicals into the NMT model with different settings to address the unseen word challenge in Chinese-to-English translation. The experiments on standard Chinese-to-English NIST translation shared task data from 2006 and 2008 show that their designed models outperform the baseline model in a wide range of state-of-the-art evaluation metrics.

Our Methods
This section provides a detailed description of our methods for English-Chinese translation based on the Transformer model. We introduce our methods in terms of word segmentation, data preprocessing, model training, and deployment.

Word Segmentation
The task of word segmentation is to divide a string of written language into its component words. In English, the space is a good approximation of a word delimiter. However, the equivalent to word spacing is missing in languages such as Chinese.
Here, we use the BPE (byte pair encoding) algorithm to perform tokenization on the raw dataset [25]. Specifically, we use SentencePiece, which is an unsupervised text tokenizer for neural network-based text generation systems, where the vocabulary size is predetermined prior to the neural model training. SentencePiece allows us to make an end-to-end system that does not depend on language-specific pre-/post-processing.
For English, the segmenter splits the punctuation and separates some affixes such as possessives. For Chinese, which is written without spaces between words, SentencePiece treats the input text as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. Thus, we can detokenize the text without any ambiguities. Based on this tool, we can perform word segmentation efficiently, which is shown in Figure 2. Each sentence is transformed into a sequence of integers, each integer being the index of a token in the dictionary. Only the top N frequent words will be taken into account. The N is set to 32,000 for both the English and the Chinese vocabulary.

Data Preprocessing and Loading
After training the word segmentation model, we preprocess the data with the pytorch tool, i.e., torch.utils.data.Dataset. The main tasks include three steps. The first is to reorder the English-Chinese parallel corpus according to the length of English sentences. In this way, we aim to ensure that the sentences within a batch are of the same length. Then, we perform word segmentation for the parallel sentences with the trained model and map each word to a unique ID in the vocabulary. The third step is that we add a starting symbol and an ending symbol for each embedding sentence, which thus generates a tensor for model training.

Model Training
We use the PyTorch functional API to create a Transformer model [15]. After shuffling the news dataset, we create training, validation, and test sets with a 70-10-20% split. As a result, there are 176,943 bilingual sentences for training, 25,278 for validation, and 50,556 for testing. The network is trained with a batchsize of 32, an optimizer of Adam, and a self-designed loss function for a total of 40 epochs. The training loss reduces to 2.08, and the validation loss converge reaches 4.10. The parameters of our trained model are around 400 MB. The following subsections specifically show how we train the transformer-based NMT model.

The Transformer Model
Rather than using the RNN or CNN structure, Transformer is the first encoder-decoder model purely relying on the self-attention mechanism [15]. To be exact, Transformer only consists of self-attention and a feed-forward neural network. A real-world neural network can have many stacked encoder layers and decoder layers. Figure 3 shows the network architecture of the Transformer model. We see that the encoder consists of multi-head self-attention and a position-wise feed-forward network. It takes the sum of input embedding and positional embedding as input. The decoder consists of masked multi-head self-attention, multi-head self-attention, and a position-wise feed-forward network. The decoder takes the sum of output embedding and positional embedding as input. A typical transformer model has 6 layers of encoder and 6 layers of decoder. However, our trained model is much larger than this configuration and is available upon request.

Training Optimizer
We used the Adam optimizer with β 1 = 0.9, β 2 = 0.98 and = 10 −9 . We varied the learning rate (lr) over the period of training, according to Equation (10): where step_num denotes the current step number, and warmup_step is the number of steps used to warm up the training process. This corresponds to increasing the learning rate linearly for the first warmup_steps training steps and decreasing it thereafter proportionally to the inverse square root of the step_num. Here, warmup_steps = 4000. Figure 4 shows the implementation of the training optimizer. These parameters are selected in a trial-anderror approach.

Parallel Training
Training deep learning models with a large amount of training data is not a trivial task, which is performed in a high-performance computing infrastructure with a large number computing nodes or accelerators. Training NMT models also consumes a lot of computing resources. Thus, we choose to train our transformer model in a parallel way.
Training NMT models comes with many forms of parallelization, including data parallelism, model parallelism, pipeline parallelism, and hybrid forms of parallelism. In data parallelism, a number of workers load an identical copy of the deep learning model. The training data are partitioned into non-overlapping chunks and fed into the model replicas of the works for training. In model parallelism, the NMT model is partitioned, and each worker loads a different portion of the NMT model for training. The workers that hold the input layer of the model are fed with the training data. By contrast, pipeline parallelism combines the two aforementioned forms of parallelism.  In this work, we aim to train our NMT model on diverse available computing resources such as laptop CPUs, desktop CPUs, and server CPUs with one or multiple GPUs. In addition, we mainly use data parallelism to speed up the training process. As shown in Figure 5, the entire English-Chinese parallel corpus is partitioned into a large number of batches, and each batch of the training data is distributed to a processor of a GPU or multi-core CPU. Thus, we have to ensure that we have sufficient batches so as to fully utilize the whole GPU processor or multiple GPUs. Our implementation is built based on the DataParallel module of the PyTorch framework. Note that we have to use suitable APIs to create the model and perform data movements between CPUs and GPUs. The batch size is set to be 32.

Training data is partitioned into batches
Each batch is distributed to a processor of GPUs or CPUs

Model Deployment
Once an NMT model has been trained, it can be used to translate a sentence into another language, i.e., the inference or decoding stage. Note that there is a clear distinction between training and inference: we only have access to the source sentence at the decoding time. We must initialize the transformer model and fill the model with parameters from the pytorch model data. We also must normalize the input sentence into a tensor which is taken into the decoder model ( Figure 3). Then, the input sentence is translated into the output sentence with our trained model.
When doing inference, we can select the most likely word at each step in the output sequence. The simplest decoding algorithm is beam search decoding, which expands all the possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities. The development set source sentences are decoded using combinations of beam size and a length penalty and the combination that gives the best evaluation metric score.

A DL-NMT Web Interface
For ease of use, we develop a Web interface for the machine translation system, which is shown in Figure 6. We aim to provide users with an easy-to-use interface for model training, corpus management, and model deployment. Thus, it has three modules for corpus, project, and model management.

Corpus Management
Many prior works have shown that the trained models can yield a better prediction accuracy from using a domain-specific bilingual corpus [20][21][22]. To this end, we provide domain experts to add or edit paired sentences. In this way, the domain experts can build their own corpus. For instance, teachers who majored in Business English can integrate their expertise of English-Chinese paired sentences into a corpus with our web interface. On the other hand, we also provide access to the existing bilingual corpus. At the initialization stage, users can choose to import the UM-Corpus of eight different text domains, including Education, Laws, Microblog, News, Science, Spoken, Subtitles, and Thesis, into our backend platform. Note that users can also edit the existing corpus with our web interface. With the help of this web interface, we aim to build a sufficiently large and diverse corpus from various domains to train NMT models.

Project Management
Our web interface provides a consequence of functions to manage the training process. We name an NMT model training process as a project. As a result, we can create, edit, or delete a project. To enable the ease of use, we provide a web interface to configure a model training task. The training parameters are listed in Table 1. Once ready, we use a one-stop button to start the training process. For the trained model, we can save it in a specified directory. Whether or not to use a GPU 0

Model Management
Our web interface manages the trained NMT models. With the interface, users can import or delete a model. In particular, we provide an interface for users to input an English sentence, make a translation, and output a Chinese sentence. This is particularly suitable for teaching purposes, e.g., a live classroom demonstration. That is, we train the NMT model with the interface and then use the interface to demonstrate how well the trained model performs.

Experimental Setup
Hardware and Software. Our DL-NMT system can be deployed onto laptops, desktops, and servers with GPUs. In this work, we run the translation system on the platforms as shown in Table 2. The table also lists the frequency, the number of cores on each CPU, whether it has GPUs, the Linux kernel, and the GCC/OpenMP version. Dataset Details. We use the news dataset from the UM-Corpus, which is a large English-Chinese parallel corpus. It provides a two million English-Chinese corpus from eight text domains, covering several topics and text genres, including Education, Laws, Microblog, News, Science, Spoken, Subtitles, and Thesis [23]. We train our models with the news subset of 252K sentences consisting of 10,635K Chinese words and 5672K English words. We split the training samples into 176,943 training pairs, 25,278 validation pairs, and 50,556 test pairs. Note that our web interface provides users with access to build a new corpus.

Training Accuracy
We train models on an NVIDIA RTX 2080Ti and NVIDIA Titan RTX. Since their memory is limited (11 GB and 24 GB), we use different batch sizes of 8 and 16, respectively, to avoid the out-of-memory (OOM) error. At the same time, to ensure the stability of the model, we use the gradient accumulation method to expand the batch size in another form. Gradient accumulation is to accumulate the gradients of several batches and then update the network parameters. The number of batches accumulated in each update is called the accumulation step. Figure 7 shows the training process on these two GPUs, respectively. Each curve describes the trend of loss (of the training set and validation set) and BLEU value of the validation set. We see that the training process on both GPUs is basically consistent. As the number of epochs increases, the train loss continues to decrease. In addition, the loss and BLEU values of the verification set are stable at half way. As a matter of fact, too many training iterations would lead to the problem of overfitting, which has a negative impact on the translation quality. Thus, we run the verification process for every five iterations to avoid the overfitting issue. We compare the test performance of the trained models on two GPUs in Table 3. The batch_size (bs) and the accumulation step (step) used in the gradient accumulation method for each GPU are indicated in the table. The different GPU devices basically have no impact on the model training effect. The gradient accumulation method can facilitate us in achieving the same performance with small batches as with large batches. We use the beam search algorithm to look for the best solution in the inference stage. It expands the solution space relative to the greed search and reduces the complexity relative to the exhaustive search. The beam size is an important parameter of the beam search algorithm, affecting performance and efficiency. We compare the test BLEU and the test time based on different beam sizes (from 1 to 6) in Figure 8 and Table 4. The results indicate that the translation performance and time consumption will increase as the beam size increases. However, when the value of the beam size exceeds three, the performance improvement is negligible, while the time consumption increases steeply. To conclude, the best beam size is 3 on the Titan RTX.  The value of batch size and the compute capability of the processor affect the training speed. We first compare the training time (per epoch) when using different batch sizes on the same device. We choose the Titan RTX as the platform because it is fast enough and has enough memory. Figure 9 shows that the training speed increases with the increase in batch size, but the growth rate gradually slows down.

Case Study
In addition to analyzing performance parameters, we also compare the actual translation of our model to the industry-level Baidu translation system. Here, the accuracy is measured based on the options of native speakers. We randomly select three different lengths of sentences as test cases and show the comparison results in Figure 11. Overall, our model can translate English to Chinese correctly, especially for short sentences and medium-length sentences (case 1 and case 2). However, the accuracy of the translation is not sufficiently good. In addition, the beam search strategy (beam size is 3) is better than the greed search strategy (beam size is 1) when decoding the sentence. In a nutshell, by consuming a relatively short training time, our trained model is competitive with an industry-level product such as the Baidu translation system. In the future, we will use a more bilingual corpus for improved translation quality. Figure 11. The translation cases of using Baidu translation system and our model.

Discussion
Implementing the Transformer-based translation system from scratch is indeed not new. However, we believe that our translation system stands out and can be applied in several scenarios. For now, large pretrained models have achieved promising results and have been widely accepted. Although each increase has brought significant performance improvements in downstream NLP tasks, training such models requires large-scale specialized computing hardware such as Google's TPUs. These computing clusters are typically unaffordable for small/medium-sized enterprises. Our translation system is portable across laptop CPUs, desktops CPU, and server CPUs with one or multiple GPUs. Such platforms are typically affordable for small/medium-sized enterprises, and our translation system can be used as a research infrastructure for such companies.
On the other hand, the large pretrained models are too complicated, and their capacity is too large for us to understand. That is, we know the models perform well, but we do not know the reasons. They work similar to a "black-box" and are particularly unsuitable for teaching purposes. Instead, our translation system can be used as a teaching demonstration tool for students majoring in translation. In particular, we have provided a web interface to manage the corpus, model training, and model prediction to ease the use of our translation system. For instance, our system provides research professors with a web interface to collect their translation expertise so as to build a new corpus.

Conclusions
In this work, we have implemented a deep learning machine translation system based on a news corpus. The deep learning algorithm takes in English text as input and uses an encoder-decoder model with an attention mechanism based on Google's Transformer to translate the text to Chinese output. The model was trained using a simple self-designed entropy loss function and an Adam optimizer on paired English and Chinese text sentences from the news area of the UM-Corpus. We train the model on high-end GPUs with a parallel approach. During training time, we not only track loss over training epochs, but measure the quality of our model's translations using the BLEU score. The experimental results on the UM-corpus show that our trained model can achieve a maximum BLEU score of 29.2. We can further improve this score by tuning other hyperparameters and increasing the complexity of our model, as well as by training on a larger subset of the data to avoid biased results. As a case study, we compare the performance of our model to that of Baidu's and show that our model can compete with the production-level translation system.
For future work, we plan to train our models with large-scale GPU-based clusters. We also want to incorporate language features into the model to improve its translation quality. In addition, we will use a more bilingual corpus for improved translation quality.
Author Contributions: Conceptualization, L.Z. and J.F.; methodology, L.Z., J.F., and W.G.; validation, W.G. and J.F.; writing-original draft preparation, L.Z. and W.G.; writing-review and editing, L.Z. and J.F. All authors have read and agreed to the published version of the manuscript.