Decoding Strategies for Improving Low-Resource Machine Translation

: Pre-processing and post-processing are signiﬁcant aspects of natural language processing (NLP) application software. Pre-processing in neural machine translation (NMT) includes subword tokenization to alleviate the problem of unknown words, parallel corpus ﬁltering that only ﬁlters data suitable for training, and data augmentation to ensure that the corpus contains sufﬁcient content. Post-processing includes automatic post editing and the application of various strategies during decoding in the translation process. Most recent NLP researches are based on the Pretrain-Finetuning Approach (PFA). However, when small and medium-sized organizations with insufﬁcient hardware attempt to provide NLP services, throughput and memory problems often occur. These difﬁculties increase when utilizing PFA to process low-resource languages, as PFA requires large amounts of data, and the data for low-resource languages are often insufﬁcient. Utilizing the current research premise that NMT model performance can be enhanced through various pre-processing and post-processing strategies without changing the model, we applied various decoding strategies to Korean–English NMT, which relies on a low-resource language pair. Through comparative experiments, we proved that translation performance could be enhanced without changes to the model. We experimentally examined how performance changed in response to beam size changes and n-gram blocking, and whether performance was enhanced when a length penalty was applied. The results showed that various decoding strategies enhance the performance and compare well with previous Korean–English NMT approaches. Therefore, the proposed methodology can improve the performance of NMT models, without the use of PFA; this presents a new perspective for improving machine translation performance.


Introduction
Natural language processing (NLP) is a subfield of artificial intelligence in which computers analyze human languages. In general, NLP is divided into three main categories: rules-based, statistics-based, and deep learning-based. In rules-based and statistics-based NLP application software, system performance is dependent on the performance of various subcomponents such as the speech tagger, syntactic parser, and semantic analyzer. In contrast, deep learning-based NLP application software are operated in an end-to-end manner, and the performance of a model is independent of the subcomponents. The processes required for each step of an end-to-end process are handled simultaneously during training. Deep learning-based NLP application software have exhibited good innovative performance in various NLP fields such as machine translation, speech recognition, and text • It is proved that performance can be enhanced through various decoding strategies without changing the model. This finding may serve as a basis for pre-processing and post-processing research in the field of NMT. This study presents a methodology that can improve performance without using PFA, and it presents a new perspective for improving machine translation (MT) performance. • An in-depth comparative analysis applies various decoding strategies to existing Korean-English NMT approaches. To the best of our knowledge, there are no studies conducted on Korean-English NMT that compare and analyze various decoding strategies. In this study, a performance comparison is made using existing Korean-English NMT research, and the objective test sets IWSLT-16 and IWSLT-17 are used as the criteria. The decoding strategies were applied in a pipelining form. We identified the optimal beam size through a comparison of beam sizes, applied n-gram blocking based on the optimal beam size, and applied the length penalty based on the optimal n. This gradual decoding process represents a novel approach. • The distribution of the model was enhanced by creating it in the form of a platform. This approach contributes to the removal of lingual barriers as more people adopt the model. This paper is structured as follows: Section 2 reviews related work in the area of Korean-English NMT and provides the background to help an understanding of our proposed approach. Section 3 discusses the proposed approach, and Section 4 describes the experiments and results. Section 5 concludes the paper.

Machine Translation
MT refers to the computerized translation of a source sentence into a target sentence; the advent of deep learning has significantly enhanced the performance of MT. Yehoshua-Bar-Hillel began research on MT in 1951 at MIT [20]; since then, it has developed in the following order: rules-based, statistics-based, and deep learning-based MT.
Rules-based MT (RBMT) [21,22] performs translation on the basis of traditional NLP processes such as lexical analysis, syntax analysis, and semantic analysis, in conjunction with linguistic rules established by linguists. For example, in Korean-English NMT, this methodology accepts a sentence in Korean (the source language) as input, guides it through a process of morphological and syntactic analysis, and produces output that complies with the grammatical rules of English (the target language). This methodology can produce a perfect translation for sentences that fit established rules. However, it is difficult to extract certain grammatical rules, for which significant linguistic knowledge is needed. In addition, it is difficult to expand the translation language and system complexity is relatively high.
Statistics-based MT (SMT) [23,24] is a method that uses statistical information that is trained from a large-scale parallel corpus. Specifically, it uses statistical information to perform translations on the basis of the alignment and co-occurrences between words in a large-scale parallel corpus. SMT is composed of a translation model and a language model. It extracts the alignments of the source sentence and target sentence through the translation model and predicts the probability of the target sentence through the language model. Unlike RBMT, this methodology can be developed without linguistic knowledge, and the quality of the translation improves as the quantity of data increases. However, it is difficult to obtain a large amount of data, and it can be a challenge to understand context because the translation is conducted in units of words and phrases.
The NMT method performs translation via deep learning. It vectorizes the source sentence through the encoder using the sequence-to-sequence (S2S) model, decodes the vector through the decoder, and creates a sentence in the target language. It identifies the most suitable expression and translation outcome by using deep learning and considering the input and output sentences as a pair. In other words, this methodology comprises an encoder and a decoder. It vectorizes the source sentence through the encoder, condenses the information as a context vector, and generates the translated target sentence in the decoder based on the condensed information. Methodologies utilized for NMT include recurrent neural networks (RNNs) [25,26], convolutional neural networks (CNNs) [27,28], and the Transformer model [17]. The Transformer model has exhibited better performance than the other approaches. Further, the recent trend is to apply PFA techniques such as cross-lingual language model pre-training (XLM) [29], masked sequence-to-sequence pre-training (MASS) [30], and the multilingual bidirectional and auto-regressive transformer (mBART) [31] to NMT as well, and such strategies are currently providing the best performance. However, because PFA requires numerous parameters and large model sizes, it is still impractical for organizations to apply the strategies. Therefore, the Transformer model was selected as the optimal NMT model after considering factors such as performance, speed, and memory requirements that have been reported in previous research; experiments were then conducted based on this model.

Pre-and Post-Processing in Neural Machine Translation
In deep learning, pre-processing involves actions such as the refining, transformation, augmentation, and filtering of a model to ensure better performance before training it. Post-processing involves transforming the results that the model predicted into a better form after training.
Many studies on the NMT model have focused on improving pre-processing. Subword tokenization research aims at the resolution of unknown issues and involves splitting the entered NMT sentence into certain units. It addresses the out-of-vocabulary (OOV) problem and is a step that must be passed during the NMT pre-processing stage. Representative examples include byte pair encoding (BPE) [8] and SentencePiece (SP) [9]; such methodologies have become necessary pre-processing operations in most NMT studies. Further, research on data augmentation has been conducted based on the fact that NMT requires a substantial amount of training data. Models with good performance, such as the Transformer model, improved their learning by utilizing numerous parallel corpora, although the construction of such corpora is expensive and time-consuming. Techniques for creating a pseudo-corpus were investigated to overcome this problem. The quantity of training data can be increased by transforming a monolingual corpus into a pseudo-parallel corpus (PPC) when these methodologies are used. Representative examples include back translation [12] and copied translation [13]. Back translation creates a PPC by using an existing trained opposite-direction translator and improving the translation of the monolingual corpus; the new PPC is then added to the existing parallel corpus for use during training. In other words, it is based on the fact that a bidirectional translation model can be created with one parallel corpus during the creation of the translator, which is the advantage of NMT. In contrast, the copied translation methodology utilizes only the monolingual corpus without employing an opposite-direction translator. It is a methodology that trains by inserting the same data into the source and the target. Research focused on parallel corpus filtering (PCF) [10] aims to increase the performance of models by only using high-quality training data during training. If data obtained from the Internet is used as training data, a substantial amount of the data will be noisy data, and it is prohibitively difficult for humans to verify such data. Therefore, PCF aims to remove data that are not suitable as training data.
First, automatic post editing (APE), a type of post-processing [14], is a subfield of MT that aims to create high-quality translation results by using the APE model to automatically edit the NMT results, which is then compared with the translation results produced by an existing model. This does not imply that the NMT model itself is changed, but it is a research effort to create another NMT model that corrects the translation results of the original NMT model. Moreover, there are various decoding strategies. In the decoding task to generate a translation in NMT, decoding strategies such as n-gram blocking and length penalties are applied instead of simply generating the translation through the beam search process. Through this approach, the quality of the results predicted by the NMT model can be enhanced without changing the model structure.
Recently, a significant amount of research has been conducted on APE through WMT Shared Task and other venues. However, minimal research is being conducted on the various decoding strategies. Accordingly, we proved that the performance of Korean-English NMT could be enhanced through a comparative experiment on the various decoding strategies.

Korean-English Neural Machine Translation Research
MT-related research and services are currently active in Korea. Along with the Papago service that is being provided by Naver, organizations such as the Electronics and Telecommunications Research Institute (ETRI), Kakao, SYSTRAN, LLsoLLu, and Genie Talk of Hancom Interfree are providing MT services. In addition, research on Korean MT is being conducted by a partnership between Flitto, Evertran, and Saltlux; these organizations constructed a Korean-English parallel corpus that was recently published in the AI Hub.
Kangwon National University [32] was the first to conduct Korean-English NMT research by applying MASS [30]. MASS is a pre-training technique that randomly designates K tokens in the input, masks them, and trains the prediction of the masked tokens. Because the tokens that were not masked in the encoder are masked in the decoder, the decoder must predict the tokens that were masked by referring only to the hidden representation and attention information provided by the encoder. This provides an environment in which the encoder and decoder can proceed through pre-training together. K denotes the number of tokens that are masked. When K is 1, one token is masked in the encoder and the decoder predicts one token that went through masking. This creates the same effect as the masked LM of BERT [1]. When K is equal to m, which denotes the total length of the entered sentence, every token on the encoder side is masked; this creates the same effect as the standard LM of GPT [2]. Kangwon National University achieved high performance by applying this strategy to Korean-English NMT.
Sogang University proposed a methodology in which large quantities of out-domain parallel corpora and multilingual corpora are applied to the NMT model training process to compensate for the lack of a bulk parallel corpus [33]. In addition, studies on expansion of the Korean-English parallel corpus through the use of back translation [34] and on Korean-English NMT using BPE [8] have also been conducted [35]. However, although the experimental results in those studies are provided in the form of BLEU scores (from 1 through 4), it is difficult to compare the performance objectively because the final BLEU scores [36] are not revealed.
Korea University (KU) routinely conducts research on NMT pre-processing. They suggested two-stage subword tokenization (morphological segmentation + SentencePiece unigram), which is a subword tokenization specialized for the Korean language, and presented a paper that described applying PCF to Korean-English NMT [6]. They also proposed a methodology that conducts training with a relative ratio when composing batches, rather than simply applying back translation and copied translation when applying the data augmentation [7]. Through this strategy, performance was higher than that achieved through the simple application of back translation. In addition, they illustrated the importance of data by conducting a study in which the AI Hub corpus, published by NIA, was applied to the Transformer model and surpassed the performance achieved by previous Korean-English NMT research [37]. In conclusion, although research on the NMT model is important, KU conducted various experiments and studies illustrating the importance of pre-processing. Moreover, the model presented in their paper was made available through a platform, which increased its distribution. The platform was selected as a "Best practice of data utilization for NIA artificial intelligence training data" (http://aihub.or.kr/node/4525).

Models
As a follow-up to the paper "Machine Translation Performance Improvement Research through the Usage of Public Korean-English Parallel Corpus", published by KU [37], in this study, we produced an NMT model by using the same training data, test set, and model structure used by the Transformer [17]. A performance comparison experiment was conducted by applying various decoding strategies. The overall model architecture is shown in Figure 1. The Transformer is a methodology that only uses attention without convolution or recurrence. Because the Transformer does not receive word entries in serial order, it adds the positional information to the embedding vector of each word to learn the words positional information and uses this as the models input. This is referred to as positional encoding. Positional encoding utilizing sine and cosine functions was added to the Transformer, and the encoding of each position is performed as follows: PE (pos,2i) = sin pos/10000 2i/d model PE (pos, 2i+1) = cos pos/10000 2i/d model (1) where pos indicates the position, and i indicates the dimension. In other words, each dimension of the positional encoding accords with a sinusoid function, and the wavelengths form a geometric progression in the range between 2π and 20,000π. These functions were chosen because of the theoretical presumption that they would assist the model in easily learning to attend to relative positions, owing to the fact that for any fixed offset k, PE pos + k can be represented as a linear function of PE pos . This model is a structure that learns the attention between the input and output after learning each self-attention of the input and output based on the query, key, and value. The attention weights are then calculated using the following method: Attention can be described as mapping a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [17]. We compute the dot products of the query with all keys, divide each by √ d k , and apply a SoftMax function to obtain the weights on the values.
Because computational parallelism is possible, the training speed is faster than that of other models, and it is currently exhibiting good performance in the MT field. Numerous organizations are providing MT services based on this model.
In conclusion, we conducted an experiment that compared various decoding strategies on the basis of the KU model, which was in turn based on the Transformer.

Decoding Strategies
We applied various decoding strategies in the form of pipelining. First, we found the optimal beam size by comparing the performance achieved using different beam sizes and applied n-gram blocking based on this beam size. Subsequently, we found the optimal n through a performance comparison experiment and applied the length penalty.

Beam Search and Beam Size
Beam search is a method that increases calculation efficiency by limiting the number of candidates to be remembered in the decoding process of NMT into K; K eventually indicates the beam width or beam size. Performance is optimized when the case with the highest cumulative probability is chosen after considering every possible case; however, considering its inherent time complexity and speed, practical utilization of this method is almost impossible.
Greedy decoding is a methodology that considers every possible case; it executes a translation by simply continuing to select candidates until the candidate with the highest probability in that step appears. However, even a single wrong prediction can exert a fatal effect on overall performance; such an effect cannot be corrected even by optimal performance in subsequent steps. The beam search method was designed to overcome this problem. Therefore, beam search could be the optimal alternative for greedy decoding and the method of considering every possible case. Most NMT processes use beam search to perform decoding. Accordingly, in this study, the beam size was incremented from 1 to 10, and the resulting performance impact was examined through an experiment. Algorithm 1 shows the details of the beam search process.

Algorithm 1 Beam Search.
set Beamsize = Z; h 0 ⇐ Trans f ormer − Encoder(S) t ⇐ 1 // L S means length of source sentence; // α is Length factor; while n ≤ α * L S do Beam search can be defined as the retainment of the top-k possible translations as candidates at each time step, where k refers to the beam width. A new possible translation is created in the next time step by combining each candidate word and a new word. The new candidate translations then compete with each other in log probability to produce the new top-k most reasonable results. The process continues until the translation ends.

N-Gram Blocking
After the advent of NMT, problems that did not appear in the existing rules-based and statistics-based methodologies started to occur. Representative examples include repeated translation, unknown (UNK) words, and omissions; these problems are generally caused by limited vocabulary size. Various subword tokenization methodologies such as BPE [8], SentencePiece [9], and BPE-Dropout [38] were suggested to solve these problems. These methodologies, which separate words into meaningful subwords, helped mitigate problems resulting from limited vocabulary size.
We determined that problems caused by repeated translation, unknown (UNK) words, and omissions can be solved through not only subword tokenization but also through n-gram blocking during decoding. Blocking improved NMT performance when a repetition occurred during decoding from unigram to 10-gram. Specifically, the output of the decoder contains the top-k beams produced through the beam search algorithm during the decoding step. At this moment, n-gram blocking is performed by referring to the output words of the previous time step, to prevent n-gram repetition and the ignoring of beams that have the same n-gram.

Length Penalty
The length penalty is a methodology that prevents the probability value from decreasing when a long sentence is translated. The cumulative probability is eventually calculated during the beam search, and it ranges from 0 to 1. Therefore, according to the principle of cumulative probability, the value moves closer to 0 as the translation progresses. Specifically, an unfairness occurs in which the cumulative probability eventually becomes smaller than that of a short sentence as the beam length increases. The length penalty was proposed to solve this problem. In other words, because the probability value of a long sentence necessarily decreases, the length penalty could be defined as a means of recalibrating the probability value to the sentence length. During the decoding step, we applied the stepwise length penalty (SLP) and average length penalty (ALP), which are representative length penalty methodologies. SLP implies that the penalty is applied at each decoding step, and ALP normalizes the scores based on the sequence length. Through this strategy, repeated translations can be handled more efficiently in the decoding step.

Korean-English Neural Machine Translation Platform
A distribution method was developed for the model; this was considered a crucial task. Efficient distribution of the model can assist human translators by reducing expenses and development time for Korean-English translation research. Therefore, we distributed the model as a platform to aid the development of Korean-English neural machine translation. The platform is designed to utilize both the GPU and CPU. GPUs are capable of translating at a rapid pace, thus allowing a greater number of users to experience Korean-English translation. It was distributed officially at (http://nlplab.iptime.org:32296/). The execution process of the platform is shown in Figure 2.

Data Sets and Data Pre-Processing
We utilized the Korean-English parallel corpus provided by the AI Hub and the Korean-English subtitles parallel corpus from OpenSubtitles as the training data for the experiment. The AI Hub corpus is composed of 1,602,708 sentences. The average syllable length and average word segment length for the Korean data were 58.58 and 13.53, respectively. The average syllable length and average word segment length for English were 142.70 and 23.57, respectively. The data from OpenSubtitles comprised a total of 929,621 sentences. The average syllable length and average word segment length for Korean were 21.82 and 5.5, respectively. The average syllable length and word segment length for English were 42.61 and 8.06, respectively.
To achieve subword tokenization, training data was pre-processed using SentencePiece [9] from Google, and the vocabulary size was set to 32,000. In the case of OpenSubtitles, data with fewer than three word tokens were deleted; this filtering process was not utilized for the AI Hub data. To form the validation set, 5000 sentences were randomly selected from the training data. The accuracy of every translation result was evaluated based on the BLEU score; this process used the multi-bleu.perl script of Moses (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/ generic/multi-bleu.perl). Table 1 presents an example of the training data. The IWSLT-16 and IWSLT-17 datasets were used as the test set. This test set was established based on the TED domain; a substantial amount of Korean-English NMT research has been conducted based on this test set [32,33]. IWSLT-16 and IWSLT-17 are composed of 1143 and 1429 sentences, respectively.

Models and Hyperparameters
Because the primary aim in this study is to determine whether NMT model performance can be improved through various decoding strategies without changing the model, the model that was evaluated was based on the KU model, which was produced on the basis of the Transformer model [17], which is currently the most commercialized NMT model.
The hyperparameters of this model were set as follows: the batch size was 4096, 6 attention blocks and 8 attention heads were used, and the embedding size was 512. Adam and noam decay were used for optimization. Two GTX 1080 GPUs were utilized in the training process.
First, the performances of this model and previous Korean-English NMT models were compared. Specifically, the comparison was conducted with results from previous studies in which performances were evaluated with the same test set. The performance comparison included a study in which multilingual machine translation was explored to improve low-resource language translation [33] and a Korean-English NMT [32] that used MASS. The overall comparison results are listed in Table 2.
According to the comparison results, the KU model showed the best performance with IWSLT-16 scored 17.34, and the model that applied back translation to the MASS model of Kangwon National University showed the best performance with IWSLT-17 scored 15.34. It appears that the KU model showed the best performance owing to the quality of the AI Hub data. In addition, because the purpose of this model was to verify the quality of the data, data augmentation techniques such as back translation were not used. The model outperformed existing models when processing IWSLT-16 even though back translation was not applied; when processing IWSLT-17, the model appeared to outperform the existing models that did not employ back translation. It appears that the MASS model showed comparatively high performance because the IWSLT 2017 TED English-Korean Parallel Corpus was used as the learning data, and the test sets used in this study and the MASS study also utilized the same domain. In the case of the KU model, the data were not used as training data because it was judged unfair to use a domain that has the same test set as the training data. This is because if an NMT model that is specialized for a particular domain is deduced by some chance, it may become a structure in which the performance from the test set appears high, naturally.

Beam Size Comparison Experiment
To compare the performances of various decoding strategies, the performance impact of changes in beam size was measured by incrementing it from 1 to 10. The results of this experiment are shown in Table 3.
According to the results, the BLEU score could be a maximum of 1.31 depending on the change in the beam size. For both test sets, optimal performance was shown when the beam size was two. In other words, controlling the beam size directly affects the overall performance of the model, implying that decoding should be conducted using an optimal beam size to maximize model performance. In addition, it was shown that a large beam size does not necessarily guarantee high performance; in fact, the worst performance was observed when the beam size was 10, which is its largest setting. It appears that a large beam size negatively affects overall performance because it increases the number of candidates that must be calculated. An example of this appeared to occur when greedy decoding was conducted with a beam size of 1; this case showed that even a single incorrect prediction can fatally affect overall performance.
An experiment comparing decoding speeds according to beam size was also conducted. The most important factor in NMT processing is speed, and many organizations consider this factor when developing NMT. Problems such as server overload and outages can occur if slow speeds are encountered when the service is provided through a web server. Therefore, we determined that translation speed is the most important factor in NMT and conducted an experiment comparing the total translation time (Total Time) for the test set, average translation speed per sentence (Average), and number of tokens that can be processed per second (Token per/s) according to the beam size. The speed experiment was conducted using IWSLT-16. The results of the experiment are presented in Table 4. According to the results, translation speed decreased as the beam size increased. Specifically, the translation speed for the entire test set, translation speed per sentence, and number of tokens processed per second all decreased proportionally as the beam size increased. According to Tables 3 and 4, a large beam size did not guarantee good performance and high speeds; rather, a moderate beam size increased both performance and speed. Figure 3 shows a graph of the relationship between the BLEU score and the translation speed for the entire test set. As shown in Figure 3, speed decreases and performance degrades gradually as the beam size increases. In particular, the difference in speed between beam sizes 1 and 10 was 9.978 s of Total Time, which implies that organizations should carefully select the beam size for decoding if processing speed is a priority.

N-Gram Blocking Comparative Experiment
A second experiment was conducted to evaluate the performance achievable with n-gram blocking. In RBMT, sentence structures are rarely scattered, and words are rarely translated repetitively because translations are conducted according to stringent rules. However, in NMT, sentence structures are often scattered because issues such as repetitive translations and omissions occur owing to UNK problems caused by the limited vocabulary size. Copy Mechanism [39] was proposed to solve this problem; however, it requires complex changes to the model structure and decreases the processing speed. To solve this problem without changing the model structure, an experiment was conducted to prove that model performance could be improved by applying n-gram blocking during decoding. The results of the experiment are shown in Table 5. A beam size of two was selected for each model when n-gram blocking was utilized, reflecting the experimental results in Table 3.
According to the results of the experiment, performance stabilization was achieved with 8-gram blocking on both test sets, and performance was optimized when 8-gram and 4-gram blocking were applied on both test sets. In contrast, the performance decreased sharply when unigram blocking was applied, implying that this technique should not be applied in practice. In conclusion, performance improved compared to the performance achieved with IWSLT-17, as shown in Table 2. The results verify that the application of n-gram blocking led to a moderate improvement in performance.

Comparative Experiment on Coverage and Length Penalty
Finally, an experiment was conducted to determine if performance improves when SLP and ALP are applied. The results are shown in Table 6.  15.22 According to the results, performance was optimized when SLP and ALP were applied for the processing of IWSLT-16; when only ALP was applied for the processing of IWSLT-17, not only was the best performance achieved but the BLEU score was also 0.08 points higher than the best BLEU score (15.34) reported in the existing Korean-English NMT research. When the length penalty was applied, the performance improvement was greater than when n-gram blocking was applied; this may have occurred because these strategies mitigate the unfairness of the reduction in cumulative probability that occurs when a long sentence is translated.
We applied various decoding strategies, and optimal performance was shown during the processing of both IWSLT-16 and IWSLT-17, which are the most objective test sets in Korean-English NMT. In conclusion, it was confirmed that performance improvements could be achieved through various decoding strategies without changing the model. This methodology can aid low-resource NMT research in which it is difficult to obtain data, and this study should provide the basis for pre-processing and post-processing research in the NMT field.

Conclusions
Through various experiments, we proved that the performance of Korean-English NMT can be increased through various decoding strategies without changing the model structure. The performance was compared against those reported in previous Korean-English NMT studies, and IWSLT-16 and IWSLT-17 were used as test sets to maintain objectivity in performance evaluations. Experimental results obtained with a beam size of 2, n-gram blocking, and a length penalty showed performance that was comparatively better than those reported in previous Korean-English NMT studies. Additionally, the model was distributed as a platform to increase its availability. By using various decoding strategies that were proven in this study, both speed and performance were improved without changes to the model structure. Performance may be further improved by integrating ideas proposed in other studies in which models were lightened through strategies such as network pruning [40], knowledge distillation [41], and quantization [15]. Hence, an effective beam search technique and a new decoding technique will be investigated in future studies. In addition, an optimal NMT service system will be researched by applying the various decoding strategies in conjunction with network pruning, knowledge distillation, and quantization. A limitation of this study is the lack of an innovative new decoding strategy. Therefore, in the future, we plan to investigate decoding strategies that are more efficient.