Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

: Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with ﬁnetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa.


Introduction
Quality estimation (QE) refers to automatically predicting translation quality using only source sentence and machine translation (MT) output [1]. The goal of QE is to estimate translation quality scores or categories for MT outputs without reference sentences at various levels of granularity (i.e., sentence, phrase, word). It is necessary to compare the MT output with a reference sentence to determine the quality of the translation in general. However, it is not easy to obtain a reference sentence, and constructing such a sentence requires large costs and human labor. Based on these issues, the need for QE research is increasing, and a considerable number of studies are being conducted in this area.
In the QE process, the quality of the MT output is indicated using quality annotations, such as numerical values or error tags. This allows the user to select or rank the system that exhibits the best translation results [2]. In addition, for low-quality sentences, efficiency can be increased during automatic post editing [3] by modifying only the low-quality words or phrases using quality annotations. Therefore, QE is an important process that can be widely applied.
According to recent research trends, there are a number of cases in which the QE task is conducted based on multilingual pretrained language models (mPLMs) [4][5][6]. mPLM is a case where a multilingual representation is learned by extending pretrained language model to multiple languages. In QE, where two languages are concatenated and entered as input, such a representation is required, so mPLMs are mostly used in this task. However, most studies are focused on improving performance by simply applying data augmentation while finetuning the QE task based on a large-capacity mPLM such as multilingual BERT (mBERT) [7], cross-lingual language model (XLM) [8], or XLM-RoBERTa (XLM-R) [9]. In addition, there are many cases in which QE models are trained based on XLM-R, which is the latest model with a state-of-the-art (SOTA) performance for cross-lingual transfer tasks [10,11] achieved by pretraining using an extremely large dataset [5,[12][13][14]. However, unlike evaluation benchmarks for cross-lingual understanding that deal with multiple languages, QE differs from these because it requires measuring translation quality while referencing two languages at the same time. Thus, performance comparisons with other models should be preceded, but many papers tend to overlook this and simply use the XLM-R model [15].
Zhou et al. [16] compare the performance difference between mBERT and XLM-MLM for sub-task 1, and Baek et al. [13] additionally compare the performance difference of XLM-CLM, Ranasinghe et al. [17] compare the performance of mBERT and XLM-R. However, XLM models including the English and German languages are quite diverse, and, in particular, there has been no comparison with XLM-TLM models that learn information between languages in addition to multiple languages.
Unlike other previous studies that mostly utilize the SOTA model, we remove the effects of data augmentation that are utilized to achieve performance improvement and perform a comparative study between representative mPLMs based on sub-tasks 1 and 2 from WMT20. Each mPLM has a different capacity, training data size, or pretraining objective, and even the same model has different performance depending on how many languages it contains. Therefore, comparative analysis of various mPLMs in QE can serve as a good indicator of which model performs well for each task in future studies. In addition, because we compare pure performance, we can expect high performance by using data augmentation and new methodologies based on the model with high performance.
This study addresses two questions: • Which mPLM is best for QE sub-tasks? • Does the input order of the source sentence and the MT output sentence affect the performance of the model?
Considering the first question, the finetuning performance of mPLMs for a QE task can be validated using a quantitative analysis. To achieve this, we apply multilingual BART (mBART) [18], which has not been used in previous QE studies, and compare it with the existing mBERT, XLM, and XLM-R models. For XLM, we conduct performance comparisons between the causal language model (CLM), mask language model (MLM), and translation language model (TLM). In the case of XLM-MLM, the performances are compared according to the number of languages used for learning.
Considering the second question, it is possible to determine the criteria indicating which input structure should be adopted We conduct comparative experiments on finetuning mPLMs for a QE task, which is different from research concerning the performance improvement of the WMT sharedtask competition. This quantitative analysis allows us to revisit the pure performance of mPLMs for the QE task. To the best of our knowledge, we are the first to conduct such research; • Through a comparative analysis concerning how to construct an appropriate input structure for QE, we reveal that the performance can be improved by simply changing the input order of the source sentence and the MT output; • In the process of finetuning mPLMs, we only use data officially distributed in WMT20 (without external knowledge or data augmentation) and use the official test set to ensure objectivity for all experiments.

Related Work and Background
A quality estimation (QE) task is a branch of machine translation. Representative metrics of NMT such as BLEU [19], METEOR [20] require reference sentences to evaluate quality of MT output. QE does not require access to reference outputs, and quality is indicated by OK/BAD tokens, numerical values, or spans, etc. QE research can be divided into three categories: the use of statistical methods, the use of recurrent neural networks (RNN) and long short-term memory (LSTM) after the advent of deep learning, and the use of pre-training and finetuning approaches with the advent of pretrained language models.
Most conventional QE studies have been conducted by extracting or selecting features to evaluate the quality of MT. When selecting such features, machine learning algorithms, such as Gaussian processes [21,22], support vector machines [23,24], and regression trees [1,25] are used. In the case of feature extraction, some studies have extracted useful features, such as linguistic features [26] and pseudo-reference features [27], using external resources such as parsers, taggers, and named entity recognizers [23,28]. However, these studies are focused on determining the complex relationship between features and references, and the process of selecting and extracting optimized features requires heuristic processes and high costs.
With the advent of deep learning, research using RNN and LSTM was mainly conducted in QE, and it achieved much higher performance improvement than statistical methods [29,30]. Kim et al. [31] proposed a new structure referred to as predictor-estimator. Predictor is a bilingual and bidirectional RNN-based word prediction model, which randomly selects and masks a word in a target sentence from a parallel corpus and then generates feature vectors by predicting it. In estimator, the generated feature vector is used as transferred knowledge to learn the QE model. This structure was able to alleviate the issue of data shortage while allowing an additional parallel corpus to be utilized for a limited amount of QE data, and it led to a dramatic performance improvement. Similar to this architecture, Wang et al. [32] constructed a QE brain model with two phases. In the first phase, features were extracted with the transformer model to be used as prior knowledge, and in the QE phase, these features were combined with human-craft features and fed into the Bi-LSTM structure to train for QE. A superior performance was also obtained using this method.
Since the advent of pre-trained language models (PLMs), the research flow of QE is mostly done based on mPLM. By designing the QE model based on the large-scale pretrained model, the performance is greatly improved. Kepler et al. [33] replaced the predictor component with a pretrained BERT or XLM model while training using the structure of a predictor-estimator. Kim et al. [34] finetuned the QE task based on mBERT. Ranasinghe et al. [35] proposed two unique approaches: MonoTransquest and Siamese-Transquest. The former finetuned for a single XLM-R, while the latter used two separate XLM-R models for each of the source and target sentences, and the cosine similarity of both outputs was measured to predict the translation quality at the sentence level. Lee [12] performed data augmentation using a parallel corpus and pretrained pseudo data with XLM-R. After the process, finetuning was performed using QE data provided by WMT. Wang et al. [36] considered the pretrained transformer model as a predictor and the taskspecific regressors or classifiers as an estimator instead of mPLM. In the learning process, a bottleneck adapter layer was newly added to improve the efficiency of transfer learning and prevent over-fitting.

Multilingual Pretrained Language Models for QE
In this section, we describe mPLMs for QE performance comparison. We used mBERT, XLM, XLM-R, and mBART, which are multilingual pretrained models that include English and German.

Multilingual BERT
BERT [37] is built on a transformer [38] architecture, which consists solely of an encoder structure.
BERT performs a self-supervised learning process for large-scale mono-lingual corpus. Because the self-supervised learning process performs supervision on raw text on its own, it does not require labeled data, so it can utilize large amounts of raw data. After performing user-defined problems such as masked language model (MLM) and next sentence prediction (NSP) on unlabeled raw data, transfer learning is performed for downstream tasks. More specifically, the user generates arbitrary tasks and labels for raw text to learn language information, and uses the representations obtained through this process as initialization values for downstream tasks. For the case of BERT, MLM, and NSP are used as pretraining schemes.
MLM is a procedure of randomly masking tokens in the original sentence with [MASK] tokens. The objectives is to correctly predict these masked tokens based on left and right context of the sentence. In particular, the last hidden vector corresponding to the mask token goes through softmax and returns as the word with the highest probability in the vocabulary. In the process of masking, 15% of the original sentences are randomly sampled, then among them, 80% of these selected tokens are replaced by [MASK], 10% are replaced by random tokens in the vocabulary, and 10% remain unchanged. Through this masking process, a defective sentenceX = {x 1 ,x 2 , . . . ,x n } is generated from an unlabeled monolingual sentence X = {x 1 , x 2 , . . . , x n }. In the training process,X is fed into a BERT model, which is parameterized by θ, and the model is then trained to return X. This task can be described by Equation (1).
This equation indicates that a model is trained to predict an original token x i by considering a defective sentenceX. By referring to nearby context while restoring a [MASK] token, a model can be trained using bidirectional contextual representation.
NSP is a binary classification task that aims to train by understanding sentence relationships. In the training process, two sentences are concatenated to construct inputs, and these sentences are then selected from an unlabeled monolingual corpus based on a probability. Successive sentences are selected for half of the time, while randomly picked sentences are chosen otherwise. The main objective of NSP is to distinguish whether these input sentences are successive or not. Through this training process, a model can obtain an improved understanding of relationships between sentences.
Multilingual BERT (mBERT) [7] is a BERT-based multilingual model. The same pretraining schemes as BERT (MLM and NSP) are adopted for mBERT. However, unlike BERT, mBERT is trained with a multilingual unlabeled corpus, which is comprised of 104 languages.
The way we adapt mBERT to a QE task is as follows. For the assessment of an entire sentence, we leverage the first hidden representation obtained from the mBERT model. By applying a linear classification head without the activation function, we can obtain the final prediction score of the sentence. Therefore, the sentence assessment score score sentence is derived from an encoded representation of the input sentence, H = {h 1 , h 2 , . . . , h m }, as shown in Equation (2).
In Equation (2), W ∈ R 1×hidden and b ∈ R 1×1 are trainable parameters where hidden indicates the hidden layer size of pretrained mBERT. During the QE training process, the mean squared error (MSE) loss between score sentence and the label score is considered.

Cross-Lingual Language Model
XLM [8] is a transformer-based model that extends existing language model pretraining methods, which mainly focus on a monolingual language representation, to the multiple language representation. XLM is pretrained through MLM and CLM by leveraging a multilingual unlabeled corpus. To achieve a better multilingual language understanding, TLM, which is a pretraining scheme utilizing a parallel corpus, is applied. Unlike mBERT, NSP is not considered during pretraining.
CLM is a pretraining scheme in which the objective is to model the probability of a word given the previous words in a sentence. This can be described as in Equation (3).
It can be said that the goal of CLM is to maximize the probability of a token based on preceding tokens. Through this process, a model can obtain an improved language understanding.
TLM is an extension of MLM and improves cross-lingual understanding by utilizing parallel data in the pretraining phase. The source and target sentences of a parallel corpus are first connected, and then some tokens in these sentences are replaced with [MASK] tokens. The training objective of TLM predicts masked tokens the same as in mBERT. However, masked tokens can be predicted by referring to the surrounding context of the masked tokens, as well as sentences from other languages concatenated. It is characterized by TLM that by predicting masked tokens by referencing both languages simultaneously, a representation containing information between languages can be obtained. This can be described as shown in Equation (4).
In Equation (4),X :Ȳ indicates corrupted input data whereX is a source sentence component andȲ is a target sentence component. M x and M y are index sets that consist of the indices indicating masked tokens in the source and target sentences, respectively. When predicting a masked word in a source sentence during the training process, a model can refer to the nearby source language context, as well as target sentence. This can encourage the model to acquire a better understanding of multilingual representation. Additionally, to obtaining decent multilingual representation, distinct language embeddings, and respective position embeddings are applied to each language.
XLM utilizes Wikipedia data for the pretraining of various languages. As the amount of established Wikipedia data differs for each language, bias towards high-resource languages can be obtained if such data are utilized without any preprocessing. To alleviate the data imbalance problem, different sampling ratios are applied in the training process. The applied sampling ratios are determined using a multinomial distribution, which is denoted in Equation (5).
Here, q i indicates a sampling ratio for the i th language data, with amount n i , among the total dataset that comprises N languages. α is a hyperparameter that is set to 0.7 for the pretraining of XLM, such that the sampling ratio is increased for low-resource languages and decreased for high-resource languages.
For the XLM-based QE model, the overall training process is similar to Section 3.1, except that positional embeddings that encode absolute positions and language embeddings that indicate the language of each token are applied.

XLM-RoBERTa
Because XLM learns using Wikipedia, there is a limitation in that data on low resource language is insufficient. In XLM-R [9], the data are expanded to a much larger scale. XLM-R is a multilingual masked language model that adopts large-scale pretraining by utilizing CommonCrawl data [39], which comprises 100 languages. XLM-R gains state-of-the-art performance for cross-lingual classification, question answering, and sequence labeling. Among the three pretraining schemes for XLM, only MLM is utilized for XLM-R training, and MLM proceeds in the same way as XLM. By expanding the model capacity and leveraging larger data sizes than permitted for XLM, XLM-R alleviates the performance degradation caused by the curse of multilinguality.
The curse of multilinguality represents a trade-off between the number of languages in the training data and the model performance at a fixed model capacity. Increasing the number of languages in training data can encourage an improved performance for monolingual and cross-lingual benchmarks to a certain extent because the understanding of low-resource languages is supported by similar high-resource languages. However, if the model capacity is fixed, an excessive number of languages will lead to the overall performance degradation of this method because of the decrease in the per-language capacity. XLM-R alleviates this problem by extending the number of model parameters.
XLM-R adopts a multinomial distribution (5) for applying different sampling ratios to each language. Unlike XLM, XLM-R sets α to 0.3 to strengthen the sampling ratio of low-resource languages. The training process for the XLM-R-based QE model is similar to that of Section 3.1.

Multilingual BART
BART [40] is a denoising autoencoder that corrupts the text by adding arbitrary noise and trains the model to restore it to the original text. mBART [18] is an extension of BART that has been applied to large monolingual corpora across multiple languages. mBART was trained using a 25-language corpus from CommonCrawl data (CC25).
BART utilizes 5 pretraining schemes leveraging a monolingual corpus: token masking, token deletion, text infilling, document rotation, and sentence permutation. Among these pretraining schemes, mBART adopts text infilling and sentence permutation. In the case of text filling, unlike MLM in which one token in the original sentence is replaced with one [MASK] token, spans of tokens are replaced with one masked token. The total number of selected tokens is 35% of the entire sentence, and the length of the masked token is determined based on the Poisson distribution, which is described in Equation (6).
Here, f (n : λ) indicates the probability of selecting n as the masking length. mBART sets λ to 3.5 for pretraining. By training to reconstruct masked sentences, which are generated by text infilling, a model can be trained for bidirectional contextual understanding, as well as to determine how many tokens should be restored from a single mask token.
In the case of sentence permutation, the text is corrupted by changing the order of the sentences within each instance. In the process of restoring the noise injected by sentence permutation to the original text, the model can understand information about the relationship between sentences.
Similar to XLM and XLM-R, mBART adopts an up-down sampling method to achieve improved training for low-resource languages. The sampling ratio λ i applied to the i th language data is provided by Equation (7).
Here, p i is the percentage of each language in the total dataset. The amount of training data for each language are rebalanced according to Equation (7), and, therefore, sampling from high-resource languages is relatively suppressed while sampling from low-resource languages is encouraged. The training process of the QE model leveraging mBART is similar to that of Section 3.1, wherein the same input structure as in pretraining is utilized.

Sub-Task 1
Sub-task 1 is a sentence-level direct assessment task. This task consists of scoring MT output according to a perceived quality score called direct assessment. A limitation of human translation error rate (HTER) [41] is that it does not capture the extent to which MT errors affect the overall quality of a sentence. The objective of sub-task 1 is to measure the overall quality of sentences through direct assessment (DA) by translation experts. One of the goals of QE in relation to this task is to investigate the relationship between a model for predicting DA scores and a model trained to predict post-editing tasks [15]. The DA score is a value obtained by evaluating the quality of the MT output from 0 to 100 by at least three professional translators. Using a total of 7K training data and 1K evaluation data, systems participating in this sub-task measure quality by predicting the mean z-standardized DA score of the MT output.

Sub-Task 2
Sub-task 2 is word-and sentence-level post-editing efforts. The objective of sub-task 2 is to improve post-editing by tagging which tokens have been mistranslated, along with the overall quality of the sentence. At the word level, this task consists of evaluating whether the translation was successful for each token in the MT output and source sentence based on the human post-edited sentences. The tokens of the source and target sides are tagged as OK or BAD. In the case of the target sentences, a gap tag is added considering the case of missing words between the tokens. If the number of tokens in the target sentence is N, the total number of tag tokens is 2N+1. Participating systems predict tags for MT output tokens and source sentence tokens. Similar to sub-task 1, a sentence-level post-editing effort task is used to measure the quality score for the MT output based on the human translation error rate (HTER) [41]. HTER is similar to the translation error rate (TER), wherein the TER compares the MT output with a reference translation and counts how many edits (substitutions, deletions, and insertions) must be performed to obtain a correct sentence. This value divided by the reference length is the TER score. HTER differs from TER in that humans create new reference translations for the MT output. Using these new reference translations can lead to correct sentences with minimal modifications compared to the use of other reference translations. Referring to the source sentence and the MT output, the participating system predicts the quality of the MT output sentence based on the HTER.

Dataset Details
In this study, we conducted experiments concerning sub-tasks 1 and 2 at the sentencelevel of WMT20 based on various mPLMs. We experimented using the English-German language pair and used train, dev and test data provided by WMT20 (http://www.statmt. org/wmt20/quality-estimation-task.html, accessed on 15 July 2021). Table 1 shows a summary of the data for each sub-task.
In the case of sub-task 1, there is a total of 7k training data, and the numbers of source and MT output tokens are 98,127 and 97,453, respectively. The average of the mean z-standardized DA score is −0.008 and the median is 0.162. The development and test data consist of a total of 1K data, and there are approximately 14K source and MT output tokens. The development and test data provide average scores of −0.049 and 0.040, and the respective median scores are slightly higher at 0.211 and 0.319.
In the case of sub-task 2 at the sentence-level, the number of sentences is 7K in the training data and 1K in each of the development and test data, as in sub-task 1. The average HTER score is distributed around 0.3, and the median value either does not significantly differ or is slightly lower than the average value. HTER is centered around values lower than the error rate of 0.5. Table 1. Summary of the QE dataset. We denote the number of instances in each dataset as # Instance. # SRC Token and # MT Token refer to the number of tokens in source-and target-side sentences for each dataset, respectively.

Model Details
We conducted a finetuning performance comparison using a total of 9 models including XLM-R base, XLM-R large, mBERT, mBART, XLM-CLM, XLM-MLM, XLM-MLM-17, XLM-MLM-100, and XLM-TLM. English-German was used as the language pair for this experiment, and performance comparisons were conducted for each mPLM at sub-task 1 and sub-task 2 sentence-levels. These models are described as follows: • We performed finetuning using the pretrained model released in HuggingFace's transformers library [42]. We did not proceed with additional pretraining and data augmentation so that the pure performances of the mPLMs could be objectively evaluated and compared in the QE task.
In preprocessing, we performed subword tokenization using the tokenizer provided for each model in HuggingFace. For the model input, we added segment embeddings for mBERT, listing tokens separated by 0 and 1 to give a distinction between sentence 1 and sentence 2. XLM has added a position embedding that gives a number corresponding to the token index for each source sentence and MT output, as well as a language embedding that is segmented by a unique number for each language.
As a training procedure for finetuning, we first load mPLMs to initialize the parameters. After that, additional embeddings for each model are put as input to the model along with the sentences concatenated with the source and target sentences. We put the output corresponding to the position of the [CLS] token among the last hidden states as an input to the linear classifier and measured the loss between the predicted value and the label. We use the mean squared error (MSE) loss as the loss function.
We found that the model has a diverse range of performance fluctuations depending on the seed value, and we attempted to reduce the effect of the seed value on the general performance of the model. To achieve this, we conduct five experiments using the same model and compare the average values, as well as the minimum and maximum performance values, thereby increasing the reliability of the experimental results.

Experimental Results for Question 1 5.3.1. Sub-Task 1
To check which model out of various mPLMs performs well for the QE task, we raise question 1, and proceed with finetuning using mPLMs. The experimental results for the QE of sub-task 1 (i.e., the direct assessment at the sentence-level) are shown in Table 2. As a result of the experiment, XLM-TLM showed the highest performance for sub-task 1 with a Pearson correlation coefficient of 0.442. In terms of the minimum and average performances, this system consistently demonstrated the highest performance compared to the other models. To investigate the cause of this result, we need to focus on the input data of XLM-TLM in the pretraining process.
The XLM-TLM model utilizes parallel data during pretraining and can refer to the context of either side when predicting the source-and target-side masked words. Likewise, in the QE field, the concatenating sentences of the source and target language are provided as an input to the model. This is similar to the form of the input for the XLM-TLM model in that it provides sentences in both languages as the input, while the other models use the mono data of multiple languages. According to Lample and Conneau [8], when predicting a masked word during XLM-TLM learning, the model can be encouraged to align the source and target language representations by attending the translated sentence along with the surrounding masked word. Therefore, when using the aligned representation derived between the source and target languages in the XLM-TLM model for QE, it is possible to infer what part of the translated sentence is wrong. The model with the second highest average performance is the mBERT model. This model provided approximately 0.012 less than that of the first-ranked model and demonstrates a comparable performance. mBART did not show a strong performance in the regression task, but the maximum value only showed a difference of about 0.005 compared to the mBERT model. Both models apply various noising schemes during pretraining, and it can be predicted that this strategy will help improve their performance.
In the case of XLM-R-large, many research groups that participated in WMT20 used this model; however, for sub-task 1, it was not ranked high. When comparing the average Pearson correlation coefficients of the models based on XLM, XLM-MLM-17 was 0.021 higher than that of XLM-MLM-100, and XLM-MLM, which learned only English and German, showed the lowest performance. XLM-MLM-17 and XLM-MLM-100 are approximately twice the size of XLM-MLM considering the number of layers and hidden states, etc. and the languages were also expanded to 17 and 100 languages, respectively. It can be inferred that the number of languages and model capacity helped to improve the performance for QE.
To answer subtask 2, we refer back to the question we posed. Which mPLM is best for QE tasks? For the question, XLM-TLM model that learned cross-lingual understanding performed the best in sub-task 1.

Sub-Task 2
The finetuning results for sub-task 2 (sentence-level post editing effort) are shown in Table 3. High performances were achieved in the descending order of XLM-TLM, XLM-R-large, mBART, XLM-R-base, mBERT, XLM-MLM-17, XLM-MLM-100, XLM-MLM, and XLM-CLM based on the average Pearson correlation coefficient. As a result of this experiment, XLM-TLM showed the highest performance based on the average, minimum, and maximum Pearson correlation coefficients, similar to the previous experimental results for sub-task 1. As analyzed in sub-task 1, because XLM-TLM was induced to learn alignment information for language pairs using parallel corpus, it can be predicted that this process contributes significantly to its performance improvement for QE, which requires knowledge of relationships between languages. In sub-task 2, the XLM-R-large model showed the best performance after XLM-TLM. A fairly comparable performance was demonstrated with an average Pearson correlation coefficient of 0.498. XLM-R-large is the latest model among the mPLM models considered in this study. As mentioned in Section 3.3, a state-of-the-art performance among cross-lingual models was achieved by expanding the number of parameters considering the large amount of data and the curse of multilinguality. Nevertheless, XLM-R did not learn the relationship between the source and target sentences because it learned the mono corpus in an unsupervised manner. In QE, the source sentence and MT output are referenced together to determine which part has been incorrectly translated, and, therefore, this characteristic did not produce an optimal effect compared to XLM-TLM. Although mBART is a sequence-to-sequence model, it ranks third in the regression task with a higher performance than all XLM models. As an extension of MLM, mBART uses a pretraining scheme referred to as text infilling and sentence permutation, and an average Pearson correlation coefficient of 0.463 was obtained. This result was significantly higher than those of XLM-MLM (0.334), XLM-MLM-17 (0.415), and XLM-MLM-100 (0.409), which used only MLM. Therefore, it can be confirmed that the additional strategy of mBART had a positive effect on the improvement of QE performance during finetuning. mBERT showed an average Pearson correlation coefficient of 0.417 in sub-task 2 and did not demonstrate a very high performance when compared with the sub-task 1 results. Considering the comparison of the various XLMs, XLM-MLM-17 performed slightly better than XLM-MLM-100 (as in sub-task 1), while XLM-CLM ranked lower than XLM-MLM, which exhibited the lowest performance in sub-task 1. To answer subtask 2, we refer back to the question we posed. Which mPLM is best for QE tasks? For the question, we can explain that the XLM-TLM model also performed best in sub-task 2.  [14], Ranasinghe et al. [35] adopted a prior structure as an input, while Moura et al. [4], Kepler et al. [33] adopted a posterior structure. Although decent performances can be achieved by adopting these structures, sufficient investigations concerning the selection of an input structure have not been conducted. In other words, clear criteria for constructing an adequate input structure have not yet been presented. Here, we focus on the inconsistent input structures utilized in current QE studies and quantitatively analyze the differences derived from adopting different input structures.

Experimental Results for Question 2
In order to check whether the order of the input sentence affects the performance while performing QE finetuning, we raise question 2 and compare the sentence order with the reversed sentence order when constructing the input sequence. The experimental results for sub-task 1 are shown in Table 4. As a result of this experiment, it can be observed that the model performance changes by simply reversing the order of the input sentence. In this table, we denote Avg Diff as the difference between the average Pearson correlation coefficients of the original input and reverse orders. As can be seen from the Avg Diff values, when the input sentence order was reversed, the average Pearson correlation coefficient of XLM-R-large improved by +0.032, while that of XLM-MLM-100 improved by +0.008. However, for all other models, the performance deteriorated when the order of the input sentences was reversed. Likewise, in Figure 1, it was confirmed that the overall reversed order input sentences in sub-task 1 did not help to improve the performance of the model.  Conversely, in the case of sub-task 2, the result of reversing the input sentences provided a better overall performance. As can be seen in Table 5 and Figure 1, only two models of XLM-CLM and XLM-MLM-17 declined in performance based on the average Pearson correlation coefficient, while all other models consistently exhibited improved performances. In particular, the range of performance fluctuations was high in both XLM-TLM and mBERT. These two models also showed the highest variation in sub-task 1, and it can, therefore, be said that these models respond most sensitively to the input sentence order. The models with the lowest performance fluctuations were XLM-MLM-17 and mBART. In the case of mBART, there was little change in performance even in subtask 1, and there was no significant change in the performance in response to the varied input structure. We refer again to the question we asked. Does the input order of the source sentence and the MT output sentence affect the performance of the model? Through these experiments, we determined that the performance fluctuation of the input order varies depending on the sub-task. To the question, we can answer that the structure of the input is a factor that affects the performance of the model, and it must, therefore, be considered before conducting such experiments.

Conclusions
Most recent studies of QE apply data augmentation with finetuning based on state-ofthe-art large scale mPLM, such as XLM-R, to obtain a high performance for a WMT shared task. In this study, unlike typical QE research that focused on the competition involving a shared task, we conducted a pure performance comparison between various mPLMs. As a result of the experiments, we confirmed that the XLM-TLM model performed best on both sub-tasks, and that the induced learning of alignment between languages during pre-training had a positive impact. Additionally, we conducted experiments using mBART for the first time, and its additional noising schemes had a positive effect on QE research. Therefore, we confirmed the feasibility of using the mBART model in further QE research. We demonstrated that the order of the input sequence between the source sentence and its MT output can affect the model performance. In the future, we will further investigate data-centric issues that are not model-based [43,44]. By filtering data based on the HTER score, we will explore which score ranges contribute significantly to the performance of a model and provide a basis for future data-centric research on QE. In addition, we plan to conduct an in-depth study on low resource language QE. We plan to study a methodology that can automatically generate data based on a semi-supervised learning method.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. The data can be found here: WMT20 English-German QE dataset: http://www.statmt.org/wmt20/qualityestimation-task.html (accessed on 15 July 2021).

Conflicts of Interest:
The authors declare no conflicts of interest.