# Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

## 1. Introduction

- Which mPLM is best for QE sub-tasks?
- Does the input order of the source sentence and the MT output sentence affect the performance of the model?

- We conduct comparative experiments on finetuning mPLMs for a QE task, which is different from research concerning the performance improvement of the WMT shared-task competition. This quantitative analysis allows us to revisit the pure performance of mPLMs for the QE task. To the best of our knowledge, we are the first to conduct such research;
- Through a comparative analysis concerning how to construct an appropriate input structure for QE, we reveal that the performance can be improved by simply changing the input order of the source sentence and the MT output;
- In the process of finetuning mPLMs, we only use data officially distributed in WMT20 (without external knowledge or data augmentation) and use the official test set to ensure objectivity for all experiments.

## 2. Related Work and Background

## 3. Multilingual Pretrained Language Models for QE

#### 3.1. Multilingual BERT

#### 3.2. Cross-Lingual Language Model

#### 3.3. XLM-RoBERTa

#### 3.4. Multilingual BART

## 4. Brief Introduction of the WMT20 QE Sub-Tasks

#### 4.1. Sub-Task 1

#### 4.2. Sub-Task 2

## 5. Question 1: Which mPLM Is Best for QE Tasks?

#### 5.1. Dataset Details

#### 5.2. Model Details

**XLM-R-base**: Pretraining was performed with 220M parameters, 12 layers, 8 heads, and 768 hidden states.**XLM-R-large**: Pretraining was performed using 550M parameters. The hidden states were expanded to 1024, and 24 layers, and 16 heads were used, which is twice the scale of the base model.**mBERT**: The model parameters of mBERT were 110M, 12 layers, 768 hidden states, and 12 heads.**mBART**: mBART was pretrained with 610M parameters, 24 layers, 1024 hidden states, and 16 heads.**XLM-CLM**: A pretrained CLM for English and German. In total, 6 layers, 1024 hidden states, and 8 heads were used.**XLM-MLM**: A pretrained MLM for English and German. In total, 6 layers, 1024 hidden states, and 8 heads were used.**XLM-MLM-17**: Pretraining was conducted by expanding the MLM into 17 languages. It was trained using 570M parameters, 16 layers, 1280 hidden states, and 16 heads.**XLM-MLM-100**: Pretraining was conducted by expanding the MLM into 100 languages. It was trained using 570M parameters, 16 layers, 1280 hidden states, and 16 heads.**XLM-TLM**: TLM was performed for 15 languages. In total, 12 layers, 1024 hidden states, and 8 heads were used.

#### 5.3. Experimental Results for Question 1

#### 5.3.1. Sub-Task 1

#### 5.3.2. Sub-Task 2

## 6. Question 2: Does the Input Order of the Source Sentence and the MT Output Sentence Affect the Performance of the Model?

#### 6.1. Revisiting the QE Input Structure

#### 6.2. Experimental Results for Question 2

## 7. Conclusions

**Figure 1.**Comparison of the average Pearson correlation coefficients of original and reverse order inputs in sub-tasks 1 and 2.

**Table 1.**Summary of the QE dataset. We denote the number of instances in each dataset as # Instance. # SRC Token and # MT Token refer to the number of tokens in source- and target-side sentences for each dataset, respectively.

Sub-Task 1 | Sub-Task 2 | |||||
---|---|---|---|---|---|---|

Train | Dev | Test | Train | Dev | Test | |

# Instance | 7000 | 1000 | 1000 | 7000 | 1000 | 1000 |

# SRC Token | 98,127 | 14,102 | 14,043 | 114,980 | 16,519 | 16,371 |

# MT Token | 97,453 | 14,003 | 14,019 | 112,342 | 16,160 | 16,154 |

Average Score | −0.008 | −0.049 | 0.040 | 0.318 | 0.312 | 0.312 |

Median Score | 0.162 | 0.211 | 0.319 | 0.3 | 0.295 | 0.286 |

Pearson | MAE | RMSE | |||||||
---|---|---|---|---|---|---|---|---|---|

Max | Min | Average | Min | Max | Average | Min | Max | Average | |

XLM-R-base | 0.380 | 0.280 | 0.328 | 0.459 | 0.479 | 0.473 | 0.648 | 0.679 | 0.665 |

XLM-R-large | 0.338 | 0.242 | 0.298 | 0.480 | 0.520 | 0.495 | 0.685 | 0.713 | 0.698 |

mBERT | 0.407 | 0.322 | 0.382 | 0.452 | 0.468 | 0.458 | 0.642 | 0.672 | 0.655 |

mBART | 0.402 | 0.306 | 0.351 | 0.465 | 0.534 | 0.490 | 0.642 | 0.729 | 0.677 |

XLM-CLM | 0.296 | 0.168 | 0.253 | 0.474 | 0.516 | 0.489 | 0.683 | 0.703 | 0.691 |

XLM-MLM | 0.219 | 0.192 | 0.206 | 0.493 | 0.526 | 0.503 | 0.693 | 0.728 | 0.708 |

XLM-MLM-17 | 0.318 | 0.143 | 0.253 | 0.465 | 0.525 | 0.490 | 0.670 | 0.731 | 0.696 |

XLM-MLM-100 | 0.256 | 0.191 | 0.232 | 0.482 | 0.536 | 0.498 | 0.690 | 0.702 | 0.695 |

XLM-TLM | 0.442 | 0.336 | 0.394 | 0.451 | 0.683 | 0.517 | 0.631 | 0.805 | 0.681 |

Pearson | MAE | RMSE | |||||||
---|---|---|---|---|---|---|---|---|---|

Max | Min | Average | Min | Max | Average | Min | Max | Average | |

XLM-R-base | 0.456 | 0.438 | 0.448 | 0.146 | 0.156 | 0.150 | 0.189 | 0.204 | 0.195 |

XLM-R-large | 0.507 | 0.489 | 0.498 | 0.141 | 0.155 | 0.145 | 0.178 | 0.204 | 0.186 |

mBERT | 0.435 | 0.389 | 0.417 | 0.149 | 0.182 | 0.160 | 0.189 | 0.230 | 0.204 |

mBART | 0.475 | 0.452 | 0.463 | 0.142 | 0.148 | 0.144 | 0.179 | 0.195 | 0.184 |

XLM-CLM | 0.309 | 0.275 | 0.298 | 0.158 | 0.161 | 0.159 | 0.196 | 0.200 | 0.198 |

XLM-MLM | 0.358 | 0.303 | 0.334 | 0.156 | 0.160 | 0.158 | 0.194 | 0.199 | 0.197 |

XLM-MLM-17 | 0.433 | 0.408 | 0.415 | 0.149 | 0.157 | 0.154 | 0.188 | 0.192 | 0.190 |

XLM-MLM-100 | 0.421 | 0.381 | 0.409 | 0.152 | 0.164 | 0.158 | 0.190 | 0.207 | 0.198 |

XLM-TLM | 0.522 | 0.498 | 0.510 | 0.152 | 0.222 | 0.177 | 0.199 | 0.273 | 0.227 |

Pearson | MAE | RMSE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Max | Min | Average | Avg Diff | Min | Max | Average | Min | Max | Average | |

XLM-R-base | 0.365 | 0.272 | 0.326 | −0.002 | 0.462 | 0.495 | 0.481 | 0.653 | 0.698 | 0.670 |

XLM-R-large | 0.394 | 0.260 | 0.330 | +0.032 | 0.447 | 0.508 | 0.479 | 0.644 | 0.729 | 0.681 |

mBERT | 0.402 | 0.106 | 0.278 | −0.104 | 0.453 | 0.553 | 0.498 | 0.648 | 0.762 | 0.700 |

mBART | 0.388 | 0.277 | 0.346 | −0.005 | 0.436 | 0.543 | 0.478 | 0.664 | 0.693 | 0.674 |

XLM-CLM | 0.268 | 0.147 | 0.197 | −0.056 | 0.483 | 0.515 | 0.502 | 0.688 | 0.714 | 0.698 |

XLM-MLM | 0.250 | 0.128 | 0.177 | −0.029 | 0.517 | 0.557 | 0.540 | 0.694 | 0.751 | 0.727 |

XLM-MLM-17 | 0.267 | 0.172 | 0.230 | −0.023 | 0.482 | 0.502 | 0.491 | 0.682 | 0.713 | 0.693 |

XLM-MLM-100 | 0.314 | 0.189 | 0.260 | +0.028 | 0.503 | 0.587 | 0.544 | 0.666 | 0.748 | 0.709 |

XLM-TLM | 0.234 | 0.141 | 0.193 | −0.201 | 0.563 | 1.115 | 0.896 | 0.739 | 1.237 | 1.061 |

Pearson | MAE | RMSE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

Max | Min | Average | Avg Diff | Min | Max | Average | Min | Max | Average | |

XLM-R-base | 0.464 | 0.453 | 0.460 | +0.012 | 0.144 | 0.153 | 0.148 | 0.184 | 0.199 | 0.191 |

XLM-R-large | 0.523 | 0.501 | 0.509 | +0.011 | 0.140 | 0.144 | 0.142 | 0.178 | 0.188 | 0.183 |

mBERT | 0.449 | 0.434 | 0.441 | +0.024 | 0.147 | 0.179 | 0.162 | 0.185 | 0.229 | 0.207 |

mBART | 0.478 | 0.463 | 0.469 | +0.006 | 0.141 | 0.151 | 0.145 | 0.179 | 0.196 | 0.187 |

XLM-CLM | 0.297 | 0.283 | 0.287 | −0.011 | 0.159 | 0.162 | 0.160 | 0.197 | 0.205 | 0.199 |

XLM-MLM | 0.364 | 0.333 | 0.351 | +0.017 | 0.153 | 0.159 | 0.156 | 0.193 | 0.200 | 0.196 |

XLM-MLM-17 | 0.420 | 0.405 | 0.411 | −0.004 | 0.154 | 0.218 | 0.172 | 0.190 | 0.273 | 0.217 |

XLM-MLM-100 | 0.442 | 0.405 | 0.417 | +0.008 | 0.151 | 0.183 | 0.161 | 0.187 | 0.220 | 0.196 |

XLM-TLM | 0.552 | 0.526 | 0.538 | +0.028 | 0.156 | 0.168 | 0.163 | 0.204 | 0.218 | 0.212 |

