2. Background
It has been shown in a number of studies that the use of MT by professional translators often leads to productivity gains which are, in turn, related to the quality of the MT segments provided to the translators [
5,
6]. To ensure that high quality segments are presented to the translator, MTQE or other forms of quality assessments have to be employed. The aim of MTQE is to provide an accurate assessment of a given translation without input from a reference translation or a human evaluator. To date, and as pointed out by Turchi et al. [
7], “QE research has not been followed by conclusive results that demonstrate whether the use of quality labels can actually lead to noticeable productivity gains in the CAT framework”. This suggests that how we use MTQE in the translation workflows is still an open question to be answered, despite previous work integrating MTQE into them.
Previous research into the use of MTQE in a professional setting has been carried out by researchers from translation studies and related fields. Most notably, Turchi et al. [
7] ran a similar study to ours, and investigated whether the use of binary labels (green for a good MT suggestion and red for a bad one) can significantly improve the productivity of translators. The authors used MateCat [
8], adapted to provide a single MT suggestion, and a red or green label. They chose a
Hter [
9] of 0.4 as the boundary between post-editing and translating from scratch.
Hter (“Human-targeted Translation Error Rate”) is a human-targeted edit distance measure.
Hter is based on both the difficulty of the translation, and the discrepancies between the source and target sentences. A lower
Hter is desirable for a higher quality translation. Therefore, all sentences with a predicted
Hter score under 0.4 were labelled green, and any sentence with a
Hter over 0.4 was labelled red. Their dataset was an English user manual in the IT domain, translated into Italian using the phrase-based SMT system Moses [
10] and then post-edited. In total, their dataset consisted of 1389 segments, of which 542 were used to train the MTQE engine, and 847 were used for testing. In total, they gathered two instances of each segment, one for the scenario in which the translator was shown the estimated quality of the translation, and one in which the translator did not have this information for the MT output. While they observed a slight increase in productivity (1.5 s per word), they concluded that this increase is not statistically significant across the dataset. However, further investigation of their data showed a statistically significant percentage of gains for medium-length suggestions with HTER > 0.1. Our user study follows in the footsteps of this study, with several differences. We use the Fuzzy Match Score (FMS) instead of more traditional MT evaluation metrics as translators are more used to working with TM leveraging and fuzzy matches [
11,
12]. Fuzzy match scores approximately match strings. The fuzzy match score can range from 0 through 100, based on how close the hypothesis and reference sentences match. While several algorithms for FMS exist, we use the fuzzy value computed by Okapi Rainbow (
http://okapi.opentag.com/, accessed on 9 September 2021). The comparison is based in 3-g at the character level. Furthermore, our experimental setup accounts for two “neutral” conditions, also shown to the translators as different colours: one in which they translate from scratch, and one in which they are shown MT, but not MTQE information, and are asked to decide whether they post-edit or translate from scratch. Most significantly, our study attempts to look at the difference between the effects of good MTQE, versus that of mediocre or even inaccurate MTQE.
Also in 2015, Moorkens et al. [
13] investigated the accuracy of human estimates of post-editing effort and the extent to which they mimic actual post-editing effort. They also researched how much the display of confidence scores (MTQE) influenced post-editing behaviour. The authors used two different groups of participants. The first consisted of six members of staff and researchers who were not students of translation or professional translators. The second group consisted of 33 translation students, both at the masters and undergraduate levels. The first group was presented with a set of 80 machine translated segments from two Wikipedia articles describing Paraguay and Bolivia. The sentences were translated into Portuguese using Microsoft Bing Translator (
https://www.bing.com/translator, accessed on 9 September 2021). The experiment consisted of three phases. In the first phase, the translators were asked to classify the MT output according to the following scale:
Segment requires a complete re-translation;
Segment requires some post-editing; and
Segment requires little or no post-editing.
In the second phase, the same participants were asked to post-edit these segments. In order to avoid the participants remembering their ratings, they were given two weeks in between the evaluation phase and the post-editing phase of the experiments. The researchers then used the ratings collected in the first phase of this experiment to create a set of MTQE labels for the segments, and in the third phase, they presented the segments with these labels to the second group of participants (undergraduate and masters students). The second group were asked to post-edit the segments, using the MTQE labels. While their study only spans 80 segments, their findings suggest that “the presentation of post-editing effort indicators in the user interface appears not to impact on actual post-editing effort”. Their conclusions contradict our findings, which we will expand on later in this paper.
Moorkens and Way [
14] compared the use of translation memory (TM) to that of MT among translators. They presented 7 translators with 60 segments of English—German translations, extracted from the documentation of an open-source program called freeCAD (
https://www.freecadweb.org/, accessed on 9 September 2021) and from the Wikipedia entry on the same topic. Their results show that low-quality MT matches are not useful to the translators in over 36% of cases. Furthermore, the translators described these suggestions as “irritating”. In contrast, TM matches were always found to be useful. Moorkens and Way [
14] conclude that their findings suggest that “MT confidence measures need to be developed as a matter of urgency, which can be used by post-editors to wrest control over what MT outputs they wish to see, and perhaps more importantly still, which ones should be withheld”.
In their more recent work, Moorkens and O’Brien [
15] attempt to determine the specific user interface needs for post-editors of MT. The authors conduct a survey of translators and report that 81% expressed the need for confidence scores for each target text segment from the MT engine. This result validates the impact of the study reported in this paper, as we precisely investigate the effect of showing MTQE to translators when undergoing post-editing tasks.
Teixeira and O’Brien [
16] investigated the impact of MTQE on the post-editing effort of 20 English to Spanish professional translators post-editing four texts from the WMT13 news dataset (
http://www.statmt.org/wmt13/, accessed on 9 September 2021). They used four types of scenarios: No MTQE, Accurate MTQE, Inaccurate MTQE, and Human Quality Estimation, which was calculated using the direct assessment method proposed by Graham et al. [
17]. Their goal was to determine the impact of the different modes of MTQE on the time spent (temporal effort), the number of keystrokes (physical effort), and the gaze behaviour (cognitive effort). Their results showed no significant differences in terms of cognitive effort. In the case of the average number of keystrokes used or time spent across the different modes of MTQE, there were no significant differences per type of MTQE. However, there were significant differences when the MTQE score level was higher (the higher the score level, the less time was spent, and fewer keys were typed regardless of the MTQE type). They concluded that displaying MTQE scores was not necessarily better than displaying no scores. As we shall see later in this paper, their conclusion seems to contradict our findings. This could also be due to the nature of the task, as we focused on technical translations, whereas the experiment done by Teixeira and O’Brien [
16] focused on news texts. As we will see later on, our results suggest that MTQE could be beneficial for post-editing tasks. (In addition, it has to be mentioned that Teixeira and O’Brien [
16] never published a full paper and most of the information included here is based on the authors’ slides from MT Summit XVI and personal communication. For this reason it is impossible to know precisely all the settings of their experiment).
Finally, multiple recently published studies [
18,
19] confirm that PE continues to be a vibrant research topic at the intersection of translation process research, natural language processing, and human-computer interaction. Current relevant research directions include multi-modal PE [
20,
21], in which PE systems take several forms of input such as speech, touch, and gaze in addition to the usual keystrokes, and automatic post-editing (APE) [
22,
23,
24], which attempts to automatise corrections commonly carried out in the MT output by human translators during the PE process.
To the best of our knowledge, our paper is the first attempt to measure the impact of MTQE in real scenarios accounting for the variable of getting accurate predictions and comparing against both translation from scratch and MTPE without MTQE information. In past research, experimental setups focused on accurate/inaccurate MTQE predictions. While others [
7,
16] investigated the impact of using accurate MTQE, the novelty of our approach lies in accounting for what happens when the machine is wrong and predicts the quality of MT output inaccurately. In addition, we use a dataset from a genre that is commonly used by professional translators.