An Oblivious Approach to Machine Translation Quality Estimation

: Machine translation (MT) is being used by millions of people daily, and therefore evaluating the quality of such systems is an important task. While human expert evaluation of MT output remains the most accurate method, it is not scalable by any means. Automatic procedures that perform the task of Machine Translation Quality Estimation (MT-QE) are typically trained on a large corpus of source–target sentence pairs, which are labeled with human judgment scores. Furthermore, the test set is typically drawn from the same distribution as the train. However, recently, interest in low-resource and unsupervised MT-QE has gained momentum. In this paper, we deﬁne and study a further restriction of the unsupervised MT-QE setting that we call oblivious MT-QE. Besides having no access no human judgment scores, the algorithm has no access to the test text’s distribution. We propose an oblivious MT-QE system based on a new notion of sentence cohesiveness that we introduce. We tested our system on standard competition datasets for various language pairs. In all cases, the performance of our system was comparable to the performance of the non-oblivious baseline system provided by the competition organizers. Our results suggest that reasonable MT-QE can be carried out even in the restrictive oblivious setting.


Introduction
Machine translation (MT) is the task of translating text from one natural language to another. Starting in the 1950s, automatic approaches to text translation developed and matured to a level where one can practically use MT. The MT industry started with rulebased MT systems [1], then statistical MT systems [2], and hybrid MT systems [3], and now we are in the era of neural systems [4,5].
As MT is becoming a prevalent mode of translation, its quality is becoming increasingly more critical. The most straightforward option for judging machine translation quality is by human evaluation. The task is performed by experts in translation and linguistics, evaluating the input-output pair from various perspectives such as fluency, adequacy, accuracy, etc.
From a practical point of view, manual evaluation, performed by translation experts, is expensive and takes time. Instead, it is desired to have automatic quick and cheap judgments that approximate human judgment. Machine translation quality evaluation or estimation (MT-QE), refers to an algorithm that produces an evaluation score, which tells the user how good a translation is. The first automatic evaluation methods counted words and sentence-based errors that can be detected automatically. At the same time, general text-level aspects (such as fluency or coherence) were not taken into account. However, in the last decade, new MT evaluation systems were developed to address these aspects.
MT-QE algorithms are used commonly during the development stages of MT systems to measure improvement. They are also used to compare different MT systems.
The "golden standard" metrics in the MT community include BLEU [6], NIST [7], METEOR [8], chrF [9], and TER [10]. Such metrics need reference translations as they compare the MT output with the references and report the comparison scores. If references are available, these metrics can be used to evaluate the output of any number of systems quickly, without the need for human intervention. However, in many situations, references are not available or expensive to obtain. This is, in particular, a problem for less-used languages. The task of reference-less MT-QE is the focus of this work.
Most of the MT-QE algorithms that work without a reference solve the task as a classification problem and are trained in a supervised-learning manner [11][12][13][14][15][16], to mention a few. The training set typically consists of a large corpus of source-target pairs, along with human judgment scores. In many cases, the test set is sampled from the same training distribution, limiting the evaluation's generality.
The validation of the algorithm's score is carried out via correlation coefficients with manual judgment scores. For a meaningful evaluation, a large manual tagged dataset is required (both for testing and training).
When only text is provided for training but no human judgment scores, it is referred to as an unsupervised learning setting. Examples of unsupervised algorithms include the works in [17][18][19][20][21][22][23][24]. This still does not preclude training on the same distribution of text as the test. Thus, the algorithm can use valuable features such as TFIDF (term frequency-inverse document frequency), n-gram statistics, or word embedding tailored to that specific distribution. In fact, in the WMT competitions (WMT is the Workshop on MT which is part of the EMNLP conference), such features are even provided by the organizers.

The Oblivious Setting
There is no formal name or distinction between the unsupervised MT-QE setting and the one where the algorithm has no access whatsoever to the text distribution on which its performance is being tested. In particular, no human judgment scores are provided. In this work, we make this distinction and define the latter as the oblivious learning setting. We chose the term "oblivious" as it is commonly used in the literature to describe a setting where the algorithm has limited access to the input, e.g., the famous Oblivious Transfer protocol [25] or the Oblivious-Caching algorithms [26].
Oblivious learning makes sense in cases where access to the text's distribution is impossible or too expensive. Furthermore, the oblivious setting can serve as a benchmark to test the robustness of an MT-QE algorithm to various degrees of noise in the training step. For example, when the unsupervised algorithm [17] was run in oblivious mode by training on general text, its performance degraded by almost 50%.
Another aspect that is related to the oblivious setting is the utilization of a parallel corpus for training the MT-QE algorithm. A parallel corpus consists of two or more monolingual corpora, which are translations of each other. For example, a novel and its translation. To generate such a corpus, corresponding segments, usually sentences or paragraphs, need to be aligned. This should be contrasted with the monolingual setting, where text is available in each language but not coupled or aligned into source-target sentences.
While the oblivious setting does not preclude parallel corpora, the fact that crosslingual parallel corpora are scarce or non-existent for most of the ∼7000 languages spoken on earth makes this additional restriction practical [27]. Thus, we arrive at the research question that will be studied in this paper: Question: What performance of machine translation quality estimation can be achieved in the oblivious monolingual setting.

Our Contribution
Our first contribution is to make formal the distinction between unsupervised and oblivious MT-QE. Our second contribution is a new MT-QE algorithm that can be executed in oblivious mode, ObliQuE (Oblivious Quality Estimation). The algorithm is inspired by physical systems of particles. We view a sentence as a small system of interacting components (the words represented using word embedding). For each sentence, we compute a cohesiveness factor, κ, which reflects the extent to which the meaning of the entire system is a function of the meanings of its constituents. For example, κ("hot-dog") should be smaller than κ("dog-house"). We take the difference in cohesiveness between source and target sentences as the measure for translation quality. The difference in cohesiveness is some aspect of adequacy, which is commonly understood as the amount of information (meaning) preserved between the reference and the candidate translation.
Our method is compatible with the oblivious setting as it can use word embedding that was trained on a generic text from the relevant source-target language pair (e.g., Wikipedia). Furthermore, our method can use monolingual data, avoiding, for example, cross-lingual word embedding such as used in [28].
We tested our method on standard benchmarks, covering several source-target language pairs: English with Spanish, German and Japanese, and Japanese-Chinese. The performance of our algorithm was better, for example, than the oblivious version of [17], scoring a Spearman rank correlation of 0.37 compared to 0.22-0.28 of [17] on the English-Spanish WMT'12 dataset. Our algorithm came first for the German-English dataset and third in the Russian-English, on data from the WMT'19 third shared QE task [29]. In that task, the baseline provided by the organizers was an unsupervised algorithm, but both supervised and unsupervised algorithms competed. To the best of our knowledge, none were oblivious. Details in Section 4.
To conclude, we positively answer the research question posed above and confirm that a fair estimation of translation quality may be obtained using a general framework that is not tailored to the distribution of the text at hand, which also uses monolingual corpora. Another advantageous aspect of ObliQuE is the fact that it is an entirely white-box algorithm. The algorithm, described in Section 2, is straightforward to follow and grasp intuitively and contains no hidden tunable parameters that may impede reproducibility.

Methodology
An instance of the MT-QE problem (at the sentence level) is a pair of sentences S (source) and T (target). The output is a score that the algorithm assigns for the quality of the translation S → T.
Our algorithm first maps the words of S (respectively, T) to vectors using a word embedding, obtained, e.g., via word2vec [30]. The vectors corresponding to the s words of S are stacked into a sentence matrix M S (M T , respectively), whose rows are the d-dimensional vectors of the word embedding. For example, if S is the sentence "the paper is accepted", and the word embedding for the i th word w i in the sentence is the vector The two sentence-matrices are the objects from which the MT-QE score will be extracted, using what we call a cohesiveness measure.
Intuitively, cohesiveness measures the extent to which each word supports the meaning of the sentence. For illustration, consider the following three 5-word sentences: Dear dear dear dear dear (same word); Breakfast, dinner, lunch, milk, egg (same theme); and Monster, factory, gym, lake, chair (random themes). Using the word embedding vectors of [31], we computed the sentence matrix and cohesiveness factor of each sentence. The first sentence scored κ = 1, and indeed each word fully determines the meaning of the sentence; the second scored κ 2 = 0.46 (nearly 50%), which reflects the thematic cohesiveness between the words. The third scored κ 3 = 0.22, roughly 1/#words = 1/5 = 0.2, namely, each word contributes uniformly to the meaning of the sentence, which is what one would do expect from a random set of words due to symmetry.
We now proceed with the details of the computation of cohesiveness. First, we define the notion of the "main direction" of the sentence S, which is a single vector that captures the semantic meaning of the sentence. A standard way of computing the main direction is by averaging the vectors of the words in the sentence. This choice led to poor performance, and we replaced it with the leading (right) eigenvector of the sentence matrix M S .
To better understand this choice, let V = [v 1 , . . . , v d ] be the matrix whose columns are M S 's right eigenvectors (v 1 is the leading right eigenvector, our candidate for the main direction), U the matrix with left eigenvectors, and let σ 1 ≥ σ 2 ≥ . . . ≥ σ m ≥ 0 be M S 's singular values (we assume that the embedding dimension d is larger than the number of words in the sentence m). The SVD decomposition theorem says that the i th row of M S (the vector of the i th word w i ) can be written as That is, the vector of each word in the sentence is composed of the semantic contribution from v 1 (the "main direction" of the sentence) plus an error term, err. In signal processing, the energy in the direction of a certain eigenvector, in our case, v 1 , is typically taken to be the ratio between its singular value and the sum of singular values. In our case, That is, the energy of v w i in the direction of the sentence (which is represented using v 1 ) is given by κ(M), and this is what we called cohesiveness to begin with. Thus κ(M) will be our proxy for cohesiveness.

The Algorithm
Our method, which we call ObliQuE (Oblivious Quality Estimation), is described formally below. The procedure receives as input the source sentence S, the target T, word embedding w S in the source language and w T in the target one, and an error function : R × R → R which measures the difference in cohesiveness between source and target. Our working hypothesis is that the smaller (x, y), the better the translation. For the evaluation part of this paper we chose (x, y) = max{x, y}/ min{x, y}. Embed S and T using the word-embedding w S and w T , respectively. 3: The algorithm is described for the sentence-level QE task; for the task of documentlevel QE, the algorithm is applied iteratively on the sentences. The final score is computed as the average of the -values.

Related Work
MT-QE is typically addressed as a supervised machine learning task where the goal is to predict MT quality without relying on reference translation. Traditional feature-based approaches rely on manually designed features obtained from the source and translated sentences, as well as external resources, such as monolingual or parallel corpora [32].
Currently, the best performing approaches to QE use NNs to learn useful representations for source and target sentences [14,16,33,34]. A notable example is the Predictor-Estimator (PredEst) model [24], which is based on an encoder-decoder RNN (Recursive NN) architecture (predictor) trained on parallel data for a word prediction task and a unidirectional RNN (estimator) that produces quality estimates by using the context representations generated by the predictor. This method can be run both in supervised and unsupervised modes. Despite achieving good performances, neural-based approaches are resource-heavy and require a significant amount of in-domain parallel corpora and labeled data for training.
Other NN-based algorithms explore internal information from neural models as an indicator of translation quality. They rely on the entropy of attention weights in RNN-based NMT systems [23,35]. However, attention-based indicators perform competitively only when combined with other QE features in a supervised framework.
The few approaches for unsupervised QE, which are not based on NN, are inspired by the work on statistical MT and perform significantly worse than supervised approaches [17,22,36]. For example, Etchegoyhen et al. [36] use lexical translation probabilities from word alignment models and language model probabilities. Their unsupervised approach averages these features to produce the final score. All of these approaches were not tested in an oblivious setting; rather, they computed statistics from the same distribution of the test.
Our approach departs from the NN-based methods in that it is white-box, simple to understand, and has only two parameters (the word embedding and the loss function ). Furthermore, it uses word2vec trained on generic monolingual data and requires no additional training data. It also departs from the statistical approaches as it offers a completely new algorithmic take on the QE problem, which may turn more useful in some settings. As mentioned above, the oblivious version of [17] scored a Spearman rank correlation of 0.22-0.28 compared to our 0.37 on the English-Spanish WMT'12 dataset.

Evaluation
In Sections 4.1 and 4.2, we discuss the performance of our method on standard benchmarks that are used in the literature and compare to both supervised and unsupervised algorithms. In Section 4.3, we explore how our algorithm correlates with the results of BLEU [6], the gold standard in the industry. We do that on two datasets that we generated. Finally, we discuss the robustness of ObliQuE to the choice of parameters, specifically, the pre-trained vectors.
In all the experiments described in Sections 4.1 and 4.2 we use Google's word2vec [31] to embed words in English text, and Wikipedia-trained vectors [37] for words that are in other languages. All vectors are 300-dimensional and were trained using the skip-gram architecture with negative sampling.

Comparing against Supervised Methods
The first batch of tests compares the performance of ObliQuE at sentence-level QE against various supervised algorithms. The results are summarized in Table 1. Each column corresponds to a different test set that was made public by previous work. Each test set is on a different pair of languages. The rows of the table describe the results obtained by previous work on that dataset. Four of the six datasets come from the WMT competition; for comparison, we provide the results of the best and baseline supervised systems in that competition. The last row of the table states the IQR (interquartile range) of the results of previous work.
As evident from Table 1, the performance of the supervised baseline (having access to human-judgment annotated data from the same distribution of the train) is on average by merely 25% better than ObliQuE. For the last two test sets, De-En and Ru-En from WMT'19, the performance of all algorithms was poor. In this case, our algorithm was much better than the baseline, and in the De-En case, it was better than the first place in that competition. We now proceed to describe in detail the test sets of Table 1. The WMT'12 QE task dataset [39] consists of 442 English-Spanish news texts produced by a phrase-based SMT system called Moses (source in English, target in Spanish). Translations were manually annotated for quality in terms of post-editing effort (1-5 scores). The winner of this MT-QE task is the author of [40] with an algorithm based on SVM and regression trees. The baseline algorithm is SVM trained on 17 features extracted using QUEST++.
The WMT'17 QE task dataset [41] contains English sentences that were translated to German by various MT-systems and ranked by correlation with HTER labels that were computed using TERCOM. We took 479 sentences translated by SYSTRAN.4847 (this system had the largest number of human-scored sentence pairs). The winner of this competition was POSTECH, which is a neural algorithm with predictor-estimator architecture [33,34]. The baseline in this task was again kernel SVM with QUEST features.
The Japanese-English and Japanese-Chinese sentence pairs are taken from [38]. The dataset contains 1676 sentences in Japanese that were obtained from role-playing dialogues of health care providers. The sentences were translated into English and Chinese using their in-house MT system, and quality was graded on a 1-5 scale, reflecting post-edit effort. The QE task was performed by a support vector regression model with a radial basis function (RBF) kernel. The model was trained once with 17 features extracted by QUEST++ (Baseline) and then with additional features extracted from a word-embedding of the sentences (First Place).

Comparing against Unsupervised Methods
There are very few examples of unsupervised algorithms competing in shared tasks like WMT. This is because the training data in those competitions are published along with human judgment scores, and the goal is of course, to win the competition. Therefore, it makes no sense to give up part of the information. Only recently, attention was drawn to the task of MT through the lens of unsupervised-learning/low-resource languages. Examples include WMT20's first of its kind unsupervised-learning/low-resource competition [27,42].
Nevertheless, the baseline in the third share task of WMT'19 [29] (a baseline is an algorithm provided by the organizers), LASER [28], is an unsupervised algorithm [28], and therefore we can compare against it. Alongside LASER, also supervised-learning algorithms competed in that task. The last two columns of Table 1 show results from that task.
As the test set of that competition was never published, the results that we report are on a sample from the train set that was published by the organizers. We sampled about 200 sentences, which is roughly the size of the test set in that same competition. As evident, ObliQuE outperforms LASER by much, and its performance is roughly the same as the best-supervised algorithm.
Also very noticeable is the overall poor performance of all algorithms on that dataset compared to other datasets. This may be attributed to the fact that the QE systems were tested on a variety of MT outputs from different MT systems (unlike the standard case, where all of the test data, as well as train/dev sets, are homogeneous and come from the same MT system). This poor performance provides another motivation for considering the oblivious setting as it allows to estimate the robustness of the algorithm when switching from the same testing-training distribution to a more diverse setting.

Benchmarking against BLEU
In the second batch of tests, we checked the correspondence between the scores given by ObliQuE and the BLEU score, which is the golden standard in MT-QE (except, of course, for human evaluation). We ran the two algorithms at the document level on two very different datasets that we assembled. The first set consisted of 100 online news pieces in English from websites like CNN, NBCNews, and NYTimes. The second dataset consisted of 100 English poems by more than 30 different poets, written in the 19th century. The average number of words per poem was 130 and 196 for a news piece.
BLEU performs evaluation against a reference. The data we collected did not come with a reference. To circumvent this problem, we performed a forward translation into each of German and French and then back to English. We performed this operation independently using three MT systems: Bing, Google, and SDL. All texts and translations are provided as supplementary data. We ended up with 1200 (En,En) pairs: we had 200 original documents in English, and each document went through two agent languages and three MT systems.
We evaluated each (En,En) pair using ObliQuE and BLEU, the original English text serving as a reference for BLEU, and recorded that score that each pair received from either algorithm. We then ran a competition, for every pair, between the three MT systems as follows. We fixed an agent language L (L being German or French). Each document d of the 200 original documents resulted in a triplet {(d, d L Bing ), (d, d L Google ), (d, d SDL )}, where d L Bing stands for the translation of d into L and back to English using Bing, and similarly the other two are defined. We then ran ObliQuE and BLEU on every triplet, recording the winning MT for that triplet with respect to BLEU and with respect to ObliQuE. Each MT system was ranked by the number of times it won this competition.
The ranking of the three MT systems is detailed in Table 2. It shows that BLEU and ObliQuE are aligned in their ranking: Bing > Google > SDL, for German and French, and for both types of documents (poetry and news). The Spearman rank correlation between BLEU and ObliQuE was around 0.4 for poetry (across languages), and 0.27 for news via German, and 0.16 for news via French. Table 2. Evaluation of three MT systems over 100 English poems and 100 news pieces. Each cell is the number of times that the specific MT system got the best score of the three MT systems, according to BLEU or ObliQuE. For example, out of 100 English poems that were translated to French and back to English, 47 poems received the highest BLEU score when using Bing for the forward-backward translation; 36 poems received the highest score using Google and 17 poems using SDL.

Robustness
In this section, we evaluate the robustness of our method. Recall that we run our method in oblivious mode. In other words, it has no access to the distribution of the test set during train time. Our method uses pre-trained word embedding, which required text for training. It is, therefore, natural to ask how the choice of text to train word2vec affects the performance of the algorithm. The training of a word2vec embedding further involves fixing several parameters like the window size, the dimension of the embedding, or whether to use negative sampling or not. These parameters are explained in [30].
In this work, we use pre-trained vectors, and therefore we can check robustness for varying parameters of existing versions of word2vec. The two parameters we checked robustness for are whether negative sampling was used and which text was used.
In the tests of Section 4.1, we used a word2vec embedding trained on Google news text with negative sampling. This section replaces it with two word2vec versions, both trained on Wikipedia text, one with negative sampling and one without. The non-English text was embedded by training on Wikipedia text with negative sampling. We checked what happens when using Wikipedia but without negative sampling (we call the two options WikiNeg and WikiNorm). Table 3 shows the correlation with human judgment scores when ObliQuE is parameterized with the various combinations. The first line of Table 3 corresponds to the results summarized in Table 1.
As evident from the first two rows of Table 3, our method is robust to changes in the source of text that was used for training the embedding. Specifically, as long as the embedding used for the source and target languages was trained with negative sampling, the results are pretty much the same, regardless of the actual text used (Google news or Wikipedia entries). On the other hand, when using negative sampling only for one language (last two rows of Table 3), the performance is poor. Table 3. Repeating the tests described in Table 1 with various pre-trained vector combinations. Each row corresponds to a specific pair of versions of a word2vec embedding. The text source and the negative sampling flag are embedded in the name of the version. The left-hand-side (lhs) version was used for the source language and the rhs for the target.

ObliQuE
En

Discussion and Limitation
Machine translation is being used by millions of people daily, and therefore evaluating the quality of MT systems is an important task. While human evaluation of MT output remains crucial to look for ideas to improve MT systems still further, it is not scalable by any means. MT evaluation offers a cheap and fast alternative.
The standard pipeline of training an MT and MT-QE systems relies on a large bilingual corpus of source-target pairs, along with a human judgment. However, to this date, there is no machine translation available for most of the approximately 7000 languages spoken on the planet Earth due to scarcity of large bi-lingual corpora for training. Therefore, methods of unsupervised machine translation (and quality estimation) are important for alleviating this problem. This aspect of low-resource MT and MT-QE is an emerging field. For example, only in WMT 2020 was there a first such shared task.
We proposed an even stricter version of unsupervised MT-QE, which we called oblivious MT-QE. In the oblivious setting, besides having no scored pairs of source-target sentences, the algorithm has no access to source-target pairs from the distribution of the text on which its performance is then tested. We showed that despite such a restrictive setting, a competitive performance of MT-QE could be achieved. We compared the performance of our oblivious algorithm to high-resource supervised learning MT-QE systems and concluded that performance degrades but remains competitive.
Our aim in this work was not to design the best MT-QE system but rather to understand if there are "universal signals" in language that can be harnessed to the task of MT-QE. The oblivious MT-QE setting allowed us to answer this question affirmatively by presenting an algorithm that performs "blind" MT-QE quite successfully.
One limitation of this current work is the use of word2vec, which is a "static" vectorization approach. The same written words, despite having different meanings, are always vectorized the same. It may be that context-based vectorization types (e.g., BERT) would be a better choice. We intend to explore this direction in future research.
Another direction for future research is to add cohesiveness as an additional feature to an existing MT-QE system, supervised or not, and check to what extent it will boost its performance.
Finally, an interesting question left for future research is whether sentence cohesiveness could be used during the training of an MT system in order to improve its quality.