Towards a Better Integration of Fuzzy Matches in Neural Machine Translation through Data Augmentation

: We identify a number of aspects that can boost the performance of Neural Fuzzy Repair (NFR), an easy-to-implement method to integrate translation memory matches and neural machine translation (NMT). We explore various ways of maximising the added value of retrieved matches within the NFR paradigm for eight language combinations, using Transformer NMT systems. In particular, we test the impact of different fuzzy matching techniques, sub-word-level segmentation methods and alignment-based features on overall translation quality. Furthermore, we propose a fuzzy match combination technique that aims to maximise the coverage of source words. This is supplemented with an analysis of how translation quality is affected by input sentence length and fuzzy match score. The results show that applying a combination of the tested modiﬁcations leads to a signiﬁcant increase in estimated translation quality over all baselines for all language combinations.


Introduction
Recent advances in machine translation (MT), most notably linked to the introduction of deep neural networks in combination with large data sets and matching computational capacity [1], have resulted in a significant increase in the quality of MT output, especially for specialised, technical and/or domain-specific translations [2]. The increase in quality has been such that more and more professional translators, translation services, and language service providers have integrated MT systems in their workflows [3]. This is exemplified by recent developments in the translation service of the European Commission, one of the world's largest translation departments, where MT is increasingly used by translators, also driven by increased demands on productivity [4][5][6]. Post-editing MT output has been shown to, under specific circumstances, increase the speed of translation, or translators' productivity, compared to translation from scratch [5][6][7]. Specifically, higher-quality MT output, as assessed by automated evaluation metrics, has been shown to lead to shorter post-editing times for translators [8].
MT is currently most often used by translators alongside translation memories (TMs), a computer-assisted translation (CAT) tool that by now is a well-established part of many translation workflows [9][10][11]. MT tends to be used as a 'back-off' solution to TMs in cases where no sufficiently similar source sentence is found in the TM [12,13], since post-editing MT output in many cases takes more time than correcting (close) TM matches. This is, for example, due to inconsistencies in translation and a lack of overlap between MT output and the desired translation [14]. The level of similarity between the sentence to translate and the sentence found in the TM, as calculated by a match metric [15,16], thus plays an important role. Whereas translations retrieved from a TM offer the advantage of being produced by a translator, in the case of partial, or fuzzy matches, they are the translation of a sentence that is similar, but not identical to the sentence to be translated. In contrast, MT produces a translation of any input sentence, but in spite of the recent increase in MT quality, this output is still not always completely error-free. Moreover, the perception of translators is that MT errors are often not predictable or coherent, which results in a lower confidence for MT output in comparison to TM segments [14,17].
At least since the turn of the century, researchers have attempted to combine the advantages of TMs and MT, for example by better integrating information retrieved from a TM into MT systems [18][19][20]. In addition, in the context of neural machine translation (NMT), several approaches to TM-MT integration have shown that NMT systems do not fully exploit the available information in the large parallel corpora (or TMs) that are used to train them [21][22][23]. Whereas most of these TM-NMT approaches require modifications to the NMT architectures or decoding algorithms, an easy-to-implement method, neural fuzzy repair (NFR), was proposed that only involves data pre-processing and augmentation [24]. This method, based on concatenating translations of similar sentences retrieved from a TM to the source sentence, has been shown to increase the quality of the MT output considerably, also when state-of-the-art NMT systems are used [25]. However, this approach has only been implemented recently, and several important aspects remain to be explored. Amongst other things, fuzzy match retrieval in the context of NFR has so far used word-level information only, and combinations of fuzzy matches relied exclusively on the criterion of match score.
In this paper, we identify a number of adaptations that can boost the performance of the NFR method. We explore sentence segmentation methods using sub-word units, employ vector-based sentence similarity metrics for retrieving TM matches in combination with alignment information added to the retrieved matches, and attempt to increase the source sentence coverage when multiple TM matches are combined. We evaluate the proposed adaptations on eight different language combinations. In addition, we analyse the impact on translation quality of the source sentence length as well as the estimated similarity of the TM match. The code for fuzzy match extraction and data augmentation is available at https://github.com/lt3/nfr. This paper is structured as follows: in the next part, we discuss relevant previous research (Section 2). This is followed by an overview of the tested NFR configurations (Section 3) and the experimental setup (Section 4). Section 5 presents the results, which are subsequently discussed (Section 6). In the final Section 7, conclusions are drawn up.

Related Research
We first briefly describe TMs and fuzzy matching techniques (Section 2.1), before discussing previous attempts to integrate TM and MT (Section 2.2). We then focus on TMbased data augmentation methods within the NMT framework (Section 2.3), the approach that is also followed in this paper. Finally, we list other related studies that followed a similar general approach (Section 2.4).

Translation Memories and Fuzzy Matching
Proposed in the 1970s [26], TMs were integrated in commercial translation software in the 1980s-1990s [11]. They have since become an indispensable tool for professional translators, especially for the translation of specialised texts in the context of larger translation projects with a considerable amount of repetition, and for the translation of updates or new versions of previously translated documents [10]. TMs are particularly useful when consistency (e.g., with regard to terminology) is important. A TM consists of a set of sentences (or 'segments') in a source language with their corresponding translation in a target language. As such, a TM can be considered to be a specific case of a bilingual parallel corpus. TM maintenance efforts can help to ensure that translation pairs are of high quality, and that the resulting parallel corpus is particularly 'clean'.
Full or partial matches of sentences to be translated are retrieved from the TM using a range of possible matching metrics in a process referred to as fuzzy matching. Fuzzy matching techniques use different approaches to estimate the degree of overlap or similarity between two sentences, such as calculating: • the percentage of tokens (or characters) that appear in both segments [15], potentially allowing for synonyms and paraphrase [27], • the length of the longest matching sequence of tokens, or n-gram matching [25], • the edit distance between segments [28], generally believed to be the most commonly used metric in CAT tools [16], • automated MT evaluation metrics such as translation edit rate (TER) [29,30], • the amount of overlap in syntactic parse trees [31], or • the (cosine) distance between continuous sentence representations [32], a more recently proposed method.
To facilitate the correction of partial matches, fuzzy matching is usually combined with the identification and highlighting of parts in the corresponding target segment that either can be kept unchanged or need editing.

Approaches to TM-MT Integration
Besides the fact that TMs are valuable resources for training in-domain MT systems, researchers working within different MT frameworks have attempted to improve the quality of automatically generated translations by harnessing the potential of similar translations retrieved from a TM. Such translations offer the advantage of being produced by translators, and with a well-managed TM their quality is assumed to be high [33]. If the retrieved source sentence is sufficiently similar to the sentence to be translated, even using its retrieved translation as such can result in better output than that generated by MT systems, both as judged by automatic evaluation metrics and in terms of post-edit effort needed by translators [13,34]. This most basic option for TM-MT integration, i.e., favouring TM matches over MT output based on a fixed (or tunable) threshold for a chosen similarity metric, can be combined with systems designed to automatically edit such matches to bring them closer to the sentence to be translated [35,36], an approach that is sometimes referred to as 'fuzzy match repair' [37]. Recent implementations of this approach have been demonstrated to outperform both unaltered fuzzy matches and state-of-the-art NMT systems [38,39].
Next to this 'dual' approach, different methods were developed to achieve a closer integration of TMs and different types of MT systems. It can be argued that the principles underlying the example-based MT paradigm as a whole are closely related to TM-MT integration and fuzzy match repair [40,41]. Within this framework, researchers have focused on different ways of using sub-segmental TM data for the purpose of MT [20,42]. To this end, example-based MT systems have also been combined with phrase-based statistical MT (PBSMT) systems [43], which were generally considered to be the state-ofthe-art in MT before the advent of NMT [44]. Other TM-PBSMT integration methods have been proposed as well, for example by constraining the MT output to contain (preferably large) parts of translations retrieved from a TM [34,45]. This process involves identifying continuous strings of translated tokens to be 'blocked' in the MT output on the basis of statistical information about the alignment between tokens in the source and target sentences. Alternatively, the phrase tables used in PBSMT systems can be enriched with information retrieved from TM matches [13,46], or the decoding algorithms can be adapted to take into consideration matches retrieved from a TM [47,48]. All of these approaches were shown to lead to a significantly increased quality of MT output, especially in contexts where the amount of high-scoring TM matches retrieved is high.
Within the NMT paradigm various modifications to the NMT architecture and search algorithms were proposed to leverage information retrieved from a TM. For example, an additional encoder can be added to the NMT architecture specifically for TM matches [49] or, alternatively, a lexical memory [21]. The NMT system has also been adapted so it can have access to a selection of TM matches at the decoding stage [22]. Decoding algorithms have further been modified to incorporate lexical constraints [50] or to take into account rewards attached to retrieved strings of target language tokens, or 'translation pieces' [23], both of which can be informed by retrieving matches from a TM. More recently, a method has been proposed that augments the decoder of an MT system using token-level retrieval based on a nearest neighbour method, labeled k-nearest-neighbor machine translation (kNN-MT) [51]. A potential advantage of this method is that it does not rely on the identification of useful sentence-level matches, but rather finds relevant matches for each individual word, resulting in a wider coverage. On the other hand, preserving the syntactic and terminological integrity of close matches is one of the presumed advantages of many other TM-MT integration methods, which may, at least in part, be lost in this approach.

Integrating Fuzzy Matches into NMT through Data Augmentation
An easy-to-implement TM-NMT integration approach, labelled neural fuzzy repair (NFR), was proposed by [24]. Drawing on work in the field of automatic post-editing [52] and multi-source translation [53], this method consists of concatenating the target-language side of matches retrieved from a TM to the original sentence to be translated. As such, it only involves data pre-processing and augmentation, and is compatible with different NMT architectures.
In the original paper, the method was tested using a single fuzzy match metric (i.e., token-based edit distance) and seq2seq bidirectional RNN models with global attention. Different options were explored with regard to the number of concatenated matches, the amount of training data generated and the choice between training a 'dedicated' model for sentences for which high-scoring matches were retrieved only and a 'unified' model that deals with sentences with and without concatenated matches.
Tests carried out on two language combinations (English-Dutch, EN-NL, and English-Hungarian, EN-HU) using the TM of the European Commission's translation service [33] showed large gains in terms of various automated quality metrics (e.g., around +7.5 BLEU points overall for both language combinations when compared to the NMT baseline). It was observed that the increase in quality is almost linearly related to the similarity of the retrieved and concatenated matches, with quality gains in the highest match range (i.e., 90% match or higher) of around 22 BLEU points compared to the NMT baseline for both language combinations, compared to an increase of not even one BLEU point for EN-NL and around three BLEU points for EN-HU in the match range between 50% and 59%. The more matches that are retrieved, and the higher the similarity of the matches to the sentence to be translated, the more beneficial NFR is for translation quality.
The data augmentation approach to TM-NMT integration was further explored by [25], who tested the impact of different matching metrics, using Transformer models [54] for one language combination, English-French, and nine data sets. In addition to token-based edit distance, they tested n-gram matching and cosine similarity of sentence embedding vectors generated using sent2vec [55], as well as different combinations of matches retrieved using these matching methods. In addition, they incorporated information about the alignment of the target tokens for token-based edit distance and for n-gram matching by either removing unmatched tokens from the input or by additionally providing this information to the NMT architecture using word-level features. Their results are in line with those of [24], showing important increases in translation quality for all of the tested NFR methods in comparison to a baseline NMT system. Moreover, they confirm that the NFR approach is compatible with the Transformer NMT architecture. Their best-performing model concatenates both the best edit-distance match and the best sentence-embedding match to the input sentence, and adds information about the provenance and alignment of the tokens using factors (indicating whether tokens belong to the original source sentence, are aligned or unaligned edit-distance tokens, or belong to the sentence-embedding match). Finally, their results demonstrate that the NFR approach also leads to considerable quality gains in a domain adaptation context.

Other Related Research
The NFR approach is somewhat similar to the approach simultaneously proposed by [56] to incorporate terminology constraints into NMT, in that this method also introduces tokens in the target language in the source sentences, leading to bilingual inputs. In addition, in the context of this method, NMT models were shown to be flexible in dealing with such bilingual information and incorporating constraints, leading to improved MT output.
In addition, the so-called Levenshtein Transformer [57] is relevant in this context. This promising neural model architecture is aimed at learning basic edit operations (i.e., insertion and deletion), which makes it, in theory at least, especially suited for a task such as fuzzy match repair. Researchers have already used this architecture successfully to impose lexical constraints on NMT output [58].
With regard to the incorporation of alignment information in NMT models, it should be noted that this has been attempted explicitly before, relatively quickly after the rise of NMT [59]. This integration of information from more traditional word alignment models was meant to combine the perceived advantages of the PBSMT and NMT paradigms, and thus differs from the incorporation of alignment information in the NFR paradigm.
Fuzzy match retrieval in combination with source-target concatenation has been shown to also be useful for improving the robustness of NMT models in the case of noisy data [60]. In addition, it was demonstrated to be a powerful method in the field of text generation as well [61].

Neural Fuzzy Repair: Methodology
In NFR, for a given TM consisting of source/target sentence pairs S, T, each source sentence s i ∈ S is augmented with the translations {t 1 , . . . , t n } ∈ T of n fuzzy matches {s 1 , . . . , s n } ∈ S, where s i / ∈ {s 1 , . . . , s n }, given that the fuzzy match score is sufficiently high (i.e., above a given threshold λ). Previous research reported comparable NFR performance using 1, 2 and 3 concatenated fuzzy match targets [24]. In the experiments in this study, we use the translations of maximally two fuzzy matches for source augmentation (n = 2). We use "@@@" as the boundary between each sentence in the augmented source. The NMT model is then trained using the combination of the original TM, which consists of the original source/target sentence pairs S, T and the augmented TM, consisting of augmented-source/target sentence pairs S , T. At inference, each source sentence is augmented using the same method. If no fuzzy matches are found with a match score above λ, the non-augmented (i.e., original) source sentence is used as input. Figure 1 illustrates the NFR method for training and inference.
With the aim of improving the NFR method further, this paper explores a number of adaptations that involve sub-word level segmentation methods for fuzzy match retrieval (Section 3.1), the integration of (sub-)word alignment information (Section 3.2), and combinations of multiple fuzzy matches that extend source coverage (Section 3.3).

Fuzzy Matching Using Sub-Word Units
Fuzzy matching is a key functionality in NFR, as the quality of the generated translations is determined by the similarity level of the retrieved fuzzy match(es) [24]. The original NFR method used token-based edit distance for fuzzy matching. With this similarity metric, the fuzzy match score ED(s i , s j ) between two sentences s i and s j is defined as: where |s| is the length of s. Following [24], candidates for high fuzzy matches are identified using SetSimilaritySearch (https://github.com/ardate/SetSimilaritySearch) before calculating edit distance. It has been shown that matches extracted by measuring the similarity of distributed representations, in the form of sentence embeddings, can complement matching based on edit distance and lead to improvements in translation quality [25]. The sentence similarity score SE(s i , s j ) between two sentences s i and s j is defined as the cosine similarity of their sentence embeddings e i and e j : where e is the magnitude of vector e. Similar to [25], we use sent2vec [55] for training models from in-domain data, and build a FAISS index [62] containing the vector representation of each sentence for each NFR method and language. FAISS is a library specifically designed for efficient similarity search and vector clustering, and is compatible with the large data sets used in this study. In a previous study, it was hypothesised that, while edit distance provides lexicalised matches from which the model learns to copy tokens to the MT output, matches obtained using sentence embeddings help the model to further contextualise translations, meaning both methods complement each other in the NFR context [25]. At the same time, however, it can be argued that both sentence similarity metrics rely on the information provided by surface forms of word tokens to measure similarity. This means that complex (inflectional) morphology can pose challenges to the retrieval of useful matches, especially when using edit distance as match metric. Using sub-word units might provide a useful way to mitigate such challenges. Even though sentence embeddings are already effective in capturing semantic similarities between vector representations of sentences using word tokens, subword units have been proven to be useful for building multilingual sentence embeddings [63], and were successfully utilised for the task of measuring sentence similarity [64].
In NLP, different sub-word segmentation approaches have been proposed with the aim of reducing data sparsity caused by infrequent words and morphological complexity, such as byte-pair encoding (BPE) [65], WordPiece [66], and linguistically motivated vocabulary reduction (LMVR) [67]. BPE was originally a data compression algorithm [68] before being used in the context of MT. It seeks an optimal representation of the vocabulary by iteratively merging the most frequent character sequences [65]. WordPiece uses the same approach for vocabulary reduction as BPE, with the difference that the merge choice is based on likelihood maximisation rather than frequency [69]. LMVR, on the other hand, is based on an unsupervised morphological segmentation algorithm that predicts sub-word units in a corpus by a prior morphology model, while reducing the vocabulary size to fit a given constraint [67].
Our hypothesis is that sub-word units can enable us to extract more relevant fuzzy matches, both when using exact, string-based matching algorithms (such as edit distance) and algorithms that utilise sentence embeddings. In this study, we first adapt both types of fuzzy matching approaches ED tok and SE tok , and replace word tokens with two types of sub-word units, namely, byte-pairs (ED bpe , SE bpe ) and LMVR tokens (ED lmvr , SE lmvr ). In our experiments, we use 32K merged vocabulary for source and target languages combined for the BPE implementation [65] (https://github.com/rsennrich/subword-nmt). As LMVR is based on language-specific morphological segmentation, we use 32K vocabulary for source and target languages separately for the LMVR implementation [67] (https://github. com/d-ataman/lmvr). Table 1 provides an example of fuzzy match retrieval using edit distance with word tokens, byte-pairs and LMVR tokens. Table 1. Best fuzzy matches retrieved with edit distance using word tokens, byte-pairs and LMVR tokens for the Hungarian input sentence 'a görbületi sugarak közötti eltérések:', with the English translation 'differences between the radii of curvature:'. Matching (sub)word tokens between the input source (s i ) and the best fuzzy match source/target pair (s j , t j ) are underlined.

Score
ED tok s i a görbületi sugarak közötti eltérések: s j 0.5 a nemek közötti egyenlőség: t j gender equality:

ED bpe
s i a görb@@ ületi sugar@@ ak közötti eltérések : s j 0.5 a tük@@ rö@@ k görb@@ ületi sugar@@ ai közötti eltérések t j differences between the radii of curvature of mirrors ED lmvr s i a gör @@b @@ület @@i sugar @@ak köz @@ött @@i eltérés @@ek: s j 0.71 a tükr @@ök gör @@b @@ület @@i sugar @@ai köz @@ött @@i eltérés @@ek t j differences between the radii of curvature of mirrors In the example in Table 1, both ED bpe and ED lmvr retrieve the same best fuzzy match with the translation 'differences between the radii of curvature of mirrors', which is different from the fuzzy match retrieved by ED tok , with the translation 'gender equality:'. In this example, by utilising sub-word units, both ED bpe and ED lmvr retrieve a fuzzy match that is arguably more informative for the correct translation of the source sentence than ED tok . Due to the difference in the way sub-words are generated by ED bpe and ED lmvr , however, the two methods retrieve the same fuzzy match with different match scores (0.5 and 0.71, respectively).

Marking Relevant Words in Fuzzy Matches
Previous work shows that fuzzy matches with lower similarity scores may cause the NFR system to copy incorrect/unrelated text fragments to the output, leading to translation errors [24]. To mitigate this problem, source-side features have been utilised to mark relevant (and irrelevant) words in fuzzy target sentences extracted using ED tok [25]. In a given TM consisting of source/target sentence pairs S, T, for a given source sentence s i ∈ S and fuzzy match source-target pair s j ∈ S, t j ∈ T, they first utilise LCS (Longest Common Sub-sequence) to mark the words in s j that also exist in s i . The marked words in s j are then mapped to t j using word alignment information between s j and t j . In this study, we follow this general idea and mark relevant (and irrelevant) words in fuzzy target segments t j used to augment source sentences. In contrast to [25], we use the optimal path of edit distance to find overlapping, identical source and fuzzy source tokens (as seen in [70] (Chapter 3.11)) instead of LCS. We also use GIZA++ [71] for word alignment rather than fast_align [72]. Figure 2 illustrates how tokens are aligned between the input sentence, the fuzzy source and the fuzzy target. Figure 2. Aligning tokens of a source sentence (s i ) with tokens of source and target sentences for a fuzzy match (s j , t j ). The features "A" and "N" stand for "aligned" (i.e., relevant) and "non-aligned" (i.e., irrelevant), respectively.
Reasoning that the similar sentences obtained with distributed representations do not necessarily present any lexical overlap, in a previous study this type of information was not added to t j obtained by SE [25]. Rather, they add a dummy feature "E" to all words in the fuzzy target. We, however, hypothesise that such source side features can still assist the model with making better lexical choices, in case relevant words are present in similar sentences. Therefore, we also mark the relevant (and irrelevant) words in t j obtained by SE. In this scenario, the focus is on high precision rather than on finding all semantically related words: by marking the words that occur in both s j and t j , we hope that the model learns to recognise these words as high-fidelity candidates to copy to the target side.
Besides word-level tokens, we use BPE and LMVR units for fuzzy matching (as described in Section 3.1). In this case, we follow the approach described above to mark relevant (and irrelevant) sub-word units, but use GIZA++ on sub-word-level data, which has been shown to lead to an improved alignment quality [73].

Maximum Source Coverage
As discussed in the previous section, features that mark relevant words in fuzzy match target sentences can provide additional information to the NFR model. Moreover, these features allow us to better combine fuzzy matches so that they provide complementary information about the source sentence. Existing methods on combining multiple fuzzy matches in the context of NFR, which lead to improvements over using a single fuzzy match for data augmentation, either use n-best fuzzy matches obtained by a single fuzzy matching method or the best matches obtained by n different fuzzy matching methods. Neither of these approaches guarantee that more source information (i.e., words) is covered by the combined fuzzy matches.
In this study, we propose an alternative fuzzy match combination method, max_coverage. To combine two fuzzy target sentences for data augmentation, this method first retrieves the best fuzzy source-target pair s 1 , t 1 obtained for a given source sentence s i . As a second match, this method seeks a source-target pair s j , t j where s j maximises the coverage of source words in s i when combined with s 1 . We limit the search for the second match to the best 40 matches with match score above 0.5. If no such match is found, the algorithm falls back to using 2-best matches. To calculate the source coverage of a given t j , we use the methodology described in Section 3.2. Algorithm 1 shows the pseudo-code for max_coverage. Table 2 illustrates the different approaches to using features on fuzzy target sentences and combining fuzzy matches, including max_coverage.
Input : Source sentence s i , list of fuzzy match source-target pairs S, T with fuzzy match score above the threshold λ, where s i ∈ S Output : List of fuzzy match source-target pairs S maxc , T maxc 2 S maxc ← [s 1 ], where s 1 ∈ S is the source segment of the highest-scoring fuzzy match; 3 T maxc ← [t 1 ], where t 1 ∈ T is the target segment of the highest-scoring fuzzy match; 4 C ← list of token IDs in s i , which are aligned with tokens in s 1 ; 5 extra_coverage ← 0; 6 s new ← None; 7 t new ← None; 8 for (s j , t j ) in (S, T) do 9 C cand ← list of token IDs in s i , which are aligned with tokens in s j ; S maxc .append(s 2 ), where s 2 ∈ S is the source segment of the fuzzy match with the second highest match score;

22
T maxc .append(t 2 ), where t 2 ∈ T is the target segment of the fuzzy match with the second highest match score; 23 end 24 return S maxc , T maxc Table 2. Adding feature labels to best fuzzy match targets retrieved for the Hungarian input source 'orvosi fizikus szakértők .' (EN: 'medical physics experts .'). The feature labels "A" and "N" indicate target tokens that are aligned/not aligned to tokens in the input source. Label "E" is a dummy feature used for correct formatting. Target tokens that are aligned to the input source through features are marked in bold. In Table 2, the only second fuzzy match target that increases the number of words covered in the source sentence is retrieved by max_coverage (Best SE + max_coverage SE). Despite its lower match score compared to the second best match (0.829 vs. 0.765), the English sentence 'medical physics expert' provides additional information about the translation for the source word 'fizikus (physics)'.

Experimental Setup
This section describes the data sets that are used in the experiments (Section 4.1), the NFR models, and the baselines they are compared to (Section 4.2), as well as the procedures used for evaluation (Section 4.3).

Data
As the data set, we use the TM of the European Commission's translation service [33]. All sentence pairs were truecased and tokenised using the Moses toolkit [74]. We run detailed tests of different NFR configurations on one language pair, English ↔ Hungarian (EN-HU, HU-EN). The best systems are then tested for three further language pairs: English ↔ Dutch (EN-NL, NL-EN), English ↔ French (EN-FR, FR-EN), and English ↔ Polish (EN-PL, PL-EN).
For each language combination, we used ∼2.4 M sentences as training, ∼3 K sentences as validation, and ∼3.2 K sentences as the test set. The validation and test sets did not contain any sentences that also occurred in the training set (i.e., 100% matches were removed). Our aim was to keep the validation and test sets as constant as possible across language combinations, but some sentence pairs needed to be removed from the test set when the translation direction was switched since the source sentences occurred in the corresponding training set, leading to slightly different test sets for the different language combinations. The exact number of sentences for each language combination, before data augmentation was applied, is provided in Appendix A. Table 3 provides an overview of the NFR system configurations that are tested in this study for English ↔ Hungarian, alongside the baselines they are compared to. All systems, including the four baselines, make use of the Transformer NMT architecture (see Appendix B.1 for a detailed description). For training the NMT models, BPE is applied to all data, except when LMVR is used for sub-word segmentation prior to retrieving fuzzy matches. It should be noted that, for training the NMT models, the sub-word segmentation step is applied independently from the segmentation techniques used for fuzzy match retrieval (see Section 3.1). This distinction is also made in Table 3, where these steps are referred to as 'NMT unit' and 'Match unit', respectively. All NFR systems use source sentences augmented with two retrieved fuzzy matches, at most, and all but one of the NFR systems use 0.5 as a threshold for fuzzy match retrieval.

Baseline and NFR Models
The first baseline, Baseline Transformer, is the only system not using the NFR method. As a second baseline, we apply the originally proposed NFR data augmentation method, involving token-based fuzzy matching using edit distance [24], but implemented using the Transformer architecture and BPE, instead of BRNNs and tokens (ED tok ). Finally, we report the performance of two systems that implement the best NFR configuration proposed by [25], involving the combination of one edit-distance and one sentence-embedding match using token-based matching, with alignment features added to the first match, but not to the second. The first of these two baselines uses the thresholds for fuzzy matching used by [25], namely 0.6 for edit distance and 0.8 for sentence embeddings (ED tok /SE tok ), whereas the second uses the same threshold of 0.5 as the other NFR systems in this study (ED tok /SE tok ‡ ). For both variants of this system, we used SetSimilaritySearch to extract high fuzzy match candidates prior to calculating ED as described in Section 3.1. The tested NFR systems are divided into three blocks: the first group of systems is implemented without alignment features or maximum source coverage, the second group uses alignment features only, and the third uses both alignment features and maximum source coverage. The labels that are used for each tested system represent the fuzzy matching method (either edit distance, ED, or sentence embeddings, SE) and the fuzzy match unit (word-level tokens, tok; byte-pair encoding, bpe; or linguistically motivated vocabulary reduction, lmvr), show whether source-side alignment features are used (indicated with +) and whether maximum source coverage is applied (indicated with M). In the case of SE, 'match unit' refers to the (sub)word unit that was used to train the sent2vec model that, in turn, was used to extract sentence embeddings, which were used for fuzzy match retrieval. For the systems that utilise word-level tokens as a 'match unit' but sub-word units as the 'NMT unit', alignment features are mapped from tokens to their corresponding sub-word units.
The hyper-parameters and the training details of the NMT systems, sent2vec, FAISS, LMVR, and GIZA++ are provided in Appendix B.

Evaluation
We make use of automated evaluation metrics BLEU [75] (From Moses: https://github. com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl), TER [30] (Version 0.7.25: https://github.com/snover/terp), and METEOR [76] (Version 1.5: https: //www.cs.cmu.edu/~alavie/METEOR/) to assess the quality of the translations. Bootstrap resampling tests are performed to verify whether differences between the best baseline system and the NFR systems combining all tested modifications are statistically significant [77]. In addition to evaluations on the complete test set, we also carry out more fine-grained evaluations of subsets of the test set defined on the basis of input sentence length and fuzzy match score.

Results
First, we present a detailed analysis of the results for English ↔ Hungarian (Section 5.1). Section 5.2 shows the results for the three other language pairs, using the best NFR systems identified for English ↔ Hungarian. Finally, Section 5.3 focuses on the impact of fuzzy match score and sentence length on the quality of the generated translations. Table 4 shows the results of the automated evaluations for the four baseline systems and all NFR systems that were tested for EN-HU and HU-EN. The table consists of four sections. From top to bottom, it shows the results for (a) the baseline systems, (b) the NFR systems using different match methods and match units, (c) the systems that incorporate alignment features, and (d) those that use both alignment features and maximum source coverage. The table reports BLEU, TER, and METEOR scores, but, in the text, we mainly focus on BLEU.

Detailed Evaluation for English-Hungarian and Hungarian-English
For both translation directions, Baseline Transformer, which does not make use of the NFR augmentation method, is outperformed by all NFR baselines by 4.5 to 6.67 BLEU points. For both EN-HU and HU-EN, the strongest baseline is the system combining the best ED and SE matches, with a threshold of 0.5 (ED tok /SE tok ‡ ). This configuration also outperforms the same system that uses higher thresholds for ED and SE matches, 0.6 and 0.8, respectively (ED tok /SE tok ).
The results of our first set of experiments, targeting different fuzzy matching methods and units, show that the systems using matches retrieved by SE outperform their counterparts that use ED as match metric (SE tok vs. ED tok , SE bpe vs. ED bpe , and SE lmvr vs. ED lmvr ). In this context, we note that, with ED, the number of retrieved fuzzy matches above the 0.5 threshold was consistently and substantially lower than with SE. Moreover, for HU-EN, the system combining the two highest-scoring SE matches using byte-pairs (SE bpe ) already outperforms the baseline system that combines the best ED and SE matches (ED tok /SE tok ‡ ), which additionally includes alignment features. Comparing the different fuzzy match units, for both translation directions, the systems that use LMVR-based matching perform worst (for both ED and SE). Token-based matching, in contrast, appeared to work better, but not as well as BPE. Especially for HU-EN, SE bpe scored considerably higher than SE tok (+1.05 BLEU points). For EN-HU, the increase in quality was less pronounced (+0.22 BLEU). Given the consistent improvements yielded by SE over ED, for the next set of experiments, we focused on systems combining sentence embedding matches only. As the third section of Table 4 shows, adding features based on alignment information resulted in a small but consistent improvement in quality for EN-HU (between +0.37 and +0.63 BLEU points compared to the corresponding systems without features). For HU-EN, however, this was not the case. Whereas the performance was slightly better for the token-based system with added features (SE tok +) compared to the system without (+0.24 BLEU), the opposite was the case for the systems using BPE and LMVR as match unit (−0.19 BLEU for SE bpe + and −0.04 for SE lmvr +).
Next, we apply max_coverage to all three systems that use SE and alignment-based features. For EN-HU, this resulted in an improved performance in two out of three cases (+0.31 BLEU for SE tok +M and +0.4 for SE lmvr +M). For SE bpe +M, the estimated quality remained virtually identical (−0.01 BLEU). This results in SE tok +M scoring best overall for EN-HU, according to the three evaluation metrics. Compared to the best baseline (ED tok /SE tok ‡ ), the difference is +0.38 BLEU (p < 0.05), −0.48 TER, and +0.41 METEOR. For HU-EN, retrieving matches using max_coverage improved the system's performance on all occasions, both compared to the systems with and without alignment-based features. The differences compared to the systems with features were larger than for EN-HU (between +0.58 and +1.03 BLEU), but, at the same time, most HU-EN systems including features scored slightly worse than those without. This means that, for HU-EN, the best-scoring system overall was the system including fuzzy matching based on sentence embeddings for BPE, including alignment features and maximum source coverage (SE bpe +M). Compared to the best baseline (ED tok /SE tok ‡ ), its performance was estimated to be better on all evaluation metrics (+1.69 BLEU, p < 0.001; −0.99 TER; +0.73 METEOR).
In a final step, we retrained the two best-scoring systems (SE tok +M and SE bpe +M) without alignment features, considering that adding such features did not appear to be beneficial in HU-EN (SE vs. SE+ in Table 4). The results of these tests (see Table 5) show that, for both language combinations, both systems with added features score better than their counterparts without (between −0.20 and −0.95 BLEU).

Evaluation on Three Additional Language Pairs
In this section, we present the results of the comparisons between the best NFR configurations identified for English ↔ Hungarian and the baseline systems on three additional language pairs: English ↔ Dutch, English ↔ French, and English ↔ Polish. We only report the three baselines using the same thresholds for fuzzy matching (i.e., 0.5). The best NFR systems for English ↔ Hungarian were those using fuzzy matching based on sentence embeddings, including alignment-based features and maximum source coverage. One system uses tokens as match unit (SE tok +M), the other BPE (SE bpe +M). Table 6 shows the results for English ↔ Dutch. Compared to Baseline Transformer, the NFR system using token-based edit-distance matching (ED tok ) already performs considerably better for both translation directions (+4.96 BLEU for EN-NL, and +5.4 BLEU for NL-EN). In both cases, however, the best baseline is the system including one sentenceembedding and one edit-distance match (ED tok /SE tok ‡ ), with an additional +1.92 BLEU for EN-NL and +0.90 for NL-EN. Compared to the best baseline, the two NFR configuration tested here score better according to all evaluation metrics for both EN-NL and NL-EN. For EN-NL, our best system (SE bpe +M) scores +0.54 BLEU better than the best baseline (p < 0.01). For NL-EN, the best system (SE tok +M) outperforms the best baseline by +0.86 BLEU (p < 0.001). Table 6. Results for English ↔ Dutch. Statistical significance of the improvements in BLEU is tested against the strongest baseline scores, which are underlined (**: p < 0.01; ***: p < 0.001). Best scores for the tested systems are highlighted in bold.

Model
English The results for English ↔ French are presented in Table 7. The pattern that emerges is fairly similar to the one observed for English ↔ Dutch. ED tok outperforms Baseline Transformer with +3.61 BLEU for EN-FR and +5.20 BLEU for FR-EN, but the best baseline is ED tok /SE tok ‡ , scoring +1.35 BLEU higher than the second-best baseline for EN-FR and +1.03 BLEU for FR-EN. For both translation directions, the best NFR system is SE bpe +M. Compared to the best baseline, this system achieves +0.93 BLEU for EN-FR (p < 0.001) and +1.07 BLEU for FR-EN (p < 0.001). Table 7. Results for English ↔ French. Statistical significance of the improvements in BLEU are tested against the strongest baseline scores, which are underlined (**: p < 0.01; ***: p < 0.001). Best scores for the tested systems are highlighted in bold.

Model
English Finally, Table 8 shows the results for English ↔ Polish. Generally speaking, these results are in line with those observed for the other language combinations that were tested. In addition, here, ED tok /SE tok ‡ is the strongest baseline, scoring +1.09 BLEU better than ED tok for EN-PL and +2.04 for PL-EN. ED tok , in turn, outperformed Baseline Transformer with +3.96 BLEU for EN-PL and +4.81 for PL-EN. For both translation directions, SE bpe +M performs best, improving over the best baseline by a further +1.16 BLEU for EN-PL (p < 0.001) and +1.58 BLEU for PL-EN (p < 0.001). Table 8. Results for English ↔ Polish. Statistical significance of the improvements in BLEU are tested against the strongest baseline scores, which are underlined (**: p < 0.01; ***: p < 0.001). Best scores for the tested systems are highlighted in bold.

Model
English

Impact of Match Score and Sentence Length on Translation Quality
To obtain more insight into the performance of the NFR systems, we evaluate the impact of two variables that can influence the quality of the generated translations: the length of the input sentence [78], and the degree of similarity between the retrieved fuzzy matches and the input sentence [24]. For the purpose of this analysis, we focus once more on the language pair English ↔ Hungarian. This type of analysis can be informative for determining the most appropriate value for the threshold λ (i.e., the lower bound for the fuzzy matching score), and it may tell us whether data augmentation is beneficial or not for input sentences of a certain length. Figures 3 and 4 plot the BLEU scores on the test set for each of the bins defined by input sentence length (calculated on the basis of word tokens) and fuzzy match score, for (a) SE tok +M and (b) SE bpe +M, compared to the Baseline Transformer, for EN-HU and HU-EN, respectively. Appendix C contains an overview of the sizes of the corresponding bins for both language combinations. Considering that the bin sizes for SE tok +M in the lowest match range are very small, we exclude these bins from the interpretation of the analysis. We also note that the distribution of similarity scores was markedly different for BPE and token-based matches, with the latter being concentrated in the higher match ranges.  Figure 3. Comparison of (a) SE tok +M and (b) SE bpe +M with Baseline Transformer for different match ranges and sentence lengths, EN-HU. Note that in (a) the bin sizes in the lowest match range are very small (i.e., between 3 and 9 sentences), so the reported BLEU scores for these bins are not reliable. S refers to sentences of length 1-10, M 11-25, and L over 25. Figure 3 shows a slightly different picture for both systems using different match units, some trends can be observed: (a) BLEU scores increase with increasing match scores, (b) the added value of the NFR approach becomes greater in higher match ranges, (c) from a match score of 0.7 onward, for each comparison the NFR system outperforms Baseline Transformer, and (d) short sentences score, overall, higher than longer ones. Whereas SE bpe +M outperforms Baseline Transformer in every bin (i.e., above 0.5), for SE tok +M, this is not necessarily the case for matches scoring lower than 0.7.

Even though
The picture for HU-EN, shown in Figure 4, is in some respects similar to what we observed for EN-HU. In addition, we have (a) the scores for both systems increase with increasing match scores, and (b) the higher the match range, the greater the positive impact of the NFR approach. However, different from EN-HU, (c) the NFR systems only consistently outperform Baseline Transformer with match score of 0.8 or higher, and (d) shorter input sentences do not necessarily lead to higher translation quality. For SE bpe +M, Baseline Transformer outperforms the NFR system for all bins in the lowest match range (i.e., 0.50-0.59). In the match range 0.60-0.69, this is the case for two out of three bins.  Figure 4. Comparison of (a) SE tok +M and (b) SE bpe +M with Baseline Transformer for different match ranges and sentence lengths, HU-EN. Note that, in (a), the bin sizes in the lowest match range are very small (i.e., between 3 and 5 sentences), so the reported BLEU scores for these bins are not reliable. S refers to sentences of length 1-10, M 11-25, and L over 25.

Discussion
Our detailed analyses for English ↔ Hungarian show that retrieving and ranking fuzzy matches using sentence embeddings leads to better results than edit distance or a combination of both methods. This confirms the usefulness of fuzzy matching based on distributed representations that capture semantic relationships between (sub-)words, in contrast to exact, string-based fuzzy matching [25,32]. One striking difference between ED and SE in our experiments is the percentage of input sentences for which a fuzzy match was retrieved: whereas almost all sentences in the training (as well as the test) set were augmented with a fuzzy match target using SE, with ED, this was only the case for around half of the sentences. In this study, we used a fuzzy match threshold of 0.5 for both ED and SE. Even though lowering the threshold for fuzzy matching in the case of ED could increase the proportion of input sentences for which a match is retrieved and data augmentation is performed, retrieved sentences that are (formally speaking) too dissimilar to or do not show enough overlap with the original sentence can lead to a decrease in translation quality [24]. It thus seems that SE is able to retrieve more matches that are informative in the NFR context. This may, in part, be related to the fact that matching based on SE is less likely to retrieve (semantically) unrelated sentences than matching based on ED, especially for short sentences (e.g., because of coincidental overlap in function words or punctuation). We return to the issue of the fuzzy matching threshold below.
We also found that, generally speaking, fuzzy matching based on BPE performed slightly better than token-based matching, but this was not always the case. For example, the best-scoring NFR system for HU-EN used tokens as match unit, which is why we also tested NFR systems using token-based matching for the other language pairs. If we look across language combinations, on six out of eight occasions, the NFR system using BPE as match unit outperformed the system relying on tokens (the two exceptions being EN-HU and NL-EN). However, the differences between systems that use tokens and BPE were relatively small, never exceeding 1 BLEU point. The systems using LMVR did not perform as well as the ones using tokens and BPE. Since sub-word units were both used as matching and NMT unit, we cannot claim with certainty whether this lower performance is due to the performance of LMVR as a match unit or as an NMT unit. In this context, it is worth noting that previous research has shown that LMVR could lead to better translation quality than BPE for morphologically rich source languages [67].
With regard to features based on alignment information, adding such features improved the performance of all NFR systems for EN-HU, but not for HU-EN. For HU-EN, adding features was only beneficial when tokens were used as match unit, and when features were added in combination with maximum source coverage. These findings show that adding alignment-based features to matches retrieved using SE is potentially beneficial, and that providing such information should maybe not be restricted to fuzzy matches retrieved using ED or other exact string-based similarity metrics [25].
Across EN-HU and HU-EN, applying max_coverage in addition to features led to improvements in estimated translation quality for five out of six tested systems. The increase in BLEU scores was, however, more pronounced with Hungarian as a source language. Moreover, max_coverage also led to improvements in BLEU scores for systems without alignment features. To better understand the impact of applying this algorithm, we verified the proportion of sentences in the training set that was affected by it. max_coverage only affects the second fuzzy match target that is used for source augmentation (i.e., not selecting the second highest-scoring match). For both EN-HU and HU-EN and for all systems, this was the case for between 68% and 77% of sentences for which data augmentation was performed. This means that, for a majority of sentences, the fuzzy match with the second highest match score was not the most informative for the NFR system, which shows that match score should not be considered to be the only criterion for selecting useful matches for data augmentation.
Even though the tested modifications to the NFR system individually only led to small and at times inconsistent improvements to the estimated translation quality for EN-HU and HU-EN, in combination, they resulted in significant improvements over the best baseline system for all tested language combinations. The reported improvements were also not equally large for each language combination, with the most substantial improvements recorded for HU-EN (+1.69 BLEU) and PL-EN (+1.58), and the least for EN-NL (+0.54) and EN-HU (+0.38). It thus seems that the proposed modifications work best for source languages that are morphologically rich (i.e., Hungarian and Polish in our experiments), but this tentative conclusion would need to be confirmed in subsequent studies.
On the topic of the threshold that is used for fuzzy match retrieval, our analyses show that finding the optimal value for this threshold not only depends on the match metric, but also on the type of fuzzy match unit that is used and on the (source) language. In this context, we also pointed out that there were considerable differences between the distribution of similarity scores for BPE and token-based matches. We therefore argue that this threshold should be considered a tunable parameter of the NFR system, rather than a fixed value. According to our preliminary tests, the optimal value for this parameter varies between 0.5 and 0.7, but in a previous study, for example, a threshold of 0.8 for SE matching was applied [25]. Note that, for our data set and language combinations, however, a threshold of 0.5 led to better results for the system configuration used in that study. It thus seems plausible that the optimal threshold also varies between different data sets and domains. With regard to input sentence length, this factor had a clear impact on overall translation quality, but, according to our experiments, it did not seem beneficial to apply a filter to this parameter (in combination with fuzzy match score) for the purpose of NFR.
The experiments presented in this paper are limited in a number of ways. First, we only used one data set, albeit with multiple language pairs, and a single domain. Second, it was not possible to test all combinations of NFR configurations because of the high computational cost involved in fuzzy match retrieval for large data sets (when using sentence embeddings, with FAISS indexing, this operation takes approximately 48 hours for each training set on a single Tesla V100 GPU) and training Transformer models with training sets that are further enlarged (up to approximately twice the original size) through data augmentation. As a result, we did not, for example, test the impact of individual modifications to the NFR method for language combinations other than EN-HU and HU-EN, but only evaluated the combined impact of all modifications in comparison to the baselines. Finally, we relied on automated evaluation metrics only, and did not conduct any experiments involving human evaluation.

Conclusions
The experiments conducted in this study confirm previous findings [24,25] that applying the NFR data augmentation method leads to a considerable increase in estimated translation quality compared to baseline NMT models in a translation context in which a high-quality TM is available. Moreover, we identified a number of adaptations that can further improve the quality of the MT output generated by the NFR systems: retrieving fuzzy matches using cosine similarity for sentence embeddings obtained on the basis of sub-word units, adding features based on alignment information, and increasing the informativeness of retrieved matches by maximising source sentence coverage. When all proposed methods are combined, statistically significant improvements in BLEU scores were reported for all eight tested language combinations, EN↔{HU,NL,FR,PL}, compared to a baseline Transformer NMT model (up to +8.4 BLEU points) as well as an already strong NFR baseline [25] (up to +1.69 BLEU points).
We argue that TM-NMT integration is both useful in contexts where the generated automatic translation is used as a final product, and for integration in a professional translator's workflow. Not only does NFR increase the quality of the generated MT output, it may also help to overcome the lack of confidence in MT output on the part of translators. Both of these factors potentially have a positive impact on the required post-editing time.
There are several lines of research arising from this work that we intend to pursue. We would like to explore further adjustments to the NFR method involving (a) additional sub-word segmentation methods for fuzzy match retrieval, such as WordPiece [66] or SentencePiece [79], as well as fuzzy match retrieval using lemmas [80]; (b) the use of pretrained, multilingual sentence embeddings for fuzzy matching [81]; (c) more recent neural word alignment methods [47]; (d) alternative fuzzy match combination methods, e.g., by weighting fuzzy match score and the amount of overlap between input sentence and fuzzy matches; and (e) combinations with techniques for automatic post-editing [38,82]. A second line of research is to investigate the factors that influence the optimal fuzzy match threshold further, with the aim of better informing the selection of this threshold. It could also be interesting to supplement our experiments with an analysis of attention or saliency [83] to gain more insight into how the NMT system deals with augmented input sentences, for example to better study the impact of alignment features. Finally, despite the improvements achieved in estimated translation quality, the usefulness of the translations generated by the NFR system is yet to be confirmed by human judgements in the context of CAT workflows. In future work, we would also like to perform human evaluations, both in terms of perceived quality and post-editing time.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: We trained our models with OpenNMT [84] using the Transformer architecture [54], with the hyper-parameters listed in Table A2. optimizer Adam 1 We do not set batch size using number of tokens as this approach leads to a considerable difference in the number of sentences in batches in the NFR and the baseline NMT settings.
For the systems that utilise source-side features, we use the source word embedding size of 506, with six cells for the features (total embedding size of 512). We use a total of 1 million steps in training with validation at every 5000 steps. All of the models are trained with early stopping: the training ends when the system has not improved for 10 validation rounds in terms of both accuracy and perplexity. We limit the source and target sentence length to 100 tokens for training the baseline NMT. The source sentence is limited to 300 tokens for training the NFR models as two additional sentences are used for data augmentation in all cases. All training runs are initialised using the same seed to avoid differences between systems due to the effect of randomness.

Appendix B.2. Sent2vec
To train our sent2vec models, we use the same hyper-parameters that are suggested in the description paper [55] for a sent2vec model trained on Wikipedia data containing both unigrams and bigrams. In our experiments, we distributed training of a sent2vec model over 40 threads. The hyper-parameters are provided in Table A3. Because our goal is to find matches over all available sentences in the FAISS index, we create a Flat index with an inner product metric to do a brute-force search. By adding the L2normalised vectors of the sentence representation to the index, and using an L2-normalised sentence vector as an input query, we are effectively using cosine similarity as match metric. More information can be found here: https://github.com/facebookresearch/faiss/wiki.

Appendix B.4. LMVR
To use LMVR, we first have to train a Morfessor model (https://morfessor.readthedocs. io/en/latest/cmdtools.html#morfessor-train). This baseline is then refined by LMVR. We use the same settings (see Table A4) as suggested in the examples here: https://github. com/d-ataman/lmvr/blob/master/examples/example-train-segment.sh.  Table A5 shows the number of sentences in the test set classified by input sentence length (i.e., number of tokens prior to sub-word segmentation) and fuzzy match score for EN-HU, for both SE tok +M and SE bpe +M, the two best-performing systems. The distribution of matches across match ranges is markedly different for the two systems, with the similarity score of token-based matches concentrated towards the higher end of the scale. There are hardly any best matches that score below 0.60. For SE bpe +M, the matches are spread out more evenly across the different match ranges.