] is the mainstream method in MT research and development today. Interestingly, current state-of-the-art NMT systems (e.g., [3
]) do not make use of context in which a source sentence to be translated appears. In other words, translation is performed in isolation while completely ignoring the remaining content of the document to which the source sentence belongs.
However, translation should not be performed in isolation as in many cases the semantics of a source sentence can only be decoded by looking at the specific context of the document. Human translators work in CAT tools where the sentence to be translated appears in the context of the surrounding sentences. In recent years, there have been a number of approaches that have tried to incorporate document-level context into state-of-the-art NMT models [4
]. All this work demonstrated that the use of document-level context can positively impact the quality of translation in NMT.
In this work, we used HAN [6
], a document-level NMT model, to see how (document-level) context of a source sentence to be translated can impact translation quality. HAN is a context-aware NMT architecture that models the preceding context of a source sentence in a document for translation, and significantly outperforms NMT models that do not make use of document-level context. In this work, we investigated why context-aware models like HAN perform better than vanilla baseline NMT systems that do not take context into account. We considered three different morphologically distant language-pairs for our investigation: Hindi-to-English, Spanish-to-English and Chinese-to-English. We summarize the main contributions of this paper as follows:
It is a well-accepted belief that context (neighbouring sentences) in which a sentence appears would help to improve the quality of its translation, and document-level MT models are built based on this principle. We show how exactly context impacts translation quality in state-of-the-art document-level NMT systems.
For our investigation we chose a state-of-the-art document-level neural MT model (i.e., HAN) and three different morphologically distant language-pairs, experimented with the formation of context, and performed a comprehensive analysis on translations. Our research demonstrates that discourse information is not always useful in document-level NMT.
As far as document-level MT is concerned, our research provides a number of recommendations regarding the nature of context that can be useful in document-level MT.
The rest of this paper is organized as follows. In Section 2
, work related to our study is discussed. Section 3
details the data we utilized for our experiments. Our NMT models are discussed in detail in Section 4
. We discuss our evaluation strategy and results in Section 5
and Section 6
, respectively. Finally, Section 7
concludes our work by discussing avenues for future work.
2. Related Work
In recent years, there has been remarkable progress in NMT to the point where some researchers [8
] have started to claim that translations by NMT systems of specific domains are on par with human translation. Nevertheless, such evaluations were generally performed at sentence-level [1
], and document-level context was ignored in the evaluation task. Analogous to how human translators work, it should be the case that consideration of document-level context [9
] will help in resolving ambiguities and inconsistencies in MT. There has been a growing interest in modeling document-level context in NMT. As far as this direction of MT research is concerned, most of the studies aimed at improving translation quality by exploiting document-level context. For example, refs. [5
] have demonstrated that context helps in improving the translation including various linguistic phenomena such as anaphoric pronoun resolution and lexical cohesion.
Wang et al. [4
] proposed the idea of utilizing a context-aware MT architecture. Their architecture used a hierarchical recurrent neural network (RNN) on top of the encoder and decoder networks to summarize the context (previous n
sentences) of a source sentence to be translated. The summarized vector was then used to initialize the decoder, either directly or after going through a gate, or as an auxiliary input to the decoder state. They conducted experiments on large scale Chinese-to-English data and the outcome from those experiments clearly illustrates the significance of context in improving translation quality.
Tiedemann and Scherrer et al. [11
] utilized an RNN-based MT model to investigate document-level MT. In their case, the context window was fixed to the preceding sentence and applied on a combination of both source and target sides. This was accomplished by extending both the source and target sentences to include the previous sentence as the context. Their experiments showed marginal improvements in translation quality.
Bawden et al. [12
] utilized multi-encoder NMT models that leverage context from the previous source sentence and combine the knowledge from the context and the source sentence. Their approach also involves a method that uses multiple encoders on the source side in order to decode the previous and current target sentences together. Despite the fact that they reported lower BLEU [17
] scores when considering the target-side context, they showed its significance by evaluating test sets for cohesion, co-reference, and coherence.
Maruf and Haffari et al. [5
] proposed a document-level NMT architecture that used memory networks, a type of neural network that uses external memories to keep track of global context. The architecture used two memory components to consider context for both source and target sides. Experimental results show the success of their approach in exploiting the document context.
Voita et al. [7
] considered the Transformer architecture [3
] for investigating document-level MT, which they modified by injecting document-level context. They used an additional encoder (i.e., a context-based encoder) whose output is concatenated with the output of the source sentence-based encoder of the Transformer. The authors considered a single sentence as the context for translation, be it preceding or succeeding. They reported improvements in translation quality when the previous sentence was used as context, but their model could not outperform the baseline when the following sentence was used as the context.
Tan et al. [15
] proposed a hierarchical model that utilizes both local and global contexts. Their approach uses a sentence encoder to capture local dependency and a document encoder to capture global dependency. The hierarchical architecture propagates the context to each word to minimize mistranslations and to achieve context-specific translations. Their experiments showed significant improvements in document-level translation quality for benchmark corpora over strong baselines.
Unlike most approaches to document-level MT that utilize dual-encoder structures, Ma et al. [18
] proposed a Transformer model that utilizes a flat structure with a unified encoder. In this model, the attention focuses on both local and global context by splitting the encoder into two parts. Their experiments demonstrate significant improvements in translation quality on two datasets by using a flat Transformer over both the uni-encoder and dual-encoder architectures.
Zhang et al. [13
] proposed a new document-level architecture called Multi-Hop Transformer. Their approach involves iteratively refining sentence-level translations by utilizing contextual clues from the source and target antecedent sentences. Their experiments confirm the effectiveness of their approach by showing significant translation improvements, and by resolving various linguistic phenomena like co-reference and polysemy on both context-aware and context-agnostic baselines.
Lopes et al. [19
] conducted a systematic comparison of different document-level MT systems based on large pre-trained language models. They introduced and evaluated a variant of Star Transformer [20
] that incorporates document-level context. They showed the significance of their approach by evaluating test sets for anaphoric pronoun translation, demonstrating improvements for the same and overall translation quality.
Kim et al. [21
] investigated advances in document-level MT using general domain (non-targeted) datasets over targeted test sets. Their experiments on non-targeted datasets showed that improvements could not be attributed to context utilization, but rather the quality improvements were attributable to regularization. Additionally, their findings suggest that word embeddings are sufficient for context representation.
Stojanovski and Fraser [14
] explored the extent to which contextual information of documents is usable for zero-resource domain adaptation. The authors proposed two variants of the Transformer model to handle a significantly large context. Their findings on document-level context-aware NMT models showed that document-level context can be leveraged to obtain domain signals. Furthermore, the proposed models benefit from significant context and also obtain strong performance in multi-domain scenarios.
Yin et al. [22
] introduced Supporting Context for Ambiguous Translations (SCAT), an English-to-French dataset for pronoun disambiguation. They discovered that regularizing attention with SCAT enhances anaphoric pronoun translation implying that supervising attention with supporting context from various tasks could help models to resolve other sorts of ambiguities.
Yun et al. [23
] proposed a Hierarchical Context Encoder (HCE) to exploit context from multiple sentences using a hierarchical attentional network. The proposed encoder extracts sentence-level information from preceding sentences and then hierarchically encodes context-level information. The experiments for increasing contextual usage show that their approach of using HCE performs better than their baseline methods. In addition, a detailed evaluation of pronoun resolution shows that HCE can exploit contextual information to a great extent.
Maruf et al. [16
] proposed a hierarchical attention mechanism for document-level NMT, forcing the attention to focus on keywords in relevant sentences in the document selectively. They also introduced single-level attention to utilizing sentence- or word-level information in the document context. The context representations generated are integrated into the encoder or decoder networks. Experiments on English-to-German translation show that their approach significantly improves over most of the baselines. Readers interested in a more detailed survey on document-level MT can consult the paper by [24
To summarize, numerous architectures have been proposed for incorporating document-level context in recent times. In their approach, Wang et al. [4
], Maruf and Haffari [1
], Tiedemann and Scherrer [11
], and Zhang et al. [13
] mainly relied on modeling local context from previous sentences of the document. Some papers [5
] use memory networks, a type of neural network that uses external memories or cache memories to keep track of the global context. Others [7
] have focused on giving more importance to the usage of the attention mechanism. [6
] use hierarchical networks to exploit context from multiple sentences. Miculicich et al. [6
] proposed HAN which uses hierarchical attention networks to incorporate previous context into MT models. They modeled contextual and source sentence information in a structured way by using word- and sentence-level abstractions. More specifically, HAN considers the preceding n
sentences as context for both source- and target-side data. Their approach clearly demonstrated the importance of wider contextual information in NMT. They show that their context-aware models can significantly outperform sentence-based baseline NMT models.
Usage of context in document-level translation were thoroughly investigated by [21
]. Their analysis showed that improvements in translation were due to regularization and not context utilization. Lopes et al. [19
] found that context-aware techniques are less advantageous in cases with larger datasets with strong sentence-level baselines when they systematically compared different document-level MT systems. Although the experiments by Miculicich et al. [6
] show that context helps improve translation quality, it is not evident why their context-aware models perform better than those that do not take the context into account. We wanted to investigate why and when context helps to improve translation quality in document-level NMT. Accordingly, we performed a comprehensive qualitative analysis to better understand its actual role in document-level NMT. The subsequent sections first detail the dataset used for our investigation, describe the baseline and document-level MT systems, and present the results obtained.
6. Results and Discussion
We evaluated the MT systems (Transformer and HAN) on the Hindi-to-English, Spanish-to-English and Chinese-to-English translation tasks on OrigTestset, and present BLEU, chrF, TER and METEOR scores in Table 4
. As can be seen from Table 4
, HAN outperforms Transformer in terms of BLEU, chrF, TER and METEOR evaluation metrics. We performed statistical significance tests using bootstrap resampling [36
]. We found the differences in scores statistically significant. This demonstrates that the context to be helpful when integrated into NMT models.
As mentioned above, in order to further assess the impact of context on translations produced by the HAN models, we randomly shuffled the test set sentences of OrigTestset five times, and created five different test sets, namely ShuffleTestsets. We evaluated the HAN models on these shuffled test sets (ShuffleTestsets) and report BLEU, chrF, TER and METEOR scores in Table 5
. As can be seen from Table 5
, the context-aware NMT model produces nearly similar BLEU, chrF, TER and METEOR scores across ShuffleTestsets. Although we see from the scores of Table 4
where context appeared to help in improving translation quality of HAN, the scores in Table 5
undermine the positive impact of context in HAN. We again carried out statistical significance tests using bootstrap resampling and found the differences in scores to be statistically significant.
Furthermore, we analyzed the translation scores (BLEU, chrF, TER and METEOR) generated by HAN and found that 14%, 16% and 17% of translations of the sentences significantly vary across five shuffles (i.e., five ShuffleTestsets) for Hindi-to-English, Spanish-to-English, and Chinese-to-English, respectively. We also observed that 58%, 64% and 61% of translations of the sentences do not vary or remain the same across the five shuffles (i.e., five ShuffleTestsets) for Hindi-to-English, Spanish-to-English, and Chinese-to- English, respectively.
These findings encouraged us to scale up our experiments, so we increased the number of samples in order to obtain further insights. For this, we shuffled our test data fifty times and this provided us with fifty ShuffleTestSets. We computed the mean of the variances for each sentence in the discourse over the fifty ShuffleTestSets. From now on, we call this measure MV (mean of the variance). This resulted in a single MV score for each sentence in the test set. We then calculated the sample mean () and standard deviation (s) from the sampling distribution i.e., the MV scores, and the 95% confidence interval of the population mean () using the formula: = = . (The mean of the sampling distribution of equals the mean of the sampled population. Since the sample size is large (n = 50), we will use the sample standard deviation, s, as an estimate for in the confidence interval formula).
The last row of Table 6
shows the 95% confidence interval of BLEU, chrF, TER and METEOR obtained from the sampling distribution of MV scores of the test set sentences. The above method leads us to classify the test set sentences into three categories: (i) context-sensitive
, (ii) context-insensitive
, and (iii) normal
. We focused on investigating sentences that belong to the two extreme zones (the first two categories), i.e., context-sensitive and context-insensitive. We now explain how we classified the test set sentences with an example. We selected two sentences from the test set of the Spanish-to-English task, and show the variances that were calculated from the distribution of the BLEU, chrF, TER and METEOR scores in the first two rows of Table 6
. As can be seen from Table 6
, variances of both sentences lie outside the confidence interval (CI). The one with a value higher than CI is classified as a context-sensitive sentence and a value lower than CI is classified as a context-insensitive sentence.
We also show the variances obtained for BLEU in Figure 3
a–c, for Hindi-to-English, Spanish-to-English and Chinese-to-English, respectively. The green, blue and red bars represent normal
sentences, respectively. The figures clearly separates distributions of sentences over the three classes.
Furthermore, we manually checked the translations of the sentences of context-sensitive and context-insensitive categories. We observed that contextual information indeed impacts the quality of the translations for the sentences of the context-sensitive class. We also observed that the quality of the translations mostly remains unaltered for the sentences of the context-insensitive class across shuffles.
6.2. Context-Sensitive Sentences
Context-sensitive sentences are those that are most susceptible to contextual influence. Their translations usually significantly vary with a change in preceding context. We observed that in such cases, the context either helps to improve the translations or degrades them. In Table 7
, we report the maximum, mean and minimum scores of the set of context-sensitive sentences in a test set. Note that these statistics were calculated over all fifty ShuffleTestsets. We can clearly see from Table 7
that the translations of context-sensitive sentences are impacted by contextual information.
In Table 8
, we give some examples of context-sensitive sentences for Hindi-to-English, Spanish-to-English, and Chinese-to-English for source, target, and various shuffle iterations. We see translations of the Hindi word
(examined) are “examined”, “questioned”, “questioned” in shuffle1, shuffle2, and shuffle3 sets, respectively. Similarly for Spanish-to-English, we see that the word Beethoven
in the Spanish sentence is translated incorrectly to “symphony” in shuffle2, correctly in shuffle1 and produces no translation in shuffle3. In the case of Chinese-to-English, translation of the word 交响乐
is “Symphony” (target). In the case of shuffle1, we see that the MT system produces the correct translation for that Chinese word. As for shuffle2 and shuffle3, we see that the translations do not include the target counterpart of the Chinese word 交响乐
6.3. Context-Insensitive Sentences
Context-insensitive sentences are those that are least susceptible to contextual influence. This category of sentences maintain their translation quality irrespective of the context provided. In Table 9
, we report the maximum, mean and minimum scores for the set of context-insensitive sentences in a test set. We can clearly see from Table 9
that the context-insensitive sentences are less impacted by contextual information as compared with the context-sensitive sentences.
We observed that the BLEU and chrF scores remain almost constant across the shuffles regardless of the different contexts. Therefore, we can conclude that context has little or zero impact on the translation quality of context-insensitive sentences.
We carried out our experiments to see how context usage by HAN impacts the quality of translation. We presented our results in Table 4
, Table 5
and Table 7
. The scores in the tables clearly indicate as expected that the HAN architecture is sensitive to context. This finding corroborates our analysis at the sentence level. In our sentence-level analysis, NMT systems produce different translations for each of the context-sensitive sentences when they are provided with different contexts. Furthermore, we also observed from the scores presented in Table 4
that context is useful in improving the overall translation quality.
In this paper, we investigated the influence of context in NMT. Based on our results of the experiments we carried out for Hindi-to-English, Spanish-to-English, and Chinese-to-English, we found that, as expected, the HAN model is sensitive to context. This is indicated by our observations that the context-aware NMT system significantly outperforms the context-agnostic NMT system in terms of BLEU, chrF, TER and METEOR. We probed this and found this is due to the context-sensitive class of sentences that is impacting the translation quality the most.
These findings also lead us to categorize the test set sentences into three classes: (i) context-sensitive (ii) normal and (iii) context-insensitive sentences. While the translation quality of context-sensitive sentences is affected by the presence or absence of the correct contextual information, the translation quality of context-insensitive sentences is not sensitive to context. We believe that investigating this problem (i.e., identifying correct context for context-sensitive sentences) could positively impact discourse-level MT research. In future, we plan to examine the characteristics of contexts of the source sentences in the context-sensitive class in order to better understand why the translation of these sentences is so sensitive to context.
We also plan to build a classifier that can recognize the context-sensitive sentences of a test set. Our next set of experiments will focus on identifying and providing the correct context to the sentences of the context-sensitive class. We also aim to investigate how the presence of terminology in context or in the source sentence can impact the quality of translations.