Context-Aware Neural Machine Translation for Korean Honoriﬁc Expressions

: Neural machine translation (NMT) is one of the text generation tasks which has achieved signiﬁcant improvement with the rise of deep neural networks. However, language-speciﬁc problems such as handling the translation of honoriﬁcs received little attention. In this paper, we propose a context-aware NMT to promote translation improvements of Korean honoriﬁcs. By exploiting the information such as the relationship between speakers from the surrounding sentences, our proposed model effectively manages the use of honoriﬁc expressions. Speciﬁcally, we utilize a novel encoder architecture that can represent the contextual information of the given input sentences. Furthermore, a context-aware post-editing (CAPE) technique is adopted to reﬁne a set of inconsistent sentence-level honoriﬁc translations. To demonstrate the efﬁcacy of the proposed method, honoriﬁc-labeled test data is required. Thus, we also design a heuristic that labels Korean sentences to distinguish between honoriﬁc and non-honoriﬁc styles. Experimental results show that our proposed method outperforms sentence-level NMT baselines both in overall translation quality and honoriﬁc translations.


Introduction
Neural machine translation (NMT) has shown impressive results on translation quality, due to the availability of vast parallel corpus [1], and the introduction of novel deep neural network (DNN) architectures such as encoder-decoder model [2,3], and self-attention based networks [4]. The performance of NMT systems has reached on par with human translators in some domains, and hence many commercial MT services, such as Google Translation, have adopted NMT as their backbone of translation systems [5].
Despite the significant improvement over the previous machine translation (MT) systems, NMT still suffers from language-specific problems such as Russian pronoun resolution [6] and honorifics. Addressing such language-specific problems is crucial in both personal and business communications [7] not only because the preservation of meaning is necessary but also many of these language-specific problems are also closely related to their culture. Honorifics are good example of these language-specific problems that conveys respect to the audience. In some languages including Korean, Japanese, and Hindi that use honorifics frequently, speaking the right honorifics is considered imperative in those languages.
In Korean, one of the most frequent usages of honorifics occurs in the conversation with people who are in superior positions, or elders [8]. As is shown in Figure 1, the source English sentence "Wait a minute, please.", which is the second utterance by the son, is translated into the target sentence "ᄌ ᅡ ᆷᄉ ᅵᄆ ᅡ ᆫ ᄀ ᅵᄃ ᅡᄅ ᅧᄋ ᅭ." (jam-si-man gi-da-lyeo-yo) that is represented as haeyo-che (ᄒ ᅢᄋ ᅭᄎ ᅦ) as the sentence ends with -ᄋ ᅭ (-yo). Haeyo-che is a type of Korean honorific reflecting the relationship between the two speakers.
In this paper, we propose a novel method to remedy limitations from solely depending on source sentence by using context, which is represented by the surrounding sentences of the source sentence. In Figure 1, we can infer that this is a dialogue between a son and his father from the content of context_1, and the source sentence. Therefore, the model can determine that the source sentence should be translated into a polite sentence using honorifics, such as haeyo-che (ᄒ ᅢᄋ ᅭᄎ ᅦ), if such context is taken into account.
To this end, we introduce a context-aware NMT to incorporate the context for improving Korean honorific translation. It is known that the context-aware NMT can improve the translation of words or phrases that need contextual information, such as pronouns that are sensitive to the plural and/or gender [10]. Considering above example that how the adequate honorific style can be determined using the context, we suggest that the contextaware NMT can also be used to aid the honorific-aware translation. To the best of our knowledge, this work is the first attempt to utilize context-aware NMT for honorific-aware translation.
We consider two types of context-aware NMT framework in our proposed method. First, we use a contextual encoder that takes context in addition to the source sentence as input. The encoder captures contextual information from the source language that is needed to determine target honorifics. Second, a context-aware post-editing (CAPE) system is adopted to take the context of translated target sentences for refining the sentence-level translations accordingly.
To demonstrate the performance of our method, an honorific-labeled parallel corpus is needed so we also developed simple and fast rule-based honorific annotation for labeling the test data. In the experiments, we compared our context-aware systems with contextagnostic models and we show that our method outperformed the context-agnostic baselines significantly in both the overall translation quality and translation of honorifics.
We hope that our proposed method improves the overall quality of Korean NMT and thus expanding the real-world use of NMT in communicating with Korean. Adequate use of honorifics can greatly improve the overall quality of Korean translations, especially in spoken language translation (SLT) systems. We suggest that MT systems for applications like movie/TV captioning, chatting can be benefited from our method.
Our contributions can be summarized in threefolds: • We show that the NMT model with a contextual encoder improves the quality of the honorific translation regardless of the model structure. In our experiments, even the simplest model that concatenates all the contextual sentences with the source sentence can improve honorific accuracy. We also show that the NMT model with contextual encoder also outperforms the sentence-level model even when the model is explicitly controlled to translate to a specific honorific style. • In addition to the contextual encoder, we demonstrate that the CAPE can improve honorifics of both the sentence-level NMT and contextual NMT by exploiting contextual sentences of the target language. Our qualitative analysis also reveals the ability of CAPE to improve the inconsistent use of honorifics of the NMT model with a contextual encoder. • We also develop an automatic data annotation heuristics for labeling Korean sentences as honorific and non-honorific style. Our heuristics utilize Korean morphology to precisely determine the honorific style of a given sentence. We labeled our test set by using the heuristics and used it to validate the improvements of our proposed method.
The remaining part of this paper consists as follows: We briefly review the related works in Section 2 and introduce Korean honorifics in Section 3. Context-aware NMT methods are presented in Section 4. We introduce our methods in Section 5 then show the experimental results in Section 6. Finally we present our conclusion in Section 7.

Neural Machine Translation
The NMT represents the translation directly via a DNN. This is different from the traditional methods such as statistical MT (SMT), which consists of a number of subcomponents such as the translation model and the language model. Generally, most of the NMT models consist of two parts, one that takes the source sentence and the other that generates the target sentence, with each of them represented as a sequence of vectors. This framework is the so-called the encoder-decoder or sequence-to-sequence model [2,3]. The model is then trained on the parallel corpus, which consists of many pairs of the source and the target sentence.
The early NMT methods were composed of recurrent neural networks (RNNs) such as long short-term memories (LSTMs) [3]. Recently, the attention mechanism [11] has made a breakthrough in the field of NMT. It summarizes the sequence of a vector by finding the most relevant part of the sequence to the input. In the early stage, the attention mechanism was widely used as a sub-component of the model by attending to the encoded source sentence [11]. Recently, the Transformer [4] has exploited the attention mechanism as a backbone of the model, which consists of an attentional network followed by a feedforward network. This greatly improved the translation quality compared to other RNN-based methods with attention, and the Transformer is now widely used as a base of NMT and many other natural language processing (NLP) methods. In addition to the architectural improvement, the more sophisticated training methods such as back-translation [12] and the language model (LM) pretraining (e.g., BERT [13], MASS [14], and BART [15]) have also been studied to further improve the translation quality.
There have been a number of MT studies involving Korean. Because parallel corpora containing Korean are not as widely available as English and many European languages, a number of the existing works focused on low-resource MT settings. For example, Xu et al. [16] exploited out-of-domain and multilingual parallel corpora, and Jeong et al. [17] applied LM pretraining and back-translation. In addition, some other works have attempted to develop additional techniques to overcome the limitations of common low-resource MT methods. For example, Nguyen et al. [18] incorporated morphological information and word-sense disambiguation (WSD) on Korean source sentences to improve the translation into Vietnamese. Park et al. [19] focused on beam search decoding and experimented with various decoding settings including beam size to improve translation quality without re-training the target NMT model. Although low-resource MT methods are out of scope in this paper, some methods including back-translation are closely related with our methods in training CAPE.

Controlling the Styles in NMT
Although the style of a generated translation also affects the quality of the machine translation, it has received little attention in the field of NMT. Since the source sentence contains insufficient information of the output style, most of the existing works have introduced a set of special tokens [20]. For example, to control the formality of the target sentence, one can add <F> at the beginning of the source sentence to translate formally or add <I> to translate informally. The model can attend to this token and extract the relevant linguistic features on training. This approach has been adopted in many subsequent works such as [21,22]. Some other works have addressed this problem as domain adaptation that treats each style as a domain [23] or adopted multitask learning of the machine translation and the style transfer problem to address the lack of a style-annotated parallel corpus [24], but the output is still controlled by the special tokens. By contrast, our approach can improve the honorific translation without using such kinds of special tokens by exploiting the contextual information of the surrounding text. In addition, our method can be combined with the methods using special tokens to further improve the accuracy of honorifics.
On the other hand, a few kinds of grammatical styles have addressed the stylecontrolled MT. The English formality [25] or the T-V distinction in European languages such as Spanish [7] are two common examples. Viswanathan et al. [7] have addressed the control of T-V distinction such as the use of a formal/informal form of second-person pronouns (usted vs. tú), as domain adaptation. Niu et al. [25] has shown that employing syntactic supervision can improve the control of English formality. Furthermore, few studies have addressed the honorifics of Asian languages such as Japanese [26] and Korean [22]. Wang et al. [22] used data labeling and reinforcement learning (RL) to enhance translation of Korean honorifics. However, they ignored contextual sentences and only relied on special tokens to control the honorifics.

Context-Aware NMT
The context-aware MT models focus on contextual information in the surrounding text [27] and either the context of the source or the target sentence can be considered. Exploiting the source side of contexts usually implements an additional encoder to represent the multiple contextual sentences efficiently [6,28,29]. On the other hand, the target-side contexts can be exploited by first translating a part of documents or discourses at the sentence level and then refining those translations. This can be implemented either by the use of multi-pass decoding or automatic post-editing (PE). The multi-pass decoder generates the translation at the sentence level first and then translates again by regarding the translated sentences as contexts [30,31]. On the other hand, the context-aware PE corrects the common and frequent errors of sentence-level models by considering both the target sentence and its contexts [10]. We choose to use both sides of contexts; the source side of context helps to choose suitable honorifics in the target sentence, whereas the targetside context is helpful in correcting inconsistencies of honorific translations since we are focusing on the honorifics in the target language.
On the other hand, many of the context-aware MT studies have been focused on improving pronoun resolutions such as choosing the correct gender or plural for pronouns. For example, Voita et al. [6,10] have addressed the translation of Russian, and Müller et al. [32] are focused on German pronoun resolution. To the best of our knowledge, our work is the first attempt to use context-aware NMT to control grammatical styles such as honorifics.

Addressing Korean Honorifics in Context
In this section, we present an overview of the Korean honorifics system and how the contextual sentence can be used to infer appropriate honorifics for translation.
Despite that all 6 examples are translated as the same English sentence, each example has its own levels of formality and politeness and different usages. For example, 'ᄇ ᅡ ᆫᄆ ᅡ ᆯᄎ ᅦ' (banmal-che) and 'ᄒ ᅢᄅ ᅡᄎ ᅦ' (haela-che) are used between people with close relationships or used by the elderly when speaking to younger people. Conversely, 'ᄒ ᅢᄋ ᅭᄎ ᅦ' (haeyo-che) and 'ᄒ ᅡ ᆸᄉ ᅭᄎ ᅦ' (hapsio-che) are used to honor the addressees and express politeness [8]. Table 1. Speech levels and sentence endings in Korean. Names are translated with respect to [8]. Each of the example sentences are a translation of: The weather is cold. Each underllined sentence ending corresponds to their addressee honorific.

Style and Name
Politeness Formality Example nal-ssi-ga chub-da

The Role of Context on Choosing Honorifics
As stated earlier, the relationship between speaker and audiences affects the use of Korean honorifics. For example, the student should use haeyo-che and hapsio-che as addressee honorifics when asking a teacher some questions. Since such social context is often reflected in utterances, readers may infer the relationship from text without knowing who are speakers and/or audiences.
In the Figure 1, we can infer that the source and the contextual sentence is consist of a dialogue between a dad and a son and the context_1 and the source sentence is utterances of the son, so the source English sentence should be translated into a polite Korean sentence as shown. Figure 2 shows two another examples in our dataset. In (a), a dialogue between a person (context_0) and his/her superior (context_1). So their Korean translations are in polite (haeyo-che) and impolite (banmal-che) respectively. In addition, we can infer that the source sentence is also an utterance by the same person who told (context_0) as we can find the same pronoun we to refer themselves. So the sentence endings of translation should be as "ᄌ ᅮ ᆼᄃ ᅩ ᆨ ᄃ ᅫ ᆻᄋ ᅥᄋ ᅭ ᄋ ᅭ ᄋ ᅭ" (jung-dog dwaess-eo-yo) which has the same honorifics as context_0, instead of using banmal-che, such as "ᄌ ᅮ ᆼᄃ ᅩ ᆨ ᄃ ᅫ ᆻᄋ ᅥ ᄋ ᅥ ᄋ ᅥ" (jung-dog dwaess-eo).
On the other hand, (b) shows the usage of hapsio-che which is frequently used for formal expressions in context_0 and the source sentence, as both of the sentences are ending with '-ㅂᄂ ᅵᄃ ᅡ' (-b-nida). The word suspect (ᄋ ᅭ ᆼᄋ ᅴᄌ ᅡ, yong-ui-ja) in context_0 give us a hint that the context_0 is told by police officers, prosecutors etc since the word is frequently used by those occupations. We can also infer that this dialogue is not held between those officers from the pronoun you, rather the utterances are told to a witness, etc. So the context_0 and the source sentence would be translated into formal Korean utterances, rather than informal sentences like " As shown in the examples, contextual sentences often have important clues for choosing appropriate honorifics in Korean translation. However prior approaches for honorificaware NMT including [26] for Japanese, and [22] for Korean have ignored those contexts. Instead, they explicitly controlled the model to translate the source sentence into a specific honorific style, using special tokens for indicating the target honorific as [20].

Context-Aware NMT Frameworks
To utilize the contextual sentences in NMT, we introduce the context-aware NMT systems. These are divided into two categories: contextual encoders on NMT models and a CAPE system. Here we briefly review those systems before explaining our proposed method.

NMT Model with Contextual Encoders
Generally, NMT models are operated at the sentence level; it takes an input sentence in a source language and returns an output sentence in a target language. On the other hand, a contextual encoder in NMT is designed to handle one or more contextual sentences as input and extract the contextual representation. In our settings, NMT models are based on the Transformer [6], which is based on a stack of attentional networks. Each hidden layer in the Transformer consists of a self-attention mechanism followed by feedforward networks. Because of its performance and efficiency, the Transformer has been widely used in NMT, and many improvements have also been made including contextual encoders. We list five Transformer-based models in our experiments: • Transformer without contexts (TwoC): As a baseline, we have experimented with the TwoC model which has the same structure as [4]. TwoC does not use any contextual sentences and only incorporates the input and the target sentences. • Transformer with contexts (TwC): This is the simplest approach to incorporate contextual sentences with the Transformer [27]. TwC concatenates all contextual sentences and an input sentence and considers the concatenated sentence as a single-input sentence. Then, the output of the TwoC encoder is the output of a stacked Transformer encoder with concatenated source and contextual sentences. • Discourse Aware Transformer (DAT) [6]: DAT handles a single contextual sentence with an extra context encoder that is also a stacked Transformer encoder. To handle multiple contextual sentences, we slightly modified DAT such that the contextual encoder takes a concatenation of contextual sentences. The context encoder has the same structure as the source encoder and even shares its weights. Encoded contextual sentences are integrated with an encoded source sentence by using a source-to-context attention mechanism and a gated summation.
• Hierarchical Attention Networks (HAN) [28]: HAN has a hierarchical structure with two-stage at every hidden layer in their contextual encoder. At the first level of the hierarchy, HAN first encodes each of the contextual sentences to sentence-level tensors using the stacked Transformer encoder as in [4]. Then each encoded sentence is summarized by word-level context-source attention, resulting in sentence-level representations. These sentence-level vectors are concatenated and again encoded with sentence-level context-source attention. Finally, encoded contextual sentences are integrated using a gated summation. • Hierarchical Context Encoder (HCE) [34]: HCE also exploits a similar hierarchical structure as HAN but uses different method to summarize word-level and sentencelevel information. In the lower part of the hierarchy, the encoded sentence-level tensor is compressed into a sentence-level vector by a self-attentive weighted sum module, which is similar to that of [35]. The collection of sentence-level vectors is fed into another Transformer encoder layer that is the upper part of the hierarchy to encode the entirety of contextual information into a single tensor. Finally, the contextual information tensor is combined with the source encoder in a similar fashion as DAT.
All the model structures are described in Figure 3.

Context-Aware Post Editing (CAPE)
CAPE is a variant of automatic post-editing (PE) systems (e.g., Vu et al. [36]). The PE fixes systematic errors that frequently occur in a specific machine translation system. Most of the PE operates at the sentence level; however, Voita et al. [10] suggested using PE to correct inconsistencies between sentence-level translations of a context-agnostic MT system. Analogous to many existing PE systems, the CAPE itself is independent of a specific MT model and can therefore in principle be trained to correct translations from any black-box MT system including a context-aware NMT system.
The training and testing process of CAPE is illustrated in Figure 4. First, the translation inconsistency of the target NMT model is simulated by using a round-trip translation. For example, to refine an English to Korean NMT system, Korean sentences are translated into English using Korean to English NMT first; then, they are again back-translated into Korean with a target English to Korean NMT system. In this way, the errors of the NMT model can be represented as the difference and inconsistency between the original Korean sentences and its round-trip translations. Once these round-trip translations are prepared, the CAPE, which consists of a typical sequence-to-sequence model, is trained to minimize these gaps. At test time, the target NMT system translates each sentence first, and then the CAPE takes a group of such translations and produces fixed translations. Moreover, CAPE has been shown to improve the English to Russian translation of context-sensitive pronouns [10] such as deixis and ellipsis.

NMT Ko -> En
Nancy Harper, is that right?
She's your mother, right?
Interesting.   Each consecutive text is segmented into a set of sentences first. Then, each sentence is translated and then back-translated. The resulting sentence group is concatenated again, and then the CAPE, which consists of a sequence-to-sequence model, is trained to minimize the errors of these round-trip translations. (b) At test time, a trained CAPE fixes sentence-level translations by taking them as a group.

Our Proposed Method-Context-Aware NMT for Korean Honorifics
In this section, we describe our proposed approach to generate appropriate Korean honorific expressions with context-aware NMT. We propose the use of context-aware NMT for translation of the honorific-styled sentence, which can improve the translation of honorifics without explicit control as done with special tokens. We also developed an automatic honorific labeling method to label the parallel corpus so that evaluation of the honorific translations, and preparing training data when the system is allowed to control target honorifics as in [22]. The process of our proposed method is illustrated in Figure 5.

Using NMT with Contextual Encoder and CAPE for Honorific-Aware Translation
To capture contextual information that affects the use of Korean honorifics, our method exploits the context-aware models in two ways, as described in Section 4.
The first one is an NMT model with a contextual encoder (Section 4.1), which is trained to capture the dependency between the contents of contextual sentences of the source language and the usage of honorific expressions represented in the training data. For example, in Figure 1, the model can attend the noun dad in the context_1 to generate a translation in haeyo-che. In this way, the trained model can implicitly control the translation to generate appropriate honorific expressions according to the contextual sentences. In the experiments, we compare this approach against the NMT models that explicitly control the translation honorifics by introducing special tokens as in [22].
The second one is a CAPE (Section 4.2) for improving the inconsistent sentence-level translation of honorifics. As stated earlier, the CAPE is trained by recovering inconsistent round-trip translations that require a pretrained bidirectional sentence-level MT model. Therefore, we first train a TwoC model to translate both Korean-English and English-Korean using the same parallel corpus. Then, we sample round-trip translations from a separately constructed monolingual Korean corpus and train a CAPE to reconstruct the original Korean sentence from the sampled round-trip translations, as illustrated in Figure 4. Our CAPE model is implemented using the same Transformer model as the TwoC [4], so once the monolingual corpus and its round-trip translations are prepared, training CAPE is similar to training a TwoC. We also apply the CAPE to improve the NMT models with contextual encoders, such as HCE. Despite that the CAPE was originally intended to correct the errors of sentence-level MT similar to TwoC [10], it can complement the NMT with a contextual encoder. Importantly, the CAPE exploits the context information of the target language, and some types of inconsistency, such as inter-sentence disagreement of honorifics, can only be identified in the target language. In the experiments, we show that the CAPE can further improve the honorific translation of HCE as well by correcting the inconsistency of honorifics between sentences.

(ii) NMT with Contextual encoder
Ko' Ko En "멈춰 주세요"  Figure 5. The process of our method, context-aware NMT for Korean honorifics. First we train NMT model with contextual encoder for English-Korean and Korean-English translation. Then we train CAPE to correct errors on those round-trip translations made by the NMT model. The automatic honorific labeling is primarily used for assessing honorific translation, but can also be used to label the training set if the NMT model uses special tokens to control target honorifics explicitly.

Scope of Honorific Expressions
Our work focuses on the translation of addressee honorifics, which is a key factor in determining whether the sentence is honorific style. From the 6 types of sentence endings in Table 1, the haeyo-che and hapsio-che are usually considered honorific styles that are used frequently by age-rank subordinates speaking to superiors [8,22]. Thus, we consider sentences having these two types of endings as honorific sentences, while others are nonhonorific sentences. The target sentence in Figure 1 "ᄌ ᅡ ᆷᄉ ᅵᄆ ᅡ ᆫ ᄀ ᅵᄃ ᅡᄅ ᅧᄋ ᅭ ᄋ ᅭ ᄋ ᅭ" (jam-si-man gi-dalyeo-yo) whose ending is haeyo-che, is an example of an honorific sentence. In contrast, "ᄌ ᅡ ᆷ ᄉ ᅵᄆ ᅡ ᆫ ᄀ ᅵᄃ ᅡᄅ ᅵᄀ ᅦ ᄀ ᅦ ᄀ ᅦ" (jam-si-man gi-da-li-ge) is a non-honorific sentence that is translated the same as in English according to our criteria since its ending is hagae-che.

Automatic Honorific Labeling
To assess the quality of honorific translation, we need to annotate the corpus into honorific sentence vs. non-honorific sentences. We developed heuristics using the above criteria to label the Korean sentences with honorific styles.
As illustrated in Figure 6, we first segment sentences into morphologies and obtain their part-of-speech (POS) tags. This ensures that our heuristic can correctly identify the proper sentence ending. In our implementation, the Kkma Korean tagger [37] is used to extract morphologies and POS tags. Once morphologies and POS tags are extracted, we then select eomi (ᄋ ᅥᄆ ᅵ) which is the sentence ending. We picked morphologies whose tag starts with 'EF' (http://kkma.snu.ac.kr/documents/index.jsp?doc=postag accessed on 1 May 2021) in our implementation. We label sentences as honorific if their eomi is hapsio-che or haeyo-che. In some cases where the morphology tagger fails to extract word endings, we resort to sub-string matching with sentence-ending markers such as '?', or '.' to correctly extract the proper sentence ending.
This heuristic is used primarily to label the test set for evaluation of our method; however, it can also be used to label the training set for training NMT models with explicit control of honorifics. In this case, the honorific label is used to generate a special token if the translation honorific of the model is controlled by a special token.

Experiments
To verify how the context-aware models improve Korean honorifics in English-Korean translation, we conduct comprehensive experiments and analyses on how context-aware MT models translate Korean honorifics. First, we constructed an English-Korean parallel corpus with contextual sentences. Then, we train and compare the models described in Section 4. Finally, a qualitative analysis is conducted on some examples from our proposed method.

Dataset and Preprocessing
To the best of our knowledge, there are no English-Korean discourse-level or contextaware parallel corpora that are publicly available. Thus, we constructed an English-Korean parallel corpus with contextual sentences. We took an approach similar to [34] by choosing to use bilingual English-Korean subtitles of movies and TV shows because these subtitles contain many scripts with honorific expressions.
We first crawled approximately 6100 subtitle files from websites such as GomLab.com. Then, we split these files into training, development, and test sets, which consist of 5.3k, 500, and 50 files, respectively. We applied a file-based split to make sure that contextual sentences are only extracted from the same movie/episode. Unlike other datasets such as OpenSubtitles2018 [38], our subtitle files contain both English and Korean sentences, so extracting bilingual sentence pairs is straightforward; we used timestamp-based heuristics to obtain those pairs. The resulting sentence pairs are 3.0M, 28.8k, and 31.1k pairs for training, development, and test sets, respectively. Some of the raw samples from our test sets are shown in Figure 7.
The contextual sentences are selected by using the timestamp of each subtitle, which contains the start time and end time in milliseconds. We assume that the sentences contain contextual information if they appear within a short period of time before the source sentence. Specifically, the start time of a contextual sentence is within K milliseconds from the start time of the source sentence. We set K as 3000 heuristically, and the maximum number of preceding contextual sentences is 2 for all experiments except those of Section 6.4.2. The final data contains 1.6M, 155.6k, and 18.1k examples of consecutive sentences in the training, development, and test sets, respectively.
For monolingual data to train the CAPE, we added 2.1M Korean sentences using an additional 4029 crawled monolingual subtitles. The resulting monolingual data consist of 5.1M sentences.
We finally tokenized the dataset using the wordpiece model [5], and the size of the vocabulary is approximately 16.5k. We also put a special token <BOC> at the beginning of contextual sentences to differentiate them from the source sentences.  Figure 7. Example parallel sentence pairs extracted from bilingual subtitles.

Model Training Details
For NMT models, we use model hyperparameters, such as the size of hidden dimensions and the number of hidden layers as the transformer-base [4], since all of the models in our experiment share the same Transformer structure. Specifically, we set 512 as the hidden dimension, the number of layers is 6, the number of attention heads is 8, and the dropout rate is set to 0.1. These hyperparameters are also applied to the CAPE model. For NMT models with additional encoders (DAT, HCE), we share the weights of encoders.
All models are trained with ADAM [39] with a learning rate of 1e-3, and we employ early stopping of the training when loss on the development set does not improve. We trained all of the models from scratch with random initialization, and we do not pretrain the model on a sentence-level task as in [22,28]. All the evaluated models are implemented by using the tensor2tensor framework [40].

Metrics
We measure the translation quality by BLEU scores [41]. For scoring BLEU, we use the t2t-bleu script (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2 tensor/bin/t2t-bleu accessed on 1 May 2021) which yields the same scores as Moses [42]. We first measure BLEU scores with original translations and we refer to these scores as normal BLEU scores. In addition, we also measure tokenized BLEU scores by tokenizing translations prior to scoring BLEU, as a common practice in the evaluation of Korean NMT [43].

Results
First, overall BLEU scores and honorific accuracy are compared among MT models with various types of contextual encoders. We also examine the varying performance of these models with respect to the number of contextual sentences and effects of CAPE for improving honorific translations.

Effect of Contextual Encoders
To evaluate the effect of contextual information on the translation of Korean honorifics, we first measure the performances of context-agnostic and context-aware models. The results are summarized on Table 2. As shown in the results, all the context-aware models (TwC, DAT, HAN, and HCE) outperform the context-agnostic model (TwoC) in terms of BLEU. The HCE shows a significant English-Korean BLEU improvement over TwoC of approximately 1.07/2.03 and the TwC, DAT, and HAN also show slight improvements. We later use Korean-English TwoC and HCE trained in this experiment for generating round-trip translations on CAPE experiment since the HCE performed best among the context-aware models in terms of BLEU. We also experimented with the models on Korean-English BLEU using the same dataset for comparison. All the context-aware models again outperformed the context-agnostic model in this experiment. Note that BLEU scores are lower in all English-Korean experiments compared to Korean-English BLEU in the same dataset. This is mainly due to the morphological-rich nature of Korean and the domain of the dataset, which consists of spoken languages. Table 2. English<->Korean BLEU scores and accuracy (%) of honorifics for context-agnostic (TwoC) and context-aware (TwC, DAT, and HCE) NMT models. English-Korean BLEU scores are shown as (normal/tokenized) respectively. All the models are trained and tested without any honorific labels or explicit control of honorifics. In addition to the BLEU scores, the context-aware models are also better in translation with correct Korean honorifics in English-Korean translation. In particular, the HCE has improved the honorific accuracy by 3.6%. Since showing politeness is considered important in Korean culture as discussed in Section 3.1, we also focus on the accuracy of the test sets which are polite target sentences. The TwC outperformed all other models in this set up to 4.81% compared to TwoC. The HAN and HCE also showed significant improvement over TwoC, while the DAT's accuracy is slightly lower than that of TwoC. We believe that such differences derive from how the model utilizes contextual information. Since we only use the sequence-level cross-entropy (CE) as a training objective, the more compact representations of contextual encoders in DAT, HAN, and HCE can improve the main objective (translation quality), but considering the raw information of contextual sentences as in TwC could be more beneficial to honorific translation.

Models
On the other hand, all of the results in Table 2 are from models that do not have any explicit control of honorifics and do not employ the honorific-annotated dataset. For comparison with prior works that forced the model to translate with specific honorifics as [22], we also include the results of NMT models with special tokens for controlling output honorifics in Table 3. In particular, the TwoC with special tokens is the same as the data labeling (DL) method in [22]. The training set was labeled the same as the test set, with the method described in Section 5.3. As shown in the results, both models are able to translate almost all the test examples with the same honorifics as their references, which is a similar result to that in [22]. Interestingly, both controlled models also improve the translation quality over their counterparts without control, and the HCE with special tokens again outperformed TwoC with special tokens on BLEU. Table 3. English-Korean BLEU scores (normal/tokenized) and accuracy (%) of honorifics for models with explicit control of honorifics by special tokens on the input. All the models are forced to obtain the translation with the honorific style of the reference sentence. In summary, the context-aware NMT models can improve not only the translation quality but also the accuracy of honorifics. While their improvements are less significant compared to the honorific-controlled models, they can nevertheless exploit the contextual information to aid in the correct translation of honorifics.

Effect of the Number of Contextual Sentences
The number of contextual sentences has a significant effect on the model performance since not all the contextual sentences are important in obtaining an adequate translation [44]. Such redundant information can hurt the performance. Since this number is dependent on the model and the data, we carry out experiments to examine the effect of the number of contextual sentences. As shown in Table 4, both the BLEU and accuracy of honorifics are the best on 2 contextual sentences, and then they decay as the number increases. Similar effects are also shown by the other context-aware NMT models, as displayed in Table 5.

Translation Examples and Analysis
We show some translation examples in Figures 8 and 9. As discussed in Section 5, the honorific sentences are mostly used when a subordinate such as a child is talking to superiors such as his/her parents. Figure 8 shows two examples of these situations. In (a), context and source sentences are a conversation between a mother and her child. This can be speculated from the contextual sentences; the child is talking but the mom urges him/her to continue eating. The TwoC completely ignores the contextual sentences, so such a situation is not considered. Thus, TwoC translates the source sentence as a non-honorific style using the non-honorific sentence ending ᄄ ᅢ ᄄ ᅢ ᄄ ᅢ (ttae), which is banmal-che. In contrast, the translation of HCE is an honorific sentence since its sentence ending is ᄋ ᅭ ᄋ ᅭ ᄋ ᅭ (yo), which is haeyo-che, the same as the reference. This is an example that shows HCE's context-awareness that helps translation of honorific-styled sentences.

En (Context_1)
Life must go on as it always has.

En (Source)
How's mom? (a) (b) Figure 8. Example translations of different NMT models. The sentences are given in a sequence, from context_1 to source. The reference translation of each contextual sentence is given in (). In (a), a mother and her child are talking to each other. The context-aware model (HCE) can infer this situation using contextual sentences and translate the source sentence with an appropriate honorific style. Similarly, in (b) a dad and his child are talking, but only a translation from TwC has the correct honorific style. Note that translations of the verb sorry and the 2nd person pronoun you also differ among models despite that all the translations have the same meaning as the source sentence.
On the other hand, Daddy! in context_1 of (b) and the content of context_1 directly indicate that the source sentence is spoken by a dad's child. Despite such direct hints, HCE failed to correctly identify the proper honorific style, resulting in banmal-che (ᄒ ᅢ ᄒ ᅢ ᄒ ᅢ (hae) and ᄋ ᅥ ᄋ ᅥ ᄋ ᅥ (eo)). However, the TwC correctly translated the source sentence as an honorific sentence using haeyo-che (ᄒ ᅢ ᄒ ᅢ ᄒ ᅢᄋ ᅭ ᄋ ᅭ ᄋ ᅭ (haeyo) and ᄃ ᅦ ᄃ ᅦ ᄃ ᅦᄋ ᅭ ᄋ ᅭ ᄋ ᅭ (daeyo)). Note that there are two sentence segments in the source and translations, and the honorific style of the two segments agrees in all the model translations and the reference. One interesting observation is that TwC has translated verb sorry as ᄌ ᅬᄉ ᅩ ᆼ-ᄒ ᅡᄃ ᅡ (joesong-hada) instead of ᄆ ᅵᄋ ᅡ ᆫ-ᄒ ᅡᄃ ᅡ (mian-hada) and the 2nd person pronoun you as ᄋ ᅡ ᄋ ᅡ ᄋ ᅡᄈ ᅡ ᄈ ᅡ ᄈ ᅡ(appa; daddy) instead of ᄂ ᅦ (ne; you) like HCE. As the former is resulting as a more polite translation and the latter is closer to the reference so this example can be viewed as a clue that TwC's context-awareness is better than that of HCE. We suggest that TwC's simple and direct use of contextual sentences can perform better than the abstract representation of contextual sentences in HCE when the contextual sentences are simple and short.

En
My condolences. / Skip the sympathy. This is a business. / My father met with you right before he died.

Ko (Reference)
고인의 명복을 빕니다. / 조문은 필요없어요. 이건 비즈니스입니다. / 아버지께서 돌아가시기 직전에 당신을 만났죠. Figure 9. Example of a translation made by HCE and its correction by CAPE. The second and third sentence segments are the utterance of the same speaker. HCE's translations are inconsistent in honorifics since honorifics of the second and third segments do not agree. The CAPE successfully corrected that inconsistency. Note that CAPE also fixed the subject honorification, resulting in a more polite translation. Note that the underlined nouns are differ among models, despite that all the translations have the same meaning.
Finally, Figure 9 shows how the CAPE corrects the inconsistent use of honorifics. These 3 sentence segments are obtained from a scene held in a funeral home. Considering the content of the sentences, we can assume that the 2nd and 3rd segments are the utterances of the same speaker. However the honorific styles of HCE translations do not agree on banmal-che for the 2nd segment and haeyo-che for the 3rd. CAPE corrected this inconsistency by looking at the translated Korean sentences. In addition, CAPE also amended the 3rd sentence segment by modifying the subject honorification, replacing both the case particle for the subject (his father) from -ᄀ ᅡ ᄀ ᅡ ᄀ ᅡ (-ga) to ᄁ ᅦ ᄁ ᅦ ᄁ ᅦᄉ ᅥ ᄉ ᅥ ᄉ ᅥ (-kkeseo) and the verb ᄌ ᅮ ᆨ ᄌ ᅮ ᆨ ᄌ ᅮ ᆨᄀ ᅵ ᄀ ᅵ ᄀ ᅵ (jukgi) to ᄃ ᅩ ᆯ ᄃ ᅩ ᆯ ᄃ ᅩ ᆯ ᄋ ᅡ ᄋ ᅡ ᄋ ᅡᄀ ᅡ ᄀ ᅡ ᄀ ᅡᄉ ᅵ ᄉ ᅵ ᄉ ᅵᄀ ᅵ ᄀ ᅵ ᄀ ᅵ (doragasigi); both are translated as died. Considering that a deceased person is generally highly honored in Korean culture, the CAPE's correction results in a more polite and thus adequate honorific-styled sentence. Although the subject honorification is out of scope in this paper, this shows the CAPE's ability to capture various honorific patterns observed in the training corpus and correct translations.

Conclusions
In this paper, we have introduced the use of context-aware NMT to improve the translation of Korean honorifics. By using contextual encoders, the context-aware NMT models can implicitly capture the speaker information and translate the source sentence with proper honorific style. In addition, context-aware postediting (CAPE) is adopted to improve the honorific translation by correcting the inconsistent use of honorifics between sentence segments. Experimental results show that our proposed method can improve the translation of Korean honorifics compared to context-agnostic methods both in BLEU and honorific accuracy. We also demonstrated that the use of context-aware NMT can further improve the prior methods which use special tokens to control honorifics translation. Qualitative analysis on sample translations supports the effectiveness of our method on exploiting contextual information for improving translations of honorific sentences. In the future, we will extend our method to other Asian languages such as Japanese, and Hindi which also has complex and widely-used honorifics system.