Named Entity Correction in Neural Machine Translation Using the Attention Alignment Map

text-to-speech. Abstract: Neural machine translation (NMT) methods based on various artiﬁcial neural network models have shown remarkable performance in diverse tasks and have become mainstream for machine translation currently. Despite the recent successes of NMT applications, a predeﬁned vocabulary is still required, meaning that it cannot cope with out-of-vocabulary (OOV) or rarely occurring words. In this paper, we propose a postprocessing method for correcting machine translation outputs using a named entity recognition (NER) model to overcome the problem of OOV words in NMT tasks. We use attention alignment mapping (AAM) between the named entities of input and output sentences, and mistranslated named entities are corrected using word look-up tables. The proposed method corrects named entities only, so it does not require retraining of existing NMT models. We carried out translation experiments on a Chinese-to-Korean translation task for Korean historical documents, and the evaluation results demonstrated that the proposed method improved the bilingual evaluation understudy (BLEU) score by 3.70 from the baseline.


Introduction
Neural machine translation (NMT) models based on artificial neural networks have shown successful results in comparison to traditional machine translation methods [1][2][3][4][5]. Traditional methods usually consist of sequential steps, such as morphological, syntactic, and semantic analyses. On the contrary, NMT aims to construct a single neural network and jointly train the entire system. Therefore, NMT requires less prior knowledge than traditional methods if a sufficient amount of training data is provided. Early NMT models, called sequence-to-sequence [6][7][8][9], are based on encoder-decoder architectures implemented with recurrent neural networks (RNNs) [10], such as long short-term memory (LSTM) [11] and the gated recurrent unit (GRU) [12]. The attention mechanism is usually used in RNN-based machine translation systems with variable lengths. The network generates an output vector, as well as its importance, called attention, to allow the decoder focus on the important part of the output [13][14][15]. Recently, a new NMT model called the transformer [16] has been proposed based on an attention mechanism with feedforward networks and without RNNs. Using the transformer, the learning time is reduced greatly with the help of non-RNN-type networks.
One of the problems in machine translation is the lack of training data. This problem was reported by Seljan [17] and Dunder [18,19] for the problem of the automatic translation of poetry with a low-resource language pair. It was reported that the fluency and adequacy of the translation results were skewed to higher scores. Especially for old literature translation where the machine translation is of great importance, obtaining reliable training data is much more difficult. The types of errors in the machine translation were extensively analyzed by Brkić [20]. They were wrong word mapping, omitted or surplus words, morphological and lexical errors, and syntactic errors such as word order and punctuation errors. There have been several methods to successfully solve these problems using transfer learning [21], contrastive learning [22], and open vocabularies [23].
Another major problem in NMT are out-of-vocabulary (OOV) words [24,25]. This is often called the rare word problem as well [26,27]. The words in the training dataset are converted into indices to the word dictionary or a predefined set of vectors, and a sequence of the converted numbers or vectors is used as an input to NMT systems. When a new word that is not in the dictionary is observed, the behavior of the trained network is unpredictable because there are no training sentences with the OOV words. It is almost impossible to include all of them in the dictionary because of the complexity limit for efficient translation. One of the solutions to this problem is subword tokenization using byte pair encoding (BPE) [27]. In this work, the unknown words are broken into reasonable subunits. Another solution is the unsupervised learning of the OOVs [28]. However, most of the OOV words are for named entities: human names, city names, and newly coined academic terms, and subword tokenization [27] and unsupervised learning [28] are not able to handle the named entities because they do not contain any meaningful information in them. As a solution, conventional systems use special labels for such OOV words (often as "UNK") and include them in the training data [24][25][26], so that the NMT model would distinguish them from ordinary words. Table 1 shows examples of translation outputs with an "UNK" symbol. The first named entity in the first example, "李周鎭," is mistranslated into "이진," although the expected output is "이주진." The second named entity, "元景淳," is not translated, but replaced with an "UNK" symbol because the true translation "원경순" is an OOV or rarely occurring word for the trained NMT model. Moreover, there are many similar cases in the subsequent named entities.
There have been several attempts to build open-vocabulary NMT models to deal with OOV words. Ling et al. [29] used a sub-LSTM layer that takes a sequence of characters to produce a word embedding vector. In the decoding process, another LSTM cell also generates words character-by-character. Luong and Manning [25] proposed a hybrid word-character model. This model adopts a sub-LSTM layer to use the information at the character level when it finds unknown words both in the encoding and decoding steps. Although character-based models show a translation quality comparable to word-based models and achieve open-vocabulary NMT, they require a huge amount of training time when compared with word-based models. This is because, if words are split into characters, their sequence lengths are increased to the number of characters, so the model complexity grows significantly. There are other approaches to use character-based models such as using convolutional neural networks [30,31]. However, it is hard to directly apply fully character-based models to Korean, because a Korean character is made by combining consonants and a vowel. Luong et al. [26] augmented a parallel corpus to allow NMT models to learn the alignments of "UNK" symbols between the input and output sentences. However, this method is difficult to apply to language pairs with extremely different structures, such as English-Chinese, English-Korean, and Chinese-to-Korean. Luong [26] and Jean [24] effectively addressed "UNK" symbols in translated sentence. However, mistranslated words, which often appear for rare input words, still were not considered.
In this work, we propose a postprocessing method that corrects mistranslated named entities in the target language using a named entity recognition (NER) model and an attention alignment mapping (AAM) between an input and an output sentence by using the attention mechanism (to the best of our knowledge, first proposed by Bahdanau et al. [13]). The proposed method can be directly applied to pretrained NMT models that use an attention mechanism by appending the postprocessing step to its output, without retraining the existing NMT models or modifying the parallel corpus. Our experiments on the Chinese-to-Korean translation task of historical documents, the Annals of the Joseon Dynasty (http: //sillok.history.go.kr/main/main.do, last access date: 1 July 2021) demonstrate that the proposed method is effective. In a numerical evaluation, the proposed method shows that the bilingual evaluation understudy (BLEU) score [32] was improved up to 3.70 compared to the baseline when the proposed method was not applied. Our work is available in a Git repository https://bitbucket.org/saraitne76/chn_nmt/src/master/, last access date: 1 July 2021).

English Translation
The king went to Yeonghuijeon and perform a rites, and then to Jeogyeonggung, Sokseonggung, Yeonhogung, and Seonhuigung and performed rites.

English Translation
Conducted a So Dae and lectured on Myungshinism .
The remainder of the paper is organized as follows: Section 2 provides a review of the conventional machine translation and NER methods that are related to the proposed method. The named entity matching using the attention alignment map that forms the core of the current study is introduced in Section 3, along with the implementation details of the transformer and the proposed NER algorithm. Section 4 describes a series of experiments that were carried out to evaluate the performance of the proposed NER method. In Section 5, the output of the NER results is further analyzed, and Section 6 concludes the paper.

Neural Machine Translation
NMT maps a source sentence to a target sentence with neural networks. In a probabilistic representation, the NMT model is required to map a given source sentence X = [x 1 x 2 · · · x n ] ∈ B v s ×n to a target sentence Y = [y 1 y 2 · · · y m ] ∈ B v t ×m , where B = {0, 1}, a binary domain space, v s and v t are the source (input) and target (output) vocabulary size, and n and m represent the sequence lengths of the input and output sentences, respectively. A vocabulary is usually defined by a set of tokens, which is a minimum processing unit for natural language processing (NLP) models. From a linguistic point of view, words or characters are the most popular mapping units for the tokens, depending on the grammar of the source and target languages. Each element of the encoding vector is assigned a positive integer index that uniquely identifies a single token in the corresponding vocabulary, so we can construct source vectors x k ∈ B v s by the following one-hot representation: where x k,i is the ith element of x k . We can also construct a one-hot representation for the target vector y k in a similar manner as well. The one-hot representation is extremely sparse, and the dimensions of input and target vector, n and m, may become too large to handle for the large vocabulary sizes. The embedding method, a general approach in natural language processing, is introduced to produce dense vector representations for the one-hot encoding vectors [33][34][35][36]. For given dimensions of the source and target, d s and d t with d s v s and d t v t , linear embeddings from a higher dimensional binary space to a lower dimensions real domain are defined as follows: where R is the real number space and E s and E t are the source and target embedding matrices, respectively. Applying the linear embedding in (2), dense representation for the source and the target sentences are obtained by multiplying E s and E t to X and Y, This linear transformation is one of the Word2Vec methods [33]. In our paper, we use this embedding for all the input and target one-hot vectors.
The target of machine translation is finding a mapping that maximizes the conditional probability p(Y|X). The direct approximation of p(Y|X) is intractable due to the high dimensionality, so most of recent NMT models are based on an encoder-decoder architecture [34]. The encoder reads an input sentenceX in a dense representation and encodes it into an intermediate, contextual representation C.
where "Encoder" is a neural network model for deriving contextual representation. After the encoding process, the decoder starts generating a translated sentence. At the first decoding step, it takes encoded contextual representation C and the "START" symbol, which means the start of the decoding process, and generates the first translated token. Second, the token generated previously is fed back into the decoder. It produces the next token based on the tokens generated previously and contextual representation C. These decoding processes are conducted recursively until an "EOS" symbol is generated, which denotes the end of the sentence. The decoding process can be formulated by the following Markovian equation, where j is the symbol index to be generated, y i is ith symbol, and g(·) is decoding step function, which generates a conditional probability if y j given the previous outputs, {y 1 , . . . , y j−1 }, and the encoder output of the input sequence. Figures 1 and 2 illustrate the framework of the "sequence-to-sequence with attention mechanism" model (seq2seq) [13] and the framework of the "the transformer" model [16], which are NMT models used in the proposed methods.

Conventional Named Entity Recognition
There have been many studies for named entity recognition (NER) based on recurrent neural networks (RNNs) [37,38]. Similar to neural machine translation, the input is a sequence of tokens,X = [x 1x2 · · ·x n ] ∈ R v s ×n , and the output is a sequence of binary labels indicating which tokens are named entities, so the length of the output is the same as that of the input sequence: t = [t 1 t 2 · · · t n ] ∈ B n . The example target encoding is shown in Table 1. Each word in the "Truth" and "NMT output" is underlined if it is a named entity. In those cases, the target labels are assigned one. The objective of named entity recognition is finding a sequence that maximizes the posterior probability of t given the input, where t * is an optimal NER result. Recently, a novel model for NER based only on attention mechanisms and feedforward networks achieved state-of-the-art performance on the CoNLL-2003 NER task [39].

AAM in Sequence-to-Sequence Models
In this subsection, we describe the seq2seq model [13] used in our method and explain how to obtain an AAM from it. The seq2seq model consists of an LSTM-based [11] encoderdecoder and an attention mechanism. The encoder encodes a sequence of input tokens x = (x 1 ,x 2 , . . . ,x n ), represented as dense vectors, into a context vector c, which is a fixed-length vector. We used a bidirectional LSTM (BiLSTM) [40][41][42][43] as the encoder to capture bidirectional context information of the input sentences.
where f are stacked unidirectional LSTM cells and Hidden states at the last encoding step for both directions − → h n and ← − h 1 are concatenated to obtain c ∈ R 2d . Stacked unidirectional LSTM cells are used for the decoder. Once the encoder produces context vector c, the bottom LSTM cell of the decoder is initialized with c.
where s j ∈ R 2d are the hidden states of the bottom LSTM cell in the decoder, and subscript j denotes the index of the decoding step. Next, the decoder starts the process of decoding: Each decoding step computes the probability of the next token using three components. The first is the previously generated token y j−1 ; the second is the current hidden state s j ; the third is an attention output vector o j . An attention output allows the decoder to retrieve hidden states of encoder Attention scores e ij indicate how related s j−1 is to h i . Here, W a ∈ R d a ×2d , U a ∈ R d a ×2d , v a ∈ R d a , and b a ∈ R d a are trainable parameters, where d a is the hidden dimension of the attention mechanism. Attention weights are computed by the softmax function across the attention scores. The attention output o j is a weighted sum of hidden states from the encoder {h 1 , · · · , h n }. It tells the decoder where to focus on the input sentence when the decoder generates the next token.
In the seq2seq model, an attention alignment map (AAM) A ∈ R n×m , where n and m represent the sequence lengths of the input and output sentences, can be easily computed by stacking up the results of (14) while the model generates the translation. Figure 1 illustrates the framework of the seq2seq model used in this paper.

Transformer
In this subsection, we describe the transformer model [16] used in our method and explain how to obtain an AAM from it. The transformer also consists of an encoder and a decoder. Unlike the seq2seq model, it introduces a position encoding [16,44] to add positional information into the model, because it does not have any recurrent units that would model positional information automatically.
Here, PE(k) ∈ R d model produces a position encoding vector that corresponds to position k, d model is the dimensionality of the model, and i and j are the positional indices of the input and output sentence, respectively. The positional encoding vectors are added to a sequence of input tokensx = (x 1 ,x 2 , . . . ,x n ) represented as dense vectors as in (2). The sum of embedding vectors and positional encoding x = (x 1 , x 2 , . . . , x n ) ∈ R d model ×n are fed into the bottom encoding layer.
The encoder is a stack of encoding layers, where each encoding layer is composed of a self-attention layer and a feedforward layer. The self-attention layer of the bottom encoding layer takes x , and the others receive the outputs of the encoding layer right below them. Self-attention layers allow the model to refer to other tokens in the input sequence.
The "multihead scaled dot-product attention" is computed by the above equations and was proposed by Vaswani et al. [16]. Here, Q h , K h , and V h are linear transformations of its input.
The outputs of the self-attention layers pass through the feedforward network. Each position is processed independently and identically. Here, The final output of the encoder is considered as contextual representation c = (c 1 , c 2 , . . . , c n ) ∈ R d model ×n as in (5). It is fed into the encoder-decoder attention layers of the decoder. The decoder has a stack of decoding layers, where each decoding layer consists of a self-attention layer, encoder-decoder attention, and a feedforward network. By analogy to the encoder, the bottom decoding layer takes the sum of embedding vectors and positional encoding y = (y 1 , y 2 , . . . , y m ) ∈ R d model ×m , as in (17), and the others receive the outputs of the decoding layer right below them. Self-attention layers in the decoder are similar to those in the encoder. However, the model can only retrieve the earlier positions at the current step. Hence, the model cannot attend to tokens not yet generated in the prediction phase. An encoder-decoder attention layer receives contextual representation c and the output of self-attention layers located below in the decoder z T d ∈ R d model ×m as in Figure 2.
This layer helps the decoder concentrate on the proper context in an input sequence when the decoder generates the next token. For every sublayer, residual connection [45] and layer normalization [46] are applied. Although we did not annotate layer indices for the trainable parameters, each layer does not share them. There are h n encoder-decoder AAMs A enc−dec h for each decoding layer. To obtain A ∈ R n×m , we reduced the mean across layers l and attention heads h. Figure 2 illustrates the framework of the transformer model in our study. The input Chinese characters, "以李", if directly translated, can be mapped to Korean "이을". However, due to the embedding in the decoder with the context information, the output of the transformer becomes "이진을".

NER Model
The detection of named entities of the input Chinese sentence is required to improve the quality of the translation. The NER model used in our study was based on stacked BiLSTM [40][41][42] and the conditional random field (CRF) [47,48]. We considered each Chinese character as a token and assigned a tag to each token. The tagging scheme was the IOB format [47]. As shown in Figure 3, to each of the input characters, it was given a label that was composed of one or two tags according to the membership of the input character to the named entities. The first tag is one of I, O, or B, for the inside, outside, or beginning of named entity words, respectively. The I-tag denotes the inside part of the named entity, but not the first character. The B-tag is the beginning character of the named entity. The O-tag means that a corresponding character is not inside a named entity. In our implementation, there were 4 types of named entities: Person, Location, Book, and Era. This type of information corresponds to B-tag and I-tag. Therefore, the NER model is asked to assign one of the nine tags to each token.
where BP, BL, BB, and BE are B-tags for Person, Location, Book and Era, respectively, and IP, IL, IB, and IE are I-tags for the same 4 named entity types. Table 2 shows an example of the input and output of the NER model. The NER model receives n Chinese tokens (characters) and predicts n named entity tags. A named entity "楊口縣" can be extracted by taking characters from the Chinese input from the index of B-Location to the index of the last I-Location. To separate the consecutive named entities, we used B-tag and I-tag together. If we only classify whether a character is within a named entity or not, it is impossible to separate "江原道楊口縣" into "江原道" and "楊口縣." Characters not belonging to named entities are labeled by a single tag "O", meaning "outside" of the named entities. The first character of a named entity word is assigned the "B"-tag, whose meaning is "beginning" of the named entity. All the other characters of the named entity word are assigned the "I"-tag ("inside"). To each of the first tags of the the named entities, "B" and "I", an extra tag from {P, L, B, E} is concatenated according to the types of the named entities, {Place, Location, Book, Era}, respectively. As in the NMT model, a Chinese sentence x = (x 1 , x 2 , . . . , x n ) ∈ R v s ×n represented as one-hot encoding vectors is converted into dense vector representationsx = (x 1 ,x 2 , . . . ,x n ) ∈ R d s ×n by using the embedding method [33][34][35][36] as in (2). Next, x are fed into the BiLSTM sequentially, and the BiLSTM captures bidirectional contextual information from input sequence x, as in (8) and (9).
Hidden state h i ∈ R 2d , which is the output of BiLSTM, is a concatenation of both directional LSTM hidden states − → h i ∈ R d and ← − h i ∈ R d . A linear transformation layer and a CRF [47,48] layer are applied to h = (h 1 , h 2 , . . . , h n ), and the CRF layer predicts named entity tags for each input token x i , where i is the time step of tokens and d is the number of hidden units of a top LSTM cell. Here, W N ∈ R 2d×d t and b N ∈ R d t are trainable parameters, where d t = 9 is the number of tag classes.
Finally, we can extract a list of named entities from the combination between the input sentence and the predicted tags. Figure 4 illustrates the NER framework used in our study.

Named Entity Correction with AAM
In Table 1, we can see that mistranslated words in the output of the NMT model correspond to named entities in the input sentences. The reason for this is that these named entities are OOV words or rarely occur in the training corpus. The NMT system cannot model these named entities well. In this section, we describe the proposed method that corrects mistranslated words in the output sentences through an example.
First, the NMT model translates a given Chinese sentence to a Korean sentence. In Table 3, it cannot accurately predict named entities that are names of persons. Second, the NER model finds named entities in the given Chinese sentence. In Table 4, red-colored words denote the named entities found by the NER model.
Third, we computed AAM A ∈ R n×m from the NMT model, using (14) and (28). Here, n and m are the sequence length of the input and output sentence, respectively. Figure 5 shows examples of the attention alignment map. Each element a ij of A is the amount of related information between input token x i and output token y j .   Fourth, we took the row vectors of AAM corresponding to indices of the Chinese named entities. Figure 6 illustrates a part of the AAM. In this example, the indices of the Chinese named entities "李周鎭" are 2, 3, 4, so we took row vectors a 2 , a 3 and a 4 , where a 2 = (a 21 , a 22 , . . . , a 2m ).ĵ Fifth, summation across the columns of a 2 , a 3 and a 4 was implemented to obtain the vector form. The index of the Korean token aligned with the Chinese named entity was found by the arg max function, whereĵ is the index of Korean token "이진" aligned with Chinese named entity "李周鎭." The NER matching results are shown in Table 5.
Repeating the above process, we can align all Chinese named entities found by the NER model with the Korean tokens in the sentence translated by the NMT model.  Table 5. Korean tokens aligned with the Chinese named entities. The underlined words are named entities. Among those words, red-colored ones are human names; blue-colored ones are place names; green-colored ones are book names.

Input
以李周鎭 (1) We assumed that Korean token yˆj was mistranslated. Finally, the aligned Korean tokens were replaced with a direct translation of the corresponding Chinese named entities from the look-up table. If the look-up table does not have the named entity, an identity copy of the Chinese named entity is an appropriate alternative. Figure 7 shows correction of the named entities in the translation results using look-up table. The corrections are: "이진(Lee Jin)" ⇒ "이주진(Lee Joo Jin)", "UNK" ⇒ "원경순(Won Kyung Soon)", and "윤주 (Yoon Joo)" ⇒ "윤경주(Yoon Kyung Joo)". The subscripted, parenthesized numbers are found by the proposed method.

Experiments
We evaluated our approach on the Chinese-to-Korean translation task. The Annals of the Joseon Dynasty were used for our experiments as a parallel corpus. We compared the results for two cases: when the postprocessing was applied and when it was not applied. We used this parallel corpus to train our NMT models. To simulate real-world situations, we split the records according to the time they were written. Records from 1413 to 1623 were the training corpus, and records from 1623 to 1865 were the evaluation corpus. The training and evaluation corpus contained 230 K and 148 K parallel articles, respectively. We only used articles with Chinese and Korean tokens less than 200 in length, because the articles have an extremely variable length of letters. Figure 8 shows histograms for the sequence length of the Korean-Chinese parallel corpus. The Chinese-Korean pair sequences with the top 5% length were ignored in histogram Figure 8. For all Chinese sentences, the mean sequence length was 112.87 and the median was 54. For all Korean sentences, the mean was 124.56 and the median was 56. In Chinese (input), no tokenization was used. We simply split each Chinese sentence into a sequence of characters, because each Chinese character has its own meaning. In Korean (output), meanwhile, we used an explicit segmentation method [49] to split each Korean sentence into a sequence of tokens. Thus, the number of articles for training was 168 K and for evaluation was 113 K.
For the NER model, we also used the same corpus: the Annals of the Joseon Dynasty. The annotation of the Chinese named entities for this corpus is publicly available (https://www.data.go.kr/dataset/3071310/fileData.do, last access date: 1 July 2021). Additionally, Table 6 shows an analysis of the Chinese NER corpus. Approximately 7.5% of the characters belong to named entities, and the most frequently named entity type is Person.

Models
For the seq2seq model, Description in Table 7 describes the model architecture used in our experiments. For the seq2seq model, Description means (embedding size, hidden units of encoder cells, # stack of encoder cells, hidden units of decoder cells, # stack of decoder cells). Embedding matrices for the source and target tokens were both pretrained by the word2vec algorithm [50], using only the training parallel corpus. The encoder is a stacked BiLSTM, and the decoder is a stacked unidirectional LSTM. During the learning process, the dropout approach [51,52] was applied to the output and states of the LSTM cells. Once training of the model was complete, beam-search decoding was used with a beam width of four to generate a translation that maximized the sum of the conditional probabilities of the sequence. For the transformer model, Description in Table 7 represents (hidden size, # hidden layers, # heads, FFN filter size). To avoid overfitting of the model, the dropout [52] method was used among the layers in the training process. As the seq2seq model, beam-search decoding was implemented with a beam width of four. The NER model used in this study was the BiLSTM-CRF model. Specifically, the following model was used in our experiments. The embedding size was 500, and the embedding matrix was pretrained by the word2vec algorithm [50] using only the training dataset. Each cell had five-hundred twelve hidden units, and two cells are stacked. Here, we also used the dropout approach [51,52] in the learning phase.

Experimental Results
To evaluate our NER models, we introduced two types of F1-score: entity form and surface form [53]. First, the entity form is a conventional measurement calculated from the entity level. Second, the surface form evaluates the ability of NER models to find rare entity words. In Table 8, the lexicon used in Dictionary search was extracted only from the training corpus. The NER model used in the experiment was a two-stack LSTM model. Table 7 shows how the performance improved in the proposed method depending on the type of NMT model (seq2seq or transformer), the number of trainable parameters of the model, and the output (Korean) vocabulary size. Our experiments showed that the proposed method was effective regardless of these types, and the BLEU scores improved from 2.60 to 3.70. In Table 9, experimental results show that the proposed approach successfully corrected mistranslated named entities in the output of the NMT model.  Table 9. Named entity correction using the proposed method. Baseline: outputs of the seq2seq model. Proposed: results of our approach. Named entities are underlined. Human names are in red color; place names in blue; book names in green.

English Translation
Lee Joo Jin is assigned as the Pyeongan inspector, Won Kyung Soon as the vice dictator, Yun Gyeong Joo as the dictator.

English Translation
The secret royal inspectors Lee Yun Myeong, Kim Mong Shin, and Lee Woo Gyeom were dispatched to investigate various provinces.

English Translation
The king went to Yeonghuijeon and perform a rites, and then to Jeogyeonggung, Sokseonggung, Yeonhogung, and Seonhuigung and performed rites.

English Translation
Conducted a So Dae and lectured on Myungshinism . Baseline 소대를 행하고 UNK 를 강하였다.

Discussion
We found that the proposed method had several strengths and weaknesses. As for the strengths, the proposed method does not require retraining of the existing NMT models, and it can be directly applied to the NMT models without modifying the model architecture. It is suitable for any language pair. Moreover, it has a low computational complexity because of the small-sized vocabulary. As for the weaknesses, the proposed method does not work when predictions of the NER model are wrong. Additionally, tokens that should not be changed may be corrected if the alignment is not proper. The proposed method needs a look-up table to work properly. Table 10 shows examples of these weaknesses. In the above example, the NER model cannot find a named entity in the source sentence, so the UNK for "거려청" was not corrected. The UNK for " 심경 " was also not corrected, although the NER model recognized a token " 必經 " as a named entity, because our look-up table did not have " 必經 ." In the final example, token "하" in Baseline was changed because the attention alignment map was not accurate. Table 10. Weaknesses of the proposed method. Source: input sentence. Truth: ground truth. Baseline: outputs of the NMT models. Proposed: results of our approach. NER output: outputs of the NER model. Underlined words: named entities. Green-colored tokens: named entities for the names of books. Blue-colored: named entities for the names of place names.

English Translation
The king went to Georyeocheong and conducted a So Dae. The king ordered the subjects to read Shim Gyung .

Conclusions
Even the NMT models that show state-of-the-art performance on multiple machine translation tasks are still limited when dealing with OOV and rarely occurring words. We found that the problem is particularly relevant for the translation of historical documents with multiple named entities. In this paper, we proposed a postprocessing approach to address this limitation. The proposed method corrects the machine translation output using the NER model and the attention map. The NER model finds named entities in the source sentence, and the attention map aligns the located named entities with the tokens in the translated sentence. Next, we assumed that the tokens aligned with the source named entities were mistranslated, and we replaced them using the look-up table or an identity copy. Experiments with various target vocabulary sizes in Section 4 demonstrated that our method is effective in the task of translation of historical documents from Chinese to Korean. Using the proposed NER method, the machine translation performance was improved up to 3.70 in terms of the BLEU score (35.83 to 39.53) in seq2seq translation models and up to 3.17 (33.90 to 37.07) in transformer models. Moreover, there was no BLEU score degradation due to the proposed method. The proposed method can be applied to an existing NMT model that uses the attention mechanism without retraining the model, if an NER model exists for the source language. Our method can be successfully applied not only to Chinese-to-Korean translation, but also to other language pairs. In our future work, we plan to explore this direction.