Korean Historical Documents Analysis with Improved Dynamic Word Embedding

: Historical documents refer to records or books that provide textual information about the thoughts and consciousness of past civilisations, and therefore, they have historical signiﬁcance. These documents are used as key sources for historical studies as they provide information over several historical periods. Many studies have analysed various historical documents using deep learning; however, studies that employ changes in information over time are lacking. In this study, we propose a deep-learning approach using improved dynamic word embedding to determine the characteristics of 27 kings mentioned in the Annals of the Joseon Dynasty, which contains a record of 500 years. The characteristics of words for each king were quantitated based on dynamic word embedding; further, this information was applied to named entity recognition and neural machine translation.In experiments, we conﬁrmed that the method we proposed showed better performance than other methods. In the named entity recognition task, the F1-score was 0.68; in the neural machine translation task, the BLEU4 score was 0.34. We demonstrated that this approach can be used to extract information about diplomatic relationships with neighbouring countries and the economic conditions of the Joseon Dynasty


Introduction
Historical documents-besides being old texts-carry considerable information, including observations of ideology and phenomena; this information can be used for reconstructing the past.Most research on such documents is performed via a close reading of a small number of documents [1][2][3][4].These attempts have allowed us to understand the meaning of a large corpus of historical documents and identify their patterns, thereby helping us discover new information or reconfirm known facts [5,6].Developments in these related technologies have improved the possibility of analysing larger historical documents.
Historical documents generally maintain an account of long-term records; for example, the Journal of the Royal Secretariat contains approximately 300 years of records from 1623 to 1910; similarly, the Ming Shilu provides us with nearly 300 years of records from 1368 to 1644.These historical documents were analysed to determine information related to specific periods or to long periods of time as a longitudinal study.For such analyses, it is necessary to identify the characteristics considering changes over time because the meaning and usage of words can vary over time.For example, the word 'apple' was initially used to refer to a fruit, but it is now frequently used to refer to electronics or other products related to the company 'Apple' or the 'iPhone'.Therefore, knowledge about the changing meaning of words is an important factor for deciphering historical documents written over long periods of time.Research has been performed to understand various languages which have embedded words, such as Portuguese [7] or Lithuanian [8], but the studies are limited in their analysis of the passage of time.Several researchers are focused on studying the changes over time [9,10], however, they are concentrating only on decades, so the work is not suitable for understanding and analysing the changes in word meanings over a long period of time.Therefore, we aim to capture semantic changes over time in historical documents.
We utilise a representative Korean historical record, i.e., the Annals of the Joseon Dynasty (AJD).The AJD-a UNESCO World Record Heritage (http://www.unesco.org/new/en/communication-andinformation/memory-of-the-world/register/full-list-of-registered-heritage/registered-heritagepage-8/the-annals-of-the-choson-dynasty/)-isan extensive historical document that contains a considerable amount of information related to politics, economy, culture, society, and weather during the Joseon Dynasty, and it has information that spans over 500 years, ruled by 27 kings.Th AJD comprises 50 million characters in the 1893 volumes of the 888 books; it is a detailed and comprehensive historical document.The original text has now been digitised; in addition, a version translated by experts is available.
The AJD has been used in research in the fields of politics, culture, metrology, and medicine [11][12][13][14][15][16][17].However, these analyses were performed over the entire period of 500 years, thereby making it difficult to differentiate between the specific characteristics or ideologies of each king.The AJD covers the longest continual period of a single dynasty compared to other historical documents.We used improved dynamic word embedding (DWE) to capture the semantic changes in the historical document.The semantic changes were quantified based on the embedding vector obtained from DWE.This information was then used to improve the performance of named entity recognition (NER) and neural machine translation (NMT) of historical documents.The entire process is illustrated in Figure 1.The contributions of this study are listed below:

•
We proposed an improved DWE using factorised embedding parameterization to identify the temporal change in the meaning of words in historical documents.We analysed the words after converting them to dense vectors using the improved DWE; we identified the change in the relationship between countries over time and the taxation structure that varied based on the king.Through this, it was able to better reflect the times than when DWE only was used.

•
We confirmed the effectiveness of the improved DWE by incorporating it with tasks such as NER.Through the method we proposed, we were able to improve the F1-score by 3% to 7%.The improved DWE helps us identify the change in object name information or the usage of words for each king; further, the integration of this information and the NER model helped enhance the performance.

•
We found that the application of parameters obtained from the NER model integrated with the improved DWE enhanced the effectiveness of historical document translations.Through the method we proposed, we were able to improve the BLEU score by 2% to 8%.

Related Work
The analysis of historical documents requires considerable resources because most of them are considerably large.Recent developments in machine learning, text mining and natural language processing methods have also developed rapidly, and these methods are now being used to analyse and understand the meaning of large-scale texts data such as online community [18][19][20][21] and social media [22].Likewise, many studies focusing on historical documents use deep learning or machine learning at various stages, for example during digitisation of historical documents based on deep learning [23][24][25], automatic extraction of information using methods such as topic modelling [1,26], and during analysis of extracted information [2].
The AJD has been investigated in many studies.The data on the decisions of the king were collected and used to analyse the decision-making pattern of the kings [3,4]; the weather records were used to analyse the weather conditions for that specific period [11,12,17].Further, the AJD has been used in many other cases, including the analysis of celestial motion and infections [13,16,27].However, most of these studies perform the analysis by extracting information from a specific period or by applying the same definition to the entire period.Although these methods are effective when analysing the entire document, they have a limitation when identifying variability depending on time.
Many researchers have attempted to incorporate the temporal information in words.Commonly used word embedding methods, such as Word2Vec [28] or GloVe [29] are not effective in using word information over time as they are trained for the entire document.To overcome this problem, DWE techniques have been and are still considered the first distributional semantics, where learning is performed by comparing the frequency of the simultaneous occurrences of words over time and by using an existing word embedding technique [30][31][32].However, many of these techniques require a large amount of data for each period, which prevents the application of DWE for analysing historical documents as the amount of accumulated information for each time period in historical documents is irregular [33].
To address this problem, we presented an improved DWE, which is based on dynamic Bernoulli embedding [33]; the performance was improved by incorporating it with factorised embedding parameterization.Thus, the improved DWE allowed us to capture semantic changes.In addition, the improved DWE enhanced the performance of NER and NMT tasks used for analysing historical documents.

Methodology
This study used the improved DWE to numerically analyse the semantic changes for each period; then, based on the results, the NER model effectively classified objects such as persons and organisations found in the text.The effectiveness of translating historical documents was further improved by using parameters trained through the NER model in the NMT.

Dynamic Word Embedding
The word set of the entire document is set as (x 1 , . . ., x N ) and the size of the vocabulary is denoted by V. Assuming that the occurrence frequency of each word follows a Bernoulli distribution, data points x iv that have a vocabulary with size V at time point i are defined as x iv ∈ 0, 1.It is assumed that c i is a set of positions in the neighbourhood of position i and x c i is a collection of data points indexed by these positions.Further, the embedding vector ρ v ∈ R K and context vector α v ∈ R K are assigned for each index (i, v), and the conditional distribution of x iv is given as where η iv denotes a log odds value assigned through dynamic Bernoulli embedding; the formula is given by A zero-centred Gaussian random walk was used as a prior probability of embedding, similar to that for dynamic Bernoulli embedding [33].
We adopted the factorised embedding parameterization used in ALBERT [34].This method has two benefits: (i) The operation amount is reduced from O(V × H) in the existing method to O(V × D + D × H) in the new method (here, H denotes the set embedding dimension and D is the dimension less than H).(ii) The comprehension of context is improved as the conversion of words into a dense vector through a single layer increases the training for each word, and the vectors passing through one layer allow comprehending the interaction among words.If factorised embedding parameterization is applied to equation (x), where

Named Entity Recognition & Neural Machine Translation
In addition to inputting each word as an embedding vector, we input additional information into the model.For a more effective reflection of information for each period, information about the 27 kings appearing in the AJD was added.After producing the embedding vectors for each king and making the dimensions of the vectors identical to those of the DWE, we use a bilinear function and transform them into vectors with identical dimensions.This is formulated as where K emb means kings' embedding which (W where k i is the information of a king with the ith context, W k ∈ R 27×d e and b k ∈ R d e .And A ∈ R d e ×d e is bilinear parameter and d e is the pre-designated embedding dimension.Non-linearity is added more effectively when these bilinear functions are used.We applied the transformer architecture [35] for performing the NER task.Following BERT [36], we performed the NER task using only the encoder part from the transformer architecture.The encoder of the transformer and the fully connected were sequentially subjected to NER.
The decoder of the transformer architecture is used in the NMT.In this process of learning, the bilinear function trained for NER and the values processed by the encoder of the transformer and the Korean embedding vector were used as inputs.This helps determine the effect of performing contextualised word embedding using NER as pre-training.All procedure is visualize in Figure 2.

Dataset
To train the deep learning model to analyse historical documents, we used the AJD data.We crawled the original text of the AJD, the object name data labelled by experts, and the data translated by experts.We collected a total of 310,000 paired sentences that included an average of 72 and 149 original (Hanja) characters and translated (Korean) sentences, respectively.Four types of object names were collected for the NER: person, organization, book, and time.The Hanja data were parsed with the character as a unit, similar to that in previous studies [37,38].In addition, we tokenise each Korean sentence based on byte pair encoding (BPE) [39] provided by Google's SentencePiece [40] library (https://github.com/google/sentencepiece).For stable training, sentences with lengths under 300 characters were used for learning both Hanja and Korean, which accounted for 95% of the entire data.These preprocessed data were divided into training, validation, and test data with a ratio of 8:1:1 before they were used.We conducted validation and the test data consists of 30,000 paired sentences each.In experiments, we used hyperparameters for test data that showed the best performance in validation.

Experimental Setup
We used four RTX-2080 graphics processing units to train the models.We used eight layers in the encoder of the transformer for the NER task, and another eight layers in the decoder of the transformer for the NMT task.Further, we used the AdamW optimizer for optimization [41].The learning rate was initially set to 5 × 10 −5 and 1 × 10 −6 for the NER and NMT tasks, respectively.We used the Warmup learning rate scheduler [42] such that the warmup step is 12000 iterations and then using a linear decay scheduler.In deep neural network learning, multiplication by large weights can lead to an excessive update step, which can cause the algorithm to diverge inappropriately.Therefore, to avoid such a divergence, we used gradient clipping [43] and set the maximum norm to 5. To prevent overfitting, we used dropout [44] at a rate of 0.3.All embedding dimensions were 256, and the other dimensions were 512.For the NMT, we used BPE for sub-word segmentation, and the vocabulary size for the BPE was set to 24,000.

Analysis of Dynamic Word Embedding
The performance of the proposed improved DWE was tested using quantitative and qualitative tests.For the quantitative test, the loss values were learned using pseudo log likelihood [45], similar to that in a previous study [33].The loss values decreased in each iteration until it converged to a proper value, as shown in Figure 3a.Separating this process, the pseudo log likelihood of the data x representing the loss values is given by x iv logσ(η iv ) where σ(•) is the sigmoid function and η iv is same as (5).
L pos indicates how good the model is at positively predicting the target word from the context, and L neg indicates how good the model is at negatively predicting the negative samples from the context.In Figure 3b,c, both values were found to converge roughly, although L neg was more unstable than L pos .This demonstrated that the proposed method performed satisfactory learning in numerical terms.
Qualitative tests were performed to avoid restricting the performance of word embedding to only numerical aspects.We determined the nearest neighbourhood vector of an embedding vector and investigated the change in the 'neighbourhood vector' over time.The words nearest to the given words for the 1st, 9th, 18th, and 27th kings were compared as summarised in Table 1.For a more effective and convenient analysis, we applied the proposed improved DWE to both Hanja and Korean texts.As mentioned above, the Hanja and Korean texts were segmented based on characters and sub-words, respectively, and the outputs of the two languages were also segmented on the same basis.All distances between words were calculated using Euclidean distance.
For the words relating to Japan, most of them in the earlier Joseon Dynasty were associated with fishery, such as 'fishnet', 'sea salt', and 'salt'.However, it was found that the distances between words such as '虜', which means slave or prisoner of war, decreased after the Imjin War and the Eulsaneukyak.Historically, the Joseon Dynasty was influenced more by the Ming Dynasty than by Japan [46], and toward the later period, the influence of war against Japan was the strongest.Thus, it can be concluded that the performance of the proposed method is excellent.
Through the Meiji restoration, Japan actively accepted Western cultures from other countries such as the England and the USA.Further, the results of the analysis showed that the word 'Japan' became closer to the word 'England' and the word 'USA' in the later Joseon Dynasty.The strongest power in China changed from the Ming Dynasty in the early Joseon Dynasty to the Qing Dynasty during the later Joseon Dynasty.In addition, most words near 'mine' represent place names, which are useful to infer the location of mines for each period.In terms of the 'tax', most taxes in the early Joseon Dynasty were collected for lands, while such taxes were often replaced by labour in the middle period and those collected for fisheries occupied most of them in the later periods.

Results of Named Entity Recognition and Neural Machine Translation
To evaluate the performance of the proposed method, various word embeddings were performed before the NER task; Table 2 summarises these results.In the table, W2V represent the Word2Vec [28] method; DW2V is dynamic word embedding from Yao et al. (2018) [32]; and DBE is dynamic Bernoulli embedding from Rudolph et al. (2017) [33].The * mark indicates that the method adopted a bilinear function to contain information about the kings.
When word embedding was pre-trained using the skip-gram method in Word2Vec (generally used without DWE), the F1-score was 0.61, thereby suggesting that it achieved the worst performance compared to the others that used DWE.The proposed method that combined DWE with factorised embedding parameterization showed a higher F1-score than other existing word embedding methods.This implies that word embedding has limitations in reflecting temporal information; further, it demonstrated that the addition of temporal information through factorised embedding parameterization and the bilinear function enhances the performance in various tasks.
Further, NMT was compared with GRU [47] and a transformer-based model to test the performance of the proposed method.Learning in GRU also used the seq2seq [48] with attention mechanism [49] method supported by the encoder and the decoder, and the hyperparameters were set to the same values as those in the proposed model.Three metrics (BLEU4 [50], METEOR [51], and ROUGE-L [52]) were used in the test; the results are summarised in Table 3.
Table 2. Results of proposed NER method.Test results was evaluated on parameter set with best validation F1-score.W2V represent the Word2Vec [28] method; DW2V is dynamic word embedding from Yao et al. (2018) [32]; and DBE is dynamic Bernoulli embedding from Rudolph et al. (2017) [33].The * mark indicates that the method adopted a bilinear function to contain information about the kings.

Method
Precision Recall F1-Score W2V [28] 0.581 0.637 0.607 DW2V [32] 0.627 0.640 0.633 DW2V * [32] 0.582 0.665 0.620 DBE [33] 0 Table 3 indicates that the proposed method achieved higher performance compared to the other methods for historical documents.Results shows our model performs better than other NMT task-specifically learned model.It demonstrated that our model enhance NMT performance.Table 4 summarises the results of the translation test for the test dataset; the first sentences were translated accurately, although most of them contain place-names.This is attributed to the fact that the application of the NER improved the translation of texts related to names of places, and therefore, vectors pre-trained through NER are effective for learning the representation of historical documents.Predicted (Eng.)King went to Jongmyo and Gyeongmogung and worshiped them, and the Prince followed them to celebrate.

Discussion
Historical documents can derive insight from any country as well as from the country where they are produced.However, the method we proposed has only translated Hanja into Korean and has not yet been used to translate other languages such as English, French or Chinese.Our method adopted the best-performing method of translating Hanja into Korean, but this may not be as effective in other languages where the Poisson distribution or multinomial distribution may be more appropriate than the Bernoulli distribution.
Since historical documents are records written in the past, there is a limit to further data collection.We used dynamic word embedding to reflect the historical background of historical documents, but to use such dynamic word embedding, we must have accurate information on when the document was written and more than a certain number of data at that time.

Conclusions & Future Work
This paper proposed an improved DWE technique to quantitate semantic changes in historical documents; the performance of the proposed technique was evaluated via application to the AJD.The proposed technique revealed the semantic changes, and it was demonstrated that such information can be used for NER and NMT tasks, which facilitated an enhancement in performance for various tasks related to historical documents.
The NER achieved an F1-score of 0.68 via a combination of the improved DWE and information about the king using a bilinear function.The NMT achieved a BLEU4 score higher than that of previous models (by 0.02) by adding the information of the NER obtained from operation based on improved DWE.
The proposed method can be applied to other historical documents such as the Journal of the Royal Secretariat, for which the translation remains incomplete owing to the vast amount of information it contains (over four times that of AJD), and the Ming Shilu.
In future work, we plan to make multi-lingual models with diverse word distributions.We also plan to conduct further studies on a general model for historical documents that incorporates regularisation techniques such as augmentation [53][54][55][56][57] which will enable the exploration of results based on interactions.The proposed methods are expected to reduce the cost of analysing and comprehending historical documents.

Figure 1 .
Figure 1.Overview of the proposed workflow of dynamic word embedding (DWE), named entity recognition (NER), neural machine translation (NMT) in historical documents.

Figure 2 .
Figure 2. Overview of the proposed model for the NER and NMT tasks.

Table 1 .
Results of proposed improved DWE method.The words closest to the target words in order by the Joseon Dynasty.Notation * means location and ‡ means job position.'lang' means language and (o) means original which Hanja or Korean, and (e) means English.

Table 3 .
Results of the performance of the translation task."Transformer (from scratch)" and "Ours" represent the model trained only using the machine translation task and the model trained using NER and bilinear function to encoder and trained decoder only using the NMT task, respectively.