Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Understanding voluminous historical records provides clues on the past in various aspects, such as social and political issues and even natural science facts. However, it is generally difficult to fully utilize the historical records, since most of the documents are not written in a modern language and part of the contents are damaged over time. As a result, restoring the damaged or unrecognizable parts as well as translating the records into modern languages are crucial tasks. In response, we present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism, specifically utilizing two Korean historical records, ones of the most voluminous historical records in the world. Experimental results show that our approach significantly improves the accuracy of the translation task than baselines without multi-task learning. In addition, we present an in-depth exploratory analysis on our translated results via topic modeling, uncovering several significant historical events.


Introduction
Historical records are invaluable sources of information on the lifestyle and scientific records of our ancestors. Humankind has learned how to handle social and political problems by learning from the past. The historical records also serve as the evidence of intellectual accomplishment of humanity over time. Given such importance, there has been a great deal of nationwide efforts to preserve these historical records. For instance, UNESCO protects world heritage sites, and experts from all around the world have been converting and restoring historical records in a digital form for long-term preservation. A representative example is the Google Books Library Project 1 . However, despite the importance of the historical records, it has been challenging to properly utilize the records for the following reasons. First, the nontrivial amounts of the documents are partially damaged and unrecognizable due to unfortunate historical events or environments, such as wars and disasters, as well as the weak durability of paper documents. These factors result in difficulties to translate and understand the records. Second, as most of the records are written in ancient and outdated languages, non-experts are difficult to read and understand them. Thus, for their in-depth analysis, it is crucial to recover the damaged parts and properly translate them into modern languages.
To address these issues existing in historical records, we formulate them as the task of language modeling, especially for the recovery and neural machine translation, by leveraging the advanced neural networks. Moreover, we apply topic modeling to the translated historical records to efficiently discover the important historical events over the last hundreds of years. In particular, we utilize two representative Korean historical records: the Annals of the Joseon Dynasty and the Diaries of the Royal Secretariat (hereafter we refer to them as AJD and DRS respectively). These records, which contain 50 million and 243 million characters respectively, are recognized as the largest historical records in the world. Considering their high value, UNESCO recognized them as the Memory of the World. 2,3 These two historical corpora contain the contents of five hundred years from the fourteenth century to the early twentieth century. In detail, AJD consists of administrative affairs with national events, and DRS contains events that occurred around the kings of the Joseon Dynasty. These corpora are valuable as they contain diverse information including international relations and natural disasters. In addition, the contents of the records are objective since the writing rules are strict that political intervention, even from the kings, is not allowed by their independent institution.
Although DRS contains a much larger amount of information than AJD, only 10-20% of DRS has been translated into the modern Korean language by a few dozens of experts for the last twenty years. The complete translation of DRS is currently expected to additionally take more than 30-40 years if only human experts continue to translate them. Applying the neural machine translation models into the historical records contains several issues. First, the pre-trained models for Chinese are not suitable to DRS and AJD, mainly because of the differences between Hanja and the Chinese language. In the past, Korean historiographers borrowed the Chinese character to write the sentences spoken by Koreans. As a result, diverse characters had been moderated or created, and considerable grammatical differences exist between the Chinese language and Hanja. Furthermore, several parts of those records are damaged and require restoration as shown in Fig. 2. Therefore, these damaged parts should be restored in order to translate them correctly. In order to address these issues, we propose a model suitable for the historical documents using the self-attention mechanism.
Overall, we propose a novel multi-task approach to restore the damaged parts and translate the records into a modern language. Afterward, we extract the meaningful historical topics from the world's largest historical records as shown in Fig. 1. This study makes the following contributions: • We design a model based on the self-attention mechanism with multi-task learning to restore and translate the historical records. Results demonstrate that our methods are effective in restoring the damaged characters and translating the records into a modern language. • We translate all the untranslated sentences in DRS. We believe that this dataset would be invaluable for researchers in various fields. 4 • We present a case study that extracts meaningful historical events by applying topic modeling, highlighting the importance of analysis of historical documents.

Related Work
This work broadly incorporates three different tasks: document restoration, machine translation, and document analysis. Therefore, this section describes studies related to the restoration of damaged documents, neural machine translation, and the analysis of historical records.

Neural Machine Translation
Recently, neural machine translation (NMT) has achieved outstanding achievements. Based on the encoder-decoder architecture, the attention mechanism (Bahdanau et al., 2015) significantly improves the performance of NMT, by calculating the target context vector in the current time step via dynamically combining the encoding vectors of source words. The self-attention-based networks (Vaswani et al., 2017) consider the correlations among all word pairs in the source and target sentences. Based on the success of self-attention networks, Transformer architecture for language modeling has been proposed, showing the forefront performances (Devlin et al., 2019;Radford et al., 2019). Especially, the pre-training approaches further improve the performances, since they train the model robustly with several tasks using a large document corpus. In addition, lightweight models, such as ALBERT (Lan et al., 2019), are proposed to reduce the model size while preserving the model performance. However, as most of the recent approaches focus on pre-training with documents written in a modern language, the model for historical datasets does not exist. Therefore, we adopt a lightweight model in the same manner as ALBERT to efficiently reconstruct and translate millions of documents. Regarding the translation task for the historical documents, several studies attempt to translate the ancient Chinese documents into modern Chinese language (Zhang et al., 2019b;. However, as they mainly attempt to translate archaic characters into the modern language using paired corpus, they do not fully utilize the unpaired corpus. Therefore, we improve the performance of machine translation for historical corpora with multi-task learning with the translation and restoration tasks, which fully utilize the paired and unpaired corpora.

Restoration of Historical Documents
Unfortunately, lots of characters in the historical records are damaged or misspelled. As shown in Fig. 2, the damaged parts are prevalent in DRS, which significantly degrade the quality of subse-quent translation tasks. To address this problem, several studies focus on normalizing the misspelled words (Tang et al., 2018;Domingo and Nolla, 2018), and others further apply language modeling to restore the parts of the documents via deep neural networks (DNNs) (Caner and Haritaoglu, 2010;Assael et al., 2019).
Recently, the Cloze-style approach of machine reading comprehension (masked language modeling; MLM) predicts the original tokens for those positions where the words in the original sentence are randomly chosen and masked or replaced (Hermann et al., 2015). Several studies significantly improved the model performance by pre-training the model via the Cloze-style approach. By utilizing the MLM approach with the self-attention mechanism and the large-scale training dataset, numerous models improve the performances of various downstream tasks including NMT task (Baevski et al., 2019;Devlin et al., 2019;Zhang et al., 2019a;Conneau and Lample, 2019;Liu et al., 2019c;Clark et al., 2019). However, to our knowledge, few studies apply such an MLM approach to restore the damaged parts.
Motivated by these studies, we design our model using masked language modeling based on the selfattention architecture to recover the damaged documents considering their contexts.

Analysis on Historical Records
Various studies apply the machine learning approaches to analyze the historical records (Zhao et al., 2014;Kumar et al., 2014;Mimno, 2012;Oh, 2015, 2018). In addition, researchers adopt neural networks such as convolutional neural networks and autoencoders, for page segmentation and optical character recognition to convert the historical records in a digital form (Chen et al., 2017;Clanuwat et al., 2019). Given such digital-form records, analysts attempt to utilize the topic modeling to discover the historically meaningful events (Yang et al., 2011).
Especially, using the translated AJD, researchers discover historical events such as magnetic storm activities (Yoo et al., 2015;Hayakawa et al., 2017), meteors (Lee et al., 2009), and solar activities (Jeon et al., 2018). In political science, researchers analyze the decision patterns of a royal family in the Joseon Dynasty Oh, 2015, 2018

Hanja Input Embedding
Multi-Head Attention been investigated (Ki et al., 2018). However, existing studies mainly rely on the documents translated by human experts. Therefore, we first translate the documents in AJD and DRS. Afterward, we apply topic modeling approaches to mine meaningful historical events over large-scale data.

Proposed Methods
This section describes a multi-task learning approach based on the Transformer networks to effectively restore and translate the historical records.
The overview of our model is shown in Fig. 3. AJD and DRS datasets consist of Hanja sentences H = {h 1 , . . . , h N } and Korean sentences K = {k 1 , . . . , k N }, where each Korean sentence is translated from its corresponding Hanja sentence. Here, the Hanja represents the Chinese characters borrowed to write the Korean language in the past. Especially, DRS contains additional Hanja sentences H = {h N +1 , . . . , h M } that are not translated yet. Hence, we have in total M Hanja sentences in the Hanja corpus such thatĤ = H ∪ H and N Korean sentences in the Korean corpus K.
Considering the properties of AJD and DRS, we design a multi-task learning approach with document restoration and machine translation, based on the Transformer networks. As shown in Fig. 3, our model consists of embedding and output layers for Hanja and Korean, and three Transformer modules: the shared encoder, the restoration encoder, and the translation decoder. The restoration encoder is an encoder for the restoration task. The translation decoder is used for translating Hanja sentences into modern Korean sentences, and the shared encoder is used for both the restoration and translation tasks. By sharing the encoder module for both tasks, the shared encoder is trained with a large-scale corpus, i.e., the Hanja-Korean paired dataset and the additional unpaired Hanja dataset. The parameter sharing technique assists the model to learn rich information from the Hanja corpus. We apply the cross-layer parameter-sharing technique in the same manner as used in ALBERT (Lan et al., 2019), which shares the attention parameters for each Transformer encoder and decoder modules to reduce the model size and the inference time.

Restoration of Damaged Documents
The restoration task for damaged documents is similar to the MLM approach, which masks randomly chosen tokens in the input sentence and then predicts their original tokens in the corresponding position. We apply the MLM technique to restore the damaged documents, especially in the case of the Hanja sentencesĤ.
For word indices (w h i 1 , . . . , w h i L i ) in the Hanja sentence h i , where L i is the length of the i-th sequence, several words are randomly selected and replaced by a [MASK] token. We extract word embedding vectors (e h i 1 , . . . , e h i L i ) ∈ R d emb from the Hanja embedding layer combined with positional embedding vectors, where d emb represents the dimension size of the embedding space. Here, we apply the factorized embedding parameterization technique to reduce model parameters (Lan et al., 2019). These embedding vectors are projected onto the d model -dimensional embedding space through a linear layer. Subsequently, the embedding vectors are transformed into the Hanja context vectors (ŝ h i 1 , . . . ,ŝ h i L i ) via the shared encoder and the restoration encoder as where f S and f R functions represent the shared encoder and the restoration encoder, respectively. The Hanja context vectors is non-linearly transformed into the output vector z h i k ∈ R d emb via the output layer. We also apply the factorized embedding parameterization technique to the output layers for parameter reduction. We calculate the prob- for the index m of the original tokenŵ h i k , using the softmax function as where |V h | is the size of the Hanja vocabulary.

Neural Machine Translation for Historical Records
In order to facilitate the training of our translation module, we exploit the Hanja-Korean paired dataset {(h i , k i )|h i ∈ H, k i ∈ K}. As shown in Fig. 3, we first extract the Hanja context vectors (s h i 1 , . . . , s h i L i ) from the word tokens in the Hanja sentence h i , using the shared encoder in the same manner as in Eq. 1. Utilizing the Hanja context vectors and previously predicted Korean words (w k i 1 , . . . , w k i t−1 ), we subsequently calculate the d model -dimensional Korean context vector s k i t for the current time step t as where f D represents the translation decoder layers. After calculating the Korean context vector s k i t , we non-linearly transform the context vector to the output vector z k i t ∈ R d emb , through the output layer, along with the above-mentioned factorized embedding parameterization for parameter reduction. Finally, we yield the probability that the word V m is generated from the t-th step as where |V k | is the size of the vocabulary for the Korean corpus, and W k ∈ R |V k |×d emb is the output layer for the Korean corpus.
As previously mentioned, we employ the parameter sharing approach for the encoder module, (i.e., the shared encoder), thus enhancing the robustness of our model, especially with the Hanja dataset.

Training and Inference
In order to train our model, we use the crossentropy loss to maximize the probability of the original token indices for the masked tokens and the target sentence for the translation task as where ξ(·) is an operator that randomly selects the tokens from each sentence for MLM. In this study, we apply not only unigram masking but also the n-gram masking techniques (i.e., bigrams and trigrams), as previously applied (Zhang et al., 2019a). Finally, the total loss is defined as  Our model is optimized by using the rectified Adam (Liu et al., 2019b) with the layer-wise adaptive rate scheduling technique (You et al., 2017). We also apply the gradient accumulation technique and update our model for each loss asynchronously, to increase the batch size and efficiently manage the GPU memory.
After training the model, the damaged tokens are replaced by the [MASK] token during the restoration stage, and the model obtains the top-K words with the highest probabilities, among which users can choose and confirm a correct word in the position of the damaged words. In addition, we translate all the Hanja records that are not yet translated for further in-depth analysis. When translating the Hanja sentence, we additionally apply beam search with length normalization.

Experiments
This section first describes our datasets and experimental settings.

Datasets and Preprocessing
To train our model, we collect most of the documents of AJD and DRS, including those manually translated to date, provided by the National Institute of the Korean History 5 . The records contain approximately 250K documents for AJD and 1.4M documents for DRS.
After collecting documents, we tokenize each Hanja sentence into the character-level tokens, similar to previous studies (Zhang et al., 2014;Li et al., 2018), and also tokenize each Korean sentence based on the unigram language model (Kudo, 2018) provided by Google's SentencePiece library. 6 Here, we included those words appearing more than ten times in the Hanja vocabulary, the size of which is about 8.7K words. For the Korean corpus, we limit the size of the Korean vocabulary to 24K. The out-of-vocabulary words are replaced with UNK (unknown) tokens. To improve the stability and efficiency during the training stage, we filter out those Hanja sentences with less than four tokens or more than 350 tokens and those Korean sentences with less than four tokens or more than 300 tokens. Note that the portion of sentences filtered out from each dataset is less than 10%.
To evaluate the performance of our model, we randomly select 20K sentences as a test dataset for each of the paired and the unpaired sets. The sizes of the training set for the Hanja-Korean paired corpus and the unpaired Hanja corpus are 240K and 1.38M, respectively. The statistics of the dataset are summarized in Table 1.

Hyper-parameter Settings
We set hyper-parameters similarly to the BERT (Devlin et al., 2019) base model. We set the size of the embedding dimension d emb , the hidden vector dimension d model , and the dimension of the positionwise feed-forward layers as 256, 768, and 3,072, respectively. The shared encoder, the translation decoder, and the restoration encoder consist of 12, 12, and 6 layers, respectively. We use 12 attention heads for each multi-head attention layer. Overall, the total number of parameters is around 168.8M.

Mining Historical Records via Topic Modeling
After obtaining machine-translated outputs of the remaining records, we apply topic modeling to the full set of documents for exploratory analysis of historical events. To be specific, the full set of documents include all of the manually translated records as well as machine-translated records by our model. By using each translated record k i and its written date information d i , we first parse the document into morphemes and then use the only noun and adjective tokens. Afterward, we build the term-date matrix V ∈ R V ×D where V is the vocabulary size and D is the number of dates in the total set of historical documents.
In this study, we utilize non-negative matrix factorization (NMF) (Lee and Seung, 2001) as a topic modeling method 7 . We first assume that there exist K topics in the corpus. The term-date matrix V is decomposed into the term-topic weight ma-7 Topic modeling includes several methods such as latent Dirichlet allocation (LDA) (Blei et al., 2003)-based and nonnegative matrix factorization-based models (Lee and Seung, 2001). We additionally tested topic modeling with LDA, but the results of NMF are slightly better than those of LDA. trix W ∈ R V ×K and the date-topic weight matrix H ∈ R D×K as W, H = arg min W,H≥0 where · F represents the Frobenius norm, and ψ and α represent the L 1 regularization function and the regularization weight, respectively. We set the number of topics K as 20 8 and the regularization weight α as 0.1.

Experimental Results
This section describes the results of the performances of our model for restoration and translation, followed by qualitative examples of each task as well as topic modeling results.

Document Restoration
We evaluate the performance of our model on the document restoration task on the test dataset. We also compare performance between the model trained with and without multi-task learning. Table 3 shows the results of top-K (HITS@K). The top-10 accuracy of our proposed model is almost 89%, which indicates the high performance of our model and demonstrates that our model provides analysts with appropriate options. However, the baseline model, trained without multi-task learning, performs slightly better than the one with multitask learning. This shows that the baseline model is more specialized in the document restoration task. However, although our model performance is slightly lower than the baseline model, the benefits of the multi-task learning approach are significantly manifested in the NMT task as shown in Table 5. As our model shows the acceptable performances on both the restoration and the translation tasks, we conclude that our model learns the purpose of our research well via multi-task learning. We will further discuss the main benefits of multi-task learning in Section 5.2.
We further investigate the qualitative results of the document restoration task. Table 2 shows four randomly sampled, example pairs. As shown in the first three rows of this table, the model also has the ability to predict bi-gram and tri-gram characterlevel tokens because the model is trained using n-gram-based MLM. Furthermore, although each character is not exactly the same as the original one,   the last example in the table shows that our model restores the proper format of the name part. However, predicting the exact name is a difficult task for human experts, even when considering the context of the sentence, as prior knowledge is necessary to predict the exact name. Therefore, we quantitatively measured the model performance on the proper nouns, e.g. person and location names, using 200 samples of them. The average top-10 accuracy is only 8.3%, significantly lower than the overall accuracy, which is larger than 89%. We conjecture that the degradation is mainly due to the difficulty in maintaining the information of the proper nouns, which would require external knowledge. We leave it as our future work.

Machine Translation Quality
To investigate the performance of the machine translation task, we translate the Hanja sentences in the test dataset and then evaluate the model performance. As shown in Table 5, the results for the translation task are evaluated by BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and ROUGE-L (Lin, 2004). In this result, "Full" represents our proposed model trained by multitask learning of the translation and the restoration tasks. Therefore, the model is trained to take both the translated and untranslated sentences. On the other hand, "Base" represents the model trained only by the translation task, and thus, the model is trained to accept only the translated sentences. Our model outperforms the baseline model with a significant margin. Furthermore, we generate sentences using the beam search method with the length normalization.
In this study, we compare the greedy search and the beam search with a beam size of 3. As shown in Table 5, results obtained with a beam size of 3 are slightly better than the greedy search method. Finally, the BLEU score of our model is obtained as 0.5410, which indicates that our model performs reasonably well, compared to other recent models trained in other languages.
We additionally compared our model to the model trained via the pretraining-then-finetuning approach. As shown in Table 6, the BLEU score of this approach is 0.3755, which is 5.9% higher than that of the model trained from scratch but 28.7% lower than our multi-task learning approach. The results can be explained for two reasons. First, as the size of unpaired data is much larger than that of paired data, the multi-task learning fully utilizes the paired and unpaired data for the translation task, compared to the pretraining-thenfinetuning approach. Second, The pretraining-thenfinetuning approach has a catastrophic forgetting problem (Chen et al., 2020). In other words, the finetuning step can fail to maintain the knowledge acquired at the pretraining step. However, as both reconstruction and translation tasks are crucial for historical documents, such a forgetting issue is critical to our tasks.
We also tested the quality of the Hanja-Korean translation task using a Chinese-Korean machine translator. As few publicly available machine translation models for Chinese-Korean exist, we used Google Translate 9 instead. The translator failed to translate given Hanja sentences in most cases, mainly because Hanja and Chinese have different properties in terms of grammar and word meanings.
To investigate the translation performance qualitatively, we sampled translated samples.  Predicted (Eng.) Replying to the Prosecutor General Namyongik's memorial, the king said, "I looked at the memorial and thoroughly understood what it meant. As the position of the director at the office of the royal physicians cannot help but agree to your message, you should not resign your position, care for your mother's illness, and come back to be responsible for your duties quickly." Original 夜一更, 月暈. 五更, 西方坤方, 有氣如火光. Predicted 밤 1경에 달무리가 졌다. 5경에 서방, 곤방에 화광 같은 기운이 있었다. Predicted (Eng.) The moon has a ring around it at 7-9 PM. At 3-5 AM, there was the light of the fire in the west and south-west.  Table 5: Results of the performance of the translation task. "Base" and "Full" represent the model trained only using the machine translation task and the model trained using multi-task learning with machine translation and restoration tasks, respectively.
shows the sentences translated from the untranslated documents by our model. For readability, we append English sentences corresponding to the predicted sentences in each row. Each result indicates that our model generates the modern sentences corresponding to contexts of the source Hanja sentences. Interestingly, the third example in the table is related to the astronomical observation of the aurora. Later, we found prior studies confirming that the red energy mentioned in our document was an aurora (Zhang, 1985;Stephenson and Willis, 2008). This highlights the importance of the ma-chine translation task of the historical records, as it is essential to survey by researchers in various fields such as astrophysics and geology. Therefore, we further analyze the documents with the topic modeling approach.

Results of Topic Modeling
As described in Section 4.3, we calculate the termtopic weight matrix W and date-topic weight matrix H. We select three interesting topics from the total of K topics and visualize the term-topic weights in W using the word cloud and the datetopic matrix H in a smoothed time-series graph for each topic. Fig. 4 shows the results. The first topic is related to troops and military exercise. As shown in the red dashed box in the timeseries graph, the weights dramatically decrease in 1882, while the weights continuously increase after the biggest war in 1592. In fact, a coup attempt of the old-fashioned soldiers occurred in 1882, causing the national intervention of neighboring countries and the decline of self-reliant defense. The fifteenth topic is related to war and national defense. Although this topic is related to the preceding military topic, it is more related to the international relationship compared to the first one. In the early years of the dynasty, northern enemies and pirates frequently invaded Joseon, which reveals as the large topical weights in the beginning. The weights increase in the late sixteenth century, and the weight maintains at a high level until 1637 when three great wars broke out in Joseon.
The eighteenth topic is related to astronomical observations such as a halo and a meteor shower. In the mid-sixteenth century, people observed the Leonids, as shown in the first red box of the graph. We later found that experts in astronomy also discovered this in the past, using AJD (Yang et al., 2005). Moreover, from the mid-seventeenth century to the early eighteenth century, the number of sunspots was low. Solar observers name this event as the Maunder minimum (Eddy, 1976;Shindell et al., 2001). This event caused abnormal climate phenomena, such as the third example in Table 4, as shown in the second red box of the graph. This topic demonstrates the importance of the use of historical records since it is difficult to easily spot the phenomena that occurred centuries ago.
Note that previous studies mainly attempted to exploit only AJD or translated parts of DRS. However, we utilize both AJD and the majority of DRS records by applying advanced NMT techniques. When performing topic modeling by using only those manually translated sentences, it failed to include topics such as the health of the royal families and actions against treason sinners, which were revealed by our approach. It is because the voluminous documents that have not been manually translated contain their own topics. Thereby, we extract several valuable topics even with no special knowledge in the Hanja domain. Translating the historical records into modern languages expands our knowledge base, and analysis of the records using machine translation and text mining tech-niques may help the analysts effectively explore the historical records.

Conclusions
In this paper, we proposed a novel approach to translate and restore the historical records of the Joseon dynasty by formulating the multi-task learning task based on the self-attention mechanism. Our approach significantly increases the translation quality by learning the rich contents in large documents. We anticipate these tasks are the first steps towards translating the ancient Korean historical records into modern languages such as English. Furthermore, the model effectively predicts the original words from the damaged parts of the documents, which is an essential step for restoaring the damaged documents. Results from text mining approaches show that our approaches have the potential in supporting analysts in effectively exploring the large volume of historical documents. We also expect researchers from diverse domains can explore documents and discover historical findings such as astronomical phenomena and undiscovered international affairs, with no special domain knowledge. As future work, we will also leverage the transfer learning approach to translate historical documents into other languages, such as English or French. We also plan to apply knowledge graphbased machine learning approaches, e.g. knowledge graph embedding and graph neural networks, to discover historical events and relations.