HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.


Introduction
Large-scale historical records in Korea were mostly produced during the Joseon dynasty (1392-1897), and the Institute for the Translation of Korean Classics (ITKC) keeps a comprehensive database of Korean classics at a scale of approximately 9 billion characters. This digital archive is a great resource for Korean historians, but the documents remain in the original Hanja language 2 . Hanja is an extinct

Language Sentence
Hanja The King attended the Royal Lecture . language, and as Table 1 illustrates with a simple sentence, Hanja is lexically and syntactically different from modern Korean, as well as simplified and traditional Chinese. Understanding the documents in the digital archive is thus difficult and would benefit greatly from a Hanja language model which can also be used to accelerate the expert translation (Vale de Gato, 2015). There are two corpora we can use to train the language model, the Annals of Joseon Dynasty (AJD), first introduced to the NLP community in (Bak and Oh, 2015), and the Diaries of the Royal Secretariats (DRS) (Kang et al., 2021). In this paper, we provide the HUE (Hanja Understanding Evaluation) dataset consisting of chronological attribution, topic classification, named entity recognition and summary retrieval, a suite of tasks to help build and evaluate the Hanja language model. In addition to AJD and DRS, we also work with the Daily Records of the Royal Court and Important Officials (DRRI). Unlike AJD and DRS that have been analyzed by historians and contain their annotations, DRRI lacks such systematic analysis, and we use it for zero-shot learning and introduce it to the NLP community.
We also provide pretrained language models (PLMs) for Hanja trained on AJD and DRS, fine-tuned for each task in HUE. Our pretrained models on the corpora from that era outperform the existing language models built for ancient Chinese, confirming the need for specially-trained Hanja language models. We also run additional experiments based on the analyses of entity-and word-level changes on AJD by controlling input conditions by masking named entity and giving the time period as input. Finally, we demonstrate the effectiveness of our Hanja language model for analyzing unseen documents, running zero-shot experiments for chronological attribution and named entity recognition tasks on DRRI.
Our main contributions are as follows: • We release the HUE dataset and Hanja PLMs to support historians to understand and analyze a large volume of historical documents written in Hanja. To the best of our knowledge, this is the first work proposing Hanja language models and releasing a NLP benchmark dataset for ancient Hanja documents.
• We demonstrate that providing key information such as named entity and document age as input improves the performance of Hanja language model on the HUE tasks.
• We run zero-shot experiments on several HUE tasks from DRRI which have not been discussed in the NLP community, and demonstrate the performance of our Hanja language models on unseen historical documents.

Hanja
Hanja, the writing system based on ancient Chinese characters, was the main writing system in Korea before the 20 th century, while Hangul, the unique Korean alphabet, has been the main writing system in Korea from the last century. Formal records from the Joseon dynasty (1392-1897) are written in Hanja, while spoken language and some written documents were in Hangul, developed in the 15th century. This co-existence of the written and colloquial languages has led Hanja to evolve to have the basic syntax of classical Chinese, mixing with the lexical, semantic, and syntactic characteristics of colloquial Korean. Hanja is significantly different from both modern Korean and modern Chinese. Modern Korean uses a different alphabet and structure, and traditional Chinese shares some characters with Hanja, while the lexicon has evolved greatly to reflect the temporal, geographical, and cultural differences between the Joseon dynasty and modern-day China. Simplified Chinese, the current written language in China has diverged more because of the simplification of the characters. These differences between Hanja and other related languages would lead to suboptimal performance when adopting the current Chinese language models to NLU tasks for the Korean historical Hanja documents.

Dataset
We describe the three corpora of records written in Hanja during the Joseon dynasty, whose contents and additional information such as topic and named entities are provided by historians in IKTC 3 . Table  2 shows the list of the Hanja corpora used.
Annals of the Joseon Dynasty (AJD) also called Veritable Records of the Joseon Dynasty, is a corpus of 27 sets of chronological records, and each set covers one ruler's reign. AJD has been translated into Korean from 1968 to 1993 and includes relevant tags such as the named entities and dates of the documents 4 . We use AJD for both training our Hanja language models and building the HUE dataset of NLP tasks.
Diaries of the Royal Secretariat (DRS) is a corpus of detailed records of daily events and official schedules of the court from the first King Taejo to the last (27 th ) Sunjong. Many of the earlier records were lost, and we use the extant records starting from the 16 th King Injo. DRS is known to hold the largest amount of authentic historic records and state secrets of the Joseon Dynasty 5 . We use DRS to continue pretraining the language models.
Daily Records of the Royal Court and Important Officials (DRRI) is a corpus of journals written from the 21 st King Yeongjo to the last Emperor Sungjong and presumably initiated from the diaries of the crown prince who became the 22 nd King Jeongjo after he was enthroned. DRRI has official daily records from both the central and the local governments, so encompasses all events in the country and reports to the king with summaries. DRRI is known to include details and events of  the late Joseon Dynasty not recorded in the AJD or DRS 5 , thus making it a good corpus for zeroshot experiments. We use DRRI for the supervised summary retrieval task and the zero-shot experiments for chronological attribution and named entity recognition.

HUE Dataset
The HUE (Hanja Understanding Evaluation) dataset is built to assist history scholars to understand Korean historical records written in Hanja. HUE consists of chronological attribution, topic classification, named entity recognition, and summary retrieval, which are tasks that can provide helpful information for studying the documents. We expect that the language models based on HUE will ultimately help historians to interpret unseen historical documents and public to grasp basic concept of those documents. We describe each task in detail below.

Task Description
Chronological Attribution (CA) is a classification task predicting the ruling king when the document was written. When given a Hanja document from AJD, a classifier outputs one of the 27 kings of the Joseon dynasty. Chronological attribution of the undiscovered document is the first step in anthology to interpret and translate it. Korean historians mostly divide the history of the Joseon Dynasty based on the reigning king, so that we treat chronological attribution as a classification task.
Topic Classification (TC) is a multi-class and multi-label classification task to find the topics of the given document. For TC, we use Hanja document from AJD. We suggest two levels of topics, namely major and minor categories. The major categories consist of 4 broad topics: politics, economy, society, and culture. The minor categories go with 106 sub-topics such as diplomacy, agriculture, and science.
Named Entity Recognition (NER) is a sequence tagging task, identifying the two types of named entities, person and location, from the Hanja document from AJD. We divide train, validation, and test sets such that there are no overlapping entities across the sets.
Summary Retrieval (SR) is a task to find the most relevant summary that matches the content among the summary candidates. For this task, we use DRRI, in which each document is a pair of summary (gang) and detailed content (mok). Among 426k articles, 265k articles in DRRI dataset contain both gamg and mok. Also, we exclude those with gang longer than mok, in which gang is not the summary of mok. The final dataset contains 213K pairs of content and the corresponding summaries.
We describe more details of the preprocessing in the Appendix.

Hanja Pretrained Model
As far as we know, there have been no pretrained language models for the Hanja language. One can use related LMs, the pretrained models for ancient Chinese as well as multilingual BERT (Devlin et al., 2019) which includes traditional Chinese in its training corpus. AnchiBERT (Tian et al., 2021) is pretrained in ancient Chinese with the Chinese anthologies written around 1000BC to 200BC. There is some vocabulary overlap between the Hanja documents and traditional Chinese corpora, we can adopt multilingual BERT and AnchiBERT to learn the representations of the Hanja texts. We propose the pretrained language models suitable for Hanja documents by continuing pretraining those two models on both AJD and DRS. Table 3 shows the ratio of unknown tokens in the test set of AJD by each model. It implies that existing AnchiBERT and multilingual BERT can also be exploited as language models for Hanja documents written in the Joseon dynasty, but the second phase of pretraining on the corpora of that era remarkably decreases the ratio of unknown tokens. 0.88% 0.76% +AJD/DRS 0.04% 0.04% Table 3: Unknown token ratio of each model in test set for CA, TC, and NER task in HUE. The first row indicates the results of the original PLMs without any additional pretraining, and the second row indicates those with continued pretraining on AJD and DRS.

Experimental Settings
We conduct experiments on HUE with our pretrained model. For the baseline model, we use BERT without pretraining and compare it to various BERT models described in Section 3.1. Specifically, we fine-tune each model to act as a re-ranker in the retrieval task for the summary retrieval. We first retrieve top-k relevant gangs (summaries) with BM25 (Robertson and Zaragoza, 2009) among all gangs in the dataset. Then, we fine-tune the model as a binary classifier to determine whether the summary matches the content of the mok with the cross-entropy loss (Nogueira and Cho, 2019). If the ground truth summary is not included in the top-k relevant summary candidates, we replace the last k th summary with the ground truth. We use k = 12 for training and k = 100 for inference. As the representative metrics, we present F1 scores for CA, TC, and NER tasks, and Mean Reciprocal Rank (MRR) for SR. The detailed results with other metrics are also available in Appendix. Table 4 shows the overall experimental results. The bold and the underlined texts in the table specify the best and the second best result, respectively. BERT without any pretraining shows the poorest results across all the tasks. AnchiBERT and mBERT, which are existing language models on the relevant domains, show better results, and the models continued pretraining on Hanja documents achieve the best performance among all tasks. This tendency indicates that all these tasks on understanding historical documents require pretraining language models on time-specific and domain-specific data.

Overall Results
AnchiBERT pretrained on the Hanja corpora shows slightly better performance than mBERT pretrained on the same corpora. We assume this is because the original training corpora of AnchiB-ERT are much closer to the Hanja documents, even the era of those two corpora are completely different. The writing style of both Hanja documents in the Joseon dynasty and anthologies in ancient China had come from Classical Chinese and share similarities. On the other hand, the training corpora of mBERT is a contemporary texts whose characters contains Traditional Chinese, but the structure and the format might be considerably changed.
Chronological Attribution (CA) Our models continued pretraining on Hanja corpora outperform other baselines on CA. Detailed analysis on CA result is illustrated in Section 6.1.
BERT not pretrained AnchiBERT + AJD/DRS AnchiBERT mBERT + AJD/DRS mBERT  Topic Classification (TC) Figure 1 gives ROC curves and AUC values of each model on each task. Our models show the similar trends to the overall results, outperforming other language models. For the evaluation results including F1 score, we find and set the best threshold to each label by Youden's index (Youden, 1950). While F1 score goes down as the number of classes increases from 4 to 106, there is no significant difference on AUC value. This might result from consistently high recall achieving around 90% on both tasks. It indicates that the threshold is too low and models tend to predict plausible topics as many as they can, which might be solved by controlling the threshold. AnchiBERT pretrained on AJD and DRS, which shows the best performance, predicts 6.39 labels in average, while the average number of ground truth labels of the minor categories is 1.97. This is probably due to the meaning overlaps in minor categories. For instance, minor categories such as revenue, finance, general price level, and commerce are the sub-categories of economy in the major categories, whose use case cannot be strictly distinguished. It would be more appro-  priate in this case to provide all plausible topics roughly rather than suggesting the one only with high certainty.
Named Entity Recognition (NER) NER also indicates similar trends to the overall benchmark tasks, but with a small gap among models including BERT without pretraining. It implies that NER in Hanja documents is a comparably easy task. This might result from certain patterns in named entities in Hanja. Most of the person entities are 3 letters starting with the common characters (family name), and most of the location entities end with the common characters meaning locations or buildings. All models tend to predict person entities better than location entities.
Summary Retrieval (SR) All fine-tuned BERTbased re-rankers outperform BM25 whose MRR is merely 29.87%, mostly retrieving the ground truth answer at the first trial. Likewise, our models shows the best results, while BERT without pretraining show the lowest MRR. It additionally implies that BERT-based re-rankers might be exploited for retrieving relevant documents from different chronicles in terms of written style or contents.

Effect of Entity and Document Age on Language Model
We investigate whether providing additional information as input can improve the performances of language models. For CA, we mask all named entities in the input and fine-tuned the language model on masked data to examine the impact of entity information. For TC and NER, we concatenate document age information to the input text and run comparative experiments to verify the importance of time period on historical texts.

CA (Masked)
BERT   Table 5 shows the difference on experimental results in CA when the named entities in the given input texts are masked. Compared to the default settings which does not mask named entities, all models show significant improvements. This is probably because models can truly focus on the content and changes in writing style without any disturbance of location entities consistently used for the whole era. Fine-tuned models with entity masked inputs also achieve nearly the same level of performance in the inference with plain inputs including named entities. It suggest to fine-tune models masking named entity in CA, considering the real scenario whose inference texts lack time period information.

Entity-Masked Chronological Attribution
Topic Classifciation and Named Entity Recognition with the Age of Document It is for granted to regard that historical texts written over several eras reveal the time changes with respect to lexical choices and contents. Section 6.2 confirms the hypothesis above in terms of n-grams changes over time.  entities. There was a big gap on difference with non-pretrained BERT in TC, which is probably due to the poor performance of itself in the original setting. All models show similar trends on both tasks with improved performances compared to the original settings without document age as input. It is an obvious result considering that the first step for ancient manuscript is assuming the written era. It implies the significance of chronological attribution task in HUE, conveying that chronological attribution task might improve the performance of other HUE tasks.

Zero-shot Experiment
Countless number of Hanja documents still remain without any analysis and new documents continue to be unearthed. Therefore, we run zero-shot experiments to verify the effectiveness of our language models on extracting information from the historical documents irrelevant to the training corpora. We use DRRI dataset which is not included in both pretraining and fine-tuning data of our Hanja language models and execute CA and NER. Table 7 shows experimental results with CA and NER on DRRI. All models perform comparably well on the both tasks, regarding that random model will achieve approximately 3.70% performances with 27 classes in CA. Also, all models in CA commonly show high precision which might be due to the monotonous and redundant phrases in the veritable records. It shows similar trends compared to the Table 4, but the gap among models was notably emphasized in the zero-shot settings. This results imply that our CA models might be exploited for the time period prediction of unseen documents in anthology with a reliable level.
Our models outperform others on NER achieving absolutely high performances, though entity maps between AJD and DRRI do not match strictly. Interestingly, all models tend to predict location entities better than person entities which is the opposite result compared to the original NER on AJD. It is probably due to the characteristics of each entity, where location entities are all commonly used in nationwide while person entity might differ by situation. Further analysis on person and location entities in the view of time changes is described in Section 6.2. We present that our models trained on the corpora of the Joseon dynasty provide reliable results on unseen records, implying that our model can be exploited for the low-resourced documents.
6 Further Analysis

Do Historical Events Affect Language
Models?
To figure out the effect of historical events on models' prediction, we analyzed the output of language models on CA. Figure 2 (a) shows a log-scale confusion matrix of AnchiBERT continued pretraining on AJD and DRS, and Figure 2 (b) indicates the mean absolute error between the predicted king order and the ground truth per each King. The x-axis in the Figure 2 (b) means the changes of time by the king reign period. Each bar in Figure 2 (b) indicates the mean absolute error between the order of ground truth king and the one of predicted, and the line graph means the number of samples in the test set. The results of the last two Emperors, Gojong and Sunjong, are remarkable in that the model rarely gets confused with those two labels to others and tends not to fail, showing notable difference on Figure 2 (a) and significantly low mean absolute error on Figure 2 (b). We believe that this is because our model learns the difference between those two records and the others in the historical view and get cues to distinguish them. The last two records are not treated as AJD in general, since those records were inspected and produced by the   Office of Governer-General during the Japanese colonial period with the view of Empire of Japan who ruled Korea 6 .
The mean absolute error of the predicted order of each king achieved the difference around one, except for the first King Taejo and the second King Jeongjong, whose errors are almost doubled. We hypothesize that this is mainly because there are too small number of examples in those classes. A similar tendency where the more samples are, the less mean absolute error be has been observed in other 6 http://esillok.history.go.kr/ classes. Also, the writing style of AJD had settled down from the third King Taejong, and those noisy records might confuse models to predict the exact dates.

Do Time Changes Affect Written Texts?
It stands to reason that AJD written in five centuries reveal the features of the language changes. In this section, we investigate the hypothesis above with respect to the named entities and n-grams.  It shows the changes of trigrams over kings whose xaxis shows the kings and the y-axis shows the number of overlapped trigrams.
Words Change over Time We analyze how frequently words change over time. For each king, we plot how many trigrams overlap by each king era in the order. Figure 3 shows overlapped trigrams in the 1st, 9th, 17th, and 25th king and the detailed results with all kings are described in Appendix. It is consistently observed that the closer the king era is, the more trigrams are overlapped. These changes result from not named entity but lexical choices, considering that person and location enti- ties account for 6.38% and 2.05% in the characters of AJD, respectively. It verifies that words used in Joseon dynasty had changed over time gradually, and it enables the language models to capture those features.
Named Entity Changes over Time We investigate how the named entities had been used over time. In particular, we show frequency rates of top-10 frequently-used named entities by each king era and how they change over time in Figure 4. It implies a strong correlation between person entity and the passage of time, while there is no explicit correlation to location entity. Most person entities include officials of the time or the previous kings, relevant to the time. In contrast, most location entities include neighboring countries or place names in the Joseon, which are less dependent on the time. The examples of frequently appeared named entities are described in Appendix. Along with these works, several works provide language models suited for historical texts in ancient languages and evaluate those models on existing NLU tasks, which aims to support understanding those documents considering that the target languages are mostly extinct. Bamman and Burns (2020) propose Latin BERT for part-of-speech tagging in ancient Latin script. Tian et al. (2021) suggest AnchiBERT and evaluate their model on some NLP tasks including poem topic classification.

Related Work
However, there has been no research attempting to propose language models in Hanja, which is a dead language in Korea but absolutely necessary to explore Korean history. Most of the studies with Hanja only shed lights on translating historical Hanja documents and use AJD as their corpus (Park et al., 2020;Jin et al., 2020;Kang et al., 2021).

Conclusion
We present HUE (Hanja Understanding Evaluation) dataset and BERT-based pretrained language models for classical Hanja documents. HUE dataset includes diverse tasks that can support analyzing historical documents written in Hanja which is an extinct language in Korea: Chronological Attribution (CA), Topic Classification (TC), Named Entity Recognition (NER), and Summary Retrieval (SR). Our models pretrained on Hanja corpora outperform other language models and we observe their performance on zero-shot settings with DRRI which is the dataset never been introduced in NLP community. The experimental results in king prediction imply that our models capture the historical events or facts disclosed in the texts. We also explore several methods to support Hanja language models such as masking named entities and giving document age as input sources, based on the analyses on textual features in AJD.
Help of adequate resources in Hanja documents might could fill some caveats in our work which lacks additional experiments and analyses on the records of different genre such as poetry, novel, and humanities resulting from the low resources that we can exploit. However, we expect that our dataset and accompanying language models might facilitate future works on historical documents written in Hanja by providing fundamental resources to leverage unknown Hanja corpora.   We crawl AJD 7 , DRS 8 , and DRRI 9 from the comprehensive database for Korean classics which are publicly available published by IKTC. All source corpora are fully tagged with the written ages and named entities, while their entity maps differ to each other. AJD also provides topics which is tagged by the experts in the translation process. These examples tend to present extremely short moks which cannot be treated as summary and content, while the offical records on administrative has much longer moks.

C Discussion
C.1 Trigram Changes Over Time Figure 5 shows the changes of trigrams over all kings. It clearly delivers the changes of trigrams as time goes by. We can observe the same trend in either unigram or bigram.

D Experimental Results
For CA, we measure the Quadratic Weighted Kappa score (QWK score) as metrics that treat each king label hierarchically. Since TC is a multi-label classification task whose example might have multiple labels as the answer, we measure Hamming score along with accuracy. In this case, accuracy is the exact match score, and the hamming score is the accuracy of subset matched, | T ∩ P | / | T ∪ P |, where T is set of true labels and P is set of predicted labels (Godbole and Sarawagi, 2004). For the evaluation results in Table 13, we find and set the best threshold to each label by Youden's index. All pretrained models outperform BERT without pretraining, and two LMs re-trained on hanja documents show the best performances.