Translating Hanja Historical Documents to Contemporary Korean and English

The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea. The Annals were originally written in an archaic Korean writing system, `Hanja', and were translated into Korean from 1968 to 1993. The resulting translation was however too literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012. Since then, the records of only one king have been completed in a decade. In parallel, expert translators are working on English translation, also at a slow pace and produced only one king's records in English so far. Thus, we propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English. Built on top of multilingual neural machine translation, H2KE learns to translate a historical document written in Hanja, from both a full dataset of outdated Korean translation and a small dataset of more recently translated contemporary Korean and English. We compare our method against two baselines: a recent model that simultaneously learns to restore and translate Hanja historical document and a Transformer based model trained only on newly translated corpora. The experiments reveal that our method significantly outperforms the baselines in terms of BLEU scores for both contemporary Korean and English translations. We further conduct extensive human evaluation which shows that our translation is preferred over the original expert translations by both experts and non-expert Korean speakers.


Introduction
Historical documents written in an archaic language should be translated into a modern language.Most of the Korean historical documents are written in Hanja, the main written language in Korea Table 1: An example from the Annals of Joseon Dynasty.We show the original Hanja sentence and the original Korean human translation which contains archaic words indicated in color box.The contemporary Korean translation replaces the archaic words with words and phrases understood by present-day Korean speakers.before the 20-th century.Hanja is an archaic language based on the old Chinese writing system, and although there is a large overlap in characters, it is different from both Chinese and Korean.The Annals of Joseon Dynasty (AJD), the representative historical records of Joseon (1392Joseon ( -1910)), originally written in Hanja, was translated into Korean from 1968 to 1993 by expert translators commissioned by the Korean government.Non-expert Korean speakers however have trouble understanding these original translations of the AJD because they contain many archaic Hanja-based words, often hard-to-understand transliterations.The Institute for the Translation of Korean Classics (ITKC) recognizes this problem and is re-translating the entire AJD with modern-style writing (Table 1).This retranslation process is expected to take 22 years with 12 to 15 expert translators.Simultaneously, the Na-tional Institute of Korean History (NIKH) has been translating AJD into English since 2012, which is also expected to take about two more decades.
Machine translation can accelerate the translation process.The challenge is the limited availability of parallel corpora between Hanja to contemporary Korean as well as English.Only one annal of the 24 kings of the Joseon Dynasty was newly translated into Korean and English.This is not a sufficient amount to train a full machine translation model.To address this low-resource problem, we adopt a multilingual translation approach that jointly learns to translate between Hanja, outdated original Korean, contemporary Korean and English, expecting positive transfer of knowledge among these languages.
We present a multilingual neural machine translation model that translates Hanja historical documents to contemporary Korean, to which we refer as H2KE.By exploiting extra resources, H2KE performs significantly better translation of Hanja into contemporary Korean than other approaches that rely solely on the parallel corpus from the newly translated Korean and Hanja.We measure the perplexity with a large-scale language model trained on contemporary Korean, called KoGPT (Kim et al., 2021), to show that translations from our model are more similar to contemporary Korean than the old Korean translations from the original translation effort.These results are further confirmed by human evaluation, where both experts and non-experts prefer our model's translation over the original translation in old Korean.Using H2KE, we translated the remaining AJD to contemporary Korean as well as English and are releasing it publicly at https://juheeuu.github.io/h2ke-demo.
Our main contributions include: • We propose a transfer learning method for translating AJD to contemporary Korean and English with a small training corpus.
• We conduct thorough human evaluation, where experts find that our generated translations are more accurate and fluent than the original expert translations, and non-expert Korean speakers choose our translations as more easily understandable compared to the original translations.
• We translate the entire AJD to modern Korean and English and publicly release the translations for easier access to the resources.

Neural Machine Translation for the Annals of the Joseon Dynasty
To translate AJD with the neural network, Park et al. (2020) propose a new subword tokenization method called share-vocabulary-and-entity-restriction bytepair encoding.Kang et al. (2021) present a multitask learning approach that simultaneously restores and translates historical documents.For the restoration task, they use the untranslated Diaries of the Royal Secretariat (DRS) which is another Korean historical corpus written in Hanja.For translation, they only focus on translating Hanja into old Korean using the outdated AJD corpus.In contrast to these earlier approaches, our approach supports both translation into contemporary Korean and into English, while benefiting from the larger Hanja-old Korean parallel corpus.

The Annals of the Joseon Dynasty
The Annals of the Joseon Dynasty (AJD), also called the Veritable Records of the Joseon Dynasty, is an old and vast volume of historical documents from Joseon Dynasty which ruled the Korean peninsula from 1392 to 1864.It records 472 years of the 25 rulers' reigns of the Joseon Dynasty.It covers diverse historical events and is known to exhibit high integrity and credibility in its description of these events, making it invaluable as a historical record. 1 The dataset is available at 'the Veritable Records of the Joseon Dynasty'2 run by the National Institute of Korean History (NIKH).AJD was originally written in Hanja, the writing system of ancient Korea, consisting of totally different characters and syntactic structures from contemporary Korean.Hanja had stemmed from traditional Chinese, but the lexical, semantic, and syntactic characteristics had changed to reflect the cultural differences between the Joseon Dynasty and other ancient Kingdoms of China.

Translated Datasets
AJD was initially translated from Hanja to Korean during 1968 -1993, and  The Annals of the 4-th King Sejong (AKS) has been translated so far, and it is available from http://esillok.history.go.kr/.These translation projects are expected to take two decades.
In Table 2 we list these corpora and their statistics.As discussed earlier, the corpora for contemporary Korean and English are substantially smaller than those for old Korean.
3 Method H2KE is a model that learns to translate historical documents written in Hanja to contemporary Korean and English.We use the multilingual neural machine translation (MNMT) approach, which enables translation between multiple languages with a single model (Johnson et al., 2017;Firat et al., 2016).
Multilingual Translation Approach.Our dataset consists of 〈source, target〉 pairs of 〈Hanja, oKo〉, 〈Hanja, cKo〉, 〈Hanja, English〉, 〈oKo, cKo〉, and 〈oKo, English〉.We append a special 3 Both the original translation of AJD and the new translation of AKJ are available at https://db.itkc.or.kr/.target-language token (either <oKo>, <cKo>, or <En>) in front of each source sentence.We train a model using all these examples shuffled randomly by presenting one pair of sentences at a time.Figure 1 illustrates the overall translation pipeline.With this approach, the model can benefit from the large amount of 〈Hanja, oKo〉 to improve the translation quality of the lower-resource target language pairs, 〈Hanja, cKo〉 and 〈Hanja, English〉.
Training and Inference.We use the Transformer model (Vaswani et al., 2017) to implement H2KE.We optimize the following loss for training: (1) There are N training examples, and each example is tagged with the target side language using tok (n) ∈ {<oKo>, <cKo>, <En>}.
For generation, we use beam search and translate the Hanja sentences to the language specified by the target language token.We generate and evaluate sentences in target languages, English (EN) and contemporary Korean (cKo), with either Hanja or original Korean translation (oKo) as source sentences.4 Experiments and Results

Data Preprocessing and Training Settings
We use the unigram language model tokenizer (Kudo, 2018) provided by Google's SentencePiece library. 4 In order to use one shared vocabulary between source and target languages, we tokenize the entire corpus together, including Hanja, oKo, cKo and EN.We limit the size of the vocabulary to 32K.The out-of-vocabulary tokens are replaced with UNK (unknown) tokens.We use the hyperparameters recommended by Vaswani et al. (2017).
We train and evaluate models using Fairseq (Ott et al., 2019).We average the five best checkpoints on validation data to obtain the final model to be tested on the test set.

Translation Quality
We train models with different dataset combinations and measure the BLEU score (Papineni et al., 2002).To measure the Korean BLEU score, we follow the protocol from WAT 2019 (Nakazawa et al., 2019) and use Mecab-ko5 tokenizer and Sacrebleu (Post, 2018).For English, we use Sacrebleu.
Table 3 shows the BLEU score for each case.Overall, utilizing 〈Hanja, oKo〉 pairs brings significant improvement in low-resource translations (to cKo or EN).However, there exist performance degradations when adding the unrelated target language pairs to the translation from Hanja.Since the encoder already learns expressive representations for Hanja from the plenty of training samples, inserting pairs with different target languages rather hinders the representation learning of the source language, Hanja.
A Commercial Translation Engine.We first compare our models to the Korean-specialized commercial translation service, called Papago (Lee et al., 2016).Although Papago was never trained to translate Hanja into modern Korean nor into English, we can force it to do so by asking it to translate from Taiwanese Mandarin (zh-TW) which shares a large set of characters with Hanja.According to the row (A) in Table 3, the commercial translation system, Papago, simply fails to properly translate Hanja documents, evident from significantly low BLEU in both contemporary Korean and English.
Original Korean Translation.Although there is no preceding work on translating Hanja into either contemporary Korean or English, Kang et al. (2021) had recently demonstrated the effectiveness of neural machine translation for translating Hanja into old Korean.We thus compare our approach against theirs in Hanja-Old Korean translation.For fair comparison, we only use the 〈Hanja, oKo〉 corpus and train a H2KE-base with only 65M parameters.
As shown in the row group (B) in Table 3, the proposed H2KE-base achives 5 BLEU scores higher than Kang et al. (2021).We attribute this improvement to the vocabulary sharing strategy and the use of the transformer.Without vocabulary sharing, the model showed 45.09 BLEU score.When we try a larger model, H2KE-big with 213M parameters, we achieve even better translation quality.We thus stick to H2KE-big in the rest of the experiments.
Contemporary Korean Translation.The first row in the row group (C) of Table 3 shows that the model trained with only a small amount of 〈Hanja, cKo〉 and 〈oKo, cKo〉 pairs result in low BLEU scores.However, adding the 〈Hanja, oKo〉 parallel corpus dramatically improves translation quality for the cKo translations, evident from 20-30 BLEU scores increase.This confirms the effectiveness of multilingual training which we hypothesized earlier.
When we take the original Korean (oKo) as translation and compare it against the ground truth contemporary Korean (cKo) as reference, we obtain the BLEU score of 39.74.This score is lower than that of the H2KE's cKo translation.This strongly suggests that the generated translations from our system are more similar to the cKo than the expert's ground truth oKo translations, fulfilling the goal of producing a machine translation system for contemporary Korean.
English Translation.According to the result in the row (D) in Table 3, we observe a similar trend when we use H2KE for translating Hanja into English.We gain significant improvement in translation quality by including the 〈Hanja, oKo〉 corpus during training.Finally in the final row (E) of Table 3, we demonstrate that a single H2KE-big model can be trained on all the corpora and can translate Hanja into both old Korean, contemporary Korean and English competitively.

How contemporary is contemporary
Korean translation?
Perplexity (Horgan, 1995) is the standard metric for measuring the performance of a language model, and it has been used recently to measure the deterioration of a language model over time by Lazaridou et al. (2021).To identify the difference and similarity between AJD translation, produced by different methods, and the modern Korean language, we calculate the perplexity of translations in the test set under a Korean pre-trained GPT (Kim et al., 2021), and huggingface framework (Wolf et al., 2020).We used H2KE-big from  P (ppl(A) < ppl(B)) Per-system perplexity.Figure 2 draws each corpus' perplexity as a box.There is a significant perplexity difference between the ground truth cKo (gt-cKo) and oKo (gt-oKo), which means the gt-cKo translation is closer to the modern language than the gt-oKo.Our generated translations result in a lower perplexity than the gt-oKo and Kang et al. (2021); it is closer to the modern language similarly to gt-cKo.
Pairwise Evaluation.Because translations are associated with the same source sentences, respectively, we can compare each pair of systems by fitting Bradley-Terry (BT) model (Peyrard et al., 2021;Bradley and Terry, 1952).The BT model estimates the probability that one system is better than another based on how frequently the former system scores better.We report the estimated probabilities, P (ppl(A) < ppl(B)), in Table 4. H2KE is more like contemporary Korean than either of the ground truth oKo or Kang et al. (2021) with probability 0.72.As anticipated, ground truth cKo is significantly more like contemporary Korean than both ground truth oKo and baseline.Between H2KE and the ground truth cKo, we do not observe a significant difference in this evaluation, implying that the proposed H2KE's translations are almost on par with cKo in terms of how probable they are under a language model trained on contemporary Korean.This observation is in agreement with our earlier observation on absolute evaluation.

Human Evaluation
We conduct human evaluation of Korean translations to confirm that H2KE's translations are both more understandable and accurate than the groundtruth oKo.We use the Direct Assessment (DA) (Graham et al., 2013(Graham et al., , 2014(Graham et al., , 2017) ) as the primary method for evaluating translation systems, where the crowd-sourced bilingual human assessors are asked to rate a translation given the source sentences by how adequately it expresses the meaning of the sentences in an analog scale (Akhbardeh et al., 2021).
We cannot however adopt the crowd-sourced DA approach as is because only a few historians can evaluate the meaning of translations by interpreting Hanja.We thus work together with ITKC and ask their experts to evaluate our generated translations according to their internal evaluation criteria.This is the same procedure taken to ensure the quality of human translations at ITKC.Additionally, we conduct another evaluation to confirm whether the new Korean translation improves the understanding of historical documents for non-expert Korean speakers.

Expert Evaluation
Evaluation Protocol.In ITKC, the evaluation criteria for the historical documents are divided into accuracy and fluency.Along each of these aspects, the scores are deducted according to errors that are made and the amount of deduction is determined based on the severity of each error.In the case of accuracy, we deduct -5, -10 and -15 for word-level, phrase-level and sentence-level errors, respectively.In the case of fluency, we deduct -5 for a word-level error.We randomly select 45 test samples from the Annals of Jeongjo with each sample's length capped at 100 Hanja characters, for evaluation.We ask six experts from ITKC to score both ground-truth translations as well as machinegenerated translations.Each sample is evaluated by two experts, and we report the average score.When there is significant disagreement between two experts, the score is adjusted through their discussion.Evaluation Result.Figure 3 shows the average deducted scores for all three cases, along both accuracy and fluency.As anticipated, the ground-truth cKo samples exhibit least deduction in their scores, implying that these new translations are indeed without serious translation errors and better translated.On the other hand, the ground-truth oKo samples received most deduction in their scores, which was expected as their low readability and errors motivated re-translation of AJD in the first place.Our samples received worse score deduction than the ground-truth cKo, but were perceived to be significantly better than the ground-truth oKo.In particular we observed significant improvement over the original Korean translations in terms of fluency.This outcome confirms the potential utility of the proposed approach of machine translation for re-translating the entire AJD as well as other historical Hanja documents.

Non-expert Evaluation
Evaluation Protocol.To compare general public's perception of three translation types (gt-oKo, gt-cKo, and H2KE), we recruit 36 Korean speakers and request them to make pairwise comparisons of the readability.Given a triplet 〈gt-oKo, gt-cKo, and H2KE〉 of translations of the same Hanja paragraph, we choose a random pair to give to each evaluator, either 〈gt-cKo, H2KE〉, 〈gt-cKo, gt-oKo〉, or 〈H2KE, gt-oKo〉.They have an option of 'no difference,' although we encourage them to avoid it as much as possible.We use 150 triples 〈gt-oKo, gt-cKo, H2KE〉 (450 pairs in total) from AKJ, and 150 pairs 〈gt-oKo, H2KE〉 from the annals of all the other kings ('others', in short) for which we do not have ground-truth contemporary Korean translations.Each evaluator compares 50 pairs, and each pair is assigned three evaluators.There are 12 different survey sheets consisting of 50 pairs each, and each survey is answered by three evaluators independently.The details about the evaluation samples and the statistics of the evaluators are in Appendix E. Evaluation Result.We use the majority vote among three evaluators' responses to decide on the winner between each pair.When three people's opinions are divided into A, B, and no difference, we treat the pair as 'no difference'.In Figure 4 we present the mean and the standard deviation of the win rates.
The result from AKJ shows that gt-cKo is unsurprisingly considered easier to understand than gt-oKo is, by 77.3%.This further emphasizes the importance and necessity of new translation of AJD for the general public.The proposed H2EK's translations were considered more readable than oKo in AKJ by 58.0%, which confirms the readability improvement, which was also observed with the annals of the other kings as well.When compared against gt-cKo, gt-cKo was preferred with a probability of 52.0%, implying that there is a room for improvement in the future.
6 Further Analysis

Sample-Level Analysis of Korean Translations
The human evaluation confirmed that H2KE significantly improves the readability and quality of the translation compared to the original oKo translations.In this section, we conduct finer-grain analysis.First, we measure how many undesirable transliteration of Hanja words are eliminated by H2KE.These transliterations are often marked in the corpus with their coresponding Hanja words surrouned by paranetheses.We thus construct the archaic Hanja-based word set by extracting the gt-cKo's Hanja-based word set from the gt-oKo's.Among these detected transliterations, the proposed H2KE replaces 75% with more understandable contemporary translations.
Table 5 illustrates one sample text in Hanja, ground truth oKo, cKo, and H2KE.The color box represents the transliterated Hanja words.The words that have the same semantic meanings and correpond to each other across different types of translations are grouped using the same color.The ground truth oKo contains many literal translations, i.e. near-transliterations, identified by parentheses, and there is even a new Hanja word (起耕) added by the human translator.Compared to the gt-oKo, H2KE and gt-cKo replace most of those difficult translations with more easily understood ones.These are marked with †.On the other hand, a proper noun, that is supposed to be transliterated, H2KE correctly preserves this behaviour.See Dojang (導掌) marked with *, which is the name of an institute.In some cases, we notice H2KE generates a translation that is even more readable and more contemporary than the ground-truth contemporary Korean, such as the one marked as §.

Sample-Level Analysis of English
Translation.
Eng.) It is too bad that the Dojang * excessively collected † the tax outside the regulations † .It is even more surprising that the old land § was regarded as cultivated land and was collected for no reason † .Look at the provisions of the law † and let them deal with it strictly.

gt-En
Frost appeared and the King attended the Royal Lecture .

H2KE
Frost covered the ground.The King attended the Royal Lecture .
Papago It frosted.I went on to the contest .We make two major observations according to the results in Table 7. First, H2KE-cKo produces translations that are of high quality, evident from BLEU above 30.Second, H2KE-cKo performs favourably to H2KE-oKo, which further confirms that H2KE-cKo is capable of producing translation in contemporary Korean.Finally, we observe that our approach works substantially better than the baseline, which may be due to missing punctuation marks, although we leave more detailed analysis to the future.

Conclusion
We present H2KE, a neural machine translation system for the AJD that translates from Hanja to contemporary Korean and English.H2KE is built on top of MNMT systems to overcome the low-resource training data problem.H2KE shows a significantly higher BLEU score than the baseline and a current commercial translation system.Based on the perplexity evaluation with KoGPT, the translation samples from H2KE are closer to the contemporary Korean corpus than the ground truth original Korean translations and the baseline.The human evaluation results show that the translation samples from H2KE are more accurate and understandable than the ground truth original Korean.Finally, we translate the entire AJD to contemporary Korean and English with H2KE and publicly release the translations.
In this work, we provide strong evidence that existing algorithms for machine translation and natural language processing generalize to a scenario where data span several centuries of an archaic language.It is highly technical in that it leads to a deeper understanding of existing algorithms and significantly extends the scope of the previous studies.

Limitations
The Annals of Joseon Dynasty (AJD) were written over the course of about 500 years, so naturally Hanja underwent change during long period.Capturing the temporal change would result in a better performing model.On a related note, some entities, such as locations, and linguistic expressions may have disappeared altogether, and we simply would not be able to express those in today's language without lengthy explanations.In the nonexpert evaluation, some of the surveys reported low inter-annotator agreement because there were only three annotators per question and the evaluation of readability is subjective.The range of non-experts' prior knowledge of Korean history varies widely, and this also affects inter-annotator agreement.

A Translation Samples
A.1 Annals of King Jeongjo (AKJ)

D Expert Evaluation
Table 9 shows the part of the ITKC's criteria for evaluating Korean translation of historical document written in Hanja.We directly adopt those creteria for our expert evaluation.

E Non-expert Evaluation
Figure 5 shows an example question of the nonexpert evaluation.The average length of the evaluated samples is about 300 Korean letters including the spaces.The ages of the non-expert evaluators range from 21 to 37, and the average is 24.It implies that the evaluators are more familiar to modern Korean of the 21st century (when AJD is being newly translated) than old Korean of the 20th century (when AJD was first translated).Table 10: The winning rate in pairwise perplexity comparison of our models, ground truth samples and the baseline model.Eng.) Yangsa said, "we ask to apply the law to make wife and children as slaves and confiscate family property on the traitor Lee Chan as in the document from the State Tribunal, and enforce the law as soon as possible on Hong Gye-neung as well," but it was not granted.

Figure 1 :
Figure1: H2KE works with multiple language pairs by appending a source sentence with a target language token during training and inference.

Figure 3 :
Figure 3: Average value of the deducted score per each translation by experts.Experts identified errors in the translation and subtracted scores according to the evaluation criteria.

Figure 4 :
Figure 4: Result of pairwise comparison of readability by non-expert Korean speakers.The bars on each side represent the win (more understandable) rates against the other side, and the in-between white bars indicate the tie rates.Each error bar indicates the standard deviation of win rates among different survey sheets.

Table 2 :
Statistics of our dataset.For the entire AJD, there are 〈Hanja, oKo〉 pairs.For the Annals of King Jeongjo, we also have contemporary Korean translations, and for the Annals of King Sejong, we have the English translations.The last column indicates the ratio of each dataset on the basis of the total AJD.
the dataset was uploaded and publicly released by the Institute for

Table 3 :
Test results of our model on different training dataset combinations.The circle indicates the king of annals and the language pair of the data for training.The BLEU score of one target language can be measured on the different source languages.

Table 3
(B)in the case of the proposed approach.

Table 6
Hanja and oKo.Because Papago is not aware of the historical context, it translates the word '경 연' (Royal Lecture) to its homonym, a 'contest.'In contrast, our model correctly translates it into 'Royal Lecture.'

Table 5 :
The translation example of ground truth oKo, cKo, En and our generated cKo translations.The parenthesized words are literally translated from the original Hanja words.The same color box represents the group of words with the same semantic meaning.* indicates the proper noun; the literal translation is allowed.† represents the case that gt-cKo and H2KE-cKo eliminate the literal translation.§ is the word only our model can generate a more understandable translation.

Table 6 :
English translation Examples in the test set of the Annals of Sejong (4th King).Our generated sample is translated from Hanja, and the Papago sample is from ground truth oKo.

Table 7 :
BLEU score of translations on DRRI.

Table 11
Table10represents the winning rate in pairwise perplexity comparison.Consistent with the BT comparison on Table4, the translations samples from H2KE are more closer to the gt-nKo than the gt-oKo and baseline model.The samples that have same perplexity are exactly same, because of the short length of the source sentences.

Table 9 :
Evaluation criteria of ITKC for historical document translation.

Table 11 :
Translation samples of the Annals of King Jeongjo (AKJ).