HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea
Haneul Yoo | Jiho Jin | Juhee Son | JinYeong Bak | Kyunghyun Cho | Alice Oh
Findings of the Association for Computational Linguistics: NAACL 2022
Historical records in Korea before the 20th century were primarily written in Hanja, an extinct language based on Chinese characters and not understood by modern Korean or Chinese speakers. Historians with expertise in this time period have been analyzing the documents, but that process is very difficult and time-consuming, and language models would significantly speed up the process. Toward building and evaluating language models for Hanja, we release the Hanja Understanding Evaluation dataset consisting of chronological attribution, topic classification, named entity recognition, and summary retrieval tasks. We also present BERT-based models continued training on the two major corpora from the 14th to the 19th centuries: the Annals of the Joseon Dynasty and Diaries of the Royal Secretariats. We compare the models with several baselines on all tasks and show there are significant improvements gained by training on the two corpora. Additionally, we run zero-shot experiments on the Daily Records of the Royal Court and Important Officials (DRRI). The DRRI dataset has not been studied much by the historians, and not at all by the NLP community.
Translating Hanja Historical Documents to Contemporary Korean and English
Juhee Son | Jiho Jin | Haneul Yoo | JinYeong Bak | Kyunghyun Cho | Alice Oh
Findings of the Association for Computational Linguistics: EMNLP 2022
The Annals of Joseon Dynasty (AJD) contain the daily records of the Kings of Joseon, the 500-year kingdom preceding the modern nation of Korea.The Annals were originally written in an archaic Korean writing system, ‘Hanja’, and were translated into Korean from 1968 to 1993.The resulting translation was however too literal and contained many archaic Korean words; thus, a new expert translation effort began in 2012. Since then, the records of only one king have been completed in a decade.In parallel, expert translators are working on English translation, also at a slow pace and produced only one king’s records in English so far.Thus, we propose H2KE, a neural machine translation model, that translates historical documents in Hanja to more easily understandable Korean and to English.Built on top of multilingual neural machine translation, H2KE learns to translate a historical document written in Hanja, from both a full dataset of outdated Korean translation and a small dataset of more recently translated contemporary Korean and English.We compare our method against two baselines:a recent model that simultaneously learns to restore and translate Hanja historical documentand a Transformer based model trained only on newly translated corpora.The experiments reveal that our method significantly outperforms the baselines in terms of BLEU scores for both contemporary Korean and English translations.We further conduct extensive human evaluation which shows that our translation is preferred over the original expert translations by both experts and non-expert Korean speakers.
Knowledge-Enhanced Evidence Retrieval for Counterargument Generation
Yohan Jo | Haneul Yoo | JinYeong Bak | Alice Oh | Chris Reed | Eduard Hovy
Findings of the Association for Computational Linguistics: EMNLP 2021
Finding counterevidence to statements is key to many tasks, including counterargument generation. We build a system that, given a statement, retrieves counterevidence from diverse sources on the Web. At the core of this system is a natural language inference (NLI) model that determines whether a candidate sentence is valid counterevidence or not. Most NLI models to date, however, lack proper reasoning abilities necessary to find counterevidence that involves complex inference. Thus, we present a knowledge-enhanced NLI model that aims to handle causality- and example-based inference by incorporating knowledge graphs. Our NLI model outperforms baselines for NLI tasks, especially for instances that require the targeted inference. In addition, this NLI model further improves the counterevidence retrieval system, notably finding complex counterevidence better.
- JinYeong Bak 3
- Alice Oh 3
- Jiho Jin 2
- Juhee Son 2
- Kyunghyun Cho 2
- show all...