Automatic Reconstruction of Ancient Chinese Pronunciations

Zhige Huang, Haoan Jin, Mengyue Wu, Kenny Zhu


Abstract
Reconstructing ancient Chinese pronunciation is a challenging task due to the scarcity of phonetic records. Different from historical linguistics’ comparative approaches, we reformulate this problem into a temporal prediction task with masked language models, digitizing existing phonology rules into ACP (Ancient Chinese Phonology) dataset of 70,943 entries for 17,001 Chinese characters. Utilizing this dataset and Chinese character glyph information, our transformer-based model demonstrates superior performance on a series of reconstruction tasks, with or without prior phonological knowledge on the target historical period. Our work significantly advances the digitization and computational reconstruction of ancient Chinese phonology, providing a more complete and temporally contextualized resource for computational linguistics and historical research. The dataset and model training code are publicly available.
Anthology ID:
2024.findings-emnlp.325
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5689–5698
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.325
DOI:
Bibkey:
Cite (ACL):
Zhige Huang, Haoan Jin, Mengyue Wu, and Kenny Zhu. 2024. Automatic Reconstruction of Ancient Chinese Pronunciations. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5689–5698, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Automatic Reconstruction of Ancient Chinese Pronunciations (Huang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.325.pdf