PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study

Yuqing Zhang, Baoyi He, Yihan Chen, Hangqi Li, Han Yue, Shengyu Zhang, Huaiyong Dou, Junchi Yan, Zemin Liu, Yongquan Zhang, Fei Wu


Abstract
Philology, the study of ancient manuscripts, demands years of professional training in ex-tensive knowledge memorization and manual textual retrieval. Despite these requirements align closely with strengths of recent successful Large Language Models (LLMs), the scarcity of high-quality, specialized training data has hindered direct applications. To bridge this gap, we curated the PhiloCorpus-ZH, a rich collec-tion of ancient Chinese texts spanning a millen-nium with 30 diverse topics, including firsthand folk copies. This corpus facilitated the develop-ment of PhiloGPT, the first LLM tailored for discovering ancient Chinese manuscripts. To effectively tackle complex philological tasks like restoration, attribution, and linguistic anal-ysis, we introduced the PhiloCoP framework. Modeled on the analytical patterns of philol-ogists, PhiloCoP enhances LLM’s handling of historical linguistic peculiarities such as phonetic loans, polysemy, and syntactic inver-sions. We further integrated these tasks into the PhiloBenchmark, establishing a new standard for evaluating ancient Chinese LLMs address-ing philology tasks. Deploying PhiloGPT in practical scenarios has enabled Dunhuang spe-cialists to resolve philology tasks, such as iden-tifying duplication of copied text and assisting archaeologists with text completion, demon-strating its potential in real-world applications.
Anthology ID:
2024.emnlp-main.163
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2784–2801
Language:
URL:
https://aclanthology.org/2024.emnlp-main.163
DOI:
Bibkey:
Cite (ACL):
Yuqing Zhang, Baoyi He, Yihan Chen, Hangqi Li, Han Yue, Shengyu Zhang, Huaiyong Dou, Junchi Yan, Zemin Liu, Yongquan Zhang, and Fei Wu. 2024. PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2784–2801, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
PhiloGPT: A Philology-Oriented Large Language Model for Ancient Chinese Manuscripts with Dunhuang as Case Study (Zhang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.163.pdf