Jiahuan Cao
2024
TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models
Jiahuan Cao
|
Dezhi Peng
|
Peirong Zhang
|
Yongxin Shi
|
Yang Liu
|
Kai Ding
|
Lianwen Jin
Findings of the Association for Computational Linguistics: EMNLP 2024
Classical Chinese is a gateway to the rich heritage and wisdom of ancient China, yet its complexities pose formidable comprehension barriers for most modern people without specialized knowledge. While Large Language Models (LLMs) have shown remarkable capabilities in Natural Language Processing (NLP), they struggle with Classical Chinese Understanding (CCU), especially in data-demanding and knowledge-intensive tasks. In response to this dilemma, we propose TongGu (mean understanding ancient and modern), the first CCU-specific LLM, underpinned by three core contributions. First, we construct a two-stage instruction-tuning dataset ACCN-INS derived from rich classical Chinese corpora, aiming to unlock the full CCU potential of LLMs. Second, we propose Redundancy-Aware Tuning (RAT) to prevent catastrophic forgetting, enabling TongGu to acquire new capabilities while preserving its foundational knowledge. Third, we present a CCU Retrieval-Augmented Generation (CCU-RAG) technique to reduce hallucinations based on knowledge-grounding. Extensive experiments across 24 diverse CCU tasks validate TongGu’s superior ability, underscoring the effectiveness of RAT and CCU-RAG. The model and dataset are available at https://github.com/SCUT-DLVCLab/TongGu-LLM.
2023
Translating Ancient Chinese to Modern Chinese at Scale: A Large Language Model-based Approach
Jiahuan Cao
|
Dezhi Peng
|
Yongxin Shi
|
Zongyuan Jiang
|
Lianwen Jin
Proceedings of ALT2023: Ancient Language Translation Workshop
Recently, the emergence of large language models (LLMs) has provided powerful foundation models for a wide range of natural language processing (NLP) tasks. However, the vast majority of the pre-training corpus for most existing LLMs is in English, resulting in their Chinese proficiency falling far behind that of English. Furthermore, ancient Chinese has a much larger vocabulary and less available corpus than modern Chinese, which significantly challenges the generalization capacity of existing LLMs. In this paper, we investigate the Ancient-Chinese-to-Modern-Chinese (A2M) translation using LLMs including LLaMA and Ziya. Specifically, to improve the understanding of Chinese texts, we explore the vocabulary expansion and incremental pre-training methods based on existing pre-trained LLMs. Subsequently, a large-scale A2M translation dataset with 4M pairs is utilized to finetune the LLMs.Experimental results demonstrate the effectiveness of the proposed method, especially with Ziya-13B, in translating ancient Chinese to modern Chinese. Moreover,we deeply analyze the performance of various LLMs with different strategies, which we believe can benefit further research on LLM-based A2M approaches.
Search
Co-authors
- Dezhi Peng 2
- Yongxin Shi 2
- Lianwen Jin 2
- Peirong Zhang 1
- Yang Liu 1
- show all...