Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

Yuxuan Wang, Jack Wang, Dongyan Zhao, Zilong Zheng


Abstract
We introduce CDBert, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBert as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries and Jiezi refers to the process of enhancing characters’ glyph representations with structure understanding. To facilitate dictionary understanding, we propose three pre-training tasks, i.e.„ Masked Entry Modeling, Contrastive Learning for Synonym and Antonym, and Example Learning. We evaluate our method on both modern Chinese understanding benchmark CLUE and ancient Chinese benchmark CCLUE. Moreover, we propose a new polysemy discrimination task PolyMRC based on the collected dictionary of ancient Chinese. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks. Moreover, our approach yields significant boosting on few-shot setting of ancient Chinese understanding.
Anthology ID:
2023.findings-acl.70
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1089–1101
Language:
URL:
https://aclanthology.org/2023.findings-acl.70
DOI:
10.18653/v1/2023.findings-acl.70
Bibkey:
Cite (ACL):
Yuxuan Wang, Jack Wang, Dongyan Zhao, and Zilong Zheng. 2023. Rethinking Dictionaries and Glyphs for Chinese Language Pre-training. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1089–1101, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Rethinking Dictionaries and Glyphs for Chinese Language Pre-training (Wang et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.70.pdf