HanTrans: An Empirical Study on Cross-Era Transferability of Chinese Pre-trained Language Model

Chin-Tung Lin, Wei-Yun Ma


Abstract
The pre-trained language model has recently dominated most downstream tasks in the NLP area. Particularly, bidirectional Encoder Representations from Transformers (BERT) is the most iconic pre-trained language model among the NLP tasks. Their proposed masked-language modeling (MLM) is an indispensable part of the existing pre-trained language models. Those outperformed models for downstream tasks benefited directly from the large training corpus in the pre-training stage. However, their training corpus for modern traditional Chinese was light. Most of all, the ancient Chinese corpus is still disappearance in the pre-training stage. Therefore, we aim to address this problem by transforming the annotation data of ancient Chinese into BERT style training corpus. Then we propose a pre-trained Oldhan Chinese BERT model for the NLP community. Our proposed model outperforms the original BERT model by significantly reducing perplexity scores in masked-language modeling (MLM). Also, our fine-tuning models improve F1 scores on word segmentation and part-of-speech tasks. Then we comprehensively study zero-shot cross-eras ability in the BERT model. Finally, we visualize and investigate personal pronouns in the embedding space of ancient Chinese records from four eras. We have released our code at https://github.com/ckiplab/han-transformers.
Anthology ID:
2022.rocling-1.21
Volume:
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)
Month:
November
Year:
2022
Address:
Taipei, Taiwan
Editors:
Yung-Chun Chang, Yi-Chin Huang
Venue:
ROCLING
SIG:
Publisher:
The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)
Note:
Pages:
164–173
Language:
Chinese
URL:
https://aclanthology.org/2022.rocling-1.21
DOI:
Bibkey:
Cite (ACL):
Chin-Tung Lin and Wei-Yun Ma. 2022. HanTrans: An Empirical Study on Cross-Era Transferability of Chinese Pre-trained Language Model. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), pages 164–173, Taipei, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
Cite (Informal):
HanTrans: An Empirical Study on Cross-Era Transferability of Chinese Pre-trained Language Model (Lin & Ma, ROCLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.rocling-1.21.pdf