Automatic Word Segmentation and Part-of-Speech Tagging of Ancient Chinese Based on BERT Model

Yu Chang, Peng Zhu, Chaoping Wang, Chaofan Wang


Abstract
In recent years, new deep learning methods and pre-training language models have been emerging in the field of natural language processing (NLP). These methods and models can greatly improve the accuracy of automatic word segmentation and part-of-speech tagging in the field of ancient Chinese research. In these models, the BERT model has made amazing achievements in the top-level test of machine reading comprehension SQuAD-1.1. In addition, it also showed better results than other models in 11 different NLP tests. In this paper, SIKU-RoBERTa pre-training language model based on the high-quality full-text corpus of SiKuQuanShu have been adopted, and part corpus of ZuoZhuan that has been word segmented and part-of-speech tagged is used as training sets to build a deep network model based on BERT for word segmentation and POS tagging experiments. In addition, we also use other classical NLP network models for comparative experiments. The results show that using SIKU-RoBERTa pre-training language model, the overall prediction accuracy of word segmentation and part-of-speech tagging of this model can reach 93.87% and 88.97%, with excellent overall performance.
Anthology ID:
2022.lt4hala-1.20
Volume:
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Rachele Sprugnoli, Marco Passarotti
Venue:
LT4HALA
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
141–145
Language:
URL:
https://aclanthology.org/2022.lt4hala-1.20
DOI:
Bibkey:
Cite (ACL):
Yu Chang, Peng Zhu, Chaoping Wang, and Chaofan Wang. 2022. Automatic Word Segmentation and Part-of-Speech Tagging of Ancient Chinese Based on BERT Model. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 141–145, Marseille, France. European Language Resources Association.
Cite (Informal):
Automatic Word Segmentation and Part-of-Speech Tagging of Ancient Chinese Based on BERT Model (Chang et al., LT4HALA 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lt4hala-1.20.pdf