Data Augmentation for Low-resource Word Segmentation and POS Tagging of Ancient Chinese Texts

Yutong Shen; Jiahuan Li; Shujian Huang (书剑 黄); Yi Zhou; Xiaopeng Xie; Qinxin Zhao

Data Augmentation for Low-resource Word Segmentation and POS Tagging of Ancient Chinese Texts

Yutong Shen, Jiahuan Li, Shujian Huang, Yi Zhou, Xiaopeng Xie, Qinxin Zhao

Abstract

Automatic word segmentation and part-of-speech tagging of ancient books can help relevant researchers to study ancient texts. In recent years, pre-trained language models have achieved significant improvements on text processing tasks. SikuRoberta is a pre-trained language model specially designed for automatic analysis of ancient Chinese texts. Although SikuRoberta significantly boosts performance on WSG and POS tasks on ancient Chinese texts, the lack of labeled data still limits the performance of the model. In this paper, to alleviate the problem of insufficient training data, We define hybrid tags to integrate WSG and POS tasks and design Roberta-CRF model to predict tags for each Chinese characters. Moreover, We generate synthetic labeled data based on the LSTM language model. To further mine knowledge in SikuRoberta, we generate the synthetic unlabeled data based on the Masked LM. Experiments show that the performance of the model is improved with the synthetic data, indicating that the effectiveness of the data augmentation methods.

Anthology ID:: 2022.lt4hala-1.26
Volume:: Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Rachele Sprugnoli, Marco Passarotti
Venue:: LT4HALA
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 169–173
Language:
URL:: https://aclanthology.org/2022.lt4hala-1.26/
DOI:
Bibkey:
Cite (ACL):: Yutong Shen, Jiahuan Li, Shujian Huang, Yi Zhou, Xiaopeng Xie, and Qinxin Zhao. 2022. Data Augmentation for Low-resource Word Segmentation and POS Tagging of Ancient Chinese Texts. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 169–173, Marseille, France. European Language Resources Association.
Cite (Informal):: Data Augmentation for Low-resource Word Segmentation and POS Tagging of Ancient Chinese Texts (Shen et al., LT4HALA 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.lt4hala-1.26.pdf

PDF Cite Search Fix data