Two Sequence Labeling Approaches to Sentence Segmentation and Punctuation Prediction for Classic Chinese Texts

Xuebin Wang, Zhenghua Li


Abstract
This paper describes our system for the EvaHan2024 shared task. We design and experiment with two sequence labeling approaches, i.e., one-stage and two-stage approaches. The one-stage approach directly predicts a label for each character, and the label may contain multiple punctuation marks. The two-stage approach divides punctuation marks into two classes, i.e., pause and non-pause, and separately handles them via two sequence labeling processes. The labels contain at most one punctuation marks. We use pre-trained SikuRoBERTa as a key component of the encoder and employ a conditional random field (CRF) layer on the top. According to the evaluation metrics adopted by the organizers, the two-stage approach is superior to the one-stage approach, and our system achieves the second place among all participant systems.
Anthology ID:
2024.lt4hala-1.28
Volume:
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Rachele Sprugnoli, Marco Passarotti
Venues:
LT4HALA | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
237–241
Language:
URL:
https://aclanthology.org/2024.lt4hala-1.28
DOI:
Bibkey:
Cite (ACL):
Xuebin Wang and Zhenghua Li. 2024. Two Sequence Labeling Approaches to Sentence Segmentation and Punctuation Prediction for Classic Chinese Texts. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 237–241, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Two Sequence Labeling Approaches to Sentence Segmentation and Punctuation Prediction for Classic Chinese Texts (Wang & Li, LT4HALA-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lt4hala-1.28.pdf