Glyph Features Matter: A Multimodal Solution for EvaHan in LT4HALA2022

Wei Xinyuan, Liu Weihao, Qing Zong, Zhang Shaoqing, Baotian Hu


Abstract
We participate in the LT4HALA2022 shared task EvaHan. This task has two subtasks. Subtask 1 is word segmentation, and subtask 2 is part-of-speech tagging. Each subtask consists of two tracks, a close track that can only use the data and models provided by the organizer, and an open track without restrictions. We employ three pre-trained models, two of which are open-source pre-trained models for ancient Chinese (Siku-Roberta and roberta-classical-chinese), and one is our pre-trained GlyphBERT combined with glyph features. Our methods include data augmentation, data pre-processing, model pretraining, downstream fine-tuning, k-fold cross validation and model ensemble. We achieve competitive P, R, and F1 scores on both our own validation set and the final public test set.
Anthology ID:
2022.lt4hala-1.28
Volume:
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Rachele Sprugnoli, Marco Passarotti
Venue:
LT4HALA
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
178–182
Language:
URL:
https://aclanthology.org/2022.lt4hala-1.28
DOI:
Bibkey:
Cite (ACL):
Wei Xinyuan, Liu Weihao, Qing Zong, Zhang Shaoqing, and Baotian Hu. 2022. Glyph Features Matter: A Multimodal Solution for EvaHan in LT4HALA2022. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 178–182, Marseille, France. European Language Resources Association.
Cite (Informal):
Glyph Features Matter: A Multimodal Solution for EvaHan in LT4HALA2022 (Xinyuan et al., LT4HALA 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lt4hala-1.28.pdf