Wei Xinyuan


pdf bib
Glyph Features Matter: A Multimodal Solution for EvaHan in LT4HALA2022
Wei Xinyuan | Liu Weihao | Qing Zong | Zhang Shaoqing | Baotian Hu
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

We participate in the LT4HALA2022 shared task EvaHan. This task has two subtasks. Subtask 1 is word segmentation, and subtask 2 is part-of-speech tagging. Each subtask consists of two tracks, a close track that can only use the data and models provided by the organizer, and an open track without restrictions. We employ three pre-trained models, two of which are open-source pre-trained models for ancient Chinese (Siku-Roberta and roberta-classical-chinese), and one is our pre-trained GlyphBERT combined with glyph features. Our methods include data augmentation, data pre-processing, model pretraining, downstream fine-tuning, k-fold cross validation and model ensemble. We achieve competitive P, R, and F1 scores on both our own validation set and the final public test set.