MVP-BERT: Multi-Vocab Pre-training for Chinese BERT

Wei Zhu


Abstract
Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary (vocab) for these Chinese PLMs remains to be the one provided by Google Chinese BERT (CITATION), which is based on Chinese characters (chars). Second, the masked language model pre-training is based on a single vocab, limiting its downstream task performances. In this work, we first experimentally demonstrate that building a vocab via Chinese word segmentation (CWS) guided sub-word tokenization (SGT) can improve the performances of Chinese PLMs. Then we propose two versions of multi-vocab pre-training (MVP), Hi-MVP and AL-MVP, to improve the models’ expressiveness. Experiments show that: (a) MVP training strategies improve PLMs’ downstream performances, especially it can improve the PLM’s performances on span-level tasks; (b) our AL-MVP outperforms the recent AMBERT (CITATION) after large-scale pre-training, and it is more robust against adversarial attacks.
Anthology ID:
2021.acl-srw.27
Volume:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:
August
Year:
2021
Address:
Online
Venues:
ACL | IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
260–269
Language:
URL:
https://aclanthology.org/2021.acl-srw.27
DOI:
10.18653/v1/2021.acl-srw.27
Bibkey:
Cite (ACL):
Wei Zhu. 2021. MVP-BERT: Multi-Vocab Pre-training for Chinese BERT. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 260–269, Online. Association for Computational Linguistics.
Cite (Informal):
MVP-BERT: Multi-Vocab Pre-training for Chinese BERT (Zhu, ACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.acl-srw.27.pdf
Video:
 https://aclanthology.org/2021.acl-srw.27.mp4
Data
CMRCCMRC 2018ChID