Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling
Peijie Jiang | Dingkun Long | Yanzhao Zhang | Pengjun Xie | Meishan Zhang | Min Zhang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can complement previous supervised lexicon exploration, where further improvements can be achieved when integrated with external lexicon information.
A Fine-Grained Domain Adaption Model for Joint Word Segmentation and POS Tagging
Peijie Jiang | Dingkun Long | Yueheng Sun | Meishan Zhang | Guangwei Xu | Pengjun Xie
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Domain adaption for word segmentation and POS tagging is a challenging problem for Chinese lexical processing. Self-training is one promising solution for it, which struggles to construct a set of high-quality pseudo training instances for the target domain. Previous work usually assumes a universal source-to-target adaption to collect such pseudo corpus, ignoring the different gaps from the target sentences to the source domain. In this work, we start from joint word segmentation and POS tagging, presenting a fine-grained domain adaption method to model the gaps accurately. We measure the gaps by one simple and intuitive metric, and adopt it to develop a pseudo target domain corpus based on fine-grained subdomains incrementally. A novel domain-mixed representation learning model is proposed accordingly to encode the multiple subdomains effectively. The whole process is performed progressively for both corpus construction and model training. Experimental results on a benchmark dataset show that our method can gain significant improvements over a vary of baselines. Extensive analyses are performed to show the advantages of our final domain adaption model as well.
- Dingkun Long 2
- Meishan Zhang 2
- Pengjun Xie 2
- Yueheng Sun 1
- Guangwei Xu 1
- show all...