uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers

Piji Li (李丕绩)

uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers

Abstract

The task of Chinese Spelling Check (CSC) is aiming to detect and correct spelling errors that can be found in the text. While manually annotating a high-quality dataset is expensive and time-consuming, thus the scale of the training dataset is usually very small (e.g., SIGHAN15 only contains 2339 samples for training), therefore supervised-learning based models usually suffer the data sparsity limitation and over-fitting issue, especially in the era of big language models. In this paper, we are dedicated to investigating the unsupervised paradigm to address the CSC problem and we propose a framework named uChecker to conduct unsupervised spelling error detection and correction. Masked pretrained language models such as BERT are introduced as the backbone model considering their powerful language diagnosis capability. Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model to further improve the performance of unsupervised detection and correction. Experimental results on standard datasets demonstrate the effectiveness of our proposed model uChecker in terms of character-level and sentence-level Accuracy, Precision, Recall, and F1-Measure on tasks of spelling error detection and correction respectively.

Anthology ID:: 2022.coling-1.248
Volume:: Proceedings of the 29th International Conference on Computational Linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 2812–2822
Language:
URL:: https://aclanthology.org/2022.coling-1.248/
DOI:
Bibkey:
Cite (ACL):: Piji Li. 2022. uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2812–2822, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):: uChecker: Masked Pretrained Language Models as Unsupervised Chinese Spelling Checkers (Li, COLING 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.coling-1.248.pdf

PDF Cite Search Fix data