Jianbao Huang


2023

pdf bib
PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction
Xiao Wei | Jianbao Huang | Hang Yu | Qian Liu
Findings of the Association for Computational Linguistics: ACL 2023

Chinese spelling correction (CSC) is a challenging task with the goal of correcting each wrong character in Chinese texts. Incorrect characters in a Chinese text are mainly due to the similar shape and similar pronunciation of Chinese characters. Recently, the paradigm of pre-training and fine-tuning has achieved remarkable success in natural language processing. However, the pre-training objectives in existing methods are not tailored for the CSC task since they neglect the visual and phonetic properties of characters, resulting in suboptimal spelling correction. In this work, we propose to pre-train a new corrector named PTCSpell for the CSC task under the detector-corrector architecture. The corrector we propose has the following two improvements. First, we design two novel pre-training objectives to capture pronunciation and shape information in Chinese characters. Second, we propose a new strategy to tackle the issue that the detector’s prediction results mislead the corrector by balancing the loss of wrong characters and correct characters. Experiments on three benchmarks (i.e., SIGHAN 2013, 2014, and 2015) show that our model achieves an average of 5.8% F1 improvements at the correction level over state-of-the-art methods, verifying its effectiveness.