PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction

Xiao Wei, Jianbao Huang, Hang Yu, Qian Liu


Abstract
Chinese spelling correction (CSC) is a challenging task with the goal of correcting each wrong character in Chinese texts. Incorrect characters in a Chinese text are mainly due to the similar shape and similar pronunciation of Chinese characters. Recently, the paradigm of pre-training and fine-tuning has achieved remarkable success in natural language processing. However, the pre-training objectives in existing methods are not tailored for the CSC task since they neglect the visual and phonetic properties of characters, resulting in suboptimal spelling correction. In this work, we propose to pre-train a new corrector named PTCSpell for the CSC task under the detector-corrector architecture. The corrector we propose has the following two improvements. First, we design two novel pre-training objectives to capture pronunciation and shape information in Chinese characters. Second, we propose a new strategy to tackle the issue that the detector’s prediction results mislead the corrector by balancing the loss of wrong characters and correct characters. Experiments on three benchmarks (i.e., SIGHAN 2013, 2014, and 2015) show that our model achieves an average of 5.8% F1 improvements at the correction level over state-of-the-art methods, verifying its effectiveness.
Anthology ID:
2023.findings-acl.394
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6330–6343
Language:
URL:
https://aclanthology.org/2023.findings-acl.394
DOI:
10.18653/v1/2023.findings-acl.394
Bibkey:
Cite (ACL):
Xiao Wei, Jianbao Huang, Hang Yu, and Qian Liu. 2023. PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6330–6343, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction (Wei et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.394.pdf