Hierarchical Character Tagger for Short Text Spelling Error Correction

Mengyi Gao, Canran Xu, Peng Shi


Abstract
State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.
Anthology ID:
2021.wnut-1.13
Volume:
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Month:
November
Year:
2021
Address:
Online
Editors:
Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
106–113
Language:
URL:
https://aclanthology.org/2021.wnut-1.13
DOI:
10.18653/v1/2021.wnut-1.13
Bibkey:
Cite (ACL):
Mengyi Gao, Canran Xu, and Peng Shi. 2021. Hierarchical Character Tagger for Short Text Spelling Error Correction. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 106–113, Online. Association for Computational Linguistics.
Cite (Informal):
Hierarchical Character Tagger for Short Text Spelling Error Correction (Gao et al., WNUT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wnut-1.13.pdf