A Text Normalisation System for Non-Standard English Words

Emma Flint, Elliot Ford, Olivia Thomas, Andrew Caines, Paula Buttery


Abstract
This paper investigates the problem of text normalisation; specifically, the normalisation of non-standard words (NSWs) in English. Non-standard words can be defined as those word tokens which do not have a dictionary entry, and cannot be pronounced using the usual letter-to-phoneme conversion rules; e.g. lbs, 99.3%, #EMNLP2017. NSWs pose a challenge to the proper functioning of text-to-speech technology, and the solution is to spell them out in such a way that they can be pronounced appropriately. We describe our four-stage normalisation system made up of components for detection, classification, division and expansion of NSWs. Performance is favourabe compared to previous work in the field (Sproat et al. 2001, Normalization of non-standard words), as well as state-of-the-art text-to-speech software. Further, we update Sproat et al.’s NSW taxonomy, and create a more customisable system where users are able to input their own abbreviations and specify into which variety of English (currently available: British or American) they wish to normalise.
Anthology ID:
W17-4414
Volume:
Proceedings of the 3rd Workshop on Noisy User-generated Text
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Leon Derczynski, Wei Xu, Alan Ritter, Tim Baldwin
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
107–115
Language:
URL:
https://aclanthology.org/W17-4414
DOI:
10.18653/v1/W17-4414
Bibkey:
Cite (ACL):
Emma Flint, Elliot Ford, Olivia Thomas, Andrew Caines, and Paula Buttery. 2017. A Text Normalisation System for Non-Standard English Words. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 107–115, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
A Text Normalisation System for Non-Standard English Words (Flint et al., WNUT 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-4414.pdf