Numbers Normalisation in the Inflected Languages: a Case Study of Polish

Rafał Poświata, Michał Perełkiewicz


Abstract
Text normalisation in Text-to-Speech systems is a process of converting written expressions to their spoken forms. This task is complicated because in many cases the normalised form depends on the context. Furthermore, when we analysed languages like Croatian, Lithuanian, Polish, Russian or Slovak there is additional difficulty related to their inflected nature. In this paper we want to show how to deal with this problem for one of these languages: Polish, without having a large dedicated data set and using solutions prepared for other NLP tasks. We limited our study to only numbers expressions, which are the most common non-standard words to normalise. The proposed solution is a combination of morphological tagger and transducer supported by a dictionary of numbers in their spoken forms. The data set used for evaluation is based on the part of 1-million word subset of the National Corpus of Polish. The accuracy of the described approach is presented with a comparison to a simple baseline and two commercial systems: Google Cloud Text-to-Speech and Amazon Polly.
Anthology ID:
W19-3703
Volume:
Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing
Month:
August
Year:
2019
Address:
Florence, Italy
Editors:
Tomaž Erjavec, Michał Marcińczuk, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
Venue:
BSNLP
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
23–28
Language:
URL:
https://aclanthology.org/W19-3703
DOI:
10.18653/v1/W19-3703
Bibkey:
Cite (ACL):
Rafał Poświata and Michał Perełkiewicz. 2019. Numbers Normalisation in the Inflected Languages: a Case Study of Polish. In Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, pages 23–28, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):
Numbers Normalisation in the Inflected Languages: a Case Study of Polish (Poświata & Perełkiewicz, BSNLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-3703.pdf
Code
 rafalposwiata/text-normalization