Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

Sina Ahmadi, Antonios Anastasopoulos


Abstract
The wide accessibility of social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages. This, however, comes with certain challenges in script normalization, particularly where the speakers of a language in a bilingual community rely on another script or orthography to write their native language. This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script. Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated. We conduct a small-scale evaluation of real data as well. Our experiments indicate that script normalization is also beneficial to improve the performance of downstream tasks such as machine translation and language identification.
Anthology ID:
2023.acl-long.809
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14466–14487
Language:
URL:
https://aclanthology.org/2023.acl-long.809
DOI:
10.18653/v1/2023.acl-long.809
Bibkey:
Cite (ACL):
Sina Ahmadi and Antonios Anastasopoulos. 2023. Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14466–14487, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities (Ahmadi & Anastasopoulos, ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-long.809.pdf
Video:
 https://aclanthology.org/2023.acl-long.809.mp4