Normalization and Back-Transliteration for Code-Switched Data

Dwija Parikh, Thamar Solorio


Abstract
Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research.
Anthology ID:
2021.calcs-1.15
Volume:
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Month:
June
Year:
2021
Address:
Online
Venues:
CALCS | NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–124
Language:
URL:
https://aclanthology.org/2021.calcs-1.15
DOI:
10.18653/v1/2021.calcs-1.15
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2021.calcs-1.15.pdf