Normalization and Back-Transliteration for Code-Switched Data

Dwija Parikh, Thamar Solorio


Abstract
Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research.
Anthology ID:
2021.calcs-1.15
Volume:
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Month:
June
Year:
2021
Address:
Online
Editors:
Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
Venue:
CALCS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
119–124
Language:
URL:
https://aclanthology.org/2021.calcs-1.15
DOI:
10.18653/v1/2021.calcs-1.15
Bibkey:
Cite (ACL):
Dwija Parikh and Thamar Solorio. 2021. Normalization and Back-Transliteration for Code-Switched Data. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 119–124, Online. Association for Computational Linguistics.
Cite (Informal):
Normalization and Back-Transliteration for Code-Switched Data (Parikh & Solorio, CALCS 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.calcs-1.15.pdf