Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification

Kelsey Ball, Dan Garrette


Abstract
Code-switching, the use of more than one language within a single utterance, is ubiquitous in much of the world, but remains a challenge for NLP largely due to the lack of representative data for training models. In this paper, we present a novel model architecture that is trained exclusively on monolingual resources, but can be applied to unseen code-switched text at inference time. The model accomplishes this by jointly maintaining separate word representations for each of the possible languages, or scripts in the case of transliteration, allowing each to contribute to inferences without forcing the model to commit to a language. Experiments on Hindi-English part-of-speech tagging demonstrate that our approach outperforms standard models when training on monolingual text without transliteration, and testing on code-switched text with alternate scripts.
Anthology ID:
D18-1347
Volume:
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:
October-November
Year:
2018
Address:
Brussels, Belgium
Editors:
Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:
EMNLP
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3084–3089
Language:
URL:
https://aclanthology.org/D18-1347
DOI:
10.18653/v1/D18-1347
Bibkey:
Cite (ACL):
Kelsey Ball and Dan Garrette. 2018. Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3084–3089, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification (Ball & Garrette, EMNLP 2018)
Copy Citation:
PDF:
https://aclanthology.org/D18-1347.pdf
Data
Universal Dependencies