Multilingual Identification of English Code-Switching

Igor Sterner


Abstract
Code-switching research depends on fine-grained language identification. In this work, we study existing corpora used to train token-level language identification systems. We aggregate these corpora with a consistent labelling scheme and train a system to identify English code-switching in multilingual text. We show that the system identifies code-switching in unseen language pairs with absolute measure 2.3-4.6% better than language-pair-specific SoTA. We also analyse the correlation between typological similarity of the languages and difficulty in recognizing code-switching.
Anthology ID:
2024.vardial-1.14
Volume:
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
163–173
Language:
URL:
https://aclanthology.org/2024.vardial-1.14
DOI:
10.18653/v1/2024.vardial-1.14
Bibkey:
Cite (ACL):
Igor Sterner. 2024. Multilingual Identification of English Code-Switching. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 163–173, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Multilingual Identification of English Code-Switching (Sterner, VarDial-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.vardial-1.14.pdf
Supplementary material:
 2024.vardial-1.14.SupplementaryMaterial.txt