TongueSwitcher: Fine-Grained Identification of German-English Code-Switching

Igor Sterner, Simone Teufel


Abstract
This paper contributes to German-English code-switching research. We provide the largest corpus of naturally occurring German-English code-switching, where English is included in German text, and two methods for code-switching identification. The first method is rule-based, using wordlists and morphological processing. We use this method to compile a corpus of 25.6M tweets employing German-English code-switching. In our second method, we continue pretraining of a neural language model on this corpus and classify tokens based on embeddings from this language model. Our systems establish SoTA on our new corpus and an existing German-English code-switching benchmark. In particular, we systematically study code-switching for language-ambiguous words which can only be resolved in context, and morphologically mixed words consisting of both English and German morphemes. We distribute both corpora and systems to the research community.
Anthology ID:
2023.calcs-1.1
Volume:
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching
Month:
December
Year:
2023
Address:
Singapore
Editors:
Genta Winata, Sudipta Kar, Marina Zhukova, Thamar Solorio, Mona Diab, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
Venues:
CALCS | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–13
Language:
URL:
https://aclanthology.org/2023.calcs-1.1
DOI:
10.18653/v1/2023.calcs-1.1
Bibkey:
Cite (ACL):
Igor Sterner and Simone Teufel. 2023. TongueSwitcher: Fine-Grained Identification of German-English Code-Switching. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 1–13, Singapore. Association for Computational Linguistics.
Cite (Informal):
TongueSwitcher: Fine-Grained Identification of German-English Code-Switching (Sterner & Teufel, CALCS-WS 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.calcs-1.1.pdf