Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?

Dana-Maria Iliescu; Rasmus Grand; Sara Qirko; Rob Van Der Goot

doi:10.18653/v1/2021.calcs-1.9

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?

Dana-Maria Iliescu, Rasmus Grand, Sara Qirko, Rob van der Goot

Abstract

Because of globalization, it is becoming more and more common to use multiple languages in a single utterance, also called code-switching. This results in special linguistic structures and, therefore, poses many challenges for Natural Language Processing. Existing models for language identification in code-switched data are all supervised, requiring annotated training data which is only available for a limited number of language pairs. In this paper, we explore semi-supervised approaches, that exploit out-of-domain mono-lingual training data. We experiment with word uni-grams, word n-grams, character n-grams, Viterbi Decoding, Latent Dirichlet Allocation, Support Vector Machine and Logistic Regression. The Viterbi model was the best semi-supervised model, scoring a weighted F1 score of 92.23%, whereas a fully supervised state-of-the-art BERT-based model scored 98.43%.

Anthology ID:: 2021.calcs-1.9
Volume:: Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Month:: June
Year:: 2021
Address:: Online
Editors:: Thamar Solorio, Shuguang Chen, Alan W. Black, Mona Diab, Sunayana Sitaram, Victor Soto, Emre Yilmaz, Anirudh Srinivasan
Venue:: CALCS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 65–71
Language:
URL:: https://aclanthology.org/2021.calcs-1.9/
DOI:: 10.18653/v1/2021.calcs-1.9
Bibkey:
Cite (ACL):: Dana-Maria Iliescu, Rasmus Grand, Sara Qirko, and Rob van der Goot. 2021. Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?. In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 65–71, Online. Association for Computational Linguistics.
Cite (Informal):: Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get? (Iliescu et al., CALCS 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.calcs-1.9.pdf

PDF Cite Search Fix data