Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter?

Kushal Tatariya, Heather Lent, Miryam de Lhoneux


Abstract
Monolinguals make up a minority of the world’s speakers, and yet most language technologies lag behind in handling linguistic behaviours produced by bilingual and multilingual speakers. A commonly observed phenomenon in such communities is code-mixing, which is prevalent on social media, and thus requires attention in NLP research. In this work, we look into the ability of pretrained language models to handle code-mixed data, with a focus on the impact of languages present in pretraining on the downstream performance of the model as measured on the task of sentiment analysis. Ultimately, we find that the pretraining language has little effect on performance when the model sees code-mixed data during downstream finetuning. We also evaluate the models on code-mixed data in a zero-shot setting, after task-specific finetuning on a monolingual dataset. We find that this brings out differences in model performance that can be attributed to the pretraining languages. We present a thorough analysis of these findings that also looks at model performance based on the composition of participating languages in the code-mixed datasets.
Anthology ID:
2023.wassa-1.32
Volume:
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Jeremy Barnes, Orphée De Clercq, Roman Klinger
Venue:
WASSA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
365–378
Language:
URL:
https://aclanthology.org/2023.wassa-1.32
DOI:
10.18653/v1/2023.wassa-1.32
Bibkey:
Cite (ACL):
Kushal Tatariya, Heather Lent, and Miryam de Lhoneux. 2023. Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter?. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, pages 365–378, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Transfer Learning for Code-Mixed Data: Do Pretraining Languages Matter? (Tatariya et al., WASSA 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.wassa-1.32.pdf
Video:
 https://aclanthology.org/2023.wassa-1.32.mp4