Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching

Parul Chopra, Sai Krishna Rallabandi, Alan W Black, Khyathi Raghavi Chandu


Abstract
Code-switching (CS), a ubiquitous phenomenon due to the ease of communication it offers in multilingual communities still remains an understudied problem in language processing. The primary reasons behind this are: (1) minimal efforts in leveraging large pretrained multilingual models, and (2) the lack of annotated data. The distinguishing case of low performance of multilingual models in CS is the intra-sentence mixing of languages leading to switch points. We first benchmark two sequence labeling tasks – POS and NER on 4 different language pairs with a suite of pretrained models to identify the problems and select the best performing char-BERT model among them (addressing (1)). We then propose a self training method to repurpose the existing pretrained models using a switch-point bias by leveraging unannotated data (addressing (2)). We finally demonstrate that our approach performs well on both tasks by reducing the gap between the switch point performance while retaining the overall performance on two distinct language pairs in both the tasks. We plan to release our models and the code for all our experiments.
Anthology ID:
2021.findings-emnlp.373
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
4389–4397
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.373
DOI:
10.18653/v1/2021.findings-emnlp.373
Bibkey:
Cite (ACL):
Parul Chopra, Sai Krishna Rallabandi, Alan W Black, and Khyathi Raghavi Chandu. 2021. Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4389–4397, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching (Chopra et al., Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.373.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.373.mp4