CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units

Yeeun Kang

doi:10.18653/v1/2024.acl-srw.40

CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units

Abstract

Multilingual code-switching research is often hindered by the lack and linguistically biased status of available datasets. To expand language representation, we synthesize code-switching data by replacing intonation units detected through PSST, a speech segmentation model fine-tuned from OpenAI’s Whisper, using a speech-to-text translation dataset, CoVoST 2. With our dataset, CoVoSwitch, spanning 13 languages, we evaluate the code-switching translation performance of two multilingual translation models, M2M-100 418M and NLLB-200 600M. We reveal that the inclusion of code-switching units results in higher translation performance than monolingual settings and that models are better at code-switching translation into English than non-English. Further, low-resource languages gain most from integration of code-switched units when translating into English but much less when translating into non-English. Translations into low-resource languages also perform worse than even raw code-switched inputs. We find that systems excel at copying English tokens but struggle with non-English tokens, that the off-target problem in monolingual settings is also relevant in code-switching settings, and that models hallucinate in code-switching translation by introducing words absent in both of the original source sentences. CoVoSwitch and code are available at https://github.com/sophiayk20/covoswitch.

Anthology ID:: 2024.acl-srw.40
Volume:: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Xiyan Fu, Eve Fleisig
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 345–357
Language:
URL:: https://aclanthology.org/2024.acl-srw.40
DOI:: 10.18653/v1/2024.acl-srw.40
Bibkey:
Cite (ACL):: Yeeun Kang. 2024. CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 345–357, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: CoVoSwitch: Machine Translation of Synthetic Code-Switched Text Based on Intonation Units (Kang, ACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.acl-srw.40.pdf

PDF Cite Search