Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Injy Hamed, Nizar Habash, Thang Vu


Abstract
Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.
Anthology ID:
2023.findings-emnlp.11
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
140–154
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.11
DOI:
10.18653/v1/2023.findings-emnlp.11
Bibkey:
Cite (ACL):
Injy Hamed, Nizar Habash, and Thang Vu. 2023. Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 140–154, Singapore. Association for Computational Linguistics.
Cite (Informal):
Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study (Hamed et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.11.pdf