Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages

Sina Ahmadi; Razhan Hameed; Rico Sennrich

doi:10.18653/v1/2025.iwslt-1.10

Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages

Sina Ahmadi, Razhan Hameed, Rico Sennrich

Abstract

Middle Eastern languages represent a linguistically diverse landscape, yet few have received substantial attention in language and speech technology outside those with official status. Machine translation, a cornerstone application in computational linguistics, remains particularly underexplored for these predominantly non-standardized, spoken varieties. This paper proposes data alignment and augmentation techniques that leverage monolingual corpora and large language models to create high-quality parallel corpora for low-resource Middle Eastern languages. Through systematic fine-tuning of a pretrained machine translation model in a multilingual framework, our results demonstrate that corpus quality consistently outperforms quantity as a determinant of translation accuracy. Furthermore, we provide empirical evidence that strategic data selection significantly enhances cross-lingual transfer in multilingual translation systems. These findings offer valuable insights for developing machine translation solutions in linguistically diverse, resource-constrained environments.

Anthology ID:: 2025.iwslt-1.10
Volume:: Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria (in-person and online)
Editors:: Elizabeth Salesky, Marcello Federico, Antonis Anastasopoulos
Venues:: IWSLT | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 110–118
Language:
URL:: https://aclanthology.org/2025.iwslt-1.10/
DOI:: 10.18653/v1/2025.iwslt-1.10
Bibkey:
Cite (ACL):: Sina Ahmadi, Razhan Hameed, and Rico Sennrich. 2025. Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages. In Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025), pages 110–118, Vienna, Austria (in-person and online). Association for Computational Linguistics.
Cite (Informal):: Literary Translations and Synthetic Data for Machine Translation of Low-resourced Middle Eastern Languages (Ahmadi et al., IWSLT 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.iwslt-1.10.pdf

PDF Cite Search Fix data