Machine Translation Between High-resource Languages in a Language Documentation Setting

Katharina von der Wense; Abteen Ebrahimi; Kristine Stenzel; Alexis Palmer

Machine Translation Between High-resource Languages in a Language Documentation Setting

Katharina Kann, Abteen Ebrahimi, Kristine Stenzel, Alexis Palmer

Abstract

Language documentation encompasses translation, typically into the dominant high-resource language in the region where the target language is spoken. To make data accessible to a broader audience, additional translation into other high-resource languages might be needed. Working within a project documenting Kotiria, we explore the extent to which state-of-the-art machine translation (MT) systems can support this second translation – in our case from Portuguese to English. This translation task is challenging for multiple reasons: (1) the data is out-of-domain with respect to the MT system’s training data, (2) much of the data is conversational, (3) existing translations include non-standard and uncommon expressions, often reflecting properties of the documented language, and (4) the data includes borrowings from other regional languages. Despite these challenges, existing MT systems perform at a usable level, though there is still room for improvement. We then conduct a qualitative analysis and suggest ways to improve MT between high-resource languages in a language documentation setting.

Anthology ID:: 2022.fieldmatters-1.3
Volume:: Proceedings of the First Workshop on NLP applications to field linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Oleg Serikov, Ekaterina Voloshina, Anna Postnikova, Elena Klyachko, Ekaterina Neminova, Ekaterina Vylomova, Tatiana Shavrina, Eric Le Ferrand, Valentin Malykh, Francis Tyers, Timofey Arkhangelskiy, Vladislav Mikhailov, Alena Fenogenova
Venue:: FieldMatters
SIG:
Publisher:: International Conference on Computational Linguistics
Note:
Pages:: 26–33
Language:
URL:: https://aclanthology.org/2022.fieldmatters-1.3/
DOI:
Bibkey:
Cite (ACL):: Katharina Kann, Abteen Ebrahimi, Kristine Stenzel, and Alexis Palmer. 2022. Machine Translation Between High-resource Languages in a Language Documentation Setting. In Proceedings of the First Workshop on NLP applications to field linguistics, pages 26–33, Gyeongju, Republic of Korea. International Conference on Computational Linguistics.
Cite (Informal):: Machine Translation Between High-resource Languages in a Language Documentation Setting (Kann et al., FieldMatters 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.fieldmatters-1.3.pdf

PDF Cite Search Fix data