Interplay of Machine Translation, Diacritics, and Diacritization

Wei-Rui Chen, Ife Adebara, Muhammad Abdul-Mageed


Abstract
We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate.
Anthology ID:
2024.naacl-long.420
Volume:
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7559–7601
Language:
URL:
https://aclanthology.org/2024.naacl-long.420
DOI:
10.18653/v1/2024.naacl-long.420
Bibkey:
Cite (ACL):
Wei-Rui Chen, Ife Adebara, and Muhammad Abdul-Mageed. 2024. Interplay of Machine Translation, Diacritics, and Diacritization. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7559–7601, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Interplay of Machine Translation, Diacritics, and Diacritization (Chen et al., NAACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.naacl-long.420.pdf