System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task

Joseph Attieh, Zachary Hopton, Yves Scherrer, Tanja Samardžić


Abstract
This paper presents the system description of the NordicsAlps team for the AmericasNLP 2024 Machine Translation Shared Task 1. We investigate the effect of tokenization on translation quality by exploring two different tokenization schemes: byte-level and redundancy-driven tokenization. We submitted three runs per language pair. The redundancy-driven tokenization ranked first among all submissions, scoring the highest average chrF2++, chrF, and BLEU metrics (averaged across all languages). These findings demonstrate the importance of carefully tailoring the tokenization strategies of machine translation systems, particularly in resource-constrained scenarios.
Anthology ID:
2024.americasnlp-1.18
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
150–158
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.18
DOI:
10.18653/v1/2024.americasnlp-1.18
Bibkey:
Cite (ACL):
Joseph Attieh, Zachary Hopton, Yves Scherrer, and Tanja Samardžić. 2024. System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 150–158, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task (Attieh et al., AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.americasnlp-1.18.pdf