A New Benchmark for Kalaallisut-Danish Neural Machine Translation

Ross Kristensen-Mclachlan, Johanne Nedergård


Abstract
Kalaallisut, also known as (West) Greenlandic, poses a number of unique challenges to contemporary natural language processing (NLP). In particular, the language has historically lacked benchmarking datasets and robust evaluation of specific NLP tasks, such as neural machine translation (NMT). In this paper, we present a new benchmark dataset for Greenlandic to Danish NMT comprising over 1.2m words of Greenlandic and 2.1m words of parallel Danish translations. We provide initial metrics for models trained on this dataset and conclude by suggesting how these findings can be taken forward to other NLP tasks for the Greenlandic language.
Anthology ID:
2024.americasnlp-1.7
Volume:
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Manuel Mager, Abteen Ebrahimi, Shruti Rijhwani, Arturo Oncevay, Luis Chiruzzo, Robert Pugh, Katharina von der Wense
Venues:
AmericasNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
50–55
Language:
URL:
https://aclanthology.org/2024.americasnlp-1.7
DOI:
10.18653/v1/2024.americasnlp-1.7
Bibkey:
Cite (ACL):
Ross Kristensen-Mclachlan and Johanne Nedergård. 2024. A New Benchmark for Kalaallisut-Danish Neural Machine Translation. In Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024), pages 50–55, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
A New Benchmark for Kalaallisut-Danish Neural Machine Translation (Kristensen-Mclachlan & Nedergård, AmericasNLP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.americasnlp-1.7.pdf
Supplementary material:
 2024.americasnlp-1.7.SupplementaryMaterial.zip