Transformers for Low-Resource Languages: Is Féidir Linn!

Seamus Lankford, Haithem Alfi, Andy Way


Abstract
The Transformer model is the state-of-the-art in Machine Translation. However and in general and neural translation models often under perform on language pairs with insufficient training data. As a consequence and relatively few experiments have been carried out using this architecture on low-resource language pairs. In this study and hyperparameter optimization of Transformer models in translating the low-resource English-Irish language pair is evaluated. We demonstrate that choosing appropriate parameters leads to considerable performance improvements. Most importantly and the correct choice of subword model is shown to be the biggest driver of translation performance. SentencePiece models using both unigram and BPE approaches were appraised. Variations on model architectures included modifying the number of layers and testing various regularization techniques and evaluating the optimal number of heads for attention. A generic 55k DGT corpus and an in-domain 88k public admin corpus were used for evaluation. A Transformer optimized model demonstrated a BLEU score improvement of 7.8 points when compared with a baseline RNN model. Improvements were observed across a range of metrics and including TER and indicating a substantially reduced post editing effort for Transformer optimized models with 16k BPE subword models. Bench-marked against Google Translate and our translation engines demonstrated significant improvements. The question of whether or not Transformers can be used effectively in a low-resource setting of English-Irish translation has been addressed. Is féidir linn - yes we can.
Anthology ID:
2021.mtsummit-research.5
Volume:
Proceedings of Machine Translation Summit XVIII: Research Track
Month:
August
Year:
2021
Address:
Virtual
Editors:
Kevin Duh, Francisco Guzmán
Venue:
MTSummit
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
48–60
Language:
URL:
https://aclanthology.org/2021.mtsummit-research.5
DOI:
Bibkey:
Cite (ACL):
Seamus Lankford, Haithem Alfi, and Andy Way. 2021. Transformers for Low-Resource Languages: Is Féidir Linn!. In Proceedings of Machine Translation Summit XVIII: Research Track, pages 48–60, Virtual. Association for Machine Translation in the Americas.
Cite (Informal):
Transformers for Low-Resource Languages: Is Féidir Linn! (Lankford et al., MTSummit 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.mtsummit-research.5.pdf