Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

Vivek Iyer; Bhavitvya Malik; Pavel Stepachev; Pinzhen Chen; Barry Haddow; Alexandra Birch

doi:10.18653/v1/2024.wmt-1.128

Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation

Vivek Iyer, Bhavitvya Malik, Pavel Stepachev, Pinzhen Chen, Barry Haddow, Alexandra Birch

Abstract

Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource languages (LRLs) still lags significantly behind Neural Machine Translation (NMT) models. In this work, we explore what it would take to adapt LLMs for the low-resource setting. Particularly, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has seen reduced use in adapting LLMs for MT, while data diversity has been embraced to promote transfer across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both considerations: a) parallel data is critical during both pre-training and SFT; b) diversity tends to cause interference instead of transfer. Our experiments with three LLMs across two low-resourced language groups—Indigenous American and North-East Indian—reveal consistent trends, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve LRLs.

Anthology ID:: 2024.wmt-1.128
Volume:: Proceedings of the Ninth Conference on Machine Translation
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1393–1409
Language:
URL:: https://aclanthology.org/2024.wmt-1.128/
DOI:: 10.18653/v1/2024.wmt-1.128
Bibkey:
Cite (ACL):: Vivek Iyer, Bhavitvya Malik, Pavel Stepachev, Pinzhen Chen, Barry Haddow, and Alexandra Birch. 2024. Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation. In Proceedings of the Ninth Conference on Machine Translation, pages 1393–1409, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation (Iyer et al., WMT 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.wmt-1.128.pdf

PDF Cite Search Fix data