SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki; Bo Li; Jonathan Lingjie Li; Qiantong Xu; Pian Pawakapan; Leon Zhang; Yun Du; Hengyu Zhao; Changran Hu; Urmish Thakker

SambaLingo: Teaching Large Language Models New Languages

Zoltan Csaki, Bo Li, Jonathan Lingjie Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, Urmish Thakker

Abstract

Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.

Anthology ID:: 2024.mrl-1.1
Volume:: Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Jonne Sälevä, Abraham Owodunni
Venue:: MRL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–21
Language:
URL:: https://aclanthology.org/2024.mrl-1.1
DOI:
Bibkey:
Cite (ACL):: Zoltan Csaki, Bo Li, Jonathan Lingjie Li, Qiantong Xu, Pian Pawakapan, Leon Zhang, Yun Du, Hengyu Zhao, Changran Hu, and Urmish Thakker. 2024. SambaLingo: Teaching Large Language Models New Languages. In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pages 1–21, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: SambaLingo: Teaching Large Language Models New Languages (Csaki et al., MRL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.mrl-1.1.pdf

PDF Cite Search