How Vocabulary Sharing Facilitates Multilingualism in LLaMA?

Fei Yuan, Shuai Yuan, Zhiyong Wu, Lei Li


Abstract
Large Language Models (LLMs), often show strong performance on English tasks, while exhibiting limitations on other languages. What is an LLM’s multilingual capability when it is trained only on certain languages? The underlying mechanism remains unclear. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective by conducting an exhaustive analysis across 101 languages. Through the investigation of the performance gap before and after embedding fine-tuning, we discovered four distinct quadrants. By delving into each quadrant we provide actionable and efficient guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs based on these attributes of each quadrant .
Anthology ID:
2024.findings-acl.721
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12111–12130
Language:
URL:
https://aclanthology.org/2024.findings-acl.721
DOI:
Bibkey:
Cite (ACL):
Fei Yuan, Shuai Yuan, Zhiyong Wu, and Lei Li. 2024. How Vocabulary Sharing Facilitates Multilingualism in LLaMA?. In Findings of the Association for Computational Linguistics ACL 2024, pages 12111–12130, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
How Vocabulary Sharing Facilitates Multilingualism in LLaMA? (Yuan et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.721.pdf