Juesi Xiao
2024
FuxiTranyu: A Multilingual Large Language Model Trained with Balanced Data
Haoran Sun
|
Renren Jin
|
Shaoyang Xu
|
Leiyu Pan
|
Supryadi
|
Menglong Cui
|
Jiangcun Du
|
Yikun Lei
|
Lei Yang
|
Ling Shi
|
Juesi Xiao
|
Shaolin Zhu
|
Deyi Xiong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large language models (LLMs) have demonstrated prowess in a wide range of tasks. However, many LLMs exhibit significant performance discrepancies between high- and low-resource languages. To mitigate this challenge, we present FuxiTranyu, an open-source multilingual LLM, which is designed to satisfy the need of the research community for balanced and high-performing multilingual capabilities. The base model, FuxiTranyu-8B, features 8 billion parameters and is trained from scratch on meticulously balanced multilingual data that contains 600 billion tokens covering 43 natural languages and 16 programming languages. We also develop two instruction-tuned models: FuxiTranyu-8B-SFT which is fine-tuned on a diverse multilingual instruction dataset, and FuxiTranyu-8B-DPO which is further refined with DPO on a preference dataset for enhanced alignment ability. Extensive experiments on a wide range of multilingual benchmarks demonstrate the competitive performance of FuxiTranyu against existing multilingual LLMs, e.g., BLOOM-7B, PolyLM-13B, and Mistral-7B-Instruct. Both neuron and representation interpretability analyses reveal that FuxiTranyu achieves consistent multilingual representations across languages. To promote further research into multilingual LLMs, we release both the base and instruction-tuned FuxiTranyu models together with 58 pre-training checkpoints at HuggingFace and Github.
Search
Co-authors
- Haoran Sun 1
- Renren Jin 1
- Shaoyang Xu 1
- Leiyu Pan 1
- Supryadi 1
- show all...