Sailor: Open Language Models for South-East Asia

Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Xin Mao, Ziqi Jin, Wei Lu, Min Lin


Abstract
We present Sailor, a family of open language models ranging from 0.5B to 14B parameters, tailored for South-East Asian (SEA) languages. From Qwen1.5, Sailor models accept 200B to 400B tokens during continual pre-training, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize the data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. We share our insights to spark a wider interest in developing large language models for multilingual use cases.
Anthology ID:
2024.emnlp-demo.45
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Delia Irazu Hernandez Farias, Tom Hope, Manling Li
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
424–435
Language:
URL:
https://aclanthology.org/2024.emnlp-demo.45
DOI:
Bibkey:
Cite (ACL):
Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Xin Mao, Ziqi Jin, Wei Lu, and Min Lin. 2024. Sailor: Open Language Models for South-East Asia. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 424–435, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Sailor: Open Language Models for South-East Asia (Dou et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-demo.45.pdf