TeleChat: An Open-source Billingual Large Language Model

Zihan Wang; Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn; Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn; Yitong Yao; Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn; Li Mengxiang; Zhongjiang He; Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn; Pulw@chinatelecom.cn Pulw@chinatelecom.cn; Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn; Chao Wang; Shuangyong Song

TeleChat: An Open-source Billingual Large Language Model

Zihan Wang, Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn, Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn, Yitong Yao, Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn, Li Mengxiang, Zhongjiang He, Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn, Pulw@chinatelecom.cn Pulw@chinatelecom.cn, Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn, Chao Wang, Shuangyong Song

Abstract

In this paper, we present TeleChat, a collection of large language models (LLMs) with parameters of 7 billion and 12 billion. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, encompassing trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including general dialogue generation, language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves state-of-the-art performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat-7B and TeleChat-12B, along with code and a portion of our filtered high-quality pretraining data, to the public community.

Anthology ID:: 2024.sighan-1.2
Volume:: Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, Runcong Zhao
Venues:: SIGHAN | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10–20
Language:
URL:: https://aclanthology.org/2024.sighan-1.2
DOI:
Bibkey:
Cite (ACL):: Zihan Wang, Liuxz2@chinatelecom.cn Liuxz2@chinatelecom.cn, Liusx14@chinatelecom.cn Liusx14@chinatelecom.cn, Yitong Yao, Huangyy121@chinatelecom.cn Huangyy121@chinatelecom.cn, Li Mengxiang, Zhongjiang He, Liyx25@chinatelecom.cn Liyx25@chinatelecom.cn, Pulw@chinatelecom.cn Pulw@chinatelecom.cn, Xuhn@chinatelecom.cn Xuhn@chinatelecom.cn, Chao Wang, and Shuangyong Song. 2024. TeleChat: An Open-source Billingual Large Language Model. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 10–20, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: TeleChat: An Open-source Billingual Large Language Model (Wang et al., SIGHAN-WS 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.sighan-1.2.pdf

PDF Cite Search