Chaoqun Liu
2024
SeaLLMs - Large Language Models for Southeast Asia
Xuan-Phi Nguyen
|
Wenxuan Zhang
|
Xin Li
|
Mahani Aljunied
|
Zhiqiang Hu
|
Chenhui Shen
|
Yew Ken Chia
|
Xingxuan Li
|
Jianyu Wang
|
Qingyu Tan
|
Liying Cheng
|
Guanzheng Chen
|
Yue Deng
|
Sen Yang
|
Chaoqun Liu
|
Hang Zhang
|
Lidong Bing
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon popular English-centric models through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
2023
Zero-Shot Text Classification via Self-Supervised Tuning
Chaoqun Liu
|
Wenxuan Zhang
|
Guizhen Chen
|
Xiaobao Wu
|
Anh Tuan Luu
|
Chip Hong Chang
|
Lidong Bing
Findings of the Association for Computational Linguistics: ACL 2023
Existing solutions to zero-shot text classification either conduct prompting with pre-trained language models, which is sensitive to the choices of templates, or rely on large-scale annotated data of relevant tasks for meta-tuning. In this work, we propose a new paradigm based on self-supervised learning to solve zero-shot text classification tasks by tuning the language models with unlabeled data, called self-supervised tuning. By exploring the inherent structure of free texts, we propose a new learning objective called first sentence prediction to bridge the gap between unlabeled data and text classification tasks. After tuning the model to learn to predict the first sentence in a paragraph based on the rest, the model is able to conduct zero-shot inference on unseen tasks such as topic classification and sentiment analysis. Experimental results show that our model outperforms the state-of-the-art baselines on 7 out of 10 tasks. Moreover, the analysis reveals that our model is less sensitive to the prompt design. Our code and pre-trained models are publicly available at https://github.com/DAMO-NLP-SG/SSTuning.
Search
Fix data
Co-authors
- Lidong Bing 2
- Wenxuan Zhang 2
- Mahani Aljunied 1
- Chip Hong Chang 1
- Guizhen Chen 1
- show all...