Yuelin Bai
2026
COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Siwei Wu | JinCheng Ren | Xeron Du | Shuyue Guo | Xingwei Qu | Yiming Liang | Jie Liu | Yunwen Li | Tyler Loakman | Tianyu Zheng | Boyu Feng | Huaqing Yuan | Zili Wang | Jiaheng Liu | Wenhao Huang | Chenglin Cai | Haoran Que | Jian Yang | Yuelin Bai | Zekun Moore Wang | Zhouliang Yu | Qunshu Lin | Ding Pan | Yuchen Eleanor Jiang | Tiannan Wang | Wangchunshu Zhou | Shenzhi Wang | Xingyuan Bu | Minghao Liu | Guoyin Wang | Ge Zhang | Chenghua Lin
Findings of the Association for Computational Linguistics: EACL 2026
Existing Chinese preference datasets suffer from limited scale, restricted domain coverage, and insufficiently rigorous data validation. Human annotation significantly limits the scalability of human preference datasets. As a result, Chinese Alignment and Chinese Reward Models (CRM) have not yet been thoroughly explored. To address these challenges, we design an LLM-based data annotation pipeline with no human intervention. Based on this pipeline, we curate COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset consisting of 1M Chinese preference pairs and 92k carefully curated Chinese queries across diverse domains, including Chat, Coding, Maths, and others. We conduct experiments to verify the quality of COIG-P from two perspectives. (1) COIG-P brings significant performance improvements for the Qwen2/2.5 and Infinity-Instruct model series on AlignBench through DPO, with gains ranging from 2% to 12%. Furthermore, it significantly outperforms other existing Chinese preference datasets. (2) We train an 8B-sized CRM and manually annotate a Chinese Reward Benchmark (CRBench). Our CRM demonstrates robust scoring ability on CRBench. In addition, in practical data construction experiments, the quality of the data constructed by our CRM is comparable to that produced by GPT-4o.
2025
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai | Xeron Du | Yiming Liang | Leo Jin | Junting Zhou | Ziqiang Liu | Feiteng Fang | Mingshan Chang | Tianyu Zheng | Xincheng Zhang | Nuo Ma | Zekun Moore Wang | Ruibin Yuan | Haihong Wu | Hongquan Lin | Wenhao Huang | Jiajun Zhang | Chenghua Lin | Jie Fu | Min Yang | Shiwen Ni | Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Yuelin Bai | Xeron Du | Yiming Liang | Leo Jin | Junting Zhou | Ziqiang Liu | Feiteng Fang | Mingshan Chang | Tianyu Zheng | Xincheng Zhang | Nuo Ma | Zekun Moore Wang | Ruibin Yuan | Haihong Wu | Hongquan Lin | Wenhao Huang | Jiajun Zhang | Chenghua Lin | Jie Fu | Min Yang | Shiwen Ni | Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Remarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang | Xi Feng | Yuelin Bai | Xeron Du | Jinchang Hou | Kaixin Deng | Guangzeng Han | Qinrui Li | Bingli Wang | Jiaheng Liu | Xingwei Qu | Yifei Zhang | Qixuan Zhao | Yiming Liang | Ziqiang Liu | Feiteng Fang | Min Yang | Wenhao Huang | Chenghua Lin | Ge Zhang | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenhao Zhang | Xi Feng | Yuelin Bai | Xeron Du | Jinchang Hou | Kaixin Deng | Guangzeng Han | Qinrui Li | Bingli Wang | Jiaheng Liu | Xingwei Qu | Yifei Zhang | Qixuan Zhao | Yiming Liang | Ziqiang Liu | Feiteng Fang | Min Yang | Wenhao Huang | Chenghua Lin | Ge Zhang | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As the capabilities of Multimodal Large Language Models (MLLMs) improve, the need for higher-order evaluation of them is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To address this, we introduce the CII-Bench, which aims to assess MLLMs’ such capabilities for Chinese images. To ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model’s understanding of Chinese traditional culture. Through experiments on multiple MLLMs using CII-Bench, significant findings emerged. There is a large gap between MLLMs and humans in performance. The highest MLLM accuracy is 64.4%, while the human average is 78.2% and the peak is 81.0%. MLLMs perform poorly on traditional culture images, indicating limitations in understanding high-level semantics and lacking a deep knowledge base of Chinese traditional culture. Moreover, most models have higher accuracy when image emotion hints are added to the prompts. We believe CII-Bench will help MLLMs better understand Chinese semantics and specific images, and move forward the development of expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io.
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jiawei Guo | Tianyu Zheng | Yizhi Li | Yuelin Bai | Bo Li | Yubo Wang | King Zhu | Graham Neubig | Wenhu Chen | Xiang Yue
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiawei Guo | Tianyu Zheng | Yizhi Li | Yuelin Bai | Bo Li | Yubo Wang | King Zhu | Graham Neubig | Wenhu Chen | Xiang Yue
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales.To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse reasoning-intensive tasks.Experiments demonstrate that training MLLMs on our dataset not only significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%), but also gains improvements of up to 4% on non-reasoning-based benchmarks.
2024
Ruler: A Model-Agnostic Method to Control Generated Length for Large Language Models
Jiaming Li | Lei Zhang | Yunshui Li | Ziqiang Liu | Yuelin Bai | Run Luo | Longze Chen | Min Yang
Findings of the Association for Computational Linguistics: EMNLP 2024
Jiaming Li | Lei Zhang | Yunshui Li | Ziqiang Liu | Yuelin Bai | Run Luo | Longze Chen | Min Yang
Findings of the Association for Computational Linguistics: EMNLP 2024
The instruction-following ability of large language models enables humans to interact with AI agents in a natural way. However, when required to generate responses of a specific length, large language models often struggle to meet users’ needs due to their inherent difficulty in accurately perceiving numerical constraints. To explore the ability of large language models to control the length of generated responses, we propose the Target Length Generation Task (TLG) and design two metrics, Precise Match (PM) and Flexible Match (FM) to evaluate the model’s performance in adhering to specified response lengths. Furthermore, we introduce a novel, model-agnostic approach called Ruler, which employs Meta Length Tokens (MLTs) to enhance the instruction-following ability of large language models under length-constrained instructions. Specifically, Ruler equips LLMs with the ability to generate responses of a specified length based on length constraints within the instructions. Moreover, Ruler can automatically generate appropriate MLT when length constraints are not explicitly provided, demonstrating excellent versatility and generalization. Comprehensive experiments show the effectiveness of Ruler across different LLMs on Target Length Generation Task, e.g., at All Level 27.97 average gain on PM, 29.57 average gain on FM. In addition, we conduct extensive ablation experiments to further substantiate the efficacy and generalization of Ruler. Our code and data is available on the internet.
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property
Shiwen Ni | Minghuan Tan | Yuelin Bai | Fuqiang Niu | Min Yang | Bowen Zhang | Ruifeng Xu | Xiaojun Chen | Chengming Li | Xiping Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Shiwen Ni | Minghuan Tan | Yuelin Bai | Fuqiang Niu | Min Yang | Bowen Zhang | Ruifeng Xu | Xiaojun Chen | Chengming Li | Xiping Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at https://github.com/AI-for-Science/MoZi.
Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training
Feiteng Fang | Yuelin Bai | Shiwen Ni | Min Yang | Xiaojun Chen | Ruifeng Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Feiteng Fang | Yuelin Bai | Shiwen Ni | Min Yang | Xiaojun Chen | Ruifeng Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges including hallucination, outdated knowledge, and untraceable reasoning processes. Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges. However, inappropriate retrieved passages can potentially hinder the LLMs’ capacity to generate comprehensive and high-quality responses. Prior RAG studies on the robustness of retrieval noises often confine themselves to a limited set of noise types, deviating from real-world retrieval environments and limiting practical applicability. In this study, we initially investigate retrieval noises and categorize them into three distinct types, reflecting real-world environments. We analyze the impact of these various retrieval noises on the robustness of LLMs. Subsequently, we propose a novel RAG approach known as Retrieval-augmented Adaptive Adversarial Training (RAAT). RAAT leverages adaptive adversarial training to dynamically adjust the model’s training process in response to retrieval noises. Concurrently, it employs multi-task learning to ensure the model’s capacity to internally recognize noisy contexts. Extensive experiments demonstrate that the LLaMA-2 7B model trained using RAAT exhibits significant improvements in F1 and EM scores under diverse noise conditions. For reproducibility, we will release our code and data upon acceptance.
Search
Fix author
Co-authors
- Min Yang 6
- Feiteng Fang 4
- Shiwen Ni 4
- Xeron Du 3
- Wenhao Huang 3
- Yiming Liang 3
- Chenghua Lin 3
- Ziqiang Liu 3
- Ge Zhang 3
- Tianyu Zheng 3
- Longze Chen 2
- Xiaojun Chen 2
- Jiaheng Liu 2
- Xingwei Qu 2
- Minghuan Tan 2
- Zekun Moore Wang 2
- Ruifeng Xu (徐睿峰) 2
- Xingyuan Bu 1
- Chenglin Cai 1
- Mingshan Chang 1
- Wenhu Chen 1
- Kaixin Deng 1
- Xi Feng 1
- Boyu Feng 1
- Jie Fu 1
- Jiawei Guo 1
- Shuyue Guo 1
- Guangzeng Han 1
- Jinchang Hou 1
- Xiping Hu 1
- Yuchen Eleanor Jiang 1
- Leo Jin 1
- Jiaming Li 1
- Yunshui Li 1
- Chengming Li 1
- Qinrui Li 1
- Yizhi Li 1
- Bo Li 1
- Yunwen Li 1
- Hongquan Lin 1
- Qunshu Lin 1
- Jie Liu 1
- Minghao Liu 1
- Tyler Loakman 1
- Run Luo 1
- Nuo Ma 1
- Graham Neubig 1
- Fuqiang Niu 1
- Ding Pan 1
- Haoran Que 1
- JinCheng Ren 1
- Bingli Wang 1
- Yubo Wang 1
- Zili Wang 1
- Tiannan Wang 1
- Shenzhi Wang 1
- Guoyin Wang 1
- Haihong Wu 1
- Siwei Wu 1
- Jian Yang 1
- Zhouliang Yu 1
- Ruibin Yuan 1
- Huaqing Yuan 1
- Xiang Yue 1
- Lei Zhang 1
- Bowen Zhang 1
- Xincheng Zhang 1
- Jiajun Zhang 1
- Zhexiang Zhang 1
- Chenhao Zhang 1
- Yifei Zhang 1
- Qixuan Zhao 1
- Junting Zhou 1
- Wangchunshu Zhou 1
- Liang Zhu 1
- King Zhu 1