2024
pdf
bib
abs
Reward Modeling Requires Automatic Adjustment Based on Data Quality
Binghai Wang
|
Rui Zheng
|
Lu Chen
|
Zhiheng Xi
|
Wei Shen
|
Yuhao Zhou
|
Dong Yan
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
In Reinforcement Learning from Human Feedback (RLHF), the reward model plays a crucial role in aligning language model outputs with human values. The human preference data used to train the reward model consists of a prompt and a response pair, with humans annotating which response better aligns with human value preferences. Due to the complexity and subjectivity of the annotation task, multiple organizations including OpenAI and Anthropic report significant noise in the human preference datasets, leading to instability and deviation in reward model training from human values. We discover that the difference in scores assigned to response pairs by the reward model effectively indicates the quality of data, and data of varying qualities show significant distinctions in reward model training. We introduce a method that automatically adjusts reward modeling based on data quality, reducing the impact of noise and making full use of dataset. Experiments on multiple human preference datasets demonstrate that our method stabilizes reward model training and significantly enhances the alignment performance of RLHF.
pdf
bib
abs
Leveraging Web-Crawled Data for High-Quality Fine-Tuning
Jing Zhou
|
Chenglin Jiang
|
Wei Shen
|
Xiao Zhou
|
Xiaonan He
Findings of the Association for Computational Linguistics: EMNLP 2024
Most large language models are fine-tuned using either expensive human-annotated data or GPT-4 generated data which cannot guarantee performance in certain domains. We argue that although the web-crawled data often has formatting errors causing semantic inaccuracies, it can still serve as a valuable source for high-quality supervised fine-tuning in specific domains without relying on advanced models like GPT-4. To this end, we create a paired training dataset automatically by aligning web-crawled data with a smaller set of high-quality data. By training a language model on this dataset, we can convert web data with irregular formats into high-quality ones. Our experiments show that training with the model-transformed data yields better results, surpassing training with only high-quality data by an average score of 9.4% in Chinese math problems. Additionally, our 7B model outperforms several open-source models larger than 32B and surpasses well-known closed-source models such as GPT-3.5, highlighting the efficacy of our approach. We have released our code at https://github.com/zhouj8553/Web_to_SFT.
pdf
bib
abs
LoRAMoE: Alleviating World Knowledge Forgetting in Large Language Models via MoE-Style Plugin
Shihan Dou
|
Enyu Zhou
|
Yan Liu
|
Songyang Gao
|
Wei Shen
|
Limao Xiong
|
Yuhao Zhou
|
Xiao Wang
|
Zhiheng Xi
|
Xiaoran Fan
|
Shiliang Pu
|
Jiang Zhu
|
Rui Zheng
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. Substantially increasing instruction data is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM. Our code is available at https://github.com/Ablustrund/LoRAMoE.
2023
pdf
bib
abs
Loose lips sink ships: Mitigating Length Bias in Reinforcement Learning from Human Feedback
Wei Shen
|
Rui Zheng
|
Wenyu Zhan
|
Jun Zhao
|
Shihan Dou
|
Tao Gui
|
Qi Zhang
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2023
Reinforcement learning from human feedback serves as a crucial bridge, aligning large language models with human and societal values. This alignment requires a vast corpus of human feedback to learn a reward model, which is subsequently used to finetune language models. However, we have identified that the reward model often finds shortcuts to bypass its intended objectives, misleadingly assuming that humans prefer longer responses. The emergence of length bias often induces the model to favor longer outputs, yet it doesn’t equate to an increase in helpful information within these outputs. In this paper, we propose an innovative solution, applying the Product-of-Experts (PoE) technique to separate reward modeling from the influence of sequence length. In our framework, the main expert concentrates on understanding human intents, while the biased expert targets the identification and capture of length bias. To further enhance the learning of bias, we introduce perturbations into the bias-focused expert, disrupting the flow of semantic information. Experimental results validate the effectiveness of our approach, indicating that language model performance is improved, irrespective of sequence length.
pdf
bib
abs
CL-QR: Cross-Lingual Enhanced Query Reformulation for Multi-lingual Conversational AI Agents
Zhongkai Sun
|
Zhengyang Zhao
|
Sixing Lu
|
Chengyuan Ma
|
Xiaohu Liu
|
Xing Fan
|
Wei Shen
|
Chenlei Guo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
The growing popularity of conversational AI agents such as Alexa, Google Assistant, and Siri rely on accurate spoken language comprehension. The query reformulation (QR) method, which reformulates defective user queries, has been broadly adopted to mitigate the challenges posed by understanding user’s intent from imperfect spoken recognition result. However, due to the scarcity of non-English QR labels, providing high-quality QR for non-English users still remains a challenge. This work proposes a novel cross-lingual QR framework, CL-QR, to leverage the abundant reformulation resources in English to improve non-English QR performance. The proposed work also proposes a Module-wise Mutually-supervised Feedback learning (MMF) algorithm to enable the continually self-improving of the CL-QR, which alleviates the lack of cross-lingual QR training data and enhances the delivery of high-quality reformulations learned in English for multilingual queries. Both offline evaluation and online A/B testing demonstrates the effectiveness of the proposed method.
pdf
bib
abs
Improving Contextual Query Rewrite for Conversational AI Agents through User-preference Feedback Learning
Zhongkai Sun
|
Yingxue Zhou
|
Jie Hao
|
Xing Fan
|
Yanbin Lu
|
Chengyuan Ma
|
Wei Shen
|
Chenlei Guo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Contextual query rewriting (CQR) is a crucial component in Conversational AI agents, leveraging the contextual information from previous user-agent conversations to improve the comprehension of current user intent. However, traditional CQR methods often concentrate on supervised fine-tuning only, neglecting the opportunities to learn from user feedback to align with user preferences. Inspired by recent advances in learning from human feedback (LHF), this paper proposes a novel Preference Aligned Contextual Query Rewriting (PA-CQR) framework to enhance the CQR model’s capability in generating user preference-aligned rewrites. This paper also investigates the efficacy of various state-of-the-art feedback learning algorithms on the CQR task, and proposes a novel Dynamic Direct Preference Optimization (Dynamic DPO) algorithm to better adapt the DPO algorithm to large-scale CQR training. Experiments on large-scale real-world CQR data set demonstrate the superiority of the proposed PA-CQR framework and the Dynamic DPO.
2022
pdf
bib
abs
Enhancing Self-Attention with Knowledge-Assisted Attention Maps
Jiangang Bai
|
Yujing Wang
|
Hong Sun
|
Ruonan Wu
|
Tianmeng Yang
|
Pengfei Tang
|
Defu Cao
|
Mingliang Zhang1
|
Yunhai Tong
|
Yaming Yang
|
Jing Bai
|
Ruofei Zhang
|
Hao Sun
|
Wei Shen
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Large-scale pre-trained language models have attracted extensive attentions in the research community and shown promising results on various tasks of natural language processing. However, the attention maps, which record the attention scores between tokens in self-attention mechanism, are sometimes ineffective as they are learned implicitly without the guidance of explicit semantic knowledge. Thus, we aim to infuse explicit external knowledge into pre-trained language models to further boost their performance. Existing works of knowledge infusion largely depend on multi-task learning frameworks, which are inefficient and require large-scale re-training when new knowledge is considered. In this paper, we propose a novel and generic solution, KAM-BERT, which directly incorporates knowledge-generated attention maps into the self-attention mechanism. It requires only a few extra parameters and supports efficient fine-tuning once new knowledge is added. KAM-BERT achieves consistent improvements on various academic datasets for natural language understanding. It also outperforms other state-of-the-art methods which conduct knowledge infusion into transformer-based architectures. Moreover, we apply our model to an industry-scale ad relevance application and show its advantages in the real-world scenario.