Fei Yu


2024

pdf bib
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning
Fei Yu | Anningzhe Gao | Benyou Wang
Findings of the Association for Computational Linguistics: NAACL 2024

Large language models (LLMs) often struggle with maintaining accuracy throughout multiple multiple reasoning steps, especially in mathematical reasoning where an error in earlier steps can propagate to subsequent ones and it ultimately leading to an incorrect answer.To reduce error propagation, guided decoding is employed to direct the LM decoding on a step-by-step basis. We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness, as the former approach leads towards a correct final answer. This transforms the task into a value estimation problem in planning.Inspired by the findings that outcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM) that employs outcome supervision for training a value model, which prioritizes steps that lead to accurate conclusions. Furthermore, the OVM eliminates the need for labor-intensive annotations of step-level correctness, thereby significantly enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training value models for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for guided decoding.

pdf bib
AceGPT, Localizing Large Language Models in Arabic
Huang Huang | Fei Yu | Jianqing Zhu | Xuening Sun | Hao Cheng | Song Dingjie | Zhihong Chen | Mosen Alharthi | Bang An | Juncai He | Ziche Liu | Junying Chen | Jianquan Li | Benyou Wang | Lian Zhang | Ruoyu Sun | Xiang Wan | Haizhou Li | Jinchao Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed ‘AceGPT’, sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.

2023

pdf bib
HuatuoGPT, Towards Taming Language Model to Be a Doctor
Hongbo Zhang | Junying Chen | Feng Jiang | Fei Yu | Zhihong Chen | Guiming Chen | Jianquan Li | Xiangbo Wu | Zhang Zhiyi | Qingying Xiao | Xiang Wan | Benyou Wang | Haizhou Li
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we present HuatuoGPT, a Large Language Model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both distilled data from **ChatGPT** and real-world data from **doctors** in the supervised fine-tuning stage. This is not only because purely using **ChatGPT**-distilled data might cause ‘model collapse’, but also because real-world data from **doctors** would be complementary to **ChatGPT**-distilled data. The responses from ChatGPT are usually detailed, well-presented, fluent, and instruction-followed, but it cannot perform like a doctor in many aspects, e.g. for interactive diagnosis. Therefore, the extra doctors’ data could tame a distilled language model to perform like doctors. To synergize the strengths of both data sources, we introduce RLMF (Reinforcement Learning from Mixed Feedback) where a reward model is trained to align the language model with the merits that both sources (ChatGPT and doctors) bring. Experimental results (in GPT-4 evaluation, human evaluation, and medical benchmark datasets) demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs. It is worth noting that by using additional real-world data and RLMF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model (i.e., ChatGPT) in most cases.