Gunsoo Han
2024
TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
Eunseop Yoon
|
Hee Suk Yoon
|
SooHwan Eom
|
Gunsoo Han
|
Daniel Nam
|
Daejin Jo
|
Kyoung-Woon On
|
Mark Hasegawa-Johnson
|
Sungwoong Kim
|
Chang Yoo
Findings of the Association for Computational Linguistics: ACL 2024
Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks.
2023
Efficient Latent Variable Modeling for Knowledge-Grounded Dialogue Generation
Gunsoo Han
|
Daejin Jo
|
Daniel Nam
|
Eunseop Yoon
|
Taehwan Kwon
|
Seungeun Rho
|
Kyoung-Woon On
|
Chang Yoo
|
Sungwoong Kim
Findings of the Association for Computational Linguistics: EMNLP 2023
Knowledge-grounded dialogue generation requires first retrieving appropriate external knowledge based on a conversational context and then generating a response grounded on the retrieved knowledge. In general, these two sequential modules, a knowledge retriever and a response generator, have been separately trained in a supervised manner. However, obtaining intermediate labels of the ground-truth knowledge is expensive, especially in open-domain conversations. Latent variable modeling avoids this need for the labels. In this paper, we propose an efficient algorithm for this latent variable modeling that is able to leverage a large amount of dialogue data. Rather than directly training the complex retriever, we adapt a query generator with an off-the-shelf retriever, and the query generator and response generator are simultaneously trained over the latent variable of query. Moreover, we employ lower bound of the evidence as a training objective and modify it to robustly perform the joint training. Experimental results on diverse knowledge-grounded dialogue datasets show that the proposed algorithm significantly outperforms the supervised learning algorithm even without the use of the annotated knowledge while maintaining efficiency and scalability.
Search
Co-authors
- Daejin Jo 2
- Daniel Nam 2
- Eunseop Yoon 2
- Kyoung-Woon On 2
- Chang Yoo 2
- show all...