Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning

Countering Reward Over-Optimization in LLM with Demonstration-Guided Reinforcement Learning Mathieu Rita author Florian Strub author Rahma Chaabouni author Paul Michel author Emmanuel Dupoux author Olivier Pietquin author 2024-08 text Findings of the Association for Computational Linguistics: ACL 2024 Lun-Wei Ku editor Andre Martins editor Vivek Srikumar editor Association for Computational Linguistics Bangkok, Thailand conference publication rita-etal-2024-countering 10.18653/v1/2024.findings-acl.740 https://aclanthology.org/2024.findings-acl.740/ 2024-08 12447 12472