RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models

RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models Jiongxiao Wang author Junlin Wu author Muhao Chen author Yevgeniy Vorobeychik author Chaowei Xiao author 2024-08 text Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Lun-Wei Ku editor Andre Martins editor Vivek Srikumar editor Association for Computational Linguistics Bangkok, Thailand conference publication wang-etal-2024-rlhfpoison 10.18653/v1/2024.acl-long.140 https://aclanthology.org/2024.acl-long.140/ 2024-08 2551 2570