Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents

Adversarial DPO: Harnessing Harmful Data for Reducing Toxicity with Minimal Impact on Coherence and Evasiveness in Dialogue Agents San Kim author Gary Lee author 2024-06 text Findings of the Association for Computational Linguistics: NAACL 2024 Kevin Duh editor Helena Gomez editor Steven Bethard editor Association for Computational Linguistics Mexico City, Mexico conference publication kim-lee-2024-adversarial 10.18653/v1/2024.findings-naacl.118 https://aclanthology.org/2024.findings-naacl.118/ 2024-06 1821 1835