Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li; Zheng Xin Yong; Stephen Bach

doi:10.18653/v1/2024.findings-emnlp.784

Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Xiaochen Li, Zheng Xin Yong, Stephen Bach

Abstract

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.

Anthology ID:: 2024.findings-emnlp.784
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13422–13440
Language:
URL:: https://aclanthology.org/2024.findings-emnlp.784/
DOI:: 10.18653/v1/2024.findings-emnlp.784
Bibkey:
Cite (ACL):: Xiaochen Li, Zheng Xin Yong, and Stephen Bach. 2024. Preference Tuning For Toxicity Mitigation Generalizes Across Languages. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13422–13440, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Preference Tuning For Toxicity Mitigation Generalizes Across Languages (Li et al., Findings 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.findings-emnlp.784.pdf

PDF Cite Search Fix data