MICo: Preventative Detoxification of Large Language Models through Inhibition Control

Roy Siegelmann, Ninareh Mehrabi, Palash Goyal, Prasoon Goyal, Lisa Bauer, Jwala Dhamala, Aram Galstyan, Rahul Gupta, Reza Ghanadan


Abstract
Large Language Models (LLMs) are powerful tools which have been both dominant and commonplace in the field of Artificial Intelligence. Yet, LLMs have a tendency to devolve into toxic degeneration, wherein otherwise safe and unproblematic models begin generating toxic content. For the sake of social responsibility and inspired by the biological mechanisms of inhibition control, we introduce the paradigm of Education for Societal Norms (ESN). By collecting and labeling examples as acceptable and unacceptable (in this case toxic and non-toxic), and including a corresponding acceptable rewrite with every unacceptable example, we introduce a new mechanism for LLM detoxification. We annotate a dataset of 2,850 entries and use it to fine-tune a model, which we call a Model with Inhibition Control (MICo). Evaluating this model on toxicity detection capability, rewrite detoxification, meaning preservation, and overall toxicity reduction, we discover significant improvements over the baseline model. In our experiments we show that overall toxicity of this model is more than 60% reduced, with over 75% reduction in severe toxicity.
Anthology ID:
2024.findings-naacl.110
Volume:
Findings of the Association for Computational Linguistics: NAACL 2024
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Kevin Duh, Helena Gomez, Steven Bethard
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1696–1703
Language:
URL:
https://aclanthology.org/2024.findings-naacl.110
DOI:
Bibkey:
Cite (ACL):
Roy Siegelmann, Ninareh Mehrabi, Palash Goyal, Prasoon Goyal, Lisa Bauer, Jwala Dhamala, Aram Galstyan, Rahul Gupta, and Reza Ghanadan. 2024. MICo: Preventative Detoxification of Large Language Models through Inhibition Control. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 1696–1703, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
MICo: Preventative Detoxification of Large Language Models through Inhibition Control (Siegelmann et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-naacl.110.pdf
Copyright:
 2024.findings-naacl.110.copyright.pdf