Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

Yujian Liu; Yang Zhang; Tommi Jaakkola; Shiyu Chang

doi:10.18653/v1/2024.emnlp-main.495

Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective

Yujian Liu, Yang Zhang, Tommi Jaakkola, Shiyu Chang

Abstract

This paper investigates Who’s Harry Potter (WHP), a pioneering yet insufficiently understood method for LLM unlearning. We explore it in two steps. First, we introduce a new task of LLM targeted unlearning, where given an unlearning target (e.g., a person) and some unlearning documents, we aim to unlearn only the information about the target, rather than everything in the unlearning documents. We further argue that a successful unlearning should satisfy criteria such as not outputting gibberish, not fabricating facts about the unlearning target, and not releasing factual information under jailbreak attacks. Second, we construct a causal intervention framework for targeted unlearning, where the knowledge of the unlearning target is modeled as a confounder between LLM input and output, and the unlearning process as a deconfounding process. This framework justifies and extends WHP, deriving a simple unlearning algorithm that includes WHP as a special case. Experiments on existing and new datasets show that our approach, without explicitly optimizing for the aforementioned criteria, achieves competitive performance in all of them.

Anthology ID:: 2024.emnlp-main.495
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8708–8731
Language:
URL:: https://aclanthology.org/2024.emnlp-main.495/
DOI:: 10.18653/v1/2024.emnlp-main.495
Bibkey:
Cite (ACL):: Yujian Liu, Yang Zhang, Tommi Jaakkola, and Shiyu Chang. 2024. Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8708–8731, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Revisiting Who’s Harry Potter: Towards Targeted Unlearning from a Causal Intervention Perspective (Liu et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.495.pdf
Data:: 2024.emnlp-main.495.data.zip

PDF Cite Search Data Fix data