Xu Yinghui


2024

pdf bib
ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average
Shaojie Shi | Xiaoyu Tan | Xihe Qiu | Chao Qu | Kexin Nie | Yuan Cheng | Wei Chu | Xu Yinghui | Yuan Qi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

In recent years, large language models (LLMs) have attracted significant interest from the research community due to their broad applicability in many language-oriented tasks, and are now widely used in numerous areas of production and daily life. One source of the powerful capabilities of LLMs is the massive scale of their pre-training dataset. However, these pre-training datasets contain many outdated, harmful, and personally sensitive information, which inevitably becomes memorized by LLM during the pre-training process. Eliminating this undesirable data is crucial for ensuring the model’s safety and enhancing the user experience. However, the cost of extensively cleaning the pre-training dataset and retraining the model from scratch is very high. In this work, we propose ULMR , a unlearning framework for LLMs , which first uses carefully designed prompts to rewrite the instructions in the specified dataset, and generate corresponding negative responses. Subsequently, to ensure that the model does not excessively deviate post-training, we perform model parameter averaging to preserve the performance of the original LLM. We conducted experiments on two public datasets, TOFU and RWKU, demonstrating that our method can effectively forget specified information while retaining the capabilities of the original LLM.