Volkan Cevher

2025

pdf bib abs
Not Every Token Needs Forgetting: Selective Unlearning Balancing Forgetting and Utility in Large Language Models
Yixin Wan | Anil Ramakrishna | Kai-Wei Chang | Volkan Cevher | Rahul Gupta
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Model (LLM) unlearning has recently gained significant attention, driven by the need to remove unwanted information—such as private, sensitive, or copyrighted content—from trained models. However, conventional unlearning approaches indiscriminately update model parameters to forget all tokens in a target document, including common tokens (e.g., pronouns, prepositions, general nouns) that carry general knowledge. In this paper, we highlight that “not every token needs forgetting”. We propose **Selective Unlearning (SU)**, which identifies a critical subset of tokens within the forgetting set that is relevant to the unwanted information, and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms demonstrate that SU not only achieves effective unlearning on the targeted forget data, but also significantly preserves the model’s utility in the retaining set.

Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark LUME that features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently-proposed algorithms and present results on carefully crafted metrics to understand their behavior and limitations.

pdf bib abs
Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate
Xiaomeng Jin | Zhiqi Bu | Bhanukiran Vinzamuri | Anil Ramakrishna | Kai-Wei Chang | Volkan Cevher | Mingyi Hong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Machine unlearning has been used to remove unwanted knowledge acquired by large language models (LLMs). In this paper, we examine machine unlearning from an optimization perspective, framing it as a regularized multi-task optimization problem, where one task optimizes a forgetting objective and another optimizes the model performance. In particular, we introduce a normalized gradient difference algorithm, enabling us to have better control over the trade-off between the objectives, while integrating a new, automatic learning rate scheduler. We provide a theoretical analysis and empirically demonstrate the superior performance of among state-of-the-art unlearning methods on the TOFU and MUSE datasets while exhibiting stable training.

We introduce SemEval-2025 Task 4: unlearn- ing sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) un- learn short form synthetic biographies contain- ing personally identifiable information (PII), in- cluding fake names, phone number, SSN, email and home addresses, and (3) unlearn real docu- ments sampled from the target model’s training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.

2024

pdf bib abs
Extreme Miscalibration and the Illusion of Adversarial Robustness
Vyas Raina | Samson Tan | Volkan Cevher | Aditya Rawal | Sheng Zha | George Karypis
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Deep learning-based Natural Language Processing (NLP) models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. However, we have discovered an intriguing phenomenon: deliberately or accidentally miscalibrating models masks gradients in a way that interferes with adversarial attack search methods, giving rise to an apparent increase in robustness. We show that this observed gain in robustness is an illusion of robustness (IOR), and demonstrate how an adversary can perform various forms of test-time temperature calibration to nullify the aforementioned interference and allow the adversarial attack to find adversarial examples. Hence, we urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine. Finally, we show how the temperature can be scaled during training to improve genuine robustness.

Co-authors

Xiaomeng Jin 3

Bhanukiran Vinzamuri 3

Venues

Fix author