Chang Sun


2025

pdf bib
Machine Unlearning of Personally Identifiable Information in Large Language Models
Dan Parii | Thomas van Osch | Chang Sun
Proceedings of the Natural Legal Language Processing Workshop 2025

Pretrained LLMs are trained on massive web-scale datasets, which often contain personally identifiable information (PII), raising serious legal and ethical concerns. A key research challenge is how to effectively unlearn PII without degrading the model’s utility or leaving implicit knowledge that can be exploited.This study proposes UnlearnPII, a benchmark designed to evaluate the effectiveness of PII unlearning methods, addressing limitations in existing metrics that overlook implicit knowledge and assess all tokens equally. Our benchmark focuses on detecting PII leakage, testing model robustness through obfuscated prompts and jailbreak attacks over different domains, while measuring utility and retention quality.To advance practical solutions, we propose a new PII unlearning method - PERMUtok. By applying token-level noise, we achieve 1) simplified integration into existing workflows, 2) improved retention and output quality, while maintaining unlearning effectiveness. The code is open-source and publicly available.

pdf bib
Copyright Infringement by Large Language Models in the EU: Misalignment, Safeguards, and the Path Forward
Noah Scharrenberg | Chang Sun
Proceedings of the Natural Legal Language Processing Workshop 2025

This position paper argues that European copyright law has struggled to keep pace with the development of large language models (LLMs), possibly creating a fundamental epistemic misalignment: copyright compliance relies on qualitative, context-dependent standards, while LLM development is governed by quantitative, proactive metrics. This gap means that technical safeguards, by themselves, may be insufficient to reliably demonstrate legal compliance. We identify several practical limitations in the existing EU legal frameworks, including ambiguous “lawful access” rules, fragmented opt-outs, and vague disclosure duties. We then discuss technical measures such as provenance-first data governance, machine unlearning for post-hoc removal, and synthetic data generation, showing their promise but also their limits.Finally, we propose a path forward grounded in legal-technical co-design, suggesting directions for standardising machine-readable opt-outs, disclosure templates, clarifying core legal terms, and developing legally-informed benchmarks and evidence standards. We conclude that such an integrated framework is essential to make compliance auditable, thus protected creators’ rights while enabling responsible AI innovation at scale.