Uyen Le


2026

Activation steering or editing hidden states to control language-model behavior can be framed as a causal mediation problem: inputs induce internal activations, a subset of which act as mediators transmitting targeted behaviors to outputs. We formalize a structural graph over transformer layers and derive front-door—style identification conditions that justify steering through mediating subspaces while preserving non-mediating features, thereby reducing confounding and off-target effects. Within this mediation-first view, we present CAS-BiPO, a sparse mediation steering approach that learns targeted behavioral interventions via regularized training. Empirically, our method achieves 97-100% of dense baseline effectiveness across four behavioral control tasks while using only 10-30% of activation dimensions. Learned masks concentrate 94.3% of steering effects in 26.7% of dimensions, with neurons exhibiting 2.2× higher activation changes, validating the sparse mediation hypothesis. Our causal framework provides theoretical grounding while CAS-BiPO demonstrates that end-to-end learning of interpretable, reliable interventions is both feasible and advantageous.
Direct Preference Optimization (DPO) is a powerful approach for aligning large language models (LLMs) with human preferences by formulating preference learning as a supervised classification problem over pairwise human-labeled outputs, thereby enabling stable and efficient training. We show that DPO inherits bias from confounders (e.g., topic, style, user objectives) that shape data generation and carry through to training, hindering recovery of true human preferences. We address this from a causal perspective, proposing Causal Direct Preference Optimization (CDPO), a general framework that incorporates causal inference principles to mitigate the influence of confounders and sharpen the signal of genuine human preferences. Our approach preserves the tractability of direct optimization while enhancing robustness to spurious correlations and annotation biases. Empirical evaluations on benchmark datasets show that CDPO surpasses DPO-based baselines by achieving unbiased fine-tuning through causal reasoning, confirming the effectiveness of confounder-aware preference optimization.

2023

Pretrained language models have achieved super-human performances on many Machine Reading Comprehension (MRC) benchmarks. Nevertheless, their relative inability to defend against adversarial attacks has spurred skepticism about their natural language understanding. In this paper, we ask whether training with unanswerable questions in SQuAD 2.0 can help improve the robustness of MRC models against adversarial attacks. To explore that question, we fine-tune three state-of-the-art language models on either SQuAD 1.1 or SQuAD 2.0 and then evaluate their robustness under adversarial attacks. Our experiments reveal that current models fine-tuned on SQuAD 2.0 do not initially appear to be any more robust than ones fine-tuned on SQuAD 1.1, yet they reveal a measure of hidden robustness that can be leveraged to realize actual performance gains. Furthermore, we find that robustness of models fine-tuned on SQuAD 2.0 extends on additional out-of-domain datasets. Finally, we introduce a new adversarial attack to reveal of SQuAD 2.0 that current MRC models are learning.