Shuhan Yuan

2025

pdf bib abs
Fine-tuning LLMs with Cross-Attention-based Weight Decay for Bias Mitigation
Farsheed Haque | Zhe Fu | Depeng Xu | Shuhan Yuan | Xi Niu
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) excel in Natural Language Processing (NLP) tasks but often propagate societal biases from their training data, leading to discriminatory outputs. These biases are amplified by the models’ self-attention mechanisms, which disproportionately emphasize biased correlations with sensitive tokens, like “he” or “she”, reflecting the sensitive attributes such as gender and race. To address this issue, we propose a novel fine-tuning method, called Cross-Attention-based Weight Decay (CrAWD), which modifies the LLM architecture to mitigate bias. CrAWD introduces a cross-attention mechanism between an input sequence and a sensitive token sequence, enabling the model to identify and selectively decay the attention weights of tokens associated with sensitive tokens. This reduces the influence of biased association on the model’s generation while maintaining task performance. Evaluations on real-world datasets demonstrate the effectiveness of our proposed CrAWD method. Notably, our method can handle multiple sensitive attributes by adjusting the sensitive token sequence, and it does not require full knowledge of sensitive tokens presented in the dataset, underscoring CrAWD’s versatility in promoting fair LLMs across various applications.

2024

pdf bib abs
Discovering and Mitigating Indirect Bias in Attention-Based Model Explanations
Farsheed Haque | Depeng Xu | Shuhan Yuan
Findings of the Association for Computational Linguistics: NAACL 2024

As the field of Natural Language Processing (NLP) increasingly adopts transformer-based models, the issue of bias becomes more pronounced. Such bias, manifesting through stereotypes and discriminatory practices, can disadvantage certain groups. Our study focuses on direct and indirect bias in the model explanations, where the model makes predictions relying heavily on identity tokens or associated contexts. We present a novel analysis of bias in model explanation, especially the subtle indirect bias, underlining the limitations of traditional fairness metrics. We first define direct and indirect bias in model explanations, which is complementary to fairness in predictions. We then develop an indirect bias discovery algorithm for quantitatively evaluating indirect bias in transformer models using their in-built self-attention matrix. We also propose an indirect bias mitigation algorithm to ensure fairness in transformer models by leveraging attention explanations. Our evaluation shows the significance of indirect bias and the effectiveness of our indirect bias discovery and mitigation.

2022

pdf bib abs
Robust Hate Speech Detection via Mitigating Spurious Correlations
Kshitiz Tiwari | Shuhan Yuan | Lu Zhang
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We develop a novel robust hate speech detection model that can defend against both word- and character-level adversarial attacks. We identify the essential factor that vanilla detection models are vulnerable to adversarial attacks is the spurious correlation between certain target words in the text and the prediction label. To mitigate such spurious correlation, we describe the process of hate speech detection by a causal graph. Then, we employ the causal strength to quantify the spurious correlation and formulate a regularized entropy loss function. We show that our method generalizes the backdoor adjustment technique in causal inference. Finally, the empirical evaluation shows the efficacy of our method.

pdf bib abs
Generating Textual Adversaries with Minimal Perturbation
Xingyi Zhao | Lu Zhang | Depeng Xu | Shuhan Yuan
Findings of the Association for Computational Linguistics: EMNLP 2022

Many word-level adversarial attack approaches for textual data have been proposed in recent studies. However, due to the massive search space consisting of combinations of candidate words, the existing approaches face the problem of preserving the semantics of texts when crafting adversarial counterparts. In this paper, we develop a novel attack strategy to find adversarial texts with high similarity to the original texts while introducing minimal perturbation. The rationale is that we expect the adversarial texts with small perturbation can better preserve the semantic meaning of original texts. Experiments show that, compared with state-of-the-art attack approaches, our approach achieves higher success rates and lower perturbation rates in four benchmark datasets.

Co-authors

Kshitiz Tiwari 1

Xingyi Zhao 1

Venues

Fix author