Ziwei Zhu

2025

Toward Inclusive Language Models: Sparsity-Driven Calibration for Systematic and Interpretable Mitigation of Social Biases in LLMs
Prommy Sultana Hossain | Chahat Raj | Ziwei Zhu | Jessica Lin | Emanuela Marasco
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) such as GPT and LLaMA excel in natural language tasks, e.g., text generation and machine translation. However, inherent biases from training on vast Internet datasets potentially amplify harmful stereotypes—widely held, oversimplified, and often inaccurate generalizations about groups of people. Our contribution introduces a novel, systematic, and architecture-aware method to identify and mitigate stereotypical bias in decoder-only transformer models. This interpretable approach operates without gradient access or retraining from scratch. We first evaluate bias and then apply a bias localization mechanism that correlates internal activations with a newly defined Context Influence (CI) Score. Our method pinpoints specific attention heads that consistently align with biased shifts in model predictions. To mitigate this, we introduce a soft pruning strategy that scales attention head parameters based on their correlation strength, followed by lightweight fine-tuning to maintain fluent text generation. Experiments across five models demonstrate our approach reduces bias by up to 37% on BBQ, 32% on StereoSet, and 33% on CrowS-Pairs while simultaneously improving reasoning performance on MMLU by up to 10%.

pdf bib abs

What’s Not Said Still Hurts: A Description-Based Evaluation Framework for Measuring Social Bias in LLMs
Jinhao Pan | Chahat Raj | Ziyu Yao | Ziwei Zhu
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) often exhibit social biases inherited from their training data. While existing benchmarks evaluate bias by term-based mode through direct term associations between demographic terms and bias terms, LLMs have become increasingly adept at avoiding biased responses, leading to seemingly low levels of bias. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Description-based Bias Benchmark (DBB), a novel dataset designed to assess bias at the semantic level that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios rather than superficial terms. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response at the term level, they continue to reinforce biases in nuanced settings. Data, code, and results are available at https://github.com/JP-25/Description-based-Bias-Benchmark.

pdf bib abs

ProtoLens: Advancing Prototype Learning for Fine-Grained Interpretability in Text Classification
Bowen Wei | Ziwei Zhu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we propose ProtoLens, a novel prototype-based model that provides fine-grained, sub-sentence level interpretability for text classification. ProtoLens uses a Prototype-aware Span Extraction module to identify relevant text spans associated with learned prototypes and a Prototype Alignment mechanism to ensure prototypes are semantically meaningful throughout training. By aligning the prototype embeddings with human-understandable examples, ProtoLens provides interpretable predictions while maintaining competitive accuracy. Extensive experiments demonstrate that ProtoLens outperforms both prototype-based and non-interpretable baselines on multiple text classification benchmarks. Code and data are available at https://github.com/weibowen555/ProtoLens.

pdf bib abs

Spurious Correlations and Beyond: Understanding and Mitigating Shortcut Learning in SDOH Extraction with Large Language Models
Fardin Ahsan Sakib | Ziwei Zhu | Karen Trister Grace | Meliha Yetisgen | Ozlem Uzuner
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Social determinants of health (SDOH) extraction from clinical text is critical for downstream healthcare analytics. Although large language models (LLMs) have shown promise, they may rely on superficial cues leading to spurious predictions. Using the MIMIC portion of the SHAC (Social History Annotation Corpus) dataset and focusing on drug status extraction as a case study, we demonstrate that mentions of alcohol or smoking can falsely induce models to predict current/past drug use where none is present, while also uncovering concerning gender disparities in model performance. We further evaluate mitigation strategies—such as prompt engineering and chain-of-thought reasoning—to reduce these false positives, providing insights into enhancing LLM reliability in health domains.

pdf bib abs

Fighting Spurious Correlations in Text Classification via a Causal Learning Perspective
Yuqing Zhou | Ziwei Zhu
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In text classification tasks, models often rely on spurious correlations for predictions, incorrectly associating irrelevant features with the target labels. This issue limits the robustness and generalization of models, especially when faced with out-of-distribution data where such spurious correlations no longer hold. To address this challenge, we propose the Causally Calibrated Robust Classifier (CCR), which aims to reduce models’ reliance on spurious correlations and improve model robustness. Our approach integrates a causal feature selection method based on counterfactual reasoning, along with an unbiased inverse propensity weighting (IPW) loss function. By focusing on selecting causal features, we ensure that the model relies less on spurious features during prediction. We theoretically justify our approach and empirically show that CCR achieves state-of-the-art performance among methods without group labels, and in some cases, it can compete with the models that utilize group labels. Our code can be found at: https://github.com/yuqing-zhou/Causal-Learning-For-Robust-Classifier.

2024

pdf bib abs

Global Gallery: The Fine Art of Painting Culture Portraits through Multilingual Instruction Tuning
Anjishnu Mukherjee | Aylin Caliskan | Ziwei Zhu | Antonios Anastasopoulos
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Exploring the intersection of language and culture in Large Language Models (LLMs), this study critically examines their capability to encapsulate cultural nuances across diverse linguistic landscapes. Central to our investigation are three research questions: the efficacy of language-specific instruction tuning, the impact of pretraining on dominant language data, and the identification of optimal approaches to elicit accurate cultural knowledge from LLMs. Utilizing the GeoMLaMA benchmark for multilingual commonsense knowledge and an adapted CAMeL dataset (English-only) for evaluation of nuanced cultural aspects, our experiments span six different languages and cultural contexts, revealing the extent of LLMs’ cultural awareness. Our findings highlight a nuanced landscape: while language-specific tuning and bilingual pretraining enhance cultural understanding in certain contexts, they also uncover inconsistencies and biases, particularly in non-Western cultures. This work expands our understanding of LLMs’ cultural competence and emphasizes the importance of integrating diverse cultural perspectives in their development, aiming for a more globally representative and equitable approach in language modeling.

pdf bib abs

Navigating the Shortcut Maze: A Comprehensive Analysis of Shortcut Learning in Text Classification by Language Models
Yuqing Zhou | Ruixiang Tang | Ziyu Yao | Ziwei Zhu
Findings of the Association for Computational Linguistics: EMNLP 2024

Language models (LMs), despite their advances, often depend on spurious correlations, undermining their accuracy and generalizability. This study addresses the overlooked impact of subtler, more complex shortcuts that compromise model reliability beyond oversimplified shortcuts. We introduce a comprehensive benchmark that categorizes shortcuts into occurrence, style, and concept, aiming to explore the nuanced ways in which these shortcuts influence the performance of LMs. Through extensive experiments across traditional LMs, large language models, and state-of-the-art robust models, our research systematically investigates models’ resilience and susceptibilities to sophisticated shortcuts. Our benchmark and code can be found at: https://github.com/yuqing-zhou/shortcut-learning-in-text-classification.

pdf bib abs

BiasDora: Exploring Hidden Biased Associations in Vision-Language Models
Chahat Raj | Anjishnu Mukherjee | Aylin Caliskan | Antonios Anastasopoulos | Ziwei Zhu
Findings of the Association for Computational Linguistics: EMNLP 2024

Existing works examining Vision-Language Models (VLMs) for social biases predominantly focus on a limited set of documented bias associations, such as gender-profession or race-crime. This narrow scope often overlooks a vast range of unexamined implicit associations, restricting the identification and, hence, mitigation of such biases. We address this gap by probing VLMs to (1) uncover hidden, implicit associations across 9 bias dimensions. We systematically explore diverse input and output modalities and (2) demonstrate how biased associations vary in their negativity, toxicity, and extremity. Our work (3) identifies subtle and extreme biases that are typically not recognized by existing methodologies. We make the **D**ataset **o**f **r**etrieved **a**ssociations (**Dora**) publicly available.

2023

pdf bib abs

Global Voices, Local Biases: Socio-Cultural Prejudices across Languages
Anjishnu Mukherjee | Chahat Raj | Ziwei Zhu | Antonios Anastasopoulos
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Human biases are ubiquitous but not uniform: disparities exist across linguistic, cultural, and societal borders. As large amounts of recent literature suggest, language models (LMs) trained on human data can reflect and often amplify the effects of these social biases. However, the vast majority of existing studies on bias are heavily skewed towards Western and European languages. In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies and yielding interesting findings about LM bias. We additionally enhance this data with culturally relevant information for each language, capturing local contexts on a global scale. Further, to encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more. Moreover, we delve deeper into the Indian linguistic landscape, conducting a comprehensive regional bias analysis across six prevalent Indian languages. Finally, we highlight the significance of these social biases and the new dimensions through an extensive comparison of embedding methods, reinforcing the need to address them in pursuit of more equitable language models.

pdf bib abs

PromptAttack: Probing Dialogue State Trackers with Adversarial Prompts
Xiangjue Dong | Yun He | Ziwei Zhu | James Caverlee
Findings of the Association for Computational Linguistics: ACL 2023

A key component of modern conversational systems is the Dialogue State Tracker (or DST), which models a user’s goals and needs. Toward building more robust and reliable DSTs, we introduce a prompt-based learning approach to automatically generate effective adversarial examples to probe DST models. Two key characteristics of this approach are: (i) it only needs the output of the DST with no need for model parameters, and (ii) it can learn to generate natural language utterances that can target any DST. Through experiments over state-of-the-art DSTs, the proposed framework leads to the greatest reduction in accuracy and the best attack success rate while maintaining good fluency and a low perturbation ratio. We also show how much the generated adversarial examples can bolster a DST through adversarial training. These results indicate the strength of prompt-based attacks on DSTs and leave open avenues for continued refinement.

pdf bib abs

Unsupervised Candidate Answer Extraction through Differentiable Masker-Reconstructor Model
Zhuoer Wang | Yicheng Wang | Ziwei Zhu | James Caverlee
Findings of the Association for Computational Linguistics: EMNLP 2023

Question generation is a widely used data augmentation approach with extensive applications, and extracting qualified candidate answers from context passages is a critical step for most question generation systems. However, existing methods for candidate answer extraction are reliant on linguistic rules or annotated data that face the partial annotation issue and challenges in generalization. To overcome these limitations, we propose a novel unsupervised candidate answer extraction approach that leverages the inherent structure of context passages through a Differentiable Masker-Reconstructor (DMR) Model with the enforcement of self-consistency for picking up salient information tokens. We curated two datasets with exhaustively-annotated answers and benchmark a comprehensive set of supervised and unsupervised candidate answer extraction methods. We demonstrate the effectiveness of the DMR model by showing its performance is superior among unsupervised methods and comparable to supervised methods.

pdf bib abs

Co²PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning
Xiangjue Dong | Ziwei Zhu | Zhuoer Wang | Maria Teleki | James Caverlee
Findings of the Association for Computational Linguistics: EMNLP 2023

Pre-trained Language Models are widely used in many important real-world applications. However, recent studies show that these models can encode social biases from large pre-training corpora and even amplify biases in downstream applications. To address this challenge, we propose Co²PT, an efficient and effective *debias-while-prompt tuning* method for mitigating biases via counterfactual contrastive prompt tuning on downstream tasks. Our experiments conducted on three extrinsic bias benchmarks demonstrate the effectiveness of Co²PT on bias mitigation during the prompt tuning process and its adaptability to existing upstream debiased language models. These findings indicate the strength of Co²PT and provide promising avenues for further enhancement in bias mitigation on downstream tasks.

2020

pdf bib abs

Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition
Yun He | Ziwei Zhu | Yin Zhang | Qin Chen | James Caverlee
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Knowledge of a disease includes information of various aspects of the disease, such as signs and symptoms, diagnosis and treatment. This disease knowledge is critical for many health-related and biomedical tasks, including consumer health question answering, medical language inference and disease name recognition. While pre-trained language models like BERT have shown success in capturing syntactic, semantic, and world knowledge from text, we find they can be further complemented by specific information like knowledge of symptoms, diagnoses, treatments, and other disease aspects. Hence, we integrate BERT with disease knowledge for improving these important tasks. Specifically, we propose a new disease knowledge infusion training procedure and evaluate it on a suite of BERT models including BERT, BioBERT, SciBERT, ClinicalBERT, BlueBERT, and ALBERT. Experiments over the three tasks show that these models can be enhanced in nearly all cases, demonstrating the viability of disease knowledge infusion. For example, accuracy of BioBERT on consumer health question answering is improved from 68.29% to 72.09%, while new SOTA results are observed in two datasets. We make our data and code freely available.

Co-authors

Venues

Fix author