Yixin Wan - ACL Anthology

Yixin Wan

2026

MTMCS-Bench: Evaluating Contextual Safety of Multimodal Large Language Models in Multi-Turn Dialogues
Zheyuan Liu | Dongwhi Kim | Yixin Wan | Xiangchi Yuan | Zhaoxuan Tan | Fengran Mo | Meng Jiang
Findings of the Association for Computational Linguistics: ACL 2026

Multimodal large language models (MLLMs) are increasingly deployed as assistants that interact through text and images, making it crucial to evaluate contextual safety when risk depends on both the visual scene and the evolving dialogue. Existing contextual safety benchmarks are mostly single-turn and often miss how malicious intent can emerge gradually or how the same scene can support both benign and exploitative goals. We introduce the Multi-Turn Multimodal Contextual Safety Benchmark (MTMCS-Bench), a benchmark of realistic images and multi-turn conversations that evaluates contextual safety in MLLMs under two complementary settings, escalation-based risk and context-switch risk. MTMCS-Bench offers paired safe and unsafe dialogues with structured evaluation. It contains over 30 thousand multimodal (image+text) and unimodal (text-only) samples, with metrics that separately measure contextual intent recognition, safety-awareness on unsafe cases, and helpfulness on benign ones. Across eight open-source and seven proprietary MLLMs, we observe persistent trade-offs between contextual safety and utility, with models tending to either miss gradual risks or over-refuse benign dialogues. Finally, we evaluate five current guardrails and find that they mitigate some failures but do not fully resolve multi-turn contextual risks.

Knowledge Control for Responsible Generative AI: Bridging Academia, Industry, and Society
Zheyuan Liu | Yixin Wan | Kai-Wei Chang | Meng Jiang | Jieyu Zhao | Nouha Dziri | Yuning Mao | Jia-Chen Gu | Jindong Gu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 5: Tutorial Abstracts)

Controlling the knowledge and behavior of generative AI systems, including large language models (LLMs), multimodal LLMs (MLLMs), and text-to-image (T2I) models, has become critical as they are increasingly used in safety-sensitive and socially impactful applications. These models often encode unintended, biased, or private content, leading to harmful or unethical outputs. Post-training knowledge control has thus emerged as a practical framework for selectively modifying or removing model behaviors without full retraining, offering scalable and interpretable interventions for improving safety, privacy, and fairness. This tutorial introduces the foundations of post-training knowledge control and showcases recent frontier methods, bridging research insights with real-world practices from both academia and industry. We cover: (i) key motivations and failure modes, such as harmful generation and stereotype reinforcement; (ii) core methods such as machine unlearning, knowledge editing, and inference-time interventions for targeted behavior adjustment; and (iii) evaluation protocols for balancing forgetting, retention, and fairness. Case studies will span text and vision–language generation, including privacy preservation, bias mitigation, and factual correction.

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
Di Wu | Yixin Wan | Kai-Wei Chang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We proposeVisualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a simple yet effective perspective for advancing in text-image retrieval. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.

InsideOut: Measuring and Mitigating Insider–Outsider Bias in Interview Script Generation
Yixin Wan | Xingrun Chen | Kai-Wei Chang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Advancements in Large language models (LLMs) have enabled a variety of downstream applications like story and interview script generation.However, recent research raised concerns about culture-related fairness issues in LLM-generated content.In this work, we identify and systematically investigate LLMs’ **insider-outsider bias**, a phenomenon where models position themselves as "insiders" of mainstream cultures during generation while externalizing less dominant cultures.We propose the ***InsideOut*** benchmark with 4,000 generation prompts and three evaluation metrics to quantify this bias through a *culturally situated interview script generation* task, in which an LLM is positioned as a reporter interviewing local people across 10 diverse cultures.Empirical evaluation on 5 state-of-the-art LLMs reveals that while models adopt insider tones in over 88% US-contexted scripts on average, they disproportionately default to "outsider" stances for non-Western cultures.To mitigate these biases, we propose *2 inference-time methods*: a baseline prompt-based **Fairness Intervention Pillars (FIP)** method, and a structured **Mitigation via Fairness Agents (MFA)** framework consisting of a Single-Agent (MFA-SA), a Hierarchical-Agent (MFA-HA), and an autonomous Agentic Planning (MFA-Plan) pipeline.Empirical results demonstrate that agent-based MFA methods achieve outstanding and robust performance in mitigating the insider-outsider bias:For instance, on the Cultural Alignment Gap (CAG) metric, *MFA-SA reduces bias in Llama model by 89.70 % and MFA-HA mitigates bias in Qwen by 82.54%*.These findings showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

2025

Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Trista Cao | Anubrata Das | Tharindu Kumarage | Yixin Wan | Satyapriya Krishna | Ninareh Mehrabi | Jwala Dhamala | Anil Ramakrishna | Aram Galystan | Anoop Kumar | Rahul Gupta | Kai-Wei Chang
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models
Anil Ramakrishna | Yixin Wan | Xiaomeng Jin | Kai-Wei Chang | Zhiqi Bu | Bhanukiran Vinzamuri | Volkan Cevher | Mingyi Hong | Rahul Gupta
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We introduce SemEval-2025 Task 4: unlearn- ing sensitive content from Large Language Models (LLMs). The task features 3 subtasks for LLM unlearning spanning different use cases: (1) unlearn long form synthetic creative documents spanning different genres; (2) un- learn short form synthetic biographies contain- ing personally identifiable information (PII), in- cluding fake names, phone number, SSN, email and home addresses, and (3) unlearn real docu- ments sampled from the target model’s training dataset. We received over 100 submissions from over 30 institutions and we summarize the key techniques and lessons in this paper.

Not Every Token Needs Forgetting: Selective Unlearning Balancing Forgetting and Utility in Large Language Models
Yixin Wan | Anil Ramakrishna | Kai-Wei Chang | Volkan Cevher | Rahul Gupta
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Model (LLM) unlearning has recently gained significant attention, driven by the need to remove unwanted information—such as private, sensitive, or copyrighted content—from trained models. However, conventional unlearning approaches indiscriminately update model parameters to forget all tokens in a target document, including common tokens (e.g., pronouns, prepositions, general nouns) that carry general knowledge. In this paper, we highlight that “not every token needs forgetting”. We propose **Selective Unlearning (SU)**, which identifies a critical subset of tokens within the forgetting set that is relevant to the unwanted information, and unlearns only those tokens. Experiments on two benchmarks and six baseline unlearning algorithms demonstrate that SU not only achieves effective unlearning on the targeted forget data, but also significantly preserves the model’s utility in the retaining set.

Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases
Jen-tse Huang | Yuhang Yan | Linqi Liu | Yixin Wan | Wenxuan Wang | Kai-Wei Chang | Michael R. Lyu
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent failures such as Google Gemini generating people of color in Nazi-era uniforms illustrate how AI outputs can be factually plausible yet socially harmful. AI models are increasingly evaluated for “fairness,” yet existing benchmarks often conflate two fundamentally different dimensions: factual correctness and normative fairness. A model may generate responses that are factually accurate but socially unfair, or conversely, appear fair while distorting factual reality. We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation. We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments. Our queries are constructed from 19 statistics and are grounded in cognitive psychology, drawing on representativeness bias, attribution bias, and ingroup–outgroup bias to explain why models often misalign fact and fairness. Experiments across ten frontier models reveal different levels of fact-fair trade-offs. By reframing fairness evaluation, we provide both a new theoretical lens and a practical benchmark to advance the responsible model assessments. Our test suite is publicly available at https://github.com/uclanlp/Fact-or-Fair.

LUME: LLM Unlearning with Multitask Evaluations
Anil Ramakrishna | Yixin Wan | Xiaomeng Jin | Kai-Wei Chang | Zhiqi Bu | Bhanukiran Vinzamuri | Volkan Cevher | Mingyi Hong | Rahul Gupta
Findings of the Association for Computational Linguistics: EMNLP 2025

Unlearning aims to remove copyrighted, sensitive, or private content from large language models (LLMs) without a full retraining. In this work, we develop a multi-task unlearning benchmark LUME that features three tasks: (1) unlearn synthetically generated creative short novels, (2) unlearn synthetic biographies with sensitive information, and (3) unlearn a collection of public biographies. We further release two fine-tuned LLMs of 1B and 7B parameter sizes as the target models. We conduct detailed evaluations of several recently-proposed algorithms and present results on carefully crafted metrics to understand their behavior and limitations.

V-ALPHASOCIAL: Benchmark and Self-Reflective Chain-of-Thought Generation for Visual Social Commonsense Reasoning
Zongyu Lin | Zhikun Xu | Xiaohan Song | Yixin Wan | Xingcheng Yao | Tsung-Han Lin | Selina Song | Pranav Subbaraman | Ben Zhou | Kai-Wei Chang | Yizhou Sun
Findings of the Association for Computational Linguistics: ACL 2025

Social commonsense reasoning naturally involves both the verbal and non-verbal cues of a social interaction. It is important for Large Vision-Language Models (VLMs) to leverage both textual and visual information in performing tasks like social understanding and reasoning. However, while current LLMs have shown good social reasoning capabilities in textual context, whether they can effectively incorporate visual information in social comprehension remains under-explored. To narrow the gap, we first construct and propose a benchmark: V-Social, featuring well-aligned text and visual content, tailored to assess visual social commonsense for multimodal foundation models. Through experimenting with V-Social, we find that even the most advanced VLM, GPT-4o, often falls short in social commonsense reasoning. This highlights the critical need to enhance the social grounding of VLMs. One major obstacle for improving this is the lack of high-quality data with good reasoning process. To overcome this obstacle, we introduce V-AlphaSocial, a novel method that generates high-quality chain-of-thought reasoning paths from unlabeled data. We design a visual reasoning reward model to improve VLM, and then iteratively refine both the VLM and the reward model. Our extensive analysis showcases how our method enhances social commonsense reasoning, proposing an effective approach that facilitates deeper exploration into field.

The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects
Yixin Wan | Kai-Wei Chang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent large-scale T2I models like DALLE-3 have made progress in reducing gender stereotypes when generating single-person images. However, significant biases remain when generating images with more than one person. To systematically evaluate this, we propose the **Paired Stereotype Test (PST)** framework, which queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities, respectively (e.g. “a CEO” and “an Assistant”). This contrastive setting often triggers T2I models to generate gender-stereotyped images. Using PST, we evaluate two aspects of gender biases – the well-known **bias in gendered occupation** and a novel aspect: **bias in organizational power**. Experiments show that **over 74% images generated by DALLE-3 display gender-occupational biases**. Additionally, compared to single-person settings, DALLE-3 is more likely to perpetuate male-associated stereotypes under PST. We further propose **FairCritic**, a novel and interpretable framework that leverages an LLM-based critic model to i) detect bias in generated images, and ii) adaptively provide feedback to T2I models for improving fairness. FairCritic achieves near-perfect fairness on PST, overcoming the limitations of previous prompt-based intervention approaches.

White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs
Yixin Wan | Kai-Wei Chang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Social biases can manifest in language agency. However, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous works often rely on string-matching techniques to identify agentic and communal words within texts, falling short of accurately classifying language agency. We introduce the **Language Agency Bias Evaluation (LABE)** benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE tests for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) LLM generations tend to demonstrate greater gender bias than human-written texts; (2) Models demonstrate remarkably higher levels of intersectional bias than the other bias aspects. (3) Prompt-based mitigation is unstable and frequently leads to bias exacerbation. Based on our observations, we propose **Mitigation via Selective Rewrite (MSR)**, a novel bias mitigation strategy that leverages an agency classifier to identify and selectively revise parts of generated texts that demonstrate communal traits. Empirical results prove MSR to be more effective and reliable than prompt-based mitigation method, showing a promising research direction.

2024

MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Shujin Wu | Yi Fung | Sha Li | Yixin Wan | Kai-Wei Chang | Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2024

Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues. Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs. Utilizing this hierarchy, we create PIE, (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics. Our evaluations on indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28. In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria. Then, the self-imagined data is formatted for conditional reinforcement learning. Experimental results show MACAROON effectively improves LVLMs’ capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.

The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention
Yixin Wan | Di Wu | Haoran Wang | Kai-Wei Chang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating real historical figures? In this work, we propose **DemOgraphic FActualIty Representation (DoFaiR)**, a benchmark to systematically quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756 meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose **Fact-Augmented Intervention** (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orienting model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions while preserving diversity.

2023

Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems
Yixin Wan | Jieyu Zhao | Aman Chadha | Nanyun Peng | Kai-Wei Chang
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent advancements in Large Language Models empower them to follow freeform instructions, including imitating generic or specific demographic personas in conversations. We define generic personas to represent demographic groups, such as “an Asian person”, whereas specific personas may take the form of specific popular Asian names like “Yumi”. While the adoption of personas enriches user experiences by making dialogue systems more engaging and approachable, it also casts a shadow of potential risk by exacerbating social biases within model responses, thereby causing societal harm through interactions with users. In this paper, we systematically study “persona biases”, which we define to be the sensitivity of dialogue models’ harmful behaviors contingent upon the personas they adopt. We categorize persona biases into biases in harmful expression and harmful agreement, and establish a comprehensive evaluation framework to measure persona biases in five aspects: Offensiveness, Toxic Continuation, Regard, Stereotype Agreement, and Toxic Agreement. Additionally, we propose to investigate persona biases by experimenting with UNIVERSALPERSONA, a systematically constructed persona dataset encompassing various types of both generic and specific model personas. Through benchmarking on four different models- including Blender, ChatGPT, Alpaca, and Vicuna- our study uncovers significant persona biases in dialogue systems. Our findings also underscore the pressing need to revisit the use of personas in dialogue agents to ensure safe application.

“Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters
Yixin Wan | George Pu | Jiao Sun | Aparna Garimella | Kai-Wei Chang | Nanyun Peng
Findings of the Association for Computational Linguistics: EMNLP 2023

Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content, including professional documents such as recommendation letters. Though bringing convenience, this application also introduces unprecedented fairness concerns. Model-generated reference letters might be directly used by users in professional scenarios. If underlying biases exist in these model-constructed letters, using them without scrutinization could lead to direct societal harms, such as sabotaging application success rates for female applicants. In light of this pressing issue, it is imminent and necessary to comprehensively study fairness issues and associated harms in this real-world use case. In this paper, we critically examine gender biases in LLM-generated reference letters. Drawing inspiration from social science findings, we design evaluation methods to manifest biases through 2 dimensions: (1) biases in language style and (2) biases in lexical content. We further investigate the extent of bias propagation by analyzing the hallucination bias of models, a term that we define to be bias exacerbation in model-hallucinated contents. Through benchmarking evaluation on 2 popular LLMs- ChatGPT and Alpaca, we reveal significant gender biases in LLM-generated recommendation letters. Our findings not only warn against using LLMs for this application without scrutinization, but also illuminate the importance of thoroughly studying hidden biases and harms in LLM-generated professional documents.

PIP: Parse-Instructed Prefix for Syntactically Controlled Paraphrase Generation
Yixin Wan | Kuan-Hao Huang | Kai-Wei Chang
Findings of the Association for Computational Linguistics: ACL 2023

Syntactically controlled paraphrase generation requires language models to generate paraphrases for sentences according to specific syntactic structures. Existing fine-tuning methods on this task is costly, as all parameters of the model need to be updated during the training process. Inspired by recent studies on parameter-efficient learning, we propose Parse-Instructed Prefix (PIP), a novel adaptation of prefix-tuning to tune large pre-trained language models on syntactically controlled paraphrase generation task in a low-data setting with significantly less training cost. We introduce two methods to instruct a model’s encoder prefix to capture syntax-related knowledge: direct initiation (PIP-Direct) and indirect optimization (PIP-Indirect). Comparing to traditional fine-tuning methods for this task, PIP is a compute-efficient alternative with 10 times less learnable parameters. Comparing to existing prefix-tuning methods, PIP excels at capturing syntax control information, achieving significantly higher performance at the same level of learnable parameter count.

TheoremQA: A Theorem-driven Question Answering Dataset
Wenhu Chen | Ming Yin | Max Ku | Pan Lu | Yixin Wan | Xueguang Ma | Jianyu Xu | Xinyi Wang | Tony Xia
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models’ capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4’s capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs’ capabilities to solve challenging science problems.

Does BERT Exacerbate Gender or L1 Biases in Automated English Speaking Assessment?
Alexander Kwako | Yixin Wan | Jieyu Zhao | Mark Hansen | Kai-Wei Chang | Li Cai
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

In English speaking assessment, pretrained large language models (LLMs) such as BERT can score constructed response items as accurately as human raters. Less research has investigated whether LLMs perpetuate or exacerbate biases, which would pose problems for the fairness and validity of the test. This study examines gender and native language (L1) biases in human and automated scores, using an off-the-shelf (OOS) BERT model. Analyses focus on a specific type of bias known as differential item functioning (DIF), which compares examinees of similar English language proficiency. Results show that there is a moderate amount of DIF, based on examinees’ L1 background in grade band 912. DIF is higher when scored by an OOS BERT model, indicating that BERT may exacerbate this bias; however, in practical terms, the degree to which BERT exacerbates DIF is very small. Additionally, there is more DIF for longer speaking items and for older examinees, but BERT does not exacerbate these patterns of DIF.

2022

Improving the Adversarial Robustness of NLP Models by Information Bottleneck
Cenyuan Zhang | Xiang Zhou | Yixin Wan | Xiaoqing Zheng | Kai-Wei Chang | Cho-Jui Hsieh
Findings of the Association for Computational Linguistics: ACL 2022

Existing studies have demonstrated that adversarial examples can be directly attributed to the presence of non-robust features, which are highly predictive, but can be easily manipulated by adversaries to fool NLP models. In this study, we explore the feasibility of capturing task-specific robust features, while eliminating the non-robust ones by using the information bottleneck theory. Through extensive experiments, we show that the models trained with our information bottleneck-based method are able to achieve a significant improvement in robust accuracy, exceeding performances of all the previously reported defense methods while suffering almost no performance drop in clean accuracy on SST-2, AGNEWS and IMDB datasets.

Using Item Response Theory to Measure Gender and Racial Bias of a BERT-based Automated English Speech Assessment System
Alexander Kwako | Yixin Wan | Jieyu Zhao | Kai-Wei Chang | Li Cai | Mark Hansen
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

Recent advances in natural language processing and transformer-based models have made it easier to implement accurate, automated English speech assessments. Yet, without careful examination, applications of these models may exacerbate social prejudices based on gender and race. This study addresses the need to examine potential biases of transformer-based models in the context of automated English speech assessment. For this purpose, we developed a BERT-based automated speech assessment system and investigated gender and racial bias of examinees’ automated scores. Gender and racial bias was measured by examining differential item functioning (DIF) using an item response theory framework. Preliminary results, which focused on a single verbal-response item, showed no statistically significant DIF based on gender or race for automated scores.

Co-authors

Volkan Cevher 3

Alexander Kwako 2

Bhanukiran Vinzamuri 2

Jwala Dhamala 1

Aram Galystan 1

Aparna Garimella 1

Cho-Jui Hsieh 1

Jen-tse Huang 1

Kuan - Hao Huang 1

Satyapriya Krishna 1

Tharindu Kumarage 1

Tsung-Han Lin 1

Michael R. Lyu 1

Ninareh Mehrabi 1

Pranav Subbaraman 1

Xingcheng Yao 1

Ming Yin (尹明) 1

Xiangchi Yuan 1

Cenyuan Zhang 1

Xiaoqing Zheng 1

Xiang Zhou (周翔) 1

Venues