Ziang Xiao

2026

Despite their potential as human proxies, LLMs often fail to generate heterogeneous data with human-like diversity, thereby diminishing their value in advancing social science research. To address this gap, we propose a novel method to incorporate psychological insights into LLM simulation through the Personality Structured Interview (PSI). PSI leverages psychometric scale-development procedures to capture personality-related linguistic information from a formal psychological perspective. To systematically evaluate simulation fidelity, we developed a measurement theory grounded evaluation procedure that considers the latent construct nature of personality and evaluates its reliability, structural validity, and external validity. Results from three experiments demonstrate that PSI effectively improves human-like heterogeneity in LLM-simulated personality data and predicts personality-related behavioral outcomes. We further offer a theoretical framework for designing theory-informed structured interviews to enhance the reliability and effectiveness of LLMs in simulating human-like data for broader psychometric research.

2025

pdf bib

pdf bib abs

Faux Polyglot: A Study on Information Disparity in Multilingual Large Language Models
Nikhil Sharma | Kenton Murray | Ziang Xiao
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Although the multilingual capability of LLMs offers new opportunities to overcome the language barrier, do these capabilities translate into real-life scenarios where linguistic divide and knowledge conflicts between multilingual sources are known occurrences? In this paper, we studied LLM’s linguistic preference in a cross-language RAG-based information search setting. We found that LLMs displayed systemic bias towards information in the same language as the query language in both document retrieval and answer generation. Furthermore, in scenarios where no information is in the language of the query, LLMs prefer documents in high-resource languages during generation, potentially reinforcing the dominant views. Such bias exists for both factual and opinion-based queries. Our results highlight the linguistic divide within multilingual LLMs in information search systems. The seemingly beneficial multilingual capability of LLMs may backfire on information parity by reinforcing language-specific filter bubbles further marginalizing low-resource views.

2024

pdf bib abs

Human-Centered Evaluation of Language Technologies
Su Lin Blodgett | Jackie Chi Kit Cheung | Vera Liao | Ziang Xiao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Evaluation is a cornerstone topic in NLP. However, many criticisms have been raised about the community’s evaluation practices, including a lack of human-centered considerations about people’s needs for language technologies and their actual impact on people. This “evaluation crisis” is exacerbated by the recent development of large generative models with diverse and uncertain capabilities. This tutorial aims to inspire more human-centered evaluation in NLP by introducing perspectives and methodologies from human-computer interaction (HCI), a field concerned primarily with the design and evaluation of technologies. The tutorial will start with an overview of current NLP evaluation practices and their limitations, then introduce the “toolbox of evaluation methods” from HCI with varying considerations such as what to evaluate for, how generalizable the results are to the real-world contexts, and pragmatic costs to conduct the evaluation. The tutorial will also encourage reflection on how these HCI perspectives and methodologies can complement NLP evaluation through Q&A discussions and a hands-on exercise.

pdf bib abs

Benchmarking is seen as critical to assessing progress in NLP. However, creating a benchmark involves many design decisions (e.g., which datasets to include, which metrics to use) that often rely on tacit, untested assumptions about what the benchmark is intended to measure or is actually measuring. There is currently no principled way of analyzing these decisions and how they impact the validity of the benchmark’s measurements. To address this gap, we draw on evidence-centered design in educational assessments and propose Evidence-Centered Benchmark Design (ECBD), a framework which formalizes the benchmark design process into five modules. ECBD specifies the role each module plays in helping practitioners collect evidence about capabilities of interest. Specifically, each module requires benchmark designers to describe, justify, and support benchmark design choices—e.g., clearly specifying the capabilities the benchmark aims to measure or how evidence about those capabilities is collected from model responses. To demonstrate the use of ECBD, we conduct case studies with three benchmarks: BoolQ, SuperGLUE, and HELM. Our analysis reveals common trends in benchmark design and documentation that could threaten the validity of benchmarks’ measurements.

pdf bib

Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing
Su Lin Blodgett | Amanda Cercas Curry | Sunipa Dev | Michael Madaio | Ani Nenkova | Diyi Yang | Ziang Xiao
Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing

pdf bib abs

Virtual environments play a key role in benchmarking advances in complex planning and decision-making tasks but are expensive and complicated to build by hand. Can current language models themselves serve as world simulators, correctly predicting how actions change different world states, thus bypassing the need for extensive manual coding? Our goal is to answer this question in the context of text-based simulators. Our approach is to build and use a new benchmark, called ByteSized32-State-Prediction, containing a dataset of text game state transitions and accompanying game tasks. We use this to directly quantify, for the first time, how well LLMs can serve as text-based world simulators. We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations. This work thus contributes both new insights into current LLM’s capabilities and weaknesses, as well as a novel benchmark to track future progress as new models appear.

2023

pdf bib abs

Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Ziang Xiao | Susu Zhang | Vivian Lai | Q. Vera Liao
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation—the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source of measurement error and offers statistical tools for evaluating evaluation metrics based on empirical data. With our framework, one can quantify the uncertainty of the metrics to better interpret the result. To exemplify the use of our framework in practice, we analyzed a set of evaluation metrics for summarization and identified issues related to conflated validity structure in human-eval and reliability in LLM-based metrics. Through MetricEval, we aim to promote the design, evaluation, and interpretation of valid and reliable metrics to advance robust and effective NLG models.

pdf bib

What should I Ask: A Knowledge-driven Approach for Follow-up Questions Generation in Conversational Surveys
Yubin Ge | Ziang Xiao | Jana Diesner | Heng Ji | Karrie Karahalios | Hari Sundaram
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Ruoyao Wang | Graham Todd | Xingdi Yuan | Ziang Xiao | Marc-Alexandre Côté | Peter Jansen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this work we investigate the capacity of language models to generate explicit, interpretable, and interactive world models of scientific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of Python code. To facilitate this task, we introduce ByteSized32, a corpus of 32 reasoning-focused text games totalling 20k lines of Python code. We empirically demonstrate that GPT-4 can use these games as templates for single-shot in-context learning, successfully producing runnable games on unseen topics in 28% of cases. When allowed to self-reflect on program errors, game runnability substantially increases to 58%. While evaluating simulation fidelity is labor intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high-degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation.

Co-authors

Venues

NAACL1

PACLIC1

Fix author