Ziang Xiao


2023

pdf bib
Evaluating Evaluation Metrics: A Framework for Analyzing NLG Evaluation Metrics using Measurement Theory
Ziang Xiao | Susu Zhang | Vivian Lai | Q. Vera Liao
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We address a fundamental challenge in Natural Language Generation (NLG) model evaluation—the design and evaluation of evaluation metrics. Recognizing the limitations of existing automatic metrics and noises from how current human evaluation was conducted, we propose MetricEval, a framework informed by measurement theory, the foundation of educational test design, for conceptualizing and evaluating the reliability and validity of NLG evaluation metrics. The framework formalizes the source of measurement error and offers statistical tools for evaluating evaluation metrics based on empirical data. With our framework, one can quantify the uncertainty of the metrics to better interpret the result. To exemplify the use of our framework in practice, we analyzed a set of evaluation metrics for summarization and identified issues related to conflated validity structure in human-eval and reliability in LLM-based metrics. Through MetricEval, we aim to promote the design, evaluation, and interpretation of valid and reliable metrics to advance robust and effective NLG models.

pdf bib
ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
Ruoyao Wang | Graham Todd | Xingdi Yuan | Ziang Xiao | Marc-Alexandre Côté | Peter Jansen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this work we investigate the capacity of language models to generate explicit, interpretable, and interactive world models of scientific and common-sense reasoning tasks. We operationalize this as a task of generating text games, expressed as hundreds of lines of Python code. To facilitate this task, we introduce ByteSized32, a corpus of 32 reasoning-focused text games totalling 20k lines of Python code. We empirically demonstrate that GPT-4 can use these games as templates for single-shot in-context learning, successfully producing runnable games on unseen topics in 28% of cases. When allowed to self-reflect on program errors, game runnability substantially increases to 58%. While evaluating simulation fidelity is labor intensive, we introduce a suite of automated metrics to assess game fidelity, technical validity, adherence to task specifications, and winnability, showing a high-degree of agreement with expert human ratings. We pose this as a challenge task to spur further development at the juncture of world modeling and code generation.

pdf bib
What should I Ask: A Knowledge-driven Approach for Follow-up Questions Generation in Conversational Surveys
Yubin Ge | Ziang Xiao | Jana Diesner | Heng Ji | Karrie Karahalios | Hari Sundaram
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation