Zhiyu Chen - ACL Anthology

Zhiyu Chen

Papers on this page may belong to the following people: Zhiyu Chen (Lehigh University)

2025

ObfusLM: Privacy-preserving Language Model Service against Embedding Inversion Attacks
Yu Lin | Ruining Yang | Yunlong Mao | Qizhi Zhang | Jue Hong | Quanwei Cai | Ye Wu | Huiqi Liu | Zhiyu Chen | Bing Duan | Sheng Zhong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As the rapid expansion of Machine Learning as a Service (MLaaS) for language models, concerns over the privacy of client inputs during inference or fine-tuning have correspondingly escalated. Recently, solutions have been proposed to safeguard client privacy by obfuscation techniques. However, the solutions incur notable decline in model utility and mainly focus on classification tasks, rendering them impractical for real-world applications. Moreover, recent studies reveal that these obfuscation, if not well designed, is susceptible to embedding inversion attacks (EIAs). In this paper, we devise ObfusLM, a privacy-preserving MLaaS framework for both classification and generation tasks. ObfusLM leverages a model obfuscation module to achieve privacy protection for both classification and generation tasks. Based on (k, 𝜖)-anonymity, ObfusLM includes novel obfuscation algorithms to reach provable security against EIAs. Extensive experiments show that ObfusLM outperforms existing works in utility by 10% with a nearly 80% resistance rate against EIAs.

Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents’ ability to autonomously reproduce scientific research.

IDEA: Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction
Kaiyu He | Mian Zhang | Shuo Yan | Peilin Wu | Zhiyu Chen
Findings of the Association for Computational Linguistics: ACL 2025

While large language models (LLMs) have been thoroughly evaluated for deductive and inductive reasoning, their proficiency in holistic rule learning in interactive environments remains less explored. We introduce RULEARN, a novel benchmark to assess the rule-learning abilities of LLM agents in interactive settings. In RULEARN, agents strategically interact with simulated environments to gather observations, discern patterns, and solve complex problems. To enhance the rule-learning capabilities for LLM agents, we propose IDEA, a novel reasoning framework that integrates the process of **I**nduction, **De**duction, and **A**bduction. The IDEA agent generates initial hypotheses from limited observations through abduction, devises plans to validate these hypotheses or leverages them to solve problems via deduction, and refines previous hypotheses through induction, dynamically establishing and applying rules that mimic human rule-learning behaviors. Our evaluation of the IDEA framework, which involves five representative LLMs, demonstrates significant improvements over the baseline. Furthermore, our study with human participants reveals notable discrepancies in rule-learning behaviors between humans and LLMs. We believe our benchmark will serve as a valuable and challenging resource, and IDEA will provide crucial insights for the development of LLM agents capable of human-like rule learning in real-world scenarios. Our code and data have been released at: https://github.com/KaiyuHe998/RULEARN_IDEA.

Search Wisely: Mitigating Sub-optimal Agentic Searches By Reducing Uncertainty
Peilin Wu | Mian Zhang | Xinlu Zhang | Xinya Du | Zhiyu Chen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Agentic Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by enabling dynamic, multi-step reasoning and information retrieval. However, these systems often exhibit sub-optimal search behaviors like over-search (retrieving redundant information) and under-search (failing to retrieve necessary information), which hinder efficiency and reliability. This work formally defines and quantifies these behaviors, revealing their prevalence across multiple QA datasets and agentic RAG systems (e.g., one model could have avoided searching in 27.7% of its search steps). Furthermore, we demonstrate a crucial link between these inefficiencies and the models’ uncertainty regarding their own knowledge boundaries, where response accuracy correlates with model’s uncertainty in its search decisions. To address this, we propose β-GRPO, a reinforcement learning-based training method that incorporates confidence threshold to reward high-certainty search decisions. Experiments on seven QA benchmarks show that β-GRPO enable a 3B model with better agentic RAG ability, outperforming other strong baselines with a 4% higher average exact match score.

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy
Mian Zhang | Xianjun Yang | Xinlu Zhang | Travis Labrum | Jamie C. Chiu | Shaun M. Eack | Fei Fang | William Yang Wang | Zhiyu Chen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

There is a significant gap between patient needs and available mental health support today. In this paper, we aim to thoroughly examine the potential of using Large Language Models (LLMs) to assist professional psychotherapy. To this end, we propose a new benchmark, CBT-Bench, for the systematic evaluation of cognitive behavioral therapy (CBT) assistance. We include three levels of tasks in CBT-Bench: **I: Basic CBT knowledge acquisition**, with the task of multiple-choice questions; **II: Cognitive model understanding**, with the tasks of cognitive distortion classification, primary core belief classification, and fine-grained core belief classification; **III: Therapeutic response generation**, with the task of generating responses to patient speech in CBT therapy sessions.These tasks encompass key aspects of CBT that could potentially be enhanced through AI assistance, while also outlining a hierarchy of capability requirements, ranging from basic knowledge recitation to engaging in real therapeutic conversations. We evaluated representative LLMs on our benchmark. Experimental results indicate that while LLMs perform well in reciting CBT knowledge, they fall short in complex real-world scenarios requiring deep analysis of patients’ cognitive structures and generating effective responses, suggesting potential future work.

2024

PATIENT-𝜓: Using Large Language Models to Simulate Patients for Training Mental Health Professionals
Ruiyi Wang | Stephanie Milani | Jamie C. Chiu | Jiayin Zhi | Shaun M. Eack | Travis Labrum | Samuel M Murphy | Nev Jones | Kate V Hardy | Hong Shen | Fei Fang | Zhiyu Chen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Mental illness remains one of the most critical public health issues. Despite its importance, many mental health professionals highlight a disconnect between their training and actual real-world patient practice. To help bridge this gap, we propose PATIENT-𝜓, a novel patient simulation framework for cognitive behavior therapy (CBT) training. To build PATIENT-𝜓, we construct diverse patient cognitive models based on CBT principles and use large language models (LLMs) programmed with these cognitive models to act as a simulated therapy patient. We propose an interactive training scheme, PATIENT-𝜓-TRAINER, for mental health trainees to practice a key skill in CBT – formulating the cognitive model of the patient – through role-playing a therapy session with PATIENT-𝜓. To evaluate PATIENT-𝜓, we conducted a comprehensive user study of 13 mental health trainees and 20 experts. The results demonstrate that practice using PATIENT-𝜓-TRAINER enhances the perceived skill acquisition and confidence of the trainees beyond existing forms of training such as textbooks, videos, and role-play with non-patients. Based on the experts’ perceptions, PATIENT-𝜓 is perceived to be closer to real patient interactions than GPT-4, and PATIENT-𝜓-TRAINER holds strong promise to improve trainee competencies. Our code and data are released at https://github.com/ruiyiw/patient-psi.

Multimodal Procedural Planning via Dual Text-Image Prompting
Yujie Lu | Pan Lu | Zhiyu Chen | Wanrong Zhu | Xin Eric Wang | William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2024

Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired text-image steps, providing more complementary and informative guidance than unimodal plans. The key challenges of MPP are to ensure the informativeness, temporal coherence,and accuracy of plans across modalities. To tackle this, we propose Text-Image Prompting (TIP), a dual-modality prompting method that jointly leverages zero-shot reasoning ability in large language models (LLMs) and compelling text-to-image generation ability from diffusion-based models. TIP improves the interaction in the dual modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs to guide the textual-grounded image plan generation and leveraging the descriptions of image plans to ground the textual plan reversely. To address the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed for MPP. Our results show compelling human preferences and automatic scores against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms of informativeness, temporal coherence, and plan accuracy.

2023

Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting
Zhiyu Chen | Yujie Lu | William Wang
Findings of the Association for Computational Linguistics: EMNLP 2023

Mental illness remains one of the most critical public health issues of our time, due to the severe scarcity and accessibility limit of professionals. Psychotherapy requires high-level expertise to conduct deep, complex reasoning and analysis on the cognition modeling of the patients. In the era of Large Language Models, we believe it is the right time to develop AI assistance for computational psychotherapy. We study the task of cognitive distortion detection and propose the Diagnosis of Thought (DoT) prompting. DoT performs diagnosis on the patient’s speech via three stages: subjectivity assessment to separate the facts and the thoughts; contrastive reasoning to elicit the reasoning processes supporting and contradicting the thoughts; and schema analysis to summarize the cognition schemas. The generated diagnosis rationales through the three stages are essential for assisting the professionals. Experiments demonstrate that DoT obtains significant improvements over ChatGPT for cognitive distortion detection, while generating high-quality rationales approved by human experts.

2022

ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering
Zhiyu Chen | Shiyang Li | Charese Smiley | Zhiqiang Ma | Sameena Shah | William Yang Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

With the recent advance in large pre-trained language models, researchers have achieved record performances in NLP tasks that mostly focus on language pattern matching. The community is experiencing the shift of the challenge from how to model language to the imitation of complex reasoning abilities like human beings. In this work, we investigate the application domain of finance that involves real-world, complex numerical reasoning. We propose a new large-scale dataset, ConvFinQA, aiming to study the chain of numerical reasoning in conversational question answering. Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations. We conduct comprehensive experiments and analyses with both the neural symbolic methods and the prompting-based methods, to provide insights into the reasoning mechanisms of these two divisions. We believe our new dataset should serve as a valuable resource to push forward the exploration of real-world, complex reasoning tasks as the next research focus. Our dataset and code is publicly available at https://github.com/czyssrs/ConvFinQA.

KETOD: Knowledge-Enriched Task-Oriented Dialogue
Zhiyu Chen | Bing Liu | Seungwhan Moon | Chinnadhurai Sankar | Paul Crook | William Yang Wang
Findings of the Association for Computational Linguistics: NAACL 2022

Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat as separate domains. Towards building a human-like assistant that can converse naturally and seamlessly with users, it is important to build a dialogue system that conducts both types of conversations effectively. In this work, we investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model. To this end, we create a new dataset, KETOD (Knowledge-Enriched Task-Oriented Dialogue), where we naturally enrich task-oriented dialogues with chit-chat based on relevant entity knowledge. We also propose two new models, SimpleToDPlus and Combiner, for the proposed task. Experimental results on both automatic and human evaluations show that the proposed methods can significantly improve the performance in knowledge-enriched response generation while maintaining a competitive task-oriented dialog performance. We believe our new dataset will be a valuable resource for future studies. Our dataset and code are publicly available at https://github.com/facebookresearch/ketod.

Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI)
Wenhu Chen | Xinyun Chen | Zhiyu Chen | Ziyu Yao | Michihiro Yasunaga | Tao Yu | Rui Zhang
Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI)

2021

HULK: An Energy Efficiency Benchmark Platform for Responsible Natural Language Processing
Xiyou Zhou | Zhiyu Chen | Xiaoyong Jin | William Yang Wang
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Computation-intensive pretrained models have been taking the lead of many natural language processing benchmarks such as GLUE. However, energy efficiency in the process of model training and inference becomes a critical bottleneck. We introduce HULK, a multi-task energy efficiency benchmarking platform for responsible natural language processing. With HULK, we compare pretrained models’ energy efficiency from the perspectives of time and cost. Baseline benchmarking results are provided for further analysis. The fine-tuning efficiency of different pretrained models can differ significantly among different tasks, and fewer parameter number does not necessarily imply better efficiency. We analyzed such a phenomenon and demonstrated the method for comparing the multi-task efficiency of pretrained models. Our platform is available at https://hulkbenchmark.github.io/ .

FinQA: A Dataset of Numerical Reasoning over Financial Data
Zhiyu Chen | Wenhu Chen | Charese Smiley | Sameena Shah | Iana Borova | Dylan Langdon | Reema Moussa | Matt Beane | Ting-Hao Huang | Bryan Routledge | William Yang Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The sheer volume of financial statements makes it difficult for humans to access and analyze a business’s financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset – the first of its kind – should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available at https://github.com/czyssrs/FinQA.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions
Zhiyu Chen | Honglei Liu | Hu Xu | Seungwhan Moon | Hao Zhou | Bing Liu
Findings of the Association for Computational Linguistics: EMNLP 2021

Existing conversational systems are mostly agent-centric, which assumes the user utterances will closely follow the system ontology. However, in real-world scenarios, it is highly desirable that users can speak freely and naturally. In this work, we attempt to build a user-centric dialogue system for conversational recommendation. As there is no clean mapping for a user’s free form utterance to an ontology, we first model the user preferences as estimated distributions over the system ontology and map the user’s utterances to such distributions. Learning such a mapping poses new challenges on reasoning over various types of knowledge, ranging from factoid knowledge, commonsense knowledge to the users’ own situations. To this end, we build a new dataset named NUANCED that focuses on such realistic settings, with 5.1k dialogues, 26k turns of high-quality user responses. We conduct experiments, showing both the usefulness and challenges of our problem setting. We believe NUANCED can serve as a valuable resource to push existing research from the agent-centric system to the user-centric system. The code and data are publicly available.

2020

Few-Shot NLG with Pre-Trained Language Model
Zhiyu Chen | Harini Eavani | Wenhu Chen | Yinyin Liu | William Yang Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural-based end-to-end approaches to natural language generation (NLG) from structured data or knowledge are data-hungry, making their adoption for real-world applications difficult with limited data. In this work, we propose the new task of few-shot natural language generation. Motivated by how humans tend to summarize tabular data, we propose a simple yet effective approach and show that it not only demonstrates strong performance but also provides good generalization across domains. The design of the model architecture is based on two aspects: content selection from input data and language modeling to compose coherent sentences, which can be acquired from prior knowledge. With just 200 training examples, across multiple domains, we show that our approach achieves very reasonable performances and outperforms the strongest baseline by an average of over 8.0 BLEU points improvement. Our code and data can be found at https://github.com/czyssrs/Few-Shot-NLG

Logical Natural Language Generation from Open-Domain Tables
Wenhu Chen | Jianshu Chen | Yu Su | Zhiyu Chen | William Yang Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural natural language generation (NLG) models have recently shown remarkable progress in fluency and coherence. However, existing studies on neural NLG are primarily focused on surface-level realizations with limited emphasis on logical inference, an important aspect of human thinking and language. In this paper, we suggest a new NLG task where a model is tasked with generating natural language statements that can be logically entailed by the facts in an open-domain semi-structured table. To facilitate the study of the proposed logical NLG problem, we use the existing TabFact dataset~(CITATION) featured with a wide range of logical/symbolic inferences as our testbed, and propose new automatic metrics to evaluate the fidelity of generation models w.r.t. logical inference. The new task poses challenges to the existing monotonic generation frameworks due to the mismatch between sequence order and logical order. In our experiments, we comprehensively survey different generation architectures (LSTM, Transformer, Pre-Trained LM) trained with different algorithms (RL, Adversarial Training, Coarse-to-Fine) on the dataset and made following observations: 1) Pre-Trained LM can significantly boost both the fluency and logical fidelity metrics, 2) RL and Adversarial Training are trading fluency for fidelity, 3) Coarse-to-Fine generation can help partially alleviate the fidelity issue while maintaining high language fluency. The code and data are available at https://github.com/wenhuchen/LogicNLG.

Logic2Text: High-Fidelity Natural Language Generation from Logical Forms
Zhiyu Chen | Wenhu Chen | Hanwen Zha | Xiyou Zhou | Yunkai Zhang | Sairam Sundaresan | William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2020

Previous studies on Natural Language Generation (NLG) from structured data have primarily focused on surface-level descriptions of record sequences. However, for complex structured data, e.g., multi-row tables, it is often desirable for an NLG system to describe interesting facts from logical inferences across records. If only provided with the table, it is hard for existing models to produce controllable and high-fidelity logical generations. In this work, we formulate high-fidelity NLG as generation from logical forms in order to obtain controllable and faithful generations. We present a new large-scale dataset, Logic2Text, with 10,753 descriptions involving common logic types paired with the underlying logical forms. The logical forms show diversified graph structure of free schema, which pose great challenges on the model’s ability to understand the semantics. We experiment on (1) Fully-supervised training with the full datasets, and (2) Few-shot setting, provided with hundreds of paired examples; We compare several popular generation models and analyze their performances. We hope our dataset can encourage research towards building an advanced NLG system capable of natural, faithful, and human-like generation. The dataset and code is available at https://github.com/czyssrs/Logic2Text.

HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data
Wenhu Chen | Hanwen Zha | Zhiyu Chen | Wenhan Xiong | Hong Wang | William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2020

Existing question answering datasets focus on dealing with homogeneous information, based either only on text or KB/Table information alone. However, as human knowledge is distributed over heterogeneous forms, using homogeneous information alone might lead to severe coverage problems. To fill in the gap, we present HybridQA, a new large-scale question-answering dataset that requires reasoning on heterogeneous information. Each question is aligned with a Wikipedia table and multiple free-form corpora linked with the entities in the table. The questions are designed to aggregate both tabular information and text information, i.e., lack of either form would render the question unanswerable. We test with three different models: 1) a table-only model. 2) text-only model. 3) a hybrid model that combines heterogeneous information to find the answer. The experimental results show that the EM scores obtained by two baselines are below 20%, while the hybrid model can achieve an EM over 40%. This gap suggests the necessity to aggregate heterogeneous information in HybridQA. However, the hybrid model’s score is still far behind human performance. Hence, HybridQA can serve as a challenging benchmark to study question answering with heterogeneous information.

2019

How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection
Wenhu Chen | Yu Su | Yilin Shen | Zhiyu Chen | Xifeng Yan | William Yang Wang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

With the rapid development in deep learning, deep neural networks have been widely adopted in many real-life natural language applications. Under deep neural networks, a pre-defined vocabulary is required to vectorize text inputs. The canonical approach to select pre-defined vocabulary is based on the word frequency, where a threshold is selected to cut off the long tail distribution. However, we observed that such a simple approach could easily lead to under-sized vocabulary or over-sized vocabulary issues. Therefore, we are interested in understanding how the end-task classification accuracy is related to the vocabulary size and what is the minimum required vocabulary size to achieve a specific performance. In this paper, we provide a more sophisticated variational vocabulary dropout (VVD) based on variational dropout to perform vocabulary selection, which can intelligently select the subset of the vocabulary to achieve the required performance. To evaluate different algorithms on the newly proposed vocabulary selection problem, we propose two new metrics: Area Under Accuracy-Vocab Curve and Vocab Size under X% Accuracy Drop. Through extensive experiments on various NLP classification tasks, our variational framework is shown to significantly outperform the frequency-based and other selection baselines on these metrics.

Global Textual Relation Embedding for Relational Understanding
Zhiyu Chen | Hanwen Zha | Honglei Liu | Wenhu Chen | Xifeng Yan | Yu Su
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Pre-trained embeddings such as word embeddings and sentence embeddings are fundamental tools facilitating a wide range of downstream NLP tasks. In this work, we investigate how to learn a general-purpose embedding of textual relations, defined as the shortest dependency path between entities. Textual relation embedding provides a level of knowledge between word/phrase level and sentence level, and we show that it can facilitate downstream tasks requiring relational understanding of the text. To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase. We use global co-occurrence statistics between textual and knowledge base relations as the supervision signal to train the embedding. Evaluation on two relational understanding tasks demonstrates the usefulness of the learned textual relation embedding. The data and code can be found at https://github.com/czyssrs/GloREPlus

Co-authors

Jamie C. Chiu 2

Shaun M. Eack 2

Travis Labrum 2

Seungwhan Moon 2

Charese Smiley 2

Paul A. Crook 1

Harini Eavani 1

Ting-Hao Huang 1

Dylan Langdon 1

George Michalopoulos 1

Stephanie Milani 1

Samuel M Murphy 1

Bryan R. Routledge 1

Chinnadhurai Sankar 1

Sairam Sundaresan 1

Xin Eric Wang 1

Michihiro Yasunaga 1

Hao Zhou (昊周) 1

Venues