Yuan Zhang - ACL Anthology

Yuan Zhang

2025

Chatbot To Help Patients Understand Their Health
Won Seok Jang | Hieu Tran | Manav Shaileshkumar Mistry | Sai Kiran Gandluri | Yifan Zhang | Sharmin Sultana | Sunjae Kwon | Yuan Zhang | Zonghai Yao | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2025

Patients must possess the knowledge necessary to actively participate in their care. To this end, we developed NoteAid-Chatbot, a conversational AI designed to help patients better understand their health through a novel framework of learning as conversation. We introduce a new learning paradigm that leverages a multi-agent large language model (LLM) and reinforcement learning (RL) framework—without relying on costly human-generated training data. Specifically, NoteAid-Chatbot was built on a lightweight 3-billion-parameter LLaMA 3.2 model using a two-stage training approach: initial supervised fine-tuning on conversational data synthetically generated using medical conversation strategies, followed by RL with rewards derived from patient understanding assessments in simulated hospital discharge scenarios. Our evaluation, which includes comprehensive human-aligned assessments and case studies, demonstrates that NoteAid-Chatbot exhibits key emergent behaviors critical for patient education—such as clarity, relevance, and structured dialogue—even though it received no explicit supervision for these attributes. Our results show that even simple Proximal Policy Optimization (PPO)-based reward modeling can successfully train lightweight, domain-specific chatbots to handle multi-turn interactions, incorporate diverse educational strategies, and meet nuanced communication objectives. Our Turing test demonstrates that NoteAid-Chatbot surpasses non-expert human. Although our current focus is on healthcare, the framework we present illustrates the feasibility and promise of applying low-cost, PPO-based RL to realistic, open-ended conversational domains—broadening the applicability of RL-based alignment methods.

Evaluating Evaluation Metrics – The Mirage of Hallucination Detection
Atharva Kulkarni | Yuan Zhang | Joel Ruben Antony Moniz | Xiou Ge | Bo-Hsiang Tseng | Dhivya Piraviperumal | Swabha Swayamdipta | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2025

Hallucinations pose a significant obstacle to the reliability and widespread adoption of language models, yet their accurate measurement remains a persistent challenge. While many task- and domain-specific metrics have been proposed to assess faithfulness and factuality concerns, the robustness and generalization of these metrics are still untested. In this paper, we conduct a large-scale empirical evaluation of 6 diverse sets of hallucination detection metrics across 4 datasets, 37 language models from 5 families, and 5 decoding methods. Our extensive investigation reveals concerning gaps in current hallucination evaluation: metrics often fail to align with human judgments, take an overtly myopic view of the problem, and show inconsistent gains with parameter scaling. Encouragingly, LLM-based evaluation, particularly with GPT-4, yields the best overall results, and mode-seeking decoding methods seem to reduce hallucinations, especially in knowledge-grounded settings. These findings underscore the need for more robust metrics to understand and quantify hallucinations, and better strategies to mitigate them.

2024

Can Large Language Models Understand Context?
Yilun Zhu | Joel Ruben Antony Moniz | Shruti Bhargava | Jiarui Lu | Dhivya Piraviperumal | Site Li | Yuan Zhang | Hong Yu | Bo-Hsiang Tseng
Findings of the Association for Computational Linguistics: EACL 2024

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models’ ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.

ReALM: Reference Resolution as Language Modeling
Joel Ruben Antony Moniz | Soundarya Krishnan | Melis Ozyildirim | Prathamesh Saraf | Halim Cagri Ates | Yuan Zhang | Hong Yu
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Reference resolution is an important problem, one that is essential to understand and successfully handle contexts of different kinds. This context includes both previous turns and context that pertains to non-conversational entities, such as entities on the user’s screen or those running in the background. While LLMs have been shown to be extremely powerful for a variety of tasks, their use in reference resolution, particularly for non-conversational entities, remains underutilized. This paper demonstrates how LLMs can be used to create an effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality. We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute gains of over 5% for on-screen references. We also benchmark against GPT-3.5 and GPT-4, with our smallest model achieving performance comparable to that of GPT-4, and our larger models substantially outperforming it.

LJPCheck: Functional Tests for Legal Judgment Prediction
Yuan Zhang | Wanhong Huang | Yi Feng | Chuanyi Li | Zhiwei Fei | Jidong Ge | Bin Luo | Vincent Ng
Findings of the Association for Computational Linguistics: ACL 2024

Legal Judgment Prediction (LJP) refers to the task of automatically predicting judgment results (e.g., charges, law articles and term of penalty) given the fact description of cases. While SOTA models have achieved high accuracy and F1 scores on public datasets, existing datasets fail to evaluate specific aspects of these models (e.g., legal fairness, which significantly impact their applications in real scenarios). Inspired by functional testing in software engineering, we introduce LJPCHECK, a suite of functional tests for LJP models, to comprehend LJP models’ behaviors and offer diagnostic insights. We illustrate the utility of LJPCHECK on five SOTA LJP models. Extensive experiments reveal vulnerabilities in these models, prompting an in-depth discussion into the underlying reasons of their shortcomings.

2023

2022

Mask the Correct Tokens: An Embarrassingly Simple Approach for Error Correction
Kai Shen | Yichong Leng | Xu Tan | Siliang Tang | Yuan Zhang | Wenjie Liu | Edward Lin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Text error correction aims to correct the errors in text sequences such as those typed by humans or generated by speech recognition models.Previous error correction methods usually take the source (incorrect) sentence as encoder input and generate the target (correct) sentence through the decoder. Since the error rate of the incorrect sentence is usually low (e.g., 10%), the correction model can only learn to correct on limited error tokens but trivially copy on most tokens (correct tokens), which harms the effective training of error correction. In this paper, we argue that the correct tokens should be better utilized to facilitate effective training and then propose a simple yet effective masking strategy to achieve this goal.Specifically, we randomly mask out a part of the correct tokens in the source sentence and let the model learn to not only correct the original error tokens but also predict the masked tokens based on their context information. Our method enjoys several advantages: 1) it alleviates trivial copy; 2) it leverages effective training signals from correct tokens; 3) it is a plug-and-play module and can be applied to different models and tasks. Experiments on spelling error correction and speech recognition error correction on Mandarin datasets and grammar error correction on English datasets with both autoregressive and non-autoregressive generation models show that our method improves the correctionaccuracy consistently.

2021

Controllable Semantic Parsing via Retrieval Augmentation
Panupong Pasupat | Yuan Zhang | Kelvin Guu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In practical applications of semantic parsing, we often want to rapidly change the behavior of the parser, such as enabling it to handle queries in a new domain, or changing its predictions on certain targeted queries. While we can introduce new training examples exhibiting the target behavior, a mechanism for enacting such behavior changes without expensive model re-training would be preferable. To this end, we propose ControllAble Semantic Parser via Exemplar Retrieval (CASPER). Given an input query, the parser retrieves related exemplars from a retrieval index, augments them to the query, and then applies a generative seq2seq model to produce an output parse. The exemplars act as a control mechanism over the generic generative model: by manipulating the retrieval index or how the augmented query is constructed, we can manipulate the behavior of the parser. On the MTOP dataset, in addition to achieving state-of-the-art on the standard setup, we show that CASPER can parse queries in a new domain, adapt the prediction toward the specified patterns, or adapt to new semantic schemas without having to further re-train the model.

QA-Driven Zero-shot Slot Filling with Weak Supervision Pretraining
Xinya Du | Luheng He | Qi Li | Dian Yu | Panupong Pasupat | Yuan Zhang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Slot-filling is an essential component for building task-oriented dialog systems. In this work, we focus on the zero-shot slot-filling problem, where the model needs to predict slots and their values, given utterances from new domains without training on the target domain. Prior methods directly encode slot descriptions to generalize to unseen slot types. However, raw slot descriptions are often ambiguous and do not encode enough semantic information, limiting the models’ zero-shot capability. To address this problem, we introduce QA-driven slot filling (QASF), which extracts slot-filler spans from utterances with a span-based QA model. We use a linguistically motivated questioning strategy to turn descriptions into questions, allowing the model to generalize to unseen slot types. Moreover, our QASF model can benefit from weak supervision signals from QA pairs synthetically generated from unlabeled conversations. Our full system substantially outperforms baselines by over 5% on the SNIPS benchmark.

Few-shot Intent Classification and Slot Filling with Retrieved Examples
Dian Yu | Luheng He | Yuan Zhang | Xinya Du | Panupong Pasupat | Qi Li
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Few-shot learning arises in important practical scenarios, such as when a natural language understanding system needs to learn new semantic labels for an emerging, resource-scarce domain. In this paper, we explore retrieval-based methods for intent classification and slot filling tasks in few-shot settings. Retrieval-based methods make predictions based on labeled examples in the retrieval index that are similar to the input, and thus can adapt to new domains simply by changing the index without having to retrain the model. However, it is non-trivial to apply such methods on tasks with a complex label space like slot filling. To this end, we propose a span-level retrieval method that learns similar contextualized representations for spans with the same label via a novel batch-softmax objective. At inference time, we use the labels of the retrieved spans to construct the final structure with the highest aggregated score. Our method outperforms previous systems in various few-shot settings on the CLINC and SNIPS benchmarks.

Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13% and EL accuracy by up to 3.6% in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.

2020

Mapping Natural Language Instructions to Mobile UI Action Sequences
Yang Li | Jiacong He | Xin Zhou | Yuan Zhang | Jason Baldridge
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present a new problem: grounding natural language instructions to mobile user interface actions, and create three new datasets for it. For full task evaluation, we create PixelHelp, a corpus that pairs English instructions with actions performed by people on a mobile UI emulator. To scale training, we decouple the language and action data by (a) annotating action phrase spans in How-To instructions and (b) synthesizing grounded descriptions of actions for mobile user interfaces. We use a Transformer to extract action phrase tuples from long-range natural language instructions. A grounding Transformer then contextually represents UI objects using both their content and screen position and connects them to object descriptions. Given a starting screen and instruction, our model achieves 70.59% accuracy on predicting complete ground-truth action sequences in PixelHelp.

LIT Team’s System Description for Japanese-Chinese Machine Translation Task in IWSLT 2020
Yimeng Zhuang | Yuan Zhang | Lijie Wang
Proceedings of the 17th International Conference on Spoken Language Translation

This paper describes the LIT Team’s submission to the IWSLT2020 open domain translation task, focusing primarily on Japanese-to-Chinese translation direction. Our system is based on the organizers’ baseline system, but we do more works on improving the Transform baseline system by elaborate data pre-processing. We manage to obtain significant improvements, and this paper aims to share some data processing experiences in this translation task. Large-scale back-translation on monolingual corpus is also investigated. In addition, we also try shared and exclusive word embeddings, compare different granularity of tokens like sub-word level. Our Japanese-to-Chinese translation system achieves a performance of BLEU=34.0 and ranks 2nd among all participating systems.

2019

Large-Scale Representation Learning from Visually Grounded Untranscribed Speech
Gabriel Ilharco | Yuan Zhang | Jason Baldridge
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Systems that can associate images with their spoken audio captions are an important step towards visually grounded language learning. We describe a scalable method to automatically generate diverse audio for image captioning datasets. This supports pretraining deep networks for encoding both audio and images, which we do via a dual encoder that learns to align latent representations from both modalities. We show that a masked margin softmax loss for such models is superior to the standard triplet loss. We fine-tune these models on the Flickr8k Audio Captions Corpus and obtain state-of-the-art results—improving recall in the top 10 from 29.6% to 49.5%. We also obtain human ratings on retrieval outputs to better assess the impact of incidentally matching image-caption pairs that were not associated in the data, finding that automatic evaluation substantially underestimates the quality of the retrieved results.

PAWS: Paraphrase Adversaries from Word Scrambling
Yuan Zhang | Jason Baldridge | Luheng He
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Existing paraphrase identification datasets lack sentence pairs that have high lexical overlap without being paraphrases. Models trained on such data fail to distinguish pairs like flights from New York to Florida and flights from Florida to New York. This paper introduces PAWS (Paraphrase Adversaries from Word Scrambling), a new dataset with 108,463 well-formed paraphrase and non-paraphrase pairs with high lexical overlap. Challenging pairs are generated by controlled word swapping and back translation, followed by fluency and paraphrase judgments by human raters. State-of-the-art models trained on existing datasets have dismal performance on PAWS (<40% accuracy); however, including PAWS training data for these models improves their accuracy to 85% while maintaining performance on existing tasks. In contrast, models that do not capture non-local contextual information fail even with PAWS training examples. As such, PAWS provides an effective instrument for driving further progress on models that better exploit structure, context, and pairwise comparisons.

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
Yinfei Yang | Yuan Zhang | Chris Tar | Jason Baldridge
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most existing work on adversarial data generation focuses on English. For example, PAWS (Paraphrase Adversaries from Word Scrambling) consists of challenging English paraphrase identification pairs from Wikipedia and Quora. We remedy this gap with PAWS-X, a new dataset of 23,659 human translated PAWS evaluation pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. We provide baseline numbers for three models with different capacity to capture non-local context and sentence structure, and using different multilingual training and evaluation regimes. Multilingual BERT fine-tuned on PAWS English plus machine-translated data performs the best, with a range of 83.1-90.8 accuracy across the non-English languages and an average accuracy gain of 23% over the next best model. PAWS-X shows the effectiveness of deep, multilingual pre-training while also leaving considerable headroom as a new challenge to drive multilingual research that better captures structure and contextual information.

Tree Communication Models for Sentiment Analysis
Yuan Zhang | Yue Zhang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Tree-LSTMs have been used for tree-based sentiment analysis over Stanford Sentiment Treebank, which allows the sentiment signals over hierarchical phrase structures to be calculated simultaneously. However, traditional tree-LSTMs capture only the bottom-up dependencies between constituents. In this paper, we propose a tree communication model using graph convolutional neural network and graph recurrent neural network, which allows rich information exchange between phrases constituent tree. Experiments show that our model outperforms existing work on bidirectional tree-LSTMs in both accuracy and efficiency, providing more consistent predictions on phrase-level sentiments.

2018

Points, Paths, and Playscapes: Large-scale Spatial Language Understanding Tasks Set in the Real World
Jason Baldridge | Tania Bedrax-Weiss | Daphne Luong | Srini Narayanan | Bo Pang | Fernando Pereira | Radu Soricut | Michael Tseng | Yuan Zhang
Proceedings of the First International Workshop on Spatial Language Understanding

Spatial language understanding is important for practical applications and as a building block for better abstract language understanding. Much progress has been made through work on understanding spatial relations and values in images and texts as well as on giving and following navigation instructions in restricted domains. We argue that the next big advances in spatial language understanding can be best supported by creating large-scale datasets that focus on points and paths based in the real world, and then extending these to create online, persistent playscapes that mix human and bot players, where the bot players must learn, evolve, and survive according to their depth of understanding of scenes, navigation, and interactions.

A Fast, Compact, Accurate Model for Language Identification of Codemixed Text
Yuan Zhang | Jason Riesa | Daniel Gillick | Anton Bakalov | Jason Baldridge | David Weiss
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We address fine-grained multilingual language identification: providing a language code for every token in a sentence, including codemixed text containing multiple languages. Such text is prevalent online, in documents, social media, and message boards. We show that a feed-forward network with a simple globally constrained decoder can accurately and rapidly label both codemixed and monolingual text in 100 languages and 100 language pairs. This model outperforms previously published multilingual approaches in terms of both accuracy and speed, yielding an 800x speed-up and a 19.5% averaged absolute gain on three codemixed datasets. It furthermore outperforms several benchmark systems on monolingual language identification.

2017

Aspect-augmented Adversarial Networks for Domain Adaptation
Yuan Zhang | Regina Barzilay | Tommi Jaakkola
Transactions of the Association for Computational Linguistics, Volume 5

We introduce a neural method for transfer learning between two (source and target) classification tasks or aspects over the same domain. Rather than training on target labels, we use a few keywords pertaining to source and target aspects indicating sentence relevance instead of document class labels. Documents are encoded by learning to embed and softly select relevant sentences in an aspect-dependent manner. A shared classifier is trained on the source encoded documents and labels, and applied to target encoded documents. We ensure transfer through aspect-adversarial training so that encoded documents are, as sets, aspect-invariant. Experimental results demonstrate that our approach outperforms different baselines and model variants on two datasets, yielding an improvement of 27% on a pathology dataset and 5% on a review dataset.

2016

Ten Pairs to Tag – Multilingual POS Tagging via Coarse Mapping between Embeddings
Yuan Zhang | David Gaddy | Regina Barzilay | Tommi Jaakkola
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Stack-propagation: Improved Representation Learning for Syntax
Yuan Zhang | David Weiss
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

High-Order Low-Rank Tensors for Semantic Role Labeling
Tao Lei | Yuan Zhang | Lluís Màrquez | Alessandro Moschitti | Regina Barzilay
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Randomized Greedy Inference for Joint Segmentation, POS Tagging and Dependency Parsing
Yuan Zhang | Chengtao Li | Regina Barzilay | Kareem Darwish
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Hierarchical Low-Rank Tensors for Multilingual Transfer Parsing
Yuan Zhang | Regina Barzilay
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Low-Rank Tensors for Scoring Dependency Structures
Tao Lei | Yu Xin | Yuan Zhang | Regina Barzilay | Tommi Jaakkola
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Greed is Good if Randomized: New Inference for Dependency Parsing
Yuan Zhang | Tao Lei | Regina Barzilay | Tommi Jaakkola
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Steps to Excellence: Simple Inference with Refined Scoring of Dependency Trees
Yuan Zhang | Tao Lei | Regina Barzilay | Tommi Jaakkola | Amir Globerson
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

Transfer Learning for Constituency-Based Grammars
Yuan Zhang | Regina Barzilay | Amir Globerson
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

Learning to Map into a Universal POS Tagset
Yuan Zhang | Roi Reichart | Regina Barzilay | Amir Globerson
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Co-authors

Amir Globerson 3

Panupong Pasupat 3

Dhivya Piraviperumal 3

Bo-Hsiang Tseng 3

Halim Cagri Ates 2

Shruti Bhargava 2

Melis Ozyildirim 2

Anton Bakalov 1

Tania Bedrax-Weiss 1

Pooja Chitkara 1

Kareem Darwish 1

Yi Feng (冯毅) 1

Sai Kiran Gandluri 1

Wanhong Huang 1

Gabriel Ilharco 1

Won Seok Jang 1

Atish Kothari 1

Soundarya Krishnan 1

Atharva Kulkarni 1

Siddhardha Maddula 1

Manav Shaileshkumar Mistry 1

Alessandro Moschitti 1

Deepak Muralidharan 1

Lluís Màrquez 1

Anil Kumar Nalamalapu 1

Srini Narayanan 1

Roman Hoang Nguyen 1

Fernando Pereira 1

Stephen Pulman 1

Vincent Renkens 1

Prathamesh Saraf 1

Mubarak Seyed Ibrahim 1

Sharmin Sultana 1

Swabha Swayamdipta 1

Michael Tseng 1

Jason D. Williams 1

Yimeng Zhuang 1

Venues