2024
pdf
bib
abs
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement
Wenda Xu
|
Guanglei Zhu
|
Xuandong Zhao
|
Liangming Pan
|
Lei Li
|
William Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent studies show that large language models (LLMs) improve their performance through self-feedback on certain tasks while degrade on others. We discovered that such a contrary is due to LLM’s bias in evaluating their own output. In this paper, we formally define LLM’s self-bias – the tendency to favor its own generation – using two statistics. We analyze six LLMs (GPT-4, GPT-3.5, Gemini, LLaMA2, Mixtral and DeepSeek) on translation, constrained text generation, and mathematical reasoning tasks. We find that self-bias is prevalent in all examined LLMs across multiple languages and tasks. Our analysis reveals that while the self-refine pipeline improves the fluency and understandability of model outputs, it further amplifies self-bias. To mitigate such biases, we discover that larger model size and external feedback with accurate assessment can significantly reduce bias in the self-refine pipeline, leading to actual performance improvement in downstream tasks. The code and data are released at https://github.com/xu1998hz/llm_self_bias.
pdf
bib
abs
BPO: Staying Close to the Behavior LLM Creates Better Online LLM Alignment
Wenda Xu
|
Jiachen Li
|
William Yang Wang
|
Lei Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Direct alignment from preferences (DAP) has emerged as a promising paradigm for aligning large language models (LLMs) to human desiderata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specifically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a proper trust region for LLM alignment.We conduct extensive experiments to validate the effectiveness and applicability of our approach by integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human reference text.
pdf
bib
abs
Translation Canvas: An Explainable Interface to Pinpoint and Analyze Translation Systems
Chinmay Dandekar
|
Wenda Xu
|
Xi Xu
|
Siqi Ouyang
|
Lei Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
With the rapid advancement of machine translation research, evaluation toolkits have become essential for benchmarking system progress. Tools like COMET and SacreBLEU offer single quality score assessments that are effective for pairwise system comparisons. However, these tools provide limited insights for fine-grained system-level comparisons and the analysis of instance-level defects. To address these limitations, we introduce Translation Canvas, an explainable interface designed to pinpoint and analyze translation systems’ performance: 1) Translation Canvas assists machine translation researchers in comprehending system-level model performance by identifying common errors (their frequency and severity) and analyzing relationships between different systems based on various evaluation metrics. 2) It supports fine-grained analysis by highlighting error spans with explanations and selectively displaying systems’ predictions. According to human evaluation, Translation Canvas demonstrates superior performance over COMET and SacreBLEU packages under enjoybility and understandbility criteria.
pdf
bib
abs
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback
Wenda Xu
|
Daniel Deutsch
|
Mara Finkelstein
|
Juraj Juraska
|
Biao Zhang
|
Zhongtao Liu
|
William Yang Wang
|
Lei Li
|
Markus Freitag
Findings of the Association for Computational Linguistics: NAACL 2024
Recent large language models (LLM) areleveraging human feedback to improve theirgeneration quality. However, human feedbackis costly to obtain, especially during inference.In this work, we propose LLMRefine, aninference time optimization method to refineLLM’s output. The core idea is to usea learned fine-grained feedback model topinpoint defects and guide LLM to refinethem iteratively. Using original LLM as aproposal of edits, LLMRefine searches fordefect-less text via simulated annealing, tradingoff the exploration and exploitation. Weconduct experiments on three text generationtasks, including machine translation, long-form question answering (QA), and topicalsummarization. LLMRefine consistentlyoutperforms all baseline approaches, achievingimprovements up to 1.7 MetricX points ontranslation tasks, 8.1 ROUGE-L on ASQA, 2.2ROUGE-L on topical summarization.
pdf
bib
abs
Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies
Liangming Pan
|
Michael Saxon
|
Wenda Xu
|
Deepak Nathani
|
Xinyi Wang
|
William Yang Wang
Transactions of the Association for Computational Linguistics, Volume 12
While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.
2023
pdf
bib
abs
SESCORE2: Learning Text Generation Evaluation via Synthesizing Realistic Mistakes
Wenda Xu
|
Xian Qian
|
Mingxuan Wang
|
Lei Li
|
William Yang Wang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Is it possible to train a general metric for evaluating text generation quality without human-annotated ratings? Existing learned metrics either perform unsatisfactory across text generation tasks or require human ratings for training on specific tasks. In this paper, we propose SEScore2, a self-supervised approach for training a model-based metric for text generation evaluation. The key concept is to synthesize realistic model mistakes by perturbing sentences retrieved from a corpus. We evaluate SEScore2 and previous methods on four text generation tasks across three languages. SEScore2 outperforms all prior unsupervised metrics on four text generation evaluation benchmarks, with an average Kendall improvement of 0.158. Surprisingly, SEScore2 even outperforms the supervised BLEURT and COMET on multiple text generation tasks.
pdf
bib
abs
PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers
Michael Saxon
|
Xinyi Wang
|
Wenda Xu
|
William Yang Wang
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.
pdf
bib
abs
INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback
Wenda Xu
|
Danqing Wang
|
Liangming Pan
|
Zhenqiao Song
|
Markus Freitag
|
William Wang
|
Lei Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Automatically evaluating the quality of language generation is critical. Although recent learned metrics show high correlation with human judgement, these metrics do not provide explicit explanation of their verdict, nor associate the scores with defects in the generated text. To address this limitation, we present INSTRUCTSCORE, a fine-grained explainable evaluation metric for text generation. By harnessing both explicit human instruction and the implicit knowledge of GPT-4, we fine-tune a text evaluation metric based on LLaMA, producing both a score for generated text and a human readable diagnostic report. We evaluate INSTRUCTSCORE on a variety of generation tasks, including translation, captioning, data-to-text, and commonsense generation. Experiments show that our 7B model surpasses all other unsupervised metrics, including those based on 175B GPT-3 and GPT-4. Surprisingly, our INSTRUCTSCORE, even without direct supervision from human-rated data, achieves performance levels on par with state-of-the-art metrics like COMET22, which were fine-tuned on human ratings.
pdf
bib
abs
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Wanrong Zhu
|
An Yan
|
Yujie Lu
|
Wenda Xu
|
Xin Wang
|
Miguel Eckstein
|
William Yang Wang
Findings of the Association for Computational Linguistics: EACL 2023
Recent advances in text-to-image synthesis make it possible to visualize machine imaginations for a given context. On the other hand, when generating text, human writers are gifted at creative visualization, which enhances their writings by forming imaginations as blueprints before putting down the stories in words. Inspired by such a cognitive process, we ask the natural question of whether we can endow machines with the same ability to utilize visual information and construct a general picture of the context to guide text generation. In this work, we propose iNLG that uses machine-generated images to guide language models (LM) in open-ended text generation. The experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks, including text completion, story generation, and concept-to-text generation in both few-shot and full-data scenarios. Both automatic metrics and human evaluations verify that the text snippets generated by our iNLG are coherent and informative while displaying minor degeneration.
pdf
bib
abs
CausalDialogue: Modeling Utterance-level Causality in Conversations
Yi-Lin Tuan
|
Alon Albalak
|
Wenda Xu
|
Michael Saxon
|
Connor Pryor
|
Lise Getoor
|
William Yang Wang
Findings of the Association for Computational Linguistics: ACL 2023
Despite their widespread adoption, neural conversation models have yet to exhibit natural chat capabilities with humans. In this research, we examine user utterances as causes and generated responses as effects, recognizing that changes in a cause should produce a different effect. To further explore this concept, we have compiled and expanded upon a new dataset called CausalDialogue through crowd-sourcing. This dataset includes multiple cause-effect pairs within a directed acyclic graph (DAG) structure. Our analysis reveals that traditional loss functions struggle to effectively incorporate the DAG structure, leading us to propose a causality-enhanced method called Exponential Maximum Average Treatment Effect (ExMATE) to enhance the impact of causality at the utterance level in training neural conversation models. To evaluate the needs of considering causality in dialogue generation, we built a comprehensive benchmark on CausalDialogue dataset using different models, inference, and training methods. Through experiments, we find that a causality-inspired loss like ExMATE can improve the diversity and agility of conventional loss function and there is still room for improvement to reach human-level quality on this new dataset.
2022
pdf
bib
abs
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
Wenda Xu
|
Yi-Lin Tuan
|
Yujie Lu
|
Michael Saxon
|
Lei Li
|
William Yang Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human annotated training data.