2025
pdf
bib
abs
DavIR: Data Selection via Implicit Reward for Large Language Models
Haotian Zhou
|
Tingkai Liu
|
Qianli Ma
|
Yufeng Zhang
|
Jianbo Yuan
|
Pengfei Liu
|
Yang You
|
Hongxia Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce DavIR, a model-based data selection method for post-training Large Language Models. DavIR generalizes Reducible Holdout Loss to core-set selection problem of causal language modeling, and quantifies the learnability of a given datum with respect to a pre-trained LLM based on relative reduction in loss during fine-tuning, a metric we show to be closely related to the implicit reward model described in Direct Preference Optimization (DPO). We show that 6% of Alpaca dataset selected with DavIR can steer both the LLaMA and Gemma model family to produce superior performance compared to the same models trained on the full 52K dataset. We also show that Alpaca dataset compressed with DavIR can be combined with GSM8K dataset to effectively balance open-domain freeform QA and mathematical reasoning capabilities. Finally, we apply the DavIR objective to DPO and develop a normalized DavIR-DPO objective which improves alignment performance of Zephyr-7B-SFT model by 8% (relative) on AlpacaEval, compared against training on vanilla DPO objective.
pdf
bib
abs
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
Yang Xiao
|
Jiashuo Wang
|
Qiancheng Xu
|
Changhe Song
|
Chunpu Xu
|
Yi Cheng
|
Wenjie Li
|
Pengfei Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present **DynToM**, a novel benchmark specifically designed to evaluate LLMs’ ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs’ ability to model the dynamic nature of human mental states.
pdf
bib
abs
Understanding Reference Policies in Direct Preference Optimization
Yixin Liu
|
Pengfei Liu
|
Arman Cohan
Findings of the Association for Computational Linguistics: NAACL 2025
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO – its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO’s effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of the KL-constraint from the reference policies in DPO by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO’s superiority in this controlled setting. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.
pdf
bib
abs
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Haonan Li
|
Xudong Han
|
Zenan Zhai
|
Honglin Mu
|
Hao Wang
|
Zhenxuan Zhang
|
Yilin Geng
|
Shom Lin
|
Renxi Wang
|
Artem Shelmanov
|
Xiangyu Qi
|
Yuxia Wang
|
Donghai Hong
|
Youliang Yuan
|
Meng Chen
|
Haoqin Tu
|
Fajri Koto
|
Cong Zeng
|
Tatsuki Kuribayashi
|
Rishabh Bhardwaj
|
Bingchen Zhao
|
Yawen Duan
|
Yi Liu
|
Emad A. Alghamdi
|
Yaodong Yang
|
Yinpeng Dong
|
Soujanya Poria
|
Pengfei Liu
|
Zhengzhong Liu
|
Hector Xuguang Ren
|
Eduard Hovy
|
Iryna Gurevych
|
Preslav Nakov
|
Monojit Choudhury
|
Timothy Baldwin
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.
2024
pdf
bib
abs
DocLens: Multi-aspect Fine-grained Evaluation for Medical Text Generation
Yiqing Xie
|
Sheng Zhang
|
Hao Cheng
|
Pengfei Liu
|
Zelalem Gero
|
Cliff Wong
|
Tristan Naumann
|
Hoifung Poon
|
Carolyn Rose
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Medical text generation aims to assist with administrative work and highlight salient information to support decision-making.To reflect the specific requirements of medical text, in this paper, we propose a set of metrics to evaluate the completeness, conciseness, and attribution of the generated text at a fine-grained level. The metrics can be computed by various types of evaluators including instruction-following (both proprietary and open-source) and supervised entailment models. We demonstrate the effectiveness of the resulting framework, DocLens, with three evaluators on three tasks: clinical note generation, radiology report summarization, and patient question summarization. A comprehensive human study shows that DocLens exhibits substantially higher agreement with the judgments of medical experts than existing metrics. The results also highlight the need to improve open-source evaluators and suggest potential directions. We released the code at https://github.com/yiqingxyq/DocLens.
pdf
bib
abs
Dissecting Human and LLM Preferences
Junlong Li
|
Fan Zhou
|
Shichao Sun
|
Yikai Zhang
|
Hai Zhao
|
Pengfei Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As a relative quality comparison of model responses, human and Large Language Model (LLM) preferences serve as common alignment goals in model fine-tuning and criteria in evaluation. Yet, these preferences merely reflect broad tendencies, resulting in less explainable and controllable models with potential safety risks. In this work, we dissect the preferences of human and 32 different LLMs to understand their quantitative composition, using annotations from real-world user-model conversations for a fine-grained, scenario-wise analysis. We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits. On the contrary, advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more. Additionally, LLMs of similar sizes tend to exhibit similar preferences, regardless of their training methods, and fine-tuning for alignment does not significantly alter the preferences of pretrained-only LLMs. Finally, we show that preference-based evaluation can be intentionally manipulated. In both training-free and training-based settings, aligning a model with the preferences of judges boosts scores, while injecting the least preferred properties lowers them. This results in notable score shifts: up to 0.59 on MT-Bench (1-10 scale) and 31.94 on AlpacaEval 2.0 (0-100 scale), highlighting the significant impact of this strategic adaptation. We have made all resources of this project publicly available.
pdf
bib
abs
MoPS: Modular Story Premise Synthesis for Open-Ended Automatic Story Generation
Yan Ma
|
Yu Qiao
|
Pengfei Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A story premise succinctly defines a story’s main idea, foundation, and trajectory. It serves as the initial trigger in automatic story generation. Existing sources of story premises are limited by a lack of diversity, uneven quality, and high costs that make them difficult to scale. In response, we introduce Modular Story Premise Synthesis (MoPS) which breaks down story premises into modules like background and persona for automated design and generation. MoPS consists of three phases: (1) Pre-collect a consistent set of candidates for each module to form a nested dictionary. (2) Extract a key path from the nested dictionary as the premise design. (3) Instruct an LLM to integrate the design into a coherent premise sentence. Thorough evaluations demonstrate that our synthesized premises excel in diversity, fascination, completeness, and originality compared to those induced from large language models and captured from public story datasets. Similarly, the extended novels and scripts generated from our premises also exhibit higher quality. In supplementary materials, we provide the MoPS code suite, along with 7.5k generated premises and 1k extended stories.
pdf
bib
abs
Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz
|
Iker García-Ferrero
|
Alon Jacovi
|
Jon Ander Campos
|
Yanai Elazar
|
Eneko Agirre
|
Yoav Goldberg
|
Wei-Lin Chen
|
Jenny Chim
|
Leshem Choshen
|
Luca D’Amico-Wong
|
Melissa Dell
|
Run-Ze Fan
|
Shahriar Golchin
|
Yucheng Li
|
Pengfei Liu
|
Bhavish Pahwa
|
Ameya Prabhu
|
Suryansh Sharma
|
Emily Silcock
|
Kateryna Solonko
|
David Stap
|
Mihai Surdeanu
|
Yu-Min Tseng
|
Vishaal Udandarao
|
Zengzhi Wang
|
Ruijie Xu
|
Jinglin Yang
Proceedings of the 1st Workshop on Data Contamination (CONDA)
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.
pdf
bib
abs
Knowledge-Centric Hallucination Detection
Xiangkun Hu
|
Dongyu Ru
|
Lin Qiu
|
Qipeng Guo
|
Tianhang Zhang
|
Yang Xu
|
Yun Luo
|
Pengfei Liu
|
Yue Zhang
|
Zheng Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 18.2 to 27.2 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments.
pdf
bib
abs
FRoG: Evaluating Fuzzy Reasoning of Generalized Quantifiers in LLMs
Yiyuan Li
|
Shichao Sun
|
Pengfei Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Fuzzy reasoning is vital due to the frequent use of imprecise information in daily contexts. However, the ability of current large language models (LLMs) to handle such reasoning remains largely uncharted. In this paper, we introduce a new benchmark, FRoG, for fuzzy reasoning, featuring real-world mathematical word problems that incorporate generalized quantifiers. Our experimental findings reveal that fuzzy reasoning continues to pose significant challenges for LLMs. Moreover, we find that existing methods designed to enhance reasoning do not consistently improve performance in tasks involving fuzzy logic. Additionally, our results show an inverse scaling effect in the performance of LLMs on FRoG. Interestingly, we also demonstrate that strong mathematical reasoning skills are not necessarily indicative of success on our benchmark.
pdf
bib
abs
ECON: On the Detection and Resolution of Evidence Conflicts
Cheng Jiayang
|
Chunkit Chan
|
Qianqian Zhuang
|
Lin Qiu
|
Tianhang Zhang
|
Tengxiao Liu
|
Yangqiu Song
|
Yue Zhang
|
Pengfei Liu
|
Zheng Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems, leading to the prevalence of AI-generated content and challenges in detecting misinformation and managing conflicting information, or “inter-evidence conflicts.” This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios. We evaluate conflict detection methods, including Natural Language Inference (NLI) models, factual consistency (FC) models, and LLMs, on these conflicts (RQ1) and analyze LLMs’ conflict resolution behaviors (RQ2). Our key findings include: (1) NLI and LLM models exhibit high precision in detecting answer conflicts, though weaker models suffer from low recall; (2) FC models struggle with lexically similar answer conflicts, while NLI and LLM models handle these better; and (3) stronger models like GPT-4 show robust performance, especially with nuanced conflicts. For conflict resolution, LLMs often favor one piece of conflicting evidence without justification and rely on internal knowledge if they have prior beliefs.
pdf
bib
abs
OpenResearcher: Unleashing AI for Accelerated Scientific Research
Yuxiang Zheng
|
Shichao Sun
|
Lin Qiu
|
Dongyu Ru
|
Cheng Jiayang
|
Xuefeng Li
|
Jifan Lin
|
Binjie Wang
|
Yun Luo
|
Renjie Pan
|
Yang Xu
|
Qingkai Min
|
Zizhao Zhang
|
Yiwen Wang
|
Wenjie Li
|
Pengfei Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
The rapid growth of scientific literature imposes significant challenges for researchers endeavoring to stay updated with the latest advancements in their fields and delve into new areas. We introduce OpenResearcher, an innovative platform that leverages Artificial Intelligence (AI) techniques to accelerate the research process by answering diverse questions from researchers. OpenResearcher is built based on Retrieval-Augmented Generation (RAG) to integrate Large Language Models (LLMs) with up-to-date, domain-specific knowledge. Moreover, we develop various tools for OpenResearcher to understand researchers’ queries, search from the scientific literature, filter retrieved information, provide accurate and comprehensive answers, and self-refine these answers. OpenResearcher can flexibly use these tools to balance efficiency and effectiveness. As a result, OpenResearcher enables researchers to save time and increase their potential to discover new insights and drive scientific breakthroughs. Demo, video, and code are available at: https://github.com/GAIR-NLP/OpenResearcher.
pdf
bib
abs
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization
Yixin Liu
|
Alexander Fabbri
|
Jiawen Chen
|
Yilun Zhao
|
Simeng Han
|
Shafiq Joty
|
Pengfei Liu
|
Dragomir Radev
|
Chien-Sheng Wu
|
Arman Cohan
Findings of the Association for Computational Linguistics: NAACL 2024
While large language models (LLMs) can already achieve strong performance on standard generic summarization benchmarks, their performance on more complex summarization task settings is less studied. Therefore, we benchmark LLMs on instruction controllable text summarization, where the model input consists of both a source article and a natural language requirement for desired summary characteristics. To this end, we curate an evaluation-only dataset for this task setting and conduct human evaluations of five LLM-based systems to assess their instruction-following capabilities in controllable summarization. We then benchmark LLM-based automatic evaluation for this task with 4 different evaluation protocols and 11 LLMs, resulting in 40 evaluation methods. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) no LLM-based evaluation methods can achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation capabilities. We make our collected benchmark InstruSum publicly available to facilitate future research in this direction.
pdf
bib
Prompt Chaining or Stepwise Prompt? Refinement in Text Summarization
Shichao Sun
|
Ruifeng Yuan
|
Ziqiang Cao
|
Wenjie Li
|
Pengfei Liu
Findings of the Association for Computational Linguistics: ACL 2024
pdf
bib
abs
LLMCrit: Teaching Large Language Models to Use Criteria
Weizhe Yuan
|
Pengfei Liu
|
Matthias Gallé
Findings of the Association for Computational Linguistics: ACL 2024
Humans follow criteria when they execute tasks, and these criteria are directly used to assess the quality of task completion. Therefore, having models learn to use criteria to provide feedback can help humans or models to perform tasks better. However, current research in this area tends to consider only a limited number of criteria, or only a limited number of quality assessment aspects. To fill this gap, we propose a general framework that enables large language models (LLMs) to use comprehensive criteria for a task in delivering natural language feedback on task execution. In particular, we present a model-in-the-loop framework that semi-automatically derives criteria from collected guidelines for different writing tasks and constructs in-context demonstrations for each criterion. We choose three tasks from real-world scenarios to operationalize this idea: paper introduction writing, Python code writing, and Reddit post writing, and evaluate our feedback generation framework using different LLMs. The results reveal the fine-grained effects of adding criteria and demonstrations and provide valuable guidance on how to teach LLMs to use criteria more effectively.
pdf
bib
The Critique of Critique
Shichao Sun
|
Junlong Li
|
Weizhe Yuan
|
Ruifeng Yuan
|
Wenjie Li
|
Pengfei Liu
Findings of the Association for Computational Linguistics: ACL 2024
pdf
bib
abs
InFoBench: Evaluating Instruction Following Ability in Large Language Models
Yiwei Qin
|
Kaiqiang Song
|
Yebowen Hu
|
Wenlin Yao
|
Sangwoo Cho
|
Xiaoyang Wang
|
Xuansheng Wu
|
Fei Liu
|
Pengfei Liu
|
Dong Yu
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models’ (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs’ compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR’s higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
pdf
bib
abs
Reformatted Alignment
Run-Ze Fan
|
Xuefeng Li
|
Haoyang Zou
|
Junlong Li
|
Shwai He
|
Ethan Chern
|
Jiewen Hu
|
Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
The quality of finetuning data is crucial for aligning large language models (LLMs) with human values. Current methods to improve data quality are either labor-intensive or prone to factual errors caused by LLM hallucinations. This paper explores elevating the quality of existing instruction data to better align with human values, introducing a simple and effective approach named ReAlign, which reformats the responses of instruction data into a format that better aligns with pre-established criteria and the collated evidence. This approach minimizes human annotation, hallucination, and the difficulty in scaling, remaining orthogonal to existing alignment techniques. Experimentally, ReAlign significantly boosts the general alignment ability, math reasoning, factuality, and readability of the LLMs.Encouragingly, without introducing any additional data or advanced training techniques, and merely by reformatting the response, LLaMA-2-13B’s mathematical reasoning ability on GSM8K can be improved **from 46.77% to 56.63%** in accuracy. Additionally, a mere 5% of ReAlign data yields a 67% boost in general alignment ability measured by the Alpaca dataset. This work highlights the need for further research into the science and mechanistic interpretability of LLMs. We have made the associated code and data publicly accessible to support future studies at https://github.com/GAIR-NLP/ReAlign.
pdf
bib
abs
SAFETY-J: Evaluating Safety with Critique
Yixiu Liu
|
Yuxiang Zheng
|
Shijie Xia
|
Jiajun Li
|
Yi Tu
|
Chaoling Song
|
Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
The deployment of Large Language Models (LLMs) in content generation raises significant safety concerns, particularly regarding the transparency and interpretability of content evaluations. Current methods, primarily focused on binary safety classifications, lack mechanisms for detailed critique, limiting their utility for model improvement and user trust. To address these limitations, we introduce SAFETY-J, a bilingual generative safety evaluator for English and Chinese with critique-based judgment. SAFETY-J utilizes a robust training dataset that includes diverse dialogues and augmented query-response pairs to assess safety across various scenarios comprehensively. We establish an automated meta-evaluation benchmark that objectively assesses the quality of critiques with minimal human intervention, facilitating scalable and continuous improvement. Additionally, SAFETY-Jemploys an iterative preference learning technique to dynamically refine safety assessments based on meta-evaluations and critiques. Our evaluations demonstrate that SAFETY-J provides more nuanced and accurate safety evaluations, thereby enhancing both critique quality and predictive reliability in complex content scenarios. To facilitate further research and application, we have released SAFETY-J’s training protocols, datasets, and code at https://github.com/GAIR-NLP/Safety-J.
pdf
bib
abs
Weak-to-Strong Reasoning
Yuqing Yang
|
Yan Ma
|
Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
When large language models (LLMs) surpass human capabilities, supervising them effectively becomes difficult. 
Weak-to-strong learning, where a less capable model enhances a stronger one, proves valuable in this context. Yet, the efficacy of this paradigm for complex reasoning tasks is still unexplored. In this paper, we introduce a progressive weak-to-strong reasoning framework that enables the strong model to autonomously refine its training data, maximizing the use of weak signals and unlocking its latent abilities. This framework begins with supervised fine-tuning on a selective small but high-quality dataset, followed by preference optimization on contrastive samples identified by the strong model itself. Experiments on the GSM8K and MATH datasets verify that our method can effectively improve the reasoning capabilities of Llama2-70b using three separate weak models. This work paves the way for a more scalable and sophisticated strategy to enhance AI reasoning powers. All relevant code and resources are available in 
https://github.com/GAIR-NLP/weak-to-strong-reasoning.
pdf
bib
abs
GPTScore: Evaluate as You Desire
Jinlan Fu
|
See-Kiong Ng
|
Zhengbao Jiang
|
Pengfei Liu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models.Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently.This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., in-context learning, zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., Flan-T5-small) to 175B (e.g., GPT3).Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.This nature helps us overcome several long-standing challenges in text evaluation–how to achieve customized, multi-faceted evaluation without model training. We make our code publicly available.
pdf
bib
abs
On Learning to Summarize with Large Language Models as References
Yixin Liu
|
Kejian Shi
|
Katherine He
|
Longtian Ye
|
Alexander Fabbri
|
Pengfei Liu
|
Dragomir Radev
|
Arman Cohan
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent studies have found that summaries generated by large language models (LLMs) are favored by human annotators over the original reference summaries in commonly used summarization datasets. Therefore, we study an LLM-as-reference learning setting for smaller text summarization models to investigate whether their performance can be substantially improved. To this end, we use LLMs as both oracle summary generators for standard supervised fine-tuning and oracle summary evaluators for efficient contrastive learning that leverages the LLMs’ supervision signals. We conduct comprehensive experiments with source news articles and find that (1) summarization models trained under the LLM-as-reference setting achieve significant performance improvement in both LLM and human evaluations; (2) contrastive learning outperforms standard supervised fine-tuning under both low and high resource settings. Our experimental results also enable a meta-analysis of LLMs’ summary evaluation capacities under a challenging setting, showing that LLMs are not well-aligned with human evaluators. Particularly, our expert human evaluation reveals remaining nuanced performance gaps between LLMs and our fine-tuned models, which LLMs fail to capture. Thus, we call for further studies into both the potential and challenges of using LLMs in summarization model development.
2023
pdf
bib
abs
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
Yixin Liu
|
Alex Fabbri
|
Pengfei Liu
|
Yilun Zhao
|
Linyong Nan
|
Ruilin Han
|
Simeng Han
|
Shafiq Joty
|
Chien-Sheng Wu
|
Caiming Xiong
|
Dragomir Radev
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale, and an in-depth analysis of human evaluation is lacking. Therefore, we address the shortcomings of existing summarization evaluation along the following axes: (1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units and allows for a high inter-annotator agreement. (2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems on three datasets. (3) We conduct a comparative study of four human evaluation protocols, underscoring potential confounding factors in evaluation setups. (4) We evaluate 50 automatic metrics and their variants using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. The metrics we benchmarked include recent methods based on large language models (LLMs), GPTScore and G-Eval. Furthermore, our findings have important implications for evaluating LLMs, as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators’ prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
pdf
bib
abs
DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions
Vijay Viswanathan
|
Luyu Gao
|
Tongshuang Wu
|
Pengfei Liu
|
Graham Neubig
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modern machine learning relies on datasets to develop and validate research ideas. Given the growth of publicly available data, finding the right dataset to use is increasingly difficult. Any research question imposes explicit and implicit constraints on how well a given dataset will enable researchers to answer this question, such as dataset size, modality, and domain. We operationalize the task of recommending datasets given a short natural language description of a research idea, to help people find relevant datasets for their needs. Dataset recommendation poses unique challenges as an information retrieval problem; datasets are hard to directly index for search and there are no corpora readily available for this task. To facilitate this task, we build the DataFinder Dataset which consists of a larger automatically-constructed training set (17.5K queries) and a smaller expert-annotated evaluation set (392 queries). Using this data, we compare various information retrieval algorithms on our test set and present a superior bi-encoder retriever for text-based dataset recommendation. This system, trained on the DataFinder Dataset, finds more relevant search results than existing third-party dataset search engines. To encourage progress on dataset recommendation, we release our dataset and models to the public.
pdf
bib
abs
GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song
|
Simran Khanuja
|
Pengfei Liu
|
Fahim Faisal
|
Alissa Ostapenko
|
Genta Winata
|
Alham Fikri Aji
|
Samuel Cahyawijaya
|
Yulia Tsvetkov
|
Antonios Anastasopoulos
|
Graham Neubig
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have focused on a limited number of tasks and languages. In contrast, GlobalBench is an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages. Rather than solely measuring accuracy, GlobalBench also tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world. Furthermore, GlobalBench is designed to identify the most under-served languages, and rewards research efforts directed towards those languages. At present, the most under-served languages are the ones with a relatively high population, but nonetheless overlooked by composite multilingual benchmarks (like Punjabi, Portuguese, and Wu Chinese). Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
pdf
bib
abs
Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation
Yixin Liu
|
Alexander Fabbri
|
Yilun Zhao
|
Pengfei Liu
|
Shafiq Joty
|
Chien-Sheng Wu
|
Caiming Xiong
|
Dragomir Radev
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics. In this work, we develop strong-performing automatic metrics for reference-based summarization evaluation, based on a two-stage evaluation pipeline that first extracts basic information units from one text sequence and then checks the extracted units in another sequence. The metrics we developed include two-stage metrics that can provide high interpretability at both the fine-grained unit level and summary level, and one-stage metrics that achieve a balance between efficiency and interpretability. We make the developed tools publicly available at https://github.com/Yale-LILY/AutoACU.
pdf
bib
abs
LLM-driven Instruction Following: Progresses and Concerns
Wenpeng Yin
|
Qinyuan Ye
|
Pengfei Liu
|
Xiang Ren
|
Hinrich Schütze
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
The progress of natural language processing (NLP) is primarily driven by machine learning that optimizes a system on a large-scale set of task-specific labeled examples. This learning paradigm limits the ability of machines to have the same capabilities as humans in handling new tasks since humans can often solve unseen tasks with a couple of examples accompanied by task instructions. In addition, we may not have a chance to prepare task-specific examples of large-volume for new tasks because we cannot foresee what task needs to be addressed next and how complex to annotate for it. Therefore, task instructions act as a novel and promising resource for supervision. This tutorial targets researchers and practitioners who are interested in AI and ML technologies for NLP generalization in a low-shot scenario. In particular, we will present a diverse thread of instruction-driven NLP studies that try to answer the following questions: (i) What is task instruction? (ii) How is the process of creating datasets and evaluating systems conducted? (iii) How to encode task instructions? (iv) When and why do some instructions work better? (v) What concerns remain in LLM-driven instruction following? We will discuss several lines of frontier research that tackle those challenges and will conclude the tutorial by outlining directions for further investigation.
pdf
bib
abs
Multi-Dimensional Evaluation of Text Summarization with In-Context Learning
Sameer Jain
|
Vaishakh Keshava
|
Swarnashree Mysore Sathyendra
|
Patrick Fernandes
|
Pengfei Liu
|
Graham Neubig
|
Chunting Zhou
Findings of the Association for Computational Linguistics: ACL 2023
Evaluation of natural language generation (NLG) is complex and multi-dimensional. Generated text can be evaluated for fluency, coherence, factuality, or any other dimensions of interest. Most frameworks that perform such multi-dimensional evaluation require training on large manually or synthetically generated datasets. In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning, obviating the need for large training datasets. Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization, establishing state-of-the-art on dimensions such as relevance and factual consistency. We then analyze the effects of factors such as the selection and number of in-context examples on performance. Finally, we study the efficacy of in-context learning-based evaluators in evaluating zero-shot summaries written by large language models such as GPT-3.
pdf
bib
abs
T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics
Yiwei Qin
|
Weizhe Yuan
|
Graham Neubig
|
Pengfei Liu
Findings of the Association for Computational Linguistics: EMNLP 2023
Modern embedding-based metrics for evaluation of generated text generally fall into one of two paradigms: discriminative metrics that are trained to directly predict which outputs are of higher quality according to supervised human annotations, and generative metrics that are trained to evaluate text based on the probabilities of a generative model. Both have their advantages; discriminative metrics are able to directly optimize for the problem of distinguishing between good and bad outputs, while generative metrics can be trained using abundant raw text. In this paper, we present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available. We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as backbone. We perform an extensive empirical comparison with other existing metrics on 5 datasets, 19 languages and 280 systems, demonstrating the utility of our method. Experimental results show that: T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
pdf
bib
abs
Improving Factuality of Abstractive Summarization via Contrastive Reward Learning
I-chun Chern
|
Zhiruo Wang
|
Sanjan Das
|
Bhavuk Sharma
|
Pengfei Liu
|
Graham Neubig
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Modern abstractive summarization models often generate summaries that contain hallucinated or contradictory information. In this paper, we propose a simple but effective contrastive learning framework that incorporates recent developments in reward learning and factuality metrics. Empirical studies demonstrate that the proposed framework enables summarization models to learn from feedback of factuality metrics using contrastive reward learning, leading to more factual summaries by human evaluations. This suggests that further advances in learning and evaluation algorithms can feed directly into providing more factual summaries. Code and human evaluation results will be publicly available at \url{https://github.com/EthanC111/factuality_summarization}.
2022
pdf
bib
abs
BRIO: Bringing Order to Abstractive Summarization
Yixin Liu
|
Pengfei Liu
|
Dragomir Radev
|
Graham Neubig
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abstractive summarization models are commonly trained using maximum likelihood estimation, which assumes a deterministic (one-point) target distribution in which an ideal model will assign all the probability mass to the reference summary. This assumption may lead to performance degradation during inference, where the model needs to compare several system-generated (candidate) summaries that have deviated from the reference summary. To address this problem, we propose a novel training paradigm which assumes a non-deterministic distribution so that different candidate summaries are assigned probability mass according to their quality. Our method achieves a new state-of-the-art result on the CNN/DailyMail (47.78 ROUGE-1) and XSum (49.07 ROUGE-1) datasets. Further analysis also shows that our model can estimate probabilities of candidate summaries that are more correlated with their level of quality.
pdf
bib
abs
DataLab: A Platform for Data Analysis and Intervention
Yang Xiao
|
Jinlan Fu
|
Weizhe Yuan
|
Vijay Viswanathan
|
Zhoumianze Liu
|
Yixin Liu
|
Graham Neubig
|
Pengfei Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Despite data’s crucial role in machine learning, most existing tools and research tend to focus on systems on top of existing data rather than how to interpret and manipulate data. In this paper, we propose DataLab, a unified data-oriented platform that not only allows users to interactively analyze the characteristics of data but also provides a standardized interface so that many data processing operations can be provided within a unified interface. Additionally, in view of the ongoing surge in the proliferation of datasets, DataLab has features for dataset recommendation and global vision analysis that help researchers form a better view of the data ecosystem. So far, DataLab covers 1,300 datasets and 3,583 of its transformed version, where 313 datasets support different types of analysis (e.g., with respect to gender bias) with the help of 119M samples annotated by 318 feature functions. DataLab is under active development and will be supported going forward. We have released a web platform, web API, Python SDK, and PyPI published package, which hopefully, can meet the diverse needs of researchers.
pdf
bib
abs
Towards a Unified Multi-Dimensional Evaluator for Text Generation
Ming Zhong
|
Yang Liu
|
Da Yin
|
Yuning Mao
|
Yizhu Jiao
|
Pengfei Liu
|
Chenguang Zhu
|
Heng Ji
|
Jiawei Han
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean QA format, we are able to introduce an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics. Specifically, compared to the top-performing unified evaluators, UniEval achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation. Also, UniEval demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks. Source code, data, and all pre-trained evaluators are available at https://github.com/maszhongming/UniEval.
pdf
bib
abs
Polyglot Prompt: Multilingual Multitask Prompt Training
Jinlan Fu
|
See-Kiong Ng
|
Pengfei Liu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
This paper aims for a potential architectural improvement for multilingual learning and asks: Can different tasks from different languages be modeled in a monolithic framework, i.e. without any task/language-specific module? The benefit of achieving this could open new doors for future multilingual research, including allowing systems trained on low resources to be further assisted by other languages as well as other tasks. We approach this goal by developing a learning framework named Polyglot Prompting to exploit prompting methods for learning a unified semantic space for different languages and tasks with multilingual prompt engineering. We performed a comprehensive evaluation of 6 tasks, namely topic classification, sentiment classification, named entity recognition, question answering, natural language inference, and summarization, covering 24 datasets and 49 languages. The experimental results demonstrated the efficacy of multilingual multitask prompt-based learning and led to inspiring observations. We also present an interpretable multilingual evaluation methodology and show how the proposed framework, multilingual multitask prompt training, works. We release all datasets prompted in the best setting and code.
pdf
bib
abs
KGxBoard: Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
Haris Widjaja
|
Kiril Gashteovski
|
Wiem Ben Rim
|
Pengfei Liu
|
Christopher Malon
|
Daniel Ruffinelli
|
Carolin Lawrence
|
Graham Neubig
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Knowledge Graphs (KGs) store information in the form of (head, predicate, tail)-triples. To augment KGs with new knowledge, researchers proposed models for KG Completion (KGC) tasks such as link prediction; i.e., answering (h; p; ?) or (?; p; t) queries. Such models are usually evaluated with averaged metrics on a held-out test set. While useful for tracking progress, averaged single-score metrics cannotreveal what exactly a model has learned — or failed to learn. To address this issue, we propose KGxBoard: an interactive framework for performing fine-grained evaluation on meaningful subsets of the data, each of which tests individual and interpretable capabilities of a KGC model. In our experiments, we highlight the findings that we discovered with the use of KGxBoard, which would have been impossible to detect with standard averaged single-score metrics.
pdf
bib
abs
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification
Yang Xiao
|
Jinlan Fu
|
See-Kiong Ng
|
Pengfei Liu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all datasets with features explored in this work on DataLab.
2021
pdf
bib
abs
CitationIE: Leveraging the Citation Graph for Scientific Information Extraction
Vijay Viswanathan
|
Graham Neubig
|
Pengfei Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Automatically extracting key information from scientific documents has the potential to help scientists work more efficiently and accelerate the pace of scientific progress. Prior work has considered extracting document-level entity clusters and relations end-to-end from raw scientific text, which can improve literature search and help identify methods and materials for a given problem. Despite the importance of this task, most existing works on scientific information extraction (SciIE) consider extraction solely based on the content of an individual paper, without considering the paper’s place in the broader literature. In contrast to prior work, we augment our text representations by leveraging a complementary source of document context: the citation graph of referential links between citing and cited papers. On a test set of English-language scientific documents, we show that simple ways of utilizing the structure and content of the citation graph can each lead to significant gains in different scientific information extraction tasks. When these tasks are combined, we observe a sizable improvement in end-to-end information extraction over the state-of-the-art, suggesting the potential for future work along this direction. We release software tools to facilitate citation-aware SciIE development.
pdf
bib
abs
SpanNER: Named Entity Re-/Recognition as Span Prediction
Jinlan Fu
|
Xuanjing Huang
|
Pengfei Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Recent years have seen the paradigm shift of Named Entity Recognition (NER) systems from sequence labeling to span prediction. Despite its preliminary effectiveness, the span prediction model’s architectural bias has not been fully understood. In this paper, we first investigate the strengths and weaknesses when the span prediction model is used for named entity recognition compared with the sequence labeling framework and how to further improve it, which motivates us to make complementary advantages of systems based on different paradigms. We then reveal that span prediction, simultaneously, can serve as a system combiner to re-recognize named entities from different systems’ outputs. We experimentally implement 154 systems on 11 datasets, covering three languages, comprehensive results show the effectiveness of span prediction models that both serve as base NER systems and system combiners. We make all codes and datasets available: 
https://github.com/neulab/spanner, as well as an online system demo: 
http://spanner.sh. Our model also has been deployed into the ExplainaBoard platform, which allows users to flexibly perform a system combination of top-scoring systems in an interactive way: 
http://explainaboard.nlpedia.ai/leaderboard/task-ner/.
pdf
bib
abs
SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization
Yixin Liu
|
Pengfei Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
In this paper, we present a conceptually simple while empirically powerful framework for abstractive summarization, SimCLS, which can bridge the gap between the learning objective and evaluation metrics resulting from the currently dominated sequence-to-sequence learning framework by formulating text generation as a reference-free evaluation problem (i.e., quality estimation) assisted by contrastive learning. Experimental results show that, with minor modification over existing top-scoring systems, SimCLS can improve the performance of existing top-performing models by a large margin. Particularly, 2.51 absolute improvement against BART and 2.50 over PEGASUS w.r.t ROUGE-1 on the CNN/DailyMail dataset, driving the state-of-the-art performance to a new level. We have open-sourced our codes and results: 
https://github.com/yixinL7/SimCLS. Results of our proposed models have been deployed into ExplainaBoard platform, which allows researchers to understand our systems in a more fine-grained way.
pdf
bib
abs
ExplainaBoard: An Explainable Leaderboard for NLP
Pengfei Liu
|
Jinlan Fu
|
Yang Xiao
|
Weizhe Yuan
|
Shuaichen Chang
|
Junqi Dai
|
Yixin Liu
|
Zihuiwen Ye
|
Graham Neubig
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations
With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g. what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g. where does system A outperform system B? What if we combine systems A, B and C?) and (iii) examine prediction results closely (e.g. what are common errors made by multiple systems or in what contexts do particular errors occur?). So far, ExplainaBoard covers more than 400 systems, 50 datasets, 40 languages, and 12 tasks. We not only released an online platform at the website but also make our evaluation tool an API with MIT Licence at Github and PyPi that allows users to conveniently assess their models offline. We additionally release all output files from systems that we have run or collected to motivate “output-driven” research in the future.
pdf
bib
abs
Towards More Fine-grained and Reliable NLP Performance Prediction
Zihuiwen Ye
|
Pengfei Liu
|
Jinlan Fu
|
Graham Neubig
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Performance prediction, the task of estimating a system’s performance without performing experiments, allows us to reduce the experimental burden caused by the combinatorial explosion of different datasets, languages, tasks, and models. In this paper, we make two contributions to improving performance prediction for NLP tasks. First, we examine performance predictors not only for holistic measures of accuracy like F1 or BLEU, but also fine-grained performance measures such as accuracy over individual classes of examples. Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration. We perform an analysis of four types of NLP tasks, and both demonstrate the feasibility of fine-grained performance prediction and the necessity to perform reliability analysis for performance prediction methods in the future.
pdf
bib
abs
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
Sebastian Ruder
|
Noah Constant
|
Jan Botha
|
Aditya Siddhant
|
Orhan Firat
|
Jinlan Fu
|
Pengfei Liu
|
Junjie Hu
|
Dan Garrette
|
Graham Neubig
|
Melvin Johnson
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to human-level performance remains, improvements have been easier to achieve in some tasks than in others. This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned. In order to catalyze meaningful progress, we extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks, including challenging language-agnostic retrieval tasks, and covers 50 typologically diverse languages. In addition, we provide a massively multilingual diagnostic suite and fine-grained multi-dataset evaluation capabilities through an interactive public leaderboard to gain a better understanding of such models.
pdf
bib
How well do you know your summarization datasets?
Priyam Tejaswin
|
Dhruv Naik
|
Pengfei Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
bib
abs
Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization
Yiran Chen
|
Pengfei Liu
|
Xipeng Qiu
Findings of the Association for Computational Linguistics: EMNLP 2021
With the continuous upgrading of the summarization systems driven by deep neural networks, researchers have higher requirements on the quality of the generated summaries, which should be not only fluent and informative but also factually correct. As a result, the field of factual evaluation has developed rapidly recently. Despite its initial progress in evaluating generated summaries, the meta-evaluation methodologies of factuality metrics are limited in their opacity, leading to the insufficient understanding of factuality metrics’ relative advantages and their applicability. In this paper, we present an adversarial meta-evaluation methodology that allows us to (i) diagnose the fine-grained strengths and weaknesses of 6 existing top-performing metrics over 24 diagnostic test datasets, (ii) search for directions for further improvement by data augmentation. Our observations from this work motivate us to propose several calls for future research. We make all codes, diagnostic test datasets, trained factuality models available: 
https://github.com/zide05/AdvFact.
pdf
bib
abs
RefSum: Refactoring Neural Summarization
Yixin Liu
|
Zi-Yi Dou
|
Pengfei Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Although some recent works show potential complementarity among different state-of-the-art systems, few works try to investigate this problem in text summarization. Researchers in other areas commonly refer to the techniques of reranking or stacking to approach this problem. In this work, we highlight several limitations of previous methods, which motivates us to present a new framework Refactor that provides a unified view of text summarization and summaries combination. Experimentally, we perform a comprehensive evaluation that involves twenty-two base systems, four datasets, and three different application scenarios. Besides new state-of-the-art results on CNN/DailyMail dataset (46.18 ROUGE-1), we also elaborate on how our proposed method addresses the limitations of the traditional methods and the effectiveness of the Refactor model sheds light on insight for performance improvement. Our system can be directly used by other researchers as an off-the-shelf tool to achieve further performance improvements. We open-source all the code and provide a convenient interface to use it: 
https://github.com/yixinL7/Refactoring-Summarization.
pdf
bib
abs
Larger-Context Tagging: When and Why Does It Work?
Jinlan Fu
|
Liangjing Feng
|
Qi Zhang
|
Xuanjing Huang
|
Pengfei Liu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
The development of neural networks and pretraining techniques has spawned many sentence-level tagging systems that achieved superior performance on typical benchmarks. However, a relatively less discussed topic is what if more context information is introduced into current top-scoring tagging systems. Although several existing works have attempted to shift tagging systems from sentence-level to document-level, there is still no consensus conclusion about when and why it works, which limits the applicability of the larger-context approach in tagging tasks. In this paper, instead of pursuing a state-of-the-art tagging system by architectural exploration, we focus on investigating when and why the larger-context training, as a general strategy, can work. To this end, we conduct a thorough comparative study on four proposed aggregators for context information collecting and present an attribute-aided evaluation method to interpret the improvement brought by larger-context training. Experimentally, we set up a testbed based on four tagging tasks and thirteen datasets. Hopefully, our preliminary observations can deepen the understanding of larger-context training and enlighten more follow-up works on the use of contextual information.
pdf
bib
abs
Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa
Junqi Dai
|
Hang Yan
|
Tianxiang Sun
|
Pengfei Liu
|
Xipeng Qiu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Aspect-based Sentiment Analysis (ABSA), aiming at predicting the polarities for aspects, is a fine-grained task in the field of sentiment analysis. Previous work showed syntactic information, e.g. dependency trees, can effectively improve the ABSA performance. Recently, pre-trained models (PTMs) also have shown their effectiveness on ABSA. Therefore, the question naturally arises whether PTMs contain sufficient syntactic information for ABSA so that we can obtain a good ABSA model only based on PTMs. In this paper, we firstly compare the induced trees from PTMs and the dependency parsing trees on several popular models for the ABSA task, showing that the induced tree from fine-tuned RoBERTa (FT-RoBERTa) outperforms the parser-provided tree. The further analysis experiments reveal that the FT-RoBERTa Induced Tree is more sentiment-word-oriented and could benefit the ABSA task. The experiments also show that the pure RoBERTa-based model can outperform or approximate to the previous SOTA performances on six datasets across four languages since it implicitly incorporates the task-oriented syntactic information.
pdf
bib
abs
GSum: A General Framework for Guided Neural Abstractive Summarization
Zi-Yi Dou
|
Pengfei Liu
|
Hiroaki Hayashi
|
Zhengbao Jiang
|
Graham Neubig
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Neural abstractive summarization models are flexible and can produce coherent summaries, but they are sometimes unfaithful and can be difficult to control. While previous studies attempt to provide different types of guidance to control the output and increase faithfulness, it is not clear how these strategies compare and contrast to each other. In this paper, we propose a general and extensible guided summarization framework (GSum) that can effectively take different kinds of external guidance as input, and we perform experiments across several different varieties. Experiments demonstrate that this model is effective, achieving state-of-the-art performance according to ROUGE on 4 popular summarization datasets when using highlighted sentences as guidance. In addition, we show that our guided model can generate more faithful summaries and demonstrate how different types of guidance generate qualitatively different summaries, lending a degree of controllability to the learned models.
2020
pdf
bib
abs
Extractive Summarization as Text Matching
Ming Zhong
|
Pengfei Liu
|
Yiran Chen
|
Danqing Wang
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. We believe the power of this matching-based summarization framework has not been fully exploited. To encourage more instantiations in the future, we have released our codes, processed dataset, as well as generated summaries in 
https://github.com/maszhongming/MatchSum.
pdf
bib
abs
Heterogeneous Graph Neural Networks for Extractive Document Summarization
Danqing Wang
|
Pengfei Liu
|
Yining Zheng
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
As a crucial step in extractive document summarization, learning cross-sentence relations has been explored by a plethora of approaches. An intuitive way is to put them in the graph-based neural network, which has a more complex structure for capturing inter-sentence relationships. In this paper, we present a heterogeneous graph-based neural network for extractive summarization (HETERSUMGRAPH), which contains semantic nodes of different granularity levels apart from sentences. These additional nodes act as the intermediary between sentences and enrich the cross-sentence relations. Besides, our graph structure is flexible in natural extension from a single-document setting to multi-document via introducing document nodes. To our knowledge, we are the first one to introduce different types of nodes into graph-based neural networks for extractive document summarization and perform a comprehensive qualitative analysis to investigate their benefits. The code will be released on Github.
pdf
bib
abs
Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics
Manik Bhandari
|
Pranav Narayan Gour
|
Atabak Ashfaq
|
Pengfei Liu
Proceedings of the 28th International Conference on Computational Linguistics
In text summarization, evaluating the efficacy of automatic metrics without human judgments has become recently popular. One exemplar work (Peyrard, 2019) concludes that automatic metrics strongly disagree when ranking high-scoring summaries. In this paper, we revisit their experiments and find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. We hypothesize that this may be because summaries are similar to each other in a narrow scoring range and are thus, difficult to rank. Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement - Ease of Summarization, Abstractiveness, and Coverage.
pdf
bib
abs
RethinkCWS: Is Chinese Word Segmentation a Solved Task?
Jinlan Fu
|
Pengfei Liu
|
Qi Zhang
|
Xuanjing Huang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
The performance of the Chinese Word Segmentation (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks, especially the successful use of large pre-trained models. In this paper, we take stock of what we have achieved and rethink what’s left in the CWS task. Methodologically, we propose a fine-grained evaluation for existing CWS systems, which not only allows us to diagnose the strengths and weaknesses of existing models (under the in-dataset setting), but enables us to quantify the discrepancy between different criterion and alleviate the negative transfer problem when doing multi-criteria learning. Strategically, despite not aiming to propose a novel model in this paper, our comprehensive experiments on eight models and seven datasets, as well as thorough analysis, could search for some promising direction for future research. We make all codes publicly available and release an interface that can quickly evaluate and diagnose user’s models: 
https://github.com/neulab/InterpretEvalpdf
bib
abs
Interpretable Multi-dataset Evaluation for Named Entity Recognition
Jinlan Fu
|
Pengfei Liu
|
Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
With the proliferation of models for natural language processing tasks, it is even harder to understand the differences between models and their relative merits. Simply looking at differences between holistic metrics such as accuracy, BLEU, or F1 does not tell us why or how particular methods perform differently and how diverse datasets influence the model design choices. In this paper, we present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them, identifying the strengths and weaknesses of current systems. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area: 
https://github.com/neulab/InterpretEvalpdf
bib
abs
Re-evaluating Evaluation in Text Summarization
Manik Bhandari
|
Pranav Narayan Gour
|
Atabak Ashfaq
|
Pengfei Liu
|
Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. We release a dataset of human judgments that are collected from 25 top-scoring neural summarization systems (14 abstractive and 11 extractive).
pdf
bib
abs
CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural Summarization Systems
Yiran Chen
|
Pengfei Liu
|
Ming Zhong
|
Zi-Yi Dou
|
Danqing Wang
|
Xipeng Qiu
|
Xuanjing Huang
Findings of the Association for Computational Linguistics: EMNLP 2020
Neural network-based models augmented with unsupervised pre-trained knowledge have achieved impressive performance on text summarization. However, most existing evaluation methods are limited to an in-domain setting, where summarizers are trained and evaluated on the same dataset. We argue that this approach can narrow our understanding of the generalization ability for different summarization systems. In this paper, we perform an in-depth analysis of characteristics of different datasets and investigate the performance of different summarization models under a cross-dataset setting, in which a summarizer trained on one corpus will be evaluated on a range of out-of-domain corpora. A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways (i.e. abstractive and extractive) on model generalization ability. Further, experimental results shed light on the limitations of existing summarizers. Brief introduction and supplementary code can be found in 
https://github.com/zide05/CDEvalSumm.
pdf
bib
abs
Target-Guided Structured Attention Network for Target-Dependent Sentiment Analysis
Ji Zhang
|
Chengyao Chen
|
Pengfei Liu
|
Chao He
|
Cane Wing-Ki Leung
Transactions of the Association for Computational Linguistics, Volume 8
Target-dependent sentiment analysis (TDSA) aims to classify the sentiment of a text towards a given target. The major challenge of this task lies in modeling the semantic relatedness between a target and its context sentence. This paper proposes a novel Target-Guided Structured Attention Network (TG-SAN), which captures target-related contexts for TDSA in a fine-to-coarse manner. Given a target and its context sentence, the proposed TG-SAN first identifies multiple semantic segments from the sentence using a target-guided structured attention mechanism. It then fuses the extracted segments based on their relatedness with the target for sentiment classification. We present comprehensive comparative experiments on three benchmarks with three major findings. First, TG-SAN outperforms the state-of-the-art by up to 1.61% and 3.58% in terms of accuracy and Marco-F1, respectively. Second, it shows a strong advantage in determining the sentiment of a target when the context sentence contains multiple semantic segments. Lastly, visualization results show that the attention scores produced by TG-SAN are highly interpretable
2019
pdf
bib
abs
A Closer Look at Data Bias in Neural Extractive Summarization Models
Ming Zhong
|
Danqing Wang
|
Pengfei Liu
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 2nd Workshop on New Frontiers in Summarization
In this paper, we take stock of the current state of summarization datasets and explore how different factors of datasets influence the generalization behaviour of neural extractive summarization models. Specifically, we first propose several properties of datasets, which matter for the generalization of summarization models. Then we build the connection between priors residing in datasets and model designs, analyzing how different properties of datasets influence the choices of model structure design and training methods. Finally, by taking a typical dataset as an example, we rethink the process of the model design based on the experience of the above analysis. We demonstrate that when we have a deep understanding of the characteristics of datasets, a simple approach can bring significant improvements to the existing state-of-the-art model.
pdf
bib
abs
Star-Transformer
Qipeng Guo
|
Xipeng Qiu
|
Pengfei Liu
|
Yunfan Shao
|
Xiangyang Xue
|
Zheng Zhang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Although Transformer has achieved great successes on many NLP tasks, its heavy structure with fully-connected attention connections leads to dependencies on large training data. In this paper, we present Star-Transformer, a lightweight alternative by careful sparsification. To reduce model complexity, we replace the fully-connected structure with a star-shaped topology, in which every two non-adjacent nodes are connected through a shared relay node. Thus, complexity is reduced from quadratic to linear, while preserving the capacity to capture both local composition and long-range dependency. The experiments on four tasks (22 datasets) show that Star-Transformer achieved significant improvements against the standard Transformer for the modestly sized datasets.
pdf
bib
abs
Searching for Effective Neural Extractive Summarization: What Works and What’s Next
Ming Zhong
|
Pengfei Liu
|
Danqing Wang
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
The recent years have seen remarkable success in the use of deep neural networks on text summarization. However, there is no clear understanding of why they perform so well, or how they might be improved. In this paper, we seek to better understand how neural extractive summarization systems could benefit from different types of model architectures, transferable knowledge and learning schemas. Besides, we find an effective way to improve the current framework and achieve the state-of-the-art result on CNN/DailyMail by a large margin based on our observations and analysis. Hopefully, our work could provide more hints for future research on extractive summarization.
pdf
bib
abs
TIGS: An Inference Algorithm for Text Infilling with Gradient Search
Dayiheng Liu
|
Jie Fu
|
Pengfei Liu
|
Jiancheng Lv
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Text infilling aims at filling in the missing part of a sentence or paragraph, which has been applied to a variety of real-world natural language generation scenarios. Given a well-trained sequential generative model, it is challenging for its unidirectional decoder to generate missing symbols conditioned on the past and future information around the missing part. In this paper, we propose an iterative inference algorithm based on gradient search, which could be the first inference algorithm that can be broadly applied to any neural sequence generative models for text infilling tasks. Extensive experimental comparisons show the effectiveness and efficiency of the proposed method on three different text infilling tasks with various mask ratios and different mask strategies, comparing with five state-of-the-art methods.
2017
pdf
bib
abs
Idiom-Aware Compositional Distributed Semantics
Pengfei Liu
|
Kaiyu Qian
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Idioms are peculiar linguistic constructions that impose great challenges for representing the semantics of language, especially in current prevailing end-to-end neural models, which assume that the semantics of a phrase or sentence can be literally composed from its constitutive words. In this paper, we propose an idiom-aware distributed semantic model to build representation of sentences on the basis of understanding their contained idioms. Our models are grounded in the literal-first psycholinguistic hypothesis, which can adaptively learn semantic compositionality of a phrase literally or idiomatically. To better evaluate our models, we also construct an idiom-enriched sentiment classification dataset with considerable scale and abundant peculiarities of idioms. The qualitative and quantitative experimental analyses demonstrate the efficacy of our models.
pdf
bib
abs
Adversarial Multi-task Learning for Text Classification
Pengfei Liu
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Neural network models have shown their promising opportunities for multi-task learning, which focus on learning the shared layers to extract the common and task-invariant features. However, in most existing approaches, the extracted shared features are prone to be contaminated by task-specific features or the noise brought by other tasks. In this paper, we propose an adversarial multi-task learning framework, alleviating the shared and private latent feature spaces from interfering with each other. We conduct extensive experiments on 16 different text classification tasks, which demonstrates the benefits of our approach. Besides, we show that the shared knowledge learned by our proposed model can be regarded as off-the-shelf knowledge and easily transferred to new tasks. The datasets of all 16 tasks are publicly available at 
http://nlp.fudan.edu.cn/data/.
2016
pdf
bib
Deep Multi-Task Learning with Shared Memory for Text Classification
Pengfei Liu
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Modelling Interaction of Sentence Pair with Coupled-LSTMs
Pengfei Liu
|
Xipeng Qiu
|
Yaqian Zhou
|
Jifan Chen
|
Xuanjing Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Deep Fusion LSTMs for Text Semantic Matching
Pengfei Liu
|
Xipeng Qiu
|
Jifan Chen
|
Xuanjing Huang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
Implicit Discourse Relation Detection via a Deep Architecture with Gated Relevance Network
Jifan Chen
|
Qi Zhang
|
Pengfei Liu
|
Xipeng Qiu
|
Xuanjing Huang
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2015
pdf
bib
Long Short-Term Memory Neural Networks for Chinese Word Segmentation
Xinchi Chen
|
Xipeng Qiu
|
Chenxi Zhu
|
Pengfei Liu
|
Xuanjing Huang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Fine-grained Opinion Mining with Recurrent Neural Networks and Word Embeddings
Pengfei Liu
|
Shafiq Joty
|
Helen Meng
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
pdf
bib
Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents
Pengfei Liu
|
Xipeng Qiu
|
Xinchi Chen
|
Shiyu Wu
|
Xuanjing Huang
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
2014
pdf
bib
SeemGo: Conditional Random Fields Labeling and Maximum Entropy Classification for Aspect Based Sentiment Analysis
Pengfei Liu
|
Helen Meng
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)