2024
pdf
bib
abs
When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives
Yebowen Hu
|
Kaiqiang Song
|
Sangwoo Cho
|
Xiaoyang Wang
|
Wenlin Yao
|
Hassan Foroosh
|
Dong Yu
|
Fei Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Reasoning is most powerful when an LLM accurately aggregates relevant information. We examine the critical role of information aggregation in reasoning by requiring the LLM to analyze sports narratives. To succeed at this task, an LLM must infer points from actions, identify related entities, attribute points accurately to players and teams, and compile key statistics to draw conclusions. We conduct comprehensive experiments with real NBA basketball data and present SportsGen, a new method to synthesize game narratives. By synthesizing data, we can rigorously evaluate LLMs’ reasoning capabilities under complex scenarios with varying narrative lengths and density of information. Our findings show that most models, including GPT-4o, often fail to accurately aggregate basketball scores due to frequent scoring patterns. Open-source models like Llama-3 further suffer from significant score hallucinations. Finally, the effectiveness of reasoning is influenced by narrative complexity, information density, and domain-specific terms, highlighting the challenges in analytical reasoning tasks.
pdf
bib
abs
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Jiangshu Du
|
Yibo Wang
|
Wenting Zhao
|
Zhongfen Deng
|
Shuaiqi Liu
|
Renze Lou
|
Henry Peng Zou
|
Pranav Narayanan Venkit
|
Nan Zhang
|
Mukund Srinath
|
Haoran Ranran Zhang
|
Vipul Gupta
|
Yinghui Li
|
Tao Li
|
Fei Wang
|
Qin Liu
|
Tianlin Liu
|
Pengzhi Gao
|
Congying Xia
|
Chen Xing
|
Cheng Jiayang
|
Zhaowei Wang
|
Ying Su
|
Raj Sanjay Shah
|
Ruohao Guo
|
Jing Gu
|
Haoran Li
|
Kangda Wei
|
Zihao Wang
|
Lu Cheng
|
Surangika Ranathunga
|
Meng Fang
|
Jie Fu
|
Fei Liu
|
Ruihong Huang
|
Eduardo Blanco
|
Yixin Cao
|
Rui Zhang
|
Philip S. Yu
|
Wenpeng Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Claim: This work is not advocating the use of LLMs for paper (meta-)reviewing. Instead, wepresent a comparative analysis to identify and distinguish LLM activities from human activities. Two research goals: i) Enable better recognition of instances when someone implicitly uses LLMs for reviewing activities; ii) Increase community awareness that LLMs, and AI in general, are currently inadequate for performing tasks that require a high level of expertise and nuanced judgment.This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?This study focuses on the topic of LLMs as NLP Researchers, particularly examining the effectiveness of LLMs in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with “deficiency” labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) “LLMs as Reviewers”, how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) “LLMs as Metareviewers”, how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
pdf
bib
abs
Factuality of Large Language Models: A Survey
Yuxia Wang
|
Minghan Wang
|
Muhammad Arslan Manzoor
|
Fei Liu
|
Georgi Nenkov Georgiev
|
Rocktim Jyoti Das
|
Preslav Nakov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs), especially when instruction-tuned for chat, have become part of our daily lives, freeing people from the process of searching, extracting, and integrating information from multiple sources by offering a straightforward answer to a variety of questions in a single place. Unfortunately, in many cases, LLM responses are factually incorrect, which limits their applicability in real-world scenarios. As a result, research on evaluating and improving the factuality of LLMs has attracted a lot of research attention recently. In this survey, we critically analyze existing work with the aim to identify the major challenges and their associated causes, pointing out to potential solutions for improving the factuality of LLMs, and analyzing the obstacles to automated factuality evaluation for open-ended text generation. We further offer an outlook on where future research should go.
pdf
bib
abs
InFoBench: Evaluating Instruction Following Ability in Large Language Models
Yiwei Qin
|
Kaiqiang Song
|
Yebowen Hu
|
Wenlin Yao
|
Sangwoo Cho
|
Xiaoyang Wang
|
Xuansheng Wu
|
Fei Liu
|
Pengfei Liu
|
Dong Yu
Findings of the Association for Computational Linguistics: ACL 2024
This paper introduces the Decomposed Requirements Following Ratio (DRFR), a new metric for evaluating Large Language Models’ (LLMs) ability to follow instructions. Addressing a gap in current methodologies, DRFR breaks down complex instructions into simpler criteria, facilitating a detailed analysis of LLMs’ compliance with various aspects of tasks. Alongside this metric, we present InFoBench, a benchmark comprising 500 diverse instructions and 2,250 decomposed questions across multiple constraint categories. Our experiments compare DRFR with traditional scoring methods and explore annotation sources, including human experts, crowd-sourced workers, and GPT-4. The findings demonstrate DRFR’s higher reliability and the effectiveness of using GPT-4 as a cost-efficient annotator. The evaluation of several advanced LLMs using this framework reveals their strengths and areas needing improvement, particularly in complex instruction-following. This study contributes a novel metric and benchmark, offering insights for future LLM development and evaluation.
pdf
bib
abs
Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy
Liyan Xu
|
Zhenlin Su
|
Mo Yu
|
Jin Xu
|
Jinho D. Choi
|
Jie Zhou
|
Fei Liu
Findings of the Association for Computational Linguistics: EMNLP 2024
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. While a major direction to enhance inconsistency detection is to derive stronger Natural Language Inference (NLI) models, we propose an orthogonal aspect that underscores the importance of incorporating task-specific taxonomy into the inference. To this end, we consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs. Extensive experiments on ten datasets of five distinct domains suggest that, zero-shot LLM inference could benefit from the explicit solution space depicted by the error type taxonomy, and achieves state-of-the-art performance overall, surpassing specialized non-LLM baselines, as well as recent LLM baselines. We further distill models that fuse the taxonomy into parameters through our designed prompt completions and supervised training strategies, efficiently substituting state-of-the-art zero-shot inference with much larger LLMs.
pdf
bib
abs
SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs
Yebowen Hu
|
Kaiqiang Song
|
Sangwoo Cho
|
Xiaoyang Wang
|
Hassan Foroosh
|
Dong Yu
|
Fei Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models hold significant potential for integrating various data types, such as text documents and database records, for advanced analytics. However, blending text and numerical data presents substantial challenges. LLMs need to process and cross-reference entities and numbers, handle data inconsistencies and redundancies, and develop planning capabilities such as building a working memory for managing complex data queries. In this paper, we introduce four novel tasks centered around sports data analytics to evaluate the numerical reasoning and information fusion capabilities of LLMs. These tasks involve providing LLMs with detailed, play-by-play sports game descriptions, then challenging them with adversarial scenarios such as new game rules, longer durations, scrambled narratives, and analyzing key statistics in game summaries. We conduct extensive experiments on NBA and NFL games to assess the performance of LLMs on these tasks. Our benchmark, SportsMetrics, introduces a new mechanism for assessing LLMs’ numerical reasoning and fusion skills.
pdf
bib
abs
Rescue: Ranking LLM Responses with Partial Ordering to Improve Response Generation
Yikun Wang
|
Rui Zheng
|
Haoming Li
|
Qi Zhang
|
Tao Gui
|
Fei Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Customizing LLMs for a specific task involves separating high-quality responses from lower-quality ones. This skill can be developed using supervised fine-tuning with extensive human preference data. However, obtaining a large volume of expert-annotated data is costly for most tasks. In this paper, we explore a novel method to optimize LLMs using ranking metrics. This method trains the model to prioritize the best responses from a pool of candidates created for a particular task. Rather than a traditional full ordering, we advocate for a partial ordering, as achieving consensus on the perfect order of candidate responses can be challenging. Our partial ordering is more robust, less sensitive to noise, and can be achieved with limited human annotations or through heuristic methods. We test our system’s improved response generation ability using benchmark datasets, including textual entailment and multi-document question answering. We conduct ablation studies to understand crucial factors, such as how to gather candidate responses for a specific task, determine their most suitable order, and balance supervised fine-tuning with ranking metrics. Our approach, named RESCUE, offers a promising avenue for enhancing the response generation and task accuracy of LLMs.
2023
pdf
bib
abs
Generating User-Engaging News Headlines
Pengshan Cai
|
Kaiqiang Song
|
Sangwoo Cho
|
Hongwei Wang
|
Xiaoyang Wang
|
Hong Yu
|
Fei Liu
|
Dong Yu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The potential choices for news article headlines are enormous, and finding the right balance between conveying the essential message and capturing the reader’s attention is key to effective headlining. However, presenting the same news headline to all readers is a suboptimal strategy, because it does not take into account the different preferences and interests of diverse readers, who may be confused about why a particular article has been recommended to them and do not see a clear connection between their interests and the recommended article. In this paper, we present a novel framework that addresses these challenges by incorporating user profiling to generate personalized headlines, and a combination of automated and human evaluation methods to determine user preference for personalized headlines. Our framework utilizes a learnable relevance function to assign personalized signature phrases to users based on their reading histories, which are then used to personalize headline generation. Through extensive evaluation, we demonstrate the effectiveness of our proposed framework in generating personalized headlines that meet the needs of a diverse audience. Our framework has the potential to improve the efficacy of news recommendations and facilitate creation of personalized content.
pdf
bib
abs
DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue
William Held
|
Christopher Hidey
|
Fei Liu
|
Eric Zhu
|
Rahul Goel
|
Diyi Yang
|
Rushin Shah
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modern virtual assistants use internal semantic parsing engines to convert user utterances to actionable commands. However, prior work has demonstrated multilingual models are less robust for semantic parsing compared to other tasks. In global markets such as India and Latin America, robust multilingual semantic parsing is critical as codeswitching between languages is prevalent for bilingual users. In this work we dramatically improve the zero-shot performance of a multilingual and codeswitched semantic parsing system using two stages of multilingual alignment. First, we show that contrastive alignment pretraining improves both English performance and transfer efficiency. We then introduce a constrained optimization approach for hyperparameter-free adversarial alignment during finetuning. Our Doubly Aligned Multilingual Parser (DAMP) improves mBERT transfer performance by 3x, 6x, and 81x on the Spanglish, Hinglish and Multilingual Task Oriented Parsing benchmarks respectively and outperforms XLM-R and mT5-Large using 3.2x fewer parameters.
pdf
bib
abs
MeetingBank: A Benchmark Dataset for Meeting Summarization
Yebowen Hu
|
Timothy Ganter
|
Hanieh Deilamsalehy
|
Franck Dernoncourt
|
Hassan Foroosh
|
Fei Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As the number of recorded meetings increases, it becomes increasingly important to utilize summarization technology to create useful summaries of these recordings. However, there is a crucial lack of annotated meeting corpora for developing this technology, as it can be hard to collect meetings, especially when the topics discussed are confidential. Furthermore, meeting summaries written by experienced writers are scarce, making it hard for abstractive summarizers to produce sensible output without a reliable reference. This lack of annotated corpora has hindered the development of meeting summarization technology. In this paper, we present MeetingBank, a new benchmark dataset of city council meetings over the past decade. MeetingBank is unique among other meeting corpora due to its divide-and-conquer approach, which involves dividing professionally written meeting minutes into shorter passages and aligning them with specific segments of the meeting. This breaks down the process of summarizing a lengthy meeting into smaller, more manageable tasks. The dataset provides a new testbed of various meeting summarization systems and also allows the public to gain insight into how council decisions are made. We make the collection, including meeting video links, transcripts, reference summaries, agenda, and other metadata, publicly available to facilitate the development of better meeting summarization techniques.
pdf
bib
abs
DecipherPref: Analyzing Influential Factors in Human Preference Judgments via GPT-4
Yebowen Hu
|
Kaiqiang Song
|
Sangwoo Cho
|
Xiaoyang Wang
|
Hassan Foroosh
|
Fei Liu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Human preference judgments are pivotal in guiding large language models (LLMs) to produce outputs that align with human values. Human evaluations are also used in summarization tasks to compare outputs from various systems, complementing existing automatic metrics. Despite their significance, however, there has been limited research probing these pairwise or k-wise comparisons. The collective impact and relative importance of factors such as output length, informativeness, fluency, and factual consistency are still not well understood. It is also unclear if there are other hidden factors influencing human judgments. In this paper, we conduct an in-depth examination of a collection of pairwise human judgments released by OpenAI. Utilizing the Bradley-Terry-Luce (BTL) model, we reveal the inherent preferences embedded in these human judgments. We find that the most favored factors vary across tasks and genres, whereas the least favored factors tend to be consistent, e.g., outputs are too brief, contain excessive off-focus content or hallucinated facts. Our findings have implications on the construction of balanced datasets in human preference evaluations, which is a crucial step in shaping the behaviors of future LLMs.
pdf
bib
Proceedings of the 4th New Frontiers in Summarization Workshop
Yue Dong
|
Wen Xiao
|
Lu Wang
|
Fei Liu
|
Giuseppe Carenini
Proceedings of the 4th New Frontiers in Summarization Workshop
2022
pdf
bib
abs
Toward Unifying Text Segmentation and Long Document Summarization
Sangwoo Cho
|
Kaiqiang Song
|
Xiaoyang Wang
|
Fei Liu
|
Dong Yu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Text segmentation is important for signaling a document’s structure. Without segmenting a long document into topically coherent sections, it is difficult for readers to comprehend the text, let alone find important information. The problem is only exacerbated by a lack of segmentation in transcripts of audio/video recordings. In this paper, we explore the role that section segmentation plays in extractive summarization of written and spoken documents. Our approach learns robust sentence representations by performing summarization and segmentation simultaneously, which is further enhanced by an optimization-based regularizer to promote selection of diverse summary sentences. We conduct experiments on multiple datasets ranging from scientific articles to spoken transcripts to evaluate the model’s performance. Our findings suggest that the model can not only achieve state-of-the-art performance on publicly available benchmarks, but demonstrate better cross-genre transferability when equipped with text segmentation. We perform a series of analyses to quantify the impact of section segmentation on summarizing written and spoken documents of substantial length and complexity.
2012
pdf
bib
Zhou qiaoli: A divide-and-conquer strategy for semantic dependency parsing
Qiaoli Zhou
|
Ling Zhang
|
Fei Liu
|
Dongfeng Cai
|
Guiping Zhang
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)