North American Chapter of the Association for Computational Linguistics (2024)



pdf (full)
bib (full)
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

pdf bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Kevin Duh | Helena Gomez | Steven Bethard

pdf bib
Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences
Hongyi Liu | Qingyun Wang | Payam Karisani | Heng Ji

Named entity recognition is a key component of Information Extraction (IE), particularly in scientific domains such as biomedicine and chemistry, where large language models (LLMs), e.g., ChatGPT, fall short. We investigate the applicability of transfer learning for enhancing a named entity recognition model trained in the biomedical domain (the source domain) to be used in the chemical domain (the target domain). A common practice for training such a model in a few-shot learning setting is to pretrain the model on the labeled source data, and then, to finetune it on a hand-full of labeled target examples. In our experiments, we observed that such a model is prone to mislabeling the source entities, which can often appear in the text, as the target entities. To alleviate this problem, we propose a model to transfer the knowledge from the source domain to the target domain, but, at the same time, to project the source entities and target entities into separate regions of the feature space. This diminishes the risk of mislabeling the source entities as the target entities. Our model consists of two stages: 1) entity grouping in the source domain, which incorporates knowledge from annotated events to establish relations between entities, and 2) entity discrimination in the target domain, which relies on pseudo labeling and contrastive learning to enhance discrimination between the entities in the two domains. We conduct our extensive experiments across three source and three target datasets, demonstrating that our method outperforms the baselines by up to 5% absolute value. Code, data, and resources are publicly available for research purposes: .

pdf bib
Text Diffusion Model with Encoder-Decoder Transformers for Sequence-to-Sequence Generation
Hongyi Yuan | Zheng Yuan | Chuanqi Tan | Fei Huang | Songfang Huang

The diffusion model, a new generative modeling paradigm, has achieved great success in image, audio, and video generation.However, considering the discrete categorical nature of the text, it is not trivial to extend continuous diffusion models to natural language. In this work, we propose SeqDiffuSeq, a text diffusion model, to approach sequence-to-sequence text generation with an encoder-decoder Transformer architecture.To improve the generation performance, SeqDiffuSeq is equipped with the self-conditioning technique and our newly proposed adaptive noise schedule technique. Self-conditioning enables SeqDiffuSeq to better use the predicted sequence information during the generation process.The adaptive noise schedule balances the difficulty of denoising across time steps at the token level.Experiment results illustrate the improved performance on five sequence-to-sequence generation tasks compared to other diffusion-based models regarding text quality and inference time.

pdf bib
An Interactive Framework for Profiling News Media Sources
Nikhil Mehta | Dan Goldwasser

The recent rise of social media has led to the spread of large amounts of fake and biased news, content published with the intent to sway beliefs. While detecting and profiling the sources that spread this news is important to maintain a healthy society, it is challenging for automated systems.In this paper, we propose an interactive framework for news media profiling. It combines the strengths of graph based news media profiling models, Pre-trained Large Language Models, and human insight to characterize the social context on social media. Experimental results show that with as little as 5 human interactions, our framework can rapidly detect fake and biased news media, even in the most challenging settings of emerging news events, where test data is unseen.

pdf bib
Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study
Yinghao Li | Haorui Wang | Chao Zhang

Large Language Models (LLMs) have shown remarkable proficiency in language understanding and have been successfully applied to a variety of real-world tasks through task-specific fine-tuning or prompt engineering. Despite these advancements, it remains an open question whether LLMs are fundamentally capable of reasoning and planning, or if they primarily rely on recalling and synthesizing information from their training data. In our research, we introduce a novel task—Minesweeper—specifically designed in a format unfamiliar to LLMs and absent from their training datasets. This task challenges LLMs to identify the locations of mines based on numerical clues provided by adjacent opened cells. Successfully completing this task requires an understanding of each cell’s state, discerning spatial relationships between the clues and mines, and strategizing actions based on logical deductions drawn from the arrangement of the cells. Our experiments, including trials with the advanced GPT-4 model, indicate that while LLMs possess the foundational abilities required for this task, they struggle to integrate these into a coherent, multi-step logical reasoning process needed to solve Minesweeper. These findings highlight the need for further research to understand the nature of reasoning capabilities in LLMs under similar circumstances, and to explore pathways towards more sophisticated AI reasoning and planning models.

pdf bib
TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation
Taeyang Yun | Hyunkuk Lim | Jeonghwan Lee | Min Song

Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue sys- tems to effectively respond to user requests. The emotions in a conversation can be identi- fied by the representations from various modal- ities, such as audio, visual, and text. How- ever, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a lan- guage model acting as the teacher to the non- verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multi- modal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effec- tiveness of our components through additional experiments.

pdf bib
Effective and Efficient Conversation Retrieval for Dialogue State Tracking with Implicit Text Summaries
Seanie Lee | Jianpeng Cheng | Joris Driesen | Alexandru Coca | Anders Johannsen

Few-shot dialogue state tracking (DST) with Large Language Models (LLM) relies on an effective and efficient conversation retriever to find similar in-context examples for prompt learning. Previous works use raw dialogue context as search keys and queries, and a retriever is fine-tuned with annotated dialogues to achieve superior performance. However, the approach is less suited for scaling to new domains or new annotation languages, where fine-tuning data is unavailable. To address this problem, we handle the task of conversation retrieval based on text summaries of the conversations.A LLM-based conversation summarizer is adopted for query and key generation, which enables effective maximum inner product search. To avoid the extra inference cost brought by LLM-based conversation summarization, we further distill a light-weight conversation encoder which produces query embeddings without decoding summaries for test conversations. We validate our retrieval approach on MultiWOZ datasets with GPT-Neo-2.7B and LLaMA-7B/30B. The experimental results show a significant improvement over relevant baselines in real few-shot DST settings.

pdf bib
Promptly Predicting Structures: The Return of Inference
Maitrey Mehta | Valentina Pyatkin | Vivek Srikumar

Prompt-based methods have been used extensively across NLP to build zero- and few-shot label predictors. Many NLP tasks are naturally structured: that is, their outputs consist of multiple labels which constrain each other. Annotating data for such tasks can be cumbersome. Can the promise of the prompt-based paradigm be extended to such structured outputs? In this paper, we present a framework for constructing zero- and few-shot linguistic structure predictors. Our key insight is that we can use structural constraints—and combinatorial inference derived from them—to filter out inconsistent structures predicted by large language models. We instantiated this framework on two structured prediction tasks, and five datasets. Across all cases, our results show that enforcing consistency not only constructs structurally valid outputs, but also improves performance over the unconstrained variants.

pdf bib
On Linearizing Structured Data in Encoder-Decoder Language Models: Insights from Text-to-SQL
Yutong Shao | Ndapa Nakashole

Structured data, prevalent in tables, databases, and knowledge graphs, poses a significant challenge in its representation. With the advent of large language models (LLMs), there has been a shift towards linearization-based methods, which process structured data as sequential token streams, diverging from approaches that explicitly model structure, often as a graph. Crucially, there remains a gap in our understanding of how these linearization-based methods handle structured data, which is inherently non-linear.This work investigates the linear handling of structured data in encoder-decoder language models, specifically T5. Our findings reveal the model’s ability to mimic human-designed processes such as schema linking and syntax prediction, indicating a deep, meaningful learning of structure beyond simple token sequencing. We also uncover insights into the model’s internal mechanisms, including the ego-centric nature of structure node encodings and the potential for model compression due to modality fusion redundancy. Overall, this work sheds light on the inner workings of linearization-based methods and could potentially provide guidance for future research.

pdf bib
Extractive Summarization with Text Generator
Thang Le | Anh Tuan Luu

Standard extractive systems suffer from the lack of gold training signals since existing corpora solely provide document and human-written summary pairs while disregarding extractive labels. As a result, existing methods resort to imperfect pseudo-labels that are both biased and error-prone, thereby hindering the learning process of extractive models. In contrast, text generators which are commonly employed in abstractive summarization can effortlessly overcome this predicament on account of flexible sequence-to-sequence architectures. Motivated to bypass this inherent limitation, we investigate the possibility of conducting extractive summarization with text generators. Through extensive experiments covering six summarization benchmarks, we show that high-quality extractive summaries can be assembled via approximating the outputs (abstractive summaries) of these generators. Moreover, we find that the approximate summaries correlate positively with the auxiliary summaries (i.e. a better generator enables the production of better extractive summaries). Our results signify a new paradigm for training extractive summarizers i.e. learning with generation (abstractive) objectives rather than extractive schemes.

pdf bib
Self-generated Replay Memories for Continual Neural Machine Translation
Michele Resta | Davide Bacciu

Modern Neural Machine Translation systems exhibit strong performance in several different languages and are constantly improving. Their ability to learn continuously is, however, still severely limited by the catastrophic forgetting issue. In this work, we leverage a key property of encoder-decoder Transformers, i.e. their generative ability, to propose a novel approach to continually learning Neural Machine Translation systems. We show how this can effectively learn on a stream of experiences comprising different languages, by leveraging a replay memory populated by using the model itself as a generator of parallel sentences. We empirically demonstrate that our approach can counteract catastrophic forgetting without requiring explicit memorization of training data. Code will be publicly available upon publication.

pdf bib
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
Yangyi Chen | Karan Sikka | Michael Cogswell | Heng Ji | Ajay Divakaran

Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can parse natural queries about the visual content and generate human-like outputs. In this work, we explore the ability of these models to demonstrate human-like reasoning based on the perceived information. To address a crucial concern regarding the extent to which their reasoning capabilities are fully consistent and grounded, we also measure the reasoning consistency of these models. We achieve this by proposing a chain-of-thought (CoT) based consistency measure. However, such an evaluation requires a benchmark that encompasses both high-level inference and detailed reasoning chains, which is costly. We tackle this challenge by proposing an LLM-Human-in-the-Loop pipeline, which notably reduces cost while simultaneously ensuring the generation of a high-quality dataset. Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs. We evaluate existing state-of-the-art VLMs, and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency, indicating that substantial efforts are required to enable VLMs to perform visual reasoning as systematically and consistently as humans. As an early step, we propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs. The first stage involves employing supervised fine-tuning of VLMs using step-by-step reasoning samples automatically generated by LLMs. In the second stage, we further augment the training process by incorporating feedback provided by LLMs to produce reasoning chains that are highly consistent and grounded. We empirically highlight the effectiveness of our framework in both reasoning performance and consistency.

pdf bib
Building Knowledge-Guided Lexica to Model Cultural Variation
Shreya Havaldar | Salvatore Giorgi | Sunny Rai | Thomas Talhelm | Sharath Chandra Guntuku | Lyle Ungar

Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs’ failure to measure cultural variation or generate culturally varied language.

pdf bib
Adaptive Rank Selections for Low-Rank Approximation of Language Models
Shangqian Gao | Ting Hua | Yen-Chang Hsu | Yilin Shen | Hongxia Jin

Singular Value Decomposition (SVD) or its weighted variants has significantly progressed in compressing language models. Previous works assume the same importance for all operations and assign the same number of ranks for different layers in a language model. However, such a uniform rank selection is sub-optimal since different operations (layers) have non-uniform demand in capacity. In other words, a desired SVD strategy should allocate more ranks for important operations and vice versa. However, a globally-optimized selection of ranks for neural networks is still an open problem, and this is a non-trivial challenge since the selection is discrete. In this work, we propose a novel binary masking mechanism for optimizing the number of ranks in a differentiable framework. Our strategy uses a novel regularization to enable the masking to comply with the SVD property where the ranks have sorted singular values. The experiments examined both types of language models, encoder-only and decoder-only models, including large language models like LLaMA. Our compressed model achieves much better accuracy than previous SVD and their SOTA variants. More interestingly, our method retains significantly better accuracy with zero or limited fine-tuning, proving the substantial advantage of adaptive rank selection.

pdf bib
An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation
Pengzhi Gao | Ruiqing Zhang | Zhongjun He | Hua Wu | Haifeng Wang

Consistency regularization methods, such as R-Drop (Liang et al., 2021) and CrossConST (Gao et al., 2023), have achieved impressive supervised and zero-shot performance in the neural machine translation (NMT) field. Can we also boost end-to-end (E2E) speech-to-text translation (ST) by leveraging consistency regularization? In this paper, we conduct empirical studies on intra-modal and cross-modal consistency and propose two training strategies, SimRegCR and SimZeroCR, for E2E ST in regular and zero-shot scenarios. Experiments on the MuST-C benchmark show that our approaches achieve state-of-the-art (SOTA) performance in most translation directions. The analyses prove that regularization brought by the intra-modal consistency, instead of the modality gap, is crucial for the regular E2E ST, and the cross-modal consistency could close the modality gap and boost the zero-shot E2E ST performance.

pdf bib
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration
Zhenhailong Wang | Shaoguang Mao | Wenshan Wu | Tao Ge | Furu Wei | Heng Ji

Human intelligence thrives on cognitive synergy, where collaboration among different minds yield superior outcomes compared to isolated individuals. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist is an intelligent agent that collaboratively combines multiple minds’ strengths and knowledge to enhance problem-solving in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. Our in-depth analysis shows that assigning multiple fine-grained personas in LLMs improves problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, experimental results demonstrate that SPP effectively reduces factual hallucination, and maintains strong reasoning capabilities. Additionally, comparative experiments show that cognitive synergy only emerges in GPT-4 and does not appear in less capable models, such as GPT-3.5-turbo and Llama2-13b-chat, which draws an interesting analogy to human development. Code, data, and prompts can be found at:

pdf bib
FPT: Feature Prompt Tuning for Few-shot Readability Assessment
Ziyang Wang | Sanwoo Lee | Hsiu-Yuan Huang | Yunfang Wu

Prompt-based methods have achieved promising results in most few-shot text classification tasks. However, for readability assessment tasks, traditional prompt methods lack crucial linguistic knowledge, which has already been proven to be essential.Moreover, previous studies on utilizing linguistic features have shown non-robust performance in few-shot settings and may even impair model performance.To address these issues, we propose a novel prompt-based tuning framework that incorporates rich linguistic knowledge, called Feature Prompt Tuning (FPT). Specifically, we extract linguistic features from the text and embed them into trainable soft prompts. Further, we devise a new loss function to calibrate the similarity ranking order between categories. Experimental results demonstrate that our proposed method FTPnot only exhibits a significant performance improvement over the prior best prompt-based tuning approaches, but also surpasses the previous leading methods that incorporate linguistic features. Also, our proposed model significantly outperforms the large language model gpt-3.5-turbo-16k in most cases. Our proposed method establishes a new architecture for prompt tuning that sheds light on how linguistic features can be easily adapted to linguistic-related tasks.

pdf bib
Self-Prompting Large Language Models for Zero-Shot Open-Domain QA
Junlong Li | Jinyuan Wang | Zhuosheng Zhang | Hai Zhao

Open-Domain Question Answering (ODQA) aims to answer questions without explicitly providing specific background documents. This task becomes notably challenging in a zero-shot setting where no data is available to train tailored retrieval-reader models.While recent Large Language Models (LLMs) like GPT-3 have demonstrated their effectiveness in zero-shot ODQA using direct prompting methods, these methods still fall short of fully harnessing the potential of LLMs when implicitly invoked.In this paper, we propose a Self-Prompting framework to explicitly utilize the massive knowledge encoded in the parameters of LLMs and their strong instruction understanding abilities. Concretely, we prompt LLMs step by step to generate multiple pseudo QA pairs with background passages and explanations entirely from scratch.These generated elements are then utilized for in-context learning. Experimental results show that our method significantly surpasses previous state-of-the-art zero-shot methods on three widely-used ODQA datasets and even achieves comparable performance with various customized fine-tuned models on full training data. Our code is available at

pdf bib
Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?
Kai Sun | Yifan Xu | Hanwen Zha | Yue Liu | Xin Luna Dong

Since the recent prosperity of Large Language Models (LLMs), there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of LLMs, and whether Knowledge Graphs (KGs), which store the world knowledge in a symbolic form, will be replaced with LLMs. In this paper, we try to answer these questions from a new angle: How knowledgeable are LLMs?To answer this question, we constructed Head-to-Tail, a benchmark that consists of 18K question-answer (QA) pairs regarding head, torso, and tail facts in terms of popularity. We designed an automated evaluation method and a set of metrics that closely approximate the knowledge an LLM confidently internalizes. Through a comprehensive evaluation of 16 publicly available LLMs, we show that existing LLMs are still far from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities.

pdf bib
kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest Neighbor In-Context Learning
Wenting Zhao | Ye Liu | Yao Wan | Yibo Wang | Qingyang Wu | Zhongfen Deng | Jiangshu Du | Shuaiqi Liu | Yunlong Xu | Philip Yu

Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language, transforming them into structured outputs that combine elements of both natural language and intent/slot tags. Recently, Large Language Models (LLMs) have achieved impressive performance in synthesizing computer programs based on a natural-language prompt, mitigating the gap between natural language and structured programs. Our paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks, addressing the following three key research questions: 1) How can LLMs be effectively utilized for semantic parsing tasks? 2) What defines an effective prompt? and 3) How can LLM overcome the length constraint and streamline prompt design by including all examples as prompts? We introduce k Nearest Neighbor In-Context Learning (kNN-ICL), which simplifies prompt engineering by allowing it to be built on top of any design strategy while providing access to all demo examples. Extensive experiments show that: 1) Simple ICL without kNN search can achieve a comparable performance with strong supervised models on the TOP tasks, and 2) kNN-ICL significantly improves the comprehension of complex requests by seamlessly integrating ICL with a nearest-neighbor approach. Notably, this enhancement is achieved without the need for additional data or specialized prompts.

pdf bib
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
Jon Saad-Falcon | Omar Khattab | Christopher Potts | Matei Zaharia

Evaluating retrieval-augmented generation (RAG) systems traditionally relies on hand annotations for input queries, passages to retrieve, and responses to generate. We introduce ARES, an Automated RAG Evaluation System, for evaluating RAG systems along the dimensions of context relevance, answer faithfulness, and answer relevance. By creating its own synthetic training data, ARES finetunes lightweight LM judges to assess the quality of individual RAG components. To mitigate potential prediction errors, ARES utilizes a small set of human-annotated datapoints for prediction-powered inference (PPI). Across eight different knowledge-intensive tasks in KILT, SuperGLUE, and AIS, ARES accurately evaluates RAG systems while using only a few hundred human annotations during evaluation. Furthermore, ARES judges remain effective across domain shifts, proving accurate even after changing the type of queries and/or documents used in the evaluated RAG systems. We make our code and datasets publicly available on Github.

pdf bib
DEMO: A Statistical Perspective for Efficient Image-Text Matching
Fan Zhang | Xian-Sheng Hua | Chong Chen | Xiao Luo

Image-text matching has been a long-standing problem, which seeks to connect vision and language through semantic understanding. Due to the capability to manage large-scale raw data, unsupervised hashing-based approaches have gained prominence recently. They typically construct a semantic similarity structure using the natural distance, which subsequently guides the optimization of the hashing network. However, the similarity structure could be biased at the boundaries of semantic distributions, causing error accumulation during sequential optimization. To tackle this, we introduce a novel hashing approach termed Distribution-based Structure Mining with Consistency Learning (DEMO) for efficient image-text matching. From a statistical view, DEMO characterizes each image using multiple augmented views, which are considered as samples drawn from its intrinsic semantic distribution. Then, we employ a non-parametric distribution divergence to ensure a robust and precise similarity structure. In addition, we introduce collaborative consistency learning which not only preserves the similarity structure in the Hamming space but also encourages consistency between retrieval distribution from different directions in a self-supervised manner. Extensive experiments on several widely used datasets demonstrate that DEMO achieves superior performance compared with various state-of-the-art methods.

pdf bib
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning
Bin Wang | Zhengyuan Liu | Xin Huang | Fangkai Jiao | Yang Ding | AiTi Aw | Nancy Chen

We present SeaEval, a benchmark for multilingual foundation models. In addition to characterizing how these models understand and reason with natural language, we also investigate how well they comprehend cultural practices, nuances, and values. Alongside standard accuracy metrics, we investigate the brittleness of foundation models in the dimensions of semantics and multilinguality. Our analyses span both open-sourced and closed models, leading to empirical results across classic NLP tasks, reasoning, and cultural comprehension. Key findings indicate (1) Many models exhibit varied behavior when given paraphrased instructions. (2) Many models still suffer from exposure bias (e.g., positional bias, majority label bias). (3) For questions rooted in factual, scientific, and commonsense knowledge, consistent responses are expected across multilingual queries that are semantically equivalent. Yet, most models surprisingly demonstrate inconsistent performance on these queries. (4) Multilingually-trained models have not attained “balanced multilingual” capabilities. Our endeavors underscore the need for more generalizable semantic representations and enhanced multilingual contextualization. SeaEval can serve as a launchpad for more thorough investigations and evaluations for multilingual and multicultural scenarios.

pdf bib
Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision
Seongyun Lee | Sue Hyun Park | Yongrae Jo | Minjoon Seo

Large multimodal models suffer from multimodal hallucination, where they provide incorrect responses misaligned with the given visual information. Recent works have conjectured that one of the reasons behind multimodal hallucination is due to the vision encoder failing to ground on the image properly. To mitigate this issue, we propose a novel approach that leverages self-feedback as visual cues. Building on this approach, we introduce Volcano, a multimodal self-feedback guided revision model. Volcano generates natural language feedback to its initial response based on the provided visual information and utilizes this feedback to self-revise its initial response. Volcano effectively reduces multimodal hallucination and achieves state-of-the-art on MMHal-Bench, POPE, and GAVIE. It also improves on general multimodal abilities and outperforms previous models on MM-Vet and MMBench. Through qualitative analysis, we show that Volcano’s feedback is properly grounded on the image than the initial response. This indicates that Volcano can provide itself with richer visual information through feedback generation, leading to self-correct hallucinations. We publicly release our model, data, and code at

pdf bib
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Samuel Cahyawijaya | Holy Lovenia | Pascale Fung

In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages.Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages.Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages.Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages.

pdf bib
Simple and effective data augmentation for compositional generalization
Yuekun Yao | Alexander Koller

Compositional generalization, the ability to predict complex meanings from training on simpler sentences, poses challenges for powerful pretrained seq2seq models. In this paper, we show that data augmentation methods that sample MRs and backtranslate them can be effective for compositional generalization, but only if we sample from the right distribution. Remarkably, sampling from a uniform distribution performs almost as well as sampling from the test distribution, and greatly outperforms earlier methods that sampled from the training distribution.We further conduct experiments to investigate the reason why this happens and where the benefit of such data augmentation methods come from.

pdf bib
Rethinking Tabular Data Understanding with Large Language Models
Tianyang Liu | Fei Wang | Muhao Chen

Large Language Models (LLMs) have shown to be capable of various tasks, yet their capability in interpreting and reasoning over tabular data remains an underexplored area. In this context, this study investigates from three core perspectives: the robustness of LLMs to structural perturbations in tables, the comparative analysis of textual and symbolic reasoning on tables, and the potential of boosting model performance through the aggregation of multiple reasoning pathways. We discover that structural variance of tables presenting the same content reveals a notable performance decline, particularly in symbolic reasoning tasks. This prompts the proposal of a method for table structure normalization. Moreover, textual reasoning slightly edges out symbolic reasoning, and a detailed error analysis reveals that each exhibits different strengths depending on the specific tasks. Notably, the aggregation of textual and symbolic reasoning pathways, bolstered by a mix self-consistency mechanism, resulted in achieving SOTA performance, with an accuracy of 73.6% on WikiTableQuestions, representing a substantial advancement over previous existing table processing paradigms of LLMs.

pdf bib
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE
Qin Liu | Fei Wang | Chaowei Xiao | Muhao Chen

Language models are often at risk of diverse backdoor attacks, especially data poisoning. Thus, it is important to investigate defense solutions for addressing them. Existing backdoor defense methods mainly focus on backdoor attacks with explicit triggers, leaving a universal defense against various backdoor attacks with diverse triggers largely unexplored. In this paper, we propose an end-to-end ensemble-based backdoor defense framework, DPoE (Denoised Product-of-Experts), which is inspired by the shortcut nature of backdoor attacks, to defend various backdoor attacks. DPoE consists of two models: a shallow model that captures the backdoor shortcuts and a main model that is prevented from learning the shortcuts. To address the label flip caused by backdoor attackers, DPoE incorporates a denoising design. Experiments on three NLP tasks show that DPoE significantly improves the defense performance against various types of backdoor triggers including word-level, sentence-level, and syntactic triggers. Furthermore, DPoE is also effective under a more challenging but practical setting that mixes multiple types of triggers.

pdf bib
BookSQL: A Large Scale Text-to-SQL Dataset for Accounting Domain
Rahul Kumar | Amar Raja Dibbu | Shrutendra Harsola | Vignesh Subrahmaniam | Ashutosh Modi

Several large-scale datasets (e.g., WikiSQL, Spider) for developing natural language interfaces to databases have recently been proposed. These datasets cover a wide breadth of domains but fall short on some essential domains, such as finance and accounting. Given that accounting databases are used worldwide, particularly by non-technical people, there is an imminent need to develop models that could help extract information from accounting databases via natural language queries. In this resource paper, we aim to fill this gap by proposing a new large-scale Text-to-SQL dataset for the accounting and financial domain: BookSQL. The dataset consists of 100k natural language queries-SQL pairs, and accounting databases of 1 million records. We experiment with and analyze existing state-of-the-art models (including GPT-4) for the Text-to-SQL task on BookSQL. We find significant performance gaps, thus pointing towards developing more focused models for this domain.

pdf bib
FLAP: Flow-Adhering Planning with Constrained Decoding in LLMs
Shamik Roy | Sailik Sengupta | Daniele Bonadiman | Saab Mansour | Arshit Gupta

Planning is a crucial task for agents in task oriented dialogs (TODs). Human agents typically resolve user issues by following predefined workflows, decomposing workflow steps into actionable items, and performing actions by executing APIs in order; all of which require reasoning and planning. With the recent advances in LLMs, there have been increasing attempts to use them for task planning and API usage. However, the faithfulness of the plans to predefined workflows and API dependencies, is not guaranteed with LLMs. Moreover, workflows in real life are often custom-defined and prone to changes; hence, adaptation is desirable. To study this, we propose the problem of faithful planning in TODs that needs to resolve user intents by following predefined flows and preserving API dependencies. To solve this problem, we propose FLAP, a Flow-Adhering Planning algorithm based on constrained decoding with lookahead heuristic for LLMs. Our algorithm alleviates the need for finetuning LLMs using domain specific (plan/dependency) data, enables quick adaptation to predefined flows, and outperforms other decoding and prompting-based baselines. Further, our algorithm empowers smaller LLMs (≈7B) to perform at par larger LLMs (≈30B-40B).

pdf bib
DuRE: Dual Contrastive Self Training for Semi-Supervised Relation Extraction
Yuxi Feng | Laks Lakshmanan

Document-level Relation Extraction (RE) aims to extract relation triples from documents. Existing document-RE models typically rely on supervised learning which requires substantial labeled data. To alleviate the amount of human supervision, Self-training (ST) has prospered again in language understanding by augmenting the fine-tuning of big pre-trained models whenever labeled data is insufficient. However, existing ST methods in RE fail to tackle the challenge of long-tail relations. In this work, we propose DuRE, a novel ST framework to tackle these problems. DuRE jointly models RE classification and text generation as a dual process. In this way, our model could construct and utilize both pseudo text generated from given labels and pseudo labels predicted from available unlabeled text, which are gradually refined during the ST phase. We proposed a contrastive loss to leverage the signal of the RE classifier to improve generation quality. In addition, we propose a self-adaptive way to sample pseudo text from different relation classes. Experiments on two document-level RE tasks show that DuRE significantly boosts recall and F1 score with comparable precision, especially for long-tail relations against several strong baselines.

pdf bib
Query-Efficient Textual Adversarial Example Generation for Black-Box Attacks
Zhen Yu | Zhenhua Chen | Kun He

Deep neural networks for Natural Language Processing (NLP) have been demonstrated to be vulnerable to textual adversarial examples. Existing black-box attacks typically require thousands of queries on the target model, making them expensive in real-world applications. In this paper, we propose a new approach that guides the word substitutions using prior knowledge from the training set to improve the attack efficiency. Specifically, we introduce Adversarial Boosting Preference (ABP), a metric that quantifies the importance of words and guides adversarial word substitutions. We then propose two query-efficient attack strategies based on ABP: query-free attack (ABPfree) and guided search attack (ABPguide). Extensive evaluations for text classification demonstrate that ABPfree generates more natural adversarial examples than existing universal attacks, ABPguide significantly reduces the number of queries by a factor of 10 500 while achieving comparable or even better performance than black-box attack baselines. Furthermore, we introduce the first ensemble attack ABPens in NLP, which gains further performance improvements and achieves better transferability and generalization by the ensemble of the ABP across different models and domains. Code is available at

pdf bib
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles
Kung-Hsiang Huang | Philippe Laban | Alexander Fabbri | Prafulla Kumar Choubey | Shafiq Joty | Caiming Xiong | Chien-Sheng Wu

Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, the summarization of diverse information dispersed across multiple articles about an event remains underexplored. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Next, to enable consistent automatic evaluation, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of summaries. Through correlation analyses, we outline the best practices for effectively using automatic LLM-based metrics on the DiverseSumm dataset. Finally, we study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover under 40% of the diverse information on average.

pdf bib
AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation
Haoyi Qiu | Kung-Hsiang Huang | Jingnong Qu | Nanyun Peng

Ensuring factual consistency is crucial for natural language generation tasks, particularly in abstractive summarization, where preserving the integrity of information is paramount. Prior works on evaluating factual consistency of summarization often take the entailment-based approaches that first generate perturbed (factual inconsistent) summaries and then train a classifier on the generated data to detect the factually inconsistencies during testing time. However, previous approaches generating perturbed summaries are either of low coherence or lack error-type coverage. To address these issues, we propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs). Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage. Additionally, we present a data selection module NegFilter based on natural language inference and BARTScore to ensure the quality of the generated negative samples. Experimental results demonstrate our approach significantly outperforms previous systems on the AggreFact-SOTA benchmark, showcasing its efficacy in evaluating factuality of abstractive summarization.

pdf bib
PILOT: Legal Case Outcome Prediction with Case Law
Lang Cao | Zifeng Wang | Cao Xiao | Jimeng Sun

Machine learning shows promise in predicting the outcome of legal cases, but most research has concentrated on civil law cases rather than case law systems. We identified two unique challenges in making legal case outcome predictions with case law. First, it is crucial to identify relevant precedent cases that serve as fundamental evidence for judges during decision-making. Second, it is necessary to consider the evolution of legal principles over time, as early cases may adhere to different legal contexts. In this paper, we proposed a new framework named PILOT (PredictIng Legal case OuTcome) for case outcome prediction. It comprises two modules for relevant case retrieval and temporal pattern handling, respectively. To benchmark the performance of existing legal case outcome prediction models, we curated a dataset from a large-scale case law database. We demonstrate the importance of accurately identifying precedent cases and mitigating the temporal shift when making predictions for case law, as our method shows a significant improvement over the prior methods that focus on civil law case outcome predictions.

pdf bib
ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models
Zequan Liu | Jiawen Lyn | Wei Zhu | Xing Tian

Parameter-efficient fine-tuning (PEFT) is widely studied for its effectiveness and efficiency in the era of large language models. Low-rank adaptation (LoRA) has demonstrated commendable performance as a popular and representative method. However, it is implemented with a fixed intrinsic rank that might not be the ideal setting for the downstream tasks. Recognizing the need for more flexible downstream task adaptation, we extend the methodology of LoRA to an innovative approach we call allocating low-rank adaptation (ALoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. First, we propose a novel method, AB-LoRA, that can effectively estimate the importance score of each LoRA rank. Second, guided by AB-LoRA, we gradually prune abundant and negatively impacting LoRA ranks and allocate the pruned LoRA budgets to important Transformer modules needing higher ranks. We have conducted experiments on various tasks, and the experimental results demonstrate that our ALoRA method can outperform the recent baselines with comparable tunable parameters.

pdf bib
R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces
Heng-Jui Chang | James Glass

This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin’s issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.

pdf bib
InsCL: A Data-efficient Continual Learning Paradigm for Fine-tuning Large Language Models with Instructions
Yifan Wang | Yafei Liu | Chufan Shi | Haoling Li | Chen Chen | Haonan Lu | Yujiu Yang

Instruction tuning effectively optimizes Large Language Models (LLMs) for downstream tasks. Due to the changing environment in real-life applications, LLMs necessitate continual task-specific adaptation without catastrophic forgetting. Considering the heavy computational cost, replay-based Continual Learning (CL) methods are the simplest and most widely used for LLMs to address the forgetting issue. However, traditional replay-based methods do not fully utilize instructions to customize the replay strategy. In this work, we propose a novel paradigm called Instruction-based Continual Learning (InsCL). InsCL dynamically replays previous data based on task similarity, calculated by Wasserstein Distance with instructions. Moreover, we further introduce an Instruction Information Metric (InsInfo) to quantify the complexity and diversity of instructions. According to InsInfo, InsCL guides the replay process more inclined to high-quality data. We conduct extensive experiments over 16 tasks with different training orders, observing consistent performance improvements of InsCL. When all tasks have been trained, InsCL achieves performance gains of 3.0 Relative Gain compared with Random Replay, and 27.96 Relative Gain compared with No Replay.

pdf bib
Language Agnostic Code Embeddings
Saiteja Utpala | Alex Gu | Pin-Yu Chen

Recently, code language models have achieved notable advancements in addressing a diverse array of essential code comprehension and generation tasks. Yet, the field lacks a comprehensive deep dive and understanding of the code embeddings of multilingual code models. In this paper, we present a comprehensive study on multilingual code embeddings, focusing on the cross-lingual capabilities of these embeddings across different programming languages. Through probing experiments, we demonstrate that code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details, primarily focusing on semantics. Further, we show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks, leading to an absolute increase of up to +17 in the Mean Reciprocal Rank (MRR).

pdf bib
An Examination of the Compositionality of Large Generative Vision-Language Models
Teli Ma | Rong Li | Junwei Liang

With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics ( VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a **SyntaxBias Score**, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose a novel benchmark, namely **S**ynt**A**ctically **DE**-biased benchmark (SADE). Our study provides an unbiased benchmark for the compositionality of GVLMs, facilitating future research in this direction. Code and dataset are available at

pdf bib
Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors
Victoria Graf | Qin Liu | Muhao Chen

Data poisoning backdoor attacks can cause undesirable behaviors in large language models (LLMs), and defending against them is of increasing importance. Existing defense mechanisms often assume that only one type of trigger is adopted by the attacker, while defending against multiple simultaneous and independent trigger types necessitates general defense frameworks and is relatively unexplored. In this paper, we propose Nested Product of Experts (NPoE) defense framework, which involves a mixture of experts (MoE) as a trigger-only ensemble within the PoE defense framework to simultaneously defend against multiple trigger types. During NPoE training, the main modelis trained in an ensemble with a mixture of smaller expert models that learn the features of backdoor triggers. At inference time, only the main model is used. Experimental results on sentiment analysis, hate speech detection, and question classification tasks demonstrate that NPoE effectively defends against a variety of triggers both separately and in trigger mixtures. Due to the versatility of the MoE structure in NPoE, this framework can be further expanded to defend against other attack settings.

pdf bib
VertAttack: Taking Advantage of Text Classifiers’ Horizontal Vision
Jonathan Rusert

Text classification systems have continuouslyimproved in performance over the years. How-ever, nearly all current SOTA classifiers have asimilar shortcoming, they process text in a hor-izontal manner. Vertically written words willnot be recognized by a classifier. In contrast,humans are easily able to recognize and readwords written both horizontally and vertically.Hence, a human adversary could write problem-atic words vertically and the meaning wouldstill be preserved to other humans. We simulatesuch an attack, VertAttack. VertAttack identifieswhich words a classifier is reliant on and thenrewrites those words vertically. We find thatVertAttack is able to greatly drop the accuracyof 4 different transformer models on 5 datasets.For example, on the SST2 dataset, VertAttackis able to drop RoBERTa’s accuracy from 94 to13%. Furthermore, since VertAttack does notreplace the word, meaning is easily preserved.We verify this via a human study and find thatcrowdworkers are able to correctly label 77%perturbed texts perturbed, compared to 81% ofthe original texts. We believe VertAttack offersa look into how humans might circumvent clas-sifiers in the future and thus inspire a look intomore robust algorithms.

pdf bib
KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning
Cong-Duy Nguyen | Thong Nguyen | Xiaobao Wu | Anh Tuan Luu

Previous work on multimodal sentence embedding has proposed multimodal contrastive learning and achieved promising results. However, by taking the rest of the batch as negative samples without reviewing when forming contrastive pairs, those studies encountered many suspicious and noisy negative examples, significantly affecting the methods’ overall performance. In this work, we propose KDMCSE (Knowledge Distillation Multimodal contrastive learning of Sentence Embeddings), a novel approach that enhances the discrimination and generalizability of multimodal representation and inherits the knowledge from the teacher model to learn the difference between positive and negative instances and via that, can detect noisy and wrong negative samples effectively before they are calculated in the contrastive objective. Furthermore, to overcome the limitation of modeling the variation within negative pairs, we introduce a new contrastive objective, AdapACSE (Adaptive Angular Margin Supervised Contrastive Learning for Multimodal sentence embeddings), that enhances the discriminative representation by strengthening the margin within the angular space while capturing varying semantics within the negative. Experimental results on widely used Semantic Textual Similarity (STS) benchmarks demonstrate the effectiveness of our approach.

pdf bib
The taste of IPA: Towards open-vocabulary keyword spotting and forced alignment in any language
Jian Zhu | Changbing Yang | Farhan Samir | Jahurul Islam

In this project, we demonstrate that phoneme-based models for speech processing can achieve strong crosslinguistic generalizability to unseen languages. We curated the IPAPACK, a massively multilingual speech corpora with phonemic transcriptions, encompassing more than 115 languages from diverse language families, selectively checked by linguists. Based on the IPAPACK, we propose CLAP-IPA, a multi-lingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between arbitrary speech signals and phonemic sequences. The proposed model was tested on 95 unseen languages, showing strong generalizability across languages. Temporal alignments between phonemes and speech signals also emerged from contrastive training, enabling zeroshot forced alignment in unseen languages. We further introduced a neural forced aligner IPA-ALIGNER by finetuning CLAP-IPA with the Forward-Sum loss to learn better phone-to-audio alignment. Evaluation results suggest that IPA-ALIGNER can generalize to unseen languages without adaptation.

pdf bib
Think Before You Act: A Two-Stage Framework for Mitigating Gender Bias Towards Vision-Language Tasks
Yunqi Zhang | Songda Li | Chunyuan Deng | Luyi Wang | Hui Zhao

Gender bias in vision-language models (VLMs) can reinforce harmful stereotypes and discrimination. In this paper, we focus on mitigating gender bias towards vision-language tasks. We identify object hallucination as the essence of gender bias in VLMs. Existing VLMs tend to focus on salient or familiar attributes in images but ignore contextualized nuances. Moreover, most VLMs rely on the co-occurrence between specific objects and gender attributes to infer the ignored features, ultimately resulting in gender bias. We propose GAMA, a task-agnostic generation framework to mitigate gender bias. GAMA consists of two stages: narrative generation and answer inference. During narrative generation, GAMA yields all-sided but gender-obfuscated narratives, which prevents premature concentration on localized image features, especially gender attributes. During answer inference, GAMA integrates the image, generated narrative, and a task-specific question prompt to infer answers for different vision-language tasks. This approach allows the model to rethink gender attributes and answers. We conduct extensive experiments on GAMA, demonstrating its debiasing and generalization ability.

pdf bib
BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings
Xianming Li | Jing Li

Sentence embeddings are crucial in measuring semantic similarity. Most recent studies employed large language models (LLMs) to learn sentence embeddings. Existing LLMs mainly adopted autoregressive architecture without explicit backward dependency modeling. Therefore, we examined the effects of backward dependencies in LLMs for semantic similarity measurements. Concretely, we propose a novel model: backward dependency enhanced large language model (BeLLM). It learns sentence embeddings via transforming specific attention layers from uni- to bi-directional. We extensively experiment across various semantic textual similarity (STS) tasks and downstream applications. BeLLM achieves state-of-the-art performance in varying scenarios. It shows that autoregressive LLMs benefit from backward dependencies for sentence embeddings.

pdf bib
Assessing Factual Reliability of Large Language Model Knowledge
Weixuan Wang | Barry Haddow | Alexandra Birch | Wei Peng

The factual knowledge of LLMs is typically evaluated using accuracy, yet this metric does not capture the vulnerability of LLMs to hallucination-inducing factors like prompt and context variability. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? In this paper, we propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs’ factual reliability. MONITOR is designed to compute the distance between the probability distributions of a valid output and its counterparts produced by the same LLM probing the same fact using different styles of prompts and contexts. Experiments on a comprehensive range of 12 LLMs demonstrate the effectiveness of MONITOR in evaluating the factual reliability of LLMs while maintaining a low computational overhead. In addition, we release the FKTC (Factual Knowledge Test Corpus) to foster research along this line

pdf bib
Dial-MAE: ConTextual Masked Auto-Encoder for Retrieval-based Dialogue Systems
Zhenpeng Su | Xing W | Wei Zhou | Guangyuan Ma | Songlin Hu

Dialogue response selection aims to select an appropriate response from several candidates based on a given user and system utterance history. Most existing works primarily focus on post-training and fine-tuning tailored for cross-encoders. However, there are no post-training methods tailored for dense encoders in dialogue response selection. We argue that when the current language model, based on dense dialogue systems (such as BERT), is employed as a dense encoder, it separately encodes dialogue context and response, leading to a struggle to achieve the alignment of both representations. Thus, we propose Dial-MAE (Dialogue Contextual Masking Auto-Encoder), a straightforward yet effective post-training technique tailored for dense encoders in dialogue response selection. Dial-MAE uses an asymmetric encoder-decoder architecture to compress the dialogue semantics into dense vectors, which achieves better alignment between the features of the dialogue context and response. Our experiments have demonstrated that Dial-MAE is highly effective, achieving state-of-the-art performance on two commonly evaluated benchmarks.

pdf bib
Toolink: Linking Toolkit Creation and Using through Chain-of-Solving on Open-Source Model
Cheng Qian | Chenyan Xiong | Zhenghao Liu | Zhiyuan Liu

Large Language Models (LLMs) have demonstrated remarkable progress in utilizing tools, but their closed-source nature and high inference costs pose limitations on their adaptability, necessitating a valid method that leverages smaller, open-sourced models. In this paper, we introduce Toolink, a comprehensive framework that performs task-solving by first creating a toolkit and then integrating the planning and calling of tools through a chain-of-solving (CoS) approach. We first validate the efficacy of Toolink in harnessing the model’s creativity and CoS ability on ChatGPT. Subsequently, we curate CoS-GPT, a chain-of-solving dataset designed for tool-using, and finetune the LLaMA-7B model. It results in LLaMA-CoS, a powerful open-source model with advanced tool-planning and tool-calling capabilities. Evaluation of diverse tasks from BIG-bench demonstrates its CoS ability matches that of ChatGPT while its performance surpasses the chain-of-thought approach. Further studies highlight the generalization of LLaMA-CoS to unseen tasks and showcase its capability in using toolkits not explicitly tailored for the target task, affirming its robustness in real-world scenarios. All codes and data are released.

pdf bib
Create! Don’t Repeat: A Paradigm Shift in Multi-Label Augmentation through Label Creative Generation
Letian Wang | Xianggen Liu | Jiancheng Lv

We propose Label Creative Generation (LCG), a new paradigm in multi-label data augmentation. Beyond repeating data points with fixed labels, LCG creates new data by exploring innovative label combinations. Within LCG, we introduce Tail-Driven Conditional Augmentation (TDCA), combining tail-driven label sampling and label-conditioned text generation for balanced, consistent data augmentation. Our approach has demonstrated a **100.21%** increase in PSP@1 across three datasets, successfully mitigating the long-tail effect in MLTC and markedly enhancing model performance.

pdf bib
Neurocache: Efficient Vector Retrieval for Long-range Language Modeling
Ali Safaya | Deniz Yuret

This paper introduces Neurocache, an approach to extend the effective context size of large language models (LLMs) using an external vector cache to store its past states. Like recent vector retrieval approaches, Neurocache uses an efficient k-nearest-neighbor (kNN) algorithm to retrieve relevant past states and incorporate them into the attention process. Neurocache improves upon previous methods by (1) storing compressed states, which reduces cache size; (2) performing a single retrieval operation per token which increases inference speed; and (3) extending the retrieval window to neighboring states, which improves both language modeling and downstream task accuracy. Our experiments show the effectiveness of Neurocache both for models trained from scratch and for pre-trained models such as Llama2-7B and Mistral-7B when enhanced with the cache mechanism. We also compare Neurocache with text retrieval methods and show improvements in single-document question-answering and few-shot learning tasks. We made the source code available under:

pdf bib
Unveiling the Generalization Power of Fine-Tuned Large Language Models
Haoran Yang | Yumeng Zhang | Jiaqi Xu | Hongyuan Lu | Pheng-Ann Heng | Wai Lam

While Large Language Models (LLMs) have demonstrated exceptional multitasking abilities, fine-tuning these models on downstream, domain-specific datasets is often necessary to yield superior performance on test sets compared to their counterparts without fine-tuning. However, the comprehensive effects of fine-tuning on the LLMs’ generalization ability are not fully understood.This paper delves into the differences between original, unmodified LLMs and their fine-tuned variants. Our primary investigation centers on whether fine-tuning affects the generalization ability intrinsic to LLMs. To elaborate on this, we conduct extensive experiments across five distinct language tasks on various datasets.Our main findings reveal that models fine-tuned on generation and classification tasks exhibit dissimilar behaviors in generalizing to different domains and tasks.Intriguingly, we observe that integrating the in-context learning strategy during fine-tuning on generation tasks can enhance the model’s generalization ability.Through this systematic investigation, we aim to contribute valuable insights into the evolving landscape of fine-tuning practices for LLMs.

pdf bib
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning
Ruixin Hong | Hongming Zhang | Xinyu Pang | Dong Yu | Changshui Zhang

Logical reasoning has been an ongoing pursuit in the field of AI. Despite significant advancements made by large language models (LLMs), they still struggle with complex logical reasoning problems. To enhance reasoning performance, one promising direction is scalable oversight, which requires LLMs to identify their own errors and then improve by themselves. Various self-verification methods have been proposed in pursuit of this goal. Nevertheless, whether existing models understand their own errors well is still under investigation. In this paper, we take a closer look at the self-verification abilities of LLMs in the context of logical reasoning, focusing on their ability to identify logical fallacies accurately. We introduce a dataset, FALLACIES, containing 232 types of reasoning fallacies categorized in a hierarchical taxonomy. By conducting exhaustive experiments on FALLACIES, we obtain comprehensive and detailed analyses of a series of models on their verification abilities. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods. Drawing from these observations, we offer suggestions for future research and practical applications of self-verification methods.

pdf bib
Exploring Self-supervised Logic-enhanced Training for Large Language Models
Fangkai Jiao | Zhiyang Teng | Bosheng Ding | Zhengyuan Liu | Nancy Chen | Shafiq Joty

Traditional attempts to enhance the logical reasoning abilities of language models often rely on supervised fine-tuning, limiting their generalization to new tasks or domains. Large Language Models (LLMs), with their capacity to condense vast knowledge, can effectively tackle many tasks. Yet, our experiments reveal a gap in their performance on logical reasoning benchmarks when compared to state-of-the-art fine-tuning based models. To bridge this gap, we present LogicLLM, a first-of-its-kind, fully self-supervised framework for integrating logical reasoning capabilities into LLMs, and activating them via in-context learning. We apply this to two LLM series, FLAN-T5 and LLaMA, with parameter sizes from 3 billion to 33 billion. LogicLLM demonstrates its effectiveness through successful improvements on two logical reasoning benchmarks (ReClor and LogiQA-v2). Additionally, LogicLLM based on FLAN-T5-11B attains comparable results to ChatGPT, and evaluations with LLaMA-based models on three language understanding benchmarks (RACE, MMLU and Big-Bench-Hard) confirm that the improvements come without compromising the model’s general language understanding capabilities.

pdf bib
MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning
Debrup Das | Debopriyo Banerjee | Somak Aditya | Ashish Kulkarni

Tool-augmented Large Language Models (TALMs) are known to enhance the skillset of large language models (LLMs), thereby, leading to their improved reasoning abilities across many tasks. While, TALMs have been successfully employed in different question-answering benchmarks, their efficacy on complex mathematical reasoning benchmarks, and the potential complementary benefits offered by tools for knowledge retrieval and mathematical equation solving are open research questions. In this work, we present MathSensei, a tool-augmented large language model for mathematical reasoning. We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API) through evaluations on mathematical reasoning datasets. We perform exhaustive ablations on MATH, a popular dataset for evaluating mathematical reasoning on diverse mathematical disciplines. We also conduct experiments involving well-known tool planners to study the impact of tool sequencing on the model performance. MathSensei achieves 13.5% better accuracy over gpt-3.5-turbo with Chain-of-Thought on the MATH dataset. We further observe that TALMs are not as effective for simpler math word problems (in GSM-8K), and the benefit increases as the complexity and required knowledge increases (progressively over AQuA, MMLU-Math, and higher level complex questions in MATH). The code and data are available at

pdf bib
CoUDA: Coherence Evaluation via Unified Data Augmentation
Dawei Zhu | Wenhao Wu | Yifan Song | Fangwei Zhu | Ziqiang Cao | Sujian Li

Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance.In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. CoUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively.Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, CoUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse.Extensive experiments in coherence evaluation show that, with only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics.

pdf bib
mEdIT: Multilingual Text Editing via Instruction Tuning
Vipul Raheja | Dimitris Alikaniotis | Vivek Kulkarni | Bashar Alhafni | Dhruv Kumar

We introduce mEdIT, a multi-lingual extension to CoEdIT – the recent state-of-the-art text editing models for writing assistance. mEdIT models are trained by fine-tuning multi-lingual large, pre-trained language models (LLMs) via instruction tuning. They are designed to take instructions from the user specifying the attributes of the desired text in the form of natural language instructions, such as “Grammatik korrigieren” (German) or “이 텍스 트를 단순화” (Korean). We build mEdIT by curating data from multiple publicly available human-annotated text editing datasets for three text editing tasks (Grammatical Error Correction (GEC), Text Simplification, and Paraphrasing) across diverse languages belonging to six different language families. We detail the design and training of mEdIT models and demonstrate their strong performance on many multi-lingual text editing benchmarks against other multilingual LLMs. We also find that mEdIT generalizes effectively to new languages over multilingual baselines. We publicly release our data, code, and trained models.

pdf bib
Navigation as Attackers Wish? Towards Building Robust Embodied Agents under Federated Learning
Yunchao Zhang | Zonglin Di | Kaiwen Zhou | Cihang Xie | Xin Wang

Federated embodied agent learning protects the data privacy of individual visual environments by keeping data locally at each client (the individual environment) during training. However, since the local data is inaccessible to the server under federated learning, attackers may easily poison the training data of the local client to build a backdoor in the agent without notice. Deploying such an agent raises the risk of potential harm to humans, as the attackers may easily navigate and control the agent as they wish via the backdoor. Towards Byzantine-robust federated embodied agent learning, in this paper, we study the attack and defense for the task of vision-and-language navigation (VLN), where the agent is required to follow natural language instructions to navigate indoor environments. First, we introduce a simple but effective attack strategy, Navigation as Wish (NAW), in which the malicious client manipulates local trajectory data to implant a backdoor into the global model. Results on two VLN datasets (R2R and RxR) show that NAW can easily navigate the deployed VLN agent regardless of the language instruction, without affecting its performance on normal test sets. Then, we propose a new Prompt-Based Aggregation (PBA) to defend against the NAW attack in federated VLN, which provides the server with a ”prompt” of the vision-and-language alignment variance between the benign and malicious clients so that they can be distinguished during training. We validate the effectiveness of the PBA method on protecting the global model from the NAW attack, which outperforms other state-of-the-art defense methods by a large margin in the defense metrics on R2R and RxR.

pdf bib
In-context Learning and Gradient Descent Revisited
Gilad Deutch | Nadav Magar | Tomer Natan | Guy Dar

In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. A recent line of work suggests that ICL performs gradient descent (GD)-based optimization implicitly. While appealing, much of the research focuses on simplified settings, where the parameters of a shallow model are optimized. In this work, we revisit evidence for ICL-GD correspondence on realistic NLP tasks and models. We find gaps in evaluation, both in terms of problematic metrics and insufficient baselines. We show that surprisingly, even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL.Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality. We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.

pdf bib
Corpus Considerations for Annotator Modeling and Scaling
Sarumi Oluyemi | Béla Neuendorf | Joan Plepi | Lucie Flek | Jörg Schlötterer | Charles Welch

Recent trends in natural language processing research and annotation tasks affirm a paradigm shift from the traditional reliance on a single ground truth to a focus on individual perspectives, particularly in subjective tasks. In scenarios where annotation tasks are meant to encompass diversity, models that solely rely on the majority class labels may inadvertently disregard valuable minority perspectives. This oversight could result in the omission of crucial information and, in a broader context, risk disrupting the balance within larger ecosystems. As the landscape of annotator modeling unfolds with diverse representation techniques, it becomes imperative to investigate their effectiveness with the fine-grained features of the datasets in view. This study systematically explores various annotator modeling techniques and compares their performance across seven corpora. From our findings, we show that the commonly used user token model consistently outperforms more complex models. We introduce a composite embedding approach and show distinct differences in which model performs best as a function of the agreement with a given dataset. Our findings shed light on the relationship between corpus statistics and annotator modeling performance, which informs future work on corpus construction and perspectivist NLP.

pdf bib
On Large Language Models’ Hallucination with Regard to Known Facts
Che Jiang | Biqing Qi | Xiangyu Hong | Dayuan Fu | Yang Cheng | Fandong Meng | Mo Yu | Bowen Zhou | Jie Zhou

Large language models are successful in answering factoid questions but are also prone to hallucination.We investigate the phenomenon of LLMs possessing correct answer knowledge yet still hallucinating from the perspective of inference dynamics, an area not previously covered in studies on hallucinations.We are able to conduct this analysis via two key ideas.First, we identify the factual questions that query the same triplet knowledge but result in different answers. The difference between the model behaviors on the correct and incorrect outputs hence suggests the patterns when hallucinations happen.Second, to measure the pattern, we utilize mappings from the residual streams to vocabulary space.We reveal the different dynamics of the output token probabilities along the depths of layers between the correct and hallucinated cases. In hallucinated cases, the output token’s information rarely demonstrates abrupt increases and consistent superiority in the later stages of the model.Leveraging the dynamic curve as a feature, we build a classifier capable of accurately detecting hallucinatory predictions with an 88% success rate. Our study shed light on understanding the reasons for LLMs’ hallucinations on their known facts, and more importantly, on accurately predicting when they are hallucinating.

pdf bib
“One-Size-Fits-All”? Examining Expectations around What Constitute “Fair” or “Good” NLG System Behaviors
Li Lucy | Su Lin Blodgett | Milad Shokouhi | Hanna Wallach | Alexandra Olteanu

Fairness-related assumptions about what constitute appropriate NLG system behaviors range from invariance, where systems are expected to behave identically for social groups, to adaptation, where behaviors should instead vary across them. To illuminate tensions around invariance and adaptation, we conduct five case studies, in which we perturb different types of identity-related language features (names, roles, locations, dialect, and style) in NLG system inputs. Through these cases studies, we examine people’s expectations of system behaviors, and surface potential caveats of these contrasting yet commonly held assumptions. We find that motivations for adaptation include social norms, cultural differences, feature-specific information, and accommodation; in contrast, motivations for invariance include perspectives that favor prescriptivism, view adaptation as unnecessary or too difficult for NLG systems to do appropriately, and are wary of false assumptions. Our findings highlight open challenges around what constitute “fair” or “good” NLG system behaviors.

pdf bib
Language Models Hallucinate, but May Excel at Fact Verification
Jian Guan | Jesse Dodge | David Wadden | Minlie Huang | Hao Peng

Recent progress in natural language processing (NLP) owes much to remarkable advances in large language models (LLMs). Nevertheless, LLMs frequently “hallucinate,” resulting in non-factual outputs. Our carefully-designed human evaluation substantiates the serious hallucination issue, revealing that even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress. Our systematic investigation affirms that LLMs can be repurposed as effective fact verifiers with strong correlations with human judgments. Surprisingly, FLAN-T5-11B , the least factual generator in our study, performs the best as a fact verifier, even outperforming more capable LLMs like GPT3.5 and ChatGPT. Delving deeper, we analyze the reliance of these LLMs on high-quality evidence, as well as their deficiencies in robustness and generalization ability. Our study presents insights for developing trustworthy generation models.

pdf bib
A Rationale-centric Counterfactual Data Augmentation Method for Cross-Document Event Coreference Resolution
Bowen Ding | Qingkai Min | Shengkun Ma | Yingjie Li | Linyi Yang | Yue Zhang

Based on Pre-trained Language Models (PLMs), event coreference resolution (ECR) systems have demonstrated outstanding performance in clustering coreferential events across documents. However, the state-of-the-art system exhibits an excessive reliance on the ‘triggers lexical matching’ spurious pattern in the input mention pair text. We formalize the decision-making process of the baseline ECR system using a Structural Causal Model (SCM), aiming to identify spurious and causal associations (i.e., rationales) within the ECR task. Leveraging the debiasing capability of counterfactual data augmentation, we develop a rationale-centric counterfactual data augmentation method with LLM-in-the-loop. This method is specialized for pairwise input in the ECR system, where we conduct direct interventions on triggers and context to mitigate the spurious association while emphasizing the causation. Our approach achieves state-of-the-art performance on three popular cross-document ECR benchmarks and demonstrates robustness in out-of-domain scenarios.

pdf bib
TrojFSP: Trojan Insertion in Few-shot Prompt Tuning
Mengxin Zheng | Jiaqi Xue | Xun Chen | Yanshan Wang | Qian Lou | Lei Jiang

Prompt tuning is one of the most effective solutions to adapting a fixed pre-trained language model (PLM) for various downstream tasks, especially with only a few input samples. However, the security issues, e.g., Trojan attacks, of prompt tuning on a few data samples are not well-studied. Transferring established data poisoning attacks directly to few-shot prompt tuning presents multiple challenges. One significant issue is the _poisoned imbalance issue_, where non-target class samples are added to the target class, resulting in a greater number of target-class samples compared to non-target class. While this issue is not critical in regular tuning, it significantly hampers the few-shot prompt tuning, making it difficult to simultaneously achieve a high attack success rate (ASR) and maintain clean data accuracy (CDA). Additionally, few-shot prompting is prone to overfitting in terms of both ASR and CDA. In this paper, we introduce _TrojFSP_, a method designed to address the challenges. To solve the poisoned imbalance issue, we develop a _Target-Class Shrink (TC-Shrink)_ technique, which aims to equalize the number of poisoning samples. To combat overfitting, we employ a _Selective Token Poisoning_ technique to boost attack performance. Furthermore, we introduce a _Trojan-Trigger Attention_ objective function to amplify the attention of the poisoned trojan prompt on triggers. Experiments show that our TrojFSP achieves an ASR of over 99% while maintaining negligible decreases in CDA across various PLMs and datasets. The source code of TrojFSP is available at _

pdf bib
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
Yi Luo | Zhenghao Lin | YuHao Zhang | Jiashuo Sun | Chen Lin | Chengjin Xu | Xiangdong Su | Yelong Shen | Jian Guo | Yeyun Gong

Large Language Models (LLMs) exhibit impressive capabilities but also present risks such as biased content generation and privacy issues. One of the current alignment techniques includes principle-driven integration, but it faces challenges arising from the imprecision of manually crafted rules and inadequate risk perception in models without safety training. To address these, we introduce Guide-Align, a two-stage approach. Initially, a safety-trained model identifies potential risks and formulates specific guidelines for various inputs, establishing a comprehensive library of guidelines and a model for input-guidelines retrieval. Subsequently, the retrieval model correlates new inputs with relevant guidelines, which guide LLMs in response generation to ensure safe and high-quality outputs, thereby aligning with human values. An additional optional stage involves fine-tuning a model with well-aligned datasets generated through the process implemented in the second stage.Our method customizes guidelines to accommodate diverse inputs, thereby enhancing the fine-grainedness and comprehensiveness of the guideline library. Furthermore, it incorporates safety expertise from a safety-trained LLM through a lightweight retrieval model.We evaluate our approach on three benchmarks, demonstrating significant improvements in LLM security and quality. Notably, our fine-tuned model, Labrador, even at 13 billion parameters, outperforms GPT-3.5-turbo and surpasses GPT-4 in alignment capabilities.

pdf bib
X-PARADE: Cross-Lingual Textual Entailment and Information Divergence across Paragraphs
Juan Rodriguez | Katrin Erk | Greg Durrett

Understanding when two pieces of text convey the same information is a goal touching many subproblems in NLP, including textual entailment and fact-checking. This problem becomes more complex when those two pieces of text are in different languages. Here, we introduce X-PARADE (Cross-lingual Paragraph-level Analysis of Divergences and Entailments), the first cross-lingual dataset of paragraph-level information divergences. Annotators label a paragraph in a target language at the span level and evaluate it with respect to a corresponding paragraph in a source language, indicating whether a given piece of information is the same, new, or new but can be inferred. This last notion establishes a link with cross-language NLI. Aligned paragraphs are sourced from Wikipedia pages in different languages, reflecting real information divergences observed in the wild. Armed with our dataset, we investigate a diverse set of approaches for this problem, including classic token alignment from machine translation, textual entailment methods that localize their decisions, and prompting LLMs. Our results show that these methods vary in their capability to handle inferable information, but they all fall short of human performance.

pdf bib
Topics, Authors, and Institutions in Large Language Model Research: Trends from 17K arXiv Papers
Rajiv Movva | Sidhika Balachandar | Kenny Peng | Gabriel Agostini | Nikhil Garg | Emma Pierson

Large language models (LLMs) are dramatically influencing AI research, spurring discussions on what has changed so far and how to shape the field’s future. To clarify such questions, we analyze a new dataset of 16,979 LLM-related arXiv papers, focusing on recent trends in 2023 vs. 2018-2022. First, we study disciplinary shifts: LLM research increasingly considers societal impacts, evidenced by 20× growth in LLM submissions to the Computers and Society sub-arXiv. An influx of new authors – half of all first authors in 2023 – are entering from non-NLP fields of CS, driving disciplinary expansion. Second, we study industry and academic publishing trends. Surprisingly, industry accounts for a smaller publication share in 2023, largely due to reduced output from Google and other Big Tech companies; universities in Asia are publishing more. Third, we study institutional collaboration: while industry-academic collaborations are common, they tend to focus on the same topics that industry focuses on rather than bridging differences. The most prolific institutions are all US- or China-based, but there is very little cross-country collaboration. We discuss implications around (1) how to support the influx of new authors, (2) how industry trends may affect academics, and (3) possible effects of (the lack of) collaboration.

pdf bib
E5: Zero-shot Hierarchical Table Analysis using Augmented LLMs via Explain, Extract, Execute, Exhibit and Extrapolate
Zhehao Zhang | Yan Gao | Jian-Guang Lou

Analyzing large hierarchical tables with multi-level headers presents challenges due to their complex structure, implicit semantics, and calculation relationships. While recent advancements in large language models (LLMs) have shown promise in flat table analysis, their application to hierarchical tables is constrained by the reliance on manually curated exemplars and the model’s token capacity limitations. Addressing these challenges, we introduce a novel code-augmented LLM-based framework, E5, for zero-shot hierarchical table question answering. This approach encompasses self-explaining the table’s hierarchical structures, code generation to extract relevant information and apply operations, external code execution to prevent hallucinations, and leveraging LLMs’ reasoning for final answer derivation. Empirical results indicate that our method, based on GPT-4, outperforms state-of-the-art fine-tuning methods with a 44.38 Exact Match improvement. Furthermore, we present F3, an adaptive algorithm designed for token-limited scenarios, effectively condensing large tables while maintaining useful information. Our experiments prove its efficiency, enabling the processing of large tables even with models having limited context lengths. The code is available at

pdf bib
S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Model
Fangyu Lei | Qian Liu | Yiming Huang | Shizhu He | Jun Zhao | Kang Liu

The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like long-context understanding and reasoning.However, as LLMs are able to process longer contexts, it becomes more challenging to evaluate whether they have acquired certain capabilities, since the length of text (e.g., 200K tokens) they can process far exceeds what humans can reliably assess in a reasonable duration.In this paper, we propose using complex synthetic tasks as a proxy evaluation method, and present S3Eval, a Synthetic, Scalable, Systematic evaluation suite for LLMs evaluation.The synthetic nature of S3Eval provides users full control over the dataset, allowing them to systematically probe LLM capabilities by scaling text length and varying task difficulty across diverse scenarios.The strong correlation between S3Eval and real-world benchmarks demonstrates the soundness of using S3Eval for evaluation of LLMs.S3Eval provides a flexible and infinite long-context data generation method. We have generated a comprehensive dataset called S3Eval-Standard, and experimental results have shown that it poses significant challenges for all existing LLMs.

pdf bib
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
Fuxiao Liu | Xiaoyang Wang | Wenlin Yao | Jianshu Chen | Kaiqiang Song | Sangwoo Cho | Yaser Yacoob | Dong Yu

With the rapid development of large language models (LLMs) and their integration into large multimodal models (LMMs), there has beenimpressive progress in zero-shot completion of user-oriented vision-language tasks. However, a gap remains in the domain of chartimage understanding due to the distinct abstract components in charts. To address this, we introduce a large-scale MultiModal ChartInstruction (MMC-Instruction) dataset comprising 600k instances supporting diverse tasks and chart types. Leveraging this data, we de-velop MultiModal Chart Assistant (MMCA), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks. Recognizing the need for a comprehensive evaluation of LMM chart understanding, we also propose a MultiModal Chart Benchmark (MMC-Benchmark), a comprehensive human-annotated benchmark with nine distinct tasks evaluating reasoning capabilities over charts.Extensive experiments on MMC-Benchmark reveal the limitations of existing LMMs on correctly interpreting charts, even for the mostrecent GPT-4V model. Our work provides an instruction-tuning methodology and benchmark to advance multimodal understanding ofcharts. Code and data are available at

pdf bib
Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
Chengxu Zhuang | Evelina Fedorenko | Jacob Andreas

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways — requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically — with grounded supervision — exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models’ learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-dataregime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images isnot redundant—models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.

pdf bib
Accurate Knowledge Distillation via n-best Reranking
Hendra Setiawan

We propose utilizing n-best reranking to enhance Sequence-Level Knowledge Distillation (Kim and Rush, 2016) where we extract pseudo-labels for student model’s training data from top n-best hypotheses and leverage a diverse set of models with different inductive biases, objective functions or architectures, including some publicly-available large language models, to pick the highest-quality hypotheses as labels. The effectiveness of our proposal is validated through experiments on the WMT’21 German ↔ English and Chinese ↔ English translation tasks. Our results demonstrate that utilizing pseudo-labels generated by our n-best reranker leads to a significantly more accurate student model. In fact, our best student model achieves comparable accuracy to a large translation model from (Tran et al., 2021) with 4.7 billion parameters, while having two orders of magnitude fewer parameters.

pdf bib
AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition
Zhaorun Chen | Zhuokai Zhao | Zhihong Zhu | Ruiqi Zhang | Xiang Li | Bhiksha Raj | Huaxiu Yao

Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework **AutoPRM** that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, **AutoPRM** first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that **AutoPRM** significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, **AutoPRM** can be easily integrated with other orthogonal reasoning pipelines.

pdf bib
SEMQA: Semi-Extractive Multi-Source Question Answering
Tal Schuster | Adam Lelkes | Haitian Sun | Jai Gupta | Jonathan Berant | William Cohen | Donald Metzler

Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge.In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans—copied verbatim from given input sources—and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.

pdf bib
Fine-Tuning Language Models with Reward Learning on Policy
Hao Lang | Fei Huang | Yongbin Li

Reinforcement learning from human feedback (RLHF) has emerged as an effective approach to aligning large language models (LLMs) to human preferences.RLHF contains three steps, i.e., human preference collecting, reward learning, and policy optimization, which are usually performed serially.Despite its popularity, however, (fixed) reward models may suffer from inaccurate off-distribution, since policy optimization continuously shifts LLMs’ data distribution.Repeatedly collecting new preference data from the latest LLMs may alleviate this issue, which unfortunately makes the resulting system more complicated and difficult to optimize.In this paper, we propose reward learning on policy (RLP), an unsupervised framework that refines a reward model using policy samples to keep it on-distribution.Specifically, an unsupervised multi-view learning method is introduced to learn robust representations of policy samples.Meanwhile, a synthetic preference generation approach is developed to simulate high-quality preference data with policy outputs.Extensive experiments on three benchmark datasets show that RLP consistently outperforms the state-of-the-art.Our code is available at

pdf bib
A Universal Dependencies Treebank for Highland Puebla Nahuatl
Robert Pugh | Francis Tyers

We present a Universal Dependencies (UD) treebank for Highland Puebla Nahuatl. The treebank is only the second such UD corpus for a Mexican language, and supplements an existing treebank for another Nahuatl variant. We describe the process of data collection, annotation decisions and interesting syntactic constructions, and discuss some similarities and differences between the Highland Puebla Nahuatl treebank and the existing Western Sierra Puebla Nahuatl treebank.

pdf bib
COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances
Haryo Wibowo | Erland Fuadi | Made Nityasya | Radityo Eko Prasojo | Alham Aji

We present COPAL-ID, a novel, public Indonesian language common sense reasoning dataset. Unlike the previous Indonesian COPA dataset (XCOPA-ID), COPAL-ID incorporates Indonesian local and cultural nuances, and therefore, provides a more natural portrayal of day-to-day causal reasoning within the Indonesian cultural sphere. Professionally written by natives from scratch, COPAL-ID is more fluent and free from awkward phrases, unlike the translated XCOPA-ID. In addition, we present COPALID in both standard Indonesian and in Jakartan Indonesian–a dialect commonly used in daily conversation. COPAL-ID poses a greater challenge for existing open-sourced and closedstate-of-the-art multilingual language models, yet is trivially easy for humans. Our findings suggest that general multilingual models struggle to perform well, achieving 66.91% accuracy on COPAL-ID. South-East Asian-specific models achieve slightly better performance of 73.88% accuracy. Yet, this number still falls short of near-perfect human performance. This shows that these language models are still way behind in comprehending the local nuances of Indonesian.

pdf bib
IterAlign: Iterative Constitutional Alignment of Large Language Models
Xiusi Chen | Hongzhi Wen | Sreyashi Nag | Chen Luo | Qingyu Yin | Ruirui Li | Zheng Li | Wei Wang

With the rapid development of large language models (LLMs), aligning LLMs with human values and societal norms to ensure their reliability and safety has become crucial. Reinforcement learning with human feedback (RLHF) and Constitutional AI (CAI) have been proposed for LLM alignment. However, these methods require either heavy human annotations or explicitly pre-defined constitutions, which are labor-intensive and resource-consuming. To overcome these drawbacks, we study constitution-based LLM alignment and propose a data-driven constitution discovery and self-alignment framework called IterAlign. IterAlign leverages red teaming to unveil the weaknesses of an LLM and automatically discovers new constitutions using a stronger LLM. These constitutions are then used to guide self-correction of the base LLM. Such a constitution discovery pipeline can be run iteratively and automatically to discover new constitutions that specifically target the alignment gaps in the current LLM. Empirical results on several safety benchmark datasets and multiple base LLMs show that IterAlign successfully improves truthfulness, helpfulness, harmlessness and honesty, improving the LLM alignment by up to 13.5% in harmlessness.

pdf bib
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue State Tracking
Chia-Hsuan Lee | Hao Cheng | Mari Ostendorf

Large language models (LLMs) have revolutionized the landscape of Natural Language Processing, but are computationally expensive. To reduce the cost without sacrificing performance, previous studies have explored various approaches to harness the potential of Smaller Language Models (SLMs) as cost-effective alternatives to their larger counterparts. Driven by findings that SLMs and LLMs exhibit complementary strengths in a structured knowledge extraction task, this work presents a novel SLM/LLM routing framework designed to improve computational efficiency and enhance task performance. In dialogue state tracking tasks, the proposed routing framework enhances performance substantially compared to relying solely on LLMs, while reducing the computational costs by over 50%.

pdf bib
Multi-Operational Mathematical Derivations in Latent Space
Marco Valentino | Jordan Meadows | Lan Zhang | Andre Freitas

This paper investigates the possibility of approximating multiple mathematical operations in latent space for expression derivation. To this end, we introduce different multi-operational representation paradigms, modelling mathematical operations as explicit geometric transformations. By leveraging a symbolic engine, we construct a large-scale dataset comprising 1.7M derivation steps stemming from 61K premises and 6 operators, analysing the properties of each paradigm when instantiated with state-of-the-art neural encoders.Specifically, we investigate how different encoding mechanisms can approximate expression manipulation in latent space, exploring the trade-off between learning different operators and specialising within single operations, as well as the ability to support multi-step derivations and out-of-distribution generalisation. Our empirical analysis reveals that the multi-operational paradigm is crucial for disentangling different operators, while discriminating the conclusions for a single operation is achievable in the original expression encoder. Moreover, we show that architectural choices can heavily affect the training dynamics, structural organisation, and generalisation of the latent space, resulting in significant variations across paradigms and classes of encoders.

pdf bib
Large Language Models Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong
Chenglei Si | Navita Goyal | Tongshuang Wu | Chen Zhao | Shi Feng | Hal Daumé Iii | Jordan Boyd-Graber

Large Language Models (LLMs) are increasingly used for accessing information on the web. Their truthfulness and factuality are thus of great interest. To help users make the right decisions about the information they get, LLMs should not only provide information but also help users fact-check it. We conduct human experiments with 80 crowdworkers to compare language models with search engines (information retrieval systems) at facilitating fact-checking. We prompt LLMs to validate a given claim and provide corresponding explanations. Users reading LLM explanations are significantly more efficient than those using search engines while achieving similar accuracy. However, they over-rely on the LLMs when the explanation is wrong. To reduce over-reliance on LLMs, we ask LLMs to provide contrastive information—explain both why the claim is true and false, and then we present both sides of the explanation to users. This contrastive explanation mitigates users’ over-reliance on LLMs, but cannot significantly outperform search engines. Further, showing both search engine results and LLM explanations offers no complementary benefits compared to search engines alone. Taken together, our study highlights that natural language explanations by LLMs may not be a reliable replacement for reading the retrieved passages, especially in high-stakes settings where over-relying on wrong AI explanations could lead to critical consequences.

pdf bib
XferBench: a Data-Driven Benchmark for Emergent Language
Brendon Boldt | David Mortensen

In this paper, we introduce a benchmark for evaluating the overall quality of emergent languages using data-driven methods. Specifically, we interpret the notion of the “quality” of an emergent language as its similarity to human language within a deep learning framework. We measure this by using the emergent language as pretraining data for a downstream NLP tasks in human language—the better the downstream performance, the better the emergent language. We implement this benchmark as an easy-to-use Python package that only requires a text file of utterances from the emergent language to be evaluated. Finally, we empirically test the benchmark’s validity using human, synthetic, and emergent language baselines.

pdf bib
Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation
Se-eun Yoon | Zhankui He | Jessica Echterhoff | Julian McAuley

Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.

pdf bib
A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers
Jordan Meadows | Marco Valentino | Damien Teney | Andre Freitas

This paper proposes a methodology for generating and perturbing detailed derivations of equations at scale, aided by a symbolic engine, to evaluate the generalisability of Transformers to out-of-distribution mathematical reasoning problems. Instantiating the framework in the context of sequence classification tasks, we compare the capabilities of GPT-4, GPT-3.5, and a canon of fine-tuned BERT models, exploring the relationship between specific operators and generalisation failure via the perturbation of reasoning aspects such as symmetry and variable surface forms. Surprisingly, our empirical evaluation reveals that the average in-distribution performance of fine-tuned models surpasses GPT-3.5, and rivals GPT-4. However, perturbations to input reasoning can reduce their performance by up to 80 F1 points. Overall, the results suggest that the in-distribution performance of smaller open-source models may potentially rival GPT by incorporating appropriately structured derivation dependencies during training, and highlight a shared weakness between BERT and GPT involving a relative inability to decode indirect references to mathematical entities. We release the full codebase, constructed datasets, and fine-tuned models to encourage future progress in the field.

pdf bib
Identifying Linear Relational Concepts in Large Language Models
David Chanin | Anthony Hunter | Oana-Maria Camburu

Transformer language models (LMs) have been shown to represent concepts as directions in the latent space of hidden activations. However, for any human-interpretable concept, how can we find its direction in the latent space? We present a technique called linear relational concepts (LRC) for finding concept directions corresponding to human-interpretable concepts by first modeling the relation between subject and object as a linear relational embedding (LRE). We find that inverting the LRE and using earlier object layers results in a powerful technique for finding concept directions that outperforms standard black-box probing classifiers. We evaluate LRCs on their performance as concept classifiers as well as their ability to causally change model output.

pdf bib
Benchmark Transparency: Measuring the Impact of Data on Evaluation
Venelin Kovatchev | Matthew Lease

In this paper we present an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. We propose an automated framework that measures the data point distribution across 6 different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity.We use disproportional stratified sampling to measure how much the data distribution affects absolute (Acc/F1) and relative (Rank) model performance. We experiment on 2 different datasets (SQUAD and MNLI) and test a total of 135 different models (125 on SQUAD and 10 on MNLI). We demonstrate that without explicit control of the data distribution, standard evaluation frameworks are inconsistent and unreliable. We find that the impact of the data is statistically significant and is often larger than the impact of changing the metric. In a second set of experiments, we demonstrate that the impact of data on evaluation is not just observable, but also predictable. We propose to use benchmark transparency as a method for comparing datasets and quantifying the similarity between them. We find that the “dataset similarity vector” can be used to predict how well a model generalizes out of distribution.

pdf bib
JAMDEC: Unsupervised Authorship Obfuscation using Constrained Decoding over Small Language Models
Jillian Fisher | Ximing Lu | Jaehun Jung | Liwei Jiang | Zaid Harchaoui | Yejin Choi

The permanence of online content combined with the enhanced authorship identification techniques calls for stronger computational methods to protect the identity and privacy of online authorship when needed, e.g., blind reviews for scientific papers, anonymous online reviews, or anonymous interactions in the mental health forums. In this paper, we propose an unsupervised inference-time approach to authorship obfuscation to address the unique challenges of authorship obfuscation: lack of supervision data for diverse authorship and domains, and the need for a sufficient level of revision beyond simple paraphrasing to obfuscate the authorship, all the while preserving the original content and fluency.We introduce JAMDEC, a user-controlled, inference-time algorithm for authorship obfuscation that can be in principle applied to any text and authorship. Our approach builds on small language models such as GPT2-XL in order to help avoid disclosing the original content to proprietary LLM’s APIs, while also reducing the performance gap between small and large language models via algorithmic enhancement. The key idea behind our approach is to boost the creative power of smaller language models through constrained decoding, while also allowing for user-specified controls and flexibility. Experimental results demonstrate that our approach based on GPT2-XL outperforms previous state-of-the-art methods based on comparably small models, while performing competitively against GPT3.5 175B, a propriety model that is two orders of magnitudes larger.

pdf bib
REST: Retrieval-Based Speculative Decoding
Zhenyu He | Zexuan Zhong | Tianle Cai | Jason Lee | Di He

We introduce Retrieval-Based Speculative Decoding (REST), a novel algorithm designed to speed up language model generation. The key insight driving the development of REST is the observation that the process of text generation often includes certain common phases and patterns. Unlike previous methods that rely on a draft language model for speculative decoding, REST harnesses the power of retrieval to generate draft tokens. This method draws from the reservoir of existing knowledge, retrieving and employing relevant tokens based on the current context. Its plug-and-play nature allows for seamless integration and acceleration of any language model, all without necessitating additional training. When benchmarked on 7B and 13B language models in a single-batch setting, REST achieves a significant speedup of 1.62 × to 2.36 × on code or text generation. The source code of REST is available at

pdf bib
Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations
Sihao Chen | Hongming Zhang | Tong Chen | Ben Zhou | Wenhao Yu | Dian Yu | Baolin Peng | Hongwei Wang | Dan Roth | Dong Yu

We introduce sub-sentence encoder, a contrastively-learned contextual embedding model for fine-grained semantic representation of text. In contrast to the standard practice with sentence embeddings, where the meaning of an entire sequence of text is encoded into a fixed-length vector, the sub-sentence encoder learns to produce distinct contextual embeddings corresponding to different atomic propositions, i.e. atomic units of meaning expressed within a text sequence. The sub-sentence embeddings are contrastively learned to recognize (inferred) semantic equivalence between propositions across different text sequences. Our experiments show the effectiveness of sub-sentence encoders in applications, such as retrieving supporting facts for fine-grained text attribution or recognizing the conditional semantic similarity between texts. In practice, we demonstrate that sub-sentence encoders keep the same level of inference cost and space complexity compared to sentence encoders.

pdf bib
MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference
Mobashir Sadat | Cornelia Caragea

The task of scientific Natural Language Inference (NLI) involves predicting the semantic relation between two sentences extracted from research articles. This task was recently proposed along with a new dataset called SciNLI derived from papers published in the computational linguistics domain. In this paper, we aim to introduce diversity in the scientific NLI task and present MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains. The availability of multiple domains makes it possible to study domain shift for scientific NLI. We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs). The highest Macro F1 scores of PLM and LLM baselines are 77.21% and 51.77%, respectively, illustrating that MSciNLI is challenging for both types of models. Furthermore, we show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset. Finally, we use both scientific NLI datasets in an intermediate task transfer learning setting and show that they can improve the performance of downstream tasks in the scientific domain. We make our dataset and code available on Github.

pdf bib
Causal Inference for Human-Language Model Collaboration
Bohan Zhang | Yixin Wang | Paramveer Dhillon

In this paper, we examine the collaborative dynamics between humansand language models (LMs), where the interactions typically involveLMs proposing text segments and humans editing or responding to theseproposals. Productive engagement with LMs in such scenarios necessitates that humans discern effective text-based interaction strategies, such as editing and response styles, from historical human-LM interactions. This objective is inherently causal, driven by the counterfactual ‘what-if’ question: how would the outcome of collaboration change if humans employed a different text editing/refinement strategy? A key challenge in answering this causal inference question is formulating an appropriate causal estimand: the conventional average treatment effect (ATE) estimand is inapplicable to text-based treatments due to their high dimensionality. To address this concern, we introduce a new causal estimand– *Incremental Stylistic Effect (ISE)*, which characterizes the average impact of infinitesimally shifting a text towards a specific style, such as increasing formality. We establish the conditions for the non-parametric identification of ISE. Building on this, we develop *CausalCollab*, an algorithm designed to estimate the ISE of various interaction strategies in dynamic human-LM collaborations. Our empirical investigations across three distinct human-LM collaboration scenarios reveal that *CausalCollab* effectively reduces confounding and significantly improves counterfactual estimation over a set of competitive baselines.

pdf bib
SELF-GUARD: Empower the LLM to Safeguard Itself
Zezhong Wang | Fangkai Yang | Lu Wang | Pu Zhao | Hongru Wang | Liang Chen | Qingwei Lin | Kam-Fai Wong

With the increasing risk posed by jailbreak attacks, recent studies have investigated various methods to improve the safety of large language models (LLMs), mainly falling into two strategies: safety training and safeguards. Safety training involves fine-tuning the LLM with adversarial samples, which activate the LLM’s capabilities against jailbreak. However, it is not always effective in countering new attacks and often leads to potential performance degradation. Safeguards, on the other hand, are methods using additional models to filter harmful content from the LLM’s response. Nevertheless, they can only reduce a limited amount of harmful output and introduce extra computational costs. Given the distinct strengths and weaknesses of both, we combine them to balance out their flaws and propose a more effective method called Self-Guard.Specifically, we train the LLM to review its responses for any harmful content and append a [harmful] or [harmless] tag to the end of the response. In this way, Self-Guard possesses the advantages of safety training, leveraging the powerful capabilities of the LLMs themselves to detect harmfulness. Besides that, it gains flexibility like safeguards, making the safety check target the output side, which makes the system less vulnerable to attack updates. Experimental results indicate that our Self-Guard can effectively defend against jailbreak attacks and will not cause LLMs’ performance degradation.

pdf bib
COSIGN: Contextual Facts Guided Generation for Knowledge Graph Completion
Jinpeng Li | Hang Yu | Xiangfeng Luo | Qian Liu

Knowledge graph completion (KGC) aims to infer missing facts based on existing facts within a KG. Recently, research on generative models (GMs) has addressed the limitations of embedding methods in terms of generality and scalability. However, GM-based methods are sensitive to contextual facts on KG, so the contextual facts of poor quality can cause GMs to generate erroneous results. To improve the performance of GM-based methods for various KGC tasks, we propose a COntextual FactS GuIded GeneratioN (COSIGN) model. First, to enhance the inference ability of the generative model, we designed a contextual facts collector to achieve human-like retrieval behavior. Second, a contextual facts organizer is proposed to learn the organized capabilities of LLMs through knowledge distillation. Finally, the organized contextual facts as the input of the inference generator to generate missing facts. Experimental results demonstrate that COSIGN outperforms state-of-the-art baseline techniques in terms of performance.

pdf bib
Toward Informal Language Processing: Knowledge of Slang in Large Language Models
Zhewei Sun | Qian Hu | Rahul Gupta | Richard Zemel | Yang Xu

Recent advancement in large language models (LLMs) has offered a strong potential for natural language systems to process informal language. A representative form of informal language is slang, used commonly in daily conversations and online social media. To date, slang has not been comprehensively evaluated in LLMs due partly to the absence of a carefully designed and publicly accessible benchmark. Using movie subtitles, we construct a dataset that supports evaluation on a diverse set of tasks pertaining to automatic processing of slang. For both evaluation and finetuning, we show the effectiveness of our dataset on two core applications: 1) slang detection, and 2) identification of regional and historical sources of slang from natural sentences. We also show how our dataset can be used to probe the output distributions of LLMs for interpretive insights. We find that while LLMs such as GPT-4 achieve good performance in a zero-shot setting, smaller BERT-like models finetuned on our dataset achieve comparable performance. Furthermore, we show that our dataset enables finetuning of LLMs such as GPT-3.5 that achieve substantially better performance than strong zero-shot baselines. Our work offers a comprehensive evaluation and a high-quality benchmark on English slang based on the OpenSubtitles corpus, serving both as a publicly accessible resource and a platform for applying tools for informal language processing.

pdf bib
Ghostbuster: Detecting Text Ghostwritten by Large Language Models
Vivek Verma | Eve Fleisig | Nicholas Tomlin | Dan Klein

We introduce Ghostbuster, a state-of-the-art system for detecting AI-generated text.Our method works by passing documents through a series of weaker language models, running a structured search over possible combinations of their features, and then training a classifier on the selected features to predict whether documents are AI-generated.Crucially, Ghostbuster does not require access to token probabilities from the target model, making it useful for detecting text generated by black-box or unknown models.In conjunction with our model, we release three new datasets of human- and AI-generated text as detection benchmarks in the domains of student essays, creative writing, and news articles. We compare Ghostbuster to several existing detectors, including DetectGPT and GPTZero, as well as a new RoBERTa baseline. Ghostbuster achieves 99.0 F1 when evaluated across domains, which is 5.9 F1 higher than the best preexisting model. It also outperforms all previous approaches in generalization across writing domains (+7.5 F1), prompting strategies (+2.1 F1), and language models (+4.4 F1). We also analyze our system’s robustness to a variety of perturbations and paraphrasing attacks, and evaluate its performance on documents by non-native English speakers.

pdf bib
End-to-End Beam Retrieval for Multi-Hop Question Answering
Jiahao Zhang | Haiyang Zhang | Dongmei Zhang | Liu Yong | Shen Huang

Multi-hop question answering (QA) involves finding multiple relevant passages and step-by-step reasoning to answer complex questions, indicating a retrieve-and-read paradigm. However, previous retrievers were customized for two-hop questions, and most of them were trained separately across different hops, resulting in a lack of supervision over the entire multi-hop retrieval process and leading to poor performance in complicated scenarios beyond two hops. In this work, we introduce Beam Retrieval, an end-to-end beam retrieval framework for multi-hop QA. This approach models the multi-hop retrieval process in an end-to-end manner by jointly optimizing an encoder and two classification heads across all hops. Moreover, Beam Retrieval maintains multiple partial hypotheses of relevant passages at each step, expanding the search space and reducing the risk of missing relevant passages. To establish a complete QA system, we incorporate a supervised reader or a large language model (LLM). Experimental results demonstrate that Beam Retrieval achieves a nearly 50% improvement compared with baselines on challenging MuSiQue-Ans, and it also surpasses all previous retrievers on HotpotQA and achieves 99.9% precision on 2WikiMultiHopQA. Providing high-quality context, Beam Retrieval helps our supervised reader achieve new state-of-the-art performance and substantially improves the few-shot QA performance of LLMs.

pdf bib
Leveraging Generative Large Language Models with Visual Instruction and Demonstration Retrieval for Multimodal Sarcasm Detection
Binghao Tang | Boda Lin | Haolong Yan | Si Li

Multimodal sarcasm detection aims to identify sarcasm in the given image-text pairs and has wide applications in the multimodal domains. Previous works primarily design complex network structures to fuse the image-text modality features for classification. However, such complicated structures may risk overfitting on in-domain data, reducing the performance in out-of-distribution (OOD) scenarios. Additionally, existing methods typically do not fully utilize cross-modal features, limiting their performance on in-domain datasets. Therefore, to build a more reliable multimodal sarcasm detection model, we propose a generative multimodal sarcasm model consisting of a designed instruction template and a demonstration retrieval module based on the large language model. Moreover, to assess the generalization of current methods, we introduce an OOD test set, RedEval. Experimental results demonstrate that our method is effective and achieves state-of-the-art (SOTA) performance on the in-domain MMSD2.0 and OOD RedEval datasets.

pdf bib
Multi-Scale Prompt Memory-Augmented Model for Black-Box Scenarios
Xiaojun Kuang | C. L. Philip Chen | Shuzhen Li | Tong Zhang

Black-box few-shot text classification handles text classification in limited data without accessing the parameters and gradients of language models (LMs). Existing black-box optimization methods have demonstrated strong few-shot learning capabilities. However, they still require numerous LMs’ calls to search optimal prompts, thus resulting in overfitting performance and increasing computational cost. To address this issue, we present MuSKPrompt (Multi-scale Knowledge Prompt for Memory Model), an efficient multi-scale knowledge prompt-based memory model in black-box few-shot text classification task. MuSKPrompt extracts instance-level and class-level knowledge at different scales and stores them in memory banks during training. Then, it references multi-scale memory banks to perform quick inference on new samples via a novel scoring module. MuSKPrompt achieves competitive performance in limited data through multi-scale instance-level and class-level knowledge. Moreover, it realizes gradient-free optimization with zero training parameters in the black-box scenario. Experiments on different benchmarks and parameter analysis demonstrate the effectiveness and efficiency of MuSKPrompt in black-box few-shot text classification tasks.

pdf bib
Ungrammatical-syntax-based In-context Example Selection for Grammatical Error Correction
Chenming Tang | Fanyi Qu | Yunfang Wu

In the era of large language models (LLMs), in-context learning (ICL) stands out as an effective prompting strategy that explores LLMs’ potency across various tasks. However, applying LLMs to grammatical error correction (GEC) is still a challenging task. In this paper, we propose a novel ungrammatical-syntax-based in-context example selection strategy for GEC. Specifically, we measure similarity of sentences based on their syntactic structures with diverse algorithms, and identify optimal ICL examples sharing the most similar ill-formed syntax to the test input. Additionally, we carry out a two-stage process to further improve the quality of selection results. On benchmark English GEC datasets, empirical results show that our proposed ungrammatical-syntax-based strategies outperform commonly-used word-matching or semantics-based methods with multiple LLMs. This indicates that for a syntax-oriented task like GEC, paying more attention to syntactic information can effectively boost LLMs’ performance. Our code is available at

pdf bib
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
Akari Asai | Sneha Kudugunta | Xinyan Yu | Terra Blevins | Hila Gonen | Machel Reid | Yulia Tsvetkov | Sebastian Ruder | Hannaneh Hajishirzi

Despite remarkable advancements in few-shot generalization in natural language processing, most models are developed and evaluated primarily in English. To establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer, we introduce a new benchmark, called BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format and provides a fixed set of few-shot examples and instructions. Using BUFFET, we perform thorough evaluations of ten state-of-the-art multilingual large language models with different transfer methods, namely in-context learning and fine-tuning. Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer. Strong multilingual pre-trained or instruction-tuned models such as BLOOM or ChatGPT often lag behind much smaller mT5-base models given the same number of few-shot samples, particularly in low-resource languages. Our analysis suggests avenues for future research in few-shot cross-lingual transfer.

pdf bib
TISE: A Tripartite In-context Selection Method for Event Argument Extraction
Yanhe Fu | Yanan Cao | Qingyue Wang | Yi Liu

In-context learning enhances the reasoning capabilities of LLMs by providing several examples. A direct yet effective approach to obtain in-context example is to select the top-k examples based on their semantic similarity to the test input. However, when applied to event argument extraction (EAE), this approach exhibits two shortcomings: 1) It may select almost identical examples, thus failing to provide additional event information, and 2) It overlooks event attributes, leading to the selected examples being unrelated to the test event type. In this paper, we introduce three necessary requirements when selecting an in-context example for EAE task: semantic similarity, example diversity and event correlation. And we further propose TISE, which scores examples from these three perspectives and integrates them using Determinantal Point Processes to directly select a set of examples as context. Experimental results on the ACE05 dataset demonstrate the effectiveness of TISE and the necessity of three requirements. Furthermore, we surprisingly observe that TISE can achieve superior performance with fewer examples and can even exceed some supervised methods.

pdf bib
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
Zhaofeng Wu | Linlu Qiu | Alexis Ross | Ekin Akyürek | Boyuan Chen | Bailin Wang | Najoung Kim | Jacob Andreas | Yoon Kim

The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on “counterfactual” task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects.

pdf bib
TRUE-UIE: Two Universal Relations Unify Information Extraction Tasks
Yucheng Wang | Bowen Yu | Yilin Liu | Shudong Lu

Information extraction (IE) encounters challenges due to the variety of schemas and objectives that differ across tasks. Recent advancements hint at the potential for universal approaches to model such tasks, referred to as Universal Information Extraction (UIE). While handling diverse tasks in one model, their generalization is limited since they are actually learning task-specific knowledge.In this study, we introduce an innovative paradigm known as TRUE-UIE, wherein all IE tasks are aligned to learn the same goals: extracting mention spans and two universal relations named \mathtt{NEXT} and \mathtt{IS}. During the decoding process, the \mathtt{NEXT} relation is utilized to group related elements, while the \mathtt{IS} relation, in conjunction with structured language prompts, undertakes the role of type recognition. Additionally, we consider the sequential dependency of tokens during span extraction, an aspect often overlooked in prevalent models.Our empirical experiments indicate that TRUE-UIE achieves state-of-the-art performance on established benchmarks encompassing 16 datasets, spanning 7 diverse IE tasks. Further evaluations reveal that our approach effectively share knowledge between different IE tasks, showcasing significant transferability in zero-shot and few-shot scenarios.

pdf bib
zrLLM: Zero-Shot Relational Learning on Temporal Knowledge Graphs with Large Language Models
Zifeng Ding | Heling Cai | Jingpei Wu | Yunpu Ma | Ruotong Liao | Bo Xiong | Volker Tresp

Modeling evolving knowledge over temporal knowledge graphs (TKGs) has become a heated topic. Various methods have been proposed to forecast links on TKGs. Most of them are embedding-based, where hidden representations are learned to represent knowledge graph (KG) entities and relations based on the observed graph contexts. Although these methods show strong performance on traditional TKG forecasting (TKGF) benchmarks, they face a strong challenge in modeling the unseen zero-shot relations that have no prior graph context. In this paper, we try to mitigate this problem as follows. We first input the text descriptions of KG relations into large language models (LLMs) for generating relation representations, and then introduce them into embedding-based TKGF methods. LLM-empowered representations can capture the semantic information in the relation descriptions. This makes the relations, whether seen or unseen, with similar semantic meanings stay close in the embedding space, enabling TKGF models to recognize zero-shot relations even without any observed graph context. Experimental results show that our approach helps TKGF models to achieve much better performance in forecasting the facts with previously unseen relations, while still maintaining their ability in link forecasting regarding seen relations.

pdf bib
Embodied Executable Policy Learning with Language-based Scene Summarization
Jielin Qiu | Mengdi Xu | William Han | Seungwhan Moon | Ding Zhao

Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning.However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments.In this work, we introduce a novel learning paradigm that generates robots’ executable actions in the form of text, derived solely from visual observations. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively.We conduct extensive experiments involving various model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.

pdf bib
Metacognitive Prompting Improves Understanding in Large Language Models
Yuqing Wang | Yun Zhao

In Large Language Models (LLMs), there have been consistent advancements in task-specific performance, largely influenced by effective prompt design. Recent advancements in prompting have enhanced reasoning in logic-intensive tasks for LLMs, yet the nuanced understanding abilities of these models, crucial for processing and interpreting complex information, remain underexplored. In this study, we introduce Metacognitive Prompting (MP), a strategy inspired by human introspective reasoning processes. Using MP, LLMs undergo a systematic series of structured, self-aware evaluations, drawing on both their vast inherent knowledge and new insights. We conduct extensive experiments on four prevalent LLMs: Llama2, PaLM2, GPT-3.5, and GPT-4, across ten natural language understanding (NLU) datasets from GLUE, SuperGLUE, BLUE, and LexGLUE benchmarks. Additionally, we compare our method with chain-of-thought prompting and its advanced versions. The results show that GPT-4 consistently excels across all tasks, while other models have shown significant progress in some tasks when used in conjunction with MP. Furthermore, MP consistently outperforms existing prompting methods in both general and domain-specific NLU tasks. This study underscores the potential to amplify the understanding abilities of LLMs and highlights the benefits of mirroring human introspective reasoning in NLU tasks.

pdf bib
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Suyu Ge | Chunting Zhou | Rui Hou | Madian Khabsa | Yi-Chia Wang | Qifan Wang | Jiawei Han | Yuning Mao

Red-teaming is a common practice for mitigating unsafe behaviors in Large Language Models (LLMs), which involves thoroughly assessing LLMs to identify potential flaws and addressing them with responsible and accurate responses.While effective, manual red-teaming is costly, and existing automatic red-teaming typically discovers safety risks without addressing them.In this paper, we propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation, significantly increasing red-teaming scalability and the safety of the target LLM.Specifically, an adversarial LLM and a target LLM interplay with each other in an iterative manner, where the adversarial LLM aims to generate challenging prompts that elicit unsafe responses from the target LLM, while the target LLM is fine-tuned with safety aligned data on these adversarial prompts. In each round, the adversarial LLM crafts better attacks on the updated target LLM, while the target LLM also improves itself through safety fine-tuning.On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART, achieving comparable performance to LLMs with extensive adversarial prompt writing. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.

pdf bib
DialogCC: An Automated Pipeline for Creating High-Quality Multi-Modal Dialogue Dataset
Young-Jun Lee | Byungsoo Ko | Han-Gyu Kim | Jonghwan Hyeon | Ho-Jin Choi

As sharing images in an instant message is a crucial factor, there has been active research on learning an image-text multi-modal dialogue models.However, training a well-generalized multi-modal dialogue model remains challenging due to the low quality and limited diversity of images per dialogue in existing multi-modal dialogue datasets.In this paper, we propose an automated pipeline to construct a multi-modal dialogue dataset, ensuring both dialogue quality and image diversity without requiring minimum human effort. In our pipeline, to guarantee the coherence between images and dialogue, we prompt GPT-4 to infer potential image-sharing moments - specifically, the utterance, speaker, rationale, and image description. Furthermore, we leverage CLIP similarity to maintain consistency between aligned multiple images to the utterance.Through this pipeline, we introduce DialogCC, a high-quality and diverse multi-modal dialogue dataset that surpasses existing datasets in terms of quality and diversity in human evaluation.Our comprehensive experiments highlight that when multi-modal dialogue models are trained using our dataset, their generalization performance on unseen dialogue datasets is significantly enhanced. We make our source code and dataset publicly available (

pdf bib
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
Keming Lu | Hongyi Yuan | Runji Lin | Junyang Lin | Zheng Yuan | Chang Zhou | Jingren Zhou

The complementary potential of Large Language Models (LLM) assumes off-the-shelf LLMs have heterogeneous expertise in a wide range of domains and tasks so that an ensemble of LLMs can achieve consistently better performance. Existing ensemble methods for LLMs mainly focus on reward model ranking of outputs, leading to significant computation overhead. To combat this issue, we revisit the complementary potential of LLMs and further elaborate on it by mining latent expertise with off-the-shelf reward models. We propose ZOOTER, a reward-guided routing method distilling rewards on training queries to train a routing function, which can precisely distribute each query to the LLM with expertise about it. We also integrate a tag-based label enhancement to mitigate noise from uncertainty when using rewards as silver supervision. ZOOTER shows computation efficiency in inference as it only introduces minor computation overhead of a routing function compared with reward model ranking methods. We evaluate ZOOTER on a comprehensive benchmark collection with 26 subsets in different domains and tasks. ZOOTER outperforms the best single model on average and ranks first on 44% of tasks, even surpassing multiple reward model ranking methods.

pdf bib
Automatic Generation of Model and Data Cards: A Step Towards Responsible AI
Jiarui Liu | Wenkai Li | Zhijing Jin | Mona Diab

In an era of model and data proliferation in machine learning/AI especially marked by the rapid advancement of open-sourced technologies, there arises a critical need for standardized consistent documentation. Our work addresses the information incompleteness in current human-written model and data cards. We propose an automated generation approach using Large Language Models (LLMs). Our key contributions include the establishment of CardBench, a comprehensive dataset aggregated from over 4.8k model cards and 1.4k data cards, coupled with the development of the CardGen pipeline comprising a two-step retrieval process. Our approach exhibits enhanced completeness, objectivity, and faithfulness in generated model and data cards, a significant step in responsible AI documentation practices ensuring better accountability and traceability.

pdf bib
FUN with Fisher: Improving Generalization of Adapter-Based Cross-lingual Transfer with Scheduled Unfreezing
Chen Liu | Jonas Pfeiffer | Ivan Vulić | Iryna Gurevych

Standard fine-tuning of language models typically performs well on in-distribution data, but suffers with generalization to distribution shifts. In this work, we aim to improve the generalization of adapter-based cross-lingual task transfer where such cross-language distribution shifts are imminent. We investigate scheduled unfreezing algorithms –originally proposed to mitigate catastrophic forgetting in transfer learning – for fine-tuning task adapters. Our experiments show that scheduled unfreezing methods close the gap to full fine-tuning and achieve stronger cross-lingual transfer performance, suggesting that these methods can go beyond just mitigating catastrophic forgetting. Next, aiming to understand these empirical findings, we investigate the learning dynamics of scheduled unfreezing using Fisher Information. Our experiments reveal that scheduled unfreezing induces different learning dynamics compared to standard fine-tuning, and provide evidence that the dynamics of Fisher Information during training correlate with cross-lingual generalization performance. We additionally propose a general scheduled unfreezing algorithm that achieves an average of 2 points improvement over four datasets compared to standard fine-tuning and provides empirical evidence for a theory-based justification of the heuristic unfreezing schedule for task adapter training.

pdf bib
Are Multilingual LLMs Culturally-Diverse Reasoners? An Investigation into Multicultural Proverbs and Sayings
Chen Liu | Fajri Koto | Timothy Baldwin | Iryna Gurevych

Large language models (LLMs) are highly adept at question answering and reasoning tasks, but when reasoning in a situational context, human expectations vary depending on the relevant cultural common ground. As languages are associated with diverse cultures, LLMs should also be culturally-diverse reasoners. In this paper, we study the ability of a wide range of state-of-the-art multilingual LLMs (mLLMs) to reason with proverbs and sayings in a conversational context. Our experiments reveal that: (1) mLLMs “know” limited proverbs and memorizing proverbs does not mean understanding them within a conversational context; (2) mLLMs struggle to reason with figurative proverbs and sayings, and when asked to select the wrong answer (instead of asking it to select the correct answer); and (3) there is a “culture gap” in mLLMs when reasoning about proverbs and sayings translated from other languages. We construct and release our evaluation dataset MAPS (MulticulturAl Proverbs and Sayings) for proverb understanding with conversational context for six different languages.

pdf bib
The Colorful Future of LLMs: Evaluating and Improving LLMs as Emotional Supporters for Queer Youth
Shir Lissak | Nitay Calderon | Geva Shenkman | Yaakov Ophir | Eyal Fruchter | Anat Brunstein Klomek | Roi Reichart

Queer youth face increased mental health risks, such as depression, anxiety, and suicidal ideation. Hindered by negative stigma, they often avoid seeking help and rely on online resources, which may provide incompatible information. Although access to a supportive environment and reliable information is invaluable, many queer youth worldwide have no access to such support. However, this could soon change due to the rapid adoption of Large Language Models (LLMs) such as ChatGPT. This paper aims to comprehensively explore the potential of LLMs to revolutionize emotional support for queers. To this end, we conduct a qualitative and quantitative analysis of LLM’s interactions with queer-related content. To evaluate response quality, we develop a novel ten-question scale that is inspired by psychological standards and expert input. We apply this scale to score several LLMs and human comments to posts where queer youth seek advice and share experiences. We find that LLM responses are supportive and inclusive, outscoring humans. However, they tend to be generic, not empathetic enough, and lack personalization, resulting in nonreliable and potentially harmful advice. We discuss these challenges, demonstrate that a dedicated prompt can improve the performance, and propose a blueprint of an LLM-supporter that actively (but sensitively) seeks user context to provide personalized, empathetic, and reliable responses. Our annotated dataset is available for further research.*

pdf bib
IPED: An Implicit Perspective for Relational Triple Extraction based on Diffusion Model
Jianli Zhao | Changhao Xu | Bin. Jiang

Relational triple extraction is a fundamental task in the field of information extraction, and a promising framework based on table filling has recently gained attention as a potential baseline for entity relation extraction. However, inherent shortcomings such as redundant information and incomplete triple recognition remain problematic. To address these challenges, we propose an Implicit Perspective for relational triple Extraction based on Diffusion model (IPED), an innovative approach for extracting relational triples. Our classifier-free solution adopts an implicit strategy using block coverage to complete the tables, avoiding the limitations of explicit tagging methods. Additionally, we introduce a generative model structure, the block-denoising diffusion model, to collaborate with our implicit perspective and effectively circumvent redundant information disruptions. Experimental results on two popular datasets demonstrate that IPED achieves state-of-the-art performance while gaining superior inference speed and low computational complexity. To support future research, we have made our source code publicly available online.

pdf bib
QualEval: Qualitative Evaluation for Model Improvement
Vishvak Murahari | Ameet Deshpande | Peter Clark | Tanmay Rajpurohit | Ashish Sabharwal | Karthik Narasimhan | Ashwin Kalyan

Quantitative evaluation metrics have been pivotal in gauging the advancements of AI systems like large language models (LLMs).However, due to the intricate nature of real-world tasks, a single scalar to quantify and compare performance trivializes the fine-grained nuances of model behavior. Additionally, metrics do not yield actionable diagnostics for model improvement, thus requiring extensive manual efforts of scientists, involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QualEval, which uses automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are supported by a dashboard report with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QualEval by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QualEval successfully increases the pace and quality of model development by eliminating the need of arduous manual analysis, thus serving as a data-scientist-in-a-box.

pdf bib
Quantum-inspired Language Model with Lindblad Master Equation and Interference Measurement for Sentiment Analysis
Kehuan Yan | Peichao Lai | Yilei Wang

Quantum-inspired models have demonstrated superior performance in many downstream language tasks, such as question answering and sentiment analysis. However, recent models primarily focus on embedding and measurement operations, overlooking the significance of the quantum evolution process. In this work, we present a novel quantum-inspired neural network, LI-QiLM, which integrates the Lindblad Master Equation (LME) to model the evolution process and the interferometry to the measurement process, providing more physical meaning to strengthen the interpretability. We conduct comprehensive experiments on six sentiment analysis datasets. Compared to the traditional neural networks, transformer-based pre-trained models and quantum-inspired models, such as CICWE-QNN and ComplexQNN, the proposed method demonstrates superior performance in accuracy and F1-score on six commonly used datasets for sentiment analysis. Additional ablation tests verify the effectiveness of LME and interferometry.

pdf bib
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization
Dongsheng Zhu | Daniel Tang | Weidong Han | Jinghui Lu | Yukun Zhao | Guoliang Xing | Junfeng Wang | Dawei Yin

This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual content. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets. Our main code is available at

pdf bib
A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
Peng Ding | Jun Kuang | Dan Ma | Xuezhi Cao | Yunsen Xian | Jiajun Chen | Shujian Huang

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as ‘jailbreaks’ can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at

pdf bib
P3Sum: Preserving Author’s Perspective in News Summarization with Diffusion Language Models
Yuhan Liu | Shangbin Feng | Xiaochuang Han | Vidhisha Balachandran | Chan Young Park | Sachin Kumar | Yulia Tsvetkov

In this work, we take a first step towards designing summarization systems that are faithful to the author’s intent, not only the semantic content of the article. Focusing on a case study of preserving political perspectives in news summarization, we find that existing approaches alter the political opinions and stances of news articles in more than 50% of summaries, misrepresenting the intent and perspectives of the news authors. We thus propose P3Sum, a diffusion model-based summarization approach controlled by political perspective classifiers. In P3Sum, the political leaning of a generated summary is iteratively evaluated at each decoding step, and any drift from the article’s original stance incurs a loss back-propagated to the embedding layers, steering the political stance of the summary at inference time. Extensive experiments on three news summarization datasets demonstrate that P3Sum outperforms state-of-the-art summarization systems and large language models by up to 13.7% in terms of the success rate of stance preservation, with competitive performance on standard metrics of summarization quality. Our findings present a first analysis of preservation of pragmatic features in summarization, highlight the lacunae in existing summarization models—that even state-of-the-art models often struggle to preserve author’s intents—and develop new summarization systems that are more faithful to author’s perspectives.

pdf bib
Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes
Rose Wang | Qingyang Zhang | Carly Robinson | Susanna Loeb | Dorottya Demszky

Scaling high-quality tutoring remains a major challenge in education. Due to growing demand, many platforms employ novice tutors who, unlike experienced educators, struggle to address student mistakes and thus fail to seize prime learning opportunities. Our work explores the potential of large language models (LLMs) to close the novice-expert knowledge gap in remediating math mistakes. We contribute Bridge, a method that uses cognitive task analysis to translate an expert’s latent thought process into a decision-making model for remediation. This involves an expert identifying (A) the student’s error, (B) a remediation strategy, and (C) their intention before generating a response. We construct a dataset of 700 real tutoring conversations, annotated by experts with their decisions. We evaluate state-of-the-art LLMs on our dataset and find that the expert’s decision-making model is critical for LLMs to close the gap: responses from GPT4 with expert decisions (e.g., “simplify the problem”) are +76% more preferred than without. Additionally, context-sensitive decisions are critical to closing pedagogical gaps: random decisions decrease GPT4’s response quality by -97% than expert decisions. Our work shows the potential of embedding expert thought processes in LLM generations to enhance their capability to bridge novice-expert knowledge gaps. Our dataset and code can be found at:

pdf bib
RST-LoRA: A Discourse-Aware Low-Rank Adaptation for Long Document Abstractive Summarization
Dongqi Pu | Vera Demberg

For long document summarization, discourse structure is important to discern the key content of the text and the differences in importance level between sentences. Unfortunately, the integration of rhetorical structure theory (RST) into parameter-efficient fine-tuning strategies for long document summarization remains unexplored. Therefore, this paper introduces RST-LoRA and proposes four RST-aware variants to explicitly incorporate RST into the LoRA model. Our empirical evaluation demonstrates that incorporating the type and uncertainty of rhetorical relations can complementarily enhance the performance of LoRA in summarization tasks. Furthermore, the best-performing variant we introduced outperforms the vanilla LoRA and full-parameter fine-tuning models, as confirmed by multiple automatic and human evaluations, and even surpasses previous state-of-the-art methods.

pdf bib
Strings from the Library of Babel: Random Sampling as a Strong Baseline for Prompt Optimisation
Yao Lu | Jiayi Wang | Raphael Tang | Sebastian Riedel | Pontus Stenetorp

Recent prompt optimisation approaches use the generative nature of language models to produce prompts – even rivaling the performance of human-curated prompts. In this paper, we demonstrate that randomly sampling tokens from the model vocabulary as “separators” can be as effective as language models for prompt-style text classification. Our experiments show that random separators are competitive baselines, having less than a 1% difference compared to previous self-optimisation methods and showing a 12% average relative improvement over strong human baselines across nine text classification tasks and eight language models. We further analyse this phenomenon in detail using three different random generation strategies, establishing that the language space is rich with potentially good separators, with a greater than 40% average chance that a randomly drawn separator performs better than human-curated separators. These observations challenge the common assumption that an effective prompt should be human readable or task relevant and establish a strong baseline for prompt optimisation research.

pdf bib
ReTA: Recursively Thinking Ahead to Improve the Strategic Reasoning of Large Language Models
Jinhao Duan | Shiqi Wang | James Diffenderfer | Lichao Sun | Tianlong Chen | Bhavya Kailkhura | Kaidi Xu

Current logical reasoning evaluations of Large Language Models (LLMs) primarily focus on single-turn and static environments, such as arithmetic problems. The crucial problem of multi-turn, strategic reasoning is under-explored. In this work, we analyze the multi-turn strategic reasoning of LLMs through text-driven complete- and incomplete-information gaming, e.g., board games (Tic-Tac-Toe, Connect-4) and poker games (Texas Hold’em Poker). Specifically, we consider two distinct scenarios: 1) Online Racing, featuring multiple LLMs/agents to facilitate direct competition and comparison; 2) Offline Probing, constructing targeted questions with verified ground truth to evaluate LLMs’ strategic behaviors. Experimental results demonstrate that existing state-of-the-art LLMs and reasoning schemes are largely ineffective for strategic reasoning tasks. To mitigate these limitations, we propose a simple yet effective Recursively Thinking-Ahead (ReTA) agent, incorporating a recursive prompting mechanism that automatically analyzes the opponents’ future moves/actions and assigns reward signals for these situations, to strengthen the strategic reasoning of LLMs. We hope our work could spur further research and exploration in the multi-turn strategic reasoning of LLMs. The code is available at

pdf bib
Fact Checking Beyond Training Set
Payam Karisani | Heng Ji

Evaluating the veracity of everyday claims is time consuming and in some cases requires domain expertise. We empirically demonstrate that the commonly used fact checking pipeline, known as the retriever-reader, suffers from performance deterioration when it is trained on the labeled data from one domain and used in another domain. Afterwards, we delve into each component of the pipeline and propose novel algorithms to address this problem. We propose an adversarial algorithm to make the retriever component robust against distribution shift. Our core idea is to initially train a bi-encoder on the labeled source data, and then, to adversarially train two separate document and claim encoders using unlabeled target data. We then focus on the reader component and propose to train it such that it is insensitive towards the order of claims and evidence documents. Our empirical evaluations support the hypothesis that such a reader shows a higher robustness against distribution shift. To our knowledge, there is no publicly available multi-topic fact checking dataset. Thus, we propose a simple automatic method to re-purpose two well-known fact checking datasets. We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models, including recent domain adaptation models that use GPT4 for generating synthetic data.

pdf bib
Program-Aided Reasoners (Better) Know What They Know
Anubha Kabra | Sanketh Rangreji | Yash Mathur | Aman Madaan | Emmy Liu | Graham Neubig

Prior work shows that program-aided reasoning, in which large language models (LLMs) are combined with programs written in programming languages such as Python, can significantly improve accuracy on various reasoning tasks. However, while accuracy is essential, it is also important for such reasoners to “know what they know”, which can be quantified through the calibration of the model. In this paper, we compare the calibration of Program Aided Language Models (PAL) and text-based Chain-of-thought (COT) prompting techniques over 5 datasets and 2 model types - LLaMA models and OpenAI models. Our results indicate that PAL leads to improved calibration in 75% of the instances. Our analysis uncovers that prompting styles that produce lesser diversity in generations also have more calibrated results, and thus we also experiment with inducing lower generation diversity using temperature scaling and find that for certain temperatures, PAL is not only more accurate but is also more calibrated than COT. Overall, we demonstrate that, in the majority of cases, program-aided reasoners better know what they know than text-based counterparts.

pdf bib
The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels
Eve Fleisig | Su Lin Blodgett | Dan Klein | Zeerak Talat

Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement–some challenged by perspectivist approaches, and some that remain to be addressed–as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement.

pdf bib
Principles from Clinical Research for NLP Model Generalization
Aparna Elangovan | Jiayuan He | Yuan Li | Karin Verspoor

The NLP community typically relies on performance of a model on a held-out test set to assess generalization. Performance drops observed in datasets outside of official test sets are generally attributed to “out-of-distribution” effects. Here, we explore the foundations of generalizability and study the factors that affect it, articulating lessons from clinical studies. In clinical research, generalizability is an act of reasoning that depends on (a) *internal validity* of experiments to ensure controlled measurement of cause and effect, and (b) *external validity* or transportability of the results to the wider population. We demonstrate how learning spurious correlations, such as the distance between entities in relation extraction tasks, can affect a model’s internal validity and in turn adversely impact generalization. We, therefore, present the need to ensure internal validity when building machine learning models in NLP. Our recommendations also apply to generative large language models, as they are known to be sensitive to even minor semantic preserving alterations. We also propose adapting the idea of *matching* in randomized controlled trials and observational studies to NLP evaluation to measure causation.

pdf bib
First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
Naomi Saphra | Eve Fleisig | Kyunghyun Cho | Adam Lopez

Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large n-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.

pdf bib
Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models
Raphael Tang | Crystina Zhang | Xueguang Ma | Jimmy Lin | Ferhan Ture

Large language models (LLMs) exhibit positional bias in how they use context, which especially affects listwise ranking. To address this, we propose permutation self-consistency, a form of self-consistency over the ranking list outputs of black-box LLMs. Our key idea is to marginalize out different list orders in the prompt to produce an order-independent ranking with less positional bias. First, given some input prompt, we repeatedly shuffle the list in the prompt and pass it through the LLM while holding the instructions the same. Next, we aggregate the resulting sample of rankings by computing the central ranking closest in distance to all of them, marginalizing out prompt order biases in the process. Theoretically, we prove the robustness of our method, showing convergence to the true ranking under random perturbations.Empirically, on five datasets in sorting and passage reranking, our approach improves scores from conventional inference by up to 34-52% for Mistral, 7-18% for GPT-3.5, 8-16% for LLaMA v2 (70B). Our code is at

pdf bib
From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Xuansheng Wu | Wenlin Yao | Jianshu Chen | Xiaoman Pan | Xiaoyang Wang | Ninghao Liu | Dong Yu

Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at

pdf bib
POLYIE: A Dataset of Information Extraction from Polymer Material Scientific Literature
Jerry Cheung | Yuchen Zhuang | Yinghao Li | Pranav Shetty | Wantian Zhao | Sanjeev Grampurohit | Rampi Ramprasad | Chao Zhang

Scientific information extraction (SciIE), which aims to automatically extract information from scientific literature, is becoming more important than ever. However, there are no existing SciIE datasets for polymer materials, which is an important class of materials used ubiquitously in our daily lives. To bridge this gap, we introduce POLYIE, a new SciIE dataset for polymer materials. POLYIE is curated from 146 full-length polymer scholarly articles, which are annotated with different named entities (i.e., materials, properties, values, conditions) as well as their N-ary relations by domain experts. POLYIE presents several unique challenges due to diverse lexical formats of entities, ambiguity between entities, and variable-length relations. We evaluate state-of-the-art named entity extraction and relation extraction models on POLYIE, analyze their strengths and weaknesses, and highlight some difficult cases for these models. To the best of our knowledge, POLYIE is the first SciIE benchmark for polymer materials, and we hope it will lead to more research efforts from the community on this challenging task. Our code and data are available on:

pdf bib
LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination
Kai Zhang | Yangyang Kang | Fubang Zhao | Xiaozhong Liu

Large Language Models (LLMs), such as GPT3.5, have exhibited remarkable proficiency in comprehending and generating natural language. On the other hand, medical assistants hold the potential to offer substantial benefits for individuals. However, the exploration of LLM-based personalized medical assistant remains relatively scarce. Typically, patients converse differently based on their background and preferences which necessitates the task of enhancing user-oriented medical assistant. While one can fully train an LLM for this objective, the resource consumption is unaffordable. Prior research has explored memory-based methods to enhance the response with aware of previous mistakes for new queries during a dialogue session. We contend that a mere memory module is inadequate and fully training an LLM can be excessively costly. In this study, we propose a novel computational bionic memory mechanism, equipped with a parameter-efficient fine-tuning (PEFT) schema, to personalize medical assistants. To encourage further research into this area, we are releasing a new conversation dataset generated based on an open-source medical corpus and our implementation.

pdf bib
SumTra: A Differentiable Pipeline for Few-Shot Cross-Lingual Summarization
Jacob Parnell | Inigo Jauregi Unanue | Massimo Piccardi

Cross-lingual summarization (XLS) generates summaries in a language different from that of the input documents (e.g., English to Spanish), allowing speakers of the target language to gain a concise view of their content. In the present day, the predominant approach to this task is to take a performing, pretrained multilingual language model (LM) and fine-tune it for XLS on the language pairs of interest. However, the scarcity of fine-tuning samples makes this approach challenging in some cases. For this reason, in this paper we propose revisiting the summarize-and-translate pipeline, where the summarization and translation tasks are performed in a sequence. This approach allows reusing the many, publicly-available resources for monolingual summarization and translation, obtaining a very competitive zero-shot performance. In addition, the proposed pipeline is completely differentiable end-to-end, allowing it to take advantage of few-shot fine-tuning, where available. Experiments over two contemporary and widely adopted XLS datasets (CrossSum and WikiLingua) have shown the remarkable zero-shot performance of the proposed approach, and also its strong few-shot performance compared to an equivalent multilingual LM baseline, that the proposed approach has been able to outperform in many languages with only 10% of the fine-tuning samples.

pdf bib
KTRL+F: Knowledge-Augmented In-Document Search
Hanseok Oh | Haebin Shin | Miyoung Ko | Hyunji Lee | Minjoon Seo

We introduce a new problem KTRL+F, a knowledge-augmented in-document search that necessitates real-time identification of all semantic targets within a document with the awareness of external sources through a single natural query. KTRL+F addresses following unique challenges for in-document search: 1) utilizing knowledge outside the document for extended use of additional information about targets, and 2) balancing between real-time applicability with the performance.We analyze various baselines in KTRL+F and find limitations of existing models, such as hallucinations, high latency, or difficulties in leveraging external knowledge. Therefore, we propose a Knowledge-Augmented Phrase Retrieval model that shows a promising balance between speed and performance by simply augmenting external knowledge in phrase embedding. We also conduct a user study to verify whether solving KTRL+F can enhance search experience for users. It demonstrates that even with our simple model, users can reduce the time for searching with less queries and reduced extra visits to other sources for collecting evidence. We encourage the research community to work on KTRL+F to enhance more efficient in-document information access.

pdf bib
How Well Do Large Language Models Truly Ground?
Hyunji Lee | Se June Joo | Chaeeun Kim | Joel Jang | Doyoung Kim | Kyoung-Woon On | Minjoon Seo

To reduce issues like hallucinations and lack of control in Large Language Models (LLMs), a common method is to generate responses by grounding on external contexts given as input, known as knowledge-augmented models. However, previous research often narrowly defines “grounding” as just having the correct answer, which does not ensure the reliability of the entire response. To overcome this, we propose a stricter definition of grounding: a model is truly grounded if it (1) fully utilizes the necessary knowledge from the provided context, and (2) stays within the limits of that knowledge. We introduce a new dataset and a grounding metric to evaluate model capability under the definition. We perform experiments across 25 LLMs of different sizes and training methods and provide insights into factors that influence grounding performance. Our findings contribute to a better understanding of how to improve grounding capabilities and suggest an area of improvement toward more reliable and controllable LLM applications.

pdf bib
ALBA: Adaptive Language-Based Assessments for Mental Health
Vasudha Varadarajan | Sverker Sikström | Oscar Kjell | H. Schwartz

Mental health issues differ widely among individuals, with varied signs and symptoms. Recently, language-based assessments haveshown promise in capturing this diversity, but they require a substantial sample of words per person for accuracy. This work introducesthe task of Adaptive Language-Based Assessment (ALBA), which involves adaptively ordering questions while also scoring an individual’s latent psychological trait using limited language responses to previous questions. To this end, we develop adaptive testing methods under two psychometric measurement theories: Classical Test Theory and Item Response Theory.We empirically evaluate ordering and scoring strategies, organizing into two new methods: a semi-supervised item response theory-basedmethod (ALIRT) and a supervised Actor-Critic model. While we found both methods to improve over non-adaptive baselines, We foundALIRT to be the most accurate and scalable, achieving the highest accuracy with fewer questions (e.g., Pearson r ≈ 0.93 after only 3 questions as compared to typically needing at least 7 questions). In general, adaptive language-based assessments of depression and anxiety were able to utilize a smaller sample of language without compromising validity or large computational costs.

pdf bib
FREB-TQA: A Fine-Grained Robustness Evaluation Benchmark for Table Question Answering
Wei Zhou | Mohsen Mesgar | Heike Adel | Annemarie Friedrich

Table Question Answering (TQA) aims at composing an answer to a question based on tabular data. While prior research has shown that TQA models lack robustness, understanding the underlying cause and nature of this issue remains predominantly unclear, posing a significant obstacle to the development of robust TQA systems. In this paper, we formalize three major desiderata for a fine-grained evaluation of robustness of TQA systems. They should (i) answer questions regardless of alterations in table structure, (ii) base their responses on the content of relevant cells rather than on biases, and (iii) demonstrate robust numerical reasoning capabilities. To investigate these aspects, we create and publish a novel TQA evaluation benchmark in English. Our extensive experimental analysis reveals that none of the examined state-of-the-art TQA systems consistently excels in these three aspects. Our benchmark is a crucial instrument for monitoring the behavior of TQA systems and paves the way for the development of robust TQA systems. We release our benchmark publicly.

pdf bib
MILL: Mutual Verification with Large Language Models for Zero-Shot Query Expansion
Pengyue Jia | Yiding Liu | Xiangyu Zhao | Xiaopeng Li | Changying Hao | Shuaiqiang Wang | Dawei Yin

Query expansion, pivotal in search engines, enhances the representation of user information needs with additional terms. While existing methods expand queries using retrieved or generated contextual documents, each approach has notable limitations. Retrieval-based methods often fail to accurately capture search intent, particularly with brief or ambiguous queries. Generation-based methods, utilizing large language models (LLMs), generally lack corpus-specific knowledge and entail high fine-tuning costs. To address these gaps, we propose a novel zero-shot query expansion framework utilizing LLMs for mutual verification. Specifically, we first design a query-query-document generation method, leveraging LLMs’ zero-shot reasoning ability to produce diverse sub-queries and corresponding documents. Then, a mutual verification process synergizes generated and retrieved documents for optimal expansion. Our proposed method is fully zero-shot, and extensive experiments on three public benchmark datasets are conducted to demonstrate its effectiveness over existing methods. Our code is available online at to ease reproduction.

pdf bib
Efficient Benchmarking (of Language Models)
Yotam Perlitz | Elron Bandel | Ariel Gera | Ofir Arviv | Liat Ein-Dor | Eyal Shnarch | Noam Slonim | Michal Shmueli-Scheuer | Leshem Choshen

The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature.In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure – Decision Impact on Reliability, DIoR for short.We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples.Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.

pdf bib
ReFACT: Updating Text-to-Image Models by Editing the Text Encoder
Dana Arad | Hadas Orgad | Yonatan Belinkov

Our world is marked by unprecedented technological, global, and socio-political transformations, posing a significant challenge to textto-image generative models. These models encode factual associations within their parameters that can quickly become outdated, diminishing their utility for end-users. To that end, we introduce ReFACT, a novel approach for editing factual associations in text-to-image models without relaying on explicit input from end-users or costly re-training. ReFACT updates the weights of a specific layer in the text encoder, modifying only a tiny portion of the model’s parameters and leaving the rest of the model unaffected.We empirically evaluate ReFACT on an existing benchmark, alongside a newly curated dataset.Compared to other methods, ReFACT achieves superior performance in both generalization to related concepts and preservation of unrelated concepts.Furthermore, ReFACT maintains image generation quality, making it a practical tool for updating and correcting factual information in text-to-image models.

pdf bib
A Likelihood Ratio Test of Genetic Relationship among Languages
V.S.D.S.Mahesh Akavarapu | Arnab Bhattacharya

Lexical resemblances among a group of languages indicate that the languages could be genetically related, i.e., they could have descended from a common ancestral language. However, such resemblances can arise by chance and, hence, need not always imply an underlying genetic relationship. Many tests of significance based on permutation of wordlists and word similarity measures appeared in the past to determine the statistical significance of such relationships. We demonstrate that although existing tests may work well for bilateral comparisons, i.e., on pairs of languages, they are either infeasible by design or are prone to yield false positives when applied to groups of languages or language families. To this end, inspired by molecular phylogenetics, we propose a likelihood ratio test to determine if given languages are related based on the proportion of invariant character sites in the aligned wordlists applied during tree inference. Further, we evaluate some language families and show that the proposed test solves the problem of false positives. Finally, we demonstrate that the test supports the existence of macro language families such as Nostratic and Macro-Mayan.

pdf bib
PaD: Program-aided Distillation Can Teach Small Models Reasoning Better than Chain-of-thought Fine-tuning
Xuekai Zhu | Biqing Qi | Kaiyan Zhang | Xinwei Long | Zhouhan Lin | Bowen Zhou

While large language models (LLMs) excel in various natural language processing tasks, their huge size and the inaccessibility of parameters present challenges for practical deployment. Previous studies try to distill task-specific ability from LLMs to smaller models, using data synthesis and chain-of-thought (CoT) fine-tuning. However, synthetic CoT data often contains faulty reasoning, which deteriorates the quality of distillation, especially in reasoning capabilities. In this work, we propose Program-aided Distillation (PaD), which introduces reasoning programs to suppress the errors in distilled data, and thus achieves better distillation quality for reasoning tasks. In PaD, we utilize the reasoning program to substitute the CoT, allowing automated error checking of synthetic data. Further, through error injecting and further training, the small distilling model could iteratively self-refine the reasoning. Moreover, we conduct a step-wise beam search by step-by-step verifying to acquire more exact reasoning chains. We evaluate PaD on arithmetic reasoning, symbolic reasoning, and general ability.Experimental results demonstrate that smaller models using PaD can not only outperform certain LLMs (e.g., LLaMA-1 13B) but also achieve strong improvement over baselines with a significantly smaller scale of parameters and data. The source code is publicly available at

pdf bib
MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks
Sanchit Ahuja | Divyanshu Aggarwal | Varun Gumma | Ishaan Watts | Ashutosh Sathe | Millicent Ochieng | Rishav Hada | Prachi Jain | Mohamed Ahmed | Kalika Bali | Sunayana Sitaram

There has been a surge in LLM evaluation research to understand LLM capabilities and limitations. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Gemini-Pro, Mistral, Llama2, and Gemma) by comparing them on the same set of multilingual datasets. Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages. We also include two multimodal datasets in the benchmark and compare the performance of LLaVA models, GPT-4-Vision and Gemini-Pro-Vision. Our experiments show that larger models such as GPT-4, Gemini-Pro and PaLM2 outperform smaller models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 and Gemini-Pro on more datasets. We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.

pdf bib
Unlocking Emergent Modularity in Large Language Models
Zihan Qiu | Zeyu Huang | Jie Fu

Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models.Existing MNNs are generally explicit: their modular architectures are pre-defined, with individual modules expected to implement distinct functions.Recent works reveal that there exists implicit modularity in standard pre-trained transformers, namely Emergent Modularity.They indicate that such modular structures spontaneously exhibit during the early pre-training phase.Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized.In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE).Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning.Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at

pdf bib
A School Student Essay Corpus for Analyzing Interactions of Argumentative Structure and Quality
Maja Stahl | Nadine Michel | Sebastian Kilsbach | Julian Schmidtke | Sara Rezat | Henning Wachsmuth

Learning argumentative writing is challenging. Besides writing fundamentals such as syntax and grammar, learners must select and arrange argument components meaningfully to create high-quality essays. To support argumentative writing computationally, one step is to mine the argumentative structure. When combined with automatic essay scoring, interactions of the argumentative structure and quality scores can be exploited for comprehensive writing support. Although studies have shown the usefulness of using information about the argumentative structure for essay scoring, no argument mining corpus with ground-truth essay quality annotations has been published yet. Moreover, none of the existing corpora contain essays written by school students specifically. To fill this research gap, we present a German corpus of 1,320 essays from school students of two age groups. Each essay has been manually annotated for argumentative structure and quality on multiple levels of granularity. We propose baseline approaches to argument mining and essay scoring, and we analyze interactions between both tasks, thereby laying the ground for quality-oriented argumentative writing support.

pdf bib
Adjusting Interpretable Dimensions in Embedding Space with Human Judgments
Katrin Erk | Marianna Apidianaki

Embedding spaces contain interpretable dimensions indicating gender, formality in style, or even object properties. This has been observed multiple times. Such interpretable dimensions are becoming valuable tools in different areas of study, from social science to neuroscience. The standard way to compute these dimensions uses contrasting seed words and computes difference vectors over them. This is simple but does not always work well. We combine seed-based vectors with guidance from human ratings of where words fall along a specific dimension, and evaluate on predicting both object properties like size and danger, and the stylistic properties of formality and complexity. We obtain interpretable dimensions with markedly better performance especially in cases where seed-based dimensions do not work well.

pdf bib
PatentEval: Understanding Errors in Patent Generation
You Zuo | Kim Gerdes | Éric Clergerie | Benoît Sagot

In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.

pdf bib
Contextual Refinement of Translations: Large Language Models for Sentence and Document-Level Post-Editing
Sai Koneru | Miriam Exel | Matthias Huck | Jan Niehues

Large language models (LLMs) have demonstrated considerable success in various natural language processing tasks, but open-source LLMs have yet to attain state-of-the-art performance in Neural Machine Translation (NMT). Nevertheless, their significant performance in tasks demanding a broad understanding and contextual processing shows their potential for translation. To exploit these abilities, we investigate using LLMs for MT and explore recent parameter-efficient fine-tuning techniques. Surprisingly, our initial experiments found that fine-tuning with Q-LoRA for translation purposes led to performance improvements in terms of BLEU but degradation in COMET compared to in-context learning. To overcome this, we propose an alternative approach: adapting LLMs as Automatic Post-Editors (APE) rather than direct translators. Building on the ability of the LLM to handle long sequences, we also propose extending our approach to document-level translation. We show that leveraging Low-Rank-Adapter fine-tuning for APE can yield significant improvements across both sentence and document-level metrics while generalizing to out-of-domain data. Most notably, we achieve a state-of-the-art accuracy rate of 88.7% on the ContraPro test set, which assesses the model’s ability to resolve pronoun ambiguities when translating from English to German. Lastly, during manual post-editing for document-level translation, the source sentences are iteratively annotated, which can be used to refine further translations in the document. Here, we demonstrate that leveraging human corrections can significantly reduce the number of edits required for subsequent translations.

pdf bib
Metaphor Detection with Context Enhancement and Curriculum Learning
Kaidi Jia | Rongsheng Li

Metaphor detection is a challenging task for natural language processing (NLP) systems. Previous works failed to sufficiently utilize the internal and external semantic relationships between target words and their context. Furthermore, they have faced challenges in tackling the problem of data sparseness due to the very limited available training data. To address these two challenges, we propose a novel model called MiceCL. By leveraging the difference between the literal meaning of the target word and the meaning of the sentence as the sentence external difference, MiceCL can better handle the semantic relationships. Additionally, we propose a curriculum learning framework for automatically assessing difficulty of the sentence with a pre-trained model. By starting from easy examples and gradually progressing to more difficult ones, we can ensure that the model will not deal with complex data when its ability is weak so that to avoid wasting limited data. Experimental results demonstrate that MiceCL achieves competitive performance across multiple datasets, with a significantly improved convergence speed compared to other models.

pdf bib
What Causes the Failure of Explicit to Implicit Discourse Relation Recognition?
Wei Liu | Stephen Wan | Michael Strube

We consider an unanswered question in the discourse processing community: why do relation classifiers trained on explicit examples (with connectives removed) perform poorly in real implicit scenarios? Prior work claimed this is due to linguistic dissimilarity between explicit and implicit examples but provided no empirical evidence. In this study, we show that one cause for such failure is a label shift after connectives are eliminated. Specifically, we find that the discourse relations expressed by some explicit instances will change when connectives disappear. Unlike previous work manually analyzing a few examples, we present empirical evidence at the corpus level to prove the existence of such shift. Then, we analyze why label shift occurs by considering factors such as the syntactic role played by connectives, ambiguity of connectives, and more. Finally, we investigate two strategies to mitigate the label shift: filtering out noisy data and joint learning with connectives. Experiments on PDTB 2.0, PDTB 3.0, and the GUM dataset demonstrate that classifiers trained with our strategies outperform strong baselines.

pdf bib
UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions
Siddhant Arora | Hayato Futami | Jee-weon Jung | Yifan Peng | Roshan Sharma | Yosuke Kashiwagi | Emiru Tsunoo | Karen Livescu | Shinji Watanabe

Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model’s behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model “UniverSLU” for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types.

pdf bib
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities
Lingbo Mo | Boshi Wang | Muhao Chen | Huan Sun

The rapid progress in open-source Large Language Models (LLMs) is significantly driving AI development forward. However, there is still a limited understanding of their trustworthiness. Deploying these models at scale without sufficient trustworthiness can pose significant risks, highlighting the need to uncover these issues promptly. In this work, we conduct an adversarial assessment of open-source LLMs on trustworthiness, scrutinizing them across eight different aspects including toxicity, stereotypes, ethics, hallucination, fairness, sycophancy, privacy, and robustness against adversarial demonstrations. We propose advCoU, an extended Chain of Utterances-based (CoU) prompting strategy by incorporating carefully crafted malicious demonstrations for trustworthiness attack. Our extensive experiments encompass recent and representative series of open-source LLMs, including Vicuna, MPT, Falcon, Mistral, and Llama 2. The empirical outcomes underscore the efficacy of our attack strategy across diverse aspects. More interestingly, our result analysis reveals that models with superior performance in general NLP tasks do not always have greater trustworthiness; in fact, larger models can be more vulnerable to attacks. Additionally, models that have undergone instruction tuning, focusing on instruction following, tend to be more susceptible, although fine-tuning LLMs for safety alignment proves effective in mitigating adversarial trustworthiness attacks.

pdf bib
Paraphrase and Solve: Exploring and Exploiting the Impact of Surface Form on Mathematical Reasoning in Large Language Models
Yue Zhou | Yada Zhu | Diego Antognini | Yoon Kim | Yang Zhang

This paper studies the relationship between the surface form of a mathematical problem and its solvability by large language models. We find that subtle alterations in the surface form can significantly impact the answer distribution and the solve rate, exposing the language model’s lack of robustness and sensitivity to the surface form in reasoning through complex problems. To improve mathematical reasoning performance, we propose Self-Consistency-over-Paraphrases (SCoP), which diversifies reasoning paths from specific surface forms of the problem. We evaluate our approach on four mathematics reasoning benchmarks over three large language models and show that SCoP improves mathematical reasoning performance over vanilla self-consistency, particularly for problems initially deemed unsolvable. Finally, we provide additional experiments and discussion regarding problem difficulty and surface forms, including cross-model difficulty agreement and paraphrasing transferability, and Variance of Variations (VOV) for language model evaluation.

pdf bib
TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale
Pengcheng Jiang | Cao Xiao | Zifeng Wang | Parminder Bhatia | Jimeng Sun | Jiawei Han

The advent of large language models (LLMs) has significantly advanced natural language processing tasks like text summarization. However, their large size and computational demands, coupled with privacy concerns in data transmission, limit their use in resource-constrained and privacy-centric settings. To overcome this, we introduce TriSum, a framework for distilling LLMs’ text summarization abilities into a compact, local model. Initially, LLMs extract a set of aspect-triple rationales and summaries, which are refined using a dual-scoring method for quality. Next, a smaller local model is trained with these tasks, employing a curriculum learning strategy that evolves from simple to complex tasks. Our method enhances local model performance on various benchmarks (CNN/DailyMail, XSum, and ClinicalTrial), outperforming baselines by 4.5%, 8.5%, and 7.4%, respectively. It also improves interpretability by providing insights into the summarization rationale.

pdf bib
GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models
Pengcheng Jiang | Jiacheng Lin | Zifeng Wang | Jimeng Sun | Jiawei Han

The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GenRES for a multi-dimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GenRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GenRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GenRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE

pdf bib
Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars
Andrés Lou | Juan Antonio Pérez-Ortiz | Felipe Sánchez-Martínez | Víctor Sánchez-Cartagena

The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at

pdf bib
The Effect of Data Partitioning Strategy on Model Generalizability: A Case Study of Morphological Segmentation
Zoey Liu | Bonnie Dorr

Recent work to enhance data partitioning strategies for more realistic model evaluation face challenges in providing a clear optimal choice. This study addresses these challenges, focusing on morphological segmentation and synthesizing limitations related to language diversity, adoption of multiple datasets and splits, and detailed model comparisons. Our study leverages data from 19 languages, including ten indigenous or endangered languages across 10 language families with diverse morphological systems (polysynthetic, fusional, and agglutinative) and different degrees of data availability. We conduct large-scale experimentation with varying sized combinations of training and evaluation sets as well as new test data. Our results show that, when faced with new test data: (1) models trained from random splits are able to achieve higher numerical scores; (2) model rankings derived from random splits tend to generalize more consistently.

pdf bib
Measuring Entrainment in Spontaneous Code-switched Speech
Debasmita Bhattacharya | Siying Ding | Alayna Nguyen | Julia Hirschberg

It is well-known that speakers who entrain to one another have more successful conversations than those who do not. Previous research has shown that interlocutors entrain on linguistic features in both written and spoken monolingual domains. More recent work on code-switched communication has also shown preliminary evidence of entrainment on certain aspects of code-switching (CSW). However, such studies of entrainment in code-switched domains have been extremely few and restricted to human-machine textual interactions. Our work studies code-switched spontaneous speech between humans, finding that (1) patterns of written and spoken entrainment in monolingual settings largely generalize to code-switched settings, and (2) some patterns of entrainment on code-switching in dialogue agent-generated text generalize to spontaneous code-switched speech. Our findings give rise to important implications for the potentially “universal” nature of entrainment as a communication phenomenon, and potential applications in inclusive and interactive speech technology.

pdf bib
A Survey of Meaning Representations – From Theory to Practical Utility
Zacchary Sadeddine | Juri Opitz | Fabian Suchanek

Symbolic meaning representations of natural language text have been studied since at least the 1960s. With the availability of large annotated corpora, and more powerful machine learning tools, the field has recently seen several new developments. In this survey, we study today’s most prominent Meaning Representation Frameworks. We shed light on their theoretical properties, as well as on their practical research environment, i.e., on datasets, parsers, applications, and future challenges.

pdf bib
Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation
Haozhe Zhao | Zefan Cai | Shuzheng Si | Liang Chen | Yufeng He | Kaikai An | Baobao Chang

Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data.However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data.Therefore, we introduce **ALSACE** to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at

pdf bib
Evaluating In-Context Learning of Libraries for Code Generation
Arkil Patel | Siva Reddy | Dzmitry Bahdanau | Pradeep Dasigi

Contemporary Large Language Models (LLMs) exhibit a high degree of code generation and comprehension capability. A particularly promising area is their ability to interpret code modules from unfamiliar libraries for solving user-instructed tasks. Recent work has shown that large proprietary LLMs can learn novel library usage in-context from demonstrations. These results raise several open questions: whether demonstrations of library usage is required, whether smaller (and more open) models also possess such capabilities, etc. In this work, we take a broader approach by systematically evaluating a diverse array of LLMs across three scenarios reflecting varying levels of domain specialization to understand their abilities and limitations in generating code based on libraries defined in-context. Our results show that even smaller open-source LLMs like Llama-2 and StarCoder demonstrate an adept understanding of novel code libraries based on specification presented in-context. Our findings further reveal that LLMs exhibit a surprisingly high proficiency in learning novel library modules even when provided with just natural language descriptions or raw code implementations of the functions, which are often cheaper to obtain than demonstrations. Overall, our results pave the way for harnessing LLMs in more adaptable and dynamic coding environments.

pdf bib
Visually-Aware Context Modeling for News Image Captioning
Tingyu Qu | Tinne Tuytelaars | Marie-Francine Moens

News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at

pdf bib
Regularized Conventions: Equilibrium Computation as a Model of Pragmatic Reasoning
Athul Jacob | Gabriele Farina | Jacob Andreas

We present a game-theoretic model of pragmatics that we call ReCo (for Regularized Conventions). This model formulates pragmatic communication as a game in which players are rewarded for communicating successfully and penalized for deviating from a shared, “default” semantics. As a result, players assign utterances context-dependent meanings that jointly optimize communicative success and naturalness with respect to speakers’ and listeners’ background knowledge of language. By using established game-theoretic tools to compute equilibrium strategies for this game, we obtain principled pragmatic language generation procedures with formal guarantees of communicative success. Across several datasets capturing real and idealized human judgments about pragmatic implicature, ReCo matches, or slightly improves upon, predictions made by Iterated Best Response and Rational Speech Acts models of language understanding.

pdf bib
TopicGPT: A Prompt-based Topic Modeling Framework
Chau Pham | Alexander Hoyle | Simeng Sun | Philip Resnik | Mohit Iyyer

Topic modeling is a well-established technique for exploring text corpora. Conventional topic models (e.g., LDA) represent topics as bags of words that often require “reading the tea leaves” to interpret; additionally, they offer users minimal control over the formatting and specificity of resulting topics. To tackle these issues, we introduce TopicGPT, a prompt-based framework that uses large language models (LLMs) to uncover latent topics in a text collection. TopicGPT produces topics that align better with human categorizations compared to competing methods: it achieves a harmonic mean purity of 0.74 against human-annotated Wikipedia topics compared to 0.64 for the strongest baseline. Its topics are also more interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions. Moreover, the framework is highly adaptable, allowing users to specify constraints and modify topics without the need for model retraining. By streamlining access to high-quality and interpretable topics, TopicGPT represents a compelling, human-centered approach to topic modeling.

pdf bib
ChatGPT as an Attack Tool: Stealthy Textual Backdoor Attack via Blackbox Generative Model Trigger
Jiazhao Li | Yijin Yang | Zhuofeng Wu | V.G.Vinod Vydiswaran | Chaowei Xiao

Textual backdoor attacks, characterized by subtle manipulations of input triggers and training dataset labels, pose significant threats to security-sensitive applications. The rise of advanced generative models, such as GPT-4, with their capacity for human-like rewriting, makes these attacks increasingly challenging to detect. In this study, we conduct an in-depth examination of black-box generative models as tools for backdoor attacks, thereby emphasizing the need for effective defense strategies. We propose BGMAttack, a novel framework that harnesses advanced generative models to execute stealthier backdoor attacks on text classifiers. Unlike prior approaches constrained by subpar generation quality, BGMAttack renders backdoor triggers more elusive to human cognition and advanced machine detection. A rigorous evaluation of attack effectiveness over four sentiment classification tasks, complemented by four human cognition stealthiness tests, reveals BGMAttack’s superior performance, achieving a state-of-the-art attack success rate of 97.35% on average while maintaining superior stealth compared to conventional methods. The dataset and code are available:

pdf bib
Social Meme-ing: Measuring Linguistic Variation in Memes
Naitian Zhou | David Jurgens | David Bamman

Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their multimodal structure in doing so. We apply this method to a large collection of meme images from Reddit and make available the resulting SemanticMemes dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language.

pdf bib
ExpertQA: Expert-Curated Questions and Attributed Answers
Chaitanya Malaviya | Subin Lee | Sihao Chen | Elizabeth Sieber | Mark Yatskar | Dan Roth

As language models are adopted by a more sophisticated and diverse set of users, the importance of guaranteeing that they provide factually correct information supported by verifiable sources is critical across fields of study. This is especially the case for high-stakes fields, such as medicine and law, where the risk of propagating false information is high and can lead to undesirable societal consequences. Previous work studying attribution and factuality has not focused on analyzing these characteristics of language model outputs in domain-specific scenarios. In this work, we conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality, by bringing domain experts in the loop. Specifically, we collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. In addition, we ask experts to improve upon responses from language models. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.

pdf bib
What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception
Chaitanya Malaviya | Subin Lee | Dan Roth | Mark Yatskar

Eliciting feedback from end users of NLP models can be beneficial for improving models. However, how should we present model responses to users so they are most amenable to be corrected from user feedback? Further, what properties do users value to understand and trust responses? We answer these questions by analyzing the effect of rationales (or explanations) generated by QA models to support their answers. We specifically consider decomposed QA models that first extract an intermediate rationale based on a context and a question and then use solely this rationale to answer the question. A rationale outlines the approach followed by the model to answer the question. Our work considers various formats of these rationales that vary according to well-defined properties of interest. We sample rationales from language models using few-shot prompting for two datasets, and then perform two user studies. First, we present users with incorrect answers and corresponding rationales in various formats and ask them to provide natural language feedback to revise the rationale. We then measure the effectiveness of this feedback in patching these rationales through in-context learning. The second study evaluates how well different rationale formats enable users to understand and trust model answers, when they are correct. We find that rationale formats significantly affect how easy it is (1) for users to give feedback for rationales, and (2) for models to subsequently execute this feedback. In addition, formats with attributions to the context and in-depth reasoning significantly enhance user-reported understanding and trust of model outputs.

pdf bib
When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels
Weiyan Shi | Emily Dinan | Kurt Shuster | Jason Weston | Jing Xu

Deployed dialogue agents have the potential to integrate human feedback to continuously improve themselves. However, humans may not always provide explicit signals when the chatbot makes mistakes during interactions. In this work, we propose Juicer, a framework to make use of both binary and free-form textual human feedback. It works by: (i) extending sparse binary feedback by training a satisfaction classifier to label the unlabeled data; and (ii) training a reply corrector to map the bad replies to good ones. We find that augmenting training with model-corrected replies improves the final dialogue model, and we can further improve performance by using both positive and negative replies through the recently proposed Director model.

pdf bib
Kreyòl-MT: Building MT for Latin American, Caribbean and Colonial African Creole Languages
Nathaniel Robinson | Raj Dabre | Ammon Shurtz | Rasul Dent | Onenamiyi Onesi | Claire Monroc | Loïc Grobol | Hasan Muhammad | Ashi Garg | Naome Etori | Vijay Murari Tiyyala | Olanrewaju Samuel | Matthew Stutzman | Bismarck Odoom | Sanjeev Khudanpur | Stephen Richardson | Kenton Murray

A majority of language technologies are tailored for a small number of high-resource languages, while relatively many low-resource languages are neglected. One such group, Creole languages, have long been marginalized in academic study, though their speakers could benefit from machine translation (MT). These languages are predominantly used in much of Latin America, Africa and the Caribbean. We present the largest cumulative dataset to date for Creole language MT, including 14.5M unique Creole sentences with parallel translations—11.6M of which we release publicly, and the largest bitexts gathered to date for 41 languages—the first ever for 21. In addition, we provide MT models supporting all 41 Creole languages in 172 translation directions. Given our diverse dataset, we produce a model for Creole language MT exposed to more genre diversity then ever before, which outperforms a genre-specific Creole MT model on its own benchmark for 23 of 34 translation directions.

pdf bib
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
Jiashu Xu | Mingyu Ma | Fei Wang | Chaowei Xiao | Muhao Chen

We investigate security concerns of the emergent instruction tuning paradigm, that models are trained on crowdsourced datasets with task instructions to achieve superior performance. Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions (~1000 tokens) and control model behavior through data poisoning, without even the need to modify data instances or labels themselves. Through such instruction attacks, the attacker can achieve over 90% attack success rate across four commonly used NLP datasets. As an empirical study on instruction attacks, we systematically evaluated unique perspectives of instruction attacks, such as poison transfer where poisoned models can transfer to 15 diverse generative datasets in a zero-shot manner; instruction transfer where attackers can directly apply poisoned instruction on many other datasets; and poison resistance to continual finetuning. Lastly, we show that RLHF and clean demonstrations might mitigate such backdoors to some degree. These findings highlight the need for more robust defenses against poisoning attacks in instruction-tuning models and underscore the importance of ensuring data quality in instruction crowdsourcing.

pdf bib
Modeling Empathetic Alignment in Conversation
Jiamin Yang | David Jurgens

Empathy requires perspective-taking: empathetic responses require a person to reason about what another has experienced and communicate that understanding in language. However, most NLP approaches to empathy do not explicitly model this alignment process. Here, we introduce a new approach to recognizing alignment in empathetic speech, grounded in Appraisal Theory. We introduce a new dataset of over 9.2K span-level annotations of different types of appraisals of a person’s experience and over 3K empathetic alignments between a speaker’s and observer’s speech. Through computational experiments, we show that these appraisals and alignments can be accurately recognized. In experiments in over 9.2M Reddit conversations, we find that appraisals capture meaningful groupings of behavior but that most responses have minimal alignment. However, we find that mental health professionals engage with substantially more empathetic alignment.

pdf bib
Native Language Identification in Texts: A Survey
Dhiman Goswami | Sharanya Thilagan | Kai North | Shervin Malmasi | Marcos Zampieri

We present the first comprehensive survey of Native Language Identification (NLI) applied to texts. NLI is the task of automatically identifying an author’s native language (L1) based on their second language (L2) production. NLI is an important task with practical applications in second language teaching and NLP. The task has been widely studied for both text and speech, particularly for L2 English due to the availability of suitable corpora. Speech-based NLI relies heavily on accent modeled by pronunciation patterns and prosodic cues while text-based NLI relies primarily on modeling spelling errors and grammatical patterns that reveal properties of an individuals’ L1 influencing L2 production. We survey over one hundred papers on the topic including the papers associated with the NLI and INLI shared tasks. We describe several text representations and computational techniques used in text-based NLI. Finally, we present a comprehensive account of publicly available datasets used for the task thus far.

pdf bib
LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models
Yifan Yang | Jiajun Zhou | Ngai Wong | Zheng Zhang

Various parameter-efficient fine-tuning (PEFT) techniques have been proposed to enable computationally efficient fine-tuning while maintaining model performance. However, existing PEFT methods are still limited by the growing number of trainable parameters with the rapid deployment of Large Language Models (LLMs). To address this challenge, we present LoRETTA, an ultra-parameter-efficient framework that significantly reduces trainable parameters through tensor-train decomposition. Specifically, we propose two methods, named LoRETTA_adp and LoRETTA_rep. The former employs tensorized adapters, offering a high-performance yet lightweight approach for the fine-tuning of LLMs. The latter emphasizes fine-tuning via weight reparameterization with a set of small tensor factors. LoRETTA achieves comparable or better performance than most widely used PEFT methods with up to 100× fewer parameters on the LLaMA-2-7B models. Furthermore, empirical results demonstrate that the proposed methods exhibit remarkable anti-overfitting capability, effectively improve training efficiency, and enjoy better multi-task learning performance. Plug-and-play loretta library built upon the Huggingface framework and PEFT library are provided.

pdf bib
Which One? Leveraging Context Between Objects and Multiple Views for Language Grounding
Chancharik Mitra | Abrar Anwar | Rodolfo Corona | Dan Klein | Trevor Darrell | Jesse Thomason

When connecting objects and their language referents in an embodied 3D environment, it is important to note that: (1) an object can be better characterized by leveraging comparative information between itself and other objects, and (2) an object’s appearance can vary with camera position. As such, we present the Multi-view Approach to Grounding in Context (MAGiC) model, which selects an object referent based on language that distinguishes between two similar objects. By pragmatically reasoning over both objects and across multiple views of those objects, MAGiC improves over the state-of-the-art model on the SNARE object reference task with a relative error reduction of 12.9% (representing an absolute improvement of 2.7%). Ablation studies show that reasoning jointly over object referent candidates and multiple views of each object both contribute to improved accuracy. Code:

pdf bib
Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks
Ting-Yun Chang | Jesse Thomason | Robin Jia

The concept of localization in LLMs is often mentioned in prior work; however, methods for localization have never been systematically and directly evaluated. We propose two complementary benchmarks that evaluate the ability of localization methods to pinpoint LLM components responsible for memorized data. In our INJ benchmark, we actively inject a piece of new information into a small subset of LLM weights, enabling us to directly evaluate whether localization methods can identify these “ground truth” weights. In our DEL benchmark, we evaluate localization by measuring how much dropping out identified neurons deletes a memorized pretrained sequence. Despite their different perspectives, our two benchmarks yield consistent rankings of five localization methods. Methods adapted from network pruning perform well on both benchmarks, and all evaluated methods show promising localization ability. On the other hand, even successful methods identify neurons that are not specific to a single memorized sequence.

pdf bib
PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning
Tianrong Zhang | Zhaohan Xi | Ting Wang | Prasenjit Mitra | Jinghui Chen

Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented.In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings.Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the trigger and counteract it respectively. The use of soft tokens and adversarial optimization eliminates the need to enumerate possible backdoor configurations and enables an adaptive balance between trigger finding and preservation of performance.Experiments with various backdoor attacks validate the effectiveness of the proposed method and the performances when domain shift is present further shows PromptFix’s applicability to models pretrained on unknown data source which is the common case in prompt tuning scenarios.

pdf bib
Comparing Explanation Faithfulness between Multilingual and Monolingual Fine-tuned Language Models
Zhixue Zhao | Nikolaos Aletras

In many real natural language processing application scenarios, practitioners not only aim to maximize predictive performance but also seek faithful explanations for the model predictions. Rationales and importance distribution given by feature attribution methods (FAs) provide insights into how different parts of the input contribute to a prediction. Previous studies have explored how different factors affect faithfulness, mainly in the context of monolingual English models. On the other hand, the differences in FA faithfulness between multilingual and monolingual models have yet to be explored. Our extensive experiments, covering five languages and five popular FAs, show that FA faithfulness varies between multilingual and monolingual models. We find that the larger the multilingual model, the less faithful the FAs are compared to its counterpart monolingual models. Our further analysis shows that the faithfulness disparity is potentially driven by the differences between model tokenizers. Our code is available:

pdf bib
A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity
Shayne Longpre | Gregory Yauney | Emily Reif | Katherine Lee | Adam Roberts | Barret Zoph | Denny Zhou | Jason Wei | Kevin Robinson | David Mimno | Daphne Ippolito

Pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. We pretrain models on data curated (1) at different collection times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we find that temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we measure the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Third, we empirically validate that heterogeneous data sources, like books and web, are beneficial and warrant greater prioritization. To date, these experiments constitute the single largest publicly documented empirical study of the effects of pretraining data. Spanning 28 unique 1.5 billion parameter models pretrained from scratch, these findings validate, quantify, and expose many undocumented intuitions about text pretraining, which ultimately support more informed data-centric decisions in model development.

pdf bib
Instructional Fingerprinting of Large Language Models
Jiashu Xu | Fei Wang | Mingyu Ma | Pang Wei Koh | Chaowei Xiao | Muhao Chen

The exorbitant cost of training Large language models (LLMs) from scratch makes it essential to fingerprint the models to protect intellectual property via ownership authentication and to ensure downstream users and developers comply with their license terms (eg restricting commercial use). In this study, we present a pilot study on LLM fingerprinting as a form of very lightweight instruction tuning. Model publisher specifies a confidential private key and implants it as an instruction backdoor that causes the LLM to generate specific text when the key is present. Results on 11 popularly-used LLMs showed that this approach is lightweight and does not affect the normal behavior of the model. It also prevents publisher overclaim, maintains robustness against fingerprint guessing and parameter-efficient training, and supports multi-stage fingerprinting akin to MIT License.

pdf bib
Reinforced Multiple Instance Selection for Speaker Attribute Prediction
Alireza Salkhordeh Ziabari | Ali Omrani | Parsa Hejabi | Preni Golazizian | Brendan Kennedy | Payam Piray | Morteza Dehghani

Language usage is related to speaker age, gender, moral concerns, political ideology, and other attributes. Current state-of-the-art methods for predicting these attributes take a speaker’s utterances as input and provide a prediction per speaker attribute. Most of these approaches struggle to handle a large number of utterances per speaker. This difficulty is primarily due to the computational constraints of the models. Additionally, only a subset of speaker utterances may be relevant to specific attributes. In this paper, we formulate speaker attribute prediction as a Multiple Instance Learning (MIL) problem and propose RL-MIL, a novel approach based on Reinforcement Learning (RL) that effectively addresses both of these challenges. Our experiments demonstrate that our RL-based methodology consistently outperforms previous approaches across a range of related tasks: predicting speakers’ psychographics and demographics from social media posts, and political ideologies from transcribed speeches. We create synthetic datasets and investigate the behavior of RL-MIL systematically. Our results show the success of RL-MIL in improving speaker attribute prediction by learning to select relevant speaker utterances.

pdf bib
DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling
Shikhar Tuli | Chi-Heng Lin | Yen-Chang Hsu | Niraj Jha | Yilin Shen | Hongxia Jin

Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models *dynamically* predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweighttechnique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57× speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.

pdf bib
Few-shot Knowledge Graph Relational Reasoning via Subgraph Adaptation
Haochen Liu | Song Wang | Chen Chen | Jundong Li

Few-shot Knowledge Graph (KG) Relational Reasoning aims to predict unseen triplets (i.e., query triplets) for rare relations in KGs, given only several triplets of these relations as references (i.e., support triplets). This task has gained significant traction due to the widespread use of knowledge graphs in various natural language processing applications. Previous approaches have utilized meta-training methods and manually constructed meta-relation sets to tackle this task. Recent efforts have focused on edge-mask-based methods, which exploit the structure of the contextualized graphs of target triplets (i.e., a subgraph containing relevant triplets in the KG). However, existing edge-mask-based methods have limitations in extracting insufficient information from KG and are highly influenced by spurious information in KG. To overcome these challenges, we propose SAFER (Subgraph Adaptation for Few-shot Relational Reasoning), a novel approach that effectively adapts the information in contextualized graphs to various subgraphs generated from support and query triplets to perform the prediction. Specifically, SAFER enables the extraction of more comprehensive information from support triplets while minimizing the impact of spurious information when predicting query triplets. Experimental results on three prevalent datasets demonstrate the superiority of our proposed framework SAFER.

pdf bib
Uncertainty Quantification for In-Context Learning of Large Language Models
Chen Ling | Xujiang Zhao | Xuchao Zhang | Wei Cheng | Yanchi Liu | Yiyou Sun | Mika Oishi | Takao Osaki | Katsushi Matsuda | Jie Ji | Guangji Bai | Liang Zhao | Haifeng Chen

In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) and revolutionized various fields by providing a few task-relevant demonstrations in the prompt. However, trustworthy issues with LLM’s response, such as hallucination, have also been actively discussed. Existing works have been devoted to quantifying the uncertainty in LLM’s response, but they often overlook the complex nature of LLMs and the uniqueness of in-context learning. In this work, we delve into the predictive uncertainty of LLMs associated with in-context learning, highlighting that such uncertainties may stem from both the provided demonstrations (aleatoric uncertainty) and ambiguities tied to the model’s configurations (epistemic uncertainty). We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion. Extensive experiments are conducted to demonstrate the effectiveness of the decomposition. The code and data are available at:

pdf bib
HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM
Zhilin Wang | Yi Dong | Jiaqi Zeng | Virginia Adams | Makesh Narsimhan Sreedhar | Daniel Egert | Olivier Delalleau | Jane Scowcroft | Neel Kant | Aidan Swope | Oleksii Kuchaiev

Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful responses only due to their length). To alleviate this problem, we collect HelpSteer, a multi-attribute helpfulness dataset annotated for the various aspects that make responses helpful. Specifically, our 37k-sample dataset has annotations for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. Training Llama 2 70B using the HelpSteer dataset with SteerLM technique produces a model that scores 7.54 on MT Bench, which is currently the highest score for open models that do not require training data from more powerful models (e.g. GPT-4). We release this dataset with CC-BY-4.0 license at

pdf bib
A Preference-driven Paradigm for Enhanced Translation with Large Language Models
Dawei Zhu | Sony Trenous | Xiaoyu Shen | Dietrich Klakow | Bill Byrne | Eva Hasler

Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in “breaking the plateau” across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.

pdf bib
Fair Abstractive Summarization of Diverse Perspectives
Yusen Zhang | Nan Zhang | Yixin Liu | Alexander Fabbri | Junru Liu | Ryo Kamoi | Xiaoxin Lu | Caiming Xiong | Jieyu Zhao | Dragomir Radev | Kathleen McKeown | Rui Zhang

People from different social and demographic groups express diverse perspectives and conflicting opinions on a broad set of topics such as product reviews, healthcare, law, and politics. A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. However, current work in summarization metrics and Large Language Models (LLMs) evaluation has not explored fair abstractive summarization. In this paper, we systematically investigate fair abstractive summarization for user-generated data. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people, and we propose four reference-free automatic metrics by measuring the differences between target and source perspectives. We evaluate nine LLMs, including three GPT models, four LLaMA models, PaLM 2, and Claude, on six datasets collected from social media, online reviews, and recorded transcripts. Experiments show that both the model-generated and the human-written reference summaries suffer from low fairness. We conduct a comprehensive analysis of the common factors influencing fairness and propose three simple but effective methods to alleviate unfair summarization. Our dataset and code are available at

pdf bib
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
Anthony Tiong | Junqi Zhao | Boyang Li | Junnan Li | Steven Hoi | Caiming Xiong

Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection.Finally, we present a new dataset, OLIVE1, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods. 1

pdf bib
Show Your Work with Confidence: Confidence Bands for Tuning Curves
Nicholas Lourie | Kyunghyun Cho | He He

The choice of hyperparameters greatly impacts performance in natural language processing. Often, it is hard to tell if a method is better than another or just better tuned. *Tuning curves* fix this ambiguity by accounting for tuning effort. Specifically, they plot validation performance as a function of the number of hyperparameter choices tried so far. While several estimators exist for these curves, it is common to use point estimates, which we show fail silently and give contradictory results when given too little data.Beyond point estimates, *confidence bands* are necessary to rigorously establish the relationship between different approaches. We present the first method to construct valid confidence bands for tuning curves. The bands are exact, simultaneous, and distribution-free, thus they provide a robust basis for comparing methods.Empirical analysis shows that while bootstrap confidence bands, which serve as a baseline, fail to approximate their target confidence, ours achieve it exactly. We validate our design with ablations, analyze the effect of sample size, and provide guidance on comparing models with our method. To promote confident comparisons in future work, we release opda: an easy-to-use library that you can install with pip.

pdf bib
GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives
Vinodkumar Prabhakaran | Christopher Homan | Lora Aroyo | Aida Mostafazadeh Davani | Alicia Parrish | Alex Taylor | Mark Diaz | Ding Wang | Gregory Serapio-García

Human annotation plays a core role in machine learning — annotations for supervised models, safety guardrails for generative models, and human feedback for reinforcement learning, to cite a few avenues. However, the fact that many of these human annotations are inherently subjective is often overlooked. Recent work has demonstrated that ignoring rater subjectivity (typically resulting in rater disagreement) is problematic within specific tasks and for specific subgroups. Generalizable methods to harness rater disagreement and thus understand the socio-cultural leanings of subjective tasks remain elusive. In this paper, we propose GRASP, a comprehensive disagreement analysis framework to measure group association in perspectives among different rater subgroups, and demonstrate its utility in assessing the extent of systematic disagreements in two datasets: (1) safety annotations of human-chatbot conversations, and (2) offensiveness annotations of social media posts, both annotated by diverse rater pools across different socio-demographic axes. Our framework (based on disagreement metrics) reveals specific rater groups that have significantly different perspectives than others on certain tasks, and helps identify demographic axes that are crucial to consider in specific task contexts.

pdf bib
Event Causality Is Key to Computational Story Understanding
Yidan Sun | Qin Chao | Boyang Li

Cognitive science and symbolic AI research suggest that event causality provides vital information for story understanding. However, machine learning systems for story understanding rarely employ event causality, partially due to the lack of methods that reliably identify open-world causal event relations. Leveraging recent progress in large language models, we present the first method for event causality identification that leads to material improvements in computational story understanding. Our technique sets a new state of the art on the COPES dataset (Wang et al., 2023c) for causal event relation identification. Further, in the downstream story quality evaluation task, the identified causal relations lead to 3.6-16.6% relative improvement on correlation with human ratings. In the multimodal story video-text alignment task, we attain 4.1-10.9% increase on Clip Accuracy and 4.2-13.5% increase on Sentence IoU. The findings indicate substantial untapped potential for event causality in computational story understanding. The codebase is at

pdf bib
Subspace Representations for Soft Set Operations and Sentence Similarities
Yoichi Ishibashi | Sho Yokoi | Katsuhito Sudoh | Satoshi Nakamura

In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.

pdf bib
My Heart Skipped a Beat! Recognizing Expressions of Embodied Emotion in Natural Language
Yuan Zhuang | Tianyu Jiang | Ellen Riloff

Humans frequently experience emotions. When emotions arise, they affect not only our mental state but can also change our physical state. For example, we often open our eyes wide when we are surprised, or clap our hands when we feel excited. Physical manifestations of emotions are referred to as embodied emotion in the psychology literature. From an NLP perspective, recognizing descriptions of physical movements or physiological responses associated with emotions is a type of implicit emotion recognition. Our work introduces a new task of recognizing expressions of embodied emotion in natural language. We create a dataset of sentences that contains 7,300 body part mentions with human annotations for embodied emotion. We develop a classification model for this task and present two methods to acquire weakly labeled instances of embodied emotion by extracting emotional manner expressions and by prompting a language model. Our experiments show that the weakly labeled data can train an effective classification model without gold data, and can also improve performance when combined with gold data. Our dataset is publicly available at

pdf bib
Low-Cost Generation and Evaluation of Dictionary Example Sentences
Bill Cai | Ng Clarence | Daniel Liang | Shelvia Hotama

Dictionary example sentences play an important role in illustrating word definitions and usage, but manually creating quality sentences is challenging. Prior works have demonstrated that language models can be trained to generate example sentences. However, they relied on costly customized models and word sense datasets for generation and evaluation of their work. Rapid advancements in foundational models present the opportunity to create low-cost, zero-shot methods for the generation and evaluation of dictionary example sentences. We introduce a new automatic evaluation metric called OxfordEval that measures the win-rate of generated sentences against existing Oxford Dictionary sentences. OxfordEval shows high alignment with human judgments, enabling large-scale automated quality evaluation. We experiment with various LLMs and configurations to generate dictionary sentences across word classes. We complement this with a novel approach of using masked language models to identify and select sentences that best exemplify word meaning. The eventual model, FM-MLM, achieves over 85.1% win rate against Oxford baseline sentences according to OxfordEval, compared to 39.8% win rate for prior model-generated sentences.

pdf bib
Making Language Models Better Tool Learners with Execution Feedback
Shuofei Qiao | Honghao Gui | Chengfei Lv | Qianghuai Jia | Huajun Chen | Ningyu Zhang

Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. With the advent of foundation models, AI systems can utilize tools to expand their capabilities and interact with the real world. Existing tool learning methodologies, encompassing supervised fine-tuning and prompt engineering approaches, often induce large language models to utilize tools indiscriminately, as complex tasks often exceed their own competencies. However, introducing tools for simple tasks, which the models themselves can readily resolve, can inadvertently propagate errors rather than enhance performance. This leads to the research question: can we teach language models when and how to use tools? To meet this need, we propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution, thereby learning when and how to use tools effectively. Experimental results, backed by further analysis, show that TRICE can make the large language model selectively use tools by improving the accuracy of tool usage while enhancing insufficient tool learning and mitigating excessive reliance on tools.

pdf bib
Complex Claim Verification with Evidence Retrieved in the Wild
Jifan Chen | Grace Kim | Aniruddh Sriram | Greg Durrett | Eunsol Choi

Retrieving evidence to support or refute claims is a core part of automatic fact-checking. Prior work makes simplifying assumptions in retrieval that depart from real-world use cases: either no access to evidence, access to evidence curated by a human fact-checker, or access to evidence published after a claim was made. In this work, we present the first realistic pipeline to check real-world claims by retrieving raw evidence from the web. We restrict our retriever to only search documents available prior to the claim’s making, modeling the realistic scenario of emerging claims. Our pipeline includes five components: claim decomposition, raw document retrieval, fine-grained evidence retrieval, claim-focused summarization, and veracity judgment. We conduct experiments on complex political claims in the ClaimDecomp dataset and show that the aggregated evidence produced by our pipeline improves veracity judgments. Human evaluation finds the evidence summary produced by our system is reliable (it does not hallucinate information) and relevant to answering key questions about a claim, suggesting that it can assist fact-checkers even when it does not reflect a complete evidence set.

pdf bib
Multimodal Multi-loss Fusion Network for Sentiment Analysis
Zehui Wu | Ziwei Gong | Jaywon Koo | Julia Hirschberg

This paper investigates the optimal selection and fusion of feature encoders across multiple modalities and combines these in one neural network to improve sentiment detection. We compare different fusion methods and examine the impact of multi-loss training within the multi-modality fusion network, identifying surprisingly important findings relating to subnet performance. We have also found that integrating context significantly enhances model performance. Our best model achieves state-of-the-art performance for three datasets (CMU-MOSI, CMU-MOSEI and CH-SIMS). These results suggest a roadmap toward an optimized feature selection and fusion approach for enhancing sentiment detection in neural networks.

pdf bib
Confronting LLMs with Traditional ML: Rethinking the Fairness of Large Language Models in Tabular Classifications
Yanchen Liu | Srishti Gautam | Jiaqi Ma | Himabindu Lakkaraju

Recent literature has suggested the potential of using large language models (LLMs) to make classifications for tabular tasks. However, LLMs have been shown to exhibit harmful social biases that reflect the stereotypes and inequalities present in society. To this end, as well as the widespread use of tabular data in many high-stake applications, it is important to explore the following questions: what sources of information do LLMs draw upon when making classifications for tabular tasks; whether and to what extent are LLM classifications for tabular data influenced by social biases and stereotypes; and what are the consequential implications for fairness?Through a series of experiments, we delve into these questions and show that LLMs tend to inherit social biases from their training data which significantly impact their fairness in tabular classification tasks. Furthermore, our investigations show that in the context of bias mitigation, though in-context learning and finetuning have a moderate effect, the fairness metric gap between different subgroups is still larger than that in traditional machine learning models, such as Random Forest and shallow Neural Networks. This observation emphasizes that the social biases are inherent within the LLMs themselves and inherited from their pretraining corpus, not only from the downstream task datasets. Besides, we demonstrate that label-flipping of in-context examples can significantly reduce biases, further highlighting the presence of inherent bias within LLMs.

pdf bib
Analyzing the Use of Metaphors in News Editorials for Political Framing
Meghdut Sengupta | Roxanne El Baff | Milad Alshomary | Henning Wachsmuth

Metaphorical language is a pivotal element inthe realm of political framing. Existing workfrom linguistics and the social sciences providescompelling evidence regarding the distinctivenessof conceptual framing for politicalideology perspectives. However, the nature andutilization of metaphors and the effect on audiencesof different political ideologies withinpolitical discourses are hardly explored. Toenable research in this direction, in this workwe create a dataset, originally based on newseditorials and labeled with their persuasive effectson liberals and conservatives and extend itwith annotations pertaining to metaphorical usageof language. To that end, first, we identifyall single metaphors and composite metaphors.Secondly, we provide annotations of the sourceand target domains for each metaphor. As aresult, our corpus consists of 300 news editorialsannotated with spans of texts containingmetaphors and the corresponding domains ofwhich these metaphors draw from. Our analysisshows that liberal readers are affected bymetaphors, whereas conservatives are resistantto them. Both ideologies are affected differentlybased on the metaphor source and targetcategory. For example, liberals are affected bymetaphors in the Darkness & Light (e.g., death)source domains, where as the source domain ofNature affects conservatives more significantly.

pdf bib
SharpSeq: Empowering Continual Event Detection through Sharpness-Aware Sequential-task Learning
Thanh-Thien Le | Viet Dao | Linh Nguyen | Thi-Nhung Nguyen | Linh Ngo | Thien Nguyen

Continual event detection is a cornerstone in uncovering valuable patterns in many dynamic practical applications, where novel events emerge daily. Existing state-of-the-art approaches with replay buffers still suffer from catastrophic forgetting, partially due to overly simplistic objective aggregation. This oversight disregards complex trade-offs and leads to sub-optimal gradient updates, resulting in performance deterioration across objectives. While there are successful, widely cited multi-objective optimization frameworks for multi-task learning, they lack mechanisms to address data imbalance and evaluate whether a Pareto-optimal solution can effectively mitigate catastrophic forgetting, rendering them unsuitable for direct application to continual learning. To address these challenges, we propose **SharpSeq**, a novel continual learning paradigm leveraging sharpness-aware minimization combined with a generative model to balance training data distribution. Through extensive experiments on multiple real-world datasets, we demonstrate the superior performance of SharpSeq in continual event detection, proving the importance of our approach in mitigating catastrophic forgetting in continual event detection.

pdf bib
Dissecting Paraphrases: The Impact of Prompt Syntax and supplementary Information on Knowledge Retrieval from Pretrained Language Models
Stephan Linzbach | Dimitar Dimitrov | Laura Kallmeyer | Kilian Evang | Hajira Jabeen | Stefan Dietze

Pre-trained Language Models (PLMs) are known to contain various kinds of knowledge.One method to infer relational knowledge is through the use of cloze-style prompts, where a model is tasked to predict missing subjects orobjects. Typically, designing these prompts is a tedious task because small differences in syntax or semantics can have a substantial impact on knowledge retrieval performance. Simultaneously, evaluating the impact of either prompt syntax or information is challenging due to their interdependence. We designed CONPARE-LAMA – a dedicated probe, consisting of 34 million distinct prompts that facilitate comparison across minimal paraphrases. These paraphrases follow a unified meta-template enabling the controlled variation of syntax and semantics across arbitrary relations.CONPARE-LAMA enables insights into the independent impact of either syntactical form or semantic information of paraphrases on the knowledge retrieval performance of PLMs. Extensive knowledge retrieval experiments using our probe reveal that prompts following clausal syntax have several desirable properties in comparison to appositive syntax: i) they are more useful when querying PLMs with a combination of supplementary information, ii) knowledge is more consistently recalled across different combinations of supplementary information, and iii) they decrease response uncertainty when retrieving known facts. In addition, range information can boost knowledge retrieval performance more than domain information, even though domain information is more reliably helpful across syntactic forms.

pdf bib
Know When To Stop: A Study of Semantic Drift in Text Generation
Ava Spataru | Eric Hambro | Elena Voita | Nicola Cancedda

In this work, we explicitly show that modern LLMs tend to generate correct facts first, then “drift away” and generate incorrect facts later: this was occasionally observed but never properly measured. We develop a semantic drift score that measures the degree of separation between correct and incorrect facts in generated texts and confirm our hypothesis when generating Wikipedia-style biographies. This correct-then-incorrect generation pattern suggests that factual accuracy can be improved by knowing when to stop generation. Therefore, we explore the trade-off between information quantity and factual accuracy for several early stopping methods and manage to improve factuality by a large margin. We further show that reranking with semantic similarity can further improve these results, both compared to the baseline and when combined with early stopping. Finally, we try calling external API to bring the model back to the right generation path, but do not get positive results. Overall, our methods generalize and can be applied to any long-form text generation to produce more reliable information, by balancing trade-offs between factual accuracy, information quantity and computational cost.

pdf bib
Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction
Kraig Tou | Zijun Sun

Many leading methods in Vision and language (V+L) pretraining utilize masked language modeling (MLM) as a standard pretraining component, with the expectation that reconstruction of masked text tokens would necessitate reference to corresponding image context via cross/self attention and thus promote representation fusion. However, we observe that the minimization of MLM loss in earlier training stages can depend disproportionately on local text signals, leading to poor training efficiency and inconsistency with the goal of representation fusion. The extent of this lack of cross modal interaction depends strongly which token(s) are masked. To address this issue, we propose a curriculum masking scheme as a replacement for random masking. Tokens are selected to be masked at a frequency proportional to the expected level of cross modal interaction necessary to reconstruct them. This is achieved using a parallel mask selection agent that measures the cross modal flow of information and treats it as a reward to be maximized. By additionally masking contiguous spans that include key objects and their relations, we also achieve better relational understanding, which has been shown to be lacking in many SOTA models. Our experiments on a wide range of V+L tasks show that we trail closely behind state-of-the-art methods despite pretraining on 300x to 1000x less data and we also achieve either top or runner-up performance on tasks from the ARO benchmark which tests compositional relationships. Finally, we demonstrate the potential of our method to scale to larger pretraining data.

pdf bib
Elote, Choclo and Mazorca: on the Varieties of Spanish
Cristina España-Bonet | Alberto Barrón-Cedeño

Spanish is one of the most widespread languages: the official language in 20 countries and the second most-spoken native language. Its contact with other languages across different regions and the rich regional and cultural diversity has produced varieties which divert from each other, particularly in terms of lexicon. Still, available corpora, and models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. To alleviate the situation, we compile and curate datasets in the different varieties of Spanish around the world at an unprecedented scale and create the CEREAL corpus. With such a resource at hand, we perform a stylistic analysis to identify and characterise varietal differences. We implement a classifier specially designed to deal with long documents and identify Spanish varieties (and therefore expand CEREAL further). We produce varietal-specific embeddings, and analyse the cultural differences that they encode. We make data, code and models publicly available.

pdf bib
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks
Chonghua Wang | Haodong Duan | Songyang Zhang | Dahua Lin | Kai Chen

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs’ capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models’ long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs’ long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at

pdf bib
A Zero-Shot Monolingual Dual Stage Information Retrieval System for Spanish Biomedical Systematic Literature Reviews
Regina Ofori-Boateng | Magaly Aceves-Martins | Nirmalie Wiratunga | Carlos Moreno-Garcia

Systematic Reviews (SRs) are foundational in healthcare for synthesising evidence to inform clinical practices. Traditionally skewed towards English-language databases, SRs often exclude significant research in other languages, leading to potential biases. This study addresses this gap by focusing on Spanish, a language notably underrepresented in SRs. We present a foundational zero-shot dual information retrieval (IR) baseline system, integrating traditional retrieval methods with pre-trained language models and cross-attention re-rankers for enhanced accuracy in Spanish biomedical literature retrieval. Utilising the LILACS database, known for its comprehensive coverage of Latin American and Caribbean biomedical literature, we evaluate the approach with three real-life case studies in Spanish SRs. The findings demonstrate the system’s efficacy and underscore the importance of query formulation. This study contributes to the field of IR by promoting language inclusivity and supports the development of more comprehensive and globally representative healthcare guidelines.

pdf bib
LayoutPointer: A Spatial-Context Adaptive Pointer Network for Visual Information Extraction
Huang Siyuan | Yongping Xiong | Wu Guibin

Visual Information Extraction (VIE), as a crucial task of Document Intelligence, involves two primary sub-tasks: Semantic Entity Recognition (SER) and Relation Extraction (RE). However, VIE faces two significant challenges. Firstly, most existing models inadequately utilize spatial information of entities, often failing to predict connections or incorrectly linking spatially distant entities. Secondly, the improper input order of tokens challenges in extracting complete entity pairs from documents with multi-line entities when text is extracted via PDF parser or OCR. To address these challenges, we propose LayoutPointer, a Spatial-Context Adaptive Pointer Network. LayoutPointer explicitly enhances spatial-context relationships by incorporating 2D relative position information and adaptive spatial constraints within self-attention. Furthermore, we recast the RE task as a specialized cycle detection problem, employing a unique tail-to-head pointer to restore the semantic order among multi-line entities. To better evaluate the effectiveness of our proposed method, we reconstruct a multi-line dataset named MLFUD, which more accurately reflects real-world scenarios. Fine-tuning experimental results on FUNSD, XFUND, and MLFUD datasets demonstrate that LayoutPointer significantly outperforms existing state-of-the-art methods in F1 scores for RE tasks (e.g., 5.71% improvement on XFUND using LayoutPointerBASE-X over LayoutLMv3).

pdf bib
Long-form evaluation of model editing
Domenic Rosati | Robie Gonzales | Jinkun Chen | Xuemin Yu | Yahya Kayani | Frank Rudzicz | Hassan Sajjad

Evaluations of model editing, a technique for changing the factual knowledge held by Large Language Models (LLMs), currently only use the ‘next few token’ completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.

pdf bib
Analyzing the Role of Semantic Representations in the Era of Large Language Models
Zhijing Jin | Yuen Chen | Fernando Gonzalez Adauto | Jiarui Liu | Jiayi Zhang | Julian Michael | Bernhard Schölkopf | Mona Diab

Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCOT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code:

pdf bib
TRAQ: Trustworthy Retrieval Augmented Question Answering via Conformal Prediction
Shuo Li | Sangdon Park | Insup Lee | Osbert Bastani

When applied to open-domain question answering, large language models (LLMs) frequently generate incorrect responses based on made-up facts, which are called hallucinations. Retrieval augmented generation (RAG) is a promising strategy to avoid hallucinations, but it does not provide guarantees on its correctness. To address this challenge, we propose the Trustworthy Retrieval Augmented Question Answering, or *TRAQ*, which provides the first end-to-end statistical correctness guarantee for RAG. TRAQ uses conformal prediction, a statistical technique for constructing prediction sets that are guaranteed to contain the semantically correct response with high probability. Additionally, TRAQ leverages Bayesian optimization to minimize the size of the constructed sets. In an extensive experimental evaluation, we demonstrate that TRAQ provides the desired correctness guarantee while reducing prediction set size by 16.2% on average compared to an ablation. The implementation is available: [](

pdf bib
MapGuide: A Simple yet Effective Method to Reconstruct Continuous Language from Brain Activities
Xinpei Zhao | Jingyuan Sun | Shaonan Wang | Jing Ye | Xiaohan Zhang | Chengqing Zong

Decoding continuous language from brain activity is a formidable yet promising field of research. It is particularly significant for aiding people with speech disabilities to communicate through brain signals. This field addresses the complex task of mapping brain signals to text. The previous best attempt reverse-engineered this process in an indirect way: it began by learning to encode brain activity from text and then guided text generation by aligning with predicted brain responses. In contrast, we propose a simple yet effective method that guides text reconstruction by directly comparing them with the predicted text embeddings mapped from brain activities. Comprehensive experiments reveal that our method significantly outperforms the current state-of-the-art model, showing average improvements of 77% and 54% on BLEU and METEOR scores. We further validate the proposed modules through detailed ablation studies and case analyses and highlight a critical correlation: the more precisely we map brain activities to text embeddings, the better the text reconstruction results. Such insight can simplify the task of reconstructing language from brain activities for future work, emphasizing the importance of improving brain-to-text-embedding mapping techniques.

pdf bib
On-the-fly Definition Augmentation of LLMs for Biomedical NER
Monica Munnangi | Sergey Feldman | Byron Wallace | Silvio Amir | Tom Hope | Aakanksha Naik

Despite their general capabilities, LLMs still struggle on biomedicalNER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs.For example, it leads to a relative improvement of 15% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at

pdf bib
This Land is Your, My Land: Evaluating Geopolitical Bias in Language Models through Territorial Disputes
Bryan Li | Samar Haider | Chris Callison-Burch

Do the Spratly Islands belong to China, the Philippines, or Vietnam? A pretrained large language model (LLM) may answer differently if asked in the languages of each claimant country: Chinese, Tagalog, or Vietnamese. This contrasts with a multilingual human, who would likely answer consistently. In this paper, we show that LLMs recall certain geographical knowledge inconsistently when queried in different languages—a phenomenon we term geopolitical bias. As a targeted case study, we consider territorial disputes, an inherently controversial and multilingual task. We introduce BorderLines, a dataset of territorial disputes which covers 251 territories, each associated with a set of multiple-choice questions in the languages of each claimant country (49 languages in total). We also propose a suite of evaluation metrics to precisely quantify bias and consistency in responses across different languages. We then evaluate various multilingual LLMs on our dataset and metrics to probe their internal knowledge and use the proposed metrics to discover numerous inconsistencies in how these models respond in different languages. Finally, we explore several prompt modification strategies, aiming to either amplify or mitigate geopolitical bias, which highlights how brittle LLMs are and how they tailor their responses depending on cues from the interaction context. Our code and data are available at

pdf bib
Set-Aligning Framework for Auto-Regressive Event Temporal Graph Generation
Xingwei Tan | Yuxiang Zhou | Gabriele Pergola | Yulan He

Event temporal graphs have been shown as convenient and effective representations of complex temporal relations between events in text. Recent studies, which employ pre-trained language models to auto-regressively generate linearised graphs for constructing event temporal graphs, have shown promising results. However, these methods have often led to suboptimal graph generation as the linearised graphs exhibit set characteristics which are instead treated sequentially by language models. This discrepancy stems from the conventional text generation objectives, leading to erroneous penalisation of correct predictions caused by the misalignment of elements in target sequences. To address these challenges, we reframe the task as a conditional set generation problem, proposing a Set-aligning Framework tailored for the effective utilisation of Large Language Models (LLMs). The framework incorporates data augmentations and set-property regularisations designed to alleviate text generation loss penalties associated with the linearised graph edge sequences, thus encouraging the generation of more relation edges. Experimental results show that our framework surpasses existing baselines for event temporal graph generation. Furthermore, under zero-shot settings, the structural knowledge introduced through our framework notably improves model generalisation, particularly when the training examples available are limited.

pdf bib
LanguageFlow: Advancing Diffusion Language Generation with Probabilistic Flows
Shujian Zhang | Lemeng Wu | Chengyue Gong | Xingchao Liu

Recent works have demonstrated success in controlling sentence attributes (e.g., sentiment) and structure (e.g., syntactic structure) based on the diffusion language model. A key component that drives theimpressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of starting from the noise and the learning steps has limited its implementation to many NLP real-world applications. This paper proposes Language Rectified Flow (LF).Our method is based on the reformulation of the standard probabilistic flow models.Language rectified flow learns (neural) ordinary differentialequation models to transport between the source distribution and the target distribution, henceproviding a unified and effective solution to generative modeling and domain transfer.From the source distribution, our language rectified flow yields fast simulation and effectively decreases the inference time. Experiments on three challenging fine-grained control tasks and multiple high-quality text editing show that our method consistently outperforms its baselines. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.

pdf bib
Towards Improved Multi-Source Attribution for Long-Form Answer Generation
Nilay Patel | Shivashankar Subramanian | Siddhant Garg | Pratyay Banerjee | Amita Misra

Teaching large language models (LLMs) to generate text with attribution to evidence sources can reduce hallucinations, improve verifiability in question answering systems (QA), and increase reliability of retrieval augmented LLMs. Despite gaining increasing popularity for usage in QA systems and search engines, current LLMs struggle with attribution for long-form responses which require reasoning over multiple evidence sources. To address this, in this paper we aim to improve the attribution capability of LLMs for long-form answer generation to multiple sources, with multiple citations per sentence. However, data for training multi-source attributable QA systems is difficult and expensive to annotate, and therefore scarce. To overcome this challenge, we transform existing QA datasets for this task (MultiAttr), and empirically demonstrate, on a wide range of attribution benchmark datasets, that fine-tuning on MultiAttr provides significant improvements over training only on the target QA domain. Lastly, to fill a gap in existing benchmarks, we present a multi-source attribution dataset containing multi-paragraph answers, PolitiICite, based on PolitiFact articles that discuss events closely related to implementation statuses of election promises.

pdf bib
Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private Language Models
Aldo Carranza | Rezsa Farahani | Natalia Ponomareva | Alexey Kurakin | Matthew Jagielski | Milad Nasr

We address the challenge of ensuring differential privacy (DP) guarantees in training deep retrieval systems. Training these systems often involves the use of contrastive-style losses, which are typically non-per-example decomposable, making them difficult to directly DP-train with since common techniques require per-example gradients. To address this issue, we propose an approach that prioritizes ensuring query privacy prior to training a deep retrieval system. Our method employs DP language models (LMs) to generate private synthetic queries representative of the original data. These synthetic queries can be used in downstream retrieval system training without compromising privacy. Our approach demonstrates a significant enhancement in retrieval quality compared to direct DP-training, all while maintaining query-level privacy guarantees. This work highlights the potential of harnessing LMs to overcome limitations in standard DP-training methods.

pdf bib
Okay, Let’s Do This! Modeling Event Coreference with Generated Rationales and Knowledge Distillation
Abhijnan Nath | Shadi Manafi Avari | Avyakta Chelle | Nikhil Krishnaswamy

In NLP, Event Coreference Resolution (ECR) is the task of connecting event clusters that refer to the same underlying real-life event, usually via neural systems. In this work, we investigate using abductive free-text rationales (FTRs) generated by modern autoregressive LLMs as distant supervision of smaller student models for cross-document coreference (CDCR) of events. We implement novel rationale-oriented event clustering and knowledge distillation methods for event coreference scoring that leverage enriched information from the FTRs for improved CDCR without additional annotation or expensive document clustering. Our model using coreference-specific knowledge distillation achieves SOTA B3 F1 on the ECB+ and GVC corpora and we establish a new baseline on the AIDA Phase 1 corpus. Our code can be found at

pdf bib
Can Knowledge Graphs Reduce Hallucinations in LLMs? : A Survey
Garima Agrawal | Tharindu Kumarage | Zeyad Alghamdi | Huan Liu

The contemporary LLMs are prone to producing hallucinations, stemming mainly from the knowledge gaps within the models. To address this critical limitation, researchers employ diverse strategies to augment the LLMs by incorporating external knowledge, aiming to reduce hallucinations and enhance reasoning accuracy. Among these strategies, leveraging knowledge graphs as a source of external information has demonstrated promising results. In this survey, we comprehensively review these knowledge-graph-based augmentation techniques in LLMs, focusing on their efficacy in mitigating hallucinations. We systematically categorize these methods into three overarching groups, offering methodological comparisons and performance evaluations. Lastly, this survey explores the current trends and challenges associated with these techniques and outlines potential avenues for future research in this emerging field.

pdf bib
Pedagogically Aligned Objectives Create Reliable Automatic Cloze Tests
Brian Ondov | Kush Attal | Dina Demner-Fushman

The cloze training objective of Masked Language Models makes them a natural choice for generating plausible distractors for human cloze questions. However, distractors must also be both distinct and incorrect, neither of which is directly addressed by existing neural methods. Evaluation of recent models has also relied largely on automated metrics, which cannot demonstrate the reliability or validity of human comprehension tests. In this work, we first formulate the pedagogically motivated objectives of plausibility, incorrectness, and distinctiveness in terms of conditional distributions from language models. Second, we present an unsupervised, interpretable method that uses these objectives to jointly optimize sets of distractors. Third, we test the reliability and validity of the resulting cloze tests compared to other methods with human participants. We find our method has stronger correlation with teacher-created comprehension tests than the state-of-the-art neural method and is more internally consistent. Our implementation is freely available and can quickly create a multiple choice cloze test from any given passage.

pdf bib
Take One Step at a Time to Know Incremental Utility of Demonstration: An Analysis on Reranking for Few-Shot In-Context Learning
Kazuma Hashimoto | Karthik Raman | Michael Bendersky

In-Context Learning (ICL) is an emergent capability of Large Language Models (LLMs). Only a few demonstrations enable LLMs to be used as blackbox for new tasks. Previous studies have shown that using LLMs’ outputs as labels is effective in training models to select demonstrations. Such a label is expected to estimate utility of a demonstration in ICL; however, it has not been well understood how different labeling strategies affect results on target tasks. This paper presents an analysis on different utility functions by focusing on LLMs’ output probability given ground-truth output, and task-specific reward given LLMs’ prediction. Unlike the previous work, we introduce a novel labeling method, incremental utility, which estimates how much incremental knowledge is brought into the LLMs by a demonstration. We conduct experiments with instruction-tuned LLMs on binary/multi-class classification, segmentation, and translation across Arabic, English, Finnish, Japanese, and Spanish. Our results show that (1) the probability is effective when the probability values are distributed across the whole value range (on the classification tasks), and (2) the downstream metric is more robust when nuanced reward values are provided with long outputs (on the segmentation and translation tasks). We then show that the proposed incremental utility further helps ICL by contrasting how the LLMs perform with and without the demonstrations.

pdf bib
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
Chi Han | Qifan Wang | Hao Peng | Wenhan Xiong | Yu Chen | Heng Ji | Sinong Wang

Today’s large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encod- ing scientific articles, code repositories, or long dialogues. Through both theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis reveals that commonly used techniques like using a sliding-window attention pattern or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs’ capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7× decoding speed up and 7.5× memory saving over the original model. Our code will be publicly available upon publication.

pdf bib
CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
Albert Sun | Varun Nair | Elliot Schumacher | Anitha Kannan

A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models (LLMs), such as GPT-4 (OpenAI, 2023). A major challenge in deploying LLM-based virtual conversational assistants in real world settings is ensuring they operate within what is admissible for the task. To overcome this challenge, the designers of these virtual assistants rely on an independent guardrail system that verifies the virtual assistant’s output aligns with the constraints required for the task. However, relying on commonly used, prompt-based guardrails can be difficult to engineer correctly and comprehensively. To address these challenges, we propose CONSCENDI. We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set and provides chatbot designers greater control. To generate contrastive examples, we prompt the LLM to alter conversations with violations into acceptable conversations to enable fine-grained distinctions. We then use this data, generated by CONSCENDI, to train a smaller model. We find that CONSCENDI results in guardrail models that improve over baselines in multiple dialogue domains.

pdf bib
Advancing Beyond Identification: Multi-bit Watermark for Large Language Models
KiYoon Yoo | Wonhyuk Ahn | Nojun Kwak

We show the viability of tackling misuses of large language models beyond the identification of machine-generated text. While existing zero-bit watermark methods focus on detection only, some malicious misuses demand tracing the adversary user for counteracting them. To address this, we propose Multi-bit Watermark via Position Allocation, embedding traceable multi-bit information during language model generation. Through allocating tokens onto different parts of the messages, we embed longer messages in high corruption settings without added latency. By independently embedding sub-units of messages, the proposed method outperforms the existing works in terms of robustness and latency. Leveraging the benefits of zero-bit watermarking, our method enables robust extraction of the watermark without any model access, embedding and extraction of long messages ( 32-bit) without finetuning, and maintaining text quality, while allowing zero-bit detection all at the same time.

pdf bib
HTCCN: Temporal Causal Convolutional Networks with Hawkes Process for Extrapolation Reasoning in Temporal Knowledge Graphs
Tingxuan Chen | Jun Long | Liu Yang | Zidong Wang | Yongheng Wang | Xiongnan Jin

Temporal knowledge graphs (TKGs) serve as powerful tools for storing and modeling dynamic facts, holding immense potential in anticipating future facts. Since future facts are inherently unknowable, effectively modeling the intricate temporal structure of historical facts becomes paramount for accurate prediction. However, current models often rely heavily on fact recurrence or periodicity, leading to information loss due to prolonged evolutionary processes. Notably, the occurrence of one fact always influences the likelihood of another. To this end, we propose HTCCN, a novel Hawkes process-based temporal causal convolutional network designed for temporal reasoning under extrapolation settings. HTCCN employs a temporal causal convolutional network to model the historical interdependence of facts and leverages Hawkes to model link formation processes inductively in TKGs. Importantly, HTCCN introduces dual-level dynamics to comprehensively capture the temporal evolution of facts. Rigorous experimentation on four real-world datasets underscores the superior performance of HTCCN.

pdf bib
SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation
Abe Hou | Jingyu Zhang | Tianxing He | Yichen Wang | Yung-Sung Chuang | Hongwei Wang | Lingfeng Shen | Benjamin Van Durme | Daniel Khashabi | Yulia Tsvetkov

Existing watermarked generation algorithms employ token-level designs and therefore, are vulnerable to paraphrase attacks. To address this issue, we introduce watermarking on the semantic representation of sentences. We propose SemStamp, a robust sentence-level semantic watermarking algorithm that uses locality-sensitive hashing (LSH) to partition the semantic space of sentences. The algorithm encodes and LSH-hashes a candidate sentence generated by a language model, and conducts rejection sampling until the sampled sentence falls in watermarked partitions in the semantic embedding space. To test the paraphrastic robustness of watermarking algorithms, we propose a “bigram paraphrase” attack that produces paraphrases with small bigram overlap with the original sentence. This attack is shown to be effective against existing token-level watermark algorithms, while posing only minor degradations to SemStamp. Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on various paraphrasers and domains, but also better at preserving the quality of generation.

pdf bib
Media Bias Detection Across Families of Language Models
Iffat Maab | Edison Marrese-Taylor | Sebastian Padó | Yutaka Matsuo

Bias in reporting can influence the public’s opinion on relevant societal issues. Examples include informational bias (selective presentation of content) and lexical bias (specific framing of content through linguistic choices). The recognition of media bias is arguably an area where NLP can contribute to the “social good”. Traditional NLP models have shown good performance in classifying media bias, but require careful model design and extensive tuning. In this paper, we ask how well prompting of large language models can recognize media bias. Through an extensive empirical study including a wide selection of pre-trained models, we find that prompt-based techniques can deliver comparable performance to traditional models with greatly reduced effort and that, similar to traditional models, the availability of context substantially improves results. We further show that larger models can leverage different kinds of context simultaneously, obtaining further performance improvements.

pdf bib
Better Zero-Shot Reasoning with Role-Play Prompting
Aobo Kong | Shiwan Zhao | Hao Chen | Qicheng Li | Yong Qin | Ruiqi Sun | Xin Zhou | Enzhi Wang | Xiaohang Dong

Modern large language models (LLMs) exhibit a remarkable capacity for role-playing, enabling them to embody not only human characters but also non-human entities. This versatility allows them to simulate complex human-like interactions and behaviors within various contexts, as well as to emulate specific objects or systems. While these capabilities have enhanced user engagement and introduced novel modes of interaction, the influence of role-playing on LLMs’ reasoning abilities remains underexplored. In this study, we introduce a strategically designed role-play prompting methodology and assess its performance under the zero-shot setting across twelve diverse reasoning benchmarks. Our empirical results illustrate that role-play prompting consistently surpasses the standard zero-shot approach across most datasets. Notably, in experiments conducted using ChatGPT, accuracy on AQuA rises from 53.5% to 63.8%, and on Last Letter from 23.8% to 84.2%. Upon further comparison with the Zero-Shot-CoT technique, which prompts the model to “think step by step”, our study demonstrates that role-play prompting acts as a more effective trigger for the CoT process.This highlights its potential to augment the reasoning capabilities of LLMs. We release our code at

pdf bib
Event-Content-Oriented Dialogue Generation in Short Video
Fenghua Cheng | Xue Li | Zi Huang | Jinxiang Wang | Sen Wang

Understanding complex events from different modalities, associating to external knowledge and generating response in a clear point of view are still unexplored in today’s multi-modal dialogue research. The great challenges include 1) lack of event-based multi-modal dialogue dataset; 2) understanding of complex events and 3) heterogeneity gap between different modalities. To overcome these challenges, we firstly introduce a novel event-oriented video-dialogue dataset called SportsVD (Sports-domain Video-dialogue Dataset). To our best knowledge, SportsVD is the first dataset that consists of complex events videos and opinion-based conversations with regards to contents in these events. Meanwhile, we present multi-modal dialogue generation method VCD (Video Commentary Dialogue) to generate human-like response according to event contents in the video and related external knowledge. In contrast to previous video-based dialogue generation, we focus on opinion-based response and the understanding of longer and more complex event contents. We evaluate VCD’s performance on SportsVD and other baselines under several automatic metrics. Experiments demonstrate VCD can outperform among other state-of-the-art baselines. Our work is available at

pdf bib
DoG-Instruct: Towards Premium Instruction-Tuning Data via Text-Grounded Instruction Wrapping
Yongrui Chen | Haiyun Jiang | Xinting Huang | Shuming Shi | Guilin Qi

The improvement of LLMs’ instruction-following capabilities relies heavily on the availability of high-quality instruction-response pairs. Unfortunately, the current methods used to collect the pairs suffer from either unaffordable labor costs or severe hallucinations in the self-generation of LLM.To tackle these challenges, this paper proposes a scalable solution.It involves training LLMs to generate instruction-response pairs based on human-written documents, rather than relying solely on self-generation without context.Our proposed method not only exploits the advantages of human-written documents in reducing hallucinations but also utilizes an LLM to wrap the expression of documents, which enables us to bridge the gap between various document styles and the standard AI response.Experiments demonstrate that our method outperforms existing typical methods on multiple benchmarks.In particular, compared to the best-performing baseline, the LLM trained using our generated dataset exhibits a 10% relative improvement in performance on AlpacaEval, despite utilizing only 1/5 of its training data.Furthermore, a comprehensive manual evaluation validates the quality of the data we generated.

pdf bib
Beyond Borders: Investigating Cross-Jurisdiction Transfer in Legal Case Summarization
Santosh T.y.s.s | Vatsal Venkatkrishna | Saptarshi Ghosh | Matthias Grabmair

Legal professionals face the challenge of managing an overwhelming volume of lengthy judgments, making automated legal case summarization crucial. However, prior approaches mainly focused on training and evaluating these models within the same jurisdiction. In this study, we explore the cross-jurisdictional generalizability of legal case summarization models. Specifically, we explore how to effectively summarize legal cases of a target jurisdiction where reference summaries are not available. In particular, we investigate whether supplementing models with unlabeled target jurisdiction corpus and extractive silver summaries obtained from unsupervised algorithms on target data enhances transfer performance. Our comprehensive study on three datasets from different jurisdictions highlights the role of pre-training in improving transfer performance. We shed light on the pivotal influence of jurisdictional similarity in selecting optimal source datasets for effective transfer. Furthermore, our findings underscore that incorporating unlabeled target data yields improvements in general pre-trained models, with additional gains when silver summaries are introduced. This augmentation is especially valuable when dealing with extractive datasets and scenarios featuring limited alignment between source and target jurisdictions. Our study provides key insights for developing adaptable legal case summarization systems, transcending jurisdictional boundaries.

pdf bib
EDC: Effective and Efficient Dialog Comprehension For Dialog State Tracking
Qifan Lu | Bhaskar Ramasubramanian | Radha Poovendran

In Task-Oriented Dialog (TOD) systems, Dialog State Tracking (DST) structurally extracts information from user and system utterances, which can be further used for querying databases and forming responses to users. The two major categories of DST methods, sequential and independent methods, face trade-offs between accuracy and efficiency. To resolve this issue, we propose Effective and Efficient Dialog Comprehension (EDC), an alternative DST approach that leverages the tree structure of the dialog state. EDC predicts domains, slot names and slot values of the dialog state step-by-step for better accuracy, and efficiently encodes dialog contexts with causal attention patterns. We evaluate EDC on several popular TOD datasets and EDC is able to achieve state-of-the-art Joint Goal Accuracy (JGA). We also show theoretically and empirically that EDC is more efficient than model designs used by previous works.

pdf bib
Automatic Restoration of Diacritics for Speech Data Sets
Sara Shatnawi | Sawsan Alqahtani | Hanan Aldarmaki

Automatic text-based diacritic restoration models generally have high diacritic error rates when applied to speech transcripts as a result of domain and style shifts in spoken language. In this work, we explore the possibility of improving the performance of automatic diacritic restoration when applied to speech data by utilizing parallel spoken utterances. In particular, we use the pre-trained Whisper ASR model fine-tuned on relatively small amounts of diacritized Arabic speech data to produce rough diacritized transcripts for the speech utterances, which we then use as an additional input for diacritic restoration models. The proposed framework consistently improves diacritic restoration performance compared to text-only baselines. Our results highlight the inadequacy of current text-based diacritic restoration models for speech data sets and provide a new baseline for speech-based diacritic restoration.

pdf bib
XNLIeu: a dataset for cross-lingual NLI in Basque
Maite Heredia | Julen Etxaniz | Muitze Zulaika | Xabier Saralegi | Jeremy Barnes | Aitor Soroa

XNLI is a popular Natural Language Inference (NLI) benchmark widely used to evaluate cross-lingual Natural Language Understanding (NLU) capabilities across languages. In this paper, we expand XNLI to include Basque, a low-resource language that can greatly benefit from transfer-learning approaches. The new dataset, dubbed XNLIeu, has been developed by first machine-translating the English XNLI corpus into Basque, followed by a manual post-edition step. We have conducted a series of experiments using mono- and multilingual LLMs to assess a) the effect of professional post-edition on the MT system; b) the best cross-lingual strategy for NLI in Basque; and c) whether the choice of the best cross-lingual strategy is influenced by the fact that the dataset is built by translation. The results show that post-edition is necessary and that the translate-train cross-lingual strategy obtains better results overall, although the gain is lower when tested in a dataset that has been built natively from scratch. Our code and datasets are publicly available under open licenses.

pdf bib
MDR: Model-Specific Demonstration Retrieval at Inference Time for In-Context Learning
Huazheng Wang | Jinming Wu | Haifeng Sun | Zixuan Xia | Daixuan Cheng | Jingyu Wang | Qi Qi | Jianxin Liao

Recently, retrieval-based in-context learning (ICL) methods for selecting demonstrations have been widely investigated. Existing methods train a dense retriever to retrieve the most appropriate demonstrations for a given test query, which improves ICL performance. However, we find that distinct LLMs exhibit different biases for “what is a good demonstration” since they possess differences in training data, model architectures and training methods. As a result, a demonstration suitable for one LLM may not be appropriate for others.Previous approaches ignore the model bias and fail to retrieve the most appropriate demonstrations for different inference LLMs, resulting in a degradation of ICL performance.To address this problem, we propose a simple yet effective metric to evaluate the appropriateness of demonstrations for a specific inference LLM. Furthermore, we introduce a Model-specific Demonstration Retrieval (MDR) method for ICL at inference time, which considers the biases of different LLMs. We test MDR on seen and unseen tasks with multi-scale inference LLMs, such as GPT-Neo-2.7B, LLaMA-7B and Vicuna-13B. Experiments on 23 datasets across 11 data domains highlight the remarkable effectiveness of MDR, showcasing improvements of up to 41.2% in comparison to methods that neglect model biases.

pdf bib
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis
Nayeon Lee | Chani Jung | Junho Myung | Jiho Jin | Jose Camacho-Collados | Juho Kim | Alice Oh

Most hate speech datasets neglect the cultural diversity within a single language, resulting in a critical shortcoming in hate speech detection. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. To construct CREHate, we follow a two-step procedure: 1) cultural post collection and 2) cross-cultural annotation. We sample posts from the SBIC dataset, which predominantly represents North America, and collect posts from four geographically diverse English-speaking countries (Australia, United Kingdom, Singapore, and South Africa) using culturally hateful keywords we retrieve from our survey. Annotations are collected from the four countries plus the United States to establish representative labels for each country. Our analysis highlights statistically significant disparities across countries in hate speech annotations. Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%. Qualitative analysis shows that label disagreement occurs mostly due to different interpretations of sarcasm and the personal bias of annotators on divisive topics. Lastly, we evaluate large language models (LLMs) under a zero-shot setting and show that current LLMs tend to show higher accuracies on Anglosphere country labels in CREHate.Our dataset and codes are available at:

pdf bib
Enhancing Contextual Understanding in Large Language Models through Contrastive Decoding
Zheng Zhao | Emilio Monti | Jens Lehmann | Haytham Assem

Large language models (LLMs) tend to inadequately integrate input context during text generation, relying excessively on encoded prior knowledge in model parameters, potentially resulting in generated text with factual inconsistencies or contextually unfaithful content. LLMs utilize two primary knowledge sources: 1) prior (parametric) knowledge from pretraining, and 2) contextual (non-parametric) knowledge from input prompts. The study addresses the open question of how LLMs effectively balance these knowledge sources during the generation process, specifically in the context of open-domain question answering. To address this issue, we introduce a novel approach integrating contrastive decoding with adversarial irrelevant passages as negative samples to enhance robust context grounding during generation. Notably, our method operates at inference time without requiring further training. We conduct comprehensive experiments to demonstrate its applicability and effectiveness, providing empirical evidence showcasing its superiority over existing methodologies.

pdf bib
Generalizable Sarcasm Detection is Just Around the Corner, of Course!
Hyewon Jang | Diego Frassinelli

We tested the robustness of sarcasm detection models by examining their behavior when fine-tuned on four sarcasm datasets containing varying characteristics of sarcasm: label source (authors vs. third-party), domain (social media/online vs. offline conversations/dialogues), style (aggressive vs. humorous mocking). We tested their prediction performance on the same dataset (intra-dataset) and across different datasets (cross-dataset). For intra-dataset predictions, models consistently performed better when fine-tuned with third-party labels rather than with author labels. For cross-dataset predictions, most models failed to generalize well to the other datasets, implying that one type of dataset cannot represent all sorts of sarcasm with different styles and domains. Compared to the existing datasets, models fine-tuned on the new dataset we release in this work showed the highest generalizability to other datasets. With a manual inspection of the datasets and post-hoc analysis, we attributed the difficulty in generalization to the fact that sarcasm actually comes in different domains and styles. We argue that future sarcasm research should take the broad scope of sarcasm into account.

pdf bib
Encoding of lexical tone in self-supervised models of spoken language
Gaofei Shen | Michaela Watkins | Afra Alishahi | Arianna Bisazza | Grzegorz Chrupała

Interpretability research has shown that self-supervised Spoken LanguageModels (SLMs) encode a wide variety of features in human speech from theacoustic, phonetic, phonological, syntactic and semantic levels, to speakercharacteristics. The bulk of prior research on representations of phonologyhas focused on segmental features such as phonemes; the encoding ofsuprasegmental phonology (such as tone and stress patterns) in SLMs is not yetwell understood. Tone is a suprasegmental feature that is present in more thanhalf of the world’s languages. This paper aims to analyze the tone encodingcapabilities of SLMs, using Mandarin and Vietnamese as case studies. We showthat SLMs encode lexical tone to a significant degree even when they aretrained on data from non-tonal languages. We further find that SLMs behavesimilarly to native and non-native human participants in tone and consonantperception studies, but they do not follow the same developmental trajectory.

pdf bib
A Systematic Comparison of Contextualized Word Embeddings for Lexical Semantic Change
Francesco Periti | Nina Tahmasebi

Contextualized embeddings are the preferred tool for modeling Lexical Semantic Change (LSC). Current evaluations typically focus on a specific task known as Graded Change Detection (GCD). However, performance comparison across work are often misleading due to their reliance on diverse settings. In this paper, we evaluate state-of-the-art models and approaches for GCD under equal conditions. We further break the LSC problem into Word-in-Context (WiC) and Word Sense Induction (WSI) tasks, and compare models across these different levels. Our evaluation is performed across different languages on eight available benchmarks for LSC, and shows that (i) APD outperforms other approaches for GCD; (ii) XL-LEXEME outperforms other contextualized models for WiC, WSI, and GCD, while being comparable to GPT-4; (iii) there is a clear need for improving the modeling of word meanings, as well as focus on *how*, *when*, and *why* these meanings change, rather than solely focusing on the extent of semantic change.

pdf bib
iACOS: Advancing Implicit Sentiment Extraction with Informative and Adaptive Negative Examples
Xiancai Xu | Jia-Dong Zhang | Lei Xiong | Zhishang Liu

Aspect-based sentiment analysis (ABSA) have been extensively studied, but little light has been shed on the quadruple extraction consisting of four fundamental elements: aspects, categories, opinions and sentiments, especially with implicit aspects and opinions. In this paper, we propose a new method iACOS for extracting Implicit Aspects with Categories and Opinions with Sentiments. First, iACOS appends two implicit tokens at the end of a text to capture the context-aware representation of all tokens including implicit aspects and opinions. Second, iACOS develops a sequence labeling model over the context-aware token representation to co-extract explicit and implicit aspects and opinions. Third, iACOS devises a multi-label classifier with a specialized multi-head attention for discovering aspect-opinion pairs and predicting their categories and sentiments simultaneously. Fourth, iACOS leverages informative and adaptive negative examples to jointly train the multi-label classifier and the other two classifiers on categories and sentiments by multi-task learning. Finally, the experimental results show that iACOS significantly outperforms other quadruple extraction baselines according to the F1 score on two public benchmark datasets.

pdf bib
Rectifying Demonstration Shortcut in In-Context Learning
Joonwon Jang | Sanghwan Jang | Wonbin Kweon | Minjin Jeon | Hwanjo Yu

Large language models (LLMs) are able to solve various tasks with only a few demonstrations utilizing their in-context learning (ICL) abilities.However, LLMs often rely on their pre-trained semantic priors of demonstrations rather than on the input-label relationships to proceed with ICL prediction. In this work, we term this phenomenon as the ‘Demonstration Shortcut’.While previous works have primarily focused on improving ICL prediction results for predefined tasks, we aim to rectify the Demonstration Shortcut, thereby enabling the LLM to effectively learn new input-label relationships from demonstrations.To achieve this, we introduce In-Context Calibration, a demonstration-aware calibration method.We evaluate the effectiveness of the proposed method in two settings: (1) the Original ICL Task using the standard label space and (2) the Task Learning setting, where the label space is replaced with semantically unrelated tokens.In both settings, In-Context Calibration demonstrates substantial improvements, with results generalized across three LLM families (OPT, GPT, and Llama2) under various configurations.

pdf bib
Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark
Stephen Mayhew | Terra Blevins | Shuheng Liu | Marek Suppa | Hila Gonen | Joseph Marvin Imperial | Börje Karlsson | Peiqin Lin | Nikola Ljubešić | Lester James Miranda | Barbara Plank | Arij Riabi | Yuval Pinter

We introduce Universal NER (UNER), an open, community-driven project to develop gold-standard NER benchmarks in many languages. The overarching goal of UNER is to provide high-quality, cross-lingually consistent annotations to facilitate and standardize multilingual NER research. UNER v1 contains 19 datasets annotated with named entities in a cross-lingual consistent schema across 13 diverse languages. In this paper, we detail the dataset creation and composition of UNER; we also provide initial modeling baselines on both in-language and cross-lingual learning settings. We will release the data, code, and fitted models to the public.

pdf bib
ODD: A Benchmark Dataset for the Natural Language Processing Based Opioid Related Aberrant Behavior Detection
Sunjae Kwon | Xun Wang | Weisong Liu | Emily Druhl | Minhee Sung | Joel Reisman | Wenjun Li | Robert Kerns | William Becker | Hong Yu

Opioid related aberrant behaviors (ORABs) present novel risk factors for opioid overdose. This paper introduces a novel biomedical natural language processing benchmark dataset named ODD, for ORAB Detection Dataset. ODD is an expert-annotated dataset designed to identify ORABs from patients’ EHR notes and classify them into nine categories; 1) Confirmed Aberrant Behavior, 2) Suggested Aberrant Behavior, 3) Opioids, 4) Indication, 5) Diagnosed opioid dependency, 6) Benzodiazepines, 7) Medication Changes, 8) Central Nervous System-related, and 9) Social Determinants of Health. We explored two state-of-the-art natural language processing models (fine-tuning and prompt-tuning approaches) to identify ORAB. Experimental results show that the prompt-tuning models outperformed the fine-tuning models in most categories and the gains were especially higher among uncommon categories (Suggested Aberrant Behavior, Confirmed Aberrant Behaviors, Diagnosed Opioid Dependence, and Medication Change). Although the best model achieved the highest 88.17% on macro average area under precision recall curve, uncommon classes still have a large room for performance improvement. ODD is publicly available.

pdf bib
A Comprehensive Study of Gender Bias in Chemical Named Entity Recognition Models
Xingmeng Zhao | Ali Niazi | Anthony Rios

Chemical named entity recognition (NER) models are used in many downstream tasks, from adverse drug reaction identification to pharmacoepidemiology. However, it is unknown whether these models work the same for everyone. Performance disparities can potentially cause harm rather than the intended good. This paper assesses gender-related performance disparities in chemical NER systems. We develop a framework for measuring gender bias in chemical NER models using synthetic data and a newly annotated corpus of over 92,405 words with self-identified gender information from Reddit. Our evaluation of multiple biomedical NER models reveals evident biases. For instance, synthetic data suggests that female names are frequently misclassified as chemicals, especially when it comes to brand name mentions. Additionally, we observe performance disparities between female- and male-associated data in both datasets. Many systems fail to detect contraceptives such as birth control. Our findings emphasize the biases in chemical NER models, urging practitioners to account for these biases in downstream applications.

pdf bib
The Promises and Pitfalls of Using Language Models to Measure Instruction Quality in Education
Paiheng Xu | Jing Liu | Nathan Jones | Julie Cohen | Wei Ai

Assessing instruction quality is a fundamental component of any improvement efforts in the education system. However, traditional manual assessments are expensive, subjective, and heavily dependent on observers’ expertise and idiosyncratic factors, preventing teachers from getting timely and frequent feedback. Different from prior research that mostly focuses on low-inference instructional practices on a singular basis, this paper presents the first study that leverages Natural Language Processing (NLP) techniques to assess multiple high-inference instructional practices in two distinct educational settings: in-person K-12 classrooms and simulated performance tasks for pre-service teachers. This is also the first study that applies NLP to measure a teaching practice that is widely acknowledged to be particularly effective for students with special needs. We confront two challenges inherent in NLP-based instructional analysis, including noisy and long input data and highly skewed distributions of human ratings. Our results suggest that pretrained Language Models (PLMs) demonstrate performances comparable to the agreement level of human raters for variables that are more discrete and require lower inference, but their efficacy diminishes with more complex teaching practices. Interestingly, using only teachers’ utterances as input yields strong results for student-centered variables, alleviating common concerns over the difficulty of collecting and transcribing high-quality student speech data in in-person teaching settings. Our findings highlight both the potential and the limitations of current NLP techniques in the education domain, opening avenues for further exploration.

pdf bib
Differentially Private Next-Token Prediction of Large Language Models
James Flemings | Meisam Razaviyayn | Murali Annavaram

Ensuring the privacy of Large Language Models (LLMs) is becoming increasingly important. The most widely adopted technique to accomplish this is DP-SGD, which trains a model to guarantee Differential Privacy (DP). However, DP-SGD overestimates an adversary’s capabilities in having white box access to the model and, as a result, causes longer training times and larger memory usage than SGD. On the other hand, commercial LLM deployments are predominantly cloud-based; hence, adversarial access to LLMs is black-box. Motivated by these observations, we present Private Mixing of Ensemble Distributions (PMixED): a private prediction protocol for next-token prediction that utilizes the inherent stochasticity of next-token sampling and a public model to achieve Differential Privacy. We formalize this by introducing RD-mollifers which project each of the model’s output distribution from an ensemble of fine-tuned LLMs onto a set around a public LLM’s output distribution, then average the projected distributions and sample from it. Unlike DP-SGD which needs to consider the model architecture during training, PMixED is model agnostic, which makes PMixED a very appealing solution for current deployments. Our results show that PMixED achieves a stronger privacy guarantee than sample-level privacy and outperforms DP-SGD for privacy 𝜖 = 8 on large-scale datasets. Thus, PMixED offers a practical alternative to DP training methods for achieving strong generative utility without compromising privacy.

pdf bib
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset
Janis Goldzycher | Paul Röttger | Gerold Schneider

Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca. 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness. Further, we find that mixing multiple support strategies is most advantageous. We make GAHD publicly available at

pdf bib
Memory Augmented Language Models through Mixture of Word Experts
Cicero Nogueira dos Santos | James Lee-Thorp | Isaac Noble | Chung-Ching Chang | David Uthus

Scaling up the number of parameters of language models has proven to be an effective approach to improve performance. For dense models, increasing their size proportionally increases their computational footprint. In this work, we seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions. Our proposed approach, dubbed Mixture of Word Experts (MoWE), can be seen as a memory augmented model, where a large set of word-specific experts play the role of a sparse memory. We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks. Moreover, MoWE outperforms traditional MoE models on knowledge intensive tasks and has similar performance to complex memory augmented approaches that often require to invoke custom mechanisms to search the sparse memory.

pdf bib
Impossible Distillation for Paraphrasing and Summarization: How to Make High-quality Lemonade out of Small, Low-quality Model
Jaehun Jung | Peter West | Liwei Jiang | Faeze Brahman | Ximing Lu | Jillian Fisher | Taylor Sorensen | Yejin Choi

We present Impossible Distillation, a novel framework for paraphrasing and sentence summarization, that distills a high-quality dataset and model from a low-quality teacher that itself cannot perform these tasks. Unlike prior works that rely on an extreme-scale teacher model (e.g., GPT3) or task-specific architecture, we hypothesize and verify the paraphrastic proximity intrinsic to pre-trained LMs (e.g., GPT2), where paraphrases occupy a proximal subspace in the LM distribution. By identifying and distilling generations from these subspaces, Impossible Distillation produces a high-quality dataset and model even from GPT2-scale LMs. We evaluate our method on multiple benchmarks spanning unconstrained / syntax-controlled paraphrase generation and sentence summarization. Our model with 770M parameters consistently outperforms strong baselines, including models distilled from ChatGPT, and sometimes, even ChatGPT itself. Also, we find that our distilled dataset from 1.5B LMs exhibits higher diversity and fidelity than up to 13 times larger datasets.

pdf bib
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Liyan Tang | Igor Shalyminov | Amy Wong | Jon Burnsky | Jake Vincent | Yu’an Yang | Siffi Singh | Song Feng | Hwanjun Song | Hang Su | Lijia Sun | Yi Zhang | Saab Mansour | Kathleen McKeown

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence- level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model’s size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

pdf bib
MOKA: Moral Knowledge Augmentation for Moral Event Extraction
Xinliang Frederick Zhang | Winston Wu | Nicholas Beauchamp | Lu Wang

News media often strive to minimize explicit moral language in news articles, yet most articles are dense with moral values as expressed through the reported events themselves. However, values that are reflected in the intricate dynamics among *participating entities* and *moral events* are far more challenging for most NLP systems to detect, including LLMs. To study this phenomenon, we annotate a new dataset, **MORAL EVENTS**, consisting of 5,494 structured event annotations on 474 news articles by diverse US media across the political spectrum. We further propose **MOKA**, a moral event extraction framework with **MO**ral **K**nowledge **A**ugmentation, which leverages knowledge derived from moral words and moral scenarios to produce structural representations of morality-bearing events. Experiments show that **MOKA** outperforms competitive baselines across three moral event understanding tasks. Further analysis shows even ostensibly nonpartisan media engage in the selective reporting of moral events.

pdf bib
Fixing Rogue Memorization in Many-to-One Multilingual Translators of Extremely-Low-Resource Languages by Rephrasing Training Samples
Paulo Cavalin | Pedro Henrique Domingues | Claudio Pinhanez | Julio Nogima

In this paper we study the fine-tuning of pre-trained large high-resource language models (LLMs) into many-to-one multilingual machine translators for extremely-low-resource languages such as endangered Indigenous languages. We explore those issues using datasets created from pseudo-parallel translations to English of The Bible written in 39 Brazilian Indigenous languages using mBART50 and WMT19 as pre-trained models and multiple translation metrics. We examine bilingual and multilingual models and show that, according to machine translation metrics, same-linguistic family models tend to perform best. However, we also found that many-to-one multilingual systems have a tendency to learn a “rogue” strategy of storing output strings from the training data in the LLM structure and retrieving them instead of performing actual translations. We show that rephrasing the output of the training samples seems to solve the problem.

pdf bib
Backdoor Attacks on Multilingual Machine Translation
Jun Wang | Qiongkai Xu | Xuanli He | Benjamin Rubinstein | Trevor Cohn

While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages.Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages.

pdf bib
Personalized Jargon Identification for Enhanced Interdisciplinary Communication
Yue Guo | Joseph Chee Chang | Maria Antoniak | Erin Bransom | Trevor Cohen | Lucy Wang | Tal August

Scientific jargon can confuse researchers when they read materials from other domains. Identifying and translating jargon for individual researchers could speed up research, but current methods of jargon identification mainly use corpus-level familiarity indicators rather than modeling researcher-specific needs, which can vary greatly based on each researcher’s background. We collect a dataset of over 10K term familiarity annotations from 11 computer science researchers for terms drawn from 100 paper abstracts. Analysis of this data reveals that jargon familiarity and information needs vary widely across annotators, even within the same sub-domain (e.g., NLP). We investigate features representing domain, subdomain, and individual knowledge to predict individual jargon familiarity. We compare supervised and prompt-based approaches, finding that prompt-based methods using information about the individual researcher (e.g., personal publications, self-defined subfield of research) yield the highest accuracy, though the task remains difficult and supervised approaches have lower false positive rates. This research offers insights into features and methods for the novel task of integrating personal data into scientific jargon identification.

pdf bib
Flames: Benchmarking Value Alignment of LLMs in Chinese
Kexin Huang | Xiangyang Liu | Qianyu Guo | Tianxiang Sun | Jiawei Sun | Yaru Wang | Zeyang Zhou | Yixu Wang | Yan Teng | Xipeng Qiu | Yingchun Wang | Dahua Lin

The widespread adoption of large language models (LLMs) across various regions underscores the urgent need to evaluate their alignment with human values. Current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in LLMs. Despite numerous models achieving high scores and ‘topping the chart’ in these evaluations, there is still a significant gap in LLMs’ deeper alignment with human values and achieving genuine harmlessness. To this end, this paper proposes a value alignment benchmark named Flames, which encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values such as harmony. Accordingly, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. By prompting 17 mainstream LLMs, we obtain model responses and rigorously annotate them for detailed evaluation. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames, particularly in the safety and fairness dimensions. We also develop a lightweight specified scorer capable of scoring LLMs across multiple dimensions to efficiently evaluate new models on the benchmark. The complexity of Flames has far exceeded existing benchmarks, setting a new challenge for contemporary LLMs and highlighting the need for further alignment of LLMs. Our benchmark is publicly available at

pdf bib
Mitigating Bias for Question Answering Models by Tracking Bias Influence
Mingyu Ma | Jiun-Yu Kao | Arpit Gupta | Yu-Hsiang Lin | Wenbo Zhao | Tagyoung Chung | Wei Wang | Kai-Wei Chang | Nanyun Peng

Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.

pdf bib
Extending CLIP’s Image-Text Alignment to Referring Image Segmentation
Seoyeon Kim | Minguk Kang | Dongwon Kim | Jaesik Park | Suha Kwak

Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP’s inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP’s image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP’s image-text alignment to RIS.

pdf bib
Generating Attractive and Authentic Copywriting from Customer Reviews
Yu-Xiang Lin | Wei-Yun Ma

The goal of product copywriting is to capture the interest of potential buyers by emphasizing the features of products through text descriptions. As e-commerce platforms offer a wide range of services, it’s becoming essential to dynamically adjust the styles of these auto-generated descriptions. Typical approaches to copywriting generation often rely solely on specified product attributes, which may result in dull and repetitive content. To tackle this issue, we propose to generate copywriting based on customer reviews, as they provide firsthand practical experiences with products, offering a richer source of information than just product attributes. We have developed a sequence-to-sequence framework, enhanced with reinforcement learning, to produce copywriting that is attractive, authentic, and rich in information. Our framework outperforms all existing baseline and zero-shot large language models, including LLaMA-2-chat-7B and GPT-3.5, in terms of both attractiveness and faithfulness. Furthermore, this work features the use of LLMs for aspect-based summaries collection and argument allure assessment. Experiments demonstrate the effectiveness of using LLMs for marketing domain corpus construction. The code and the dataset is publicly available at:

pdf bib
Effective Long-Context Scaling of Foundation Models
Wenhan Xiong | Jingyu Liu | Igor Molybog | Hejia Zhang | Prajjwal Bhargava | Rui Hou | Louis Martin | Rashi Rungta | Karthik Abinav Sankararaman | Barlas Oguz | Madian Khabsa | Han Fang | Yashar Mehdad | Sharan Narang | Kshitiz Malik | Angela Fan | Shruti Bhosale | Sergey Edunov | Mike Lewis | Sinong Wang | Hao Ma

We present an effective recipe to train strong long-context LLMs that are capable of utilizing massive context windows of up to 32,000 tokens. Our models are built through continual pretraining from Llama 2 checkpoints with longer text sequences and on a dataset where long texts are upsampled. We perform extensive evaluation using language modeling, synthetic context probing tasks, and a wide range of downstream benchmarks. Across all evaluations, our models achieve consistent improvements on most regular-context tasks and significant improvements on long-context tasks over Llama 2. Moreover, with a cost-effective instruction tuning procedure that is free of expensive annotation, the presented models can already surpass gpt-3.5-turbo-16k‘s overall performance on long-context benchmarks. Alongside these results, we provide an in-depth analysis on each individual component of our method. We delve into Llama’s position encodings and discuss its key limitation in modeling long data. We examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths – ablation results suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

pdf bib
Empowering Diffusion Models on the Embedding Space for Text Generation
Zhujin Gao | Junliang Guo | Xu Tan | Yongxin Zhu | Fang Zhang | Jiang Bian | Linli Xu

Diffusion models have achieved state-of-the-art synthesis quality on both visual and audio tasks, and recent works further adapt them to textual data by diffusing on the embedding space. In this paper, we conduct systematic studies of the optimization challenges encountered with both the embedding space and the denoising model, which have not been carefully explored. Firstly, the data distribution is learnable for embeddings, which may lead to the collapse of the embedding space and unstable training. To alleviate this problem, we propose a new objective called the anchor loss which is more efficient than previous methods. Secondly, we find the noise levels of conventional schedules are insufficient for training a desirable denoising model while introducing varying degrees of degeneration in consequence. To address this challenge, we propose a novel framework called noise rescaling. Based on the above analysis, we propose Difformer, an embedding diffusion model based on Transformer. Experiments on varieties of seminal text generation tasks show the effectiveness of the proposed methods and the superiority of Difformer over previous state-of-the-art embedding diffusion baselines.

pdf bib
Aligning as Debiasing: Causality-Aware Alignment via Reinforcement Learning with Interventional Feedback
Yu Xia | Tong Yu | Zhankui He | Handong Zhao | Julian McAuley | Shuai Li

Large language models (LLMs) often generate biased outputs containing offensive, toxic, or stereotypical text. Existing LLM alignment methods such as reinforcement learning from human feedback (RLHF) alleviate biases primarily based on reward signals from current model outputs without considering the source of biases. In this work, to explore how biases are formed, we revisit LLMs’ text generation from a causal perspective. We identify pretraining data and input prompts, which contain semantic correlations of textual phrases, as two confounders between LLMs and model outputs causing biases. Inspired by our causal view, we leverage the reward model in RL alignment as an instrumental variable to perform causal intervention on LLMs. Utilizing the reward difference between an initial LLM and intervened LLM as interventional feedback to guide RL finetuning, we propose Causality-Aware Alignment (CAA) for LLM debiasing. Experiments on two text generation tasks with three different alignment objectives demonstrate the advantages of our method in aligning LLMs to generate less biased and safer outputs.

pdf bib
Fake Alignment: Are LLMs Really Aligned Well?
Yixu Wang | Yan Teng | Kexin Huang | Chengqi Lyu | Songyang Zhang | Wenwei Zhang | Xingjun Ma | Yu-Gang Jiang | Yu Qiao | Yingchun Wang

The growing awareness of safety concerns in large language models (LLMs) has sparked considerable interest in the evaluation of safety. This study investigates an under-explored issue about the evaluation of LLMs, namely the substantial discrepancy in performance between multiple-choice questions and open-ended questions. Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization. That is, LLM only remembers the answer style for open-ended safety questions, which makes it unable to solve other forms of safety tests. We refer to this phenomenon as fake alignment and construct a comparative benchmark to empirically verify its existence in LLMs. We introduce a Fake alIgNment Evaluation (FINE) framework and two novel metrics——Consistency Score (CS) and Consistent Safety Score (CSS), which jointly assess two complementary forms of evaluation to quantify fake alignment and obtain corrected performance estimation. Applying FINE to 14 widely-used LLMs reveals several models with purported safety are poorly aligned in practice. Subsequently, we found that multiple-choice format data can also be used as high-quality contrast distillation-based fine-tuning data, which can strongly improve the alignment consistency of LLMs with minimal fine-tuning overhead. For data and code, see

pdf bib
Visually Guided Generative Text-Layout Pre-training for Document Intelligence
Zhiming Mao | Haoli Bai | Lu Hou | Lifeng Shang | Xin Jiang | Qun Liu | Kam-Fai Wong

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

pdf bib
HILL: Hierarchy-aware Information Lossless Contrastive Learning for Hierarchical Text Classification
He Zhu | Junran Wu | Ruomei Liu | Yue Hou | Ze Yuan | Shangzhe Li | Yicheng Pan | Ke Xu

Existing self-supervised methods in natural language processing (NLP), especially hierarchical text classification (HTC), mainly focus on self-supervised contrastive learning, extremely relying on human-designed augmentation rules to generate contrastive samples, which can potentially corrupt or distort the original information. In this paper, we tend to investigate the feasibility of a contrastive learning scheme in which the semantic and syntactic information inherent in the input sample is adequately reserved in the contrastive samples and fused during the learning process. Specifically, we propose an information lossless contrastive learning strategy for HTC, namely Hierarchy-aware Information Lossless contrastive Learning (HILL), which consists of a text encoder representing the input document, and a structure encoder directly generating the positive sample. The structure encoder takes the document embedding as input, extracts the essential syntactic information inherent in the label hierarchy with the principle of structural entropy minimization, and injects the syntactic information into the text representation via hierarchical representation learning. Experiments on three common datasets are conducted to verify the superiority of HILL.

pdf bib
Investigating the Emergent Audio Classification Ability of ASR Foundation Models
Rao Ma | Adian Liusie | Mark Gales | Kate Knill

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

pdf bib
In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax
Aaron Mueller | Albert Webson | Jackson Petty | Tal Linzen

In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks: given labeled examples in the input context, the LLM learns to perform the task without weight updates. Do models guided via ICL infer the underlying structure of the task defined by the context, or do they rely on superficial heuristics that only generalize to identically distributed examples? We address this question using transformations tasks and an NLI task that assess sensitivity to syntax—a requirement for robust language understanding. We further investigate whether out-of-distribution generalization can be improved via chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs. The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size; in particular, models pre-trained on code generalize better, and benefit more from chain-of-thought prompting.

pdf bib
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Yongqi Wang | Ruofan Hu | Rongjie Huang | Zhiqing Hong | Ruiqi Li | Wenrui Liu | Fuming You | Tao Jin | Zhou Zhao

Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at .

pdf bib
Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech
Dena Mujtaba | Nihar Mahapatra | Megan Arney | J Yaruss | Hope Gerlach-Houck | Caryn Herring | Jia Bin

Automatic speech recognition (ASR) systems, increasingly prevalent in education, healthcare, employment, and mobile technology, face significant challenges in inclusivity, particularly for the 80 million-strong global community of people who stutter. These systems often fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations. This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. The synthetic dataset, uniquely designed to incorporate various stuttering events, enables an in-depth analysis of each ASR’s handling of disfluent speech. Our comprehensive assessment includes metrics such as word error rate (WER), character error rate (CER), and semantic accuracy of the transcripts. The results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions. These findings highlight a critical gap in current ASR technologies, underscoring the need for effective bias mitigation strategies. Addressing this bias is imperative not only to improve the technology’s usability for people who stutter but also to ensure their equitable and inclusive participation in the rapidly evolving digital landscape.

pdf bib
MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification
Chadi Helwe | Tom Calamai | Pierre-Henri Paris | Chloé Clavel | Fabian Suchanek

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.

pdf bib
Diffusion Glancing Transformer for Parallel Sequence-to-Sequence Learning
Lihua Qian | Mingxuan Wang | Yang Liu | Hao Zhou

Previously, non-autoregressive models were widely recognized as being superior in generation efficiency but inferior in generation quality due to the challenges of modeling multiple target modalities.To enhance the multi-modality modeling ability, we propose the diffusion glancing transformer, which employs a modality diffusion process and residual glancing sampling.The modality diffusion process is a discrete process that interpolates the multi-modal distribution along the decoding steps, and the residual glancing sampling approach guides the model to continuously learn the remaining modalities across the layers. Experimental results on various machine translation and text generation benchmarks demonstrate that DIFFGLAT achieves better generation accuracy while maintaining fast decoding speed compared with both autoregressive and non-autoregressive models.

pdf bib
No Context Needed: Contextual Quandary In Idiomatic Reasoning With Pre-Trained Language Models
Kellen Cheng | Suma Bhat

Reasoning in the presence of idiomatic expressions (IEs) remains a challenging frontier in natural language understanding (NLU). Unlike standard text, the non-compositional nature of an IE makes it difficult for model comprehension, as their figurative or non-literal mean- ing usually cannot be inferred from the constituent words alone. It stands to reason that in these challenging circumstances, pre-trained language models (PTLMs) should make use of the surrounding context to infer additional in- formation about the IE. In this paper, we investigate the utilization of said context for idiomatic reasoning tasks, which is under-explored relative to arithmetic or commonsense reason- ing (Liu et al., 2022; Yu et al., 2023). Preliminary findings point to a surprising observation: general purpose PTLMs are actually negatively affected by the context, as performance almost always increases with its removal. In these scenarios, models may see gains of up to 3.89%. As a result, we argue that only IE-aware models remain suitable for idiomatic reasoning tasks, given the unexpected and unexplainable manner in which general purpose PTLMs reason over IEs. Additionally, we conduct studies to examine how models utilize the context in various situations, as well as an in-depth analysis on dataset formation and quality. Finally, we provide some explanations and insights into the reasoning process itself based on our results.

pdf bib
Multi-stage Retrieve and Re-rank Model for Automatic Medical Coding Recommendation
Xindi Wang | Robert Mercer | Frank Rudzicz

The International Classification of Diseases (ICD) serves as a definitive medical classification system encompassing a wide range of diseases and conditions. The primary objective of ICD indexing is to allocate a subset of ICD codes to a medical record, which facilitates standardized documentation and management of various health conditions. Most existing approaches have suffered from selecting the proper label subsets from an extremely large ICD collection with a heavy long-tailed label distribution. In this paper, we leverage a multi-stage “retrieve and re-rank” framework as a novel solution to ICD indexing, via a hybrid discrete retrieval method, and re-rank retrieved candidates with contrastive learning that allows the model to make more accurate predictions from a simplified label space. The retrieval model is a hybrid of auxiliary knowledge of the electronic health records (EHR) and a discrete retrieval method (BM25), which efficiently collects high-quality candidates. In the last stage, we propose a label co-occurrence guided contrastive re-ranking model, which re-ranks the candidate labels by pulling together the clinical notes with positive ICD codes. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures on the MIMIC-III benchmark.

pdf bib
Anisotropy is Not Inherent to Transformers
Anemily Machina | Robert Mercer

Isotropy is the property that embeddings are uniformly distributed around the origin. Previous work has shown that Transformer embedding spaces are anisotropic, which is called the representation degradation problem. This degradation has been assumed to be inherent to the standard language modeling tasks and to apply to all Transformer models regardless of their architecture. In this work we identify a set of Transformer models with isotropic embedding spaces, the large Pythia models. We examine the isotropy of Pythia models and explore how isotropy and anisotropy develop as a model is trained. We find that anisotropic models do not develop as previously theorized, using our own analysis to show that the large Pythia models optimize their final Layer Norm for isotropy, and provide reasoning why previous theoretical justifications for anisotropy were insufficient. The identification of a set of isotropic Transformer models calls previous assumptions into question, provides a set of models to contrast existing analysis, and should lead to deeper insight into isotropy.

pdf bib
Finding Replicable Human Evaluations via Stable Ranking Probability
Parker Riley | Daniel Deutsch | George Foster | Viresh Ratnakar | Ali Dabirmoghaddam | Markus Freitag

Reliable human evaluation is critical to the development of successful natural language generation models, but achieving it is notoriously difficult. Stability is a crucial requirement when ranking systems by quality: consistent ranking of systems across repeated evaluations is not just desirable, but essential. Without it, there is no reliable foundation for hill-climbing or product launch decisions. In this paper, we use machine translation and its state-of-the-art human evaluation framework, MQM, as a case study to understand how to set up reliable human evaluations that yield stable conclusions. We investigate the optimal configurations for item allocation to raters, number of ratings per item, and score normalization. Our study on two language pairs provides concrete recommendations for designing replicable human evaluation studies. We also collect and release the largest publicly available dataset of multi-segment translations rated by multiple professional translators, consisting of nearly 140,000 segment annotations across two language pairs.

pdf bib
Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections
Yuanpu Cao | Bochuan Cao | Jinghui Chen

Recent developments in Large Language Models (LLMs) have manifested significant advancements. To facilitate safeguards against malicious exploitation, a body of research has concentrated on aligning LLMs with human preferences and inhibiting their generation of inappropriate content. Unfortunately, such alignments are often vulnerable: fine-tuning with a minimal amount of harmful data can easily unalign the target LLM. While being effective, such fine-tuning-based unalignment approaches also have their own limitations: (1) non-stealthiness, after fine-tuning, safety audits or red-teaming can easily expose the potential weaknesses of the unaligned models, thereby precluding their release/use. (2) non-persistence, the unaligned LLMs can be easily repaired through re-alignment, i.e., fine-tuning again with aligned data points. In this work, we show that it is possible to conduct stealthy and persistent unalignment on large language models via backdoor injections. We also provide a novel understanding of the relationship between the backdoor persistence and the activation pattern and further provide guidelines for potential trigger design. Through extensive experiments, we demonstrate that our proposed stealthy and persistent unalignment can successfully pass the safety evaluation while maintaining strong persistence against re-alignment defense.

pdf bib
Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts
Sai Ashish Somayajula | Youwei Liang | Li Zhang | Abhishek Singh | Pengtao Xie

Pretrained Language Models (PLMs) have advanced Natural Language Processing (NLP) tasks significantly, but finetuning PLMs on low-resource datasets poses significant challenges such as instability and overfitting. Previous methods tackle these issues by finetuning a strategically chosen subnetwork on a downstream task, while keeping the remaining weights fixed to the pretrained weights. However, they rely on a suboptimal criteria for sub-network selection, leading to suboptimal solutions. To address these limitations, we propose a regularization method based on attention-guided weight mixup for finetuning PLMs. Our approach represents each network weight as a mixup of task-specific weight and pretrained weight, controlled by a learnable attention parameter, providing finer control over sub-network selection. Furthermore, we employ a bi-level optimization (BLO) based framework on two separate splits of the training dataset, improving generalization and combating overfitting. We validate the efficacy of our proposed method through extensive experiments, demonstrating its superiority over previous methods, particularly in the context of finetuning PLMs on low-resource datasets. Our code is available at

pdf bib
Detecting Bipolar Disorder from Misdiagnosed Major Depressive Disorder with Mood-Aware Multi-Task Learning
Daeun Lee | Hyolim Jeon | Sejung Son | Chaewon Park | Ji hyun An | Seungbae Kim | Jinyoung Han

Bipolar Disorder (BD) is a mental disorder characterized by intense mood swings, from depression to manic states. Individuals with BD are at a higher risk of suicide, but BD is often misdiagnosed as Major Depressive Disorder (MDD) due to shared symptoms, resulting in delays in appropriate treatment and increased suicide risk. While early intervention based on social media data has been explored to uncover latent BD risk, little attention has been paid to detecting BD from those misdiagnosed as MDD. Therefore, this study presents a novel approach for identifying BD risk in individuals initially misdiagnosed with MDD. A unique dataset, BD-Risk, is introduced, incorporating mental disorder types and BD mood levels verified by two clinical experts. The proposed multi-task learning for predicting BD risk and BD mood level outperforms the state-of-the-art baselines. Also, the proposed dynamic mood-aware attention can provide insights into the impact of BD mood on future risk, potentially aiding interventions for at-risk individuals.

pdf bib
Leveraging Code to Improve In-Context Learning for Semantic Parsing
Ben Bogin | Shivanshu Gupta | Peter Clark | Ashish Sabharwal

In-context learning (ICL) is an appealing approach for semantic parsing due to its few-shot nature and improved generalization. However, learning to parse to rare domain-specific languages (DSLs) from just a few demonstrations is challenging, limiting the performance of even the most capable LLMs.In this work, we show how pre-existing coding abilities of LLMs can be leveraged for semantic parsing by (1) using general-purpose programming languages such as Python instead of DSLs and (2) augmenting prompts with a structured domain description that includes, e.g., the available classes and functions. We show that both these changes significantly improve accuracy across three popular datasets; combined, they lead to dramatic improvements (e.g., 7.9% to 66.5% on SMCalFlow compositional split) and can substantially improve compositional generalization, nearly closing the performance gap between easier i.i.d. and harder compositional splits. Finally, comparisons across multiple PLs and DSL variations suggest that the similarity of a target language to general-purpose code is more important than prevalence in pretraining corpora. Our findings provide an improved methodology for building semantic parsers in the modern context of ICL with LLMs.

pdf bib
Improving Pre-trained Language Model Sensitivity via Mask Specific losses: A case study on Biomedical NER
Micheal Abaho | Danushka Bollegala | Gary Leeming | Dan Joyce | Iain Buchan

Adapting language models (LMs) to novel domains is often achieved through fine-tuning a pre-trained LM (PLM) on domain-specific data. Fine-tuning introduces new knowledge into an LM, enabling it to comprehend and efficiently perform a target domain task. Fine-tuning can however be inadvertently insensitive if it ignores the wide array of disparities (e.g in word meaning) between source and target domains. For instance, words such as chronic and pressure may be treated lightly in social conversations, however, clinically, these words are usually an expression of concern. To address insensitive fine-tuning, we propose Mask Specific Language Modeling (MSLM), an approach that efficiently acquires target domain knowledge by appropriately weighting the importance of domain-specific terms (DS-terms) during fine-tuning. MSLM jointly masks DS-terms and generic words, then learns mask-specific losses by ensuring LMs incur larger penalties for inaccurately predicting DS-terms compared to generic words. Results of our analysis show that MSLM improves LMs sensitivity and detection of DS-terms. We empirically show that an optimal masking rate not only depends on the LM, but also on the dataset and the length of sequences. Our proposed masking strategy outperforms advanced masking strategies such as span- and PMI-based masking.

pdf bib
Language Models Implement Simple Word2Vec-style Vector Arithmetic
Jack Merullo | Carsten Eickhoff | Ellie Pavlick

A primary criticism towards language models (LMs) is their inscrutability. This paper presents evidence that, despite their size and complexity, LMs sometimes exploit a simple vector arithmetic style mechanism to solve some relational tasks using regularities encoded in the hidden space of the model (e.g., Poland:Warsaw::China:Beijing). We investigate a range of language model sizes (from 124M parameters to 176B parameters) in an in-context learning setting, and find that for a variety of tasks (involving capital cities, uppercasing, and past-tensing) a key part of the mechanism reduces to a simple additive update typically applied by the feedforward (FFN) networks. We further show that this mechanism is specific to tasks that require retrieval from pretraining memory, rather than retrieval from local context. Our results contribute to a growing body of work on the interpretability of LMs, and offer reason to be optimistic that, despite the massive and non-linear nature of the models, the strategies they ultimately use to solve tasks can sometimes reduce to familiar and even intuitive algorithms.

pdf bib
AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning
Ruiyi Zhang | Rushi Qiang | Sai Ashish Somayajula | Pengtao Xie

Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pretrained weights, has proven particularly effective. Nonetheless, LoRA’s uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA. The code is publicly available at

pdf bib
SportQA: A Benchmark for Sports Understanding in Large Language Models
Haotian Xia | Zhengbang Yang | Yuqing Wang | Rhys Tracy | Yun Zhao | Dongdong Huang | Zezhi Chen | Yan Zhu | Yuan-fang Wang | Weining Shen

A deep understanding of sports, a field rich in strategic and dynamic content, is crucial for advancing Natural Language Processing (NLP). This holds particular significance in the context of evaluating and advancing Large Language Models (LLMs), given the existing gap in specialized benchmarks. To bridge this gap, we introduce SportQA, a novel benchmark specifically designed for evaluating LLMs in the context of sports understanding. SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels, each targeting different aspects of sports knowledge from basic historical facts to intricate, scenario-based reasoning tasks. We conducted a thorough evaluation of prevalent LLMs, mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting. Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning, lagging behind human expertise. The introduction of SportQA marks a significant step forward in NLP, offering a tool for assessing and enhancing sports understanding in LLMs. The dataset is available at

pdf bib
Revisiting subword tokenization: A case study on affixal negation in large language models
Thinh Truong | Yulia Otmakhova | Karin Verspoor | Trevor Cohn | Timothy Baldwin

In this work, we measure the impact of affixal negation on modern English large language models (LLMs). In affixal negation, the negated meaning is expressed through a negative morpheme, which is potentially challenging for LLMs as their tokenizers are often not morphologically plausible. We conduct extensive experiments using LLMs with different subword tokenization methods, which lead to several insights on the interaction between tokenization performance and negation sensitivity. Despite some interesting mismatches between tokenization accuracy and negation detection performance, we show that models can, on the whole, reliably recognize the meaning of affixal negation.

pdf bib
Generating Mental Health Transcripts with SAPE (Spanish Adaptive Prompt Engineering)
Daniel Lozoya | Alejandro Berazaluce | Juan Perches | Eloy Lúa | Mike Conway | Simon D’Alfonso

Large language models have become valuable tools for data augmentation in scenarios with limited data availability, as they can generate synthetic data resembling real-world data. However, their generative performance depends on the quality of the prompt used to instruct the model. Prompt engineering that relies on hand-crafted strategies or requires domain experts to adjust the prompt often yields suboptimal results. In this paper we present SAPE, a Spanish Adaptive Prompt Engineering method utilizing genetic algorithms for prompt generation and selection. Our evaluation of SAPE focuses on a generative task that involves the creation of Spanish therapy transcripts, a type of data that is challenging to collect due to the fact that it typically includes protected health information. Through human evaluations conducted by mental health professionals, our results show that SAPE produces Spanish counselling transcripts that more closely resemble authentic therapy transcripts compared to other prompt engineering techniques that are based on Reflexion and Chain-of-Thought.

pdf bib
Where are you from? Geolocating Speech and Applications to Language Identification
Patrick Foley | Matthew Wiesner | Bismarck Odoom | Leibny Paola Garcia Perera | Kenton Murray | Philipp Koehn

We train models to answer the question, Where are you from? and show how such models can be repurposed for language identification (LID). To our knowledge, this paper is the first to introduce data sources, methods and models to tackle the task of geolocation of speech at a global scale, and the first to explore using geolocation as a proxy-task for LID. Specifically, we explore whether radio broadcasts with known origin can be used to train regression and classification-based models for geolocating speech. We build models on top of self-supervised pretrained models, using attention pooling to qualitatively verify that the model geolocates the speech itself, and not other channel artifacts.The best geolocation models localize speaker origin to around 650km. We confirm the value of speech geolocation as a proxy task by using speech geolocation models for zero-shot LID. Finally, we show that fine-tuning geolocation models for LID outperforms fine-tuning pretrained Wav2Vec2.0 models, and achieves state-of-the-art performance on the FLEURS benchmark.

pdf bib
Teaching Language Models to Self-Improve through Interactive Demonstrations
Xiao Yu | Baolin Peng | Michel Galley | Jianfeng Gao | Zhou Yu

The self-improving ability of large language models (LLMs), enabled by prompting them to analyze and revise their own outputs, has garnered significant interest in recent research. However, this ability has been shown to be absent and difficult to learn for smaller models, thus widening the performance gap between state-of-the-art LLMs and more cost-effective and faster ones. To reduce this gap, we introduce TriPosT, a training algorithm that endows smaller models with such self-improvement ability, and show that our approach can improve LLaMA-7B’s performance on math and reasoning tasks by up to 7.13%. In contrast to prior work, we achieve this by using the smaller model to interact with LLMs to collect feedback and improvements on *its own generations*. We then replay this experience to train the small model. Our experiments on four math and reasoning datasets show that the interactive experience of learning from and correcting its *own* mistakes is crucial for small models to improve their performance.

pdf bib
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
Hossein Aboutalebi | Hwanjun Song | Yusheng Xie | Arshit Gupta | Lijia Sun | Hang Su | Igor Shalyminov | Nikolaos Pappas | Siffi Singh | Saab Mansour

Development of multimodal interactive systems is hindered by the lack of rich, multimodal (text, images) conversational data, which is needed in large quantities for LLMs. Previous approaches augment textual dialogues with retrieved images, posing privacy, diversity, and quality constraints. In this work, we introduce Multimodal Augmented Generative Images Dialogues (MAGID), a framework to augment text-only dialogues with diverse and high-quality images . Subsequently, a diffusion model is applied to craft corresponding images, ensuring alignment with the identified text. Finally, MAGID incorporates an innovative feedback loop between an image description generation module (textual LLM) and image quality modules (addressing aesthetics, image-text matching, and safety), that work in tandem to generate high-quality and multi-modal dialogues. We compare MAGID to other SOTA baselines on three dialogue datasets, using automated and human evaluation. Our results show that MAGID is comparable to or better than baselines, with significant improvements in human evaluation, especially against retrieval baselines where the image database is small.

pdf bib
Zero-shot Generative Linguistic Steganography
Ke Lin | Yiyang Luo | Zijian Zhang | Luo Ping

Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces 1.926× more innocent and intelligible stegotext than any other method.

pdf bib
Does GPT-4 pass the Turing test?
Cameron Jones | Ben Bergen

We evaluated GPT-4 in a public online Turing test. The best-performing GPT-4 prompt passed in 49.7% of games, outperforming ELIZA (22%) and GPT-3.5 (20%), but falling short of the baseline set by human participants (66%). Participants’ decisions were based mainly on linguistic style (35%) and socioemotional traits (27%), supporting the idea that intelligence, narrowly conceived, is not sufficient to pass the Turing test. Participant knowledge about LLMs and number of games played positively correlated with accuracy in detecting AI, suggesting learning and practice as possible strategies to mitigate deception. Despite known limitations as a test of intelligence, we argue that the Turing test continues to be relevant as an assessment of naturalistic communication and deception. AI models with the ability to masquerade as humans could have widespread societal consequences, and we analyse the effectiveness of different strategies and criteria for judging humanlikeness.

pdf bib
Polarity Calibration for Opinion Summarization
Yuanyuan Lei | Kaiqiang Song | Sangwoo Cho | Xiaoyang Wang | Ruihong Huang | Dong Yu

Opinion summarization is automatically generating summaries from a variety of subjective information, such as product reviews or political opinions. The challenge of opinions summarization lies in presenting divergent or even conflicting opinions. We conduct an analysis of previous summarization models, which reveals their inclination to amplify the polarity bias, emphasizing the majority opinions while ignoring the minority opinions. To address this issue and make the summarizer express both sides of opinions, we introduce the concept of polarity calibration, which aims to align the polarity of output summary with that of input text. Specifically, we develop a reinforcement training approach for polarity calibration. This approach feeds the polarity distance between output summary and input text as reward into the summarizer, and also balance polarity calibration with content preservation and language naturality. We evaluate our Polarity Calibration model (PoCa) on two types of opinions summarization tasks: summarizing product reviews and political opinions articles. Automatic and human evaluation demonstrate that our approach can mitigate the polarity mismatch between output summary and input text, as well as maintain the content semantic and language quality.

pdf bib
Sentence-level Media Bias Analysis with Event Relation Graph
Yuanyuan Lei | Ruihong Huang

Media outlets are becoming more partisan and polarized nowadays. In this paper, we identify media bias at the sentence level, and pinpoint bias sentences that intend to sway readers’ opinions. As bias sentences are often expressed in a neutral and factual way, considering broader context outside a sentence can help reveal the bias. In particular, we observe that events in a bias sentence need to be understood in associations with other events in the document. Therefore, we propose to construct an event relation graph to explicitly reason about event-event relations for sentence-level bias identification. The designed event relation graph consists of events as nodes and four common types of event relations: coreference, temporal, causal, and subevent relations. Then, we incorporate event relation graph for bias sentences identification in two steps: an event-aware language model is built to inject the events and event relations knowledge into the basic language model via soft labels; further, a relation-aware graph attention network is designed to update sentence embedding with events and event relations information based on hard labels. Experiments on two benchmark datasets demonstrate that our approach with the aid of event relation graph improves both precision and recall of bias sentence identification.

pdf bib
EMONA: Event-level Moral Opinions in News Articles
Yuanyuan Lei | Md Messal Monem Miah | Ayesha Qamar | Sai Ramana Reddy | Jonathan Tong | Haotian Xu | Ruihong Huang

Most previous research on moral frames has focused on social media short texts, little work has explored moral sentiment within news articles. In news articles, authors often express their opinions or political stance through moral judgment towards events, specifically whether the event is right or wrong according to social moral rules. This paper initiates a new task to understand moral opinions towards events in news articles. We have created a new dataset, EMONA, and annotated event-level moral opinions in news articles. This dataset consists of 400 news articles containing over 10k sentences and 45k events, among which 9,613 events received moral foundation labels. Extracting event morality is a challenging task, as moral judgment towards events can be very implicit. Baseline models were built for event moral identification and classification. In addition, we also conduct extrinsic evaluations to integrate event-level moral opinions into three downstream tasks. The statistical analysis and experiments show that moral opinions of events can serve as informative features for identifying ideological bias or subjective events.

pdf bib
DLM: A Decoupled Learning Model for Long-tailed Polyphone Disambiguation in Mandarin
Beibei Gao | Yangsen Zhang | Ga Xiang | Yushan Jiang

Grapheme-to-phoneme conversion (G2P) is a critical component of the text-to-speech system (TTS), where polyphone disambiguation is the most crucial task. However, polyphone disambiguation datasets often suffer from the long-tail problem, and context learning for polyphonic characters commonly stems from a single dimension. In this paper, we propose a novel model DLM: a Decoupled Learning Model for long-tailed polyphone disambiguation in Mandarin. Firstly, DLM decouples representation and classification learnings. It can apply different data samplers for each stage to obtain an optimal training data distribution. This can mitigate the long-tail problem. Secondly, two improved attention mechanisms and a gradual conversion strategy are integrated into the DLM, which achieve transition learning of context from local to global. Finally, to evaluate the effectiveness of DLM, we construct a balanced polyphone disambiguation corpus via in-context learning. Experiments on the benchmark CPP dataset demonstrate that DLM achieves a boosted accuracy of 99.07%. Moreover, DLM improves the disambiguation performance of long-tailed polyphonic characters. For many long-tailed characters, DLM even achieves an accuracy of 100%.

pdf bib
You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments
Bangzhao Shu | Lechen Zhang | Minje Choi | Lavinia Dunagan | Lajanugen Logeswaran | Moontae Lee | Dallas Card | David Jurgens

The versatility of Large Language Models (LLMs) on natural language understanding tasks has made them popular for research in social sciences. To properly understand the properties and innate personas of LLMs, researchers have performed studies that involve using prompts in the form of questions that ask LLMs about particular opinions. In this study, we take a cautionary step back and examine whether the current format of prompting LLMs elicits responses in a consistent and robust manner. We first construct a dataset that contains 693 questions encompassing 39 different instruments of persona measurement on 115 persona axes. Additionally, we design a set of prompts containing minor variations and examine LLMs’ capabilities to generate answers, as well as prompt variations to examine their consistency with respect to content-level variations such as switching the order of response options or negating the statement. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model’s question-answering ability, and that most LLMs have low negation consistency. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions, and we therefore discuss potential alternatives to improve these issues.

pdf bib
CASA: Causality-driven Argument Sufficiency Assessment
Xiao Liu | Yansong Feng | Kai-Wei Chang

The argument sufficiency assessment task aims to determine if the premises of a given argument support its conclusion.To tackle this task, existing works often train a classifier on data annotated by humans. However, annotating data is laborious, and annotations are often inconsistent due to subjective criteria. Motivated by the definition of probability of sufficiency (PS) in the causal literature, we proposeCASA, a zero-shot causality-driven argument sufficiency assessment framework. PS measures how likely introducing the premise event would lead to the conclusion when both the premise and conclusion events are absent. To estimate this probability, we propose to use large language models (LLMs) to generate contexts that are inconsistent with the premise and conclusion and revise them by injecting the premise event.Experiments on two logical fallacy detection datasets demonstrate that CASA accurately identifies insufficient arguments. We further deploy CASA in a writing assistance application, and find that suggestions generated by CASA enhance the sufficiency of student-written arguments. Code and data are available at

pdf bib
MacGyver: Are Large Language Models Creative Problem Solvers?
Yufei Tian | Abhilasha Ravichander | Lianhui Qin | Ronan Le Bras | Raja Marjieh | Nanyun Peng | Yejin Choi | Thomas Griffiths | Faeze Brahman

We explore the creative problem-solving capabilities of modern LLMs in a novel constrained setting. To this end, we create MACGYVER, an automatically generated dataset consisting of over 1,600 real-world problems deliberately designed to trigger innovative usage of objects and necessitate out-of-the-box thinking. We then present our collection to both LLMs and humans to compare and contrast their problem-solving abilities. MACGYVER is challenging for both groups, but in unique and complementary ways. For instance, humans excel in tasks they are familiar with but struggle with domain-specific knowledge, leading to a higher variance. In contrast, LLMs, exposed to a variety of specialized knowledge, attempt broader problems but fail by proposing physically-infeasible actions. Finally, we provide a detailed error analysis of LLMs, and demonstrate the potential of enhancing their problem-solving ability with novel prompting techniques such as iterative step-wise reflection and divergent-convergent thinking.This work (1) introduces a fresh arena for intelligent agents focusing on intricate aspects of physical reasoning, planning, and unconventional thinking, which supplements the existing spectrum of machine intelligence; and (2) provides insight into the constrained problem-solving capabilities of both humans and AI.

pdf bib
To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages
Benedikt Ebing | Goran Glavaš

Perfect machine translation (MT) would render cross-lingual transfer (XLT) by means of multilingual language models (mLMs) superfluous. Given, on the one hand, the large body of work on improving XLT with mLMs and, on the other hand, recent advances in massively multilingual MT, in this work, we systematically evaluate existing and propose new translation-based XLT approaches for transfer to low-resource languages. We show that all translation-based approaches dramatically outperform zero-shot XLT with mLMs—with the combination of round-trip translation of the source-language training data and the translation of the target-language test instances at inference—being generally the most effective. We next show that one can obtain further empirical gains by adding reliable translations to other high-resource languages to the training data. Moreover, we propose an effective translation-based XLT strategy even for languages not supported by the MT system. Finally, we show that model selection for XLT based on target-language validation data obtained with MT outperforms model selection based on the source-language data. We believe our findings warrant a broader inclusion of more robust translation-based baselines in XLT research.

pdf bib
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
Rui Wang | Hongru Wang | Fei Mi | Boyang Xue | Yi Chen | Kam-Fai Wong | Ruifeng Xu

Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful.Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses.Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as inductive instructions, which may stem from users’ false beliefs or malicious intents.In this paper, we aim to reveal the behaviors of LLMs towards inductive instructions and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of Inductive Instructions (INDust), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.Additionally, we identified that different inductive styles affect the models’ ability to identify the same underlying errors,and the complexity of the underlying assumptions also influences the model’s performance.Motivated by these results, we propose Dual-critique prompting to improve LLM robustness against inductive instructions.Our experiments demonstrate that Dual-critique prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles.

pdf bib
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer
Urchade Zaratiana | Nadi Tomeh | Pierre Holat | Thierry Charnois

Named Entity Recognition (NER) is essential in various Natural Language Processing (NLP) applications. Traditional NER models are effective but limited to a set of predefined entity types. In contrast, Large Language Models (LLMs) can extract arbitrary entities through natural language instructions, offering greater flexibility. However, their size and cost, particularly for those accessed via APIs like ChatGPT, make them impractical in resource-limited scenarios. In this paper, we introduce a compact NER model trained to identify any type of entity. Leveraging a bidirectional transformer encoder, our model, GLiNER, facilitates parallel entity extraction, an advantage over the slow sequential token generation of LLMs. Through comprehensive testing, GLiNER demonstrate strong performance, outperforming both ChatGPT and fine-tuned LLMs in zero-shot evaluations on various NER benchmarks.

pdf bib
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Paul Röttger | Hannah Kirk | Bertie Vidgen | Giuseppe Attanasio | Federico Bianchi | Dirk Hovy

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest’s creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.

pdf bib
Carpe diem: On the Evaluation of World Knowledge in Lifelong Language Models
Yujin Kim | Jaehong Yoon | Seonghyeon Ye | Sangmin Bae | Namgyu Ho | Sung Ju Hwang | Se-Young Yun

The dynamic nature of knowledge in an ever-changing world presents challenges for language models trained on static data; the model in the real world often requires not only acquiring new knowledge but also overwriting outdated information into updated ones. To study the ability of language models for these time-dependent dynamics in human language, we introduce a novel task, EvolvingQA, a temporally evolving question-answering benchmark designed for training and evaluating LMs on an evolving Wikipedia database. The construction of EvolvingQA is automated with our pipeline using large language models. We uncover that existing continual learning baselines suffer from updating and removing outdated knowledge. Our analysis suggests that models fail to rectify knowledge due to small weight gradients. In addition, we elucidate that language models particularly struggle to reflect the change of numerical or temporal information. Our work aims to model the dynamic nature of real-world information, suggesting faithful evaluations of the evolution-adaptability of language models. Our data construction code and dataset files are available at

pdf bib
Fine-grained Gender Control in Machine Translation with Large Language Models
Minwoo Lee | Hyukhun Koh | Minsung Kim | Kyomin Jung

In machine translation, the problem of ambiguously gendered input has been pointed out, where the gender of an entity is not available in the source sentence. To address this ambiguity issue, the task of controlled translation that takes the gender of the ambiguous entity as additional input have been proposed. However, most existing works have only considered a simplified setup of one target gender for input. In this paper, we tackle controlled translation in a more realistic setting of inputs with multiple entities and propose Gender-of-Entity (GoE) prompting method for LLMs. Our proposed method instructs the model with fine-grained entity-level gender information to translate with correct gender inflections. By utilizing four evaluation benchmarks, we investigate the controlled translation capability of LLMs in multiple dimensions and find that LLMs reach state-of-the-art performance in controlled translation. Furthermore, we discover an emergence of gender interference phenomenon when controlling the gender of multiple entities. Finally, we address the limitations of existing gender accuracy evaluation metrics and propose leveraging LLMs as an evaluator for gender inflection in machine translation.

pdf bib
DialogVCS: Robust Natural Language Understanding in Dialogue System Upgrade
Zefan Cai | Xin Zheng | Tianyu Liu | Haoran Meng | Jiaqi Han | Gang Yuan | Binghuai Lin | Baobao Chang | Yunbo Cao

In the constant updates of the product dialogue systems, we need to retrain the natural language understanding (NLU) model as new data from the real users would be merged into the existing data accumulated in the last updates. Within the newly added data, new intents would emerge and might have semantic entanglement with the existing intents, e.g. new intents that are semantically too specific or generic are actually a subset or superset of some existing intents in the semantic space, thus impairing the robustness of the NLU model.As the first attempt to solve this problem, we setup a new benchmark consisting of 4 Dialogue Version Control dataSets (DialogVCS). We formulate the intent detection with imperfect data in the system update as a multi-label classification task with positive but unlabeled intents, which asks the models to recognize all the proper intents, including the ones with semantic entanglement, in the inference.We also propose comprehensive baseline models and conduct in-depth analyses for the benchmark, showing that the semantically entangled intents can be effectively recognized with an automatic workflow. Our code and dataset are available at

pdf bib
LLatrieval: LLM-Verified Retrieval for Verifiable Generation
Xiaonan Li | Changtai Zhu | Linyang Li | Zhangyue Yin | Tianxiang Sun | Xipeng Qiu

Verifiable generation aims to let the large language model (LLM) generate text with supporting documents, which enables the user to flexibly verify the answer and makes the LLM’s output more reliable. Retrieval plays a crucial role in verifiable generation. Specifically, the retrieved documents not only supplement knowledge to help the LLM generate correct answers, but also serve as supporting evidence for the user to verify the LLM’s output. However, the widely used retrievers become the bottleneck of the entire pipeline and limit the overall performance. Their capabilities are usually inferior to LLMs since they often have much fewer parameters than the large language model and have not been demonstrated to scale well to the size of LLMs. If the retriever does not correctly find the supporting documents, the LLM can not generate the correct and verifiable answer, which overshadows the LLM’s remarkable abilities. To address these limitations, we propose **LLatrieval** (**L**arge **La**nguage Model Verified Re**trieval**),where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question. Thus, the LLM can iteratively provide feedback to retrieval and facilitate the retrieval result to fully support verifiable generation. Experiments on ALCE show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.

pdf bib
Mapping Long-term Causalities in Psychiatric Symptomatology and Life Events from Social Media
Siyuan Chen | Meilin Wang | Minghao Lv | Zhiling Zhang | Juqianqian Juqianqian | Dejiyangla Dejiyangla | Yujia Peng | Kenny Zhu | Mengyue Wu

Social media is a valuable data source for exploring mental health issues. However, previous studies have predominantly focused on the semantic content of these posts, overlooking the importance of their temporal attributes, as well as the evolving nature of mental disorders and symptoms.In this paper, we study the causality between psychiatric symptoms and life events, as well as among different symptoms from social media posts, which leads to better understanding of the underlying mechanisms of mental disorders. By applying these extracted causality features to tasks such as diagnosis point detection and early risk detection of depression, we notice considerable performance enhancement. This indicates that causality information extracted from social media data can boost the efficacy of mental disorder diagnosis and treatment planning.

pdf bib
Multimodal Chart Retrieval: A Comparison of Text, Table and Image Based Approaches
Averi Nowak | Francesco Piccinno | Yasemin Altun

We investigate multimodal chart retrieval, addressing the challenge of retrieving image-based charts using textual queries. We compare four approaches: (a) OCR with text retrieval, (b) chart derendering (DePlot) followed by table retrieval, (c) a direct image understanding model (PaLI-3), and (d) a combined PaLI-3 + DePlot approach. As the table retrieval component we introduce Tab-GTR, a text retrieval model augmented with table structure embeddings, achieving state-of-the-art results on the NQ-Tables benchmark with 48.88% R@1. On in-distribution data, the DePlot-based method (b) outperforms PaLI-3 (c), while being significantly more efficient (300M vs 3B trainable parameters). However, DePlot struggles with complex charts, indicating a need for improvements in chart derendering - specifically in terms of chart data diversity and the richness of text/table representations. We found no clear winner between methods (b) and (c) in general, with the best performance achieved by the combined approach (d), and further show that it benefits the most from multi-task training.

pdf bib
Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models
Seiji Maekawa | Hayate Iso | Sairam Gurajada | Nikita Bhutani

While large language models (LMs) demonstrate remarkable performance, they encounter challenges in providing accurate responses when queried for information beyond their pre-trained memorization. Although augmenting them with relevant external information can mitigate these issues, failure to consider the necessity of retrieval may adversely affect overall performance. Previous research has primarily focused on examining how entities influence retrieval models and knowledge recall in LMs, leaving other aspects relatively unexplored. In this work, our goal is to offer a more detailed, fact-centric analysis by exploring the effects of combinations of entities and relations. To facilitate this, we construct a new question answering (QA) dataset called WiTQA (Wikipedia Triple Question Answers). This dataset includes questions about entities and relations of various popularity levels, each accompanied by a supporting passage. Our extensive experiments with diverse LMs and retrievers reveal when retrieval does not consistently enhance LMs from the viewpoints of fact-centric popularity. Confirming earlier findings, we observe that larger LMs excel in recalling popular facts. However, they notably encounter difficulty with infrequent entity-relation pairs compared to retrievers. Interestingly, they can effectively retain popular relations of less common entities. We demonstrate the efficacy of our finer-grained metric and insights through an adaptive retrieval system that selectively employs retrieval and recall based on the frequencies of entities and relations in the question.

pdf bib
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs
Yassir Fathullah | Chunyang Wu | Egor Lakomkin | Ke Li | Junteng Jia | Yuan Shangguan | Jay Mahadeokar | Ozlem Kalinli | Christian Fuegen | Mike Seltzer

In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of original LLM capabilities, without using any carefully curated paired data. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform spoken question answering (QA), speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. On both synthesized and recorded speech QA test sets, evaluations show that our end-to-end approach is on par with or outperforms cascaded systems (speech recognizer + LLM) in terms of modelling the response to a prompt. Furthermore, unlike cascades, our approach can interchange text and audio modalities and intrinsically utilize prior context in a conversation to provide better results.

pdf bib
Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness
Ashim Gupta | Rishanth Rajendhran | Nathan Stringham | Vivek Srikumar | Ana Marasovic

*Do larger and more performant models resolve NLP’s longstanding robustness issues?* We investigate this question using over 20 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all out-of-domain tests provide insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them adequately robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.

pdf bib
Sequential Compositional Generalization in Multimodal Models
Semih Yagcioglu | Osman Batur İnce | Aykut Erdem | Erkut Erdem | Desmond Elliott | Deniz Yuret

The rise of large-scale multimodal models has paved the pathway for groundbreaking advances in generative modeling and reasoning, unlocking transformative applications in a variety of complex tasks. However, a pressing question that remains is their genuine capability for stronger forms of generalization, which has been largely underexplored in the multimodal setting. Our study aims to address this by examining sequential compositional generalization using CompAct (Compositional Activities), a carefully constructed, perceptually grounded dataset set within a rich backdrop of egocentric kitchen activity videos. Each instance in our dataset is represented with a combination of raw video footage, naturally occurring sound, and crowd-sourced step-by-step descriptions. More importantly, our setup ensures that the individual concepts are consistently distributed across training and evaluation sets, while their compositions are novel in the evaluation set. We conduct a comprehensive assessment of several unimodal and multimodal models. Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts. This highlights the importance of multimodality while charting a trajectory for future research in this domain.

pdf bib
Generating Uncontextualized and Contextualized Questions for Document-Level Event Argument Extraction
Md Nayem Uddin | Enfa George | Eduardo Blanco | Steven Corman

This paper presents multiple question generation strategies for document-level event argument extraction. These strategies do not require human involvement and result in uncontextualized questions as well as contextualized questions grounded on the event and document of interest. Experimental results show that combining uncontextualized and contextualized questions is beneficial,especially when event triggers and arguments appear in different sentences. Our approach does not have corpus-specific components, in particular, the question generation strategies transfer across corpora. We also present a qualitative analysis of the most common errors made by our best model.

pdf bib
Evidence-Driven Retrieval Augmented Response Generation for Online Misinformation
Zhenrui Yue | Huimin Zeng | Yimeng Lu | Lanyu Shang | Yang Zhang | Dong Wang

The proliferation of online misinformation has posed significant threats to public interest. While numerous online users actively participate in the combat against misinformation, many of such responses can be characterized by the lack of politeness and supporting facts. As a solution, text generation approaches are proposed to automatically produce counter-misinformation responses. Nevertheless, existing methods are often trained end-to-end without leveraging external knowledge, resulting in subpar text quality and excessively repetitive responses. In this paper, we propose retrieval augmented response generation for online misinformation (RARG), which collects supporting evidence from scientific sources and generates counter-misinformation responses based on the evidences. In particular, our RARG consists of two stages: (1) evidence collection, where we design a retrieval pipeline to retrieve and rerank evidence documents using a database comprising over 1M academic articles; (2) response generation, in which we align large language models (LLMs) to generate evidence-based responses via reinforcement learning from human feedback (RLHF). We propose a reward function to maximize the utilization of the retrieved evidence while maintaining the quality of the generated text, which yields polite and factual responses that clearly refutes misinformation. To demonstrate the effectiveness of our method, we study the case of COVID-19 and perform extensive experiments with both in- and cross-domain datasets, where RARG consistently outperforms baselines by generating high-quality counter-misinformation responses.

pdf bib
Open-Vocabulary Federated Learning with Multimodal Prototyping
Huimin Zeng | Zhenrui Yue | Dong Wang

Existing federated learning (FL) studies usuallyassume the training label space and test labelspace are identical. However, in real-world applications, this assumption is too ideal to betrue. A new user could come up with queriesthat involve data from unseen classes, and suchopen-vocabulary queries would directly defectsuch FL systems. Therefore, in this work, weexplicitly focus on the under-explored openvocabulary challenge in FL. That is, for a newuser, the global server shall understand her/hisquery that involves arbitrary unknown classes.To address this problem, we leverage the pretrained vision-language models (VLMs). Inparticular, we present a novel adaptation framework tailored for VLMs in the context of FL,named as Federated Multimodal Prototyping(Fed-MP). Fed-MP adaptively aggregates thelocal model weights based on light-weightclient residuals, and makes predictions basedon a novel multimodal prototyping mechanism.Fed-MP exploits the knowledge learned fromthe seen classes, and robustifies the adaptedVLM to unseen categories. Our empirical evaluation on various datasets validates the effectiveness of Fed-MP.

pdf bib
Exploring Key Point Analysis with Pairwise Generation and Graph Partitioning
Xiao Li | Yong Jiang | Shen Huang | Pengjun Xie | Gong Cheng | Fei Huang

Key Point Analysis (KPA), the summarization of multiple arguments into a concise collection of key points, continues to be a significant and unresolved issue within the field of argument mining. Existing models adapt a two-stage pipeline of clustering arguments or generating key points for argument clusters. This approach rely on semantic similarity instead of measuring the existence of shared key points among arguments. Additionally, it only models the intra-cluster relationship among arguments, disregarding the inter-cluster relationship between arguments that do not share key points. To address these limitations, we propose a novel approach for KPA with pairwise generation and graph partitioning. Our objective is to train a generative model that can simultaneously provide a score indicating the presence of shared key point between a pair of arguments and generate the shared key point. Subsequently, to map generated redundant key points to a concise set of key points, we proceed to construct an arguments graph by considering the arguments as vertices, the generated key points as edges, and the scores as edge weights. We then propose a graph partitioning algorithm to partition all arguments sharing the same key points to the same subgraph. Notably, our experimental findings demonstrate that our proposed model surpasses previous models when evaluated on both the ArgKP and QAM datasets.

pdf bib
Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense
Siqi Shen | Lajanugen Logeswaran | Moontae Lee | Honglak Lee | Soujanya Poria | Rada Mihalcea

Large language models (LLMs) have demonstrated substantial commonsense understanding through numerous benchmark evaluations. However, their understanding of cultural commonsense remains largely unexamined. In this paper, we conduct a comprehensive examination of the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks. Using several general and cultural commonsense benchmarks, we find that (1) LLMs have a significant discrepancy in performance when tested on culture-specific commonsense knowledge for different cultures; (2) LLMs’ general commonsense capability is affected by cultural context; and (3) The language used to query the LLMs can impact their performance on cultural-related tasks.Our study points to the inherent bias in the cultural understanding of LLMs and provides insights that can help develop culturally-aware language models.

pdf bib
Code Models are Zero-shot Precondition Reasoners
Lajanugen Logeswaran | Sungryull Sohn | Yiwei Lyu | Anthony Liu | Dong-Ki Kim | Dongsub Shim | Moontae Lee | Honglak Lee

One of the fundamental skills required for an agent acting in an environment to complete tasks is the ability to understand what actions are plausible at any given point. This work explores a novel use of code representations to reason about action preconditions for sequential decision making tasks. Code representations offer the flexibility to model procedural activities and associated constraints as well as the ability to execute and verify constraint satisfaction. Leveraging code representations, we extract action preconditions from demonstration trajectories in a zero-shot manner using pre-trained code models. Given these extracted preconditions, we propose a precondition-aware action sampling strategy that ensures actions predicted by a policy are consistent with preconditions. We demonstrate that the proposed approach enhances the performance of few-shot policy learning approaches across task-oriented dialog and embodied textworld benchmarks.

pdf bib
Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding
Suyoung Kim | Jiyeon Hwang | Ho-Young Jung

Recently, deep end-to-end learning has been studied for intent classification in Spoken Language Understanding (SLU). However, end-to-end models require a large amount of speech data with intent labels, and highly optimized models are generally sensitive to the inconsistency between the training and evaluation conditions. Therefore, a natural language understanding approach based on Automatic Speech Recognition (ASR) remains attractive because it can utilize a pre-trained general language model and adapt to the mismatch of the speech input environment. Using this module-based approach, we improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors. We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. Experiments on four benchmark datasets show that CCL outperforms existing methods and improves the ASR robustness in various noisy environments. Code is available at

pdf bib
Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers
Yuan Wang | Xuyang Wu | Hsin-Tai Wu | Zhiqiang Tao | Yi Fang

The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works such as RankGPT have demonstrated that the LLMs have better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.

pdf bib
TabSQLify: Enhancing Reasoning Capabilities of LLMs Through Table Decomposition
Md Nahid | Davood Rafiei

Table reasoning is a challenging task that requires understanding both natural language questions and structured tabular data. Large language models (LLMs) have shown impressive capabilities in natural language understanding and generation, but they often struggle with large tables due to their limited input length. In this paper, we propose TabSQLify, a novel method that leverages text-to-SQL generation to decompose tables into smaller and relevant sub-tables, containing only essential information for answering questions or verifying statements, before performing the reasoning task. In our comprehensive evaluation on four challenging datasets, our approach demonstrates comparable or superior performance compared to prevailing methods reliant on full tables as input. Moreover, our method can reduce the input context length significantly, making it more scalable and efficient for large-scale table reasoning applications. Our method performs remarkably well on the WikiTQ benchmark, achieving an accuracy of 64.7%. Additionally, on the TabFact benchmark, it achieves a high accuracy of 79.5%. These results surpass other LLM-based baseline models on gpt-3.5-turbo (chatgpt). TabSQLify can reduce the table size significantly alleviating the computational load on LLMs when handling large tables without compromising performance.

pdf bib
Contextual Label Projection for Cross-Lingual Structured Prediction
Tanmay Parekh | I-Hung Hsu | Kuan-Hao Huang | Kai-Wei Chang | Nanyun Peng

Label projection, which involves obtaining translated labels and texts jointly, is essential for leveraging machine translation to facilitate cross-lingual transfer in structured prediction tasks. Prior research exploring label projection often compromise translation accuracy by favoring simplified label translation or relying solely on word-level alignments. In this paper, we introduce a novel label projection approach, CLaP, which translates text to the target language and performs *contextual translation* on the labels using the translated text as the context, ensuring better accuracy for the translated labels. We leverage instruction-tuned language models with multilingual capabilities as our contextual translator, imposing the constraint of the presence of translated labels in the translated text via instructions. We benchmark CLaP with other label projection techniques on zero-shot cross-lingual transfer across 39 languages on two representative structured prediction tasks - event argument extraction (EAE) and named entity recognition (NER), showing over 2.4 F1 improvement for EAE and 1.4 F1 improvement for NER. We further explore the applicability of CLaP on ten extremely low-resource languages to showcase its potential for cross-lingual structured prediction.

pdf bib
Event Detection from Social Media for Epidemic Prediction
Tanmay Parekh | Anh Mac | Jiarui Yu | Yuxuan Dong | Syed Shahriar | Bonnie Liu | Eric Yang | Kuan-Hao Huang | Wei Wang | Nanyun Peng | Kai-Wei Chang

Social media is an easy-to-access platform providing timely updates about societal trends and events. Discussions regarding epidemic-related events such as infections, symptoms, and social interactions can be crucial for informing policymaking during epidemic outbreaks. In our work, we pioneer exploiting Event Detection (ED) for better preparedness and early warnings of any upcoming epidemic by developing a framework to extract and analyze epidemic-related events from social media posts. To this end, we curate an epidemic event ontology comprising seven disease-agnostic event types and construct a Twitter dataset SPEED with human-annotated events focused on the COVID-19 pandemic. Experimentation reveals how ED models trained on COVID-based SPEED can effectively detect epidemic events for three unseen epidemics of Monkeypox, Zika, and Dengue; while models trained on existing ED datasets fail miserably. Furthermore, we show that reporting sharp increases in the extracted events by our framework can provide warnings 4-9 weeks earlier than the WHO epidemic declaration for Monkeypox. This utility of our framework lays the foundations for better preparedness against emerging epidemics.

pdf bib
RESPROMPT: Residual Connection Prompting Advances Multi-Step Reasoning in Large Language Models
Song Jiang | Zahra Shakeri | Aaron Chan | Maziar Sanjabi | Hamed Firooz | Yinglong Xia | Bugra Akyildiz | Yizhou Sun | Jinchao Li | Qifan Wang | Asli Celikyilmaz

Chain-of-thought (CoT) has impressively unlocked the reasoning potential of large language models (LLMs). Yet, it falls short when tackling problems that require multiple reasoning steps. This limitation arises from the complex nature of multi-step reasoning processes: later stages often depend not only on the immediately preceding step, but also on the results from several steps earlier. Such complexities indicate the reasoning process is naturally a graph. The almost linear structure of CoT, however, struggles to capture this complex reasoning graph. To address this challenge, we propose Residual Connection Prompting (ResPrompt), a new prompting strategy that advances multi-step reasoning in LLMs. The core of our idea is to reconstruct the reasoning graph within prompts. We achieve this by integrating necessary connections–links present in reasoning graph but missing in the linear CoT flow–into the prompts. Termed “residual connections”, these links can transform linear CoT into the complex reasoning graphs that multi-step problems entail. On benchmarks across math, sequential, and commonsense domains, ResPrompt demonstrates clear improvements in multi-step reasoning compared with CoT. Through extensive ablation studies and analyses, we pinpoint how to effectively build residual connections and also identify situations where it might be unnecessary.

pdf bib
BPE-knockout: Pruning Pre-existing BPE Tokenisers with Backwards-compatible Morphological Semi-supervision
Thomas Bauwens | Pieter Delobelle

Byte-pair encoding (BPE) has become the default subword tokeniser in language models (LMs), allowing the representation of an infinite space of text with a finite set of units. Yet, BPE training is unsupervised, receiving no explicit information about a language’s morphology. This results in a subword vocabulary wherein many units are a concatenation of partial morphemes, preventing their formation as tokens. This, in turn, causes consistent intra-word patterns to be displayed inconsistently to downstream models, and bloats the vocabulary, hence requiring unnecessary embedding storage. In this paper, we address this issue by identifying blameworthy BPE merges and removing the resulting subwords from the BPE vocabulary, without impeding further use of merges that relied on them. We find that our method, BPE-knockout, is effective at making BPE’s segmentation positions adhere better to derivational and compound boundaries in English, Dutch and German, and improves token-based tasks in Dutch RoBERTa models, indicating that a tokeniser’s adherence to morphology impacts downstream models. We demonstrate the latter not only by training LMs from scratch, but also by continuing the pre-training of existing LMs. This proves promising, showing that suboptimal tokenisers can be remedied whilst salvaging training cost of downstream LMs.

pdf bib
How are Prompts Different in Terms of Sensitivity?
Sheng Lu | Hendrik Schuff | Iryna Gurevych

In-context learning (ICL) has become one of the most popular learning paradigms. While there is a growing body of literature focusing on prompt engineering, there is a lack of systematic analysis comparing the effects of prompt techniques across different models and tasks. To address this, we present a comprehensive prompt analysis based on sensitivity. Our analysis reveals that sensitivity is an unsupervised proxy for model performance, as it exhibits a strong negative correlation with accuracy. We use gradient-based saliency scores to empirically demonstrate how different prompts affect the relevance of input tokens to the output, resulting in different levels of sensitivity. Furthermore, we introduce sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding. We show that this approach is particularly helpful when information in the input is scarce. Our work provides a fresh perspective on the analysis of prompts, and contributes to a better understanding of the mechanism of ICL.

pdf bib
LSTDial: Enhancing Dialogue Generation via Long- and Short-Term Measurement Feedback
Guanghui Ye | Huan Zhao | Zixing Zhang | Xupeng Zha | Zhihua Jiang

Generating high-quality responses is a key challenge for any open domain dialogue systems. However, even though there exist a variety of quality dimensions especially designed for dialogue evaluation (e.g., coherence and diversity scores), current dialogue systems rarely utilize them to guide the response generation during training. To alleviate this issue, we propose LSTDial (Long- and Short-Term Dialogue), a novel two-stage framework which generates and utilizes conversation evaluation as explicit feedback during training. Specifically, we fine-tune pre-trained dialogue systems through using turn-level quality feedback in the first stage and further train ever-improving dialogue agents through using dialogue-level quality feedback in the second stage. By using our approach on dialogue systems, capable of enabling dialogue generation with both short-term capabilities (generating more fluent, relevant and varied responses at the turn-level) and long-term capabilities (generating more coherent, engaging and informative responses at the dialogue-level). We implement LSTDial on four strong baseline models and experiment with two open-domain dialogue datasets. Experimental results show that LSTDial achieves significant improvement, enabling to generate better dialogue responses in terms of both human and automatic evaluation.

pdf bib
The ART of LLM Refinement: Ask, Refine, and Trust
Kumar Shridhar | Koustuv Sinha | Andrew Cohen | Tianlu Wang | Ping Yu | Ramakanth Pasunuru | Mrinmaya Sachan | Jason Weston | Asli Celikyilmaz

Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations and self-improve?A popular concept, referred to as *self-refinement*, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with a refinement strategy called *ART: Ask, Refine, and Trust*, which *asks* necessary questions to decide when an LLM should *refine* its output, and uses it to affirm or deny *trust* in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), *ART* achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We believe that *ART* with smaller models, making refinement decisions can be a cost-effective alternative to fine-tuning LLMs.

pdf bib
Modularized Multilingual NMT with Fine-grained Interlingua
Sungjun Lim | Yoonjung Choi | Sangha Kim

Recently, one popular alternative in Multilingual NMT (MNMT) is modularized MNMT that has both language-specific encoders and decoders. However, due to the absence of layer-sharing, the modularized MNMT failed to produce satisfactory language-independent (Interlingua) features, leading to performance degradation in zero-shot translation. To address this issue, a solution was proposed to share the top of language-specific encoder layers, enabling the successful generation of interlingua features. Nonetheless, it should be noted that this sharing structure does not guarantee the explicit propagation of language-specific features to their respective language-specific decoders. Consequently, to overcome this challenge, we present our modularized MNMT approach, where a modularized encoder is divided into three distinct encoder modules based on different sharing criteria: (1) source language-specific (Encs); (2) universal (Encall); (3) target language-specific (Enct). By employing these sharing strategies, Encall propagates the interlingua features, after which Enct propagates the target language-specific features to the language-specific decoders. Additionally, we suggest the Denoising Bi-path Autoencoder (DBAE) to fortify the Denoising Autoencoder (DAE) by leveraging Enct. For experimental purposes, our training corpus comprises both En-to-Any and Any-to-En directions. We adjust the size of our corpus to simulate both balanced and unbalanced settings. Our method demonstrates an improved average BLEU score by "+2.90” in En-to-Any directions and by "+3.06” in zero-shot compared to other MNMT baselines.

pdf bib
ParallelPARC: A Scalable Pipeline for Generating Natural-Language Analogies
Oren Sultan | Yonatan Bitton | Ron Yosef | Dafna Shahaf

Analogy-making is central to human cognition, allowing us to adapt to novel situations – an ability that current AI systems still lack. Most analogy datasets today focus on simple analogies (e.g., word analogies); datasets including complex types of analogies are typically manually curated and very small. We believe that this holds back progress in computational analogy.In this work, we design a data generation pipeline, ParallelPARC (Parallel Paragraph Creator) leveraging state-of-the-art Large Language Models (LLMs) to create complex, paragraph-based analogies, as well as distractors, both simple and challenging. We demonstrate our pipeline and create ProPara-Logy, a dataset of analogies between scientific processes. We publish a gold-set, validated by humans, and a silver-set, generated automatically. We test LLMs’ and humans’ analogy recognition in binary and multiple-choice settings, and found that humans outperform the best models (∼13% gap) after a light supervision. We demonstrate that our silver-set is useful for training models. Lastly, we show challenging distractors confuse LLMs, but not humans. We hope our pipeline will encourage research in this emerging field.

pdf bib
AWESOME: GPU Memory-constrained Long Document Summarization using Memory Mechanism and Global Salient Content
Shuyang Cao | Lu Wang

Long document summarization systems are critical for domains with lengthy and jargon-laden text, yet they present significant challenges to researchers and developers with limited computing resources. Existing solutions mainly focus on efficient attentions or divide-and-conquer strategies. The former reduces theoretical time complexity, but is still memory-heavy. The latter methods sacrifice global context, leading to uninformative and incoherent summaries. This work aims to leverage the memory-efficient nature of divide-and-conquer methods while preserving global context. Concretely, our framework AWESOME uses two novel mechanisms: (1) External memory mechanisms track previously encoded document segments and their corresponding summaries, to enhance global document understanding and summary coherence. (2) Global salient content is further identified beforehand to augment each document segment to support its summarization. Extensive experiments on diverse genres of text, including government reports, meeting transcripts, screenplays, scientific papers, and novels, show that AWESOME produces summaries with improved informativeness, faithfulness, and coherence than competitive baselines on longer documents, while having a smaller GPU memory footprint.

pdf bib
NLP Systems That Can’t Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps
Kristina Gligoric | Myra Cheng | Lucia Zheng | Esin Durmus | Dan Jurafsky

The use of words to convey speaker’s intent is traditionally distinguished from the ‘mention’ of words for quoting what someone said, or pointing out properties of a word. Here we show that computationally modeling this use-mention distinction is crucial for dealing with counterspeech online. Counterspeech that refutes problematic content often mentions harmful language but is not harmful itself (e.g., calling a vaccine dangerous is not the same as expressing disapproval of someone for calling vaccines dangerous). We show that even recent language models fail at distinguishing use from mention, and that this failure propagates to two key downstream tasks: misinformation and hate speech detection, resulting in censorship of counterspeech. We introduce prompting mitigations that teach the use-mention distinction, and show they reduce these errors. Our work highlights the importance of the use-mention distinction for NLP and CSS and offers ways to address it.

pdf bib
Debiasing with Sufficient Projection: A General Theoretical Framework for Vector Representations
Enze Shi | Lei Ding | Linglong Kong | Bei Jiang

Pre-trained vector representations in natural language processing often inadvertently encode undesirable social biases. Identifying and removing unwanted biased information from vector representation is an evolving and significant challenge. Our study uniquely addresses this issue from the perspective of statistical independence, proposing a framework for reducing bias by transforming vector representations to an unbiased subspace using sufficient projection. The key to our framework lies in its generality: it adeptly mitigates bias across both debiasing and fairness tasks, and across various vector representation types, including word embeddings and output representations of transformer models. Importantly, we establish the connection between debiasing and fairness, offering theoretical guarantees and elucidating our algorithm’s efficacy. Through extensive evaluation of intrinsic and extrinsic metrics, our method achieves superior performance in bias reduction while maintaining high task performance, and offers superior computational efficiency.

pdf bib
Semi-Supervised Dialogue Abstractive Summarization via High-Quality Pseudolabel Selection
Jianfeng He | Hang Su | Jason Cai | Igor Shalyminov | Hwanjun Song | Saab Mansour

Semi-supervised dialogue summarization (SSDS) leverages model-generated summaries to reduce reliance on human-labeled data and improve the performance of summarization models. While addressing label noise, previous works on semi-supervised learning primarily focus on natural language understanding tasks, assuming each sample has a unique label. However, these methods are not directly applicable to SSDS, as it is a generative task, and each dialogue can be summarized in different ways. In this work, we propose a novel scoring approach, SiCF, which encapsulates three primary dimensions of summarization model quality: Semantic invariance (indicative of model confidence), Coverage (factual recall), and Faithfulness (factual precision). Using the SiCF score, we select unlabeled dialogues with high-quality generated summaries to train summarization models. Comprehensive experiments on three public datasets demonstrate the effectiveness of SiCF scores in uncertainty estimation and semi-supervised learning for dialogue summarization tasks. Our code is available at

pdf bib
AfriMTE and AfriCOMET: Enhancing COMET to Embrace Under-resourced African Languages
Jiayi Wang | David Adelani | Sweta Agrawal | Marek Masiak | Ricardo Rei | Eleftheria Briakou | Marine Carpuat | Xuanli He | Sofia Bourhim | Andiswa Bukula | Muhidin Mohamed | Temitayo Olatoye | Tosin Adewumi | Hamam Mokayed | Christine Mwase | Wangui Kimotho | Foutse Yuehgoh | Anuoluwapo Aremu | Jessica Ojo | Shamsuddeen Muhammad | Salomey Osei | Abdul-Hakeem Omotayo | Chiamaka Chukwuneke | Perez Ogayo | Oumaima Hourrane | Salma El Anigri | Lolwethu Ndolela | Thabiso Mangwana | Shafie Mohamed | Hassan Ayinde | Oluwabusayo Awoyomi | Lama Alkhaled | Sana Al-azzawi | Naome Etori | Millicent Ochieng | Clemencia Siro | Njoroge Kiragu | Eric Muchiri | Wangari Kimotho | Toadoum Sari Sakayo | Lyse Naomi Wamba | Daud Abolade | Simbiat Ajao | Iyanuoluwa Shode | Ricky Macharm | Ruqayya Iro | Saheed Abdullahi | Stephen Moore | Bernard Opoku | Zainab Akinjobi | Abeeb Afolabi | Nnaemeka Obiefuna | Onyekachi Ogbu | Sam Ochieng’ | Verrah Otiende | Chinedu Mbonu | Yao Lu | Pontus Stenetorp

Despite the recent progress on scaling multilingual machine translation (MT) to several under-resourced African languages, accurately measuring this progress remains challenging, since evaluation is often performed on n-gram matching metrics such as BLEU, which typically show a weaker correlation with human judgments. Learned metrics such as COMET have higher correlation; however, the lack of evaluation data with human ratings for under-resourced languages, complexity of annotation guidelines like Multidimensional Quality Metrics (MQM), and limited language coverage of multilingual encoders have hampered their applicability to African languages. In this paper, we address these challenges by creating high-quality human evaluation data with simplified MQM guidelines for error detection and direct assessment (DA) scoring for 13 typologically diverse African languages. Furthermore, we develop AfriCOMET: COMET evaluation metrics for African languages by leveraging DA data from well-resourced languages and an African-centric multilingual encoder (AfroXLM-R) to create the state-of-the-art MT evaluation metrics for African languages with respect to Spearman-rank correlation with human judgments (0.441).

pdf bib
TableLlama: Towards Open Large Generalist Models for Tables
Tianshu Zhang | Xiang Yue | Yifei Li | Huan Sun

Semi-structured tables are ubiquitous. There has been a variety of tasks that aim to automatically interpret, augment, and query tables. Current methods often require pretraining on tables or special model architecture design, are restricted to specific table types, or have simplifying assumptions about tables and tasks. This paper makes the first step towards developing open-source large language models (LLMs) as generalists for a diversity of table-based tasks. Towards that end, we construct TableInstruct, a new dataset with a variety of realistic tables and tasks, for instruction tuning and evaluating LLMs. We further develop the first open-source generalist model for tables, TableLlama, by fine-tuning Llama 2 (7B) with LongLoRA to address the long context challenge. We experiment under both in-domain setting and out-of-domain setting. On 7 out of 8 in-domain tasks, TableLlama achieves comparable or better performance than the SOTA for each task, despite the latter often has task-specific design. On 6 out-of-domain datasets, it achieves 5-44 absolute point gains compared with the base model, showing that training on TableInstruct enhances the model’s generalizability. We open-source our dataset and trained model to boost future work on developing open generalist models for tables.

pdf bib
PEMA: An Offsite-Tunable Plug-in External Memory Adaptation for Language Models
HyunJin Kim | Young Jin Kim | JinYeong Bak

Pre-trained language models (PLMs) show impressive performance in various downstream NLP tasks. However, pre-training large language models demands substantial memory and training compute. Furthermore, due to the substantial resources required, many PLM weights are confidential. Consequently, users are compelled to share their data with model owners for fine-tuning specific tasks. To overcome the limitations, we introduce Plug-in External Memory Adaptation (PEMA), a Parameter-Efficient Fine-Tuning (PEFT) method, enabling PLM fine-tuning without requiring access to all the weights. PEMA integrates with context representations from test data during inference to perform downstream tasks. It uses external memory to store PLM-generated context representations mapped with target tokens. Our method utilizes weight matrices of LoRA-like bottlenecked adapter in the PLM’s final layer to enhance efficiency. Our approach also includes Gradual Unrolling, a novel interpolation strategy to improve generation quality. We validate PEMA’s effectiveness through experiments on syntactic and real datasets for machine translation and style transfer. Our findings show that PEMA outperforms other PEFT approaches in memory and latency efficiency for training, and also excels in maintaining sentence meaning and generating appropriate language and styles.

pdf bib
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
Jun Yan | Vikas Yadav | Shiyang Li | Lichang Chen | Zheng Tang | Hai Wang | Vijay Srinivasan | Xiang Ren | Hongxia Jin

Instruction-tuned Large Language Models (LLMs) have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt “Describe Joe Biden negatively.” for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model’s instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at

pdf bib
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models
Shuaijie She | Shujian Huang | Xingyun Wang | Yanke Zhou | Jiajun Chen

LLMs (Large Language Models) usually interact with users in the form of dialogue and generate responses following their instructions, which naturally require dialogue comprehension abilities. However, dialogue comprehension is a general language ability which is hard to be evaluated directly. In this work, we propose to perform the evaluation focusing on the factual consistency issue with the help of the dialogue summarization task. Besides evaluating and analyzing the dialogue summarization performance (DIAC-Sum) of different LLMs, we also derive factual questions from the generated summaries and use them as a more flexible measurement of dialogue comprehension (DIAC-FactQA). Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. Even ChatGPT, the strongest model evaluated, has such errors in 16% of its summaries. For answering the factual questions, which is more challenging, the average error rate of all evaluated LLMs is 36.1%. Both results indicate serious deficiencies. Detailed analysis shows that the understanding of subject/object of the conversation is still challenging for LLMs. Furthermore, to stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data, which achieved a relative error rate reduction of 11% on DIAC-FactQA.

pdf bib
Multilingual Pretraining and Instruction Tuning Improve Cross-Lingual Knowledge Alignment, But Only Shallowly
Changjiang Gao | Hongda Hu | Peng Hu | Jiajun Chen | Jixing Li | Shujian Huang

Despite their strong ability to retrieve knowledge in English, current large language models show imbalance abilities in different languages. Two approaches are proposed to address this, i.e., multilingual pretraining and multilingual instruction tuning. However, whether and how do such methods contribute to the cross-lingual knowledge alignment inside the models is unknown. In this paper, we propose CLiKA, a systematic framework to assess the cross-lingual knowledge alignment of LLMs in the Performance, Consistency and Conductivity levels, and explored the effect of multilingual pretraining and instruction tuning on the degree of alignment. Results show that: while both multilingual pretraining and instruction tuning are beneficial for cross-lingual knowledge alignment, the training strategy needs to be carefully designed. Namely, continued pretraining improves the alignment of the target language at the cost of other languages, while mixed pretraining affect other languages less. Also, the overall cross-lingual knowledge alignment, especially in the conductivity level, is unsatisfactory for all tested LLMs, and neither multilingual pretraining nor instruction tuning can substantially improve the cross-lingual knowledge conductivity.

pdf bib
A Study on the Calibration of In-context Learning
Hanlin Zhang | YiFan Zhang | Yaodong Yu | Dhruv Madeka | Dean Foster | Eric Xing | Himabindu Lakkaraju | Sham Kakade

Accurate uncertainty quantification is crucial for the safe deployment of machine learning models, and prior research has demonstrated improvements in the calibration of modern language models (LMs). We study in-context learning (ICL), a prevalent method for adapting static LMs through tailored prompts, and examine the balance between performance and calibration across a broad spectrum of natural language understanding and reasoning tasks. Through comprehensive experiments, we observe that, with an increasing number of ICL examples, models initially exhibit increased miscalibration before achieving better calibration and miscalibration tends to arise in low-shot settings. Moreover, we find that methods aimed at improving usability, such as fine-tuning and chain-of-thought (CoT) prompting, can lead to miscalibration and unreliable natural language explanations. Furthermore, we explore recalibration techniques and find that a scaling-binning calibrator can reduce calibration errors consistently.

pdf bib
DialogBench: Evaluating LLMs as Human-like Dialogue Systems
Jiao Ou | Junda Lu | Che Liu | Yihong Tang | Fuzheng Zhang | Di Zhang | Kun Gai

Large language models (LLMs) have achieved remarkable breakthroughs in new dialogue capabilities by leveraging instruction tuning,which refreshes human impressions of dialogue systems. The long-standing goal of dialogue systems is to be human-like enough to establish long-term connections with users. Therefore, there has been an urgent need to evaluate LLMs as human-like dialogue systems. In this paper, we propose DialogBench, a dialogue evaluation benchmark that contains 12 dialogue tasks to probe the capabilities of LLMs as human-like dialogue systems should have. Specifically, we prompt GPT-4 to generate evaluation instances for each task. We first design the basic prompt based on widely used design principles and further mitigate the existing biases to generate higher-quality evaluation instances. Our extensive tests on English and Chinese DialogBench of 26 LLMs show that instruction tuning improves the human likeness of LLMs to a certain extent, but most LLMs still have much room for improvement as human-like dialogue systems. Interestingly, results also show that the positioning of assistant AI can make instruction tuning weaken the human emotional perception of LLMs and their mastery of information about human daily life.

pdf bib
GINopic: Topic Modeling with Graph Isomorphism Network
Suman Adhya | Debarshi Kumar Sanyal

Topic modeling is a widely used approach for analyzing and exploring large document collections. Recent research efforts have incorporated pre-trained contextualized language models, such as BERT embeddings, into topic modeling. However, they often neglect the intrinsic informational value conveyed by mutual dependencies between words. In this study, we introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words. By conducting intrinsic (quantitative as well as qualitative) and extrinsic evaluations on diverse benchmark datasets, we demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.

pdf bib
CMB: A Comprehensive Medical Benchmark in Chinese
Xidong Wang | Guiming Chen | Song Dingjie | Zhang Zhiyi | Zhihong Chen | Qingying Xiao | Junying Chen | Feng Jiang | Jianquan Li | Xiang Wan | Benyou Wang | Haizhou Li

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in contextual incongruities to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at

pdf bib
Massive End-to-end Speech Recognition Models with Time Reduction
Weiran Wang | Rohit Prabhavalkar | Haozhe Shan | Zhong Meng | Dongseong Hwang | Qiujia Li | Khe Chai Sim | Bo Li | James Qin | Xingyu Cai | Adam Stooke | Chengjian Zheng | Yanzhang He | Tara Sainath | Pedro Moreno Mengibar

We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. The encoders of our models use the neural architecture of Google’s universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We also explore a few practical methods to mitigate potential accuracy loss due to time reduction, while enjoying most efficiency gain. Our methods are demonstrated to work with both Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), with up to 2B model parameters, and over two domains. For a large-scale voice search recognition task, we perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets, and show that a 900M RNN-T is very tolerant to severe time reduction, with as low encoder output frame rate as 640ms. We also provide ablation studies on the Librispeech benchmark for important training hyperparameters and architecture designs, in training 600M RNN-T models at the frame rate of 160ms.

pdf bib
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics
Arash Ardakani | Altan Haan | Shangyin Tan | Doru Thom Popovici | Alvin Cheung | Costin Iancu | Koushik Sen

Transformer-based models, such as BERT and ViT, have achieved state-of-the-art results across different natural language processing (NLP) and computer vision (CV) tasks. However, these models are extremely memory intensive during their fine-tuning process, making them difficult to deploy on GPUs with limited memory resources. To address this issue, we introduce a new tool called SlimFit that reduces the memory requirements of these models by dynamically analyzing their training dynamics and freezing less-contributory layers during fine-tuning. The layers to freeze are chosen using a runtime inter-layer scheduling algorithm. This allows SlimFit to freeze up to 95% of layers and reduce the overall on-device GPU memory usage of transformer-based models such as ViT and BERT by an average of 2.2x, across different NLP and CV benchmarks/datasets such as GLUE, SQuAD 2.0, CIFAR-10, CIFAR-100 and ImageNet with an average degradation of 0.2% in accuracy. For such NLP and CV tasks, SlimFit can reduce up to 3.1x the total on-device memory usage with an accuracy degradation of only up to 0.4%. As a result, while fine-tuning of ViT on ImageNet and BERT on SQuAD 2.0 with a batch size of 128 requires 3 and 2 32GB GPUs, respectively, SlimFit enables fine-tuning them on a single 32GB GPU without any significant accuracy degradation. The code of SlimFit is available at

pdf bib
Effective Large Language Model Adaptation for Improved Grounding and Citation Generation
Xi Ye | Ruoxi Sun | Sercan Arik | Tomas Pfister

Large language models (LLMs) have achieved remarkable advancements in natural language understanding and generation. However, one major issue towards their widespread deployment in the real world is that they can generate “hallucinated” answers that are not factual.Towards this end, this paper focuses on improving LLMs by grounding their responses in retrieved passages and by providing citations. We propose a new framework, AGREE, Adaptation for GRounding EnhancEment, that improves the grounding from a holistic perspective. Our framework tunes LLMs to self-ground the claims in their responses and provide accurate citations to retrieved documents. This tuning on top of the pre-trained LLMs requires well-grounded responses (with citations) for paired queries, for which we introduce a method that can automatically construct such data from unlabeled queries. The self-grounding capability of tuned LLMs further grants them a test-time adaptation (TTA) capability that can actively retrieve passages to support the claims that have not been grounded, which iteratively improves the responses of LLMs. Across five datasets and two LLMs, our results show that the proposed tuning-based framework generates superior grounded responses with more accurate citations compared to prompting-based approaches and post-hoc citing-based approaches.

pdf bib
Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
Yijia Shao | Yucheng Jiang | Theodore Kanell | Peter Xu | Omar Khattab | Monica Lam

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages. This underexplored problem poses new challenges at the pre-writing stage, including how to research the topic and prepare an outline prior to writing. We propose STORM, a writing system for the Synthesis of Topic Outlines throughRetrieval and Multi-perspective Question Asking. STORM models the pre-writing stage by (1) discovering diverse perspectives in researching the given topic, (2) simulating conversations where writers carrying different perspectives pose questions to a topic expert grounded on trusted Internet sources, (3) curating the collected information to create an outline.For evaluation, we curate FreshWiki, a dataset of recent high-quality Wikipedia articles, and formulate outline assessments to evaluate the pre-writing stage. We further gather feedback from experienced Wikipedia editors. Compared to articles generated by an outline-driven retrieval-augmented baseline, more of STORM’s articles are deemed to be organized (by a 25% absolute increase) and broad in coverage (by 10%). The expert feedback also helps identify new challenges for generating grounded long articles, such as source bias transfer and over-association of unrelated facts.

pdf bib
Grounding Gaps in Language Model Generations
Omar Shaikh | Kristina Gligoric | Ashna Khetan | Matthias Gerstgrasser | Diyi Yang | Dan Jurafsky

Effective conversation requires common ground: a shared understanding between the participants. Common ground, however, does not emerge spontaneously in conversation. Speakers and listeners work together to both identify and construct a shared basis while avoiding misunderstanding. To accomplish grounding, humans rely on a range of dialogue acts, like clarification (What do you mean?) and acknowledgment (I understand.). However, it is unclear whether large language models (LLMs) generate text that reflects human grounding. To this end, we curate a set of grounding acts and propose corresponding metrics that quantify attempted grounding. We study whether LLM generations contain grounding acts, simulating turn-taking from several dialogue datasets and comparing results to humans. We find that—compared to humans—LLMs generate language with less conversational grounding, instead generating text that appears to simply presume common ground. To understand the roots of the identified grounding gap, we examine the role of instruction tuning and preference optimization, finding that training on contemporary preference data leads to a reduction in generated grounding acts. Altogether, we highlight the need for more research investigating conversational grounding in human-AI interaction.

pdf bib
When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale
Christos Baziotis | Biao Zhang | Alexandra Birch | Barry Haddow

Multilingual machine translation (MMT), trained on a mixture of parallel and monolingual data, is key for improving translation in low-resource language pairs. However, the literature offers conflicting results on the performance of different methods of including monolingual data. To resolve this, we examine how denoising autoencoding (DAE) and backtranslation (BT) impact MMT under different data conditions and model scales. Unlike prior studies, we use a realistic dataset of 100 translation directions and consider many domain combinations of monolingual and test data. We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales. BT is beneficial when the parallel, monolingual, and test data sources are similar but can be detrimental otherwise, while DAE is less effective than previously reported. Next, we analyze the impact of scale (from 90M to 1.6B parameters) and find it is important for both methods, particularly DAE. As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource. These results offer new insights into how to best use monolingual data in MMT.

pdf bib
ContraSim – Analyzing Neural Representations Based on Contrastive Learning
Adir Rahamim | Yonatan Belinkov

Recent work has compared neural network representations via similarity-based analyses to improve model interpretation. The quality of a similarity measure is typically evaluated by its success in assigning a high score to representations that are expected to be matched. However, existing similarity measures perform mediocrely on standard benchmarks. In this work, we develop a new similarity measure, dubbed ContraSim, based on contrastive learning. In contrast to common closed-form similarity measures, ContraSim learns a parameterized measure by using both similar and dissimilar examples. We perform an extensive experimental evaluation of our method, with both language and vision models, on the standard layer prediction benchmark and two new benchmarks that we introduce: the multilingual benchmark and the image–caption benchmark. In all cases, ContraSim achieves much higher accuracy than previous similarity measures, even when presented with challenging examples. Finally, ContraSim is more suitable for the analysis of neural networks, revealing new insights not captured by previous measures.

pdf bib
Universal Prompt Optimizer for Safe Text-to-Image Generation
Zongyu Wu | Hongcheng Gao | Yueze Wang | Xiang Zhang | Suhang Wang

Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal **p**rompt **o**ptimizer for **s**afe T2**I** (**POSI**) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at [](

pdf bib
Language Model Based Unsupervised Dependency Parsing with Conditional Mutual Information and Grammatical Constraints
Junjie Chen | Xiangheng He | Yusuke Miyao

Previous methods based on Large Language Models (LLM) perform unsupervised dependency parsing by maximizing bi-lexical dependence scores. However, these previous methods adopt dependence scores that are difficult to interpret. These methods cannot incorporate grammatical constraints that previous grammar-based parsing research has shown beneficial to improving parsing performance. In this work, we apply Conditional Mutual Information (CMI), an interpretable metric, to measure the bi-lexical dependence and incorporate grammatical constraints into LLM-based unsupervised parsing. We incorporate Part-Of-Speech information as a grammatical constraint at the CMI estimation stage and integrate two additional grammatical constraints at the subsequent tree decoding stage. We find that the CMI score positively correlates with syntactic dependencies and has a stronger correlation with the syntactic dependencies than baseline scores. Our experiment confirms the benefits and applicability of the proposed grammatical constraints across five languages and eight datasets. The CMI parsing model outperforms state-of-the-art LLM-based models and similarly constrained grammar-based models. Our analysis reveals that the CMI model is strong in retrieving dependency relations with rich lexical interactions but is weak in retrieving relations with sparse lexical interactions, indicating a potential limitation in CMI-based unsupervised parsing methods.

pdf bib
The Bias Amplification Paradox in Text-to-Image Generation
Preethi Seshadri | Sameer Singh | Yanai Elazar

Bias amplification is a phenomenon in which models exacerbate biases or stereotypes present in the training data. In this paper, we study bias amplification in the text-to-image domain using Stable Diffusion by comparing gender ratios in training vs. generated images. We find that the model appears to amplify gender-occupation biases found in the training data (LAION) considerably. However, we discover that amplification can be largely attributed to discrepancies between training captions and model prompts. For example, an inherent difference is that captions from the training data often contain explicit gender information while our prompts do not, which leads to a distribution shift and consequently inflates bias measures. Once we account for distributional differences between texts used for training and generation when evaluating amplification, we observe that amplification decreases drastically. Our findings illustrate the challenges of comparing biases in models and their training data, as well as evaluation more broadly, and highlight how confounding factors can impact analyses.

pdf bib
Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation
Agustín Lucas | Alexis Baladón | Victoria Pardiñas | Marvin Agüero-Torales | Santiago Góngora | Luis Chiruzzo

One of the main problems low-resource languages face in NLP can be pictured as a vicious circle: data is needed to build and test tools, but the available text is scarce and there are not powerful tools to collect it.In order to break this circle for Guarani, we explore if text automatically generated from a grammar can work as a Data Augmentation technique to boost the performance of Guarani-Spanish Machine Translation (MT) systems.After building a grammar-based system that generates Spanish text and syntactically transfers it to Guarani, we perform several experiments by pretraining models using this synthetic text.We find that the MT systems that are pretrained with synthetic text perform better, even outperforming previous baselines.

pdf bib
Global Gallery: The Fine Art of Painting Culture Portraits through Multilingual Instruction Tuning
Anjishnu Mukherjee | Aylin Caliskan | Ziwei Zhu | Antonios Anastasopoulos

Exploring the intersection of language and culture in Large Language Models (LLMs), this study critically examines their capability to encapsulate cultural nuances across diverse linguistic landscapes. Central to our investigation are three research questions: the efficacy of language-specific instruction tuning, the impact of pretraining on dominant language data, and the identification of optimal approaches to elicit accurate cultural knowledge from LLMs. Utilizing the GeoMLaMA benchmark for multilingual commonsense knowledge and an adapted CAMeL dataset (English-only) for evaluation of nuanced cultural aspects, our experiments span six different languages and cultural contexts, revealing the extent of LLMs’ cultural awareness. Our findings highlight a nuanced landscape: while language-specific tuning and bilingual pretraining enhance cultural understanding in certain contexts, they also uncover inconsistencies and biases, particularly in non-Western cultures. This work expands our understanding of LLMs’ cultural competence and emphasizes the importance of integrating diverse cultural perspectives in their development, aiming for a more globally representative and equitable approach in language modeling.

pdf bib
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee | Sanghyuk Chun | Sangdoo Yun

Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce RegionVLM, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding.

pdf bib
ScriptMix: Mixing Scripts for Low-resource Language Parsing
Jaeseong Lee | Dohyeon Lee | Seung-won Hwang

Despite the success of multilingual pretrained language models (mPLMs) for tasks such as dependency parsing (DEP) or part-of-speech (POS) tagging, their coverage of 100s of languages is still limited, as most of the 6500+ languages remains “unseen”. To adapt mPLMs for including such unseen langs, existing work has considered transliteration and vocabulary augmentation. Meanwhile, the consideration of combining the two has been surprisingly lacking. To understand why, we identify both complementary strengths of the two, and the hurdles to realizing it. Based on this observation, we propose ScriptMix, combining two strengths, and overcoming the hurdle.Specifically, ScriptMix a) is trained with dual-script corpus to combine strengths, but b) with separate modules to avoid gradient conflict. In combining modules properly, we also point out the limitation of the conventional method AdapterFusion, and propose AdapterFusion+ to overcome it. We empirically show ScriptMix is effective– ScriptMix improves the POS accuracy by up to 14%, and improves the DEP LAS score by up to 5.6%. Our code is publicly available.

pdf bib
MT-PATCHER: Selective and Extendable Knowledge Distillation from Large Language Models for Machine Translation
Jiahuan Li | Shanbo Cheng | Shujian Huang | Jiajun Chen

Large Language Models (LLM) have demonstrated their strong ability in the field of machine translation, yet they suffer from high computational cost and latency. Therefore, transferring translation knowledge from giant LLMs to medium-sized machine translation models is a promising research direction. However, traditional knowledge distillation methods ignore the capability of student and teacher models, therefore repeatedly teaching student models on the knowledge they have learned, and failing to extend to novel contexts and knowledge. In this paper, we propose a framework called MT-Patcher, which transfers knowledge from LLMs to existing MT models in a selective, comprehensive and proactive manner. Considering the current translation ability of student MT models, we only identify and correct their translation errors, instead of distilling the whole translation from the teacher. Leveraging the strong language abilities of LLMs, we instruct LLM teachers to synthesize diverse contexts and anticipate more potential errors for the student. Experiment results on translating both specific language phenomena and general MT benchmarks demonstrate that finetuning the MT model on about 10% examples can achieve comparable results to the traditional knowledge distillation method, and synthesized potential errors and diverse contexts further improve MT performances on unseen contexts and words.

pdf bib
ToXCL: A Unified Framework for Toxic Speech Detection and Explanation
Nhat Hoang | Xuan Long Do | Duc Anh Do | Duc Anh Vu | Anh Tuan Luu

The proliferation of online toxic speech is a pertinent problem posing threats to demographic groups. While explicit toxic speech contains offensive lexical signals, implicit one consists of coded or indirect language. Therefore, it is crucial for models not only to detect implicit toxic speech but also to explain its toxicity. This draws a unique need for unified frameworks that can effectively detect and explain implicit toxic speech. Prior works mainly formulated the task of toxic speech detection and explanation as a text generation problem. Nonetheless, models trained using this strategy can be prone to suffer from the consequent error propagation problem. Moreover, our experiments reveal that the detection results of such models are much lower than those that focus only on the detection task. To bridge these gaps, we introduce ToXCL, a unified framework for the detection and explanation of implicit toxic speech. Our model consists of three modules: a (i) Target Group Generator to generate the targeted demographic group(s) of a given post; an (ii) Encoder-Decoder Model in which the encoder focuses on detecting implicit toxic speech and is boosted by a (iii) Teacher Classifier via knowledge distillation, and the decoder generates the necessary explanation. ToXCL achieves new state-of-the-art effectiveness, and outperforms baselines significantly.

pdf bib
LinkPrompt: Natural and Universal Adversarial Attacks on Prompt-based Language Models
Yue Xu | Wenjie Wang

Prompt-based learning is a new language model training paradigm that adapts the Pre-trained Language Models (PLMs) to downstream tasks, which revitalizes the performance benchmarks across various natural language processing (NLP) tasks. Instead of using a fixed prompt template to fine-tune the model, some research demonstrates the effectiveness of searching for the prompt via optimization. Such prompt optimization process of prompt-based learning on PLMs also gives insight into generating adversarial prompts to mislead the model, raising concerns about the adversarial vulnerability of this paradigm. Recent studies have shown that universal adversarial triggers (UATs) can be generated to alter not only the predictions of the target PLMs but also the prediction of corresponding Prompt-based Fine-tuning Models (PFMs) under the prompt-based learning paradigm. However, UATs found in previous works are often unreadable tokens or characters and can be easily distinguished from natural texts with adaptive defenses. In this work, we consider the naturalness of the UATs and develop LinkPrompt, an adversarial attack algorithm to generate UATs by a gradient-based beam search algorithm that not only effectively attacks the target PLMs and PFMs but also maintains the naturalness among the trigger tokens. Extensive results demonstrate the effectiveness of LinkPrompt, as well as the transferability of UATs generated by LinkPrompt to open-sourced Large Language Model (LLM) Llama2 and API-accessed LLM GPT-3.5-turbo. The resource is available at

pdf bib
CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions
Hanchong Zhang | Ruisheng Cao | Hongshen Xu | Lu Chen | Kai Yu

Recently, Large Language Models (LLMs) have been demonstrated to possess impressive capabilities in a variety of domains and tasks. We investigate the issue of prompt design in the multi-turn text-to-SQL task and attempt to enhance the LLMs’ reasoning capacity when generating SQL queries. In the conversational context, the current SQL query can be modified from the preceding SQL query with only a few operations due to the context dependency. We introduce our method called CoE-SQL which can prompt LLMs to generate the SQL query based on the previously generated SQL query with an edition chain. We also conduct extensive ablation studies to determine the optimal configuration of our approach. Our approach outperforms different in-context learning baselines stably and achieves state-of-the-art performances on two benchmarks SParC and CoSQL using LLMs, which is also competitive to the SOTA fine-tuned models.

pdf bib
ContraDoc: Understanding Self-Contradictions in Documents with Large Language Models
Jierui Li | Vipul Raheja | Dhruv Kumar

In recent times, large language models (LLMs) have shown impressive performance on various document-level tasks such as document classification, summarization, and question-answering. However, research on understanding their capabilities on the task of self-contradictions in long documents has been very limited. In this work, we introduce ContraDoc, the first human-annotated dataset to study self-contradictions in long documents across multiple domains, varying document lengths, self-contradiction types, and appearance scope. We then analyze the current capabilities of four state-of-the-art open-source and commercially available LLMs: GPT3.5, GPT4, PaLM2, and LLaMAv2 on this dataset. While GPT4 performs the best and can outperform humans on this task, we find that it is still unreliable and struggles with self-contradictions that require more nuance and context. We release the dataset and all the code associated with the experiments.

pdf bib
Entity Disambiguation via Fusion Entity Decoding
Junxiong Wang | Ali Mousavi | Omar Attia | Ronak Pradeep | Saloni Potdar | Alexander Rush | Umar Farooq Minhas | Yunyao Li

Entity disambiguation (ED), which links the mentions of ambiguous entities to their referent entities in a knowledge base, serves as a core component in entity linking (EL). Existing generative approaches demonstrate improved accuracy compared to classification approaches under the standardized ZELDA benchmark. Nevertheless, generative approaches suffer from the need for large-scale pre-training and inefficient generation. Most importantly, entity descriptions, which could contain crucial information to distinguish similar entities from each other, are often overlooked.We propose an encoder-decoder model to disambiguate entities with more detailed entity descriptions. Given text and candidate entities, the encoder learns interactions between the text and each candidate entity, producing representations for each entity candidate. The decoder then fuses the representations of entity candidates together and selects the correct entity.Our experiments, conducted on various entity disambiguation benchmarks, demonstrate the strong and robust performance of this model, particularly +1.5% in the ZELDA benchmark compared with GENRE. Furthermore, we integrate this approach into the retrieval/reader framework and observe +1.5% improvements in end-to-end entity linking in the GERBIL benchmark compared with EntQA.

pdf bib
PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers
Myeonghwa Lee | Seonho An | Min-Soo Kim

In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define **Decision QA** as the task of answering the best decision, dbest, for a decision-making question Q, business rules R and a database D. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, **DQA**. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the *iterative plan-then-retrieval augmented generation* (**PlanRAG**). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at

pdf bib
GPTScore: Evaluate as You Desire
Jinlan Fu | See-Kiong Ng | Zhengbao Jiang | Pengfei Liu

Generative Artificial Intelligence (AI) has enabled the development of sophisticated models that are capable of producing high-caliber text, images, and other outputs through the utilization of large pre-trained models.Nevertheless, assessing the quality of the generation is an even more arduous task than the generation itself, and this issue has not been given adequate consideration recently.This paper proposes a novel evaluation framework, GPTScore, which utilizes the emergent abilities (e.g., in-context learning, zero-shot instruction) of generative pre-trained models to score generated texts. There are 19 pre-trained models explored in this paper, ranging in size from 80M (e.g., Flan-T5-small) to 175B (e.g., GPT3).Experimental results on four text generation tasks, 22 evaluation aspects, and corresponding 37 datasets demonstrate that this approach can effectively allow us to achieve what one desires to evaluate for texts simply by natural language instructions.This nature helps us overcome several long-standing challenges in text evaluation–how to achieve customized, multi-faceted evaluation without model training. We make our code publicly available.

pdf bib
A Survey of Confidence Estimation and Calibration in Large Language Models
Jiahui Geng | Fengyu Cai | Yuxia Wang | Heinz Koeppl | Preslav Nakov | Iryna Gurevych

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and to outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work.

pdf bib
Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References
Tianyi Tang | Hongyuan Lu | Yuchen Jiang | Haoyang Huang | Dongdong Zhang | Xin Zhao | Tom Kocmi | Furu Wei

Most research about natural language generation (NLG) relies on evaluation benchmarks with limited references for a sample, which may result in poor correlations with human judgements. The underlying reason is that one semantic meaning can actually be expressed in different forms, and the evaluation with a single or few references may not accurately reflect the quality of the model’s hypotheses. To address this issue, this paper presents a simple and effective method, named **Div-Ref**, to enhance existing evaluation benchmarks by enriching the number of references. We leverage large language models (LLMs) to diversify the expression of a single reference into multiple high-quality ones to cover the semantic space of the reference sentence as much as possible. We conduct comprehensive experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation. This idea is compatible with recent LLM-based evaluation which can similarly derive advantages from incorporating multiple references. *We strongly encourage future generation benchmarks to include more references, even if they are generated by LLMs, which is once for all.* We release all the code and data at to facilitate research.

pdf bib
Separation and Fusion: A Novel Multiple Token Linking Model for Event Argument Extraction
Jing Xu | Dandan Song | Siu Hui | Zhijing Wu | Meihuizi Jia | Hao Wang | Yanru Zhou | Changzhi Zhou | Ziyi Yang

In event argument extraction (EAE), a promising approach involves jointly encoding text and argument roles, and performing multiple token linking operations. This approach further falls into two categories. One extracts arguments within a single event, while the other attempts to extract arguments from multiple events simultaneously. However, the former lacks to leverage cross-event information and the latter requires tougher predictions with longer encoded role sequences and extra linking operations. In this paper, we design a novel separation-and-fusion paradigm to separately acquire cross-event information and fuse it into the argument extraction of a target event. Following the paradigm, we propose a novel multiple token linking model named Sep2F, which can effectively build event correlations via roles and preserve the simple linking predictions of single-event extraction. In particular, we employ one linking module to extract arguments for the target event and another to aggregate the role information of multiple events. More importantly, we propose a novel two-fold fusion module to ensure that the aggregated cross-event information serves EAE well. We evaluate our proposed model on sentence-level and document-level datasets, including ACE05, RAMS, WikiEvents and MLEE. The extensive experimental results indicate that our model outperforms the state-of-the-art EAE models on all the datasets.

pdf bib
The Integration of Semantic and Structural Knowledge in Knowledge Graph Entity Typing
Muzhi Li | Minda Hu | Irwin King | Ho-fung Leung

The Knowledge Graph Entity Typing (KGET) task aims to predict missing type annotations for entities in knowledge graphs. Recent works only utilize the structural knowledge in the local neighborhood of entities, disregarding semantic knowledge in the textual representations of entities, relations, and types that are also crucial for type inference. Additionally, we observe that the interaction between semantic and structural knowledge can be utilized to address the false-negative problem. In this paper, we propose a novel Semantic and Structure-aware KG Entity Typing (SSET) framework, which is composed of three modules. First, the Semantic Knowledge Encoding module encodes factual knowledge in the KG with a Masked Entity Typing task. Then, the Structural Knowledge Aggregation module aggregates knowledge from the multi-hop neighborhood of entities to infer missing types. Finally, the Unsupervised Type Re-ranking module utilizes the inference results from the two models above to generate type predictions that are robust to false-negative samples. Extensive experiments show that SSET significantly outperforms existing state-of-the-art methods.

pdf bib
ComCLIP: Training-Free Compositional Image and Text Matching
Kenan Jiang | Xuehai He | Ruize Xu | Xin Wang

Contrastive Language-Image Pretraining (CLIP) has demonstrated great zero-shot performance for matching images and text. However, it is still challenging to adapt vision-language pretrained models like CLIP to compositional image and text matching — a more challenging image and text matching task requiring the model’s understanding of compositional word concepts and visual components. Towards better compositional generalization in zero-shot image and text matching, in this paper, we study the problem from a causal perspective: the erroneous semantics of individual entities are essentially confounders that cause the matching failure. Therefore, we propose a novel training-free compositional CLIP model (ComCLIP). ComCLIP disentangles input images into subjects, objects, and action subimages and composes CLIP’s vision encoder and text encoder to perform evolving matching over compositional text embedding and subimage embeddings. In this way, ComCLIP can mitigate spurious correlations introduced by the pretrained CLIP models and dynamically evaluate the importance of each component. Experiments on four compositional image-text matching datasets: Winoground, VL-checklist, SVO, and ComVG, and two general image-text retrieval datasets: Flick30K, and MSCOCO demonstrate the effectiveness of our plug-and-play method, which boosts the zero-shot inference ability of CLIP, SLIP, and BLIP2 even without further training or fine-tuning. Our codes can be found at

pdf bib
ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications
Sotaro Takeshita | Tommaso Green | Ines Reinig | Kai Eckert | Simone Ponzetto

Extensive efforts in the past have been directed toward the development of summarization datasets. However, a predominant number of these resources have been (semi)-automatically generated, typically through web data crawling. This resulted in subpar resources for training and evaluating summarization systems, a quality compromise that is arguably due to the substantial costs associated with generating ground-truth summaries, particularly for diverse languages and specialized domains. To address this issue, we present ACLSum, a novel summarization dataset carefully crafted and evaluated by domain experts. In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers, covering challenges, approaches, and outcomes in depth. Through extensive experiments, we evaluate the quality of our resource and the performance of models based on pretrained language models (PLMs) and state-of-the-art large language models (LLMs). Additionally, we explore the effectiveness of extract-then-abstract versus abstractive end-to-end summarization within the scholarly domain on the basis of automatically discovered aspects. While the former performs comparably well to the end-to-end approach with pretrained language models regardless of the potential error propagation issue, the prompting-based approach with LLMs shows a limitation in extracting sentences from source documents.

pdf bib
XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners
Yun Luo | Zhen Yang | Fandong Meng | Yingjie Li | Fang Guo | Qinglin Qi | Jie Zhou | Yue Zhang

Active learning (AL), which aims to construct an effective training set by iteratively curating the most formative unlabeled data for annotation, has been widely used in low-resource tasks. Most active learning techniques in classification rely on the model’s uncertainty or disagreement to choose unlabeled data, suffering from the problem of over-confidence in superficial patterns and a lack of exploration.Inspired by the cognitive processes in which humans deduce and predict through causal information, we take an initial attempt towards integrating rationales into AL and propose a novel Explainable Active Learning framework (XAL) for low-resource text classification, which aims to encourage classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations. Specifically, besides using a pre-trained bi-directional encoder for classification, we employ a pre-trained uni-directional decoder to generate and score the explanation. We further facilitate the alignment of the model with human reasoning preference through a proposed ranking loss. During the selection of unlabeled data, the predicted uncertainty of the encoder and the explanation score of the decoder complement each other as the final metric to acquire informative data. Extensive experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines. Analysis indicates that the proposed method can generate corresponding explanations for its predictions.

pdf bib
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?
Yuchi Wang | Shuhuai Ren | Rundong Gao | Linli Yao | Qingyan Guo | Kaikai An | Jianhong Bai | Xu Sun

Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.

pdf bib
Intent-conditioned and Non-toxic Counterspeech Generation using Multi-Task Instruction Tuning with RLAIF
Amey Hengle | Aswini Padhi | Sahajpreet Singh | Anil Bandhakavi | Md Shad Akhtar | Tanmoy Chakraborty

Counterspeech, defined as a response to mitigate online hate speech, is increasingly used as a non-censorial solution. The effectiveness of addressing hate speech involves dispelling the stereotypes, prejudices, and biases often subtly implied in brief, single-sentence statements or abuses. These expressions challenge language models, especially in seq2seq tasks, as model performance typically excels with longer contexts. Our study introduces CoARL, a novel framework enhancing counterspeech generation by modeling the pragmatic implications underlying social biases in hateful statements. The first two phases of CoARL involve sequential multi-instruction tuning, teaching the model to understand intents, reactions, and harms of offensive statements, and then learning task-specific low-rank adapter weights for generating intent-conditioned counterspeech. The final phase uses reinforcement learning to fine-tune outputs for effectiveness and nontoxicity. CoARL outperforms existing benchmarks in intent-conditioned counterspeech generation, showing an average improvement of ∼3 points in intent-conformity and ∼4 points in argument-quality metrics. Extensive human evaluation supports CoARL’s efficacy in generating superior and more context-appropriate responses compared to existing systems, including prominent LLMs like ChatGPT.

pdf bib
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
Zhichen Dong | Zhanhui Zhou | Chao Yang | Jing Shao | Yu Qiao

Large Language Models (LLMs) are now commonplace in conversation applications. However, their risks of misuse for generating harmful responses have raised serious societal concerns and spurred recent research on LLM conversation safety. Therefore, in this survey, we provide a comprehensive overview of recent studies, covering three critical aspects of LLM conversation safety: attacks, defenses, and evaluations. Our goal is to provide a structured summary that enhances understanding of LLM conversation safety and encourages further investigation into this important subject. For easy reference, we have categorized all the studies mentioned in this survey according to our taxonomy, available at:

pdf bib
Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models
Weize Liu | Guocong Li | Kai Zhang | Bang Du | Qiyuan Chen | Xuming Hu | Hongxia Xu | Jintai Chen | Jian Wu

Large language models (LLMs) have achieved remarkable advancements in natural language processing. However, the massive scale and computational demands of these models present formidable challenges when considering their practical deployment in resource-constrained environments. While techniques such as chain-of-thought (CoT) distillation have displayed promise in distilling LLMs into small language models (SLMs), there is a risk that distilled SLMs may still inherit flawed reasoning and hallucinations from LLMs. To address these issues, we propose a twofold methodology: First, we introduce a novel method for distilling the self-evaluation capability from LLMs into SLMs, aiming to mitigate the adverse effects of flawed reasoning and hallucinations inherited from LLMs. Second, we advocate for distilling more comprehensive thinking by incorporating multiple distinct CoTs and self-evaluation outputs, to ensure a more thorough and robust knowledge transfer into SLMs. Experiments on three NLP benchmarks demonstrate that our method significantly improves the performance of distilled SLMs, offering a new perspective for developing more effective and efficient SLMs in resource-constrained environments.

pdf bib
Divergent Token Metrics: Measuring degradation to prune away LLM components – and optimize quantization
Björn Deiseroth | Max Meuer | Nikolas Gritsch | Constantin Eichenberg | Patrick Schramowski | Matthias Aßenmacher | Kristian Kersting

Large Language Models (LLMs) have reshaped natural language processing with their impressive capabilities. However, their ever-increasing size has raised concerns about their effective deployment and the need for LLM compression. This study introduces the Divergent Token Metrics (DTMs), a novel approach to assessing compressed LLMs, addressing the limitations of traditional perplexity or accuracy measures that fail to accurately reflect text generation quality. DTMs measure token divergences that allow deeper insights into the subtleties of model compression, in particular, when evaluating components’ impacts individually. Utilizing the First Divergent Token Metric (FDTM) in model sparsification reveals that 25% of all attention components can be pruned beyond 90% on the Llama-2 model family, still keeping SOTA performance. For quantization, FDTM suggests that more than 80% of parameters can be naively transformed to int8 without special outlier management. These evaluations indicate the necessity of choosing appropriate compressions for parameters individually—and that FDTM can identify those—while standard metrics result in deteriorated outcomes.

pdf bib
Beyond Performance: Quantifying and Mitigating Label Bias in LLMs
Yuval Reif | Roy Schwartz

Large language models (LLMs) have shown remarkable adaptability to diverse tasks, by leveraging context prompts containing instructions, or minimal input-output examples. However, recent work revealed they also exhibit *label bias*—an undesirable preference toward predicting certain answers over others. Still, detecting and measuring this bias reliably and at scale has remained relatively unexplored. In this study, we evaluate different approaches to quantifying label bias in a model’s predictions, conducting a comprehensive investigation across 279 classification tasks and ten LLMs. Our investigation reveals substantial label bias in models both before and after debiasing attempts, as well as highlights the importance of outcomes-based evaluation metrics, which were not previously used in this regard. We further propose a novel label bias calibration method tailored for few-shot prompting, which outperforms recent calibration approaches for both improving performance and mitigating label bias. Our results emphasize that label bias in the predictions of LLMs remains a barrier to their reliability.

pdf bib
Instructing Large Language Models to Identify and Ignore Irrelevant Conditions
Zhenyu Wu | Chao Shen | Meng Jiang

Math word problem (MWP) solving requires generating a reasoning path based on a given problem description that often contains irrelevant conditions.Existing chain-of-thought (CoT) prompting methods elicited multi-step reasoning abilities of large language models (LLMs) to solve MWPs.However, they were seriously confused by the irrelevant conditions, resulting in low accuracy.In this paper, we propose a novel approach named I3C that instructs LLMs to identify and ignore irrelevant conditions.It identifies a set of irrelevant condition candidates that have a weak semantic relevance with the question.Then it prompts LLMs to verify the irrelevant conditions.Lastly it instructs the LLMs with the verification on relevant and irrelevant conditions to avoid confusion and improve reasoning paths.Moreover, we propose to select (problem, reasoning paths) pairs as demonstrations to enhance I3C with few-shot reasoning. We develop I3C-Select that selects the most confusing problems based on the semantic relevance measurement.We conduct extensive experiments on eight MWP datasets.I3C can be combined with any CoT prompting methods to improve the performance of solving MWPs.Notably, with GPT-3.5-Turbo and I3C-Select, we achieve an accuracy of 96.0 and 94.1 on GSM-IC2-1K and GSM-ICM-1K, respectively, significantly outperforming the state-of-the-art few-shot prompting method Complex-CoT by +11.7 and +11.1.Our implementation is made publicly available at

pdf bib
Lower Bounds on the Expressivity of Recurrent Neural Language Models
Anej Svete | Franz Nowak | Anisha Sahabdeen | Ryan Cotterell

The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their abilities. Describing their abilities through LMs’ representational capacity is a lively area of research. Investigations of the representational capacity of neural LMs have predominantly focused on their ability to recognize formal languages. For example, recurrent neural networks (RNNs) as classifiers are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN language models (LMs), which are definitionally distributions over strings. We take a fresh look at the represen- tational capacity of RNN LMs by connecting them to probabilistic FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs.

pdf bib
Transformers Can Represent n-gram Language Models
Anej Svete | Ryan Cotterell

Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language acceptance. We contend that this is an ill-suited problem in the study of language models (LMs), which are definitionally probability distributions over strings. In this paper, we focus on the relationship between transformer LMs and n-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any n-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.

pdf bib
The Role of n-gram Smoothing in the Age of Neural Networks
Luca Malagutti | Andrius Buinovskij | Anej Svete | Clara Meister | Afra Amini | Ryan Cotterell

For nearly three decades, language models derived from the n-gram assumption held the state of the art on the task. The key to their success lay in the application of various smoothing techniques that served to combat overfitting. However, when neural language models toppled n-gram models as the best performers, n-gram smoothing techniques became less relevant. Indeed, it would hardly be an understatement to suggest that the line of inquiry into n-gram smoothing techniques became dormant. This paper re-opens the role classical n-gram smoothing techniques may play in the age of neural language models. First, we draw a formal equivalence between label smoothing, a popular regularization technique for neural language models, and add-𝜆 smoothing. Second, we derive a generalized framework for converting any n-gram smoothing technique into a regularizer compatible with neural language models. Our empirical results find that our novel regularizers are comparable to and, indeed, sometimes outperform label smoothing on language modeling and machine translation.

pdf bib
Reliability Estimation of News Media Sources: Birds of a Feather Flock Together
Sergio Burdisso | Dairazalia Sanchez-cortes | Esaú Villatoro-tello | Petr Motlicek

Evaluating the reliability of news sources is a routine task for journalists and organizations committed to acquiring and disseminating accurate information.Recent research has shown that predicting sources’ reliability represents an important first-prior step in addressing additional challenges such as fake news detection and fact-checking.In this paper, we introduce a novel approach for source reliability estimation that leverages reinforcement learning strategies for estimating the reliability degree of news sources. Contrary to previous research, our proposed approach models the problem as the estimation of a reliability degree, and not a reliability label, based on how all the news media sources interact with each other on the Web.We validated the effectiveness of our method on a news media reliability dataset that is an order of magnitude larger than comparable existing datasets. Results show that the estimated reliability degrees strongly correlates with journalists-provided scores (Spearman=0.80) and can effectively predict reliability labels (macro-avg. F1 score=81.05).We release our implementation and dataset, aiming to provide a valuable resource for the NLP community working on information verification.

pdf bib
On the Multilingual Ability of Decoder-based Pre-trained Language Models: Finding and Controlling Language-Specific Neurons
Takeshi Kojima | Itsuki Okimura | Yusuke Iwasawa | Hitomi Yanaka | Yutaka Matsuo

Current decoder-based pre-trained language models (PLMs) successfully demonstrate multilingual capabilities. However, it is unclear how these models handle multilingualism.We analyze the neuron-level internal behavior of multilingual decoder-based PLMs, Specifically examining the existence of neurons that fire “uniquely for each language” within decoder-only multilingual PLMs.We analyze six languages: English, German, French, Spanish, Chinese, and Japanese, and show that language-specific neurons are unique, with a slight overlap (< 5%) between languages. These neurons are mainly distributed in the models’ first and last few layers. This trend remains consistent across languages and models.Additionally, we tamper with less than 1% of the total neurons in each model during inference and demonstrate that tampering with a few language-specific neurons drastically changes the probability of target language occurrence in text generation.

pdf bib
NLP Progress in Indigenous Latin American Languages
Atnafu Tonja | Fazlourrahman Balouchzahi | Sabur Butt | Olga Kolesnikova | Hector Ceballos | Alexander Gelbukh | Thamar Solorio

The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancements that respect indigenous community perspectives. We show the NLP progress of indigenous Latin American languages and the survey that covers the status of indigenous languages in Latin America, their representation in NLP, and the challenges and innovations required for their preservation and development. The paper contributes to the current literature in understanding the need and progress of NLP for indigenous communities of Latin America, specifically low-resource and indigenous communities in general.

pdf bib
On the Effectiveness of Adversarial Robustness for Abuse Mitigation with Counterspeech
Yi-Ling Chung | Jonathan Bright

Recent work on automated approaches to counterspeech have mostly focused on synthetic data but seldom look into how the public deals with abuse. While these systems identifying and generating counterspeech have the potential for abuse mitigation, it remains unclear how robust a model is against adversarial attacks across multiple domains and how models trained on synthetic data can handle unseen user-generated abusive content in the real world. To tackle these issues, this paper first explores the dynamics of abuse and replies using our novel dataset of 6,955 labelled tweets targeted at footballers for studying public figure abuse. We then curate DynaCounter, a new English dataset of 1,911 pairs of abuse and replies addressing nine minority identity groups, collected in an adversarial human-in-the-loop process over four rounds. Our analysis shows that adversarial attacks do not necessarily result in better generalisation. We further present a study of multi-domain counterspeech generation, comparing Flan-T5 and T5 models. We observe that handling certain abuse targets is particularly challenging.

pdf bib
Leveraging the Structure of Pre-trained Embeddings to Minimize Annotation Effort
Cesar Gonzalez-Gutierrez | Ariadna Quattoni

Most current state-of-the-art approaches for text classification are based on fine-tuning the representations computed by large language models (LLMs). This strategy has led to significant improvements in classification performance and contributed to a reduction of the amount of labeled data required for training a model. However, for some challenging classification tasks, providing enough annotations to ensure a reliable classification continues to be the main bottleneck. This is especially true in settings of highly imbalanced class distributions. This paper proposes to tackle this bottleneck by exploiting the structural properties of pre-trained embeddings. We develop a label propagation method that uses pre-trained embeddings to spread information from the labeled samples to nearby samples in the induced space, ensuring the optimal use of annotations. Our approach is simple and relatively low-cost since it only requires computing some distances in the embedded space. We conduct experiments on different text classification datasets showing that the proposed method is efficient and significantly outperforms both self-training and random walk label propagation strategies.

pdf bib
UniArk: Improving Generalisation and Consistency for Factual Knowledge Extraction through Debiasing
Yijun Yang | Jie He | Pinzhen Chen | Victor Gutierrez Basulto | Jeff Pan

Several recent papers have investigated the potential of language models as knowledge bases as well as the existence of severe biases when extracting factual knowledge. In this work, we focus on the factual probing performance over unseen prompts from tuning, and using a probabilistic view we show the inherent misalignment between pre-training and downstream tuning objectives in language models for probing knowledge. We hypothesize that simultaneously debiasing these objectives can be the key to generalisation over unseen prompts. We propose an adapter-based framework, **UniArk**, for generalised and consistent factual knowledge extraction through simple methods without introducing extra parameters. Extensive experiments show that UniArk can significantly improve the model’s out-of-domain generalisation as well as consistency under various prompts. Additionally, we construct **ParaTrex**, a large-scale and diverse dataset for measuring the inconsistency and out-of-domain generation of models. Further, ParaTrex offers a reference method for constructing paraphrased datasets using large language models.

pdf bib
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity
Soyeong Jeong | Jinheon Baek | Sukmin Cho | Sung Ju Hwang | Jong Park

Retrieval-Augmented Large Language Models (LLMs), which incorporate the non-parametric knowledge from external knowledge bases into LLMs, have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA). However, even though there are various approaches dealing with queries of different complexities, they either handle simple queries with unnecessary computational overhead or fail to adequately address complex multi-step queries; yet, not all user requests fall into only one of the simple or complex categories. In this work, we propose a novel adaptive QA framework that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs from the simplest to the most sophisticated ones based on the query complexity. Also, this selection process is operationalized with a classifier, which is a smaller LM trained to predict the complexity level of incoming queries with automatically collected labels, obtained from actual predicted outcomes of models and inherent inductive biases in datasets. This approach offers a balanced strategy, seamlessly adapting between the iterative and single-step retrieval-augmented LLMs, as well as the no-retrieval methods, in response to a range of query complexities. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems, compared to relevant baselines including the adaptive retrieval approaches. Code is available at:

pdf bib
Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method
Yukun Zhao | Lingyong Yan | Weiwei Sun | Guoliang Xing | Chong Meng | Shuaiqiang Wang | Zhicong Cheng | Zhaochun Ren | Dawei Yin

Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.However, recent literature reveals that LLMs hallucinate intermittently, which impedes their reliability for further utilization. In this paper, we propose a novel self-detection method to detect which questions an LLM does not know.Our proposal is empirical and applicable for continually upgrading LLMs compared with state-of-the-art methods. Specifically, we examine the divergence of the LLM’s behaviors on different verbalizations for a question and examine the atypicality of the verbalized input. We combine the two components to identify whether the model generates a non-factual response to the question. The above components can be accomplished by utilizing the LLM itself without referring to any other external resources. We conduct comprehensive experiments and demonstrate the effectiveness of our method for recently released LLMs involving Llama 2, Vicuna, ChatGPT, and GPT-4 across factoid question-answering, arithmetic reasoning, and commonsense reasoning tasks.

pdf bib
Are Large Language Model Temporally Grounded?
Yifu Qiu | Zheng Zhao | Yftah Ziser | Anna Korhonen | Edoardo Ponti | Shay Cohen

Are Large Language Models (LLMs) temporally grounded? Since LLMs cannot perceive and interact with the environment, it is impossible to answer this question directly. Instead, we provide LLMs with textual narratives and probe them with respect to their common-sense knowledge of the structure and duration of events, their ability to order events along a timeline, and self-consistency within their temporal model (e.g., temporal relations such as after and before are mutually exclusive for any pair of events). We evaluate state-of-the-art LLMs (such as LLaMA 2 and GPT-4) on three tasks reflecting these abilities. Generally, we find that LLMs lag significantly behind both human performance as well as small-scale, specialised LMs. In-context learning, instruction tuning, and chain-of-thought prompting reduce this gap only to a limited degree. Crucially, LLMs struggle the most with self-consistency, displaying incoherent behaviour in at least 27.23% of their predictions. Contrary to expectations, we also find that scaling the model size does not guarantee positive gains in performance. To explain these results, we study the sources from which LLMs may gather temporal information: we find that sentence ordering in unlabelled texts, available during pre-training, is only weakly correlated with event ordering. Moreover, public instruction tuning mixtures contain few temporal tasks. Hence, we conclude that current LLMs lack a consistent temporal model of textual narratives.

pdf bib
Document Image Machine Translation with Dynamic Multi-pre-trained Models Assembling
Yupu Liang | Yaping Zhang | Cong Ma | Zhiyang Zhang | Yang Zhao | Lu Xiang | Chengqing Zong | Yu Zhou

Text image machine translation (TIMT) is a task that translates source texts embedded in the image to target translations. The existing TIMT task mainly focuses on text-line-level images. In this paper, we extend the current TIMT task and propose a novel task, **D**ocument **I**mage **M**achine **T**ranslation to **Markdown** (**DIMT2Markdown**), which aims to translate a source document image with long context and complex layout structure to markdown-formatted target translation.We also introduce a novel framework, **D**ocument **I**mage **M**achine **T**ranslation with **D**ynamic multi-pre-trained models **A**ssembling (**DIMTDA**).A dynamic model assembler is used to integrate multiple pre-trained models to enhance the model’s understanding of layout and translation capabilities.Moreover, we build a novel large-scale **Do**cument image machine **T**ranslation dataset of **A**rXiv articles in markdown format (**DoTA**), containing 126K image-translation pairs.Extensive experiments demonstrate the feasibility of end-to-end translation of rich-text document images and the effectiveness of DIMTDA.

pdf bib
Elastic Weight Removal for Faithful and Abstractive Dialogue Generation
Nico Daheim | Nouha Dziri | Mrinmaya Sachan | Iryna Gurevych | Edoardo Ponti

Generating factual responses is a crucial requirement for dialogue systems. To promotemore factual responses, a common strategyis to ground their responses in relevant documents that inform response generation. However, common dialogue models still often hallucinate information that was not containedin these documents and is therefore unfaithful. In this work, we propose to alleviate suchhallucinations by ‘subtracting’ the parametersof a model trained to hallucinate from a dialogue response generation model in order to‘negate’ the contribution of such hallucinatedexamples from it. Extensive automatic and human evaluation shows favourable results whencompared to state-of-the-art methods that combine the distributions of multiple models, suchas DExperts (Liu et al., 2021), and others thatchange the training procedure, such as Quark(Lu et al., 2022a). Finally, we show how wecan not only reduce hallucinations but also discourage extractive responses, which are oftena consequence of reducing hallucinations byencouraging copy-pasting of document spans.We publicly release our code for reproducibilityand facilitating further research.

pdf bib
R-Tuning: Instructing Large Language Models to Say ‘I Don’t Know’
Hanning Zhang | Shizhe Diao | Yong Lin | Yi Fung | Qing Lian | Xingyao Wang | Yangyi Chen | Heng Ji | Tong Zhang

Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. A predominant issue is the propensity for these models to generate non-existent facts, a concern termed hallucination. Our research is motivated by the observation that previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. When the question is out of the parametric knowledge, it will try to make up something and fail to indicate when it lacks knowledge. In this paper, we present a new approach called Refusal-Aware Instruction Tuning (R-Tuning). This approach is formalized by first identifying the disparity in knowledge encompassed by pre-trained parameters compared to that of instruction tuning data. Then, we construct the refusal-aware data based on the knowledge intersection, to tune LLMs to refrain from responding to questions beyond its parametric knowledge. Experimental results demonstrate R-Tuning effectively improves a model’s ability to answer known questions and refrain from answering unknown questions. Furthermore, when tested on out-of-domain datasets, the refusal ability was found to be a meta-skill that could be generalized to other tasks. Further analysis surprisingly finds that learning the uncertainty results in better calibration and an improved ability to estimate the uncertainty than uncertainty-based testing. Our code is available at

pdf bib
Bridging the Gap between Different Vocabularies for LLM Ensemble
Yangyifan Xu | Jinliang Lu | Jiajun Zhang

Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.

pdf bib
KnowLA: Enhancing Parameter-efficient Finetuning with Knowledgeable Adaptation
Xindi Luo | Zequn Sun | Jing Zhao | Zhe Zhao | Wei Hu

Parameter-efficient finetuning (PEFT) is a key technique for adapting large language models (LLMs) to downstream tasks. In this paper, we study leveraging knowledge graph embeddings to improve the effectiveness of PEFT. We propose a knowledgeable adaptation method called KnowLA. It inserts an adaptation layer into an LLM to integrate the embeddings of entities appearing in the input text. The adaptation layer is trained in combination with LoRA on instruction data. Experiments on six benchmarks with two popular LLMs and three knowledge graphs demonstrate the effectiveness and robustness of KnowLA. We show that KnowLA can help activate the relevant parameterized knowledge in an LLM to answer a question without changing its parameters or input prompts.

pdf bib
Extremely Weakly-supervised Text Classification with Wordsets Mining and Sync-Denoising
Lysa Xiao

Extremely weakly-supervised text classification aims to classify texts without any labeled data, but only relying on class names as supervision. Existing works include prompt-based and seed-based methods. Prompt-based methods prompt language model with instructions, while seed-based methods generate pseudo-labels with word matching. Both of them have significant flaws, including zero-shot instability and context-dependent ambiguities. This paper introduces SetSync, which follows a new paradigm, i.e. wordset-based, which can avoid the above problems. In SetSync, a class is represented with wordsets, and pseudo-labels are generated with wordsets matching. To facilitate this, we propose to use information bottleneck to identify class-relevant wordsets. Moreover, we regard the classifier training as a hybrid learning of semi-supervised and noisy-labels, and propose a new training strategy, termed sync-denoising. Extensive experiments on 11 datasets show that SetSync outperforms all existing prompt and seed methods, exceeding SOTA by an impressive average of 8 points.

pdf bib
F-MALLOC: Feed-forward Memory Allocation for Continual Learning in Neural Machine Translation
Junhong Wu | Yuchen Liu | Chengqing Zong

In the evolving landscape of Neural Machine Translation (NMT), the pretrain-then-finetune paradigm has yielded impressive results. However, the persistent challenge of Catastrophic Forgetting (CF) remains a hurdle. While previous work has introduced Continual Learning (CL) methods to address CF, these approaches grapple with the delicate balance between avoiding forgetting and maintaining system extensibility. To address this, we propose a CL method, named F-MALLOC (Feed-forward Memory ALLOCation). F-MALLOC is inspired by recent insights highlighting that feed-forward layers emulate neural memories and encapsulate crucial translation knowledge. It decomposes feed-forward layers into discrete memory cells and allocates these memories to different tasks. By learning to allocate and safeguard these memories, our method effectively alleviates CF while ensuring robust extendability. Besides, we propose a comprehensive assessment protocol for multi-stage CL of NMT systems. Experiments conducted following this new protocol showcase the superior performance of F-MALLOC, evidenced by higher BLEU scores and almost zero forgetting.

pdf bib
Towards Reducing Diagnostic Errors with Interpretable Risk Prediction
Denis McInerney | William Dickinson | Lucy Flynn | Andrea Young | Geoffrey Young | Jan-Willem van de Meent | Byron Wallace

Many diagnostic errors occur because clinicians cannot easily access relevant information in patient Electronic Health Records (EHRs). In this work we propose a method to use LLMs to identify pieces of evidence in patient EHR data that indicate increased or decreased risk of specific diagnoses; our ultimate aim is to increase access to evidence and reduce diagnostic errors. In particular, we propose a Neural Additive Model to make predictions backed by evidence with individualized risk estimates at time-points where clinicians are still uncertain, aiming to specifically mitigate delays in diagnosis and errors stemming from an incomplete differential. To train such a model, it is necessary to infer temporally fine-grained retrospective labels of eventual “true” diagnoses. We do so with LLMs, to ensure that the input text is from before a confident diagnosis can be made. We use an LLM to retrieve an initial pool of evidence, but then refine this set of evidence according to correlations learned by the model. We conduct an in-depth evaluation of the usefulness of our approach by simulating how it might be used by a clinician to decide between a pre-defined list of differential diagnoses.

pdf bib
Generalizable Multilingual Hate Speech Detection on Low Resource Indian Languages using Fair Selection in Federated Learning
Akshay Singh | Rahul Thakur

Social media, originally meant for peaceful communication, now faces issues with hate speech. Detecting hate speech from social media in Indian languages with linguistic diversity and cultural nuances presents a complex and challenging task. Furthermore, traditional methods involve sharing of users’ sensitive data with a server for model training making it undesirable and involving potential risk to their privacy remained under-studied. In this paper, we combined various low-resource language datasets and propose MultiFED, a federated approach that performs effectively to detect hate speech. MultiFED utilizes continuous adaptation and fine-tuning to aid generalization using subsets of multilingual data overcoming the limitations of data scarcity. Extensive experiments are conducted on 13 Indic datasets across five different pre-trained models. The results show that MultiFED outperforms the state-of-the-art baselines by 8% (approx.) in terms of Accuracy and by 12% (approx.) in terms of F-Score.

pdf bib
Key ingredients for effective zero-shot cross-lingual knowledge transfer in generative tasks
Nadezhda Chirkova | Vassilina Nikoulina

Zero-shot cross-lingual transfer, which implies finetuning of the multilingual pretrained language model on input-output pairs in one language and using it to make task predictions for inputs in other languages, was widely studied for natural language understanding but is understudied for generation. Previous works notice a frequent problem of generation in a wrong language and propose approaches to address it, usually using mT5 as a backbone model. In this work we compare various approaches proposed from the literature in unified settings, also including alternative backbone models, namely mBART and NLLB-200. We first underline the importance of tuning learning rate used for finetuning, which helps to substantially alleviate the problem of generation in the wrong language. Then, we show that with careful learning rate tuning, the simple full finetuning of the model acts as a very strong baseline and alternative approaches bring only marginal improvements. Finally, we find that mBART performs similarly to mT5 of the same size, and NLLB-200 can be competitive in some cases. Our final zero-shot models reach the performance of the approach based on data translation which is usually considered as an upper baseline for zero-shot cross-lingual transfer in generation.

pdf bib
The Impact of Depth on Compositional Generalization in Transformer Language Models
Jackson Petty | Sjoerd Steenkiste | Ishita Dasgupta | Fei Sha | Dan Garrette | Tal Linzen

To process novel sentences, language models (LMs) must generalize compositionally—combine familiar elements in new ways. What aspects of a model’s structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number of parameters; to address this confound between depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize more compositionally than shallower models do, but the benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling. Because model latency is approximately linear in the number of layers, these results lead us to the recommendation that, with a given total parameter budget, transformers can be made shallower than is typical without sacrificing performance.

pdf bib
Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering
Neha Srikanth | Rupak Sarkar | Heran Mane | Elizabeth Aparicio | Quynh Nguyen | Rachel Rudinger | Jordan Boyd-Graber

Questions posed by information-seeking users often contain implicit false or potentially harmful assumptions. In a high-risk domain such as maternal and infant health, a question-answering system must recognize these pragmatic constraints and go beyond simply answering user questions, examining them in context to respond helpfully. To achieve this, we study assumptions and implications, or pragmatic inferences, made when mothers ask questions about pregnancy and infant care by collecting a dataset of 2,727 inferences from 500 questions across three diverse sources. We study how health experts naturally address these inferences when writing answers, and illustrate that informing existing QA pipelines with pragmatic inferences produces responses that are more complete, mitigating the propagation of harmful beliefs.

pdf bib
Towards Explainability in Legal Outcome Prediction Models
Josef Valvoda | Ryan Cotterell

Current legal outcome prediction models - a staple of legal NLP - do not explain their reasoning. However, to employ these models in the real world, human legal actors need to be able to understand the model’s decisions. In the case of common law, legal practitioners reason towards the outcome of a case by referring to past case law, known as precedent. We contend that precedent is, therefore, a natural way of facilitating explainability for legal NLP models. In this paper, we contribute a novel method for identifying the precedent employed by legal outcome prediction models. Furthermore, by developing a taxonomy of legal precedent, we are able to compare human judges and neural models with respect to the different types of precedent they rely on. We find that while the models learn to predict outcomes reasonably well, their use of precedent is unlike that of human judges.

pdf bib
The steerability of large language models toward data-driven personas
Junyi Li | Charith Peris | Ninareh Mehrabi | Palash Goyal | Kai-Wei Chang | Aram Galstyan | Richard Zemel | Rahul Gupta

Large language models (LLMs) are known to generate biased responses where the opinions of certain groups and populations are underrepresented. Here, we present a novel approach to achieve controllable generation of specific viewpoints using LLMs, that can be leveraged to produce multiple perspectives and to reflect the diverse opinions. Moving beyond the traditional reliance on demographics like age, gender, or party affiliation, we introduce a data-driven notion of persona grounded in collaborative filtering, which is defined as either a single individual or a cohort of individuals manifesting similar views across specific inquiries. As individuals in the same demographic group may have different personas, our data-driven persona definition allows for a more nuanced understanding of different (latent) social groups present in the population. In addition to this, we also explore an efficient method to steer LLMs toward the personas that we define. We show that our data-driven personas significantly enhance model steerability, with improvements of between 57%-77% over our best performing baselines.

pdf bib
CCSum: A Large-Scale and High-Quality Dataset for Abstractive News Summarization
Xiang Jiang | Markus Dreyer

Training a supervised news summarization model requires large amounts of high-quality training data consisting of news articles paired with reference summaries. However, obtaining such data is costly, and existing datasets contain considerable amount of noise. We present a new large-scale and high-quality dataset for supervised abstractive news summarization containing 1.3 million training samples, which we call CCSum. In creating this dataset, we take advantage of the journalistic inverted-pyramid style in news writing: In some articles, the first sentence can be considered a summary of the reported story. Accordingly, among 35 million CommonCrawl News articles, we identify pairs of articles about the same news story and use one article’s first sentence as the summary for the other article. To ensure high quality, we apply strict filters whose parameters we optimize using Bayesian optimization. We show that the resulting dataset is more factual and informative than established summarization datasets; less than 1% of the summaries have major factual inconsistencies with the corresponding news articles, compared to 5.5% to 15.4% in existing datasets, according to our human evaluation. Summarization models trained on our dataset are more favored compared to those trained on CNN/Daily Mail. The proposed dataset can open new opportunities for future research in abstractive summarization.

pdf bib
Capturing Perspectives of Crowdsourced Annotators in Subjective Learning Tasks
Negar Mokhberian | Myrl Marmarelis | Frederic Hopp | Valerio Basile | Fred Morstatter | Kristina Lerman

Supervised classification heavily depends on datasets annotated by humans. However, in subjective tasks such as toxicity classification, these annotations often exhibit low agreement among raters. Annotations have commonly been aggregated by employing methods like majority voting to determine a single ground truth label. In subjective tasks, aggregating labels will result in biased labeling and, consequently, biased models that can overlook minority opinions. Previous studies have shed light on the pitfalls of label aggregation and have introduced a handful of practical approaches to tackle this issue. Recently proposed multi-annotator models, which predict labels individually per annotator, are vulnerable to under-determination for annotators with few samples. This problem is exacerbated in crowdsourced datasets. In this work, we propose Annotator Aware Representations for Texts (AART) for subjective classification tasks. Our approach involves learning representations of annotators, allowing for exploration of annotation behaviors. We show the improvement of our method on metrics that assess the performance on capturing individual annotators’ perspectives. Additionally, we demonstrate fairness metrics to evaluate our model’s equability of performance for marginalized annotators compared to others.

pdf bib
Improving Factual Accuracy of Neural Table-to-Text Output by Addressing Input Problems in ToTTo
Barkavi Sundararajan | Yaji Sripada | Ehud Reiter

Neural Table-to-Text models tend to hallucinate, producing texts that contain factual errors. We investigate whether such errors in the output can be traced back to problems with the input. We manually annotated 1,837 texts generated by multiple models in the politics domain of the ToTTo dataset. We identify the input problems that are responsible for many output errors and show that fixing these inputs reduces factual errors by between 52% and 76% (depending on the model). In addition, we observe that models struggle in processing tabular inputs that are structured in a non-standard way, particularly when the input lacks distinct row and column values or when the column headers are not correctly mapped to corresponding values.

pdf bib
CERET: Cost-Effective Extrinsic Refinement for Text Generation
Jason Cai | Hang Su | Monica Sunkara | Igor Shalyminov | Saab Mansour

Large Language Models (LLMs) are powerful models for generation tasks, but they may not generate good quality outputs in their first attempt. Apart from model fine-tuning, existing approaches to improve prediction accuracy and quality typically involve LLM self-improvement / self-reflection that incorporate feedback from models themselves. Despite their effectiveness, these methods are hindered by their high computational cost and lack of scalability. In this work, we propose CERET, a method for refining text generations by considering semantic stability, entailment and inter-sample uncertainty measures. Experimental results show that CERET outperforms Self-consistency and Self-rerank baselines consistently under various task setups, by 1.6% in Rouge-1 for abstractive summarization and 3.5% in hit rate for question answering. Compared to LLM Self-rerank method, our approach only requires 9.4% of its latency and is more cost-effective.

pdf bib
Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling
Subhendu Khatuya | Rajdeep Mukherjee | Akash Ghosh | Manjunath Hegde | Koustuv Dasgupta | Niloy Ganguly | Saptarshi Ghosh | Pawan Goyal

We study the problem of automatically annotating relevant numerals (GAAP metrics) occurring in the financial documents with their corresponding XBRL tags. Different from prior works, we investigate the feasibility of solving this extreme classification problem using a generative paradigm through instruction tuning of Large Language Models (LLMs). To this end, we leverage metric metadata informationto frame our target outputs while proposing a parameter efficient solution for the task using LoRA. We perform experiments on two recently released financial numeric labeling datasets. Our proposed model, **FLAN-FinXC**, achieves new state-of-the-art performances on both the datasets, outperforming several strong baselines. We explain the better scores of our proposed model by demonstrating its capability for zero-shot as well as the least frequently occurring tags. Also, even when we fail to predict the XBRL tags correctly, our generated output has substantial overlap with the ground-truth in majority of the cases.

pdf bib
Analysis of State-Level Legislative Process in Enhanced Linguistic and Nationwide Network Contexts
Maryam Davoodi | Dan Goldwasser

State bills have a significant impact on various aspects of society, including health, education, and the economy. Consequently, it is crucial to conduct systematic research on state bills before and after they are enacted to evaluate their benefits and drawbacks, thereby guiding future decision-making. In this work, we developed the first state-level deep learning framework that (1) handles the complex and inconsistent language of policies across US states using generative large language models and (2) decodes legislators’ behavior and implications of state policies by establishing a shared nationwide network, enriched with diverse contexts, such as information on interest groups influencing public policy and legislators’ courage test results, which reflect their political positions.

pdf bib
DeMuX: Data-efficient Multilingual Learning
Simran Khanuja | Srinivas Gowriraj | Lucio Dery | Graham Neubig

Pre-trained multilingual models have enabled deployment of NLP technologies for multiple languages. However, optimally fine-tuning these models under an annotation budget, such that performance on desired target languages is jointly maximized, still remains an open question. In this paper, we introduce DeMuX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations. Our active learning strategies rely upon distance and uncertainty measures to select task-specific neighbors that are most informative to label, given a model. DeMuX outperforms strong baselines in 84% of the test cases, in the zero-shot setting of disjoint source and target language sets (including multilingual target pools), across three models and four tasks. Notably, in low-budget settings (5-100 examples), we observe gains of up to 8-11 F1 points. Our code is released here:

pdf bib
DUQGen: Effective Unsupervised Domain Adaptation of Neural Rankers by Diversifying Synthetic Query Generation
Ramraj Chandradevan | Kaustubh Dhole | Eugene Agichtein

State-of-the-art neural rankers pre-trained on large task-specific training data such as MS-MARCO, have been shown to exhibit strong performance on various ranking tasks without domain adaptation, also called zero-shot. However, zero-shot neural ranking may be sub-optimal, as it does not take advantage of the target domain information. Unfortunately, acquiring sufficiently large and high quality target training data to improve a modern neural ranker can be costly and time-consuming. To address this problem, we propose a new approach to unsupervised domain adaptation for ranking, DUQGen, which addresses a critical gap in prior literature, namely how to automatically generate both effective and diverse synthetic training data to fine tune a modern neural ranker for a new domain. Specifically, DUQGen produces a more effective representation of the target domain by identifying clusters of similar documents; and generates a more diverse training dataset by probabilistic sampling over the resulting document clusters. Our extensive experiments, over the standard BEIR collection, demonstrate that DUQGen consistently outperforms all zero-shot baselines and substantially outperforms the SOTA baselines on 16 out of 18 datasets, for an average of 4% relative improvement across all datasets. We complement our results with a thorough analysis for more in-depth understanding of the proposed method’s performance and to identify promising areas for further improvements.

pdf bib
How did we get here? Summarizing conversation dynamics
Yilun Hua | Nicholas Chernogor | Yuzhe Gu | Seoyeon Jeong | Miranda Luo | Cristian Danescu-Niculescu-Mizil

Throughout a conversation, the way participants interact with each other is in constant flux: their tones may change, they may resort to different strategies to convey their points, or they might alter their interaction patterns. An understanding of these dynamics can complement that of the actual facts and opinions discussed, offering a more holistic view of the trajectory of the conversation: how it arrived at its current state and where it is likely heading.In this work, we introduce the task of summarizing the dynamics of conversations, by constructing a dataset of human-written summaries, and exploring several automated baselines. We evaluate whether such summaries can capture the trajectory of conversations via an established downstream task: forecasting whether an ongoing conversation will eventually derail into toxic behavior. We show that they help both humans and automated systems with this forecasting task. Humans make predictions three times faster, and with greater confidence, when reading the summaries than when reading the transcripts. Furthermore, automated forecasting systems are more accurate when constructing, and then predicting based on, summaries of conversation dynamics, compared to directly predicting on the transcripts.

pdf bib
Can Language Model Moderators Improve the Health of Online Discourse?
Hyundong Cho | Shuai Liu | Taiwei Shi | Darpan Jain | Basem Rizk | Yuyang Huang | Zixun Lu | Nuan Wen | Jonathan Gratch | Emilio Ferrara | Jonathan May

Conversational moderation of online communities is crucial to maintaining civility for a constructive environment, but it is challenging to scale and harmful to moderators. The inclusion of sophisticated natural language generation modules as a force multiplier to aid human moderators is a tantalizing prospect, but adequate evaluation approaches have so far been elusive. In this paper, we establish a systematic definition of conversational moderation effectiveness grounded on moderation literature and establish design criteria for conducting realistic yet safe evaluation. We then propose a comprehensive evaluation framework to assess models’ moderation capabilities independently of human intervention. With our framework, we conduct the first known study of language models as conversational moderators, finding that appropriately prompted models that incorporate insights from social science can provide specific and fair feedback on toxic behavior but struggle to influence users to increase their levels of respect and cooperation.

pdf bib
LeanReasoner: Boosting Complex Logical Reasoning with Lean
Dongwei Jiang | Marcio Fonseca | Shay Cohen

Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty ofsuch reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing logical reasoning problems intotheorems within Lean, we can solve them by proving or disproving the corresponding theorems. This method reduces the risk of logical inconsistencies with the help of Lean’s symbolic solver. It also enhances our ability to treat complex reasoning tasks using Lean’s extensive library of theorem proofs. Our method achieves state-of-the-art performance on the FOLIO dataset and achieves performance near this level on ProofWriter. Notably, these results were accomplished by fine-tuning on fewer than 100 in-domain samples for each dataset

pdf bib
UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
Jason Wu | Eldon Schoop | Alan Leung | Titus Barik | Jeffrey Bigham | Jeffrey Nichols

Many large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely either on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset, and producing a new LLM by finetuning the original on the refined dataset.We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences.Our results show the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

pdf bib
Measuring Cross-lingual Transfer in Bytes
Leandro De Souza | Thales Almeida | Roberto Lotufo | Rodrigo Frassetto Nogueira

Multilingual pretraining has been a successful solution to the challenges posed by the lack of resources for languages. These models can transfer knowledge to target languages with minimal or no examples. Recent research suggests that monolingual models also have a similar capability, but the mechanisms behind this transfer remain unclear. Some studies have explored factors like language contamination and syntactic similarity. An emerging line of research suggests that the representations learned by language models contain two components: a language-specific and a language-agnostic component. The latter is responsible for transferring a more universal knowledge. However, there is a lack of comprehensive exploration of these properties across diverse target languages. To investigate this hypothesis, we conducted an experiment inspired by the work on the Scaling Laws for Transfer. We measured the amount of data transferred from a source language to a target language and found that models initialized from diverse languages perform similarly to a target language in a cross-lingual setting. This was surprising because the amount of data transferred to 10 diverse target languages, such as Spanish, Korean, and Finnish, was quite similar. We also found evidence that this transfer is not related to language contamination or language proximity, which strengthens the hypothesis that the model also relies on language-agnostic knowledge. Our experiments have opened up new possibilities for measuring how much data represents the language-agnostic representations learned during pretraining.

pdf bib
MisgenderMender: A Community-Informed Approach to Interventions for Misgendering
Tamanna Hossain | Sunipa Dev | Sameer Singh

Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering.Misgendering, the act of incorrectly addressing someone’s gender, inflicts serious harm and is pervasive in everyday technologies, yet there is a notable lack of research to combat it. We are the first to address this lack of research into interventions for misgendering by conducting a survey of gender-diverse individuals in the US to understand perspectives about automated interventions for text-based misgendering. Based on survey insights on the prevalence of misgendering, desired solutions, and associated concerns, we introduce a misgendering interventions task and evaluation dataset, MisgenderMender. We define the task with two sub-tasks: (i) detecting misgendering, followed by (ii) correcting misgendering where misgendering is present, in domains where editing is appropriate. MisgenderMender comprises 3790 instances of social media content and LLM-generations about non-cisgender public figures, annotated for the presence of misgendering, with additional annotations for correcting misgendering in LLM-generated text. Using this dataset, we set initial benchmarks by evaluating existing NLP systems and highlighting challenges for future models to address. We release the full dataset, code, and demo at

pdf bib
Interplay of Machine Translation, Diacritics, and Diacritization
Wei-Rui Chen | Ife Adebara | Muhammad Abdul-Mageed

We investigate two research questions: (1) how do machine translation (MT) and diacritization influence the performance of each other in a multi-task learning setting (2) the effect of keeping (vs. removing) diacritics on MT performance. We examine these two questions in both high-resource (HR) and low-resource (LR) settings across 55 different languages (36 African languages and 19 European languages). For (1), results show that diacritization significantly benefits MT in the LR scenario, doubling or even tripling performance for some languages, but harms MT in the HR scenario. We find that MT harms diacritization in LR but benefits significantly in HR for some languages. For (2), MT performance is similar regardless of diacritics being kept or removed. In addition, we propose two classes of metrics to measure the complexity of a diacritical system, finding these metrics to correlate positively with the performance of our diacritization models. Overall, our work provides insights for developing MT and diacritization systems under different data size conditions and may have implications that generalize beyond the 55 languages we investigate.

pdf bib
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
Ming Li | Yong Zhang | Zhitao Li | Jiuhai Chen | Lichang Chen | Ning Cheng |