uppdf
bib
abs
AmbiFC: Fact-Checking Ambiguous Claims with Evidence
Max Glockner
|
Ieva Staliūnaitė
|
James Thorne
|
Gisela Vallejo
|
Andreas Vlachos
|
Iryna Gurevych
Automated fact-checking systems verify claims against evidence to predict their veracity. In real-world scenarios, the retrieved evidence may not unambiguously support or refute the claim and yield conflicting but valid interpretations. Existing fact-checking datasets assume that the models developed with them predict a single veracity label for each claim, thus discouraging the handling of such ambiguity. To address this issue we present AmbiFC,1 a fact-checking dataset with 10k claims derived from real-world information needs. It contains fine-grained evidence annotations of 50k passages from 5k Wikipedia pages. We analyze the disagreements arising from ambiguity when comparing claims against evidence in AmbiFC, observing a strong correlation of annotator disagreement with linguistic phenomena such as underspecification and probabilistic reasoning. We develop models for predicting veracity handling this ambiguity via soft labels, and find that a pipeline that learns the label distribution for sentence-level evidence selection and veracity prediction yields the best performance. We compare models trained on different subsets of AmbiFC and show that models trained on the ambiguous instances perform better when faced with the identified linguistic phenomena.
pdf
bib
abs
Language Varieties of Italy: Technology Challenges and Opportunities
Alan Ramponi
Italy is characterized by a one-of-a-kind linguistic diversity landscape in Europe, which implicitly encodes local knowledge, cultural traditions, artistic expressions, and history of its speakers. However, most local languages and dialects in Italy are at risk of disappearing within a few generations. The NLP community has recently begun to engage with endangered languages, including those of Italy. Yet, most efforts assume that these varieties are under-resourced language monoliths with an established written form and homogeneous functions and needs, and thus highly interchangeable with each other and with high-resource, standardized languages. In this paper, we introduce the linguistic context of Italy and challenge the default machine-centric assumptions of NLP for Italy’s language varieties. We advocate for a shift in the paradigm from machine-centric to speaker-centric NLP, and provide recommendations and opportunities for work that prioritizes languages and their speakers over technological advances. To facilitate the process, we finally propose building a local community towards responsible, participatory efforts aimed at supporting vitality of languages and dialects of Italy.
pdf
bib
abs
Benchmarking Large Language Models for News Summarization
Tianyi Zhang
|
Faisal Ladhak
|
Esin Durmus
|
Percy Liang
|
Kathleen McKeown
|
Tatsunori B. Hashimoto
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, not model size, is the key to the LLM’s zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LLM summaries are judged to be on par with human written summaries.
pdf
bib
abs
mGPT: Few-Shot Learners Go Multilingual
Oleh Shliazhko
|
Alena Fenogenova
|
Maria Tikhonova
|
Anastasia Kozlova
|
Vladislav Mikhailov
|
Tatiana Shavrina
This paper introduces mGPT, a multilingual variant of GPT-3, pretrained on 61 languages from 25 linguistically diverse language families using Wikipedia and the C4 Corpus. We detail the design and pretraining procedure. The models undergo an intrinsic and extrinsic evaluation: language modeling in all languages, downstream evaluation on cross-lingual NLU datasets and benchmarks in 33 languages, and world knowledge probing in 23 languages. The in-context learning abilities are on par with the contemporaneous language models while covering a larger number of languages, including underrepresented and low-resource languages of the Commonwealth of Independent States and the indigenous peoples in Russia. The source code and the language models are publicly available under the MIT license.
pdf
bib
abs
Cultural Adaptation of Recipes
Yong Cao
|
Yova Kementchedjhieva
|
Ruixiang Cui
|
Antonia Karamolegkou
|
Li Zhou
|
Megan Dare
|
Lucia Donatelli
|
Daniel Hershcovich
Building upon the considerable advances in Large Language Models (LLMs), we are now equipped to address more sophisticated tasks demanding a nuanced understanding of cross-cultural contexts. A key example is recipe adaptation, which goes beyond simple translation to include a grasp of ingredients, culinary techniques, and dietary preferences specific to a given culture. We introduce a new task involving the translation and cultural adaptation of recipes between Chinese- and English-speaking cuisines. To support this investigation, we present CulturalRecipes, a unique dataset composed of automatically paired recipes written in Mandarin Chinese and English. This dataset is further enriched with a human-written and curated test set. In this intricate task of cross-cultural recipe adaptation, we evaluate the performance of various methods, including GPT-4 and other LLMs, traditional machine translation, and information retrieval techniques. Our comprehensive analysis includes both automatic and human evaluation metrics. While GPT-4 exhibits impressive abilities in adapting Chinese recipes into English, it still lags behind human expertise when translating English recipes into Chinese. This underscores the multifaceted nature of cultural adaptations. We anticipate that these insights will significantly contribute to future research on culturally aware language models and their practical application in culturally diverse contexts.
pdf
bib
abs
Metric-Free Learning Network with Dual Relations Propagation for Few-Shot Aspect Category Sentiment Analysis
Shiman Zhao
|
Yutao Xie
|
Wei Chen
|
Tengjiao Wang
|
Jiahui Yao
|
Jiabin Zheng
Few-shot Aspect Category Sentiment Analysis (ACSA) is a crucial task for aspect-based sentiment analysis, which aims to detect sentiment polarity for a given aspect category in a sentence with limited data. However, few-shot learning methods focus on distance metrics between the query and support sets to classify queries, heavily relying on aspect distributions in the embedding space. Thus, they suffer from overlapping distributions of aspect embeddings caused by irrelevant sentiment noise among sentences with multiple sentiment aspects, leading to misclassifications. To solve the above issues, we propose a metric-free method for few-shot ACSA, which models the associated relations among the aspects of support and query sentences by Dual Relations Propagation (DRP), addressing the passive effect of overlapping distributions. Specifically, DRP uses the dual relations (similarity and diversity) among the aspects of support and query sentences to explore intra-cluster commonality and inter-cluster uniqueness for alleviating sentiment noise and enhancing aspect features. Additionally, the dual relations are transformed from support-query to class-query to promote query inference by learning class knowledge. Experiments show that we achieve convincing performance on few-shot ACSA, especially an average improvement of 2.93% accuracy and 2.10% F1 score in the 3-way 1-shot setting.
pdf
bib
abs
Addressing the Binning Problem in Calibration Assessment through Scalar Annotations
Zhengping Jiang
|
Anqi Liu
|
Benjamin Van Durme
Computational linguistics models commonly target the prediction of discrete—categorical—labels. When assessing how well-calibrated these model predictions are, popular evaluation schemes require practitioners to manually determine a binning scheme: grouping labels into bins to approximate true label posterior. The problem is that these metrics are sensitive to binning decisions. We consider two solutions to the binning problem that apply at the stage of data annotation: collecting either distributed (redundant) labels or direct scalar value assignment. In this paper, we show that although both approaches address the binning problem by evaluating instance-level calibration, direct scalar assignment is significantly more cost-effective. We provide theoretical analysis and empirical evidence to support our proposal for dataset creators to adopt scalar annotation protocols to enable a higher-quality assessment of model calibration.
pdf
bib
abs
An Energy-based Model for Word-level AutoCompletion in Computer-aided Translation
Cheng Yang
|
Guoping Huang
|
Mo Yu
|
Zhirui Zhang
|
Siheng Li
|
Mingming Yang
|
Shuming Shi
|
Yujiu Yang
|
Lemao Liu
Word-level AutoCompletion (WLAC) is a rewarding yet challenging task in Computer-aided Translation. Existing work addresses this task through a classification model based on a neural network that maps the hidden vector of the input context into its corresponding label (i.e., the candidate target word is treated as a label). Since the context hidden vector itself does not take the label into account and it is projected to the label through a linear classifier, the model cannot sufficiently leverage valuable information from the source sentence as verified in our experiments, which eventually hinders its overall performance. To alleviate this issue, this work proposes an energy-based model for WLAC, which enables the context hidden vector to capture crucial information from the source sentence. Unfortunately, training and inference suffer from efficiency and effectiveness challenges, therefore we employ three simple yet effective strategies to put our model into practice. Experiments on four standard benchmarks demonstrate that our reranking-based approach achieves substantial improvements (about 6.07%) over the previous state-of-the-art model. Further analyses show that each strategy of our approach contributes to the final performance.1
pdf
bib
abs
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu
|
Kevin Lin
|
John Hewitt
|
Ashwin Paranjape
|
Michele Bevilacqua
|
Fabio Petroni
|
Percy Liang
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
pdf
bib
abs
Red Teaming Language Model Detectors with Language Models
Zhouxing Shi
|
Yihan Wang
|
Fan Yin
|
Xiangning Chen
|
Kai-Wei Chang
|
Cho-Jui Hsieh
The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent work has proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. Code is available at https://github.com/shizhouxing/LLM-Detector-Robustness.
pdf
bib
abs
Text Attribute Control via Closed-Loop Disentanglement
Lei Sha
|
Thomas Lukasiewicz
Changing an attribute of a text without changing the content usually requires first disentangling the text into irrelevant attributes and content representations. After that, in the inference phase, the representation of one attribute is tuned to a different value, expecting that the corresponding attribute of the text can also be changed accordingly. The usual way of disentanglement is to add some constraints on the latent space of an encoder-decoder architecture, including adversarial-based constraints and mutual-information-based constraints. However, previous semi-supervised processes of attribute change are usually not enough to guarantee the success of attribute change and content preservation. In this paper, we propose a novel approach to achieve a robust control of attributes while enhancing content preservation. In this approach, we use a semi-supervised contrastive learning method to encourage the disentanglement of attributes in latent spaces. Differently from previous works, we re-disentangle the reconstructed sentence and compare the re-disentangled latent space with the original latent space, which makes a closed-loop disentanglement process. This also helps content preservation. In addition, the contrastive learning method is also able to replace the role of minimizing mutual information and adversarial training in the disentanglement process, which alleviates the computation cost. We conducted experiments on three text datasets, including the Yelp Service review dataset, the Amazon Product review dataset, and the GoEmotions dataset. The experimental results show the effectiveness of our model.
pdf
bib
abs
Unifying Structured Data as Graph for Data-to-Text Pre-Training
Shujie Li
|
Liang Li
|
Ruiying Geng
|
Min Yang
|
Binhua Li
|
Guanghu Yuan
|
Wanwei He
|
Shao Yuan
|
Can Ma
|
Fei Huang
|
Yongbin Li
Data-to-text (D2T) generation aims to transform structured data into natural language text. Data-to-text pre-training has proved to be powerful in enhancing D2T generation and yields impressive performance. However, previous pre-training methods either oversimplified structured data into a sequence without considering input structures or designed training objectives tailored for a specific data structure (e.g., table or knowledge graph). In this paper, we unify different types of structured data (i.e., table, key-value data, knowledge graph) into the graph format and cast different D2T generation tasks as graph-to-text generation. To effectively exploit the structural information of the input graph, we propose a structure-enhanced pre-training method for D2T generation by designing a structure-enhanced Transformer. Concretely, we devise a position matrix for the Transformer, encoding relative positional information of connected nodes in the input graph. In addition, we propose a new attention matrix to incorporate graph structures into the original Transformer by taking the available explicit connectivity structure into account. Extensive experiments on six benchmark datasets show the effectiveness of our model. Our source codes are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/unid2t.
pdf
bib
abs
Exploring Human-Like Translation Strategy with Large Language Models
Zhiwei He
|
Tian Liang
|
Wenxiang Jiao
|
Zhuosheng Zhang
|
Yujiu Yang
|
Rui Wang
|
Zhaopeng Tu
|
Shuming Shi
|
Xing Wang
Large language models (LLMs) have demonstrated impressive capabilities in general scenarios, exhibiting a level of aptitude that approaches, in some aspects even surpasses, human-level intelligence. Among their numerous skills, the translation abilities of LLMs have received considerable attention. Compared to typical machine translation that focuses solely on source-to-target mapping, LLM-based translation can potentially mimic the human translation process, which might take preparatory steps to ensure high-quality translation. This work explores this possibility by proposing the MAPS framework, which stands for Multi-Aspect Prompting and Selection. Specifically, we enable LLMs first to analyze the given source sentence and induce three aspects of translation-related knowledge (keywords, topics, and relevant demonstrations) to guide the final translation process. Moreover, we employ a selection mechanism based on quality estimation to filter out noisy and unhelpful knowledge. Both automatic (3 LLMs × 11 directions × 2 automatic metrics) and human evaluation (preference study and MQM) demonstrate the effectiveness of MAPS. Further analysis shows that by mimicking the human translation process, MAPS reduces various translation errors such as hallucination, ambiguity, mistranslation, awkward style, untranslated text, and omission. Source code is available at https://github.com/zwhe99/MAPS-mt.
pdf
bib
abs
Retrieve What You Need: A Mutual Learning Framework for Open-domain Question Answering
Dingmin Wang
|
Qiuyuan Huang
|
Matthew Jackson
|
Jianfeng Gao
An open-domain question answering (QA) system usually follows a retrieve-then-read paradigm, in which a retriever is used to retrieve relevant passages from a large corpus, and then a reader generates answers based on the retrieved passages and the original question. In this paper, we propose a simple and novel mutual learning framework to improve the performance of retrieve-then-read-style models via an intermediate module named the knowledge selector, which we train with reinforcement learning. The key benefits of our proposed intermediate module are: 1) no requirement for additional annotated question-passage pairs; 2) improvements in both retrieval and QA performance, as well as computational efficiency, compared to prior competitive retrieve-then-read models; 3) with no finetuning, improvement in the zero-shot performance of large-scale pre-trained language models, e.g., ChatGPT, by encapsulating the input with relevant knowledge without violating the input length constraint.
pdf
bib
abs
Explicitly Representing Syntax Improves Sentence-to-Layout Prediction of Unexpected Situations
Wolf Nuyts
|
Ruben Cartuyvels
|
Marie-Francine Moens
Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it allows localized and controlled in-painting of the image. In this comparative study it is shown that we can predict layouts from language representations that implicitly or explicitly encode sentence syntax, if the sentences mention similar entity-relationships to the ones seen during training. To test compositional understanding, we collect a test set of grammatically correct sentences and layouts describing compositions of entities and relations that unlikely have been seen during training. Performance on this test set substantially drops, showing that current models rely on correlations in the training data and have difficulties in understanding the structure of the input sentences. We propose a novel structural loss function that better enforces the syntactic structure of the input sentence and show large performance gains in the task of 2D spatial layout prediction conditioned on text. The loss has the potential to be used in other generation tasks where a tree-like structure underlies the conditioning modality. Code, trained models, and the USCOCO evaluation set are available via Github.1
pdf
bib
abs
Evaluating the Ripple Effects of Knowledge Editing in Language Models
Roi Cohen
|
Eden Biran
|
Ori Yoran
|
Amir Globerson
|
Mor Geva
Modern language models capture a large body of factual knowledge. However, some facts can be incorrectly induced or become obsolete over time, resulting in factually incorrect generations. This has led to the development of various editing methods that allow updating facts encoded by the model. Evaluation of these methods has primarily focused on testing whether an individual fact has been successfully injected, and if similar predictions for other subjects have not changed. Here we argue that such evaluation is limited, since injecting one fact (e.g., “Jack Depp is the son of Johnny Depp”) introduces a “ripple effect” in the form of additional facts that the model needs to update (e.g., “Jack Depp is the sibling of Lily-Rose Depp”). To address this, we propose novel evaluation criteria that consider the implications of an edit on related facts. Using these criteria, we then construct RippleEdits, a diagnostic benchmark of 5K factual edits, capturing various types of ripple effects. We evaluate prominent editing methods on RippleEdits, showing that they fail to introduce consistent changes in the model’s knowledge. In addition, we find that a simple in-context editing baseline obtains the best scores on our benchmark, suggesting a promising research direction for model editing.1
pdf
bib
abs
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
Aina Garí Soler
|
Matthieu Labeau
|
Chloé Clavel
When deriving contextualized word representations from language models, a decision needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carry out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words. Their similarity values, however, must be interpreted with caution.
pdf
bib
abs
Large Language Models Enable Few-Shot Clustering
Vijay Viswanathan
|
Kiril Gashteovski
|
Kiril Gashteovski
|
Carolin Lawrence
|
Tongshuang Wu
|
Graham Neubig
Unlike traditional unsupervised clustering, semi-supervised clustering allows users to provide meaningful structure to the data, which helps the clustering algorithm to match the user’s intent. Existing approaches to semi-supervised clustering require a significant amount of feedback from an expert to improve the clusters. In this paper, we ask whether a large language model (LLM) can amplify an expert’s guidance to enable query-efficient, few-shot semi-supervised text clustering. We show that LLMs are surprisingly effective at improving clustering. We explore three stages where LLMs can be incorporated into clustering: before clustering (improving input features), during clustering (by providing constraints to the clusterer), and after clustering (using LLMs post-correction). We find that incorporating LLMs in the first two stages routinely provides significant improvements in cluster quality, and that LLMs enable a user to make trade-offs between cost and accuracy to produce desired clusters. We release our code and LLM prompts for the public to use.1
pdf
bib
abs
JustiLM: Few-shot Justification Generation for Explainable Fact-Checking of Real-world Claims
Fengzhu Zeng
|
Wei Gao
Justification is an explanation that supports the veracity assigned to a claim in fact-checking. However, the task of justification generation has been previously oversimplified as summarization of a fact-check article authored by fact-checkers. Therefore, we propose a realistic approach to generate justification based on retrieved evidence. We present a new benchmark dataset called ExClaim (for Explainable fact-checking of real-world Claims), and introduce JustiLM, a novel few-shot Justification generation based on retrieval-augmented Language Model by using fact-check articles as an auxiliary resource during training only. Experiments show that JustiLM achieves promising performance in justification generation compared to strong baselines, and can also enhance veracity classification with a straightforward extension.1 Code and dataset are released at https://github.com/znhy1024/JustiLM.
pdf
bib
abs
To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation
Jiaming Luo
|
Colin Cherry
|
George Foster
We conduct a large-scale fine-grained comparative analysis of machine translations (MTs) against human translations (HTs) through the lens of morphosyntactic divergence. Across three language pairs and two types of divergence defined as the structural difference between the source and the target, MT is consistently more conservative than HT, with less morphosyntactic diversity, more convergent patterns, and more one-to-one alignments. Through analysis on different decoding algorithms, we attribute this discrepancy to the use of beam search that biases MT towards more convergent patterns. This bias is most amplified when the convergent pattern appears around 50% of the time in training data. Lastly, we show that for a majority of morphosyntactic divergences, their presence in HT is correlated with decreased MT performance, presenting a greater challenge for MT systems.
pdf
bib
abs
What Do Self-Supervised Speech Models Know About Words?
Ankita Pasad
|
Chung-Ming Chien
|
Shane Settle
|
Karen Livescu
Many self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. However, these empirical successes alone do not give a complete picture of what is learned during pre-training. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information, but we still lack a proper understanding of knowledge encoded at the word level and beyond. In this work, we use lightweight analysis methods to study segment-level linguistic properties—word identity, boundaries, pronunciation, syntactic features, and semantic features—encoded in S3Ms. We present a comparative study of layer-wise representations from ten S3Ms and find that (i) the frame-level representations within each word segment are not all equally informative, and (ii) the pre-training objective and model size heavily influence the accessibility and distribution of linguistic information across layers. We also find that on several tasks—word discrimination, word segmentation, and semantic sentence similarity—S3Ms trained with visual grounding outperform their speech-only counterparts. Finally, our task-based analyses demonstrate improved performance on word segmentation and acoustic word discrimination while using simpler methods than prior work.1
pdf
bib
abs
Are Character-level Translations Worth the Wait? Comparing ByT5 and mT5 for Machine Translation
Lukas Edman
|
Gabriele Sarti
|
Antonio Toral
|
Gertjan van Noord
|
Arianna Bisazza
Pretrained character-level and byte-level language models have been shown to be competitive with popular subword models across a range of Natural Language Processing tasks. However, there has been little research on their effectiveness for neural machine translation (NMT), particularly within the popular pretrain-then-finetune paradigm. This work performs an extensive comparison across multiple languages and experimental conditions of character- and subword-level pretrained models (ByT5 and mT5, respectively) on NMT. We show the effectiveness of character-level modeling in translation, particularly in cases where fine-tuning data is limited. In our analysis, we show how character models’ gains in translation quality are reflected in better translations of orthographically similar words and rare words. While evaluating the importance of source texts in driving model predictions, we highlight word-level patterns within ByT5, suggesting an ability to modulate word-level and character-level information during generation. We conclude by assessing the efficiency tradeoff of byte models, suggesting their usage in non-time-critical scenarios to boost translation quality.
pdf
bib
abs
Geographic Adaptation of Pretrained Language Models
Valentin Hofmann
|
Goran Glavaš
|
Nikola Ljubešić
|
Janet B. Pierrehumbert
|
Hinrich Schütze
While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
pdf
bib
abs
Do Text Simplification Systems Preserve Meaning? A Human Evaluation via Reading Comprehension
Sweta Agrawal
|
Marine Carpuat
Automatic text simplification (TS) aims to automate the process of rewriting text to make it easier for people to read. A pre-requisite for TS to be useful is that it should convey information that is consistent with the meaning of the original text. However, current TS evaluation protocols assess system outputs for simplicity and meaning preservation without regard for the document context in which output sentences occur and for how people understand them. In this work, we introduce a human evaluation framework to assess whether simplified texts preserve meaning using reading comprehension questions. With this framework, we conduct a thorough human evaluation of texts by humans and by nine automatic systems. Supervised systems that leverage pre-training knowledge achieve the highest scores on the reading comprehension tasks among the automatic controllable TS systems. However, even the best-performing supervised system struggles with at least 14% of the questions, marking them as “unanswerable” based on simplified content. We further investigate how existing TS evaluation metrics and automatic question-answering systems approximate the human judgments we obtained.
pdf
bib
abs
Simultaneous Selection and Adaptation of Source Data via Four-Level Optimization
Pengtao Xie
|
Xingchen Zhao
|
Xuehai He
In many NLP applications, to mitigate data deficiency in a target task, source data is collected to help with target model training. Existing transfer learning methods either select a subset of source examples that are close to the target domain or try to adapt all source examples into the target domain, then use selected or adapted source examples to train the target model. These methods either incur significant information loss or bear the risk that after adaptation, source examples which are originally already in the target domain may be outside the target domain. To address the limitations of these methods, we propose a four-level optimization based framework which simultaneously selects and adapts source data. Our method can automatically identify in-domain and out-of-domain source examples and apply example-specific processing methods: selection for in-domain examples and adaptation for out-of-domain examples. Experiments on various datasets demonstrate the effectiveness of our proposed method.
pdf
bib
abs
ConvoSense: Overcoming Monotonous Commonsense Inferences for Conversational AI
Sarah E. Finch
|
Jinho D. Choi
Mastering commonsense understanding and reasoning is a pivotal skill essential for conducting engaging conversations. While there have been several attempts to create datasets that facilitate commonsense inferences in dialogue contexts, existing datasets tend to lack in-depth details, restate information already present in the conversation, and often fail to capture the multifaceted nature of commonsense reasoning. In response to these limitations, we compile a new synthetic dataset for commonsense reasoning in dialogue contexts using GPT, ℂonvoSense, that boasts greater contextual novelty, offers a higher volume of inferences per example, and substantially enriches the detail conveyed by the inferences. Our dataset contains over 500,000 inferences across 12,000 dialogues with 10 popular inference types, which empowers the training of generative commonsense models for dialogue that are superior in producing plausible inferences with high novelty when compared to models trained on the previous datasets. To the best of our knowledge, ℂonvoSense is the first of its kind to provide such a multitude of novel inferences at such a large scale.
pdf
bib
abs
Automatically Correcting Large Language Models: Surveying the Landscape of Diverse Automated Correction Strategies
Liangming Pan
|
Michael Saxon
|
Wenda Xu
|
Deepak Nathani
|
Xinyi Wang
|
William Yang Wang
While large language models (LLMs) have shown remarkable effectiveness in various NLP tasks, they are still prone to issues such as hallucination, unfaithful reasoning, and toxicity. A promising approach to rectify these flaws is correcting LLMs with feedback, where the LLM itself is prompted or guided with feedback to fix problems in its own output. Techniques leveraging automated feedback—either produced by the LLM itself (self-correction) or some external system—are of particular interest as they make LLM-based solutions more practical and deployable with minimal human intervention. This paper provides an exhaustive review of the recent advances in correcting LLMs with automated feedback, categorizing them into training-time, generation-time, and post-hoc approaches. We also identify potential challenges and future directions in this emerging field.
pdf
bib
abs
KoBBQ: Korean Bias Benchmark for Question Answering
Jiho Jin
|
Jiseon Kim
|
Nayeon Lee
|
Haneul Yoo
|
Alice Oh
|
Hwaran Lee
Warning: This paper contains examples of stereotypes and biases. The Bias Benchmark for Question Answering (BBQ) is designed to evaluate social biases of language models (LMs), but it is not simple to adapt this benchmark to cultural contexts other than the US because social biases depend heavily on the cultural context. In this paper, we present KoBBQ, a Korean bias benchmark dataset, and we propose a general framework that addresses considerations for cultural adaptation of a dataset. Our framework includes partitioning the BBQ dataset into three classes—Simply-Transferred (can be used directly after cultural translation), Target-Modified (requires localization in target groups), and Sample-Removed (does not fit Korean culture)—and adding four new categories of bias specific to Korean culture. We conduct a large-scale survey to collect and validate the social biases and the targets of the biases that reflect the stereotypes in Korean culture. The resulting KoBBQ dataset comprises 268 templates and 76,048 samples across 12 categories of social bias. We use KoBBQ to measure the accuracy and bias scores of several state-of-the-art multilingual LMs. The results clearly show differences in the bias of LMs as measured by KoBBQ and a machine-translated version of BBQ, demonstrating the need for and utility of a well-constructed, culturally aware social bias benchmark.
pdf
bib
abs
AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning
Han Zhou
|
Xingchen Wan
|
Ivan Vulić
|
Anna Korhonen
Large pretrained language models are widely used in downstream NLP tasks via task- specific fine-tuning, but such procedures can be costly. Recently, Parameter-Efficient Fine-Tuning (PEFT) methods have achieved strong task performance while updating much fewer parameters than full model fine-tuning (FFT). However, it is non-trivial to make informed design choices on the PEFT configurations, such as their architecture, the number of tunable parameters, and even the layers in which the PEFT modules are inserted. Consequently, it is highly likely that the current, manually designed configurations are suboptimal in terms of their performance-efficiency trade-off. Inspired by advances in neural architecture search, we propose AutoPEFT for automatic PEFT configuration selection: We first design an expressive configuration search space with multiple representative PEFT modules as building blocks. Using multi-objective Bayesian optimization in a low-cost setup, we then discover a Pareto-optimal set of configurations with strong performance-cost trade-offs across different numbers of parameters that are also highly transferable across different tasks. Empirically, on GLUE and SuperGLUE tasks, we show that AutoPEFT-discovered configurations significantly outperform existing PEFT methods and are on par or better than FFT without incurring substantial training efficiency costs.
pdf
bib
abs
What Formal Languages Can Transformers Express? A Survey
Lena Strobl
|
William Merrill
|
Gail Weiss
|
David Chiang
|
Dana Angluin
As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help clarify the power of transformers relative to other models of computation, their fundamental capabilities and limits, and the impact of architectural choices. Work in this subarea has made considerable progress in recent years. Here, we undertake a comprehensive survey of this work, documenting the diverse assumptions that underlie different results and providing a unified framework for harmonizing seemingly contradictory findings.
pdf
bib
abs
Text-to-OverpassQL: A Natural Language Interface for Complex Geodata Querying of OpenStreetMap
Michael Staniek
|
Raphael Schumann
|
Maike Züfle
|
Stefan Riezler
We present Text-to-OverpassQL, a task designed to facilitate a natural language interface for querying geodata from OpenStreetMap (OSM). The Overpass Query Language (OverpassQL) allows users to formulate complex database queries and is widely adopted in the OSM ecosystem. Generating Overpass queries from natural language input serves multiple use-cases. It enables novice users to utilize OverpassQL without prior knowledge, assists experienced users with crafting advanced queries, and enables tool-augmented large language models to access information stored in the OSM database. In order to assess the performance of current sequence generation models on this task, we propose OverpassNL,1 a dataset of 8,352 queries with corresponding natural language inputs. We further introduce task specific evaluation metrics and ground the evaluation of the Text-to-OverpassQL task by executing the queries against the OSM database. We establish strong baselines by finetuning sequence-to-sequence models and adapting large language models with in-context examples. The detailed evaluation reveals strengths and weaknesses of the considered learning strategies, laying the foundations for further research into the Text-to-OverpassQL task.
pdf
bib
abs
Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions
Jiahuan Li
|
Hao Zhou
|
Shujian Huang
|
Shanbo Cheng
|
Jiajun Chen
Large-scale pretrained language models (LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translation, without being explicitly trained on parallel corpora. It is intriguing how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7.5B, to perform multilingual translation following given instructions. Firstly, we show that multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language, the translation performance depends on its similarity to English and the amount of data used in the pretraining phase. Secondly, we find that LLMs’ ability to carry out translation instructions relies on the understanding of translation instructions and the alignment among different languages. With multilingual finetuning with translation instructions, LLMs could learn to perform the translation task well even for those language pairs unseen during the instruction tuning phase.
pdf
bib
abs
Semantics of Multiword Expressions in Transformer-Based Models: A Survey
Filip Miletić
|
Sabine Schulte im Walde
Multiword expressions (MWEs) are composed of multiple words and exhibit variable degrees of compositionality. As such, their meanings are notoriously difficult to model, and it is unclear to what extent this issue affects transformer architectures. Addressing this gap, we provide the first in-depth survey of MWE processing with transformer models. We overall find that they capture MWE semantics inconsistently, as shown by reliance on surface patterns and memorized information. MWE meaning is also strongly localized, predominantly in early layers of the architecture. Representations benefit from specific linguistic properties, such as lower semantic idiosyncrasy and ambiguity of target expressions. Our findings overall question the ability of transformer models to robustly capture fine-grained semantics. Furthermore, we highlight the need for more directly comparable evaluation setups.
pdf
bib
abs
The Thai Discourse Treebank: Annotating and Classifying Thai Discourse Connectives
Ponrawee Prasertsom
|
Apiwat Jaroonpol
|
Attapol T. Rutherford
Discourse analysis is a highly applicable area of natural language processing. In English and other languages, resources for discourse-based tasks are widely available. Thai, however, has hitherto lacked such resources. We present the Thai Discourse Treebank, the first, large Thai corpus annotated in the style of the Penn Discourse Treebank. The resulting corpus has over 10,000 sentences and 18,000 instances of connectives in 33 different relations. We release the corpus alongside our list of 148 potentially polysemous discourse connectives with a total of 340 form-sense pairs and their classification criteria to facilitate future research. We also develop models for connective identification and classification tasks. Our best models achieve an F1 of 0.96 in the identification task and 0.46 on the sense classification task. Our results serve as benchmarks for future models for Thai discourse tasks.
pdf
bib
abs
Federated Learning for Exploiting Annotators’ Disagreements in Natural Language Processing
Nuria Rodríguez-Barroso
|
Eugenio Martínez Cámara
|
Jose Camacho Collados
|
M. Victoria Luzón
|
Francisco Herrera
The annotation of ambiguous or subjective NLP tasks is usually addressed by various annotators. In most datasets, these annotations are aggregated into a single ground truth. However, this omits divergent opinions of annotators, hence missing individual perspectives. We propose FLEAD (Federated Learning for Exploiting Annotators’ Disagreements), a methodology built upon federated learning to independently learn from the opinions of all the annotators, thereby leveraging all their underlying information without relying on a single ground truth. We conduct an extensive experimental study and analysis in diverse text classification tasks to show the contribution of our approach with respect to mainstream approaches based on majority voting and other recent methodologies that also learn from annotator disagreements.
pdf
bib
abs
Computational Complexity of Natural Morphology Revisited
Hajime Senuma
|
Akiko Aizawa
This paper revisits a classical, yet fundamental, discussion of theoretical computational linguistics: the computational complexity of natural languages. Past studies have revealed that syntax, as observed in Swiss-German, is not weakly context-free. Concerning morphology, Culy (1985) employed a construction in Bambara to show that morphology is not weakly context-free; however, Manaster-Ramer (1988) pointed out that the Bambara case can be problematic because the wordhood of the construction is reliant on special tonal behaviors, and it is ambiguous whether the behaviors belong to the morphological domain. This raises doubts about whether the case can be considered a genuine morphological phenomenon. In this paper, we argue that Classical Ainu, a language we examine, also defies weak context-freeness at the morphological level. The construction we introduce is unambiguously morphological because this language’s valency-sensitive structure and valency-changing operations, such as noun incorporation, preclude its grammatical interpretation as syntactic.
pdf
bib
abs
Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis
Sohee Yang
|
Jonghyeon Kim
|
Joel Jang
|
Seonghyeon Ye
|
Hyunji Lee
|
Minjoon Seo
Previous work in prompt engineering for large language models has introduced different gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but have failed to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common and diverse NLP tasks. We find that each of the existing methods can be interpreted as some variant of the method that maximizes mutual information between the input and the predicted output (MI). Utilizing this finding, we develop several other combinatorial variants of MI and increase the effectiveness of the oracle prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, considering that all the methods rely on the output probability distribution of the model that might be biased, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to the existing methods and helps increase the prompt selection effectiveness of the best method to 96.85%, achieving 99.44% of the oracle prompt F1 without calibration.1
pdf
bib
abs
Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
Vaibhav Adlakha
|
Parishad BehnamGhader
|
Xing Han Lu
|
Nicholas Meade
|
Siva Reddy
Instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA). By simply prepending relevant documents and an instruction to their input, these models can be adapted to various information domains and tasks without additional training. However, these models tend to produce verbose responses with supplementary information, which makes traditional QA metrics like exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we evaluate instruction-following models along two fronts: 1) how well they satisfy user’s information need (correctness), and 2) whether they disseminate information supported by the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness and propose simple token-overlap metrics that correlate highly with human judgments. Our analysis reveals that for correctness, instruction-following models perform comparably to models specifically fine-tuned for that task. However, they struggle to accurately judge the relevance of the provided knowledge and often hallucinate in their responses. We hope our work encourages more holistic evaluation of instruction-following models for QA. Our code and human annotation data is available at https://github.com/McGill-NLP/instruct-qa.
pdf
bib
abs
The Ethics of Automating Legal Actors
Josef Valvoda
|
Alec Thompson
|
Ryan Cotterell
|
Simone Teufel
The introduction of large public legal datasets has brought about a renaissance in legal NLP. Many of these datasets are composed of legal judgments—the product of judges deciding cases. Since ML algorithms learn to model the data they are trained on, several legal NLP models are models of judges. While some have argued for the automation of judges, in this position piece, we argue that automating the role of the judge raises difficult ethical challenges, in particular for common law legal systems. Our argument follows from the social role of the judge in actively shaping the law, rather than merely applying it. Since current NLP models are too far away from having the facilities necessary for this task, they should not be used to automate judges. Furthermore, even in the case that the models could achieve human-level capabilities, there would still be remaining ethical concerns inherent in the automation of the legal process.
pdf
bib
abs
Source-Free Domain Adaptation for Question Answering with Masked Self-training
Maxwell J. Yin
|
Boyu Wang
|
Yue Dong
|
Charles Ling
Previous unsupervised domain adaptation (UDA) methods for question answering (QA) require access to source domain data while fine-tuning the model for the target domain. Source domain data may, however, contain sensitive information and should be protected. In this study, we investigate a more challenging setting, source-free UDA, in which we have only the pretrained source model and target domain data, without access to source domain data. We propose a novel self-training approach to QA models that integrates a specially designed mask module for domain adaptation. The mask is auto-adjusted to extract key domain knowledge when trained on the source domain. To maintain previously learned domain knowledge, certain mask weights are frozen during adaptation, while other weights are adjusted to mitigate domain shifts with pseudo-labeled samples generated in the target domain. Our empirical results on four benchmark datasets suggest that our approach significantly enhances the performance of pretrained QA models on the target domain, and even outperforms models that have access to the source data during adaptation.
pdf
bib
abs
Scope Ambiguities in Large Language Models
Gaurav Kamath
|
Sebastian Schuster
|
Sowmya Vajjala
|
Siva Reddy
Sentences containing multiple semantic operators with overlapping scope often create ambiguities in interpretation, known as scope ambiguities. These ambiguities offer rich insights into the interaction between semantic structure and world knowledge in language processing. Despite this, there has been little research into how modern large language models treat them. In this paper, we investigate how different versions of certain autoregressive language models—GPT-2, GPT-3/3.5, Llama 2, and GPT-4—treat scope ambiguous sentences, and compare this with human judgments. We introduce novel datasets that contain a joint total of almost 1,000 unique scope-ambiguous sentences, containing interactions between a range of semantic operators, and annotated for human judgments. Using these datasets, we find evidence that several models (i) are sensitive to the meaning ambiguity in these sentences, in a way that patterns well with human judgments, and (ii) can successfully identify human-preferred readings at a high level of accuracy (over 90% in some cases).1
pdf
bib
abs
Visually Grounded Speech Models Have a Mutual Exclusivity Bias
Leanne Nortje
|
Dan Oneaţă
|
Yevgen Matusevych
|
Herman Kamper
When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: A novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio. Concretely, we train a model on familiar words and test its ME bias by asking it to select between a novel and a familiar object when queried with a novel word. To simulate prior acoustic and visual knowledge, we experiment with several initialization strategies using pretrained speech and vision networks. Our findings reveal the ME bias across the different initialization approaches, with a stronger bias in models with more prior (in particular, visual) knowledge. Additional tests confirm the robustness of our results, even when different loss functions are considered. Based on detailed analyses to piece out the model’s representation space, we attribute the ME bias to how familiar and novel classes are distinctly separated in the resulting space.
pdf
bib
abs
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias
Itay Itzhak
|
Gabriel Stanovsky
|
Nir Rosenfeld
|
Yonatan Belinkov
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. While these tuning methods can help align models with human objectives and generate high-quality text, not much is known about their potential adverse effects. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs, focusing on three cognitive biases—the decoy effect, the certainty effect, and the belief bias—all of which are known to influence human decision-making and reasoning. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families. Notably, we find a stronger presence of biases in models that have undergone instruction tuning, such as Flan-T5, Mistral-Instruct, GPT3.5, and GPT4. Our work constitutes a step toward comprehending cognitive biases in instruction-tuned LMs, which is crucial for the development of more reliable and unbiased language models.1
pdf
bib
abs
Beyond Boundaries: A Human-like Approach for Question Answering over Structured and Unstructured Information Sources
Jens Lehmann
|
Dhananjay Bhandiwad
|
Preetam Gattogi
|
Sahar Vahdati
Answering factual questions from heterogenous sources, such as graphs and text, is a key capacity of intelligent systems. Current approaches either (i) perform question answering over text and structured sources as separate pipelines followed by a merge step or (ii) provide an early integration, giving up the strengths of particular information sources. To solve this problem, we present “HumanIQ”, a method that teaches language models to dynamically combine retrieved information by imitating how humans use retrieval tools. Our approach couples a generic method for gathering human demonstrations of tool use with adaptive few-shot learning for tool augmented models. We show that HumanIQ confers significant benefits, including i) reducing the error rate of our strongest baseline (GPT-4) by over 50% across 3 benchmarks, (ii) improving human preference over responses from vanilla GPT-4 (45.3% wins, 46.7% ties, 8.0% loss), and (iii) outperforming numerous task-specific baselines.
pdf
bib
abs
Comparing Humans and Large Language Models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME)
Cameron R. Jones
|
Sean Trott
|
Benjamin Bergen
We address a growing debate about the extent to which large language models (LLMs) produce behavior consistent with Theory of Mind (ToM) in humans. We present EPITOME: a battery of six experiments that tap diverse ToM capacities, including belief attribution, emotional inference, and pragmatic reasoning. We elicit a performance baseline from human participants for each task. We use the dataset to ask whether distributional linguistic information learned by LLMs is sufficient to explain ToM in humans. We compare performance of five LLMs to a baseline of responses from human comprehenders. Results are mixed. LLMs display considerable sensitivity to mental states and match human performance in several tasks. Yet, they commit systematic errors in others, especially those requiring pragmatic reasoning on the basis of mental state information. Such uneven performance indicates that human-level ToM may require resources beyond distributional information.
pdf
bib
abs
A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice
Juri Opitz
Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called ‘macro’ metrics to rank systems (e.g., ‘macro F1’) but do not clearly specify what they would expect from such a ‘macro’ metric. This is problematic, since picking a metric can affect research findings and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics. The analysis helps us understand the metrics’ underlying properties, and how they align with expectations as found expressed in papers. Then we reflect on the practical situation in the field, and survey evaluation practice in recent shared tasks. We find that metric selection is often not supported with convincing arguments, an issue that can make a system ranking seem arbitrary. Our work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.
pdf
bib
abs
Revisiting Meta-evaluation for Grammatical Error Correction
Masamune Kobayashi
|
Masato Mita
|
Mamoru Komachi
Metrics are the foundation for automatic evaluation in grammatical error correction (GEC), with their evaluation of the metrics (meta-evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several challenges, including biases caused by inconsistencies in evaluation granularity and an outdated setup using classical systems. These problems can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper proposes SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities: edit-based and sentence-based, covering 12 state-of-the-art systems including large language models, and two human corrections with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation suggest that edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.
pdf
bib
abs
Context-Aware Machine Translation with Source Coreference Explanation
Huy Hien Vu
|
Hidetaka Kamigaito
|
Taro Watanabe
Despite significant improvements in enhancing the quality of translation, context-aware machine translation (MT) models underperform in many cases. One of the main reasons is that they fail to utilize the correct features from context when the context is too long or their models are overly complex. This can lead to the explain-away effect, wherein the models only consider features easier to explain predictions, resulting in inaccurate translations. To address this issue, we propose a model that explains the decisions made for translation by predicting coreference features in the input. We construct a model for input coreference by exploiting contextual features from both the input and translation output representations on top of an existing MT model. We evaluate and analyze our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset, demonstrating an improvement of over 1.0 BLEU score when compared with other context-aware models.
pdf
bib
abs
Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?
Cristina Aggazzotti
|
Nicholas Andrews
|
Elizabeth Allyn Smith
Authorship verification is the task of determining if two distinct writing samples share the same author and is typically concerned with the attribution of written text. In this paper, we explore the attribution of transcribed speech, which poses novel challenges. The main challenge is that many stylistic features, such as punctuation and capitalization, are not informative in this setting. On the other hand, transcribed speech exhibits other patterns, such as filler words and backchannels (e.g., um, uh-huh), which may be characteristic of different speakers. We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts. To limit spurious associations of speakers with topic, we employ both conversation prompts and speakers participating in the same conversation to construct verification trials of varying difficulties. We establish the state of the art on this new benchmark by comparing a suite of neural and non-neural baselines, finding that although written text attribution models achieve surprisingly good performance in certain settings, they perform markedly worse as conversational topic is increasingly controlled. We present analyses of the impact of transcription style on performance as well as the ability of fine-tuning on speech transcripts to improve performance.1
pdf
bib
abs
Decision-Oriented Dialogue for Human-AI Collaboration
Jessy Lin
|
Nicholas Tomlin
|
Jacob Andreas
|
Jason Eisner
We describe a class of tasks called decision-oriented dialogues, in which AI assistants such as large language models (LMs) must collaborate with one or more humans via natural language to help them make complex decisions. We formalize three domains in which users face everyday decisions: (1) choosing an assignment of reviewers to conference papers, (2) planning a multi-step itinerary in a city, and (3) negotiating travel plans for a group of friends. In each of these settings, AI assistants and users have disparate abilities that they must combine to arrive at the best decision: Assistants can access and process large amounts of information, while users have preferences and constraints external to the system. For each task, we build a dialogue environment where agents receive a reward based on the quality of the final decision they reach. We evaluate LMs in self-play and in collaboration with humans and find that they fall short compared to human assistants, achieving much lower rewards despite engaging in longer dialogues. We highlight a number of challenges models face in decision-oriented dialogues, ranging from goal-directed behavior to reasoning and optimization, and release our environments as a testbed for future work.
pdf
bib
abs
Exploring Continual Learning of Compositional Generalization in NLI
Xiyan Fu
|
Anette Frank
Compositional Natural Language Inference (NLI) has been explored to assess the true abilities of neural models to perform NLI. Yet, current evaluations assume models to have full access to all primitive inferences in advance, in contrast to humans that continuously acquire inference knowledge. In this paper, we introduce the Continual Compositional Generalization in Inference (C2Gen NLI) challenge, where a model continuously acquires knowledge of constituting primitive inference tasks as a basis for compositional inferences. We explore how continual learning affects compositional generalization in NLI, by designing a continual learning setup for compositional NLI inference tasks. Our experiments demonstrate that models fail to compositionally generalize in a continual scenario. To address this problem, we first benchmark various continual learning algorithms and verify their efficacy. We then further analyze C2Gen, focusing on how to order primitives and compositional inference types, and examining correlations between subtasks. Our analyses show that by learning subtasks continuously while observing their dependencies and increasing degrees of difficulty, continual learning can enhance composition generalization ability.1
pdf
bib
abs
State of What Art? A Call for Multi-Prompt LLM Evaluation
Moran Mizrahi
|
Guy Kaplan
|
Dan Malkin
|
Rotem Dror
|
Dafna Shahaf
|
Gabriel Stanovsky
Recent advances in LLMs have led to an abundance of evaluation benchmarks, which typically rely on a single instruction template per task. We create a large-scale collection of instruction paraphrases and comprehensively analyze the brittleness introduced by single-prompt evaluations across 6.5M instances, involving 20 different LLMs and 39 tasks from 3 benchmarks. We find that different instruction templates lead to very different performance, both absolute and relative. Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities. We show that our metrics provide new insights into the strengths and limitations of current LLMs.
pdf
bib
abs
CreoleVal: Multilingual Multitask Benchmarks for Creoles
Heather Lent
|
Kushal Tatariya
|
Raj Dabre
|
Yiyi Chen
|
Marcell Fekete
|
Esther Ploeger
|
Li Zhou
|
Ruth-Ann Armstrong
|
Abee Eijansantos
|
Catriona Malau
|
Hans Erik Heje
|
Ernests Lavrinovics
|
Diptesh Kanojia
|
Paul Belony
|
Marcel Bollmann
|
Loïc Grobol
|
Miryam de Lhoneux
|
Daniel Hershcovich
|
Michel DeGraff
|
Anders Søgaard
|
Johannes Bjerva
Creoles represent an under-explored and marginalized group of languages, with few available resources for NLP research. While the genealogical ties between Creoles and a number of highly resourced languages imply a significant potential for transfer learning, this potential is hampered due to this lack of annotated data. In this work we present CreoleVal, a collection of benchmark datasets spanning 8 different NLP tasks, covering up to 28 Creole languages; it is an aggregate of novel development datasets for reading comprehension relation classification, and machine translation for Creoles, in addition to a practical gateway to a handful of preexisting benchmarks. For each benchmark, we conduct baseline experiments in a zero-shot setting in order to further ascertain the capabilities and limitations of transfer learning for Creoles. Ultimately, we see CreoleVal as an opportunity to empower research on Creoles in NLP and computational linguistics, and in general, a step towards more equitable language technology around the globe.
pdf
bib
abs
xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection
Nuno M. Guerreiro
|
Ricardo Rei
|
Daan van Stigt
|
Luisa Coheur
|
Pierre Colombo
|
André F. T. Martins
Widely used learned metrics for machine translation evaluation, such as Comet and Bleurt, estimate the quality of a translation hypothesis by providing a single sentence-level score. As such, they offer little insight into translation errors (e.g., what are the errors and what is their severity). On the other hand, generative large language models (LLMs) are amplifying the adoption of more granular strategies to evaluation, attempting to detail and categorize translation errors. In this work, we introduce xcomet, an open-source learned metric designed to bridge the gap between these approaches. xcomet integrates both sentence-level evaluation and error span detection capabilities, exhibiting state-of-the-art performance across all types of evaluation (sentence-level, system-level, and error span detection). Moreover, it does so while highlighting and categorizing error spans, thus enriching the quality assessment. We also provide a robustness analysis with stress tests, and show that xcomet is largely capable of identifying localized critical errors and hallucinations.
pdf
bib
abs
SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks
Xuanli He
|
Qiongkai Xu
|
Jun Wang
|
Benjamin I. P. Rubinstein
|
Trevor Cohn
Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model’s behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.
pdf
bib
abs
Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design
Lindia Tjuatja
|
Valerie Chen
|
Tongshuang Wu
|
Ameet Talwalkwar
|
Graham Neubig
One widely cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wording—but interestingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of “prompts” have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore, even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained characterizations of model behavior.1
pdf
bib
abs
Multi-level Shared Knowledge Guided Learning for Knowledge Graph Completion
Yongxue Shan
|
Jie Zhou
|
Jie Peng
|
Xin Zhou
|
Jiaqian Yin
|
Xiaodong Wang
In the task of Knowledge Graph Completion (KGC), the existing datasets and their inherent subtasks carry a wealth of shared knowledge that can be utilized to enhance the representation of knowledge triplets and overall performance. However, no current studies specifically address the shared knowledge within KGC. To bridge this gap, we introduce a multi-level Shared Knowledge Guided learning method (SKG) that operates at both the dataset and task levels. On the dataset level, SKG-KGC broadens the original dataset by identifying shared features within entity sets via text summarization. On the task level, for the three typical KGC subtasks—head entity prediction, relation prediction, and tail entity prediction—we present an innovative multi-task learning architecture with dynamically adjusted loss weights. This approach allows the model to focus on more challenging and underperforming tasks, effectively mitigating the imbalance of knowledge sharing among subtasks. Experimental results demonstrate that SKG-KGC outperforms existing text-based methods significantly on three well-known datasets, with the most notable improvement on WN18RR (MRR: 66.6%→ 72.2%, Hit@1: 58.7%→67.0%).
pdf
bib
abs
Do Multi-Document Summarization Models Synthesize?
Jay DeYoung
|
Stephanie C. Martinez
|
Iain J. Marshall
|
Byron C. Wallace
Multi-document summarization entails producing concise synopses of collections of inputs. For some applications, the synopsis should accurately synthesize inputs with respect to a key aspect, e.g., a synopsis of film reviews written about a particular movie should reflect the average critic consensus. As a more consequential example, narrative summaries that accompany biomedical systematic reviews of clinical trial results should accurately summarize the potentially conflicting results from individual trials. In this paper we ask: To what extent do modern multi-document summarization models implicitly perform this sort of synthesis? We run experiments over opinion and evidence synthesis datasets using a suite of summarization models, from fine-tuned transformers to GPT-4. We find that existing models partially perform synthesis, but imperfectly: Even the best performing models are over-sensitive to changes in input ordering and under-sensitive to changes in input compositions (e.g., ratio of positive to negative reviews). We propose a simple, general, effective method for improving model synthesis capabilities by generating an explicitly diverse set of candidate outputs, and then selecting from these the string best aligned with the expected aggregate measure for the inputs, or abstaining when the model produces no good candidate.
pdf
bib
abs
ARN: Analogical Reasoning on Narratives
Zhivar Sourati
|
Filip Ilievski
|
Pia Sommerauer
|
Yifan Jiang
As a core cognitive skill that enables the transferability of information across domains, analogical reasoning has been extensively studied for both humans and computational models. However, while cognitive theories of analogy often focus on narratives and study the distinction between surface, relational, and system similarities, existing work in natural language processing has a narrower focus as far as relational analogies between word pairs. This gap brings a natural question: can state-of-the-art large language models (LLMs) detect system analogies between narratives? To gain insight into this question and extend word-based relational analogies to relational system analogies, we devise a comprehensive computational framework that operationalizes dominant theories of analogy, using narrative elements to create surface and system mappings. Leveraging the interplay between these mappings, we create a binary task and benchmark for Analogical Reasoning on Narratives (ARN), covering four categories of far (cross-domain)/near (within-domain) analogies and disanalogies. We show that while all LLMs can largely recognize near analogies, even the largest ones struggle with far analogies in a zero-shot setting, with GPT4.0 scoring below random. Guiding the models through solved examples and Chain-of-Thought reasoning enhances their analogical reasoning ability. Yet, since even in the few-shot setting, the best model only performs halfway between random and humans, ARN opens exciting directions for computational analogical reasoners.
pdf
bib
abs
Are Language Models More Like Libraries or Like Librarians? Bibliotechnism, the Novel Reference Problem, and the Attitudes of LLMs
Harvey Lederman
|
Kyle Mahowald
Are LLMs cultural technologies like photocopiers or printing presses, which transmit information but cannot create new content? A challenge for this idea, which we call bibliotechnism, is that LLMs generate novel text. We begin with a defense of bibliotechnism, showing how even novel text may inherit its meaning from original human-generated text. We then argue that bibliotechnism faces an independent challenge from examples in which LLMs generate novel reference, using new names to refer to new entities. Such examples could be explained if LLMs were not cultural technologies but had beliefs, desires, and intentions. According to interpretationism in the philosophy of mind, a system has such attitudes if and only if its behavior is well explained by the hypothesis that it does. Interpretationists may hold that LLMs have attitudes, and thus have a simple solution to the novel reference problem. We emphasize, however, that interpretationism is compatible with very simple creatures having attitudes and differs sharply from views that presuppose these attitudes require consciousness, sentience, or intelligence (topics about which we make no claims).
pdf
bib
abs
Segmentation-Free Streaming Machine Translation
Javier Iranzo-Sánchez
|
Jorge Iranzo-Sánchez
|
Adrià Giménez
|
Jorge Civera
|
Alfons Juan
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until after the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model.1
pdf
bib
abs
Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation
Cyril Chhun
|
Fabian M. Suchanek
|
Chloé Clavel
Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning, and deep understanding. Meanwhile, Large Language Models (LLMs) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.
pdf
bib
abs
How Often Are Errors in Natural Language Reasoning Due to Paraphrastic Variability?
Neha Srikanth
|
Marine Carpuat
|
Rachel Rudinger
Large language models have been shown to behave inconsistently in response to meaning-preserving paraphrastic inputs. At the same time, researchers evaluate the knowledge and reasoning abilities of these models with test evaluations that do not disaggregate the effect of paraphrastic variability on performance. We propose a metric, PC, for evaluating the paraphrastic consistency of natural language reasoning models based on the probability of a model achieving the same correctness on two paraphrases of the same problem. We mathematically connect this metric to the proportion of a model’s variance in correctness attributable to paraphrasing. To estimate PC, we collect ParaNlu, a dataset of 7,782 human-written and validated paraphrased reasoning problems constructed on top of existing benchmark datasets for defeasible and abductive natural language inference.1 Using ParaNlu, we measure the paraphrastic consistency of several model classes and show that consistency dramatically increases with pretraining but not fine-tuning. All models tested exhibited room for improvement in paraphrastic consistency.
pdf
bib
abs
Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
George Chrysostomou
|
Zhixue Zhao
|
Miles Williams
|
Nikolaos Aletras
Despite the remarkable performance of generative large language models (LLMs) on abstractive summarization, they face two significant challenges: their considerable size and tendency to hallucinate. Hallucinations are concerning because they erode reliability and raise safety issues. Pruning is a technique that reduces model size by removing redundant weights, enabling more efficient sparse inference. Pruned models yield downstream task performance comparable to the original, making them ideal alternatives when operating on a limited budget. However, the effect that pruning has upon hallucinations in abstractive summarization with LLMs has yet to be explored. In this paper, we provide an extensive empirical study across five summarization datasets, two state-of-the-art pruning methods, and five instruction-tuned LLMs. Surprisingly, we find that hallucinations are less prevalent from pruned LLMs than the original models. Our analysis suggests that pruned models tend to depend more on the source document for summary generation. This leads to a higher lexical overlap between the generated summary and the source document, which could be a reason for the reduction in hallucination risk.1
pdf
bib
abs
Hypernetworks for Personalizing ASR to Atypical Speech
Max Müller-Eberstein
|
Dianna Yee
|
Karren Yang
|
Gautam Varma Mantena
|
Colin Lea
Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. However, these approaches assume a priori knowledge of the atypical speech disorder being adapted for—the diagnosis of which requires expert knowledge that is not always available. Even given this knowledge, data scarcity and high inter-/intra-speaker variability further limit the effectiveness of traditional fine-tuning. To circumvent these challenges, we first identify the minimal set of model parameters required for ASR adaptation. Our analysis of each individual parameter’s effect on adaptation performance allows us to reduce Word Error Rate (WER) by half while adapting 0.03% of all weights. Alleviating the need for cohort-specific models, we next propose the novel use of a meta-learned hypernetwork to generate highly individualized, utterance-level adaptations on-the-fly for a diverse set of atypical speech characteristics. Evaluating adaptation at the global, cohort, and individual-level, we show that hypernetworks generalize better to out-of-distribution speakers, while maintaining an overall relative WER reduction of 75.2% using 0.1% of the full parameter budget.
pdf
bib
abs
Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval
Ohad Rubin
|
Jonathan Berant
Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch and applying it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
pdf
bib
abs
Retrieval-style In-context Learning for Few-shot Hierarchical Text Classification
Huiyao Chen
|
Yu Zhao
|
Zulong Chen
|
Mengjia Wang
|
Liangyue Li
|
Meishan Zhang
|
Min Zhang
Hierarchical text classification (HTC) is an important task with broad applications, and few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely ambiguous labels. In this work, we introduce the first ICL-based framework with LLM for few-shot HTC. We exploit a retrieval database to identify relevant demonstrations, and an iterative policy to manage multi-layer hierarchical labels. Particularly, we equip the retrieval database with HTC label-aware representations for the input texts, which is achieved by continual training on a pretrained language model with masked language modeling (MLM), layer-wise classification (CLS, specifically for HTC), and a novel divergent contrastive learning (DCL, mainly for adjacent semantically similar labels) objective. Experimental results on three benchmark datasets demonstrate superior performance of our method, and we can achieve state-of-the-art results in few-shot HTC.
pdf
bib
abs
Do Vision and Language Models Share Concepts? A Vector Space Alignment Study
Jiaang Li
|
Yova Kementchedjhieva
|
Constanza Fierro
|
Anders Søgaard
Large-scale pretrained language models (LMs) are said to “lack the ability to connect utterances to the world” (Bender and Koller, 2020), because they do not have “mental models of the world” (Mitchell and Krakauer, 2023). If so, one would expect LM representations to be unrelated to representations induced by vision models. We present an empirical evaluation across four families of LMs (BERT, GPT-2, OPT, and LLaMA-2) and three vision model architectures (ResNet, SegFormer, and MAE). Our experiments show that LMs partially converge towards representations isomorphic to those of vision models, subject to dispersion, polysemy, and frequency. This has important implications for both multi-modal processing and the LM understanding debate (Mitchell and Krakauer, 2023).1
pdf
bib
abs
Assessing the Role of Context in Chat Translation Evaluation: Is Context Helpful and Under What Conditions?
Sweta Agrawal
|
Amin Farajian
|
Patrick Fernandes
|
Ricardo Rei
|
André F. T. Martins
Despite the recent success of automatic metrics for assessing translation quality, their application in evaluating the quality of machine-translated chats has been limited. Unlike more structured texts like news, chat conversations are often unstructured, short, and heavily reliant on contextual information. This poses questions about the reliability of existing sentence-level metrics in this domain as well as the role of context in assessing the translation quality. Motivated by this, we conduct a meta-evaluation of existing automatic metrics, primarily designed for structured domains such as news, to assess the quality of machine-translated chats. We find that reference-free metrics lag behind reference-based ones, especially when evaluating translation quality in out-of-English settings. We then investigate how incorporating conversational contextual information in these metrics for sentence-level evaluation affects their performance. Our findings show that augmenting neural learned metrics with contextual information helps improve correlation with human judgments in the reference-free scenario and when evaluating translations in out-of-English settings. Finally, we propose a new evaluation metric, Context-MQM, that utilizes bilingual context with a large language model (LLM) and further validate that adding context helps even for LLM-based evaluation metrics.
pdf
bib
abs
Not Eliminate but Aggregate: Post-Hoc Control over Mixture-of-Experts to Address Shortcut Shifts in Natural Language Understanding
Ukyo Honda
|
Tatsushi Oka
|
Peinan Zhang
|
Masato Mita
Recent models for natural language understanding are inclined to exploit simple patterns in datasets, commonly known as shortcuts. These shortcuts hinge on spurious correlations between labels and latent features existing in the training data. At inference time, shortcut-dependent models are likely to generate erroneous predictions under distribution shifts, particularly when some latent features are no longer correlated with the labels. To avoid this, previous studies have trained models to eliminate the reliance on shortcuts. In this study, we explore a different direction: pessimistically aggregating the predictions of a mixture-of-experts, assuming each expert captures relatively different latent features. The experimental results demonstrate that our post-hoc control over the experts significantly enhances the model’s robustness to the distribution shift in shortcuts. Additionally, we show that our approach has some practical advantages. We also analyze our model and provide results to support the assumption.1
pdf
bib
abs
Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers
Melanie Subbiah
|
Sean Zhang
|
Lydia B. Chilton
|
Kathleen McKeown
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories, which can be lengthy, and include nuanced subtext or scrambled timelines. Importantly, we work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models), and to obtain informed evaluations of summary quality using judgments from the authors themselves. Through quantitative and qualitative analysis grounded in narrative theory, we compare GPT-4, Claude-2.1, and LLama-2-70B. We find that all three models make faithfulness mistakes in over 50% of summaries and struggle with specificity and interpretation of difficult subtext. We additionally demonstrate that LLM ratings and other automatic metrics for summary quality do not correlate well with the quality ratings from the writers.
pdf
bib
abs
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
Ansong Ni
|
Pengcheng Yin
|
Yilun Zhao
|
Martin Riddell
|
Troy Feng
|
Rui Shen
|
Stephen Yin
|
Ye Liu
|
Semih Yavuz
|
Caiming Xiong
|
Shafiq Joty
|
Yingbo Zhou
|
Dragomir Radev
|
Arman Cohan
|
Arman Cohan
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs. Despite promising results, there is a notable lack of a comprehensive evaluation of these models’ language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning, and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition, we assess confidence calibration, and conduct human evaluations to identify typical failures across different tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We release the evaluation framework1 and all model outputs, hoping to lay the groundwork for further future research. All future evaluations (e.g., LLaMA-3, StarCoder2, etc) will be updated on the project website: https://l2c-eval.github.io/.
pdf
bib
abs
NoviCode: Generating Programs from Natural Language Utterances by Novices
Asaf Achi Mordechai
|
Yoav Goldberg
|
Reut Tsarfaty
Current Text-to-Code models demonstrate impressive capabilities in generating executable code from natural language snippets. However, current studies focus on technical instructions and programmer-oriented language, and it is an open question whether these models can effectively translate natural language descriptions given by non-technical users and express complex goals, to an executable program that contains an intricate flow—composed of API access and control structures as loops, conditions, and sequences. To unlock the challenge of generating a complete program from a plain non-technical description we present NoviCode, a novel NL Programming task, which takes as input an API and a natural language description by a novice non-programmer, and provides an executable program as output. To assess the efficacy of models on this task, we provide a novel benchmark accompanied by test suites wherein the generated program code is assessed not according to their form, but according to their functional execution. Our experiments show that, first, NoviCode is indeed a challenging task in the code synthesis domain, and that generating complex code from non-technical instructions goes beyond the current Text-to-Code paradigm. Second, we show that a novel approach wherein we align the NL utterances with the compositional hierarchical structure of the code, greatly enhances the performance of LLMs on this task, compared with the end-to-end Text-to-Code counterparts.
pdf
bib
abs
Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability
Tyler A. Chang
|
Zhuowen Tu
|
Benjamin K. Bergen
How do language models learn to make predictions during pre-training? To study this, we extract learning curves from five autoregressive English language model pre-training runs, for 1M unseen tokens in context. We observe that the language models generate short repetitive phrases before learning to generate longer and more coherent text. We also find that individual tokens often exhibit sudden increases or decreases in loss that are surprisingly consistent across pre-training runs. To better understand these fluctuations, we quantify the final surprisal, within-run variability, age of acquisition, forgettability, and cross-run variability of learning curves for individual tokens in context. More frequent tokens reach lower final surprisals, exhibit less variability within and across pre-training runs, are learned earlier, and are less likely to be “forgotten” during pre-training. Higher n-gram probabilities further accentuate these effects. Independent of the target token, shorter and more frequent contexts correlate with marginally more stable and quickly acquired predictions. Based on our results, we argue for the existence of sequential learning dependencies between different model capabilities, and we characterize language model learning as early n-gram learning before gradual refinement of tail n-gram predictions.
pdf
bib
abs
Addressing Topic Leakage in Cross-Topic Evaluation for Authorship Verification
Jitkapat Sawatphol
|
Can Udomcharoenchaikit
|
Sarana Nutanong
Authorship verification (AV) aims to identify whether a pair of texts has the same author. We address the challenge of evaluating AV models’ robustness against topic shifts. The conventional evaluation assumes minimal topic overlap between training and test data. However, we argue that there can still be topic leakage in test data, causing misleading model performance and unstable rankings. To address this, we propose an evaluation method called Heterogeneity-Informed Topic Sampling (HITS), which creates a smaller dataset with a heterogeneously distributed topic set. Our experimental results demonstrate that HITS-sampled datasets yield a more stable ranking of models across random seeds and evaluation splits. Our contributions include: 1. An analysis of causes and effects of topic leakage; 2. A demonstration of the HITS in reducing the effects of topic leakage; and 3. The Robust Authorship Verification bENchmark (RAVEN) that allows topic shortcut test to uncover AV models’ reliance on topic-specific features.
pdf
bib
abs
Beyond Prompt Brittleness: Evaluating the Reliability and Consistency of Political Worldviews in LLMs
Tanise Ceron
|
Neele Falk
|
Ana Barić
|
Dmitry Nikolaev
|
Sebastian Padó
Due to the widespread use of large language models (LLMs), we need to understand whether they embed a specific “worldview” and what these views reflect. Recent studies report that, prompted with political questionnaires, LLMs show left-liberal leanings (Feng et al., 2023; Motoki et al., 2024). However, it is as yet unclear whether these leanings are reliable (robust to prompt variations) and whether the leaning is consistent across policies and political leaning. We propose a series of tests which assess the reliability and consistency of LLMs’ stances on political statements based on a dataset of voting-advice questionnaires collected from seven EU countries and annotated for policy issues. We study LLMs ranging in size from 7B to 70B parameters and find that their reliability increases with parameter count. Larger models show overall stronger alignment with left-leaning parties but differ among policy programs: They show a (left-wing) positive stance towards environment protection, social welfare state, and liberal society but also (right-wing) law and order, with no consistent preferences in the areas of foreign policy and migration.
pdf
bib
abs
Self-supervised Topic Taxonomy Discovery in the Box Embedding Space
Yuyin Lu
|
Hegang Chen
|
Pengbo Mao
|
Yanghui Rao
|
Haoran Xie
|
Fu Lee Wang
|
Qing Li
Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Unfortunately, most prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption. What’s worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, existing methods suffer from problems of low-quality topics at high abstraction levels and inaccurate hierarchical relations. To alleviate these problems, this paper develops a Box embedding-based Topic Model (BoxTM) that maps words and topics into the box embedding space, where the asymmetric metric is defined to properly infer hierarchical relations among topics. Additionally, our BoxTM explicitly infers upper-level topics based on correlation between specific topics through recursive clustering on topic boxes. Finally, extensive experiments validate high-quality of the topic taxonomy learned by BoxTM.
pdf
bib
abs
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
Ryo Kamoi
|
Yusen Zhang
|
Nan Zhang
|
Jiawei Han
|
Rui Zhang
Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback. However, there is still no consensus on the question of when LLMs can correct their own mistakes, as recent studies also report negative results. In this work, we critically survey broad papers and discuss the conditions required for successful self-correction. We first find that prior studies often do not define their research questions in detail and involve impractical frameworks or unfair evaluations that over-evaluate self-correction. To tackle these issues, we categorize research questions in self-correction research and provide a checklist for designing appropriate experiments. Our critical survey based on the newly categorized research questions shows that (1) no prior work demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-correction.
pdf
bib
abs
Interactive Machine Teaching by Labeling Rules and Instances
Giannis Karamanolakis
|
Daniel Hsu
|
Luis Gravaano
Weakly supervised learning aims to reduce the cost of labeling data by using expert-designed labeling rules. However, existing methods require experts to design effective rules in a single shot, which is difficult in the absence of proper guidance and tooling. Therefore, it is still an open question whether experts should spend their limited time writing rules or instead providing instance labels via active learning. In this paper, we investigate how to exploit an expert’s limited time to create effective supervision. First, to develop practical guidelines for rule creation, we conduct an exploratory analysis of diverse collections of existing expert-designed rules and find that rule precision is more important than coverage across datasets. Second, we compare rule creation to individual instance labeling via active learning and demonstrate the importance of both across 6 datasets. Third, we propose an interactive learning framework, INTERVAL, that achieves efficiency by automatically extracting candidate rules based on rich patterns (e.g., by prompting a language model), and effectiveness by soliciting expert feedback on both candidate rules and individual instances. Across 6 datasets, INTERVAL outperforms state-of-the-art weakly supervised approaches by 7% in F1. Furthermore, it requires as few as 10 queries for expert feedback to reach F1 values that existing active learning methods cannot match even with 100 queries.
pdf
bib
abs
Conformalizing Machine Translation Evaluation
Chrysoula Zerva
|
André F. T. Martins
Several uncertainty estimation methods have been recently proposed for machine translation evaluation. While these methods can provide a useful indication of when not to trust model predictions, we show in this paper that the majority of them tend to underestimate model uncertainty, and as a result, they often produce misleading confidence intervals that do not cover the ground truth. We propose as an alternative the use of conformal prediction, a distribution-free method to obtain confidence intervals with a theoretically established guarantee on coverage. First, we demonstrate that split conformal prediction can “correct” the confidence intervals of previous methods to yield a desired coverage level, and we demonstrate these findings across multiple machine translation evaluation metrics and uncertainty quantification methods. Further, we highlight biases in estimated confidence intervals, reflected in imbalanced coverage for different attributes, such as the language and the quality of translations. We address this by applying conditional conformal prediction techniques to obtain calibration subsets for each data subgroup, leading to equalized coverage. Overall, we show that, provided access to a calibration set, conformal prediction can help identify the most suitable uncertainty quantification methods and adapt the predicted confidence intervals to ensure fairness with respect to different attributes.1
pdf
bib
abs
Predicting Human Translation Difficulty with Neural Machine Translation
Zheng Wei Lim
|
Ekaterina Vylomova
|
Charles Kemp
|
Trevor Cohn
Human translators linger on some words and phrases more than others, and predicting this variation is a step towards explaining the underlying cognitive processes. Using data from the CRITT Translation Process Research Database, we evaluate the extent to which surprisal and attentional features derived from a Neural Machine Translation (NMT) model account for reading and production times of human translators. We find that surprisal and attention are complementary predictors of translation difficulty, and that surprisal derived from a NMT model is the single most successful predictor of production duration. Our analyses draw on data from hundreds of translators operating across 13 language pairs, and represent the most comprehensive investigation of human translation difficulty to date.
pdf
bib
abs
Conformal Prediction for Natural Language Processing: A Survey
Margarida Campos
|
António Farinhas
|
Chrysoula Zerva
|
Mário A. T. Figueiredo
|
André F. T. Martins
The rapid proliferation of large language models and natural language processing (NLP) applications creates a crucial need for uncertainty quantification to mitigate risks such as Hallucinations and to enhance decision-making reliability in critical applications. Conformal prediction is emerging as a theoretically sound and practically useful framework, combining flexibility with strong statistical guarantees. Its model-agnostic and distribution-free nature makes it particularly promising to address the current shortcomings of NLP systems that stem from the absence of uncertainty quantification. This paper provides a comprehensive survey of conformal prediction techniques, their guarantees, and existing applications in NLP, pointing to directions for future research and open challenges.
pdf
bib
abs
FINCH: Prompt-guided Key-Value Cache Compression for Large Language Models
Giulio Corallo
|
Paolo Papotti
Recent large language model applications, such as Retrieval-Augmented Generation and chatbots, have led to an increased need to process longer input contexts. However, this requirement is hampered by inherent limitations. Architecturally, models are constrained by a context window defined during training. Additionally, processing extensive texts requires substantial GPU memory. We propose a novel approach, Finch, to compress the input context by leveraging the pre-trained model weights of the self-attention. Given a prompt and a long text, Finch iteratively identifies the most relevant Key (K) and Value (V) pairs over chunks of the text conditioned on the prompt. Only such pairs are stored in the KV cache, which, within the space constrained by the context window, ultimately contains a compressed version of the long text. Our proposal enables models to consume large inputs even with high compression (up to 93x) while preserving semantic integrity without the need for fine-tuning.
pdf
bib
abs
Hierarchical Indexing for Retrieval-Augmented Opinion Summarization
Tom Hosking
|
Hao Tang
|
Mirella Lapata
We propose a method for unsupervised abstractive opinion summarization, that combines the attributability and scalability of extractive approaches with the coherence and fluency of Large Language Models (LLMs). Our method, HIRO, learns an index structure that maps sentences to a path through a semantically organized discrete hierarchy. At inference time, we populate the index and use it to identify and retrieve clusters of sentences containing popular opinions from input reviews. Then, we use a pretrained LLM to generate a readable summary that is grounded in these extracted evidential clusters. The modularity of our approach allows us to evaluate its efficacy at each stage. We show that HIRO learns an encoding space that is more semantically structured than prior work, and generates summaries that are more representative of the opinions in the input reviews. Human evaluation confirms that HIRO generates significantly more coherent, detailed, and accurate summaries.
pdf
bib
abs
A Survey on Model Compression for Large Language Models
Xunyu Zhu
|
Jian Li
|
Yong Liu
|
Can Ma
|
Weiping Wang
Large Language Models (LLMs) have transformed natural language processing tasks successfully. Yet, their large size and high computational needs pose challenges for practical use, especially in resource-limited settings. Model compression has emerged as a key research area to address these challenges. This paper presents a survey of model compression techniques for LLMs. We cover methods like quantization, pruning, and knowledge distillation, highlighting recent advancements. We also discuss benchmarking strategies and evaluation metrics crucial for assessing compressed LLMs. This survey offers valuable insights for researchers and practitioners, aiming to enhance efficiency and real-world applicability of LLMs while laying a foundation for future advancements.
pdf
bib
abs
Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization
Yangyang Zhao
|
Mehdi Dastani
|
Jinchuan Long
|
Zhenyu Wang
|
Shihan Wang
Training a task-oriented dialogue policy using deep reinforcement learning is promising but requires extensive environment exploration. The amount of wasted invalid exploration makes policy learning inefficient. In this paper, we define and argue that dead-end states are important reasons for invalid exploration. When a conversation enters a dead-end state, regardless of the actions taken afterward, it will continue in a dead-end trajectory until the agent reaches a termination state or maximum turn. We propose a Dead-end Detection and Resurrection (DDR) method that detects dead-end states in an efficient manner and provides a rescue action to guide and correct the exploration direction. To prevent dialogue policies from repeating errors, DDR also performs dialogue data augmentation by adding relevant experiences that include dead-end states and penalties into the experience pool. We first validate the dead-end detection reliability and then demonstrate the effectiveness and generality of the method across various domains through experiments on four public dialogue datasets.
pdf
bib
abs
Filtered Corpus Training (FiCT) Shows that Language Models Can Generalize from Indirect Evidence
Abhinav Patil
|
Jaap Jumelet
|
Yu Ying Chiu
|
Andy Lapastora
|
Peter Shen
|
Lexie Wang
|
Clevis Willrich
|
Shane Steinert-Threlkeld
This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect evidence.
pdf
bib
abs
Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models
Andreas Waldis
|
Yotam Perlitz
|
Leshem Choshen
|
Yufang Hou
|
Iryna Gurevych
We introduce Holmes, a new benchmark designed to assess language models’ (LMs’) linguistic competence—their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs’ internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs’ linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.
pdf
bib
abs
TabVer: Tabular Fact Verification with Natural Logic
Rami Aly
|
Andreas Vlachos
Fact verification on tabular evidence incentivizes the use of symbolic reasoning models where a logical form is constructed (e.g., a LISP-style program), providing greater verifiability than fully neural approaches. However, these logical forms typically rely on well-formed tables, restricting their use in many scenarios. An emerging symbolic reasoning paradigm for textual evidence focuses on natural logic inference, which constructs proofs by modeling set-theoretic relations between a claim and its evidence in natural language. This approach provides flexibility and transparency but is less compatible with tabular evidence since the relations do not extend to arithmetic functions. We propose a set-theoretic interpretation of numerals and arithmetic functions in the context of natural logic, enabling the integration of arithmetic expressions in deterministic proofs. We leverage large language models to generate arithmetic expressions by generating questions about salient parts of a claim which are answered by executing appropriate functions on tables. In a few-shot setting on FEVEROUS, we achieve an accuracy of 71.4, outperforming both fully neural and symbolic reasoning models by 3.4 points. When evaluated on TabFact without any further training, our method remains competitive with an accuracy lead of 0.5 points.
pdf
bib
abs
The Causal Influence of Grammatical Gender on Distributional Semantics
Karolina Stańczak
|
Kevin Du
|
Adina Williams
|
Isabelle Augenstein
|
Ryan Cotterell
How much meaning influences gender assignment across languages is an active area of research in linguistics and cognitive science. We can view current approaches as aiming to determine where gender assignment falls on a spectrum, from being fully arbitrarily determined to being largely semantically determined. For the latter case, there is a formulation of the neo-Whorfian hypothesis, which claims that even inanimate noun gender influences how people conceive of and talk about objects (using the choice of adjective used to modify inanimate nouns as a proxy for meaning). We offer a novel, causal graphical model that jointly represents the interactions between a noun’s grammatical gender, its meaning, and adjective choice. In accordance with past results, we find a significant relationship between the gender of nouns and the adjectives that modify them. However, when we control for the meaning of the noun, the relationship between grammatical gender and adjective choice is near zero and insignificant.
pdf
bib
abs
Toward Robust RALMs: Revealing the Impact of Imperfect Retrieval on Retrieval-Augmented Language Models
Seong-Il Park
|
Jay-Yoon Lee
Retrieval Augmented Language Models (RALMs) have gained significant attention for their ability to generate accurate answers and improve efficiency. However, RALMs are inherently vulnerable to imperfect information due to their reliance on the imperfect retriever or knowledge source. We identify three common scenarios—unanswerable, adversarial, conflicting—where retrieved document sets can confuse RALMs with plausible real-world examples. We present the first comprehensive investigation to assess how well RALMs detect and handle such problematic scenarios. Among these scenarios, to systematically examine adversarial robustness we propose a new adversarial attack method, Generative model-based ADVersarial attack (GenADV) and a novel metric Robustness under Additional Document (RAD). Our findings reveal that RALMs often fail to identify the unanswerability or contradiction of a document set, which frequently leads to hallucinations. Moreover, we show that the addition of an adversary significantly degrades RALM’s performance, with the model becoming even more vulnerable when the two scenarios overlap (adversarial+ unanswerable). Our research identifies critical areas for assessing and enhancing the robustness of RALMs, laying the foundation for the development of more robust models.1
pdf
bib
abs
IndoCulture: Exploring Geographically Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces
Fajri Koto
|
Rahmad Mahendra
|
Nurul Aisyah
|
Timothy Baldwin
Although commonsense reasoning is greatly shaped by cultural and geographical factors, previous studies have predominantly centered on cultures grounded in the English language, potentially resulting in an Anglocentric bias. In this paper, we introduce IndoCulture, aimed at understanding the influence of geographical factors on language model reasoning ability, with a specific emphasis on the diverse cultures found within eleven Indonesian provinces. In contrast to prior work that has relied on templates (Yin et al., 2022) and online scrapping (Fung et al., 2024), we create IndoCulture by asking local people to manually develop a cultural context and plausible options, across a set of predefined topics. Evaluation of 27 language models reveals several insights: (1) the open-weight Llama–3 is competitive with GPT–4, while other open-weight models struggle, with accuracies below 50%; (2) there is a general pattern of models generally performing better for some provinces, such as Bali and West Java, and less well for others; and (3) the inclusion of location context enhances performance, especially for larger models like GPT–4, emphasizing the significance of geographical context in commonsense reasoning.1
pdf
bib
abs
SCL: Selective Contrastive Learning for Data-driven Zero-shot Relation Extraction
Ning Pang
|
Xiang Zhao
|
Weixin Zeng
|
Zhen Tan
|
Weidong Xiao
Relation extraction has evolved from supervised relation extraction to zero-shot setting due to the continuous emergence of newly generated relations. Some pioneering works handle zero-shot relation extraction by reformulating it into proxy tasks, such as reading comprehension and textual entailment. Nonetheless, the divergence in proxy task formulations from relation extraction hinders the acquisition of informative semantic representations, leading to subpar performance. Therefore, in this paper, we take a data-driven view to handle zero-shot relation extraction under a three-step paradigm, including encoder training, relation clustering, and summarization. Specifically, to train a discriminative relational encoder, we propose a novel selective contrastive learning framework, namely, SCL, where selective importance scores are assigned to distinguish the importance of different negative contrastive instances. During testing, the prompt-based encoder is employed to map test samples into representation vectors, which are then clustered into several groups. Typical samples closest to the cluster centroid are selected for summarization to generate the predicted relation for all samples in the cluster. Moreover, we design a simple non-parametric threshold plugin to reduce false-positive errors in inference on unseen relation representations. Our experiments demonstrate that SCL outperforms the current state-of-the-art method by over 3% across all metrics.
pdf
bib
abs
Deuce: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning
Jiaxin Guo
|
C. L. Philip Chen
|
Shuzhen Li
|
Tong Zhang
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation. It provides high-quality data at a low annotation cost for label-scarce text classification. However, existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning. To address these issues, this paper proposes a novel dual-diversity enhancing and uncertainty-aware (Deuce) framework for CSAL. Specifically, Deuce leverages a pretrained language model (PLM) to efficiently extract textual representations, class predictions, and predictive uncertainty. Then, it constructs a Dual-Neighbor Graph (DNG) to combine information on both textual diversity and class diversity, ensuring a balanced data distribution. It further propagates uncertainty information via density-based clustering to select hard representative instances. Deuce performs well in selecting class-balanced and hard representative data by dual-diversity and informativeness. Experiments on six NLP datasets demonstrate the superiority and efficiency of Deuce.
pdf
bib
abs
Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?
Vagrant Gautam
|
Eileen Bingert
|
Dawei Zhu
|
Anne Lauscher
|
Dietrich Klakow
Robust, faithful, and harm-free pronoun use for individuals is an important goal for language model development as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: Given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later. We present RUFF, a carefully designed dataset of over 5 million instances to measure robust pronoun fidelity in English, and we evaluate 37 model variants from nine popular families, across architectures (encoder-only, decoder-only, and encoder-decoder) and scales (11M-70B parameters). When an individual is introduced with a pronoun, models can mostly faithfully reuse this pronoun in the next sentence, but they are significantly worse with she/her/her, singular they, and neopronouns. Moreover, models are easily distracted by non-adversarial sentences discussing other people; even one sentence with a distractor pronoun causes accuracy to drop on average by 34 percentage points. Our results show that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy. We encourage researchers to bridge the gaps we find and to carefully evaluate reasoning in settings where superficial repetition might inflate perceptions of model performance.