The 18th Conference of the European Chapter of the Association for Computational Linguistics

St. Julian’s, Malta
March, 2024




pdf (full)
bib (full)
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Yvette Graham | Matthew Purver

pdf bib
Enhancing Ethical Explanations of Large Language Models through Iterative Symbolic Refinement
Xin Quan | Marco Valentino | Louise Dennis | Andre Freitas

An increasing amount of research in Natural Language Inference (NLI) focuses on the application and evaluation of Large Language Models (LLMs) and their reasoning capabilities. Despite their success, however, LLMs are still prone to factual errors and inconsistencies in their explanations, offering limited control and interpretability for inference in complex domains. In this paper, we focus on ethical NLI, investigating how hybrid neuro-symbolic techniques can enhance the logical validity and alignment of ethical explanations produced by LLMs. Specifically, we present an abductive-deductive framework named Logic-Explainer, which integrates LLMs with an external backward-chaining solver to refine step-wise natural language explanations and jointly verify their correctness, reduce incompleteness and minimise redundancy. An extensive empirical analysis demonstrates that Logic-Explainer can improve explanations generated via in-context learning methods and Chain-of-Thought (CoT) on challenging ethical NLI tasks, while, at the same time, producing formal proofs describing and supporting models’ reasoning. As ethical NLI requires commonsense reasoning to identify underlying moral violations, our results suggest the effectiveness of neuro-symbolic methods for multi-step NLI more broadly, opening new opportunities to enhance the logical consistency, reliability, and alignment of LLMs.

pdf bib
Multi-Relational Hyperbolic Word Embeddings from Natural Language Definitions
Marco Valentino | Danilo Carvalho | Andre Freitas

Natural language definitions possess a recursive, self-explanatory semantic structure that can support representation learning methods able to preserve explicit conceptual relations and constraints in the latent space. This paper presents a multi-relational model that explicitly leverages such a structure to derive word embeddings from definitions. By automatically extracting the relations linking defined and defining terms from dictionaries, we demonstrate how the problem of learning word embeddings can be formalised via a translational framework in Hyperbolic space and used as a proxy to capture the global semantic structure of definitions. An extensive empirical analysis demonstrates that the framework can help imposing the desired structural constraints while preserving the semantic mapping required for controllable and interpretable traversal. Moreover, the experiments reveal the superiority of the Hyperbolic word embeddings over the Euclidean counterparts and demonstrate that the multi-relational approach can obtain competitive results when compared to state-of-the-art neural models, with the advantage of being intrinsically more efficient and interpretable

pdf bib
Anisotropy Is Inherent to Self-Attention in Transformers
Nathan Godey | Éric Clergerie | Benoît Sagot

The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations tend to demonstrate that anisotropy might actually be inherent to Transformers-based models.

pdf bib
Generating Benchmarks for Factuality Evaluation of Language Models
Dor Muhlgay | Ori Ram | Inbal Magar | Yoav Levine | Nir Ratner | Yonatan Belinkov | Omri Abend | Kevin Leyton-Brown | Amnon Shashua | Yoav Shoham

Before deploying a language model (LM) within a given domain, it is important to measure its tendency to generate factually incorrect information in that domain. Existing methods for factuality evaluation of LLM generation focus on facts sampled from the LM itself, and thus do not control the set of evaluated facts and might under-represent domain specific or rare facts. We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM’s propensity to generate true facts from the corpus vs. similar but incorrect statements. We use our framework to create three benchmarks: Wiki-FACTOR, News-FACTOR and Expert-FACTOR. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation, as measured by human annotators.

pdf bib
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
Simone Balloccu | Patrícia Schmidtová | Mateusz Lango | Ondrej Dusek

Natural Language Processing (NLP) research is increasingly focusing on the use of Large Language Models (LLMs), with some of the most popular ones being either fully or partially closed-source. The lack of access to model details, especially regarding training data, has repeatedly raised concerns about data contamination among researchers. Several attempts have been made to address this issue, but they are limited to anecdotal evidence and trial and error. Additionally, they overlook the problem of indirect data leaking, where modelsare iteratively improved by using data coming from users. In this work, we conduct the first systematic analysis of work using OpenAI’s GPT-3.5 and GPT-4, the most prominently used LLMs today, in the context of data contamination. By analysing 255 papers and considering OpenAI’s data usage policy, we extensively document the amount of data leaked to these models during the first year after the model’s release. We report that these models have been globally exposed to ∼4.7M samples from 263 benchmarks. At the same time, we document a number of evaluation malpractices emerging in the reviewed papers, such as unfair or missing baseline comparisons and reproducibility issues. We release our results as a collaborative project on, where other researchers can contribute to our efforts.

pdf bib
Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning
Danna Zheng | Mirella Lapata | Jeff Pan

We present Archer, a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains. Notably, this dataset demonstrates a significantly higher level of complexity compared to existing publicly available datasets. Our evaluation shows that Archer challenges the capabilities of current state-of-the-art models, with a high-ranked model on the Spider leaderboard achieving only 6.73% execution accuracy on Archer test set. Thus, Archer presents a significant challenge for future research in this field.

pdf bib
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution
Yining Lu | Haoping Yu | Daniel Khashabi

Augmenting large language models (LLM) to use external tools enhances their performance across a variety of tasks. However, prior works over-rely on task-specific demonstration of tool use that limits their generalizability and computational cost due to making many calls to large-scale LLMs. We introduce GEAR, a computationally efficient query-tool grounding algorithm that is generalizable to various tasks that require tool use while not relying on task-specific demonstrations. GEAR achieves better efficiency by delegating tool grounding and execution to small language models (SLM) and LLM, respectively; while leveraging semantic and pattern-based evaluation at both question and answer levels for generalizable tool grounding. We evaluate GEAR on 14 datasets across 6 downstream tasks, demonstrating its strong generalizability to novel tasks, tools and different SLMs. Despite offering more efficiency, GEAR achieves higher precision in tool grounding compared to prior strategies using LLM prompting, thus improving downstream accuracy at a reduced computational cost. For example, we demonstrate that GEAR-augmented GPT-J and GPT-3 outperform counterpart tool-augmented baselines because of better tool use.

pdf bib
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models
Adian Liusie | Potsawee Manakul | Mark Gales

Current developments in large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. An interesting application of these systems is in the automated assessment of natural language generation (NLG), a highly challenging area with great practical benefit. In this paper, we explore two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment: absolute score prediction, and comparative assessment which uses relative comparisons between pairs of candidates. Though comparative assessment has not been extensively studied in NLG assessment, we note that humans often find it more intuitive to compare two options rather than scoring each one independently. This work examines comparative assessment from multiple perspectives: performance compared to absolute grading; positional biases in the prompt; and efficient ranking in terms of the number of comparisons. We illustrate that LLM comparative assessment is a simple, general and effective approach for NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring, and in many cases can achieve performance competitive with state-of-the-art methods. Additionally, we demonstrate that LLMs often exhibit strong positional biases when making pairwise comparisons, and we propose debiasing methods that can further improve performance.

pdf bib
Parameter-Efficient Conversational Recommender System as a Language Processing Task
Mathieu Ravaut | Hao Zhang | Lu Xu | Aixin Sun | Yong Liu

Conversational recommender systems (CRS) aim to recommend relevant items to users by eliciting user preference through natural language conversation. Prior work often utilizes external knowledge graphs for items’ semantic information, a language model for dialogue generation, and a recommendation module for ranking relevant items. This combination of multiple components suffers from a cumber-some training process, and leads to semantic misalignment issues between dialogue generation and item recommendation. In this paper, we represent items in natural language and formulate CRS as a natural language processing task. Accordingly, we leverage the power of pre-trained language models to encode items, understand user intent via conversation, perform item recommendation through semantic matching, and generate dialogues. As a unified model, our PECRS (Parameter-Efficient CRS), can be optimized in a single stage, without relying on non-textual metadata such as a knowledge graph. Experiments on two benchmark CRS datasets, ReDial and INSPIRED, demonstrate the effectiveness of PECRS on recommendation and conversation. Our code is available at:

pdf bib
OpenPI2.0: An Improved Dataset for Entity Tracking in Texts
Li Zhang | Hainiu Xu | Abhinav Kommula | Chris Callison-Burch | Niket Tandon

Much texts describe a changing world (e.g., procedures, stories, newswires), and understanding them requires tracking how entities change. An earlier dataset, OpenPI, provided crowdsourced annotations of entity state changes in text. However, a major limitation was that those annotations were free-form and did not identify salient changes, hampering model evaluation. To overcome these limitations, we present an improved dataset, OpenPI2.0, where entities and attributes are fully canonicalized and additional entity salience annotations are added. On our fairer evaluation setting, we find that current state-of-the-art language models are far from competent. We also show that using state changes of salient entities as a chain-of-thought prompt, downstream performance is improved on tasks such as question answering and classical planning, outperforming the setting involving all related entities indiscriminately. We offer OpenPI2.0 for the continued development of models that can understand the dynamics of entities in text.

pdf bib
A Comparative Multidimensional Analysis of Empathetic Systems
Andrew Lee | Jonathan Kummerfeld | Larry Ann | Rada Mihalcea

Recently, empathetic dialogue systems have received significant attention.While some researchers have noted limitations, e.g., that these systems tend to generate generic utterances, no study has systematically verified these issues. We survey 21 systems, asking what progress has been made on the task. We observe multiple limitations of current evaluation procedures. Most critically, studies tend to rely on a single non-reproducible empathy score, which inadequately reflects the multidimensional nature of empathy. To better understand the differences between systems, we comprehensively analyze each system with automated methods that are grounded in a variety of aspects of empathy. We find that recent systems lack three important aspects of empathy: specificity, reflection levels, and diversity. Based on our results, we discuss problematic behaviors that may have gone undetected in prior evaluations, and offer guidance for developing future systems.

pdf bib
Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering
Mingda Chen | Xilun Chen | Wen-tau Yih

Few-shot learning for open domain multi-hop question answering typically relies on the in-context learning capability of large language models (LLMs). While powerful, these LLMs usually contain tens or hundreds of billions of parameters, making them rather inefficient at inference time. To improve performance of smaller language models, we propose a data synthesis framework for multi-hop question answering that requires less than 10 human-annotated question answer pairs. Our framework depends only on rich, naturally-occurring relationships among documents and is built upon the data generation functions parameterized by LLMs and prompts. We synthesize millions of multi-hop questions and claims to finetune language models, evaluated on popular benchmarks for multi-hop question answering and fact verification. Empirically, our approach improves model performance significantly, allowing the finetuned models to be competitive with GPT-3.5 based approaches while being almost one-third the size in parameter count.

pdf bib
Language Models as Inductive Reasoners
Zonglin Yang | Li Dong | Xinya Du | Hao Cheng | Erik Cambria | Xiaodong Liu | Jianfeng Gao | Furu Wei

Inductive reasoning is a core component of human intelligence. In the past research of inductive reasoning within computer science, formal language is used as representations of knowledge (facts and rules, more specifically). However, formal language can cause systematic problems for inductive reasoning such as disability of handling raw input such as natural language, sensitiveness to mislabeled data, and incapacity to handle ambiguous input. To this end, we propose a new paradigm (task) for inductive reasoning, which is to induce natural language rules from natural language facts, and create a dataset termed DEER containing 1.2k rule-fact pairs for the task, where rules and facts are written in natural language. New automatic metrics are also proposed and analysed for the evaluation of this task. With DEER, we investigate a modern approach for inductive reasoning where we use natural language as representation for knowledge instead of formal language and use pretrained language models as ”reasoners”. Moreover, we provide the first and comprehensive analysis of how well pretrained language models can induce natural language rules from natural language facts. We also propose a new framework drawing insights from philosophy literature for this task, which we show in the experiment section that surpasses baselines in both automatic and human evaluations. We discuss about our future perspectives for inductive reasoning in Section 7. Dataset and code are available at

pdf bib
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
David Adelani | Hannah Liu | Xiaoyu Shen | Nikita Vassilyev | Jesujoba Alabi | Yanke Mao | Haonan Gao | En-Shiun Lee

Despite the progress in building multilingual language models, evaluation is often limited to a few languages with available datasets which excludes a large number of low-resource languages. In this paper, we create SIB-200—a large-scale open-sourced benchmark dataset for topic classification in 205 languages and dialects to address the lack of evaluation dataset for Natural Language Understanding (NLU). For many of the languages covered in SIB-200, this is the first publicly available evaluation dataset for NLU. The dataset is based on Flores-200 machine translation corpus. We annotated the English portion of the dataset and extended the sentence-level annotation to the remaining 204 languages covered in the corpus. Despite the simplicity of this task, our evaluation in full-supervised setting, cross-lingual transfer setting and prompting of large language model setting show that there is still a large gap between the performance of high-resource and low-resource languages when multilingual evaluation is scaled to numerous world languages. We found that languages unseen during the pre-training of multilingual language models, languages from under-represented families (like Nilotic and Altantic-Congo), and languages from the regions of Africa, Americas, Oceania and South East Asia, often have the lowest performance on our topic classification dataset. We hope our dataset %will encourages a more inclusive evaluation of multilingual language models on a more diverse set of languages.

pdf bib
FinBPM: A Framework for Portfolio Management-based Financial Investor Behavior Perception Model
Zhilu Zhang | Procheta Sen | Zimu Wang | Ruoyu Sun | Zhengyong Jiang | Jionglong Su

The goal of portfolio management is to simultaneously maximize the accumulated return and also to control risk. In consecutive trading periods, portfolio manager needs to continuously adjust the portfolio weights based on the factors which can cause price fluctuation in the market. In the stock market, the factors affecting the stock price can be divided into two categories. The first is price fluctuations caused by irrational investment of the speculators. The second is endogenous value changes caused by operations of the company. In recent years, with the advancement of artificial intelligence technology, reinforcement learning (RL) algorithms have been increasingly employed by scholars to address financial problems, particularly in the area of portfolio management. However, the deep RL models proposed by these scholars in the past have focused more on analyzing the price changes caused by the investment behavior of speculators in response to technical indicators of actual stock prices. In this research, we introduce an RL-based framework called FinBPM, which takes both the factor pertaining to the impact on operations of the company and the factor of the irrational investment of the speculator into consideration. For our experimentation, we randomly selected twelve stocks from the Dow Jones Industrial Index to construct our portfolio. The experimental results reveal that, in comparison to conventional reinforcement learning methods, our approach with at least 13.26% increase over other methods compared. Additionally, it achieved the best Sharpe ratio of 2.77, effectively maximizing the return per unit of risk.

pdf bib
Asking the Right Question at the Right Time: Human and Model Uncertainty Guidance to Ask Clarification Questions
Alberto Testoni | Raquel Fernández

Clarification questions are an essential dialogue tool to signal misunderstanding, ambiguities, and under-specification in language use. While humans are able to resolve uncertainty by asking questions since childhood, modern dialogue systems struggle to generate effective questions. To make progress in this direction, in this work we take a collaborative dialogue task as a testbed and study how model uncertainty relates to human uncertainty—an as yet under-explored problem. We show that model uncertainty does not mirror human clarification-seeking behavior, which suggests that using human clarification questions as supervision for deciding when to ask may not be the most effective way to resolve model uncertainty. To address this issue, we propose an approach to generating clarification questions based on model uncertainty estimation, compare it to several alternatives, and show that it leads to significant improvements in terms of task success. Our findings highlight the importance of equipping dialogue systems with the ability to assess their own uncertainty and exploit in interaction.

pdf bib
Like a Good Nearest Neighbor: Practical Content Moderation and Text Classification
Luke Bates | Iryna Gurevych

Few-shot text classification systems have impressive capabilities but are infeasible to deploy and use reliably due to their dependence on prompting and billion-parameter language models. SetFit (Tunstall, 2022) is a recent, practical approach that fine-tunes a Sentence Transformer under a contrastive learning paradigm and achieves similar results to more unwieldy systems. Inexpensive text classification is important for addressing the problem of domain drift in all classification tasks, and especially in detecting harmful content, which plagues social media platforms. Here, we propose Like a Good Nearest Neighbor (LaGoNN), a modification to SetFit that introduces no learnable parameters but alters input text with information from its nearest neighbor, for example, the label and text, in the training data, making novel data appear similar to an instance on which the model was optimized. LaGoNN is effective at flagging undesirable content and text classification, and improves SetFit’s performance. To demonstrate LaGoNN’s value, we conduct a thorough study of text classification systems in the context of content moderation under four label distributions, and in general and multilingual classification settings.

pdf bib
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon
Fajri Koto | Tilman Beck | Zeerak Talat | Iryna Gurevych | Timothy Baldwin

Improving multilingual language models capabilities in low-resource languages is generally difficult due to the scarcity of large-scale data in those languages. In this paper, we relax the reliance on texts in low-resource languages by using multilingual lexicons in pretraining to enhance multilingual capabilities. Specifically, we focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets. We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets, and large language models like GPT–3.5, BLOOMZ, and XGLM. These findings are observable for unseen low-resource languages to code-mixed scenarios involving high-resource languages.

pdf bib
CEAN: Contrastive Event Aggregation Network with LLM-based Augmentation for Event Extraction
Zihao Meng | Tao Liu | Heng Zhang | Kai Feng | Peng Zhao

Event Extraction is a crucial yet arduous task in natural language processing (NLP), as its performance is significantly hindered by laborious data annotation. Given this challenge, recent research has predominantly focused on two approaches: pretraining task-oriented models for event extraction and employing data augmentation techniques. These methods involve integrating external knowledge, semantic structures, or artificially generated samples using large language models (LLMs). However, their performances can be compromised due to two fundamental issues. Firstly, the alignment between the introduced knowledge and event extraction knowledge is crucial. Secondly, the introduction of data noise during the augmentation is unavoidable and can mislead the model’s convergence. To address these issues, we propose a Contrastive Event Aggregation Network with LLM-based Augmentation to promote low-resource learning and reduce data noise for event extraction. Different from the existing methods introducing linguistic knowledge into data augmentation, an event aggregation network is established to introduce event knowledge into supervised learning by constructing adaptively-updated semantic representation for trigger and argument. For LLM-based augmentation, we design a new scheme including a multi-pattern rephrasing paradigm and a data-free composing paradigm. Instead of directly using augmentation samples in the supervised task, we introduce span-level contrastive learning to reduce data noise. Experiments on the ACE2005 and ERE-EN demonstrate that our proposed approach achieves new state-of-the-art results on both of the two datasets.

pdf bib
How Transferable are Attribute Controllers on Pretrained Multilingual Translation Models?
Danni Liu | Jan Niehues

Customizing machine translation models to comply with desired attributes (e.g., formality or grammatical gender) is a well-studied topic. However, most current approaches rely on (semi-)supervised data with attribute annotations. This data scarcity bottlenecks democratizing such customization possibilities to a wider range of languages, particularly lower-resource ones. This gap is out of sync with recent progress in pretrained massively multilingual translation models. In response, we transfer the attribute controlling capabilities to languages without attribute-annotated data with an NLLB-200 model as a foundation. Inspired by techniques from controllable generation, we employ a gradient-based inference-time controller to steer the pretrained model. The controller transfers well to zero-shot conditions, as it is operates on pretrained multilingual representations and is attribute- rather than language-specific. With a comprehensive comparison to finetuning-based control, we demonstrate that, despite finetuning’s clear dominance in supervised settings, the gap to inference-time control closes when moving to zero-shot conditions, especially with new and distant target languages. The latter also shows stronger domain robustness. We further show that our inference-time control complements finetuning. Moreover, a human evaluation on a real low-resource language, Bengali, confirms our findings. Our code is in the supplementary material.

pdf bib
MultiMUC: Multilingual Template Filling on MUC-4
William Gantt | Shabnam Behzad | Hannah An | Yunmo Chen | Aaron White | Benjamin Van Durme | Mahsa Yarmohammadi

We introduce MultiMUC, the first multilingual parallel corpus for template filling, comprising translations of the classic MUC-4 template filling benchmark into five languages: Arabic, Chinese, Farsi, Korean, and Russian. We obtain automatic translations from a strong multilingual machine translation system and manually project the original English annotations into each target language. For all languages, we also provide human translations for key portions of the dev and test splits. Finally, we present baselines on MultiMUC both with state-of-the-art template filling models for MUC-4 and with ChatGPT. We release MUC-4 and the supervised baselines to facilitate further work on document-level information extraction in multilingual settings.

pdf bib
Align and Augment: Generative Data Augmentation for Compositional Generalization
Francesco Cazzaro | Davide Locatelli | Ariadna Quattoni

Recent work on semantic parsing has shown that seq2seq models find compositional generalization challenging. Several strategies have been proposed to mitigate this challenge. One such strategy is to improve compositional generalization via data augmentation techniques. In this paper we follow this line of work and propose Archer, a data-augmentation strategy that exploits alignment annotations between sentences and their corresponding meaning representations. More precisely, we use alignments to train a two step generative model that combines monotonic lexical generation with reordering. Our experiments show that Archer leads to significant improvements in compositional generalization performance.

pdf bib
UNSEE: Unsupervised Non-contrastive Sentence Embeddings
Ömer Çağatan

In this paper, we introduce UNSEE, which stands for Unsupervised Non-Contrastive Sentence Embeddings. UNSEE demonstrates better performance compared to SimCSE in the Massive Text Embedding (MTEB) benchmark. We begin by highlighting the issue of representation collapse that occurs with the replacement of contrastive objectives with non-contrastive objectives in SimCSE. Subsequently, we introduce a straightforward solution called the target network to mitigate this problem. This approach enables us to harness non-contrastive objectives while ensuring training stability and achieving performance improvements similar to those seen with contrastive objectives. We have reached peak performance in non-contrastive sentence embeddings through extensive fine-tuning and optimization. These efforts have resulted in superior sentence representation models, emphasizing the importance of careful tuning and optimization for non-contrastive objectives.

pdf bib
EXPLORER: Exploration-guided Reasoning for Textual Reinforcement Learning
Kinjal Basu | Keerthiram Murugesan | Subhajit Chaudhury | Murray Campbell | Kartik Talamadupula | Tim Klinger

Text-based games (TBGs) have emerged as an important collection of NLP tasks, requiring reinforcement learning (RL) agents to combine natural language understanding with reasoning. A key challenge for agents attempting to solve such tasks is to generalize across multiple games and demonstrate good performance on both seen and unseen objects. Purely deep-RL-based approaches may perform well on seen objects; however, they fail to showcase the same performance on unseen objects. Commonsense-infused deep-RL agents may work better on unseen data; unfortunately, their policies are often not interpretable or easily transferable. To tackle these issues, in this paper, we present EXPLORER which is an exploration-guided reasoning agent for textual reinforcement learning. EXPLORER is neuro-symbolic in nature, as it relies on a neural module for exploration and a symbolic module for exploitation. It can also learn generalized symbolic policies and perform well over unseen data. Our experiments show that EXPLORER outperforms the baseline agents on Text-World cooking (TW-Cooking) and Text-World Commonsense (TWC) games.

pdf bib
From Text Segmentation to Smart Chaptering: A Novel Benchmark for Structuring Video Transcriptions
Fabian Retkowski | Alexander Waibel

Text segmentation is a fundamental task in natural language processing, where documents are split into contiguous sections. However, prior research in this area has been constrained by limited datasets, which are either small in scale, synthesized, or only contain well-structured documents. In this paper, we address these limitations by introducing a novel benchmark YTSeg focusing on spoken content that is inherently more unstructured and both topically and structurally diverse. As part of this work, we introduce an efficient hierarchical segmentation model MiniSeg, that outperforms state-of-the-art baselines. Lastly, we expand the notion of text segmentation to a more practical “smart chaptering” task that involves the segmentation of unstructured content, the generation of meaningful segment titles, and a potential real-time application of the models.

pdf bib
Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels
Negar Arabzadeh | Charles Clarke

The rapid advancement of natural language processing, information retrieval (IR), computer vision, and other technologies has presented significant challenges in evaluating the performance of these systems. One of the main challenges is the scarcity of human-labeled data, which hinders the fair and accurate assessment of these systems. In this work, we specifically focus on evaluating IR systems with sparse labels, borrowing from recent research on evaluating computer vision tasks.taking inspiration from the success of using Fréchet Inception Distance (FID) in assessing text-to-image generation systems. We propose leveraging the Fréchet Distance to measure the distance between the distributions of relevant judged items and retrieved results. Our experimental results on MS MARCO V1 dataset and TREC Deep Learning Tracks query sets demonstrate the effectiveness of the Fréchet Distance as a metric for evaluating IR systems, particularly in settings where a few labels are available.This approach contributes to the advancement of evaluation methodologies in real-world scenarios such as the assessment of generative IR systems.

pdf bib
Semantic Sensitivities and Inconsistent Predictions: Measuring the Fragility of NLI Models
Erik Arakelyan | Zhaoqi Liu | Isabelle Augenstein

Recent studies of the emergent capabilities of transformer-based Natural Language Understanding (NLU) models have indicated that they have an understanding of lexical and compositional semantics. We provide evidence that suggests these claims should be taken with a grain of salt: we find that state-of-the-art Natural Language Inference (NLI) models are sensitive towards minor semantics preserving surface-form variations, which lead to sizable inconsistent model decisions during inference. Notably, this behaviour differs from valid and in-depth comprehension of compositional semantics, however does neither emerge when evaluating model accuracy on standard benchmarks nor when probing for syntactic, monotonic, and logically robust reasoning. We propose a novel framework to measure the extent of semantic sensitivity. To this end, we evaluate NLI models on adversarially generated examples containing minor semantics-preserving surface-form input noise. This is achieved using conditional text generation, with the explicit condition that the NLI model predicts the relationship between the original and adversarial inputs as a symmetric equivalence entailment. We systematically study the effects of the phenomenon across NLI models for in- and out-of- domain settings. Our experiments show that semantic sensitivity causes performance degradations of 12.92% and 23.71% average over in- and out-of- domain settings, respectively. We further perform ablation studies, analysing this phenomenon across models, datasets, and variations in inference and show that semantic sensitivity can lead to major inconsistency within model predictions.

pdf bib
Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties
Ekaterina Artemova | Verena Blaschke | Barbara Plank

Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages.We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data.Inspired by prior work on English varieties, we craft and manually evaluate perturbation rules that transform German sentences into colloquial forms and use them to synthesize test sets in four ToD datasets.Our perturbation rules cover 18 distinct language phenomena, enabling us to explore the impact of each perturbation on slot and intent performance.Using these new datasets, we conduct an experimental evaluation across six different transformers.Here, we demonstrate that when applied to colloquial varieties, ToD systems maintain their intent recognition performance, losing 6% (4.62 percentage points) in accuracy on average. However, they exhibit a significant drop in slot detection, with a decrease of 31% (21 percentage points) in slot F1 score.Our findings are further supported by a transfer experiment from Standard American English to synthetic Urban African American Vernacular English.

pdf bib
PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents
Simeng Sun | Yang Liu | Shuohang Wang | Dan Iter | Chenguang Zhu | Mohit Iyyer

Strategies such as chain-of-thought prompting improve the performance of large language models (LLMs) on complex reasoning tasks by decomposing input examples into intermediate steps. However, it remains unclear how to apply such methods to reason over long input documents, in which both the decomposition and the output of each intermediate step are non-trivial to obtain. In this work, we propose PEARL, a prompting framework to improve reasoning over long documents, which consists of three stages: action mining, plan formulation, and plan execution. More specifically, given a question about a long document, PEARL decomposes the question into a sequence of actions (e.g., SUMMARIZE, FIND_EVENT, FIND_RELATION) and then executes them over the document to obtain the answer. Each stage of PEARL is implemented via zero-shot or few-shot prompting of LLMs (in our work, GPT-4) with minimal human input. We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts. PEARL outperforms zero-shot and chain-of-thought prompting on this dataset, and ablation experiments show that each stage of PEARL is critical to its performance. Overall, PEARL is a first step towards leveraging LLMs to reason over long documents.

pdf bib
LAraBench: Benchmarking Arabic AI with Large Language Models
Ahmed Abdelali | Hamdy Mubarak | Shammur Chowdhury | Maram Hasanain | Basel Mousi | Sabri Boughorbel | Samir Abdaljalil | Yassine El Kheir | Daniel Izham | Fahim Dalvi | Majd Hawasly | Nizi Nazar | Youssef Elshahawy | Ahmed Ali | Nadir Durrani | Natasa Milic-Frayling | Firoj Alam

Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ~296K data points, ~46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperformed LLMs in zero-shot learning, with a few exceptions. Notably, larger computational models with few-shot learning techniques managed to reduce these performance gaps. Our findings provide valuable insights into the applicability of LLMs for Arabic NLP and speech processing tasks.

pdf bib
SentenceLDA: Discriminative and Robust Document Representation with Sentence Level Topic Model
Taehun Cha | Donghun Lee

A subtle difference in context results in totally different nuances even for lexically identical words. On the other hand, two words can convey similar meanings given a homogeneous context. As a result, considering only word spelling information is not sufficient to obtain quality text representation. We propose SentenceLDA, a sentence-level topic model. We combine modern SentenceBERT and classical LDA to extend the semantic unit from word to sentence. By extending the semantic unit, we verify that SentenceLDA returns more discriminative document representation than other topic models, while maintaining LDA’s elegant probabilistic interpretability. We also verify the robustness of SentenceLDA by comparing the inference results on original and paraphrased texts. Additionally, we implement one possible application of SentenceLDA on corpus-level key opinion mining by applying SentenceLDA on an argumentative corpus, DebateSum.

pdf bib
Towards Hierarchical Spoken Language Disfluency Modeling
Jiachen Lian | Gopala Anumanchipalli

Speech dysfluency modeling is the bottleneck for both speech therapy and language learning. However, there is no AI solution to systematically tackle this problem. We first propose to define the concept of dysfluent speech and dysfluent speech modeling. We then present Hierarchical Unconstrained Dysfluency Modeling (H-UDM) approach that addresses both dysfluency transcription and detection to eliminate the need for extensive manual annotation. Furthermore, we introduce a simulated dysfluent dataset called VCTK++ to enhance the capabilities of H-UDM in phonetic transcription. Our experimental results demonstrate the effectiveness and robustness of our proposed methods in both transcription and detection tasks.

pdf bib
Finding a Needle in the Adversarial Haystack: A Targeted Paraphrasing Approach For Uncovering Edge Cases with Minimal Distribution Distortion
Aly Kassem | Sherif Saad

Adversarial attacks against Language models (LMs) are a significant concern. In particular, adversarial samples exploit the model’s sensitivity to small input changes. While these changes appear insignificant on the semantics of the input sample, they result in significant decay in model performance. In this paper, we propose Targeted Paraphrasing via RL (TPRL), an approach to automatically learn a policy to generate challenging samples that improve the model’s performance. TPRL leverages FLAN-T5, a language model, as a generator and employs a self-learned policy using a proximal policy optimization to generate the adversarial examples automatically. TPRL’s reward is based on the confusion induced in the classifier, preserving the original text meaning through a Mutual Implication score. We demonstrate & evaluate TPRL’s effectiveness in discovering natural adversarial attacks and improving model performance through extensive experiments on four diverse NLP classification tasks via Automatic & Human evaluation. TPRL outperforms strong baselines, exhibits generalizability across classifiers and datasets, and combines the strengths of language modeling and reinforcement learning to generate diverse and influential adversarial examples.

pdf bib
FAIR: Filtering of Automatically Induced Rules
Divya Jyoti Bajpai | Ayush Maheshwari | Manjesh Hanawal | Ganesh Ramakrishnan

Availability of large annotated data can be a critical bottleneck in training machine learning algorithms successfully, especially when applied to diverse domains. Weak supervision offers a promising alternative by accelerating the creation of labeled training data using domain-specific rules. However, it requires users to write a diverse set of high-quality rules to assign labels to the unlabeled data (eg., Snorkel (CITATION)). Automatic Rule Induction (ARI) approaches such as Snuba (CITATION) circumvent this problem by automatically creating rules from features on a small labeled set and filtering a final set of rules from them. In the ARI approach, the crucial step is to filter out a set of a high-quality useful subset of rules from the large set of automatically created rules. In this paper, we propose an algorithm FAIR (Filtering of Automatically Induced Rules) to filter rules from a large number of automatically induced rules using submodular objective functions that account for the collective precision, coverage, and conflicts of the rule set. We experiment with three ARI approaches and five text classification datasets to validate the superior performance of our algorithm with respect to several semi-supervised label aggregation approaches. We show that our approach achieves statistically significant results in comparison to existing rule-filtering approaches. The anonymized source code is available at

pdf bib
NNOSE: Nearest Neighbor Occupational Skill Extraction
Mike Zhang | Rob Goot | Min-Yen Kan | Barbara Plank

The labor market is changing rapidly, prompting increased interest in the automatic extraction of occupational skills from text. With the advent of English benchmark job description datasets, there is a need for systems that handle their diversity well. We tackle the complexity in occupational skill datasets tasks—combining and leveraging multiple datasets for skill extraction, to identify rarely observed skills within a dataset, and overcoming the scarcity of skills across datasets. In particular, we investigate the retrieval-augmentation of language models, employing an external datastore for retrieving similar skills in a dataset-unifying manner. Our proposed method, Nearest Neighbor Occupational Skill Extraction (NNOSE) effectively leverages multiple datasets by retrieving neighboring skills from other datasets in the datastore. This improves skill extraction without additional fine-tuning. Crucially, we observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.

pdf bib
GAINER: Graph Machine Learning with Node-specific Radius for Classification of Short Texts and Documents
Naganand Yadati

Graphs provide a natural, intuitive, and holistic means to capture relationships between different text elements in Natural Language Processing (NLP) such as words, sentences, and documents. Recent advancements in the field of Graph Machine Learning (GML) have led to the development of numerous models to process text for various natural language applications, including but not limited to short-text classification, document classification, and others.At the heart of GML models, specifically those based on Graph Neural Networks (GNNs), lies the message passing operation which has shown to be an essential component for strong empirical performance in NLP.However, the number of message passing steps (often known as the radius) is fixed for all the nodes in existing GML models for NLP.Fixing the radius poses a fundamental restriction as nodes exhibit diverse properties and varying amounts of informative local structures in the input graph.This paper presents GAINER, a novel framework called Graph mAchine learnIng with Node-spEcific Radius, aimed at graph-based NLP. We propose non-neural and novel neural approaches built on the core ideas of GAINER.Through rigorous experimentation, we demonstrate the efficacy of GAINER in various popular NLP tasks.

pdf bib
MAFIA: Multi-Adapter Fused Inclusive Language Models
Prachi Jain | Ashutosh Sathe | Varun Gumma | Kabir Ahuja | Sunayana Sitaram

Pretrained Language Models (PLMs) are widely used in NLP for various tasks. Recent studies have identified various biases that such models exhibit and have proposed methods to correct these biases. However, most of the works address a limited set of bias dimensions independently such as gender, race, or religion. Moreover, the methods typically involve finetuning the full model in order to maintain the performance on the downstream task. In this work, we aim to modularly debias a pre-trained language model across multiple dimensions. Previous works extensively explored debiasing PLMs by using limited US-centric counterfactual data augmentation (CDA). We use structured knowledge and a large generative model to build a diverse CDA across multiple bias dimensions in a semi-automated way. We highlight how existing debiasing methods do not consider interactions between multiple societal biases and propose a debiasing model that exploits the synergy amongst various societal biases and enables multi-bias debiasing simultaneously. An extensive evaluation on multiple tasks and languages demonstrates the efficacy of the approach.

pdf bib
Code-Switched Language Identification is Harder Than You Think
Laurie Burchell | Alexandra Birch | Robert Thompson | Kenneth Heafield

Code switching (CS) is a very common phenomenon in written and spoken communication, but is handled poorly by many NLP applications. Looking to the application of building CS corpora, we explore CS language identification for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable architectures for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate, and finally provide recommendations for future work in this area.

pdf bib
Generation-driven Contrastive Self-training for Zero-shot Text Classification with Instruction-following LLM
Ruohong Zhang | Yau-Shian Wang | Yiming Yang

The remarkable performance of large language models (LLMs) in zero-shot language understanding has garnered significant attention.However, employing LLMs for large-scale inference or domain-specific fine-tuning requires immense computational resources due to their substantial model size. To overcome these limitations, we introduce a novel method, namely GenCo, which leverages the strong generative power of LLMs to assist in training a smaller and more adaptable language model. In our method, an LLM plays an important role in the self-training loop of a smaller model in two important ways. Firstly, we utilize an LLM to generate multiple augmented texts for each input instance to enhance its semantic meaning for better understanding. Secondly, we additionally generate high-quality training instances conditioned on predicted labels, ensuring the generated texts are relevant to the labels. In this way, GenCo not only corrects the errors of predicted labels during self-training but also eliminates the need for extensive unlabeled texts. In our experiments, GenCo outperforms previous state-of-the-art methods when only limited (<5% of original) in-domain text data is available. Notably, our approach surpasses Alpaca-7B with human instructions, highlighting the significance of self-training.

pdf bib
Quantifying the Hyperparameter Sensitivity of Neural Networks for Character-level Sequence-to-Sequence Tasks
Adam Wiemerslage | Kyle Gorman | Katharina Wense

Hyperparameter tuning, the process of searching for suitable hyperparameters, becomes more difficult as the computing resources required to train neural networks continue to grow. This topic continues to receive little attention and discussion—much of it hearsay—despite its obvious importance. We attempt to formalize hyperparameter sensitivity using two metrics: similarity-based sensitivity and performance-based sensitivity. We then use these metrics to quantify two such claims: (1) transformers are more sensitive to hyperparameter choices than LSTMs and (2) transformers are particularly sensitive to batch size. We conduct experiments on two different character-level sequence-to-sequence tasks and find that, indeed, the transformer is slightly more sensitive to hyperparameters according to both of our metrics. However, we do not find that it is more sensitive to batch size in particular.

pdf bib
Examining Gender and Racial Bias in Large Vision–Language Models Using a Novel Dataset of Parallel Images
Kathleen Fraser | Svetlana Kiritchenko

Following on recent advances in large language models (LLMs) and subsequent chat models, a new wave of large vision–language models (LVLMs) has emerged. Such models can incorporate images as input in addition to text, and perform tasks such as visual question answering, image captioning, story generation, etc. Here, we examine potential gender and racial biases in such systems, based on the perceived characteristics of the people in the input images. To accomplish this, we present a new dataset PAIRS (PArallel Images for eveRyday Scenarios). The PAIRS dataset contains sets of AI-generated images of people, such that the images are highly similar in terms of background and visual content, but differ along the dimensions of gender (man, woman) and race (Black, white). By querying the LVLMs with such images, we observe significant differences in the responses according to the perceived gender or race of the person depicted.

pdf bib
ConstraintChecker: A Plugin for Large Language Models to Reason on Commonsense Knowledge Bases
Quyet V. Do | Tianqing Fang | Shizhe Diao | Zhaowei Wang | Yangqiu Song

Reasoning over Commonsense Knowledge Bases (CSKB), i.e. CSKB reasoning, has been explored as a way to acquire new commonsense knowledge based on reference knowledge in the original CSKBs and external prior knowledge.Despite the advancement of Large Language Models (LLM) and prompt engineering techniques in various reasoning tasks, they still struggle to deal with CSKB reasoning.One of the problems is that it is hard for them to acquire explicit relational constraints in CSKBs from only in-context exemplars, due to a lack of symbolic reasoning capabilities (CITATION).To this end, we proposed **ConstraintChecker**, a plugin over prompting techniques to provide and check explicit constraints.When considering a new knowledge instance, ConstraintChecker employs a rule-based module to produce a list of constraints, then it uses a zero-shot learning module to check whether this knowledge instance satisfies all constraints.The acquired constraint-checking result is then aggregated with the output of the main prompting technique to produce the final output.Experimental results on CSKB Reasoning benchmarks demonstrate the effectiveness of our method by bringing consistent improvements over all prompting methods.

pdf bib
A* shortest string decoding for non-idempotent semirings
Kyle Gorman | Cyril Allauzen

Abstract: The single shortest path algorithm is undefined for weighted finite-state automata over non-idempotent semirings because such semirings do not guarantee the existence of a shortest path. However, in non-idempotent semirings admitting an order satisfying a monotonicity condition (such as the plus-times or log semirings), the shortest string is well-defined. We describe an algorithm which finds the shortest string for a weighted non-deterministic automaton over such semirings using the backwards shortest distance of an equivalent deterministic automaton (DFA) as a heuristic for A* search performed over a companion idempotent semiring, which is proven to return the shortest string. There may be exponentially more states in the DFA, but the proposed algorithm needs to visit only a small fraction of them if determinization is performed “on the fly”.

pdf bib
Importance-Aware Data Augmentation for Document-Level Neural Machine Translation
Minghao Wu | Yufei Wang | George Foster | Lizhen Qu | Gholamreza Haffari

Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive, in contrast to its sentence-level counterpart. However, due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity. To overcome this issue, we propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients. We conduct comprehensive experiments on three widely-used DocNMT benchmarks. Our empirical results show that our proposed IADA outperforms strong DocNMT baselines as well as several data augmentation approaches, with statistical significance on both sentence-level and document-level BLEU.

pdf bib
Lost in Translationese? Reducing Translation Effect Using Abstract Meaning Representation
Shira Wein | Nathan Schneider

Translated texts bear several hallmarks distinct from texts originating in the language (“translationese”). Though individual translated texts are often fluent and preserve meaning, at a large scale, translated texts have statistical tendencies which distinguish them from text originally written in the language and can affect model performance. We frame the novel task of translationese reduction and hypothesize that Abstract Meaning Representation (AMR), a graph-based semantic representation which abstracts away from the surface form, can be used as an interlingua to reduce the amount of translationese in translated texts. By parsing English translations into an AMR and then generating text from that AMR, the result more closely resembles originally English text across three quantitative macro-level measures, without severely compromising fluency or adequacy. We compare our AMR-based approach against three other techniques based on machine translation or paraphrase generation. This work represents the first approach to reducing translationese in text and highlights the promise of AMR, given that our AMR-based approach outperforms more computationally intensive methods.

pdf bib
Comparing Template-based and Template-free Language Model Probing
Sagi Shaier | Kevin Bennett | Lawrence Hunter | Katharina Wense

The differences between cloze-task language model (LM) probing with 1) expert-made templates and 2) naturally-occurring text have often been overlooked. Here, we evaluate 16 different LMs on 10 probing English datasets – 4 template-based and 6 template-free – in general and biomedical domains to answer the following research questions: (RQ1) Do model rankings differ between the two approaches? (RQ2) Do models’ absolute scores differ between the two approaches? (RQ3) Do the answers to RQ1 and RQ2 differ between general and domain-specific models? Our findings are: 1) Template-free and template-based approaches often rank models differently, except for the top domain- specific models. 2) Scores decrease by up to 42% Acc@1 when comparing parallel template-free and template-based prompts. 3) Perplexity is negatively correlated with accuracy in the template-free approach, but, counter-intuitively, they are positively correlated for template-based probing. 4) Models tend to predict the same answers frequently across prompts for template-based probing, which is less common when employing template-free techniques.

pdf bib
Desiderata For The Context Use Of Question Answering Systems
Sagi Shaier | Lawrence Hunter | Katharina Wense

Prior work has uncovered a set of common problems in state-of-the-art context-based question answering (QA) systems: a lack of attention to the context when the latter conflicts with a model’s parametric knowledge, little robustness to noise, and a lack of consistency with their answers. However, most prior work focus on one or two of those problems in isolation, which makes it difficult to see trends across them. We aim to close this gap, by first outlining a set of – previously discussed as well as novel – desiderata for QA models. We then survey relevant analysis and methods papers to provide an overview of the state of the field. The second part of our work presents experiments where we evaluate 15 QA systems on 5 datasets according to all desiderata at once. We find many novel trends, including (1) systems that are less susceptible to noise are not necessarily more consistent with their answers when given irrelevant context; (2) most systems that are more susceptible to noise are more likely to correctly answer according to a context that conflicts with their parametric knowledge; and (3) the combination of conflicting knowledge and noise can reduce system performance by up to 96%. As such, our desiderata help increase our understanding of how these models work and reveal potential avenues for improvements.

pdf bib
Scaling up Discovery of Latent Concepts in Deep NLP Models
Majd Hawasly | Fahim Dalvi | Nadir Durrani

Despite the revolution caused by deep NLP models, they remain black boxes, necessitating research to understand their decision-making processes. A recent work by Dalvi et al. (2022) carried out representation analysis through the lens of clustering latent spaces within pre-trained models (PLMs), but that approach is limited to small scale due to the high cost of running Agglomerative hierarchical clustering. This paper studies clustering algorithms in order to scale the discovery of encoded concepts in PLM representations to larger datasets and models. We propose metrics for assessing the quality of discovered latent concepts and use them to compare the studied clustering algorithms. We found that K-Means-based concept discovery significantly enhances efficiency while maintaining the quality of the obtained concepts. Furthermore, we demonstrate the practicality of this newfound efficiency by scaling latent concept discovery to LLMs and phrasal concepts.

pdf bib
AnthroScore: A Computational Linguistic Measure of Anthropomorphism
Myra Cheng | Kristina Gligoric | Tiziano Piccardi | Dan Jurafsky

Anthropomorphism, or the attribution of human-like characteristics to non-human entities, has shaped conversations about the impacts and possibilities of technology. We present AnthroScore, an automatic metric of implicit anthropomorphism in language. We use a masked language model to quantify how non-human entities are implicitly framed as human by the surrounding context. We show that AnthroScore corresponds with human judgments of anthropomorphism and dimensions of anthropomorphism described in social science literature. Motivated by concerns of misleading anthropomorphism in computer science discourse, we use AnthroScore to analyze 15 years of research papers and downstream news articles. In research papers, we find that anthropomorphism has steadily increased over time, and that papers related to language models have the most anthropomorphism. Within ACL papers, temporal increases in anthropomorphism are correlated with key neural advancements. Building upon concerns of scientific misinformation in mass media, we identify higher levels of anthropomorphism in news headlines compared to the research papers they cite. Since AnthroScore is lexicon-free, it can be directly applied to a wide range of text sources.

pdf bib
Centering the Speech Community
Steven Bird | Dean Yibarbuk

How can NLP/AI practitioners engage with oral societies and develop locally appropriate language technologies? We report on our experience of working together over five years in a remote community in the far north of Australia, and how we prototyped simple language technologies to support our collaboration. We navigated different understandings of language, the functional differentiation of oral vs institutional languages, and the distinct technology opportunities for each. Our collaboration unsettled the first author’s western framing of language as data for exploitation by machines, and we devised a design pattern that seems better aligned with local interests and aspirations. We call for new collaborations on the design of locally appropriate technologies for languages with primary orality.

pdf bib
Improving the TENOR of Labeling: Re-evaluating Topic Models for Content Analysis
Zongxia Li | Andrew Mao | Daniel Stephens | Pranav Goel | Emily Walpole | Alden Dima | Juan Fung | Jordan Boyd-Graber

Topic models are a popular tool for understanding text collections, but their evaluation has been a point of contention. Automated evaluation metrics such as coherence are often used, however, their validity has been questioned for neural topic models (NTMs) and can overlook a model’s benefits in real-world applications. To this end, we conduct the first evaluation of neural, supervised and classical topic models in an interactive task-based setting. We combine topic models with a classifier and test their ability to help humans conduct content analysis and document annotation. From simulated, real user and expert pilot studies, the Contextual Neural Topic Model does the best on cluster evaluation metrics and human evaluations; however, LDA is competitive with two other NTMs under our simulated experiment and user study results, contrary to what coherence scores suggest. We show that current automated metrics do not provide a complete picture of topic modeling capabilities, but the right choice of NTMs can be better than classical models on practical tasks.

pdf bib
Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga | Nisansa De Silva | Velayuthan Menan | Aloka Fernando | Charitha Rathnayake

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

pdf bib
VOLTAGE: A Versatile Contrastive Learning based OCR Methodology for ultra low-resource scripts through Auto Glyph Feature Extraction
Prawaal Sharma | Poonam Goyal | Vidisha Sharma | Navneet Goyal

UNESCO has classified 2500 out of 7000 languages spoken worldwide as endangered. Attrition of a language leads to loss of traditional wisdom, folk literature, and the essence of the community that uses it. It is therefore imperative to bring digital inclusion to these languages and avoid its extinction. Low resource languages are at a greater risk of extinction. Lack of unsupervised Optical Character Recognition(OCR) methodologies for low resource languages is one of the reasons impeding their digital inclusion. We propose VOLTAGE - a contrastive learning based OCR methodology, leveraging auto-glyph feature recommendation for cluster-based labelling. We augment the labelled data for diversity and volume using image transformations and Generative Adversarial Networks. Voltage has been designed using Takri - a family of scripts used in 16th to 20th century in the Himalayan regions of India. We present results for Takri along with other Indic scripts (both low and high resource) to substantiate the universal behavior of the methodology. An accuracy of 95% for machine printed and 87% for handwritten samples on Takri script has been achieved. We conduct baseline and ablation studies along with building downstream use cases for Takri, demonstrating the usefulness of our work.

pdf bib
Unsupervised Contrast-Consistent Ranking with Language Models
Niklas Stoehr | Pengxiang Cheng | Jing Wang | Daniel Preotiuc-Pietro | Rajarshi Bhowmik

Language models contain ranking-based knowledge and are powerful solvers of in-context ranking tasks. For instance, they may have parametric knowledge about the ordering of countries by size or may be able to rank product reviews by sentiment. We compare pairwise, pointwise and listwise prompting techniques to elicit a language model’s ranking knowledge. However, we find that even with careful calibration and constrained decoding, prompting-based techniques may not always be self-consistent in the rankings they produce. This motivates us to explore an alternative approach that is inspired by an unsupervised probing method called Contrast-Consistent Search (CCS). The idea is to train a probe guided by a logical constraint: a language model’s representation of a statement and its negation must be mapped to contrastive true-false poles consistently across multiple statements. We hypothesize that similar constraints apply to ranking tasks where all items are related via consistent, pairwise or listwise comparisons. To this end, we extend the binary CCS method to Contrast-Consistent Ranking (CCR) by adapting existing ranking methods such as the Max-Margin Loss, Triplet Loss and an Ordinal Regression objective. Across different models and datasets, our results confirm that CCR probing performs better or, at least, on a par with prompting.

pdf bib
Entity-level Factual Adaptiveness of Fine-tuning based Abstractive Summarization Models
Jongyoon Song | Nohil Park | Bongkyu Hwang | Jaewoong Yun | Seongho Joe | Youngjune Gwon | Sungroh Yoon

Abstractive summarization models often generate factually inconsistent content particularly when the parametric knowledge of the model conflicts with the knowledge in the input document. In this paper, we analyze the robustness of fine-tuning based summarization models to the knowledge conflict, which we call factual adaptiveness. We utilize pre-trained language models to construct evaluation sets and find that factual adaptiveness is not strongly correlated with factual consistency on original datasets. Furthermore, we introduce a controllable counterfactual data augmentation method where the degree of knowledge conflict within the augmented data can be adjustable. Our experimental results on two pre-trained language models (PEGASUS and BART) and two fine-tuning datasets (XSum and CNN/DailyMail) demonstrate that our method enhances factual adaptiveness while achieving factual consistency on original datasets on par with the contrastive learning baseline.

pdf bib
Meme-ingful Analysis: Enhanced Understanding of Cyberbullying in Memes Through Multimodal Explanations
Prince Jha | Krishanu Maity | Raghav Jain | Apoorv Verma | Sriparna Saha | Pushpak Bhattacharyya

Internet memes have gained significant influence in communicating political, psychological, and sociocultural ideas. While meme are often humorous, there has been a rise in the use of memes for trolling and cyberbullying. Although a wide variety of effective deep learning-based models have been developed for detecting offensive multimodal memes, only a few works have been done on explainability aspect. Recent laws like “right to explanations” of General Data Protection Regulation, have spurred research in developing interpretable models rather than only focusing on performance. Motivated by this, we introduce MultiBully-Ex, the first benchmark dataset for multimodal explanation from code-mixed cyberbullying memes. Here, both visual and textual modalities are highlighted to explain why a given meme is cyberbullying. A Contrastive Language-Image Pretraining (CLIP) projection based multimodal shared-private multitask approach has been proposed for visual and textual explanation of a meme. Experimental results demonstrate that training with multimodal explanations improves performance in generating textual justifications and more accurately identifying the visual evidence supporting a decision with reliable performance improvements.

pdf bib
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Minghao Wu | Abdul Waheed | Chiyu Zhang | Muhammad Abdul-Mageed | Alham Aji

Large language models (LLMs) with instruction fine-tuning demonstrate superior generative capabilities. However, these models are resource-intensive. To alleviate this issue, we explore distilling knowledge from instruction-tuned LLMs into much smaller ones. While other similar works have been done, they are often conducted on a limited set of (usually still large) models and are not accompanied by proper evaluations. To this end, we carefully develop a large set of 2.58M instructions based on both existing and newly-generated instructions. In addition to being sizable, we design our instructions to cover a broad set of topics to ensure diversity. Extensive analysis of our instruction dataset confirms its diversity, and we generate responses for these instructions using gpt-3.5-turbo. Leveraging these instructions, we fine-tune a diverse herd of models, collectively referred to as LaMini-LM, which includes models from both the encoder-decoder and decoder-only families, with varying sizes. We evaluate the performance of our models using automatic metrics on 15 different natural language processing (NLP) benchmarks, as well as through human assessment. We also assess the model for hallucination and toxicity, and for the former, we introduce a new benchmark dataset for hallucination-inducing QA. The results demonstrate that our proposed LaMini-LM models are comparable to strong baselines while being much smaller in size.

pdf bib
Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer
V.S.D.S.Mahesh Akavarapu | Arnab Bhattacharya

Identification of cognates across related languages is one of the primary problems in historical linguistics. Automated cognate identification is helpful for several downstream tasks including identifying sound correspondences, proto-language reconstruction, phylogenetic classification, etc. Previous state-of-the-art methods are mostly based on distributions of phonemes computed across multilingual wordlists and make little use of the cognacy labels that define links among cognate clusters. In this paper, we present a transformer-based architecture inspired by computational biology for the task of automated cognate detection. Beyond a certain amount of supervision, this method performs better than the existing methods, and shows steady improvement with further increase in supervision proving the efficacy of utilizing the labeled information. We also demonstrate that accepting multiple sequence alignments as input and having an end-to-end architecture with link prediction head saves much computation time while simultaneously yielding superior performance.

pdf bib
Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding
Kaiyan Zhao | Qiyu Wu | Xin-Qiang Cai | Yoshimasa Tsuruoka

Learning multilingual sentence embeddings is a fundamental task in natural language processing. Recent trends in learning both monolingual and multilingual sentence embeddings are mainly based on contrastive learning (CL) among an anchor, one positive, and multiple negative instances. In this work, we argue that leveraging multiple positives should be considered for multilingual sentence embeddings because (1) positives in a diverse set of languages can benefit cross-lingual learning, and (2) transitive similarity across multiple positives can provide reliable structural information for learning.In order to investigate the impact of multiple positives in CL, we propose a novel approach, named MPCL, to effectively utilize multiple positive instances to improve the learning of multilingual sentence embeddings. Experimental results on various backbone models and downstream tasks demonstrate that MPCL leads to better retrieval, semantic similarity, and classification performance compared to conventional CL. We also observe that in unseen languages, sentence embedding models trained on multiple positives show better cross-lingual transfer performance than models trained on a single positive instance.

pdf bib
Moderation in the Wild: Investigating User-Driven Moderation in Online Discussions
Neele Falk | Eva Vecchi | Iman Jundi | Gabriella Lapesa

Effective content moderation is imperative for fostering healthy and productive discussions in online domains. Despite the substantial efforts of moderators, the overwhelming nature of discussion flow can limit their effectiveness. However, it is not only trained moderators who intervene in online discussions to improve their quality. “Ordinary” users also act as moderators, actively intervening to correct information of other users’ posts, enhance arguments, and steer discussions back on course.This paper introduces the phenomenon of user moderation, documenting and releasing UMOD, the first dataset of comments in whichusers act as moderators. UMOD contains 1000 comment-reply pairs from the subreddit r/changemyview with crowdsourced annotations from a large annotator pool and with a fine-grained annotation schema targeting the functions of moderation, stylistic properties(aggressiveness, subjectivity, sentiment), constructiveness, as well as the individual perspectives of the annotators on the task. The releaseof UMOD is complemented by two analyses which focus on the constitutive features of constructiveness in user moderation and on thesources of annotator disagreements, given the high subjectivity of the task.

pdf bib
Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching
Kurt Micallef | Nizar Habash | Claudia Borg | Fadhl Eryani | Houda Bouamor

Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model’s pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.

pdf bib
Where Do We Go From Here? Multi-scale Allocentric Relational Inferencefrom Natural Spatial Descriptions
Tzuf Paz-Argaman | John Palowitch | Sayali Kulkarni | Jason Baldridge | Reut Tsarfaty

The concept of acquired spatial knowledge is crucial in spatial cognitive research, particularly when it comes to communicating routes. However, NLP navigation studies often overlook the impact of acquired knowledge on textual descriptions. Current navigation studies concentrate on egocentric local descriptions (e.g., ‘it will be on your right’) that require reasoning over the agent’s local perception. These instructions are typically given in a sequence of steps, with each action-step explicitly mentioned and followed by a landmark that the agent can use to verify that they are on the correct path (e.g., ‘turn right and then you will see...’). In contrast, descriptions based on knowledge acquired through a map provide a complete view of the environment and capture its compositionality. These instructions typically contain allocentric relations, are non-sequential, with implicit actions and multiple spatial relations without any verification (e.g., ‘south of Central Park and a block north of a police station’). This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge. Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.

pdf bib
Bias in Opinion Summarisation from Pre-training to Adaptation: A Case Study in Political Bias
Nannan Huang | Haytham Fayek | Xiuzhen Zhang

Opinion summarisation aims to summarise the salient information and opinions presented in documents such as product reviews, discussion forums, and social media texts into short summaries that enable users to effectively understand the opinions therein.Generating biased summaries has the risk of potentially swaying public opinion. Previous studies focused on studying bias in opinion summarisation using extractive models, but limited research has paid attention to abstractive summarisation models. In this study, using political bias as a case study, we first establish a methodology to quantify bias in abstractive models, then trace it from the pre-trained models to the task of summarising social media opinions using different models and adaptation methods. We find that most models exhibit intrinsic bias. Using a social media text summarisation dataset and contrasting various adaptation methods, we find that tuning a smaller number of parameters is less biased compared to standard fine-tuning; however, the diversity of topics in training data used for fine-tuning is critical.

pdf bib
Document Structure in Long Document Transformers
Jan Buchmann | Max Eichler | Jan-Micha Bodensohn | Ilia Kuznetsov | Iryna Gurevych

Long documents often exhibit structure with hierarchically organized elements of different functions, such as section headers and paragraphs. Despite the omnipresence of document structure, its role in natural language processing (NLP) remains opaque. Do long-document Transformer models acquire an internal representation of document structure during pre-training? How can structural information be communicated to a model after pre-training, and how does it influence downstream performance? To answer these questions, we develop a novel suite of probing tasks to assess structure-awareness of long-document Transformers, propose general-purpose structure infusion methods, and evaluate the effects of structure infusion on QASPER and Evidence Inference, two challenging long-document NLP tasks. Results on LED and LongT5 suggest that they acquire implicit understanding of document structure during pre-training, which can be further enhanced by structure infusion, leading to improved end-task performance. To foster research on the role of document structure in NLP modeling, we make our data and code publicly available.

pdf bib
The Role of Data Curation in Image Captioning
Wenyan Li | Jonas Lotz | Chen Qiu | Desmond Elliott

Image captioning models are typically trained by treating all samples equally, neglecting to account for mismatched or otherwise difficult data points. In contrast, recent work has shown the effectiveness of training models by scheduling the data using curriculum learning strategies. This paper contributes to this direction by actively curating difficult samples in datasets without increasing the total number of samples. We explore the effect of using three data curation methods within the training process: complete removal of an sample, caption replacement, or image replacement via a text-to-image generation model. Experiments on the Flickr30K and COCO datasets with the BLIP and BEiT-3 models demonstrate that these curation methods do indeed yield improved image captioning models, underscoring their efficacy.

pdf bib
Large-Scale Bitext Corpora Provide New Evidence for Cognitive Representations of Spatial Terms
Peter Viechnicki | Kevin Duh | Anthony Kostacos | Barbara Landau

Recent evidence from cognitive science suggests that there exist two classes of cognitive representations within the spatial terms of a language, one represented geometrically (e.g., above, below) and the other functionally (e.g., on, in). It has been hypothesized that geometric terms are more constrained and are mastered relatively early in language learning, whereas functional terms are less constrained and are mastered over longer time periods (Landau, 2016). One consequence of this hypothesis is that these two classes should exhibit different cross-linguistic variability, which is supported by human elicitation studies. In this work we present to our knowledge the first corpus-based empirical test of this hypothesis. We develop a pipeline for extracting, isolating, and aligning spatial terms in basic locative constructions from parallel text. Using Shannon entropy to measure the variability of spatial term use across eight languages, we find supporting evidence that variability in functional terms differs significantly from that of geometric terms. We also perform latent variable modeling and find support for the division of spatial terms into geometric and functional classes.

pdf bib
REFINER: Reasoning Feedback on Intermediate Representations
Debjit Paul | Mete Ismayilzada | Maxime Peyrard | Beatriz Borges | Antoine Bosselut | Robert West | Boi Faltings

Language models (LMs) have recently shown remarkable performance on reasoning tasks by explicitly generating intermediate inferences,e.g., chain-of-thought prompting. However, these intermediate inference steps may be inappropriate deductions from the initial contextand lead to incorrect final predictions. Here we introduce REFINER, a framework for finetuning LMs to explicitly generate intermediate reasoning steps while interacting with a critic model that provides automated feedback on the reasoning. Specifically, the critic provides structured feedback that the reasoning LM uses to iteratively improve its intermediate arguments. Empirical evaluations of REFINER on three diverse reasoning tasks show significant improvements over baseline LMs of comparable scale. Furthermore, when using GPT-3.5 or ChatGPT as the reasoner, the trained critic significantly improves reasoning without finetuning the reasoner. Finally, our critic model is trained without expensive human-in-the-loop data but can be substituted with humans at inference time.

pdf bib
HumBEL: A Human-in-the-Loop Approach for Evaluating Demographic Factors of Language Models in Human-Machine Conversations
Anthony Sicilia | Jennifer Gates | Malihe Alikhani

While demographic factors like age and gender change the way people talk, and in particular, the way people talk to machines, there is little investigation into how large pre-trained language models (LMs) can adapt to these changes. To remedy this gap, we consider how demographic factors in LM language skills can be measured to determine compatibility with a target demographic. We suggest clinical techniques from Speech Language Pathology, which has norms for acquisition of language skills in humans. We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist), and also propose automated techniques to complement clinical evaluation at scale. Empirically, we focus on age, finding LM capability varies widely depending on task: GPT-3.5 mimics the ability of humans ranging from age 6-15 at tasks requiring inference, and simultaneously, outperforms a typical 21 year old at memorization. GPT-3.5 also has trouble with social language use, exhibiting less than 50% of the tested pragmatic skills. Findings affirm the importance of considering demographic alignment and conversational goals when using LMs as public-facing tools. Code, data, and a package will be available.

pdf bib
LOCOST: State-Space Models for Long Document Abstractive Summarization
Florian Le Bronnec | Song Duong | Mathieu Ravaut | Alexandre Allauzen | Nancy Chen | Vincent Guigue | Alberto Lumbreras | Laure Soulier | Patrick Gallinari

State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of 𝒪(L log L), this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.

pdf bib
A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation
Sahar Sadrizadeh | Ljiljana Dolamic | Pascal Frossard

Neural Machine Translation (NMT) models have been shown to be vulnerable to adversarial attacks, wherein carefully crafted perturbations of the input can mislead the target model. In this paper, we introduce ACT, a novel adversarial attack framework against NMT systems guided by a classifier. In our attack, the adversary aims to craft meaning-preserving adversarial examples whose translations in the target language by the NMT model belong to a different class than the original translations. Unlike previous attacks, our new approach has a more substantial effect on the translation by altering the overall meaning, which then leads to a different class determined by an oracle classifier. To evaluate the robustness of NMT models to our attack, we propose enhancements to existing black-box word-replacement-based attacks by incorporating output translations of the target NMT model and the output logits of a classifier within the attack process. Extensive experiments, including a comparison with existing untargeted attacks, show that our attack is considerably more successful in altering the class of the output translation and has more effect on the translation. This new paradigm can reveal the vulnerabilities of NMT systems by focusing on the class of translation rather than the mere translation quality as studied traditionally.

pdf bib
Improving Generalization in Semantic Parsing by Increasing Natural Language Variation
Irina Saparina | Mirella Lapata

Text-to-SQL semantic parsing has made significant progress in recent years, with various models demonstrating impressive performance on the challenging Spider benchmark. However, it has also been shown that these models often struggle to generalize even when faced with small perturbations of previously (accurately) parsed expressions. This is mainly due to the linguistic form of questions in Spider which are overly specific, unnatural, and display limited variation. In this work, we use data augmentation to enhance the robustness of text-to-SQL parsers against natural language variations. Existing approaches generate question reformulations either via models trained on Spider or only introduce local changes. In contrast, we leverage the capabilities of large language models to generate more realistic and diverse questions. Using only a few prompts, we achieve a two-fold increase in the number of questions in Spider. Training on this augmented dataset yields substantial improvements on a range of evaluation sets, including robustness benchmarks and out-of-domain data.

pdf bib
Text-to-Code Generation with Modality-relative Pre-training
Fenia Christopoulou | Guchun Zhang | Gerasimos Lampouras

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model–where training sequences typically contain both natural and (linearised) programming language. Such approaches effectively map both modalities of the sequence into the same embedding space. However, programming language keywords (e.g. “while”) often have very strictly defined semantics. As such, transfer learning from their natural language usage may not necessarily be beneficial to their code application and vise versa. Assuming an already pre-trained language model, in this work we investigate how sequence tokens can be adapted and represented differently, depending on which modality they belong to, and to the ultimate benefit of the downstream task. We experiment with separating embedding spaces between modalities during further model pre-training with modality-relative training objectives. We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets, measuring pass@k and a novel incremental variation.

pdf bib
No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models
Agnes Luhtaru | Elizaveta Korotkova | Mark Fishel

Grammatical Error Correction (GEC) enhances language proficiency and promotes effective communication, but research has primarily centered around English. We propose a simple approach to multilingual and low-resource GEC by exploring the potential of multilingual machine translation (MT) models for error correction. We show that MT models are not only capable of error correction out-of-the-box, but that they can also be fine-tuned to even better correction quality. Results show the effectiveness of this approach, with our multilingual model outperforming similar-sized mT5-based models and even competing favourably with larger models.

pdf bib
Quantifying Stereotypes in Language
Yang Liu

A stereotype is a generalized perception of a specific group of humans. It is often potentially encoded in human language, which is more common in texts on social issues. Previous works simply define a sentence as stereotypical and anti-stereotypical. However, the stereotype of a sentence may require fine-grained quantification. In this paper, to fill this gap, we quantify stereotypes in language by annotating a dataset. We use the pre-trained language models (PLMs) to learn this dataset to predict stereotypes of sentences. Then, we discuss stereotypes about common social issues such as hate speech, sexism, sentiments, and disadvantaged and advantaged groups. We demonstrate the connections and differences between stereotypes and common social issues, and all four studies validate the general findings of the current studies. In addition, our work suggests that fine-grained stereotype scores are a highly relevant and competitive dimension for research on social issues. The models and datasets used in this paper are available at

pdf bib
Generation, Distillation and Evaluation of Motivational Interviewing-Style Reflections with a Foundational Language Model
Andrew Brown | Jiading Zhu | Mohamed Abdelwahab | Alec Dong | Cindy Wang | Jonathan Rose

Large Foundational Language Models are capable of performing many tasks at a high level but are difficult to deploy in many applications because of their size and proprietary ownership. Many will be motivated to distill specific capabilities of foundational models into smaller models that can be owned and controlled. In the development of a therapeutic chatbot, we wish to distill a capability known as reflective listening, in which a therapist produces reflections of client speech. These reflections either restate what a client has said, or connect what was said to a relevant observation, idea or guess that encourages and guides the client to continue contemplation. In this paper, we present a method for distilling the generation of reflections from a Foundational Language Model (GPT-4) into smaller models. We first show that GPT-4, using zero-shot prompting, can generate reflections at near 100% success rate, superior to all previous methods. Using reflections generated by GPT-4, we fine-tune different sizes of the GPT-2 family. The GPT-2-small model achieves 83% success on a hold-out test set and the GPT-2 XL achieves 90% success. We also show that GPT-4 can help in the labor-intensive task of evaluating the quality of the distilled models, using it as a zero-shot classifier. Using triple-human review as a guide, the classifier achieves a Cohen-Kappa of 0.66, a substantial inter-rater reliability figure.

pdf bib
Multi-Reference Benchmarks for Russian Grammatical Error Correction
Frank Palma Gomez | Alla Rozovskaya

This paper presents multi-reference benchmarks for the Grammatical Error Correction (GEC) of Russian, based on two existing single-reference datasets, for a total of 7,444 learner sentences from a variety of first language backgrounds. Each sentence is corrected independently by two new raters, and their corrections are reviewed by a senior annotator, resulting in a total of three references per sentence. Analysis of the annotations reveals that the new raters tend to make more changes, compared to the original raters, especially at the lexical level. We conduct experiments with two popular GEC approaches and show competitive performance on the original datasets and the new benchmarks. We also compare system scores as evaluated against individual annotators and discuss the effect of using multiple references overall and on specific error types. We find that using the union of the references increases system scores by more than 10 points and decreases the gap between system and human performance, thereby providing a more realistic evaluation of GEC system performance, although the effect is not the same across the error types. The annotations are available for research.

pdf bib
Plan-Grounded Large Language Models for Dual Goal Conversational Settings
Diogo Glória-Silva | Rafael Ferreira | Diogo Tavares | David Semedo | Joao Magalhaes

Training Large Language Models (LLMs) to follow user instructions has shown to supply the LLM with ample capacity to converse fluently while being aligned with humans. Yet, it is not completely clear how an LLM can lead a plan-grounded conversation in mixed-initiative settings where instructions flow in both directions of the conversation, i.e. both the LLM and the user provide instructions to one another. In this paper, we tackle a dual goal mixed-initiative conversational setting where the LLM not only grounds the conversation on an arbitrary plan but also seeks to satisfy both a procedural plan and user instructions. The LLM is then responsible for guiding the user through the plan and, at the same time, adapting to new circumstances, answering questions, and activating safety guardrails when needed. We propose a novel LLM that grounds the dialogue on a procedural plan, can take the dialogue initiative, and enforces guardrails on the system’s behavior, while also improving the LLM’s responses to unexpected user behavior. Experiments in controlled settings and with real users show that the best-performing model, which we call PlanLLM, achieves a 2.1x improvement over a strong baseline. Moreover, experiments also show good generalization to unseen domains.

pdf bib
“Define Your Terms” : Enhancing Efficient Offensive Speech Classification with Definition
Huy Nghiem | Umang Gupta | Fred Morstatter

The propagation of offensive content through social media channels has garnered attention of the research community. Multiple works have proposed various semantically related yet subtle distinct categories of offensive speech. In this work, we explore meta-learning approaches to leverage the diversity of offensive speech corpora to enhance their reliable and efficient detection. We propose a joint embedding architecture that incorporates the input’s label and definition for classification via Prototypical Network. Our model achieves at least 75% of the maximal F1-score while using less than 10% of the available training data across 4 datasets. Our experimental findings also provide a case study of training strategies valuable to combat resource scarcity.

pdf bib
VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension
Thinh Ngo | Khoa Dang | Son Luu | Kiet Nguyen | Ngan Nguyen

This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension (MRC) tasks and provides insights into the challenges and opportunities associated with using real-world data for machine reading comprehension tasks. The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks. In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube – an extensive source of user-uploaded content, covering the topics of food and travel. By capturing the spoken language of native Vietnamese speakers in natural settings, an obscure corner overlooked in Vietnamese research, the corpus provides a valuable resource for future research in reading comprehension tasks for the Vietnamese language. Regarding performance evaluation, our deep-learning models achieved the highest F1 score of 75.34% on the test set, indicating significant progress in machine reading comprehension for Vietnamese spoken language data. In terms of EM, the highest score we accomplished is 53.97%, which reflects the challenge in processing spoken-based content and highlights the need for further improvement.

pdf bib
CEV-LM: Controlled Edit Vector Language Model for Shaping Natural Language Generations
Samraj Moorjani | Adit Krishnan | Hari Sundaram

As large-scale language models become the standard for text generation, there is a greater need to tailor the generations to be more or less concise, targeted, and informative, depending on the audience/application. Existing control approaches primarily adjust the semantic (e.g., emotion, topics), structural (e.g., syntax tree, parts-of-speech), and lexical (e.g., keyword/phrase inclusion) properties of text, but are insufficient to accomplish complex objectives such as pacing which control the complexity and readability of the text. In this paper, we introduce CEV-LM - a lightweight, semi-autoregressive language model that utilizes constrained edit vectors to control three complementary metrics (speed, volume, and circuitousness) that quantify the shape of text (e.g., pacing of content). We study an extensive set of state-of-the-art CTG models and find that CEV-LM provides significantly more targeted and precise control of these three metrics while preserving semantic content, using less training data, and containing fewer parameters.

pdf bib
It’s All Relative: Learning Interpretable Models for Scoring Subjective Bias in Documents from Pairwise Comparisons
Aswin Suresh | Wu Hsuan | Matthias Grossglauser

We propose an interpretable model to score the subjective bias present in documents, based only on their textual content. Our model is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. Although prior approaches based on bias classification have struggled to obtain a high accuracy for the task, we are able to develop a useful model for scoring bias by learning to accurately perform pairwise comparisons. We show that we can interpret the parameters of the trained model to discover the words most indicative of bias. We also apply our model in three different settings by studying the temporal evolution of bias in Wikipedia articles, comparing news sources based on bias, and scoring bias in law amendments. In each case, we demonstrate that the outputs of the model can be explained and validated, even for the two domains that are outside the training-data domain. We also use the model to compare the general level of bias between domains, where we see that legal texts are the least biased and news media are the most biased, with Wikipedia articles in between.

pdf bib
HiGen: Hierarchy-Aware Sequence Generation for Hierarchical Text Classification
Vidit Jain | Mukund Rungta | Yuchen Zhuang | Yue Yu | Zeyu Wang | Mu Gao | Jeffrey Skolnick | Chao Zhang

Hierarchical text classification (HTC) is a complex subtask under multi-label text classification, characterized by a hierarchical label taxonomy and data imbalance. The best-performing models aim to learn a static representation by combining document and hierarchical label information. However, the relevance of document sections can vary based on the hierarchy level, necessitating a dynamic document representation. To address this, we propose HiGen, a text-generation-based framework utilizing language models to encode dynamic text representations. We introduce a level-guided loss function to capture the relationship between text and label name semantics. Our approach incorporates a task-specific pretraining strategy, adapting the language model to in-domain knowledge and significantly enhancing performance for classes with limited examples. Furthermore, we present a new and valuable dataset called ENZYME, designed for HTC, which comprises articles from PubMed with the goal of predicting Enzyme Commission (EC) numbers. Through extensive experiments on the ENZYME dataset and the widely recognized WOS and NYT datasets, our methodology demonstrates superior performance, surpassing existing approaches while efficiently handling data and mitigating class imbalance. We release our code and dataset here:

pdf bib
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Chenxi Whitehouse | Osama Mohammed Afzal | Tarek Mahmoud | Toru Sasaki | Thomas Arnold | Alham Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at

pdf bib
A Truly Joint Neural Architecture for Segmentation and Parsing
Danit Yshaayahu Levi | Reut Tsarfaty

Contemporary multilingual dependency parsers can parse a diverse set of languages, but for Morphologically Rich Languages (MRLs), performance is attested to be lower than other languages. The key challenge is that, due to high morphological complexity and ambiguity of the space-delimited input tokens, the linguistic units that act as nodes in the tree are not known in advance. Pre-neural dependency parsers for MRLs subscribed to the joint morpho-syntactic hypothesis, stating that morphological segmentation and syntactic parsing should be solved jointly, rather than as a pipeline where segmentation precedes parsing. However, neural state-of-the-art parsers to date use a strict pipeline. In this paper we introduce a joint neural architecture where a lattice-based representation preserving all morphological ambiguity of the input is provided to an arc-factored model, which then solves the morphological segmentation and syntactic parsing tasks at once. Our experiments on Hebrew, a rich and highly ambiguous MRL, demonstrate state-of-the-art performance on parsing, tagging and segmentation of the Hebrew section of UD, using a single model. This proposed architecture is LLM-based and language agnostic, providing a solid foundation for MRLs to obtain further performance improvements and bridge the gap with other languages.

pdf bib
ViLexNorm: A Lexical Normalization Corpus for Vietnamese Social Media Text
Thanh-Nhi Nguyen | Thanh-Phong Le | Kiet Nguyen

Lexical normalization, a fundamental task in Natural Language Processing (NLP), involves the transformation of words into their canonical forms. This process has been proven to benefit various downstream NLP tasks greatly. In this work, we introduce Vietnamese Lexical Normalization (ViLexNorm), the first-ever corpus developed for the Vietnamese lexical normalization task. The corpus comprises over 10,000 pairs of sentences meticulously annotated by human annotators, sourced from public comments on Vietnam’s most popular social media platforms. Various methods were used to evaluate our corpus, and the best-performing system achieved a result of 57.74% using the Error Reduction Rate (ERR) metric (van der Goot, 2019a) with the Leave-As-Is (LAI) baseline. For extrinsic evaluation, employing the model trained on ViLexNorm demonstrates the positive impact of the Vietnamese lexical normalization task on other NLP tasks. Our corpus is publicly available exclusively for research purposes.

pdf bib
Diffusion-NAT: Self-Prompting Discrete Diffusion for Non-Autoregressive Text Generation
Kun Zhou | Yifan Li | Xin Zhao | Ji-Rong Wen

Recently, continuous diffusion models (CDM) have been introduced into non-autoregressive (NAR) text-to-text generation. However, the discrete nature of text increases the difficulty of CDM to generate coherent and fluent texts, and also causes the incompatibility problem between CDM and advanced NLP techniques, especially the popular pre-trained language models (PLMs).To solve it, we propose Diffusion-NAT, which introduces discrete diffusion models (DDM) into NAR text-to-text generation and integrates BART to improve the performance.By revising the decoding process of BART and the typical settings of DDM, we unify the inference process of BART and the denoising process of DDM into the same NAR masked tokens recovering task.In this way, DDM can rely on BART to perform denoising, which can benefit from both the rich pre-learned knowledge of BART and the iterative refining paradigm of DDM.Besides, we also propose the iterative self-prompting strategy to further improve the generation quality.Experimental results on 7 datasets show that our approach can outperform competitive NAR methods, and even surpass autoregressive methods.Our code and data are released at

pdf bib
Unleashing the Power of Discourse-Enhanced Transformers for Propaganda Detection
Alexander Chernyavskiy | Dmitry Ilvovsky | Preslav Nakov

The prevalence of information manipulation online has created a need for propaganda detection systems. Such systems have typically focused on the surface words, ignoring the linguistic structure. Here we aim to bridge this gap. In particular, we present the first attempt at using discourse analysis for the task. We consider both paragraph-level and token-level classification and we propose a discourse-aware Transformer architecture. Our experiments on English and Russian demonstrate sizeable performance gains compared to a number of baselines. Moreover, our ablation study emphasizes the importance of specific types of discourse features, and our in-depth analysis reveals a strong correlation between propaganda instances and discourse spans.

pdf bib
Predicting Client Emotions and Therapist Interventions in Psychotherapy Dialogues
Tobias Mayer | Neha Warikoo | Amir Eliassaf | Dana Atzil-Slonim | Iryna Gurevych

Natural Language Processing (NLP) can advance psychotherapy research by scaling up therapy dialogue analysis as well as by allowing researchers to examine client-therapist interactions in detail. Previous studies have mainly either explored the clients’ behavior or the therapists’ intervention in dialogues. Yet, modelling conversations from both dialogue participants is crucial to understanding the therapeutic interaction. This study explores speaker contribution-based dialogue acts at the utterance-level; i.e, the therapist - Intervention Prediction (IP) and the client - Emotion Recognition (ER) in psychotherapy using a pan-theoretical schema. We perform experiments with fine-tuned language models and light-weight adapter solutions on a Hebrew dataset. We deploy the results from our ER model predictions in investigating the coherence between client self-reports on emotion and the utterance-level emotions. Our best adapters achieved on-par performance with fully fine-tuned models, at 0.64 and 0.66 micro F1 for IP and ER, respectively. In addition, our analysis identifies ambiguities within categorical clinical coding, which can be used to fine-tune the coding schema. Finally, our results indicate a positive correlation between client self-reports and utterance-level emotions.

pdf bib
Who Needs Decoders? Efficient Estimation of Sequence-Level Attributes with Proxies
Yassir Fathullah | Puria Radmard | Adian Liusie | Mark Gales

Sequence-to-sequence models often require an expensive autoregressive decoding process. However, for some downstream tasks such as out-of-distribution (OOD) detection and resource allocation, the actual decoding output is not needed, just a scalar attribute of this sequence. In such scenarios, where knowing the quality of a system’s output to predict poor performance prevails over knowing the output itself, is it possible to bypass the autoregressive decoding? We propose Non-Autoregressive Proxy (NAP) models that can efficiently predict scalar-valued sequence-level attributes. Importantly, NAPs predict these metrics directly from the encodings, avoiding the expensive decoding stage. We consider two sequence tasks: Machine Translation (MT) and Automatic Speech Recognition (ASR). In OOD for MT, NAPs outperform ensembles while being significantly faster. NAPs are also proven capable of predicting metrics such as BERTScore (MT) or word error rate (ASR). For downstream tasks, such as data filtering and resource optimization, NAPs generate performance predictions that outperform predictive uncertainty while being highly inference efficient.

pdf bib
3D Rotation and Translation for Hyperbolic Knowledge Graph Embedding
Yihua Zhu | Hidetoshi Shimodaira

The main objective of Knowledge Graph (KG) embeddings is to learn low-dimensional representations of entities and relations, enabling the prediction of missing facts. A significant challenge in achieving better KG embeddings lies in capturing relation patterns, including symmetry, antisymmetry, inversion, commutative composition, non-commutative composition, hierarchy, and multiplicity. This study introduces a novel model called 3H-TH (3D Rotation and Translation in Hyperbolic space) that captures these relation patterns simultaneously. In contrast, previous attempts have not achieved satisfactory performance across all the mentioned properties at the same time. The experimental results demonstrate that the new model outperforms existing state-of-the-art models in terms of accuracy, hierarchy property, and other relation patterns in low-dimensional space, meanwhile performing similarly in high-dimensional space.

pdf bib
Geo-Encoder: A Chunk-Argument Bi-Encoder Framework for Chinese Geographic Re-Ranking
Yong Cao | Ruixue Ding | Boli Chen | Xianzhi Li | Min Chen | Daniel Hershcovich | Pengjun Xie | Fei Huang

Chinese geographic re-ranking task aims to find the most relevant addresses among retrieved candidates, which is crucial for location-related services such as navigation maps. Unlike the general sentences, Chinese geographic contexts are closely intertwined with geographical concepts, from general spans (e.g., province) to specific spans (e.g., road). Given this feature, we propose an innovative framework, namely Geo-Encoder, to more effectively integrate Chinese geographical semantics into re-ranking pipelines. Our methodology begins by employing off-the-shelf tools to associate text with geographical spans, treating them as chunking units. Then, we present a multi-task learning module to simultaneously acquire an effective attention matrix that determines chunk contributions to geographic representations. Furthermore, we put forth an asynchronous update mechanism for the proposed task, aiming to guide the model to focus on specific chunks. Experiments on two Chinese benchmark datasets, show that the Geo-Encoder achieves significant improvements when compared to state-of-the-art baselines. Notably, it leads to a substantial improvement in the Hit@1 score of MGEO-BERT, increasing it by 6.22% from 62.76 to 68.98 on the GeoTES dataset.

pdf bib
Style-News: Incorporating Stylized News Generation and Adversarial Verification for Neural Fake News Detection
Wei-Yao Wang | Yu-Chieh Chang | Wen-Chih Peng

With the improvements in generative models, the issues of producing hallucinations in various domains (e.g., law, writing) have been brought to people’s attention due to concerns about misinformation. In this paper, we focus on neural fake news, which refers to content generated by neural networks aiming to mimic the style of real news to deceive people. To prevent harmful disinformation spreading fallaciously from malicious social media (e.g., content farms), we propose a novel verification framework, Style-News, using publisher metadata to imply a publisher’s template with the corresponding text types, political stance, and credibility. Based on threat modeling aspects, a style-aware neural news generator is introduced as an adversary for generating news content conditioning for a specific publisher, and style and source discriminators are trained to defend against this attack by identifying which publisher the style corresponds with, and discriminating whether the source of the given news is human-written or machine-generated. To evaluate the quality of the generated content, we integrate various dimensional metrics (language fluency, content preservation, and style adherence) and demonstrate that Style-News significantly outperforms the previous approaches by a margin of 0.35 for fluency, 15.24 for content, and 0.38 for style at most. Moreover, our discriminative model outperforms state-of-the-art baselines in terms of publisher prediction (up to 4.64%) and neural fake news detection (+6.94% 31.72%). We plan to release our Style-News publicly, with the aim of improving neural fake news detection.

pdf bib
Graph-based Clustering for Detecting Semantic Change Across Time and Languages
Xianghe Ma | Michael Strube | Wei Zhao

Despite the predominance of contextualized embeddings in NLP, approaches to detect semantic change relying on these embeddings and clustering methods underperform simpler counterparts based on static word embeddings. This stems from the poor quality of the clustering methods to produce sense clusters—which struggle to capture word senses, especially those with low frequency. This issue hinders the next step in examining how changes in word senses in one language influence another. To address this issue, we propose a graph-based clustering approach to capture nuanced changes in both high- and low-frequency word senses across time and languages, including the acquisition and loss of these senses over time. Our experimental results show that our approach substantially surpasses previous approaches in the SemEval2020 binary classification task across four languages. Moreover, we showcase the ability of our approach as a versatile visualization tool to detect semantic changes in both intra-language and inter-language setups. We make our code and data publicly available.

pdf bib
Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models
Haoqiang Kang | Terra Blevins | Luke Zettlemoyer

Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and perform well on diverse tasks such as translation and multilingual word sense disambiguation (WSD) when finetuned. However, they often struggle at disambiguating word sense in a zero-shot setting. To better understand this contrast, we present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT), an extension of word-level translation that prompts the model to translate a given word in context. We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance. Building on C-WLT, we introduce a zero-shot prompting approach for WSD, tested on 18 languages from the XL-WSD dataset. Our method outperforms fully supervised baselines on recall for many evaluation languages without additional training or finetuning. This study presents a first step towards understanding how to best leverage the cross-lingual knowledge inside PLMs for robust zero-shot reasoning in any language.

pdf bib
Anchor Points: Benchmarking Models with Much Fewer Examples
Rajan Vivek | Kawin Ethayarajh | Diyi Yang | Douwe Kiela

Modern language models often exhibit powerful but brittle behavior, leading to the development of larger and more diverse benchmarks to reliably assess their behavior. Here, we suggest that model performance can be benchmarked and elucidated with much smaller evaluation sets. We first show that in six popular language classification benchmarks, model confidence in the correct class on many pairs of points is strongly correlated across models. We build upon this phenomenon to propose Anchor Point Selection, a technique to select small subsets of datasets that capture model behavior across the entire dataset. Anchor points reliably rank models: across 87 diverse language model-prompt pairs, evaluating models using 1-30 anchor points outperforms uniform sampling and other baselines at accurately ranking models. Moreover, just a dozen anchor points can be used to estimate model per-class predictions on all other points in a dataset with low error, sufficient for gauging where the model is likely to fail. Lastly, we present Anchor Point Maps for visualizing these insights and facilitating comparisons of the performance of different models on various regions within the dataset distribution.

pdf bib
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
Eileen Wang | Caren Han | Josiah Poon

Visual storytelling aims to automatically generate a coherent story based on a given image sequence. Unlike tasks like image captioning, visual stories should contain factual descriptions, worldviews, and human social commonsense to put disjointed elements together to form a coherent and engaging human-writeable story. However, most models mainly focus on applying factual information and using taxonomic/lexical external knowledge when attempting to create stories. This paper introduces SCO-VIST, a framework representing the image sequence as a graph with objects and relations that includes human action motivation and its social interaction commonsense knowledge. SCO-VIST then takes this graph representing plot points and creates bridges between plot points with semantic and occurrence-based edge weights. This weighted story graph produces the storyline in a sequence of events using Floyd-Warshall’s algorithm. Our proposed framework produces stories superior across multiple metrics in terms of visual grounding, coherence, diversity, and humanness, per both automatic and human evaluations.

pdf bib
Discovering and Articulating Frames of Communication from Social Media Using Chain-of-Thought Reasoning
Maxwell Weinzierl | Sanda Harabagiu

Frames of Communication (FoCs) are ubiquitous in social media discourse. They define what counts as a problem, diagnose what is causing the problem, elicit moral judgments and imply remedies for resolving the problem. Most research on automatic frame detection involved the recognition of the problems addressed by frames, but did not consider the articulation of frames. Articulating an FoC involves reasoning with salient problems, their cause and eventual solution. In this paper we present a method for Discovering and Articulating FoCs (DA-FoC) that relies on a combination of Chain-of-Thought prompting of large language models (LLMs) with In-Context Active Curriculum Learning. Very promising evaluation results indicate that 86.72% of the FoCs encoded by communication experts on the same reference dataset were also uncovered by DA-FoC. Moreover, DA-FoC uncovered many new FoCs, which escaped the experts. Interestingly, 55.1% of the known FoCs were judged as being better articulated than the human-written ones, while 93.8% of the new FoCs were judged as having sound rationale and being clearly articulated.

pdf bib
VEIL: Vetting Extracted Image Labels from In-the-Wild Captions for Weakly-Supervised Object Detection
Arushi Rai | Adriana Kovashka

The use of large-scale vision-language datasets is limited for object detection due to the negative impact of label noise on localization. Prior methods have shown how such large-scale datasets can be used for pretraining, which can provide initial signal for localization, but is insufficient without clean bounding-box data for at least some categories. We propose a technique to “vet” labels extracted from noisy captions, and use them for weakly-supervised object detection (WSOD), without any bounding boxes. We analyze and annotate the types of label noise in captions in our Caption Label Noise dataset, and train a classifier that predicts if an extracted label is actually present in the image or not. Our classifier generalizes across dataset boundaries and across categories. We compare the classifier to nine baselines on five datasets, and demonstrate that it can improve WSOD without label vetting by 30% (31.2 to 40.5 mAP when evaluated on PASCAL VOC). See dataset at:

pdf bib
WSC+: Enhancing The Winograd Schema Challenge Using Tree-of-Experts
Pardis Zahraei | Ali Emami

The Winograd Schema Challenge (WSC) serves as a prominent benchmark for evaluating machine understanding. While Large Language Models (LLMs) excel at answering WSC questions, their ability to generate such questions remains less explored. In this work, we propose Tree-of-Experts (ToE), a novel prompting method which enhances the generation of WSC instances (50% valid cases vs. 10% in recent methods). Using this approach, we introduce WSC+, a novel dataset comprising 3,026 LLM-generated sentences. Notably, we extend the WSC framework by incorporating new ‘ambiguous’ and ‘offensive’ categories, providing a deeper insight into model overconfidence and bias. Our analysis reveals nuances in generation-evaluation consistency, suggesting that LLMs may not always outperform in evaluating their own generated questions when compared to those crafted by other models. On WSC+, GPT-4, the top-performing LLM, achieves an accuracy of 68.7%, significantly below the human benchmark of 95.1%.

pdf bib
Kardeş-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin – A Benchmark and Evaluation for Turkic Languages
Lütfi Kerem Senel | Benedikt Ebing | Konul Baghirova | Hinrich Schuetze | Goran Glavaš

Cross-lingual transfer (XLT) driven by massively multilingual language models (mmLMs) has been shown largely ineffective for low-resource (LR) target languages with little (or no) representation in mmLM’s pretraining, especially if they are linguistically distant from the high-resource (HR) source language. Much of the recent focus in XLT research has been dedicated to LR language families, i.e., families without any HR languages (e.g., families of African languages or indigenous languages of the Americas). In this work, in contrast, we investigate a configuration that is arguably of practical relevance for more of the world’s languages: XLT to LR languages that do have a close HR relative. To explore the extent to which a HR language can facilitate transfer to its LR relatives, we (1) introduce Kardeş-NLU, an evaluation benchmark with language understanding datasets in five LR Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Uzbek, and Uyghur; and (2) investigate (a) intermediate training and (b) fine-tuning strategies that leverage Turkish in XLT to these target languages. Our experimental results show that both - integrating Turkish in intermediate training and in downstream fine-tuning - yield substantial improvements in XLT to LR Turkic languages. Finally, we benchmark cutting-edge instruction-tuned large language models on Kardeş-NLU, showing that their performance is highly task- and language-dependent.

pdf bib
Chaining Event Spans for Temporal Relation Grounding
Jongho Kim | Dohyeon Lee | Minsoo Kim | Seung-won Hwang

Accurately understanding temporal relations between events is a critical building block of diverse tasks, such as temporal reading comprehension (TRC) and relation extraction (TRE). For example in TRC, we need to understand the temporal semantic differences between the following two questions that are lexically near-identical: “What finished right before the decision?” or “What finished right after the decision?”. To discern the two questions, existing solutions have relied on answer overlaps as a proxy label to contrast similar and dissimilar questions. However, we claim that answer overlap can lead to unreliable results, due to spurious overlaps of two dissimilar questions with coincidentally identical answers. To address the issue, we propose a novel approach that elicits proper reasoning behaviors through a module for predicting time spans of events. We introduce the Timeline Reasoning Network (TRN) operating in a two-step inductive reasoning process: In the first step model initially answers each question with semantic and syntactic information. The next step chains multiple questions on the same event to predict a timeline, which is then used to ground the answers. Results on the TORQUE and TB-dense, TRC, and TRE tasks respectively, demonstrate that TRN outperforms previous methods by effectively resolving the spurious overlaps using the predicted timeline.

pdf bib
Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks
Huajian Zhang | Yumo Xu | Laura Perez-Beltrachini

We study existing approaches to leverage off-the-shelf Natural Language Inference (NLI) models for the evaluation of summary faithfulness and argue that these are sub-optimal due to the granularity level considered for premises and hypotheses. That is, the smaller content unit considered as hypothesis is a sentence and premises are made up of a fixed number of document sentences. We propose a novel approach, namely INFUSE, that uses a variable premise size and simplifies summary sentences into shorter hypotheses. Departing from previous studies which focus on single short document summarisation, we analyse NLI based faithfulness evaluation for diverse summarisation tasks. We introduce DiverSumm, a new benchmark comprising long form summarisation (long documents and summaries) and diverse summarisation tasks (e.g., meeting and multi-document summarisation). In experiments, INFUSE obtains superior performance across the different summarisation tasks.

pdf bib
AnaDE1.0: A Novel Data Set for Benchmarking Analogy Detection and Extraction
Bhavya Bhavya | Shradha Sehgal | Jinjun Xiong | ChengXiang Zhai

Textual analogies that make comparisons between two concepts are often used for explaining complex ideas, creative writing, and scientific discovery. In this paper, we propose and study a new task, called Analogy Detection and Extraction (AnaDE), which includes three synergistic sub-tasks: 1) detecting documents containing analogies, 2) extracting text segments that make up the analogy, and 3) identifying the (source and target) concepts being compared. To facilitate the study of this new task, we create a benchmark dataset by scraping and investigate the performances of state-of-the-art models on all sub-tasks to establish the first-generation benchmark results for this new task. We find that the Longformer model achieves the best performance on all the three sub-tasks demonstrating its effectiveness for handling long texts. Moreover, smaller models fine-tuned on our dataset perform better than non-finetuned ChatGPT, suggesting high task difficulty. Overall, the models achieve a high performance on documents detection suggesting that it could be used to develop applications like analogy search engines. Further, there is a large room for improvement on the segment and concept extraction tasks.

pdf bib
A Comprehensive Survey of Sentence Representations: From the BERT Epoch to the CHATGPT Era and Beyond
Abhinav Ramesh Kashyap | Thanh-Tung Nguyen | Viktor Schlegel | Stefan Winkler | See-Kiong Ng | Soujanya Poria

Sentence representations are a critical component in NLP applications such as retrieval, question answering, and text classification. They capture the meaning of a sentence, enabling machines to understand and reason over human language. In recent years, significant progress has been made in developing methods for learning sentence representations, including unsupervised, supervised, and transfer learning approaches. However there is no literature review on sentence representations till now. In this paper, we provide an overview of the different methods for sentence representation learning, focusing mostly on deep learning models. We provide a systematic organization of the literature, highlighting the key contributions and challenges in this area. Overall, our review highlights the importance of this area in natural language processing, the progress made in sentence representation learning, and the challenges that remain. We conclude with directions for future research, suggesting potential avenues for improving the quality and efficiency of sentence representations.

pdf bib
Learning to Retrieve In-Context Examples for Large Language Models
Liang Wang | Nan Yang | Furu Wei

Large language models (LLMs) have demonstrated their ability to learn in-context, allowing them to perform various tasks based on a few input-output examples. However, the effectiveness of in-context learning is heavily reliant on the quality of the selected examples. In this paper, we propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples for LLMs. Our framework initially trains a reward model based on LLM feedback to evaluate the quality of candidate examples, followed by knowledge distillation to train a bi-encoder based dense retriever. Our experiments on a suite of 30 tasks demonstrate that our framework significantly enhances in-context learning performance. Furthermore, we show the generalization ability of our framework to unseen tasks during training. An in-depth analysis reveals that our model improves performance by retrieving examples with similar patterns, and the gains are consistent across LLMs of varying sizes.

pdf bib
EnCore: Fine-Grained Entity Typing by Pre-Training Entity Encoders on Coreference Chains
Frank Mtumbuka | Steven Schockaert

Entity typing is the task of assigning semantic types to the entities that are mentioned in a text. In the case of fine-grained entity typing (FET), a large set of candidate type labels is considered. Since obtaining sufficient amounts of manual annotations is then prohibitively expensive, FET models are typically trained using distant supervision. In this paper, we propose to improve on this process by pre-training an entity encoder such that embeddings of coreferring entities are more similar to each other than to the embeddings of other entities. The main problem with this strategy, which helps to explain why it has not previously been considered, is that predicted coreference links are often too noisy. We show that this problem can be addressed by using a simple trick: we only consider coreference links that are predicted by two different off-the-shelf systems. With this prudent use of coreference links, our pre-training strategy allows us to improve the state-of-the-art in benchmarks on fine-grained entity typing, as well as traditional entity extraction.

pdf bib
Unsupervised stance detection for social media discussions: A generic baseline
Maia Sutter | Antoine Gourru | Amine Trabelsi | Christine Largeron

With the ever-growing use of social media to express opinions on the national and international stage, unsupervised methods of stance detection are increasingly important to handle the task without costly annotation of data. The current unsupervised state-of-the-art models are designed for specific network types, either homophilic or heterophilic, and they fail to generalize to both. In this paper, we first analyze the generalization ability of recent baselines to these two very different network types. Then, we conduct extensive experiments with a baseline model based on text embeddings propagated with a graph neural network that generalizes well to heterophilic and homophilic networks. We show that it outperforms, on average, other state-of-the-art methods across the two network types. Additionally, we show that combining textual and network information outperforms using text only, and that the language model size has only a limited impact on the model performance.

pdf bib
Putting Context in Context: the Impact of Discussion Structure on Text Classification
Nicolò Penzo | Antonio Longa | Bruno Lepri | Sara Tonelli | Marco Guerini

Current text classification approaches usually focus on the content to be classified. Contextual aspects (both linguistic and extra-linguistic) are usually neglected, even in tasks based on online discussions. Still in many cases the multi-party and multi-turn nature of the context from which these elements are selected can be fruitfully exploited. In this work, we propose a series of experiments on a large dataset for stance detection in English, in which we evaluate the contribution of different types of contextual information, i.e. linguistic, structural and temporal, by feeding them as natural language input into a transformer-based model. We also experiment with different amounts of training data and analyse the topology of local discussion networks in a privacy-compliant way. Results show that structural information can be highly beneficial to text classification but only under certain circumstances (e.g. depending on the amount of training data and on discussion chain complexity). Indeed, we show that contextual information on smaller datasets from other classification tasks does not yield significant improvements. Our framework, based on local discussion networks, allows the integration of structural information while minimising user profiling, thus preserving their privacy.

pdf bib
Aligning Large and Small Language Models via Chain-of-Thought Reasoning
Leonardo Ranaldi | Andre Freitas

Chain-of-Thought (CoT) prompting empowersthe reasoning abilities of Large Language Models (LLMs), eliciting them to solve complexreasoning tasks in a step-wise manner. However, these capabilities appear only in models with billions of parameters, which represent an entry barrier for many users who are constrained to operate on a smaller model scale, i.e., Small Language Models (SLMs). Although many companies are releasing LLMs of the same family with fewer parameters, these models tend not to preserve all the reasoning capabilities of the original models, including CoT reasoning.In this paper, we propose a method for aligning and transferring reasoning abilities between larger to smaller Language Models. By using an Instruction-tuning-CoT method, that is, an Instruction-tuning designed around CoT-Demonstrations, we enable the SLMs to generate multi-step controlled reasoned answers when they are elicited with the CoT mechanism. Hence, we instruct a smaller Language Model using outputs generated by more robust models belonging to the same family or not, evaluating the impact across different types of models. Results obtained on question-answering and mathematical reasoning benchmarks show that LMs instructed via the Instruction-tuning CoT method produced by LLMs outperform baselines within both in-domain and out-domain scenarios.

pdf bib
Disentangling the Roles of Target-side Transfer and Regularization in Multilingual Machine Translation
Yan Meng | Christof Monz

Multilingual Machine Translation (MMT) benefits from knowledge transfer across different language pairs. However, improvements in one-to-many translation compared to many-to-one translation are only marginal and sometimes even negligible. This performance discrepancy raises the question of to what extent positive transfer plays a role on the target-side for one-to-many MT. In this paper, we conduct a large-scale study that varies the auxiliary target-side languages along two dimensions, i.e., linguistic similarity and corpus size, to show the dynamic impact of knowledge transfer on the main language pairs. We show that linguistically similar auxiliary target languages exhibit strong ability to transfer positive knowledge. With an increasing size of similar target languages, the positive transfer is further enhanced to benefit the main language pairs. Meanwhile, we find distant auxiliary target languages can also unexpectedly benefit main language pairs, even with minimal positive transfer ability. Apart from transfer, we show distant auxiliary target languages can act as a regularizer to benefit translation performance by enhancing the generalization and model inference calibration.

pdf bib
Uncovering Stereotypes in Large Language Models: A Task Complexity-based Approach
Hari Shrawgi | Prasanjit Rath | Tushar Singhal | Sandipan Dandapat

Recent Large Language Models (LLMs) have unlocked unprecedented applications of AI. As these models continue to transform human life, there are growing socio-ethical concerns around their inherent stereotypes that can lead to bias in their applications. There is an urgent need for holistic bias evaluation of these LLMs. Few such benchmarks exist today and evaluation techniques that do exist are either non-holistic or may provide a false sense of security as LLMs become better at hiding their biases on simpler tasks. We address these issues with an extensible benchmark - LLM Stereotype Index (LSI). LSI is grounded on Social Progress Index, a holistic social benchmark. We also test the breadth and depth of bias protection provided by LLMs via a variety of tasks with varying complexities. Our findings show that both ChatGPT and GPT-4 have strong inherent prejudice with respect to nationality, gender, race, and religion. The exhibition of such issues becomes increasingly apparent as we increase task complexity. Furthermore, GPT-4 is better at hiding the biases, but when displayed it is more significant. Our findings highlight the harms and divide that these LLMs can bring to society if we do not take very diligent care in their use.

pdf bib
Rainbow - A Benchmark for Systematic Testing of How Sensitive Visio-Linguistic Models are to Color Naming
Marie Bexte | Andrea Horbach | Torsten Zesch

With the recent emergence of powerful visio-linguistic models comes the question of how fine-grained their multi-modal understanding is. This has lead to the release of several probing datasets. Results point towards models having trouble with prepositions and verbs, but being relatively robust when it comes to color.To gauge how deep this understanding goes, we compile a comprehensive probing dataset to systematically test multi-modal alignment around color. We demonstrate how human perception influences descriptions of color and pay special attention to the extent to which this is reflected within the predictions of a visio-linguistic model. Probing a set of models with diverse properties with our benchmark confirms the superiority of models that do not rely on pre-extracted image features, and demonstrates that augmentation with too much noisy pre-training data can produce an inferior model. While the benchmark remains challenging for all models we test, the overall result pattern suggests well-founded alignment of color terms with hues. Analyses do however reveal uncertainty regarding the boundaries between neighboring color terms.

pdf bib
CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain Performance and Calibration
Rachneet Sachdeva | Martin Tutek | Iryna Gurevych

In recent years, large language models (LLMs) have shown remarkable capabilities at scale, particularly at generating text conditioned on a prompt. In our work, we investigate the use of LLMs to augment training data of smaller language models (SLMs) with automatically generated counterfactual (CF) instances – i.e. minimally altered inputs – in order to improve out-of-domain (OOD) performance of SLMs in the extractive question answering (QA) setup. We show that, across various LLM generators, such data augmentation consistently enhances OOD performance and improves model calibration for both confidence-based and rationale-augmented calibrator models. Furthermore, these performance improvements correlate with higher diversity of CF instances in terms of their surface form and semantic content. Finally, we show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance, indicating that rationale-augmented calibrators prefer concise explanations.

pdf bib
UP5: Unbiased Foundation Model for Fairness-aware Recommendation
Wenyue Hua | Yingqiang Ge | Shuyuan Xu | Jianchao Ji | Zelong Li | Yongfeng Zhang

Recent advances in Foundation Models such as Large Language Models (LLMs) have propelled them to the forefront of Recommender Systems (RS). Despite their utility, there is a growing concern that LLMs might inadvertently perpetuate societal stereotypes, resulting in unfair recommendations. Since fairness is critical for RS as many users take it for decision-making and demand fulfillment, this paper focuses on user-side fairness for LLM-based recommendation where the users may require a recommender system to be fair on specific sensitive features such as gender or age. In this paper, we dive into the extent of unfairness exhibited by LLM-based recommender models based on both T5 and LLaMA backbones, and discuss appropriate methods for promoting equitable treatment of users in LLM-based recommendation models. We introduce a novel Counterfactually-Fair-Prompt (CFP) method towards Unbiased Foundation mOdels (UFO) for fairness-aware LLM-based recommendation. Experiments are conducted on two real-world datasets, MovieLens-1M and Insurance, and compared with both matching-based and sequential-based fairness-aware recommendation models. Results show that CFP achieves better recommendation performance with a high level of fairness.

pdf bib
Human Temporal Inferences Go Beyond Aspectual Class
Katarzyna Pruś | Mark Steedman | Adam Lopez

Past work in NLP has proposed the task of classifying English verb phrases into situation aspect categories, assuming that these categories play an important role in tasks requiring temporal reasoning. We investigate this assumption by gathering crowd-sourced judgements about aspectual entailments from non-expert, native English participants. The results suggest that aspectual class alone is not sufficient to explain the response patterns of the participants. We propose that looking at scenarios which can feasibly accompany an action description contributes towards a better explanation of the participants’ answers. A further experiment using GPT-3.5 shows that its outputs follow different patterns than human answers, suggesting that such conceivable scenarios cannot be fully accounted for in the language alone. We release our dataset to support further research.

pdf bib
It is not True that Transformers are Inductive Learners: Probing NLI Models with External Negation
Michael Sullivan

NLI tasks necessitate a substantial degree of logical reasoning; as such, the remarkable performance of SoTA transformers on these tasks may lead us to believe that those models have learned to reason logically. The results presented in this paper demonstrate that (i) models fine-tuned on NLI datasets learn to treat external negation as a distractor, effectively ignoring its presence in hypothesis sentences; (ii) several near-SoTA encoder and encoder-decoder transformer models fail to inductively learn the law of the excluded middle for a single external negation prefix with respect to NLI tasks, despite extensive fine-tuning; (iii) those models which are are able to learn the law of the excluded middle for a single prefix are unable to generalize this pattern to similar prefixes. Given the critical role of negation in logical reasoning, we may conclude from these findings that transformers do not learn to reason logically when fine-tuned for NLI tasks. Furthermore, these results suggest that transformers may not be able to inductively learn the role of negation with respect to NLI tasks, calling into question their capacity to fully acquire logical reasoning abilities.

pdf bib
Polarized Opinion Detection Improves the Detection of Toxic Language
John Pavlopoulos | Aristidis Likas

Distance from unimodality (DFU) has been found to correlate well with human judgment for the assessment of polarized opinions. However, its un-normalized nature makes it less intuitive and somewhat difficult to exploit in machine learning (e.g., as a supervised signal). In this work a normalized version of this measure, called nDFU, is proposed that leads to better assessment of the degree of polarization. Then, we propose a methodology for K-class text classification, based on nDFU, that exploits polarized texts in the dataset. Such polarized instances are assigned to a separate K+1 class, so that a K+1-class classifier is trained. An empirical analysis on three datasets for abusive language detection, shows that nDFU can be used to model polarized annotations and prevent them from harming the classification performance. Finally, we further exploit nDFU to specify conditions that could explain polarization given a dimension and present text examples that polarized the annotators when the dimension was gender and race. Our code is available at

pdf bib
Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations
Amit Meghanani | Thomas Hain

Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. However, they have not been well studied in the context of learning AWEs. This work explores the effectiveness of CAE with SSL-based speech representations to obtain improved AWEs. Additionally, the capabilities of SSL-based speech models are explored in cross-lingual scenarios for obtaining AWEs. Experiments are conducted on five languages: Polish, Portuguese, Spanish, French, and English. HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite HuBERT being pre-trained on English only. Also, the HuBERT-based CAE model works well in cross-lingual settings. It outperforms MFCC-based CAE models trained on the target languages when trained on one source language and tested on target languages.

pdf bib
Investigating Agency of LLMs in Human-AI Collaboration Tasks
Ashish Sharma | Sudha Rao | Chris Brockett | Akanksha Malhotra | Nebojsa Jojic | Bill Dolan

Agency, the capacity to proactively shape events, is central to how humans interact and collaborate. While LLMs are being developed to simulate human behavior and serve as human-like agents, little attention has been given to the Agency that these models should possess in order to proactively manage the direction of interaction and collaboration. In this paper, we investigate Agency as a desirable function of LLMs, and how it can be measured and managed. We build on social-cognitive theory to develop a framework of features through which Agency is expressed in dialogue – indicating what you intend to do (Intentionality), motivating your intentions (Motivation), having self-belief in intentions (Self-Efficacy), and being able to self-adjust (Self-Regulation). We collect a new dataset of 83 human-human collaborative interior design conversations containing 908 conversational snippets annotated for Agency features. Using this dataset, we develop methods for measuring Agency of LLMs. Automatic and human evaluations show that models that manifest features associated with high Intentionality, Motivation, Self-Efficacy, and Self-Regulation are more likely to be perceived as strongly agentive.

pdf bib
SynthDST: Synthetic Data is All You Need for Few-Shot Dialog State Tracking
Atharva Kulkarni | Bo-Hsiang Tseng | Joel Moniz | Dhivya Piraviperumal | Hong Yu | Shruti Bhargava

In-context learning with Large Language Models (LLMs) has emerged as a promising avenue of research in Dialog State Tracking (DST). However, the best-performing in-context learning methods involve retrieving and adding similar examples to the prompt, requiring access to labeled training data. Procuring such training data for a wide range of domains and applications is time-consuming, expensive, and, at times, infeasible. While zero-shot learning requires no training data, it significantly lags behind the few-shot setup. Thus, ‘Can we efficiently generate synthetic data for any dialogue schema to enable few-shot prompting?' Addressing this question, we propose , a data generation framework tailored for DST, utilizing LLMs. Our approach only requires the dialogue schema and a few hand-crafted dialogue templates to synthesize natural, coherent, and free-flowing dialogues with DST annotations. Few-shot learning using data from results in 4-5% improvement in Joint Goal Accuracy over the zero-shot baseline on MultiWOZ 2.1 and 2.4. Remarkably, our few-shot learning approach recovers nearly 98% of the performance compared to the few-shot setup using human-annotated training data.

pdf bib
Argument Mining as a Text-to-Text Generation Task
Masayuki Kawarada | Tsutomu Hirao | Wataru Uchida | Masaaki Nagata

Argument Mining (AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures.Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus (AAEC), AbstRCT, and the Cornell eRulemaking Corpus (CDCP).

pdf bib
Answering legal questions from laymen in German civil law system
Marius Büttner | Ivan Habernal

What is preventing us from building a NLP system that could help real people in real situations, for instance when they need legal advice but don’t understand law? This question is trickier than one might think, because legal systems vary from country to country, so do the law books, availability of data, and incomprehensibility of legalese. In this paper we focus Germany (which employs the civil-law system where, roughly speaking, interpretation of law codes dominates over precedence) and lay a foundational work to address the laymen’s legal question answering empirically. We create GerLayQA, a new dataset comprising of 21k laymen’s legal questions paired with answers from lawyers and grounded to concrete law book paragraphs. We experiment with a variety of retrieval and answer generation models and provide an in-depth analysis of limitations, which helps us to provide first empirical answers to the question above.

pdf bib
An Empirical Analysis of Diversity in Argument Summarization
Michiel Van Der Meer | Piek Vossen | Catholijn Jonker | Pradeep Murukannaiah

Presenting high-level arguments is a crucial task for fostering participation in online societal discussions. Current argument summarization approaches miss an important facet of this task—capturing diversity—which is important for accommodating multiple perspectives. We introduce three aspects of diversity: those of opinions, annotators, and sources. We evaluate approaches to a popular argument summarization task called Key Point Analysis, which shows how these approaches struggle to (1) represent arguments shared by few people, (2) deal with data from various sources, and (3) align with subjectivity in human-provided annotations. We find that both general-purpose LLMs and dedicated KPA models exhibit this behavior, but have complementary strengths. Further, we observe that diversification of training data may ameliorate generalization in zero-shot cases. Addressing diversity in argument summarization requires a mix of strategies to deal with subjectivity.

pdf bib
What Makes Medical Claims (Un)Verifiable? Analyzing Entity and Relation Properties for Fact Verification
Amelie Wuehrl | Yarik Menchaca Resendiz | Lara Grimminger | Roman Klinger

Verifying biomedical claims fails if no evidence can be discovered. In these cases, the fact-checking verdict remains unknown and the claim is unverifiable. To improve this situation, we have to understand if there are any claim properties that impact its verifiability. In this work we assume that entities and relations define the core variables in a biomedical claim’s anatomy and analyze if their properties help us to differentiate verifiable from unverifiable claims. In a study with trained annotation experts we prompt them to find evidence for biomedical claims, and observe how they refine search queries for their evidence search. This leads to the first corpus for scientific fact verification annotated with subject–relation–object triplets, evidence documents, and fact-checking verdicts (the BEAR-FACT corpus). We find (1) that discovering evidence for negated claims (e.g., X–does-not-cause–Y) is particularly challenging. Further, we see that annotators process queries mostly by adding constraints to the search and by normalizing entities to canonical names. (2) We compare our in-house annotations with a small crowdsourcing setting where we employ both medical experts and laypeople. We find that domain expertise does not have a substantial effect on the reliability of annotations. Finally, (3), we demonstrate that it is possible to reliably estimate the success of evidence retrieval purely from the claim text (.82F1), whereas identifying unverifiable claims proves more challenging (.27F1)

pdf bib
Approximate Attributions for Off-the-Shelf Siamese Transformers
Lucas Moeller | Dmitry Nikolaev | Sebastian Padó

Siamese encoders such as sentence transformers are among the least understood deep models.Established attribution methods cannot tackle this model class since it compares two inputs rather than processing a single one. To address this gap, we have recently proposed an attribution method specifically for Siamese encoders (Möller et al., 2023). However, it requires models to be adjusted and fine-tuned and therefore cannot be directly applied to off-the-shelf models. In this work, we reassess these restrictions and propose (i) a model with exact attribution ability that retains the original model’s predictive performance and (ii) a way to compute approximate attributions for off-the-shelf models.We extensively compare approximate and exact attributions and use them to analyze the models’ attendance to different linguistic aspects. We gain insights into which syntactic roles Siamese transformers attend to, confirm that they mostly ignore negation, explore how they judge semantically opposite adjectives, and find that they exhibit lexical bias.

pdf bib
Describing Images Fast and Slow: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes
Ece Takmaz | Sandro Pezzelle | Raquel Fernández

There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.

pdf bib
Tracing the Roots of Facts in Multilingual Language Models: Independent, Shared, and Transferred Knowledge
Xin Zhao | Naoki Yoshinaga | Daisuke Oba

Acquiring factual knowledge for language models (LMs) in low-resource languages poses a serious challenge, thus resorting to cross-lingual transfer in multilingual LMs (ML-LMs). In this study, we ask how ML-LMs acquire and represent factual knowledge. Using the multilingual factual knowledge probing dataset, mLAMA, we first conducted a neuron investigation of ML-LMs (specifically, multilingual BERT). We then traced the roots of facts back to the knowledge source (Wikipedia) to identify the ways in which ML-LMs acquire specific facts. We finally identified three patterns of acquiring and representing facts in ML-LMs: language-independent, cross-lingual shared and transferred, and devised methods for differentiating them. Our findings highlight the challenge of maintaining consistent factual knowledge across languages, underscoring the need for better fact representation learning in ML-LMs.

pdf bib
Comparing Knowledge Sources for Open-Domain Scientific Claim Verification
Juraj Vladika | Florian Matthes

The increasing rate at which scientific knowledge is discovered and health claims shared online has highlighted the importance of developing efficient fact-checking systems for scientific claims. The usual setting for this task in the literature assumes that the documents containing the evidence for claims are already provided and annotated or contained in a limited corpus. This renders the systems unrealistic for real-world settings where knowledge sources with potentially millions of documents need to be queried to find relevant evidence. In this paper, we perform an array of experiments to test the performance of open-domain claim verification systems. We test the final verdict prediction of systems on four datasets of biomedical and health claims in different settings. While keeping the pipeline’s evidence selection and verdict prediction parts constant, document retrieval is performed over three common knowledge sources (PubMed, Wikipedia, Google) and using two different information retrieval techniques. We show that PubMed works better with specialized biomedical claims, while Wikipedia is more suited for everyday health concerns. Likewise, BM25 excels in retrieval precision, while semantic search in recall of relevant evidence. We discuss the results, outline frequent retrieval patterns and challenges, and provide promising future directions.

pdf bib
Measuring Uncertainty in Neural Machine Translation with Similarity-Sensitive Entropy
Julius Cheng | Andreas Vlachos

Uncertainty estimation is an important diagnostic tool for statistical models, and is often used to assess the confidence of model predictions. Previous work shows that neural machine translation (NMT) is an intrinsically uncertain task where there are often multiple correct and semantically equivalent translations, and that well-trained NMT models produce good translations despite spreading probability mass among many semantically similar translations. These findings suggest that popular measures of uncertainty based on token- and sequence-level entropies which measure surface form diversity may not be good proxies of the more useful quantity of interest, semantic diversity. We propose to adapt similarity-sensitive Shannon entropy (S3E), a concept borrowed from theoretical ecology, for NMT. By demonstrating significantly improved correlation between S3E and task performance on quality estimation and named entity recall, we show that S3E is a useful framework for measuring uncertainty in NMT.

pdf bib
LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text
Dor Bernsohn | Gil Semo | Yaron Vazana | Gila Hayat | Ben Hagag | Joel Niklaus | Rohit Saha | Kyryl Truskovskyi

In this study, we focus on two main tasks, the first for detecting legal violations within unstructured textual data, and the second for associating these violations with potentially affected individuals. We constructed two datasets using Large Language Models (LLMs) which were subsequently validated by domain expert annotators. Both tasks were designed specifically for the context of class-action cases. The experimental design incorporated fine-tuning models from the BERT family and open-source LLMs, and conducting few-shot experiments using closed-source LLMs. Our results, with an F1-score of 62.69% (violation identification) and 81.02% (associating victims), show that our datasets and setups can be used for both tasks. Finally, we publicly release the datasets and the code used for the experiments in order to advance further research in the area of legal natural language processing (NLP).

pdf bib
𝜇PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge
Fantine Huot | Joshua Maynez | Chris Alberti | Reinald Kim Amplayo | Priyanka Agrawal | Constanza Fierro | Shashi Narayan | Mirella Lapata

Cross-lingual summarization aims to generate a summary in one languagegiven input in a different language, allowing for the dissemination ofrelevant content among different language speaking populations. Thetask is challenging mainly due to the paucity of cross-lingualdatasets and the compounded difficulty of summarizing andtranslating.This work presents 𝜇PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge. We formulate the plan as a sequence of entities capturing thesummary’s content and the order in which it should becommunicated. Importantly, our plans abstract from surface form: usinga multilingual knowledge base, we align entities to their canonicaldesignation across languages and generate the summary conditioned onthis cross-lingual bridge and the input. Automatic and human evaluation on the XWikis dataset (across four language pairs) demonstrates that our planning objective achieves state-of-the-art performance interms of informativeness and faithfulness. Moreover, 𝜇PLAN modelsimprove the zero-shot transfer to new cross-lingual language pairscompared to baselines without a planning component.

pdf bib
Exploring Data Augmentation in Neural DRS-to-Text Generation
Muhammad Saad Amin | Luca Anselma | Alessandro Mazzei

Neural networks are notoriously data-hungry. This represents an issue in cases where data are scarce such as in low-resource languages. Data augmentation is a technique commonly used in computer vision to provide neural networks with more data and increase their generalization power. When dealing with data augmentation for natural language, however, simple data augmentation techniques similar to the ones used in computer vision such as rotation and cropping cannot be employed because they would generate ungrammatical texts. Thus, data augmentation needs a specific design in the case of neural logic-to-text systems, especially for a structurally rich input format such as the ones used for meaning representation. This is the case of the neural natural language generation for Discourse Representation Structures (DRS-to-Text), where the logical nature of DRS needs a specific design of data augmentation. In this paper, we adopt a novel approach in DRS-to-Text to selectively augment a training set with new data by adding and varying two specific lexical categories, i.e. proper and common nouns. In particular, we propose using WordNet supersenses to produce new training sentences using both in-and-out-of-context nouns. We present a number of experiments for evaluating the role played by augmented lexical information. The experimental results prove the effectiveness of our approach for data augmentation in DRS-to-Text generation.

pdf bib
Think Twice: Measuring the Efficiency of Eliminating Prediction Shortcuts of Question Answering Models
Lukáš Mikula | Michal Štefánik | Marek Petrovič | Petr Sojka

While the Large Language Models (LLMs) dominate a majority of language understanding tasks, previous work shows that some of these results are supported by modelling spurious correlations of training datasets. Authors commonly assess model robustness by evaluating their models on out-of-distribution (OOD) datasets of the same task, but these datasets might share the bias of the training dataset. We propose a simple method for measuring a scale of models’ reliance on any identified spurious feature and assess the robustness towards a large set of known and newly found prediction biases for various pre-trained models and debiasing methods in Question Answering (QA). We find that the reported OOD gains of debiasing methods can not be explained by mitigated reliance on biased features, suggesting that biases are shared among different QA datasets. We further evidence this by measuring that performance of OOD models depends on bias features comparably to the ID model. Our findings motivate future work to refine the reports of LLMs’ robustness to a level of known spurious features.

pdf bib
Improving Contrastive Learning in Emotion Recognition in Conversation via Data Augmentation and Decoupled Neutral Emotion
Yujin Kang | Yoon-Sik Cho

Emotion recognition in conversation (ERC) has attracted much attention due to its wide applications. While consistent improvement is being made in this area, inevitable challenge comes from the dataset. The ERC dataset exhibits significantly imbalanced emotion distribution. While the utterances with neutral emotion predominate the data, this emotion label is always treated the same as other emotion labels in current approaches. To address the problem caused by the dataset, we propose a supervised contrastive learning specifically oriented for ERC task. We employ a novel data augmentation method emulating the emotion dynamics in a conversation and formulate supervised contrastive learning method tailored for ERC addressing the predominance and the ambiguity of neutral emotion. Experimental results on four benchmark datasets demonstrate the effectiveness of our approach.

pdf bib
CroCoAlign: A Cross-Lingual, Context-Aware and Fully-Neural Sentence Alignment System for Long Texts
Francesco Molfese | Andrei Bejgu | Simone Tedeschi | Simone Conia | Roberto Navigli

Sentence alignment – establishing links between corresponding sentences in two related documents – is an important NLP task with several downstream applications, such as machine translation (MT). Despite the fact that existing sentence alignment systems have achieved promising results, their effectiveness is based on auxiliary information such as document metadata or machine-generated translations, as well as hyperparameter-sensitive techniques. Moreover, these systems often overlook the crucial role that context plays in the alignment process. In this paper, we address the aforementioned issues and propose CroCoAlign: the first context-aware, end-to-end and fully neural architecture for sentence alignment. Our system maps source and target sentences in long documents by contextualizing their sentence embeddings with respect to the other sentences in the document. We extensively evaluate CroCoAlign on a multilingual dataset consisting of 20 language pairs derived from the Opus project, and demonstrate that our model achieves state-of-the-art performance. To ensure reproducibility, we release our code and model checkpoints at

pdf bib
Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Eliana Pastor | Alkis Koudounas | Giuseppe Attanasio | Dirk Hovy | Elena Baralis

Predictive models make mistakes and have biases. To combat both, we need to understand their predictions.Explainable AI (XAI) provides insights into models for vision, language, and tabular data. However, only a few approaches exist for speech classification models. Previous works focus on a selection of spoken language understanding (SLU) tasks, and most users find their explanations challenging to interpret.We propose a novel approach to explain speech classification models. It provides two types of insights. (i) Word-level. We measure the impact of each audio segment aligned with a word on the outcome. (ii) Paralinguistic. We evaluate how non-linguistic features (e.g., prosody and background noise) affect the outcome if perturbed.We validate our approach by explaining two state-of-the-art SLU models on two tasks in English and Italian. We test their plausibility with human subject ratings. Our results show that the explanations correctly represent the model’s inner workings and are plausible to humans.

pdf bib
Zero-Shot End-to-End Spoken Language Understanding via Cross-Modal Selective Self-Training
Jianfeng He | Julian Salazar | Kaisheng Yao | Haoqi Li | Jason Cai

End-to-end (E2E) spoken language understanding (SLU) is constrained by the cost of collecting speech-semantics pairs, especially when label domains change. Hence, we explore zero-shot E2E SLU, which learns E2E SLU without speech-semantics pairs, instead using only speech-text and text-semantics pairs. Previous work achieved zero-shot by pseudolabeling all speech-text transcripts with a natural language understanding (NLU) model learned on text-semantics corpora. However, this method requires the domains of speech-text and text-semantics to match, which often mismatch due to separate collections. Furthermore, using the entire collected speech-text corpus from any domains leads to imbalance and noise issues. To address these, we propose cross-modal selective self-training (CMSST). CMSST tackles imbalance by clustering in a joint space of the three modalities (speech, text, and semantics) and handles label noise with a selection network. We also introduce two benchmarks for zero-shot E2E SLU, covering matched and found speech (mismatched) settings. Experiments show that CMSST improves performance in both two settings, with significantly reduced sample sizes and training time. Our code and data are released in

pdf bib
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models
Natalie Shapira | Mosh Levy | Seyed Hossein Alavi | Xuhui Zhou | Yejin Choi | Yoav Goldberg | Maarten Sap | Vered Shwartz

The escalating debate on AI’s capabilities warrants developing reliable metrics to assess machine “intelligence.” Recently, many anecdotal examples were used to suggest that newer Large Language Models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM); however, prior work reached conflicting conclusions regarding those abilities. We investigate the extent of LLMs’ N-ToM through an extensive evaluation of 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust. We further examine the factors impacting performance on N-ToM tasks and discover that LLMs struggle with adversarial examples, indicating reliance on shallow heuristics rather than robust ToM abilities. We caution against drawing conclusions from anecdotal examples, limited benchmark testing, and using human-designed psychological tests to evaluate models.

pdf bib
NevIR: Negation in Neural Information Retrieval
Orion Weller | Dawn Lawrie | Benjamin Van Durme

Negation is a common everyday phenomena and has been a consistent area of weakness for language models (LMs). Although the Information Retrieval (IR) community has adopted LMs as the backbone of modern IR architectures, there has been little to no research in understanding how negation impacts neural IR. We therefore construct a straightforward benchmark on this theme: asking IR models to rank two documents that differ only by negation. We show that the results vary widely according to the type of IR architecture: cross-encoders perform best, followed by late-interaction models, and in last place are bi-encoder and sparse neural architectures. We find that most current information retrieval models do not consider negation, performing similarly or worse than randomly ranking. We show that although the obvious approach of continued fine-tuning on a dataset of contrastive documents containing negations increases performance (as does model size), there is still a large gap between machine and human performance.

pdf bib
“According to . . . ”: Prompting Language Models Improves Quoting from Pre-Training Data
Orion Weller | Marc Marone | Nathaniel Weir | Dawn Lawrie | Daniel Khashabi | Benjamin Van Durme

Large Language Models (LLMs) may hallucinate and generate fake information, despite pre-training on factual data. Inspired by the journalistic device of “according to sources”, we propose according-to prompting: directing LLMs to ground responses against previously observed text. To quantify this grounding, we propose a novel evaluation metric (QUIP-Score) that measures the extent to which model-produced answers are directly found in underlying text corpora. We illustrate with experiments on three corpora (Wikipedia, PubMed, and the U.S. legal tax code) that these prompts improve grounding under our metrics, with the additional benefit of often improving end-task performance. Furthermore, prompts that ask the model to decrease grounding (or to ground to other corpora) indeed decrease QUIP-Score, indicating the ability of LLMs to increase or decrease grounded generations on request.

pdf bib
Accurate and Well-Calibrated ICD Code Assignment Through Attention Over Diverse Label Embeddings
Goncalo Gomes | Isabel Coutinho | Bruno Martins

Although the International Classification of Diseases (ICD) has been adopted worldwide, manually assigning ICD codes to clinical text is time-consuming, error-prone, and expensive, motivating the development of automated approaches. This paper describes a novel approach for automated ICD coding, combining several ideas from previous related work. We specifically employ a strong Transformer-based model as a text encoder and, to handle lengthy clinical narratives, we explored either (a) adapting the base encoder model into a Longformer, or (b) dividing the text into chunks and processing each chunk independently. The representations produced by the encoder are combined with a label embedding mechanism that explores diverse ICD code synonyms. Experiments with different splits of the MIMIC-III dataset show that the proposed approach outperforms the current state-of-the-art models in ICD coding, with the label embeddings significantly contributing to the good performance. Our approach also leads to properly calibrated classification results, which can effectively inform downstream tasks such as quantification.

pdf bib
Investigating Content Planning for Navigating Trade-offs in Knowledge-Grounded Dialogue
Kushal Chawla | Hannah Rashkin | Gaurav Singh Tomar | David Reitter

Knowledge-grounded dialogue generation is a challenging task because it requires satisfying two fundamental, yet often competing constraints: being responsive in a manner that is specific to what the conversation partner has said while also being attributable to an underlying source document. In this work, we bring this trade-off between these two objectives (specificity and attribution) to light, and ask the question: Can explicit content planning before the response generation help the model to address this challenge? To answer this question, we design a framework called PLEDGE, which allows us to experiment with various plan variables explored in prior work supporting both metric-agnostic and metric-aware approaches. While content planning shows promise, our results on whether it can actually help to navigate this trade-off are mixed – planning mechanisms that are metric-aware (use automatic metrics during training) are better at automatic evaluations but underperform in human judgment compared to metric-agnostic mechanisms. We discuss how this may be caused by over-fitting to automatic metrics, and the need for future work to better calibrate these metrics towards human judgment. We hope the observations from our analysis will inform future work that aims to apply content planning in this context.

pdf bib
SPUQ: Perturbation-Based Uncertainty Quantification for Large Language Models
Xiang Gao | Jiaxin Zhang | Lalla Mouatadid | Kamalika Das

In recent years, large language models (LLMs) have become increasingly prevalent, offering remarkable text generation capabilities. However, a pressing challenge is their tendency to make confidently wrong predictions, highlighting the critical need for uncertainty quantification (UQ) in LLMs. While previous works have mainly focused on addressing aleatoric uncertainty, the full spectrum of uncertainties, including epistemic, remains inadequately explored. Motivated by this gap, we introduce a novel UQ method, sampling with perturbation for UQ (SPUQ), designed to tackle both aleatoric and epistemic uncertainties. The method entails generating a set of perturbations for LLM inputs, sampling outputs for each perturbation, and incorporating an aggregation module that generalizes the sampling uncertainty approach for text generation tasks. Through extensive experiments on various datasets, we investigated different perturbation and aggregation techniques. Our findings show a substantial improvement in model uncertainty calibration, with a reduction in Expected Calibration Error (ECE) by 50% on average. Our findings suggest that our proposed UQ method offers promising steps toward enhancing the reliability and trustworthiness of LLMs.

pdf bib
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
Rabeeh Karimi Mahabadi | Hamish Ivison | Jaesung Tae | James Henderson | Iz Beltagy | Matthew Peters | Arman Cohan

Diffusion models have emerged as a powerful paradigm for generation, obtaining strong performance in various continuous domains. However, applying continuous diffusion models to natural language remains challenging due to its discrete nature and the need for a large number of diffusion steps to generate text, making diffusion-based generation expensive.In this work, we propose Text-to-text Self-conditioned Simplex Diffusion (TESS), a text diffusion model that is fully non-autoregressive, employs a new form of self-conditioning, and applies the diffusion process on the logit simplex space rather than the learned embedding space.Through extensive experiments on natural language understanding and generation tasks including summarization, text simplification, paraphrase generation, and question generation, we demonstrate that TESS outperforms state-of-the-art non-autoregressive models, requires fewer diffusion steps with minimal drop in performance, and is competitive with pretrained autoregressive sequence-to-sequence models.

pdf bib
Advancing Precise Outline-Conditioned Text Generation with Task Duality and Explicit Outline Control
Yunzhe Li | Qian Chen | Weixiang Yan | Wen Wang | Qinglin Zhang | Hari Sundaram

Existing works on outline-conditioned text generation typically aim to generate text using provided outlines as rough sketches, such as keywords and phrases. However, these approaches make it challenging to control the quality of text generation and assess consistency between outlines and generated texts due to lack of clarity and rationality of the rough outlines. In this paper, we introduce a novel text generation task called Precise Outline-conditioned Generation, which requires generating stories based on specific, sentence-level outlines. To facilitate research on this task, we construct two new datasets, WPOG and CDM. We provide strong baselines based on fine-tuning models such as BART and GPT-2, and evaluating zero-shot performance of models such as ChatGPT and Vicuna. Furthermore, we identify an issue of imbalanced utilization of the outline information in the precise outline-conditioned generation, which is ubiquitously observed across fine-tuned models and zero-shot inference models. To address this issue, we propose an explicit outline utilization control approach and a novel framework that leverages the task duality between summarization and generation. Experimental results show that the proposed approaches effectively alleviate the issue of imbalanced outline utilization and enhance the quality of precise outline-conditioned text generation for both fine-tuning and zero-shot settings.

pdf bib
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
Zhuowan Li | Cihang Xie | Benjamin Van Durme | Alan Yuille

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether multi-modal learning can help understand each individual modality. In this work, we conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models by probing on a broad range of tasks. Five probing tasks are evaluated in order to assess the quality of the learned representations in a nuanced manner. Our results on five probing tasks suggest vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.

pdf bib
Creating Suspenseful Stories: Iterative Planning with Large Language Models
Kaige Xie | Mark Riedl

Automated story generation has been one of the long-standing challenges in NLP. Among all dimensions of stories, *suspense* is very common in human-written stories but relatively under-explored in AI-generated stories. While recent advances in large language models (LLMs) have greatly promoted language generation in general, state-of-the-art LLMs are still unreliable when it comes to suspenseful story generation. We propose a novel iterative-prompting-based planning method that is grounded in two theoretical foundations of story suspense from cognitive psychology and narratology. This theory-grounded method works in a fully zero-shot manner and does not rely on any supervised story corpora. To the best of our knowledge, this paper is the first attempt at suspenseful story generation with LLMs. Extensive human evaluations of the generated suspenseful stories demonstrate the effectiveness of our method.

pdf bib
Few-Shot Dialogue Summarization via Skeleton-Assisted Prompt Transfer in Prompt Tuning
Kaige Xie | Tong Yu | Haoliang Wang | Junda Wu | Handong Zhao | Ruiyi Zhang | Kanak Mahadik | Ani Nenkova | Mark Riedl

In real-world scenarios, labeled samples for dialogue summarization are usually limited (i.e., few-shot) due to high annotation costs for high-quality dialogue summaries. To efficiently learn from few-shot samples, previous works have utilized massive annotated data from other downstream tasks and then performed prompt transfer in prompt tuning so as to enable cross-task knowledge transfer. However, existing general-purpose prompt transfer techniques lack consideration for dialogue-specific information. In this paper, we focus on improving the prompt transfer from dialogue state tracking to dialogue summarization and propose Skeleton-Assisted Prompt Transfer (SAPT), which leverages skeleton generation as extra supervision that functions as a medium connecting the distinct source and target task and resulting in the model’s better consumption of dialogue state information. To automatically extract dialogue skeletons as supervised training data for skeleton generation, we design a novel approach with perturbation-based probes requiring neither annotation effort nor domain knowledge. Training the model on such skeletons can also help preserve model capability during prompt transfer. Our method significantly outperforms existing baselines. In-depth analyses demonstrate the effectiveness of our method in facilitating cross-task knowledge transfer in few-shot dialogue summarization.

pdf bib
Ask, Assess, and Refine: Rectifying Factual Consistency and Hallucination in LLMs with Metric-Guided Feedback Learning
Dongyub Lee | Eunhwan Park | Hodong Lee | Heuiseok Lim

Recent advancements in Large Language Models (LLMs) have heralded unprecedented capabilities in information-seeking and text generation, as evidenced by applications like Bing Chat and Despite these strides, challenges on hallucination and factual inconsistency continue to impede their wider real-world adoption. Contemporary methods, including retrieval-augmented LLMs and feedback-based learning, serve as alternatives to mitigate these challenges. However, challenges remain, particularly regarding referencing erroneous evidence (citation errors) and generating information not present in the evidence (hallucination). In this paper, we introduce the 𝖠2𝖱 framework: Ask, Assess, and Refine. Our approach utilizes an explicit evaluation paradigm, incorporating metrics specifically tailored to assess citation errors and hallucination, aiming to address these prevalent challenges robustly. Capitalizing on these evaluations, we devise a strategy to formulate actionable natural language feedback, enabling iterative refinements that yield improved factual consistency and reduced hallucinations in responses. Our experiments on ASQA, ELI5, and QAMPARI datasets demonstrate our method’s superiority in enhancing correctness, fluency, and citation quality.

pdf bib
Effective Controllable Bias Mitigation for Classification and Retrieval using Gate Adapters
Shahed Masoudian | Cornelia Volaucnik | Markus Schedl | Navid Rekabsaz

Bias mitigation of Language Models has been the topic of many studies with a recent focus on learning separate modules like adapters for on-demand debiasing. Besides optimizing for a modularized debiased model, it is often critical in practice to control the degree of bias reduction at inference time, e.g., in order to tune for a desired performance-fairness trade-off in search results or to control the strength of debiasing in classification tasks. In this paper, we introduce Controllable Gate Adapter (ConGater), a novel modular gating mechanism with adjustable sensitivity parameters, %In addition to better perseverance of task performance and enhanced information removal, which allows for a gradual transition from the biased state of the model to the fully debiased version at inference time. We demonstrate ConGater performance by (1) conducting adversarial debiasing experiments with three different models on three classification tasks with four protected attributes, and (2) reducing the bias of search results through fairness list-wise regularization to enable adjusting a trade-off between performance and fairness metrics. Our experiments on the classification tasks show that compared to baselines of the same caliber, ConGater can maintain higher task performance while containing less information regarding the attributes. Our results on the retrieval task show that the fully debiased ConGater can achieve the same fairness performance while maintaining more than twice as high task performance than recent strong baselines. Overall, besides strong performance ConGater enables the continuous transitioning between biased and debiased states of models, enhancing personalization of use and interpretability through controllability.

pdf bib
STable: Table Generation Framework for Encoder-Decoder Models
Michał Pietruszka | Michał Turski | Łukasz Borchmann | Tomasz Dwojak | Gabriela Nowakowska | Karolina Szyndler | Dawid Jurkiewicz | Łukasz Garncarek

The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table’s content across all random permutations of the factorization order. During the content inference, we exploit the model’s ability to generate cells in any order by searching over possible orderings to maximize the model’s confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15\\%.

pdf bib
A RelEntLess Benchmark for Modelling Graded Relations between Named Entities
Asahi Ushio | Jose Camacho-Collados | Steven Schockaert

Relations such as “is influenced by”, “is known for” or “is a competitor of” are inherently graded: we can rank entity pairs based on how well they satisfy these relations, but it is hard to draw a line between those pairs that satisfy them and those that do not. Such graded relations play a central role in many applications, yet they are typically not covered by existing Knowledge Graphs. In this paper, we consider the possibility of using Large Language Models (LLMs) to fill this gap. To this end, we introduce a new benchmark, in which entity pairs have to be ranked according to how much they satisfy a given graded relation. The task is formulated as a few-shot ranking problem, where models only have access to a description of the relation and five prototypical instances. We use the proposed benchmark to evaluate state-of-the-art relation embedding strategies as well as several publicly available LLMs and closed conversational models such as GPT-4. We find that smaller language models struggle to outperform a naive baseline. Overall, the best results are obtained with the 11B parameter Flan-T5 model and the 13B parameter OPT model, where further increasing the model size does not seem to be beneficial. For all models, a clear gap with human performance remains.

pdf bib
A Multimodal Framework to Detect Target Aware Aggression in Memes
Shawly Ahsan | Eftekhar Hossain | Omar Sharif | Avishek Das | Mohammed Moshiul Hoque | M. Dewan

Internet memes have gained immense traction as a medium for individuals to convey emotions, thoughts, and perspectives on social media. While memes often serve as sources of humor and entertainment, they can also propagate offensive, incendiary, or harmful content, deliberately targeting specific individuals or communities. Identifying such memes is challenging because of their satirical and cryptic characteristics. Most contemporary research on memes’ detrimental facets is skewed towards high-resource languages, often sidelining the unique challenges tied to low-resource languages, such as Bengali. To facilitate this research in low-resource languages, this paper presents a novel dataset MIMOSA (MultIMOdal aggreSsion dAtaset) in Bengali. MIMOSA encompasses 4,848 annotated memes across five aggression target categories: Political, Gender, Religious, Others, and non-aggressive. We also propose MAF (Multimodal Attentive Fusion), a simple yet effective approach that uses multimodal context to detect the aggression targets. MAF captures the selective modality-specific features of the input meme and jointly evaluates them with individual modality features. Experiments on MIMOSA exhibit that the proposed method outperforms several state-of-the-art rivaling approaches. Our code and data are available at

pdf bib
Graph Guided Question Answer Generation for Procedural Question-Answering
Hai Pham | Isma Hadji | Xinnuo Xu | Ziedune Degutyte | Jay Rainey | Evangelos Kazakos | Afsaneh Fazly | Georgios Tzimiropoulos | Brais Martinez

In this paper, we focus on task-specific question answering (QA). To this end, we introduce a method for generating exhaustive and high-quality training data, which allows us to train compact (e.g., run on a mobile device), task-specific QA models that are competitive against GPT variants. The key technological enabler is a novel mechanism for automatic question-answer generation from procedural text which can ingest large amounts of textual instructions and produce exhaustive in-domain QA training data. While current QA data generation methods can produce well-formed and varied data, their non-exhaustive nature is sub-optimal for training a QA model. In contrast, we leverage the highly structured aspect of procedural text and represent each step and the overall flow of the procedure as graphs. We then condition on graph nodes to automatically generate QA pairs in an exhaustive and controllable manner. Comprehensive evaluations of our method show that: 1) small models trained with our data achieve excellent performance on the target QA task, even exceeding that of GPT3 and ChatGPT despite being several orders of magnitude smaller. 2) semantic coverage is the key indicator for downstream QA performance. Crucially, while large language models excel at syntactic diversity, this does not necessarily result in improvements on the end QA model. In contrast, the higher semantic coverage provided by our method is critical for QA performance.

pdf bib
Contrastive Decoding Reduces Hallucinations in Large Multilingual Machine Translation Models
Jonas Waldendorf | Barry Haddow | Alexandra Birch

In Neural Machine Translation (NMT), models will sometimes generate repetitive or fluent output that is not grounded in the source sentence. This phenomenon is known as hallucination and is a problem even in large-scale multilingual translation models. We propose to use Contrastive Decoding, an algorithm developed to improve generation from unconditional language models, to mitigate hallucinations in NMT. Specifically, we maximise the log-likelihood difference between a model and the same model with reduced contribution from the encoder outputs. Additionally, we propose an alternative implementation of Contrastive Decoding that dynamically weights the difference based on the maximum probability in the output distribution to reduce the effect of CD when the model is confident of its prediction. We evaluate our methods using the Small (418M) and Medium (1.2B) M2M models across 21 low and medium-resource language pairs. Our results show a 14.6 ± 0.5 and 11.0 ± 0.6 maximal increase in the mean COMET scores for the Small and Medium models on those sentences for which the M2M models initially generate a hallucination., respectively.

pdf bib
Leveraging fine-tuned Large Language Models with LoRA for Effective Claim, Claimer, and Claim Object Detection
Sotiris Kotitsas | Panagiotis Kounoudis | Eleni Koutli | Haris Papageorgiou

Misinformation and disinformation phenomena existed long before the advent of digital technologies. The exponential use of social media platforms, whose information feeds have created the conditions for many to many communication and instant amplification of the news has accelerated the diffusion of inaccurate and misleading information. As a result, the identification of claims have emerged as a pivotal technology for combating the influence of misinformation and disinformation within news media. Most existing work has concentrated on claim analysis at the sentence level, neglecting the crucial exploration of supplementary attributes such as the claimer and the claim object of the claim or confining it by limiting its scope to a predefined list of topics. Furthermore, previous research has been mostly centered around political debates, Wikipedia articles, and COVID-19 related content. By leveraging the advanced capabilities of Large Language Models (LLMs) in Natural Language Understanding (NLU) and text generation, we propose a novel architecture utilizing LLMs finetuned with LoRA to transform the claim, claimer and claim object detection task into a Question Answering (QA) setting. We evaluate our approach in a dataset of 867 scientific news articles of 3 domains (Health, Climate Change, Nutrition) (HCN), which are human annotated with the major claim, the claimer and the object of the major claim. We also evaluate our proposed model in the benchmark dataset of NEWSCLAIMS. Experimental and qualitative results showcase the effectiveness of the proposed approach. We make our dataset publicly available to encourage further research.

pdf bib
Should I try multiple optimizers when fine-tuning a pre-trained Transformer for NLP tasks? Should I tune their hyperparameters?
Nefeli Gkouti | Prodromos Malakasiotis | Stavros Toumpis | Ion Androutsopoulos

NLP research has explored different neural model architectures and sizes, datasets, training objectives, and transfer learning techniques. However, the choice of optimizer during training has not been explored as extensively. Typically, some variant of Stochastic Gradient Descent (SGD) is employed, selected among numerous variants, using unclear criteria, often with minimal or no tuning of the optimizer’s hyperparameters. Experimenting with five GLUE datasets, two models (DistilBERT and DistilRoBERTa), and seven popular optimizers (SGD, SGD with Momentum, Adam, AdaMax, Nadam, AdamW, and AdaBound), we find that when the hyperparameters of the optimizers are tuned, there is no substantial difference in test performance across the five more elaborate (adaptive) optimizers, despite differences in training loss. Furthermore, tuning just the learning rate is in most cases as good as tuning all the hyperparameters. Hence, we recommend picking any of the best-behaved adaptive optimizers (e.g., Adam) and tuning only its learning rate. When no hyperparameter can be tuned, SGD with Momentum is the best choice.

pdf bib
GUMsley: Evaluating Entity Salience in Summarization for 12 English Genres
Jessica Lin | Amir Zeldes

As NLP models become increasingly capable of understanding documents in terms of coherent entities rather than strings, obtaining the most salient entities for each document is not only an important end task in itself but also vital for Information Retrieval (IR) and other downstream applications such as controllable summarization. In this paper, we present and evaluate GUMsley, the first entity salience dataset covering all named and non-named salient entities for 12 genres of English text, aligned with entity types, Wikification links and full coreference resolution annotations. We promote a strict definition of salience using human summaries and demonstrate high inter-annotator agreement for salience based on whether a source entity is mentioned in the summary. Our evaluation shows poor performance by pre-trained SOTA summarization models and zero-shot LLM prompting in capturing salient entities in generated summaries. We also show that predicting or providing salient entities to several model architectures enhances performance and helps derive higher-quality summaries by alleviating the entity hallucination problem in existing abstractive summarization.

pdf bib
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting
Tilman Beck | Hendrik Schuff | Anne Lauscher | Iryna Gurevych

Annotators’ sociodemographic backgrounds (i.e., the individual compositions of their gender, age, educational background, etc.) have a strong impact on their decisions when working on subjective NLP tasks, such as toxic language detection. Often, heterogeneous backgrounds result in high disagreements. To model this variation, recent work has explored sociodemographic prompting, a technique, which steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give. However, the available NLP literature disagrees on the efficacy of this technique — it remains unclear for which tasks and scenarios it can help, and the role of the individual factors in sociodemographic prompting is still unexplored. We address this research gap by presenting the largest and most comprehensive study of sociodemographic prompting today. We use it to analyze its influence on model sensitivity, performance and robustness across seven datasets and six instruction-tuned model families. We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.However, its outcomes largely vary for different model types, sizes, and datasets, and are subject to large variance with regards to prompt formulations. Most importantly, our results show that sociodemographic prompting should be used with care when used for data annotation or studying LLM alignment.

pdf bib
Threat Behavior Textual Search by Attention Graph Isomorphism
Chanwoo Bae | Guanhong Tao | Zhuo Zhang | Xiangyu Zhang

Cyber attacks cause over $1 trillion loss every year. An important task for cyber security analysts is attack forensics. It entails understanding malware behaviors and attack origins. However, existing automated or manual malware analysis can only disclose a subset of behaviors due to inherent difficulties (e.g., malware cloaking and obfuscation). As such, analysts often resort to text search techniques to identify existing malware reports based on the symptoms they observe, exploiting the fact that malware samples share a lot of similarity, especially those from the same origin. In this paper, we propose a novel malware behavior search technique that is based on graph isomorphism at the attention layers of Transformer models. We also compose a large dataset collected from various agencies to facilitate such research.Our technique outperforms state-of-the-art methods, such as those based on sentence embeddings and keywords by 6-14%. In the case study of 10 real-world malwares, our technique can correctly attribute 8 of them to their ground truth origins while using Google only works for 3 cases.

pdf bib
Identifying Narrative Content in Podcast Transcripts
Yosra Abdessamed | Shadi Rezapour | Steven Wilson

As one of the oldest forms of human communication, narratives appear across a variety of genres and media. Computational methods have been applied to study narrativity in novels, social media, and patient records, leading to new approaches and insights. However, other types of media are growing in popularity, like podcasts. Podcasts contain a multitude of spoken narratives that can provide a meaningful glimpse into how people share stories with one another.In this paper, we outline and apply methods to process English-language podcast transcripts and extract narrative content from conversations within each episode. We provide an initial analysis of the types of narrative content that exists within a wide range of podcasts, and compare our results to other established narrative analysis tools.Our annotations for narrativity and pretrained models can help to enable future research into narrativity within a large corpus of approximately 100,000 podcast episodes.

pdf bib
Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times
Byung-Doh Oh | Shisen Yue | William Schuler

Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. The current work presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. First, residual errors from four language model families on four corpora show that the inverse correlation between model size and fit to reading times is the strongest on the subset of least frequent words, which is driven by excessively accurate predictions of larger model variants. Additionally, training dynamics reveal that during later training steps, all model variants learn to predict rare words and that larger model variants do so more accurately, which explains the detrimental effect of both training data amount and model size on fit to reading times. Finally, a feature attribution analysis demonstrates that larger model variants are able to accurately predict rare words based on both an effectively longer context window size as well as stronger local associations compared to smaller model variants. Taken together, these results indicate that Transformer-based language models’ surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.

pdf bib
Presentations by the Humans and For the Humans: Harnessing LLMs for Generating Persona-Aware Slides from Documents
Ishani Mondal | Shwetha S | Anandhavelu Natarajan | Aparna Garimella | Sambaran Bandyopadhyay | Jordan Boyd-Graber

Scientific papers and slides are two different representations of the same underlying information, but both require substantial work to prepare. While there had been prior efforts on automating document-to-slides generation, there is still a pressing need of customizing the presentation of content aligning with the persona of target audience or duration of presentation. This paper first introduces the concept of end-user specification-aware document to slides conversion that incorporates end-user specifications into the conversion process. For this, we initially introduce a new dataset reuse the existing SciDuet dataset consisting of pairs of papers and corresponding slides decks from recent years’ *ACL conferences to create four persona-aware configurations. Secondly, we present Persona-Aware-D2S, a novel approach by finetuning LLMs using target audience feedback to create persona-aware slides from scientific documents. Our evaluation on both automated metrics and qualitative human evaluation suggests that by incorporating end-user specifications into the conversion process, our model can create presentations that are not only informative but also tailored to expectations and cognitive abilities of target audience.

pdf bib
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks
Bolei Ma | Ercong Nie | Shuzhou Yuan | Helmut Schmid | Michael Färber | Frauke Kreuter | Hinrich Schuetze

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

pdf bib
Small Language Models Improve Giants by Rewriting Their Outputs
Giorgos Vernikos | Arthur Brazinskas | Jakub Adamek | Jonathan Mallinson | Aliaksei Severyn | Eric Malmi

Despite the impressive performance of large language models (LLMs), theyoften lag behind specialized models in various tasks. LLMs only use a fractionof the existing training data for in-context learning, while task-specificmodels harness the full dataset for fine-tuning. In this work, we tackle theproblem of leveraging training data to improve the performance of LLMs withoutfine-tuning. Our approach directly targets LLM predictions without requiringaccess to their weights. We create a pool of candidates from the LLM throughfew-shot prompting and we employ a compact model, the LM-corrector (LMCor),specifically trained to merge these candidates to produce an enhanced output.Our experiments on four natural language generation tasks demonstrate that evena small LMCor model (250M) substantially improves the few-shot performance ofLLMs (62B), matching and even outperforming standard fine-tuning. Furthermore,we illustrate the robustness of LMCor against different prompts, therebyminimizing the need for extensive prompt engineering. Finally, we show thatLMCor can be seamlessly integrated with different LLMs at inference, serving asa plug-and-play module to improve their performance.

pdf bib
Unintended Bias Detection and Mitigation in Misogynous Memes
Gitanjali Kumari | Anubhav Sinha | Asif Ekbal

Online sexism has become a concerning issue in recent years, especially conveyed through memes. Although this alarming phenomenon has triggered many studies from computational linguistic and natural language processing points of view, less effort has been spent analyzing if those misogyny detection models are affected by an unintended bias. Such biases can lead models to incorrectly label non-misogynous memes misogynous due to specific identity terms, perpetuating harmful stereotypes and reinforcing negative attitudes. This paper presents the first and most comprehensive approach to measure and mitigate unintentional bias in the misogynous memes detection model, aiming to develop effective strategies to counter their harmful impact. Our proposed model, the Contextualized Scene Graph-based Multimodal Network (CTXSGMNet), is an integrated architecture that combines VisualBERT, a CLIP-LSTM-based memory network, and an unbiased scene graph module with supervised contrastive loss, achieves state-of-the-art performance in mitigating unintentional bias in misogynous memes.Empirical evaluation, including both qualitative and quantitative analysis, demonstrates the effectiveness of our CTXSGMNet framework on the SemEval-2022 Task 5 (MAMI task) dataset, showcasing its promising performance in terms of Equity of Odds and F1 score. Additionally, we assess the generalizability of the proposed model by evaluating their performance on a few benchmark meme datasets, providing a comprehensive understanding of our approach’s efficacy across diverse datasets.

pdf bib
A Weak Supervision Approach for Few-Shot Aspect Based Sentiment Analysis
Robert Vacareanu | Siddharth Varia | Kishaloy Halder | Shuai Wang | Giovanni Paolini | Neha Anna John | Miguel Ballesteros | Smaranda Muresan

We explore how weak supervision on abundant unlabeled data can be leveraged to improve few-shot performance in aspect-based sentiment analysis (ABSA) tasks. We propose a pipeline approach to construct a noisy ABSA dataset, and we use it to adapt a pre-trained sequence-to-sequence model to the ABSA tasks. We test the resulting model on three widely used ABSA datasets, before and after fine-tuning. Our proposed method preserves the full fine-tuning performance while showing significant improvements (15.84 absolute F1) in the few-shot learning scenario for the harder tasks. In zero-shot (i.e., without fine-tuning), our method outperforms the previous state of the art on the aspect extraction sentiment classification (AESC) task and is, additionally, capable of performing the harder aspect sentiment triplet extraction (ASTE) task.

pdf bib
Counterfactual Reasoning with Knowledge Graph Embeddings
Lena Zellinger | Andreas Stephan | Benjamin Roth

Knowledge graph embeddings (KGEs) were originally developed to infer true but missing facts in incomplete knowledge repositories.In this paper, we link knowledge graph completion and counterfactual reasoning via our new task CFKGR. We model the original world state as a knowledge graph, hypothetical scenarios as edges added to the graph, and plausible changes to the graph as inferences from logical rules. We create corresponding benchmark datasets, which contain diverse hypothetical scenarios with plausible changes to the original knowledge graph and facts that should be retained. We develop COULDD, a general method for adapting existing knowledge graph embeddings given a hypothetical premise, and evaluate it on our benchmark. Our results indicate that KGEs learn patterns in the graph without explicit training. We further observe that KGEs adapted with COULDD solidly detect plausible counterfactual changes to the graph that follow these patterns. An evaluation on human-annotated data reveals that KGEs adapted with COULDD are mostly unable to recognize changes to the graph that do not follow learned inference rules. In contrast, ChatGPT mostly outperforms KGEs in detecting plausible changes to the graph but has poor knowledge retention. In summary, CFKGR connects two previously distinct areas, namely KG completion and counterfactual reasoning.

pdf bib
System-Level Natural Language Feedback
Weizhe Yuan | Kyunghyun Cho | Jason Weston

Natural language (NL) feedback offers rich insights into user experience. While existing studies focus on an instance-level approach, where feedback is used to refine specific examples, we introduce a framework for system-level use of NL feedback. We show how to use feedback to formalize system-level design decisions in a human-in-the-loop-process – in order to produce better models. In particular this is done through: (i) metric design for tasks; and (ii) language model prompt design for refining model responses. We conduct two case studies of this approach for improving search query and dialog response generation, demonstrating the effectiveness of system-level feedback. We show the combination of system-level and instance-level feedback brings further gains, and that human written instance-level feedback results in more grounded refinements than GPT-3.5 written ones, underlying the importance of human feedback for building systems.

pdf bib
Syntactic Preposing and Discourse Relations
Yunfang Dong | Xixian Liao | Bonnie Webber

Over 15 years ago, Ward & Birner (2006) suggested that non-canonical constructions in English can serve both to mark information status and to structure the information flow of discourse. One such construction is preposing, where a phrasal constituent appears to the left of its canonical position, typically sentence-initially. But computational work on discourse has, to date, ignored non-canonical syntax. We take account of non-canonical syntax by providing quantitative evidence relating NP/PP preposing to discourse relations. The evidence comes from an LLM mask-filling task that compares the predictions when a mask is inserted between the arguments of an implicit inter-sentential discourse relation — first, when the right-hand argument (Arg2) starts with a preposed constituent, and again, when that constituent is in canonical (post-verbal) position. Results show that (1) the top-ranked mask-fillers in the preposed case agree more often with “gold” annotations in the Penn Discourse TreeBank than they do in the latter case, and (2) preposing in Arg2 can affect the distribution of discourse-relational senses.

pdf bib
Can we obtain significant success in RST discourse parsing by using Large Language Models?
Aru Maekawa | Tsutomu Hirao | Hidetaka Kamigaito | Manabu Okumura

Recently, decoder-only pre-trained large language models (LLMs), with several tens of billion parameters, have significantly impacted a wide range of natural language processing (NLP) tasks. While encoder-only or encoder-decoder pre-trained language models have already proved to be effective in discourse parsing, the extent to which LLMs can perform this task remains an open research question. Therefore, this paper explores how beneficial such LLMs are for Rhetorical Structure Theory (RST) discourse parsing. Here, the parsing process for both fundamental top-down and bottom-up strategies is converted into prompts, which LLMs can work with. We employ Llama 2 and fine-tune it with QLoRA, which has fewer parameters that can be tuned. Experimental results on three benchmark datasets, RST-DT, Instr-DT, and the GUM corpus, demonstrate that Llama 2 with 70 billion parameters in the bottom-up strategy obtained state-of-the-art (SOTA) results with significant differences. Furthermore, our parsers demonstrated generalizability when evaluated on RST-DT, showing that, in spite of being trained with the GUM corpus, it obtained similar performances to those of existing parsers trained with RST-DT.

pdf bib
Ameli: Enhancing Multimodal Entity Linking with Fine-Grained Attributes
Barry Yao | Sijia Wang | Yu Chen | Qifan Wang | Minqian Liu | Zhiyang Xu | Licheng Yu | Lifu Huang

We propose attribute-aware multimodal entity linking, where the input consists of a mention described with a text paragraph and images, and the goal is to predict the corresponding target entity from a multimodal knowledge base (KB) where each entity is also accompanied by a text description, visual images, and a collection of attributes that present the meta-information of the entity in a structured format. To facilitate this research endeavor, we construct Ameli, encompassing a new multimodal entity linking benchmark dataset that contains 16,735 mentions described in text and associated with 30,472 images, and a multimodal knowledge base that covers 34,690 entities along with 177,873 entity images and 798,216 attributes. To establish baseline performance on Ameli, we experiment with several state-of-the-art architectures for multimodal entity linking and further propose a new approach that incorporates attributes of entities into disambiguation. Experimental results and extensive qualitative analysis demonstrate that extracting and understanding the attributes of mentions from their text descriptions and visual images play a vital role in multimodal entity linking. To the best of our knowledge, we are the first to integrate attributes in the multimodal entity linking task. The programs, model checkpoints, and the dataset are publicly available at

pdf bib
Generative Dense Retrieval: Memory Can Be a Burden
Peiwen Yuan | Xinglin Wang | Shaoxiong Feng | Boyuan Pan | Yiwei Li | Heda Wang | Xupeng Miao | Kan Li

Generative Retrieval (GR), autoregressively decoding relevant document identifiers given a query, has been shown to perform well under the setting of small-scale corpora. By memorizing the document corpus with model parameters, GR implicitly achieves deep interaction between query and document. However, such a memorizing mechanism faces three drawbacks: (1) Poor memory accuracy for fine-grained features of documents; (2) Memory confusion gets worse as the corpus size increases; (3) Huge memory update costs for new documents. To alleviate these problems, we propose the Generative Dense Retrieval (GDR) paradigm. Specifically, GDR first uses the limited memory volume to achieve inter-cluster matching from query to relevant document clusters. Memorizing-free matching mechanism from Dense Retrieval (DR) is then introduced to conduct fine-grained intra-cluster matching from clusters to relevant documents. The coarse-to-fine process maximizes the advantages of GR’s deep interaction and DR’s scalability. Besides, we design a cluster identifier constructing strategy to facilitate corpus memory and a cluster-adaptive negative sampling strategy to enhance the intra-cluster mapping ability. Empirical results show that GDR obtains an average of 3.0 R@100 improvement on NQ dataset under multiple settings and has better scalability.

pdf bib
Backward Compatibility During Data Updates by Weight Interpolation
Raphael Schumann | Elman Mansimov | Yi-An Lai | Nikolaos Pappas | Xibin Gao | Yi Zhang

Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified.A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.

pdf bib
Gradient-Based Language Model Red Teaming
Nevan Wichers | Carson Denison | Ahmad Beirami

Red teaming is a common strategy for identifying weaknesses in generative language models (LMs) by producing adversarial prompts that trigger models to generate unsafe responses. Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a novel red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by scoring an LM response with a safety classifier and then backpropagating through the frozen safety classifier and LM to update the prompt. To improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. Our experiments show that GBRT is more effective at finding prompts that trigger an LM to generate unsafe responses than a strong reinforcement learning-based red teaming approach and works even when the LM has been fine-tuned to produce safer outputs.

pdf bib
Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test
Aditi Khandelwal | Utkarsh Agarwal | Kumar Tanmay | Monojit Choudhury

This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs – ChatGPT, GPT-4 and Llama2Chat-70B – that shows substantial multilingual text processing and generation abilities. Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English, while there is no clear trend for the performance of the latter four languages. The moral judgments too vary considerably by the language.

pdf bib
Analyzing the Evaluation of Cross-Lingual Knowledge Transfer in Multilingual Language Models
Sara Rajaee | Christof Monz

Recent advances in training multilingual language models on large datasets seem to have shown promising results in knowledge transfer across languages and achieve high performance on downstream tasks. However, we question to what extent the current evaluation benchmarks and setups accurately measure zero-shot cross-lingual knowledge transfer. In this work, we challenge the assumption that high zero-shot performance on target tasks reflects high cross-lingual ability by introducing more challenging setups involving instances with multiple languages. Through extensive experiments and analysis, we show that the observed high performance of multilingual models can be largely attributed to factors not requiring the transfer of actual linguistic knowledge, such as task- and surface-level knowledge. More specifically, we observe what has been transferred across languages is mostly data artifacts and biases, especially for low-resource languages. Our findings highlight the overlooked drawbacks of existing cross-lingual test data and evaluation setups, calling for a more nuanced understanding of the cross-lingual capabilities of multilingual models.

pdf bib
Large-Scale Label Interpretation Learning for Few-Shot Named Entity Recognition
Jonas Golde | Felix Hamborg | Alan Akbik

Few-shot named entity recognition (NER) detects named entities within text using only a few annotated examples. One promising line of research is to leverage natural language descriptions of each entity type: the common label PER might, for example, be verbalized as ”person entity.” In an initial label interpretation learning phase, the model learns to interpret such verbalized descriptions of entity types. In a subsequent few-shot tagset extension phase, this model is then given a description of a previously unseen entity type (such as ”music album”) and optionally a few training examples to perform few-shot NER for this type. In this paper, we systematically explore the impact of a strong semantic prior to interpret verbalizations of new entity types by massively scaling up the number and granularity of entity types used for label interpretation learning. To this end, we leverage an entity linking benchmark to create a dataset with orders of magnitude of more distinct entity types and descriptions as currently used datasets. We find that this increased signal yields strong results in zero- and few-shot NER in in-domain, cross-domain, and even cross-lingual settings. Our findings indicate significant potential for improving few-shot NER through heuristical data-based optimization.

pdf bib
MLCopilot: Unleashing the Power of Large Language Models in Solving Machine Learning Tasks
Lei Zhang | Yuge Zhang | Kan Ren | Dongsheng Li | Yuqing Yang

The field of machine learning (ML) has gained widespread adoption, leading to significant demand for adapting ML to specific scenarios, which is yet expensive and non-trivial. The predominant approaches towards the automation of solving ML tasks (e.g., AutoML) are often time-consuming and hard to understand for human developers. In contrast, though human engineers have the incredible ability to understand tasks and reason about solutions, their experience and knowledge are often sparse and difficult to utilize by quantitative approaches. In this paper, we aim to bridge the gap between machine intelligence and human knowledge by introducing a novel framework MLCopilot, which leverages the state-of-the-art large language models to develop ML solutions for novel tasks. We showcase the possibility of extending the capability of LLMs to comprehend structured inputs and perform thorough reasoning for solving novel ML tasks. And we find that, after some dedicated design, the LLM can (i) observe from the existing experiences of ML tasks and (ii) reason effectively to deliver promising results for new tasks. The solution generated can be used directly to achieve high levels of competitiveness.

pdf bib
Text-Guided Image Clustering
Andreas Stephan | Lukas Miklautz | Kevin Sidak | Jan Philip Wahle | Bela Gipp | Claudia Plant | Benjamin Roth

Image clustering divides a collection of images into meaningful groups, typically interpreted post-hoc via human-given annotations. Those are usually in the form of text, begging the question of using text as an abstraction for image clustering. Current image clustering methods, however, neglect the use of generated textual descriptions. We, therefore, propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models and subsequently clustering the generated text. Further, we introduce a novel approach to inject task- or domain knowledge for clustering by prompting VQA models. Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features. Additionally, we propose a counting-based cluster explainability method. Our evaluations show that the derived keyword-based explanations describe clusters better than the respective cluster accuracy suggests. Overall, this research challenges traditional approaches and paves the way for a paradigm shift in image clustering, using generated text.

pdf bib
CCPrefix: Counterfactual Contrastive Prefix-Tuning for Many-Class Classification
Yang Li | Canran Xu | Guodong Long | Tao Shen | Chongyang Tao | Jing Jiang

Recently, prefix-tuning was proposed to efficiently adapt pre-trained language models to a broad spectrum of natural language classification tasks. It leverages soft prefix as task-specific indicators and language verbalizers as categorical-label mentions to narrow the formulation gap from pre-training language models. However, when the label space increases considerably (i.e., many-class classification), such a tuning technique suffers from a verbalizer ambiguity problem since the many-class labels are represented by semantic-similar verbalizers in short language phrases. To overcome this, inspired by the human-decision process that the most ambiguous classes would be mulled over for an instance, we propose a brand-new prefix-tuning method, Counterfactual Contrastive Prefix-tuning (CCPrefix), for many-class classification. Basically, an instance-dependent soft prefix, derived from fact-counterfactual pairs in the label space, is leveraged to complement the language verbalizers in many-class classification. We conduct experiments on many-class benchmark datasets in both the fully supervised setting and the few-shot setting, which indicates that our model outperforms former baselines.


pdf (full)
bib (full)
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Yvette Graham | Matthew Purver

pdf bib
French GossipPrompts: Dataset For Prevention of Generating French Gossip Stories By LLMs
Msvpj Sathvik | Abhilash Dowpati | Revanth Narra

The realm of Large Language Models (LLMs) is undergoing a continuous and dynamic transformation. These state-of-the-art LLMs showcase an impressive ability to craft narratives based on contextual cues, highlighting their skill in comprehending and producing text resembling human writing. However, there exists a potential risk: the potential inclination of LLMs to create gossips when prompted with specific contexts. These LLMs possess the capacity to generate stories rooted in the context provided by the prompts. Yet, this very capability carries a risk of generating gossips given the context as input. To mitigate this, we introduce a dataset named “French GossipPrompts” designed for identifying prompts that lead to the creation of gossipy content in the French language. This dataset employs binary classification, categorizing whether a given prompt generates gossip or not. The dataset comprises a total of 7253 individual prompts. We have developed classification models and achieved an accuracy of 89.95%.

pdf bib
More Discriminative Sentence Embeddings via Semantic Graph Smoothing
Chakib Fettal | Lazhar Labiod | Mohamed Nadif

This paper explores an empirical approach to learn more discriminantive sentence representations in an unsupervised fashion. Leveraging semantic graph smoothing, we enhance sentence embeddings obtained from pretrained models to improve results for the text clustering and classification tasks. Our method, validated on eight benchmarks, demonstrates consistent improvements, showcasing the potential of semantic graph smoothing in improving sentence embeddings for the supervised and unsupervised document categorization tasks.

pdf bib
Multi-Level Attention Aggregation for Language-Agnostic Speaker Replication
Yejin Jeon | Gary Lee

This paper explores the task of language-agnostic speaker replication, a novel endeavor that seeks to replicate a speaker’s voice irrespective of the language they are speaking. Towards this end, we introduce a multi-level attention aggregation approach that systematically probes and amplifies various speaker-specific attributes in a hierarchical manner. Through rigorous evaluations across a wide range of scenarios including seen and unseen speakers conversing in seen and unseen lingua, we establish that our proposed model is able to achieve substantial speaker similarity, and is able to generalize to out-of-domain (OOD) cases.

pdf bib
Mitigating Hallucinations and Off-target Machine Translation with Source-Contrastive and Language-Contrastive Decoding
Rico Sennrich | Jannis Vamvas | Alireza Mohammadshahi

Hallucinations and off-target translation remain unsolved problems in MT, especially for low-resource languages and massively multilingual models. In this paper, we introduce two related methods to mitigate these failure cases with a modified decoding objective, without either requiring retraining or external models. In source-contrastive decoding, we search for a translation that is probable given the correct input, but improbable given a random input segment. In language-contrastive decoding, we search for a translation that is probable, but improbable given the wrong language indicator token. Experiments on the massively multilingual models M2M-100 (418M) and SMaLL-100 show that these methods suppress hallucinations and off-target translations, reducing the number of translations with segment-level chrF2 below 10 by 67-83% on average across 57 tested translation directions. In a proof of concept on out-of-English translation, we also show that we can suppress off-target translations with large language models. We release code upon acceptance.

pdf bib
Injecting Wiktionary to improve token-level contextual representations using contrastive learning
Anna Mosolova | Marie Candito | Carlos Ramisch

While static word embeddings are blind to context, for lexical semantics tasks context is rather too present in contextual word embeddings, vectors of same-meaning occurrences being too different (Ethayarajh, 2019). Fine-tuning pre-trained language models (PLMs) using contrastive learning was proposed, leveraging automatically self-augmented examples (Liu et al., 2021b). In this paper, we investigate how to inject a lexicon as an alternative source of supervision, using the English Wiktionary. We also test how dimensionality reduction impacts the resulting contextual word embeddings. We evaluate our approach on the Word-In-Context (WiC) task, in the unsupervised setting (not using the training set). We achieve new SoTA result on the original WiC test set. We also propose two new WiC test sets for which we show that our fine-tuning method achieves substantial improvements. We also observe improvements, although modest, for the semantic frame induction task. Although we experimented on English to allow comparison with related work, our method is adaptable to the many languages for which large Wiktionaries exist.

pdf bib
Multilingual Gradient Word-Order Typology from Universal Dependencies
Emi Baylor | Esther Ploeger | Johannes Bjerva

While information from the field of linguistic typology has the potential to improve performance on NLP tasks, reliable typological data is a prerequisite. Existing typological databases, including WALS and Grambank, suffer from inconsistencies primarily caused by their categorical format. Furthermore, typological categorisations by definition differ significantly from the continuous nature of phenomena, as found in natural language corpora. In this paper, we introduce a new seed dataset made up of continuous-valued data, rather than categorical data, that can better reflect the variability of language. While this initial dataset focuses on word-order typology, we also present the methodology used to create the dataset, which can be easily adapted to generate data for a broader set of features and languages.

pdf bib
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains
Sanjana Ramprasad | Kundan Krishna | Zachary Lipton | Byron Wallace

Recent work has shown that large language models (LLMs) are capable of generating summaries zero-shot—i.e., without explicit supervision—that, under human assessment, are often comparable or even preferred to manually composed reference summaries. However, this prior work has focussed almost exclusively on evaluating news article summarization. How do zero-shot summarizers perform in other (potentially more specialized) domains?In this work we evaluate zero-shot generated summaries across specialized domains including: biomedical articles, and legal bills (in addition to standard news benchmarks for reference). We focus especially on the factuality of outputs. We acquire annotations from domain experts to identify inconsistencies in summaries and systematically categorize these errors. We analyze whether the prevalence of a given domain in the pretraining corpus affects extractiveness and faithfulness of generated summaries of articles in this domain. We release all collected annotations to facilitate additional research toward measuring and realizing factually accurate summarization, beyond news articles (The dataset can be downloaded from

pdf bib
Leveraging Implicit Feedback from Deployment Data in Dialogue
Richard Yuanzhe Pang | Stephen Roller | Kyunghyun Cho | He He | Jason Weston

We study improving social conversational agents by learning from natural dialogue between users and a deployed model, without extra annotations. To implicitly measure the quality of a machine-generated utterance, we leverage signals like user response length, sentiment and reaction of the future human utterances in the collected dialogue episodes. Our experiments use the publicly released deployment data from BlenderBot (Xu et al., 2023). Human evaluation indicates improvements in our new models over baseline responses; however, we find that some proxy signals can lead to more generations with undesirable properties as well. For example, optimizing for conversation length can lead to more controversial or unfriendly generations compared to the baseline, whereas optimizing for positive sentiment or reaction can decrease these behaviors.

pdf bib
Characterizing the Confidence of Large Language Model-Based Automatic Evaluation Metrics
Rickard Stureborg | Dimitris Alikaniotis | Yoshi Suhara

There has recently been a growing interest in using Large Language Models (LLMs) to evaluate NLP tasks automatically. Considerable research effort has been put into improving such systems towards achieving high correlations with human judgement. However, it is still unclear what level of correlation is good enough for practical applications of LLM-based automatic evaluation systems. This paper characterizes these LLM evaluators’ confidence in ranking candidate NLP models and develops a configurable Monte Carlo simulation method. We show that even automatic metrics with low correlation with human judgement can reach high-confidence rankings of candidate models with reasonable evaluation set sizes (100s of examples). Further, we describe tradeoff curves between the LLM evaluator performance (i.e., correlation with humans) and evaluation set size; loss in correlation can be compensated with modest increases in the evaluation set size. We validate our results on RoSE, a text summarization dataset, and find our estimates of confidence align with empirical observations.Code available at

pdf bib
Equipping Language Models with Tool Use Capability for Tabular Data Analysis in Finance
Adrian Theuma | Ehsan Shareghi

Large language models (LLMs) have exhibited an array of reasoning capabilities but face challenges like error propagation and hallucination, particularly in specialised areas like finance, where data is heterogeneous, and precision is paramount. We explore the potential of language model augmentation with external tools to mitigate these limitations and offload certain reasoning steps to external tools that are more suited for the task, instead of solely depending on the LLM’s inherent abilities. More concretely, using financial domain question answering datasets, we apply supervised finetuning on a LLAMA-2 13B CHAT model to act both as a task router and task solver. The task router dynamically directs a question to either be answered internally by the LLM or externally via the right tool from the tool set. Our tool-equipped SFT model, RAVEN, demonstrates an improvement of 35.2% and 5.06% over the base model and SFT-only baselines, respectively, and is highly competitive with strong GPT-3.5 results. To the best of our knowledge, our work is the first that investigates tool augmentation of language models for the finance domain.

pdf bib
Commonsense-augmented Memory Construction and Management in Long-term Conversations via Context-aware Persona Refinement
Hana Kim | Kai Ong | Seoyeon Kim | Dongha Lee | Jinyoung Yeo

Memorizing and utilizing speakers’ personas is a common practice for response generation in long-term conversations. Yet, human-authored datasets often provide uninformative persona sentences that hinder response quality. This paper presents a novel framework that leverages commonsense-based persona expansion to address such issues in long-term conversation.While prior work focuses on not producing personas that contradict others, we focus on transforming contradictory personas into sentences that contain rich speaker information, by refining them based on their contextual backgrounds with designed strategies. As the pioneer of persona expansion in multi-session settings, our framework facilitates better response generation via human-like persona refinement. The supplementary video of our work is available at

pdf bib
Investigating the Potential of Task Arithmetic for Cross-Lingual Transfer
Marinela Parović | Ivan Vulić | Anna Korhonen

Cross-lingual transfer has recently been tackled through modular, parameter-efficient fine-tuning methods which allow arbitrary combinations of language and task modules for transfer of any task to any language. Concurrently, task arithmetic has emerged as a powerful and modular tool for editing pretrained models using multiple full fine-tunings. In this work, we connect the paradigms of task arithmetic and cross-lingual transfer, demonstrating that modularity for cross-lingual transfer can be achieved even with full model fine-tuning. Our approach displays strong performance on a range of multilingual benchmarks encompassing both high-resource and low-resource languages.

pdf bib
On the Benefits of Fine-Grained Loss Truncation: A Case Study on Factuality in Summarization
Lorenzo Jaime Flores | Arman Cohan

Text summarization and simplification are among the most widely used applications of AI. However, such models are often prone to hallucination, which can result from training models on unaligned data. One efficient approach to address this issue is Loss Truncation (Kang and Hashimoto, 2020), an approach to modify the standard log loss to adaptively remove noisy examples during training. However, we find that LT alone yields a considerable number of hallucinated entities on various datasets. We study the behavior of the underlying losses between factual and non-factual examples, to understand and refine the performance of LT. We demonstrate that LT’s performance is limited when the underlying assumption that noisy targets have higher NLL loss is not satisfied, and find that word-level NLL among entities provides better signal for distinguishing factuality. We then leverage this to propose a fine-grained NLL loss and fine-grained data cleaning strategies, and observe improvements in hallucination reduction across some datasets. Our work is available at

pdf bib
Evaluating Unsupervised Argument Aligners via Generation of Conclusions of Structured Scientific Abstracts
Yingqiang Gao | Nianlong Gu | Jessica Lam | James Henderson | Richard Hahnloser

Scientific abstracts provide a concise summary of research findings, making them a valuable resource for extracting scientific arguments. In this study, we assess various unsupervised approaches for extracting arguments as aligned premise-conclusion pairs: semantic similarity, text perplexity, and mutual information. We aggregate structured abstracts from PubMed Central Open Access papers published in 2022 and evaluate the argument aligners in terms of the performance of language models that we fine-tune to generate the conclusions from the extracted premise given as input prompts. We find that mutual information outperforms the other measures on this task, suggesting that the reasoning process in scientific abstracts hinges mostly on linguistic constructs beyond simple textual similarity.

pdf bib
Over-Reasoning and Redundant Calculation of Large Language Models
Cheng-Han Chiang | Hung-yi Lee

Large language models (LLMs) can solve problems step-by-step.While this chain-of-thought (CoT) reasoning boosts LLMs’ performance, it is unclear if LLMs know when to use CoT and whether those CoT are always necessary to answer the question. This paper shows that LLMs tend to generate redundant calculations and reasoning on a manually constructed math QA dataset, GSM8K-Zero.GSM8K-Zero is constructed such that the questions can be answered without any calculations, but LLMs, including Llama-2 models and Claude-2, tend to generate lengthy and unnecessary calculations to answer the questions.We also conduct experiments to explain why LLMs generate redundant calculations and reasonings.

pdf bib
Multimodal Fallacy Classification in Political Debates
Eleonora Mancini | Federico Ruggeri | Paolo Torroni

Recent advances in NLP suggest that some tasks, such as argument detection and relation classification, are better framed in a multimodal perspective. We propose multimodal argument mining for argumentative fallacy classification in political debates. To this end, we release the first corpus for multimodal fallacy classification. Our experiments show that the integration of the audio modality leads to superior classification performance. Our findings confirm that framing fallacy classification as a multimodal task is essential to capture paralinguistic aspects of fallacious arguments.

pdf bib
The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks
Anders Giovanni Møller | Arianna Pera | Jacob Dalsgaard | Luca Aiello

In the realm of Computational Social Science (CSS), practitioners often navigate complex, low-resource domains and face the costly and time-intensive challenges of acquiring and annotating data. We aim to establish a set of guidelines to address such challenges, comparing the use of human-labeled data with synthetically generated data from GPT-4 and Llama-2 in ten distinct CSS classification tasks of varying complexity. Additionally, we examine the impact of training data sizes on performance. Our findings reveal that models trained on human-labeled data consistently exhibit superior or comparable performance compared to their synthetically augmented counterparts. Nevertheless, synthetic augmentation proves beneficial, particularly in improving performance on rare classes within multi-class tasks. Furthermore, we leverage GPT-4 and Llama-2 for zero-shot classification and find that, while they generally display strong performance, they often fall short when compared to specialized classifiers trained on moderately sized training sets.

pdf bib
Language Model Sentence Completion with a Parser-Driven Rhetorical Control Method
Joshua Zingale | Jugal Kalita

Controlled text generation (CTG) seeks to guide large language model (LLM) output, that statistical language generation would conform to desired criteria. The current study presents a novel CTG algorithm that enforces adherence toward specific rhetorical relations in an LLM sentence-completion context by a parser-driven decoding scheme that requires no model fine-tuning. The method is validated both with automatic and human evaluation.

pdf bib
”It’s how you do things that matters”: Attending to Process to Better Serve Indigenous Communities with Language Technologies
Ned Cooper | Courtney Heldreth | Ben Hutchinson

Indigenous languages are historically under-served by Natural Language Processing (NLP) technologies, but this is changing for some languages with the recent scaling of large multilingual models and an increased focus by the NLP community on endangered languages. This position paper explores ethical considerations in building NLP technologies for Indigenous languages, based on the premise that such projects should primarily serve Indigenous communities. We report on interviews with 17 researchers working in or with Aboriginal and/or Torres Strait Islander communities on language technology projects in Australia. Drawing on insights from the interviews, we recommend practices for NLP researchers to increase attention to the process of engagements with Indigenous communities, rather than focusing only on decontextualised artefacts.

pdf bib
Source Identification in Abstractive Summarization
Yoshi Suhara | Dimitris Alikaniotis

Neural abstractive summarization models make summaries in an end-to-end manner, and little is known about how the source information is actually converted into summaries. In this paper, we define input sentences that contain essential information in the generated summary as source sentences and study how abstractive summaries are made by analyzing the source sentences. To this end, we annotate source sentences for reference summaries and system summaries generated by PEGASUS on document-summary pairs sampled from the CNN/DailyMail and XSum datasets. We also formulate automatic source sentence detection and compare multiple methods to establish a strong baseline for the task. Experimental results show that the perplexity-based method performs well in highly abstractive settings, while similarity-based methods perform robustly in relatively extractive settings.

pdf bib
From Partial to Strictly Incremental Constituent Parsing
Ana Ezquerro | Carlos Gómez-Rodríguez | David Vilares

We study incremental constituent parsers to assess their capacity to output trees based on prefix representations alone. Guided by strictly left-to-right generative language models and tree-decoding modules, we build parsers that adhere to a strong definition of incrementality across languages. This builds upon work that asserted incrementality, but that mostly only enforced it on either the encoder or the decoder. Finally, we conduct an analysis against non-incremental and partially incremental models.

pdf bib
Predict the Next Word: <Humans exhibit uncertainty in this task and language models _____>
Evgenia Ilia | Wilker Aziz

Language models (LMs) are statistical models trained to assign probability to human-generated text. As such, it is reasonable to question whether they approximate linguistic variability exhibited by humans well. This form of statistical assessment is difficult to perform at the passage level, for it requires acceptability judgments (i.e., human evaluation) or a robust automated proxy (which is non-trivial). At the word level, however, given some context, samples from an LM can be assessed via exact matching against a prerecorded dataset of alternative single-word continuations of the available context. We exploit this fact and evaluate the LM’s ability to reproduce variability that humans (in particular, a population of English speakers) exhibit in the ‘next word prediction’ task. This can be seen as assessing a form of calibration, which, in the context of text classification, Baan et al. (2022) termed calibration to human uncertainty. We assess GPT2, BLOOM and ChatGPT and find that they exhibit fairly low calibration to human uncertainty. We also verify the failure of expected calibration error (ECE) to reflect this, and as such, advise the community against relying on it in this setting.

pdf bib
A Prompt Response to the Demand for Automatic Gender-Neutral Translation
Beatrice Savoldi | Andrea Piergentili | Dennis Fucci | Matteo Negri | Luisa Bentivogli

Gender-neutral translation (GNT) that avoids biased and undue binary assumptions is a pivotal challenge for the creation of more inclusive translation technologies. Advancements for this task in Machine Translation (MT), however, are hindered by the lack of dedicated parallel data, which are necessary to adapt MT systems to satisfy neutral constraints. For such a scenario, large language models offer hitherto unforeseen possibilities, as they come with the distinct advantage of being versatile in various (sub)tasks when provided with explicit instructions. In this paper, we explore this potential to automate GNT by comparing MT with the popular GPT-4 model. Through extensive manual analyses, our study empirically reveals the inherent limitations of current MT systems in generating GNTs and provides valuable insights into the potential and challenges associated with prompting for neutrality.

pdf bib
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
Joris Baan | Raquel Fernández | Barbara Plank | Wilker Aziz

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good _representation of uncertainty_ by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

pdf bib
Smaller Language Models are Better Zero-shot Machine-Generated Text Detectors
Niloofar Mireshghallah | Justus Mattern | Sicun Gao | Reza Shokri | Taylor Berg-Kirkpatrick

As large language models are becoming more embedded in different user-facing services, it is important to be able to distinguish between human-written and machine-generated text to verify the authenticity of news articles, product reviews, etc. Thus, in this paper we set out to explore whether it is possible to use one language model to identify machine-generated text produced by another language model, in a zero-shot way, even if the two have different architectures and are trained on different data. We find that overall, smaller models are better universal machine-generated text detectors: they can more precisely detect text generated from both small and larger models, without the need for any additional training/data. Interestingly, we find that whether or not the detector and generator models were trained on the same data is not critically important to the detection success. For instance the OPT-125M model has an AUC of 0.90 in detecting GPT4 generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.65.

pdf bib
CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
Kaushal Maurya | Rahul Kejriwal | Maunendra Desarkar | Anoop Kunchukuttan

We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from *closely-related* high-resource language (HRL). The development of an MT system for ELRL is challenging because these languages typically lack parallel corpora and monolingual corpora, and their representations are absent from large multilingual language models. Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity. However, existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space. To overcome this limitation, we propose a novel, CharSpan, approach based on character-span noise augmentation into the training data of HRL. This serves as a regularization technique, making the model more robust to lexical divergences between the HRL and ELRL, thus facilitating effective cross-lingual transfer. Our method significantly outperformed strong baselines in zero-shot settings on closely related HRL and ELRL pairs from three diverse language families, emerging as the state-of-the-art model for ELRLs.

pdf bib
Robust Neural Machine Translation for Abugidas by Glyph Perturbation
Hour Kaing | Chenchen Ding | Hideki Tanaka | Masao Utiyama

Neural machine translation (NMT) systems are vulnerable when trained on limited data. This is a common scenario in low-resource tasks in the real world. To increase robustness, a solution is to intently add realistic noise in the training phase. Noise simulation using text perturbation has been proven to be efficient in writing systems that use Latin letters. In this study, we further explore perturbation techniques on more complex abugida writing systems, for which the visual similarity of complex glyphs is considered to capture the essential nature of these writing systems. Besides the generated noise, we propose a training strategy to improve robustness. We conducted experiments on six languages: Bengali, Hindi, Myanmar, Khmer, Lao, and Thai. By overcoming the introduced noise, we obtained non-degenerate NMT systems with improved robustness for low-resource tasks for abugida glyphs.

pdf bib
Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning
Ashish Agrawal | Barah Fazili | Preethi Jyothi

Popular benchmarks (e.g., XNLI) used to evaluate cross-lingual language understanding consist of parallel versions of English evaluation sets in multiple target languages created with the help of professional translators. When creating such parallel data, it is critical to ensure high-quality translations for all target languages for an accurate characterization of cross-lingual transfer. In this work, we find that translation inconsistencies do exist and interestingly they disproportionally impact low-resource languages in XNLI. To identify such inconsistencies, we propose measuring the gap in performance between zero-shot evaluations on the human-translated and machine-translated target text across multiple target languages; relatively large gaps are indicative of translation errors. We also corroborate that translation errors exist for two target languages, namely Hindi and Urdu, by doing a manual reannotation of human-translated test instances in these two languages and finding poor agreement with the original English labels these instances were supposed to inherit.

pdf bib
Less is More for Long Document Summary Evaluation by LLMs
Yunshu Wu | Hayate Iso | Pouya Pezeshkpour | Nikita Bhutani | Estevam Hruschka

Large Language Models (LLMs) have shown promising performance in summary evaluation tasks, yet they face challenges such as high computational costs and the Lost-in-the-Middle problem where important information in the middle of long documents is often overlooked. To address these issues, this paper introduces a novel approach, Extract-then-Evaluate, which involves extracting key sentences from a long source document and then evaluating the summary by prompting LLMs. The results reveal that the proposed method not only significantly reduces evaluation costs but also exhibits a higher correlation with human evaluations. Furthermore, we provide practical recommendations for optimal document length and sentence extraction methods, contributing to the development of cost-effective yet more accurate methods for LLM-based text generation evaluation.

pdf bib
Leveraging ChatGPT in Pharmacovigilance Event Extraction: An Empirical Study
Zhaoyue Sun | Gabriele Pergola | Byron Wallace | Yulan He

With the advent of large language models (LLMs), there has been growing interest in exploring their potential for medical applications. This research aims to investigate the ability of LLMs, specifically ChatGPT, in the context of pharmacovigilance event extraction, of which the main goal is to identify and extract adverse events or potential therapeutic events from textual medical sources. We conduct extensive experiments to assess the performance of ChatGPT in the pharmacovigilance event extraction task, employing various prompts and demonstration selection strategies. The findings demonstrate that while ChatGPT demonstrates reasonable performance with appropriate demonstration selection strategies, it still falls short compared to fully fine-tuned small models. Additionally, we explore the potential of leveraging ChatGPT for data augmentation. However, our investigation reveals that the inclusion of synthesized data into fine-tuning may lead to a decrease in performance, possibly attributed to noise in the ChatGPT-generated labels. To mitigate this, we explore different filtering strategies and find that, with the proper approach, more stable performance can be achieved, although constant improvement remains elusive.

pdf bib
A Comparative Analysis of Conversational Large Language Models in Knowledge-Based Text Generation
Phillip Schneider | Manuel Klettner | Elena Simperl | Florian Matthes

Generating natural language text from graph-structured data is essential for conversational information seeking. Semantic triples derived from knowledge graphs can serve as a valuable source for grounding responses from conversational agents by providing a factual basis for the information they communicate. This is especially relevant in the context of large language models, which offer great potential for conversational interaction but are prone to hallucinating, omitting, or producing conflicting information. In this study, we conduct an empirical analysis of conversational large language models in generating natural language text from semantic triples. We compare four large language models of varying sizes with different prompting techniques. Through a series of benchmark experiments on the WebNLG dataset, we analyze the models’ performance and identify the most common issues in the generated predictions. Our findings show that the capabilities of large language models in triple verbalization can be significantly improved through few-shot prompting, post-processing, and efficient fine-tuning techniques, particularly for smaller models that exhibit lower zero-shot performance.

pdf bib
Extreme Fine-tuning: A Novel and Fast Fine-tuning Approach for Text Classification
Boonnithi Jiaramaneepinit | Thodsaporn Chay-intr | Kotaro Funakoshi | Manabu Okumura

Although fine-tuning a pre-trained model with a conventional approach has shown to be effective in various downstream tasks, previous work has used only backpropagation to fine-tune the model, which causes a massive amount of computational resources and time. We propose Extreme Fine-Tuning (EFT), a novel approach for fine-tuning a pre-trained model effectively and efficiently. EFT uses backpropagation for a brief fine-tuning and an iterative extreme learning machine for training a classifier. We applied EFT to four text classification datasets, MELD, IEMOCAP, IMDb, and AG News, and compared its performance with state-of-the-art (SOTA) approaches. The results indicate that EFT noticeably outperformed the other approaches in training-time measurement with comparable model performance. We will release our code at

pdf bib
Flow Matching for Conditional Text Generation in a Few Sampling Steps
Vincent Hu | Di Wu | Yuki Asano | Pascal Mettes | Basura Fernando | Björn Ommer | Cees Snoek

Diffusion models are a promising tool for high-quality text generation. However, current models face multiple drawbacks including slow sampling, noise schedule sensitivity, and misalignment between the training and sampling stages. In this paper, we introduce FlowSeq, which bypasses all current drawbacks by leveraging flow matching for conditional text generation. FlowSeq can generate text in a few steps by training with a novel anchor loss, alleviating the need for expensive hyperparameter optimization of the noise schedule prevalent in diffusion models. We extensively evaluate our proposed method and show competitive performance in tasks such as question generation, open-domain dialogue, and paraphrasing tasks.

pdf bib
Corpus-Steered Query Expansion with Large Language Models
Yibin Lei | Yu Cao | Tianyi Zhou | Tao Shen | Andrew Yates

Recent studies demonstrate that query expansions generated by large language models (LLMs) can considerably enhance information retrieval systems by generating hypothetical documents that answer the queries as expansions. However, challenges arise from misalignments between the expansions and the retrieval corpus, resulting in issues like hallucinations and outdated information due to the limited intrinsic knowledge of LLMs. Inspired by Pseudo Relevance Feedback (PRF), we introduce Corpus-Steered Query Expansion (CSQE) to promote the incorporation of knowledge embedded within the corpus. CSQE utilizes the relevance assessing capability of LLMs to systematically identify pivotal sentences in the initially-retrieved documents. These corpus-originated texts are subsequently used to expand the query together with LLM-knowledge empowered expansions, improving the relevance prediction between the query and the target documents. Extensive experiments reveal that CSQE exhibits strong performance without necessitating any training, especially with queries for which LLMs lack knowledge.

pdf bib
Defending Against Disinformation Attacks in Open-Domain Question Answering
Orion Weller | Aleem Khan | Nathaniel Weir | Dawn Lawrie | Benjamin Van Durme

Recent work in open-domain question answering (ODQA) has shown that adversarial poisoning of the search collection can cause large drops in accuracy for production systems. However, little to no work has proposed methods to defend against these attacks. To do so, we rely on the intuition that redundant information often exists in large corpora. To find it, we introduce a method that uses query augmentation to search for a diverse set of passages that could answer the original question but are less likely to have been poisoned. We integrate these new passages into the model through the design of a novel confidence method, comparing the predicted answer to its appearance in the retrieved contexts (what we call Confidence from Answer Redundancy, i.e. CAR). Together these methods allow for a simple but effective way to defend against poisoning attacks that provides gains of nearly 20% exact match across varying levels of data poisoning/knowledge conflicts.

pdf bib
Sentence Representations via Gaussian Embedding
Shohei Yoda | Hayato Tsukagoshi | Ryohei Sasano | Koichi Takeda

Recent progress in sentence embedding, which represents a sentence’s meaning as a point in a vector space, has achieved high performance on several tasks such as the semantic textual similarity (STS) task.However, a sentence representation cannot adequately express the diverse information that sentences contain: for example, such representations cannot naturally handle asymmetric relationships between sentences.This paper proposes GaussCSE, a Gaussian-distribution-based contrastive learning framework for sentence embedding that can handle asymmetric inter-sentential relations, as well as a similarity measure for identifying entailment relations.Our experiments show that GaussCSE achieves performance comparable to that of previous methods on natural language inference (NLI) tasks, and that it can estimate the direction of entailment relations, which is difficult with point representations.

pdf bib
STORiCo: Storytelling TTS for Hindi with Character Voice Modulation
Pavan Tankala | Preethi Jyothi | Preeti Rao | Pushpak Bhattacharyya

We present a new Hindi text-to-speech (TTS) dataset and demonstrate its utility for the expressive synthesis of children’s audio stories. The dataset comprises narration by a single female speaker who modifies her voice to produce different story characters. Annotation for dialogue identification, character labelling, and character attribution are provided, all of which are expected to facilitate the learning of character voice and speaking styles. Experiments are conducted using different versions of the annotated dataset that enable training a multi-speaker TTS model on the single-speaker data. Subjective tests show that the multi-speaker model improves expressiveness and character voice consistency compared to the baseline single-speaker TTS. With the multi-speaker model, objective evaluations show comparable word error rates, better speaker voice consistency, and higher correlations with ground-truth emotion attributes. We release a new 16.8 hours storytelling speech dataset in Hindi and propose effective solutions for expressive TTS with narrator voice modulation and character voice consistency.

pdf bib
Rethinking Loss Functions for Fact Verification
Yuta Mukobara | Yutaro Shigeto | Masashi Shimbo

We explore loss functions for fact verification in the FEVER shared task. While the cross-entropy loss is a standard objective for training verdict predictors, it fails to capture the heterogeneity among the FEVER verdict classes. In this paper, we develop two task-specific objectives tailored to FEVER. Experimental results confirm that the proposed objective functions outperform the standard cross-entropy. Performance is further improved when these objectives are combined with simple class weighting, which effectively overcomes the imbalance in the training data. The source code is available (

pdf bib
A Dataset for Metaphor Detection in Early Medieval Hebrew Poetry
Michael Toker | Oren Mishali | Ophir Münz-Manor | Benny Kimelfeld | Yonatan Belinkov

There is a large volume of late antique and medieval Hebrew texts. They represent a crucial linguistic and cultural bridge between Biblical and modern Hebrew. Poetry is prominent in these texts and one of its main characteristics is the frequent use of metaphor. Distinguishing figurative and literal language use is a major task for scholars of the Humanities, especially in the fields of literature, linguistics, and hermeneutics. This paper presents a new, challenging dataset of late antique and medieval Hebrew poetry with expert annotations of metaphor, as well as some baseline results, which we hope will facilitate further research in this area.

pdf bib
SOCIALITE-LLAMA: An Instruction-Tuned Model for Social Scientific Tasks
Gourab Dey | Adithya V Ganesan | Yash Kumar Lal | Manal Shah | Shreyashee Sinha | Matthew Matero | Salvatore Giorgi | Vivek Kulkarni | H. Schwartz

Social science NLP tasks, such as emotion or humor detection, are required to capture the semantics along with the implicit pragmatics from text, often with limited amounts of training data. Instruction tuning has been shown to improve the many capabilities of large language models (LLMs) such as commonsense reasoning, reading comprehension, and computer programming. However, little is known about the effectiveness of instruction tuning on the social domain where implicit pragmatic cues are often needed to be captured. We explore the use of instruction tuning for social science NLP tasks and introduce Socialite-Llama — an open-source, instruction-tuned Llama. On a suite of 20 social science tasks, Socialite-Llama improves upon the performance of Llama as well as matches or improves upon the performance of a state-of-the-art, multi-task finetuned model on a majority of them. Further, Socialite-Llama also leads to improvement on 5 out of 6 related social tasks as compared to Llama, suggesting instruction tuning can lead to generalized social understanding. All resources including our code, model and dataset can be found through [](

pdf bib
Pre-Training Methods for Question Reranking
Stefano Campese | Ivano Lauriola | Alessandro Moschitti

One interesting approach to Question Answering (QA) is to search for semantically similar questions, which have been answered before. This task is different from answer retrieval as it focuses on questions rather than only on the answers, therefore it requires different model training on different data.In this work, we introduce a novel unsupervised pre-training method specialized for retrieving and ranking questions. This leverages (i) knowledge distillation from a basic question retrieval model, and (ii) new pre-training task and objective for learning to rank questions in terms of their relevance with the query. Our experiments show that (i) the proposed technique achieves state-of-the-art performance on QRC and Quora-match datasets, and (ii) the benefit of combining re-ranking and retrieval models.

pdf bib
Dynamic Masking Rate Schedules for MLM Pretraining
Zachary Ankner | Naomi Saphra | Davis Blalock | Jonathan Frankle | Matthew Leavitt

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model’s fixed masking rate of 15%. We propose to instead dynamically schedule the masking rate throughout training. We find that linearly decreasing the masking rate over the course of pretraining improves average GLUE accuracy by up to 0.46% and 0.25% in BERT-base and BERT-large, respectively, compared to fixed rate baselines. These gains come from exposure to both high and low masking rate regimes, providing benefits from both settings. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models, achieving up to a 1.89x speedup in pretraining for BERT-base as well as a Pareto improvement for BERT-large.


pdf (full)
bib (full)
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Nikolaos Aletras | Orphee De Clercq

pdf bib
TextBI: An Interactive Dashboard for Visualizing Multidimensional NLP Annotations in Social Media Data
Maxime Masson | Christian Sallaberry | Marie-Noelle Bessagnet | Annig Le Parc Lacayrelle | Philippe Roose | Rodrigo Agerri

In this paper we introduce TextBI, a multimodal generic dashboard designed to present multidimensional text annotations on large volumes of multilingual social media data. This tool focuses on four core dimensions: spatial, temporal, thematic, and personal, and also supports additional enrichment data such as sentiment and engagement. Multiple visualization modes are offered, including frequency, movement, and association. This dashboard addresses the challenge of facilitating the interpretation of NLP annotations by visualizing them in a user-friendly, interactive interface catering to two categories of users: (1) domain stakeholders and (2) NLP researchers. We conducted experiments within the domain of tourism leveraging data from X (formerly Twitter) and incorporating requirements from tourism offices. Our approach, TextBI, represents a significant advancement in the field of visualizing NLP annotations by integrating and blending features from a variety of Business Intelligence, Geographical Information Systems and NLP tools. A demonstration video is also provided

pdf bib
kNN-BOX: A Unified Framework for Nearest Neighbor Generation
Wenhao Zhu | Qianfeng Zhao | Yunzhe Lv | Shujian Huang | Siheng Zhao | Sizhe Liu | Jiajun Chen

Augmenting the base neural model with a token-level symbolic datastore is a novel generation paradigm and has achieved promising results in machine translation (MT). In this paper, we introduce a unified framework kNN-BOX, which enables quick development and visualization for this novel paradigm. kNN-BOX decomposes the datastore-augmentation approach into three modules: datastore, retriever and combiner, thus putting diverse kNN generation methods into a unified way. Currently, kNN-BOX has provided implementation of seven popular kNN-MT variants, covering research from performance enhancement to efficiency optimization. It is easy for users to reproduce these existing work or customize their own models. Besides, users can interact with their kNN generation systems with kNN-BOX to better understand the underlying inference process in a visualized way. In experiment section, we apply kNN-BOX for machine translation and three other seq2seq generation tasks (text simplification, paraphrase generation and question generation). Experiment results show that augmenting the base neural model with kNN-BOX can bring large performance improvement in all these tasks. The code and document of kNN-BOX is available at The demo can be accessed at The introduction video is available at

pdf bib
A Human-Centric Evaluation Platform for Explainable Knowledge Graph Completion
Zhao Xu | Wiem Ben Rim | Kiril Gashteovski | Timo Sztyler | Carolin Lawrence

Explanations for AI are expected to help human users understand AI-driven predictions. Evaluating plausibility, the helpfulness of the explanations, is therefore essential for developing eXplainable AI (XAI) that can really aid human users. Here we propose a human-centric evaluation platform to measure plausibility of explanations in the context of eXplainable Knowledge Graph Completion (XKGC). The target audience of the platform are researchers and practitioners who want to 1) investigate real needs and interests of their target users in XKGC, 2) evaluate the plausibility of the XKGC methods. We showcase these two use cases in an experimental setting to illustrate what results can be achieved with our system.

pdf bib
pyTLEX: A Python Library for TimeLine EXtraction
Akul Singh | Jared Hummer | Mustafa Ocal | Mark Finlayson

pyTLEX is an implementation of the TimeLine EXtraction algorithm (TLEX; Finlayson et al.,2021) that enables users to work with TimeML annotations and perform advanced temporal analysis, offering a comprehensive suite of features. TimeML is a standardized markup language for temporal information in text. pyTLEX allows users to parse TimeML annotations, construct TimeML graphs, and execute the TLEX algorithm to effect complete timeline extraction. In contrast to previous implementations (i.e., jTLEX for Java), pyTLEX sets itself apart with a range of advanced features. It introduces a React-based visualization system, enhancing the exploration of temporal data and the comprehension of temporal connections within textual information. Furthermore, pyTLEX incorporates an algorithm for increasing connectivity in temporal graphs, which identifies graph disconnectivity and recommends links based on temporal reasoning, thus enhancing the coherence of the graph representation. Additionally, pyTLEX includes a built-in validation algorithm, ensuring compliance with TimeML annotation guidelines, which is essential for maintaining data quality and reliability. pyTLEX equips researchers and developers with an extensive toolkit for temporal analysis, and its testing across various datasets validates its accuracy and reliability.

pdf bib
DepressMind: A Depression Surveillance System for Social Media Analysis
Roque Fernández-Iglesias | Marcos Fernandez-Pichel | Mario Aragon | David E. Losada

Depression is a pressing global issue that impacts millions of individuals worldwide. This prevailing psychologicaldisorder profoundly influences the thoughts and behavior of those who suffer from it. We have developed DepressMind, a versatile screening tool designed to facilitate the analysis of social network data. This automated tool explores multiple psychological dimensions associated with clinical depression and estimates the extent to which these symptoms manifest in language use. Our project comprises two distinct components: one for data extraction and another one for analysis.The data extraction phase is dedicated to harvesting texts and the associated meta-information from social networks and transforming them into a user-friendly format that serves various analytical purposes.For the analysis, the main objective is to conduct an in-depth inspection of the user publications and establish connections between the posted contents and dimensions or traits defined by well-established clinical instruments.Specifically, we aim to associate extracts authored by individuals with symptoms or dimensions of the Beck Depression Inventory (BDI).

pdf bib
Check News in One Click: NLP-Empowered Pro-Kremlin Propaganda Detection
Veronika Solopova | Viktoriia Herman | Christoph Benzmüller | Tim Landgraf

Many European citizens become targets of the Kremlin propaganda campaigns, aiming to minimise public support for Ukraine, foster a climate of mistrust and disunity, and shape elections (Meister, 2022). To address this challenge, we developed “Check News in 1 Click”, the first NLP-empowered pro-Kremlin propaganda detection application available in 7 languages, which provides the lay user with feedback on their news, and explains manipulative linguistic features and keywords. We conducted a user study, analysed user entries and models’ behaviour paired with questionnaire answers, and investigated the advantages and disadvantages of the proposed interpretative solution.

pdf bib
NESTLE: a No-Code Tool for Statistical Analysis of Legal Corpus
Kyoungyeon Cho | Seungkum Han | Young Rok Choi | Wonseok Hwang

The statistical analysis of large scale legal corpus can provide valuable legal insights. For such analysis one needs to (1) select a subset of the corpus using document retrieval tools, (2) structure text using information extraction (IE) systems, and (3) visualize the data for the statistical analysis. Each process demands either specialized tools or programming skills whereas no comprehensive unified “no-code” tools have been available. Here we provide NESTLE, a no-code tool for large-scale statistical analysis of legal corpus. Powered by a Large Language Model (LLM) and the internal custom end-to-end IE system, NESTLE can extract any type of information that has not been predefined in the IE system opening up the possibility of unlimited customizable statistical analysis of the corpus without writing a single line of code. We validate our system on 15 Korean precedent IE tasks and 3 legal text classification tasks from LexGLUE. The comprehensive experiments reveal NESTLE can achieve GPT-4 comparable performance by training the internal IE module with 4 human-labeled, and 192 LLM-labeled examples.

pdf bib
Multi-party Multimodal Conversations Between Patients, Their Companions, and a Social Robot in a Hospital Memory Clinic
Angus Addlesee | Neeraj Cherakara | Nivan Nelson | Daniel Hernandez Garcia | Nancie Gunson | Weronika Sieińska | Christian Dondrup | Oliver Lemon

We have deployed an LLM-based spoken dialogue system in a real hospital. The ARI social robot embodies our system, which patients and their companions can have multi-party conversations with together. In order to enable this multi-party ability, multimodality is critical. Our system, therefore, receives speech and video as input, and generates both speech and gestures (arm, head, and eye movements). In this paper, we describe our complex setting and the architecture of our dialogue system. Each component is detailed, and a video of the full system is available with the appropriate components highlighted in real-time. Our system decides when it should take its turn, generates human-like clarification requests when the patient pauses mid-utterance, answers in-domain questions (grounding to the in-prompt knowledge), and responds appropriately to out-of-domain requests (like generating jokes or quizzes). This latter feature is particularly remarkable as real patients often utter unexpected sentences that could not be handled previously.

pdf bib
ScamSpot: Fighting Financial Fraud in Instagram Comments
Stefan Erben | Andreas Waldis

The long-standing problem of spam and fraudulent messages in the comment sections of Instagram pages in the financial sector claims new victims every day. Instagram’s current spam filter proves inadequate, and existing research approaches are primarily confined to theoretical concepts. Practical implementations with evaluated results are missing. To solve this problem, we propose ScamSpot, a comprehensive system that includes a browser extension, a fine-tuned BERT model and a REST API. This approach ensures public accessibility of our results for Instagram users using the Chrome browser. Furthermore, we conduct a data annotation study, shedding light on the reasons and causes of the problem and evaluate the system through user feedback and comparison with existing models. ScamSpot is an open-source project and is publicly available at

pdf bib
NarrativePlay: Interactive Narrative Understanding
Runcong Zhao | Wenjia Zhang | Jiazheng Li | Lixing Zhu | Yanran Li | Yulan He | Lin Gui

In this paper, we introduce NarrativePlay, a novel system that allows users to role-play a fictional character and interact with other characters in narratives in an immersive environment. We leverage Large Language Models (LLMs) to generate human-like responses, guided by personality traits extracted from narratives. The system incorporates auto-generated visual display of narrative settings, character portraits, and character speech, greatly enhancing the user experience. Our approach eschews predefined sandboxes, focusing instead on main storyline events from the perspective of a user-selected character. NarrativePlay has been evaluated on two types of narratives, detective and adventure stories, where users can either explore the world or increase affinity with other characters through conversations.

pdf bib
DP-NMT: Scalable Differentially Private Machine Translation
Timour Igamberdiev | Doan Nam Long Vu | Felix Kuennecke | Zhuo Yu | Jannik Holmer | Ivan Habernal

Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.

pdf bib
AnnoPlot: Interactive Visualizations of Text Annotations
Elisabeth Fittschen | Tim Fischer | Daniel Brühl | Julia Spahr | Yuliia Lysa | Phuoc Thang Le

This paper presents AnnoPlot, a web application designed to analyze, manage, and visualize annotated text data.Users can configure projects, upload datasets, and explore their data through interactive visualization of span annotations with scatter plots, clusters, and statistics. AnnoPlot supports various transformer models to compute high-dimensional embeddings of text annotations and utilizes dimensionality reduction algorithms to offer users a novel 2D view of their datasets.A dynamic approach to dimensionality reduction allows users to adjust visualizations in real-time, facilitating category reorganization and error identification. The proposed application is open-source, promoting transparency and user control.Especially suited for the Digital Humanities, AnnoPlot offers a novel solution to address challenges in dynamic annotation datasets, empowering users to enhance data integrity and adapt to evolving categorizations.

pdf bib
GeospaCy: A tool for extraction and geographical referencing of spatial expressions in textual data
Syed Mehtab Alam | Elena Arsevska | Mathieu Roche | Maguelonne Teisseire

Spatial information in text enables to understand the geographical context and relationships within text for better decision-making across various domains such as disease surveillance, disaster management and other location based services. Therefore, it is crucial to understand the precise geographical context for location-sensitive applications. In response to this necessity, we introduce the GeospaCy software tool, designed for the extraction and georeferencing of spatial information present in textual data. GeospaCy fulfils two primary objectives: 1) Geoparsing, which involves extracting spatial expressions, encompassing place names and associated spatial relations within the text data, and 2) Geocoding, which facilitates the assignment of geographical coordinates to the spatial expressions extracted during the Geoparsing task. Geoparsing is evaluated with a disease news article dataset consisting of event information, whereas a qualitative evaluation of geographical coordinates (polygons/geometries) of spatial expressions is performed by end-users for Geocoding task.

pdf bib
MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
Timothee Mickus | Stig-Arne Grönroos | Joseph Attieh | Michele Boggia | Ona De Gibert | Shaoxiong Ji | Niki Andreas Loppi | Alessandro Raganato | Raúl Vázquez | Jörg Tiedemann

NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters.We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information.The toolkit is publicly available online at

pdf bib
The DURel Annotation Tool: Human and Computational Measurement of Semantic Proximity, Sense Clusters and Semantic Change
Dominik Schlechtweg | Shafqat Mumtaz Virk | Pauline Sander | Emma Sköldberg | Lukas Theuer Linke | Tuo Zhang | Nina Tahmasebi | Jonas Kuhn | Sabine Schulte Im Walde

We present the DURel tool implementing the annotation of semantic proximity between word uses into an online, open source interface. The tool supports standardized human annotation as well as computational annotation, building on recent advances with Word-in-Context models. Annotator judgments are clustered with automatic graph clustering techniques and visualized for analysis. This allows to measure word senses with simple and intuitive micro-task judgments between use pairs, requiring minimal preparation efforts. The tool offers additional functionalities to compare the agreement between annotators to guarantee the inter-subjectivity of the obtained judgments and to calculate summary statistics over the annotated data giving insights into sense frequency distributions, semantic variation or changes of senses over time.

pdf bib
RAGAs: Automated Evaluation of Retrieval Augmented Generation
Shahul Es | Jithin James | Luis Espinosa Anke | Steven Schockaert

We introduce RAGAs (Retrieval Augmented Generation Assessment), a framework for reference-free evaluation of Retrieval Augmented Generation (RAG) pipelines. RAGAs is available at []. RAG systems are composed of a retrieval and an LLM based generation module. They provide LLMs with knowledge from a reference textual database, enabling them to act as a natural language layer between a user and textual databases, thus reducing the risk of hallucinations. Evaluating RAG architectures is challenging due to several dimensions to consider: the ability of the retrieval system to identify relevant and focused context passages, the ability of the LLM to exploit such passages faithfully, and the quality of the generation itself. With RAGAs, we introduce a suite of metrics that can evaluate these different dimensions without relying on ground truth human annotations. We posit that such a framework can contribute crucially to faster evaluation cycles of RAG architectures, which is especially important given the fast adoption of LLMs.

pdf bib
NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation
Shachar Rosenman | Vasudev Lal | Phillip Howard

Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive framework that automatically enhances a user’s prompt to improve the quality of generations produced by text-to-image models. Our framework utilizes constrained text decoding with a pre-trained language model that has been adapted to generate prompts similar to those produced by human prompt engineers. This approach enables higher-quality text-to-image generations and provides user control over stylistic features via constraint set specification. We demonstrate the utility of our framework by creating an interactive application for prompt enhancement and image generation using Stable Diffusion. Additionally, we conduct experiments utilizing a large dataset of human-engineered prompts for text-to-image generation and show that our approach automatically produces enhanced prompts that result in superior image quality. We make our code, a screencast video demo and a live demo instance of NeuroPrompts publicly available.

pdf bib
MEGAnno+: A Human-LLM Collaborative Annotation System
Hannah Kim | Kushan Mitra | Rafael Li Chen | Sajjadur Rahman | Dan Zhang

Large language models (LLMs) can label data faster and cheaper than humans for various NLP tasks. Despite their prowess, LLMs may fall short in understanding of complex, sociocultural, or domain-specific context, potentially leading to incorrect annotations. Therefore, we advocate a collaborative approach where humans and LLMs work together to produce reliable and high-quality labels. We present MEGAnno+, a human-LLM collaborative annotation system that offers effective LLM agent and annotation management, convenient and robust LLM annotation, and exploratory verification of LLM labels by humans.

pdf bib
X-AMR Annotation Tool
Shafiuddin Rehan Ahmed | Jon Cai | Martha Palmer | James H. Martin

This paper presents a novel Cross-document Abstract Meaning Representation (X-AMR) annotation tool designed for annotating key corpus-level event semantics. Leveraging machine assistance through the Prodigy Annotation Tool, we enhance the user experience, ensuring ease and efficiency in the annotation process. Through empirical analyses, we demonstrate the effectiveness of our tool in augmenting an existing event corpus, highlighting its advantages when integrated with GPT-4. Code and annotations: href{}{} footnote Demo: {href{}{}} footnote Live Link: {href{}{}}

pdf bib
DocChecker: Bootstrapping Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies
Anh Dau | Jin L.c. Guo | Nghi Bui

Comments in source code are crucial for developers to understand the purpose of the code and to use it correctly. However, keeping comments aligned with the evolving codebase poses a significant challenge. With increasing interest in automated solutions to identify and rectify discrepancies between code and its associated comments, most existing methods rely heavily on heuristic rules. This paper introduces DocChecker, a language model-based framework adept at detecting inconsistencies between code and comments and capable of generating synthetic comments. This functionality allows DocChecker to identify and rectify cases where comments do not accurately represent the code they describe.The efficacy of DocChecker is demonstrated using the Just-In-Time and CodeXGlue datasets in various scenarios. Notably, DocChecker sets a new benchmark in the Inconsistency Code-Comment Detection (ICCD) task, achieving 72.3% accuracy, and scoring 33.64 in BLEU-4 on the code summarization task. These results surpass other Large Language Models (LLMs), including GPT 3.5 and CodeLlama.DocChecker is accessible for use and evaluation. It can be found on and at For a more comprehensive understanding of its functionality, a demonstration video is available on

pdf bib
TL;DR Progress: Multi-faceted Literature Exploration in Text Summarization
Shahbaz Syed | Khalid Al Khatib | Martin Potthast

This paper presents TL;DR Progress, a new tool for exploring the literature on neural text summarization. It organizes 514~papers based on a comprehensive annotation scheme for text summarization approaches and enables fine-grained, faceted search. Each paper was manually annotated to capture aspects such as evaluation metrics, quality dimensions, learning paradigms, challenges addressed, datasets, and document domains. In addition, a succinct indicative summary is provided for each paper, describing contextual factors, issues, and proposed solutions. The tool is available at {url{}}, a demo video at {url{}}

pdf bib
FRAPPE: FRAming, Persuasion, and Propaganda Explorer
Ahmed Sajwani | Alaa El Setohy | Ali Mekky | Diana Turmakhan | Lara Hassan | Mohamed El Zeftawy | Omar El Herraoui | Osama Afzal | Qisheng Liao | Tarek Mahmoud | Zain Muhammad Mujahid | Muhammad Umar Salman | Muhammad Arslan Manzoor | Massa Baali | Jakub Piskorski | Nicolas Stefanovitch | Giovanni Da San Martino | Preslav Nakov

The abundance of news sources and the urgent demand for reliable information have led to serious concerns about the threat of misleading information. In this paper, we present FRAPPE, a FRAming, Persuasion, and Propaganda Explorer system. FRAPPE goes beyond conventional news analysis of articles and unveils the intricate linguistic techniques used to shape readers’ opinions and emotions. Our system allows users not only to analyze individual articles for their genre, framings, and use of persuasion techniques, but also to draw comparisons between the strategies of persuasion and framing adopted by a diverse pool of news outlets and countries across multiple languages for different topics, thus providing a comprehensive understanding of how information is presented and manipulated. FRAPPE is publicly accessible at and a video explaining our system is available at

pdf bib
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
Fahim Dalvi | Maram Hasanain | Sabri Boughorbel | Basel Mousi | Samir Abdaljalil | Nizi Nazar | Ahmed Abdelali | Shammur Absar Chowdhury | Hamdy Mubarak | Ahmed Ali

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community ( and a video demonstrating the framework is available online (

pdf bib
Sig-Networks Toolkit: Signature Networks for Longitudinal Language Modelling
Talia Tseriotou | Ryan Chan | Adam Tsakalidis | Iman Munire Bilal | Elena Kochkina | Terry Lyons | Maria Liakata

We present an open-source, pip installable toolkit, Sig-Networks, the first of its kind for longitudinal language modelling. A central focus is the incorporation of Signature-based Neural Network models, which have recently shown success in temporal tasks. We apply and extend published research providing a full suite of signature-based models. Their components can be used as PyTorch building blocks in future architectures. Sig-Networks enables task-agnostic dataset plug-in, seamless preprocessing for sequential data, parameter flexibility, automated tuning across a range of models. We examine signature networks under three different NLP tasks of varying temporal granularity: counselling conversations, rumour stance switch and mood changes in social media threads, showing SOTA performance in all three, and provide guidance for future tasks. We release the Toolkit as a PyTorch package with an introductory video, Git repositories for preprocessing and modelling including sample notebooks on the modeled NLP tasks.


pdf (full)
bib (full)
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Neele Falk | Sara Papi | Mike Zhang

pdf bib
AutoAugment Is What You Need: Enhancing Rule-based Augmentation Methods in Low-resource Regimes
Juhwan Choi | Kyohoon Jin | Junho Lee | Sangmin Song | YoungBin Kim

Text data augmentation is a complex problem due to the discrete nature of sentences. Although rule-based augmentation methods are widely adopted in real-world applications because of their simplicity, they suffer from potential semantic damage. Previous researchers have suggested easy data augmentation with soft labels (softEDA), employing label smoothing to mitigate this problem. However, finding the best factor for each model and dataset is challenging; therefore, using softEDA in real-world applications is still difficult. In this paper, we propose adapting AutoAugment to solve this problem. The experimental results suggest that the proposed method can boost existing augmentation methods and that rule-based methods can enhance cutting-edge pretrained language models. We offer the source code.

pdf bib
Generating Diverse Translation with Perturbed kNN-MT
Yuto Nishida | Makoto Morishita | Hidetaka Kamigaito | Taro Watanabe

Generating multiple translation candidates would enable users to choose the one that satisfies their needs.Although there has been work on diversified generation, there exists room for improving the diversity mainly because the previous methods do not address the overcorrection problem—the model underestimates a prediction that is largely different from the training data, even if that prediction is likely.This paper proposes methods that generate more diverse translations by introducing perturbed k-nearest neighbor machine translation (kNN-MT).Our methods expand the search space of kNN-MT and help incorporate diverse words into candidates by addressing the overcorrection problem.Our experiments show that the proposed methods drastically improve candidate diversity and control the degree of diversity by tuning the perturbation’s magnitude.

pdf bib
The KIND Dataset: A Social Collaboration Approach for Nuanced Dialect Data Collection
Asma Yamani | Raghad Alziyady | Reem AlYami | Salma Albelali | Leina Albelali | Jawharah Almulhim | Amjad Alsulami | Motaz Alfarraj | Rabeah Al-Zaidy

Nuanced dialects are a linguistic variant that pose several challenges for NLP models and techniques. One of the main challenges is the limited amount of datasets to enable extensive research and experimentation. We propose an approach for efficiently collecting nuanced dialectal datasets that are not only of high quality, but are versatile enough to be multipurpose as well. To test our approach we collect the KIND corpus, which is a collection of fine-grained Arabic dialect data. The data is short texts, and unlike many nuanced dialectal datasets, it is curated manually through social collaboration efforts as opposed to being crawled from social media. The collaborative approach is incentivized through educational gamification and competitions for which the community itself benefits from the open source dataset. Our approach aims to achieve: (1) coverage of dialects from under-represented groups and fine-grained dialectal varieties, (2) provide aligned parallel corpora for translation between Modern Standard Arabic (MSA) and multiple dialects to enable translation and comparison studies, (3) promote innovative approaches for nuanced dialect data collection. We explain the steps for the competition as well as the resulting datasets and the competing data collection systems. The KIND dataset is shared with the research community.

pdf bib
Can Stanza be Used for Part-of-Speech Tagging Historical Polish?
Maria Irena Szawerna

The goal of this paper is to evaluate the performance of Stanza, a part-of-speech (POS) tagger developed for modern Polish, on historical text to assess its possible use for automating the annotation of other historical texts. While the issue of the reliability of utilizing POS taggers on historical data has been previously discussed, most of the research focuses on languages whose grammar differs from Polish, meaning that their results need not be fully applicable in this case. The evaluation of Stanza is conducted on two sets of 10286 and 3270 manually annotated tokens from a piece of historical Polish writing (1899), and the errors are analyzed qualitatively and quantitatively. The results show a good performance of the tagger, especially when it comes to Universal Part-of-Speech (UPOS) tags, which is promising for utilizing the tagger for automatic annotation in larger projects, and pinpoint some common features of misclassified tokens.

pdf bib
Toward Zero-Shot Instruction Following
Renze Lou | Wenpeng Yin

This work proposes a challenging yet more realistic setting for zero-shot cross-task generalization: zero-shot instruction following, presuming the existence of a paragraph-style task definition while no demonstrations exist. To better learn the task supervision from the definition, we propose two strategies: first, to automatically find out the critical sentences in the definition; second, a ranking objective to force the model to generate the gold outputs with higher probabilities when those critical parts are highlighted in the definition. The joint efforts of the two strategies yield state-of-the-art performance on the Super-NaturalInstructions. Our code is available on GitHub.

pdf bib
UnMASKed: Quantifying Gender Biases in Masked Language Models through Linguistically Informed Job Market Prompts
Iñigo Parra

Language models (LMs) have become pivotal in the realm of technological advancements. While their capabilities are vast and transformative, they often include societal biases encoded in the human-produced datasets used for their training. This research delves into the inherent biases present in masked language models (MLMs), with a specific focus on gender biases. This study evaluated six prominent models: BERT, RoBERTa, DistilBERT, BERT- multilingual, XLM-RoBERTa, and DistilBERT- multilingual. The methodology employed a novel dataset, bifurcated into two subsets: one containing prompts that encouraged models to generate subject pronouns in English and the other requiring models to return the probabilities of verbs, adverbs, and adjectives linked to the prompts’ gender pronouns. The analysis reveals stereotypical gender alignment of all models, with multilingual variants showing comparatively reduced biases.

pdf bib
Distribution Shifts Are Bottlenecks: Extensive Evaluation for Grounding Language Models to Knowledge Bases
Yiheng Shu | Zhiwei Yu

Grounding language models (LMs) to knowledge bases (KBs) helps to obtain rich and accurate facts. However, it remains challenging because of the enormous size, complex structure, and partial observability of KBs. One reason is that current benchmarks fail to reflect robustness challenges and fairly evaluate models.This paper analyzes whether these robustness challenges arise from distribution shifts, including environmental, linguistic, and modal aspects.This affects the ability of LMs to cope with unseen schema, adapt to language variations, and perform few-shot learning. Thus, the paper proposes extensive evaluation protocols and conducts experiments to demonstrate that, despite utilizing our proposed data augmentation method, both advanced small and large language models exhibit poor robustness in these aspects. We conclude that current LMs are too fragile to navigate in complex environments due to distribution shifts. This underscores the need for future research focusing on data collection, evaluation protocols, and learning paradigms.

pdf bib
AttriSage: Product Attribute Value Extraction Using Graph Neural Networks
Rohan Potta | Mallika Asthana | Siddhant Yadav | Nidhi Goyal | Sai Patnaik | Parul Jain

Extracting the attribute value of a product from the given product description is essential for ecommerce functions like product recommendations, search, and information retrieval. Therefore, understanding products in E-commerce. Greater accuracy certainly gives any retailer the edge. The burdensome aspect of this problem lies in the diversity of the products and their attributes and values. Existing solutions typically employ large language models or sequence-tagging approaches to capture the context of a given product description and extract attribute values. However, they do so with limited accuracy, which serves as the underlying motivation to explore a more comprehensive solution. Through this paper, we present a novel approach for attribute value extraction from product description leveraging graphs and graph neural networks. Our proposed method demonstrates improvements in attribute value extraction accuracy compared to the baseline sequence tagging approaches.

pdf bib
HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination Tendency of LLMs
Cem Uluoglakci | Tugba Temizel

Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs), limiting their widespread acceptance beyond chatbot applications. Despite ongoing efforts, hallucinations remain a prevalent challenge in LLMs. The detection of hallucinations itself is also a formidable task, frequently requiring manual labeling or constrained evaluations. This paper introduces an automated scalable framework that combines benchmarking LLMs’ hallucination tendencies with efficient hallucination detection. We leverage LLMs to generate challenging tasks related to hypothetical phenomena, subsequently employing them as agents for efficient hallucination detection. The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain. We introduce the publicly available HypoTermQA Benchmarking Dataset, on which state-of-the-art models’ performance ranged between 3% and 11%, and evaluator agents demonstrated a 6% error rate in hallucination prediction. The proposed framework provides opportunities to test and improve LLMs. Additionally, it has the potential to generate benchmarking datasets tailored to specific domains, such as law, health, and finance.

pdf bib
Arabic Synonym BERT-based Adversarial Examples for Text Classification
Norah Alshahrani | Saied Alshahrani | Esma Wali | Jeanna Matthews

Text classification systems have been proven vulnerable to adversarial text examples, modified versions of the original text examples that are often unnoticed by human eyes, yet can force text classification models to alter their classification. Often, research works quantifying the impact of adversarial text attacks have been applied only to models trained in English. In this paper, we introduce the first word-level study of adversarial attacks in Arabic. Specifically, we use a synonym (word-level) attack using a Masked Language Modeling (MLM) task with a BERT model in a black-box setting to assess the robustness of the state-of-the-art text classification models to adversarial attacks in Arabic. To evaluate the grammatical and semantic similarities of the newly produced adversarial examples using our synonym BERT-based attack, we invite four human evaluators to assess and compare the produced adversarial examples with their original examples. We also study the transferability of these newly produced Arabic adversarial examples to various models and investigate the effectiveness of defense mechanisms against these adversarial examples on the BERT models. We find that fine-tuned BERT models were more susceptible to our synonym attacks than the other Deep Neural Networks (DNN) models like WordCNN and WordLSTM we trained. We also find that fine-tuned BERT models were more susceptible to transferred attacks. We, lastly, find that fine-tuned BERT models successfully regain at least 2% in accuracy after applying adversarial training as an initial defense mechanism.

pdf bib
A Hypothesis-Driven Framework for the Analysis of Self-Rationalising Models
Marc Braun | Jenny Kunz

The self-rationalising capabilities of LLMs are appealing because the generated explanations can give insights into the plausibility of the predictions. However, how faithful the explanations are to the predictions is questionable, raising the need to explore the patterns behind them further.To this end, we propose a hypothesis-driven statistical framework. We use a Bayesian network to implement a hypothesis about how a task (in our example, natural language inference) is solved, and its internal states are translated into natural language with templates. Those explanations are then compared to LLM-generated free-text explanations using automatic and human evaluations. This allows us to judge how similar the LLM’s and the Bayesian network’s decision processes are. We demonstrate the usage of our framework with an example hypothesis and two realisations in Bayesian networks. The resulting models do not exhibit a strong similarity to GPT-3.5. We discuss the implications of this as well as the framework’s potential to approximate LLM decisions better in future work.

pdf bib
Align before Attend: Aligning Visual and Textual Features for Multimodal Hateful Content Detection
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque | Sarah Masud Preum

Multimodal hateful content detection is a challenging task that requires complex reasoning across visual and textual modalities. Therefore, creating a meaningful multimodal representation that effectively captures the interplay between visual and textual features through intermediate fusion is critical. Conventional fusion techniques are unable to attend to the modality-specific features effectively. Moreover, most studies exclusively concentrated on English and overlooked other low-resource languages. This paper proposes a context-aware attention framework for multimodal hateful content detection and assesses it for both English and non-English languages. The proposed approach incorporates an attention layer to meaningfully align the visual and textual features. This alignment enables selective focus on modality-specific features before fusing them. We evaluate the proposed approach on two benchmark hateful meme datasets, viz. MUTE (Bengali code-mixed) and MultiOFF (English). Evaluation results demonstrate our proposed approach’s effectiveness with F1-scores of 69.7% and 70.3% for the MUTE and MultiOFF datasets. The scores show approximately 2.5% and 3.2% performance improvement over the state-of-the-art systems on these datasets. Our implementation is available at

pdf bib
Topic-guided Example Selection for Domain Adaptation in LLM-based Machine Translation
Seth Aycock | Rachel Bawden

Current machine translation (MT) systems perform well in the domains on which they were trained, but adaptation to unseen domains remains a challenge. Rather than fine-tuning on domain data or modifying the architecture for training, an alternative approach exploits large language models (LLMs), which are performant across NLP tasks especially when presented with in-context examples. We focus on adapting a pre-trained LLM to a domain at inference through in-context example selection. For MT, examples are usually randomly selected from a development set. Some more recent methods though select using the more intuitive basis of test source similarity. We employ topic models to select examples based on abstract semantic relationships below the level of a domain. We test the relevance of these statistical models and use them to select informative examples even for out-of-domain inputs, experimenting on 7 diverse domains and 11 language pairs of differing resourcedness. Our method outperforms baselines on challenging multilingual out-of-domain tests, though it does not match performance with strong baselines for the in-language setting. We find that adding few-shot examples and related keywords consistently improves translation quality, that example diversity must be balanced with source similarity, and that our pipeline is overly restrictive for example selection when a targeted development set is available.

pdf bib
Reforging : A Method for Constructing a Linguistically Valid Japanese CCG Treebank
Asa Tomita | Hitomi Yanaka | Daisuke Bekki

The linguistic validity of Combinatory Categorial Grammar (CCG) parsing results relies heavily on treebanks for training and evaluation, so the treebank construction is crucial. Yet the current Japanese CCG treebank is known to have inaccuracies in its analyses of Japanese syntactic structures, including passive and causative constructions. While ABCTreebank, a treebank for ABC grammar, has been made to improve the analysis, particularly of argument structures, it lacks the detailed syntactic features required for Japanese CCG. In contrast, the Japanese CCG parser, lightblue, efficiently provides detailed syntactic features, but it does not accurately capture argument structures. We propose a method to generate a linguistically valid Japanese CCG treebank with detailed information by combining the strengths of ABCTreebank and lightblue. We develop an algorithm that filters lightblue’s lexical items using ABCTreebank, effectively converting lightblue output into a linguistically valid CCG treebank. To evaluate our treebank, we manually evaluate CCG syntactic structures and semantic representations and analyze conversion rates.

pdf bib
Thesis Proposal: Detecting Agency Attribution
Igor Ryazanov | Johanna Björklund

We explore computational methods for perceived agency attribution in natural language data. We consider ‘agency’ as the freedom and capacity to act, and the corresponding Natural Language Processing (NLP) task involves automatically detecting attributions of agency to entities in text. Our theoretical framework draws on semantic frame analysis, role labelling and related techniques. In initial experiments, we focus on the perceived agency of AI systems. To achieve this, we analyse a dataset of English-language news coverage of AI-related topics, published within one year surrounding the release of the Large Language Model-based service ChatGPT, a milestone in the general public’s awareness of AI. Building on this, we propose a schema to annotate a dataset for agency attribution and formulate additional research questions to answer by applying NLP models.

pdf bib
A Thesis Proposal ClaimInspector Framework: A Hybrid Approach to Data Annotation using Fact-Checked Claims and LLMs
Basak Bozkurt

This thesis explores the challenges and limitations encountered in automated fact-checking processes, with a specific emphasis on data annotation in the context of misinformation. Despite the widespread presence of misinformation in multiple formats and across various channels, current efforts concentrate narrowly on textual claims sourced mainly from Twitter, resulting in datasets with considerably limited scope. Furthermore, the absence of automated control measures, coupled with the reliance on human annotation, which is very limited, increases the risk of noisy data within these datasets. This thesis proposal examines the existing methods, elucidates their limitations and explores the potential integration of claim detection subtasks and Large Language Models to mitigate these issues. It introduces ClaimInspector, a novel framework designed for a systemic collection of multimodal data from the internet. By implementing this framework, this thesis will propose a dataset comprising fact-checks alongside the corresponding claims made by politicians. Overall, this thesis aims to enhance the accuracy and efficiency of annotation processes, thereby contributing to automated fact-checking efforts.

pdf bib
Large Language Models for Mathematical Reasoning: Progresses and Challenges
Janice Ahn | Rishu Verma | Renze Lou | Di Liu | Rui Zhang | Wenpeng Yin

Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the automated resolution of mathematical problems. However, the landscape of mathematical problem types is vast and varied, with LLM-oriented techniques undergoing evaluation across diverse datasets and settings. This diversity makes it challenging to discern the true advancements and obstacles within this burgeoning field. This survey endeavors to address four pivotal dimensions: i) a comprehensive exploration of the various mathematical problems and their corresponding datasets that have been investigated; ii) an examination of the spectrum of LLM-oriented techniques that have been proposed for mathematical problem-solving; iii) an overview of factors and concerns affecting LLMs in solving math; and iv) an elucidation of the persisting challenges within this domain. To the best of our knowledge, this survey stands as one of the first extensive examinations of the landscape of LLMs in the realm of mathematics, providing a holistic perspective on the current state, accomplishments, and future challenges in this rapidly evolving field.

pdf bib
Representation and Generation of Machine Learning Test Functions
Souha Hassine | Steven Wilson

Writing tests for machine learning (ML) code is a crucial step towards ensuring the correctness and reliability of ML software. At the same time, Large Language Models (LLMs) have been adopted at a rapid pace for various code generation tasks, making it a natural choice for many developers who need to write ML tests. However, the implications of using these models, and how the LLM-generated tests differ from human-written ones, are relatively unexplored. In this work, we examine the use of LLMs to extract representations of ML source code and tests in order to understand the semantic relationships between human-written test functions and LLM-generated ones, and annotate a set of LLM-generated tests for several important qualities including usefulness, documentation, and correctness. We find that programmers prefer LLM-generated tests to those selected using retrieval-based methods, and in some cases, to those written by other humans.

pdf bib
The Generative AI Paradox in Evaluation: “What It Can Solve, It May Not Evaluate”
Juhyun Oh | Eunsu Kim | Inha Cha | Alice Oh

This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of “the Generative AI Paradox” (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.

pdf bib
Generative Data Augmentation using LLMs improves Distributional Robustness in Question Answering
Arijit Chowdhury | Aman Chadha

Robustness in Natural Language Processing continues to be a pertinent issue, where state of the art models under-perform under naturally shifted distributions. In the context of Question Answering, work on domain adaptation methods continues to be a growing body of research. However, very little attention has been given to the notion of domain generalization under natural distribution shifts, where the target domain is unknown. With drastic improvements in the quality and access to generative models, we answer the question: How do generated datasets influence the performance of QA models under natural distribution shifts? We perform experiments on 4 different datasets under varying amounts of distribution shift, and analyze how “in-the-wild” generation can help achieve domain generalization. We take a two-step generation approach, generating both contexts and QA pairs to augment existing datasets. Through our experiments, we demonstrate how augmenting reading comprehension datasets with generated data leads to better robustness towards natural distribution shifts.

pdf bib
Japanese-English Sentence Translation Exercises Dataset for Automatic Grading
Naoki Miura | Hiroaki Funayama | Seiya Kikuchi | Yuichiroh Matsubayashi | Yuya Iwase | Kentaro Inui

This paper proposes the task of automatic assessment of Sentence Translation Exercises (STEs), that have been used in the early stage of L2 language learning.We formalize the task as grading student responses for each rubric criterion pre-specified by the educators.We then create a dataset for STE between Japanese and English including 21 questions, along with a total of 3,498 student responses (167 on average).The answer responses were collected from students and crowd workers.Using this dataset, we demonstrate the performance of baselines including a finetuned BERT model and GPT-3.5 with few-shot learning. Experimental results showed that the baseline model with fine-tuned BERT was able to classify correct responses with approximately 90% in F1, but only less than 80% for incorrect responses. Furthermore, GPT-3.5 with few-shot learning shows a poorer result than the BERT model, indicating that our newly proposed task presents a challenging issue, even for the state-of-the-art large language model.

pdf bib
The Impact of Integration Step on Integrated Gradients
Masahiro Makino | Yuya Asazuma | Shota Sasaki | Jun Suzuki

Integrated Gradients (IG) serve as a potent tool for explaining the internal structure of a language model. The calculation of IG requires numerical integration, wherein the number of steps serves as a critical hyperparameter. The step count can drastically alter the results, inducing considerable errors in interpretability. To scrutinize the effect of step variation on IG, we measured the difference between theoretical and observed IG totals for each step amount.Our findings indicate that the ideal number of steps to maintain minimal error varies from instance to instance. Consequently, we advocate for customizing the step count for each instance. Our study is the first to quantitatively analyze the variation of IG values with the number of steps.

pdf bib
GesNavi: Gesture-guided Outdoor Vision-and-Language Navigation
Aman Jain | Teruhisa Misu | Kentaro Yamada | Hitomi Yanaka

Vision-and-Language Navigation (VLN) task involves navigating mobility using linguistic commands and has application in developing interfaces for autonomous mobility. In reality, natural human communication also encompasses non-verbal cues like hand gestures and gaze. These gesture-guided instructions have been explored in Human-Robot Interaction systems for effective interaction, particularly in object-referring expressions. However, a notable gap exists in tackling gesture-based demonstrative expressions in outdoor VLN task. To address this, we introduce a novel dataset for gesture-guided outdoor VLN instructions with demonstrative expressions, designed with a focus on complex instructions requiring multi-hop reasoning between the multiple input modalities. In addition, our work also includes a comprehensive analysis of the collected data and a comparative evaluation against the existing datasets.

pdf bib
Can docstring reformulation with an LLM improve code generation?
Nicola Dainese | Alexander Ilin | Pekka Marttinen

Generating code is an important application of Large Language Models (LLMs) and the task of function completion is one of the core open challenges in this context. Existing approaches focus on either training, fine-tuning or prompting LLMs to generate better outputs given the same input. We propose a novel and complementary approach: to optimize part of the input, the docstring (summary of a function’s purpose and usage), via reformulation with an LLM, in order to improve code generation. We develop two baseline methods for optimizing code generation via docstring reformulation and test them on the original HumanEval benchmark and multiple curated variants which are made more challenging by realistically worsening the docstrings. Our results show that, when operating on docstrings reformulated by an LLM instead of the original (or worsened) inputs, the performance of a number of open-source LLMs does not change significantlyThis finding demonstrates an unexpected robustness of current open-source LLMs to the details of the docstrings. We conclude by examining a series of questions, accompanied by in-depth analyses, pertaining to the sensitivity of current open-source LLMs to the details in the docstrings, the potential for improvement via docstring reformulation and the limitations of the methods employed in this work.

pdf bib
Benchmarking Diffusion Models for Machine Translation
Yunus Demirag | Danni Liu | Jan Niehues

Diffusion models have recently shown great potential on many generative tasks.In this work, we explore diffusion models for machine translation (MT).We adapt two prominent diffusion-based text generation models, Diffusion-LM and DiffuSeq, to perform machine translation.As the diffusion models generate non-autoregressively (NAR),we draw parallels to NAR machine translation models.With a comparison to conventional Transformer-based translation models, as well as to the Levenshtein Transformer,an established NAR MT model,we show that the multimodality problem that limits NAR machine translation performance is also a challenge to diffusion models.We demonstrate that knowledge distillation from an autoregressive model improves the performance of diffusion-based MT.A thorough analysis on the translation quality of inputs of different lengths shows that the diffusion models struggle more on long-range dependencies than other models.

pdf bib
Forged-GAN-BERT: Authorship Attribution for LLM-Generated Forged Novels
Kanishka Silva | Ingo Frommholz | Burcu Can | Fred Blain | Raheem Sarwar | Laura Ugolini

The advancement of generative Large Language Models (LLMs), capable of producing human-like texts, introduces challenges related to the authenticity of the text documents. This requires exploring potential forgery scenarios within the context of authorship attribution, especially in the literary domain. Particularly,two aspects of doubted authorship may arise in novels, as a novel may be imposed by a renowned author or include a copied writing style of a well-known novel. To address these concerns, we introduce Forged-GAN-BERT, a modified GANBERT-based model to improve the classification of forged novels in two data-augmentation aspects: via the Forged Novels Generator (i.e., ChatGPT) and the generator in GAN. Compared to other transformer-based models, the proposed Forged-GAN-BERT model demonstrates an improved performance with F1 scores of 0.97 and 0.71 for identifying forged novels in single-author and multi-author classification settings. Additionally, we explore different prompt categories for generating the forged novels to analyse the quality of the generated texts using different similarity distance measures, including ROUGE-1, Jaccard Similarity, Overlap Confident, and Cosine Similarity.

pdf bib
Thesis Proposal: Detecting Empathy Using Multimodal Language Model
Md Rakibul Hasan | Md Zakir Hossain | Aneesh Krishna | Shafin Rahman | Tom Gedeon

Empathy is crucial in numerous social interactions, including human-robot, patient-doctor, teacher-student, and customer-call centre conversations. Despite its importance, empathy detection in videos continues to be a challenging task because of the subjective nature of empathy and often remains under-explored. Existing studies have relied on scripted or semi-scripted interactions in text-, audio-, or video-only settings that fail to capture the complexities and nuances of real-life interactions. This PhD research aims to fill these gaps by developing a multimodal language model (MMLM) that detects empathy in audiovisual data. In addition to leveraging existing datasets, the proposed study involves collecting real-life interaction video and audio. This study will leverage optimisation techniques like neural architecture search to deliver an optimised small-scale MMLM. Successful implementation of this project has significant implications in enhancing the quality of social interactions as it enables real-time measurement of empathy and thus provides potential avenues for training for better empathy in interactions.

pdf bib
Toward Sentiment Aware Semantic Change Analysis
Roksana Goworek | Haim Dubossarsky

This student paper explores the potential of augmenting computational models of semantic change with sentiment information. It tests the efficacy of this approach on the English SemEval of Lexical Semantic Change and its associated historical corpora. We first establish the feasibility of our approach by demonstrating that existing models extract reliable sentiment information from historical corpora, and then validate that words that underwent semantic change also show greater sentiment change in comparison to historically stable words. We then integrate sentiment information into standard models of semantic change for individual words, and test if this can improve the overall performance of the latter, showing mixed results. This research contributes to our understanding of language change by providing the first attempt to enrich standard models of semantic change with additional information. It taps into the multifaceted nature of language change, that should not be reduced only to binary or scalar report of change, but adds additional dimensions to this change, sentiment being only one of these. As such, this student paper suggests novel directions for future work in integrating additional, more nuanced information of change and interpretation for finer-grained semantic change analysis.

pdf bib
Dynamic Task-Oriented Dialogue: A Comparative Study of Llama-2 and Bert in Slot Value Generation
Tiziano Labruna | Sofia Brenna | Bernardo Magnini

Recent advancements in instruction-based language models have demonstrated exceptional performance across various natural language processing tasks. We present a comprehensive analysis of the performance of two open-source language models, BERT and Llama-2, in the context of dynamic task-oriented dialogues. Focusing on the Restaurant domain and utilizing the MultiWOZ 2.4 dataset, our investigation centers on the models’ ability to generate predictions for masked slot values within text. The dynamic aspect is introduced through simulated domain changes, mirroring real-world scenarios where new slot values are incrementally added to a domain over time.This study contributes to the understanding of instruction-based models’ effectiveness in dynamic natural language understanding tasks when compared to traditional language models and emphasizes the significance of open-source, reproducible models in advancing research within the academic community.


pdf (full)
bib (full)
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Mohsen Mesgar | Sharid Loáiciga

pdf bib
Computational modeling of semantic change
Pierluigi Cassotti | Francesco Periti | Stefano de Pascale | Haim Dubossarsky | Nina Tahmasebi

Languages change constantly over time, influenced by social, technological, cultural and political factors that affect how people express themselves. In particular, words can undergo the process of semantic change, which can be subtle and significantly impact the interpretation of texts. For example, the word terrific used to mean ‘causing terror’ and was as such synonymous to terrifying. Nowadays, speakers use the word in the sense of ‘excessive’ and even ‘amazing’. In Historical Linguistics, tools and methods have been developed to analyse this phenomenon, including systematic categorisations of the types of change, the causes and the mechanisms underlying the different types of change. However, traditional linguistic methods, while informative, are often based on small, carefully curated samples. Thanks to the availability of both large diachronic corpora, the computational means to model word meaning unsupervised, and evaluation benchmarks, we are seeing an increasing interest in the computational modelling of semantic change. This is evidenced by the increasing number of publications in this new domain as well as the organisation of initiatives and events related to this topic, such as four editions of the International Workshop on Computational Approaches to Historical Language Change LChange1, and several evaluation campaigns (Schlechtweg et al., 2020a; Basile et al., 2020b; Kutuzov et al.; Zamora-Reina et al., 2022).

pdf bib
Item Response Theory for Natural Language Processing
John P. Lalor | Pedro Rodriguez | João Sedoc | Jose Hernandez-Orallo

This tutorial will introduce the NLP community to Item Response Theory (IRT; Baker 2001). IRT is a method from the field of psychometrics for model and dataset assessment. IRT has been used for decades to build test sets for human subjects and estimate latent characteristics of dataset examples. Recently, there has been an uptick in work applying IRT to tasks in NLP. It is our goal to introduce the wider NLP community to IRT and show its benefits for a number of NLP tasks. From this tutorial, we hope to encourage wider adoption of IRT among NLP researchers.

pdf bib
Language + Molecules
Carl Edwards | Qingyun Wang | Heng Ji

Climate change, access to food and water, pandemics–the world faces an enormous number of problems in the coming decades on scales of complexity never-before-seen. To address these issues, development of scientific solutions which are scalable, flexible, and inexpensive are critical. Over the last couple years, considerable interest has arisen for applying natural language-driven solutions to these problems. Particularly, the chemistry field is posed to be substantially accelerated by language+molecule models. This tutorial is designed to provide an introduction to this area of research. It requires no knowledge outside mainstream NLP, and it will enable participants to begin exploring relevant research. By discussing cutting-edge work, we will highlight the key roles language can fill for 1) abstract, compositional control of generative models, 2) bridging different biochemical modalities, 3) planning experimental procedures, and 4) broadening access to computational approaches. Beyond this, language models have also seen considerable success when applied to proteins or molecule structures, which can be considered as ‘exotic’ languages, and computational linguistics researchers’ expertise can be highly valuable for these impactful, possibly life-saving tasks.

pdf bib
Transformer-specific Interpretability
Hosein Mohebbi | Jaap Jumelet | Michael Hanna | Afra Alishahi | Willem Zuidema

Transformers have emerged as dominant play- ers in various scientific fields, especially NLP. However, their inner workings, like many other neural networks, remain opaque. In spite of the widespread use of model-agnostic interpretability techniques, including gradient-based and occlusion-based, their shortcomings are becoming increasingly apparent for Transformer interpretation, making the field of interpretability more demanding today. In this tutorial, we will present Transformer-specific interpretability methods, a new trending approach, that make use of specific features of the Transformer architecture and are deemed more promising for understanding Transformer-based models. We start by discussing the potential pitfalls and misleading results model-agnostic approaches may produce when interpreting Transformers. Next, we discuss Transformer-specific methods, including those designed to quantify context- mixing interactions among all input pairs (as the fundamental property of the Transformer architecture) and those that combine causal methods with low-level Transformer analysis to identify particular subnetworks within a model that are responsible for specific tasks. By the end of the tutorial, we hope participants will understand the advantages (as well as current limitations) of Transformer-specific interpretability methods, along with how these can be applied to their own research.

pdf bib
LLMs for Low Resource Languages in Multilingual, Multimodal and Dialectal Settings
Firoj Alam | Shammur Absar Chowdhury | Sabri Boughorbel | Maram Hasanain

The recent breakthroughs in Artificial Intelligence (AI) can be attributed to the remarkable performance of Large Language Models (LLMs) across a spectrum of research areas (e.g., machine translation, question-answering, automatic speech recognition, text-to-speech generation) and application domains (e.g., business, law, healthcare, education, and psychology). The success of these LLMs largely de- pends on specific training techniques, most notably instruction tuning, RLHF, and subsequent prompting to achieve the desired output. As the development of such LLMs continues to increase in both closed and open settings, evaluation has become crucial for understanding their generalization capabilities across different tasks, modalities, languages, and dialects. This evaluation process is tightly coupled with prompting, which plays a key role in obtain- ing better outputs. There has been attempts to evaluate such models focusing on diverse tasks, languages, and dialects, which suggests that the capabilities of LLMs are still limited to medium-to-low-resource languages due to the lack of representative datasets. The tutorial offers an overview of this emerging research area. We explore the capabilities of LLMs in terms of their performance, zero- and few-shot settings, fine-tuning, instructions tuning, and close vs. open models with a special emphasis on low-resource settings. In addition to LLMs for standard NLP tasks, we will focus on speech and multimodality.


pdf (full)
bib (full)
Findings of the Association for Computational Linguistics: EACL 2024

pdf bib
Findings of the Association for Computational Linguistics: EACL 2024
Yvette Graham | Matthew Purver

pdf bib
Chem-FINESE: Validating Fine-Grained Few-shot Entity Extraction through Text Reconstruction
Qingyun Wang | Zixuan Zhang | Hongxiang Li | Xuan Liu | Jiawei Han | Huimin Zhao | Heng Ji

Fine-grained few-shot entity extraction in the chemical domain faces two unique challenges. First, compared with entity extraction tasks in the general domain, sentences from chemical papers usually contain more entities. Moreover, entity extraction models usually have difficulty extracting entities of long-tailed types. In this paper, we propose Chem-FINESE, a novel sequence-to-sequence (seq2seq) based few-shot entity extraction approach, to address these two challenges. Our Chem-FINESE has two components: a seq2seq entity extractor to extract named entities from the input sentence and a seq2seq self-validation module to reconstruct the original input sentence from extracted entities. Inspired by the fact that a good entity extraction system needs to extract entities faithfully, our new self-validation module leverages entity extraction results to reconstruct the original input sentence. Besides, we design a new contrastive loss to reduce excessive copying during the extraction process. Finally, we release ChemNER+, a new fine-grained chemical entity extraction dataset that is annotated by domain experts with the ChemNER schema. Experiments in few-shot settings with both ChemNER+ and CHEMET datasets show that our newly proposed framework has contributed up to 8.26% and 6.84% absolute F1-score gains respectively.

pdf bib
GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi | Eunju Lee | Kyohoon Jin | YoungBin Kim

Data annotation is an essential step for constructing new datasets. However, the conventional approach of data annotation through crowdsourcing is both time-consuming and expensive. In addition, the complexity of this process increases when dealing with low-resource languages owing to the difference in the language pool of crowdworkers. To address these issues, this study proposes an autonomous annotation method by utilizing large language models, which have been recently demonstrated to exhibit remarkable performance. Through our experiments, we demonstrate that the proposed method is not just cost-efficient but also applicable for low-resource language annotation. Additionally, we constructed an image captioning dataset using our approach and are committed to open this dataset for future study. We have opened our source code for further study and reproducibility.

pdf bib
Next Visit Diagnosis Prediction via Medical Code-Centric Multimodal Contrastive EHR Modelling with Hierarchical Regularisation
Heejoon Koo

Predicting next visit diagnosis using Electronic Health Records (EHR) is an essential task in healthcare, critical for devising proactive future plans for both healthcare providers and patients. Nonetheless, many preceding studies have not sufficiently addressed the heterogeneous and hierarchical characteristics inherent in EHR data, inevitably leading to sub-optimal performance. To this end, we propose NECHO, a novel medical code-centric multimodal contrastive EHR learning framework with hierarchical regularisation. First, we integrate multifaceted information encompassing medical codes, demographics, and clinical notes using a tailored network design and a pair of bimodal contrastive losses, all of which pivot around a medical codes representation. We also regularise modality-specific encoders using a parental level information in medical ontology to learn hierarchical structure of EHR data. A series of experiments on MIMIC-III data demonstrates effectiveness of our approach.

pdf bib
FlexiQA: Leveraging LLM’s Evaluation Capabilities for Flexible Knowledge Selection in Open-domain Question Answering
Yuhan Chen | Shuqi Li | Rui Yan

Nowadays, large language models (LLMs) have demonstrated their ability to be a powerful knowledge generator of generate-then-read paradigm for open-domain question answering (ODQA). However this new paradigm mainly suffers from the “hallucination” and struggles to handle time-sensitive issue because of its expensive knowledge update costs. On the other hand, retrieve-then-read, as a traditional paradigm, is more limited by the relevance of acquired knowledge to the given question. In order to combine the strengths of both paradigms, and overcome their respective shortcomings, we design a new pipeline called “FlexiQA”, in which we utilize the diverse evaluation capabilities of LLMs to select knowledge effectively and flexibly. First, given a question, we prompt a LLM as a discriminator to identify whether it is time-sensitive. For time-sensitive questions, we follow the retrieve-then-read paradigm to obtain the answer. For the non time-sensitive questions, we further prompt the LLM as an evaluator to select a better document from two perspectives: factuality and relevance. Based on the selected document, we leverage a reader to get the final answer. We conduct extensive experiments on three widely-used ODQA benchmarks, the experimental results fully confirm the effectiveness of our approach.

pdf bib
Hyper-BTS Dataset: Scalability and Enhanced Analysis of Back TranScription (BTS) for ASR Post-Processing
Chanjun Park | Jaehyung Seo | Seolhwa Lee | Junyoung Son | Hyeonseok Moon | Sugyeong Eo | Chanhee Lee | Heuiseok Lim

The recent advancements in the realm of Automatic Speech Recognition (ASR) post-processing have been primarily driven by sequence-to-sequence paradigms. Despite their effectiveness, these methods often demand substantial amounts of data, necessitating the expensive recruitment of phonetic transcription experts to rectify the erroneous outputs of ASR systems, thereby creating the desired training data. Back TranScription (BTS) alleviates this issue by generating ASR inputs from clean text via a Text-to-Speech (TTS) system. While initial studies on BTS exhibited promise, they were constrained by a limited dataset of just 200,000 sentence pairs, leaving the scalability of this method in question. In this study, we delve into the potential scalability of BTS. We introduce the “Hyper-BTS” dataset, a corpus approximately five times larger than that utilized in prior research. Additionally, we present innovative criteria for categorizing error types within ASR post-processing. This not only facilitates a more comprehensive qualitative analysis, which was absent in preceding studies, but also enhances the understanding of ASR error patterns. Our empirical results, both quantitative and qualitative, suggest that the enlarged scale of the Hyper-BTS dataset sufficiently addresses a vast majority of the ASR error categories. We make the Hyper-BTS dataset publicly available.

pdf bib
ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations
Neil Shah | Saiteja Kosgi | Vishal Tambrahalli | Neha S | Anil Nelakanti | Vineet Gandhi

We present ParrotTTS, a modularized text-to-speech synthesis model leveraging disentangled self-supervised speech representations. It can train a multi-speaker variant effectively using transcripts from a single speaker. ParrotTTS adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone. Moreover, without training on bilingual or parallel examples, ParrotTTS can transfer voices across languages while preserving the speaker-specific characteristics, e.g., synthesizing fluent Hindi speech using a French speaker’s voice and accent. We present extensive results in monolingual and multi-lingual scenarios. ParrotTTS outperforms state-of-the-art multi-lingual text-to-speech (TTS) models using only a fraction of paired data as latter. Speech samples from ParrotTTS and code can be found at

pdf bib
NavHint: Vision and Language Navigation Agent with a Hint Generator
Yue Zhang | Quan Guo | Parisa Kordjamshidi

The existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment.In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions.The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent’s attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment.We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the agent’s interpretability.

pdf bib
Text or Image? What is More Important in Cross-Domain Generalization Capabilities of Hate Meme Detection Models?
Piush Aggarwal | Jawar Mehrabanian | Weigang Huang | Özge Alacam | Torsten Zesch

This paper delves into the formidable challenge of cross-domain generalization in multimodal hate meme detection, presenting compelling findings. We provide evidence supporting the hypothesis that only the textual component of hateful memes enables the multimodal classifier to generalize across different domains, while the image component proves highly sensitive to a specific training dataset. The evidence includes demonstrations showing that hate-text classifiers perform similarly to hate-meme classifiers in a zero-shot setting. Simultaneously, the introduction of captions generated from images of memes to the hate-meme classifier worsens performance by an average F1 of 0.02. Through blackbox explanations, we identify a substantial contribution of the text modality (average of 83%), which diminishes with the introduction of meme’s image captions (52%). Additionally, our evaluation on a newly created confounder dataset reveals higher performance on text confounders as compared to image confounders with average ∆F1 of 0.18.

pdf bib
Where are we Still Split on Tokenization?
Rob Goot

Many Natural Language Processing (NLP) tasks are labeled on the token level, forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often considered to be a solved problem, gold tokenization iscommonly assumed. In this paper, we propose an efficient method fortokenization with subword-based language models, and reflect on the status ofperformance on the tokenization task by evaluating on 122 languages in 20different scripts. We show that our proposed model performs on par with thestate-of-the-art, and that tokenization performance is mainly dependent on theamount and consistency of annotated data. We conclude that besidesinconsistencies in the data and exceptional cases the task can be consideredsolved for Latin languages for in-dataset settings (>99.5 F1). However,performance is 0.75 F1 point lower on average for datasets in other scripts andperformance deteriorates in cross-dataset setups.

pdf bib
A Methodology for Generative Spelling Correction via Natural Spelling Errors Emulation across Multiple Domains and Languages
Nikita Martynov | Mark Baushenko | Anastasia Kozlova | Katerina Kolomeytseva | Aleksandr Abramov | Alena Fenogenova

Large language models excel in text generation and generalization, however they face challenges in text editing tasks, especially in correcting spelling errors and mistyping.In this paper, we present a methodology for generative spelling correction (SC), tested on English and Russian languages and potentially can be extended to any language with minor changes. Our research mainly focuses on exploring natural spelling errors and mistyping in texts and studying how those errors can be emulated in correct sentences to enrich generative models’ pre-train procedure effectively. We investigate the effects of emulations in various text domains and examine two spelling corruption techniques: 1) first one mimics human behavior when making a mistake through leveraging statistics of errors from a particular dataset, and 2) second adds the most common spelling errors, keyboard miss clicks, and some heuristics within the texts.We conducted experiments employing various corruption strategies, models’ architectures, and sizes in the pre-training and fine-tuning stages and evaluated the models using single-domain and multi-domain test sets. As a practical outcome of our work, we introduce SAGE (Spell checking via Augmentation and Generative distribution Emulation).

pdf bib
How Does In-Context Learning Help Prompt Tuning?
Simeng Sun | Yang Liu | Dan Iter | Chenguang Zhu | Mohit Iyyer

Fine-tuning large language models is becoming ever more impractical due to their rapidly-growing scale. This motivates the use of parameter-efficient adaptation methods such as prompt tuning (PT), which adds a small number of tunable embeddings to an otherwise frozen model, and in-context learning (ICL), in which demonstrations of the task are provided to the model in natural language without any additional training. Recently, (CITATION) propose “instruction prompt tuning” (IPT), which combines PT with ICL by concatenating a natural language demonstration with learned prompt embeddings. While all of these methods have proven effective on different tasks, how they interact with each other remains unexplored. In this paper, we empirically study when and how in-context examples improve prompt tuning by measuring the effectiveness of ICL, PT, and IPT on five text generation tasks with multiple base language models. We observe that (1) IPT does not always outperform PT, and in fact requires the in-context demonstration to be semantically similar to the test input to yield improvements; (2) PT is unstable and exhibits high variance, but combining PT and ICL (into IPT) consistently reduces variance across all five tasks; and(3) prompts learned for a specific source task via PT exhibit positive transfer when paired with in-context examples of a different target task. Our results offer actionable insights on choosing a suitable parameter-efficient adaptation method for a given task.

pdf bib
Large Language Models for Psycholinguistic Plausibility Pretesting
Samuel Amouyal | Aya Meltzer-Asscher | Jonathan Berant

In psycholinguistics, the creation of controlled materials is crucial to ensure that research outcomes are solely attributed to the intended manipulations and not influenced by extraneous factors. To achieve this, psycholinguists typically pretest linguistic materials, where a common pretest is to solicit plausibility judgments from human evaluators on specific sentences. In this work, we investigate whether Language Models (LMs) can be used to generate these plausibility judgements. We investigate a wide range of LMs across multiple linguistic structures and evaluate whether their plausibility judgements correlate with human judgements. We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine, whereas other LMs correlate well with humans on commonly used syntactic structures. We then test whether this correlation implies that LMs can be used instead of humans for pretesting. We find that when coarse-grained plausibility judgements are needed, this works well, but when fine-grained judgements are necessary, even GPT-4 does not provide satisfactory discriminative power.

pdf bib
Modeling Aspect Sentiment Coherency via Local Sentiment Aggregation
Heng Yang | Ke Li

Aspect sentiment coherency is an intriguing yet underexplored topic in the field of aspect-based sentiment classification. This concept reflects the common pattern where adjacent aspects often share similar sentiments. Despite its prevalence, current studies have not fully recognized the potential of modeling aspect sentiment coherency, including its implications in adversarial defense. To model aspect sentiment coherency, we propose a novel local sentiment aggregation (LSA) paradigm based on constructing a differential-weighted sentiment aggregation window. We have rigorously evaluated our model through experiments, and the results affirm the proficiency of LSA in terms of aspect coherency prediction and aspect sentiment classification. For instance, it outperforms existing models and achieves state-of-the-art sentiment classification performance across five public datasets. Furthermore, we demonstrate the promising ability of LSA in ABSC adversarial defense, thanks to its sentiment coherency modeling. To encourage further exploration and application of this concept, we have made our code publicly accessible. This will provide researchers with a valuable tool to delve into sentiment coherency modeling in future research.

pdf bib
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi | Aishwarya Agrawal

Recently, reference-free metrics such as CLIPScore (Hessel et al., 2021), UMIC (Lee et al., 2021), and PAC-S (Sarto et al., 2023) have been proposed for automatic reference-free evaluation of image captions. Our focus lies in evaluating the robustness of these metrics in scenarios that require distinguishing between two captions with high lexical overlap but very different meanings. Our findings reveal that despite their high correlation with human judgments, CLIPScore, UMIC, and PAC-S struggle to identify fine-grained errors. While all metrics exhibit strong sensitivity to visual grounding errors, their sensitivity to caption implausibility errors is limited. Furthermore, we found that all metrics are sensitive to variations in the size of image-relevant objects mentioned in the caption, while CLIPScore and PAC-S are also sensitive to the number of mentions of image-relevant objects in the caption. Regarding linguistic aspects of a caption, all metrics show weak comprehension of negation, and CLIPScore and PAC-S are insensitive to the structure of the caption to a great extent. We hope our findings will guide further improvements in reference-free evaluation of image captioning.

pdf bib
Barriers to Effective Evaluation of Simultaneous Interpretation
Shira Wein | Te I | Colin Cherry | Juraj Juraska | Dirk Padfield | Wolfgang Macherey

Simultaneous interpretation is an especially challenging form of translation because it requires converting speech from one language to another in real-time. Though prior work has relied on out-of-the-box machine translation metrics to evaluate interpretation data, we hypothesize that strategies common in high-quality human interpretations, such as summarization, may not be handled well by standard machine translation metrics. In this work, we examine both qualitatively and quantitatively four potential barriers to evaluation of interpretation: disfluency, summarization, paraphrasing, and segmentation. Our experiments reveal that, while some machine translation metrics correlate fairly well with human judgments of interpretation quality, much work is still needed to account for strategies of interpretation during evaluation. As a first step to address this, we develop a fine-tuned model for interpretation evaluation, and achieve better correlation with human judgments than the state-of-the-art machine translation metrics.

pdf bib
Inconsistent dialogue responses and how to recover from them
Mian Zhang | Lifeng Jin | Linfeng Song | Haitao Mi | Dong Yu

One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.

pdf bib
MUG: Interactive Multimodal Grounding on User Interfaces
Tao Li | Gang Li | Jingjie Zheng | Purple Wang | Yang Li

We present MUG, a novel interactive task for multimodal grounding where a user and an agent work collaboratively on an interface screen. Prior works modeled multimodal UI grounding in one round: the user gives a command and the agent responds to the command. Yet, in a realistic scenario, a user command can be ambiguous when the target action is inherently difficult to articulate in natural language. MUG allows multiple rounds of interactions such that upon seeing the agent responses, the user can give further commands for the agent to refine or even correct its actions. Such interaction is critical for improving grounding performances in real-world use cases. To investigate the problem, we create a new dataset that consists of 77,820 sequences of human user-agent interaction on mobile interfaces in which 20% involves multiple rounds of interactions. To establish benchmark, we experiment with a range of modeling variants and evaluation strategies, including both offline and online evaluation—the online strategy consists of both human evaluation and automatic with simulators. Our experiments show that iterative interaction significantly improves the absolute task completion by 18% over the entire test set and 31% over the challenging split. Our results lay the foundation for further investigation of the problem.

pdf bib
PRILoRA: Pruned and Rank-Increasing Low-Rank Adaptation
Nadav Benedek | Lior Wolf

With the proliferation of large pre-trained language models (PLMs), fine-tuning all model parameters becomes increasingly inefficient, particularly when dealing with numerous downstream tasks that entail substantial training and storage costs. Several approaches aimed at achieving parameter-efficient fine-tuning (PEFT) have been proposed. Among them, Low-Rank Adaptation (LoRA) stands out as an archetypal method, incorporating trainable rank decomposition matrices into each target module. Nevertheless, LoRA does not consider the varying importance of each layer. To address these challenges, we introduce PRILoRA, which linearly allocates a different rank for each layer, in an increasing manner, and performs pruning throughout the training process, considering both the temporary magnitude of weights and the accumulated statistics of the input to any given layer. We validate the effectiveness of PRILoRA through extensive experiments on eight GLUE benchmarks, setting a new state of the art.

pdf bib
Revamping Multilingual Agreement Bidirectionally via Switched Back-translation for Multilingual Neural Machine Translation
Hongyuan Lu | Haoyang Huang | Dongdong Zhang | Furu Wei | Wai Lam

Despite the fact that multilingual agreement (MA) has shown its importance for multilingual neural machine translation (MNMT), current methodologies in the field have two shortages: (i) require parallel data between multiple language pairs, which is not always realistic and (ii) optimize the agreement in an ambiguous direction, which hampers the translation performance. We present Bidirectional Multilingual Agreement via Switched Back-translation (BMA-SBT), a novel and universal multilingual agreement framework for fine-tuning pre-trained MNMT models, which (i) exempts the need for aforementioned parallel data by using a novel method called switched BT that creates synthetic text written in another source language using the translation target and (ii) optimizes the agreement bidirectionally with the Kullback-Leibler Divergence loss. Experiments indicate that BMA-SBT clearly improves the strong baselines on the task of MNMT with three benchmarks: TED Talks, News, and Europarl. In-depth analyzes indicate that BMA-SBT brings additive improvements to the conventional BT method.

pdf bib
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Peiqin Lin | Chengzhi Hu | Zheyu Zhang | Andre Martins | Hinrich Schuetze

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

pdf bib
OYXOY: A Modern NLP Test Suite for Modern Greek
Konstantinos Kogkalidis | Stergios Chatzikyriakidis | Eirini Giannikouri | Vasiliki Katsouli | Christina Klironomou | Christina Koula | Dimitris Papadakis | Thelka Pasparaki | Erofili Psaltaki | Efthymia Sakellariou | Charikleia Soupiona

This paper serves as a foundational step towards the development of a linguistically motivated and technically relevant evaluation suite for Greek NLP. We initiate this endeavor by introducing four expert-verified evaluation tasks, specifically targeted at natural language inference, word sense disambiguation (through example comparison or sense selection) and metaphor detection. More than language-adapted replicas of existing tasks, we contribute two innovations which will resonate with the broader resource and evaluation community. Firstly, our inference dataset is the first of its kind, marking not just one, but rather all possible inference labels, accounting for possible shifts due to e.g. ambiguity or polysemy. Secondly, we demonstrate a cost-efficient method to obtain datasets for under-resourced languages. Using ChatGPT as a language-neutral parser, we transform the Dictionary of Standard Modern Greek into a structured format, from which we derive the other three tasks through simple projections. Alongside each task, we conduct experiments using currently available state of the art machinery. Our experimental baselines affirm the challenging nature of our tasks and highlight the need for expedited progress in order for the Greek NLP ecosystem to keep pace with contemporary mainstream research.

pdf bib
A Comprehensive Evaluation of Inductive Reasoning Capabilities and Problem Solving in Large Language Models
Chen Bowen | Rune Sætre | Yusuke Miyao

Inductive reasoning is fundamental to both human and artificial intelligence. The inductive reasoning abilities of current Large Language Models (LLMs) are evaluated in this research.We argue that only considering induction of rules is too narrow and unrealistic, since inductive reasoning is usually mixed with other abilities, like rules application, results/rules validation, and updated information integration.We probed the LLMs with a set of designed symbolic tasks and found that even state-of-the-art (SotA) LLMs fail significantly, showing the inability of LLMs to perform these intuitively simple tasks.Furthermore, we found that perfect accuracy in a small-size problem does not guarantee the same accuracy in a larger-size version of the same problem, provoking the question of how we can assess the LLMs’ actual problem-solving capabilities.We also argue that Chain-of-Thought prompts help the LLMs by decomposing the problem-solving process, but the LLMs still learn limitedly.Furthermore, we reveal that few-shot examples assist LLM generalization in out-of-domain (OOD) cases, albeit limited. The LLM starts to fail when the problem deviates from the provided few-shot examples.

pdf bib
Towards efficient self-supervised representation learning in speech processing
Luis Lugo | Valentin Vielzeuf

Self-supervised learning has achieved impressive results in speech processing, but current models are computationally expensive, generating environmental concerns because of their high energy consumption. Therefore, we propose an efficient self-supervised approach to address high computational costs, using a single GPU during 24 to 48 hours of pretraining. The proposed approach combines linear, convolutional, and self-attention layers with several optimizations, including dynamic batching, flash attention, mixed-precision training, gradient accumulation, and acoustic feature extraction with input preprocessing. Computational cost estimations for our proposed model represent up to two orders of magnitude improvements in computational efficiency against existing speech models.

pdf bib
Improving Cross-Domain Low-Resource Text Generation through LLM Post-Editing: A Programmer-Interpreter Approach
Zhuang Li | Levon Haroutunian | Raj Tumuluri | Philip Cohen | Reza Haf

Post-editing has proven effective in improving the quality of text generated by large language models (LLMs) such as GPT-3.5 or GPT-4, particularly when direct updating of their parameters to enhance text quality is infeasible or expensive. However, relying solely on smaller language models for post-editing can limit the LLMs’ ability to generalize across domains. Moreover, the editing strategies in these methods are not optimally designed for text generation tasks. To address these limitations, we propose a neural programmer-interpreter approach that preserves the domain generalization ability of LLMs while editing their output. The editing actions in this framework are specifically devised for text generation. Extensive experiments demonstrate that the programmer-interpreter significantly enhances GPT-3.5’s performance in logical form-to-text conversion and low-resource machine translation, surpassing other state-of-the-art (SOTA) LLM post-editing methods in cross-domain settings.

pdf bib
Noise Contrastive Estimation-based Matching Framework for Low-Resource Security Attack Pattern Recognition
Tu Nguyen | Nedim Šrndić | Alexander Neth

Techniques, Tactics and Procedures (TTP) mapping is an important and difficult task in the application of cyber threat intelligence (CTI) extraction for threat reports. TTPs are typically expressed in semantic forms within security knowledge bases like MITRE ATT&CK, serving as textual high-level descriptions for sophisticated attack patterns. Conversely, attacks in CTI threat reports are detailed in a combination of natural and technical language forms, presenting a significant challenge even for security experts to establish correlations or mappings with the corresponding TTPs.Conventional learning approaches often target the TTP mapping problem in the classical multiclass/label classification setting. This setting hinders the learning capabilities of the model, due to the large number of classes (i.e., TTPs), the inevitable skewness of the label distribution and the complex hierarchical structure of the label space. In this work, we approach the problem in a different learning paradigm, such that the assignment of a text to a TTP label is essentially decided by the direct semantic similarity between the two, thus, reducing the complexity of competing solely over the large labeling space. In order that, we propose a neural matching architecture that incorporates a sampling based learn-to-compare mechanism, facilitating the learning process of the matching model despite constrained resources.

pdf bib
Large Language Models for Scientific Information Extraction: An Empirical Study for Virology
Mahsa Shamsabadi | Jennifer D’Souza | Sören Auer

In this paper, we champion the use of structured and semantic content representation of discourse-based scholarly communication, inspired by tools like Wikipedia infoboxes or structured Amazon product descriptions. These representations provide users with a concise overview, aiding scientists in navigating the dense academic landscape. Our novel automated approach leverages the robust text generation capabilities of LLMs to produce structured scholarly contribution summaries, offering both a practical solution and insights into LLMs’ emergent abilities.For LLMs, the prime focus is on improving their general intelligence as conversational agents. We argue that these models can also be applied effectively in information extraction (IE), specifically in complex IE tasks within terse domains like Science. This paradigm shift replaces the traditional modular, pipelined machine learning approach with a simpler objective expressed through instructions. Our results show that finetuned FLAN-T5 with 1000x fewer parameters than the state-of-the-art GPT-davinci is competitive for the task.

pdf bib
Re3val: Reinforced and Reranked Generative Retrieval
EuiYul Song | Sangryul Kim | Haeju Lee | Joonkee Kim | James Thorne

Generative retrieval models encode pointers to information in a corpus as an index within the model’s parameters. These models serve as part of a larger pipeline, where retrieved information conditions generation for knowledge-intensive NLP tasks. However, we identify two limitations: the generative retrieval does not account for contextual information. Secondly, the retrieval can’t be tuned for the downstream readers as decoding the page title is a non-differentiable operation. This paper introduces Re3val, trained with generative reranking and reinforcement learning using limited data. Re3val leverages context acquired via Dense Passage Retrieval to rerank the retrieved page titles and utilizes REINFORCE to maximize rewards generated by constrained decoding. Additionally, we generate questions from our pre-training dataset to mitigate epistemic uncertainty and bridge the domain gap between the pre-training and fine-tuning datasets. Subsequently, we extract and rerank contexts from the KILT database using the rerank page titles. Upon grounding the top five reranked contexts, Re3val demonstrates the Top 1 KILT scores compared to all other generative retrieval models across five KILT datasets.

pdf bib
Entity Linking in the Job Market Domain
Mike Zhang | Rob Goot | Barbara Plank

In Natural Language Processing, entity linking (EL) has centered around Wikipedia, but yet remains underexplored for the job market domain. Disambiguating skill mentions can help us get insight into the current labor market demands. In this work, we are the first to explore EL in this domain, specifically targeting the linkage of occupational skills to the ESCO taxonomy (le Vrang et al., 2014). Previous efforts linked coarse-grained (full) sentences to a corresponding ESCO skill. In this work, we link more fine-grained span-level mentions of skills. We tune two high-performing neural EL models, a bi-encoder (Wu et al., 2020) and an autoregressive model (Cao et al., 2021), on a synthetically generated mention–skill pair dataset and evaluate them on a human-annotated skill-linking benchmark. Our findings reveal that both models are capable of linking implicit mentions of skills to their correct taxonomy counterparts. Empirically, BLINK outperforms GENRE in strict evaluation, but GENRE performs better in loose evaluation (accuracy@k).

pdf bib
(Chat)GPT v BERT Dawn of Justice for Semantic Change Detection
Francesco Periti | Haim Dubossarsky | Nina Tahmasebi

In the universe of Natural Language Processing, Transformer-based language models like BERT and (Chat)GPT have emerged as lexical superheroes with great power to solve open research problems. In this paper, we specifically focus on the temporal problem of semantic change, and evaluate their ability to solve two diachronic extensions of the Word-in-Context (WiC) task: TempoWiC and HistoWiC. In particular, we investigate the potential of a novel, off-the-shelf technology like ChatGPT (and GPT) 3.5 compared to BERT, which represents a family of models that currently stand as the state-of-the-art for modeling semantic change. Our experiments represent the first attempt to assess the use of (Chat)GPT for studying semantic change. Our results indicate that ChatGPT performs significantly worse than the foundational GPT version. Furthermore, our results demonstrate that (Chat)GPT achieves slightly lower performance than BERT in detecting long-term changes but performs significantly worse in detecting short-term changes.

pdf bib
Towards Unified Uni- and Multi-modal News Headline Generation
Mateusz Krubiński | Pavel Pecina

Thanks to the recent progress in vision-language modeling and the evolving nature of news consumption, the tasks of automatic summarization and headline generation based on multimodal news articles have been gaining popularity. One of the limitations of the current approaches is caused by the commonly used sophisticated modular architectures built upon hierarchical cross-modal encoders and modality-specific decoders, which restrict the model’s applicability to specific data modalities – once trained on, e.g., text+video pairs there is no straightforward way to apply the model to text+image or text-only data. In this work, we propose a unified task formulation that utilizes a simple encoder-decoder model to generate headlines from uni- and multi-modal news articles. This model is trained jointly on data of several modalities and extends the textual decoder to handle the multimodal output.

pdf bib
On the Relationship between Sentence Analogy Identification and Sentence Structure Encoding in Large Language Models
Thilini Wijesiriwardene | Ruwan Wickramarachchi | Aishwarya Naresh Reganti | Vinija Jain | Aman Chadha | Amit Sheth | Amitava Das

The ability of Large Language Models (LLMs) to encode syntactic and semantic structures of language is well examined in NLP. Additionally, analogy identification, in the form of word analogies are extensively studied in the last decade of language modeling literature. In this work we specifically look at how LLMs’ abilities to capture sentence analogies (sentences that convey analogous meaning to each other) vary with LLMs’ abilities to encode syntactic and semantic structures of sentences. Through our analysis, we find that LLMs’ ability to identify sentence analogies is positively correlated with their ability to encode syntactic and semantic structures of sentences. Specifically, we find that the LLMs which capture syntactic structures better, also have higher abilities in identifying sentence analogies.

pdf bib
Contextualization Distillation from Large Language Model for Knowledge Graph Completion
Dawei Li | Zhen Tan | Tianlong Chen | Huan Liu

While textual information significantly enhances the performance of pre-trained language models (PLMs) in knowledge graph completion (KGC), the static and noisy nature of existing corpora collected from Wikipedia articles or synsets definitions often limits the potential of PLM-based KGC models. To surmount these challenges, we introduce the Contextualization Distillation strategy, a versatile plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models (LLMs) to transform compact, structural triplets into context-rich segments. Subsequently, we introduce two tailored auxiliary tasks—reconstruction and contextualization—allowing smaller KGC models to assimilate insights from these enriched triplets. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach, revealing consistent performance enhancements irrespective of underlying pipelines or architectures. Moreover, our analysis makes our method more explainable and provides insight into how to generate high-quality corpora for KGC, as well as the selection of suitable distillation tasks.

pdf bib
Differentially Private Natural Language Models: Recent Advances and Future Directions
Lijie Hu | Ivan Habernal | Lei Shen | Di Wang

Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protecting the privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances in DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then, we investigate some existing work on DP-NLP andpresent its recent developments from three aspects: gradient perturbation based methods, embedding vector perturbation based methods, and ensemble model based methods. We also discuss some challenges and future directions.

pdf bib
Learning to Compare Financial Reports for Financial Forecasting
Ross Koval | Nicholas Andrews | Xifeng Yan

Public companies in the US are required to publish annual reports that detail their recent financial performance, present the current state of ongoing business operations, and discuss future prospects. However, they typically contain over 25,000 words across all sections, large amounts of industry and legal jargon, and a high percentage of boilerplate content that does not change much year-to-year. These unique characteristics present challenges for many generic pretrained language models because it is likely that only a small percentage of the long report that reflects salient information contains meaningful signal about the future prospects of the company. In this work, we curate a large-scale dataset of paired financial reports and introduce two novel, challenging tasks of predicting long-horizon company risk and correlation that evaluate the ability of the model to recognize cross-document relationships with complex, nuanced signals. We explore and present a comprehensive set of methods and experiments, and establish strong baselines designed to learn to identify subtle similarities and differences between long documents. Furthermore, we demonstrate that it is possible to predict company risk and correlation solely from the text of their financial reports and further that modeling the cross-document interactions at a fine-grained level provides significant benefit. Finally, we probe the best performing model through quantitative and qualitative interpretability methods to reveal some insight into the underlying task signal.

pdf bib
Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation
Shohei Higashiyama | Hiroki Ouchi | Hiroki Teranishi | Hiroyuki Otomo | Yusuke Ide | Aitaro Yamamoto | Hiroyuki Shindo | Yuki Matsuda | Shoko Wakamiya | Naoya Inoue | Ikuya Yamada | Taro Watanabe

Geoparsing is a fundamental technique for analyzing geo-entity information in text, which is useful for geographic applications, e.g., tourist spot recommendation. We focus on document-level geoparsing that considers geographic relatedness among geo-entity mentions and present a Japanese travelogue dataset designed for training and evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

pdf bib
Knowledge Generation for Zero-shot Knowledge-based VQA
Rui Cao | Jing Jiang

Previous solutions to knowledge-based visual question answering (K-VQA) retrieve knowledge from external knowledge bases and use supervised learning to train the K-VQA model.Recently pre-trained LLMs have been used as both a knowledge source and a zero-shot QA model for K-VQA and demonstrated promising results.However, these recent methods do not explicitly show the knowledge needed to answer the questions and thus lack interpretability.Inspired by recent work on knowledge generation from LLMs for text-based QA, in this work we propose and test a similar knowledge-generation-based K-VQA method, which first generates knowledge from an LLM and then incorporates the generated knowledge for K-VQA in a zero-shot manner. We evaluate our method on two K-VQA benchmarks and found that our method performs better than previous zero-shot K-VQA methods and our generated knowledge is generally relevant and helpful.

pdf bib
Simple Temperature Cool-down in Contrastive Framework for Unsupervised Sentence Representation Learning
Yoo Hyun Jeong | Myeong Soo Han | Dong-Kyu Chae

In this paper, we proposes a simple, tricky method to improve sentence representation of unsupervised contrastive learning. Even though contrastive learning has achieved great performances in both visual representation learning (VRL) and sentence representation learning (SRL) fields, we focus on the fact that there is a gap between characteristics and training dynamics of VRL and SRL. We first examine the role of temperature to bridge the gap between VRL and SRL, and find some temperature-dependent elements in SRL; i.e., a higher temperature causes overfitting of the uniformity while improving the alignment in earlier phase of training. Then, we design a temperature cool-down technique based on this observation, which helps PLMs to be more suitable for contrastive learning via preparation of uniform representation space. Our experimental results on widely-utilized benchmarks demonstrate the effectiveness and extensiblity of our method.

pdf bib
Bootstrap Your Own PLM: Boosting Semantic Features of PLMs for Unsuperivsed Contrastive Learning
Yoo Hyun Jeong | Myeong Soo Han | Dong-Kyu Chae

This paper aims to investigate the possibility of exploiting original semantic features of PLMs (pre-trained language models) during contrastive learning in the context of SRL (sentence representation learning). In the context of feature modification, we identified a method called IFM (implicit feature modification), which reduces the tendency of contrastive models for VRL (visual representation learning) to rely on feature-suppressing shortcut solutions. We observed that IFM did not work well for SRL, which may be due to differences between the nature of VRL and SRL. We propose BYOP, which boosts well-represented features, taking the opposite idea of IFM, under the assumption that SimCSE’s dropout-noise-based augmentation may be too simple to modify high-level semantic features, and that the features learned by PLMs are semantically meaningful and should be boosted, rather than removed. Extensive experiments lend credence to the logic of BYOP, which considers the nature of SRL.

pdf bib
Personalized Abstractive Summarization by Tri-agent Generation Pipeline
Wen Xiao | Yujia Xie | Giuseppe Carenini | Pengcheng He

Tailoring outputs from large language models, like ChatGPT, to implicit user preferences remains a challenge despite their impressive generative capabilities. In this paper, we propose a tri-agent generation pipeline comprising a generator, an instructor, and an editor to enhance output personalization. The generator produces an initial output, the instructor automatically generates editing instructions based on user preferences, and the editor refines the output to align with those preferences. The inference-only large language model (ChatGPT) serves as both the generator and editor, with a smaller model acting as the instructor to guide output generation. We train the instructor using editor-steered reinforcement learning, leveraging feedback from a large-scale editor model to optimize instruction generation. Experimental results on two abstractive summarization datasets demonstrate the effectiveness of our approach in generating outputs that better meet user expectations.

pdf bib
Revisiting the Markov Property for Machine Translation
Cunxiao Du | Hao Zhou | Zhaopeng Tu | Jing Jiang

In this paper, we re-examine the Markov property in the context of neural machine translation. We design a Markov Autoregressive Transformer (MAT) and undertake a comprehensive assessment of its performance across four WMT benchmarks. Our findings indicate that MAT with an order larger than 4 can generate translations with quality on par with that of conventional autoregressive transformers. In addition, counter-intuitively, we also find that the advantages of utilizing a higher-order MAT do not specifically contribute to the translation of longer sentences.

pdf bib
Reward Engineering for Generating Semi-structured Explanation
Jiuzhou Han | Wray Buntine | Ehsan Shareghi

Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is utilised and supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify a model’s true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs (e.g., FLAN-T5-XXL). In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed method on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.

pdf bib
Towards Context-Based Violence Detection: A Korean Crime Dialogue Dataset
Minju Kim | Heuiyeen Yeen | Myoung-Wan Koo

In order to enhance the security of society, there is rising interest in artificial intelligence (AI) to help detect and classify in advanced violence in daily life. The field of violence detection has introduced various datasets, yet context-based violence detection predominantly focuses on vision data, with a notable lack of NLP datasets. To overcome this, this paper presents the first Korean dialogue dataset for classifying violence that occurs in online settings: the Korean Crime Dialogue Dataset (KCDD). KCDD contains 22,249 dialogues created by crowd workers assuming offline scenarios. It has four criminal classes that meet international legal standards and one clean class (Serious Threats, Extortion or Blackmail, Harassment in the Workplace, Other Harassment, and Clean Dialogue). Plus, we propose a strong baseline for the proposed dataset, Relationship-Aware BERT. The model shows that understanding varying relationships among interlocutors improves the performance of crime dialogue classification. We hope that the proposed dataset will be used to detect cases of violence and aid people in danger. The KCDD dataset and corresponding baseline implementations can be found at the following link:

pdf bib
Capturing the Relationship Between Sentence Triplets for LLM and Human-Generated Texts to Enhance Sentence Embeddings
Na Min An | Sania Waheed | James Thorne

Deriving meaningful sentence embeddings is crucial in capturing the semantic relationship between texts. Recent advances in building sentence embedding models have centered on replacing traditional human-generated text datasets with those generated by LLMs. However, the properties of these widely used LLM-generated texts remain largely unexplored. Here, we evaluate the quality of the LLM-generated texts from four perspectives (Positive Text Repetition, Length Difference Penalty, Positive Score Compactness, and Negative Text Implausibility) and find that there exists an inherent difference between human and LLM-generated datasets. To further enhance sentence embeddings using both human and LLM-generated datasets, we propose a novel loss function that incorporates Positive-Negative sample Augmentation (PNA) within the contrastive learning objective. Our results demonstrate that PNA effectively mitigates the sentence anisotropy problem in Wikipedia corpus (-7% compared to CLHAIF) and simultaneously improves the Spearman’s correlation in standard Semantic Textual Similarity (STS) tasks (+1.47% compared to CLHAIF).

pdf bib
Harmonizing Code-mixed Conversations: Personality-assisted Code-mixed Response Generation in Dialogues
Shivani Kumar | Tanmoy Chakraborty

Code-mixing, the blending of multiple languages within a single conversation, introduces a distinctive challenge, particularly in the context of response generation. Capturing the intricacies of code-mixing proves to be a formidable task, given the wide-ranging variations influenced by individual speaking styles and cultural backgrounds. In this study, we explore response generation within code-mixed conversations. We introduce a novel approach centered on harnessing the Big Five personality traits acquired in an unsupervised manner from the conversations to bolster the performance of response generation. These inferred personality attributes are seamlessly woven into the fabric of the dialogue context, using a novel fusion mechanism, . It uses an effective two-step attention formulation to fuse the dialogue and personality information. This fusion not only enhances the contextual relevance of generated responses but also elevates the overall performance of the model. Our experimental results, grounded in a dataset comprising of multi-party Hindi-English code-mix conversations, highlight the substantial advantages offered by personality-infused models over their conventional counterparts. This is evident in the increase observed in ROUGE and BLUE scores for the response generation task when the identified personality is seamlessly integrated into the dialogue context. Qualitative assessment for personality identification and response generation aligns well with our quantitative results.

pdf bib
Morality is Non-Binary: Building a Pluralist Moral Sentence Embedding Space using Contrastive Learning
Jeongwoo Park | Enrico Liscio | Pradeep Murukannaiah

Recent advances in NLP show that language models retain a discernible level of knowledge in deontological ethics and moral norms. However, existing works often treat morality as binary, ranging from right to wrong. This simplistic view does not capture the nuances of moral judgment. Pluralist moral philosophers argue that human morality can be deconstructed into a finite number of elements, respecting individual differences in moral judgment. In line with this view, we build a pluralist moral sentence embedding space via a state-of-the-art contrastive learning approach. We systematically investigate the embedding space by studying the emergence of relationships among moral elements, both quantitatively and qualitatively. Our results show that a pluralist approach to morality can be captured in an embedding space. However, moral pluralism is challenging to deduce via self-supervision alone and requires a supervised approach with human labels.

pdf bib
Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases
Giulio Zhou | Tsz Kin Lam | Alexandra Birch | Barry Haddow

Speech-to-Text Translation (S2TT) has typically been addressed with cascade systems, where speech recognition systems generate a transcription that is subsequently passed to a translation model. While there has been a growing interest in developing direct speech translation systems to avoid propagating errors and losing non-verbal content, prior work in direct S2TT has struggled to conclusively establish the advantages of integrating the acoustic signal directly into the translation process. This work proposes using contrastive evaluation to quantitatively measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role. Specifically, we evaluated Korean-English translation systems on a test set containing wh-phrases, for which prosodic features are necessary to produce translations with the correct intent, whether it’s a statement, a yes/no question, a wh-question, and more. Our results clearly demonstrate the value of direct translation systems over cascade translation models, with a notable 12.9% improvement in overall accuracy in ambiguous cases, along with up to a 15.6% increase in F1 scores for one of the major intent categories. To the best of our knowledge, this work stands as the first to provide quantitative evidence that direct S2TT models can effectively leverage prosody. The code for our evaluation is openly accessible and freely available for review and utilisation.

pdf bib
Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
Chunkit Chan | Cheng Jiayang | Weiqi Wang | Yuxin Jiang | Tianqing Fang | Xin Liu | Yangqiu Song

This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT’s promising performance across various tasks, we proceed to carry out thorough evaluations on the whole test sets of 11 datasets, including temporal and causal relations, PDTB2.0-based, and dialogue-based discourse relations. To ensure the reliability of our findings, we employ three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time. Through our study, we discover that ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations, albeit it may not possess the same level of expertise in identifying the temporal order between two events. While it is capable of identifying the majority of discourse relations with existing explicit discourse connectives, the implicit discourse relation remains a formidable challenge. Concurrently, ChatGPT demonstrates subpar performance in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.

pdf bib
Backtracing: Retrieving the Cause of the Query
Rose Wang | Pawan Wirawarn | Omar Khattab | Noah Goodman | Dorottya Demszky

Many online content portals allow users to ask questions to supplement their understanding (e.g., of lectures). While information retrieval (IR) systems may provide answers for such user queries, they do not directly assist content creators—such as lecturers who want to improve their content—identify segments that caused a user to ask those questions.We introduce the task of backtracing, in which systems retrieve the text segment that most likely caused a user query.We formalize three real-world domains for which backtracing is important in improving content delivery and communication: understanding the cause of (a) student confusion in the Lecture domain, (b) reader curiosity in the News Article domain, and (c) user emotion in the Conversation domain.We evaluate the zero-shot performance of popular information retrieval methods and language modeling methods, including bi-encoder, re-ranking and likelihood-based methods and ChatGPT.While traditional IR systems retrieve semantically relevant information (e.g., details on “projection matrices” for a query “does projecting multiple times still lead to the same point?”), they often miss the causally relevant context (e.g., the lecturer states “projecting twice gets me the same answer as one projection”). Our results show that there is room for improvement on backtracing and it requires new retrieval approaches.We hope our benchmark serves to improve future retrieval systems for backtracing, spawning systems that refine content generation and identify linguistic triggers influencing user queries.

pdf bib
Unsupervised Multilingual Dense Retrieval via Generative Pseudo Labeling
Chao-Wei Huang | Chen-An Li | Tsu-Yuan Hsu | Chen-Yu Hsu | Yun-Nung Chen

Dense retrieval methods have demonstrated promising performance in multilingual information retrieval, where queries and documents can be in different languages. However, dense retrievers typically require a substantial amount of paired data, which poses even greater challenges in multilingual scenarios. This paper introduces UMR, an  ̲Unsupervised  ̲Multilingual dense  ̲Retriever trained without any paired data. Our approach leverages the sequence likelihood estimation capabilities of multilingual language models to acquire pseudo labels for training dense retrievers. We propose a two-stage framework which iteratively improves the performance of multilingual dense retrievers. Experimental results on two benchmark datasets show that UMR outperforms supervised baselines, showcasing the potential of training multilingual retrievers without paired data, thereby enhancing their practicality. All of our source code, data, and models are available:

pdf bib
Investigating grammatical abstraction in language models using few-shot learning of novel noun gender
Priyanka Sukumaran | Conor Houghton | Nina Kazanina

Humans can learn a new word and infer its grammatical properties from very few examples. They have an abstract notion of linguistic properties like grammatical gender and agreement rules that can be applied to novel syntactic contexts and words. Drawing inspiration from psycholinguistics, we conduct a noun learning experiment to assess whether an LSTM and a decoder-only transformer can achieve human-like abstraction of grammatical gender in French. Language models were tasked with learning the gender of a novel noun embedding from a few examples in one grammatical agreement context and predicting agreement in another, unseen context. We find that both language models effectively generalise novel noun gender from one to two learning examples and apply the learnt gender across agreement contexts, albeit with a bias for the masculine gender category. Importantly, the few-shot updates were only applied to the embedding layers, demonstrating that models encode sufficient gender information within the word-embedding space. While the generalisation behaviour of models suggests that they represent grammatical gender as an abstract category, like humans, further work is needed to explore the details of how exactly this is implemented. For a comparative perspective with human behaviour, we conducted an analogous one-shot novel noun gender learning experiment, which revealed that native French speakers, like language models, also exhibited a masculine gender bias and are not excellent one-shot learners either.

pdf bib
On-the-fly Denoising for Data Augmentation in Natural Language Understanding
Tianqing Fang | Wenxuan Zhou | Fangyu Liu | Hongming Zhang | Yangqiu Song | Muhao Chen

Data Augmentation (DA) is frequently used to provide additional training data without extra human annotation automatically.However, data augmentation may introduce noisy data that impairs training.To guarantee the quality of augmented data,existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out “noisy” data.However, those filtered examples may still contain useful information, and dropping them completely causes a loss of supervision signals.In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.To further prevent overfitting on noisy labels, a simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts.Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.

pdf bib
Style Vectors for Steering Generative Large Language Models
Kai Konen | Sophie Jentzsch | Diaoulé Diallo | Peer Schütt | Oliver Bensch | Roxanne El Baff | Dominik Opitz | Tobias Hecking

This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. We show that style vectors can be simply computed from recorded layer activations for input texts in a specific style in contrast to more complex training-based approaches. Through a series of experiments, we demonstrate the effectiveness of activation engineering using such style vectors to influence the style of generated text in a nuanced and parameterisable way, distinguishing it from prompt engineering. The presented research constitutes a significant step towards developing more adaptive and effective AI-empowered interactive systems.

pdf bib
Consistent Joint Decision-Making with Heterogeneous Learning Models
Hossein Rajaby Faghihi | Parisa Kordjamshidi

This paper introduces a novel decision-making framework that promotes consistency among decisions made by diverse models while utilizing external knowledge. Leveraging the Integer Linear Programming(ILP) framework, we map predictions from various models into globally normalized and comparable values by incorporating information about decisions’ prior probability, confidence (uncertainty), and the models’ expected accuracy. Our empirical study demonstrates the superiority of our approach over conventional baselines on multiple datasets.

pdf bib
Quantifying Association Capabilities of Large Language Models and Its Implications on Privacy Leakage
Hanyin Shao | Jie Huang | Shen Zheng | Kevin Chang

The advancement of large language models (LLMs) brings notable improvements across various applications, while simultaneously raising concerns about potential private data exposure. One notable capability of LLMs is their ability to form associations between different pieces of information, but this raises concerns when it comes to personally identifiable information (PII). This paper delves into the association capabilities of language models, aiming to uncover the factors that influence their proficiency in associating information. Our study reveals that as models scale up, their capacity to associate entities/information intensifies, particularly when target pairs demonstrate shorter co-occurrence distances or higher co-occurrence frequencies. However, there is a distinct performance gap when associating commonsense knowledge versus PII, with the latter showing lower accuracy. Despite the proportion of accurately predicted PII being relatively small, LLMs still demonstrate the capability to predict specific instances of email addresses and phone numbers when provided with appropriate prompts. These findings underscore the potential risk to PII confidentiality posed by the evolving capabilities of LLMs, especially as they continue to expand in scale and power.

pdf bib
Probing Critical Learning Dynamics of PLMs for Hate Speech Detection
Sarah Masud | Mohammad Aflah Khan | Vikram Goyal | Md Shad Akhtar | Tanmoy Chakraborty

Despite the widespread adoption, there is a lack of research into how various critical aspects of pretrained language models (PLMs) affect their performance in hate speech detection. Through five research questions, our findings and recommendations lay the groundwork for empirically investigating different aspects of PLMs’ use in hate speech detection. We deep dive into comparing different pretrained models, evaluating their seed robustness, finetuning settings, and the impact of pretraining data collection time. Our analysis reveals early peaks for downstream tasks during pretraining, the limited benefit of employing a more recent pretraining corpus, and the significance of specific layers during finetuning. We further call into question the use of domain-specific models and highlight the need for dynamic datasets for benchmarking hate speech detection.

pdf bib
Embible: Reconstruction of Ancient Hebrew and Aramaic Texts Using Transformers
Niv Fono | Harel Moshayof | Eldar Karol | Itai Assraf | Mark Last

Hebrew and Aramaic inscriptions serve as an essential source of information on the ancient history of the Near East. Unfortunately, some parts of the inscribed texts become illegible over time. Special experts, called epigraphists, use time-consuming manual procedures to estimate the missing content. This problem can be considered an extended masked language modeling task, where the damaged content can comprise single characters, character n-grams (partial words), single complete words, and multi-word n-grams.This study is the first attempt to apply the masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic languages, both using the Hebrew alphabet consisting mostly of consonant symbols. In our experiments, we evaluate several transformer-based models, which are fine-tuned on the Biblical texts and tested on three different percentages of randomly masked parts in the testing corpus. For any masking percentage, the highest text completion accuracy is obtained with a novel ensemble of word and character prediction models.

pdf bib
Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling
Qingyang Wu | Zhou Yu

Transformer models have achieved great performance in dialogue generation tasks. However, their inability to process long dialogue history often leads to truncation of the context. To address this problem, we propose a novel memory-augmented transformer that is compatible with existing pre-trained encoder-decoder models and enables efficient preservation of the dialogue history information. The new model incorporates a separate memory module alongside the pre-trained transformer, which can effectively interchange information between the memory states and the current input context. We evaluate the efficiency of our model on three dialogue datasets and two language modeling datasets. Experimental results show that our method has achieved superior efficiency and performance compared to other pre-trained Transformer baselines.

pdf bib
The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models
Anton Razzhigaev | Matvey Mikhalchuk | Elizaveta Goncharova | Ivan Oseledets | Denis Dimitrov | Andrey Kuznetsov

In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. This fact is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.

pdf bib
MEDs for PETs: Multilingual Euphemism Disambiguation for Potentially Euphemistic Terms
Patrick Lee | Alain Chirino Trujillo | Diana Cuevas Plancarte | Olumide Ojo | Xinyi Liu | Iyanuoluwa Shode | Yuan Zhao | Anna Feldman | Jing Peng

Euphemisms are found across the world’s languages, making them a universal linguistic phenomenon. As such, euphemistic data may have useful properties for computational tasks across languages. In this study, we explore this premise by training a multilingual transformer model (XLM-RoBERTa) to disambiguate potentially euphemistic terms (PETs) in multilingual and cross-lingual settings. In line with current trends, we demonstrate that zero-shot learning across languages takes place. We also show cases where multilingual models perform better on the task compared to monolingual models by a statistically significant margin, indicating that multilingual data presents additional opportunities for models to learn about cross-lingual, computational properties of euphemisms. In a follow-up analysis, we focus on universal euphemistic “categories” such as death and bodily functions among others. We test to see whether cross-lingual data of the same domain is more important than within-language data of other domains to further understand the nature of the cross-lingual transfer.

pdf bib
PromptExplainer: Explaining Language Models through Prompt-based Learning
Zijian Feng | Hanzhang Zhou | Zixiao Zhu | Kezhi Mao

Pretrained language models have become workhorses for various natural language processing (NLP) tasks, sparking a growing demand for enhanced interpretability and transparency. However, prevailing explanation methods, such as attention-based and gradient-based strategies, largely rely on linear approximations, potentially causing inaccuracies such as accentuating irrelevant input tokens. To mitigate the issue, we develop PromptExplainer, a novel method for explaining language models through prompt-based learning. PromptExplainer aligns the explanation process with the masked language modeling (MLM) task of pretrained language models and leverages the prompt-based learning framework for explanation generation. It disentangles token representations into the explainable embedding space using the MLM head and extracts discriminative features with a verbalizer to generate class-dependent explanations. Extensive experiments demonstrate that PromptExplainer significantly outperforms state-of-the-art explanation methods.

pdf bib
Do-Not-Answer: Evaluating Safeguards in LLMs
Yuxia Wang | Haonan Li | Xudong Han | Preslav Nakov | Timothy Baldwin

With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation. Our data and code are available at

pdf bib
Do Language Models Know When They’re Hallucinating References?
Ayush Agrawal | Mirac Suzgun | Lester Mackey | Adam Kalai

State-of-the-art language models (LMs) are notoriously susceptible to generating hallucinated information. Such inaccurate outputs not only undermine the reliability of these models but also limit their use and raise serious concerns about misinformation and propaganda. In this work, we focus on hallucinated book and article references and present them as the “model organism” of language model hallucination research, due to their frequent and easy-to-discern nature. We posit that if a language model cites a particular reference in its output, then it should ideally possess sufficient information about its authors and content, among other relevant details. Using this basic insight, we illustrate that one can identify hallucinated references without ever consulting any external resources, by asking a set of direct or indirect queries to the language model about the references. These queries can be considered as “consistency checks.” Our findings highlight that while LMs, including GPT-4, often produce inconsistent author lists for hallucinated references, they also often accurately recall the authors of real references. In this sense, the LM can be said to “know” when it is hallucinating references. Furthermore, these findings show how hallucinated references can be dissected to shed light on their nature.

pdf bib
Bridging Cultural Nuances in Dialogue Agents through Cultural Value Surveys
Yong Cao | Min Chen | Daniel Hershcovich

The cultural landscape of interactions with dialogue agents is a compelling yet relatively unexplored territory. It’s clear that various sociocultural aspects—from communication styles and beliefs to shared metaphors and knowledge—profoundly impact these interactions. To delve deeper into this dynamic, we introduce cuDialog, a first-of-its-kind benchmark for dialogue generation with a cultural lens. We also develop baseline models capable of extracting cultural attributes from dialogue exchanges, with the goal of enhancing the predictive accuracy and quality of dialogue agents. To effectively co-learn cultural understanding and multi-turn dialogue predictions, we propose to incorporate cultural dimensions with dialogue encoding features. Our experimental findings highlight that incorporating cultural value surveys boosts alignment with references and cultural markers, demonstrating its considerable influence on personalization and dialogue quality. To facilitate further exploration in this exciting domain, we publish our benchmark publicly accessible at

pdf bib
CEO: Corpus-based Open-Domain Event Ontology Induction
Nan Xu | Hongming Zhang | Jianshu Chen

Existing event-centric NLP models often only apply to the pre-defined ontology, which significantly restricts their generalization capabilities.This paper presents CEO, a novel Corpus-based Event Ontology induction model to relax the restriction imposed by pre-defined event ontologies. Without direct supervision, CEO leverages distant supervision from available summary datasets to detect corpus-wise salient events and exploits external event knowledge to force events within a short distance to have close embeddings. Experiments on three popular event datasets show that the schema induced by CEO has better coverage and higher accuracy than previous methods. Moreover, CEO is the first event ontology induction model that can induce a hierarchical event ontology with meaningful names on eleven open-domain corpora, making the induced schema more trustworthy and easier to be further curated. We anonymously release our dataset, codes, and induced ontology.

pdf bib
Rethinking STS and NLI in Large Language Models
Yuxia Wang | Minghan Wang | Preslav Nakov

Recent years, have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/biomedical domain, and then we assess LLMs’ predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.

pdf bib
Learning High-Quality and General-Purpose Phrase Representations
Lihu Chen | Gael Varoquaux | Fabian Suchanek

Phrase representations play an important role in data science and natural language processing, benefiting various tasks like Entity Alignment, Record Linkage, Fuzzy Joins, and Paraphrase Classification.The current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings using contrastive learning. However, we have identified areas for improvement. First, these pre-trained models tend to be unnecessarily complex and require to be pre-trained on a corpus with context sentences.Second, leveraging the phrase type and morphology gives phrase representations that are both more precise and more flexible.We propose an improved framework to learn phrase representations in a context-free fashion.The framework employs phrase type classification as an auxiliary task and incorporates character-level information more effectively into the phrase representation.Furthermore, we design three granularities of data augmentation to increase the diversity of training samples.Our experiments across a wide range of tasks reveal that our approach generates superior phrase embeddings compared to previous methods while requiring a smaller model size.

pdf bib
Explaining Language Model Predictions with High-Impact Concepts
Ruochen Zhao | Tan Wang | Yongjie Wang | Shafiq Joty

To encourage fairness and transparency, there exists an urgent demand for deriving reliable explanations for large language models (LLMs). One promising solution is concept-based explanations, i.e., human-understandable concepts from internal representations. However, due to the compositional nature of languages, current methods mostly discover correlational explanations instead of causal features. Therefore, we propose a novel framework to provide impact-aware explanations for users to understand the LLM’s behavior, which are robust to feature changes and influential to the model’s predictions. Specifically, we extract predictive high-level features (concepts) from the model’s hidden layer activations. Then, we innovatively optimize for features whose existence causes the output predictions to change substantially. Extensive experiments on real and synthetic tasks demonstrate that our method achieves superior results on predictive impact, explainability, and faithfulness compared to the baselines, especially for LLMs.

pdf bib
Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis
Oscar Chew | Hsuan-Tien Lin | Kai-Wei Chang | Kuan-Hao Huang

Recent research has revealed that machine learning models have a tendency to leverage spurious correlations that exist in the training set but may not hold true in general circumstances. For instance, a sentiment classifier may erroneously learn that the token “performances” is commonly associated with positive movie reviews.Relying on these spurious correlations degrades the classifier’s performance when it deploys on out-of-distribution data.In this paper, we examine the implications of spurious correlations through a novel perspective called neighborhood analysis. The analysis uncovers how spurious correlations lead unrelated words to erroneously cluster together in the embedding space. Driven by the analysis, we design a metric to detect spurious tokens and also propose a family of regularization methods, NFL (doN’t Forget your Language) to mitigate spurious correlations in text classification.Experiments show that NFL can effectively prevent erroneous clusters and significantly improve the robustness of classifiers without auxiliary data. The code is publicly available at

pdf bib
On the Intractability to Synthesize Factual Inconsistencies in Summarization
Ge Luo | Weisi Fan | Miaoran Li | Youbiao He | Yinfei Yang | Forrest Bao

Factual consistency detection has gotten raised attention in the task of abstractive summarization. Many existing works rely on synthetic training data, which may not accurately reflect or match the inconsistencies produced by summarization models. In this paper, we first systematically analyze the shortcomings of the current methods in synthesizing inconsistent summaries. Current synthesis methods may fail to produce inconsistencies of coreference errors and discourse errors, per our quantitative and qualitative study. Then, employing the parameter-efficient finetuning (PEFT) technique, we discover that a competitive factual consistency detector can be achieved using thousands of real model-generated summaries with human annotations. Our study demonstrates the importance of real machine-generated texts with human annotation in NLG evaluation as our model outperforms the SOTA on the CoGenSumm, FactCC, Frank, and SummEval datasets.

pdf bib
IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators
Luyang Lin | Lingzhi Wang | Xiaoyan Zhao | Jing Li | Kam-Fai Wong

This study focuses on media bias detection, crucial in today’s era of influential social media platforms shaping individual attitudes and opinions. In contrast to prior work that primarily relies on training specific models tailored to particular datasets, resulting in limited adaptability and subpar performance on out-of-domain data, we introduce a general bias detection framework, IndiVec, built upon large language models. IndiVec begins by constructing a fine-grained media bias database, leveraging the robust instruction-following capabilities of large language models and vector database techniques. When confronted with new input for bias detection, our framework automatically selects the most relevant indicator from the vector database and employs majority voting to determine the input’s bias label. IndiVec excels compared to previous methods due to its adaptability (demonstrating consistent performance across diverse datasets from various sources) and explainability (providing explicit top-k indicators to interpret bias predictions). Experimental results on four political bias datasets highlight IndiVec’s significant superiority over baselines. Furthermore, additional experiments and analysis provide profound insights into the framework’s effectiveness.

pdf bib
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?
Rishav Hada | Varun Gumma | Adrian Wynter | Harshita Diddee | Mohamed Ahmed | Monojit Choudhury | Kalika Bali | Sunayana Sitaram

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top 20, remains inadequate due to existing benchmarks and metrics limitations. Employing LLMs as evaluators to rank or score other models’ outputs emerges as a viable solution, addressing the constraints tied to human annotators and established benchmarks. In this study, we explore the potential of LLM-based evaluators in enhancing multilingual evaluation by calibrating them against 20K human judgments across three text-generation tasks, five metrics, and eight languages. Our analysis reveals a bias in LLM-based evaluators towards higher scores, underscoring the necessity of calibration with native speaker judgments, especially in low-resource and non-Latin script languages, to ensure accurate evaluation of LLM performance across diverse languages.

pdf bib
Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals
Christian Khairallah | Reham Marzouk | Salam Khalifa | Mayar Nassar | Nizar Habash

Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals’ intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.

pdf bib
Relabeling Minimal Training Subset to Flip a Prediction
Jinghan Yang | Linjie Xu | Lequan Yu

When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point xt, how to identify the smallest training subset 𝒮t that we need to relabel?We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss.We find that relabeling fewer than 2% of the training points can always flip a prediction.This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e., |𝒮t|); we show that |𝒮t| is highly related to the noise ratio in the training set and |𝒮t| is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.

pdf bib
Why Generate When You Can Discriminate? A Novel Technique for Text Classification using Language Models
Sachin Pawar | Nitin Ramrakhiyani | Anubhav Sinha | Manoj Apte | Girish Palshikar

In this paper, we propose a novel two-step technique for text classification using autoregressive Language Models (LM). In the first step, a set of perplexity and log-likelihood based numeric features are elicited from an LM for a text instance to be classified. Then, in the second step, a classifier based on these features is trained to predict the final label. The classifier used is usually a simple machine learning classifier like Support Vector Machine (SVM) or Logistic Regression (LR) and it is trained using a small set of training examples. We believe, our technique presents a whole new way of exploiting the available training instances, in addition to the existing ways like fine-tuning LMs or in-context learning. Our approach stands out by eliminating the need for parameter updates in LMs, as required in fine-tuning, and does not impose limitations on the number of training examples faced while building prompts for in-context learning. We evaluate our technique across 5 different datasets and compare with multiple competent baselines.

pdf bib
Autism Detection in Speech – A Survey
Nadine Probol | Margot Mieskes

There has been a range of studies of how autism is displayed in voice, speech, and language. We analyse studies from the biomedical, as well as the psychological domain, but also from the NLP domain in order to find linguistic, prosodic and acoustic cues. Our survey looks at all three domains. We define autism and which comorbidities might influence the correct detection of the disorder. We especially look at observations such as verbal and semantic fluency, prosodic features, but also disfluencies and speaking rate. We also show word-based approaches and describe machine learning and transformer-based approaches both on the audio data as well as the transcripts. Lastly, we conclude, while there already is a lot of research, female patients seem to be severely under-researched. Also, most NLP research focuses on traditional machine learning methods instead of transformers. Additionally, we were unable to find research combining both features from audio and transcripts.

pdf bib
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Danae Sanchez Villegas | Daniel Preotiuc-Pietro | Nikolaos Aletras

Effectively leveraging multimodal information from social media posts is essential to various downstream tasks such as sentiment analysis, sarcasm detection or hate speech classification. Jointly modeling text and images is challenging because cross-modal semantics might be hidden or the relation between image and text is weak. However, prior work on multimodal classification of social media posts has not yet addressed these challenges. In this work, we present an extensive study on the effectiveness of using two auxiliary losses jointly with the main task during fine-tuning multimodal models. First, Image-Text Contrastive (ITC) is designed to minimize the distance between image-text representations within a post, thereby effectively bridging the gap between posts where the image plays an important role in conveying the post’s meaning. Second, Image-Text Matching (ITM) enhances the model’s ability to understand the semantic relationship between images and text, thus improving its capacity to handle ambiguous or loosely related posts. We combine these objectives with five multimodal models, demonstrating consistent improvements of up to 2.6 F1 score across five diverse social media datasets. Our comprehensive analysis shows the specific scenarios where each auxiliary task is most effective.

pdf bib
What the Weight?! A Unified Framework for Zero-Shot Knowledge Composition
Carolin Holtermann | Markus Frohmann | Navid Rekabsaz | Anne Lauscher

The knowledge encapsulated in a model is the core factor determining its final performance on downstream tasks. Much research in NLP has focused on efficient methods for storing and adapting different types of knowledge, e.g., in dedicated modularized structures, and on how to effectively combine these, e.g., by learning additional parameters. However, given the many possible options, a thorough understanding of the mechanisms involved in these compositions is missing, and hence it remains unclear which strategies to utilize. To address this research gap, we propose a novel framework for zero-shot module composition, which encompasses existing and some novel variations for selecting, weighting, and combining parameter modules under a single unified notion. Focusing on the scenario of domain knowledge and adapter layers, our framework provides a systematic unification of concepts, allowing us to conduct the first comprehensive benchmarking study of various zero-shot knowledge composition strategies. In particular, we test two module combination methods and five selection and weighting strategies for their effectiveness and efficiency in an extensive experimental setup. Our results highlight the efficacy of ensembling but also hint at the power of simple though often-ignored weighting methods. Further in-depth analyses allow us to understand the role of weighting vs. top-k selection, and show that, to a certain extent, the performance of adapter composition can even be predicted.

pdf bib
IndiFoodVQA: Advancing Visual Question Answering and Reasoning with a Knowledge-Infused Synthetic Data Generation Pipeline
Pulkit Agarwal | Settaluri Sravanthi | Pushpak Bhattacharyya

Large Vision Language Models (VLMs) like GPT-4, LLaVA, and InstructBLIP exhibit extraordinary capabilities for both knowledge understanding and reasoning. However, the reasoning capabilities of such models on sophisticated problems that require external knowledge of a specific domain have not been assessed well, due to the unavailability of necessary datasets. In this work, we release a first-of-its-kind dataset called IndiFoodVQA with around 16.7k data samples, consisting of explicit knowledge-infused questions, answers, and reasons. We also release IndiFoodKG, a related Knowledge Graph (KG) with 79k triples. The data has been created with minimal human intervention via an automated pipeline based on InstructBlip and GPT-3.5. We also present a methodology to extract knowledge from the KG and use it to both answer and reason upon the questions. We employ different models to report baseline zero-shot and fine-tuned results. Fine-tuned VLMs on our data showed an improvement of ~25% over the corresponding base model, highlighting the fact that current VLMs need domain-specific fine-tuning to excel in specialized settings. Our findings reveal that (1) explicit knowledge infusion during question generation helps in making questions that have more grounded knowledge, and (2) proper knowledge retrieval can often lead to better-answering potential in such cases. The data and code is available at

pdf bib
MAPLE: Micro Analysis of Pairwise Language Evolution for Few-Shot Claim Verification
Xia Zeng | Arkaitz Zubiaga

Claim verification is an essential step in the automated fact-checking pipeline which assesses the veracity of a claim against a piece of evidence. In this work, we explore the potential of few-shot claim verification, where only very limited data is available for supervision. We propose MAPLE (Micro Analysis of Pairwise Language Evolution), a pioneering approach that explores the alignment between a claim and its evidence with a small seq2seq model and a novel semantic measure. Its innovative utilization of micro language evolution path leverages unlabelled pairwise data to facilitate claim verification while imposing low demand on data annotations and computing resources. MAPLE demonstrates significant performance improvements over SOTA baselines SEED, PET and LLaMA 2 across three fact-checking datasets: FEVER, Climate FEVER, and SciFact. Data and code are available.

pdf bib
Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection
David Dukić | Kiril Gashteovski | Goran Glavaš | Jan Snajder

Event detection is a crucial information extraction task in many domains, such as Wikipedia or news. The task typically relies on trigger detection (TD) – identifying token spans in the text that evoke specific events. While the notion of triggers should ideally be universal across domains, domain transfer for TD from high- to low-resource domains results in significant performance drops. We address the problem of negative transfer in TD by coupling triggers between domains using subject-object relations obtained from a rule-based open information extraction (OIE) system. We demonstrate that OIE relations injected through multi-task training can act as mediators between triggers in different domains, enhancing zero- and few-shot TD domain transfer and reducing performance drops, in particular when transferring from a high-resource source domain (Wikipedia) to a low(er)-resource target domain (news). Additionally, we combine this improved transfer with masked language modeling on the target domain, observing further TD transfer gains. Finally, we demonstrate that the gains are robust to the choice of the OIE system.

pdf bib
Exploring efficient zero-shot synthetic dataset generation for Information Retrieval
Tiago Almeida | Sérgio Matos

The broad integration of neural retrieval models into Information Retrieval (IR) systems is significantly impeded by the high cost and laborious process associated with the manual labelling of training data. Similarly, synthetic training data generation, a potential workaround, often requires expensive computational resources due to the reliance on large language models. This work explored the potential of small language models for efficiently creating high-quality synthetic datasets to train neural retrieval models. We aim to identify an optimal method to generate synthetic datasets, enabling training neural reranking models in document collections where annotated data is unavailable. We introduce a novel methodology, grounded in the principles of information theory, to select the most appropriate documents to be used as context for question generation. Then, we employ a small language model for zero-shot conditional question generation, supplemented by a filtering mechanism to ensure the quality of generated questions. Extensive evaluation on five datasets unveils the potential of our approach, outperforming unsupervised retrieval methods such as BM25 and pretrained monoT5. Our findings indicate that an efficiently generated “silver-standard” dataset allows effective training of neural rerankers in unlabeled scenarios. To ensure reproducibility and facilitate wider application, we will release a code repository featuring an accessible API for zero-shot synthetic question generation.

pdf bib
Clustering-based Sampling for Few-Shot Cross-Domain Keyphrase Extraction
Prakamya Mishra | Lincy Pattanaik | Arunima Sundar | Nishant Yadav | Mayank Kulkarni

Keyphrase extraction is the task of identifying a set of keyphrases present in a document that captures its most salient topics. Scientific domain-specific pre-training has led to achieving state-of-the-art keyphrase extraction performance with a majority of benchmarks being within the domain. In this work, we explore how to effectively enable the cross-domain generalization capabilities of such models without requiring the same scale of data. We primarily focus on the few-shot setting in non-scientific domain datasets such as OpenKP from the Web domain & StackEx from the StackExchange forum. We propose to leverage topic information intrinsically available in the data, to build a novel clustering-based sampling approach that facilitates selecting a few samples to label from the target domain facilitating building robust and performant models. This approach leads to large gains in performance of up to 26.35 points in F1 when compared to selecting few-shot samples uniformly at random. We also explore the setting where we have access to labeled data from the model’s pretraining domain corpora and perform gradual training which involves slowly folding in target domain data to the source domain data. Here we demonstrate further improvements in the model performance by up to 12.76 F1 points.

pdf bib
Random Smooth-based Certified Defense against Text Adversarial Attack
Zeliang Zhang | Wei Yao | Susan Liang | Chenliang Xu

Certified defense methods have identified their effectiveness against textual adversarial examples, which train models on the worst-case text generated by substituting words in original texts with synonyms. However, due to the discrete word embedding representations, the large search space hinders the robust training efficiency, resulting in significant time consumption. To overcome this challenge, motivated by the observation that synonym embedding has a small distance, we propose to treat the word substitution as a continuous perturbation on the word embedding representation. The proposed method Text-RS applies random smooth techniques to approximate the word substitution operation, offering a computationally efficient solution that outperforms conventional discrete methods and improves the robustness in training. The evaluation results demonstrate its effectiveness in defending against multiple textual adversarial attacks.

pdf bib
Clarifying the Path to User Satisfaction: An Investigation into Clarification Usefulness
Hossein A. Rahmani | Xi Wang | Mohammad Aliannejadi | Mohammadmehdi Naghiaei | Emine Yilmaz

Clarifying questions are an integral component of modern information retrieval systems, directly impacting user satisfaction and overall system performance. Poorly formulated questions can lead to user frustration and confusion, negatively affecting the system’s performance. This research addresses the urgent need to identify and leverage key features that contribute to the classification of clarifying questions, enhancing user satisfaction. To gain deeper insights into how different features influence user satisfaction, we conduct a comprehensive analysis, considering a broad spectrum of lexical, semantic, and statistical features, such as question length and sentiment polarity. Our empirical results provide three main insights into the qualities of effective query clarification: (1) specific questions are more effective than generic ones; (2) the subjectivity and emotional tone of a question play a role; and (3) shorter and more ambiguous queries benefit significantly from clarification. Based on these insights, we implement feature-integrated user satisfaction prediction using various classifiers, both traditional and neural-based, including random forest, BERT, and large language models. Our experiments show a consistent and significant improvement, particularly in traditional classifiers, with a minimum performance boost of 45%. This study presents invaluable guidelines for refining the formulation of clarifying questions and enhancing both user satisfaction and system performance.

pdf bib
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning
Lifu Tu | Jin Qu | Semih Yavuz | Shafiq Joty | Wenhao Liu | Caiming Xiong | Yingbo Zhou

Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks, but focus on conversational tasks has been rather limited. This is partly due to the high cost of obtaining non-English conversational data, which results in limited coverage. In this work, we introduce for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset that we created by translating the English-only Schema-Guided Dialogue (SGD) dataset (Rastogi et al., 2020) into 105 other languages. XSGD contains about 330k utterances per language. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts. We also investigate two different classifiers: NLI-based and vanilla classifiers, and test cross-lingual capability enabled by the aligned prompts. We evaluate our model’s cross-lingual generalization capabilities on two conversation tasks: slot-filling and intent classification. Our results demonstrate strong and efficient modeling ability of NLI-based classifiers and the large cross-lingual transfer improvements achieved by our aligned prompts, particularly in few-shot settings. We also conduct studies on large language models (LLMs) such as text-davinci-003 and ChatGPT in both zero- and few-shot settings. While LLMs exhibit impressive performance in English, their cross-lingual capabilities in other languages, particularly low-resource ones, are limited.

pdf bib
Correcting Language Model Outputs by Editing Salient Layers
Kshitij Mishra | Tamer Soliman | Anil Ramakrishna | Aram Galstyan | Anoop Kumar

Large language models can accumulate incorrect or outdated knowledge as the real world evolves. Compared to typical solutions such as retraining, retrieval augmented generation, model editing offers an effective yet low cost solution to address this issue. However, existing model editing algorithms employ manual selection of edit layers, which requires prior domain knowledge or expensive architecture-specific empirical layer selection methods, such as causal tracing. In this work, we propose SaLEM (Salient Layers Editing Model), an efficient solution for data driven layer selection for the model editing task. Our solution utilizes layer-wise saliency maps for layer selection, and matches the accuracy of prior approaches but with only 1/3 of their edits, enabling efficient updates to the parametric knowledge in large language models.

pdf bib
Improving Grounded Language Understanding in a Collaborative Environment by Interacting with Agents Through Help Feedback
Nikhil Mehta | Milagro Teruel | Xin Deng | Sergio Figueroa Sanz | Ahmed Awadallah | Julia Kiseleva

Many approaches to Natural Language Processing tasks often treat them as single-step problems, where an agent receives an instruction, executes it, and is evaluated based on the final outcome. However, language is inherently interactive, as evidenced by the back-and-forth nature of human conversations. In light of this, we posit that human-AI collaboration should also be interactive, with humans monitoring the work of AI agents and providing feedback that the agent can understand and utilize. Further, the AI agent should be able to detect when it needs additional information and proactively ask for help. Enabling this scenario would lead to more natural, efficient, and engaging human-AI collaboration.In this paper, we investigate these directions using the challenging task established by the IGLU competition, an interactive grounded language understanding task in a MineCraft-like world. We delve into multiple types of help players can give to the AI to guide it and analyze the impact of this help on behavior, resulting in performance improvements and an end-to-end interactive system.

pdf bib
Goodhart’s Law Applies to NLP’s Explanation Benchmarks
Jennifer Hsia | Danish Pruthi | Aarti Singh | Zachary Lipton

Despite the rising popularity of saliency-based explanations, the research community remains at an impasse, facing doubts concerning their purpose, efficacy, and tendency to contradict each other. Seeking to unite the community’s efforts around common goals, several recent works have proposed evaluation metrics. In this paper, we critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics, focusing our inquiry on natural language processing. First, we show that we can inflate a model’s comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our strategy exploits the tendency for extracted explanations and their complements to be “out-of-support” relative to each other and in-distribution inputs. Next, we demonstrate that the EVAL-X metrics can be inflated arbitrarily by a simple method that encodes the label, even though EVAL-X is precisely motivated to address such exploits. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.

pdf bib
Syllable-level lyrics generation from melody exploiting character-level language model
Zhe Zhang | Karol Lasocki | Yi Yu | Atsuhiro Takasu

The generation of lyrics tightly connected to accompanying melodies involves establishing a mapping between musical notes and syllables of lyrics. This process requires a deep understanding of music constraints and semantic patterns at syllable-level, word-level, and sentence-level semantic meanings. However, pre-trained language models specifically designed at the syllable level are publicly unavailable. To solve these challenging issues, we propose to exploit fine-tuning character-level language models for syllable-level lyrics generation from symbolic melody. In particular, our method aims to fine-tune a character-level pre-trained language model, allowing to incorporation of linguistic knowledge of the language model into the beam search process of a syllable-level Transformer generator network. Besides, by exploring ChatGPT-based evaluation of generated lyrics in addition to human subjective evaluation, we prove that our approach improves the coherence and correctness of generated lyrics, without the need to train expensive new language models.

pdf bib
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
Pinzhen Chen | Shaoxiong Ji | Nikolay Bogoychev | Andrey Kutuzov | Barry Haddow | Kenneth Heafield

Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which is then used to tune LLMs through either low-rank adaptation or full-parameter training. Under a controlled computation budget, comparisons show that multilingual tuning is on par or better than tuning a model for each language. Furthermore, multilingual tuning with downsampled data can be as powerful and more robust. Our findings serve as a guide for expanding language support through instruction tuning.

pdf bib
Prompt Perturbation Consistency Learning for Robust Language Models
Yao Qiang | Subhrangshu Nandi | Ninareh Mehrabi | Greg Ver Steeg | Anoop Kumar | Anna Rumshisky | Aram Galstyan

Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks, such as question answering and text summarization. However, their performance on sequence labeling tasks such as intent classification and slot filling (IC-SF), which is a central component in personal assistant systems, lags significantly behind discriminative models. Furthermore, there is a lack of substantive research on robustness of LLMs to various perturbations in the input prompts. The contributions of this paper are three-fold. First, we show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models. Next, we systematically analyze the performance deterioration of those fine-tuned models due to three distinct yet relevant types of input perturbations - oronyms, synonyms, and paraphrasing. Finally, we propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples. Our experiments show that PPCL can recover on an average 59% and 69% of the performance drop for IC and SF tasks, respectively. Furthermore, PPCL beats data augmentation approach while using ten times fewer augmented data samples.

pdf bib
Enhancing Society-Undermining Disinformation Detection through Fine-Grained Sentiment Analysis Pre-Finetuning
Tsung-Hsuan Pan | Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen

In the era of the digital world, while freedom of speech has been flourishing, it has also paved the way for disinformation, causing detrimental effects on society. Legal and ethical criteria are insufficient to address this concern, thus necessitating technological intervention. This paper presents a novel method leveraging pre-finetuning concept for efficient detection and removal of disinformation that may undermine society, as deemed by judicial entities. We argue the importance of detecting this type of disinformation and validate our approach with real-world data derived from court orders. Following a study that highlighted four areas of interest for rumor analysis, our research proposes the integration of a fine-grained sentiment analysis task in the pre-finetuning phase of language models, using the GoEmotions dataset. Our experiments validate the effectiveness of our approach in enhancing performance significantly. Furthermore, we explore the application of our approach across different languages using multilingual language models, showing promising results. To our knowledge, this is the first study that investigates the role of sentiment analysis pre-finetuning in disinformation detection.

pdf bib
Minimal Distillation Schedule for Extreme Language Model Compression
Chen Zhang | Yang Yang | Qifan Wang | Jiahao Liu | Jingang Wang | Wei Wu | Dawei Song

Recent studies have revealed that language model distillation can become less effective when there is a significant capacity gap between the teacher and the student models. In order to bridge the gap, teacher assistant-based distillation has been introduced, in which the selection of the teacher assistant plays a crucial role in transferring knowledge from the teacher to the student. However, existing approaches for teacher assistant-based distillation require numerous trials to find the optimal teacher assistant.In this paper, we propose a novel approach called Minimal Distillation Schedule (MiniDisc), which enables the scheduling of an optimal teacher assistant in just one trial for extreme model compression (e.g, to 5% scale). In particular, we empirically show that the performance of the student is positively correlated with the scale-performance tradeoff of the teacher assistant. We then introduce a new 𝜆-tradeoff metric that quantifies the optimality of the teacher assistant without the need for trial distillation to the student. By employing a sandwich framework, MiniDisc can select the optimal teacher assistant with the best 𝜆-tradeoff.We extensively evaluate MiniDisc through a series of experiments on the GLUE benchmark. The results demonstrate that our approach achieved an improved efficiency compared to various state-of-the-art baselines. Furthermore, we showcase the scalability of MiniDisc by applying it to a language model with billions of parameters.

pdf bib
Event Semantic Classification in Context
Haoyu Wang | Hongming Zhang | Kaiqiang Song | Dong Yu | Dan Roth

In this work, we focus on a fundamental yet underexplored problem, event semantic classification in context, to help machines gain a deeper understanding of events. We classify events from six perspectives: modality, affirmation, specificity, telicity, durativity, and kinesis. These properties provide essential cues regarding the occurrence and grounding of events, changes of status that events can bring about, and the connection between events and time. To this end, this paper introduces a novel dataset collected for the semantic classification tasks and several effective models. By incorporating these event properties into downstream tasks, we demonstrate that understanding the fine-grained event semantics benefits downstream event understanding and reasoning via experiments on event extraction, temporal relation extraction, and subevent relation extraction.

pdf bib
Local and Global Contexts for Conversation
Zuoquan Lin | Xinyi Shen

The context in conversation is the dialog history crucial for multi-turn dialogue. Learning from the relevant contexts in dialog history for grounded conversation is a challenging problem. Local context is the most neighbor and more sensitive to the subsequent response, and global context is relevant to a whole conversation far beyond neighboring utterances. Currently, pretrained transformer models for conversation challenge capturing the correlation and connection between local and global contexts. We introduce a local and global conversation model (LGCM) for general-purpose conversation in open domain. It is a local-global hierarchical transformer model that excels at accurately discerning and assimilating the relevant contexts necessary for generating responses. It employs a local encoder to grasp the local context at the level of individual utterances and a global encoder to understand the broader context at the dialogue level. The seamless fusion of these locally and globally contextualized encodings ensures a comprehensive comprehension of the conversation. Experiments on popular datasets show that LGCM outperforms the existing conversation models on the performance of automatic metrics with significant margins.

pdf bib
Aspect-based Key Point Analysis for Quantitative Summarization of Reviews
An Tang | Xiuzhen Zhang | Minh Dinh

Key Point Analysis (KPA) is originally for summarizing arguments, where short sentences containing salient viewpoints are extracted as key points (KPs) and quantified for their prevalence as salience scores. Recently, KPA was applied to summarize reviews, but the study still relies on sentence-based KP extraction and matching, which leads to two issues: sentence-based extraction can result in KPs of overlapping opinions on the same aspects, and sentence-based matching of KP to review comment can be inaccurate, resulting in inaccurate salience scores. To address the above issues, in this paper, we propose Aspect-based Key Point Analysis (ABKPA), a novel framework for quantitative review summarization. Leveraging the readily available aspect-based sentiment analysis (ABSA) resources of reviews to automatically annotate silver labels for matching aspect-sentiment pairs, we propose a contrastive learning model to effectively match KPs to reviews and quantify KPs at the aspect level. Especially, the framework ensures extracting KP of distinct aspects and opinions, leading to more accurate opinion quantification. Experiments on five business categories of the popular Yelp review dataset show that ABKPA outperforms state-of-the-art baselines. Source code and data are available at:

pdf bib
Improving Semantic Control in Discrete Latent Spaces with Transformer Quantized Variational Autoencoders
Yingji Zhang | Danilo Carvalho | Marco Valentino | Ian Pratt-Hartmann | Andre Freitas

Achieving precise semantic control over the latent spaces of Variational AutoEncoders (VAEs) holds significant value for downstream tasks in NLP as the underlying generative mechanisms could be better localised, explained and improved upon. Recent research, however, has struggled to achieve consistent results, primarily due to the inevitable loss of semantic information in the variational bottleneck and limited control over the decoding mechanism. To overcome these challenges, we investigate discrete latent spaces in Vector Quantized Variational AutoEncoder (VQVAE) to improve semantic control and generation in Transformer-based VAEs. In particular, We propose T5VQVAE, a novel model that leverages the controllability of VQVAE to guide the self-attention mechanism in T5, exploiting its full generalization capabilities. Experimental results indicate that T5VQVAE outperforms existing state-of-the-art VAE models, including Optimus, in terms of control and preservation of semantic information across different tasks such as auto-encoding of sentences and mathematical expressions, text transfer, and inference. Moreover, T5VQVAE exhibits improved reasoning capabilities, suggesting potential applications for downstream natural language and symbolic inference tasks.

pdf bib
High-quality Data-to-Text Generation for Severely Under-Resourced Languages with Out-of-the-box Large Language Models
Michela Lorandi | Anya Belz

The performance of NLP methods for severely under-resourced languages cannot currently hope to match the state of the art in NLP methods for well resourced languages. We explore the extent to which pretrained large language models (LLMs) can bridge this gap, via the example of data-to-text generation for Irish, Welsh, Breton and Maltese. We test LLMs on these under-resourced languages and English, in a range of scenarios. We find that LLMs easily set the state of the art for the under-resourced languages by substantial margins, as measured by both automatic and human evaluations. For all our languages, human evaluation shows on-a-par performance with humans for our best systems, but BLEU scores collapse compared to English, casting doubt on the metric’s suitability for evaluating non-task-specific systems. Overall, our results demonstrate the great potential of LLMs to bridge the performance gap for under-resourced languages.

pdf bib
Antonym vs Synonym Distinction using InterlaCed Encoder NETworks (ICE-NET)
Muhammad Ali | Yan Hu | Jianbin Qin | Di Wang

Antonyms vs synonyms distinction is a core challenge in lexico-semantic analysis and automated lexical resource construction. These pairs share a similar distributional context which makes it harder to distinguish them. Leading research in this regard attempts to capture the properties of the relation pairs, i.e., symmetry, transitivity, and trans-transitivity. However, the inability of existing research to appropriately model the relation-specific properties limits their end performance. In this paper, we propose InterlaCed Encoder NETworks (i.e., ICE-NET) for antonym vs synonym distinction, that aim to capture and model the relation-specific properties of the antonyms and synonyms pairs in order to perform the classification task in a performance-enhanced manner. Experimental evaluation using the benchmark datasets shows that ICE-NET outperforms the existing research by a relative score of upto 1.8% in F1-measure.

pdf bib
Predicting Machine Translation Performance on Low-Resource Languages: The Role of Domain Similarity
Eric Khiu | Hasti Toossi | Jinyu Liu | Jiaxu Li | David Anugraha | Juan Flores | Leandro Roman | A. Seza Doğruöz | En-Shiun Lee

Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.

pdf bib
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Martha Lewis | Nihal Nayak | Peilin Yu | Jack Merullo | Qinan Yu | Stephen Bach | Ellie Pavlick

Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying ‘red cube’ by reasoning over the constituents ‘red’ and ‘cube’. In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating ‘cube behind sphere’ from ‘sphere behind cube’). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets – single-object, two-object, and relational – designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.

pdf bib
Code-Switching and Back-Transliteration Using a Bilingual Model
Daniel Weisberg Mitelman | Nachum Dershowitz | Kfir Bar

The challenges of automated transliteration and code-switching–detection in Judeo-Arabic texts are addressed. We introduce two novel machine-learning models, one focused on transliterating Judeo-Arabic into Arabic, and another aimed at identifying non-Arabic words, predominantly Hebrew and Aramaic. Unlike prior work, our models are based on a bilingual Arabic-Hebrew language model, providing a unique advantage in capturing shared linguistic nuances. Evaluation results show that our models outperform prior solutions for the same tasks. As a practical contribution, we present a comprehensive pipeline capable of taking Judeo-Arabic text, identifying non-Arabic words, and then transliterating the Arabic portions into Arabic script. This work not only advances the state of the art but also offers a valuable toolset for making Judeo-Arabic texts more accessible to a broader Arabic-speaking audience.

pdf bib
Tsetlin Machine Embedding: Representing Words Using Logical Expressions
Bimal Bhattarai | Ole-Christoffer Granmo | Lei Jiao | Rohan Yadav | Jivitesh Sharma

Embedding words in vector space is a fundamental first step in state-of-the-art natural language processing (NLP). Typical NLP solutions employ pre-defined vector representations to improve generalization by co-locating similar words in vector space. For instance, Word2Vec is a self-supervised predictive model that captures the context of words using a neural network. Similarly, GLoVe is a popular unsupervised model incorporating corpus-wide word co-occurrence statistics. Such word embedding has significantly boosted important NLP tasks, including sentiment analysis, document classification, and machine translation. However, the embeddings are dense floating-point vectors, making them expensive to compute and difficult to interpret. In this paper, we instead propose to represent the semantics of words with a few defining words that are related using propositional logic. To produce such logical embeddings, we introduce a Tsetlin Machine-based autoencoder that learns logical clauses self-supervised. The clauses consist of contextual words like black, cup, and hot to define other words like coffee, thus being human-understandable. We evaluate our embedding approach on several intrinsic and extrinsic benchmarks, outperforming GLoVe on six classification tasks. Furthermore, we investigate the interpretability of our embedding using the logical representations acquired during training. We also visualize word clusters in vector space, demonstrating how our logical embedding co-locate similar words.

pdf bib
Reading Between the Tweets: Deciphering Ideological Stances of Interconnected Mixed-Ideology Communities
Zihao He | Ashwin Rao | Siyi Guo | Negar Mokhberian | Kristina Lerman

Recent advances in NLP have improved our ability to understand the nuanced worldviews of online communities. Existing research focused on probing ideological stances treats liberals and conservatives as separate groups. However, this fails to account for the nuanced views of the organically formed online communities and the connections between them. In this paper, we study discussions of the 2020 U.S. election on Twitter to identify complex interacting communities. Capitalizing on this interconnectedness, we introduce a novel approach that harnesses message passing when finetuning language models (LMs) to probe the nuanced ideologies of these communities. By comparing the responses generated by LMs and real-world survey results, our method shows higher alignment than existing baselines, highlighting the potential of using LMs in revealing complex ideologies within and across interconnected mixed-ideology communities.

pdf bib
Unified Embeddings for Multimodal Retrieval via Frozen LLMs
Ziyang Wang | Heba Elfardy | Markus Dreyer | Kevin Small | Mohit Bansal

In this work, We present Unified Embeddings for Multimodal Retrieval (UniMuR), a simple but effective approach that embeds multimodal inputs and retrieves visual and textual outputs via frozen Large Language Models (LLMs). Specifically, UniMuR jointly retrieves multimodal outputs via a unified multimodal embedding and applies dual alignment training to account for both visual and textual semantics. Thus, unlike previous approaches, UniMuR significantly reduces LLM’s modality bias towards generating text-only outputs. Meanwhile, the proposed unified multimodal embedding mitigates the inconsistency between visual and textual outputs and provides coherent multimodal outputs. Furthermore, benefiting from the joint training of visual and textual semantics, UniMuR also achieves strong image/text retrieval ability. Compared to existing approaches, UniMuR achieves better zero-shot multimodal response retrieval performance on MMDialog, improving the overall R@1 by 6.5% while boosting the image retrieval rate and having better cross-modal consistency on multimodal outputs. UniMuR also achieves 2.4% and 3.9% improvement on context-based image retrieval tasks on MMDialog and VisDial respectively when compared to previous approaches, validating its generalization ability across multiple tasks.

pdf bib
Assessing the Portability of Parameter Matrices Trained by Parameter-Efficient Finetuning Methods
Mohammed Mohammed | Anya Belz

As the cost of training ever larger language models has grown, so has the interest in reusing previously learnt knowledge. Transfer learning methods have shown how reusing non-task-specific knowledge can help in subsequent task-specific learning.In this paper, we investigate the inverse: porting whole functional modules that encode task-specific knowledge from one model to another. We designed a study comprising 1,440 training/testing runs to test the portability of modules trained by parameter-efficient finetuning (PEFT) techniques, using sentiment analysis as an example task. We test portability in a wide range of scenarios, involving different PEFT techniques and different pretrained host models, among other dimensions. We compare the performance of ported modules with that of equivalent modules trained (i) from scratch, and (ii) from parameters sampled from the same distribution as the ported module.We find that the ported modules far outperform the two alternatives tested, but that there are interesting differences between the four PEFT techniques tested.We conclude that task-specific knowledge in the form of structurally modular sets of parameters as produced by PEFT techniques is highly portable, but that degree of success depends on type of PEFT and on differences between originating and receiving pretrained models.

pdf bib
Exploiting Class Probabilities for Black-box Sentence-level Attacks
Raha Moraffah | Huan Liu

Sentence-level attacks craft adversarial sentences that are synonymous with correctly-classified sentences but are misclassified by the text classifiers. Under the black-box setting, classifiers are only accessible through their feedback to queried inputs, which is predominately available in the form of class probabilities. Even though utilizing class probabilities results in stronger attacks, due to the challenges of using them for sentence-level attacks, existing attacks use either no feedback or only the class labels. Overcoming the challenges, we develop a novel algorithm that uses class probabilities for black-box sentence-level attacks, investigate the effectiveness of using class probabilities on the attack’s success, and examine the question if it is worthy or practical to use class probabilities by black-box sentence-level attacks. We conduct extensive evaluations of the proposed attack comparing with the baselines across various classifiers and benchmark datasets.

pdf bib
Learning Label Hierarchy with Supervised Contrastive Learning
Ruixue Lian | William Sethares | Junjie Hu

Supervised contrastive learning (SCL) frameworks treat each class as independent and thus consider all classes to be equally important. This neglects the common scenario in which label hierarchy exists, where fine-grained classes under the same category show more similarity than very different ones. This paper introduces a family of Label-Aware SCL methods (LA-SCL) that incorporates hierarchical information to SCL by leveraging similarities between classes, resulting in creating a more well-structured and discriminative feature space. This is achieved by first adjusting the distance between instances based on measures of the proximity of their classes with the scaled instance-instance-wise contrastive. An additional instance-center-wise contrastive is introduced to move within-class examples closer to their centers, which are represented by a set of learnable label parameters. The learned label parameters can be directly used as a nearest neighbor classifier without further finetuning. In this way, a better feature representation is generated with improvements of intra-cluster compactness and inter-cluster separation. Experiments on three datasets show that the proposed LA-SCL works well on text classification of distinguishing a single label among multi-labels, outperforming the baseline supervised approaches. Our code is publicly available 1.

pdf bib
GrounDial: Human-norm Grounded Safe Dialog Response Generation
Siwon Kim | Shuyang Dai | Mohammad Kachuee | Shayan Ray | Tara Taghavi | Sungroh Yoon

Current conversational AI systems based on large language models (LLMs) are known to generate unsafe responses agreeing to offensive user input or including toxic content. Previous research aimed to alleviate the toxicity by fine-tuning LLM with manually annotated safe dialogue histories. However, the dependency on additional tuning requires substantial costs. To remove the dependency, we propose GrounDial, where response safety is achieved by grounding responses to commonsense social rules without requiring fine-tuning. A hybrid approach of in-context learning and human-norm-guided decoding of GrounDial enables the response to be quantitatively and qualitatively safer even without additional data or tuning.

pdf bib
Trainable Hard Negative Examples in Contrastive Learning for Unsupervised Abstractive Summarization
Haojie Zhuang | Wei Emma Zhang | Chang Dong | Jian Yang | Quan Sheng

Contrastive learning has demonstrated promising results in unsupervised abstractive summarization. However, existing methods rely on manually crafted negative examples, demanding substantial human effort and domain knowledge. Moreover, these human-generated negative examples may be poor in quality and lack adaptability during model training. To address these issues, we propose a novel approach that learns trainable negative examples for contrastive learning in unsupervised abstractive summarization, which eliminates the need for manual negative example design. Our framework introduces an adversarial optimization process between a negative example network and a representation network (including the summarizer and encoders). The negative example network is trained to synthesize hard negative examples that are close to the positive examples, driving the representation network to improve the quality of the generated summaries. We evaluate our method on two benchmark datasets for unsupervised abstractive summarization and observe significant performance improvements compared to strong baseline models.

pdf bib
Low-Resource Counterspeech Generation for Indic Languages: The Case of Bengali and Hindi
Mithun Das | Saurabh Pandey | Shivansh Sethi | Punyajoy Saha | Animesh Mukherjee

With the rise of online abuse, the NLP community has begun investigating the use of neural architectures to generate counterspeech that can “counter” the vicious tone of such abusive speech and dilute/ameliorate their rippling effect over the social network. However, most of the efforts so far have been primarily focused on English. To bridge the gap for low-resource languages such as Bengali and Hindi, we create a benchmark dataset of 5,062 abusive speech/counterspeech pairs, of which 2,460 pairs are in Bengali, and 2,602 pairs are in Hindi. We implement several baseline models considering various interlingual transfer mechanisms with different configurations to generate suitable counterspeech to set up an effective benchmark. We observe that the monolingual setup yields the best performance. Further, using synthetic transfer, language models can generate counterspeech to some extent; specifically, we notice that transferability is better when languages belong to the same language family.

pdf bib
Teaching Probabilistic Logical Reasoning to Transformers
Aliakbar Nafar | K. Brent Venable | Parisa Kordjamshidi

In this paper, we evaluate the capability of transformer-based language models in making inferences over uncertain text that includes uncertain rules of reasoning. We cover both Pre-trained Language Models (PLMs) and generative Large Language Models (LLMs). Our evaluation results show that both generations of language models struggle with reasoning over uncertain text. We propose a novel end-to-end fine-tuning approach, Probabilistic Constraint Training (PCT), that utilizes probabilistic logical rules as constraints in the fine-tuning phase without relying on these rules in the inference stage. To assess the effectiveness of PCT, we utilize the related corpora and, additionally, create a new and more challenging benchmark that, unlike the previous ones, uses instance-specific rules. Our study demonstrates that PCT improves the transformer-based language model’s intrinsic reasoning and makes their probabilistic logical reasoning process more explicit and explainable. Furthermore, PCT equips these models to effectively handle novel situations, including higher reasoning depth, new domains, and complex probabilistic structures.

pdf bib
On Measuring Context Utilization in Document-Level MT Systems
Wafaa Mohammed | Vlad Niculae

Document-level translation models are usually evaluated using general metrics such as BLEU, which are not informative about the benefits of context. Current work on context-aware evaluation, such as contrastive methods, only measure translation accuracy on words that need context for disambiguation. Such measures cannot reveal whether the translation model uses the correct supporting context. We propose to complement accuracy-based evaluation with measures of context utilization. We find that perturbation-based analysis (comparing models’ performance when provided with correct versus random context) is an effective measure of overall context utilization. For a finer-grained phenomenon-specific evaluation, we propose to measure how much the supporting context contributes to handling context-dependent discourse phenomena. We show that automatically-annotated supporting context gives similar conclusions to human-annotated context and can be used as alternative for cases where human annotations are not available. Finally, we highlight the importance of using discourse-rich datasets when assessing context utilization.

pdf bib
Solving NLP Problems through Human-System Collaboration: A Discussion-based Approach
Masahiro Kaneko | Graham Neubig | Naoaki Okazaki

Humans work together to solve common problems by having discussions, explaining, and agreeing or disagreeing with each other.Similarly, if a system can have discussions with human partners when solving tasks, it has the potential to improve the system’s performance and reliability.In previous research on explainability, it has only been possible for systems to make predictions and for humans to ask questions about them, rather than having a mutual exchange of opinions.This research aims to create a dataset and computational framework for systems that discuss and refine their predictions through dialogue. Through experiments, we show that the proposed system can have beneficial discussions with humans, improving the accuracy by up to 25 points on a natural language inference task.

pdf bib
Autoregressive Score Generation for Multi-trait Essay Scoring
Heejin Do | Yunsu Kim | Gary Lee

Recently, encoder-only pre-trained models such as BERT have been successfully applied in automated essay scoring (AES) to predict a single overall score. However, studies have yet to explore these models in multi-trait AES, possibly due to the inefficiency of replicating BERT-based models for each trait. Breaking away from the existing sole use of *encoder*, we propose an autoregressive prediction of multi-trait scores (ArTS), incorporating a *decoding* process by leveraging the pre-trained T5. Unlike prior regression or classification methods, we redefine AES as a score-generation task, allowing a single model to predict multiple scores. During decoding, the subsequent trait prediction can benefit by conditioning on the preceding trait scores. Experimental results proved the efficacy of ArTS, showing over 5% average improvements in both prompts and traits.

pdf bib
CMA-R: Causal Mediation Analysis for Explaining Rumour Detection
Lin Tian | Xiuzhen Zhang | Jey Han Lau

We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter.Interventions at the input and network level reveal the causal impacts of tweets and words in the model output.We find that our approach CMA-R – Causal Mediation Analysis for Rumour detection – identifies salient tweets that explain model predictions and show strong agreement with human judgements for critical tweets determining the truthfulness of stories.CMA-R can further highlight causally impactful words in the salient tweets, providing another layer of interpretability and transparency into these blackbox rumour detection systems. Code is available at:

pdf bib
Morphology Aware Source Term Masking for Terminology-Constrained NMT
Ander Corral | Xabier Saralegi

Terminology-constrained NMT systems facilitate the forced translation of domain-specific vocabulary. A notable method in this context is the “copy-and-inflect” approach, which appends the target term lemmas of constraints to their corresponding source terms in the input sentence. In this work, we propose a novel adaptation of the “copy-and-inflect” method, referred to as “morph-masking”. Our method involves masking the source terms of the constraints from the input sentence while retaining essential grammatical information. Our approach is based on the hypothesis that “copy-and-inflect” systems have access to both source and target terms, allowing them to generate the correct surface form of the constraint by either translating the source term itself or properly inflecting the target term lemma. Through extensive validation of our method in two translation directions with different levels of source morphological complexity, Basque to Spanish and English to German, we have demonstrated that “morph-masking” is capable of providing a harder constraint signal, resulting in a notable improvement over the “copy-and-inflect” method (up to 38% in term accuracy), especially in challenging constraint scenarios.

pdf bib
Improving Backchannel Prediction Leveraging Sequential and Attentive Context Awareness
Yo-Han Park | Wencke Liermann | Yong-Seok Choi | Kong Joo Lee

Backchannels, which refer to short and often affirmative or empathetic responses from a listener during a conversation, play a crucial role in effective communication. In this paper, we introduce CABP(Context-Aware Backchannel Prediction), a sequential and attentive context approach aimed at enhancing backchannel prediction performance. Additionally, CABP leverages the pretrained wav2vec model for encoding audio signal. Experimental results show that CABP performs better than context-free models, with performance improvements of 1.3% and 1.8% in Korean and English datasets, respectively. Furthermore, when utilizing the pretrained wav2vec model, CABP consistently demonstrates the best performance, achieving performance improvements of 4.4% and 3.1% in Korean and English datasets.

pdf bib
SENSE-LM : A Synergy between a Language Model and Sensorimotor Representations for Auditory and Olfactory Information Extraction
Cédric Boscher | Christine Largeron | Véronique Eglin | Elöd Egyed-Zsigmond

The five human senses – vision, taste, smell, hearing, and touch – are key concepts that shape human perception of the world. The extraction of sensory references (i.e., expressions that evoke the presence of a sensory experience) in textual corpus is a challenge of high interest, with many applications in various areas. In this paper, we propose SENSE-LM, an information extraction system tailored for the discovery of sensory references in large collections of textual documents. Based on the novel idea of combining the strength of large language models and linguistic resources such as sensorimotor norms, it addresses the task of sensory information extraction at a coarse-grained (sentence binary classification) and fine-grained (sensory term extraction) level.Our evaluation of SENSE-LM for two sensory functions, Olfaction and Audition, and comparison with state-of-the-art methods emphasize a significant leap forward in automating these complex tasks.

pdf bib
Analyzing the Role of Part-of-Speech in Code-Switching: A Corpus-Based Study
Jie Chi | Peter Bell

Code-switching (CS) is a common linguistic phenomenon wherein speakers fluidly transition between languages in conversation. While the cognitive processes driving CS remain a complex domain, earlier investigations have shed light on its multifaceted triggers. This study delves into the influence of Part-of-Speech (POS) on the propensity of bilinguals to engage in CS, employing a comprehensive analysis of Spanish-English and Mandarin-English corpora. Compared with prior research, our findings not only affirm the existence of a statistically significant connection between POS and the likelihood of CS across language pairs, but notably find this relationship exhibits its maximum strength in proximity to CS instances, progressively diminishing as tokens distance themselves from these CS points.

pdf bib
In-Contextual Gender Bias Suppression for Large Language Models
Daisuke Oba | Masahiro Kaneko | Danushka Bollegala

Despite their impressive performance in a wide range of NLP tasks, Large Language Models (LLMs) have been reported to encode worrying-levels of gender biases. Prior work has proposed debiasing methods that require human labelled examples, data augmentation and fine-tuning of LLMs, which are computationally costly. Moreover, one might not even have access to the model parameters for performing debiasing such as in the case of closed LLMs such as GPT-4. To address this challenge, we propose bias suppression that prevents biased generations of LLMs by simply providing textual preambles constructed from manually designed templates and real-world statistics, without accessing to model parameters. We show that, using CrowsPairs dataset, our textual preambles covering counterfactual statements can suppress gender biases in English LLMs such as LLaMA2. Moreover, we find that gender-neutral descriptions of gender-biased objects can also suppress their gender biases. Moreover, we show that bias suppression has acceptable adverse effect on downstream task performance with HellaSwag and COPA.

pdf bib
Parameter-Efficient Fine-Tuning: Is There An Optimal Subset of Parameters to Tune?
Max Ploner | Alan Akbik

The ever-growing size of pretrained language models (PLM) presents a significant challenge for efficiently fine-tuning and deploying these models for diverse sets of tasks within memory-constrained environments.In light of this, recent research has illuminated the possibility of selectively updating only a small subset of a model’s parameters during the fine-tuning process.Since no new parameters or modules are added, these methods retain the inference speed of the original model and come at no additional computational cost. However, an open question pertains to which subset of parameters should best be tuned to maximize task performance and generalizability. To investigate, this paper presents comprehensive experiments covering a large spectrum of subset selection strategies. We comparatively evaluate their impact on model performance as well as the resulting model’s capability to generalize to different tasks.Surprisingly, we find that the gains achieved in performance by elaborate selection strategies are, at best, marginal when compared to the outcomes obtained by tuning a random selection of parameter subsets. Our experiments also indicate that selection-based tuning impairs generalizability to new tasks.

pdf bib
Contextualized Topic Coherence Metrics
Hamed Rahimi | David Mimno | Jacob Hoover | Hubert Naacke | Camelia Constantin | Bernd Amann

This article proposes a new family of LLM-based topic coherence metrics called Contextualized Topic Coherence (CTC) and inspired by standard human topic evaluation methods. CTC metrics simulate human-centered coherence evaluation while maintaining the efficiency of other automated methods. We compare the performance of our CTC metrics and five other baseline metrics on seven topic models and show that CTC metrics better reflect human judgment, particularly for topics extracted from short text collections by avoiding highly scored topics that are meaningless to humans.

pdf bib
ProMISe: A Proactive Multi-turn Dialogue Dataset for Information-seeking Intent Resolution
Yash Butala | Siddhant Garg | Pratyay Banerjee | Amita Misra

Users of AI-based virtual assistants and search systems encounter challenges in articulating their intents while seeking information on unfamiliar topics, possibly due to complexity of the user’s intent or the lack of meta-information on the topic. We posit that an iterative suggested question-answering (SQA) conversation can improve the trade-off between the satisfaction of the user’s intent while keeping the information exchange natural and cognitive load of the interaction minimal on the users. In this paper, we evaluate a novel setting ProMISe by means of a sequence of interactions between a user, having a predefined information-seeking intent, and an agent that generates a set of SQA pairs at each step to aid the user to get closer to their intent. We simulate this two-player setting to create a multi-turn conversational dataset of SQAs and user choices (1025 dialogues comprising 4453 turns and 17812 SQAs) using human-feedback, chain-of-thought prompting and web-retrieval augmented large language models. We evaluate the quality of the SQs in the dataset on attributes such as diversity, specificity, grounding, etc, and benchmark the performance of different language models for the task of replicating user behavior.

pdf bib
CODET: A Benchmark for Contrastive Dialectal Evaluation of Machine Translation
Md Mahfuz Ibn Alam | Sina Ahmadi | Antonios Anastasopoulos

Neural machine translation (NMT) systems exhibit limited robustness in handling source-side linguistic variations. Their performance tends to degrade when faced with even slight deviations in language usage, such as different domains or variations introduced by second-language speakers. It is intuitive to extend this observation to encompass dialectal variations as well, but the work allowing the community to evaluate MT systems on this dimension is limited. To alleviate this issue, we compile and release CODET, a contrastive dialectal benchmark encompassing 891 different variations from twelve different languages. We also quantitatively demonstrate the challenges large MT models face in effectively translating dialectal variants. All the data and code have been released.

pdf bib
QAEVENT: Event Extraction as Question-Answer Pairs Generation
Milind Choudhary | Xinya Du

We propose a novel representation of document-level events as question and answer pairs (QAEVENT). Under this paradigm: (1) questions themselves can define argument roles without the need for predefined schemas, which will cover a comprehensive list of event arguments from the document; (2) it allows for more scalable and faster annotations from crowdworkers without linguistic expertise. Based on our new paradigm, we collect a novel and wide-coverage dataset. Our examinations show that annotations with the QA representations produce high-quality data for document-level event extraction, both in terms of human agreement level and high coverage of roles comparing to the pre-defined schema. We present and compare representative approaches for generating event question answer pairs on our benchmark.

pdf bib
Sequence Shortening for Context-Aware Machine Translation
Paweł Maka | Yusuf Semerci | Jan Scholtes | Gerasimos Spanakis

Context-aware Machine Translation aims to improve translations of sentences by incorporating surrounding sentences as context. Towards this task, two main architectures have been applied, namely single-encoder (based on concatenation) and multi-encoder models. In this study, we show that a special case of multi-encoder architecture, where the latent representation of the source sentence is cached and reused as the context in the next step, achieves higher accuracy on the contrastive datasets (where the models have to rank the correct translation among the provided sentences) and comparable BLEU and COMET scores as the single- and multi-encoder approaches. Furthermore, we investigate the application of Sequence Shortening to the cached representations. We test three pooling-based shortening techniques and introduce two novel methods - Latent Grouping and Latent Selecting, where the network learns to group tokens or selects the tokens to be cached as context. Our experiments show that the two methods achieve competitive BLEU and COMET scores and accuracies on the contrastive datasets to the other tested methods while potentially allowing for higher interpretability and reducing the growth of memory requirements with increased context size.

pdf bib
Jigsaw Pieces of Meaning: Modeling Discourse Coherence with Informed Negative Sample Synthesis
Shubhankar Singh

Coherence in discourse is fundamental for comprehension and perception. Much research on coherence modeling has focused on better model architectures and training setups optimizing on the permuted document task, where random permutations of a coherent document are considered incoherent. However, there’s very limited work on creating “informed” synthetic incoherent samples that better represent or mimic incoherence. We source a diverse positive corpus for local coherence and propose six rule-based methods leveraging information from Constituency trees, Part-of-speech, semantic overlap and more, for “informed” negative sample synthesis for better representation of incoherence. We keep a straightforward training setup for local coherence modeling by fine-tuning popular transformer models, and aggregate local scores for global coherence. We evaluate on a battery of independent downstream tasks to assess the impact of improved negative sample quality. We assert that a step towards optimality for coherence modeling requires better negative sample synthesis in tandem with model improvements.

pdf bib
Non-Exchangeable Conformal Language Generation with Nearest Neighbors
Dennis Ulmer | Chrysoula Zerva | Andre Martins

Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with statistical guarantees, however, its application to text generation is challenging since any i.i.d. assumptions are not realistic. In this paper, we bridge this gap by leveraging recent results on *non-exchangeable* conformal prediction, which still ensures bounds on coverage. The result, *non-exchangeable conformal nucleus sampling*, is a novel extension of the conformal prediction framework to generation based on nearest neighbors. Our method can be used post-hoc for an arbitrary model without extra training and supplies token-level, calibrated prediction sets equipped with statistical guarantees. Experiments in machine translation and language modeling show encouraging results in generation quality. By also producing tighter prediction sets with good coverage, we thus give a more theoretically principled way to perform sampling with conformal guarantees.

pdf bib
Evidentiality-aware Retrieval for Overcoming Abstractiveness in Open-Domain Question Answering
Yongho Song | Dahyun Lee | Myungha Jang | Seung-won Hwang | Kyungjae Lee | Dongha Lee | Jinyoung Yeo

The long-standing goal of dense retrievers in abtractive open-domain question answering (ODQA) tasks is to learn to capture evidence passages among relevant passages for any given query, such that the reader produce factually correct outputs from evidence passages. One of the key challenge is the insufficient amount of training data with the supervision of the answerability of the passages. Recent studies rely on iterative pipelines to annotate answerability using signals from the reader, but their high computational costs hamper practical applications. In this paper, we instead focus on a data-driven approach and propose Evidentiality-Aware Dense Passage Retrieval (EADPR), which leverages synthetic distractor samples to learn to discriminate evidence passages from distractors. We conduct extensive experiments to validate the effectiveness of our proposed method on multiple abstractive ODQA tasks.

pdf bib
Self-training Strategies for Sentiment Analysis: An Empirical Study
Haochen Liu | Sai Rallabandi | Yijing Wu | Parag Dakle | Preethi Raghavan

Sentiment analysis is a crucial task in natural language processing that involves identifying and extracting subjective sentiment from text. Self-training has recently emerged as an economical and efficient technique for developing sentiment analysis models by leveraging a small amount of labeled data and a large amount of unlabeled data. However, given a set of training data, how to utilize them to conduct self-training makes a significant difference in the final performance of the model. We refer to this methodology as the self-training strategy. In this paper, we present an empirical study of various self-training strategies for sentiment analysis. First, we investigate the influence of the self-training strategy and hyper-parameters on the performance of traditional small language models (SLMs) in various few-shot settings. Second, we also explore the feasibility of leveraging large language models (LLMs) to help self-training. We propose and empirically compare several self-training strategies with the intervention of LLMs. Extensive experiments are conducted on three real-world sentiment analysis datasets.

pdf bib
Language is All a Graph Needs
Ruosong Ye | Caiqi Zhang | Runhui Wang | Shuyuan Xu | Yongfeng Zhang

The emergence of large-scale pre-trained language models has revolutionized various AI research domains. Transformers-based Large Language Models (LLMs) have gradually replaced CNNs and RNNs to unify fields of computer vision and natural language processing. Compared with independent data like images, videos or texts, graphs usually contain rich structural and relational information. Meanwhile, languages, especially natural language, being one of the most expressive mediums, excels in describing complex structures. However, existing work on incorporating graph problems into the generative language modeling framework remains very limited. Considering the rising prominence of LLMs, it becomes essential to explore whether LLMs can also replace GNNs as the foundation model for graphs. In this paper, we propose InstructGLM (Instruction-finetuned Graph Language Model) with highly scalable prompts based on natural language instructions. We use natural language to describe multi-scale geometric structure of the graph and then instruction finetune an LLM to perform graph tasks, which enables Generative Graph Learning. Our method surpasses all GNN baselines on ogbn-arxiv, Cora and PubMed datasets, underscoring its effectiveness and sheds light on generative LLMs as new foundation model for graph machine learning. Our code is available at

pdf bib
Unraveling the Dynamics of Semi-Supervised Hate Speech Detection: The Impact of Unlabeled Data Characteristics and Pseudo-Labeling Strategies
Florian Ludwig | Klara Dolos | Ana Alves-Pinto | Torsten Zesch

Despite advances in machine learning based hate speech detection, the need for larges amounts of labeled training data for state-of-the-art approaches remains a challenge for their application. Semi-supervised learning addresses this problem by leveraging unlabeled data and thus reducing the amount of annotated data required. Underlying this approach is the assumption that labeled and unlabeled data follow similar distributions. This assumption however may not always hold, with consequences for real world applications. We address this problem by investigating the dynamics of pseudo-labeling, a commonly employed form of semi-supervised learning, in the context of hate speech detection. Concretely we analysed the influence of data characteristics and of two strategies for selecting pseudo-labeled samples: threshold- and ratio-based. The results show that the influence of data characteristics on the pseudo-labeling performances depends on other factors, such as pseudo-label selection strategies or model biases. Furthermore, the effectiveness of pseudo-labeling in classification performance is determined by the interaction between the number, hate ratio and accuracy of the selected pseudo-labels. Analysis of the results suggests an advantage of the threshold-based approach when labeled and unlabeled data arise from the same domain, whilst the ratio-based approach may be recommended in the opposite situation.

pdf bib
When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
Orion Weller | Kyle Lo | David Wadden | Dawn Lawrie | Benjamin Van Durme | Arman Cohan | Luca Soldaini

Using large language models (LMs) for query or document expansion can improve generalization in information retrieval. However, it is unknown whether these techniques are universally beneficial or only effective in specific settings, such as for particular retrieval models, dataset domains, or query types. To answer this, we conduct the first comprehensive analysis of LM-based expansion. We find that there exists a strong negative correlation between retriever performance and gains from expansion: expansion improves scores for weaker models, but generally harms stronger models. We show this trend holds across a set of eleven expansion techniques, twelve datasets with diverse distribution shifts, and twenty-four retrieval models. Through qualitative error analysis, we hypothesize that although expansions provide extra information (potentially improving recall), they add additional noise that makes it difficult to discern between the top relevant documents (thus introducing false positives). Our results suggest the following recipe: use expansions for weaker models or when the target dataset significantly differs from training corpus in format; otherwise, avoid expansions to keep the relevance signal clear.

pdf bib
Can Large Language Models Understand Context?
Yilun Zhu | Joel Moniz | Shruti Bhargava | Jiarui Lu | Dhivya Piraviperumal | Site Li | Yuan Zhang | Hong Yu | Bo-Hsiang Tseng

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models. This benchmark comprises of four distinct tasks and nine datasets, all featuring prompts designed to assess the models’ ability to understand context. First, we evaluate the performance of LLMs under the in-context learning pretraining scenario. Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models. Second, as LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings. We find that 3-bit post-training quantization leads to varying degrees of performance reduction on our benchmark. We conduct an extensive analysis of these scenarios to substantiate our experimental results.

pdf bib
Let’s Negotiate! A Survey of Negotiation Dialogue Systems
Haolan Zhan | Yufei Wang | Zhuang Li | Tao Feng | Yuncheng Hua | Suraj Sharma | Lizhen Qu | Zhaleh Semnani Azad | Ingrid Zukerman | Reza Haf

Negotiation is a crucial ability in human communication. Recently, there has been a resurgent research interest in negotiation dialogue systems, whose goal is to create intelligent agents that can assist people in resolving conflicts or reaching agreements. Although there have been many explorations into negotiation dialogue systems, a systematic review of this task has not been performed to date. We aim to fill this gap by investigating recent studies in the field of negotiation dialogue systems, and covering benchmarks, evaluations and methodologies within the literature. We also discuss potential future directions, including multi-modal, multi-party and cross-cultural negotiation scenarios. Our goal is to provide the community with a systematic overview of negotiation dialogue systems and to inspire future research.

pdf bib
Towards Understanding Counseling Conversations: Domain Knowledge and Large Language Models
Younghun Lee | Dan Goldwasser | Laura Schwab Reese

Understanding the dynamics of counseling conversations is an important task, yet it is a challenging NLP problem regardless of the recent advance of Transformer-based pre-trained language models. This paper proposes a systematic approach to examine the efficacy of domain knowledge and large language models (LLMs) in better representing conversations between a crisis counselor and a help seeker. We empirically show that state-of-the-art language models such as Transformer-based models and GPT models fail to predict the conversation outcome. To provide richer context to conversations, we incorporate human-annotated domain knowledge and LLM-generated features; simple integration of domain knowledge and LLM features improves the model performance by approximately 15%. We argue that both domain knowledge and LLM-generated features can be exploited to better characterize counseling conversations when they are used as an additional context to conversations.

pdf bib
Better Explain Transformers by Illuminating Important Information
Linxin Song | Yan Cui | Ao Luo | Freddy Lecue | Irene Li

Transformer-based models excel in various natural language processing (NLP) tasks, attracting countless efforts to explain their inner workings. Prior methods explain Transformers by focusing on the raw gradient and attention as token attribution scores, where non-relevant information is often considered during explanation computation, resulting in confusing results. In this work, we propose highlighting the important information and eliminating irrelevant information by a refined information flow on top of the layer-wise relevance propagation (LRP) method. Specifically, we consider identifying syntactic and positional heads as important attention heads and focus on the relevance obtained from these important heads. Experimental results demonstrate that irrelevant information does distort output attribution scores and then should be masked during explanation computation. Compared to eight baselines on both classification and question-answering datasets, our method consistently outperforms with over 3% to 33% improvement on explanation metrics, providing superior explanation performance. Our anonymous code repository is available at:

pdf bib
Testing the Depth of ChatGPT’s Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5’s Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking
David Bayani

In the months since its release, ChatGPT and its underlying model, GPT3.5, have garnered massive attention, due to their potent mix of capability and accessibility. While a niche industry of papers have emerged examining the scope of capabilities these models possess, language — whether natural or stylized like code — has been the vehicle to exchange information with the network. Drawing inspiration from the multi-modal knowledge we’d expect an agent with true understanding to possess, we examine GPT3.5’s aptitude for visual tasks, where the inputs feature ASCII-art without overt distillation into a lingual summary. In particular, we scrutinize its performance on carefully designed image recognition and generation tasks. An extended version of this write-up is available at: .

pdf bib
Cross-lingual Editing in Multilingual Language Models
Himanshu Beniwal | Kowsik D | Mayank Singh

The training of large language models (LLMs) necessitates substantial data and computational resources, and updating outdated LLMs entails significant efforts and resources. While numerous model editing techniques (METs) have emerged to efficiently update model outputs without retraining, their effectiveness in multilingual LLMs, where knowledge is stored in diverse languages, remains an underexplored research area. This research paper introduces the cross-lingual model editing (XME) paradigm, wherein a fact is edited in one language, and the subsequent update propagation is observed across other languages. To investigate the XME paradigm, we conducted experiments using BLOOM, mBERT, and XLM-RoBERTa using the two writing scripts: Latin (English, French, and Spanish) and Indic (Hindi, Gujarati, and Bengali). The results reveal notable performance limitations of state-of-the-art METs under the XME setting, mainly when the languages involved belong to two distinct script families. These findings highlight the need for further research and development of XME techniques to address these challenges. For more comprehensive information, the dataset used in this research and the associated code are publicly available at the following [URL](

pdf bib
Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference
Parsa Kavehzadeh | Mojtaba Valipour | Marzieh Tahaei | Ali Ghodsi | Boxing Chen | Mehdi Rezagholizadeh

Large language models (LLMs) have revolutionized natural language processing (NLP) by excelling at understanding and generating human-like text. However, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference by leveraging the modularity in networks and sorting sub-models based on computation/accuracy in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any Pre-Training and by only replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that this approach can unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. The efficacy of our proposed method was demonstrated by applying it to tune LLaMA 2 13B on the Stanford Alpaca dataset for instruction following and TriviaQA for closed-book question answering. Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit), all achieved with very efficient tuning and without additional memory usage during inference.

pdf bib
AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents
Abraham Owodunni | Aditya Yadavalli | Chris Emezue | Tobi Olatunji | Clinton Mbataku

Despite advancements in speech recognition, accented speech remains challenging. While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). Our exploratory analysis of speech embeddings representing 100+ African accents reveals interesting spatial accent relationships highlighting geographic and genealogical similarities, capturing consistent phonological, and morphological regularities, all learned empirically from speech. Furthermore, we discover accent relationships previously uncharacterized by the Ethnologue. Through empirical evaluation, we demonstrate the effectiveness of AccentFold by showing that, for out-of-distribution (OOD) accents, sampling accent subsets for training based on AccentFold information outperforms strong baselines a relative WER improvement of 4.6%. AccentFold presents a promising approach for improving ASR performance on accented speech, particularly in the context of African accents, where data scarcity and budget constraints pose significant challenges. Our findings emphasize the potential of leveraging linguistic relationships to improve zero-shot ASR adaptation to target accents.

pdf bib
Hierarchical and Dynamic Prompt Compression for Efficient Zero-shot API Usage
Yichen Jiang | Marco Vecchio | Mohit Bansal | Anders Johannsen

Long prompts present a significant challenge for practical LLM-based systems that need to operate with low latency and limited resources. We investigate prompt compression for zero-shot dialogue systems that learn to use unseen APIs directly in-context from their documentation, which may take up hundreds of prompt tokens per API. We start from a recently introduced approach (Mu et al., 2023) that learns to compress the prompt into a few “gist token” activations during finetuning. However, this simple idea is ineffective in compressing API documentation, resulting in low accuracy compared to the baseline using an uncompressed prompt. In this work, we introduce two major improvements. First, we specialize gist tokens for different hierarchies within an API: we use one Gistarg token for compressing an argument and one Gistvalue token for compressing an acceptable value of a categorical argument. We then dynamically reveal Gistvalue tokens only when they are needed. Second, we add a reconstruction loss to predict the API documentation from the gist tokens. On multiple API-calling tasks, our proposed system keeps the simplicity, efficiency, and large compression factor (20x on SGD) of the gist token approach while achieving significantly better accuracy.

pdf bib
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim | Seunghyun Yoon | Trung Bui | Handong Zhao | Quan Tran | Franck Dernoncourt | Jaewoo Kang

Contrastive language-image pre-training (CLIP) models have demonstrated considerable success across various vision-language tasks, such as text-to-image retrieval, where the model is required to effectively process natural language input to produce an accurate visual output. However, current models still face limitations in dealing with linguistic variations in input queries, such as paraphrases, making it challenging to handle a broad range of user queries in real-world applications. In this study, we introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our approach involves a two-step paraphrase generation process, where we automatically create two categories of paraphrases from web-scale image captions by leveraging large language models. Subsequently, we fine-tune the CLIP text encoder using these generated paraphrases while freezing the image encoder. Our resulting model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks, including paraphrased retrieval (with rank similarity scores improved by up to 7.6% and 9.6%), Visual Genome Relation and Attribution, as well as seven semantic textual similarity tasks.

pdf bib
Generative Interpretation: Toward Human-Like Evaluation for Educational Question-Answer Pair Generation
Hyeonseok Moon | Jaewook Lee | Sugyeong Eo | Chanjun Park | Jaehyung Seo | Heuiseok Lim

Educational question-answer generation has been extensively researched owing to its practical applicability. However, we have identified a persistent challenge concerning the evaluation of such systems. Existing evaluation methods often fail to produce objective results and instead exhibit a bias towards favoring high similarity to the ground-truth question-answer pairs. In this study, we demonstrate that these evaluation methods yield low human alignment and propose an alternative approach called Generative Interpretation (GI) to achieve more objective evaluations. Through experimental analysis, we reveal that GI outperforms existing evaluation methods in terms of human alignment, and even shows comparable performance with GPT3.5, only with BART-large.

pdf bib
Dive into the Chasm: Probing the Gap between In- and Cross-Topic Generalization
Andreas Waldis | Yufang Hou | Iryna Gurevych

Pre-trained language models (PLMs) perform well in In-Topic setups, where training and testing data come from the same topics. However, they face challenges in Cross-Topic scenarios where testing data is derived from distinct topics. This paper analyzes various PLMs with three probing-based experiments to better understand the reasons behind such generalization gaps. For the first time, we demonstrate that the extent of these generalization gaps and the sensitivity to token-level interventions vary significantly across PLMs. By evaluating large language models (LLMs), we show the usefulness of our analysis for these recent models. Overall, we obs