uppdf
bib
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Andreas Vlachos
|
Isabelle Augenstein
pdf
bib
abs
PiC: A Phrase-in-Context Dataset for Phrase Understanding and Semantic Search
Thang Pham
|
Seunghyun Yoon
|
Trung Bui
|
Anh Nguyen
While contextualized word embeddings have been a de-facto standard, learning contextualized phrase embeddings is less explored and being hindered by the lack of a human-annotated benchmark that tests machine understanding of phrase semantics given a context sentence or paragraph (instead of phrases alone). To fill this gap, we propose PiC—a dataset of ∼28K of noun phrases accompanied by their contextual Wikipedia pages and a suite of three tasks for training and evaluating phrase embeddings. Training on PiC improves ranking-models’ accuracy and remarkably pushes span selection (SS) models (i.e., predicting the start and end index of the target phrase) near human accuracy, which is 95% Exact Match (EM) on semantic search given a query phrase and a passage. Interestingly, we find evidence that such impressive performance is because the SS models learn to better capture the common meaning of a phrase regardless of its actual context. SotA models perform poorly in distinguishing two senses of the same phrase in two contexts (∼60% EM) and in estimating the similarity between two different phrases in the same context (∼70% EM).
pdf
bib
abs
Enhancing Dialogue Summarization with Topic-Aware Global- and Local- Level Centrality
Xinnian Liang
|
Shuangzhi Wu
|
Chenhao Cui
|
Jiaqi Bai
|
Chao Bian
|
Zhoujun Li
Dialogue summarization aims to condense a given dialogue into a simple and focused summary text. Typically, both the roles’ viewpoints and conversational topics change in the dialogue stream. Thus how to effectively handle the shifting topics and select the most salient utterance becomes one of the major challenges of this task. In this paper, we propose a novel topic-aware Global-Local Centrality (GLC) model to help select the salient context from all sub-topics. The centralities are constructed in both global level and local level. The global one aims to identify vital sub-topics in the dialogue and the local one aims to select the most important context in each sub-topic. Specifically, the GLC collects sub-topic based on the utterance representations. And each utterance is aligned with one sub-topic. Based on the sub-topics, the GLC calculates global- and local-level centralities. Finally, we combine the two to guide the model to capture both salient context and sub-topics when generating summaries. Experimental results show that our model outperforms strong baselines on three public dialogue summarization datasets: CSDS, MC, and SAMSUM. Further analysis demonstrates that our GLC can exactly identify vital contents from sub-topics.
pdf
bib
abs
Exploiting Summarization Data to Help Text Simplification
Renliang Sun
|
Zhixian Yang
|
Xiaojun Wan
One of the major problems with text simplification is the lack of high-quality data. The sources of simplification datasets are limited to Wikipedia and Newsela, restricting further development of this field. In this paper, we analyzed the similarity between text summarization and text simplification and exploited summarization data to help simplify. First, we proposed an alignment algorithm to extract sentence pairs from summarization datasets. Then, we designed four attributes to characterize the degree of simplification and proposed a method to filter suitable pairs. We named these pairs Sum4Simp (S4S). Next, we conducted human evaluations to show that S4S is high-quality and compared it with a real simplification dataset. Finally, we conducted experiments to illustrate that the S4S can improve the performance of several mainstream simplification models, especially in low-resource scenarios.
pdf
bib
abs
Shironaam: Bengali News Headline Generation using Auxiliary Information
Abu Ubaida Akash
|
Mir Tafseer Nayeem
|
Faisal Tareque Shohan
|
Tanvir Islam
Automatic headline generation systems have the potential to assist editors in finding interesting headlines to attract visitors or readers. However, the performance of headline generation systems remains challenging due to the unavailability of sufficient parallel data for low-resource languages like Bengali and the lack of ideal approaches to develop a system for headline generation using pre-trained language models, especially for long news articles. To address these challenges, we present Shironaam, a large-scale dataset in Bengali containing over 240K news article-headline pairings with auxiliary data such as image captions, topic words, and category information. Unlike other headline generation models, this paper uses this auxiliary information to better model this task. Furthermore, we utilize the contextualized language models to design encoder-decoder model for Bengali news headline generation and follow a simple yet cost-effective coarse-to-fine approach using topic-words to retrieve important sentences considering the fixed length requirement of the pre-trained language models. Finally, we conduct extensive experiments on our dataset containing news articles of 13 different categories to demonstrate the effectiveness of incorporating auxiliary information and evaluate our system on a wide range of metrics. The experimental results demonstrate that our methods bring significant improvements (i.e., 3 to 10 percentage points across all evaluation metrics) over the baselines. Also to illustrate the utility and robustness, we report experimental results in few-shot and non-few-shot settings.
pdf
bib
abs
PCC: Paraphrasing with Bottom-k Sampling and Cyclic Learning for Curriculum Data Augmentation
Hongyuan Lu
|
Wai Lam
Curriculum Data Augmentation (CDA) improves neural models by presenting synthetic data with increasing difficulties from easy to hard. However, traditional CDA simply treats the ratio of word perturbation as the difficulty measure and goes through the curriculums only once. This paper presents PCC: Paraphrasing with Bottom-k Sampling and Cyclic Learning for Curriculum Data Augmentation, a novel CDA framework via paraphrasing, which exploits the textual paraphrase similarity as the curriculum difficulty measure. We propose a curriculum-aware paraphrase generation module composed of three units: a paraphrase candidate generator with bottom-k sampling, a filtering mechanism and a difficulty measure. We also propose a cyclic learning strategy that passes through the curriculums multiple times. The bottom-k sampling is proposed to generate super-hard instances for the later curriculums. Experimental results on few-shot text classification as well as dialogue generation indicate that PCC surpasses competitive baselines. Human evaluation and extensive case studies indicate that bottom-k sampling effectively generates super-hard instances, and PCC significantly improves the baseline dialogue agent.
pdf
bib
abs
A Two-Sided Discussion of Preregistration of NLP Research
Anders Søgaard
|
Daniel Hershcovich
|
Miryam de Lhoneux
Van Miltenburg et al. (2021) suggest NLP research should adopt preregistration to prevent fishing expeditions and to promote publication of negative results. At face value, this is a very reasonable suggestion, seemingly solving many methodological problems with NLP research. We discuss pros and cons - some old, some new: a) Preregistration is challenged by the practice of retrieving hypotheses after the results are known; b) preregistration may bias NLP toward confirmatory research; c) preregistration must allow for reclassification of research as exploratory; d) preregistration may increase publication bias; e) preregistration may increase flag-planting; f) preregistration may increase p-hacking; and finally, g) preregistration may make us less risk tolerant. We cast our discussion as a dialogue, presenting both sides of the debate.
pdf
bib
abs
WinoDict: Probing language models for in-context word acquisition
Julian Martin Eisenschlos
|
Jeremy R. Cole
|
Fangyu Liu
|
William W. Cohen
We introduce a new in-context learning paradigm to measure Large Language Models’ (LLMs) ability to learn novel words during inference. In particular, we rewrite Winograd-style co-reference resolution problems by replacing the key concept word with a synthetic but plausible word that the model must understand to complete the task. Solving this task requires the model to make use of the dictionary definition of the new word given in the prompt. This benchmark addresses word acquisition, one important aspect of the diachronic degradation known to afflict LLMs. As LLMs are frozen in time at the moment they are trained, they are normally unable to reflect the way language changes over time. We show that the accuracy of LLMs compared to the original Winograd tasks decreases radically in our benchmark, thus identifying a limitation of current models and providing a benchmark to measure future improvements in LLMs ability to do in-context learning.
pdf
bib
abs
Sentiment as an Ordinal Latent Variable
Niklas Stoehr
|
Ryan Cotterell
|
Aaron Schein
Sentiment analysis has become a central tool in various disciplines outside of natural language processing. In particular in applied and domain-specific settings with strong requirements for interpretable methods, dictionary-based approaches are still a popular choice. However, existing dictionaries are often limited in coverage, static once annotation is completed and sentiment scales differ widely; some are discrete others continuous. We propose a Bayesian generative model that learns a composite sentiment dictionary as an interpolation between six existing dictionaries with different scales. We argue that sentiment is a latent concept with intrinsically ranking-based characteristics — the word “excellent” may be ranked more positive than “great” and “okay”, but it is hard to express how much more exactly. This prompts us to enforce an ordinal scale of ordered discrete sentiment values in our dictionary. We achieve this through an ordering transformation in the priors of our model. We evaluate the model intrinsically by imputing missing values in existing dictionaries. Moreover, we conduct extrinsic evaluations through sentiment classification tasks. Finally, we present two extension: first, we present a method to augment dictionary-based approaches with word embeddings to construct sentiment scales along new semantic axes. Second, we demonstrate a Latent Dirichlet Allocation-inspired variant of our model that learns document topics that are ordered by sentiment.
pdf
bib
abs
Nationality Bias in Text Generation
Pranav Narayanan Venkit
|
Sanjana Gautam
|
Ruchi Panchanadikar
|
Ting-Hao Huang
|
Shomir Wilson
Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country’s economic status impacts the sentiment of the stories. To reduce the propagation of biases through large language models (LLM), we explore the debiasing method of adversarial triggering. Our results show that GPT-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.
pdf
bib
abs
Investigating data partitioning strategies for crosslinguistic low-resource ASR evaluation
Zoey Liu
|
Justin Spence
|
Emily Prud’hommeaux
Many automatic speech recognition (ASR) data sets include a single pre-defined test set consisting of one or more speakers whose speech never appears in the training set. This “hold-speaker(s)-out” data partitioning strategy, however, may not be ideal for data sets in which the number of speakers is very small. This study investigates ten different data split methods for five languages with minimal ASR training resources. We find that (1) model performance varies greatly depending on which speaker is selected for testing; (2) the average word error rate (WER) across all held-out speakers is comparable not only to the average WER over multiple random splits but also to any given individual random split; (3) WER is also generally comparable when the data is split heuristically or adversarially; (4) utterance duration and intensity are comparatively more predictive factors of variability regardless of the data split. These results suggest that the widely used hold-speakers-out approach to ASR data partitioning can yield results that do not reflect model performance on unseen data or speakers. Random splits can yield more reliable and generalizable estimates when facing data sparsity.
pdf
bib
abs
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization
Ryo Kamoi
|
Tanya Goyal
|
Greg Durrett
Despite recent progress in abstractive summarization, models often generate summaries with factual errors. Numerous approaches to detect these errors have been proposed, the most popular of which are question answering (QA)-based factuality metrics. These have been shown to work well at predicting summary-level factuality and have potential to localize errors within summaries, but this latter capability has not been systematically evaluated in past research. In this paper, we conduct the first such analysis and find that, contrary to our expectations, QA-based frameworks fail to correctly identify error spans in generated summaries and are outperformed by trivial exact match baselines. Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules. Moreover, even human-in-the-loop question generation cannot easily offset these problems. Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
pdf
bib
abs
Socratic Question Generation: A Novel Dataset, Models, and Evaluation
Beng Heng Ang
|
Sujatha Das Gollapalli
|
See-Kiong Ng
Socratic questioning is a form of reflective inquiry often employed in education to encourage critical thinking in students, and to elicit awareness of beliefs and perspectives in a subject during therapeutic counseling. Specific types of Socratic questions are employed for enabling reasoning and alternate views against the context of individual personal opinions on a topic. Socratic contexts are different from traditional question generation contexts where “answer-seeking” questions are generated against a given formal passage on a topic, narrative stories or conversations. We present SocratiQ, the first large dataset of 110K (question, context) pairs for enabling studies on Socratic Question Generation (SoQG). We provide an in-depth study on the various types of Socratic questions and present models for generating Socratic questions against a given context through prompt tuning. Our automated and human evaluation results demonstrate that our SoQG models can produce realistic, type-sensitive, human-like Socratic questions enabling potential applications in counseling and coaching.
pdf
bib
abs
Do we need Label Regularization to Fine-tune Pre-trained Language Models?
Ivan Kobyzev
|
Aref Jafari
|
Mehdi Rezagholizadeh
|
Tianda Li
|
Alan Do-Omri
|
Peng Lu
|
Pascal Poupart
|
Ali Ghodsi
Knowledge Distillation (KD) is a prominent neural model compression technique that heavily relies on teacher network predictions to guide the training of a student model. Considering the ever-growing size of pre-trained language models (PLMs), KD is often adopted in many NLP tasks involving PLMs. However, it is evident that in KD, deploying the teacher network during training adds to the memory and computational requirements of training. In the computer vision literature, the necessity of the teacher network is put under scrutiny by showing that KD is a label regularization technique that can be replaced with lighter teacher-free variants such as the label-smoothing technique. However, to the best of our knowledge, this issue is not investigated in NLP. Therefore, this work concerns studying different label regularization techniques and whether we actually need them to improve the fine-tuning of smaller PLM networks on downstream tasks. In this regard, we did a comprehensive set of experiments on different PLMs such as BERT, RoBERTa, and GPT with more than 600 distinct trials and ran each configuration five times. This investigation led to a surprising observation that KD and other label regularization techniques do not play any meaningful role over regular fine-tuning when the student model is pre-trained. We further explore this phenomenon in different settings of NLP and computer vision tasks and demonstrate that pre-training itself acts as a kind of regularization, and additional label regularization is unnecessary.
pdf
bib
abs
COVID-VTS: Fact Extraction and Verification on Short Video Platforms
Fuxiao Liu
|
Yaser Yacoob
|
Abhinav Shrivastava
We introduce a new benchmark, COVID-VTS, for fact-checking multi-modal information involving short-duration videos with COVID19- focused information from both the real world and machine generation. We propose, TwtrDetective, an effective model incorporating cross-media consistency checking to detect token-level malicious tampering in different modalities, and generate explanations. Due to the scarcity of training data, we also develop an efficient and scalable approach to automatically generate misleading video posts by event manipulation or adversarial matching. We investigate several state-of-the-art models and demonstrate the superiority of TwtrDetective.
pdf
bib
abs
Multimodal Graph Transformer for Multimodal Question Answering
Xuehai He
|
Xin Wang
Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
pdf
bib
abs
Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies
Md Rizwan Parvez
|
Jianfeng Chi
|
Wasi Uddin Ahmad
|
Yuan Tian
|
Kai-Wei Chang
Prior studies in privacy policies frame the question answering (QA) task as identifying the most relevant text segment or a list of sentences from a policy document given a user query. Existing labeled datasets are heavily imbalanced (only a few relevant segments), limiting the QA performance in this domain. In this paper, we develop a data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascaded them with noise reduction oracles. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%. Our ablation studies provide further insights into the effectiveness of our approach.
pdf
bib
abs
FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric
Maximillian Chen
|
Caitlyn Chen
|
Xiao Yu
|
Zhou Yu
Syntax is a fundamental component of language, yet few metrics have been employed to capture syntactic similarity or coherence at the utterance- and document-level. The existing standard document-level syntactic similarity metric is computationally expensive and performs inconsistently when faced with syntactically dissimilar documents. To address these challenges, we present FastKASSIM, a metric for utterance- and document-level syntactic similarity which pairs and averages the most similar constituency parse trees between a pair of documents based on tree kernels. FastKASSIM is more robust to syntactic dissimilarities and runs up to to 5.32 times faster than its predecessor over documents in the r/ChangeMyView corpus. FastKASSIM’s improvements allow us to examine hypotheses in two settings with large documents. We find that syntactically similar arguments on r/ChangeMyView tend to be more persuasive, and that syntax is predictive of authorship attribution in the Australian High Court Judgment corpus.
pdf
bib
abs
Friend-training: Learning from Models of Different but Related Tasks
Mian Zhang
|
Lifeng Jin
|
Linfeng Song
|
Haitao Mi
|
Xiabing Zhou
|
Dong Yu
Current self-training methods such as standard self-training, co-training, tri-training, and others often focus on improving model performance on a single task, utilizing differences in input features, model architectures, and training processes. However, many tasks in natural language processing are about different but related aspects of language, and models trained for one task can be great teachers for other related tasks. In this work, we propose friend-training, a cross-task self-training framework, where models trained to do different tasks are used in an iterative training, pseudo-labeling, and retraining process to help each other for better selection of pseudo-labels. With two dialogue understanding tasks, conversational semantic role labeling and dialogue rewriting, chosen for a case study, we show that the models trained with the friend-training framework achieve the best performance compared to strong baselines.
pdf
bib
abs
Understanding Transformer Memorization Recall Through Idioms
Adi Haviv
|
Ido Cohen
|
Jacob Gidron
|
Roei Schuster
|
Yoav Goldberg
|
Mor Geva
To produce accurate predictions, language models (LMs) must balance between generalization and memorization. Yet, little is known about the mechanism by which transformer LMs employ their memorization capacity. When does a model decide to output a memorized phrase, and how is this phrase then retrieved from memory? In this work, we offer the first methodological framework for probing and characterizing recall of memorized sequences in transformer LMs. First, we lay out criteria for detecting model inputs that trigger memory recall, and propose idioms as inputs that typically fulfill these criteria. Next, we construct a dataset of English idioms and use it to compare model behavior on memorized vs. non-memorized inputs. Specifically, we analyze the internal prediction construction process by interpreting the model’s hidden representations as a gradual refinement of the output probability distribution. We find that across different model sizes and architectures, memorized predictions are a two-step process: early layers promote the predicted token to the top of the output distribution, and upper layers increase model confidence. This suggests that memorized information is stored and retrieved in the early layers of the network. Last, we demonstrate the utility of our methodology beyond idioms in memorized factual statements. Overall, our work makes a first step towards understanding memory recall, and provides a methodological basis for future studies of transformer memorization.
pdf
bib
abs
A Discerning Several Thousand Judgments: GPT-3 Rates the Article + Adjective + Numeral + Noun Construction
Kyle Mahowald
Knowledge of syntax includes knowledge of rare, idiosyncratic constructions. LLMs must overcome frequency biases in order to master such constructions. In this study, I prompt GPT-3 to give acceptability judgments on the English-language Article + Adjective + Numeral + Noun construction (e.g., “a lovely five days”). I validate the prompt using the CoLA corpus of acceptability judgments and then zero in on the AANN construction. I compare GPT- 3’s judgments to crowdsourced human judgments on a subset of sentences. GPT-3’s judgments are broadly similar to human judgments and generally align with proposed constraints in the literature but, in some cases, GPT-3’s judgments and human judgments diverge from the literature and from each other.
pdf
bib
abs
Triple-Hybrid Energy-based Model Makes Better Calibrated Natural Language Understanding Models
Haotian Xu
|
Yingying Zhang
Though pre-trained language models achieve notable success in many applications, it’s usually controversial for over-confident predictions. Specifically, the in-distribution (ID) miscalibration and out-of-distribution (OOD) detection are main concerns. Recently, some works based on energy-based models (EBM) have shown great improvements on both ID calibration and OOD detection for images. However, it’s rarely explored in natural language understanding tasks due to the non-differentiability of text data which makes it more difficult for EBM training. In this paper, we first propose a triple-hybrid EBM which combines the benefits of classifier, conditional generative model and marginal generative model altogether. Furthermore, we leverage contrastive learning to approximately train the proposed model, which circumvents the non-differentiability issue of text data. Extensive experiments have been done on GLUE and six other multiclass datasets in various domains. Our model outperforms previous methods in terms of ID calibration and OOD detection by a large margin while maintaining competitive accuracy.
pdf
bib
abs
A weakly supervised textual entailment approach to zero-shot text classification
Marc Pàmies
|
Joan Llop
|
Francesco Multari
|
Nicolau Duran-Silva
|
César Parra-Rojas
|
Aitor Gonzalez-Agirre
|
Francesco Alessandro Massucci
|
Marta Villegas
Zero-shot text classification is a widely studied task that deals with a lack of annotated data. The most common approach is to reformulate it as a textual entailment problem, enabling classification into unseen classes. This work explores an effective approach that trains on a weakly supervised dataset generated from traditional classification data. We empirically study the relation between the performance of the entailment task, which is used as a proxy, and the target zero-shot text classification task. Our findings reveal that there is no linear correlation between both tasks, to the extent that it can be detrimental to lengthen the fine-tuning process even when the model is still learning, and propose a straightforward method to stop training on time. As a proof of concept, we introduce a domain-specific zero-shot text classifier that was trained on Microsoft Academic Graph data. The model, called SCIroShot, achieves state-of-the-art performance in the scientific domain and competitive results in other areas. Both the model and evaluation benchmark are publicly available on HuggingFace and GitHub.
pdf
bib
abs
Fair Enough: Standardizing Evaluation and Model Selection for Fairness Research in NLP
Xudong Han
|
Timothy Baldwin
|
Trevor Cohn
Modern NLP systems exhibit a range of biases, which a growing literature on model debiasing attempts to correct. However, current progress is hampered by a plurality of definitions of bias, means of quantification, and oftentimes vague relation between debiasing algorithms and theoretical measures of bias. This paper seeks to clarify the current situation and plot a course for meaningful progress in fair learning, with two key contributions: (1) making clear inter-relations among the current gamut of methods, and their relation to fairness theory; and (2) addressing the practical problem of model selection, which involves a trade-off between fairness and accuracy and has led to systemic issues in fairness research. Putting them together, we make several recommendations to help shape future work.
pdf
bib
abs
CHARD: Clinical Health-Aware Reasoning Across Dimensions for Text Generation Models
Steven Y. Feng
|
Vivek Khetan
|
Bogdan Sacaleanu
|
Anatole Gershman
|
Eduard Hovy
We motivate and introduce CHARD: Clinical Health-Aware Reasoning across Dimensions, to investigate the capability of text generation models to act as implicit clinical knowledge bases and generate free-flow textual explanations about various health-related conditions across several dimensions. We collect and present an associated dataset, CHARDat, consisting of explanations about 52 health conditions across three clinical dimensions. We conduct extensive experiments using BART and T5 along with data augmentation, and perform automatic, human, and qualitative analyses. We show that while our models can perform decently, CHARD is very challenging with strong potential for further exploration.
pdf
bib
abs
Prompt Tuning with Contradictory Intentions for Sarcasm Recognition
Yiyi Liu
|
Ruqing Zhang
|
Yixing Fan
|
Jiafeng Guo
|
Xueqi Cheng
Recently, prompt tuning has achieved promising results in a variety of natural language processing (NLP) tasks. The typical approach is to insert text pieces (i.e. templates) into the input and transform downstream tasks into the same form as pre-training. In essence, a high-quality template is the foundation of prompt tuning to support the performance of the converted cloze-style task. However, for sarcasm recognition, it is time-consuming and requires increasingly sophisticated domain knowledge to determine the appropriate templates and label words due to its highly figurative nature. In this work, we propose SarcPrompt, to incorporate the prior knowledge about contradictory intentions into prompt tuning for sarcasm recognition. SarcPrompt is inspired by that the speaker usually says the opposite of what they actually mean in the sarcastic text. Based on this idea, we explicitly mimic the actual intention by prompt construction and indicate whether the actual intention is contradictory to the literal content by verbalizer engineering. Experiments on three public datasets with standard and low-resource settings demonstrate the effectiveness of our SarcPrompt for sarcasm recognition.
pdf
bib
abs
COMBO: A Complete Benchmark for Open KG Canonicalization
Chengyue Jiang
|
Yong Jiang
|
Weiqi Wu
|
Yuting Zheng
|
Pengjun Xie
|
Kewei Tu
Open knowledge graph (KG) consists of (subject, relation, object) triples extracted from millions of raw text. The subject and object noun phrases and the relation in open KG have severe redundancy and ambiguity and need to be canonicalized. Existing datasets for open KG canonicalization only provide gold entity-level canonicalization for noun phrases. In this paper, we present COMBO, a Complete Benchmark for Open KG canonicalization. Compared with existing datasets, we additionally provide gold canonicalization for relation phrases, gold ontology-level canonicalization for noun phrases, as well as source sentences from which triples are extracted. We also propose metrics for evaluating each type of canonicalization. On the COMBO dataset, we empirically compare previously proposed canonicalization methods as well as a few simple baseline methods based on pretrained language models. We find that properly encoding the phrases in a triple using pretrained language models results in better relation canonicalization and ontology-level canonicalization of the noun phrase. We release our dataset, baselines, and evaluation scripts at path/to/url.
pdf
bib
abs
UScore: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation
Jonas Belouadi
|
Steffen Eger
The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to cases where such supervision signals are not available. In this work, we develop fully unsupervised evaluation metrics. To do so, we leverage similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems. In particular, we use an unsupervised evaluation metric to mine pseudo-parallel data, which we use to remap deficient underlying vector spaces (in an iterative manner) and to induce an unsupervised MT system, which then provides pseudo-references as an additional component in the metric. Finally, we also induce unsupervised multilingual sentence embeddings from pseudo-parallel data. We show that our fully unsupervised metrics are effective, i.e., they beat supervised competitors on 4 out of our 5 evaluation datasets. We make our code publicly available.
pdf
bib
abs
Assistive Recipe Editing through Critiquing
Diego Antognini
|
Shuyang Li
|
Boi Faltings
|
Julian McAuley
There has recently been growing interest in the automatic generation of cooking recipes that satisfy some form of dietary restrictions, thanks in part to the availability of online recipe data. Prior studies have used pre-trained language models, or relied on small paired recipe data (e.g., a recipe paired with a similar one that satisfies a dietary constraint). However, pre-trained language models generate inconsistent or incoherent recipes, and paired datasets are not available at scale. We address these deficiencies with RecipeCrit, a hierarchical denoising auto-encoder that edits recipes given ingredient-level critiques. The model is trained for recipe completion to learn semantic relationships within recipes. Our work’s main innovation is our unsupervised critiquing module that allows users to edit recipes by interacting with the predicted ingredients; the system iteratively rewrites recipes to satisfy users’ feedback. Experiments onthe Recipe1M recipe dataset show that our model can more effectively edit recipes compared to strong language-modeling baselines, creating recipes that satisfy user constraints and are more correct, serendipitous, coherent, and relevant as measured by human judges.
pdf
bib
abs
DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer
Shanu Kumar
|
Soujanya Abbaraju
|
Sandipan Dandapat
|
Sunayana Sitaram
|
Monojit Choudhury
Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages. In this work, we envision languages as domains for improving zero-shot transfer by jointly reducing the feature incongruity between the source and the target language and increasing the generalization capabilities of pre-trained multilingual transformers. We show that our approach, DiTTO, significantly outperforms the standard zero-shot fine-tuning method on multiple datasets across all languages using solely unlabeled instances in the target language. Empirical results show that jointly reducing feature incongruity for multiple target languages is vital for successful cross-lingual transfer. Moreover, our model enables better cross-lingual transfer than standard fine-tuning methods, even in the few-shot setting.
pdf
bib
abs
“John is 50 years old, can his son be 65?” Evaluating NLP Models’ Understanding of Feasibility
Himanshu Gupta
|
Neeraj Varshney
|
Swaroop Mishra
|
Kuntal Kumar Pal
|
Saurabh Arjun Sawant
|
Kevin Scaria
|
Siddharth Goyal
|
Chitta Baral
In current NLP research, large-scale language models and their abilities are widely being discussed. Some recent works have also found notable failures of these models. Often these failure examples involve complex reasoning abilities. This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible. To this end, we introduce FeasibilityQA, a question-answering dataset involving binary classification (BCQ) and multi-choice multi-correct questions (MCQ) that test understanding of feasibility. We show that even state-of-the-art models such as GPT-3, GPT-2, and T5 struggle to answer the feasibility questions correctly. Specifically, on (MCQ, BCQ) questions, GPT-3 achieves accuracy of just (19%, 62%) and (25%, 64%) in zero-shot and few-shot settings, respectively. We also evaluate models by providing relevant knowledge statements required to answer the question and find that the additional knowledge leads to a 7% gain in performance, but the overall performance still remains low. These results make one wonder how much commonsense knowledge about action feasibility is encoded in state-of-the-art models and how well they can reason about it.
pdf
bib
abs
Efficient Encoders for Streaming Sequence Tagging
Ayush Kaushal
|
Aditya Gupta
|
Shyam Upadhyay
|
Manaal Faruqui
A naive application of state-of-the-art bidirectional encoders for streaming sequence tagging would require encoding each token from scratch for each new token in an incremental streaming input (like transcribed speech). The lack of re-usability of previous computation leads to a higher number of Floating Point Operations (or FLOPs) and higher number of unnecessary label flips. Increased FLOPs consequently lead to higher wall-clock time and increased label flipping leads to poorer streaming performance. In this work, we present a Hybrid Encoder with Adaptive Restart (HEAR) that addresses these issues while maintaining the performance of bidirectional encoders over the offline (or complete) and improving streaming (or incomplete) inputs. HEAR has a Hybrid unidirectional-bidirectional encoder architecture to perform sequence tagging, along with an Adaptive Restart Module (ARM) to selectively guide the restart of bidirectional portion of the encoder. Across four sequence tagging tasks, HEAR offers FLOP savings in streaming settings upto 71.1% and also outperforms bidirectional encoders for streaming predictions by upto +10% streaming exact match.
pdf
bib
abs
Retrieve-and-Fill for Scenario-based Task-Oriented Semantic Parsing
Akshat Shrivastava
|
Shrey Desai
|
Anchit Gupta
|
Ali Elkahky
|
Aleksandr Livshits
|
Alexander Zotov
|
Ahmed Aly
Task-oriented semantic parsing models have achieved strong results in recent years, but unfortunately do not strike an appealing balance between model size, runtime latency, and cross-domain generalizability. We tackle this problem by introducing scenario-based semantic parsing: a variant of the original task which first requires disambiguating an utterance’s “scenario” (an intent-slot template with variable leaf spans) before generating its frame, complete with ontology and utterance tokens. This formulation enables us to isolate coarse-grained and fine-grained aspects of the task, each of which we solve with off-the-shelf neural modules, also optimizing for the axes outlined above. Concretely, we create a Retrieve-and-Fill (RAF) architecture comprised of (1) a retrieval module which ranks the best scenario given an utterance and (2) a filling module which imputes spans into the scenario to create the frame. Our model is modular, differentiable, interpretable, and allows us to garner extra supervision from scenarios. RAF achieves strong results in high-resource, low-resource, and multilingual settings, outperforming recent approaches by wide margins despite, using base pre-trained encoders, small sequence lengths, and parallel decoding.
pdf
bib
abs
Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation
Minghao Wu
|
George Foster
|
Lizhen Qu
|
Gholamreza Haffari
Existing work in document-level neural machine translation commonly concatenates several consecutive sentences as a pseudo-document, and then learns inter-sentential dependencies. This strategy limits the model’s ability to leverage information from distant context. We overcome this limitation with a novel Document Flattening (DocFlat) technique that integrates Flat-Batch Attention (FBA) and Neural Context Gate (NCG) into Transformer model to utilizes information beyond the pseudo-document boundaries. FBA allows the model to attend to all the positions in the batch and model the relationships between positions explicitly and NCG identifies the useful information from the distant context. We conduct comprehensive experiments and analyses on three benchmark datasets for English-German translation, and validate the effectiveness of two variants of DocFlat. Empirical results show that our approach outperforms strong baselines with statistical significance on BLEU, COMET and accuracy on the contrastive test set. The analyses highlight that DocFlat is highly effective in capturing the long-range information.
pdf
bib
abs
Scaling Back-Translation with Domain Text Generation for Sign Language Gloss Translation
Jinhui Ye
|
Wenxiang Jiao
|
Xing Wang
|
Zhaopeng Tu
Sign language gloss translation aims to translate the sign glosses into spoken language texts, which is challenging due to the scarcity of labeled gloss-text parallel data. Back translation (BT), which generates pseudo-parallel data by translating in-domain spoken language texts into sign glosses, has been applied to alleviate the data scarcity problem. However, the lack of large-scale high-quality in-domain spoken language text data limits the effect of BT. In this paper, to overcome the limitation, we propose a Prompt based domain text Generation (PGen) approach to produce the large-scale in-domain spoken language text data. Specifically, PGen randomly concatenates sentences from the original in-domain spoken language text data as prompts to induce a pre-trained language model (i.e., GPT-2) to generate spoken language texts in a similar style. Experimental results on three benchmarks of sign language gloss translation in varied languages demonstrate that BT with spoken language texts generated by PGen significantly outperforms the compared methods. In addition, as the scale of spoken language texts generated by PGen increases, the BT technique can achieve further improvements, demonstrating the effectiveness of our approach. We release the code and data for facilitating future research in this field.
pdf
bib
abs
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement
Soyeong Jeong
|
Jinheon Baek
|
Sung Ju Hwang
|
Jong Park
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. To apply such models to a real-world scenario, some existing work uses predicted answers, instead of unavailable ground-truth answers, as the conversation history for inference. However, since these models usually predict wrong answers, using all the predictions without filtering significantly hampers the model performance. To address this problem, we propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model, without making any architectural changes. Moreover, to make the confidence and uncertainty values more reliable, we propose to further calibrate them, thereby smoothing the model predictions. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets, and the results show that our models significantly outperform relevant baselines. Code is available at:
https://github.com/starsuzi/AS-ConvQA.
pdf
bib
abs
PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically
Sedrick Scott Keh
|
Steven Y. Feng
|
Varun Gangal
|
Malihe Alikhani
|
Eduard Hovy
Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called TT-Corp, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.
pdf
bib
abs
A User-Centered, Interactive, Human-in-the-Loop Topic Modelling System
Zheng Fang
|
Lama Alqazlan
|
Du Liu
|
Yulan He
|
Rob Procter
Human-in-the-loop topic modelling incorporates users’ knowledge into the modelling process, enabling them to refine the model iteratively. Recent research has demonstrated the value of user feedback, but there are still issues to consider, such as the difficulty in tracking changes, comparing different models and the lack of evaluation based on real-world examples of use. We developed a novel, interactive human-in-the-loop topic modeling system with a user-friendly interface that enables users compare and record every step they take, and a novel topic words suggestion feature to help users provide feedback that is faithful to the ground truth. Our system also supports not only what traditional topic models can do, i.e., learning the topics from the whole corpus, but also targeted topic modelling, i.e., learning topics for specific aspects of the corpus. In this article, we provide an overview of the system and present the results of a series of user studies designed to assess the value of the system in progressively more realistic applications of topic modelling.
pdf
bib
abs
A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing
Sophie Henning
|
William Beluch
|
Alexander Fraser
|
Annemarie Friedrich
Many natural language processing (NLP) tasks are naturally imbalanced, as some target categories occur much more frequently than others in the real world. In such scenarios, current NLP models tend to perform poorly on less frequent classes. Addressing class imbalance in NLP is an active research topic, yet, finding a good approach for a particular task and imbalance scenario is difficult. In this survey, the first overview on class imbalance in deep-learning based NLP, we first discuss various types of controlled and real-world class imbalance. Our survey then covers approaches that have been explicitly proposed for class-imbalanced NLP tasks or, originating in the computer vision community, have been evaluated on them. We organize the methods by whether they are based on sampling, data augmentation, choice of loss function, staged learning, or model design. Finally, we discuss open problems and how to move forward.
pdf
bib
abs
Extracting or Guessing? Improving Faithfulness of Event Temporal Relation Extraction
Haoyu Wang
|
Hongming Zhang
|
Yuqian Deng
|
Jacob Gardner
|
Dan Roth
|
Muhao Chen
In this paper, we seek to improve the faithfulness of TempRel extraction models from two perspectives. The first perspective is to extract genuinely based on contextual description. To achieve this, we propose to conduct counterfactual analysis to attenuate the effects of two significant types of training biases: the event trigger bias and the frequent label bias. We also add tense information into event representations to explicitly place an emphasis on the contextual description. The second perspective is to provide proper uncertainty estimation and abstain from extraction when no relation is described in the text. By parameterization of Dirichlet Prior over the model-predicted categorical distribution, we improve the model estimates of the correctness likelihood and make TempRel predictions more selective. We also employ temperature scaling to recalibrate the model confidence measure after bias mitigation. Through experimental analysis on MATRES, MATRES-DS, and TDDiscourse, we demonstrate that our model extracts TempRel and timelines more faithfully compared to SOTA methods, especially under distribution shifts.
pdf
bib
abs
LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control
Yilun Zhao
|
Zhenting Qi
|
Linyong Nan
|
Lorenzo Jaime Flores
|
Dragomir Radev
Logical Table-to-Text (LT2T) generation is tasked with generating logically faithful sentences from tables. There currently exists two challenges in the field: 1) Faithfulness: how to generate sentences that are factually correct given the table content; 2) Diversity: how to generate multiple sentences that offer different perspectives on the table. This work proposes LoFT, which utilizes logic forms as fact verifiers and content planners to control LT2T generation. Experimental results on the LogicNLG dataset demonstrate that LoFT is the first model that addresses unfaithfulness and lack of diversity issues simultaneously. Our code is publicly available at
https://github.com/Yale-LILY/LoFT.
pdf
bib
abs
PromptDA: Label-guided Data Augmentation for Prompt-based Few Shot Learners
Canyu Chen
|
Kai Shu
Recent advances in large pre-trained language models (PLMs) lead to impressive gains on natural language understanding (NLU) tasks with task-specific fine-tuning. However, directly fine-tuning PLMs heavily relies on sufficient labeled training instances, which are usually hard to obtain. Prompt-based tuning on PLMs has shown to be powerful for various downstream few-shot tasks. Existing works studying prompt-based tuning for few-shot NLU tasks mainly focus on deriving proper label words with a verbalizer or generating prompt templates to elicit semantics from PLMs. In addition, conventional data augmentation strategies such as synonym substitution are also widely adopted in low-resource scenarios. However, the improvements they bring to prompt-based few-shot learning have been demonstrated to be marginal. Thus, an important research question arises as follows: how to design effective data augmentation methods for prompt-based few-shot tuning? To this end, considering the label semantics are essential in prompt-based tuning, we propose a novel label-guided data augmentation framework PromptDA, which exploits the enriched label semantic information for data augmentation. Extensive experiment results on few-shot text classification tasks show that our proposed framework achieves superior performances by effectively leveraging label semantics and data augmentation for natural language understanding.
pdf
bib
abs
Incorporating Question Answering-Based Signals into Abstractive Summarization via Salient Span Selection
Daniel Deutsch
|
Dan Roth
In this work, we propose a method for incorporating question-answering (QA) signals into a summarization model. Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs and automatically determining whether those questions are answered in the gold summaries. This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary. Our experiments demonstrate that the models trained using QA-based supervision generate higher-quality summaries than baseline methods of identifying salient spans on benchmark summarization datasets. Further, we show that the content of the generated summaries can be controlled based on which NPs are marked in the input document. Finally, we propose a method of augmenting the training data so the gold summaries are more consistent with the marked input spans used during training and show how this results in models which learn to better exclude unmarked document content.
pdf
bib
abs
Patient Outcome and Zero-shot Diagnosis Prediction with Hypernetwork-guided Multitask Learning
Shaoxiong Ji
|
Pekka Marttinen
Multitask deep learning has been applied to patient outcome prediction from text, taking clinical notes as input and training deep neural networks with a joint loss function of multiple tasks. However, the joint training scheme of multitask learning suffers from inter-task interference, and diagnosis prediction among the multiple tasks has the generalizability issue due to rare diseases or unseen diagnoses. To solve these challenges, we propose a hypernetwork-based approach that generates task-conditioned parameters and coefficients of multitask prediction heads to learn task-specific prediction and balance the multitask learning. We also incorporate semantic task information to improve the generalizability of our task-conditioned multitask model. Experiments on early and discharge notes extracted from the real-world MIMIC database show our method can achieve better performance on multitask patient outcome prediction than strong baselines in most cases. Besides, our method can effectively handle the scenario with limited information and improve zero-shot prediction on unseen diagnosis categories.
pdf
bib
abs
A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches
Annemarie Friedrich
|
Nianwen Xue
|
Alexis Palmer
Aspectual meaning refers to how the internal temporal structure of situations is presented. This includes whether a situation is described as a state or as an event, whether the situation is finished or ongoing, and whether it is viewed as a whole or with a focus on a particular phase. This survey gives an overview of computational approaches to modeling lexical and grammatical aspect along with intuitive explanations of the necessary linguistic concepts and terminology. In particular, we describe the concepts of stativity, telicity, habituality, perfective and imperfective, as well as influential inventories of eventuality and situation types. Aspect is a crucial component of semantics, especially for precise reporting of the temporal structure of situations, and future NLP approaches need to be able to handle and evaluate it systematically.
pdf
bib
abs
Incorporating Context into Subword Vocabularies
Shaked Yehezkel
|
Yuval Pinter
Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the resulting vocabularies are used in language models’ highly contextualized settings. We present SaGe, a tokenizer that tailors subwords for their downstream use by baking in the contextualized signal at the vocabulary creation phase. We show that SaGe does a better job than current widespread tokenizers in keeping token contexts cohesive, while not incurring a large price in terms of encoding efficiency or domain robustness. SaGe improves performance on English GLUE classification tasks as well as on NER, and on Inference and NER in Turkish, demonstrating its robustness to language properties such as morphological exponence and agglutination.
pdf
bib
abs
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization
Laura Nguyen
|
Thomas Scialom
|
Benjamin Piwowarski
|
Jacopo Staiano
Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets – consistently built from scholar resources – covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models – two orthogonal approaches – and obtain state-of-the-art results, showing the importance of combining both lines of research.
pdf
bib
abs
ViHOS: Hate Speech Spans Detection for Vietnamese
Phu Gia Hoang
|
Canh Duc Luu
|
Khanh Quoc Tran
|
Kiet Van Nguyen
|
Ngan Luu-Thuy Nguyen
The rise in hateful and offensive language directed at other users is one of the adverse side effects of the increased use of social networking platforms. This could make it difficult for human moderators to review tagged comments filtered by classification systems. To help address this issue, we present the ViHOS (Vietnamese Hate and Offensive Spans) dataset, the first human-annotated corpus containing 26k spans on 11k comments. We also provide definitions of hateful and offensive spans in Vietnamese comments as well as detailed annotation guidelines. Besides, we conduct experiments with various state-of-the-art models. Specifically, XLM-R_Large achieved the best F1-scores in Single span detection and All spans detection, while PhoBERT_Large obtained the highest in Multiple spans detection. Finally, our error analysis demonstrates the difficulties in detecting specific types of spans in our data for future research. Our dataset is released on GitHub.
pdf
bib
abs
Vote’n’Rank: Revision of Benchmarking with Social Choice Theory
Mark Rofin
|
Vladislav Mikhailov
|
Mikhail Florinsky
|
Andrey Kravchenko
|
Tatiana Shavrina
|
Elena Tutubalina
|
Daniel Karabekyan
|
Ekaterina Artemova
The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote’n’Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote’n’Rank’s procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.
pdf
bib
abs
Combining Parameter-efficient Modules for Task-level Generalisation
Edoardo Maria Ponti
|
Alessandro Sordoni
|
Yoshua Bengio
|
Siva Reddy
A modular design encourages neural models to disentangle and recombine different facets of knowledge to generalise more systematically to new tasks. In this work, we assume that each task is associated with a subset of latent skills from an (arbitrary size) inventory. In turn, each skill corresponds to a parameter-efficient (sparse / low-rank) model adapter. By jointly learning adapters and a routing function that allocates skills to each task, the full network is instantiated as the average of the parameters of active skills. We propose several inductive biases that encourage re-usage and composition of the skills, including variable-size skill allocation and a dual-speed learning rate. We evaluate our latent-skill model in two main settings: 1) multitask reinforcement learning for instruction following on 8 levels of the BabyAI platform; and 2) few-shot fine-tuning of language models on 160 NLP tasks of the CrossFit benchmark. We find that the modular design of our network enhances sample efficiency in reinforcement learning and few-shot generalisation in supervised learning, compared to a series of baselines. These include models where parameters are fully shared, task-specific, conditionally generated (HyperFormer), or sparse mixture-of-experts (TaskMoE).
pdf
bib
abs
Self-imitation Learning for Action Generation in Text-based Games
Zijing Shi
|
Yunqiu Xu
|
Meng Fang
|
Ling Chen
In this work, we study reinforcement learning (RL) in solving text-based games. We address the challenge of combinatorial action space, by proposing a confidence-based self-imitation model to generate action candidates for the RL agent. Firstly, we leverage the self-imitation learning to rank and exploit past valuable trajectories to adapt a pre-trained language model (LM) towards a target game. Then, we devise a confidence-based strategy to measure the LM’s confidence with respect to a state, thus adaptively pruning the generated actions to yield a more compact set of action candidates. In multiple challenging games, our model demonstrates promising performance in comparison to the baselines.
pdf
bib
abs
Investigating the Effect of Relative Positional Embeddings on AMR-to-Text Generation with Structural Adapters
Sebastien Montella
|
Alexis Nasr
|
Johannes Heinecke
|
Frederic Bechet
|
Lina M. Rojas Barahona
Text generation from Abstract Meaning Representation (AMR) has substantially benefited from the popularized Pretrained Language Models (PLMs). Myriad approaches have linearized the input graph as a sequence of tokens to fit the PLM tokenization requirements. Nevertheless, this transformation jeopardizes the structural integrity of the graph and is therefore detrimental to its resulting representation. To overcome this issue, Ribeiro et al. (2021b) have recently proposed StructAdapt, a structure-aware adapter which injects the input graph connectivity within PLMs using Graph Neural Networks (GNNs). In this paper, we investigate the influence of Relative Position Embeddings (RPE) on AMR-to-Text, and, in parallel, we examine the robustness of StructAdapt. Through ablation studies, graph attack and link prediction, we reveal that RPE might be partially encoding input graphs. We suggest further research regarding the role of RPE will provide valuable insights for Graph-to-Text generation.
pdf
bib
abs
On the Intersection of Context-Free and Regular Languages
Clemente Pasti
|
Andreas Opedal
|
Tiago Pimentel
|
Tim Vieira
|
Jason Eisner
|
Ryan Cotterell
The Bar-Hillel construction is a classic result in formal language theory. It shows, by a simple construction, that the intersection of a context-free language and a regular language is itself context-free. In the construction, the regular language is specified by a finite-state automaton. However, neither the original construction (Bar-Hillel et al., 1961) nor its weighted extension (Nederhof and Satta, 2003) can handle finite-state automata with ε-arcs. While it is possible to remove ε-arcs from a finite-state automaton efficiently without modifying the language, such an operation modifies the automaton’s set of paths. We give a construction that generalizes the Bar- Hillel in the case the desired automaton has ε-arcs, and further prove that our generalized construction leads to a grammar that encodes the structure of both the input automaton and grammar while retaining the asymptotic size of the original construction.
pdf
bib
abs
Social Influence Dialogue Systems: A Survey of Datasets and Models For Social Influence Tasks
Kushal Chawla
|
Weiyan Shi
|
Jingwen Zhang
|
Gale Lucas
|
Zhou Yu
|
Jonathan Gratch
Dialogue systems capable of social influence such as persuasion, negotiation, and therapy, are essential for extending the use of technology to numerous realistic scenarios. However, existing research primarily focuses on either task-oriented or open-domain scenarios, a categorization that has been inadequate for capturing influence skills systematically. There exists no formal definition or category for dialogue systems with these skills and data-driven efforts in this direction are highly limited. In this work, we formally define and introduce the category of social influence dialogue systems that influence users’ cognitive and emotional responses, leading to changes in thoughts, opinions, and behaviors through natural conversations. We present a survey of various tasks, datasets, and methods, compiling the progress across seven diverse domains. We discuss the commonalities and differences between the examined systems, identify limitations, and recommend future directions. This study serves as a comprehensive reference for social influence dialogue systems to inspire more dedicated research and discussion in this emerging area.
pdf
bib
abs
Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
Juntao Yu
|
Silviu Paun
|
Maris Camilleri
|
Paloma Garcia
|
Jon Chamberlain
|
Udo Kruschwitz
|
Massimo Poesio
Although several datasets annotated for anaphoric reference / coreference exist, even the largest such datasets have limitations in term of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven’t so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to ‘complete’ markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents ( 2K in length).
pdf
bib
abs
What Makes Sentences Semantically Related? A Textual Relatedness Dataset and Empirical Study
Mohamed Abdalla
|
Krishnapriya Vishnubhotla
|
Saif Mohammad
The degree of semantic relatedness of two units of language has long been considered fundamental to understanding meaning. Additionally, automatically determining relatedness has many applications such as question answering and summarization. However, prior NLP work has largely focused on semantic similarity, a subset of relatedness, because of a lack of relatedness datasets. In this paper, we introduce a dataset for Semantic Textual Relatedness, STR-2022, that has 5,500 English sentence pairs manually annotated using a comparative annotation framework, resulting in fine-grained scores. We show that human intuition regarding relatedness of sentence pairs is highly reliable, with a repeat annotation correlation of 0.84. We use the dataset to explore questions on what makes sentences semantically related. We also show the utility of STR-2022 for evaluating automatic methods of sentence representation and for various downstream NLP tasks. Our dataset, data statement, and annotation questionnaire can be found at:
https://doi.org/10.5281/zenodo.7599667.
pdf
bib
abs
RevUp: Revise and Update Information Bottleneck for Event Representation
Mehdi Rezaee
|
Francis Ferraro
The existence of external (“side”) semantic knowledge has been shown to result in more expressive computational event models. To enable the use of side information that may be noisy or missing, we propose a semi-supervised information bottleneck-based discrete latent variable model. We reparameterize the model’s discrete variables with auxiliary continuous latent variables and a light-weight hierarchical structure. Our model is learned to minimize the mutual information between the observed data and optional side knowledge that is not already captured by the new, auxiliary variables. We theoretically show that our approach generalizes past approaches, and perform an empirical case study of our approach on event modeling. We corroborate our theoretical results with strong empirical experiments, showing that the proposed method outperforms previous proposed approaches on multiple datasets.
pdf
bib
abs
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Genta Indra Winata
|
Alham Fikri Aji
|
Samuel Cahyawijaya
|
Rahmad Mahendra
|
Fajri Koto
|
Ade Romadhony
|
Kemal Kurniawan
|
David Moeljadi
|
Radityo Eko Prasojo
|
Pascale Fung
|
Timothy Baldwin
|
Jey Han Lau
|
Rico Sennrich
|
Sebastian Ruder
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes sentiment and machine translation datasets, and bilingual lexicons. We provide extensive analyses and describe challenges for creating such resources. We hope this work can spark NLP research on Indonesian and other underrepresented languages.
pdf
bib
abs
The Functional Relevance of Probed Information: A Case Study
Michael Hanna
|
Roberto Zamparelli
|
David Mareček
Recent studies have shown that transformer models like BERT rely on number information encoded in their representations of sentences’ subjects and head verbs when performing subject-verb agreement. However, probing experiments suggest that subject number is also encoded in the representations of all words in such sentences. In this paper, we use causal interventions to show that BERT only uses the subject plurality information encoded in its representations of the subject and words that agree with it in number. We also demonstrate that current probing metrics are unable to determine which words’ representations contain functionally relevant information. This both provides a revised view of subject-verb agreement in language models, and suggests potential pitfalls for current probe usage and evaluation.
pdf
bib
abs
Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses?
Avi Shmidman
|
Cheyn Shmuel Shmidman
|
Dan Bareket
|
Moshe Koppel
|
Reut Tsarfaty
Semitic morphologically-rich languages (MRLs) are characterized by extreme word ambiguity. Because most vowels are omitted in standard texts, many of the words are homographs with multiple possible analyses, each with a different pronunciation and different morphosyntactic properties. This ambiguity goes beyond word-sense disambiguation (WSD), and may include token segmentation into multiple word units. Previous research on MRLs claimed that standardly trained pre-trained language models (PLMs) based on word-pieces may not sufficiently capture the internal structure of such tokens in order to distinguish between these analyses.Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated and analyzed using PLMs. We evaluate all existing models for contextualized Hebrew embeddings on a novel Hebrew homograph challenge sets that we deliver. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; and that they are most effective for disambiguating segmentation and morphosyntactic features, less so regarding pure word-sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions, whether calculated as masked or unmasked tokens. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.
pdf
bib
abs
Parameter-Efficient Tuning with Special Token Adaptation
Xiaocong Yang
|
James Y. Huang
|
Wenxuan Zhou
|
Muhao Chen
Parameter-efficient tuning aims at updating only a small subset of parameters when adapting a pretrained model to downstream tasks. In this work, we introduce PASTA, in which we only modify the special token representations (e.g., [SEP] and [CLS] in BERT) before the self-attention module at each layer in Transformer-based models. PASTA achieves comparable performance to fine-tuning in natural language understanding tasks including text classification and NER with up to only 0.029% of total parameters trained. Our work not only provides a simple yet effective way of parameter-efficient tuning, which has a wide range of practical applications when deploying finetuned models for multiple tasks, but also demonstrates the pivotal role of special tokens in pretrained language models.
pdf
bib
abs
Probing Power by Prompting: Harnessing Pre-trained Language Models for Power Connotation Framing
Shima Khanehzar
|
Trevor Cohn
|
Gosia Mikolajczak
|
Lea Frermann
When describing actions, subtle changes in word choice can evoke very different associations with the involved entities. For instance, a company ‘employing workers’ evokes a more positive connotation than the one ‘exploiting’ them. This concept is called connotation. This paper investigates whether pre-trained language models (PLMs) encode such subtle connotative information about power differentials between involved entities. We design a probing framework for power connotation, building on (CITATION)’s operationalization of connotation frames. We show that zero-shot prompting of PLMs leads to above chance prediction of power connotation, however fine-tuning PLMs using our framework drastically improves their accuracy. Using our fine-tuned models, we present a case study of power dynamics in US news reporting on immigration, showing the potential of our framework as a tool for understanding subtle bias in the media.
pdf
bib
abs
Zero and Few-Shot Localization of Task-Oriented Dialogue Agents with a Distilled Representation
Mehrad Moradshahi
|
Sina Semnani
|
Monica Lam
Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent. We propose automatic methods that use ToD training data in a source language to build a high-quality functioning dialogue agent in another target language that has no training data (i.e. zero-shot) or a small training set (i.e. few-shot). Unlike most prior work in cross-lingual ToD that only focuses on Dialogue State Tracking (DST), we build an end-to-end agent. We show that our approach closes the accuracy gap between few-shot and existing full-shot methods for ToD agents. We achieve this by (1) improving the dialogue data representation, (2) improving entity-aware machine translation, and (3) automatic filtering of noisy translations. We evaluate our approach on the recent bilingual dialogue dataset BiToD.In Chinese to English transfer, in the zero-shot setting, our method achieves 46.7% and 22.0% in Task Success Rate (TSR) and Dialogue Success Rate (DSR) respectively. In the few-shot setting where 10% of the data in the target language is used, we improve the state-of-the-art by 15.2% and 14.0%, coming within 5% of full-shot training.
pdf
bib
abs
Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues
Mehrad Moradshahi
|
Victoria Tsai
|
Giovanni Campagna
|
Monica Lam
Robust state tracking for task-oriented dialogue systems currently remains restricted to a few popular languages. This paper shows that given a large-scale dialogue data set in one language, we can automatically produce an effective semantic parser for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values and eliminate costly human supervision used in previous benchmarks. We also propose a new contextual semantic parsing model, which encodes the formal slots and values, and only the last agent and user utterances. We show that the succinct representation reduces the compounding effect of translation errors, without harming the accuracy in practice. We evaluate our approach on several dialogue state tracking benchmarks. On RiSAWOZ, CrossWOZ, CrossWOZ-EN, and MultiWOZ-ZH datasets we improve the state of the art by 11%, 17%, 20%, and 0.3% in joint goal accuracy. We present a comprehensive error analysis for all three datasets showing erroneous annotations can lead to misguided judgments on the quality of the model. Finally, we present RiSAWOZ English and German datasets, created using our translation methodology. On these datasets, accuracy is within 11% of the original showing that high-accuracy multilingual dialogue datasets are possible without relying on expensive human annotations. We release our datasets and software open source.
pdf
bib
abs
Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers
Minsoo Kim
|
Kyuhong Shim
|
Seongmin Park
|
Wonyong Sung
|
Jungwook Choi
Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but at the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising method to lower the implementation cost and energy consumption. However, aggressive quantization below 2-bit causes considerable accuracy degradation due to unstable convergence, especially when the downstream dataset is not abundant. This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers. TI intervenes layer-wise signal propagation with the intact signal from the teacher to remove the interference of propagated quantization errors, smoothing loss surface of QAT and expediting the convergence. Furthermore, we propose a gradual intervention mechanism to stabilize the recovery of subsections of Transformer layers from quantization. The proposed schemes enable fast convergence of QAT and improve the model accuracy regardless of the diverse characteristics of downstream fine-tuning tasks. We demonstrate that TI consistently achieves superior accuracy with significantly lower fine-tuning iterations on well-known Transformers of natural language processing as well as computer vision compared to the state-of-the-art QAT methods.
pdf
bib
abs
Generative Replay Inspired by Hippocampal Memory Indexing for Continual Language Learning
Aru Maekawa
|
Hidetaka Kamigaito
|
Kotaro Funakoshi
|
Manabu Okumura
Continual learning aims to accumulate knowledge to solve new tasks without catastrophic forgetting for previously learned tasks. Research on continual learning has led to the development of generative replay, which prevents catastrophic forgetting by generating pseudo-samples for previous tasks and learning them together with new tasks. Inspired by the biological brain, we propose the hippocampal memory indexing to enhance the generative replay by controlling sample generation using compressed features of previous training samples. It enables the generation of a specific training sample from previous tasks, thus improving the balance and quality of generated replay samples. Experimental results indicate that our method effectively controls the sample generation and consistently outperforms the performance of current generative replay methods.
pdf
bib
abs
A Survey of Multi-task Learning in Natural Language Processing: Regarding Task Relatedness and Training Methods
Zhihan Zhang
|
Wenhao Yu
|
Mengxia Yu
|
Zhichun Guo
|
Meng Jiang
Multi-task learning (MTL) has become increasingly popular in natural language processing (NLP) because it improves the performance of related tasks by exploiting their commonalities and differences. Nevertheless, it is still not understood very well how multi-task learning can be implemented based on the relatedness of training tasks. In this survey, we review recent advances of multi-task learning methods in NLP, with the aim of summarizing them into two general multi-task training methods based on their task relatedness: (i) joint training and (ii) multi-step training. We present examples in various NLP downstream applications, summarize the task relationships and discuss future directions of this promising topic.
pdf
bib
abs
Conclusion-based Counter-Argument Generation
Milad Alshomary
|
Henning Wachsmuth
In real-world debates, the most common way to counter an argument is to reason against its main point, that is, its conclusion. Existing work on the automatic generation of natural language counter-arguments does not address the relation to the conclusion, possibly because many arguments leave their conclusion implicit. In this paper, we hypothesize that the key to effective counter-argument generation is to explicitly model the argument’s conclusion and to ensure that the stance of the generated counter is opposite to that conclusion. In particular, we propose a multitask approach that jointly learns to generate both the conclusion and the counter of an input argument. The approach employs a stance-based ranking component that selects the counter from a diverse set of generated candidates whose stance best opposes the generated conclusion. In both automatic and manual evaluation, we provide evidence that our approach generates more relevant and stance-adhering counters than strong baselines.
pdf
bib
abs
Question-Answer Sentence Graph for Joint Modeling Answer Selection
Roshni Iyer
|
Thuy Vu
|
Alessandro Moschitti
|
Yizhou Sun
This research studies graph-based approaches for Answer Sentence Selection (AS2), an essential component for retrieval-based Question Answering (QA) systems. During offline learning, our model constructs a small-scale relevant training graph per question in an unsupervised manner, and integrates with Graph Neural Networks. Graph nodes are question sentence to answer sentence pairs. We train and integrate state-of-the-art (SOTA) models for computing scores between question-question, question-answer, and answer-answer pairs, and use thresholding on relevance scores for creating graph edges. Online inference is then performed to solve the AS2 task on unseen queries. Experiments on two well-known academic benchmarks and a real-world dataset show that our approach consistently outperforms SOTA QA baseline models.
pdf
bib
abs
Evaluating and Improving the Coreference Capabilities of Machine Translation Models
Asaf Yehudai
|
Arie Cattan
|
Omri Abend
|
Gabriel Stanovsky
Machine translation (MT) requires a wide range of linguistic capabilities, which current end-to-end models are expected to learn implicitly by observing aligned sentences in bilingual corpora. In this work, we ask: How well MT models learn coreference resolution via implicit signal? To answer this question, we develop an evaluation methodology that derives coreference clusters from MT output and evaluates them without requiring annotations in the target language.Following, we evaluate several prominent open-source and commercial MT systems, translating from English to six target languages, and compare them to state-of-the-art coreference resolvers on three challenging benchmarks. Our results show that the monolingual resolvers greatly outperform MT models. Motivated by this result, we experiment with different methods for incorporating the output of coreference resolution models in MT, showing improvement over strong baselines.
pdf
bib
abs
Document-Level Planning for Text Simplification
Liam Cripwell
|
Joël Legrand
|
Claire Gardent
Most existing work on text simplification is limited to sentence-level inputs, with attempts to iteratively apply these approaches to document-level simplification failing to coherently preserve the discourse structure of the document. We hypothesise that by providing a high-level view of the target document, a simplification plan might help to guide generation. Building upon previous work on controlled, sentence-level simplification, we view a plan as a sequence of labels, each describing one of four sentence-level simplification operations (copy, rephrase, split, or delete). We propose a planning model that labels each sentence in the input document while considering both its context (a window of surrounding sentences) and its internal structure (a token-level representation). Experiments on two simplification benchmarks (Newsela-auto and Wiki-auto) show that our model outperforms strong baselines both on the planning task and when used to guide document-level simplification models.
pdf
bib
abs
Efficient Hybrid Generation Framework for Aspect-Based Sentiment Analysis
Haoran Lv
|
Junyi Liu
|
Henan Wang
|
Yaoming Wang
|
Jixiang Luo
|
Yaxiao Liu
Aspect-based sentiment analysis (ABSA) has attracted broad attention due to its commercial value. Natural Language Generation-based (NLG) approaches dominate the recent advance in ABSA tasks. However, current NLG practices are inefficient because most of them directly employ an autoregressive generation framework that cannot efficiently generate location information and semantic representations of ABSA targets. In this paper, we propose a novel framework, namely Efficient Hybrid Generation (EHG) to revolutionize traditions. Specifically, we leverage an Efficient Hybrid Transformer to generate the location and semantic information of ABSA targets in parallel. Besides, we design a novel global hybrid loss function in combination with bipartite matching to achieve end-to-end model training. Extensive experiments demonstrate that our proposed EHG framework outperforms current state-of-the-art methods in almost all cases and outperforms existing NLG-based methods in terms of inference efficiency.
pdf
bib
abs
What’s New? Summarizing Contributions in Scientific Literature
Hiroaki Hayashi
|
Wojciech Kryscinski
|
Bryan McCann
|
Nazneen Rajani
|
Caiming Xiong
With thousands of academic articles shared on a daily basis, it has become increasingly difficult to keep up with the latest scientific findings. To overcome this problem, we introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work, making it easier to identify the key findings shared in articles. For this purpose, we extend the S2ORC corpus of academic articles, which spans a diverse set of domains ranging from economics to psychology, by adding disentangled “contribution” and “context” reference labels. Together with the dataset, we introduce and analyze three baseline approaches: 1) a unified model controlled by input code prefixes, 2) a model with separate generation heads specialized in generating the disentangled outputs, and 3) a training strategy that guides the model using additional supervision coming from inbound and outbound citations. We also propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs. Through a human study involving expert annotators, we show that in 79%, of cases our new task is considered more helpful than traditional scientific paper summarization.
pdf
bib
abs
Find Parent then Label Children: A Two-stage Taxonomy Completion Method with Pre-trained Language Model
Fei Xia
|
Yixuan Weng
|
Shizhu He
|
Kang Liu
|
Jun Zhao
Taxonomies, which organize domain concepts into hierarchical structures, are crucial for building knowledge systems and downstream applications. As domain knowledge evolves, taxonomies need to be continuously updated to include new concepts. Previous approaches have mainly focused on adding concepts to the leaf nodes of the existing hierarchical tree, which does not fully utilize the taxonomy’s knowledge and is unable to update the original taxonomy structure (usually involving non-leaf nodes). In this paper, we propose a two-stage method called ATTEMPT for taxonomy completion. Our method inserts new concepts into the correct position by finding a parent node and labeling child nodes. Specifically, by combining local nodes with prompts to generate natural sentences, we take advantage of pre-trained language models for hypernym/hyponymy recognition. Experimental results on two public datasets (including six domains) show that ATTEMPT performs best on both taxonomy completion and extension tasks, surpassing existing methods.
pdf
bib
abs
Meta Self-Refinement for Robust Learning with Weak Supervision
Dawei Zhu
|
Xiaoyu Shen
|
Michael Hedderich
|
Dietrich Klakow
Training deep neural networks (DNNs) under weak supervision has attracted increasing research attention as it can significantly reduce the annotation cost. However, labels from weak supervision can be noisy, and the high capacity of DNNs enables them to easily overfit the label noise, resulting in poor generalization. Recent methods leverage self-training to build noise-resistant models, in which a teacher trained under weak supervision is used to provide highly confident labels for teaching the students. Nevertheless, the teacher derived from such frameworks may have fitted a substantial amount of noise and therefore produce incorrect pseudo-labels with high confidence, leading to severe error propagation. In this work, we propose Meta Self-Refinement (MSR), a noise-resistant learning framework, to effectively combat label noise from weak supervision. Instead of relying on a fixed teacher trained with noisy labels, we encourage the teacher to refine its pseudo-labels. At each training step, MSR performs a meta gradient descent on the current mini-batch to maximize the student performance on a clean validation set. Extensive experimentation on eight NLP benchmarks demonstrates that MSR is robust against label noise in all settings and outperforms state-of-the-art methods by up to 11.4% in accuracy and 9.26% in F1 score.
pdf
bib
abs
Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation
Nuno M. Guerreiro
|
Elena Voita
|
André Martins
Although the problem of hallucinations in neural machine translation (NMT) has received some attention, research on this highly pathological phenomenon lacks solid ground. Previous work has been limited in several ways: it often resorts to artificial settings where the problem is amplified, it disregards some (common) types of hallucinations, and it does not validate adequacy of detection heuristics. In this paper, we set foundations for the study of NMT hallucinations. First, we work in a natural setting, i.e., in-domain data without artificial noise neither in training nor in inference. Next, we annotate a dataset of over 3.4k sentences indicating different kinds of critical errors and hallucinations. Then, we turn to detection methods and both revisit methods used previously and propose using glass-box uncertainty-based detectors. Overall, we show that for preventive settings, (i) previously used methods are largely inadequate, (ii) sequence log-probability works best and performs on par with reference-based methods. Finally, we propose DeHallucinator, a simple method for alleviating hallucinations at test time that significantly reduces the hallucinatory rate.
pdf
bib
abs
Investigating UD Treebanks via Dataset Difficulty Measures
Artur Kulmizev
|
Joakim Nivre
Treebanks annotated with Universal Dependencies (UD) are currently available for over 100 languages and are widely utilized by the community. However, their inherent characteristics are hard to measure and are only partially reflected in parser evaluations via accuracy metrics like LAS. In this study, we analyze a large subset of the UD treebanks using three recently proposed accuracy-free dataset analysis methods: dataset cartography, 𝒱-information, and minimum description length. Each method provides insights about UD treebanks that would remain undetected if only LAS was considered. Specifically, we identify a number of treebanks that, despite yielding high LAS, contain very little information that is usable by a parser to surpass what can be achieved by simple heuristics. Furthermore, we make note of several treebanks that score consistently low across numerous metrics, indicating a high degree of noise or annotation inconsistency present therein.
pdf
bib
abs
On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex
Terry Yue Zhuo
|
Zhuang Li
|
Yujin Huang
|
Fatemeh Shiri
|
Weiqing Wang
|
Gholamreza Haffari
|
Yuan-Fang Li
Semantic parsing is a technique aimed at constructing a structured representation of the meaning of a natural-language question. Recent advances in language models trained on code have shown superior performance in generating these representations compared to language models trained solely on natural language text. The existing fine-tuned neural semantic parsers are vulnerable to adversarial attacks on natural-language inputs. While it has been established that the robustness of smaller semantic parsers can be enhanced through adversarial training, this approach is not feasible for large language models in real-world scenarios, as it requires both substantial computational resources and expensive human annotation on in-domain semantic parsing data. This paper presents the first empirical study on the adversarial robustness of a prompt-based semantic parser based on CODEX, a stateof-the-art (SOTA) language model trained on code. Our results demonstrate that the large language model of code is vulnerable to carefully crafted adversarial examples. To overcome this challenge, we propose methods for enhancing robustness without requiring substantial amounts of labelled data or intensive computational resources.
pdf
bib
abs
Leveraging Task Dependency and Contrastive Learning for Case Outcome Classification on European Court of Human Rights Cases
Santosh T.y.s.s
|
Marcel Perez San Blas
|
Phillip Kemper
|
Matthias Grabmair
We report on an experiment in case outcome classification on European Court of Human Rights cases where our model first learns to identify the convention articles allegedly violated by the state from case facts descriptions, and subsequently uses that information to classify whether the court finds a violation of those articles. We assess the dependency between these two tasks at the feature and outcome level. Furthermore, we leverage a hierarchical contrastive loss to pull together article-specific representations of cases at the higher level, leading to distinctive article clusters. The cases in each article cluster are further pulled closer based on their outcome, leading to sub-clusters of cases with similar outcomes. Our experiment results demonstrate that, given a static pre-trained encoder, our models produce a small but consistent improvement in classification performance over single-task and joint models without contrastive loss.
pdf
bib
abs
Semi-supervised Relation Extraction via Data Augmentation and Consistency-training
Komal Teru
Due to the semantic complexity of the Relation extraction (RE) task, obtaining high-quality human labelled data is an expensive and noisy process. To improve the sample efficiency of the models, semi-supervised learning (SSL) methods aim to leverage unlabelled data in addition to learning from limited labelled data points. Recently, strong data augmentation combined with consistency-based semi-supervised learning methods have advanced the state of the art in several SSL tasks. However, adapting these methods to the RE task has been challenging due to the difficulty of data augmentation for RE. In this work, we leverage the recent advances in controlled text generation to perform high-quality data augmentation for the RE task. We further introduce small but significant changes to model architecture that allows for generation of more training data by interpolating different data points in their latent space. These data augmentations along with consistency training result in very competitive results for semi-supervised relation extraction on four benchmark datasets.
pdf
bib
abs
Event Temporal Relation Extraction with Bayesian Translational Model
Xingwei Tan
|
Gabriele Pergola
|
Yulan He
Existing models to extract temporal relations between events lack a principled method to incorporate external knowledge. In this study, we introduce Bayesian-Trans, a Bayesian learning-based method that models the temporal relation representations as latent variables and infers their values via Bayesian inference and translational functions. Compared to conventional neural approaches, instead of performing point estimation to find the best set parameters, the proposed model infers the parameters’ posterior distribution directly, enhancing the model’s capability to encode and express uncertainty about the predictions. Experimental results on the three widely used datasets show that Bayesian-Trans outperforms existing approaches for event temporal relation extraction. We additionally present detailed analyses on uncertainty quantification, comparison of priors, and ablation studies, illustrating the benefits of the proposed approach.
pdf
bib
abs
Persona Expansion with Commonsense Knowledge for Diverse and Consistent Response Generation
Donghyun Kim
|
Youbin Ahn
|
Wongyu Kim
|
Chanhee Lee
|
Kyungchan Lee
|
Kyong-Ho Lee
|
Jeonguk Kim
|
Donghoon Shin
|
Yeonsoo Lee
Generating diverse and consistent responses is the ultimate goal of a persona-based dialogue. Although many studies have been conducted, the generated responses tend to be generic and bland due to the personas’ limited descriptiveness. Therefore, it is necessary to expand the given personas for more attractive responses. However, indiscriminate expansion of personas threaten the consistency of responses and therefore reduce the interlocutor’s interest in conversation. To alleviate this issue, we propose a consistent persona expansion framework that improves not only the diversity but also the consistency of persona-based responses. To do so, we define consistency criteria to avoid possible contradictions among personas as follows: 1) Intra-Consistency and 2) Inter-Consistency. Then, we construct a silver profile dataset to deliver the ability to conform with the consistency criteria to the expansion model. Finally, we propose a persona expansion model with an encoder-decoder structure, which considers the relatedness and consistency among personas. Our experiments on the Persona-Chat dataset demonstrate the superiority of the proposed framework.
pdf
bib
abs
UnifEE: Unified Evidence Extraction for Fact Verification
Nan Hu
|
Zirui Wu
|
Yuxuan Lai
|
Chen Zhang
|
Yansong Feng
FEVEROUS is a fact extraction and verification task that requires systems to extract evidence of both sentences and table cells from a Wikipedia dump, then predict the veracity of the given claim accordingly. Existing works extract evidence in the two formats separately, ignoring potential connections between them. In this paper, we propose a Unified Evidence Extraction model (UnifEE), which uses a mixed evidence graph to extract the evidence in both formats. With the carefully-designed unified evidence graph, UnifEE allows evidence interactions among all candidates in both formats at similar granularity. Experiments show that, with information aggregated from related evidence candidates in the fusion graph, UnifEE can make better decisions about which evidence should be kept, especially for claims requiring multi-hop reasoning or a combination of tables and texts. Thus it outperforms all previous evidence extraction methods and brings significant improvement in the subsequent claim verification step.
pdf
bib
abs
MiniALBERT: Model Distillation via Parameter-Efficient Recursive Transformers
Mohammadmahdi Nouriborji
|
Omid Rohanian
|
Samaneh Kouchaki
|
David A. Clifton
Pre-trained Language Models (LMs) have become an integral part of Natural Language Processing (NLP) in recent years, due to their superior performance in downstream applications. In spite of this resounding success, the usability of LMs is constrained by computational and time complexity, along with their increasing size; an issue that has been referred to as overparameterisation. Different strategies have been proposed in the literature to alleviate these problems, with the aim to create effective compact models that nearly match the performance of their bloated counterparts with negligible performance losses. One of the most popular techniques in this area of research is model distillation. Another potent but underutilised technique is cross-layer parameter sharing. In this work, we combine these two strategies and present MiniALBERT, a technique for converting the knowledge of fully parameterised LMs (such as BERT) into a compact recursive student. In addition, we investigate the application of bottleneck adapters for layer-wise adaptation of our recursive student, and also explore the efficacy of adapter tuning for fine-tuning of compact models. We test our proposed models on a number of general and biomedical NLP tasks to demonstrate their viability and compare them with the state-of-the-art and other existing compact models. All the codes used in the experiments and the pre-trained compact models will be made publicly available.
pdf
bib
abs
Multilingual Normalization of Temporal Expressions with Masked Language Models
Lukas Lange
|
Jannik Strötgen
|
Heike Adel
|
Dietrich Klakow
The detection and normalization of temporal expressions is an important task and preprocessing step for many applications. However, prior work on normalization is rule-based, which severely limits the applicability in real-world multilingual settings, due to the costly creation of new rules. We propose a novel neural method for normalizing temporal expressions based on masked language modeling. Our multilingual method outperforms prior rule-based systems in many languages, and in particular, for low-resource languages with performance improvements of up to 33 F1 on average compared to the state of the art.
pdf
bib
abs
K-hop neighbourhood regularization for few-shot learning on graphs: A case study of text classification
Niels van der Heijden
|
Ekaterina Shutova
|
Helen Yannakoudakis
We present FewShotTextGCN, a novel method designed to effectively utilize the properties of word-document graphs for improved learning in low-resource settings. We introduce K-hop Neighbourhood Regularization, a regularizer for heterogeneous graphs, and show that it stabilizes and improves learning when only a few training samples are available. We furthermore propose a simplification in the graph-construction method, which results in a graph that is ∼7 times less dense and yields better performance in little-resource settings while performing on par with the state of the art in high-resource settings. Finally, we introduce a new variant of Adaptive Pseudo-Labeling tailored for word-document graphs. When using as little as 20 samples for training, we outperform a strong TextGCN baseline with 17% in absolute accuracy on average over eight languages. We demonstrate that our method can be applied to document classification without any language model pretraining on a wide range of typologically diverse languages while performing on par with large pretrained language models.
pdf
bib
abs
What Clued the AI Doctor In? On the Influence of Data Source and Quality for Transformer-Based Medical Self-Disclosure Detection
Mina Valizadeh
|
Xing Qian
|
Pardis Ranjbar-Noiey
|
Cornelia Caragea
|
Natalie Parde
Recognizing medical self-disclosure is important in many healthcare contexts, but it has been under-explored by the NLP community. We conduct a three-pronged investigation of this task. We (1) manually expand and refine the only existing medical self-disclosure corpus, resulting in a new, publicly available dataset of 3,919 social media posts with clinically validated labels and high compatibility with the existing task-specific protocol. We also (2) study the merits of pretraining task domain and text style by comparing Transformer-based models for this task, pretrained from general, medical, and social media sources. Our BERTweet condition outperforms the existing state of the art for this task by a relative F1 score increase of 16.73%. Finally, we (3) compare data augmentation techniques for this task, to assess the extent to which medical self-disclosure data may be further synthetically expanded. We discover that this task poses many challenges for data augmentation techniques, and we provide an in-depth analysis of identified trends.
pdf
bib
abs
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective
Zijian Zhang
|
Chang Shu
|
Ya Xiao
|
Yuan Shen
|
Di Zhu
|
Youxin Chen
|
Jing Xiao
|
Jey Han Lau
|
Qian Zhang
|
Zheng Lu
Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at
https://github.com/96-Zachary/vse_2ad .
pdf
bib
abs
Policy-based Reinforcement Learning for Generalisation in Interactive Text-based Environments
Edan Toledo
|
Jan Buys
|
Jonathan Shock
Text-based environments enable RL agents to learn to converse and perform interactive tasks through natural language. However, previous RL approaches applied to text-based environments show poor performance when evaluated on unseen games. This paper investigates the improvement of generalisation performance through the simple switch from a value-based update method to a policy-based one, within text-based environments. We show that by replacing commonly used value-based methods with REINFORCE with baseline, a far more general agent is produced. The policy-based agent is evaluated on Coin Collector and Question Answering with interactive text (QAit), two text-based environments designed to test zero-shot performance. We see substantial improvements on a variety of zero-shot evaluation experiments, including tripling accuracy on various QAit benchmark configurations. The results indicate that policy-based RL has significantly better generalisation capabilities than value-based methods within such text-based environments, suggesting that RL agents could be applied to more complex natural language environments.
pdf
bib
abs
Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning
Hongyin Luo
|
James Glass
Due to their similarity-based learning objectives, pretrained sentence encoders often internalize stereotypical assumptions that reflect the social biases that exist within their training corpora. In this paper, we describe several kinds of stereotypes concerning different communities that are present in popular sentence representation models, including pretrained next sentence prediction and contrastive sentence representation models. We compare such models to textual entailment models that learn language logic for a variety of downstream language understanding tasks. By comparing strong pretrained models based on text similarity with textual entailment learning, we conclude that the explicit logic learning with textual entailment can significantly reduce bias and improve the recognition of social communities, without an explicit de-biasing process.
pdf
bib
abs
Entity Tracking via Effective Use of Multi-Task Learning Model and Mention-guided Decoding
Janvijay Singh
|
Fan Bai
|
Zhen Wang
Cross-task knowledge transfer via multi-task learning has recently made remarkable progress in general NLP tasks. However, entity tracking on the procedural text has not benefited from such knowledge transfer because of its distinct formulation, i.e., tracking the event flow while following structural constraints. State-of-the-art entity tracking approaches either design complicated model architectures or rely on task-specific pre-training to achieve good results. To this end, we propose MeeT, a Multi-task learning-enabled entity Tracking approach, which utilizes knowledge gained from general domain tasks to improve entity tracking. Specifically, MeeT first fine-tunes T5, a pre-trained multi-task learning model, with entity tracking-specialized QA formats, and then employs our customized decoding strategy to satisfy the structural constraints. MeeT achieves state-of-the-art performances on two popular entity tracking datasets, even though it does not require any task-specific architecture design or pre-training.
pdf
bib
abs
Conversational Tree Search: A New Hybrid Dialog Task
Dirk Väth
|
Lindsey Vanderlyn
|
Ngoc Thang Vu
Conversational interfaces provide a flexible and easy way for users to seek information that may otherwise be difficult or inconvenient to obtain. However, existing interfaces generally fall into one of two categories: FAQs, where users must have a concrete question in order to retrieve a general answer, or dialogs, where users must follow a pre-defined path but may receive a personalized answer. In this paper, we introduce Conversational Tree Search (CTS) as a new task that bridges the gap between FAQ-style information retrieval and task-oriented dialog, allowing domain-experts to define dialog trees which can then be converted to an efficient dialog policy that learns only to ask the questions necessary to navigate a user to their goal. We collect a dataset for the travel reimbursement domain and demonstrate a baseline as well as a novel deep Reinforcement Learning architecture for this task. Our results show that the new architecture combines the positive aspects of both the FAQ and dialog system used in the baseline and achieves higher goal completion while skipping unnecessary questions.
pdf
bib
abs
A Human Subject Study of Named Entity Recognition in Conversational Music Recommendation Queries
Elena Epure
|
Romain Hennequin
We conducted a human subject study of named entity recognition on a noisy corpus of conversational music recommendation queries, with many irregular and novel named entities. We evaluated the human NER linguistic behaviour in these challenging conditions and compared it with the most common NER systems nowadays, fine-tuned transformers. Our goal was to learn about the task to guide the design of better evaluation methods and NER algorithms. The results showed that NER in our context was quite hard for both human and algorithms under a strict evaluation schema; humans had higher precision, while the model higher recall because of entity exposure especially during pre-training; and entity types had different error patterns (e.g. frequent typing errors for artists). The released corpus goes beyond predefined frames of interaction and can support future work in conversational music recommendation.
pdf
bib
abs
Entity Disambiguation with Entity Definitions
Luigi Procopio
|
Simone Conia
|
Edoardo Barba
|
Roberto Navigli
Local models have recently attained astounding performances in Entity Disambiguation (ED), with generative and extractive formulations being the most promising research directions. However, previous works have so far limited their studies to using, as the textual representation of each candidate, only its Wikipedia title. Although certainly effective, this strategy presents a few critical issues, especially when titles are not sufficiently informative or distinguishable from one another. In this paper, we address this limitation and investigate the extent to which more expressive textual representations can mitigate it. We evaluate our approach thoroughly against standard benchmarks in ED and find extractive formulations to be particularly well-suited to such representations. We report a new state of the art on 2 out of the 6 benchmarks we consider and strongly improve the generalization capability over unseen patterns. We release our code, data and model checkpoints at
https://github.com/SapienzaNLP/extend.
pdf
bib
abs
Exploring Paracrawl for Document-level Neural Machine Translation
Yusser Al Ghussin
|
Jingyi Zhang
|
Josef van Genabith
Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in realworld translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.
pdf
bib
abs
Poor Man’s Quality Estimation: Predicting Reference-Based MT Metrics Without the Reference
Vilém Zouhar
|
Shehzaad Dhuliawala
|
Wangchunshu Zhou
|
Nico Daheim
|
Tom Kocmi
|
Yuchen Eleanor Jiang
|
Mrinmaya Sachan
Machine translation quality estimation (QE) predicts human judgements of a translation hypothesis without seeing the reference. State-of-the-art QE systems based on pretrained language models have been achieving remarkable correlations with human judgements yet they are computationally heavy and require human annotations, which are slow and expensive to create. To address these limitations, we define the problem of metric estimation (ME) where one predicts the automated metric scores also without the reference. We show that even without access to the reference, our model can estimate automated metrics (ρ = 60% for BLEU, ρ = 51% for other metrics) at the sentence-level. Because automated metrics correlate with human judgements, we can leverage the ME task for pre-training a QE model. For the QE task, we find that pre-training on TER is better (ρ = 23%) than training for scratch (ρ = 20%).
pdf
bib
abs
Integrating Translation Memories into Non-Autoregressive Machine Translation
Jitao Xu
|
Josep Crego
|
François Yvon
Non-autoregressive machine translation (NAT) has recently made great progress. However, most works to date have focused on standard translation tasks, even though some edit-based NAT models, such as the Levenshtein Transformer (LevT), seem well suited to translate with a Translation Memory (TM). This is the scenario considered here. We first analyze the vanilla LevT model and explain why it does not do well in this setting. We then propose a new variant, TM-LevT, and show how to effectively train this model. By modifying the data presentation and introducing an extra deletion operation, we obtain performance that are on par with an autoregressive approach, while reducing the decoding load. We also show that incorporating TMs during training dispenses to use knowledge distillation, a well-known trick used to mitigate the multimodality issue.
pdf
bib
abs
Shorten the Long Tail for Rare Entity and Event Extraction
Pengfei Yu
|
Heng Ji
The distribution of knowledge elements such as entity types and event types is long-tailed in natural language. Hence information extraction datasets naturally conform long-tailed distribution. Although imbalanced datasets can teach the model about the useful real-world bias, deep learning models may learn features not generalizable to rare or unseen expressions of entities or events during evaluation, especially for rare types without sufficient training instances. Existing approaches for the long-tailed learning problem seek to manipulate the training data by re-balancing, augmentation or introducing extra prior knowledge. In comparison, we propose to handle the generalization challenge by making the evaluation instances closer to the frequent training cases. We design a new transformation module that transforms infrequent candidate mention representation during evaluation with the average mention representation in the training dataset. Experimental results on classic benchmarks on three entity or event extraction datasets demonstrates the effectiveness of our framework.
pdf
bib
abs
Do Deep Neural Networks Capture Compositionality in Arithmetic Reasoning?
Keito Kudo
|
Yoichi Aoki
|
Tatsuki Kuribayashi
|
Ana Brassard
|
Masashi Yoshikawa
|
Keisuke Sakaguchi
|
Kentaro Inui
Compositionality is a pivotal property of symbolic reasoning. However, how well recent neural models capture compositionality remains underexplored in the symbolic reasoning tasks. This study empirically addresses this question by systematically examining recently published pre-trained seq2seq models with a carefully controlled dataset of multi-hop arithmetic symbolic reasoning. We introduce a skill tree on compositionality in arithmetic symbolic reasoning that defines the hierarchical levels of complexity along with three compositionality dimensions: systematicity, productivity, and substitutivity. Our experiments revealed that among the three types of composition, the models struggled most with systematicity, performing poorly even with relatively simple compositions. That difficulty was not resolved even after training the models with intermediate reasoning steps.
pdf
bib
abs
BLM-AgrF: A New French Benchmark to Investigate Generalization of Agreement in Neural Networks
Aixiu An
|
Chunyang Jiang
|
Maria A. Rodriguez
|
Vivi Nastase
|
Paola Merlo
Successful machine learning systems currently rely on massive amounts of data, which are very effective in hiding some of the shallowness of the learned models. To help train models with more complex and compositional skills, we need challenging data, on which a system is successful only if it detects structure and regularities, that will allow it to generalize. In this paper, we describe a French dataset (BLM-AgrF) for learning the underlying rules of subject-verb agreement in sentences, developed in the BLM framework, a new task inspired by visual IQ tests known as Raven’s Progressive Matrices. In this task, an instance consists of sequences of sentences with specific attributes. To predict the correct answer as the next element of the sequence, a model must correctly detect the generative model used to produce the dataset. We provide details and share a dataset built following this methodology. Two exploratory baselines based on commonly used architectures show that despite the simplicity of the phenomenon, it is a complex problem for deep learning systems.
pdf
bib
abs
Robustification of Multilingual Language Models to Real-world Noise in Crosslingual Zero-shot Settings with Robust Contrastive Pretraining
Asa Cooper Stickland
|
Sailik Sengupta
|
Jason Krone
|
Saab Mansour
|
He He
Advances in neural modeling have achieved state-of-the-art (SOTA) results on public natural language processing (NLP) benchmarks, at times surpassing human performance. However, there is a gap between public benchmarks and real-world applications where noise, such as typographical or grammatical mistakes, is abundant and can result in degraded performance. Unfortunately, works which evaluate the robustness of neural models on noisy data and propose improvements, are limited to the English language. Upon analyzing noise in different languages, we observe that noise types vary greatly across languages. Thus, existing investigations do not generalize trivially to multilingual settings. To benchmark the performance of pretrained multilingual language models, we construct noisy datasets covering five languages and four NLP tasks and observe a clear gap in the performance between clean and noisy data in the zero-shot cross-lingual setting. After investigating several ways to boost the robustness of multilingual models in this setting, we propose Robust Contrastive Pretraining (RCP). RCP combines data augmentation with a contrastive loss term at the pretraining stage and achieves large improvements on noisy (and original test data) across two sentence-level (+3.2%) and two sequence-labeling (+10 F1-score) multilingual classification tasks.
pdf
bib
abs
Unsupervised Anomaly Detection in Multi-Topic Short-Text Corpora
Mira Ait-Saada
|
Mohamed Nadif
Unsupervised anomaly detection seeks to identify deviant data samples in a dataset without using labels and constitutes a challenging task, particularly when the majority class is heterogeneous. This paper addresses this topic for textual data and aims to determine whether a text sample is an outlier within a potentially multi-topic corpus. To this end, it is crucial to grasp the semantic aspects of words, particularly when dealing with short texts, since it is difficult to syntactically discriminate data samples based only on a few words. Thereby we make use of word embeddings to represent each sample by a dense vector, efficiently capturing the underlying semantics. Then, we rely on the Mixture Model approach to detect which samples deviate the most from the underlying distributions of the corpus. Experiments carried out on real datasets show the effectiveness of the proposed approach in comparison to state-of-the-art techniques both in terms of performance and time efficiency, especially when more than one topic is present in the corpus.
pdf
bib
abs
Metaphor Detection with Effective Context Denoising
Shun Wang
|
Yucheng Li
|
Chenghua Lin
|
Loic Barrault
|
Frank Guerin
We propose a novel RoBERTa-based model, RoPPT, which introduces a target-oriented parse tree structure in metaphor detection. Compared to existing models, RoPPT focuses on semantically relevant information and achieves the state-of-the-art on several main metaphor datasets. We also compare our approach against several popular denoising and pruning methods, demonstrating the effectiveness of our approach in context denoising. Our code and dataset can be found at
https://github.com/MajiBear000/RoPPT.
pdf
bib
abs
Low-Resource Compositional Semantic Parsing with Concept Pretraining
Subendhu Rongali
|
Mukund Sridhar
|
Haidar Khan
|
Konstantine Arkoudas
|
Wael Hamza
|
Andrew McCallum
Semantic parsing plays a key role in digital voice assistants such as Alexa, Siri, and Google Assistant by mapping natural language to structured meaning representations. When we want to improve the capabilities of a voice assistant by adding a new domain, the underlying semantic parsing model needs to be retrained using thousands of annotated examples from the new domain, which is time-consuming and expensive. In this work, we present an architecture to perform such domain adaptation automatically, with only a small amount of metadata about the new domain and without any new training data (zero-shot) or with very few examples (few-shot). We use a base seq2seq (sequence-to-sequence) architecture and augment it with a concept encoder that encodes intent and slot tags from the new domain. We also introduce a novel decoder-focused approach to pretrain seq2seq models to be concept aware using Wikidata and use it to help our model learn important concepts and perform well in low-resource settings. We report few-shot and zero-shot results for compositional semantic parsing on the TOPv2 dataset and show that our model outperforms prior approaches in few-shot settings for the TOPv2 and SNIPS datasets.
pdf
bib
abs
Made of Steel? Learning Plausible Materials for Components in the Vehicle Repair Domain
Annerose Eichel
|
Helena Schlipf
|
Sabine Schulte im Walde
We propose a novel approach to learn domain-specific plausible materials for components in the vehicle repair domain by probing Pretrained Language Models (PLMs) in a cloze task style setting to overcome the lack of annotated datasets. We devise a new method to aggregate salient predictions from a set of cloze query templates and show that domain-adaptation using either a small, high-quality or a customized Wikipedia corpus boosts performance. When exploring resource-lean alternatives, we find a distilled PLM clearly outperforming a classic pattern-based algorithm. Further, given that 98% of our domain-specific components are multiword expressions, we successfully exploit the compositionality assumption as a way to address data sparsity.
pdf
bib
abs
Self-Adapted Utterance Selection for Suicidal Ideation Detection in Lifeline Conversations
Zhong-Ling Wang
|
Po-Hsien Huang
|
Wen-Yau Hsu
|
Hen-Hsen Huang
This paper investigates a crucial aspect of mental health by exploring the detection of suicidal ideation in spoken phone conversations between callers and counselors at a suicide prevention hotline. These conversations can be lengthy, noisy, and cover a broad range of topics, making it challenging for NLP models to accurately identify the caller’s suicidal ideation. To address these difficulties, we introduce a novel, self-adaptive approach that identifies the most critical utterances that the NLP model can more easily distinguish. The experiments use real-world Lifeline transcriptions, expertly labeled, and show that our approach outperforms the baseline models in overall performance with an F-score of 66.01%. In detecting the most dangerous cases, our approach achieves a significantly higher F-score of 65.94% compared to the baseline models, an improvement of 8.9%. The selected utterances can also provide valuable insights for suicide prevention research. Furthermore, our approach demonstrates its versatility by showing its effectiveness in sentiment analysis, making it a valuable tool for NLP applications beyond the healthcare domain.
pdf
bib
abs
Can Pretrained Language Models (Yet) Reason Deductively?
Zhangdie Yuan
|
Songbo Hu
|
Ivan Vulić
|
Anna Korhonen
|
Zaiqiao Meng
Acquiring factual knowledge with Pretrained Language Models (PLMs) has attracted increasing attention, showing promising performance in many knowledge-intensive tasks. Their good performance has led the community to believe that the models do possess a modicum of reasoning competence rather than merely memorising the knowledge. In this paper, we conduct a comprehensive evaluation of the learnable deductive (also known as explicit) reasoning capability of PLMs. Through a series of controlled experiments, we posit two main findings. 1) PLMs inadequately generalise learned logic rules and perform inconsistently against simple adversarial surface form edits. 2) While the deductive reasoning fine-tuning of PLMs does improve their performance on reasoning over unseen knowledge facts, it results in catastrophically forgetting the previously learnt knowledge. Our main results suggest that PLMs cannot yet perform reliable deductive reasoning, demonstrating the importance of controlled examinations and probing of PLMs’ deductive reasoning abilities; we reach beyond (misleading) task performance, revealing that PLMs are still far from robust reasoning capabilities, even for simple deductive tasks.
pdf
bib
abs
Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information
Yen-Ting Lin
|
Alexandros Papangelis
|
Seokhwan Kim
|
Sungjin Lee
|
Devamanyu Hazarika
|
Mahdi Namazifar
|
Di Jin
|
Yang Liu
|
Dilek Hakkani-Tur
This work focuses on in-context data augmentation for intent detection. Having found that augmentation via in-context prompting of large pre-trained language models (PLMs) alone does not improve performance, we introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents. It then employs intent-aware filtering, based on PVI, to remove datapoints that are not helpful to the downstream intent classifier. Our method is thus able to leverage the expressive power of large language models to produce diverse training data. Empirical results demonstrate that our method can produce synthetic training data that achieve state-of-the-art performance on three challenging intent detection datasets under few-shot settings (1.28% absolute improvement in 5-shot and 1.18% absolute in 10-shot, on average) and perform on par with the state-of-the-art in full-shot settings (within 0.01% absolute, on average).
pdf
bib
abs
Multilingual Representation Distillation with Contrastive Learning
Weiting Tan
|
Kevin Heffernan
|
Holger Schwenk
|
Philipp Koehn
Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences (i.e., find semantically similar sentences that can be used as translations of each other). We validate our approach with multilingual similarity search and corpus filtering tasks. Experiments across different low-resource languages show that our method greatly outperforms previous sentence encoders such as LASER, LASER3, and LaBSE.
pdf
bib
abs
On the inconsistency of separable losses for structured prediction
Caio Corro
In this paper, we prove that separable negative log-likelihood losses for structured prediction are not necessarily Bayes consistent, that is minimizing these losses may not result in a model that predicts the most probable structure in the data distribution for a given input. This fact opens the question of whether these losses are well-adapted for structured prediction and, if so, why.
pdf
bib
abs
A Systematic Search for Compound Semantics in Pretrained BERT Architectures
Filip Miletic
|
Sabine Schulte im Walde
To date, transformer-based models such as BERT have been less successful in predicting compositionality of noun compounds than static word embeddings. This is likely related to a suboptimal use of the encoded information, reflecting an incomplete grasp of how the models represent the meanings of complex linguistic structures. This paper investigates variants of semantic knowledge derived from pretrained BERT when predicting the degrees of compositionality for 280 English noun compounds associated with human compositionality ratings. Our performance strongly improves on earlier unsupervised implementations of pretrained BERT and highlights beneficial decisions in data preprocessing, embedding computation, and compositionality estimation. The distinct linguistic roles of heads and modifiers are reflected by differences in BERT-derived representations, with empirical properties such as frequency, productivity, and ambiguity affecting model performance. The most relevant representational information is concentrated in the initial layers of the model architecture.
pdf
bib
abs
Efficiently Upgrading Multilingual Machine Translation Models to Support More Languages
Simeng Sun
|
Maha Elbayad
|
Anna Sun
|
James Cross
With multilingual machine translation (MMT) models continuing to grow in size and number of supported languages, it is natural to reuse and upgrade existing models to save computation as data becomes available in more languages. However, adding new languages requires updating the vocabulary, which complicates the reuse of embeddings. The question of how to reuse existing models while also making architectural changes to provide capacity for both old and new languages has also not been closely studied. In this work, we introduce three techniques that help speed up the effective learning of new languages and alleviate catastrophic forgetting despite vocabulary and architecture mismatches. Our results show that by (1) carefully initializing the network, (2) applying learning rate scaling, and (3) performing data up-sampling, it is possible to exceed the performance of a same-sized baseline model with 30% computation and recover the performance of a larger model trained from scratch with over 50% reduction in computation. Furthermore, our analysis reveals that the introduced techniques help learn new directions more effectively and alleviate catastrophic forgetting at the same time. We hope our work will guide research into more efficient approaches to growing languages for these MMT models and ultimately maximize the reuse of existing models.
pdf
bib
abs
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages
Wasi Uddin Ahmad
|
Saikat Chakraborty
|
Baishakhi Ray
|
Kai-Wei Chang
Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The target-to-source model generates noisy sources, while the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Hence, training them to build programming language translation systems via back-translation is compelling. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we propose performing back-translation via code summarization and generation. In code summarization, a model learns to generate natural language (NL) summaries given code snippets. In code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as a target-to-NL-to-source generation. We show that our proposed approach performs competitively with state-of-the-art methods. We have made the code publicly available.
pdf
bib
abs
The Impacts of Unanswerable Questions on the Robustness of Machine Reading Comprehension Models
Son Quoc Tran
|
Phong Nguyen-Thuan Do
|
Uyen Le
|
Matt Kretchmar
Pretrained language models have achieved super-human performances on many Machine Reading Comprehension (MRC) benchmarks. Nevertheless, their relative inability to defend against adversarial attacks has spurred skepticism about their natural language understanding. In this paper, we ask whether training with unanswerable questions in SQuAD 2.0 can help improve the robustness of MRC models against adversarial attacks. To explore that question, we fine-tune three state-of-the-art language models on either SQuAD 1.1 or SQuAD 2.0 and then evaluate their robustness under adversarial attacks. Our experiments reveal that current models fine-tuned on SQuAD 2.0 do not initially appear to be any more robust than ones fine-tuned on SQuAD 1.1, yet they reveal a measure of hidden robustness that can be leveraged to realize actual performance gains. Furthermore, we find that robustness of models fine-tuned on SQuAD 2.0 extends on additional out-of-domain datasets. Finally, we introduce a new adversarial attack to reveal of SQuAD 2.0 that current MRC models are learning.
pdf
bib
abs
FrameBERT: Conceptual Metaphor Detection with Frame Embedding Learning
Yucheng Li
|
Shun Wang
|
Chenghua Lin
|
Frank Guerin
|
Loic Barrault
In this paper, we propose FrameBERT, a BERT-based model that can explicitly learn and incorporate FrameNet Embeddings for concept-level metaphor detection. FrameBERT not only achieves better or comparable performance to the state-of-the-art, but also is more explainable and interpretable compared to existing models, attributing to its ability of accounting for external knowledge of FrameNet.
pdf
bib
abs
Towards More Efficient Insertion Transformer with Fractional Positional Encoding
Zhisong Zhang
|
Yizhe Zhang
|
Bill Dolan
Auto-regressive neural sequence models have been shown to be effective across text generation tasks. However, their left-to-right decoding order prevents generation from being parallelized. Insertion Transformer (Stern et al., 2019) is an attractive alternative that allows outputting multiple tokens in a single generation step. Nevertheless, due to the incompatibility between absolute positional encoding and insertion-based generation schemes, it needs to refresh the encoding of every token in the generated partial hypothesis at each step, which could be costly. We design a novel reusable positional encoding scheme for Insertion Transformers called Fractional Positional Encoding (FPE), which allows reusing representations calculated in previous steps. Empirical studies on various text generation tasks demonstrate the effectiveness of FPE, which leads to floating-point operation reduction and latency improvements on batched decoding.
pdf
bib
abs
SODAPOP: Open-Ended Discovery of Social Biases in Social Commonsense Reasoning Models
Haozhe An
|
Zongxia Li
|
Jieyu Zhao
|
Rachel Rudinger
A common limitation of diagnostic tests for detecting social biases in NLP models is that they may only detect stereotypic associations that are pre-specified by the designer of the test. Since enumerating all possible problematic associations is infeasible, it is likely these tests fail to detect biases that are present in a model but not pre-specified by the designer. To address this limitation, we propose SODAPOP (SOcial bias Discovery from Answers about PeOPle), an approach for automatic social bias discovery in social commonsense question-answering. The SODAPOP pipeline generates modified instances from the Social IQa dataset (Sap et al., 2019b) by (1) substituting names associated with different demographic groups, and (2) generating many distractor answers from a masked language model. By using a social commonsense model to score the generated distractors, we are able to uncover the model’s stereotypic associations between demographic groups and an open set of words. We also test SODAPOP on debiased models and show the limitations of multiple state-of-the-art debiasing algorithms.
pdf
bib
abs
Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering
Wenhu Chen
|
Pat Verga
|
Michiel de Jong
|
John Wieting
|
William W. Cohen
Existing state-of-the-art methods for open-domain question-answering (ODQA) use an open book approach in which information is first retrieved from a large text corpus or knowledge base (KB) and then reasoned over to produce an answer. A recent alternative is to retrieve from a collection of previously-generated question-answer pairs; this has several practical advantages including being more memory and compute-efficient. Question-answer pairs are also appealing in that they can be viewed as an intermediate between text and KB triples: like KB triples, they often concisely express a single relationship, but like text, have much higher coverage than traditional KBs. In this work, we describe a new QA system that augments a text-to-text model with a large memory of question-answer pairs, and a new pre-training task for the latent step of question retrieval. The pre-training task substantially simplifies training and greatly improves performance on smaller QA benchmarks. Unlike prior systems of this sort, our QA system can also answer multi-hop questions that do not explicitly appear in the collection of stored question-answer pairs.
pdf
bib
abs
Gold Doesn’t Always Glitter: Spectral Removal of Linear and Nonlinear Guarded Attribute Information
Shun Shao
|
Yftah Ziser
|
Shay B. Cohen
We describe a simple and effective method (Spectral Attribute removaL; SAL) to remove private or guarded information from neural representations. Our method uses matrix decomposition to project the input representations into directions with reduced covariance with the guarded information rather than maximal covariance as factorization methods normally use. We begin with linear information removal and proceed to generalize our algorithm to the case of nonlinear information removal using kernels. Our experiments demonstrate that our algorithm retains better main task performance after removing the guarded information compared to previous work. In addition, our experiments demonstrate that we need a relatively small amount of guarded attribute data to remove information about these attributes, which lowers the exposure to sensitive data and is more suitable for low-resource scenarios.
pdf
bib
abs
CTC Alignments Improve Autoregressive Translation
Brian Yan
|
Siddharth Dalmia
|
Yosuke Higuchi
|
Graham Neubig
|
Florian Metze
|
Alan W Black
|
Shinji Watanabe
Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework wherein CTC’s core properties can counteract several key weaknesses of pure-attention models during training and decoding. To validate this conjecture, we modify the Hybrid CTC/Attention model originally proposed for ASR to support text-to-text translation (MT) and speech-to-text translation (ST). Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
pdf
bib
abs
Modelling Temporal Document Sequences for Clinical ICD Coding
Boon Liang Clarence Ng
|
Diogo Santos
|
Marek Rei
Past studies on the ICD coding problem focus on predicting clinical codes primarily based on the discharge summary. This covers only a small fraction of the notes generated during each hospital stay and leaves potential for improving performance by analysing all the available clinical notes. We propose a hierarchical transformer architecture that uses text across the entire sequence of clinical notes in each hospital stay for ICD coding, and incorporates embeddings for text metadata such as their position, time, and type of note. While using all clinical notes increases the quantity of data substantially, superconvergence can be used to reduce training costs. We evaluate the model on the MIMIC-III dataset. Our model exceeds the prior state-of-the-art when using only discharge summaries as input, and achieves further performance improvements when all clinical notes are used as input.
pdf
bib
abs
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization
Kalpesh Krishna
|
Erin Bransom
|
Bailey Kuehl
|
Mohit Iyyer
|
Pradeep Dasigi
|
Arman Cohan
|
Kyle Lo
While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73% of these papers do not perform any human evaluation on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall’s tau using 50% judgements). We release our human judgments, annotation templates, and software as a Python library for future research.
pdf
bib
abs
Cluster-Guided Label Generation in Extreme Multi-Label Classification
Taehee Jung
|
Joo-kyung Kim
|
Sungjin Lee
|
Dongyeop Kang
For extreme multi-label classification (XMC), existing classification-based models poorly per- form for tail labels and often ignore the semantic relations among labels, like treating”Wikipedia” and “Wiki” as independent and separate labels. In this paper, we cast XMC as a generation task (XLGen), where we benefit from pre-trained text-to-text models. However, generating labels from the extremely large label space is challenging without any constraints or guidance. We, therefore, propose to guide label generation using label cluster information to hierarchically generate lower-level labels. We also find that frequency-based label ordering and using decoding ensemble methods are critical factors for the improvements in XLGen. XLGen with cluster guidance significantly outperforms the classification and generation baselines on tail labels, and also generally improves the overall performance in four popular XMC benchmarks. In human evaluation, we also find XLGen generates unseen but plausible labels. Our code is now available at
https://github.com/alexa/xlgen-eacl-2023.
pdf
bib
abs
Empathy Identification Systems are not Accurately Accounting for Context
Andrew Lee
|
Jonathan K. Kummerfeld
|
Larry An
|
Rada Mihalcea
Understanding empathy in text dialogue data is a difficult, yet critical, skill for effective human-machine interaction. In this work, we ask whether systems are making meaningful progress on this challenge. We consider a simple model that checks if an input utterance is similar to a small set of empathetic examples. Crucially, the model does not look at what the utterance is a response to, i.e., the dialogue context. This model performs comparably to other work on standard benchmarks and even outperforms state-of-the-art models for empathetic rationale extraction by 16.7 points on T-F1 and 4.3 on IOU-F1. This indicates that current systems rely on the surface form of the response, rather than whether it is suitable in context. To confirm this, we create examples with dialogue contexts that change the interpretation of the response and show that current systems continue to label utterances as empathetic. We discuss the implications of our findings, including improvements for empathetic benchmarks and how our model can be an informative baseline.
pdf
bib
abs
Enhancing Multi-Document Summarization with Cross-Document Graph-based Information Extraction
Zixuan Zhang
|
Heba Elfardy
|
Markus Dreyer
|
Kevin Small
|
Heng Ji
|
Mohit Bansal
Information extraction (IE) and summarization are closely related, both tasked with presenting a subset of the information contained in a natural language text. However, while IE extracts structural representations, summarization aims to abstract the most salient information into a generated text summary – thus potentially encountering the technical limitations of current text generation methods (e.g., hallucination). To mitigate this risk, this work uses structured IE graphs to enhance the abstractive summarization task. Specifically, we focus on improving Multi-Document Summarization (MDS) performance by using cross-document IE output, incorporating two novel components: (1) the use of auxiliary entity and event recognition systems to focus the summary generation model; (2) incorporating an alignment loss between IE nodes and their text spans to reduce inconsistencies between the IE graphs and text representations. Operationally, both the IE nodes and corresponding text spans are projected into the same embedding space and pairwise distance is minimized. Experimental results on multiple MDS benchmarks show that summaries generated by our model are more factually consistent with the source documents than baseline models while maintaining the same level of abstractiveness.
pdf
bib
abs
What happens before and after: Multi-Event Commonsense in Event Coreference Resolution
Sahithya Ravi
|
Chris Tanner
|
Raymond Ng
|
Vered Shwartz
Event coreference models cluster event mentions pertaining to the same real-world event. Recent models rely on contextualized representations to recognize coreference among lexically or contextually similar mentions. However, models typically fail to leverage commonsense inferences, which is particularly limiting for resolving lexically-divergent mentions. We propose a model that extends event mentions with temporal commonsense inferences. Given a complex sentence with multiple events, e.g., “the man killed his wife and got arrested”, with the target event “arrested”, our model generates plausible events that happen before the target event – such as “the police arrived”, and after it, such as “he was sentenced”. We show that incorporating such inferences into an existing event coreference model improves its performance, and we analyze the coreferences in which such temporal knowledge is required.
pdf
bib
abs
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models
Sepehr Janghorbani
|
Gerard De Melo
Recent breakthroughs in self-supervised training have led to a new class of pretrained vision–language models. While there have been investigations of bias in multimodal models, they have mostly focused on gender and racial bias, giving much less attention to other relevant groups, such as minorities with regard to religion, nationality, sexual orientation, or disabilities. This is mainly due to lack of suitable benchmarks for such groups. We seek to address this gap by providing a visual and textual bias benchmark called MMBias, consisting of around 3,800 images and phrases covering 14 population subgroups. We utilize this dataset to assess bias in several prominent self-supervised multimodal models, including CLIP, ALBEF, and ViLT. Our results show that these models demonstrate meaningful bias favoring certain groups. Finally, we introduce a debiasing method designed specifically for such large pretrained models that can be applied as a post-processing step to mitigate bias, while preserving the remaining accuracy of the model.
pdf
bib
abs
CylE: Cylinder Embeddings for Multi-hop Reasoning over Knowledge Graphs
Chau Duc Minh Nguyen
|
Tim French
|
Wei Liu
|
Michael Stewart
Recent geometric-based approaches have been shown to efficiently model complex logical queries (including the intersection operation) over Knowledge Graphs based on the natural representation of Venn diagram. Existing geometric-based models (using points, boxes embeddings), however, cannot handle the logical negation operation. Further, those using cones embeddings are limited to representing queries by two-dimensional shapes, which reduced their effectiveness in capturing entities query relations for correct answers. To overcome this challenge, we propose unbounded cylinder embeddings (namely CylE), which is a novel geometric-based model based on three-dimensional shapes. Our approach can handle a complete set of basic first-order logic operations (conjunctions, disjunctions and negations). CylE considers queries as Cartesian products of unbounded sector-cylinders and consider a set of nearest boxes corresponds to the set of answer entities. Precisely, the conjunctions can be represented via the intersections of unbounded sector-cylinders. Transforming queries to Disjunctive Normal Form can handle queries with disjunctions. The negations can be represented by considering the closure of complement for an arbitrary unbounded sector-cylinder. Empirical results show that the performance of multi-hop reasoning task using CylE significantly increases over state-of-the-art geometric-based query embedding models for queries without negation. For queries with negation operations, though the performance is on a par with the best performing geometric-based model, CylE significantly outperforms a recent distribution-based model.
pdf
bib
abs
Fiction-Writing Mode: An Effective Control for Human-Machine Collaborative Writing
Wenjie Zhong
|
Jason Naradowsky
|
Hiroya Takamura
|
Ichiro Kobayashi
|
Yusuke Miyao
We explore the idea of incorporating concepts from writing skills curricula into human-machine collaborative writing scenarios, focusing on adding writing modes as a control for text generation models. Using crowd-sourced workers, we annotate a corpus of narrative text paragraphs with writing mode labels. Classifiers trained on this data achieve an average accuracy of ~87% on held-out data. We fine-tune a set of large language models to condition on writing mode labels, and show that the generated text is recognized as belonging to the specified mode with high accuracy. To study the ability of writing modes to provide fine-grained control over generated text, we devise a novel turn-based text reconstruction game to evaluate the difference between the generated text and the author’s intention. We show that authors prefer text suggestions made by writing mode-controlled models on average 61.1% of the time, with satisfaction scores 0.5 higher on a 5-point ordinal scale. When evaluated by humans, stories generated via collaboration with writing mode-controlled models achieve high similarity with the professionally written target story. We conclude by identifying the most common mistakes found in the generated stories.
pdf
bib
abs
Robustness Challenges in Model Distillation and Pruning for Natural Language Understanding
Mengnan Du
|
Subhabrata Mukherjee
|
Yu Cheng
|
Milad Shokouhi
|
Xia Hu
|
Ahmed Hassan Awadallah
Recent work has focused on compressing pre-trained language models (PLMs) like BERT where the major focus has been to improve the in-distribution performance for downstream tasks. However, very few of these studies have analyzed the impact of compression on the generalizability and robustness of compressed models for out-of-distribution (OOD) data. Towards this end, we study two popular model compression techniques including knowledge distillation and pruning and show that the compressed models are significantly less robust than their PLM counterparts on OOD test sets although they obtain similar performance on in-distribution development sets for a task. Further analysis indicates that the compressed models overfit on the shortcut samples and generalize poorly on the hard ones. We further leverage this observation to develop a regularization strategy for robust model compression based on sample uncertainty.
pdf
bib
abs
Don’t Blame the Annotator: Bias Already Starts in the Annotation Instructions
Mihir Parmar
|
Swaroop Mishra
|
Mor Geva
|
Chitta Baral
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator’s instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model performance, and that models struggle to generalize beyond biases originating in the crowdsourcing instructions. We further analyze the influence of instruction bias in terms of pattern frequency and model size, and derive concrete recommendations for creating future NLU benchmarks.
pdf
bib
abs
Performance Prediction via Bayesian Matrix Factorisation for Multilingual Natural Language Processing Tasks
Viktoria Schram
|
Daniel Beck
|
Trevor Cohn
Performance prediction for Natural Language Processing (NLP) seeks to reduce the experimental burden resulting from the myriad of different evaluation scenarios, e.g., the combination of languages used in multilingual transfer. In this work, we explore the framework ofBayesian matrix factorisation for performance prediction, as many experimental settings in NLP can be naturally represented in matrix format. Our approach outperforms the state-of-the-art in several NLP benchmarks, including machine translation and cross-lingual entity linking. Furthermore, it also avoids hyperparameter tuning and is able to provide uncertainty estimates over predictions.
pdf
bib
abs
Unified Neural Topic Model via Contrastive Learning and Term Weighting
Sungwon Han
|
Mingi Shin
|
Sungkyu Park
|
Changwook Jung
|
Meeyoung Cha
Two types of topic modeling predominate: generative methods that employ probabilistic latent models and clustering methods that identify semantically coherent groups. This paper newly presents UTopic (Unified neural Topic model via contrastive learning and term weighting) that combines the advantages of these two types. UTopic uses contrastive learning and term weighting to learn knowledge from a pretrained language model and discover influential terms from semantically coherent clusters. Experiments show that the generated topics have a high-quality topic-word distribution in terms of topic coherence, outperforming existing baselines across multiple topic coherence measures. We demonstrate how our model can be used as an add-on to existing topic models and improve their performance.
pdf
bib
abs
Don’t Mess with Mister-in-Between: Improved Negative Search for Knowledge Graph Completion
Fan Jiang
|
Tom Drummond
|
Trevor Cohn
The best methods for knowledge graph completion use a ‘dual-encoding’ framework, a form of neural model with a bottleneck that facilitates fast approximate search over a vast collection of candidates. These approaches are trained using contrastive learning to differentiate between known positive examples and sampled negative instances. The mechanism for sampling negatives to date has been very simple, driven by pragmatic engineering considerations (e.g., using mismatched instances from the same batch). We propose several novel means of finding more informative negatives, based on searching for candidates with high lexical overlaps, from the dual-encoder model and according to knowledge graph structures. Experimental results on four benchmarks show that our best single model improves consistently over previous methods and obtains new state-of-the-art performance, including the challenging large-scale Wikidata5M dataset. Combing different kinds of strategies through model ensembling results in a further performance boost.
pdf
bib
abs
Semantic Frame Induction with Deep Metric Learning
Kosuke Yamada
|
Ryohei Sasano
|
Koichi Takeda
Recent studies have demonstrated the usefulness of contextualized word embeddings in unsupervised semantic frame induction. However, they have also revealed that generic contextualized embeddings are not always consistent with human intuitions about semantic frames, which causes unsatisfactory performance for frame induction based on contextualized embeddings. In this paper, we address supervised semantic frame induction, which assumes the existence of frame-annotated data for a subset of predicates in a corpus and aims to build a frame induction model that leverages the annotated data. We propose a model that uses deep metric learning to fine-tune a contextualized embedding model, and we apply the fine-tuned contextualized embeddings to perform semantic frame induction. Our experiments on FrameNet show that fine-tuning with deep metric learning considerably improves the clustering evaluation scores, namely, the B-cubed F-score and Purity F-score, by about 8 points or more. We also demonstrate that our approach is effective even when the number of training instances is small.
pdf
bib
abs
The Devil is in the Details: On Models and Training Regimes for Few-Shot Intent Classification
Mohsen Mesgar
|
Thy Thy Tran
|
Goran Glavaš
|
Iryna Gurevych
In task-oriented dialog (ToD) new intents emerge on regular basis, with a handful of available utterances at best. This renders effective Few-Shot Intent Classification (FSIC) a central challenge for modular ToD systems. Recent FSIC methods appear to be similar: they use pretrained language models (PLMs) to encode utterances and predominantly resort to nearest-neighbor-based inference. However, they also differ in major components: they start from different PLMs, use different encoding architectures and utterance similarity functions, and adopt different training regimes. Coupling of these vital components together with the lack of informative ablations prevents the identification of factors that drive the (reported) FSIC performance. We propose a unified framework to evaluate these components along the following key dimensions:(1) Encoding architectures: Cross-Encoder vs Bi-Encoders;(2) Similarity function: Parameterized (i.e., trainable) vs non-parameterized; (3) Training regimes: Episodic meta-learning vs conventional (i.e., non-episodic) training. Our experimental results on seven FSIC benchmarks reveal three new important findings. First, the unexplored combination of cross-encoder architecture and episodic meta-learning consistently yields the best FSIC performance. Second, episodic training substantially outperforms its non-episodic counterpart. Finally, we show that splitting episodes into support and query sets has a limited and inconsistent effect on performance. Our findings show the importance of ablations and fair comparisons in FSIC. We publicly release our code and data.
pdf
bib
abs
Iterative Document-level Information Extraction via Imitation Learning
Yunmo Chen
|
William Gantt
|
Weiwei Gu
|
Tongfei Chen
|
Aaron White
|
Benjamin Van Durme
We present a novel iterative extraction model, IterX, for extracting complex relations, or templates, i.e., N-tuples representing a mapping from named slots to spans of text within a document. Documents may feature zero or more instances of a template of any given type, and the task of template extraction entails identifying the templates in a document and extracting each template’s slot values. Our imitation learning approach casts the problem as a Markov decision process (MDP), and relieves the need to use predefined template orders to train an extractor. It leads to state-of-the-art results on two established benchmarks – 4-ary relation extraction on SciREX and template extraction on MUC-4 – as well as a strong baseline on the new BETTER Granular task.
pdf
bib
abs
CLICK: Contrastive Learning for Injecting Contextual Knowledge to Conversational Recommender System
Hyeongjun Yang
|
Heesoo Won
|
Youbin Ahn
|
Kyong-Ho Lee
Conversational recommender systems (CRSs) capture a user preference through a conversation. However, the existing CRSs lack capturing comprehensive user preferences. This is because the items mentioned in a conversation are mainly regarded as a user preference. Thus, they have limitations in identifying a user preference from a dialogue context expressed without preferred items. Inspired by the characteristic of an online recommendation community where participants identify a context of a recommendation request and then comment with appropriate items, we exploit the Reddit data. Specifically, we propose a Contrastive Learning approach for Injecting Contextual Knowledge (CLICK) from the Reddit data to the CRS task, which facilitates the capture of a context-level user preference from a dialogue context, regardless of the existence of preferred item-entities. Moreover, we devise a relevance-enhanced contrastive learning loss to consider the fine-grained reflection of multiple recommendable items. We further develop a response generation module to generate a persuasive rationale for a recommendation. Extensive experiments on the benchmark CRS dataset show the effectiveness of CLICK, achieving significant improvements over state-of-the-art methods.
pdf
bib
abs
LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation
Zhuoyuan Mao
|
Tetsuji Nakagawa
Large-scale language-agnostic sentence embedding models such as LaBSE (Feng et al., 2022) obtain state-of-the-art performance for parallel sentence alignment. However, these large-scale models can suffer from inference speed and computation overhead. This study systematically explores learning language-agnostic sentence embeddings with lightweight models. We demonstrate that a thin-deep encoder can construct robust low-dimensional sentence embeddings for 109 languages. With our proposed distillation methods, we achieve further improvements by incorporating knowledge from a teacher model. Empirical results on Tatoeba, United Nations, and BUCC show the effectiveness of our lightweight models. We release our lightweight language-agnostic sentence embedding models LEALLA on TensorFlow Hub.
pdf
bib
abs
Synthesizing Human Gaze Feedback for Improved NLP Performance
Varun Khurana
|
Yaman Kumar
|
Nora Hollenstein
|
Rajesh Kumar
|
Balaji Krishnamurthy
Integrating human feedback in models can improve the performance of natural language processing (NLP) models. Feedback can be either explicit (e.g. ranking used in training language models) or implicit (e.g. using human cognitive signals in the form of eyetracking). Prior eye tracking and NLP research reveal that cognitive processes, such as human scanpaths, gleaned from human gaze patterns aid in the understanding and performance of NLP models. However, the collection of real eyetracking data for NLP tasks is challenging due to the requirement of expensive and precise equipment coupled with privacy invasion issues. To address this challenge, we propose ScanTextGAN, a novel model for generating human scanpaths over text. We show that ScanTextGAN-generated scanpaths can approximate meaningful cognitive signals in human gaze patterns. We include synthetically generated scanpaths in four popular NLP tasks spanning six different datasets as proof of concept and show that the models augmented with generated scanpaths improve the performance of all downstream NLP tasks.
pdf
bib
abs
Memory-efficient Temporal Moment Localization in Long Videos
Cristian Rodriguez-Opazo
|
Edison Marrese-Taylor
|
Basura Fernando
|
Hiroya Takamura
|
Qi Wu
Temporal Moment Localization is a challenging multi-modal task which aims to identify the start and end timestamps of a moment of interest in an input untrimmed video, given a query in natural language. Solving this task correctly requires understanding the temporal relationships in the entire input video, but processing such long inputs and reasoning about them is memory and computationally expensive. In light of this issue, we propose Stochastic Bucket-wise Feature Sampling (SBFS), a stochastic sampling module that allows methods to process long videos at a constant memory footprint. We further combine SBFS with a new consistency loss to propose Locformer, a Transformer-based model that can process videos as long as 18 minutes. We test our proposals on relevant benchmark datasets, showing that not only can Locformer achieve excellent results, but also that our sampling is more effective than competing counterparts. Concretely, SBFS consistently improves the performance of prior work, by up to 3.13% in the mean temporal IoU, leading to a new state-of-the-art performance on Charades-STA and YouCookII, while also obtaining up to 12.8x speed-up at testing time and reducing memory requirements by up to 5x.
pdf
bib
abs
Extracting Victim Counts from Text
Mian Zhong
|
Shehzaad Dhuliawala
|
Niklas Stoehr
Decision-makers in the humanitarian sector rely on timely and exact information during crisis events. Knowing how many civilians were injured during an earthquake is vital to allocate aids properly. Information about such victim counts are however often only available within full-text event descriptions from newspapers and other reports. Extracting numbers from text is challenging: numbers have different formats and may require numeric reasoning. This renders purely tagging approaches insufficient. As a consequence, fine-grained counts of injured, displaced, or abused victims beyond fatalities are often not extracted and remain unseen. We cast victim count extraction as a question answering (QA) task with a regression or classification objective. We compare tagging approaches: regex, dependency parsing, semantic role labeling, and advanced text-to-text models. Beyond model accuracy, we analyze extraction reliability and robustness which are key for this sensitive task. In particular, we discuss model calibration and investigate out-of-distribution and few-shot performance. Ultimately, we make a comprehensive recommendation on which model to select for different desiderata and data domains. Our work is among the first to apply numeracy-focused large language models in a real-world use case with a positive impact.
pdf
bib
abs
ConEntail: An Entailment-based Framework for Universal Zero and Few Shot Classification with Supervised Contrastive Pretraining
Ranran Haoran Zhang
|
Aysa Xuemo Fan
|
Rui Zhang
A universal classification model aims to generalize to diverse classification tasks in both zero and few shot settings. A promising way toward universal classification is to cast heterogeneous data formats into a dataset-agnostic “meta-task” (e.g., textual entailment, question answering) then pretrain a model on the combined meta dataset. The existing work is either pretrained on specific subsets of classification tasks, or pretrained on both classification and generation data but the model could not fulfill its potential in universality and reliability. These also leave a massive amount of annotated data under-exploited. To fill these gaps, we propose ConEntail, a new framework for universal zero and few shot classification with supervised contrastive pretraining. Our unified meta-task for classification is based on nested entailment. It can be interpreted as “Does sentence a entails [sentence b entails label c]”. This formulation enables us to make better use of 57 annotated classification datasets for supervised contrastive pretraining and universal evaluation. In this way, ConEntail helps the model (1) absorb knowledge from different datasets, and (2) gain consistent performance gain with more pretraining data. In experiments, we compare our model with discriminative and generative models pretrained on the same dataset. The results confirm that our framework effectively exploits existing annotated data and consistently outperforms baselines in both zero (9.4% average improvement) and few shot settings (3.5% average improvement). Our code is available in supplementary materials.
pdf
bib
abs
Guide the Learner: Controlling Product of Experts Debiasing Method Based on Token Attribution Similarities
Ali Modarressi
|
Hossein Amirkhani
|
Mohammad Taher Pilehvar
Several proposals have been put forward in recent years for improving out-of-distribution (OOD) performance through mitigating dataset biases. A popular workaround is to train a robust model by re-weighting training examples based on a secondary biased model. Here, the underlying assumption is that the biased model resorts to shortcut features. Hence, those training examples that are correctly predicted by the biased model are flagged as being biased and are down-weighted during the training of the main model. However, assessing the importance of an instance merely based on the predictions of the biased model may be too naive. It is possible that the prediction of the main model can be derived from another decision-making process that is distinct from the behavior of the biased model. To circumvent this, we introduce a fine-tuning strategy that incorporates the similarity between the main and biased model attribution scores in a Product of Experts (PoE) loss function to further improve OOD performance. With experiments conducted on natural language inference and fact verification benchmarks, we show that our method improves OOD results while maintaining in-distribution (ID) performance.
pdf
bib
abs
Task and Sentiment Adaptation for Appraisal Tagging
Lin Tian
|
Xiuzhen Zhang
|
Myung Hee Kim
|
Jennifer Biggs
The Appraisal framework in linguistics defines the framework for fine-grained evaluations and opinions and has contributed to sentiment analysis and opinion mining. As developing appraisal-annotated resources requires tagging of several dimensions with distinct semantic taxonomies, it has been primarily conducted manually by human experts through expensive and time-consuming processes. In this paper, we study how to automatically identify and annotate text segments for appraisal. We formulate the problem as a sequence tagging problem and propose novel task and sentiment adapters based on language models for appraisal tagging. Our model, named Adaptive Appraisal (A
ˆ2), achieves superior performance than baseline adapter-based models and other neural classification models, especially for cross-domain and cross-language settings. Source code for A
ˆ2 is available at:
https://github.com/ltian678/AA-code.gitpdf
bib
abs
DREEAM: Guiding Attention with Evidence for Improving Document-Level Relation Extraction
Youmi Ma
|
An Wang
|
Naoaki Okazaki
Document-level relation extraction (DocRE) is the task of identifying all relations between each entity pair in a document. Evidence, defined as sentences containing clues for the relationship between an entity pair, has been shown to help DocRE systems focus on relevant texts, thus improving relation extraction. However, evidence retrieval (ER) in DocRE faces two major issues: high memory consumption and limited availability of annotations. This work aims at addressing these issues to improve the usage of ER in DocRE. First, we propose DREEAM, a memory-efficient approach that adopts evidence information as the supervisory signal, thereby guiding the attention modules of the DocRE system to assign high weights to evidence. Second, we propose a self-training strategy for DREEAM to learn ER from automatically-generated evidence on massive data without evidence annotations. Experimental results reveal that our approach exhibits state-of-the-art performance on the DocRED benchmark for both DocRE and ER. To the best of our knowledge, DREEAM is the first approach to employ ER self-training.
pdf
bib
abs
Span-based Named Entity Recognition by Generating and Compressing Information
Nhung T. H. Nguyen
|
Makoto Miwa
|
Sophia Ananiadou
The information bottleneck (IB) principle has been proven effective in various NLP applications. The existing work, however, only used either generative or information compression models to improve the performance of the target task. In this paper, we propose to combine the two types of IB models into one system to enhance Named Entity Recognition (NER).For one type of IB model, we incorporate two unsupervised generative components, span reconstruction and synonym generation, into a span-based NER system. The span reconstruction ensures that the contextualised span representation keeps the span information, while the synonym generation makes synonyms have similar representations even in different contexts. For the other type of IB model, we add a supervised IB layer that performs information compression into the system to preserve useful features for NER in the resulting span representations. Experiments on five different corpora indicate that jointly training both generative and information compression models can enhance the performance of the baseline span-based NER system. Our source code is publicly available at
https://github.com/nguyennth/joint-ib-models.
pdf
bib
abs
An In-depth Analysis of Implicit and Subtle Hate Speech Messages
Nicolás Benjamín Ocampo
|
Ekaterina Sviridova
|
Elena Cabrio
|
Serena Villata
The research carried out so far in detecting abusive content in social media has primarily focused on overt forms of hate speech. While explicit hate speech (HS) is more easily identifiable by recognizing hateful words, messages containing linguistically subtle and implicit forms of HS (as circumlocution, metaphors and sarcasm) constitute a real challenge for automatic systems. While the sneaky and tricky nature of subtle messages might be perceived as less hurtful with respect to the same content expressed clearly, such abuse is at least as harmful as overt abuse. In this paper, we first provide an in-depth and systematic analysis of 7 standard benchmarks for HS detection, relying on a fine-grained and linguistically-grounded definition of implicit and subtle messages. Then, we experiment with state-of-the-art neural network architectures on two supervised tasks, namely implicit HS and subtle HS message classification. We show that while such models perform satisfactory on explicit messages, they fail to detect implicit and subtle content, highlighting the fact that HS detection is not a solved problem and deserves further investigation.
pdf
bib
abs
MTEB: Massive Text Embedding Benchmark
Niklas Muennighoff
|
Nouamane Tazi
|
Loic Magne
|
Nils Reimers
Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings todate. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-theart results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at
https://github.com/embeddings-benchmark/mteb.
pdf
bib
abs
Step by Step Loss Goes Very Far: Multi-Step Quantization for Adversarial Text Attacks
Piotr Gaiński
|
Klaudia Bałazy
We propose a novel gradient-based attack against transformer-based language models that searches for an adversarial example in a continuous space of tokens probabilities. Our algorithm mitigates the gap between adversarial loss for continuous and discrete text representations by performing multi-step quantization in a quantization-compensation loop. Experiments show that our method significantly outperforms other approaches on various natural language processing (NLP) tasks.
pdf
bib
abs
TwiRGCN: Temporally Weighted Graph Convolution for Question Answering over Temporal Knowledge Graphs
Aditya Sharma
|
Apoorv Saxena
|
Chitrank Gupta
|
Mehran Kazemi
|
Partha Talukdar
|
Soumen Chakrabarti
Recent years have witnessed interest in Temporal Question Answering over Knowledge Graphs (TKGQA), resulting in the development of multiple methods. However, these are highly engineered, thereby limiting their generalizability, and they do not automatically discover relevant parts of the KG during multi-hop reasoning. Relational graph convolutional networks (RGCN) provide an opportunity to address both of these challenges – we explore this direction in the paper. Specifically, we propose a novel, intuitive and interpretable scheme to modulate the messages passed through a KG edge during convolution based on the relevance of its associated period to the question. We also introduce a gating device to predict if the answer to a complex temporal question is likely to be a KG entity or time and use this prediction to guide our scoring mechanism. We evaluate the resulting system, which we call TwiRGCN, on a recent challenging dataset for multi-hop complex temporal QA called TimeQuestions. We show that TwiRGCN significantly outperforms state-of-the-art models on this dataset across diverse question types. Interestingly, TwiRGCN improves accuracy by 9–10 percentage points for the most difficult ordinal and implicit question types.
pdf
bib
abs
ZELDA: A Comprehensive Benchmark for Supervised Entity Disambiguation
Marcel Milich
|
Alan Akbik
Entity disambiguation (ED) is the task of disambiguating named entity mentions in text to unique entries in a knowledge base. Due to its industrial relevance, as well as current progress in leveraging pre-trained language models, a multitude of ED approaches have been proposed in recent years. However, we observe a severe lack of uniformity across experimental setups in current ED work,rendering a direct comparison of approaches based solely on reported numbers impossible: Current approaches widely differ in the data set used to train, the size of the covered entity vocabulary, and the usage of additional signals such as candidate lists. To address this issue, we present ZELDA , a novel entity disambiguation benchmark that includes a unified training data set, entity vocabulary, candidate lists, as well as challenging evaluation splits covering 8 different domains. We illustrate its design and construction, and present experiments in which we train and compare current state-of-the-art approaches on our benchmark. To encourage greater direct comparability in the entity disambiguation domain, we make our benchmark publicly available to the research community.
pdf
bib
abs
GLADIS: A General and Large Acronym Disambiguation Benchmark
Lihu Chen
|
Gael Varoquaux
|
Fabian M. Suchanek
Acronym Disambiguation (AD) is crucial for natural language understanding on various sources, including biomedical reports, scientific papers, and search engine queries. However, existing acronym disambiguationbenchmarks and tools are limited to specific domains, and the size of prior benchmarks is rather small. To accelerate the research on acronym disambiguation, we construct a new benchmark with three components: (1) a much larger acronym dictionary with 1.5M acronyms and 6.4M long forms; (2) a pre-training corpus with 160 million sentences;(3) three datasets that cover thegeneral, scientific, and biomedical domains. We then pre-train a language model, AcroBERT, on our constructed corpus for general acronym disambiguation, and show the challenges and values of our new benchmark.
pdf
bib
abs
Probing Cross-Lingual Lexical Knowledge from Multilingual Sentence Encoders
Ivan Vulić
|
Goran Glavaš
|
Fangyu Liu
|
Nigel Collier
|
Edoardo Maria Ponti
|
Anna Korhonen
Pretrained multilingual language models (LMs) can be successfully transformed into multilingual sentence encoders (SEs; e.g., LaBSE, xMPNet) via additional fine-tuning or model distillation with parallel data. However, it remains unclear how to best leverage them to represent sub-sentence lexical items (i.e., words and phrases) in cross-lingual lexical tasks. In this work, we probe SEs for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs. We also devise a simple yet efficient method for exposing the cross-lingual lexical knowledge by means of additional fine-tuning through inexpensive contrastive learning that requires only a small amount of word translation pairs. Using bilingual lexical induction (BLI), cross-lingual lexical semantic similarity, and cross-lingual entity linking as lexical probing tasks, we report substantial gains on standard benchmarks (e.g., +10 Precision@1 points in BLI). The results indicate that the SEs such as LaBSE can be ‘rewired’ into effective cross-lingual lexical encoders via the contrastive learning procedure, and that it is possible to expose more cross-lingual lexical knowledge compared to using them as off-the-shelf SEs. This way, we also provide an effective tool for harnessing ‘covert’ multilingual lexical knowledge hidden in multilingual sentence encoders.
pdf
bib
abs
Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples
Philipp Sadler
|
David Schlangen
NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image _i_ and text _t_), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., “_t_ is a description of _i_, for which the content of _i_ needs to be recognised and understood”).We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the “Incremental Algorithm”),which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.
pdf
bib
abs
Mitigating Exposure Bias in Grammatical Error Correction with Data Augmentation and Reweighting
Hannan Cao
|
Wenmian Yang
|
Hwee Tou Ng
The most popular approach in grammatical error correction (GEC) is based on sequence-to-sequence (seq2seq) models. Similar to other autoregressive generation tasks, seq2seq GEC also faces the exposure bias problem, i.e., the context tokens are drawn from different distributions during training and testing, caused by the teacher forcing mechanism. In this paper, we propose a novel data manipulation approach to overcome this problem, which includes a data augmentation method during training to mimic the decoder input at inference time, and a data reweighting method to automatically balance the importance of each kind of augmented samples. Experimental results on benchmark GEC datasets show that our method achieves significant improvements compared to prior approaches.
pdf
bib
abs
Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training
Wenliang Dai
|
Zihan Liu
|
Ziwei Ji
|
Dan Su
|
Pascale Fung
Large-scale vision-language pre-trained (VLP) models are prone to hallucinate non-existent visual objects when generating text based on visual information. In this paper, we systematically study the object hallucination problem from three aspects. First, we examine recent state-of-the-art VLP models, showing that they still hallucinate frequently and models achieving better scores on standard metrics (e.g., CIDEr) could be more unfaithful. Second, we investigate how different types of image encoding in VLP influence hallucination, including region-based, grid-based, and patch-based. Surprisingly, we find that patch-based features perform the best and smaller patch resolution yields a non-trivial reduction in object hallucination. Third, we decouple various VLP objectives and demonstrate that token-level image-text alignment and controlled generation are crucial to reducing hallucination. Based on that, we propose a simple yet effective VLP loss named ObjMLM to further mitigate object hallucination. Results show that it reduces object hallucination by up to 17.4% when tested on two benchmarks (COCO Caption for in-domain and NoCaps for out-of-domain evaluation).
pdf
bib
abs
Characterizing the Entities in Harmful Memes: Who is the Hero, the Villain, the Victim?
Shivam Sharma
|
Atharva Kulkarni
|
Tharun Suresh
|
Himanshi Mathur
|
Preslav Nakov
|
Md. Shad Akhtar
|
Tanmoy Chakraborty
Memes can sway people’s opinions over social media as they combine visual and textual information in an easy-to-consume manner. Since memes instantly turn viral, it becomes crucial to infer their intent and potentially associated harmfulness to take timely measures as needed. A common problem associated with meme comprehension lies in detecting the entities referenced and characterizing the role of each of these entities. Here, we aim to understand whether the meme glorifies, vilifies, or victimizes each entity it refers to. To this end, we address the task of role identification of entities in harmful memes, i.e., detecting who is the ‘hero’, the ‘villain’, and the ‘victim’ in the meme, if any. We utilize HVVMemes – a memes dataset on US Politics and Covid-19 memes, released recently as part of the CONSTRAINT@ACL-2022 shared-task. It contains memes, entities referenced, and their associated roles: hero, villain, victim, and other. We further design VECTOR (Visual-semantic role dEteCToR), a robust multi-modal framework for the task, which integrates entity-based contextual information in the multi-modal representation and compare it to several standard unimodal (text-only or image-only) or multi-modal (image+text) models. Our experimental results show that our proposed model achieves an improvement of 4% over the best baseline and 1% over the best competing stand-alone submission from the shared-task. Besides divulging an extensive experimental setup with comparative analyses, we finally highlight the challenges encountered in addressing the complex task of semantic role labeling within memes.
pdf
bib
abs
Systematic Investigation of Strategies Tailored for Low-Resource Settings for Low-Resource Dependency Parsing
Jivnesh Sandhan
|
Laxmidhar Behera
|
Pawan Goyal
In this work, we focus on low-resource dependency parsing for multiple languages. Several strategies are tailored to enhance performance in low-resource scenarios. While these are well-known to the community, it is not trivial to select the best-performing combination of these strategies for a low-resource language that we are interested in, and not much attention has been given to measuring the efficacy of these strategies. We experiment with 5 low-resource strategies for our ensembled approach on 7 Universal Dependency (UD) low-resource languages. Our exhaustive experimentation on these languages supports the effective improvements for languages not covered in pretrained models. We show a successful application of the ensembled system on a truly low-resource language Sanskrit. The code and data are available at:
https://github.com/Jivnesh/SanDPpdf
bib
abs
Compositional Generalisation with Structured Reordering and Fertility Layers
Matthias Lindemann
|
Alexander Koller
|
Ivan Titov
Seq2seq models have been shown to struggle with compositional generalisation, i.e. generalising to new and potentially more complex structures than seen during training. Taking inspiration from grammar-based models that excel at compositional generalisation, we present a flexible end-to-end differentiable neural model that composes two structural operations: a fertility step, which we introduce in this work, and a reordering step based on previous work (Wang et al., 2021). To ensure differentiability, we use the expected value of each step, which we compute using dynamic programming. Our model outperforms seq2seq models by a wide margin on challenging compositional splits of realistic semantic parsing tasks that require generalisation to longer examples. It also compares favourably to other models targeting compositional generalisation.
pdf
bib
abs
Investigating Multi-source Active Learning for Natural Language Inference
Ard Snijders
|
Douwe Kiela
|
Katerina Margatina
In recent years, active learning has been successfully applied to an array of NLP tasks. However, prior work often assumes that training and test data are drawn from the same distribution. This is problematic, as in real-life settings data may stem from several sources of varying relevance and quality. We show that four popular active learning schemes fail to outperform random selection when applied to unlabelled pools comprised of multiple data sources on the task of natural language inference. We reveal that uncertainty-based strategies perform poorly due to the acquisition of collective outliers, i.e., hard-to-learn instances that hamper learning and generalisation. When outliers are removed, strategies are found to recover and outperform random baselines. In further analysis, we find that collective outliers vary in form between sources, and show that hard-to-learn data is not always categorically harmful. Lastly, we leverage dataset cartography to introduce difficulty-stratified testing and find that different strategies are affected differently by example learnability and difficulty.
pdf
bib
abs
Towards a Unified Multi-Domain Multilingual Named Entity Recognition Model
Mayank Kulkarni
|
Daniel Preotiuc-Pietro
|
Karthik Radhakrishnan
|
Genta Indra Winata
|
Shijie Wu
|
Lingjue Xie
|
Shaohua Yang
Named Entity Recognition is a key Natural Language Processing task whose performance is sensitive to choice of genre and language. A unified NER model across multiple genres and languages is more practical and efficient by leveraging commonalities across genres or languages. In this paper, we propose a novel setup for NER which includes multi-domain and multilingual training and evaluation across 13 domains and 4 languages. We explore a range of approaches to building a unified model using domain and language adaptation techniques. Our experiments highlight multiple nuances to consider while building a unified model, including that naive data pooling fails to obtain good performance, that domain-specific adaptations are more important than language-specific ones and that including domain-specific adaptations in a unified model nears the performance of training multiple dedicated monolingual models at a fraction of their parameter count.
pdf
bib
abs
Do Neural Topic Models Really Need Dropout? Analysis of the Effect of Dropout in Topic Modeling
Suman Adhya
|
Avishek Lahiri
|
Debarshi Kumar Sanyal
Dropout is a widely used regularization trick to resolve the overfitting issue in large feedforward neural networks trained on a small dataset, which performs poorly on the held-out test subset. Although the effectiveness of this regularization trick has been extensively studied for convolutional neural networks, there is a lack of analysis of it for unsupervised models and in particular, VAE-based neural topic models. In this paper, we have analyzed the consequences of dropout in the encoder as well as in the decoder of the VAE architecture in three widely used neural topic models, namely, contextualized topic model (CTM), ProdLDA, and embedded topic model (ETM) using four publicly available datasets. We characterize the dropout effect on these models in terms of the quality and predictive performance of the generated topics.
pdf
bib
abs
A Psycholinguistic Analysis of BERT’s Representations of Compounds
Lars Buijtelaar
|
Sandro Pezzelle
This work studies the semantic representations learned by BERT for compounds, that is, expressions such as sunlight or bodyguard. We build on recent studies that explore semantic information in Transformers at the word level and test whether BERT aligns with human semantic intuitions when dealing with expressions (e.g., sunlight) whose overall meaning depends—to a various extent—on the semantics of the constituent words (sun, light). We leverage a dataset that includes human judgments on two psycholinguistic measures of compound semantic analysis: lexeme meaning dominance (LMD; quantifying the weight of each constituent toward the compound meaning) and semantic transparency (ST; evaluating the extent to which the compound meaning is recoverable from the constituents’ semantics). We show that BERT-based measures moderately align with human intuitions, especially when using contextualized representations, and that LMD is overall more predictable than ST. Contrary to the results reported for ‘standard’ words, higher, more contextualized layers are the best at representing compound meaning. These findings shed new light on the abilities of BERT in dealing with fine-grained semantic phenomena. Moreover, they can provide insights into how speakers represent compounds.
pdf
bib
abs
Measuring Normative and Descriptive Biases in Language Models Using Census Data
Samia Touileb
|
Lilja Øvrelid
|
Erik Velldal
We investigate in this paper how distributions of occupations with respect to gender is reflected in pre-trained language models. Such distributions are not always aligned to normative ideals, nor do they necessarily reflect a descriptive assessment of reality. In this paper, we introduce an approach for measuring to what degree pre-trained language models are aligned to normative and descriptive occupational distributions. To this end, we use official demographic information about gender–occupation distributions provided by the national statistics agencies of France, Norway, United Kingdom, and the United States. We manually generate template-based sentences combining gendered pronouns and nouns with occupations, and subsequently probe a selection of ten language models covering the English, French, and Norwegian languages. The scoring system we introduce in this work is language independent, and can be used on any combination of template-based sentences, occupations, and languages. The approach could also be extended to other dimensions of national census data and other demographic variables.
pdf
bib
abs
UDAPTER - Efficient Domain Adaptation Using Adapters
Bhavitvya Malik
|
Abhinav Ramesh Kashyap
|
Min-Yen Kan
|
Soujanya Poria
We propose two methods to make unsupervised domain adaptation (UDA) more parameter efficient using adapters – small bottleneck layers interspersed with every layer of the large-scale pre-trained language model (PLM). The first method deconstructs UDA into a two-step process: first by adding a domain adapter to learn domain-invariant information and then by adding a task adapter that uses domain-invariant information to learn task representations in the source domain. The second method jointly learns a supervised classifier while reducing the divergence measure. Compared to strong baselines, our simple methods perform well in natural language inference (MNLI) and the cross-domain sentiment classification task. We even outperform unsupervised domain adaptation methods such as DANN and DSN in sentiment classification, and we are within 0.85% F1 for natural language inference task, by fine-tuning only a fraction of the full model parameters. We release our code at this URL.
pdf
bib
abs
Efficient CTC Regularization via Coarse Labels for End-to-End Speech Translation
Biao Zhang
|
Barry Haddow
|
Rico Sennrich
For end-to-end speech translation, regularizing the encoder with the Connectionist Temporal Classification (CTC) objective using the source transcript or target translation as labels can greatly improve quality. However, CTC demands an extra prediction layer over the vocabulary space, bringing in non-negligible model parameters and computational overheads, although this layer becomes useless at inference. In this paper, we re-examine the need for genuine vocabulary labels for CTC for regularization and explore strategies to reduce the CTC label space, targeting improved efficiency without quality degradation. We propose coarse labeling for CTC (CoLaCTC), which merges vocabulary labels via simple heuristic rules, such as using truncation, division or modulo (MOD) operations. Despite its simplicity, our experiments on 4 source and 8 target languages show that CoLaCTC with MOD particularly can compress the label space aggressively to 256 and even further, gaining training efficiency (1.18× ∼ 1.77× speedup depending on the original vocabulary size) yet still delivering comparable or better performance than the CTC baseline. We also show that CoLaCTC successfully generalizes to CTC regularization regardless of using transcript or translation for labeling.
pdf
bib
abs
Exploring Category Structure with Contextual Language Models and Lexical Semantic Networks
Joseph Renner
|
Pascal Denis
|
Remi Gilleron
|
Angèle Brunellière
The psychological plausibility of word embeddings has been studied through different tasks such as word similarity, semantic priming, and lexical entailment. Recent work on predicting category structure with word embeddings report low correlations with human ratings. (Heyman and Heyman, 2019) showed that static word embeddings fail at predicting typicality using cosine similarity between category and exemplar words, while (Misra et al., 2021)obtain equally modest results for various contextual language models (CLMs) using a Cloze task formulation over hand-crafted taxonomic sentences. In this work, we test a wider array of methods for probing CLMs for predicting typicality scores. Our experiments, using BERT (Devlin et al., 2018), show the importance of using the right type of CLM probes, as our best BERT-based typicality prediction methods improve on previous works. Second, our results highlight the importance of polysemy in this task, as our best results are obtained when contextualization is paired with a disambiguation mechanism as in (Chronis and Erk, 2020). Finally, additional experiments and analyses reveal that Information Content-based WordNet (Miller, 1995) similarities with disambiguation match the performance of the best BERT-based method, and in fact capture complementary information, and when combined with BERT allow for enhanced typicality predictions.
pdf
bib
abs
An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters
Asma Ben Abacha
|
Wen-wai Yim
|
Yadan Fan
|
Thomas Lin
Medical doctors spend on average 52 to 102 minutes per day writing clinical notes from their patient encounters (Hripcsak et al., 2011). Reducing this workload calls for relevant and efficient summarization methods. In this paper, we introduce new resources and empirical investigations for the automatic summarization of doctor-patient conversations in a clinical setting. In particular, we introduce the MTS-Dialog dataset; a new collection of 1,700 doctor-patient dialogues and corresponding clinical notes. We use this new dataset to investigate the feasibility of this task and the relevance of existing language models, data augmentation, and guided summarization techniques. We compare standard evaluation metrics based on n-gram matching, contextual embeddings, and Fact Extraction to assess the accuracy and the factual consistency of the generated summaries. To ground these results, we perform an expert-based evaluation using relevant natural language generation criteria and task-specific criteria such as critical omissions, and study the correlation between the automatic metrics and expert judgments. To the best of our knowledge, this study is the first attempt to introduce an open dataset of doctor-patient conversations and clinical notes, with detailed automated and manual evaluations of clinical note generation.
pdf
bib
abs
Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset
Brielen Madureira
|
David Schlangen
In visual instruction-following dialogue games, players can engage in repair mechanisms in face of an ambiguous or underspecified instruction that cannot be fully mapped to actions in the world. In this work, we annotate Instruction Clarification Requests (iCRs) in CoDraw, an existing dataset of interactions in a multimodal collaborative dialogue game. We show that it contains lexically and semantically diverse iCRs being produced self-motivatedly by players deciding to clarify in order to solve the task successfully. With 8.8k iCRs found in 9.9k dialogues, CoDraw-iCR (v1) is a large spontaneous iCR corpus, making it a valuable resource for data-driven research on clarification in dialogue. We then formalise and provide baseline models for two tasks: Determining when to make an iCR and how to recognise them, in order to investigate to what extent these tasks are learnable from data.
pdf
bib
abs
Can Synthetic Text Help Clinical Named Entity Recognition? A Study of Electronic Health Records in French
Nicolas Hiebel
|
Olivier Ferret
|
Karen Fort
|
Aurélie Névéol
In sensitive domains, the sharing of corpora is restricted due to confidentiality, copyrights or trade secrets. Automatic text generation can help alleviate these issues by producing synthetic texts that mimic the linguistic properties of real documents while preserving confidentiality. In this study, we assess the usability of synthetic corpus as a substitute training corpus for clinical information extraction. Our goal is to automatically produce a clinical case corpus annotated with clinical entities and to evaluate it for a named entity recognition (NER) task. We use two auto-regressive neural models partially or fully trained on generic French texts and fine-tuned on clinical cases to produce a corpus of synthetic clinical cases. We study variants of the generation process: (i) fine-tuning on annotated vs. plain text (in that case, annotations are obtained a posteriori) and (ii) selection of generated texts based on models parameters and filtering criteria. We then train NER models with the resulting synthetic text and evaluate them on a gold standard clinical corpus. Our experiments suggest that synthetic text is useful for clinical NER.
pdf
bib
abs
IRMA: the 335-million-word Italian coRpus for studying MisinformAtion
Fabio Carrella
|
Alessandro Miani
|
Stephan Lewandowsky
The dissemination of false information on the internet has received considerable attention over the last decade. Misinformation often spreads faster than mainstream news, thus making manual fact checking inefficient or, at best, labor-intensive. Therefore, there is an increasing need to develop methods for automatic detection of misinformation. Although resources for creating such methods are available in English, other languages are often under-represented in this effort. With this contribution, we present IRMA, a corpus containing over 600,000 Italian news articles (335+ million tokens) collected from 56 websites classified as ‘untrustworthy’ by professional fact-checkers. The corpus is freely available and comprises a rich set of text- and website-level data, representing a turnkey resource to test hypotheses and develop automatic detection algorithms. It contains texts, titles, and dates (from 2004 to 2022), along with three types of semantic measures (i.e., keywords, topics at three different resolutions, and LIWC lexical features). IRMA also includes domain-specific information such as source type (e.g., political, health, conspiracy, etc.), credibility, and higher-level metadata, including several metrics of website incoming traffic that allow to investigate user online behavior. IRMA constitutes the largest corpus of misinformation available today in Italian, making it a valid tool for advancing quantitative research on untrustworthy news detection and ultimately helping limit the spread of misinformation.
pdf
bib
abs
Parameter-Efficient Korean Character-Level Language Modeling
Marco Cognetta
|
Sangwhan Moon
|
Lawrence Wolf-sonkin
|
Naoaki Okazaki
Character-level language modeling has been shown empirically to perform well on highly agglutinative or morphologically rich languages while using only a small fraction of the parameters required by (sub)word models. Korean fits nicely into this framework, except that, like other CJK languages, it has a very large character vocabulary of 11,172 unique syllables. However, unlike Japanese Kanji and Chinese Hanzi, each Korean syllable can be uniquely factored into a small set of subcharacters, called jamo. We explore a “three-hot” scheme, where we exploit the decomposability of Korean characters to model at the syllable level but using only jamo-level representations. We find that our three-hot embedding and decoding scheme alleviates the two major issues with prior syllable- and jamo-level models. Namely, it requires fewer than 1% of the embedding parameters of a syllable model, and it does not require tripling the sequence length, as with jamo models. In addition, it addresses a theoretical flaw in a prior three-hot modeling scheme. Our experiments show that, even when reducing the number of embedding parameters by 99.6% (from 11.4M to just 36k), our model suffers no loss in translation quality compared to the baseline syllable model.
pdf
bib
abs
Opportunities and Challenges in Neural Dialog Tutoring
Jakub Macina
|
Nico Daheim
|
Lingzhi Wang
|
Tanmay Sinha
|
Manu Kapur
|
Iryna Gurevych
|
Mrinmaya Sachan
Designing dialog tutors has been challenging as it involves modeling the diverse and complex pedagogical strategies employed by human tutors. Although there have been significant recent advances in neural conversational systems using large language models and growth in available dialog corpora, dialog tutoring has largely remained unaffected by these advances. In this paper, we rigorously analyze various generative language models on two dialog tutoring datasets for language learning using automatic and human evaluations to understand the new opportunities brought by these advances as well as the challenges we must overcome to build models that would be usable in real educational settings. We find that although current approaches can model tutoring in constrained learning scenarios when the number of concepts to be taught and possible teacher strategies are small, they perform poorly in less constrained scenarios. Our human quality evaluation shows that both models and ground-truth annotations exhibit low performance in terms of equitable tutoring, which measures learning opportunities for students and how engaging the dialog is. To understand the behavior of our models in a real tutoring setting, we conduct a user study using expert annotators and find a significantly large number of model reasoning errors in 45% of conversations. Finally, we connect our findings to outline future work.
pdf
bib
abs
Evaluating the Robustness of Discrete Prompts
Yoichi Ishibashi
|
Danushka Bollegala
|
Katsuhito Sudoh
|
Satoshi Nakamura
Discrete prompts have been used for fine-tuning Pre-trained Language Models for diverse NLP tasks. In particular, automatic methods that generate discrete prompts from a small set of training instances have reported superior performance. However, a closer look at the learnt prompts reveals that they contain noisy and counter-intuitive lexical constructs that would not be encountered in manually-written prompts. This raises an important yet understudied question regarding the robustness of automatically learnt discrete prompts when used in downstream tasks. To address this question, we conduct a systematic study of the robustness of discrete prompts by applying carefully designed perturbations into an application using AutoPrompt and then measure their performance in two Natural Language Inference (NLI) datasets. Our experimental results show that although the discrete prompt-based method remains relatively robust against perturbations to NLI inputs, they are highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens. Moreover, they generalize poorly across different NLI datasets. We hope our findings will inspire future work on robust discrete prompt learning.
pdf
bib
abs
Assessing Out-of-Domain Language Model Performance from Few Examples
Prasann Singhal
|
Jarad Forristal
|
Xi Ye
|
Greg Durrett
While pretrained language models have exhibited impressive generalization capabilities, they still behave unpredictably under certain domain shifts. In particular, a model may learn a reasoning process on in-domain training data that does not hold for out-of-domain test data. We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion: given a few target-domain examples and a set of models with similar training performance, can we understand how these models will perform on OOD test data? We benchmark the performance on this task when looking at model accuracy on the few-shot examples, then investigate how to incorporate analysis of the models’ behavior using feature attributions to better tackle this problem. Specifically, we explore a set of factors designed to reveal model agreement with certain pathological heuristics that may indicate worse generalization capabilities. On textual entailment, paraphrase recognition, and a synthetic classification task, we show that attribution-based factors can help rank relative model OOD performance. However, accuracy on a few-shot test set is a surprisingly strong baseline, particularly when the system designer does not have in-depth prior knowledge about the domain shift.
pdf
bib
abs
Mind the Labels: Describing Relations in Knowledge Graphs With Pretrained Models
Zdeněk Kasner
|
Ioannis Konstas
|
Ondrej Dusek
Pretrained language models (PLMs) for data-to-text (D2T) generation can use human-readable data labels such as column headings, keys, or relation names to generalize to out-of-domain examples. However, the models are well-known in producing semantically inaccurate outputs if these labels are ambiguous or incomplete, which is often the case in D2T datasets. In this paper, we expose this issue on the task of descibing a relation between two entities. For our experiments, we collect a novel dataset for verbalizing a diverse set of 1,522 unique relations from three large-scale knowledge graphs (Wikidata, DBPedia, YAGO). We find that although PLMs for D2T generation expectedly fail on unclear cases, models trained with a large variety of relation labels are surprisingly robust in verbalizing novel, unseen relations. We argue that using data with a diverse set of clear and meaningful labels is key to training D2T generation systems capable of generalizing to novel domains.
pdf
bib
abs
Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers
William Held
|
Diyi Yang
Multilingual transformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades. Those who attribute this interference phenomenon to limited model capacity address the problem by adding additional parameters, despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific attention heads. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that pruning such heads from a fixed model improves performance for a target language on both sentence classification and structural prediction. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization.
pdf
bib
abs
Why Don’t You Do It Right? Analysing Annotators’ Disagreement in Subjective Tasks
Marta Sandri
|
Elisa Leonardelli
|
Sara Tonelli
|
Elisabetta Jezek
Annotators’ disagreement in linguistic data has been recently the focus of multiple initiatives aimed at raising awareness on issues related to ‘majority voting’ when aggregating diverging annotations. Disagreement can indeed reflect different aspects of linguistic annotation, from annotators’ subjectivity to sloppiness or lack of enough context to interpret a text. In this work we first propose a taxonomy of possible reasons leading to annotators’ disagreement in subjective tasks. Then, we manually label part of a Twitter dataset for offensive language detection in English following this taxonomy, identifying how the different categories are distributed. Finally we run a set of experiments aimed at assessing the impact of the different types of disagreement on classification performance. In particular, we investigate how accurately tweets belonging to different categories of disagreement can be classified as offensive or not, and how injecting data with different types of disagreement in the training set affects performance. We also perform offensive language detection as a multi-task framework, using disagreement classification as an auxiliary task.
pdf
bib
abs
Analyzing Challenges in Neural Machine Translation for Software Localization
Sai Koneru
|
Matthias Huck
|
Miriam Exel
|
Jan Niehues
Advancements in Neural Machine Translation (NMT) greatly benefit the software localization industry by decreasing the post-editing time of human annotators. Although the volume of the software being localized is growing significantly, techniques for improving NMT for user interface (UI) texts are lacking. These UI texts have different properties than other collections of texts, presenting unique challenges for NMT. For example, they are often very short, causing them to be ambiguous and needing additional context (button, title text, a table item, etc.) for disambiguation. However, no such UI data sets are readily available with contextual information for NMT models to exploit. This work aims to provide a first step in improving UI translations and highlight its challenges. To achieve this, we provide a novel multilingual UI corpus collection (∼ 1.3M for English ↔ German) with a targeted test set and analyze the limitations of state-of-the-art methods on this challenging task. Specifically, we present a targeted test set for disambiguation from English to German to evaluate reliably and emphasize UI translation challenges. Furthermore, we evaluate several state-of-the-art NMT techniques from domain adaptation and document-level NMT on this challenging task. All the scripts to replicate the experiments and data sets are available here.ˆ,
pdf
bib
abs
Bootstrapping Multilingual Semantic Parsers using Large Language Models
Abhijeet Awasthi
|
Nitish Gupta
|
Bidisha Samanta
|
Shachi Dave
|
Sunita Sarawagi
|
Partha Talukdar
Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-train paradigm of transferring English datasets across multiple languages remains to be a key mechanism for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human-annotated translation pairs. Further, translation services may continue to be brittle due to domain mismatch between task-specific input text and general-purpose text used for training translation models. For multilingual semantic parsing, we demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. Through extensive comparisons on two public datasets, MTOP and MASSIVE, spanning 50 languages and several domains, we show that our method of translating data using LLMs outperforms a strong translate-train baseline on 41 out of 50 languages. We study the key design choices that enable more effective multilingual data translation via prompted LLMs.
pdf
bib
abs
Modeling Complex Event Scenarios via Simple Entity-focused Questions
Mahnaz Koupaee
|
Greg Durrett
|
Nathanael Chambers
|
Niranjan Balasubramanian
Event scenarios are often complex and involve multiple event sequences connected through different entity participants. Exploring such complex scenarios requires an ability to branch through different sequences, something that is difficult to achieve with standard event language modeling. To address this, we propose a question-guided generation framework that models events in complex scenarios as answers to questions about participants. At any step in the generation process, the framework uses the previously-generated events as context, but generates the next event as an answer to one of three questions: what else a participant did, what else happened to a participant, or what else happened. The participants and the questions themselves can be sampled or be provided as input from a user, allowing for controllable exploration. Our empirical evaluation shows that this question-guided generation provides better coverage of participants, diverse events within a domain, comparable perplexities for modeling event sequences, and more effective control for interactive schema generation.
pdf
bib
abs
Uncovering Implicit Inferences for Improved Relational Argument Mining
Ameer Saadat-Yazdi
|
Jeff Z. Pan
|
Nadin Kokciyan
Argument mining seeks to extract arguments and their structure from unstructured texts. Identifying relations between arguments (such as attack, support, and neutral) is a challenging task because two arguments may be related to each other via implicit inferences. This task often requires external commonsense knowledge to discover how one argument relates to another. State-of-the-art methods, however, rely on pre-defined knowledge graphs, and thus might not cover target argument pairs well. We introduce a new generative neuro-symbolic approach to finding inference chains that connect the argument pairs by making use of the Commonsense Transformer (COMET). We evaluate our approach on three datasets for both the two-label (attack/support) and three-label (attack/support/neutral) tasks. Our approach significantly outperforms the state-of-the-art, by 2-5% in F1 score, on all three datasets.
pdf
bib
abs
How people talk about each other: Modeling Generalized Intergroup Bias and Emotion
Venkata Subrahmanyan Govindarajan
|
Katherine Atwell
|
Barea Sinno
|
Malihe Alikhani
|
David I. Beaver
|
Junyi Jessy Li
Current studies of bias in NLP rely mainly on identifying (unwanted or negative) bias towards a specific demographic group. While this has led to progress recognizing and mitigating negative bias, and having a clear notion of the targeted group is necessary, it is not always practical. In this work we extrapolate to a broader notion of bias, rooted in social science and psychology literature. We move towards predicting interpersonal group relationship (IGR) - modeling the relationship between the speaker and the target in an utterance - using fine-grained interpersonal emotions as an anchor. We build and release a dataset of English tweets by US Congress members annotated for interpersonal emotion - the first of its kind, and ‘found supervision’ for IGR labels; our analyses show that subtle emotional signals are indicative of different biases. While humans can perform better than chance at identifying IGR given an utterance, we show that neural models perform much better; furthermore, a shared encoding between IGR and interpersonal perceived emotion enabled performance gains in both tasks.
pdf
bib
abs
Semantic Parsing for Conversational Question Answering over Knowledge Graphs
Laura Perez-Beltrachini
|
Parag Jain
|
Emilio Monti
|
Mirella Lapata
In this paper, we are interested in developing semantic parsers which understand natural language questions embedded in a conversation with a user and ground them to formal queries over definitions in a general purpose knowledge graph (KG) with very large vocabularies (covering thousands of concept names and relations, and millions of entities). To this end, we develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof. We present two different semantic parsing approaches and highlight the challenges of the task: dealing with large vocabularies, modelling conversation context, predicting queries with multiple entities, and generalising to new questions at test time. We hope our dataset will serve as useful testbed for the development of conversational semantic parsers. Our dataset and models are released at
https://github.com/EdinburghNLP/SPICE.
pdf
bib
abs
MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting
Oscar Mañas
|
Pau Rodriguez Lopez
|
Saba Ahmadi
|
Aida Nematzadeh
|
Yash Goyal
|
Aishwarya Agrawal
Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL’s modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at
https://github.com/oscmansan/mapl.
pdf
bib
abs
ComSearch: Equation Searching with Combinatorial Strategy for Solving Math Word Problems with Weak Supervision
Qianying Liu
|
Wenyu Guan
|
Jianhao Shen
|
Fei Cheng
|
Sadao Kurohashi
Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm with combinatorial strategy ComSearch, which can compress the search space by excluding mathematically equivalent equations. The compression allows the searching algorithm to enumerate all possible equations and obtain high-quality data. We investigate the noise in the pseudo labels that hold wrong mathematical logic, which we refer to as the false-matching problem, and propose a ranking model to denoise the pseudo labels. Our approach holds a flexible framework to utilize two existing supervised math word problem solvers to train pseudo labels, and both achieve state-of-the-art performance in the weak supervision task.
pdf
bib
abs
Towards preserving word order importance through Forced Invalidation
Hadeel Al-Negheimish
|
Pranava Madhyastha
|
Alessandra Russo
Large pre-trained language models such as BERT have been widely used as a framework for natural language understanding (NLU) tasks. However, recent findings have revealed that pre-trained language models are insensitive to word order. The performance on NLU tasks remains unchanged even after randomly permuting the word of a sentence, where crucial syntactic information is destroyed. To help preserve the importance of word order, we propose a simple approach called Forced Invalidation (FI): forcing the model to identify permuted sequences as invalid samples. We perform an extensive evaluation of our approach on various English NLU and QA based tasks over BERT-based and attention-based models over word embeddings. Our experiments demonstrate that FI significantly improves the sensitivity of the models to word order.
pdf
bib
abs
How Many and Which Training Points Would Need to be Removed to Flip this Prediction?
Jinghan Yang
|
Sarthak Jain
|
Byron C. Wallace
We consider the problem of identifying a minimal subset of training data 𝒮t such that if the instances comprising 𝒮t had been removed prior to training, the categorization of a given test point xt would have been different.Identifying such a set may be of interest for a few reasons.First, the cardinality of 𝒮t provides a measure of robustness (if |𝒮t| is small for xt, we might be less confident in the corresponding prediction), which we show is correlated with but complementary to predicted probabilities.Second, interrogation of 𝒮t may provide a novel mechanism for contesting a particular model prediction: If one can make the case that the points in 𝒮t are wrongly labeled or irrelevant, this may argue for overturning the associated prediction. Identifying 𝒮t via brute-force is intractable.We propose comparatively fast approximation methods to find 𝒮t based on influence functions, and find that—for simple convex text classification models—these approaches can often successfully identify relatively small sets of training examples which, if removed, would flip the prediction.
pdf
bib
abs
Reinforced Sequence Training based Subjective Bias Correction
Karthic Madanagopal
|
James Caverlee
Subjective bias is ubiquitous on news sites, social media, and knowledge resources like Wikipedia. Many existing methods for subjective bias correction have typically focused on making one-word edits and have been trained over a single (often, noisy) domain. In contrast, we propose a novel reinforced sequence training approach for robust subjective bias correction. Three of the unique characteristics of the approach are: (i) it balances bias neutralization with fluency and semantics preservation through reinforcement learning, to broaden the scope to bias beyond a single word; (ii) it is cross-trained over multiple sources of bias to be more robust to new styles of biased writing that are not seen in the training data for a single domain; and (iii) it is used to fine-tune a large pre-trained transformer model to yield state-of-the-art performance in bias text correction task. Extensive experiments show that the proposed approach results in significant improvements in subjective bias correction versus alternatives.
pdf
bib
abs
Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists
John E. Miller
|
Johann-Mattis List
Language contact is a pervasive phenomenon reflected in the borrowing of words from donor to recipient languages. Most computational approaches to borrowing detection treat all languages under study as equally important, even though dominant languages have a stronger impact on heritage languages than vice versa. We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role, applying two classical sequence comparison methods and one machine learning method to a sample of seven Latin American languages which have all borrowed extensively from Spanish. All systems perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.
pdf
bib
abs
Towards Integration of Discriminability and Robustness for Document-Level Relation Extraction
Jia Guo
|
Stanley Kok
|
Lidong Bing
Document-level relation extraction (DocRE) predicts relations for entity pairs that rely on long-range context-dependent reasoning in a document. As a typical multi-label classification problem, DocRE faces the challenge of effectively distinguishing a small set of positive relations from the majority of negative ones. This challenge becomes even more difficult to overcome when there exists a significant number of annotation errors in the dataset. In this work, we aim to achieve better integration of both the discriminability and robustness for the DocRE problem. Specifically, we first design an effective loss function to endow high discriminability to both probabilistic outputs and internal representations. We innovatively customize entropy minimization and supervised contrastive learning for the challenging multi-label and long-tailed learning problems. To ameliorate the impact of label errors, we equipped our method with a novel negative label sampling strategy to strengthen the model robustness. In addition, we introduce two new data regimes to mimic more realistic scenarios with annotation errors and evaluate our sampling strategy. Experimental results verify the effectiveness of each component and show that our method achieves new state-of-the-art results on the DocRED dataset, its recently cleaned version, Re-DocRED, and the proposed data regimes.
pdf
bib
abs
Penguins Don’t Fly: Reasoning about Generics through Instantiations and Exceptions
Emily Allaway
|
Jena D. Hwang
|
Chandra Bhagavatula
|
Kathleen McKeown
|
Doug Downey
|
Yejin Choi
Generics express generalizations about the world (e.g., birds can fly) that are not universally true (e.g., newborn birds and penguins cannot fly). Commonsense knowledge bases, used extensively in NLP, encode some generic knowledge but rarely enumerate such exceptions and knowing when a generic statement holds or does not hold true is crucial for developing a comprehensive understanding of generics. We present a novel framework informed by linguistic theory to generate exemplars—specific cases when a generic holds true or false. We generate ~19k exemplars for ~650 generics and show that our framework outperforms a strong GPT-3 baseline by 12.8 precision points. Our analysis highlights the importance of linguistic theory-based controllability for generating exemplars, the insufficiency of knowledge bases as a source of exemplars, and the challenges exemplars pose for the task of natural language inference.
pdf
bib
abs
Adding Instructions during Pretraining: Effective way of Controlling Toxicity in Language Models
Shrimai Prabhumoye
|
Mostofa Patwary
|
Mohammad Shoeybi
|
Bryan Catanzaro
Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our two strategies are: (1) MEDA: adds raw toxicity score as meta-data to the pretraining samples, and (2) INST: adds instructions to those samples indicating their toxicity. Our results indicate that our best performing strategy (INST) substantially reduces the toxicity probability up to 61% while preserving the accuracy on five benchmark NLP tasks as well as improving AUC scores on four bias detection tasks by 1.3%. We also demonstrate the generalizability of our techniques by scaling the number of training samples and the number of model parameters.
pdf
bib
abs
Multi2Claim: Generating Scientific Claims from Multi-Choice Questions for Scientific Fact-Checking
Neset Tan
|
Trung Nguyen
|
Josh Bensemann
|
Alex Peng
|
Qiming Bao
|
Yang Chen
|
Mark Gahegan
|
Michael Witbrock
Training machine learning models to successfully perform scientific fact-checking tasks is challenging due to the expertise bottleneck that limits the availability of appropriate training datasets. In this task, models use textual evidence to confirm scientific claims, which requires data that contains extensive domain-expert annotation. Consequently, the number of existing scientific-fact-checking datasets and the sizes of those datasets are limited. However, these limitations do not apply to multiple-choice question datasets because of the necessity of domain exams in the modern education system. As one of the first steps towards addressing the fact-checking dataset scarcity problem in scientific domains, we propose a pipeline for automatically converting multiple-choice questions into fact-checking data, which we call Multi2Claim. By applying the proposed pipeline, we generated two large-scale datasets for scientific-fact-checking tasks: Med-Fact and Gsci-Fact for the medical and general science domains, respectively. These two datasets are among the first examples of large-scale scientific-fact-checking datasets. We developed baseline models for the verdict prediction task using each dataset. Additionally, we demonstrated that the datasets could be used to improve performance with respect to the F 1 weighted metric on existing fact-checking datasets such as SciFact, HEALTHVER, COVID-Fact, and CLIMATE-FEVER. In some cases, the improvement in performance was up to a 26% increase.
pdf
bib
abs
On Evaluation of Document Classification with RVL-CDIP
Stefan Larson
|
Gordon Lim
|
Kevin Leach
The RVL-CDIP benchmark is widely used for measuring performance on the task of document classification. Despite its widespread use, we reveal several undesirable characteristics of the RVL-CDIP benchmark. These include (1) substantial amounts of label noise, which we estimate to be 8.1% (ranging between 1.6% to 16.9% per document category); (2) presence of many ambiguous or multi-label documents; (3) a large overlap between test and train splits, which can inflate model performance metrics; and (4) presence of sensitive personally-identifiable information like US Social Security numbers (SSNs). We argue that there is a risk in using RVL-CDIP for benchmarking document classifiers, as its limited scope, presence of errors (state-of-the-art models now achieve accuracy error rates that are within our estimated label error rate), and lack of diversity make it less than ideal for benchmarking. We further advocate for the creation of a new document classification benchmark, and provide recommendations for what characteristics such a resource should include.
pdf
bib
abs
Event Linking: Grounding Event Mentions to Wikipedia
Xiaodong Yu
|
Wenpeng Yin
|
Nitish Gupta
|
Dan Roth
Comprehending an article requires understanding its constituent events. However, the context where an event is mentioned often lacks the details of this event. A question arises: how can the reader obtain more knowledge about this particular event in addition to what is provided by the local context in the article? This work defines Event Linking, a new natural language understanding task at the event level. Event linking tries to link an event mention appearing in an article to the most appropriate Wikipedia page. This page is expected to provide rich knowledge about what the event mention refers to. To standardize the research in this new direction, we contribute in four-fold. First, this is the first work in the community that formally defines the Event Linking task. Second, we collect a dataset for this new task. Specifically, we automatically gather the training set from Wikipedia, and then create two evaluation sets: one from the Wikipedia domain, reporting the in-domain performance, and a second from the real-world news domain, to evaluate out-of-domain performance. Third, we retrain and evaluate two state-of-the-art (SOTA) entity linking models, showing the challenges of event linking, and we propose an event-specific linking system, EVELINK, to set a competitive result for the new task. Fourth, we conduct a detailed and insightful analysis to help understand the task and the limitations of the current model. Overall, as our analysis shows, Event Linking is a challenging and essential task requiring more effort from the community.
pdf
bib
abs
SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for Classification in Low-Resource Domains
Koustava Goswami
|
Lukas Lange
|
Jun Araki
|
Heike Adel
Prompting pre-trained language models leads to promising results across natural language processing tasks but is less effective when applied in low-resource domains, due to the domain gap between the pre-training data and the downstream task. In this work, we bridge this gap with a novel and lightweight prompting methodology called SwitchPrompt for the adaptation of language models trained on datasets from the general domain to diverse low-resource domains. Using domain-specific keywords with a trainable gated prompt, SwitchPrompt offers domain-oriented prompting, that is, effective guidance on the target domains for general-domain language models. Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt. They often even outperform their domain-specific counterparts trained with baseline state-of-the-art prompting methods by up to 10.7% performance increase in accuracy. This result indicates that SwitchPrompt effectively reduces the need for domain-specific language model pre-training.
pdf
bib
abs
Do dialogue representations align with perception? An empirical study
Sarenne Wallbridge
|
Peter Bell
|
Catherine Lai
There has been a surge of interest regarding the alignment of large-scale language models with human language comprehension behaviour. The majority of this research investigates comprehension behaviours from reading isolated, written sentences. We propose studying the perception of dialogue, focusing on an intrinsic form of language use: spoken conversations. Using the task of predicting upcoming dialogue turns, we ask whether turn plausibility scores produced by state-of-the-art language models correlate with human judgements. We find a strong correlation for some but not all models: masked language models produce stronger correlations than auto-regressive models. In doing so, we quantify human performance on the response selection task for open-domain spoken conversation. To the best of our knowledge, this is the first such quantification. We find that response selection performance can be used as a coarse proxy for the strength of correlation with human judgements, however humans and models make different response selection mistakes. The model which produces the strongest correlation also outperforms human response selection performance. Through ablation studies, we show that pre-trained language models provide a useful basis for turn representations; however, fine-grained contextualisation, inclusion of dialogue structure information, and fine-tuning towards response selection all boost response selection accuracy by over 30 absolute points.
pdf
bib
abs
Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models
Peter Hase
|
Mona Diab
|
Asli Celikyilmaz
|
Xian Li
|
Zornitsa Kozareva
|
Veselin Stoyanov
|
Mohit Bansal
|
Srinivasan Iyer
Language models can memorize a considerable amount of factual information during pretraining that can be elicited through prompting or finetuning models on tasks like question answering. In this paper, we discuss approaches to measuring model factual beliefs, updating incorrect factual beliefs in models, and visualizing graphical relationships between factual beliefs. Our main contributions include: (1) new metrics for evaluating belief-updating methods focusing on the logical consistency of beliefs, (2) a training objective for Sequential, Local, and Generalizing updates (SLAG) that improves the performance of existing hypernetwork approaches, and (3) the introduction of the belief graph, a new form of visualization for language models that shows relationships between stored model beliefs. Our experiments suggest that models show only limited consistency between factual beliefs, but update methods can both fix incorrect model beliefs and greatly improve their consistency. Although off-the-shelf optimizers are surprisingly strong belief-updating baselines, our learned optimizers can outperform them in more difficult settings than have been considered in past work.
pdf
bib
abs
Improving Sign Recognition with Phonology
Lee Kezar
|
Jesse Thomason
|
Zed Sehyr
We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition accuracy on the WLASL benchmark, with consistent improvements in ISLR regardless of the underlying prediction model architecture. This work has the potential to accelerate linguistic research in the domain of signed languages and reduce communication barriers between deaf and hearing people.
pdf
bib
abs
Parameter-efficient Modularised Bias Mitigation via AdapterFusion
Deepak Kumar
|
Oleg Lesota
|
George Zerveas
|
Daniel Cohen
|
Carsten Eickhoff
|
Markus Schedl
|
Navid Rekabsaz
Large pre-trained language models contain societal biases and carry along these biases to downstream tasks. Current in-processing bias mitigation approaches (like adversarial training) impose debiasing by updating a model’s parameters, effectively transferring the model to a new, irreversible debiased state. In this work, we propose a novel approach to develop stand-alone debiasing functionalities separate from the model, which can be integrated into the model on-demand, while keeping the core model untouched. Drawing from the concept of AdapterFusion in multi-task learning, we introduce DAM (Debiasing with Adapter Modules) – a debiasing approach to first encapsulate arbitrary bias mitigation functionalities into separate adapters, and then add them to the model on-demand in order to deliver fairness qualities. We conduct a large set of experiments on three classification tasks with gender, race, and age as protected attributes. Our results show that DAM improves or maintains the effectiveness of bias mitigation, avoids catastrophic forgetting in a multi-attribute scenario, and maintains on-par task performance, while granting parameter-efficiency and easy switching between the original and debiased models.
pdf
bib
abs
LingMess: Linguistically Informed Multi Expert Scorers for Coreference Resolution
Shon Otmazgin
|
Arie Cattan
|
Yoav Goldberg
Current state-of-the-art coreference systems are based on a single pairwise scoring component, which assigns to each pair of mention spans a score reflecting their tendency to corefer to each other. We observe that different kinds of mention pairs require different information sources to assess their score. We present LingMess, a linguistically motivated categorization of mention-pairs into 6 types of coreference decisions and learn a dedicated trainable scoring function for each category. This significantly improves the accuracy of the pairwise scorer as well as of the overall coreference performance on the English Ontonotes coreference corpus and 5 additional datasets.
pdf
bib
abs
Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks
Antoine Louis
|
Gijs van Dijck
|
Gerasimos Spanakis
Statutory article retrieval (SAR), the task of retrieving statute law articles relevant to a legal question, is a promising application of legal text processing. In particular, high-quality SAR systems can improve the work efficiency of legal professionals and provide basic legal assistance to citizens in need at no cost. Unlike traditional ad-hoc information retrieval, where each document is considered a complete source of information, SAR deals with texts whose full sense depends on complementary information from the topological organization of statute law. While existing works ignore these domain-specific dependencies, we propose a novel graph-augmented dense statute retriever (G-DSR) model that incorporates the structure of legislation via a graph neural network to improve dense retrieval performance. Experimental results show that our approach outperforms strong retrieval baselines on a real-world expert-annotated SAR dataset.
pdf
bib
abs
Behavior Cloned Transformers are Neurosymbolic Reasoners
Ruoyao Wang
|
Peter Jansen
|
Marc-Alexandre Côté
|
Prithviraj Ammanabrolu
In this work, we explore techniques for augmenting interactive agents with information from symbolic modules, much like humans use tools like calculators and GPS systems to assist with arithmetic and navigation. We test our agent’s abilities in text games – challenging benchmarks for evaluating the multi-step reasoning abilities of game agents in grounded, language-based environments. Our experimental study indicates that injecting the actions from these symbolic modules into the action space of a behavior cloned transformer agent increases performance on four text game benchmarks that test arithmetic, navigation, sorting, and common sense reasoning by an average of 22%, allowing an agent to reach the highest possible performance on unseen games. This action injection technique is easily extended to new agents, environments, and symbolic modules.
pdf
bib
abs
Bridging the Gap Between BabelNet and HowNet: Unsupervised Sense Alignment and Sememe Prediction
Xiang Zhang
|
Ning Shi
|
Bradley Hauer
|
Grzegorz Kondrak
As the minimum semantic units of natural languages, sememes can provide precise representations of concepts. Despite the widespread utilization of lexical resources for semantic tasks, use of sememes is limited by a lack of available sememe knowledge bases. Recent efforts have been made to connect BabelNet with HowNet by automating sememe prediction. However, these methods depend on large manually annotated datasets. We propose to use sense alignment via a novel unsupervised and explainable method. Our method consists of four stages, each relaxing predefined constraints until a complete alignment of BabelNet synsets to HowNet senses is achieved. Experimental results demonstrate the superiority of our unsupervised method over previous supervised ones by an improvement of 12% overall F1 score, setting a new state of the art. Our work is grounded in an interpretable propagation of sememe information between lexical resources, and may benefit downstream applications which can incorporate sememe information.
pdf
bib
abs
The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents
Xing Han Lu
|
Siva Reddy
|
Harm de Vries
We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online users looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.
pdf
bib
abs
Question Generation Using Sequence-to-Sequence Model with Semantic Role Labels
Alireza Naeiji
|
Aijun An
|
Heidar Davoudi
|
Marjan Delpisheh
|
Muath Alzghool
Automatic generation of questions from text has gained increasing attention due to its useful applications. We propose a novel question generation method that combines the benefits of rule-based and neural sequence-to-sequence (Seq2Seq) models. The proposed method can automatically generate multiple questions from an input sentence covering different views of the sentence as in rule-based methods, while more complicated “rules” can be learned via the Seq2Seq model. The method utilizes semantic role labeling to convert training examples into their semantic representations, and then trains a Seq2Seq model over the semantic representations. Our extensive experiments on three real-world data sets show that the proposed method significantly improves the state-of-the-art neural question generation approaches.
pdf
bib
abs
StyLEx: Explaining Style Using Human Lexical Annotations
Shirley Anugrah Hayati
|
Kyumin Park
|
Dheeraj Rajagopal
|
Lyle Ungar
|
Dongyeop Kang
Large pre-trained language models have achieved impressive results on various style classification tasks, but they often learn spurious domain-specific words to make predictions (Hayati et al., 2021). While human explanation highlights stylistic tokens as important features for this task, we observe that model explanations often do not align with them. To tackle this issue, we introduce StyLEx, a model that learns from human annotated explanations of stylistic features and jointly learns to perform the task and predict these features as model explanations. Our experiments show that StyLEx can provide human like stylistic lexical explanations without sacrificing the performance of sentence-level style prediction on both in-domain and out-of-domain datasets. Explanations from StyLEx show significant improvements in explanation metrics (sufficiency, plausibility) and when evaluated with human annotations. They are also more understandable by human judges compared to the widely-used saliency-based explanation baseline.
pdf
bib
abs
Comparing Intrinsic Gender Bias Evaluation Measures without using Human Annotated Examples
Masahiro Kaneko
|
Danushka Bollegala
|
Naoaki Okazaki
Numerous types of social biases have been identified in pre-trained language models (PLMs), and various intrinsic bias evaluation measures have been proposed for quantifying those social biases. Prior works have relied on human annotated examples to compare existing intrinsic bias evaluation measures. However, this approach is not easily adaptable to different languages nor amenable to large scale evaluations due to the costs and difficulties when recruiting human annotators. To overcome this limitation, we propose a method to compare intrinsic gender bias evaluation measures without relying on human-annotated examples. Specifically, we create multiple bias-controlled versions of PLMs using varying amounts of male vs. female gendered sentences, mined automatically from an unannotated corpus using gender-related word lists. Next, each bias-controlled PLM is evaluated using an intrinsic bias evaluation measure, and the rank correlation between the computed bias scores and the gender proportions used to fine-tune the PLMs is computed. Experiments on multiple corpora and PLMs repeatedly show that the correlations reported by our proposed method that does not require human annotated examples are comparable to those computed using human annotated examples in prior work.
pdf
bib
abs
Faithfulness-Aware Decoding Strategies for Abstractive Summarization
David Wan
|
Mengwen Liu
|
Kathleen McKeown
|
Markus Dreyer
|
Mohit Bansal
Despite significant progress in understanding and improving faithfulness in abstractive summarization, the question of how decoding strategies affect faithfulness is less studied. We present a systematic study of the effect of generation techniques such as beam search and nucleus sampling on faithfulness in abstractive summarization. We find a consistent trend where beam search with large beam sizes produces the most faithful summaries while nucleus sampling generates the least faithful ones. We propose two faithfulness-aware generation methods to further improve faithfulness over current generation techniques: (1) ranking candidates generated by beam search using automatic faithfulness metrics and (2) incorporating lookahead heuristics that produce a faithfulness score on the future summary. We show that both generation methods significantly improve faithfulness across two datasets as evaluated by four automatic faithfulness metrics and human evaluation. To reduce computational cost, we demonstrate a simple distillation approach that allows the model to generate faithful summaries with just greedy decoding.
pdf
bib
abs
Dynamic Benchmarking of Masked Language Models on Temporal Concept Drift with Multiple Views
Katerina Margatina
|
Shuai Wang
|
Yogarshi Vyas
|
Neha Anna John
|
Yassine Benajiba
|
Miguel Ballesteros
Temporal concept drift refers to the problem of data changing over time. In the field of NLP, that would entail that language (e.g. new expressions, meaning shifts) and factual knowledge (e.g. new concepts, updated facts) evolve over time. Focusing on the latter, we benchmark 11 pretrained masked language models (MLMs) on a series of tests designed to evaluate the effect of temporal concept drift, as it is crucial that widely used language models remain up-to-date with the ever-evolving factual updates of the real world. Specifically, we provide a holistic framework that (1) dynamically creates temporal test sets of any time granularity (e.g. month, quarter, year) of factual data from Wikidata, (2) constructs fine-grained splits of tests (e.g. updated, new, unchanged facts) to ensure comprehensive analysis, and (3) evaluates MLMs in three distinct ways (single-token probing, multi-token generation, MLM scoring). In contrast to prior work, our framework aims to unveil how robust an MLM is over time and thus to provide a signal in case it has become outdated, by leveraging multiple views of evaluation.
pdf
bib
abs
Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow
Anjana Arunkumar
|
Swaroop Mishra
|
Bhavdeep Singh Sachdeva
|
Chitta Baral
|
Chris Bryan
Recent research has shown that language models exploit ‘artifacts’ in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.
pdf
bib
abs
COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models
Kanishka Misra
|
Julia Rayz
|
Allyson Ettinger
A characteristic feature of human semantic cognition is its ability to not only store and retrieve the properties of concepts observed through experience, but to also facilitate the inheritance of properties (can breathe) from superordinate concepts (animal) to their subordinates (dog)—i.e. demonstrate property inheritance. In this paper, we present COMPS, a collection of minimal pair sentences that jointly tests pre-trained language models (PLMs) on their ability to attribute properties to concepts and their ability to demonstrate property inheritance behavior. Analyses of 22 different PLMs on COMPS reveal that they can easily distinguish between concepts on the basis of a property when they are trivially different, but find it relatively difficult when concepts are related on the basis of nuanced knowledge representations. Furthermore, we find that PLMs can show behaviors suggesting successful property inheritance in simple contexts, but fail in the presence of distracting information, which decreases the performance of many models sometimes even below chance. This lack of robustness in demonstrating simple reasoning raises important questions about PLMs’ capacity to make correct inferences even when they appear to possess the prerequisite knowledge.
pdf
bib
abs
Probabilistic Robustness for Data Filtering
Yu Yu
|
Abdul Rafae Khan
|
Shahram Khadivi
|
Jia Xu
We introduce our probabilistic robustness rewarded data optimization (PRoDO) approach as a framework to enhance the model’s generalization power by selecting training data that optimizes our probabilistic robustness metrics. We use proximal policy optimization (PPO) reinforcement learning to approximately solve the computationally intractable training subset selection problem. The PPO’s reward is defined as our (𝛼,𝜖, 𝛾)-Robustness that measures performance consistency over multiple domains by simulating unknown test sets in real-world scenarios using a leaving-one-out strategy. We demonstrate that our PRoDO effectively filters data that lead to significantly higher prediction accuracy and robustness on unknown-domain test sets. Our experiments achieve up to +17.2% increase of accuracy (+25.5% relatively) in sentiment analysis, and -28.05 decrease of perplexity (-32.1% relatively) in language modeling.In addition, our probabilistic (𝛼,𝜖, 𝛾)-Robustness definition serves as an evaluation metric with higher levels of agreement with human annotations than typical performance-based metrics.
pdf
bib
abs
Unsupervised Improvement of Factual Knowledge in Language Models
Nafis Sadeq
|
Byungkyu Kang
|
Prarit Lamba
|
Julian McAuley
Masked language modeling (MLM) plays a key role in pretraining large language models. But the MLM objective is often dominated by high-frequency words that are sub-optimal for learning factual knowledge. In this work, we propose an approach for influencing MLM pretraining in a way that can improve language model performance on a variety of knowledge-intensive tasks. We force the language model to prioritize informative words in a fully unsupervised way. Experiments demonstrate that the proposed approach can significantly improve the performance of pretrained language models on tasks such as factual recall, question answering, sentiment analysis, and natural language inference in a closed-book setting.
pdf
bib
abs
Learning to Ignore Adversarial Attacks
Yiming Zhang
|
Yangqiaoyu Zhou
|
Samuel Carton
|
Chenhao Tan
Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can successfully ignore over 90% of attack tokens. This approach leads to consistent sizable improvements (~10%) over baseline models in robustness on three datasets for both BERT and RoBERTa, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set and hence reduce the effect of adversarial attacks.
pdf
bib
abs
Should You Mask 15% in Masked Language Modeling?
Alexander Wettig
|
Tianyu Gao
|
Zexuan Zhong
|
Danqi Chen
Masked language models (MLMs) conventionally mask 15% of tokens due to the belief that more masking would leave insufficient context to learn good representations; this masking rate has been widely used, regardless of model sizes or masking strategies. In this work, we revisit this important choice of MLM pre-training. We first establish that 15% is not universally optimal, and larger models should adopt a higher masking rate. Specifically, we find that masking 40% outperforms 15% for BERT-large size models on GLUE and SQuAD. Interestingly, an extremely high masking rate of 80% can still preserve 95% fine-tuning performance and most of the accuracy in linguistic probing, challenging the conventional wisdom about the role of the masking rate. We then examine the interplay between masking rates and masking strategies and find that uniform masking requires a higher masking rate compared to sophisticated masking strategies such as span or PMI masking. Finally, we argue that increasing the masking rate has two distinct effects: it leads to more corruption, which makes the prediction task more difficult; it also enables more predictions, which benefits optimization. Using this framework, we revisit BERT’s 80-10-10 corruption strategy. Together, our results contribute to a better understanding of MLM pre-training.
pdf
bib
abs
How do Words Contribute to Sentence Semantics? Revisiting Sentence Embeddings with a Perturbation Method
Wenlin Yao
|
Lifeng Jin
|
Hongming Zhang
|
Xiaoman Pan
|
Kaiqiang Song
|
Dian Yu
|
Dong Yu
|
Jianshu Chen
Understanding sentence semantics requires an interpretation of the main information from a concrete context. To investigate how individual word contributes to sentence semantics, we propose a perturbation method for unsupervised semantic analysis. We next re-examine SOTA sentence embedding models’ ability to capture the main semantics of a sentence by developing a new evaluation metric to adapt sentence compression datasets for automatic evaluation. Results on three datasets show that unsupervised discourse relation recognition can serve as a general inference task that can more effectively aggregate information to essential contents than several SOTA unsupervised sentence embedding models.
pdf
bib
abs
AutoTriggER: Label-Efficient and Robust Named Entity Recognition with Auxiliary Trigger Extraction
Dong-Ho Lee
|
Ravi Kiran Selvam
|
Sheikh Muhammad Sarwar
|
Bill Yuchen Lin
|
Fred Morstatter
|
Jay Pujara
|
Elizabeth Boschee
|
James Allan
|
Xiang Ren
Deep neural models for named entity recognition (NER) have shown impressive results in overcoming label scarcity and generalizing to unseen entities by leveraging distant supervision and auxiliary information such as explanations. However, the costs of acquiring such additional information are generally prohibitive. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging “entity triggers” which are human-readable cues in the text that help guide the model to make better decisions. Our framework leverages post-hoc explanation to generate rationales and strengthens a model’s prior knowledge using an embedding interpolation technique. This approach allows models to exploit triggers to infer entity boundaries and types instead of solely memorizing the entity words themselves. Through experiments on three well-studied NER datasets, AutoTriggER shows strong label-efficiency, is capable of generalizing to unseen entities, and outperforms the RoBERTa-CRF baseline by nearly 0.5 F1 points on average.
pdf
bib
abs
Incorporating Task-Specific Concept Knowledge into Script Learning
Chenkai Sun
|
Tie Xu
|
ChengXiang Zhai
|
Heng Ji
In this paper, we present Tetris, a new task of Goal-Oriented Script Completion. Unlike previous work, it considers a more realistic and general setting, where the input includes not only the goal but also additional user context, including preferences and history. To address this problem, we propose a novel approach, which uses two techniques to improve performance: (1) concept prompting, and (2) script-oriented contrastive learning that addresses step repetition and hallucination problems. On our WikiHow-based dataset, we find that both methods improve performance.
pdf
bib
abs
DeepMaven: Deep Question Answering on Long-Distance Movie/TV Show Videos with Multimedia Knowledge Extraction and Synthesis
Yi Fung
|
Han Wang
|
Tong Wang
|
Ali Kebarighotbi
|
Mohit Bansal
|
Heng Ji
|
Prem Natarajan
Long video content understanding poses a challenging set of research questions as it involves long-distance, cross-media reasoning and knowledge awareness. In this paper, we present a new benchmark for this problem domain, targeting the task of deep movie/TV question answering (QA) beyond previous work’s focus on simple plot summary and short video moment settings. We define several baselines based on direct retrieval of relevant context for long-distance movie QA. Observing that real-world QAs may require higher-order multi-hop inferences, we further propose a novel framework, called the DeepMaven, which extracts events, entities, and relations from the rich multimedia content in long videos to pre-construct movie knowledge graphs (movieKGs), and at the time of QA inference, complements general semantics with structured knowledge for more effective information retrieval and knowledge reasoning. We also introduce our recently collected DeepMovieQA dataset, including 1,000 long-form QA pairs from 41 hours of videos, to serve as a new and useful resource for future work. Empirical results show the DeepMaven performs competitively for both the new DeepMovieQA and the pre-existing MovieQA dataset.
pdf
bib
abs
Salient Span Masking for Temporal Understanding
Jeremy R. Cole
|
Aditi Chaudhary
|
Bhuwan Dhingra
|
Partha Talukdar
Salient Span Masking (SSM) has shown itself to be an effective strategy to improve closed-book question answering performance. SSM extends general masked language model pretraining by creating additional unsupervised training sentences that mask a single entity or date span, thus oversampling factual information. Despite the success of this paradigm, the span types and sampling strategies are relatively arbitrary and not widely studied for other tasks. Thus, we investigate SSM from the perspective of temporal tasks, where learning a good representation of various temporal expressions is important. To that end, we introduce Temporal Span Masking (TSM) intermediate training. First, we find that SSM alone improves the downstream performance on three temporal tasks by an avg. +5.8 points. Further, we are able to achieve additional improvements (avg. +0.29 points) by adding the TSM task. These comprise the new best reported results on the targeted tasks. Our analysis suggests that the effectiveness of SSM stems from the sentences chosen in the training data rather than the mask choice: sentences with entities frequently also contain temporal expressions. Nonetheless, the additional targeted spans of TSM can still improve performance, especially in a zero-shot context.
pdf
bib
abs
PECO: Examining Single Sentence Label Leakage in Natural Language Inference Datasets through Progressive Evaluation of Cluster Outliers
Michael Saxon
|
Xinyi Wang
|
Wenda Xu
|
William Yang Wang
Building natural language inference (NLI) benchmarks that are both challenging for modern techniques, and free from shortcut biases is difficult. Chief among these biases is “single sentence label leakage,” where annotator-introduced spurious correlations yield datasets where the logical relation between (premise, hypothesis) pairs can be accurately predicted from only a single sentence, something that should in principle be impossible. We demonstrate that despite efforts to reduce this leakage, it persists in modern datasets that have been introduced since its 2018 discovery. To enable future amelioration efforts, introduce a novel model-driven technique, the progressive evaluation of cluster outliers (PECO) which enables both the objective measurement of leakage, and the automated detection of subpopulations in the data which maximally exhibit it.
pdf
bib
abs
Weakly-Supervised Questions for Zero-Shot Relation Extraction
Saeed Najafi
|
Alona Fyshe
Zero-Shot Relation Extraction (ZRE) is the task of Relation Extraction where the training and test sets have no shared relation types. This very challenging domain is a good test of a model’s ability to generalize. Previous approaches to ZRE reframed relation extraction as Question Answering (QA), allowing for the use of pre-trained QA models. However, this method required manually creating gold question templates for each new relation. Here, we do away with these gold templates and instead learn a model that can generate questions for unseen relations. Our technique can successfully translate relation descriptions into relevant questions, which are then leveraged to generate the correct tail entity. On tail entity extraction, we outperform the previous state-of-the-art by more than 16 F1 points without using gold question templates. On the RE-QA dataset where no previous baseline for relation extraction exists, our proposed algorithm comes within 0.7 F1 points of a system that uses gold question templates. Our model also outperforms the state-of-the-art ZRE baselines on the FewRel and WikiZSL datasets, showing that QA models no longer need template questions to match the performance of models specifically tailored to the ZRE task. Our implementation is available at
https://github.com/fyshelab/QA-ZRE.
pdf
bib
abs
DiffQG: Generating Questions to Summarize Factual Changes
Jeremy R. Cole
|
Palak Jain
|
Julian Martin Eisenschlos
|
Michael J.Q. Zhang
|
Eunsol Choi
|
Bhuwan Dhingra
Identifying the difference between two versions of the same article is useful to update knowledge bases and to understand how articles evolve. Paired texts occur naturally in diverse situations: reporters write similar news stories and maintainers of authoritative websites must keep their information up to date. We propose representing factual changes between paired documents as question-answer pairs, where the answer to the same question differs between two versions. We find that question-answer pairs can flexibly and concisely capture the updated contents. Provided with paired documents, annotators identify questions that are answered by one passage but answered differently or cannot be answered by the other. We release DiffQG which consists of 759 QA pairs and 1153 examples of paired passages with no factual change. These questions are intended to be both unambiguous and information-seeking and involve complex edits, pushing beyond the capabilities of current question generation and factual change detection systems. Our dataset summarizes the changes between two versions of the document as questions and answers, studying automatic update summarization in a novel way.
pdf
bib
abs
Contextual Dynamic Prompting for Response Generation in Task-oriented Dialog Systems
Sandesh Swamy
|
Narges Tabari
|
Chacha Chen
|
Rashmi Gangadharaiah
Response generation is one of the critical components in task-oriented dialog systems. Existing studies have shown that large pre-trained language models can be adapted to this task. The typical paradigm of adapting such extremely large language models would be by fine-tuning on the downstream tasks which is not only time-consuming but also involves significant resources and access to fine-tuning data. Prompting (Schick and Schütze, 2020) has been an alternative to fine-tuning in many NLP tasks. In our work, we explore the idea of using prompting for response generation in task-oriented dialog systems. Specifically, we propose an approach that performs contextual dynamic prompting where the prompts are learnt from dialog contexts. We aim to distill useful prompting signals from the dialog context. On experiments with MultiWOZ 2.2 dataset (Zang et al., 2020), we show that contextual dynamic prompts improve response generation in terms of combined score (Mehri et al., 2019) by 3 absolute points, and an additional 17 points when dialog states are incorporated. Furthermore, we carried out human annotation on these conversations and found that agents which incorporate context are preferred over agents with vanilla prefix-tuning.
pdf
bib
abs
Why Can’t Discourse Parsing Generalize? A Thorough Investigation of the Impact of Data Diversity
Yang Janet Liu
|
Amir Zeldes
Recent advances in discourse parsing performance create the impression that, as in other NLP tasks, performance for high-resource languages such as English is finally becoming reliable. In this paper we demonstrate that this is not the case, and thoroughly investigate the impact of data diversity on RST parsing stability. We show that state-of-the-art architectures trained on the standard English newswire benchmark do not generalize well, even within the news domain. Using the two largest RST corpora of English with text from multiple genres, we quantify the impact of genre diversity in training data for achieving generalization to text types unseen during training. Our results show that a heterogeneous training regime is critical for stable and generalizable models, across parser architectures. We also provide error analyses of model outputs and out-of-domain performance. To our knowledge, this study is the first to fully evaluate cross-corpus RST parsing generalizability on complete trees, examine between-genre degradation within an RST corpus, and investigate the impact of genre diversity in training data composition.
pdf
bib
abs
Enriching Biomedical Knowledge for Low-resource Language Through Large-scale Translation
Long Phan
|
Tai Dang
|
Hieu Tran
|
Trieu H. Trinh
|
Vy Phan
|
Lam D. Chau
|
Minh-Thang Luong
Biomedical data and benchmarks are highly valuable yet very limited in low-resource languages other than English, such as Vietnamese. In this paper, we use a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained and supervised data in the biomedical domains. Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20 million translated abstracts from the high-quality public PubMed corpus. ViPubMedT5 demonstrates state-of-the-art results on two different biomedical benchmarks in summarization and acronym disambiguation. Further, we release ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the recently public En-vi translation model and carefully refined by human experts, with evaluations of existing methods against ViPubmedT5.
pdf
bib
abs
Syntax-guided Neural Module Distillation to Probe Compositionality in Sentence Embeddings
Rohan Pandey
Past work probing compositionality in sentence embedding models faces issues determining the causal impact of implicit syntax representations. Given a sentence, we construct a neural module net based on its syntax parse and train it end-to-end to approximate the sentence’s embedding generated by a transformer model. The distillability of a transformer to a Syntactic NeurAl Module Net (SynNaMoN) then captures whether syntax is a strong causal model of its compositional ability. Furthermore, we address questions about the geometry of semantic composition by specifying individual SynNaMoN modules’ internal architecture & linearity. We find differences in the distillability of various sentence embedding models that broadly correlate with their performance, but observe that distillability doesn’t considerably vary by model size. We also present preliminary evidence that much syntax-guided composition in sentence embedding models is linear, and that non-linearities may serve primarily to handle non-compositional phrases.
pdf
bib
abs
Closed-book Question Generation via Contrastive Learning
Xiangjue Dong
|
Jiaying Lu
|
Jianling Wang
|
James Caverlee
Question Generation (QG) is a fundamental NLP task for many downstream applications. Recent studies on open-book QG, where supportive answer-context pairs are provided to models, have achieved promising progress. However, generating natural questions under a more practical closed-book setting that lacks these supporting documents still remains a challenge. In this work, we propose a new QG model for this closed-book setting that is designed to better understand the semantics of long-form abstractive answers and store more information in its parameters through contrastive learning and an answer reconstruction module. Through experiments, we validate the proposed QG model on both public datasets and a new WikiCQA dataset. Empirical results show that the proposed QG model outperforms baselines in both automatic evaluation and human evaluation. In addition, we show how to leverage the proposed model to improve existing question-answering systems. These results further indicate the effectiveness of our QG model for enhancing closed-book question-answering tasks.
pdf
bib
abs
A Hybrid Detection and Generation Framework with Separate Encoders for Event Extraction
Ge Shi
|
Yunyue Su
|
Yongliang Ma
|
Ming Zhou
The event extraction task typically consists of event detection and event argument extraction. Most previous work models these two subtasks with shared representation by multiple classification tasks or a unified generative approach. In this paper, we revisit this pattern and propose to use independent encoders to model event detection and event argument extraction, respectively, and use the output of event detection to construct the input of event argument extraction. In addition, we use token-level features to precisely control the fusion between two encoders to achieve joint bridging training rather than directly reusing representations between different tasks. Through a series of careful experiments, we demonstrate the importance of avoiding feature interference of different tasks and the importance of joint bridging training. We achieved competitive results on standard benchmarks (ACE05-E, ACE05-E+, and ERE-EN) and established a solid baseline.
pdf
bib
abs
Path Spuriousness-aware Reinforcement Learning for Multi-Hop Knowledge Graph Reasoning
Chunyang Jiang
|
Tianchen Zhu
|
Haoyi Zhou
|
Chang Liu
|
Ting Deng
|
Chunming Hu
|
Jianxin Li
Multi-hop reasoning, a prevalent approach for query answering, aims at inferring new facts along reasonable paths over a knowledge graph. Reinforcement learning methods can be adopted by formulating the problem into a Markov decision process. However, common suffering within RL-based reasoning models is that the agent can be biased to spurious paths which coincidentally lead to the correct answer with poor explanation. In this work, we take a deep dive into this phenomenon and define a metric named Path Spuriousness (PS), to quantitatively estimate to what extent a path is spurious. Guided by the definition of PS, we design a model with a new reward that considers both answer accuracy and path reasonableness. We test our method on four datasets and experiments reveal that our method considerably enhances the agent’s capacity to prevent spurious paths while keeping comparable to state-of-the-art performance.
pdf
bib
abs
Self-Adaptive Named Entity Recognition by Retrieving Unstructured Knowledge
Kosuke Nishida
|
Naoki Yoshinaga
|
Kyosuke Nishida
Although named entity recognition (NER) helps us to extract domain-specific entities from text (e.g., artists in the music domain), it is costly to create a large amount of training data or a structured knowledge base to perform accurate NER in the target domain. Here, we propose self-adaptive NER, which retrieves external knowledge from unstructured text to learn the usages of entities that have not been learned well. To retrieve useful knowledge for NER, we design an effective two-stage model that retrieves unstructured knowledge using uncertain entities as queries. Our model predicts the entities in the input and then finds those of which the prediction is not confident. Then, it retrieves knowledge by using these uncertain entities as queries and concatenates the retrieved text to the original input to revise the prediction. Experiments on CrossNER datasets demonstrated that our model outperforms strong baselines by 2.35 points in F1 metric.
pdf
bib
abs
When Do Pre-Training Biases Propagate to Downstream Tasks? A Case Study in Text Summarization
Faisal Ladhak
|
Esin Durmus
|
Mirac Suzgun
|
Tianyi Zhang
|
Dan Jurafsky
|
Kathleen McKeown
|
Tatsunori Hashimoto
Large language models (LLMs) are subject to sociocultural and other biases previously identified using intrinsic evaluations. However, when and how these intrinsic biases in pre-trained LM representations propagate to downstream, fine-tuned NLP tasks like summarization is not well understood. In this work, we investigate one type of bias—name-nationality bias—and trace it from the pre-training stage to a downstream summarization task across multiple summarization modeling choices. We show that these biases manifest themselves as hallucinations in summarization, leading to factually incorrect summaries. We also find that this propagation of biases is algorithm-dependent: more abstractive models allow biases to propagate more directly to downstream tasks as hallucinated facts. Building on these observations, we further analyze how changes to the adaptation method and fine-tuning data set affect name nationality biases and show that while they can reduce the overall rate of hallucinations, they do not change the types of biases that do appear.
pdf
bib
abs
BERT Shows Garden Path Effects
Tovah Irwin
|
Kyra Wilson
|
Alec Marantz
Garden path sentences (i.e. “the horse raced past the barn fell”) are sentences that readers initially incorrectly parse, requiring partial or total re-analysis of the sentence structure. Given human difficulty in parsing garden paths, we aim to compare transformer language models’ performance on these sentences. We assess a selection of models from the BERT family which have been fine-tuned on the question-answering task, and evaluate each model’s performance on comprehension questions based on garden path and control sentences. We then further investigate the semantic roles assigned to arguments of verbs in garden path and control sentences by utilizing a probe task to directly assess which semantic role(s) the model assigns. We find that the models have relatively low performance in certain instances of question answering based on garden path contexts, and the model incorrectly assigns semantic roles, aligning for the most part with human performance.
pdf
bib
abs
Models Teaching Models: Improving Model Accuracy with Slingshot Learning
Lachlan O’Neill
|
Nandini Anantharama
|
Satya Borgohain
|
Simon D. Angus
One significant obstacle to the successful application of machine learning to real-world data is that of labeling: it is often prohibitively expensive to pay an ethical amount for the human labor required to label a dataset successfully. Human-in-the-loop techniques such as active learning can reduce the cost, but the required human time is still significant and many fixed costs remain. Another option is to employ pre-trained transformer models as labelers at scale, which can yield reasonable accuracy and significant cost savings. However, such models can still be expensive to use due to their high computational requirements, and the opaque nature of these models is not always suitable in applied social science and public use contexts. We propose a novel semi-supervised method, named Slingshot Learning, in which we iteratively and selectively augment a small human-labeled dataset with labels from a high-quality “teacher” model to slingshot the performance of a “student” model in a cost-efficient manner. This reduces the accuracy trade-off required to use these simpler algorithms without disrupting their benefits, such as lower compute requirements, better interpretability, and faster inference. We define and discuss the slingshot learning algorithm and demonstrate its effectiveness on several benchmark tasks, using ALBERT to teach a simple Naive Bayes binary classifier. We experimentally demonstrate that Slingshot learning effectively decreases the performance gap between the teacher and student models. We also analyze its performance in several scenarios and compare different variants of the algorithm.
pdf
bib
abs
A Federated Approach for Hate Speech Detection
Jay Gala
|
Deep Gandhi
|
Jash Mehta
|
Zeerak Talat
Hate speech detection has been the subject of high research attention, due to the scale of content created on social media. In spite of the attention and the sensitive nature of the task, privacy preservation in hate speech detection has remained under-studied. The majority of research has focused on centralised machine learning infrastructures which risk leaking data. In this paper, we show that using federated machine learning can help address privacy the concerns that are inherent to hate speech detection while obtaining up to 6.81% improvement in terms of F1-score.
pdf
bib
abs
Learning the Legibility of Visual Text Perturbations
Dev Seth
|
Rickard Stureborg
|
Danish Pruthi
|
Bhuwan Dhingra
Many adversarial attacks in NLP perturb text in puts to produce visually similar strings (‘ergo’, ‘εrgo’) which are legible to humans but degrade model performance. Although preserving legibility is a necessary condition for text perturbation, little work has been done to systematically characterize it; instead, legibility is typically loosely enforced via intuitions around the nature and extent of perturbations. Particularly, it is unclear to what extent can inputs be perturbed while preserving legibility, or how to quantify the legibility of a perturbed string. In this work, we address this gap by learning models that predict the legibility of a perturbed string, and rank candidate perturbations based on their legibility. To do so, we collect and release LEGIT, a human-annotated dataset comprising the legibility of visually perturbed text. Using this dataset, we build both text- and vision-based models which achieve up to 0.91 F score in predicting whether an input is legible, and an accuracy of 0.86 in predicting which of two given perturbations is more legible. Additionally, we discover that legible perturbations from the LEGIT dataset are more effective at lowering the performance of NLP models than best-known attack strategies, suggesting that current models may be vulnerable to a broad range of perturbations beyond what is captured by existing visual attacks.
pdf
bib
abs
DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
Mojtaba Valipour
|
Mehdi Rezagholizadeh
|
Ivan Kobyzev
|
Ali Ghodsi
With the ever-growing size of pretrained models (PMs), fine-tuning them has become more expensive and resource-hungry. As a remedy, low-rank adapters (LoRA) keep the main pretrained weights of the model frozen and just introduce some learnable truncated SVD modules (so-called LoRA blocks) to the model. While LoRA blocks are parameter-efficient, they suffer from two major problems: first, the size of these blocks is fixed and cannot be modified after training (for example, if we need to change the rank of LoRA blocks, then we need to re-train them from scratch); second, optimizing their rank requires an exhaustive search and effort. In this work, we introduce a dynamic low-rank adaptation (DyLoRA) technique to address these two problems together. Our DyLoRA method trains LoRA blocks for a range of ranks instead of a single rank by sorting the representation learned by the adapter module at different ranks during training. We evaluate our solution on different natural language understanding (GLUE benchmark) and language generation tasks (E2E, DART and WebNLG) using different pretrained models such as RoBERTa and GPT with different sizes. Our results show that we can train dynamic search-free models with DyLoRA at least 4 to 7 times (depending to the task) faster than LoRA without significantly compromising performance. Moreover, our models can perform consistently well on a much larger range of ranks compared to LoRA.
pdf
bib
abs
Conversational Emotion-Cause Pair Extraction with Guided Mixture of Experts
DongJin Jeong
|
JinYeong Bak
Emotion-Cause Pair Extraction (ECPE) task aims to pair all emotions and corresponding causes in documents.ECPE is an important task for developing human-like responses. However, previous ECPE research is conducted based on news articles, which has different characteristics compared to dialogues. To address this issue, we propose a Pair-Relationship Guided Mixture-of-Experts (PRG-MoE) model, which considers dialogue features (e.g., speaker information).PRG-MoE automatically learns relationship between utterances and advises a gating network to incorporate dialogue features in the evaluation, yielding substantial performance improvement. We employ a new ECPE dataset, which is an English dialogue dataset, with more emotion-cause pairs in documents than news articles. We also propose Cause Type Classification that classifies emotion-cause pairs according to the types of the cause of a detected emotion. For reproducing the results, we make available all our code and data.
pdf
bib
abs
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar
|
Vidhisha Balachandran
|
Lucille Njoo
|
Antonios Anastasopoulos
|
Yulia Tsvetkov
Recent advances in the capacity of large language models to generate human-like text have resulted in their increased adoption in user-facing settings. In parallel, these improvements have prompted a heated discourse around the risks of societal harms they introduce, whether inadvertent or malicious. Several studies have explored these harms and called for their mitigation via development of safer, fairer models. Going beyond enumerating the risks of harms, this work provides a survey of practical methods for addressing potential threats and societal harms from language generation models. We draw on several prior works’ taxonomies of language model risks to present a structured overview of strategies for detecting and ameliorating different kinds of risks/harms of language generators. Bridging diverse strands of research, this survey aims to serve as a practical guide for both LM researchers and practitioners, with explanations of different strategies’ motivations, their limitations, and open problems for future research.
pdf
bib
abs
TraVLR: Now You See It, Now You Don’t! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning
Keng Ji Chow
|
Samson Tan
|
Min-Yen Kan
Numerous visio-linguistic (V+L) representation learning methods have been developed, yet existing datasets do not adequately evaluate the extent to which they represent visual and linguistic concepts in a unified space. We propose several novel evaluation settings for V+L models, including cross-modal transfer. Furthermore, existing V+L benchmarks often report global accuracy scores on the entire dataset, making it difficult to pinpoint the specific reasoning tasks that models fail and succeed at. We present TraVLR, a synthetic dataset comprising four V+L reasoning tasks. TraVLR’s synthetic nature allows us to constrain its training and testing distributions along task-relevant dimensions, enabling the evaluation of out-of-distribution generalisation. Each example in TraVLR redundantly encodes the scene in two modalities, allowing either to be dropped or added during training or testing without losing relevant information. We compare the performance of four state-of-the-art V+L models, finding that while they perform well on test examples from the same modality, they all fail at cross-modal transfer and have limited success accommodating the addition or deletion of one modality. We release TraVLR as an open challenge for the research community.
pdf
bib
abs
Paraphrase Acquisition from Image Captions
Marcel Gohsen
|
Matthias Hagen
|
Martin Potthast
|
Benno Stein
We propose to use image captions from the Web as a previously underutilized resource for paraphrases (i.e., texts with the same “message”) and to create and analyze a corresponding dataset. When an image is reused on the Web, an original caption is often assigned. We hypothesize that different captions for the same image naturally form a set of mutual paraphrases. To demonstrate the suitability of this idea, we analyze captions in the English Wikipedia, where editors frequently relabel the same image for different articles. The paper introduces the underlying mining technology, the resulting Wikipedia-IPC dataset, and compares known paraphrase corpora with respect to their syntactic and semantic paraphrase similarity to our new resource. In this context, we introduce characteristic maps along the two similarity dimensions to identify the style of paraphrases coming from different sources. An annotation study demonstrates the high reliability of the algorithmically determined characteristic maps.
pdf
bib
abs
Generation-Based Data Augmentation for Offensive Language Detection: Is It Worth It?
Camilla Casula
|
Sara Tonelli
Generation-based data augmentation (DA) has been presented in several works as a way to improve offensive language detection. However, the effectiveness of generative DA has been shown only in limited scenarios, and the potential injection of biases when using generated data to classify offensive language has not been investigated. Our aim is that of analyzing the feasibility of generative data augmentation more in-depth with two main focuses. First, we investigate the robustness of models trained on generated data in a variety of data augmentation setups, both novel and already presented in previous work, and compare their performance on four widely-used English offensive language datasets that present inherent differences in terms of content and complexity. In addition to this, we analyze models using the HateCheck suite, a series of functional tests created to challenge hate speech detection systems. Second, we investigate potential lexical bias issues through a qualitative analysis on the generated data. We find that the potential positive impact of generative data augmentation on model performance is unreliable, and generative DA can also have unpredictable effects on lexical bias.
pdf
bib
abs
Quantifying Context Mixing in Transformers
Hosein Mohebbi
|
Willem Zuidema
|
Grzegorz Chrupała
|
Afra Alishahi
Self-attention weights and their transformed variants have been the main source of information for analyzing token-to-token interactions in Transformer-based models. But despite their ease of interpretation, these weights are not faithful to the models’ decisions as they are only one part of an encoder, and other components in the encoder layer can have considerable impact on information mixing in the output representations. In this work, by expanding the scope of analysis to the whole encoder block, we propose Value Zeroing, a novel context mixing score customized for Transformers that provides us with a deeper understanding of how information is mixed at each encoder layer. We demonstrate the superiority of our context mixing score over other analysis methods through a series of complementary evaluations with different viewpoints based on linguistically informed rationales, probing, and faithfulness analysis.
pdf
bib
abs
KGVL-BART: Knowledge Graph Augmented Visual Language BART for Radiology Report Generation
Kaveri Kale
|
Pushpak Bhattacharyya
|
Milind Gune
|
Aditya Shetty
|
Rustom Lawyer
Timely generation of radiology reports and diagnoses is a challenge worldwide due to the enormous number of cases and shortage of radiology specialists. In this paper, we propose a Knowledge Graph Augmented Vision Language BART (KGVL-BART) model that takes as input two chest X-ray images- one frontal and the other lateral- along with tags which are diagnostic keywords, and outputs a report with the patient-specific findings. Our system development effort is divided into 3 stages: i) construction of the Chest X-ray KG (referred to as chestX-KG), ii) image feature extraction, and iii) training a KGVL-BART model using the visual, text, and KG data. The dataset we use is the well-known Indiana University Chest X-ray reports with the train, validation, and test split of 3025 instances, 300 instances, and 500 instances respectively. We construct a Chest X-Ray knowledge graph from these reports by extracting entity1-relation-entity2 triples; the triples get extracted by a rule-based tool of our own. Constructed KG is verified by two experienced radiologists (with experience of 30 years and 8 years, respectively). We demonstrate that our model- KGVL-BART- outperforms State-of-the-Art transformer-based models on standard NLG scoring metrics. We also include a qualitative evaluation of our system by experienced radiologist (with experience of 30 years) on the test data, which showed that 73% of the reports generated were fully correct, only 5.5% are completely wrong and 21.5% have important missing details though overall correct. To the best of our knowledge, ours is the first system to make use of multi-modality and domain knowledge to generate X-ray reports automatically.
pdf
bib
abs
A simple but effective model for attachment in discourse parsing with multi-task learning for relation labeling
Zineb Bennis
|
Julie Hunter
|
Nicholas Asher
In this paper, we present a discourse parsing model for conversation trained on the STAC. We fine-tune a BERT-based model to encode pairs of discourse units and use a simple linear layer to predict discourse attachments. We then exploit a multi-task setting to predict relation labels. The multitask approach effectively aids in the difficult task of relation type prediction; our f1 score of 57 surpasses the state of the art with no loss in performance for attachment, confirming the intuitive interdependence of these two tasks. Our method also improves over previous discourse parsing models in allowing longer input sizes and in permitting attachments in which one node has multiple parents, an important feature of multiparty conversation.
pdf
bib
abs
How Far Can It Go? On Intrinsic Gender Bias Mitigation for Text Classification
Ewoenam Kwaku Tokpo
|
Pieter Delobelle
|
Bettina Berendt
|
Toon Calders
To mitigate gender bias in contextualized language models, different intrinsic mitigation strategies have been proposed, alongside many bias metrics. Considering that the end use of these language models is for downstream tasks like text classification, it is important to understand how these intrinsic bias mitigation strategies actually translate to fairness in downstream tasks and the extent of this. In this work, we design a probe to investigate the effects that some of the major intrinsic gender bias mitigation strategies have on downstream text classification tasks. We discover that instead of resolving gender bias, intrinsic mitigation techniques and metrics are able to hide it in such a way that significant gender information is retained in the embeddings. Furthermore, we show that each mitigation technique is able to hide the bias from some of the intrinsic bias measures but not all, and each intrinsic bias measure can be fooled by some mitigation techniques, but not all. We confirm experimentally, that none of the intrinsic mitigation techniques used without any other fairness intervention is able to consistently impact extrinsic bias. We recommend that intrinsic bias mitigation techniques should be combined with other fairness interventions for downstream tasks.
pdf
bib
abs
Multimodal Event Transformer for Image-guided Story Ending Generation
Yucheng Zhou
|
Guodong Long
Image-guided story ending generation (IgSEG) is to generate a story ending based on given story plots and ending image. Existing methods focus on cross-modal feature fusion but overlook reasoning and mining implicit information from story plots and ending image. To tackle this drawback, we propose a multimodal event transformer, an event-based reasoning framework for IgSEG. Specifically, we construct visual and semantic event graphs from story plots and ending image, and leverage event-based reasoning to reason and mine implicit information in a single modality. Next, we connect visual and semantic event graphs and utilize cross-modal fusion to integrate different-modality features. In addition, we propose a multimodal injector to adaptive pass essential information to decoder. Besides, we present an incoherence detection to enhance the understanding context of a story plot and the robustness of graph modeling for our model. Experimental results show that our method achieves state-of-the-art performance for the image-guided story ending generation.
pdf
bib
abs
Improving Cross-modal Alignment for Text-Guided Image Inpainting
Yucheng Zhou
|
Guodong Long
Text-guided image inpainting (TGII) aims to restore missing regions based on a given text in a damaged image. Existing methods are based on a strong vision encoder and a cross-modal fusion model to integrate cross-modal features. However, these methods allocate most of the computation to visual encoding, while light computation on modeling modality interactions. Moreover, they take cross-modal fusion for depth features, which ignores a fine-grained alignment between text and image. Recently, vision-language pre-trained models (VLPM), encapsulating rich cross-modal alignment knowledge, have advanced in most multimodal tasks. In this work, we propose a novel model for TGII by improving cross-modal alignment (CMA). CMA model consists of a VLPM as a vision-language encoder, an image generator and global-local discriminators. To explore cross-modal alignment knowledge for image restoration, we introduce cross-modal alignment distillation and in-sample distribution distillation. In addition, we employ adversarial training to enhance the model to fill the missing region in complicated structures effectively. Experiments are conducted on two popular vision-language datasets. Results show that our model achieves state-of-the-art performance compared with other strong competitors.
pdf
bib
abs
Semantic Specialization for Knowledge-based Word Sense Disambiguation
Sakae Mizuki
|
Naoaki Okazaki
A promising approach for knowledge-based Word Sense Disambiguation (WSD) is to select the sense whose contextualized embeddings computed for its definition sentence are closest to those computed for a target word in a given sentence. This approach relies on the similarity of the sense and context embeddings computed by a pre-trained language model. We propose a semantic specialization for WSD where contextualized embeddings are adapted to the WSD task using solely lexical knowledge. The key idea is, for a given sense, to bring semantically related senses and contexts closer and send different/unrelated senses farther away. We realize this idea as the joint optimization of the Attract-Repel objective for sense pairs and the self-training objective for context-sense pairs while controlling deviations from the original embeddings. The proposed method outperformed previous studies that adapt contextualized embeddings. It achieved state-of-the-art performance on knowledge-based WSD when combined with the reranking heuristic that uses the sense inventory. We found that the similarity characteristics of specialized embeddings conform to the key idea. We also found that the (dis)similarity of embeddings between the related/different/unrelated senses correlates well with the performance of WSD.
pdf
bib
abs
Concept-based Persona Expansion for Improving Diversity of Persona-Grounded Dialogue
Donghyun Kim
|
Youbin Ahn
|
Chanhee Lee
|
Wongyu Kim
|
Kyong-Ho Lee
|
Donghoon Shin
|
Yeonsoo Lee
A persona-grounded dialogue model aims to improve the quality of responses to promote user engagement. However, because the given personas are mostly short and limited to only a few informative words, it is challenging to utilize them to generate diverse responses. To tackle this problem, we propose a novel persona expansion framework, Concept-based Persona eXpansion (CPX). CPX takes the original persona as input and generates expanded personas that contain conceptually rich content. We constitute CPX with two task modules: 1) Concept Extractor and 2) Sentence Generator. To train these modules, we exploit the duality of two tasks with a commonsense dataset consisting of a concept set and the corresponding sentences which contain the given concepts. Extensive experiments on persona expansion and response generation show that our work sufficiently contributes to improving the quality of responses in diversity and richness.
pdf
bib
abs
RPTCS: A Reinforced Persona-aware Topic-guiding Conversational System
Zishan Ahmad
|
Kshitij Mishra
|
Asif Ekbal
|
Pushpak Bhattacharyya
Although there has been a plethora of work on open-domain conversational systems, most of the systems lack the mechanism of controlling the concept transitions in a dialogue. For activities like switching from casual chit-chat to task-oriented conversation, an agent with the ability to manage the flow of concepts in a conversation might be helpful. The user would find the dialogue more engaging and be more receptive to such transitions if these concept transitions were made while taking into account the user’s persona. Focusing on persona-aware concept transitions, we propose a Reinforced Persona-aware Topic-guiding Conversational System (RPTCS). Due to the lack of a persona-aware topic transition dataset, we propose a novel conversation dataset creation mechanism in which the conversational agent leads the discourse to drift to a set of target concepts depending on the persona of the speaker and the context of the conversation. To avoid scarcely available expensive human resource, the entire data-creation process is mostly automatic with human-in-loop only for quality checks. This created conversational dataset named PTCD is used to develop the RPTCS in two steps. First, a maximum likelihood estimation loss-based conversational model is trained on PTCD. Then this trained model is fine-tuned in a Reinforcement Learning (RL) framework by employing novel reward functions to assure persona, topic, and context consistency with non-repetitiveness in generated responses. Our experimental results demonstrate the strength of the proposed system with respect to strong baselines.
pdf
bib
abs
What Did You Learn To Hate? A Topic-Oriented Analysis of Generalization in Hate Speech Detection
Tom Bourgeade
|
Patricia Chiril
|
Farah Benamara
|
Véronique Moriceau
Hate speech has unfortunately become a significant phenomenon on social media platforms, and it can cover various topics (misogyny, sexism, racism, xenophobia, etc.) and targets (e.g., black people, women). Various hate speech detection datasets have been proposed, some annotated for specific topics, and others for hateful speech in general. In either case, they often employ different annotation guidelines, which can lead to inconsistencies, even in datasets focusing on the same topics. This can cause issues in models trying to generalize across more data and more topics in order to improve detection accuracy. In this paper, we propose, for the first time, a topic-oriented approach to study generalization across popular hate speech datasets. We first perform a comparative analysis of the performances of Transformer-based models in capturing topic-generic and topic-specific knowledge when trained on different datasets. We then propose a novel, simple yet effective approach to study more precisely which topics are best captured in implicit manifestations of hate, showing that selecting combinations of datasets with better out-of-domain topical coverage improves the reliability of automatic hate speech detection.
pdf
bib
abs
End-to-end Case-Based Reasoning for Commonsense Knowledge Base Completion
Zonglin Yang
|
Xinya Du
|
Erik Cambria
|
Claire Cardie
Pretrained language models have been shown to store knowledge in their parameters and have achieved reasonable performance in commonsense knowledge base completion (CKBC) tasks. However, CKBC is knowledge-intensive and it is reported that pretrained language models’ performance in knowledge-intensive tasks are limited because of their incapability of accessing and manipulating knowledge. As a result, we hypothesize that providing retrieved passages that contain relevant knowledge as additional input to the CKBC task will improve performance. In particular, we draw insights from Case-Based Reasoning (CBR) – which aims to solve a new problem by reasoning with retrieved relevant cases, and investigate the direct application of it to CKBC. On two benchmark datasets, we demonstrate through automatic and human evaluations that our End-to-end Case-Based Reasoning Framework (ECBRF) generates more valid, informative, and novel knowledge than the state-of-the-art COMET model for CKBC in both the fully supervised and few-shot settings. We provide insights on why previous retrieval-based methods only achieve merely the same performance with COMET. From the perspective of CBR, our framework addresses a fundamental question on whether CBR methodology can be utilized to improve deep learning models.
pdf
bib
abs
Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
Marwa Gaser
|
Manuel Mager
|
Injy Hamed
|
Nizar Habash
|
Slim Abdennadher
|
Ngoc Thang Vu
Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.
pdf
bib
abs
Identifying the limits of transformers when performing model-checking with natural language
Tharindu Madusanka
|
Riza Batista-navarro
|
Ian Pratt-hartmann
Can transformers learn to comprehend logical semantics in natural language? Although many strands of work on natural language inference have focussed on transformer models’ ability to perform reasoning on text, the above question has not been answered adequately. This is primarily because the logical problems that have been studied in the context of natural language inference have their computational complexity vary with the logical and grammatical constructs within the sentences. As such, it is difficult to access whether the difference in accuracy is due to logical semantics or the difference in computational complexity. A problem that is much suited to address this issue is that of the model-checking problem, whose computational complexity remains constant (for fragments derived from first-order logic). However, the model-checking problem remains untouched in natural language inference research. Thus, we investigated the problem of model-checking with natural language to adequately answer the question of how the logical semantics of natural language affects transformers’ performance. Our results imply that the language fragment has a significant impact on the performance of transformer models. Furthermore, we hypothesise that a transformer model can at least partially understand the logical semantics in natural language but can not completely learn the rules governing the model-checking algorithm.
pdf
bib
abs
Improving the Generalizability of Collaborative Dialogue Analysis With Multi-Feature Embeddings
Ayesha Enayet
|
Gita Sukthankar
Conflict prediction in communication is integral to the design of virtual agents that support successful teamwork by providing timely assistance. The aim of our research is to analyze discourse to predict collaboration success. Unfortunately, resource scarcity is a problem that teamwork researchers commonly face since it is hard to gather a large number of training examples. To alleviate this problem, this paper introduces a multi-feature embedding (MFeEmb) that improves the generalizability of conflict prediction models trained on dialogue sequences. MFeEmb leverages textual, structural, and semantic information from the dialogues by incorporating lexical, dialogue acts, and sentiment features. The use of dialogue acts and sentiment features reduces performance loss from natural distribution shifts caused mainly by changes in vocabulary. This paper demonstrates the performance of MFeEmb on domain adaptation problems in which the model is trained on discourse from one task domain and applied to predict team performance in a different domain. The generalizability of MFeEmb is quantified using the similarity measure proposed by Bontonou et al. (2021). Our results show that MFeEmb serves as an excellent domain-agnostic representation for meta-pretraining a few-shot model on collaborative multiparty dialogues.
pdf
bib
abs
MetaQA: Combining Expert Agents for Multi-Skill Question Answering
Haritz Puerto
|
Gözde Şahin
|
Iryna Gurevych
The recent explosion of question-answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer predictions. Through quantitative and qualitative experiments, we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future research of multi-agent systems.
pdf
bib
abs
BERT Is Not The Count: Learning to Match Mathematical Statements with Proofs
Weixian Waylon Li
|
Yftah Ziser
|
Maximin Coavoux
|
Shay B. Cohen
We introduce a task consisting in matching a proof to a given mathematical statement. The task fits well within current research on Mathematical Information Retrieval and, more generally, mathematical article analysis (Mathematical Sciences, 2014). We present a dataset for the task (the MATcH dataset) consisting of over 180k statement-proof pairs extracted from modern mathematical research articles. We find this dataset highly representative of our task, as it consists of relatively new findings useful to mathematicians. We propose a bilinear similarity model and two decoding methods to match statements to proofs effectively. While the first decoding method matches a proof to a statement without being aware of other statements or proofs, the second method treats the task as a global matching problem. Through a symbol replacement procedure, we analyze the “insights” that pre-trained language models have in such mathematical article analysis and show that while these models perform well on this task with the best performing mean reciprocal rank of 73.7, they follow a relatively shallow symbolic analysis and matching to achieve that performance.
pdf
bib
abs
Lessons Learned from a Citizen Science Project for Natural Language Processing
Jan-Christoph Klie
|
Ji-Ung Lee
|
Kevin Stowe
|
Gözde Şahin
|
Nafise Sadat Moosavi
|
Luke Bates
|
Dominic Petrak
|
Richard Eckart De Castilho
|
Iryna Gurevych
Many Natural Language Processing (NLP) systems use annotated corpora for training and evaluation. However, labeled data is often costly to obtain and scaling annotation projects is difficult, which is why annotation tasks are often outsourced to paid crowdworkers. Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP. To investigate whether and how well Citizen Science can be applied in this setting, we conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset. Our results show that this can yield high-quality annotations and at- tract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues. We summarize lessons learned in the form of guidelines and provide our code and data to aid future work on Citizen Science.
pdf
bib
abs
Contrastive Learning with Keyword-based Data Augmentation for Code Search and Code Question Answering
Shinwoo Park
|
Youngwook Kim
|
Yo-Sub Han
The semantic code search is to find code snippets from the collection of candidate code snippets with respect to a user query that describes functionality. Recent work on code search proposes data augmentation of queries for contrastive learning. This data augmentation approach modifies random words in queries. When a user web query for searching code snippet is too brief, the important word that represents the search intent of the query could be undesirably modified. A code snippet has informative components such as function name and documentation that describe its functionality. We propose to utilize these code components to identify important words and preserve them in the data augmentation step. We present KeyDAC (Keyword-based Data Augmentation for Contrastive learning) that identifies important words for code search from queries and code components based on term matching. KeyDAC augments query-code pairs while preserving keywords, and then leverages generated training instances for contrastive learning. We use KeyDAC to fine-tune various pre-trained language models and evaluate the performance of code search and code question answering via CoSQA and WebQueryTest. The experimental results confirm that KeyDAC substantially outperforms the current state-of-the-art performance, and achieves the new state-of-the-arts for both tasks.
pdf
bib
abs
Large Scale Multi-Lingual Multi-Modal Summarization Dataset
Yash Verma
|
Anubhav Jangra
|
Raghvendra Verma
|
Sriparna Saha
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity across five language roots, it is also the largest summarization dataset for 13 languages and consists of cross-lingual summarization data for 2 languages. We formally define the multi-lingual multi-modal summarization task utilizing our dataset and report baseline scores from various state-of-the-art summarization techniques in a multi-lingual setting. We also compare it with many similar datasets to analyze the uniqueness and difficulty of M3LS. The dataset and code used in this work are made available at “
https://github.com/anubhav-jangra/M3LS”.
pdf
bib
abs
External Knowledge Acquisition for End-to-End Document-Oriented Dialog Systems
Tuan M. Lai
|
Giuseppe Castellucci
|
Saar Kuzi
|
Heng Ji
|
Oleg Rokhlenko
End-to-end neural models for conversational AI often assume that a response can be generated by considering only the knowledge acquired by the model during training. Document-oriented conversational models make a similar assumption by conditioning the input on the document and assuming that any other knowledge is captured in the model’s weights. However, a conversation may refer to external knowledge sources. In this work, we present EKo-Doc, an architecture for document-oriented conversations with access to external knowledge: we assume that a conversation is centered around a topic document and that external knowledge is needed to produce responses. EKo-Doc includes a dense passage retriever, a re-ranker, and a response generation model. We train the model end-to-end by using silver labels for the retrieval and re-ranking components that we automatically acquire from the attention signals of the response generation model. We demonstrate with automatic and human evaluations that incorporating external knowledge improves response generation in document-oriented conversations. Our architecture achieves new state-of-the-art results on the Wizard of Wikipedia dataset, outperforming a competitive baseline by 10.3% in Recall@1 and 7.4% in ROUGE-L.
pdf
bib
abs
In-Depth Look at Word Filling Societal Bias Measures
Matúš Pikuliak
|
Ivana Beňová
|
Viktor Bachratý
Many measures of societal bias in language models have been proposed in recent years. A popular approach is to use a set of word filling prompts to evaluate the behavior of the language models. In this work, we analyze the validity of two such measures – StereoSet and CrowS-Pairs. We show that these measures produce unexpected and illogical results when appropriate control group samples are constructed. Based on this, we believe that they are problematic and using them in the future should be reconsidered. We propose a way forward with an improved testing protocol. Finally, we also introduce a new gender bias dataset for Slovak.
pdf
bib
abs
Retrieval-augmented Image Captioning
Rita Ramos
|
Desmond Elliott
|
Bruno Martins
Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks.
pdf
bib
abs
Automatic Evaluation and Analysis of Idioms in Neural Machine Translation
Christos Baziotis
|
Prashant Mathur
|
Eva Hasler
A major open problem in neural machine translation (NMT) is the translation of idiomatic expressions, such as “under the weather”. The meaning of these expressions is not composed by the meaning of their constituent words, and NMT models tend to translate them literally (i.e., word-by-word), which leads to confusing and nonsensical translations. Research on idioms in NMT is limited and obstructed by the absence of automatic methods for quantifying these errors. In this work, first, we propose a novel metric for automatically measuring the frequency of literal translation errors without human involvement. Equipped with this metric, we present controlled translation experiments with models trained in different conditions (with/without the test-set idioms) and across a wide range of (global and targeted) metrics and test sets. We explore the role of monolingual pretraining and find that it yields substantial targeted improvements, even without observing any translation examples of the test-set idioms. In our analysis, we probe the role of idiom context. We find that the randomly initialized models are more local or “myopic” as they are relatively unaffected by variations of the idiom context, unlike the pretrained ones.
pdf
bib
abs
Representation biases in sentence transformers
Dmitry Nikolaev
|
Sebastian Padó
Variants of the BERT architecture specialised for producing full-sentence representations often achieve better performance on downstream tasks than sentence embeddings extracted from vanilla BERT. However, there is still little understanding of what properties of inputs determine the properties of such representations. In this study, we construct several sets of sentences with pre-defined lexical and syntactic structures and show that SOTA sentence transformers have a strong nominal-participant-set bias: cosine similarities between pairs of sentences are more strongly determined by the overlap in the set of their noun participants than by having the same predicates, lengthy nominal modifiers, or adjuncts. At the same time, the precise syntactic-thematic functions of the participants are largely irrelevant.
pdf
bib
abs
AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature
Melissa Roemmele
|
Kyle Shaffer
|
Katrina Olsen
|
Yiyi Wang
|
Steve DeNeefe
Creating an abridged version of a text involves shortening it while maintaining its linguistic qualities. In this paper, we examine this task from an NLP perspective for the first time. We present a new resource, AbLit, which is derived from abridged versions of English literature books. The dataset captures passage-level alignments between the original and abridged texts. We characterize the linguistic relations of these alignments, and create automated models to predict these relations as well as to generate abridgements for new texts. Our findings establish abridgement as a challenging task, motivating future resources and research. The dataset is available at github.com/roemmele/AbLit.
pdf
bib
abs
Self-training Reduces Flicker in Retranslation-based Simultaneous Translation
Sukanta Sen
|
Rico Sennrich
|
Biao Zhang
|
Barry Haddow
In simultaneous translation, the retranslation approach has the advantage of requiring no modifications to the inference engine. However, in order to reduce the undesirable flicker in the output, previous work has resorted to increasing the latency through masking, and introducing specialised inference, thus losing the simplicity of the approach. In this work, we show that self-training improves the flicker-latency tradeoff, while maintaining similar translation quality to the original. Our analysis indicates that self-training reduces flicker by controlling monotonicity. Furthermore, self-training can be combined with biased beam search to further improve the flicker-latency tradeoff.
pdf
bib
abs
Social Commonsense for Explanation and Cultural Bias Discovery
Lisa Bauer
|
Hanna Tischer
|
Mohit Bansal
Social commonsense contains many human biases due to social and cultural influence (Sap et al., 2020; Emelin et al., 2020). We focus on identifying cultural biases in data, specifically causal assumptions and commonsense implications, that strongly influence model decisions for a variety of tasks designed for social impact. This enables us to examine data for bias by making explicit the causal (if-then, inferential) relations in social commonsense knowledge used for decision making, furthering interpretable commonsense reasoning from a dataset perspective. We apply our methods on 2 social tasks: emotion detection and perceived value detection. We identify influential social commonsense knowledge to explain model behavior in the following ways. First, we augment large-scale language models with social knowledge and show improvements for the tasks, indicating the implicit assumptions a model requires to be successful on each dataset. Second, we identify influential events in the datasets by using social knowledge to cluster data and demonstrate the influence that these events have on model behavior via leave-K-out experiments. This allows us to gain a dataset-level understanding of the events and causal commonsense relationships that strongly influence predictions. We then analyze these relationships to detect influential cultural bias in each dataset. Finally, we use our influential event identification for detecting mislabeled examples and improve training and performance through their removal. We support our findings with manual analysis.
pdf
bib
abs
Counter-GAP: Counterfactual Bias Evaluation through Gendered Ambiguous Pronouns
Zhongbin Xie
|
Vid Kocijan
|
Thomas Lukasiewicz
|
Oana-Maria Camburu
Bias-measuring datasets play a critical role in detecting biased behavior of language models and in evaluating progress of bias mitigation methods. In this work, we focus on evaluating gender bias through coreference resolution, where previous datasets are either hand-crafted or fail to reliably measure an explicitly defined bias. To overcome these shortcomings, we propose a novel method to collect diverse, natural, and minimally distant text pairs via counterfactual generation, and construct Counter-GAP, an annotated dataset consisting of 4008 instances grouped into 1002 quadruples. We further identify a bias cancellation problem in previous group-level metrics on Counter-GAP, and propose to use the difference between inconsistency across genders and within genders to measure bias at a quadruple level. Our results show that four pre-trained language models are significantly more inconsistent across different gender groups than within each group, and that a name-based counterfactual data augmentation method is more effective to mitigate such bias than an anonymization-based method.
pdf
bib
abs
The NLP Task Effectiveness of Long-Range Transformers
Guanghui Qin
|
Yukun Feng
|
Benjamin Van Durme
Transformer models cannot easily scale to long sequences due to their O(Nˆ2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors to illuminate model details beyond metric scores. We find that the modified attention in long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens and accumulated approximation error.
pdf
bib
abs
Creation and evaluation of timelines for longitudinal user posts
Anthony Hills
|
Adam Tsakalidis
|
Federico Nanni
|
Ioannis Zachos
|
Maria Liakata
There is increasing interest to work with user generated content in social media, especially textual posts over time. Currently there is no consistent way of segmenting user posts into timelines in a meaningful way that improves the quality and cost of manual annotation. Here we propose a set of methods for segmenting longitudinal user posts into timelines likely to contain interesting moments of change in a user’s behaviour, based on their online posting activity. We also propose a novel framework for evaluating timelines and show its applicability in the context of two different social media datasets. Finally, we present a discussion of the linguistic content of highly ranked timelines.
pdf
bib
abs
Semi-supervised New Event Type Induction and Description via Contrastive Loss-Enforced Batch Attention
Carl Edwards
|
Heng Ji
Most event extraction methods have traditionally relied on an annotated set of event types. However, creating event ontologies and annotating supervised training data are expensive and time-consuming. Previous work has proposed semi-supervised approaches which leverage seen (annotated) types to learn how to automatically discover new event types. State-of-the-art methods, both semi-supervised or fully unsupervised, use a form of reconstruction loss on specific tokens in a context. In contrast, we present a novel approach to semi-supervised new event type induction using a masked contrastive loss, which learns similarities between event mentions by enforcing an attention mechanism over the data minibatch. We further disentangle the discovered clusters by approximating the underlying manifolds in the data, which allows us to achieve an adjusted rand index score of 48.85%. Building on these clustering results, we extend our approach to two new tasks: predicting the type name of the discovered clusters and linking them to FrameNet frames.
pdf
bib
abs
Multilingual Content Moderation: A Case Study on Reddit
Meng Ye
|
Karan Sikka
|
Katherine Atwell
|
Sabit Hassan
|
Ajay Divakaran
|
Malihe Alikhani
Content moderation is the process of flagging content based on pre-defined platform rules. There has been a growing need for AI moderators to safeguard users as well as protect the mental health of human moderators from traumatic content. While prior works have focused on identifying hateful/offensive language, they are not adequate for meeting the challenges of content moderation since 1) moderation decisions are based on violation of rules, which subsumes detection of offensive speech, and 2) such rules often differ across communities which entails an adaptive solution. We propose to study the challenges of content moderation by introducing a multilingual dataset of 1.8 Million Reddit comments spanning 56 subreddits in English, German, Spanish and French1. We perform extensive experimental analysis to highlight the underlying challenges and suggest related research problems such as cross-lingual transfer, learning under label noise (human biases), transfer of moderation models, and predicting the violated rule. Our dataset and analysis can help better prepare for the challenges and opportunities of auto moderation.
pdf
bib
abs
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Archiki Prasad
|
Peter Hase
|
Xiang Zhou
|
Mohit Bansal
Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large language models in a zero-shot setting. Recent work has aimed to improve such prompts via manual rewriting or gradient-based tuning. However, manual rewriting is time-consuming and requires subjective interpretation, while gradient-based tuning can be extremely computationally demanding for large models and may not be feasible for API-based models. In this work, we introduce Gradient-free Instructional Prompt Search (GrIPS), a gradient-free, edit-based search approach for improving task instructions for large language models. GrIPS takes in instructions designed for humans and automatically returns an improved, edited prompt, while allowing for API-based tuning. With InstructGPT models, GrIPS improves the average task performance by up to 4.30 percentage points on eight classification tasks from the Natural Instructions dataset (with similar improvements for OPT, BLOOM, and FLAN-T5). We see improvements for both instruction-only prompts and instruction + k-shot examples prompts. Notably, GrIPS outperforms manual rewriting and purely example-based prompts while controlling for the available compute and data budget. Further, performance of GrIPS is comparable to select gradient-based tuning approaches. Qualitatively, we show our edits can simplify instructions and at times make them incoherent but nonetheless improve accuracy.
pdf
bib
abs
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Wei Zhao
|
Michael Strube
|
Steffen Eger
Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics are weak in recognizing coherence, and thus are not reliable in a way to spot the discourse-level improvements of those text generation systems. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level—which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at
https://github.com/AIPHES/DiscoScore.
pdf
bib
abs
Know your audience: specializing grounded language models with listener subtraction
Aaditya K Singh
|
David Ding
|
Andrew Saxe
|
Felix Hill
|
Andrew Lampinen
Effective communication requires adapting to the idiosyncrasies of each communicative context—such as the common ground shared with each partner. Humans demonstrate this ability to specialize to their audience in many contexts, such as the popular game Dixit. We take inspiration from Dixit to formulate a multi-agent image reference game where a (trained) speaker model is rewarded for describing a target image such that one (pretrained) listener model can correctly identify it among distractors, but another listener cannot. To adapt, the speaker must exploit differences in the knowledge it shares with the different listeners. We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization from rewards only, without direct supervision. Through controlled experiments, we show that training a speaker with two listeners that perceive differently, using our method, allows the speaker to adapt to the idiosyncracies of the listeners. Furthermore, we show zero-shot transfer of the specialization to real-world data. Our experiments demonstrate a method for specializing grounded language models without direct supervision and highlight the interesting research challenges posed by complex multi-agent communication.
pdf
bib
abs
Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Abteen Ebrahimi
|
Arya D. McCarthy
|
Arturo Oncevay
|
John E. Ortega
|
Luis Chiruzzo
|
Gustavo Giménez-Lugo
|
Rolando Coto-Solano
|
Katharina Kann
Large multilingual models have inspired a new class of word alignment methods, which work well for the model’s pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri–Spanish, Guarani–Spanish, Quechua–Spanish, and Shipibo-Konibo–Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.
uppdf
bib
Findings of the Association for Computational Linguistics: EACL 2023
Andreas Vlachos
|
Isabelle Augenstein
pdf
bib
abs
Using Punctuation as an Adversarial Attack on Deep Learning-Based NLP Systems: An Empirical Study
Brian Formento
|
Chuan Sheng Foo
|
Luu Anh Tuan
|
See Kiong Ng
This work empirically investigates punctuation insertions as adversarial attacks on NLP systems. Data from experiments on three tasks, five datasets, and six models with four attacks show that punctuation insertions, when limited to a few symbols (apostrophes and hyphens), are a superior attack vector compared to character insertions due to 1) a lower after-attack accuracy (Aaft-atk) than alphabetical character insertions; 2) higher semantic similarity between the resulting and original texts; and 3) a resulting text that is easier and faster to read as assessed with the Test of Word Reading Efficiency (TOWRE)). The tests also indicate that 4) grammar checking does not mitigate punctuation insertions and 5) punctuation insertions outperform word-level attacks in settings with a limited number of word synonyms and queries to the victim’s model. Our findings indicate that inserting a few punctuation types that result in easy-to-read samples is a general attack mechanism. In light of this threat, we assess the impact of punctuation insertions, potential mitigations, the mitigation’s tradeoffs, punctuation insertion’s worst-case scenarios and summarize our findings in a qualitative casual map, so that developers can design safer, more secure systems.
pdf
bib
abs
Self-Supervised Unimodal Label Generation Strategy Using Recalibrated Modality Representations for Multimodal Sentiment Analysis
Yewon Hwang
|
Jong-Hwan Kim
While multimodal sentiment analysis (MSA) has gained much attention over the last few years, the main focus of most work on MSA has been limited to constructing multimodal representations that capture interactions between different modalities in a single task. This was largely due to a lack of unimodal annotations in MSA benchmark datasets. However, training a model using only multimodal representations can lead to suboptimal performance due to insufficient learning of each uni-modal representation. In this work, to fully optimize learning representations from multimodal data, we propose SUGRM which jointly trains multimodal and unimodal tasks using recalibrated features. The features are recalibrated such that the model learns to weight the features differently based on the features of other modalities. Further, to leverage unimodal tasks, we auto-generate unimodal annotations via a unimodal label generation module (ULGM). The experiment results on two benchmark datasets demonstrate the efficacy of our framework.
pdf
bib
abs
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks
Pedro Rodriguez
|
Mahmoud Azab
|
Becka Silvert
|
Renato Sanchez
|
Linzy Labson
|
Hardik Shah
|
Seungwhan Moon
Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points—a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25% recall points higher for the best models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated through sampling. We recommend retiring these benchmarks in their current form, and we make recommendations for future text-to-video retrieval benchmarks.
pdf
bib
abs
Improving Numeracy by Input Reframing and Quantitative Pre-Finetuning Task
Chung-Chi Chen
|
Hiroya Takamura
|
Ichiro Kobayashi
|
Yusuke Miyao
Numbers have unique characteristics to words. Teaching models to understand numbers in text is an open-ended research question. Instead of discussing the required calculation skills, this paper focuses on a more fundamental topic: understanding numerals. We point out that innumeracy—the inability to handle basic numeral concepts—exists in most pretrained language models (LMs), and we propose a method to solve this issue by exploring the notation of numbers. Further, we discuss whether changing notation and pre-finetuning along with the comparing-number task can improve performance in three benchmark datasets containing quantitative-related tasks. The results of this study indicate that input reframing and the proposed pre-finetuning task is useful for RoBERTa.
pdf
bib
abs
Visualize Before You Write: Imagination-Guided Open-Ended Text Generation
Wanrong Zhu
|
An Yan
|
Yujie Lu
|
Wenda Xu
|
Xin Wang
|
Miguel Eckstein
|
William Yang Wang
Recent advances in text-to-image synthesis make it possible to visualize machine imaginations for a given context. On the other hand, when generating text, human writers are gifted at creative visualization, which enhances their writings by forming imaginations as blueprints before putting down the stories in words. Inspired by such a cognitive process, we ask the natural question of whether we can endow machines with the same ability to utilize visual information and construct a general picture of the context to guide text generation. In this work, we propose iNLG that uses machine-generated images to guide language models (LM) in open-ended text generation. The experiments and analyses demonstrate the effectiveness of iNLG on open-ended text generation tasks, including text completion, story generation, and concept-to-text generation in both few-shot and full-data scenarios. Both automatic metrics and human evaluations verify that the text snippets generated by our iNLG are coherent and informative while displaying minor degeneration.
pdf
bib
abs
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation
Wanrong Zhu
|
Xin Wang
|
An Yan
|
Miguel Eckstein
|
William Yang Wang
Automatic evaluations for natural language generation (NLG) conventionally rely on token-level or embedding-level comparisons with text references. This differs from human language processing, for which visual imagination often improves comprehension. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of StableDiffusion, a state-of-the-art text-to-image generator, we automatically generate an image as the embodied imagination for the text snippet and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding machine-generated images with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics’ correlations with human similarity judgments in both reference-based and reference-free evaluation scenarios.
pdf
bib
abs
Entity-Aware Dual Co-Attention Network for Fake News Detection
Sin-han Yang
|
Chung-chi Chen
|
Hen-Hsen Huang
|
Hsin-Hsi Chen
Fake news and misinformation spread rapidly on the Internet. How to identify it and how to interpret the identification results have become important issues. In this paper, we propose a Dual Co-Attention Network (Dual-CAN) for fake news detection, which takes news content, social media replies, and external knowledge into consideration. Our experimental results support that the proposed Dual-CAN outperforms current representative models in two benchmark datasets. We further make in-depth discussions by comparing how models work in both datasets with empirical analysis of attention weights.
pdf
bib
abs
CIKQA: Learning Commonsense Inference with a Unified Knowledge-in-the-loop QA Paradigm
Hongming Zhang
|
Yintong Huo
|
Yanai Elazar
|
Yangqiu Song
|
Yoav Goldberg
|
Dan Roth
We propose a new commonsense reasoning benchmark to motivate commonsense reasoning progress from two perspectives: (1) Evaluating whether models can distinguish knowledge quality by predicting if the knowledge is enough to answer the question; (2) Evaluating whether models can develop commonsense inference capabilities that generalize across tasks. We first extract supporting knowledge for each question and ask humans to annotate whether the auto-extracted knowledge is enough to answer the question or not. After that, we convert different tasks into a unified question-answering format to evaluate the models’ generalization capabilities. We name the benchmark Commonsense Inference with Knowledge-in-the-loop Question Answering (\name). Experiments show that with our learning paradigm, models demonstrate encouraging generalization capabilities. At the same time, we also notice that distinguishing knowledge quality remains challenging for current commonsense reasoning models.
pdf
bib
abs
Data-Efficient Methods For Improving Hate Speech Detection
Sumegh Roychowdhury
|
Vikram Gupta
Scarcity of large-scale datasets, especially for resource-impoverished languages motivates exploration of data-efficient methods for hate speech detection. Hateful intents are expressed explicitly (use of cuss, swear, abusive words) and implicitly (indirect and contextual). In this work, we progress implicit and explicit hate speech detection using an input-level data augmentation technique, task reformulation using entailment and cross-learning across five languages. Our proposed data augmentation technique EasyMix, improves the performance across all english datasets by ~1% and across multilingual datasets by ~1-9%. We also observe substantial gains of ~2-8% by reformulating hate speech detection as entail problem. We further probe the contextual models and observe that higher layers encode implicit hate while lower layers focus on explicit hate, highlighting the importance of token-level understanding for explicit and context-level for implicit hate speech detection. Code and Dataset splits -
https://anonymous.4open.science/r/data_efficient_hatedetect/pdf
bib
abs
Learning the Effects of Physical Actions in a Multi-modal Environment
Gautier Dagan
|
Frank Keller
|
Alex Lascarides
Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action’s outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model’s performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better.
pdf
bib
abs
FVQA 2.0: Introducing Adversarial Samples into Fact-based Visual Question Answering
Weizhe Lin
|
Zhilin Wang
|
Bill Byrne
The widely used Fact-based Visual Question Answering (FVQA) dataset contains visually-grounded questions that require information retrieval using common sense knowledge graphs to answer. It has been observed that the original dataset is highly imbalanced and concentrated on a small portion of its associated knowledge graph. We introduce FVQA 2.0 which contains adversarial variants of test questions to address this imbalance. We show that systems trained with the original FVQA train sets can be vulnerable to adversarial samples and we demonstrate an augmentation scheme to reduce this vulnerability without human annotations.
pdf
bib
abs
Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective
Jongwoo Ko
|
Seungjoon Park
|
Minchan Jeong
|
Sukjin Hong
|
Euijai Ahn
|
Du-Seong Chang
|
Se-Young Yun
Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at
https://github.com/jongwooko/CR-ILD.
pdf
bib
abs
Implicit Temporal Reasoning for Evidence-Based Fact-Checking
Liesbeth Allein
|
Marlon Saelens
|
Ruben Cartuyvels
|
Marie-Francine Moens
Leveraging contextual knowledge has become standard practice in automated claim verification, yet the impact of temporal reasoning has been largely overlooked. Our study demonstrates that time positively influences the claim verification process of evidence-based fact-checking. The temporal aspects and relations between claims and evidence are first established through grounding on shared timelines, which are constructed using publication dates and time expressions extracted from their text. Temporal information is then provided to RNN-based and Transformer-based classifiers before or after claim and evidence encoding. Our time-aware fact-checking models surpass base models by up to 9% Micro F1 (64.17%) and 15% Macro F1 (47.43%) on the MultiFC dataset. They also outperform prior methods that explicitly model temporal relations between evidence. Our findings show that the presence of temporal information and the manner in which timelines are constructed greatly influence how fact-checking models determine the relevance and supporting or refuting character of evidence documents.
pdf
bib
abs
Active PETs: Active Data Annotation Prioritisation for Few-Shot Claim Verification with Pattern Exploiting Training
Xia Zeng
|
Arkaitz Zubiaga
To mitigate the impact of the scarcity of labelled data on fact-checking systems, we focus on few-shot claim verification. Despite recent work on few-shot classification by proposing advanced language models, there is a dearth of research in data annotation prioritisation that improves the selection of the few shots to be labelled for optimal model performance. We propose Active PETs, a novel weighted approach that utilises an ensemble of Pattern Exploiting Training (PET) models based on various language models, to actively select unlabelled data as candidates for annotation. Using Active PETs for few-shot data selection shows consistent improvement over the baseline methods, on two technical fact-checking datasets and using six different pretrained language models. We show further improvement with Active PETs-o, which further integrates an oversampling strategy. Our approach enables effective selection of instances to be labelled where unlabelled data is abundant but resources for labelling are limited, leading to consistently improved few-shot claim verification performance. Our code is available.
pdf
bib
abs
Plan-then-Seam: Towards Efficient Table-to-Text Generation
Liang Li
|
Ruiying Geng
|
Chengyang Fang
|
Bing Li
|
Can Ma
|
Binhua Li
|
Yongbin Li
Table-to-text generation aims at automatically generating text to help people conveniently obtain salient information in tables. Recent works explicitly decompose the generation process into content planning and surface generation stages, employing two autoregressive networks for them respectively. However, they are computationally expensive due to the non-parallelizable nature of autoregressive decoding and the redundant parameters of two networks. In this paper, we propose the first totally non-autoregressive table-to-text model (Plan-then-Seam, PTS) that produces its outputs in parallel with one single network.PTS firstly writes and calibrates one plan of the content to be generated with a novel rethinking pointer predictor, and then takes the plan as the context for seaming to decode the description. These two steps share parameters and perform iteratively to capture token inter-dependency while keeping parallel decoding. Experiments on two public benchmarks show that PTS achieves 3.0 5.6 times speedup for inference time, reducing 50% parameters, while maintaining as least comparable performance against strong two-stage table-to-text competitors.
pdf
bib
abs
A corpus of metaphors as register markers
Markus Egg
|
Valia Kordoni
The paper presents our work on corpus annotationfor metaphor in German. Metaphors denoteentities that are similar to their literal referent,e.g., when *Licht* ‘light’ is used in the senseof ‘hope’. We are interested in the relation betweenmetaphor and register, hence, the corpusincludes material from different registers. We focussed on metaphors that can serve asregister markers and can also be reliably indentifiedfor annotation. Our results show hugedifferences between registers in metaphor usage,which we interpret in terms of specificproperties of the registers.
pdf
bib
abs
Translate First Reorder Later: Leveraging Monotonicity in Semantic Parsing
Francesco Cazzaro
|
Davide Locatelli
|
Ariadna Quattoni
|
Xavier Carreras
Prior work in semantic parsing has shown that conventional seq2seq models fail at compositional generalization tasks. This limitation led to a resurgence of methods that model alignments between sentences and their corresponding meaning representations, either implicitly through latent variables or explicitly by taking advantage of alignment annotations. We take the second direction and propose TPol, a two-step approach that first translates input sentences monotonically and then reorders them to obtain the correct output. This is achieved with a modular framework comprising a Translator and a Reorderer component. We test our approach on two popular semantic parsing datasets. Our experiments show that by means of the monotonic translations, TPol can learn reliable lexico-logical patterns from aligned data, significantly improving compositional generalization both over conventional seq2seq models, as well as over other approaches that exploit gold alignments.
pdf
bib
abs
PePe: Personalized Post-editing Model utilizing User-generated Post-edits
Jihyeon Lee
|
Taehee Kim
|
Yunwon Tae
|
Cheonbok Park
|
Jaegul Choo
Incorporating personal preference is crucial in advanced machine translation tasks. Despite the recent advancement of machine translation, it remains a demanding task to properly reflect personal style. In this paper, we introduce a personalized automatic post-editing framework to address this challenge, which effectively generates sentences considering distinct personal behaviors. To build this framework, we first collect post-editing data that connotes the user preference from a live machine translation system. Specifically, real-world users enter source sentences for translation and edit the machine-translated outputs according to the user’s preferred style. We then propose a model that combines a discriminator module and user-specific parameters on the APE framework. Experimental results show that the proposed method outperforms other baseline models on four different metrics (i.e., BLEU, TER, YiSi-1, and human evaluation).
pdf
bib
abs
Infusing Context and Knowledge Awareness in Multi-turn Dialog Understanding
Ting-Wei Wu
|
Biing-Hwang Juang
In multi-turn dialog understanding, semantic frames are constructed by detecting intents and slots within each user utterance. However, recent works lack the capability of modeling multi-turn dynamics within a dialog in natural language understanding (NLU), instead leaving them for updating dialog states only. Moreover, humans usually associate relevant background knowledge with the current dialog contexts to better illustrate slot semantics revealed from word connotations, where previous works have explored such possibility mostly in knowledge-grounded response generation. In this paper, we propose to amend the research gap by equipping a BERT-based NLU framework with knowledge and context awareness. We first encode dialog contexts with a unidirectional context-aware transformer encoder and select relevant inter-word knowledge with the current word and previous history based on a knowledge attention mechanism. Experimental results in two complicated multi-turn dialog datasets have demonstrated significant improvements of our proposed framework. Attention visualization also demonstrates how our modules leverage knowledge across the utterance.
pdf
bib
abs
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
Zhiruo Wang
|
Grace Cuenca
|
Shuyan Zhou
|
Frank F. Xu
|
Graham Neubig
While there has been a recent burgeoning of applications at the intersection of natural and programming languages, such as code generation and code summarization, these applications are usually English-centric. This creates a barrier for program developers who are not proficient in English. To mitigate this gap in technology development across languages, we propose a multilingual dataset, MCoNaLa, to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNaLa) dataset, we annotated a total of 896 NL-Code pairs in three languages: Spanish, Japanese, and Russian. We present a systematic evaluation on MCoNaLa by testing state-of-the-art code generation systems. Although the difficulties vary across three languages, all systems lag significantly behind their English counterparts, revealing the challenges in adapting code generation to new languages.
pdf
bib
abs
Augmenting pre-trained language models with audio feature embedding for argumentation mining in political debates
Rafael Mestre
|
Stuart E. Middleton
|
Matt Ryan
|
Masood Gheasi
|
Timothy Norman
|
Jiatong Zhu
The integration of multimodality in natural language processing (NLP) tasks seeks to exploit the complementary information contained in two or more modalities, such as text, audio and video. This paper investigates the integration of often under-researched audio features with text, using the task of argumentation mining (AM) as a case study. We take a previously reported dataset and present an audio-enhanced version (the Multimodal USElecDeb60To16 dataset). We report the performance of two text models based on BERT and GloVe embeddings, one audio model (based on CNN and Bi-LSTM) and multimodal combinations, on a dataset of 28,850 utterances. The results show that multimodal models do not outperform text-based models when using the full dataset. However, we show that audio features add value in fully supervised scenarios with limited data. We find that when data is scarce (e.g. with 10% of the original dataset) multimodal models yield improved performance, whereas text models based on BERT considerably decrease performance. Finally, we conduct a study with artificially generated voices and an ablation study to investigate the importance of different audio features in the audio models.
pdf
bib
abs
Improving Retrieval Augmented Neural Machine Translation by Controlling Source and Fuzzy-Match Interactions
Cuong Hoang
|
Devendra Sachan
|
Prashant Mathur
|
Brian Thompson
|
Marcello Federico
We explore zero-shot adaptation, where a general-domain model has access to customer or domain specific parallel data at inference time, but not during training. We build on the idea of Retrieval Augmented Translation (RAT) where top-k in-domain fuzzy matches are found for the source sentence, and target-language translations of those fuzzy-matched sentences are provided to the translation model at inference time. We propose a novel architecture to control interactions between a source sentence and the top-k fuzzy target-language matches, and compare it to architectures from prior work. We conduct experiments in two language pairs (En-De and En-Fr) by training models on WMT data and testing them with five and seven multi-domain datasets, respectively. Our approach consistently outperforms the alternative architectures, improving BLEU across language pair, domain, and number k of fuzzy matches.
pdf
bib
abs
CALM-Bench: A Multi-task Benchmark for Evaluating Causality-Aware Language Models
Dhairya Dalal
|
Paul Buitelaar
|
Mihael Arcan
Causal reasoning is a critical component of human cognition and is required across a range of question-answering (QA) tasks (such as abductive reasoning, commonsense QA, and procedural reasoning). Research on causal QA has been underdefined, task-specific, and limited in complexity. Recent advances in foundation language models (such as BERT, ERNIE, and T5) have shown the efficacy of pre-trained models across diverse QA tasks. However, there is limited research exploring the causal reasoning capabilities of those language models and no standard evaluation benchmark. To unify causal QA research, we propose CALM-Bench, a multi-task benchmark for evaluating causality-aware language models (CALM). We present a standardized definition of causal QA tasks and show empirically that causal reasoning can be generalized and transferred across different QA tasks. Additionally, we share a strong multi-task baseline model which outperforms single-task fine-tuned models on the CALM-Bench tasks.
pdf
bib
abs
ezCoref: Towards Unifying Annotation Guidelines for Coreference Resolution
Ankita Gupta
|
Marzena Karpinska
|
Wenlong Zhao
|
Kalpesh Krishna
|
Jack Merullo
|
Luke Yeh
|
Mohit Iyyer
|
Brendan O’Connor
Large-scale, high-quality corpora are critical for advancing research in coreference resolution. However, existing datasets vary in their definition of coreferences and have been collected via complex and lengthy guidelines that are curated for linguistic experts. These concerns have sparked a growing interest among researchers to curate a unified set of guidelines suitable for annotators with various backgrounds. In this work, we develop a crowdsourcing-friendly coreference annotation methodology, ezCoref, consisting of an annotation tool and an interactive tutorial. We use ezCoref to re-annotate 240 passages from seven existing English coreference datasets (spanning fiction, news, and multiple other domains) while teaching annotators only cases that are treated similarly across these datasets. Surprisingly, we find that reasonable quality annotations were already achievable (90% agreement between the crowd and expert annotations) even without extensive training. On carefully analyzing the remaining disagreements, we identify the presence of linguistic cases that our annotators unanimously agree upon but lack unified treatments (e.g., generic pronouns, appositives) in existing datasets. We propose the research community should revisit these phenomena when curating future unified annotation guidelines.
pdf
bib
abs
PREME: Preference-based Meeting Exploration through an Interactive Questionnaire
Negar Arabzadeh
|
Ali Ahmadvand
|
Julia Kiseleva
|
Yang Liu
|
Ahmed Hassan Awadallah
|
Ming Zhong
|
Milad Shokouhi
The recent increase in the volume of online meetings necessitates automated tools for organizing the material, especially when an attendee has missed the discussion and needs assistance in quickly exploring it. In this work, we propose a novel end-to-end framework for generating interactive questionnaires for preference-based meeting exploration. As a result, users are supplied with a list of suggested questions reflecting their preferences. Since the task is new, we introduce an automatic evaluation strategy by measuring how much the generated questions via questionnaire are answerable to ensure factual correctness and covers the source meeting for the depth of possible exploration.
pdf
bib
abs
Sentence Identification with BOS and EOS Label Combinations
Takuma Udagawa
|
Hiroshi Kanayama
|
Issei Yoshida
The sentence is a fundamental unit in many NLP applications. Sentence segmentation is widely used as the first preprocessing task, where an input text is split into consecutive sentences considering the end of the sentence (EOS) as their boundaries. This task formulation relies on a strong assumption that the input text consists only of sentences, or what we call the sentential units (SUs). However, real-world texts often contain non-sentential units (NSUs) such as metadata, sentence fragments, nonlinguistic markers, etc. which are unreasonable or undesirable to be treated as a part of an SU. To tackle this issue, we formulate a novel task of sentence identification, where the goal is to identify SUs while excluding NSUs in a given text. To conduct sentence identification, we propose a simple yet effective method which combines the beginning of the sentence (BOS) and EOS labels to determine the most probable SUs and NSUs based on dynamic programming. To evaluate this task, we design an automatic, language-independent procedure to convert the Universal Dependencies corpora into sentence identification benchmarks. Finally, our experiments on the sentence identification task demonstrate that our proposed method generally outperforms sentence segmentation baselines which only utilize EOS labels.
pdf
bib
abs
Gauging the Gap Between Human and Machine Text Simplification Through Analytical Evaluation of Simplification Strategies and Errors
Daichi Yamaguchi
|
Rei Miyata
|
Sayuka Shimada
|
Satoshi Sato
This study presents an analytical evaluation of neural text simplification (TS) systems. Because recent TS models are trained in an end-to-end fashion, it is difficult to grasp their abilities to perform particular simplification operations. For the advancement of TS research and development, we should understand in detail what current TS systems can and cannot perform in comparison with human performance. To that end, we first developed an analytical evaluation framework consisting of fine-grained taxonomies of simplification strategies (at both the surface and content levels) and errors. Using this framework, we annotated TS instances produced by professional human editors and multiple neural TS systems and compared the results. Our analyses concretely and quantitatively revealed a wide gap between humans and systems, specifically indicating that systems tend to perform deletions and local substitutions while excessively omitting important information, and that the systems can hardly perform information addition operations. Based on our analyses, we also provide detailed directions to address these limitations.
pdf
bib
abs
Bridging the Gap between Pre-Training and Fine-Tuning for Commonsense Generation
Haoran Yang
|
Yan Wang
|
Piji Li
|
Wei Bi
|
Wai Lam
|
Chen Xu
Commonsense generation aims to generate a plausible sentence containing all given unordered concept words. Previous methods focusing on this task usually directly concatenate these words as the input of a pre-trained language model (PLM). However, in PLMs’ pre-training process, the inputs are often corrupted sentences with correct word order. This input distribution discrepancy between pre-training and fine-tuning makes the model difficult to fully utilize the knowledge of PLMs. In this paper, we propose a two-stage framework to alleviate this issue. Firstly, in pre-training stage, we design a new format of input to endow PLMs the ability to deal with masked sentences with incorrect word order. Secondly, during fine-tuning, we insert the special token [MASK] between two consecutive concept words to make the input distribution more similar to the input distribution in pre-training. We conduct extensive experiments and provide thorough analysis to demonstrate the effectiveness of our proposed method.
pdf
bib
abs
LED: A Dataset for Life Event Extraction from Dialogs
Yi-Pei Chen
|
An-Zi Yen
|
Hen-Hsen Huang
|
Hideki Nakayama
|
Hsin-Hsi Chen
Lifelogging has gained more attention due to its wide applications, such as personalized recommendations or memory assistance. The issues of collecting and extracting personal life events have emerged. People often share their life experiences with others through conversations. However, extracting life events from conversations is rarely explored. In this paper, we present Life Event Dialog, a dataset containing fine-grained life event annotations on conversational data. In addition, we initiate a novel Conversational Life Event Extraction task and differentiate the task from the public event extraction or the life event extraction from other sources like microblogs. We explore three information extraction (IE) frameworks to address the Conversational Life Event Extraction task: OpenIE, relation extraction, and event extraction. A comprehensive empirical analysis of the three baselines is established. The results suggest that the current event extraction model still struggles with extracting life events from human daily conversations. Our proposed Life Event Dialog dataset and in-depth analysis of IE frameworks will facilitate future research on life event extraction from conversations.
pdf
bib
abs
Reading and Reasoning over Chart Images for Evidence-based Automated Fact-Checking
Mubashara Akhtar
|
Oana Cocarascu
|
Elena Simperl
Evidence data for automated fact-checking (AFC) can be in multiple modalities such as text, tables, images, audio, or video. While there is increasing interest in using images for AFC, previous works mostly focus on detecting manipulated or fake images. We propose a novel task, chart-based fact-checking, and introduce ChartBERT as the first model for AFC against chart evidence. ChartBERT leverages textual, structural and visual information of charts to determine the veracity of textual claims. For evaluation, we create ChartFC, a new dataset of 15,886 charts. We systematically evaluate 75 different vision-language (VL) baselines and show that ChartBERT outperforms VL models, achieving 63.8% accuracy. Our results suggest that the task is complex yet feasible, with many challenges ahead.
pdf
bib
abs
Causal Reasoning of Entities and Events in Procedural Texts
Li Zhang
|
Hainiu Xu
|
Yue Yang
|
Shuyan Zhou
|
Weiqiu You
|
Manni Arora
|
Chris Callison-Burch
Entities and events are crucial to natural language reasoning and common in procedural texts. Existing work has focused either exclusively on entity state tracking (e.g., whether a pan is hot) or on event reasoning (e.g., whether one would burn themselves by touching the pan), while these two tasks are often causally related. We propose CREPE, the first benchmark on causal reasoning of event plausibility and entity states. We show that most language models, including GPT-3, perform close to chance at .35 F1, lagging far behind human at .87 F1. We boost model performance to .59 F1 by creatively representing events as programming languages while prompting language models pretrained on code. By injecting the causal relations between entities and events as intermediate reasoning steps in our representation, we further boost the performance to .67 F1. Our findings indicate not only the challenge that CREPE brings for language models, but also the efficacy of code-like prompting combined with chain-of-thought prompting for multihop event reasoning.
pdf
bib
abs
Few-Shot Structured Policy Learning for Multi-Domain and Multi-Task Dialogues
Thibault Cordier
|
Tanguy Urvoy
|
Fabrice Lefèvre
|
Lina M. Rojas Barahona
Reinforcement learning has been widely adopted to model dialogue managers in task-oriented dialogues. However, the user simulator provided by state-of-the-art dialogue frameworks are only rough approximations of human behaviour. The ability to learn from a small number of human interactions is hence crucial, especially on multi-domain and multi-task environments where the action space is large. We therefore propose to use structured policies to improve sample efficiency when learning on these kinds of environments. We also evaluate the impact of learning from human vs simulated experts. Among the different levels of structure that we tested, the graph neural networks (GNNs) show a remarkable superiority by reaching a success rate above 80% with only 50 dialogues when learning from simulated experts. They also show superiority when learning from human experts, although a performance drop was observed. We therefore suggest to concentrate future research efforts on bridging the gap between human data, simulators and automatic evaluators in dialogue frameworks.
pdf
bib
abs
Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?
Jielin Qiu
|
William Han
|
Jiacheng Zhu
|
Mengdi Xu
|
Michael Rosenberg
|
Emerson Liu
|
Douglas Weber
|
Ding Zhao
Recent advancements in Large Language Models (LLMs) have drawn increasing attention since the learned embeddings pretrained on large-scale datasets have shown powerful ability in various downstream applications. However, whether the learned knowledge by LLMs can be transferred to clinical cardiology remains unknown. In this work, we aim to bridge this gap by transferring the knowledge of LLMs to clinical Electrocardiography (ECG). We propose an approach for cardiovascular disease diagnosis and automatic ECG diagnosis report generation. We also introduce an additional loss function by Optimal Transport (OT) to align the distribution between ECG and language embedding. The learned embeddings are evaluated on two downstream tasks: (1) automatic ECG diagnosis report generation, and (2) zero-shot cardiovascular disease detection. Our approach is able to generate high-quality cardiac diagnosis reports and also achieves competitive zero-shot classification performance even compared with supervised baselines, which proves the feasibility of transferring knowledge from LLMs to the cardiac domain.
pdf
bib
abs
Practical Takes on Federated Learning with Pretrained Language Models
Ankur Agarwal
|
Mehdi Rezagholizadeh
|
Prasanna Parthasarathi
Real-world applications of language models entail data privacy constraints when learning from diverse data domains. Federated learning with pretrained language models for language tasks has been gaining attention lately but there are definite confounders that warrants a careful study. Specifically, understanding the limits of federated NLP applications through varying the effects of different aspects (such as data heterogeneity, the trade-off between training time and performance, the effect of different data, and client distributions and sensitivity of the shared model to learning local distributions) is necessary to evaluate whether language models indeed learn to generalize by adapting to the different domains. Towards that, we elaborate different hypotheses over the components in federated NLP architectures and study them in detail with relevant experiments over three tasks: Stanford Sentiment Treebank-2, OntoNotes-5.0 and GigaWord. The experiments with different Transformer inductive biases on the variety of tasks provide a glimpse at the understanding of federated learning at NLP tasks. Specifically, the analysis suggests that regularization due to the ensembling effect may be masquerading as domain adaptation of federated learning in NLP with pre-trained language models.
pdf
bib
abs
Paper Bullets: Modeling Propaganda with the Help of Metaphor
Daniel Baleato Rodríguez
|
Verna Dankers
|
Preslav Nakov
|
Ekaterina Shutova
Propaganda aims to persuade an audience by appealing to emotions and using faulty reasoning, with the purpose of promoting a particular point of view. Similarly, metaphor modifies the semantic frame, thus eliciting a response that can be used to tune up or down the emotional volume of the message. Given the close relationship between them, we hypothesize that, when modeling them computationally, it can be beneficial to do so jointly. In particular, we perform multi-task learning with propaganda identification as the main task and metaphor detection as an auxiliary task. To the best of our knowledge, this is the first work that models metaphor and propaganda together. We experiment with two datasets for identifying propaganda techniques in news articles and in memes shared on social media. We find that leveraging metaphor improves model performance, particularly for the two most common propaganda techniques: loaded language and name-calling.
pdf
bib
abs
Lexical Semantics with Large Language Models: A Case Study of English “break”
Erika Petersen
|
Christopher Potts
Large neural language models (LLMs) can be powerful tools for research in lexical semantics. We illustrate this potential using the English verb “break”, which has numerous senses and appears in a wide range of syntactic frames. We show that LLMs capture known sense distinctions and can be used to identify informative new sense combinations for further analysis. More generally, we argue that LLMs are aligned with lexical semantic theories in providing high-dimensional, contextually modulated representations, but LLMs’ lack of discrete features and dependence on usage-based data offer a genuinely new perspective on traditional problems in lexical semantics.
pdf
bib
abs
SWING: Balancing Coverage and Faithfulness for Dialogue Summarization
Kung-Hsiang Huang
|
Siffi Singh
|
Xiaofei Ma
|
Wei Xiao
|
Feng Nan
|
Nicholas Dingwall
|
William Yang Wang
|
Kathleen McKeown
Missing information is a common issue of dialogue summarization where some information in the reference summaries is not covered in the generated summaries. To address this issue, we propose to utilize natural language inference (NLI) models to improve coverage while avoiding introducing factual inconsistencies. Specifically, we use NLI to compute fine-grained training signals to encourage the model to generate content in the reference summaries that have not been covered, as well as to distinguish between factually consistent and inconsistent generated sentences. Experiments on the DialogSum and SAMSum datasets confirm the effectiveness of the proposed approach in balancing coverage and faithfulness, validated with automatic metrics and human evaluations. Additionally, we compute the correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency to provide insight into the most suitable metric for evaluating dialogue summaries.
pdf
bib
abs
Language-Aware Multilingual Machine Translation with Self-Supervised Learning
Haoran Xu
|
Jean Maillard
|
Vedanuj Goswami
Multilingual machine translation (MMT) benefits from cross-lingual transfer but is a challenging multitask optimization problem. This is partly because there is no clear framework to systematically learn language-specific parameters. Self-supervised learning (SSL) approaches that leverage large quantities of monolingual data (where parallel data is unavailable) have shown promise by improving translation performance as complementary tasks to the MMT task. However, jointly optimizing SSL and MMT tasks is even more challenging. In this work, we first investigate how to utilize **intra-distillation** to learn more *language-specific* parameters and then show the importance of these language-specific parameters. Next, we propose a novel but simple SSL task, **concurrent denoising**, that co-trains with the MMT task by concurrently denoising monolingual data on both the encoder and decoder. Finally, we apply **intra-distillation** to this co-training approach. Combining these two approaches significantly improves MMT performance, outperforming three state-of-the-art SSL methods by a large margin, e.g., 11.3% and 3.7% improvement on an 8-language and a 15-language benchmark compared with MASS, respectively.
pdf
bib
abs
Cloze Quality Estimation for Language Assessment
Zizheng Zhang
|
Masato Mita
|
Mamoru Komachi
Cloze tests play an essential role in language assessment and help language learners improve their skills. In this paper, we propose a novel task called Cloze Quality Estimation (CQE) — a zero-shot task of evaluating whether a cloze test is of sufficient “high-quality” for language assessment based on two important factors: reliability and validity. We have taken the first step by creating a new dataset named CELA for the CQE task, which includes English cloze tests and corresponding evaluations about their quality annotated by native English speakers, which includes 2,597 and 1,730 instances in aspects of reliability and validity, respectively. We have tested baseline evaluation methods on the dataset, showing that our method could contribute to the CQE task, but the task is still challenging.
pdf
bib
abs
Bag of Tricks for In-Distribution Calibration of Pretrained Transformers
Jaeyoung Kim
|
Dongbin Na
|
Sungchul Choi
|
Sungbin Lim
While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently. Although calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements shortcomings that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL’s training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.
pdf
bib
abs
Fine-Tuning Deteriorates General Textual Out-of-Distribution Detection by Distorting Task-Agnostic Features
Sishuo Chen
|
Wenkai Yang
|
Xiaohan Bi
|
Xu Sun
Detecting out-of-distribution (OOD) inputs is crucial for the safe deployment of natural language processing (NLP) models. Though existing methods, especially those based on the statistics in the feature space of fine-tuned pre-trained language models (PLMs), are claimed to be effective, their effectiveness on different types of distribution shifts remains underexplored. In this work, we take the first step to comprehensively evaluate the mainstream textual OOD detection methods for detecting semantic and non-semantic shifts. We find that: (1) no existing method behaves well in both settings; (2) fine-tuning PLMs on in-distribution data benefits detecting semantic shifts but severely deteriorates detecting non-semantic shifts, which can be attributed to the distortion of task-agnostic features. To alleviate the issue, we present a simple yet effective general OOD score named GNOME that integrates the confidence scores derived from the task-agnostic and task-specific representations. Experiments show that GNOME works well in both semantic and non-semantic shift scenarios, and further brings significant improvement on two cross-task benchmarks where both kinds of shifts simultaneously take place. Our code is available at
https://github.com/lancopku/GNOME.
pdf
bib
abs
A Question of Style: A Dataset for Analyzing Formality on Different Levels
Elisabeth Eder
|
Ulrike Krieg-Holz
|
Michael Wiegand
Accounting for different degrees of formality is crucial for producing contextually appropriate language. To assist NLP applications concerned with this problem and formality analysis in general, we present the first dataset of sentences from a wide range of genres assessed on a continuous informal-formal scale via comparative judgments. It is the first corpus with a comprehensive perspective on German sentence-level formality overall. We compare machine learning models for formality scoring, a task we treat as a regression problem, on our dataset. Finally, we investigate the relation between sentence- and document-level formality and evaluate leveraging sentence-based annotations for assessing formality on documents.
pdf
bib
abs
Task-specific Compression for Multi-task Language Models using Attribution-based Pruning
Nakyeong Yang
|
Yunah Jang
|
Hwanhee Lee
|
Seohyeong Jeong
|
Kyomin Jung
Multi-task language models show outstanding performance for various natural language understanding tasks with only a single model. However, these language models inevitably utilize an unnecessarily large number of model parameters, even when used only for a specific task. In this paper, we propose a novel training-free compression method for multi-task language models using pruning method. Specifically, we use an attribution method to determine which neurons are essential for performing a specific task. We task-specifically prune unimportant neurons and leave only task-specific parameters. Furthermore, we extend our method to be applicable in both low-resource and unsupervised settings. Since our compression method is training-free, it uses little computing resources and does not update the pre-trained parameters of language models, reducing storage space usage. Experimental results on the six widely-used datasets show that our proposed pruning method significantly outperforms baseline pruning methods. In addition, we demonstrate that our method preserves performance even in an unseen domain setting.
pdf
bib
abs
Zero-shot Transfer of Article-aware Legal Outcome Classification for European Court of Human Rights Cases
Santosh T.y.s.s
|
Oana Ichim
|
Matthias Grabmair
In this paper, we cast Legal Judgment Prediction on European Court of Human Rights cases into an article-aware classification task, where the case outcome is classified from a combined input of case facts and convention articles. This configuration facilitates the model learning some legal reasoning ability in mapping article text to specific case fact text. It also provides an opportunity to evaluate the model’s ability to generalize to zero-shot settings when asked to classify the case outcome with respect to articles not seen during training. We devise zero-shot experiments and apply domain adaptation methods based on domain discrimination and Wasserstein distance. Our results demonstrate that the article-aware architecture outperforms straightforward fact classification. We also find that domain adaptation methods improve zero-shot transfer performance, with article relatedness and encoder pre-training influencing the effect.
pdf
bib
abs
Abstractive Document Summarization with Summary-length Prediction
Jingun Kwon
|
Hidetaka Kamigaito
|
Manabu Okumura
Recently, we can obtain a practical abstractive document summarization model by fine-tuning a pre-trained language model (PLM). Since the pre-training for PLMs does not consider summarization-specific information such as the target summary length, there is a gap between the pre-training and fine-tuning for PLMs in summarization tasks. To fill the gap, we propose a method for enabling the model to understand the summarization-specific information by predicting the summary length in the encoder and generating a summary of the predicted length in the decoder in fine-tuning. Experimental results on the WikiHow, NYT, and CNN/DM datasets showed that our methods improve ROUGE scores from BART by generating summaries of appropriate lengths. Further, we observed about 3.0, 1,5, and 3.1 point improvements for ROUGE-1, -2, and -L, respectively, from GSum on the WikiHow dataset. Human evaluation results also showed that our methods improve the informativeness and conciseness of summaries.
pdf
bib
Hierarchical Label Generation for Text Classification
Jingun Kwon
|
Hidetaka Kamigaito
|
Young-In Song
|
Manabu Okumura
pdf
bib
abs
Active Learning for Multilingual Semantic Parser
Zhuang Li
|
Gholamreza Haffari
Current multilingual semantic parsing (MSP) datasets are almost all collected by translating the utterances in the existing datasets from the resource-rich language to the target language. However, manual translation is costly. To reduce the translation effort, this paper proposes the first active learning procedure for MSP (AL-MSP). AL-MSP selects only a subset from the existing datasets to be translated. We also propose a novel selection method that prioritizes the examples diversifying the logical form structures with more lexical choices, and a novel hyperparameter tuning method that needs no extra annotation cost. Our experiments show that AL-MSP significantly reduces translation costs with ideal selection methods. Our selection method with proper hyperparameters yields better parsing performance than the other baselines on two multilingual datasets.
pdf
bib
abs
Joint Word and Morpheme Segmentation with Bayesian Non-Parametric Models
Shu Okabe
|
François Yvon
Language documentation often requires segmenting transcriptions of utterances collected on the field into words and morphemes. While these two tasks are typically performed in succession, we study here Bayesian models for simultaneously segmenting utterances at these two levels. Our aim is twofold: (a) to study the effect of explicitly introducing a hierarchy of units in joint segmentation models; (b) to further assess whether these two levels can be better identified through weak supervision. For this, we first consider a deterministic coupling between independent models; then design and evaluate hierarchical Bayesian models. Experiments with two under-resourced languages (Japhug and Tsez) allow us to better understand the value of various types of weak supervision. In our analysis, we use these results to revisit the distributional hypotheses behind Bayesian segmentation models and evaluate their validity for language documentation data.
pdf
bib
abs
Cross-Lingual Transfer of Cognitive Processing Complexity
Charlotte Pouw
|
Nora Hollenstein
|
Lisa Beinborn
When humans read a text, their eye movements are influenced by the structural complexity of the input sentences. This cognitive phenomenon holds across languages and recent studies indicate that multilingual language models utilize structural similarities between languages to facilitate cross-lingual transfer. We use sentence-level eye-tracking patterns as a cognitive indicator for structural complexity and show that the multilingual model XLM-RoBERTa can successfully predict varied patterns for 13 typologically diverse languages, despite being fine-tuned only on English data. We quantify the sensitivity of the model to structural complexity and distinguish a range of complexity characteristics. Our results indicate that the model develops a meaningful bias towards sentence length but also integrates cross-lingual differences. We conduct a control experiment with randomized word order and find that the model seems to additionally capture more complex structural information.
pdf
bib
abs
Does Transliteration Help Multilingual Language Modeling?
Ibraheem Muhammad Moosa
|
Mahmud Elahi Akhter
|
Ashfia Binte Habib
Script diversity presents a challenge to Multilingual Language Models (MLLM) by reducing lexical overlap among closely related languages. Therefore, transliterating closely related languages that use different writing scripts to a common script may improve the downstream task performance of MLLMs. We empirically measure the effect of transliteration on MLLMs in this context. We specifically focus on the Indic languages, which have the highest script diversity in the world, and we evaluate our models on the IndicGLUE benchmark. We perform the Mann-Whitney U test to rigorously verify whether the effect of transliteration is significant or not. We find that transliteration benefits the low-resource languages without negatively affecting the comparatively high-resource languages. We also measure the cross-lingual representation similarity of the models using centered kernel alignment on parallel sentences from the FLORES-101 dataset. We find that for parallel sentences across different languages, the transliteration-based model learns sentence representations that are more similar.
pdf
bib
abs
A Multilingual Dataset of Racial Stereotypes in Social Media Conversational Threads
Tom Bourgeade
|
Alessandra Teresa Cignarella
|
Simona Frenda
|
Mario Laurent
|
Wolfgang Schmeisser-Nieto
|
Farah Benamara
|
Cristina Bosco
|
Véronique Moriceau
|
Viviana Patti
|
Mariona Taulé
In this paper, we focus on the topics of misinformation and racial hoaxes from a perspective derived from both social psychology and computational linguistics. In particular, we consider the specific case of anti-immigrant feeling as a first case study for addressing racial stereotypes. We describe the first corpus-based study for multilingual racial stereotype identification in social media conversational threads. Our contributions are: (i) a multilingual corpus of racial hoaxes, (ii) a set of common guidelines for the annotation of racial stereotypes in social media texts, and a multi-layered, fine-grained scheme, psychologically grounded on the work by Fiske, including not only stereotype presence, but also contextuality, implicitness, and forms of discredit, (iii) a multilingual dataset in Italian, Spanish, and French annotated following the aforementioned guidelines, and cross-lingual comparative analyses taking into account racial hoaxes and stereotypes in online discussions. The analysis and results show the usefulness of our methodology and resources, shedding light on how racial hoaxes are spread, and enable the identification of negative stereotypes that reinforce them.
pdf
bib
abs
Detecting Contextomized Quotes in News Headlines by Contrastive Learning
Seonyeong Song
|
Hyeonho Song
|
Kunwoo Park
|
Jiyoung Han
|
Meeyoung Cha
Quotes are critical for establishing credibility in news articles. A direct quote enclosed in quotation marks has a strong visual appeal and is a sign of a reliable citation. Unfortunately, this journalistic practice is not strictly followed, and a quote in the headline is often “contextomized.” Such a quote uses words out of context in a way that alters the speaker’s intention so that there is no semantically matching quote in the body text. We present QuoteCSE, a contrastive learning framework that represents the embedding of news quotes based on domain-driven positive and negative samples to identify such an editorial strategy. The dataset and code are available at
https://github.com/ssu-humane/contextomized-quote-contrastive.
pdf
bib
abs
Zero-Shot On-the-Fly Event Schema Induction
Rotem Dror
|
Haoyu Wang
|
Dan Roth
What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information, and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety. Using our model, complete schemas on any topic can be generated on-the-fly without any manual data collection, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate in a series of experiments that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts and even reaching the best score in the prediction task.
pdf
bib
abs
BanglaNLG and BanglaT5: Benchmarks and Resources for Evaluating Low-Resource Natural Language Generation in Bangla
Abhik Bhattacharjee
|
Tahmid Hasan
|
Wasi Uddin Ahmad
|
Rifat Shahriyar
This work presents ‘BanglaNLG,’ a comprehensive benchmark for evaluating natural language generation (NLG) models in Bangla, a widely spoken yet low-resource language. We aggregate six challenging conditional text generation tasks under the BanglaNLG benchmark, introducing a new dataset on dialogue generation in the process. Furthermore, using a clean corpus of 27.5 GB of Bangla data, we pretrain ‘BanglaT5’, a sequence-to-sequence Transformer language model for Bangla. BanglaT5 achieves state-of-the-art performance in all of these tasks, outperforming several multilingual models by up to 9% absolute gain and 32% relative gain. We are making the new dialogue dataset and the BanglaT5 model publicly available at
https://github.com/csebuetnlp/BanglaNLG in the hope of advancing future research on Bangla NLG.
pdf
bib
abs
It’s about Time: Rethinking Evaluation on Rumor Detection Benchmarks using Chronological Splits
Yida Mu
|
Kalina Bontcheva
|
Nikolaos Aletras
New events emerge over time influencing the topics of rumors in social media. Current rumor detection benchmarks use random splits as training, development and test sets which typically results in topical overlaps. Consequently, models trained on random splits may not perform well on rumor classification on previously unseen topics due to the temporal concept drift. In this paper, we provide a re-evaluation of classification models on four popular rumor detection benchmarks considering chronological instead of random splits. Our experimental results show that the use of random splits can significantly overestimate predictive performance across all datasets and models. Therefore, we suggest that rumor detection models should always be evaluated using chronological splits for minimizing topical overlaps.
pdf
bib
abs
MUTANT: A Multi-sentential Code-mixed Hinglish Dataset
Rahul Gupta
|
Vivek Srivastava
|
Mayank Singh
The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research directions, we will make the dataset and the code publicly available upon publication.
pdf
bib
abs
Bridging the Gap between Native Text and Translated Text through Adversarial Learning: A Case Study on Cross-Lingual Event Extraction
Pengfei Yu
|
Jonathan May
|
Heng Ji
Recent research in cross-lingual learning has found that combining large-scale pretrained multilingual language models with machine translation can yield good performance. We explore this idea for cross-lingual event extraction with a new model architecture that jointly encodes a source language input sentence with its translation to the target language during training, and takes a target language sentence with its translation back to the source language as input during evaluation. However, we observe significant representational gap between the native source language texts during training and the texts translated into source language during evaluation, as well as the texts translated into target language during training and the native target language texts during evaluation. This representational gap undermines the effectiveness of cross-lingual transfer learning for event extraction with machine-translated data. In order to mitigate this problem, we propose an adversarial training framework that encourages the language model to produce more similar representations for the translated text and the native text. To be specific, we train the language model such that its hidden representations are able to fool a jointly trained discriminator that distinguishes translated texts’ representations from native texts’ representations. We conduct experiments on cross-lingual for event extraction across three languages. Results demonstrate that our proposed adversarial training can effectively incorporate machine translation to improve event extraction, while simply adding machine-translated data yields unstable performance due to the representational gap.
pdf
bib
abs
Scalable Prompt Generation for Semi-supervised Learning with Language Models
Yuhang Zhou
|
Suraj Maharjan
|
Beiye Liu
Prompt-based learning methods in semi-supervised learning (SSL) settings have been shown to be effective on multiple natural language understanding (NLU) datasets and tasks in the literature. However, manually designing multiple prompts and verbalizers requires domain knowledge and human effort, making it difficult and expensive to scale across different datasets. In this paper, we propose two methods to automatically design multiple prompts and integrate automatic verbalizer in SSL settings without sacrificing performance. The first method uses various demonstration examples with learnable continuous prompt tokens to create diverse prompt models. The second method uses a varying number of soft prompt tokens to encourage language models to learn different prompts. For the verbalizer, we use the prototypical verbalizer to replace the manual one. In summary, we obtained the best average accuracy of 71.5% (a relative improvement of 0.99% over even the previous state-of-the-art SSL method with manual prompts and verbalizers) in different few-shot learning settings.
pdf
bib
abs
Novel Feature Discovery for Task-Oriented Dialog Systems
Vinh Thinh Ho
|
Mohamed Soliman
|
Abdalghani Abujabal
A novel feature represents a cluster of semantically equivalent novel user requests e.g., requests to play a song on a service or read user’s messages. Detecting and supporting novel features is crucial towards wider adoption of dialog systems by end users. Intuitively, features are represented by a combination of intents, slot types and/or their values. For example, while playing a song is a feature represented by a single intent (PlayMusic) only, playing a song on a service is another feature represented by the combination of PlayMusic intent and ServiceName slot type. Prior work on novelty detection limits the scope of features to those represented by novel single intents, leading to (1) giant clusters spanning several user-perceived fine-grained features belonging to the same intent, (2) incoherent interpretation of clusters from users’ perspective (no direct connection to some user-perceived feature), and (3) missing those features spanning several intents. In this work, we introduce feature discovery as opposed to single intent discovery, which aims at discovering novel features spanning a combination of intents and slots, and present a technique for discovering novel features from user utterances. Experiments on two datasets demonstrate the effectiveness of our approach and consistently show its ability to detect novel features.
pdf
bib
abs
Context Generation Improves Open Domain Question Answering
Dan Su
|
Mostofa Patwary
|
Shrimai Prabhumoye
|
Peng Xu
|
Ryan Prenger
|
Mohammad Shoeybi
|
Pascale Fung
|
Anima Anandkumar
|
Bryan Catanzaro
Closed-book question answering (QA) requires a model to directly answer an open-domain question without access to any external knowledge. Prior work on closed-book QA either directly finetunes or prompts a pretrained language model (LM) to leverage the stored knowledge. However, they do not fully exploit the parameterized knowledge. To address this inefficiency, we propose a two-stage, closed-book QA framework which employs a coarse-to-fine approach to extract the relevant knowledge and answer a question. We first generate a related context for a given question by prompting a pretrained LM. We then prompt the same LM to generate an answer using the generated context and the question. Additionally, we marginalize over the generated contexts to improve the accuracies and reduce context uncertainty. Experimental results on three QA benchmarks show that our method significantly outperforms previous closed-book QA methods. For example on TriviaQA, our method improves exact match accuracy from 55.3% to 68.6%, and is on par with open-book QA methods (68.6% vs. 68.0%). Our results show that our new methodology is able to better exploit the stored knowledge in pretrained LMs without adding extra learnable parameters or needing finetuning, and paves the way for hybrid models that integrate pretrained LMs with external knowledge.
pdf
bib
abs
RedHOT: A Corpus of Annotated Medical Questions, Experiences, and Claims on Social Media
Somin Wadhwa
|
Vivek Khetan
|
Silvio Amir
|
Byron Wallace
We present Reddit Health Online Talk (RedHOT), a corpus of 22,000 richly annotated social media posts from Reddit spanning 24 health conditions. Annotations include demarcations of spans corresponding to medical claims, personal experiences, and questions. We collect additional granular annotations on identified claims. Specifically, we mark snippets that describe patient Populations, Interventions, and Outcomes (PIO elements) within these. Using this corpus, we introduce the task of retrieving trustworthy evidence relevant to a given claim made on social media. We propose a new method to automatically derive (noisy) supervision for this task which we use to train a dense retrieval model; this outperforms baseline models. Manual evaluation of retrieval results performed by medical doctors indicate that while our system performance is promising, there is considerable room for improvement. We release all annotations collected (and scripts to assemble the dataset), and all code necessary to reproduce the results in this paper at:
https://sominw.com/redhot.
pdf
bib
abs
Paparazzi: A Deep Dive into the Capabilities of Language and Vision Models for Grounding Viewpoint Descriptions
Henrik Voigt
|
Jan Hombeck
|
Monique Meuschke
|
Kai Lawonn
|
Sina Zarrieß
Existing language and vision models achieve impressive performance in image-text understanding. Yet, it is an open question to what extent they can be used for language understanding in 3D environments and whether they implicitly acquire 3D object knowledge, e.g. about different views of an object. In this paper, we investigate whether a state-of-the-art language and vision model, CLIP, is able to ground perspective descriptions of a 3D object and identify canonical views of common objects based on text queries. We present an evaluation framework that uses a circling camera around a 3D object to generate images from different viewpoints and evaluate them in terms of their similarity to natural language descriptions. We find that a pre-trained CLIP model performs poorly on most canonical views and that fine-tuning using hard negative sampling and random contrasting yields good results even under conditions with little available training data.
pdf
bib
abs
PLACES: Prompting Language Models for Social Conversation Synthesis
Maximillian Chen
|
Alexandros Papangelis
|
Chenyang Tao
|
Seokhwan Kim
|
Andy Rosenbaum
|
Yang Liu
|
Zhou Yu
|
Dilek Hakkani-Tur
Collecting high quality conversational data can be very expensive for most applications and infeasible for others due to privacy, ethical, or similar concerns. A promising direction to tackle this problem is to generate synthetic dialogues by prompting large language models. In this work, we use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting. We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations. This includes various dimensions of conversation quality with human evaluation directly on the synthesized conversations, and interactive human evaluation of chatbots fine-tuned on the synthetically generated dataset. We additionally demonstrate that this prompting approach is generalizable to multi-party conversations, providing potential to create new synthetic data for multi-party tasks. Our synthetic multi-party conversations were rated more favorably across all measured dimensions compared to conversation excerpts sampled from a human-collected multi-party dataset.
pdf
bib
abs
FedPerC: Federated Learning for Language Generation with Personal and Context Preference Embeddings
Andrew Silva
|
Pradyumna Tambwekar
|
Matthew Gombolay
Federated learning is a training paradigm that learns from multiple distributed users without aggregating data on a centralized server, promising the ability to deploy machine-learning to a diverse population of users without first collecting large, labeled datasets. As federated learning involves averaging gradient updates across a decentralized population, there is a growing need for personalization of federated learning systems (i.e. conversational agents must personalize to individual users and the context of an interaction).In this work, we propose a new direction for personalization research within federated learning, leveraging both personal embeddings and shared context embeddings.We also present an approach to predict these “preference” embeddings, enabling personalization without backpropagation. Compared to state-of-the-art personalization baselines, our approach achieves a 50% improvement in test-time perplexity using 0.001% of the memory required by baseline approaches, and achieving greater sample- and compute-efficiency.
pdf
bib
abs
A Neural CRF-based Hierarchical Approach for Linear Text Segmentation
Inderjeet Nair
|
Aparna Garimella
|
Balaji Vasan Srinivasan
|
Natwar Modani
|
Niyati Chhaya
|
Srikrishna Karanam
|
Sumit Shekhar
We consider the problem of segmenting unformatted text and transcripts linearly based on their topical structure. While prior approaches explicitly train to predict segment boundaries, our proposed approach solves this task by inferring the hierarchical segmentation structure associated with the input text fragment. Given the lack of a large annotated dataset for this task, we propose a data curation strategy and create a corpus of over 700K Wikipedia articles with their hierarchical structures. We then propose the first supervised approach to generating hierarchical segmentation structures based on these annotations. Our method, in particular, is based on a neural conditional random field (CRF), which explicitly models the statistical dependency between a node and its constituent child nodes. We introduce a new data augmentation scheme as part of our model training strategy, which involves sampling a variety of node aggregations, permutations, and removals, all of which help capture fine-grained and coarse topical shifts in the data and improve model performance. Extensive experiments show that our model outperforms or achieves competitive performance when compared to previous state-of-the-art algorithms in the following settings: rich-resource, cross-domain transferability, few-shot supervision, and segmentation when topic label annotations are provided.
pdf
bib
abs
MultiFin: A Dataset for Multilingual Financial NLP
Rasmus Jørgensen
|
Oliver Brandt
|
Mareike Hartmann
|
Xiang Dai
|
Christian Igel
|
Desmond Elliott
Financial information is generated and distributed across the world, resulting in a vast amount of domain-specific multilingual data. Multilingual models adapted to the financial domain would ease deployment when an organization needs to work with multiple languages on a regular basis. For the development and evaluation of such models, there is a need for multilingual financial language processing datasets. We describe MultiFin – a publicly available financial dataset consisting of real-world article headlines covering 15 languages across different writing systems and language families. The dataset consists of hierarchical label structure providing two classification tasks: multi-label and multi-class. We develop our annotation schema based on a real-world application and annotate our dataset using both ‘label by native-speaker’ and ‘translate-then-label’ approaches. The evaluation of several popular multilingual models, e.g., mBERT, XLM-R, and mT5, show that although decent accuracy can be achieved in high-resource languages, there is substantial room for improvement in low-resource languages.
pdf
bib
abs
MLASK: Multimodal Summarization of Video-based News Articles
Mateusz Krubiński
|
Pavel Pecina
In recent years, the pattern of news consumption has been changing. The most popular multimedia news formats are now multimodal - the reader is often presented not only with a textual article but also with a short, vivid video. To draw the attention of the reader, such video-based articles are usually presented as a short textual summary paired with an image thumbnail. In this paper, we introduce MLASK (MultimodaL Article Summarization Kit) - a new dataset of video-based news articles paired with a textual summary and a cover picture, all obtained by automatically crawling several news websites. We demonstrate how the proposed dataset can be used to model the task of multimodal summarization by training a Transformer-based neural model. We also examine the effects of pre-training when the usage of generative pre-trained language models helps to improve the model performance, but (additional) pre-training on the simpler task of text summarization yields even better results. Our experiments suggest that the benefits of pre-training and using additional modalities in the input are not orthogonal.
pdf
bib
abs
Going beyond research datasets: Novel intent discovery in the industry setting
Aleksandra Chrabrowa
|
Tsimur Hadeliya
|
Dariusz Kajtoch
|
Robert Mroczkowski
|
Piotr Rybak
Novel intent discovery automates the process of grouping similar messages (questions) to identify previously unknown intents. However, current research focuses on publicly available datasets which have only the question field and significantly differ from real-life datasets. This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform. We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision. We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv. All our methods combined to fully utilize real-life datasets give up to 33pp performance boost over state-of-the-art Constrained Deep Adaptive Clustering (CDAC) model for question only. By comparison CDAC model for the question data only gives only up to 13pp performance boost over the naive baseline.
pdf
bib
abs
DATScore: Evaluating Translation with Data Augmented Translations
Moussa Kamal Eddine
|
Guokan Shang
|
Michalis Vazirgiannis
The rapid development of large pretrained language models has revolutionized not only the field of Natural Language Generation (NLG) but also its evaluation. Inspired by the recent work of BARTScore: a metric leveraging the BART language model to evaluate the quality of generated text from various aspects, we introduce DATScore. DATScore uses data augmentation techniques to improve the evaluation of machine translation. Our main finding is that introducing data augmented translations of the source and reference texts is greatly helpful in evaluating the quality of the generated translation. We also propose two novel score averaging and term weighting strategies to improve the original score computing process of BARTScore. Experimental results on WMT show that DATScore correlates better with human meta-evaluations than the other recent state-of-the-art metrics, especially for low-resource languages. Ablation studies demonstrate the value added by our new scoring strategies. Moreover, we report in our extended experiments the performance of DATScore on 3 NLG tasks other than translation.
pdf
bib
abs
How do decoding algorithms distribute information in dialogue responses?
Saranya Venkatraman
|
He He
|
David Reitter
Humans tend to follow the Uniform Information Density (UID) principle by distributing information evenly in utterances. We study if decoding algorithms implicitly follow this UID principle, and under what conditions adherence to UID might be desirable for dialogue generation. We generate responses using different decoding algorithms with GPT-2 on the Persona-Chat dataset and collect human judgments on their quality using Amazon Mechanical Turk. We find that (i) surprisingly, model-generated responses follow the UID principle to a greater extent than human responses, and (ii) decoding algorithms that promote UID do not generate higher-quality responses. Instead, when we control for surprisal, non-uniformity of information density correlates with the quality of responses with very low/high surprisal. Our findings indicate that encouraging non-uniform responses is a potential solution to the “likelihood trap” problem (quality degradation in very high-likelihood text). Our dataset containing multiple candidate responses per dialog history along with human-annotated quality ratings is available at:
https://huggingface.co/datasets/saranya132/dialog_uid_gpt2.
pdf
bib
abs
Benchmarking Long-tail Generalization with Likelihood Splits
Ameya Godbole
|
Robin Jia
In order to reliably process natural language, NLP systems must generalize to the long tail of rare utterances. We propose a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets. We create ‘Likelihood Splits’ where examples that are assigned lower likelihood by a pre-trained language model (LM) are placed in the test set, and more likely examples are in the training set. This simple approach can be customized to construct meaningful train-test splits for a wide range of tasks. Likelihood Splits surface more challenges than random splits: relative error rates of state-of-the-art models increase by 59% for semantic parsing on Spider, 93% for natural language inference on SNLI, and 33% for yes/no question answering on BoolQ, on our splits compared with the corresponding random splits. Moreover, Likelihood Splits create fairer benchmarks than adversarial filtering; when the LM used to create the splits is also employed as the task model, our splits do not unfairly penalize the LM.
pdf
bib
abs
Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation
Vivek Iyer
|
Arturo Oncevay
|
Alexandra Birch
Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains — owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a ‘base’ NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models.
pdf
bib
abs
XQA-DST: Multi-Domain and Multi-Lingual Dialogue State Tracking
Han Zhou
|
Ignacio Iacobacci
|
Pasquale Minervini
Dialogue State Tracking (DST), a crucial component of task-oriented dialogue (ToD) systems, keeps track of all important information pertaining to dialogue history: filling slots with the most probable values throughout the conversation. Existing methods generally rely on a predefined set of values and struggle to generalise to previously unseen slots in new domains. To overcome these challenges, we propose a domain-agnostic extractive question answering (QA) approach with shared weights across domains. To disentangle the complex domain information in ToDs, we train our DST with a novel domain filtering strategy by excluding out-of-domain question samples. With an independent classifier that predicts the presence of multiple domains given the context, our model tackles DST by extracting spans in active domains. Empirical results demonstrate that our model can efficiently leverage domain-agnostic QA datasets by two-stage fine-tuning while being both domain-scalable and open vocabulary in DST. It shows strong transferability by achieving zero-shot domain-adaptation results on MultiWOZ 2.1 with an average JGA of 36.7%. It further achieves cross-lingual transfer with state-of-the-art zero-shot results, 66.2% JGA from English to German and 75.7% JGA from English to Italian on WOZ 2.0.
pdf
bib
abs
Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion
Yi-An Lai
|
Elman Mansimov
|
Yuqing Xie
|
Yi Zhang
When upgrading neural models to a newer version, new errors that were not encountered in the legacy version can be introduced, known as regression errors. This inconsistent behavior during model upgrade often outweighs the benefits of accuracy gain and hinders the adoption of new models. To mitigate regression errors from model upgrade, distillation and ensemble have proven to be viable solutions without significant compromise in performance. Despite the progress, these approaches attained an incremental reduction in regression which is still far from achieving backward-compatible model upgrade. In this work, we propose a novel method, Gated Fusion, that promotes backward compatibility via learning to mix predictions between old and new models. Empirical results on two distinct model upgrade scenarios show that our method reduces the number of regression errors by 62% on average, outperforming the strongest baseline by an average of 25%.
pdf
bib
abs
AmbiCoref: Evaluating Human and Model Sensitivity to Ambiguous Coreference
Yuewei Yuan
|
Chaitanya Malaviya
|
Mark Yatskar
Given a sentence “Abby told Brittney that she upset Courtney”, one would struggle to understand who “she” refers to, and ask for clarification. However, if the word “upset” were replaced with “hugged”, “she” unambiguously refers to Abby. We study if modern coreference resolution models are sensitive to such pronominal ambiguity. To this end, we construct AmbiCoref, a diagnostic corpus of minimal sentence pairs with ambiguous and unambiguous referents. Our examples generalize psycholinguistic studies of human perception of ambiguity around particular arrangements of verbs and their arguments. Analysis shows that (1) humans are less sure of referents in ambiguous AmbiCoref examples than unambiguous ones, and (2) most coreference models show little difference in output between ambiguous and unambiguous pairs. We release AmbiCoref as a diagnostic corpus for testing whether models treat ambiguity similarly to humans.
pdf
bib
abs
Improving Unsupervised Out-of-domain detection through Pseudo Labeling and Learning
Byounghan Lee
|
Jaesik Kim
|
Junekyu Park
|
Kyung-Ah Sohn
Unsupervised out-of-domain (OOD) detection is a task aimed at discriminating whether given samples are from the in-domain or not, without the categorical labels of in-domain instances. Unlike supervised OOD, as there are no labels for training a classifier, previous works on unsupervised OOD detection adopted the one-class classification (OCC) approach, assuming that the training samples come from a single domain. However, in-domain instances in many real world applications can have a heterogeneous distribution (i.e., across multiple domains or multiple classes). In this case, OCC methods have difficulty in reflecting the categorical information of the domain properly. To tackle this issue, we propose a two-stage framework that leverages the latent categorical information to improve representation learning for textual OOD detection. In the first stage, we train a transformer-based sentence encoder for pseudo labeling by contrastive loss and cluster loss. The second stage is pseudo label learning in which the model is re-trained with pseudo-labels obtained in the first stage. The empirical results on the three datasets show that our two-stage framework significantly outperforms baseline models in more challenging scenarios.
pdf
bib
abs
How Many Data Samples is an Additional Instruction Worth?
Ravsehaj Singh Puri
|
Swaroop Mishra
|
Mihir Parmar
|
Chitta Baral
Recently introduced instruction-paradigm empowers non-expert users to leverage NLP resources by defining a new task in natural language. Instruction-tuned models have significantly outperformed multitask learning models (without instruction); however they are far from state-of-the-art task-specific models. Conventional approaches to improve model performance via creating datasets with large number of task instances or architectural changes in the model may not be feasible for non-expert users. However, they can write alternate instructions to represent an instruction task. Is Instruction-augmentation helpful? We augment a subset of tasks in the expanded version of NATURAL INSTRUCTIONS with additional instructions and find that it significantly improves model performance (up to 35%), especially in the low-data regime. Our results indicate that an additional instruction can be equivalent to ~200 data samples on average across tasks.
pdf
bib
abs
[MASK] Insertion: a robust method for anti-adversarial attacks
Xinrong Hu
|
Ce Xu
|
Junlong Ma
|
Zijian Huang
|
Jie Yang
|
Yi Guo
|
Johan Barthelemy
Adversarial attack aims to perturb input sequences and mislead a trained model for false predictions. To enhance the model robustness, defensing methods are accordingly employed by either data augmentation (involving adversarial samples) or model enhancement (modifying the training loss and/or model architecture). In contrast to previous work, this paper revisits the masked language modeling (MLM) and presents a simple yet efficient algorithm against adversarial attacks, termed [MASK] insertion for defensing (MI4D). Specifically, MI4D simply inserts [MASK] tokens to input sequences during training and inference, maximizing the intersection of the new convex hull (MI4D creates) with the original one (the clean input forms). As neither additional adversarial samples nor the model modification is required, MI4D is as computationally efficient as traditional fine-tuning. Comprehensive experiments have been conducted using three benchmark datasets and four attacking methods. MI4D yields a significant improvement (on average) of the accuracy between 3.2 and 11.1 absolute points when compared with six state-of-the-art defensing baselines.
pdf
bib
abs
ViDeBERTa: A powerful pre-trained language model for Vietnamese
Cong Dao Tran
|
Nhut Huy Pham
|
Anh Tuan Nguyen
|
Truong Son Hy
|
Tu Vu
This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts using DeBERTa architecture. Although many successful pre-trained language models based on Transformer have been widely proposed for the English language, there are still few pre-trained models for Vietnamese, a low-resource language, that perform good results on downstream tasks, especially Question answering. We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering. The empirical results demonstrate that ViDeBERTa with far fewer parameters surpasses the previous state-of-the-art models on multiple Vietnamese-specific natural language understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only about 23% of PhoBERT_large with 370M parameters, still performs the same or better results than the previous state-of-the-art model. Our ViDeBERTa models are available at:
https://github.com/HySonLab/ViDeBERTa.
pdf
bib
abs
NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization
Junru Lu
|
Jiazheng Li
|
Byron Wallace
|
Yulan He
|
Gabriele Pergola
Accessing medical literature is difficult for laypeople as the content is written for specialists and contains medical jargon. Automated text simplification methods offer a potential means to address this issue. In this work, we propose a summarize-then-simplify two-stage strategy, which we call NapSS, identifying the relevant content to simplify while ensuring that the original narrative flow is preserved. In this approach, we first generate reference summaries via sentence matching between the original and the simplified abstracts. These summaries are then used to train an extractive summarizer, learning the most relevant content to be simplified. Then, to ensure the narrative consistency of the simplified text, we synthesize auxiliary narrative prompts combining key phrases derived from the syntactical analyses of the original text. Our model achieves results significantly better than the seq2seq baseline on an English medical corpus, yielding 3% 4% absolute improvements in terms of lexical similarity, and providing a further 1.1% improvement of SARI score when combined with the baseline. We also highlight shortcomings of existing evaluation methods, and introduce new metrics that take into account both lexical and high-level semantic similarity. A human evaluation conducted on a random sample of the test set further establishes the effectiveness of the proposed approach. Codes and models are released here:
https://github.com/LuJunru/NapSS.
pdf
bib
abs
Long-tailed Extreme Multi-label Text Classification by the Retrieval of Generated Pseudo Label Descriptions
Ruohong Zhang
|
Yau-Shian Wang
|
Yiming Yang
|
Donghan Yu
|
Tom Vu
|
Likun Lei
Extreme Multi-label Text Classification (XMTC) has been a tough challenge in machine learning research and applications due to the sheer sizes of the label spaces and the severe data scarcity problem associated with the long tail of rare labels in highly skewed distributions. This paper addresses the challenge of tail label prediction by leveraging the power of dense neural retrieval model in mapping input documents (as queries) to relevant label descriptions. To further enhance the quality of label descriptions, we propose to generate pseudo label descriptions from a trained bag-of-words (BoW) classifier, which demonstrates better classification performance under severe scarce data conditions. The proposed approach achieves the state-of-the-art (SOTA) performance of overall label prediction on XMTC benchmark datasets and especially outperforms the SOTA models in the tail label prediction. We also provide a theoretical analysis for relating the BoW and neural models w.r.t. performance lower bound.
pdf
bib
abs
Unsupervised Keyphrase Extraction via Interpretable Neural Networks
Rishabh Joshi
|
Vidhisha Balachandran
|
Emily Saldanha
|
Maria Glenski
|
Svitlana Volkova
|
Yulia Tsvetkov
Keyphrase extraction aims at automatically extracting a list of “important” phrases representing the key concepts in a document. Prior approaches for unsupervised keyphrase extraction resorted to heuristic notions of phrase importance via embedding clustering or graph centrality, requiring extensive domain expertise. Our work presents a simple alternative approach which defines keyphrases as document phrases that are salient for predicting the topic of the document. To this end, we propose INSPECT—an approach that uses self-explaining models for identifying influential keyphrases in a document by measuring the predictive impact of input phrases on the downstream task of the document topic classification. We show that this novel method not only alleviates the need for ad-hoc heuristics but also achieves state-of-the-art results in unsupervised keyphrase extraction in four datasets across two domains: scientific publications and news articles.
pdf
bib
abs
Large Language Models are few(1)-shot Table Reasoners
Wenhu Chen
Recent literature has shown that large language models (LLMs) are generally excellent few-shot reasoners to solve text reasoning tasks. However, the capability of LLMs on table reasoning tasks is yet to be explored. In this paper, we aim at understanding how well LLMs can perform table-related tasks with few-shot in-context learning. Specifically, we evaluated LLMs on popular table QA and fact verification datasets like WikiTableQuestion, FetaQA, TabFact, and FEVEROUS and found that LLMs are competent at complex reasoning over table structures, though these models are not pre-trained on any table corpus. When combined with ‘chain of thoughts’ prompting, LLMs can achieve very strong performance with only a 1-shot demonstration, even on par with some SoTA models. We show that LLMs are even more competent at generating comprehensive long-form answers on FetaQA than tuned T5-large. We further manually studied the reasoning chains elicited from LLMs and found that these reasoning chains are highly consistent with the underlying semantic form. We believe that LLMs can serve as a simple yet generic baseline for future research. The code and data are released in
https://github.com/wenhuchen/TableCoT.
pdf
bib
abs
Realistic Citation Count Prediction Task for Newly Published Papers
Jun Hirako
|
Ryohei Sasano
|
Koichi Takeda
Citation count prediction is the task of predicting the future citation counts of academic papers, which is particularly useful for estimating the future impacts of an ever-growing number of academic papers. Although there have been many studies on citation count prediction, they are not applicable to predicting the citation counts of newly published papers, because they assume the availability of future citation counts for papers that have not had enough time pass since publication. In this paper, we first identify problems in the settings of existing studies and introduce a realistic citation count prediction task that strictly uses information available at the time of a target paper’s publication. For realistic citation count prediction, we then propose two methods to leverage the citation counts of papers shortly after publication. Through experiments using papers collected from arXiv and bioRxiv, we demonstrate that our methods considerably improve the performance of citation count prediction for newly published papers in a realistic setting.
pdf
bib
abs
“Why do I feel offended?” - Korean Dataset for Offensive Language Identification
San-Hee Park
|
Kang-Min Kim
|
O-Joun Lee
|
Youjin Kang
|
Jaewon Lee
|
Su-Min Lee
|
SangKeun Lee
Warning: This paper contains some offensive expressions. Offensive content is an unavoidable issue on social media. Most existing offensive language identification methods rely on the compilation of labeled datasets. However, existing methods rarely consider low-resource languages that have relatively less data available for training (e.g., Korean). To address these issues, we construct a novel KOrean Dataset for Offensive Language Identification (KODOLI). KODOLI comprises more fine-grained offensiveness categories (i.e., not offensive, likely offensive, and offensive) than existing ones. A likely offensive language refers to texts with implicit offensiveness or abusive language without offensive intentions. In addition, we propose two auxiliary tasks to help identify offensive languages: abusive language detection and sentiment analysis. We provide experimental results for baselines on KODOLI and observe that language models suffer from identifying “LIKELY” offensive statements. Quantitative results and qualitative analysis demonstrate that jointly learning offensive language, abusive language and sentiment information improves the performance of offensive language identification.
pdf
bib
abs
Empirical Investigation of Neural Symbolic Reasoning Strategies
Yoichi Aoki
|
Keito Kudo
|
Tatsuki Kuribayashi
|
Ana Brassard
|
Masashi Yoshikawa
|
Keisuke Sakaguchi
|
Kentaro Inui
Neural reasoning accuracy improves when generating intermediate reasoning steps. However, the source of this improvement is yet unclear. Here, we investigate and factorize the benefit of generating intermediate steps for symbolic reasoning. Specifically, we decompose the reasoning strategy w.r.t. step granularity and chaining strategy. With a purely symbolic numerical reasoning dataset (e.g., A=1, B=3, C=A+3, C?), we found that the choice of reasoning strategies significantly affects the performance, with the gap becoming even larger as the extrapolation length becomes longer. Surprisingly, we also found that certain configurations lead to nearly perfect performance, even in the case of length extrapolation. Our results indicate the importance of further exploring effective strategies for neural reasoning models.
pdf
bib
abs
Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering
Xanh Ho
|
Anh-Khoa Duong Nguyen
|
Saku Sugawara
|
Akiko Aizawa
To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effectiveness of UR tasks (including both sentence-level and entity-level tasks) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness. While the previous models have not been explicitly trained on an entity-level reasoning prediction task, we build a multi-task model that performs three tasks together: sentence-level supporting facts prediction, entity-level reasoning prediction, and answer prediction. Experimental results on 2WikiMultiHopQA and HotpotQA-small datasets reveal that (1) UR tasks can improve QA performance. Using four debiased datasets that are newly created, we demonstrate that (2) UR tasks are helpful in preventing reasoning shortcuts in the multi-hop QA task. However, we find that (3) UR tasks do not contribute to improving the robustness of the model on adversarial questions, such as sub-questions and inverted questions. We encourage future studies to investigate the effectiveness of entity-level reasoning in the form of natural language questions (e.g., sub-question forms).
pdf
bib
abs
PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?
Sedigheh Eslami
|
Christoph Meinel
|
Gerard de Melo
Contrastive Language–Image Pre-training (CLIP) has shown remarkable success in learning with cross-modal supervision from extensive amounts of image–text pairs collected online. Thus far, the effectiveness of CLIP has been investigated primarily in general-domain multimodal problems. In this work, we evaluate the effectiveness of CLIP for the task of Medical Visual Question Answering (MedVQA). We present PubMedCLIP, a fine-tuned version of CLIP for the medical domain based on PubMed articles. Our experiments conducted on two MedVQA benchmark datasets illustrate that PubMedCLIP achieves superior results improving the overall accuracy up to 3% in comparison to the state-of-the-art Model-Agnostic Meta-Learning (MAML) networks pre-trained only on visual data. The PubMedCLIP model with different back-ends, the source code for pre-training them and reproducing our MedVQA pipeline is publicly available at
https://github.com/sarahESL/PubMedCLIP.
pdf
bib
abs
Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models
Isabel Papadimitriou
|
Kezia Lopez
|
Dan Jurafsky
While multilingual language models can improve NLP performance on low-resource languages by leveraging higher-resource languages, they also reduce average performance on all languages (the ‘curse of multilinguality’). Here we show another problem with multilingual models: grammatical structures in higher-resource languages bleed into lower-resource languages, a phenomenon we call grammatical structure bias. We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models: testing their preference for two carefully-chosen variable grammatical structures (optional pronoun-drop in Spanish and optional Subject-Verb ordering in Greek). We find that multilingual BERT is biased toward the English-like setting (explicit pronouns and Subject-Verb-Object ordering) as compared to our monolingual control language model. With our case studies, we hope to bring to light the fine-grained ways in which multilingual models can be biased, and encourage more linguistically-aware fluency evaluation.
pdf
bib
abs
Reassessing Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization
Aishwarya Agrawal
|
Ivana Kajic
|
Emanuele Bugliarello
|
Elnaz Davoodi
|
Anita Gergely
|
Phil Blunsom
|
Aida Nematzadeh
Vision-and-language (V&L) models pretrained on large-scale multimodal data have demonstrated strong performance on various tasks such as image captioning and visual question answering (VQA). The quality of such models is commonly assessed by measuring their performance on unseen data that typically comes from the same distribution as the training data. However, when evaluated under out-of-distribution (out-of-dataset) settings for VQA, we observe that these models exhibit poor generalization. We comprehensively evaluate two pretrained V&L models under different settings (i.e. classification and open-ended text generation) by conducting cross-dataset evaluations. We find that these models tend to learn to solve the benchmark, rather than learning the high-level skills required by the VQA task. We also find that in most cases generative models are less susceptible to shifts in data distribution compared to discriminative ones, and that multimodal pretraining is generally helpful for OOD generalization. Finally, we revisit assumptions underlying the use of automatic VQA evaluation metrics, and empirically show that their stringent nature repeatedly penalizes models for correct responses.
pdf
bib
abs
Our kind of people? Detecting populist references in political debates
Christopher Klamm
|
Ines Rehbein
|
Simone Paolo Ponzetto
This paper investigates the identification of populist rhetoric in text and presents a novel cross-lingual dataset for this task. Our work is based on the definition of populism as a “communication style of political actors that refers to the people” but also includes anti-elitism as another core feature of populism. Accordingly, we annotate references to The People and The Elite in German and English parliamentary debates with a hierarchical scheme. The paper describes our dataset and annotation procedure and reports inter-annotator agreement for this task. Next, we compare and evaluate different transformer-based model architectures on a German dataset and report results for zero-shot learning on a smaller English dataset. We then show that semi-supervised tri-training can improve results in the cross-lingual setting. Our dataset can be used to investigate how political actors talk about The Elite and The People and to study how populist rhetoric is used as a strategic device.
pdf
bib
abs
SharPT: Shared Latent Space Prompt Tuning
Bo Pang
|
Semih Yavuz
|
Caiming Xiong
|
Yingbo Zhou
Prompt tuning is an efficient method for adapting large language models, and Soft Prompt Transfer (SPoT) further narrows the gap between prompt tuning and full model tuning by transferring prompts learned from source tasks to target tasks. It is nevertheless difficult and expensive to identify the source task that provides optimal prompts. In this work, we propose to learn a shared latent space which captures a set of basis skills from a mixture of source tasks. Given an instance, its embedding queries the latent space, yielding a basis skill vector. This vector generates soft prompts, via a lightweight prompt generator, which modulates a frozen model. The latent space and prompt transformation are learned end-to-end by training on source tasks. Transfer learning from source tasks to a target task simply amounts to finetuning the prompt generator, accounting for roughly 0.3% parameters of the frozen backbone model, while the shared latent space is also frozen in finetuning. Our approach outperforms prior soft prompt methods by a significant margin on a variety of tasks such as NLI, sentence completion, QA, conference resolution, word sense disambiguation. We also find, on various model scales, our method achieves competitive performance compared to finetuning the full model.
pdf
bib
abs
Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection
Tolulope Ogunremi
|
Dan Jurafsky
|
Christopher Manning
With the prominence of large pretrained language models, low-resource languages are rarely modelled monolingually and become victims of the “curse of multilinguality” in massively multilingual models. Recently, AfriBERTa showed that training transformer models from scratch on 1GB of data from many unrelated African languages outperforms massively multilingual models on downstream NLP tasks. Here we extend this direction, focusing on the use of related languages. We propose that training on smaller amounts of data but from related languages could match the performance of models trained on large, unrelated data. We test our hypothesis on the Niger-Congo family and its Bantu and Volta-Niger sub-families, pretraining models with data solely from Niger-Congo languages and finetuning on 4 downstream tasks: NER, part-of-speech tagging, sentiment analysis and text classification. We find that models trained on genetically related languages achieve equal performance on downstream tasks in low-resource languages despite using less training data. We recommend selecting training data based on language-relatedness when pretraining language models for low-resource languages.
pdf
bib
abs
Long Document Summarization with Top-down and Bottom-up Inference
Bo Pang
|
Erik Nijkamp
|
Wojciech Kryscinski
|
Silvio Savarese
|
Yingbo Zhou
|
Caiming Xiong
Text summarization aims to condense long documents and retain key information. Critical to the success of a summarization model is the faithful inference of latent representations of words or tokens in the source documents. Most recent models infer the latent representations with a transformer encoder, which is purely bottom-up and thus does not capture long-distance context well. Also, self-attention-based models face the challenge of quadratic complexity with respect to sequence length. We propose a method to improve summarization models on these two aspects. Our method assumes a hierarchical latent structure of a document where the top-level captures the long range dependency at a coarser time scale and the bottom token level preserves the details. Critically, our method enables token representations to be updated in both a bottom-up and top-down manner. In the bottom-up pass, token representations are inferred with local self-attention to leverage its efficiency. Top-down correction is then applied to allow tokens to capture global context. We demonstrate the effectiveness on a diverse set of summarization datasets, including narrative, conversational, scientific documents and news. Our model achieves state-of-the-art performance on a wide range of long document summarization benchmarks, compared to recent efficient transformers. We show that our model can summarize an entire book and achieve competitive performance using 0.27% parameters and much less training data, compared to a recent GPT-3-based model. These results indicate the general applicability and benefits of the framework.
pdf
bib
abs
Open Information Extraction with Entity Focused Constraints
Prajna Upadhyay
|
Oana Balalau
|
Ioana Manolescu
Open Information Extraction (OIE) is the task of extracting tuples of the form (subject, predicate, object), without any knowledge of the type and lexical form of the predicate, the subject, or the object. In this work, we focus on improving OIE quality by exploiting domain knowledge about the subject and object. More precisely, knowing that the subjects and objects in sentences are oftentimes named entities, we explore how to inject constraints in the extraction through constrained inference and constraint-aware training. Our work leverages the state-of-the-art OpenIE6 platform, which we adapt to our setting. Through a carefully constructed training dataset and constrained training, we obtain a 29.17% F1-score improvement in the CaRB metric and a 24.37% F1-score improvement in the WIRe57 metric. Our technique has important applications – one of them is investigative journalism, where automatically extracting conflict-of-interest between scientists and funding organizations helps understand the type of relations companies engage with the scientists.
pdf
bib
abs
Hierarchical3D Adapters for Long Video-to-text Summarization
Pinelopi Papalampidi
|
Mirella Lapata
In this paper, we focus on video-to-text summarization and investigate how to best utilize multimodal information for summarizing long inputs (e.g., an hour-long TV show) into long outputs (e.g., a multi-sentence summary). We extend SummScreen (Chen et al., 2022), a dialogue summarization dataset consisting of transcripts of TV episodes with reference summaries, and create a multimodal variant by collecting corresponding full-length videos. We incorporate multimodal information into a pre-trained textual summarizer efficiently using adapter modules augmented with a hierarchical structure while tuning only 3.8% of model parameters. Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
pdf
bib
abs
An Intra-Class Relation Guided Approach for Code Comment Generation
Zhenni Wang
|
Xiaohan Yu
|
Yansong Feng
|
Dongyan Zhao
Code comments are critical for maintaining and comprehending software programs, but they are often missing, mismatched, or outdated in practice. Code comment generation task aims to automatically produce descriptive comments for code snippets. Recently, methods based on the neural encoder-decoder architecture have achieved impressive performance. These methods assume that all the information required to generate comments is encoded in the target function itself, yet in most realistic situations, it is hard to understand a function in isolation from the surrounding context. Furthermore, the global context may contain redundant information that should not be introduced. To address the above issues, we present a novel graph-based learning framework to capture various relations among functions in a class file. Our approach is based on a common real-world scenario in which only a few functions in the source file have human-written comments. Guided by intra-class function relations, our model incorporates contextual information extracted from both the source code and available comments to generate missing comments. We conduct experiments on a Java dataset collected from real-world projects. Experimental results show that the proposed method outperforms competitive baseline models on all automatic and human evaluation metrics.
pdf
bib
abs
Spelling convention sensitivity in neural language models
Elizabeth Nielsen
|
Christo Kirov
|
Brian Roark
We examine whether large neural language models, trained on very large collections of varied English text, learn the potentially long-distance dependency of British versus American spelling conventions, i.e., whether spelling is consistently one or the other within model-generated strings. In contrast to long-distance dependencies in non-surface underlying structure (e.g., syntax), spelling consistency is easier to measure both in LMs and the text corpora used to train them, which can provide additional insight into certain observed model behaviors. Using a set of probe words unique to either British or American English, we first establish that training corpora exhibit substantial (though not total) consistency. A large T5 language model does appear to internalize this consistency, though only with respect to observed lexical items (not nonce words with British/American spelling patterns). We further experiment with correcting for biases in the training data by fine-tuning T5 on synthetic data that has been debiased, and find that finetuned T5 remains only somewhat sensitive to spelling consistency. Further experiments show GPT2 to be similarly limited.
pdf
bib
abs
Modelling Language Acquisition through Syntactico-Semantic Pattern Finding
Jonas Doumen
|
Katrien Beuls
|
Paul Van Eecke
Usage-based theories of language acquisition have extensively documented the processes by which children acquire language through communicative interaction. Notably, Tomasello (2003) distinguishes two main cognitive capacities that underlie human language acquisition: intention reading and pattern finding. Intention reading is the process by which children try to continuously reconstruct the intended meaning of their interlocutors. Pattern finding refers to the process that allows them to distil linguistic schemata from multiple communicative interactions. Even though the fields of cognitive science and psycholinguistics have studied these processes in depth, no faithful computational operationalisations of these mechanisms through which children learn language exist to date. The research on which we report in this paper aims to fill part of this void by introducing a computational operationalisation of syntactico-semantic pattern finding. Concretely, we present a methodology for learning grammars based on similarities and differences in the form and meaning of linguistic observations alone. Our methodology is able to learn compositional lexical and item-based constructions of variable extent and degree of abstraction, along with a network of emergent syntactic categories. We evaluate our methodology on the CLEVR benchmark dataset and show that the methodology allows for fast, incremental and effective learning. The constructions and categorial network that result from the learning process are fully transparent and bidirectional, facilitating both language comprehension and production. Theoretically, our model provides computational evidence for the learnability of usage-based constructionist theories of language acquisition. Practically, the techniques that we present facilitate the learning of computationally tractable, usage-based construction grammars, which are applicable for natural language understanding and production tasks.
pdf
bib
abs
Benchmark Data and Evaluation Framework for Intent Discovery Around COVID-19 Vaccine Hesitancy
Shai Gretz
|
Assaf Toledo
|
Roni Friedman
|
Dan Lahav
|
Rose Weeks
|
Naor Bar-Zeev
|
João Sedoc
|
Pooja Sangha
|
Yoav Katz
|
Noam Slonim
The COVID-19 pandemic has made a huge global impact and cost millions of lives. As COVID-19 vaccines were rolled out, they were quickly met with widespread hesitancy. To address the concerns of hesitant people, we launched VIRA, a public dialogue system aimed at addressing questions and concerns surrounding the COVID-19 vaccines. Here, we release VIRADialogs, a dataset of over 8k dialogues conducted by actual users with VIRA, providing a unique real-world conversational dataset. In light of rapid changes in users’ intents, due to updates in guidelines or in response to new information, we highlight the important task of intent discovery in this use-case. We introduce a novel automatic evaluation framework for intent discovery, leveraging the existing intent classifier of VIRA. We use this framework to report baseline intent discovery results over VIRADialogs, that highlight the difficulty of this task.
pdf
bib
abs
Learning Disentangled Representations for Natural Language Definitions
Danilo Silva De Carvalho
|
Giangiacomo Mercatali
|
Yingji Zhang
|
André Freitas
Disentangling the encodings of neural models is a fundamental aspect for improving interpretability, semantic control and downstream task performance in Natural Language Processing. Currently, most disentanglement methods are unsupervised or rely on synthetic datasets with known generative factors. We argue that recurrent syntactic and semantic regularities in textual data can be used to provide the models with both structural biases and generative factors. We leverage the semantic structures present in a representative and semantically dense category of sentence types, definitional sentences, for training a Variational Autoencoder to learn disentangled representations. Our experimental results show that the proposed model outperforms unsupervised baselines on several qualitative and quantitative benchmarks for disentanglement, and it also improves the results in the downstream task of definition modeling.
pdf
bib
abs
Distinguishability Calibration to In-Context Learning
Hongjing Li
|
Hanqi Yan
|
Yanran Li
|
Li Qian
|
Yulan He
|
Lin Gui
Recent years have witnessed increasing interests in prompt-based learning in which models can be trained on only a few annotated instances, making them suitable in low-resource settings. It is even challenging in fine-grained classification as the pre-trained language models tend to generate similar output embedding which makes it difficult to discriminate for the prompt-based classifier. In this work, we alleviate this information diffusion issue by proposing a calibration method based on a transformation which rotates the embedding feature into a new metric space where we adapt the ratio of each dimension to a uniform distribution to guarantee the distinguishability of learned embeddings. Furthermore, we take the advantage of hyperbolic embedding to capture the relation between dimensions by a coarse-fine metric learning strategy to enhance interpretability. Extensive experiments on the three datasets under various settings demonstrate the effectiveness of our approach.
pdf
bib
abs
Investigating anatomical bias in clinical machine learning algorithms
Jannik Pedersen
|
Martin Laursen
|
Pernille Vinholt
|
Anne Alnor
|
Thiusius Savarimuthu
Clinical machine learning algorithms have shown promising results and could potentially be implemented in clinical practice to provide diagnosis support and improve patient treatment. Barriers for realisation of the algorithms’ full potential include bias which is systematic and unfair discrimination against certain individuals in favor of others. The objective of this work is to measure anatomical bias in clinical text algorithms. We define anatomical bias as unfair algorithmic outcomes against patients with medical conditions in specific anatomical locations. We measure the degree of anatomical bias across two machine learning models and two Danish clinical text classification tasks, and find that clinical text algorithms are highly prone to anatomical bias. We argue that datasets for creating clinical text algorithms should be curated carefully to isolate the effect of anatomical location in order to avoid bias against patient subgroups.
pdf
bib
abs
Topic Ontologies for Arguments
Yamen Ajjour
|
Johannes Kiesel
|
Benno Stein
|
Martin Potthast
Many computational argumentation tasks, such as stance classification, are topic-dependent: The effectiveness of approaches to these tasks depends largely on whether they are trained with arguments on the same topics as those on which they are tested. The key question is: What are these training topics? To answer this question, we take the first step of mapping the argumentation landscape with The Argument Ontology (TAO). TAO draws on three authoritative sources for argument topics: the World Economic Forum, Wikipedia’s list of controversial topics, and Debatepedia. By comparing the topics in our ontology with those in 59 argument corpora, we perform the first comprehensive assessment of their topic coverage. While TAO already covers most of the corpus topics, the corpus topics barely cover all the topics in TAO. This points to a new goal for corpus construction to achieve a broad topic coverage and thus better generalizability of computational argumentation approaches.
pdf
bib
abs
Longtonotes: OntoNotes with Longer Coreference Chains
Kumar Shridhar
|
Nicholas Monath
|
Raghuveer Thirukovalluru
|
Alessandro Stolfo
|
Manzil Zaheer
|
Andrew McCallum
|
Mrinmaya Sachan
Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank.We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long-document coreference modelling revealed by our new corpus.
pdf
bib
abs
More Robust Schema-Guided Dialogue State Tracking via Tree-Based Paraphrase Ranking
Alexandru Coca
|
Bo-Hsiang Tseng
|
Weizhe Lin
|
Bill Byrne
The schema-guided paradigm overcomes scalability issues inherent in building task-oriented dialogue (TOD) agents with static ontologies. Rather than operating on dialogue context alone, agents have access to hierarchical schemas containing task-relevant natural language descriptions. Fine-tuned language models excel at schema-guided dialogue state tracking (DST) but are sensitive to the writing style of the schemas. We explore methods for improving the robustness of DST models. We propose a framework for generating synthetic schemas which uses tree-based ranking to jointly optimise lexical diversity and semantic faithfulness. The robust generalisation of strong baselines is improved when augmenting their training data with prompts generated by our framework, as demonstrated by marked improvements in average Joint Goal Accuracy (JGA) and schema sensitivity (SS) on the SGD-X benchmark.
pdf
bib
abs
Language Model Decoding as Likelihood–Utility Alignment
Martin Josifoski
|
Maxime Peyrard
|
Frano Rajič
|
Jiheng Wei
|
Debjit Paul
|
Valentin Hartmann
|
Barun Patra
|
Vishrav Chaudhary
|
Emre Kiciman
|
Boi Faltings
A critical component of a successful language generation pipeline is the decoding algorithm. However, the general principles that should guide the choice of a decoding algorithm remain unclear. Previous works only compare decoding algorithms in narrow scenarios, and their findings do not generalize across tasks. We argue that the misalignment between the model’s likelihood and the task-specific notion of utility is the key factor in understanding the effectiveness of decoding algorithms. To structure the discussion, we introduce a taxonomy of misalignment mitigation strategies (MMSs), providing a unifying view of decoding as a tool for alignment. The MMS taxonomy groups decoding algorithms based on their implicit assumptions about likelihood–utility misalignment, yielding general statements about their applicability across tasks. Specifically, by analyzing the correlation between the likelihood and the utility of predictions across a diverse set of tasks, we provide empirical evidence supporting the proposed taxonomy and a set of principles to structure reasoning when choosing a decoding algorithm. Crucially, our analysis is the first to relate likelihood-based decoding algorithms with algorithms that rely on external information, such as value-guided methods and prompting, and covers the most diverse set of tasks to date. Code, data, and models are available at
https://github.com/epfl-dlab/understanding-decoding.
pdf
bib
abs
Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents
Yanfei Dong
|
Lambert Deng
|
Jiazheng Zhang
|
Xiaodong Yu
|
Ting Lin
|
Francesco Gelli
|
Soujanya Poria
|
Wee Sun Lee
Documents that consist of diverse templates and exhibit complex spatial structures pose a challenge for document entity classification. We propose KNN-Former, which incorporates a new kind of spatial bias in attention calculation based on the K-nearest-neighbor (KNN) graph of document entities. We limit entities’ attention only to their local radius defined by the KNN graph. We also use combinatorial matching to address the one-to-one mapping property that exists in many documents, where one field has only one corresponding entity. Moreover, our method is highly parameter-efficient compared to existing approaches in terms of the number of trainable parameters. Despite this, experiments across various datasets show our method outperforms baselines in most entity types. Many real-world documents exhibit combinatorial properties which can be leveraged as inductive biases to improve extraction accuracy, but existing datasets do not cover these documents. To facilitate future research into these types of documents, we release a new ID document dataset that covers diverse templates and languages. We also release enhanced annotations for an existing dataset.
pdf
bib
abs
On the Generalization Ability of Retrieval-Enhanced Transformers
Tobias Norlund
|
Ehsan Doostmohammadi
|
Richard Johansson
|
Marco Kuhlmann
Recent work on the Retrieval-Enhanced Transformer (RETRO) model has shown impressive results: off-loading memory from trainable weights to a retrieval database can significantly improve language modeling and match the performance of non-retrieval models that are an order of magnitude larger in size. It has been suggested that at least some of this performance gain is due to non-trivial generalization based on both model weights and retrieval. In this paper, we try to better understand the relative contributions of these two components. We find that the performance gains from retrieval to a very large extent originate from overlapping tokens between the database and the test data, suggesting less of non-trivial generalization than previously assumed. More generally, our results point to the challenges of evaluating the generalization of retrieval-augmented language models such as RETRO, as even limited token overlap may significantly decrease test-time loss. We release our code and model at
https://github.com/TobiasNorlund/retropdf
bib
abs
Assessing Monotonicity Reasoning in Dutch through Natural Language Inference
Gijs Wijnholds
In this paper we investigate monotonicity reasoning in Dutch, through a novel Natural Language Inference dataset. Monotonicity reasoning shows to be highly challenging for Transformer-based language models in English and here, we corroborate those findings using a parallel Dutch dataset, obtained by translating the Monotonicity Entailment Dataset of Yanaka et al. (2019). After fine-tuning two Dutch language models BERTje and RobBERT on the Dutch NLI dataset SICK-NL, we find that performance severely drops on the monotonicity reasoning dataset, indicating poor generalization capacity of the models. We provide a detailed analysis of the test results by means of the linguistic annotations in the dataset. We find that models struggle with downward entailing contexts, and argue that this is due to a poor understanding of negation. Additionally, we find that the choice of monotonicity context affects model performance on conjunction and disjunction. We hope that this new resource paves the way for further research in generalization of neural reasoning models in Dutch, and contributes to the development of better language technology for Natural Language Inference, specifically for Dutch.
pdf
bib
abs
Noisy Parallel Data Alignment
Ruoyu Xie
|
Antonios Anastasopoulos
An ongoing challenge in current natural language processing is how its major advancements tend to disproportionately favor resource-rich languages, leaving a significant number of under-resourced languages behind. Due to the lack of resources required to train and evaluate models, most modern language technologies are either nonexistent or unreliable to process endangered, local, and non-standardized languages. Optical character recognition (OCR) is often used to convert endangered language documents into machine-readable data. However, such OCR output is typically noisy, and most word alignment models are not built to work under such noisy conditions. In this work, we study the existing word-level alignment models under noisy settings and aim to make them more robust to noisy data. Our noise simulation and structural biasing method, tested on multiple language pairs, manages to reduce the alignment error rate on a state-of-the-art neural-based alignment model up to 59.6%.
pdf
bib
abs
Enhancing Dialogue Generation with Conversational Concept Flows
Siheng Li
|
Wangjie Jiang
|
Pengda Si
|
Cheng Yang
|
Qiu Yao
|
Jinchao Zhang
|
Jie Zhou
|
Yujiu Yang
Human conversations contain natural and reasonable topic shifts, reflected as the concept flows across utterances. Previous researches prove that explicitly modeling concept flows with a large commonsense knowledge graph effectively improves response quality. However, we argue that there exists a gap between the knowledge graph and the conversation. The knowledge graph has limited commonsense knowledge and ignores the characteristics of natural conversations. Thus, many concepts and relations in conversations are not included. To bridge this gap, we propose to enhance dialogue generation with conversational concept flows. Specifically, we extract abundant concepts and relations from natural conversations and build a new conversation-aware knowledge graph. In addition, we design a novel relation-aware graph encoder to capture the concept flows guided by the knowledge graph. Experimental results on the large-scale Reddit conversation dataset indicate that our method performs better than strong baselines, andfurther analysis verifies the effectiveness of each component. All our code and data will be publicly available after acceptance.
pdf
bib
abs
SMHD-GER: A Large-Scale Benchmark Dataset for Automatic Mental Health Detection from Social Media in German
Sourabh Zanwar
|
Daniel Wiechmann
|
Yu Qiao
|
Elma Kerz
Mental health problems are a challenge to our modern society, and their prevalence is predicted to increase worldwide. Recently, a surge of research has demonstrated the potential of automated detection of mental health conditions (MHC) through social media posts, with the ultimate goal of enabling early intervention and monitoring population-level health outcomes in real-time. Progress in this area of research is highly dependent on the availability of high-quality datasets and benchmark corpora. However, the publicly available datasets for understanding and modelling MHC are largely confined to the English language. In this paper, we introduce SMHD-GER (Self-Reported Mental Health Diagnoses for German), a large-scale, carefully constructed dataset for MHC detection built on high-precision patterns and the approach proposed for English. We provide benchmark models for this dataset to facilitate further research and conduct extensive experiments. These models leverage engineered (psycho-)linguistic features as well as BERT-German. We also examine nuanced patterns of linguistic markers characteristics of specific MHC.
pdf
bib
abs
Exploring Data Augmentation for Code Generation Tasks
Pinzhen Chen
|
Gerasimos Lampouras
Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and expanded it through multi-modality and multi-tasking, yet the data for downstream tasks remain modest in size. Focusing on data utilization for downstream tasks, we propose and adapt augmentation methods that yield consistent improvements in code translation and summarization by up to 6.9% and 7.5% respectively. Further analysis suggests that our methods work orthogonally and show benefits in output code style and numeric consistency. We also discuss test data imperfections.
pdf
bib
abs
Stabilized In-Context Learning with Pre-trained Language Models for Few Shot Dialogue State Tracking
Derek Chen
|
Kun Qian
|
Zhou Yu
Prompt-based methods with large pre-trained language models (PLMs) have shown impressive unaided performance across many NLP tasks. These models improve even further with the addition of a few labeled in-context exemplars to guide output generation. However, for more complex tasks such as dialogue state tracking (DST), designing prompts that reliably convey the desired intent is nontrivial, leading to unstable results. Furthermore, building in-context exemplars for dialogue tasks is difficult because conversational contexts are long while model input lengths are relatively short. To overcome these issues we first adapt a meta-learning scheme to the dialogue domain which stabilizes the ability of the model to perform well under various prompts. We additionally design a novel training method to improve upon vanilla retrieval mechanisms to find ideal in-context examples. Finally, we introduce a saliency model to limit dialogue text length, allowing us to include more exemplars per query. In effect, we are able to achieve highly competitive results for few-shot DST on MultiWOZ.
pdf
bib
abs
Can Demographic Factors Improve Text Classification? Revisiting Demographic Adaptation in the Age of Transformers
Chia-Chien Hung
|
Anne Lauscher
|
Dirk Hovy
|
Simone Paolo Ponzetto
|
Goran Glavaš
Demographic factors (e.g., gender or age) shape our language. Previous work showed that incorporating demographic factors can consistently improve performance for various NLP tasks with traditional NLP models. In this work, we investigate whether these previous findings still hold with state-of-the-art pretrained Transformer-based language models (PLMs). We use three common specialization methods proven effective for incorporating external knowledge into pretrained Transformers (e.g., domain-specific or geographic knowledge). We adapt the language representations for the demographic dimensions of gender and age, using continuous language modeling and dynamic multi-task learning for adaptation, where we couple language modeling objectives with the prediction of demographic classes. Our results, when employing a multilingual PLM, show substantial gains in task performance across four languages (English, German, French, and Danish), which is consistent with the results of previous work. However, controlling for confounding factors – primarily domain and language proficiency of Transformer-based PLMs – shows that downstream performance gains from our demographic adaptation do not actually stem from demographic knowledge. Our results indicate that demographic specialization of PLMs, while holding promise for positive societal impact, still represents an unsolved problem for (modern) NLP.
pdf
bib
abs
JBLiMP: Japanese Benchmark of Linguistic Minimal Pairs
Taiga Someya
|
Yohei Oseki
In this paper, we introduce JBLiMP (Japanese Benchmark of Linguistic Minimal Pairs), a novel dataset for targeted syntactic evaluations of language models in Japanese. JBLiMP consists of 331 minimal pairs, which are created based on acceptability judgments extracted from journal articles in theoretical linguistics. These minimal pairs are grouped into 11 categories, each covering a different linguistic phenomenon. JBLiMP is unique in that it successfully combines two important features independently observed in existing datasets: (i) coverage of complex linguistic phenomena (cf. CoLA) and (ii) presentation of sentences as minimal pairs (cf. BLiMP). In addition, JBLiMP is the first dataset for targeted syntactic evaluations of language models in Japanese, thus allowing the comparison of syntactic knowledge of language models across different languages. We then evaluate the syntactic knowledge of several language models on JBLiMP: GPT-2, LSTM, and n-gram language models. The results demonstrated that all the architectures achieved comparable overall accuracies around 75%. Error analyses by linguistic phenomenon further revealed that these language models successfully captured local dependencies like nominal structures, but not long-distance dependencies such as verbal agreement and binding.
pdf
bib
abs
SMATCH++: Standardized and Extended Evaluation of Semantic Graphs
Juri Opitz
The Smatch metric is a popular method for evaluating graph distances, as is necessary, for instance, to assess the performance of semantic graph parsing systems. However, we observe some issues in the metric that jeopardize meaningful evaluation. E.g., opaque pre-processing choices can affect results, and current graph-alignment solvers do not provide us with upper-bounds. Without upper-bounds, however, fair evaluation is not guaranteed. Furthermore, adaptions of Smatch for extended tasks (e.g., fine-grained semantic similarity) are spread out, and lack a unifying framework. For better inspection, we divide the metric into three modules: pre-processing, alignment, and scoring. Examining each module, we specify its goals and diagnose potential issues, for which we discuss and test mitigation strategies. For pre-processing, we show how to fully conform to annotation guidelines that allow structurally deviating but valid graphs. For safer and enhanced alignment, we show the feasibility of optimal alignment in a standard evaluation setup, and develop a lossless graph compression method that shrinks the search space and significantly increases efficiency. For improved scoring, we propose standardized and extended metric calculation of fine-grained sub-graph meaning aspects. Our code is available at
https://github.com/flipz357/smatchpppdf
bib
abs
An Extended Sequence Tagging Vocabulary for Grammatical Error Correction
Stuart Mesham
|
Christopher Bryant
|
Marek Rei
|
Zheng Yuan
We extend a current sequence-tagging approach to Grammatical Error Correction (GEC) by introducing specialised tags for spelling correction and morphological inflection using the SymSpell and LemmInflect algorithms. Our approach improves generalisation: the proposed new tagset allows a smaller number of tags to correct a larger range of errors. Our results show a performance improvement both overall and in the targeted error categories. We further show that ensembles trained with our new tagset outperform those trained with the baseline tagset on the public BEA benchmark.
pdf
bib
abs
Cheating to Identify Hard Problems for Neural Machine Translation
Proyag Pal
|
Kenneth Heafield
We identify hard problems for neural machine translation models by analyzing progressively higher-scoring translations generated by letting models cheat to various degrees. If a system cheats and still gets something wrong, that suggests it is a hard problem. We experiment with two forms of cheating: providing the model a compressed representation of the target as an additional input, and fine-tuning on the test set. Contrary to popular belief, we find that the most frequent tokens are not necessarily the most accurately translated due to these often being function words and punctuation that can be used more flexibly in translation, or content words which can easily be paraphrased. We systematically analyze system outputs to identify categories of tokens which are particularly hard for the model to translate, and find that this includes certain types of named entities, subordinating conjunctions, and unknown and foreign words. We also encounter a phenomenon where words, often names, which were not infrequent in the training data are still repeatedly mistranslated by the models — we dub this the Fleetwood Mac problem.
pdf
bib
abs
Model-Agnostic Bias Measurement in Link Prediction
Lena Schwertmann
|
Manoj Prabhakar Kannan Ravi
|
Gerard de Melo
Link prediction models based on factual knowledge graphs are commonly used in applications such as search and question answering. However, work investigating social bias in these models has been limited. Previous work focused on knowledge graph embeddings, so more recent classes of models achieving superior results by fine-tuning Transformers have not yet been investigated. We therefore present a model-agnostic approach for bias measurement leveraging fairness metrics to compare bias in knowledge graph embedding-based predictions (KG only) with models that use pre-trained, Transformer-based language models (KG+LM). We further create a dataset to measure gender bias in occupation predictions and assess whether the KG+LM models are more or less biased than KG only models. We find that gender bias tends to be higher for the KG+LM models and analyze potential connections to the accuracy of the models and the data bias inherent in our dataset. Finally, we discuss the limitations and ethical considerations of our work. The repository containing the source code and the data set is publicly available at
https://github.com/lena-schwert/comparing-bias-in-KG-models.
pdf
bib
abs
Divergence-Based Domain Transferability for Zero-Shot Classification
Alexander Pugantsov
|
Richard McCreadie
Transferring learned patterns from pretrained neural language models has been shown to significantly improve effectiveness across a variety of language-based tasks, meanwhile further tuning on intermediate tasks has been demonstrated to provide additional performance benefits, provided the intermediate task is sufficiently related to the target task. However, how to identify related tasks is an open problem, and brute-force searching effective task combinations is prohibitively expensive. Hence, the question arises, are we able to improve the effectiveness and efficiency of tasks with no training examples through selective fine-tuning? In this paper, we explore statistical measures that approximate the divergence between domain representations as a means to estimate whether tuning using one task pair will exhibit performance benefits over tuning another. This estimation can then be used to reduce the number of task pairs that need to be tested by eliminating pairs that are unlikely to provide benefits. Through experimentation over 58 tasks and over 6,600 task pair combinations, we demonstrate that statistical measures can distinguish effective task pairs, and the resulting estimates can reduce end-to-end runtime by up to 40%.
pdf
bib
abs
EDU-level Extractive Summarization with Varying Summary Lengths
Yuping Wu
|
Ching-Hsun Tseng
|
Jiayu Shang
|
Shengzhong Mao
|
Goran Nenadic
|
Xiao-Jun Zeng
Extractive models usually formulate text summarization as extracting fixed top-k salient sentences from the document as a summary. Few works exploited extracting finer-grained Elementary Discourse Unit (EDU) with little analysis and justification for the extractive unit selection. Further, the selection strategy of the fixed top-k salient sentences fits the summarization need poorly, as the number of salient sentences in different documents varies and therefore a common or best k does not exist in reality. To fill these gaps, this paper first conducts the comparison analysis of oracle summaries based on EDUs and sentences, which provides evidence from both theoretical and experimental perspectives to justify and quantify that EDUs make summaries with higher automatic evaluation scores than sentences. Then, considering this merit of EDUs, this paper further proposes an EDU-level extractive model with Varying summary Lengths (EDU-VL) and develops the corresponding learning algorithm. EDU-VL learns to encode and predict probabilities of EDUs in the document, generate multiple candidate summaries with varying lengths based on various k values, and encode and score candidate summaries, in an end-to-end training manner. Finally, EDU-VL is experimented on single and multi-document benchmark datasets and shows improved performances on ROUGE scores in comparison with state-of-the-art extractive models, and further human evaluation suggests that EDU-constituent summaries maintain good grammaticality and readability.
pdf
bib
abs
“Chère maison” or “maison chère”? Transformer-based prediction of adjective placement in French
Eleni Metheniti
|
Tim Van de Cruys
|
Wissam Kerkri
|
Juliette Thuilier
|
Nabil Hathout
In French, the placement of the adjective within a noun phrase is subject to variation: it can appear either before or after the noun. We conduct experiments to assess whether transformer-based language models are able to learn the adjective position in noun phrases in French –a position which depends on several linguistic factors. Prior findings have shown that transformer models are insensitive to permutated word order, but in this work, we show that finetuned models are successful at learning and selecting the correct position of the adjective. However, this success can be attributed to the process of finetuning rather than the linguistic knowledge acquired during pretraining, as evidenced by the low accuracy of experiments of classification that make use of pretrained embeddings. Comparing the finetuned models to the choices of native speakers (with a questionnaire), we notice that the models favor context and global syntactic roles, and are weaker with complex structures and fixed expressions.
pdf
bib
abs
On the Role of Reviewer Expertise in Temporal Review Helpfulness Prediction
Mir Tafseer Nayeem
|
Davood Rafiei
Helpful reviews have been essential for the success of e-commerce services, as they help customers make quick purchase decisions and benefit the merchants in their sales. While many reviews are informative, others provide little value and may contain spam, excessive appraisal, or unexpected biases. With the large volume of reviews and their uneven quality, the problem of detecting helpful reviews has drawn much attention lately. Existing methods for identifying helpful reviews primarily focus on review text and ignore the two key factors of (1) who post the reviews and (2) when the reviews are posted. Moreover, the helpfulness votes suffer from scarcity for less popular products and recently submitted (a.k.a., cold-start) reviews. To address these challenges, we introduce a dataset and develop a model that integrates the reviewer’s expertise, derived from the past review history of the reviewers, and the temporal dynamics of the reviews to automatically assess review helpfulness. We conduct experiments on our dataset to demonstrate the effectiveness of incorporating these factors and report improved results compared to several well-established baselines.
pdf
bib
abs
Towards a Unified Model for Generating Answers and Explanations in Visual Question Answering
Chenxi Whitehouse
|
Tillman Weyde
|
Pranava Madhyastha
The field of visual question answering (VQA) has recently seen a surge in research focused on providing explanations for predicted answers. However, current systems mostly rely on separate models to predict answers and generate explanations, leading to less grounded and frequently inconsistent results. To address this, we propose a multitask learning approach towards a Unified Model for Answer and Explanation generation (UMAE). Our approach involves the addition of artificial prompt tokens to training data and fine-tuning a multimodal encoder-decoder model on a variety of VQA-related tasks. In our experiments, UMAE models surpass the prior state-of-the-art answer accuracy on A-OKVQA by 10 15%, show competitive results on OK-VQA, achieve new state-of-the-art explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.
pdf
bib
abs
Machine Translation between Spoken Languages and Signed Languages Represented in SignWriting
Zifan Jiang
|
Amit Moryossef
|
Mathias Müller
|
Sarah Ebling
This paper presents work on novel machine translation (MT) systems between spoken and signed languages, where signed languages are represented in SignWriting, a sign language writing system. Our work seeks to address the lack of out-of-the-box support for signed languages in current MT systems and is based on the SignBank dataset, which contains pairs of spoken language text and SignWriting content. We introduce novel methods to parse, factorize, decode, and evaluate SignWriting, leveraging ideas from neural factored MT. In a bilingual setup—translating from American Sign Language to (American) English—our method achieves over 30 BLEU, while in two multilingual setups—translating in both directions between spoken languages and signed languages—we achieve over 20 BLEU. We find that common MT techniques used to improve spoken language translation similarly affect the performance of sign language translation. These findings validate our use of an intermediate text representation for signed languages to include them in natural language processing research.
pdf
bib
abs
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
Jimin Sun
|
Patrick Fernandes
|
Xinyi Wang
|
Graham Neubig
Recent works on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead compared to subword-based alternatives. However, previous work mainly focuses on reporting accuracy on a limited set of tasks and data settings, placing less emphasis on other important factors when tuning and deploying the models in practice, such as memory usage, inference speed, and finetuning data efficiency. We attempt to fill this gap by performing a comprehensive empirical comparison of multilingual tokenizer-free and subword-based models considering the various dimensions. Surprisingly, we find that subword-based models might still be the most practical choice in many settings, achieving better performance for lower inference latency and memory usage. Based on these results, we encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.
pdf
bib
abs
Neural Ranking with Weak Supervision for Open-Domain Question Answering : A Survey
Xiaoyu Shen
|
Svitlana Vakulenko
|
Marco del Tredici
|
Gianni Barlacchi
|
Bill Byrne
|
Adria de Gispert
Neural ranking (NR) has become a key component for open-domain question-answering in order to access external knowledge. However, training a good NR model requires substantial amounts of relevance annotations, which is very costly to scale. To address this, a growing body of research works have been proposed to reduce the annotation cost by training the NR model with weak supervision (WS) instead. These works differ in what resources they require and employ a diverse set of WS signals to train the model. Understanding such differences is crucial for choosing the right WS technique. To facilitate this understanding, we provide a structured overview of standard WS signals used for training a NR model. Based on their required resources, we divide them into three main categories: (1) only documents are needed; (2) documents and questions are needed; and (3) documents and question-answer pairs are needed. For every WS signal, we review its general idea and choices. Promising directions are outlined for future research.
pdf
bib
abs
Double Retrieval and Ranking for Accurate Question Answering
Zeyu Zhang
|
Thuy Vu
|
Alessandro Moschitti
Recent work has shown that an answer verification step introduced in Transformer-based answer selection models can significantly improve the state of the art in Question Answering. This step is performed by aggregating the embeddings of top k answer candidates to support the verification of a target answer. Although the approach is intuitive and sound, it still shows two limitations: (i) the supporting candidates are ranked only according to the relevancy with the question and not with the answer, and (ii) the support provided by the other answer candidates is suboptimal as these are retrieved independently of the target answer. In this paper, we address both drawbacks by proposing (i) a double reranking model, which, for each target answer, selects the best support; and (ii) a second neural retrieval stage designed to encode question and answer pair as the query, which finds more specific verification information. The results on well-known datasets for Answer Sentence Selection show significant improvement over the state of the art.
pdf
bib
abs
Evaluating the Diversity, Equity, and Inclusion of NLP Technology: A Case Study for Indian Languages
Simran Khanuja
|
Sebastian Ruder
|
Partha Talukdar
In order for NLP technology to be widely applicable, fair, and useful, it needs to serve a diverse set of speakers across the world’s languages, be equitable, i.e., not unduly biased towards any particular language, and be inclusive of all users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologies across all three dimensions. While diversity and inclusion have received attention in recent literature, equity is currently unexplored. We propose to address this gap using the Gini coefficient, a well-established metric used for estimating societal wealth inequality. Using our paradigm, we highlight the distressed state of current technologies for Indian (IN) languages (a linguistically large and diverse set, with a varied speaker population), across all three dimensions. To improve upon these metrics, we demonstrate the importance of region-specific choices in model building and dataset creation, and more importantly, propose a novel, generalisable approach to optimal resource allocation during fine-tuning. Finally, we discuss steps to mitigate these biases and encourage the community to employ multi-faceted evaluation when building linguistically diverse and equitable technologies.
pdf
bib
abs
Joint Reasoning on Hybrid-knowledge sources for Task-Oriented Dialog
Mayank Mishra
|
Danish Contractor
|
Dinesh Raghu
Traditional systems designed for task oriented dialog utilize knowledge present only in structured knowledge sources to generate responses. However, relevant information required to generate responses may also reside in unstructured sources, such as documents. Recent state of the art models such as HyKnow (Gao et al., 2021b) and SEKNOW (Gao et al., 2021a) aimed at overcoming these challenges make limiting assumptions about the knowledge sources. For instance, these systems assume that certain types of information, such as a phone number, is always present in a structured knowledge base (KB) while information about aspects such as entrance ticket prices, would always be available in documents. In this paper, we create a modified version of the MutliWOZ-based dataset prepared by (Gao et al., 2021a) to demonstrate how current methods have significant degradation in performance when strict assumptions about the source of information are removed. Then, in line with recent work exploiting pre-trained language models, we fine-tune a BART (Lewiset al., 2020) based model using prompts (Brown et al., 2020; Sun et al., 2021) for the tasks of querying knowledge sources, as well as, for response generation, without makingassumptions about the information present in each knowledge source. Through a series of experiments, we demonstrate that our model is robust to perturbations to knowledge modality (source of information), and that it can fuse information from structured as well as unstructured knowledge to generate responses.
pdf
bib
abs
Revisiting Offline Compression: Going Beyond Factorization-based Methods for Transformer Language Models
Mohammadreza Banaei
|
Klaudia Bałazy
|
Artur Kasymov
|
Rémi Lebret
|
Jacek Tabor
|
Karl Aberer
Recent transformer language models achieve outstanding results in many natural language processing (NLP) tasks. However, their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks. In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model. We challenge the classical matrix factorization methods by proposing a novel, better-performing autoencoder-based framework. We perform a comprehensive ablation study of our approach, examining its different aspects over a diverse set of evaluation settings. Moreover, we show that enabling collaboration between modules across layers by compressing certain modules together positively impacts the final model performance. Experiments on various NLP tasks demonstrate that our approach significantly outperforms commonly used factorization-based offline compression methods.
pdf
bib
abs
PriMeSRL-Eval: A Practical Quality Metric for Semantic Role Labeling Systems Evaluation
Ishan Jindal
|
Alexandre Rademaker
|
Khoi-Nguyen Tran
|
Huaiyu Zhu
|
Hiroshi Kanayama
|
Marina Danilevsky
|
Yunyao Li
Semantic role labeling (SRL) identifies the predicate-argument structure in a sentence. This task is usually accomplished in four steps: predicate identification, predicate sense disambiguation, argument identification, and argument classification. Errors introduced at one step propagate to later steps. Unfortunately, the existing SRL evaluation scripts do not consider the full effect of this error propagation aspect. They either evaluate arguments independent of predicate sense (CoNLL09) or do not evaluate predicate sense at all (CoNLL05), yielding an inaccurate SRL model performance on the argument classification task. In this paper, we address key practical issues with existing evaluation scripts and propose a more strict SRL evaluation metric PriMeSRL. We observe that by employing PriMeSRL, the quality evaluation of all SoTA SRL models drops significantly, and their relative rankings also change. We also show that PriMeSRLsuccessfully penalizes actual failures in SoTA SRL models.
pdf
bib
abs
Prompt-based Learning for Text Readability Assessment
Bruce W. Lee
|
Jason Lee
We propose the novel adaptation of a pre-trained seq2seq model for readability assessment. We prove that a seq2seq model - T5 or BART - can be adapted to discern which text is more difficult from two given texts (pairwise). As an exploratory study to prompt-learn a neural network for text readability in a text-to-text manner, we report useful tips for future work in seq2seq training and ranking-based approach to readability assessment. Specifically, we test nine input-output formats/prefixes and show that they can significantly influence the final model performance. Also, we argue that the combination of text-to-text training and pairwise ranking setup 1) enables leveraging multiple parallel text simplification data for teaching readability and 2) trains a neural model for the general concept of readability (therefore, better cross-domain generalization). At last, we report a 99.6% pairwise classification accuracy on Newsela and a 98.7% for OneStopEnglish, through a joint training approach. Our code is available at github.com/brucewlee/prompt-learning-readability.
pdf
bib
abs
Best Practices in the Creation and Use of Emotion Lexicons
Saif Mohammad
Words play a central role in how we express ourselves. Lexicons of word–emotion associations are widely used in research and real-world applications for sentiment analysis, tracking emotions associated with products and policies, studying health disorders, tracking emotional arcs of stories, and so on. However, inappropriate and incorrect use of these lexicons can lead to not just sub-optimal results, but also inferences that are directly harmful to people. This paper brings together ideas from Affective Computing and AI Ethics to present, some of the practical and ethical considerations involved in the creation and use of emotion lexicons – best practices. The goal is to provide a comprehensive set of relevant considerations, so that readers (especially those new to work with emotions) can find relevant information in one place. We hope this work will facilitate more thoughtfulness when one is deciding on what emotions to work on, how to create an emotion lexicon, how to use an emotion lexicon, how to draw meaningful inferences, and how to judge success.
pdf
bib
abs
The Role of Semantic Parsing in Understanding Procedural Text
Hossein Rajaby Faghihi
|
Parisa Kordjamshidi
|
Choh Man Teng
|
James Allen
In this paper, we investigate whether symbolic semantic representations, extracted from deep semantic parsers, can help reasoning over the states of involved entities in a procedural text. We consider a deep semantic parser (TRIPS) and semantic role labeling as two sources of semantic parsing knowledge. First, we propose PROPOLIS, a symbolic parsing-based procedural reasoning framework. Second, we integrate semantic parsing information into state-of-the-art neural models to conduct procedural reasoning. Our experiments indicate that explicitly incorporating such semantic knowledge improves procedural understanding. This paper presents new metrics for evaluating procedural reasoning tasks that clarify the challenges and identify differences among neural, symbolic, and integrated models.
pdf
bib
abs
Named Entity Recognition in a Very Homogenous Domain
Oshin Agarwal
|
Ani Nenkova
Machine Learning models have lower accuracy when tested on out-of-domain data. Developing models that perform well on several domains or can be quickly adapted to a new domain is an important research area. Domain, however, is a vague term, that can refer to any aspect of data such as language, genre, source and structure. We consider a very homogeneous source of data, specifically sentences from news articles from the same newspaper in English, and collect a dataset of such “in-domain” sentences annotated with named entities. We find that even in such a homogeneous domain, the performance of named entity recognition models varies significantly across news topics. Selection of diverse data, as we demonstrate, is crucial even in a seemingly homogeneous domain.
pdf
bib
abs
Crawling The Internal Knowledge-Base of Language Models
Roi Cohen
|
Mor Geva
|
Jonathan Berant
|
Amir Globerson
Language models are trained on large volumes of text, and as a result their parameters might contain a significant body of factual knowledge. Any downstream task performed by these models implicitly builds on these facts, and thus it is highly desirable to have means for representing this body of knowledge in an interpretable way. However, there is currently no mechanism for such a representation. Here, we propose to address this goal by extracting a knowledge-graph of facts from a given language model. We describe a procedure for “crawling” the internal knowledge-base of a language model. Specifically, given a seed entity, we expand a knowledge-graph around it. The crawling procedure is decomposed into sub-tasks, realized through specially designed prompts that control for both precision (i.e., that no wrong facts are generated) and recall (i.e., the number of facts generated). We evaluate our approach on graphs crawled starting from dozens of seed entities, and show it yields high precision graphs (82-92%), while emitting a reasonable number of facts per entity.
pdf
bib
abs
Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages
Ankan Mullick
|
Ishani Mondal
|
Sourjyadip Ray
|
Raghav R
|
G Chaitanya
|
Pawan Goyal
Scarcity of data and technological limitations for resource-poor languages in developing countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real world Indian hospital query data in English and multiple Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with the query intents as well as entities. Our aim is to detect query intents and corresponding entities. We perform extensive experiments on a set of models which in various realistic settings and explore two scenarios based on the access to English data only (less costly) and access to target language data (more expensive). We analyze context specific practical relevancy through empirical analysis. The results, expressed in terms of overall F-score show that our approach is practically useful to identify intents and entities.
pdf
bib
abs
Text-Derived Knowledge Helps Vision: A Simple Cross-modal Distillation for Video-based Action Anticipation
Sayontan Ghosh
|
Tanvi Aggarwal
|
Minh Hoai
|
Niranjan Balasubramanian
Anticipating future actions in a video is useful for many autonomous and assistive technologies. Prior action anticipation work mostly treat this as a vision modality problem, where the models learn the task information primarily from the video features in the action anticipation datasets. However, knowledge about action sequences can also be obtained from external textual data. In this work, we show how knowledge in pretrained language models can be adapted and distilled into vision based action anticipation models. We show that a simple distillation technique can achieve effective knowledge transfer and provide consistent gains on a strong vision model (Anticipative Vision Transformer) for two action anticipation datasets (3.5% relative gain on EGTEA-GAZE+ and 7.2% relative gain on EPIC-KITCHEN 55), giving a new state-of-the-art result.
pdf
bib
abs
Simple Yet Effective Synthetic Dataset Construction for Unsupervised Opinion Summarization
Ming Shen
|
Jie Ma
|
Shuai Wang
|
Yogarshi Vyas
|
Kalpit Dixit
|
Miguel Ballesteros
|
Yassine Benajiba
Opinion summarization provides an important solution for summarizing opinions expressed among a large number of reviews. However, generating aspect-specific and general summaries is challenging due to the lack of annotated data. In this work, we propose two simple yet effective unsupervised approaches to generate both aspect-specific and general opinion summaries by training on synthetic datasets constructed with aspect-related review contents. Our first approach, Seed Words Based Leave-One-Out (SW-LOO), identifies aspect-related portions of reviews simply by exact-matching aspect seed words and outperforms existing methods by 3.4 ROUGE-L points on Space and 0.5 ROUGE-1 point on Oposum+ for aspect-specific opinion summarization. Our second approach, Natural Language Inference Based Leave-One-Out (NLI-LOO) identifies aspect-related sentences utilizing an NLI model in a more general setting without using seed words and outperforms existing approaches by 1.2 ROUGE-L points on Space for aspect-specific opinion summarization and remains competitive on other metrics.
pdf
bib
abs
Towards Fine-tuning Pre-trained Language Models with Integer Forward and Backward Propagation
Mohammadreza Tayaranian Hosseini
|
Alireza Ghaffari
|
Marzieh S. Tahaei
|
Mehdi Rezagholizadeh
|
Masoud Asgharian
|
Vahid Partovi Nia
The large number of parameters of some prominent language models, such as BERT, makes their fine-tuning on downstream tasks computationally intensive and energy hungry. Previously researchers were focused on lower bit-width integer data types for the forward propagation of language models to save memory and computation. As for the backward propagation, however, only 16-bit floating-point data type has been used for the fine-tuning of BERT.In this work, we use integer arithmetic for both forward and back propagation in the fine-tuning of BERT.We study the effects of varying the integer bit-width on the model’s metric performance. Our integer fine-tuning uses integer arithmetic to perform forward propagation and gradient computation of linear, layer-norm, and embedding layers of BERT.We fine-tune BERT using our integer training method on SQuAD v1.1 and SQuAD v2., and GLUE benchmark. We demonstrate that metric performance of fine-tuning 16-bit integer BERT matches both 16-bit and 32-bit floating-point baselines. Furthermore, using the faster and more memory efficient 8-bit integer data type, integer fine-tuning of BERT loses an average of 3.1 points compared to the FP32 baseline.
pdf
bib
abs
Data Augmentation for Radiology Report Simplification
Ziyu Yang
|
Santhosh Cherian
|
Slobodan Vucetic
This work considers the development of a text simplification model to help patients better understand their radiology reports. This paper proposes a data augmentation approach to address the data scarcity issue caused by the high cost of manual simplification. It prompts a large foundational pre-trained language model to generate simplifications of unlabeled radiology sentences. In addition, it uses paraphrasing of labeled radiology sentences. Experimental results show that the proposed data augmentation approach enables the training of a significantly more accurate simplification model than the baselines.
pdf
bib
abs
Embedding Recycling for Language Models
Jon Saad-Falcon
|
Amanpreet Singh
|
Luca Soldaini
|
Mike D’Arcy
|
Arman Cohan
|
Doug Downey
Real-world applications of neural language models often involve running many different models over the same corpus. The high computational cost of these runs has led to interest in techniques that can reuse the contextualized embeddings produced in previous runs to speed training and inference of future ones. We refer to this approach as embedding recycling (ER). While multiple ER techniques have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few models and do not adequately account for overhead costs. We perform an extensive evaluation of ER across eight different models (17 to 900 million parameters) and fourteen tasks in English. We show how a simple ER technique that caches activations from an intermediate layer of a pretrained model, and learns task-specific adapters on the later layers, is broadly effective. For the best-performing baseline in our experiments (DeBERTa-v2 XL), adding a precomputed cache results in a 90% speedup during training and 87-91% speedup for inference, with negligible impact on accuracy. Our analysis reveals important areas of future work.
pdf
bib
abs
Trained on 100 million words and still in shape: BERT meets British National Corpus
David Samuel
|
Andrey Kutuzov
|
Lilja Øvrelid
|
Erik Velldal
While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source – the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.
pdf
bib
abs
Generating Synthetic Speech from SpokenVocab for Speech Translation
Jinming Zhao
|
Gholamreza Haffari
|
Ehsan Shareghi
Training end-to-end speech translation (ST) systems requires sufficiently large-scale data, which is unavailable for most language pairs and domains. One practical solution to the data scarcity issue is to convert text-based machine translation (MT) data to ST data via text-to-speech (TTS) systems. Yet, using TTS systems can be tedious and slow. In this work, we propose SpokenVocab, a simple, scalable and effective data augmentation technique to convert MT data to ST data on-the-fly. The idea is to retrieve and stitch audio snippets, corresponding to words in an MT sentence, from a spoken vocabulary bank. Our experiments on multiple language pairs show that stitched speech helps to improve translation quality by an average of 1.83 BLEU score, while performing equally well as TTS-generated speech in improving translation quality. We also showcase how SpokenVocab can be applied in code-switching ST for which often no TTS systems exit.
pdf
bib
abs
Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints
Albert Lu
|
Hongxin Zhang
|
Yanzhe Zhang
|
Xuezhi Wang
|
Diyi Yang
The limits of open-ended generative models are unclear, yet increasingly important. What causes them to succeed and what causes them to fail? In this paper, we take a prompt-centric approach to analyzing and bounding the abilities of open-ended generative models. We present a generic methodology of analysis with two challenging prompt constraint types: structural and stylistic. These constraint types are categorized into a set of well-defined constraints that are analyzable by a single prompt. We then systematically create a diverse set of simple, natural, and useful prompts to robustly analyze each individual constraint. Using the GPT-3 text-davinci-002 model as a case study, we generate outputs from our collection of prompts and analyze the model’s generative failures. We also show the generalizability of our proposed method on other large models like BLOOM and OPT. Our results and our in-context mitigation strategies reveal open challenges for future research.
pdf
bib
abs
Learning to Retrieve Engaging Follow-Up Queries
Christopher Richardson
|
Sudipta Kar
|
Anjishnu Kumar
|
Anand Ramachandran
|
Zeynab Raeesy
|
Omar Khan
|
Abhinav Sethy
Open domain conversational agents can answer a broad range of targeted queries. However, the sequential nature of interaction with these systems makes knowledge exploration a lengthy task which burdens the user with asking a chain of well phrased questions. In this paper, we present a retrieval based system and associated dataset for predicting the next questions that the user might have. Such a system can proactively assist users in knowledge exploration leading to a more engaging dialog. The retrieval system is trained on a dataset called the Follow-up Query Bank (FQ-Bank). FQ-Bank contains ~14K multi-turn information-seeking conversations with a valid follow-up question and a set of invalid candidates. The invalid candidates are generated to simulate various syntactic and semantic confounders such as paraphrases, partial entity match, irrelevant entity, and ASR errors. We use confounder specific techniques to simulate these negative examples on the OR-QuAC dataset. Then, we train ranking models on FQ-Bank and present results comparing supervised and unsupervised approaches. The results suggest that we can retrieve the valid follow-ups by ranking them in higher positions compared to confounders, but further knowledge grounding can improve ranking performance.FQ-Bank is publicly available at
https://github.com/amazon-science/fq-bank.
pdf
bib
abs
Selective-LAMA: Selective Prediction for Confidence-Aware Evaluation of Language Models
Hiyori Yoshikawa
|
Naoaki Okazaki
Recent studies have suggested that neural language models learn and store a large amount of facts and commonsense knowledge from training data. The ability of language models to restore such knowledge is often evaluated via zero-shot cloze-style QA tasks. However, such evaluations rely only on prediction accuracy without punishing the systems for their mistakes, e.g., simply guessing or hallucinating likely answers. Selective prediction is a more informative evaluation framework that takes the confidence of predictions into account. Under the selective prediction setting, a model is evaluated not only by the number of correct predictions, but also by the ability to filter out dubious predictions by estimating the confidence of individual predictions. Such confidence-aware evaluation is crucial for determining whether to trust zero-shot predictions of language models. In this paper, we apply the selective prediction setting to an existing benchmark, LAMA probe, and conduct extensive experiments with recent neural language models and different confidence functions. We empirically show that our Selective-LAMA evaluation is more robust to the effect of simple guesses than the conventional accuracy-based evaluation. Our evaluation reveals the importance of the choice of confidence functions by showing that simply relying on token probabilities is not always the best choice. Further analysis shows that various confidence functions exhibit different preferences over predicted tokens for a given context.
pdf
bib
abs
Multi-View Source Ablation for Faithful Summarization
Shuyang Cao
|
Liang Ma
|
Di Lu
|
Robert L Logan IV
|
Joel Tetreault
|
Alejandro Jaimes
In this paper, we present MuFaSSa (Multi-view Faithfulness Scoring via Source Ablation), a metric for evaluating faithfulness of abstractive summaries, and for guiding training of more faithful summarizers. For evaluation, MuFaSSa employs different strategies (e.g., masking entity mentions) to first remove information from the source document to form multiple ablated views. Then, the faithfulness level of each token in a generated summary is measured by the difference between the token generation probabilities when given the original document and the ablated document as inputs to trained summarizers. For training, MuFaSSa uses a novel word truncation objective that drops unfaithful tokens located by MuFaSSa in both the decoder input and output. Alignments with human-annotated faithfulness labels on AggreFact show that MuFaSSa is comparable to or better than existing metrics built on classifiers or QA models pre-trained on other tasks. In experiments on summarization with XSum and CNN/DailyMail, models trained with word truncation using MuFaSSa outperform competitive methods according to both automatic faithfulness metrics and human assessments.
pdf
bib
abs
Mining Effective Features Using Quantum Entropy for Humor Recognition
Yang Liu
|
Yuexian Hou
Humor recognition has been extensively studied with different methods in the past years. However, existing studies on humor recognition do not understand the mechanisms that generate humor. In this paper, inspired by the incongruity theory, any joke can be divided into two components (the setup and the punchline). Both components have multiple possible semantics, and there is an incongruous relationship between them. We use density matrices to represent the semantic uncertainty of the setup and the punchline, respectively, and design QE-Uncertainty and QE-Incongruity with the help of quantum entropy as features for humor recognition. The experimental results on the SemEval2021 Task 7 dataset show that the proposed features are more effective than the baselines for recognizing humorous and non-humorous texts.
pdf
bib
abs
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
Alexandra Chronopoulou
|
Matthew Peters
|
Alexander Fraser
|
Jesse Dodge
Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A parameter-efficient adaptation method suggests training an adapter for each domain on the task of language modeling. This leads to good in-domain scores but can be impractical for domain- or resource-restricted settings. A solution is to use a related-domain adapter for the novel domain at test time. In this paper, we introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains. Our approach is embarrassingly parallel: first, we train a set of domain-specific adapters; then, for each novel domain, we determine which adapters should be averaged at test time. We present extensive experiments showing that AdapterSoup consistently improves performance to new domains without extra training. We also explore weight averaging of adapters trained on the same domain with different hyper-parameters, and show that it preserves the performance of a PLM on new domains while obtaining strong in-domain results. We explore various approaches for choosing which adapters to combine, such as text clustering and semantic similarity. We find that using clustering leads to the most competitive results on novel domains.
pdf
bib
abs
Towards End-to-End Open Conversational Machine Reading
Sizhe Zhou
|
Siru Ouyang
|
Zhuosheng Zhang
|
Hai Zhao
In open-retrieval conversational machine reading (OR-CMR) task, machines are required to do multi-turn question answering given dialogue history and a textual knowledge base. Existing works generally utilize two independent modules to approach this problem’s two successive sub-tasks: first with a hard-label decision making and second with a question generation aided by various entailment reasoning methods. Such usual cascaded modeling is vulnerable to error propagation and prevents the two sub-tasks from being consistently optimized. In this work, we instead model OR-CMR as a unified text-to-text task in a fully end-to-end style. Experiments on the ShARC and OR-ShARC dataset show the effectiveness of our proposed end-to-end framework on both sub-tasks by a large margin, achieving new state-of-the-art results. Further ablation studies support that our framework can generalize to different backbone models.
pdf
bib
abs
Generative Knowledge Selection for Knowledge-Grounded Dialogues
Weiwei Sun
|
Pengjie Ren
|
Zhaochun Ren
Knowledge selection is the key in knowledge-grounded dialogues (KGD), which aims to select an appropriate knowledge snippet to be used in the utterance based on dialogue history. Previous studies mainly employ the classification approach to classify each candidate snippet as “relevant” or “irrelevant” independently. However, such approaches neglect the interactions between snippets, leading to difficulties in inferring the meaning of snippets. Moreover, they lack modeling of the discourse structure of dialogue-knowledge interactions. We propose a simple yet effective generative approach for knowledge selection, called GenKS. GenKS learns to select snippets by generating their identifiers with a sequence-to-sequence model. GenKS therefore captures intra-knowledge interaction inherently through attention mechanisms. Meanwhile, we devise a hyperlink mechanism to model the dialogue-knowledge interactions explicitly. We conduct experiments on three benchmark datasets, and verify GenKS achieves the best results on both knowledge selection and response generation.
pdf
bib
abs
Evaluating the Tradeoff Between Abstractiveness and Factuality in Abstractive Summarization
Markus Dreyer
|
Mengwen Liu
|
Feng Nan
|
Sandeep Atluri
|
Sujith Ravi
Neural models for abstractive summarization tend to generate output that is fluent and well-formed but lacks semantic faithfulness, or factuality, with respect to the input documents. In this paper, we analyze the tradeoff between abstractiveness and factuality of generated summaries across multiple datasets and models, using extensive human evaluations of factuality. In our analysis, we visualize the rates of change in factuality as we gradually increase abstractiveness using a decoding constraint, and we observe that, while increased abstractiveness generally leads to a drop in factuality, the rate of factuality decay depends on factors such as the data that the system was trained on. We introduce two datasets with human factuality judgements; one containing 10.2k generated summaries with systematically varied degrees of abstractiveness; the other containing 4.2k summaries from five different summarization models. We propose new factuality metrics that adjust for the degree of abstractiveness, and we use them to compare the abstractiveness-adjusted factuality of previous summarization works, providing baselines for future work.
pdf
bib
abs
Fairness in Language Models Beyond English: Gaps and Challenges
Krithika Ramesh
|
Sunayana Sitaram
|
Monojit Choudhury
With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors. Most research on evaluating and mitigating fairness harms has been concentrated on English, while multilingual models and non-English languages have received comparatively little attention. In this paper, we survey different aspects of fairness in languages beyond English and multilingual contexts. This paper presents a survey of fairness in multilingual and non-English contexts, highlighting the shortcomings of current research and the difficulties faced by methods designed for English. We contend that the multitude of diverse cultures and languages across the world makes it infeasible to achieve comprehensive coverage in terms of constructing fairness datasets. Thus, the measurement and mitigation of biases must evolve beyond the current dataset-driven practices that are narrowly focused on specific dimensions and types of biases and, therefore, impossible to scale across languages and cultures.
pdf
bib
abs
Global-Local Modeling with Prompt-Based Knowledge Enhancement for Emotion Inference in Conversation
Renxi Wang
|
Shi Feng
The ability to recognize emotions in conversations is necessary and important for the online chatbot to do tasks such as empathetic response generation and emotional support. Present researches mainly focus on recognizing emotions through a speaker’s utterance, while research on emotion inference predicts emotions of addressees through previous utterances. Because of the lack of the addressee’s utterance, emotion inference is more challenging than emotion recognition. In this paper, we propose a global-local modeling method based on recurrent neural networks (RNN) and pre-trained language models (PLM) to do emotion inference, which utilizes the sequence modeling ability of RNNs and abundant knowledge from PLMs. Moreover, we take the whole dialogue history as input of PLM to generate knowledge by in-context learning. Experimental results show that our model with knoledge enhancement achieves state-of-the-art performance on all three datasets.
pdf
bib
abs
Headline Token-based Discriminative Learning for Subheading Generation in News Article
Joonwon Jang
|
Misuk Kim
The news subheading summarizes an article’s contents in several sentences to support the headline limited to solely conveying the main contents. So, it is necessary to generate compelling news subheadings in consideration of the structural characteristics of the news. In this paper, we propose a subheading generation model using topical headline information. We introduce a discriminative learning method that utilizes the prediction result of masked headline tokens. Experiments show that the proposed model is effective and outperforms the comparative models on three news datasets written in two languages. We also show that our model performs robustly on a small dataset and various masking ratios. Qualitative analysis and human evaluations also show that the overall quality of generated subheadings improved over the comparative models.
pdf
bib
abs
Decipherment as Regression: Solving Historical Substitution Ciphers by Learning Symbol Recurrence Relations
Nishant Kambhatla
|
Logan Born
|
Anoop Sarkar
Solving substitution ciphers involves mapping sequences of cipher symbols to fluent text in a target language. This has conventionally been formulated as a search problem, to find the decipherment key using a character-level language model to constrain the search space. This work instead frames decipherment as a sequence prediction task, using a Transformer-based causal language model to learn recurrences between characters in a ciphertext. We introduce a novel technique for transcribing arbitrary substitution ciphers into a common recurrence encoding. By leveraging this technique, we (i) create a large synthetic dataset of homophonic ciphers using random keys, and (ii) train a decipherment model that predicts the plaintext sequence given a recurrence-encoded ciphertext. Our method achieves strong results on synthetic 1:1 and homophonic ciphers, and cracks several real historic homophonic ciphers. Our analysis shows that the model learns recurrence relations between cipher symbols and recovers decipherment keys in its self-attention.
pdf
bib
abs
A Survey on Recent Advances in Keyphrase Extraction from Pre-trained Language Models
Mingyang Song
|
Yi Feng
|
Liping Jing
Keyphrase Extraction (KE) is a critical component in Natural Language Processing (NLP) systems for selecting a set of phrases from the document that could summarize the important information discussed in the document. Typically, a keyphrase extraction system can significantly accelerate the speed of information retrieval and help people get first-hand information from a long document quickly and accurately. Specifically, keyphrases are capable of providing semantic metadata characterizing documents and producing an overview of the content of a document. In this paper, we introduce keyphrase extraction, present a review of the recent studies based on pre-trained language models, offer interesting insights on the different approaches, highlight open issues, and give a comparative experimental study of popular supervised as well as unsupervised techniques on several datasets. To encourage more instantiations, we release the related files mentioned in this paper.
pdf
bib
abs
Prompting for explanations improves Adversarial NLI. Is this true? {Yes} it is {true} because {it weakens superficial cues}
Pride Kavumba
|
Ana Brassard
|
Benjamin Heinzerling
|
Kentaro Inui
Explanation prompts ask language models to not only assign a particular label to a giveninput, such as true, entailment, or contradiction in the case of natural language inference but also to generate a free-text explanation that supports this label. For example: “This is label because explanation.” While this type of prompt was originally introduced with the aim of improving model interpretability, we showhere that explanation prompts also improve robustness to adversarial perturbations in naturallanguage inference benchmarks. Compared to prompting for labels only, explanation prompting consistently yields stronger performance on adversarial benchmarks, outperforming the state of the art on Adversarial Natural Language Inference, Counterfactually-Augmented Natural Language Inference, and SNLI-Hard datasets. We argue that the increase in robustness is due to the fact that prompting for explanations weakens superficial cues. Specifically, single tokens that are highly predictive of the correct answer in the label-only setting become uninformative when the model also has to generate explanations.
pdf
bib
abs
JobXMLC: EXtreme Multi-Label Classification of Job Skills with Graph Neural Networks
Nidhi Goyal
|
Jushaan Kalra
|
Charu Sharma
|
Raghava Mutharaju
|
Niharika Sachdeva
|
Ponnurangam Kumaraguru
Writing a good job description is an important step in the online recruitment process to hire the best candidates. Most recruiters forget to include some relevant skills in the job description. These missing skills affect the performance of recruitment tasks such as job suggestions, job search, candidate recommendations, etc. Existing approaches are limited to contextual modelling, do not exploit inter-relational structures like job-job and job-skill relationships, and are not scalable. In this paper, we exploit these structural relationships using a graph-based approach. We propose a novel skill prediction framework called JobXMLC, which uses graph neural networks with skill attention to predict missing skills using job descriptions. JobXMLC enables joint learning over a job-skill graph consisting of 22.8K entities (jobs and skills) and 650K relationships. We experiment with real-world recruitment datasets to evaluate our proposed approach. We train JobXMLC on 20,298 job descriptions and 2,548 skills within 30 minutes on a single GPU machine. JobXMLC outperforms the state-of-the-art approaches by 6% in precision and 3% in recall. JobXMLC is 18X faster for training task and up to 634X faster in skill prediction on benchmark datasets enabling JobXMLC to scale up on larger datasets.
pdf
bib
abs
ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities
Terry Yue Zhuo
|
Yaqing Liao
|
Yuecheng Lei
|
Lizhen Qu
|
Gerard de Melo
|
Xiaojun Chang
|
Yazhou Ren
|
Zenglin Xu
We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from Charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.
pdf
bib
abs
Grammatical Error Correction through Round-Trip Machine Translation
Yova Kementchedjhieva
|
Anders Søgaard
Machine translation (MT) operates on the premise of an interlingua which abstracts away from surface form while preserving meaning. A decade ago the idea of using round-trip MT to guide grammatical error correction was proposed as a way to abstract away from potential errors in surface forms (Madnani et al., 2012). At the time, it did not pan out due to the low quality of MT systems of the day. Today much stronger MT systems are available so we re-evaluate this idea across five languages and models of various sizes. We find that for extra large models input augmentation through round-trip MT has little to no effect. For more ‘workable’ model sizes, however, it yields consistent improvements, sometimes bringing the performance of a base or large model up to that of a large or xl model, respectively. The round-trip translation comes at a computational cost though, so one would have to determine whether to opt for a larger model or for input augmentation on a case-by-case basis.
pdf
bib
abs
Does Masked Language Model Pre-training with Artificial Data Improve Low-resource Neural Machine Translation?
Hiroto Tamura
|
Tosho Hirasawa
|
Hwichan Kim
|
Mamoru Komachi
Pre-training masked language models (MLMs) with artificial data has been proven beneficial for several natural language processing tasks such as natural language understanding and summarization; however, it has been less explored for neural machine translation (NMT).A previous study revealed the benefit of transfer learning for NMT in a limited setup, which differs from MLM.In this study, we prepared two kinds of artificial data and compared the translation performance of NMT when pre-trained with MLM.In addition to the random sequences, we created artificial data mimicking token frequency information from the real world. Our results showed that pre-training the models with artificial data by MLM improves translation performance in low-resource situations. Additionally, we found that pre-training on artificial data created considering token frequency information facilitates improved performance.
pdf
bib
abs
Performance and Risk Trade-offs for Multi-word Text Prediction at Scale
Aniket Vashishtha
|
S Sai Prasad
|
Payal Bajaj
|
Vishrav Chaudhary
|
Kate Cook
|
Sandipan Dandapat
|
Sunayana Sitaram
|
Monojit Choudhury
Large Language Models such as GPT-3 are well-suited for text prediction tasks, which can help and delight users during text composition. LLMs are known to generate ethically inappropriate predictions even for seemingly innocuous contexts. Toxicity detection followed by filtering is a common strategy for mitigating the harm from such predictions. However, as we shall argue in this paper, in the context of text prediction, it is not sufficient to detect and filter toxic content. One also needs to ensure factual correctness and group-level fairness of the predictions; failing to do so can make the system ineffective and nonsensical at best, and unfair and detrimental to the users at worst. We discuss the gaps and challenges of toxicity detection approaches - from blocklist-based approaches to sophisticated state-of-the-art neural classifiers - by evaluating them on the text prediction task for English against a manually crafted CheckList of harms targeted at different groups and different levels of severity.
pdf
bib
abs
Searching for Better Database Queries in the Outputs of Semantic Parsers
Anton Osokin
|
Irina Saparina
|
Ramil Yarullin
The task of generating a database query from a question in natural language suffers from ambiguity and insufficiently precise description of the goal. The problem is amplified when the system needs to generalize to databases unseen at training. In this paper, we consider the case when, at the test time, the system has access to an external criterion that evaluates the generated queries. The criterion can vary from checking that a query executes without errors to verifying the query on a set of tests. In this setting, we augment neural autoregressive models with a search algorithm that looks for a query satisfying the criterion. We apply our approach to the state-of-the-art semantic parsers and report that it allows us to find many queries passing all the tests on different datasets.
pdf
bib
abs
Style-Aware Contrastive Learning for Multi-Style Image Captioning
Yucheng Zhou
|
Guodong Long
Existing multi-style image captioning methods show promising results in generating a caption with accurate visual content and desired linguistic style. However, existing methods overlook the relationship between linguistic style and visual content. To overcome this drawback, we propose style-aware contrastive learning for multi-style image captioning. First, we present a style-aware visual encoder with contrastive learning to mine potential visual content relevant to style. Moreover, we propose a style-aware triplet contrast objective to distinguish whether the image, style and caption matched. To provide positive and negative samples for contrastive learning, we present three retrieval schemes: object-based retrieval, RoI-based retrieval and triplet-based retrieval, and design a dynamic trade-off function to calculate retrieval scores. Experimental results demonstrate that our approach achieves state-of-the-art performance. In addition, we conduct an extensive analysis to verify the effectiveness of our method.
pdf
bib
abs
Strategize Before Teaching: A Conversational Tutoring System with Pedagogy Self-Distillation
Lingzhi Wang
|
Mrinmaya Sachan
|
Xingshan Zeng
|
Kam-Fai Wong
Conversational tutoring systems (CTSs) aim to help students master educational material with natural language interaction in the form of a dialog. CTSs have become a key pillar in educational data mining research. A key challenge in CTSs is to engage the student in the conversation while exposing them to a diverse set of teaching strategies, akin to a human teacher, thereby, helping them learn in the process. Different from previous work that generates responses given the strategies as input, we propose to jointly predict teaching strategies and generate tutor responses accordingly, which fits a more realistic application scenario. We benchmark several competitive models on three dialog tutoring datasets and propose a unified framework that combines teaching response generation and pedagogical strategy prediction, where a self-distillation mechanism is adopted to guide the teaching strategy learning and facilitate tutor response generation. Our experiments and analyses shed light on how teaching strategies affect dialog tutoring.
pdf
bib
abs
ICA-Proto: Iterative Cross Alignment Prototypical Network for Incremental Few-Shot Relation Classification
Wangjie Jiang
|
Zhihao Ye
|
Bang Liu
|
Ruihui Zhao
|
Jianguang Zheng
|
Mengyao Li
|
Zhiyong Li
|
Yujiu Yang
|
Yefeng Zheng
In the task of incremental few-shot relation classification, model performance is always limited by the incompatibility between the base feature embedding space and the novel feature embedding space. To tackle the issue, we propose a novel model named ICA-Proto: Iterative Cross Alignment prototypical network. Specifically, we incorporate the query representation into the encoding of novel prototypes and utilize the query-aware prototypes to update the query representation at the same time. Further, we implement the above process iteratively to achieve more interaction. In addition, a novel prototype quadruplet loss is designed to regulate the spatial distributions of embedding space, so as to make it easier for the relation classification. Experimental results on two benchmark datasets demonstrate that ICA-Proto significantly outperforms the state-of-the-art baseline model.
pdf
bib
abs
A Large-Scale Multilingual Study of Visual Constraints on Linguistic Selection of Descriptions
Uri Berger
|
Lea Frermann
|
Gabriel Stanovsky
|
Omri Abend
We present a large, multilingual study into how vision constrains linguistic choice, covering four languages and five linguistic properties, such as verb transitivity or use of numerals. We propose a novel method that leverages existing corpora of images with captions written by native speakers, and apply it to nine corpora, comprising 600k images and 3M captions. We study the relation between visual input and linguistic choices by training classifiers to predict the probability of expressing a property from raw images, and find evidence supporting the claim that linguistic properties are constrained by visual context across languages. We complement this investigation with a corpus study, taking the test case of numerals. Specifically, we use existing annotations (number or type of objects) to investigate the effect of different visual conditions on the use of numeral expressions in captions, and show that similar patterns emerge across languages. Our methods and findings both confirm and extend existing research in the cognitive literature. We additionally discuss possible applications for language generation.
pdf
bib
abs
How Much Syntactic Supervision is “Good Enough”?
Hiroshi Noji
|
Yohei Oseki
In this paper, we explore how much syntactic supervision is “good enough” to make language models (LMs) more human-like. Specifically, we propose the new method called syntactic ablation, where syntactic LMs, namely Recurrent Neural Network Grammars (RNNGs), are gradually ablated from full syntactic supervision to zero syntactic supervision (≈ unidirectional LSTM) by preserving NP, VP, PP, SBAR nonterminal symbols and the combinations thereof. The 17 ablated grammars are then evaluated via targeted syntactic evaluation on the SyntaxGym benchmark. The results of our syntactic ablation demonstrated that (i) the RNNG with zero syntactic supervision underperformed the RNNGs with some syntactic supervision, (ii) the RNNG with full syntactic supervision underperformed the RNNGs with less syntactic supervision, and (iii) the RNNG with mild syntactic supervision achieved the best performance comparable to the state-of-the-art GPT-2-XL. Those results may suggest that the “good enough” approach to language processing seems to make LMs more human-like.
pdf
bib
abs
Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?
Sonal Sannigrahi
|
Josef van Genabith
|
Cristina España-Bonet
Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art resultsin various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks
pdf
bib
abs
Improving User Controlled Table-To-Text Generation Robustness
Hanxu Hu
|
Yunqing Liu
|
Zhongyi Yu
|
Laura Perez-Beltrachini
In this work we study user controlled table-to-text generation where users explore the content in a table by selecting cells and reading a natural language description thereof automatically produce by a natural language generator. Such generation models usually learn from carefully selected cell combinations (clean cell selections); however, in practice users may select unexpected, redundant, or incoherent cell combinations (noisy cell selections). In experiments, we find that models perform well on test sets coming from the same distribution as the train data but their performance drops when evaluated on realistic noisy user inputs. We propose a fine-tuning regime with additional user-simulated noisy cell selections. Models fine-tuned with the proposed regime gain 4.85 BLEU points on user noisy test cases and 1.4 on clean test cases; and achieve comparable state-of-the-art performance on the ToTTo dataset.
pdf
bib
abs
Better Pre-Training by Reducing Representation Confusion
Haojie Zhang
|
Mingfei Liang
|
Ruobing Xie
|
Zhenlong Sun
|
Bo Zhang
|
Leyu Lin
In this work, we revisit the Transformer-based pre-trained language models and identify two different types of information confusion in position encoding and model representations, respectively. Firstly, we show that in the relative position encoding, the joint modeling about relative distances and directions brings confusion between two heterogeneous information. It may make the model unable to capture the associative semantics of the same distance and the opposite directions, which in turn affects the performance of downstream tasks. Secondly, we notice the BERT with Mask Language Modeling (MLM) pre-training objective outputs similar token representations (last hidden states of different tokens) and head representations (attention weightsof different heads), which may make the diversity of information expressed by different tokens and heads limited. Motivated by the above investigation, we propose two novel techniques to improve pre-trained language models: Decoupled Directional Relative Position (DDRP) encoding and MTH pre-training objective. DDRP decouples the relative distance features and the directional features in classical relative position encoding. MTH applies two novel auxiliary regularizers besides MLM to enlarge the dissimilarities between (a) last hidden states of different tokens, and (b) attention weights of different heads. These designs allow the model to capture different categories of information more clearly, as a way to alleviate information confusion in representation learning for better optimization. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of our proposed methods.
pdf
bib
abs
MAFiD: Moving Average Equipped Fusion-in-Decoder for Question Answering over Tabular and Textual Data
Sung-Min Lee
|
Eunhwan Park
|
Daeryong Seo
|
Donghyeon Jeon
|
Inho Kang
|
Seung-Hoon Na
Transformer-based models for question answering (QA) over tables and texts confront a “long” hybrid sequence over tabular and textual elements, causing long-range reasoning problems. To handle long-range reasoning, we extensively employ a fusion-in-decoder (FiD) and exponential moving average (EMA), proposing a Moving Average Equipped Fusion-in-Decoder (MAFiD). With FiD as the backbone architecture, MAFiD combines various levels of reasoning: independent encoding of homogeneous data and single-row and multi-row heterogeneous reasoning, using a gated cross attention layer to effectively aggregate the three types of representations resulting from various reasonings. Experimental results on HybridQA indicate that MAFiD achieves state-of-the-art performance by increasing exact matching (EM) and F1 by 1.1 and 1.7, respectively, on the blind test set.
pdf
bib
abs
Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis
Akshita Jha
|
Adithya Samavedhi
|
Vineeth Rakesh
|
Jaideep Chandrashekar
|
Chandan Reddy
Recent advances in the area of long document matching have primarily focused on using transformer-based models for long document encoding and matching. There are two primary challenges associated with these models. Firstly, the performance gain provided by transformer-based models comes at a steep cost – both in terms of the required training time and the resource (memory and energy) consumption. The second major limitation is their inability to handle more than a pre-defined input token length at a time. In this work, we empirically demonstrate the effectiveness of simple neural models (such as feed-forward networks, and CNNs) and simple embeddings (like GloVe, and Paragraph Vector) over transformer-based models on the task of document matching. We show that simple models outperform the more complex BERT-based models while taking significantly less training time, energy, and memory. The simple models are also more robust to variations in document length and text perturbations.
pdf
bib
abs
Simple and Effective Multi-Token Completion from Masked Language Models
Oren Kalinsky
|
Guy Kushilevitz
|
Alexander Libov
|
Yoav Goldberg
Pre-trained neural masked language models are often used for predicting a replacement token for a given sequence position, in a cloze-like task. However, this usage is restricted to predicting a single token, from a relatively small pre-trained vocabulary. Recent Sequence2Sequence pre-trained LMs like T5 do allow predicting multi-token completions, but are more expensive to train and run. We show that pre-trained masked language models can be adapted to produce multi-token completions, with only a modest addition to their parameter count. We propose two simple adaptation approaches, trading parameter counts for accuracy. The first method generates multi-token completions from a conditioned RNN. It has a very low parameter count and achieves competitive results. The second method is even simpler: it adds items corresponding to multi-token units to the output prediction matrix. While being higher in parameter count than the RNN method, it also surpasses current state-of-the-art multi-token completion models, including T5-3B, while being significantly more parameter efficient. We demonstrate that our approach is flexible to different vocabularies and domains and can effectively leverage existing pre-trained models available in different domains. Finally, a human evaluation further validates our results and shows that our solution regularly provides valid completions, as well as reasonable correctness for factual-sentence completions.
pdf
bib
abs
A Survey on Dynamic Neural Networks for Natural Language Processing
Canwen Xu
|
Julian McAuley
Effectively scaling large Transformer models is a main driver of recent advances in natural language processing. Dynamic neural networks, as an emerging research direction, are capable of scaling up neural networks with sub-linear increases in computation and time by dynamically adjusting their computational path based on the input. Dynamic neural networks could be a promising solution to the growing parameter numbers of pretrained language models, allowing both model pretraining with trillions of parameters and faster inference on mobile devices. In this survey, we summarize the progress of three types of dynamic neural networks in NLP: skimming, mixture of experts, and early exit. We also highlight current challenges in dynamic neural networks and directions for future research.
pdf
bib
abs
Transformers with Learnable Activation Functions
Haishuo Fang
|
Ji-Ung Lee
|
Nafise Sadat Moosavi
|
Iryna Gurevych
Activation functions can have a significant impact on reducing the topological complexity of input data and therefore, improving a model’s performance. However, the choice of activation functions is seldom discussed or explored in Transformer-based language models. As a common practice, commonly used activation functions like Gaussian Error Linear Unit (GELU) are chosen beforehand and then remain fixed from pre-training to fine-tuning. In this paper, we investigate the impact of activation functions on Transformer-based models by utilizing rational activation functions (RAFs). In contrast to fixed activation functions (FAF), RAFs are capable of learning the optimal activation functions from data. Our experiments show that the RAF-based Transformer model (RAFT) achieves a better performance than its FAF-based counterpart (). For instance, we find that RAFT outperforms on the GLUE benchmark by 5.71 points when using only 100 training examples and by 2.05 points on SQuAD with all available data. Analyzing the shapes of the learned RAFs further unveils that they vary across different layers and different tasks; opening a promising way to better analyze and understand large, pre-trained language models.
pdf
bib
abs
The Solvability of Interpretability Evaluation Metrics
Yilun Zhou
|
Julie Shah
Feature attribution methods are popular for explaining neural network predictions, and they are often evaluated on metrics such as comprehensiveness and sufficiency. In this paper, we highlight an intriguing property of these metrics: their solvability. Concretely, we can define the problem of optimizing an explanation for a metric, which can be solved by beam search. This observation leads to the obvious yet unaddressed question: why do we use explainers (e.g., LIME) not based on solving the target metric, if the metric value represents explanation quality? We present a series of investigations showing strong performance of this beam search explainer and discuss its broader implication: a definition-evaluation duality of interpretability concepts. We implement the explainer and release the Python solvex package for models of text, image and tabular domains.
pdf
bib
abs
Reliable Gradient-free and Likelihood-free Prompt Tuning
Maohao Shen
|
Soumya Ghosh
|
Prasanna Sattigeri
|
Subhro Das
|
Yuheng Bu
|
Gregory Wornell
Due to privacy or commercial constraints, large pre-trained language models (PLMs) are often offered as black-box APIs. Fine-tuning such models to downstream tasks is challenging because one can neither access the model’s internal representations nor propagate gradients through it. This paper addresses these challenges by developing techniques for adapting PLMs with only API access. Building on recent work on soft prompt tuning, we develop methods to tune the soft prompts without requiring gradient computation. Further, we develop extensions that in addition to not requiring gradients also do not need to access any internal representation of the PLM beyond the input embeddings. Moreover, instead of learning a single prompt, our methods learn a distribution over prompts allowing us to quantify predictive uncertainty. Ours is the first work to consider uncertainty in prompts when only having API access to the PLM. Finally, through extensive experiments, we carefully vet the proposed methods and find them competitive with (and sometimes even improving on) gradient-based approaches with full access to the PLM.
pdf
bib
abs
Combining Psychological Theory with Language Models for Suicide Risk Detection
Daniel Izmaylov
|
Avi Segal
|
Kobi Gal
|
Meytal Grimland
|
Yossi Levi-Belz
With the increased awareness of situations of mental crisis and their societal impact, online services providing emergency support are becoming commonplace in many countries. Computational models, trained on discussions between help-seekers and providers, can support suicide prevention by identifying at-risk individuals. However, the lack of domain-specific models, especially in low-resource languages, poses a significant challenge for the automatic detection of suicide risk. We propose a model that combines pre-trained language models (PLM) with a fixed set of manually crafted (and clinically approved) set of suicidal cues, followed by a two-stage fine-tuning process. Our model achieves 0.91 ROC-AUC and an F2-score of 0.55, significantly outperforming an array of strong baselines even early on in the conversation, which is critical for real-time detection in the field. Moreover, the model performs well across genders and age groups.
pdf
bib
abs
Cross-Lingual Question Answering over Knowledge Base as Reading Comprehension
Chen Zhang
|
Yuxuan Lai
|
Yansong Feng
|
Xingyu Shen
|
Haowei Du
|
Dongyan Zhao
Although many large-scale knowledge bases (KBs) claim to contain multilingual information, their support for many non-English languages is often incomplete. This incompleteness gives birth to the task of cross-lingual question answering over knowledge base (xKBQA), which aims to answer questions in languages different from that of the provided KB. One of the major challenges facing xKBQA is the high cost of data annotation, leading to limited resources available for further exploration. Another challenge is mapping KB schemas and natural language expressions in the questions under cross-lingual settings. In this paper, we propose a novel approach for xKBQA in a reading comprehension paradigm. We convert KB subgraphs into passages to narrow the gap between KB schemas and questions, which enables our model to benefit from recent advances in multilingual pre-trained language models (MPLMs) and cross-lingual machine reading comprehension (xMRC). Specifically, we use MPLMs, with considerable knowledge of cross-lingual mappings, for cross-lingual reading comprehension. Existing high-quality xMRC datasets can be further utilized to finetune our model, greatly alleviating the data scarcity issue in xKBQA. Extensive experiments on two xKBQA datasets in 12 languages show that our approach outperforms various baselines and achieves strong few-shot and zero-shot performance. Our dataset and code are released for further research.
pdf
bib
abs
Delving Deeper into Cross-lingual Visual Question Answering
Chen Liu
|
Jonas Pfeiffer
|
Anna Korhonen
|
Ivan Vulić
|
Iryna Gurevych
Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, existing VQA research has mostly focused on the English language, due to a lack of suitable evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers with large gaps to monolingual performance, without any deeper analysis. In this work, we delve deeper into the different aspects of cross-lingual VQA, aiming to understand the impact of 1) modeling methods and choices, including architecture, inductive bias, fine-tuning; 2) learning biases: including question types and modality biases in cross-lingual setups. The key results of our analysis are: 1. We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance, yielding +10 accuracy points over existing methods. 2. We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers, and identify question types that are the most difficult to improve on. 3. We provide an analysis of modality biases present in training data and models, revealing why zero-shot performance gaps remain for certain question types and languages.
pdf
bib
abs
Bridging Argument Quality and Deliberative Quality Annotations with Adapters
Neele Falk
|
Gabriella Lapesa
Assessing the quality of an argument is a complex, highly subjective task, influenced by heterogeneous factors (e.g., prior beliefs of the annotators, topic, domain, and application), and crucial for its impact in downstream tasks (e.g., argument retrieval or generation). Both the Argument Mining and the Social Science community have devoted plenty of attention to it, resulting in a wide variety of argument quality dimensions and a large number of annotated resources. This work aims at a better understanding of how the different aspects of argument quality relate to each other from a practical point of view. We employ adapter-fusion (Pfeiffer et al., 2021) as a multi-task learning framework which a) can improve the prediction of individual quality dimensions by injecting knowledge about related dimensions b) is efficient and modular and c) can serve as an analysis tool to investigate relations between different dimensions. We conduct experiments on 6 datasets and 20 quality dimensions. We find that the majority of the dimensions can be learned as a weighted combination of other quality aspects, and that for 8 dimensions adapter fusion improves quality prediction. Last, we show the benefits of this approach by improving the performance in an extrinsic, out-of-domain task: prediction of moderator interventions in a deliberative forum.
pdf
bib
abs
Interventional Probing in High Dimensions: An NLI Case Study
Julia Rozanova
|
Marco Valentino
|
Lucas Cordeiro
|
André Freitas
Probing strategies have been shown to detectthe presence of various linguistic features inlarge language models; in particular, seman-tic features intermediate to the “natural logic”fragment of the Natural Language Inferencetask (NLI). In the case of natural logic, the rela-tion between the intermediate features and theentailment label is explicitly known: as such,this provides a ripe setting for interventionalstudies on the NLI models’ representations, al-lowing for stronger causal conjectures and adeeper critical analysis of interventional prob-ing methods. In this work, we carry out newand existing representation-level interventionsto investigate the effect of these semantic fea-tures on NLI classification: we perform am-nesic probing (which removes features as di-rected by learned linear probes) and introducethe mnestic probing variation (which forgetsall dimensions except the probe-selected ones).Furthermore, we delve into the limitations ofthese methods and outline some pitfalls havebeen obscuring the effectivity of interventionalprobing studies.
pdf
bib
abs
Program Synthesis for Complex QA on Charts via Probabilistic Grammar Based Filtered Iterative Back-Translation
Shabbirhussain Bhaisaheb
|
Shubham Paliwal
|
Rajaswa Patil
|
Manasi Patwardhan
|
Lovekesh Vig
|
Gautam Shroff
Answering complex reasoning questions from chart images is a challenging problem requiring a combination of natural language understanding, fine-grained perception, and analytical reasoning. Current chart-based Question Answering (QA) approaches largely address structural, visual or simple data retrieval-type questions with fixed-vocabulary answers and perform poorly on reasoning queries. We focus on answering realistic, complex, reasoning-based questions where the answer needs to be computed and not selected from a fixed set of choices. Our approach employs a neural semantic parser to transform Natural Language (NL) questions into SQL programs and execute them on a standardized schema populated from the extracted chart contents. In the absence of program annotations, i.e., in a weak supervision setting, we obtain initial SQL predictions from a pre-trained CodeT5 semantic parser and employ Filtered Iterative Back-Translation (FIBT) for iteratively augmenting our NL-SQL training set. The forward (neural semantic parser) and backward (language model) models are initially trained with an external NL-SQL dataset. We iteratively move towards the NL query distribution by generating NL questions from the synthesized SQL programs using a Probabilistic Context-Free Grammar (PCFG) where the production rule probabilities are induced to be inversely proportional to the probabilities in the training data. We filter out the generated NL queries with mismatched structures and compositions. Our FIBT approach achieves State-of-the-Art (SOTA) results on reasoning-based queries in the PlotQA dataset yielding a test accuracy of 60.44%, superseding the previous baselines by a large margin.
pdf
bib
abs
Exploiting Language Characteristics for Legal Domain-Specific Language Model Pretraining
Inderjeet Nair
|
Natwar Modani
Pretraining large language models has resulted in tremendous performance improvement for many natural language processing (NLP) tasks. While for non-domain specific tasks, such models can be used directly, a common strategy to achieve better performance for specific domains involves pretraining these language models over domain specific data using objectives like Masked Language Modelling (MLM), Autoregressive Language Modelling, etc. While such pretraining addresses the change in vocabulary and style of language for the domain, it is otherwise a domain agnostic approach. In this work, we investigate the effect of incorporating pretraining objectives that explicitly tries to exploit the domain specific language characteristics in addition to such MLM based pretraining. Particularly, we examine two distinct characteristics associated with the legal domain and propose pretraining objectives modelling these characteristics. The proposed objectives target improvement of token-level feature representation, as well as aim to incorporate sentence level semantics. We demonstrate superiority in the performance of the models pretrained using our objectives against those trained using domain-agnostic objectives over several legal downstream tasks.
pdf
bib
abs
Global Constraints with Prompting for Zero-Shot Event Argument Classification
Zizheng Lin
|
Hongming Zhang
|
Yangqiu Song
Determining the role of event arguments is a crucial subtask of event extraction. Most previous supervised models leverage costly annotations, which is not practical for open-domain applications. In this work, we propose to use global constraints with prompting to effectively tackles event argument classification without any annotation and task-specific training. Specifically, given an event and its associated passage, the model first creates several new passages by prefix prompts and cloze prompts, where prefix prompts indicate event type and trigger span, and cloze prompts connect each candidate role with the target argument span. Then, a pre-trained language model scores the new passages, making the initial prediction. Our novel prompt templates can easily adapt to all events and argument types without manual effort. Next, the model regularizes the prediction by global constraints exploiting cross-task, cross-argument, and cross-event relations. Extensive experiments demonstrate our model’s effectiveness: it outperforms the best zero-shot baselines by 12.5% and 10.9% F1 on ACE and ERE with given argument spans and by 4.3% and 3.3% F1, respectively, without given argument spans. We have made our code publicly available.
pdf
bib
abs
Distillation of encoder-decoder transformers for sequence labelling
Marco Farina
|
Duccio Pappadopulo
|
Anant Gupta
|
Leslie Huang
|
Ozan Irsoy
|
Thamar Solorio
Driven by encouraging results on a wide range of tasks, the field of NLP is experiencing an accelerated race to develop bigger language models. This race for bigger models has also underscored the need to continue the pursuit of practical distillation approaches that can leverage the knowledge acquired by these big models in a compute-efficient manner. Having this goal in mind, we build on recent work to propose a hallucination-free framework for sequence tagging that is especially suited for distillation. We show empirical results of new state-of-the-art performance across multiple sequence labelling datasets and validate the usefulness of this framework for distilling a large model in a few-shot learning scenario.
pdf
bib
abs
Predicting Desirable Revisions of Evidence and Reasoning in Argumentative Writing
Tazin Afrin
|
Diane Litman
We develop models to classify desirable evidence and desirable reasoning revisions in student argumentative writing. We explore two ways to improve classifier performance – using the essay context of the revision, and using the feedback students received before the revision. We perform both intrinsic and extrinsic evaluation for each of our models and report a qualitative analysis. Our results show that while a model using feedback information improves over a baseline model, models utilizing context - either alone or with feedback - are the most successful in identifying desirable revisions.
pdf
bib
abs
Discourse Structure Extraction from Pre-Trained and Fine-Tuned Language Models in Dialogues
Chuyuan Li
|
Patrick Huber
|
Wen Xiao
|
Maxime Amblard
|
Chloe Braud
|
Giuseppe Carenini
Discourse processing suffers from data sparsity, especially for dialogues. As a result, we explore approaches to infer latent discourse structures for dialogues, based on attention matrices from Pre-trained Language Models (PLMs). We investigate multiple auxiliary tasks for fine-tuning and show that the dialogue-tailored Sentence Ordering task performs best. To locate and exploit discourse information in PLMs, we propose an unsupervised and a semi-supervised method. Our proposals thereby achieve encouraging results on the STAC corpus, with F1 scores of 57.2 and 59.3 for the unsupervised and semi-supervised methods, respectively. When restricted to projective trees, our scores improved to 63.3 and 68.1.
pdf
bib
abs
Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision
Zhen Wan
|
Fei Cheng
|
Qianying Liu
|
Zhuoyuan Mao
|
Haiyue Song
|
Sadao Kurohashi
Contrastive pre-training on distant supervision has shown remarkable effectiveness in improving supervised relation extraction tasks. However, the existing methods ignore the intrinsic noise of distant supervision during the pre-training stage. In this paper, we propose a weighted contrastive learning method by leveraging the supervised data to estimate the reliability of pre-training instances and explicitly reduce the effect of noise. Experimental results on three supervised datasets demonstrate the advantages of our proposed weighted contrastive learning approach compared to two state-of-the-art non-weighted baselines. Our code and models are available at:
https://github.com/YukinoWan/WCL.
pdf
bib
abs
CK-Transformer: Commonsense Knowledge Enhanced Transformers for Referring Expression Comprehension
Zhi Zhang
|
Helen Yannakoudakis
|
Xiantong Zhen
|
Ekaterina Shutova
The task of multimodal referring expression comprehension (REC), aiming at localizing an image region described by a natural language expression, has recently received increasing attention within the research comminity. In this paper, we specifically focus on referring expression comprehension with commonsense knowledge (KB-Ref), a task which typically requires reasoning beyond spatial, visual or semantic information. We propose a novel framework for Commonsense Knowledge Enhanced Transformers (CK-Transformer) which effectively integrates commonsense knowledge into the representations of objects in an image, facilitating identification of the target objects referred to by the expressions. We conduct extensive experiments on several benchmarks for the task of KB-Ref. Our results show that the proposed CK-Transformer achieves a new state of the art, with an absolute improvement of 3.14% accuracy over the existing state of the art.
pdf
bib
abs
Curricular Next Conversation Prediction Pretraining for Transcript Segmentation
Anvesh Rao Vijjini
|
Hanieh Deilamsalehy
|
Franck Dernoncourt
|
Snigdha Chaturvedi
Transcript segmentation is the task of dividing a single continuous transcript into multiple segments. While document segmentation is a popular task, transcript segmentation has significant challenges due to the relatively noisy and sporadic nature of data. We propose pretraining strategies to address these challenges. The strategies are based on “Next Conversation Prediction” (NCP) with the underlying idea of pretraining a model to identify consecutive conversations. We further introduce “Advanced NCP” to make the pretraining task more relevant to the downstream task of segmentation break prediction while being significantly easier. Finally we introduce a curriculum to Advanced NCP (Curricular NCP) based on the similarity between pretraining and downstream task samples. Curricular NCP applied to a state-of-the-art model for text segmentation outperforms prior results. We also show that our pretraining strategies make the model robust to speech recognition errors commonly found in automatically generated transcripts.