Workshop on Representation Learning for NLP (2023)


bib (full) Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

pdf bib
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)
Burcu Can | Maximilian Mozes | Samuel Cahyawijaya | Naomi Saphra | Nora Kassner | Shauli Ravfogel | Abhilasha Ravichander | Chen Zhao | Isabelle Augenstein | Anna Rogers | Kyunghyun Cho | Edward Grefenstette | Lena Voita

pdf bib
Adversarial Clean Label Backdoor Attacks and Defenses on Text Classification Systems
Ashim Gupta | Amrith Krishna

Clean-label (CL) attack is a form of data poisoning attack where an adversary modifies only the textual input of the training data, without requiring access to the labeling function. CL attacks are relatively unexplored in NLP, as compared to label flipping (LF) attacks, where the latter additionally requires access to the labeling function as well. While CL attacks are more resilient to data sanitization and manual relabeling methods than LF attacks, they often demand as high as ten times the poisoning budget than LF attacks. In this work, we first introduce an Adversarial Clean Label attack which can adversarially perturb in-class training examples for poisoning the training set. We then show that an adversary can significantly bring down the data requirements for a CL attack, using the aforementioned approach, to as low as 20 % of the data otherwise required. We then systematically benchmark and analyze a number of defense methods, for both LF and CL attacks, some previously employed solely for LF attacks in the textual domain and others adapted from computer vision. We find that text-specific defenses greatly vary in their effectiveness depending on their properties.

pdf bib
Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords
Shahriar Golchin | Mihai Surdeanu | Nazgol Tavabi | Ata Kiapour

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).

pdf bib
Grammatical information in BERT sentence embeddings as two-dimensional arrays
Vivi Nastase | Paola Merlo

Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yield several promising results. First, we show that while the usual sentence representations encoded as one-dimensional arrays do not easily support extraction of rule-like regularities, a two-dimensional reshaping of these vectors allows various learning architectures to access such information. Next, we show that various architectures can detect patterns in these two-dimensional reshaped sentence embeddings and successfully learn a model based on smaller amounts of simpler training data, which performs well on more complex test data. This indicates that current sentence embeddings contain information that is regularly distributed, and which can be captured when the embeddings are reshaped into higher dimensional arrays. Our results cast light on representations produced by language models and help move towards developing few-shot learning approaches.

pdf bib
A Multilingual Evaluation of NER Robustness to Adversarial Inputs
Akshay Srinivasan | Sowmya Vajjala

Adversarial evaluations of language models typically focus on English alone. In this paper, we performed a multilingual evaluation of Named Entity Recognition (NER) in terms of its robustness to small perturbations in the input. Our results showed the NER models we explored across three languages (English, German and Hindi) are not very robust to such changes, as indicated by the fluctuations in the overall F1 score as well as in a more fine-grained evaluation. With that knowledge, we further explored whether it is possible to improve the existing NER models using a part of the generated adversarial data sets as augmented training data to train a new NER model or as fine-tuning data to adapt an existing NER model. Our results showed that both these approaches improve performance on the original as well as adversarial test sets. While there is no significant difference between the two approaches for English, re-training is significantly better than fine-tuning for German and Hindi.

pdf bib
Retrieval-Augmented Domain Adaptation of Language Models
Benfeng Xu | Chunxu Zhao | Wenbin Jiang | PengFei Zhu | Songtai Dai | Chao Pang | Zhuo Sun | Shuohuan Wang | Yu Sun

Language models pretrained on general domain corpora usually exhibit considerable degradation when generalizing to downstream tasks of specialized domains. Existing approaches try to construct PLMs for each specific domains either from scratch or through further pretraining, which not only costs substantial resources, but also fails to cover all target domains at various granularity. In this work, we propose RADA, a novel Retrieval-Augmented framework for Domain Adaptation. We first construct a textual corpora that covers the downstream task at flexible domain granularity and resource availability. We employ it as a pluggable datastore to retrieve informative background knowledge, and integrate them into the standard language model framework to augment representations. We then propose a two-level selection scheme to integrate the most relevant information while alleviating irrelevant noises. Specifically, we introduce a differentiable sampling module as well as an attention mechanism to achieve both passage-level and word-level selection. Such a retrieval-augmented framework enables domain adaptation of language models with flexible domain coverage and fine-grained domain knowledge integration. We conduct comprehensive experiments across biomedical, science and legal domains to demonstrate the effectiveness of the overall framework, and its advantage over existing solutions.

pdf bib
Fine-grained Text Style Transfer with Diffusion-Based Language Models
Yiwei Lyu | Tiange Luo | Jiacheng Shi | Todd Hollon | Honglak Lee

Diffusion probabilistic models have shown great success in generating high-quality images controllably, and researchers have tried to utilize this controllability into text generation domain. Previous works on diffusion-based language models have shown that they can be trained without external knowledge (such as pre-trained weights) and still achieve stable performance and controllability. In this paper, we trained a diffusion-based model on StylePTB dataset, the standard benchmark for fine-grained text style transfers. The tasks in StylePTB requires much more refined control over the output text compared to tasks evaluated in previous works, and our model was able to achieve state-of-the-art performance on StylePTB on both individual and compositional transfers. Moreover, our model, trained on limited data from StylePTB without external knowledge, outperforms previous works that utilized pretrained weights, embeddings, and external grammar parsers, and this may indicate that diffusion-based language models have great potential under low-resource settings.

pdf bib
Enhancing text comprehension for Question Answering with Contrastive Learning
Seungyeon Lee | Minho Lee

Although Question Answering (QA) have advanced to the human-level language skills in NLP tasks, there is still a problem: the QA model gets confused when there are similar sentences or paragraphs. Existing studies focus on enhancing the text understanding of the candidate answers to improve the overall performance of the QA models. However, since these methods focus on re-ranking queries or candidate answers, they fail to resolve the confusion when many generated answers are similar to the expected answer. To address these issues, we propose a novel contrastive learning framework called ContrastiveQA that alleviates the confusion problem in answer extraction. We propose a supervised method where we generate positive and negative samples from the candidate answers and the given answer, respectively. We thus introduce ContrastiveQA, which uses contrastive learning with sampling data to reduce incorrect answers. Experimental results on four QA benchmarks show the effectiveness of the proposed method.

pdf bib
Towards Flow Graph Prediction of Open-Domain Procedural Texts
Keisuke Shirai | Hirotaka Kameko | Shinsuke Mori

Machine comprehension of procedural texts is essential for reasoning about the steps and automating the procedures. However, this requires identifying entities within a text and resolving the relationships between the entities. Previous work focused on the cooking domain and proposed a framework to convert a recipe text into a flow graph (FG) representation. In this work, we propose a framework based on the recipe FG for flow graph prediction of open-domain procedural texts. To investigate flow graph prediction performance in non-cooking domains, we introduce the wikiHow-FG corpus from articles on wikiHow, a website of how-to instruction articles. In experiments, we consider using the existing recipe corpus and performing domain adaptation from the cooking to the target domain. Experimental results show that the domain adaptation models achieve higher performance than those trained only on the cooking or target domain data.

pdf bib
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle | Chen Liu | Jonas Pfeiffer | Iryna Gurevych

Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs—of different architectures, trained on different data and objectives—are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a single pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not repurposed, but explicitly designed for V+L tasks, have the potential of improving performance on the target V+L tasks.

pdf bib
SPC: Soft Prompt Construction for Cross Domain Generalization
Wenbo Zhao | Arpit Gupta | Tagyoung Chung | Jing Huang

Recent advances in prompt tuning have proven effective as a new language modeling paradigm for various natural language understanding tasks. However, it is challenging to adapt the soft prompt embeddings to different domains or generalize to low-data settings when learning soft prompts itself is unstable, task-specific, and bias-prone. This paper proposes a principled learning framework—soft prompt construction (SPC)—to facilitate learning domain-adaptable soft prompts. Derived from the SPC framework is a simple loss that can plug into various models and tuning approaches to improve their cross-domain performance. We show SPC can improve upon SOTA for contextual query rewriting, summarization, and paraphrase detection by up to 5%, 19%, and 16%, respectively.

pdf bib
Friendly Neighbors: Contextualized Sequence-to-Sequence Link Prediction
Adrian Kochsiek | Apoorv Saxena | Inderjeet Nair | Rainer Gemulla

We propose KGT5-context, a simple sequence-to-sequence model for link prediction (LP) in knowledge graphs (KG). Our work expands on KGT5, a recent LP model that exploits textual features of the KG, has small model size, and is scalable. To reach good predictive performance, however, KGT5 relies on an ensemble with a knowledge graph embedding model, which itself is excessively large and costly to use. In this short paper, we show empirically that adding contextual information — i.e., information about the direct neighborhood of the query entity — alleviates the need for a separate KGE model to obtain good performance. The resulting KGT5-context model is simple, reduces model size significantly, and obtains state-of-the-art performance in our experimental study.

pdf bib
Extracting Multi-valued Relations from Language Models
Sneha Singhania | Simon Razniewski | Gerhard Weikum

The widespread usage of latent language representations via pre-trained language models (LMs) suggests that they are a promising source of structured knowledge. However, existing methods focus only on a single object per subject-relation pair, even though often multiple objects are correct. To overcome this limitation, we analyze these representations for their potential to yield materialized multi-object relational knowledge. We formulate the problem as a rank-then-select task. For ranking candidate objects, we evaluate existing prompting techniques and propose new ones incorporating domain knowledge. Among the selection methods, we find that choosing objects with a likelihood above a learned relation-specific threshold gives a 49.5% F1 score. Our results highlight the difficulty of employing LMs for the multi-valued slot-filling task, and pave the way for further research on extracting relational knowledge from latent language representations.

pdf bib
Hierarchical Multi-Instance Multi-Label Learning for Detecting Propaganda Techniques
Anni Chen | Bhuwan Dhingra

Since the introduction of the SemEval 2020 Task 11 (CITATION), several approaches have been proposed in the literature for classifying propagandabased on the rhetorical techniques used to influence readers. These methods, however, classify one span at a time, ignoring dependencies from the labels of other spans within the same context. In this paper, we approach propaganda technique classification as aMulti-Instance Multi-Label (MIML) learning problem (CITATION) and propose a simple RoBERTa-based model (CITATION) for classifying all spans in an article simultaneously. Further, we note that, due to the annotation process whereannotators classified the spans by following a decision tree,there is an inherent hierarchical relationship among the differenttechniques, which existing approaches ignore. We incorporate these hierarchical label dependencies by adding an auxiliary classifier for each node in the decision tree to the training objective and ensembling the predictions from the original and auxiliary classifiers at test time. Overall, our model leads to an absolute improvement of 2.47% micro-F1 over the model from the shared task winning team in a cross-validation setup and is the best performing non-ensemble model on the shared task leaderboard.

pdf bib
Contrastive Loss is All You Need to Recover Analogies as Parallel Lines
Narutatsu Ri | Fei-Tzin Lee | Nakul Verma

While static word embedding models are known to represent linguistic analogies as parallel lines in high-dimensional space, the underlying mechanism as to why they result in such geometric structures remains obscure. We find that an elementary contrastive-style method employed over distributional information performs competitively with popular word embedding models on analogy recovery tasks, while achieving dramatic speedups in training time. Further, we demonstrate that a contrastive loss is sufficient to create these parallel structures in word embeddings, and establish a precise relationship between the co-occurrence statistics and the geometric structure of the resulting word embeddings.

pdf bib
Syntax-Aware Graph-to-Graph Transformer for Semantic Role Labelling
Alireza Mohammadshahi | James Henderson

Recent models have shown that incorporating syntactic knowledge into the semantic role labelling (SRL) task leads to a significant improvement. In this paper, we propose Syntax-aware Graph-to-Graph Transformer (SynG2G-Tr) model, which encodes the syntactic structure using a novel way to input graph relations as embeddings, directly into the self-attention mechanism of Transformer. This approach adds a soft bias towards attention patterns that follow the syntactic structure but also allows the model to use this information to learn alternative patterns. We evaluate our model on both span-based and dependency-based SRL datasets, and outperform previous alternative methods in both in-domain and out-of-domain settings, on CoNLL 2005 and CoNLL 2009 datasets.

pdf bib
Improving Zero-shot Relation Classification via Automatically-acquired Entailment Templates
Mahdi Rahimi | Mihai Surdeanu

While fully supervised relation classification (RC) models perform well on large-scale datasets, their performance drops drastically in low-resource settings. As generating annotated examples are expensive, recent zero-shot methods have been proposed that reformulate RC into other NLP tasks for which supervision exists such as textual entailment. However, these methods rely on templates that are manually created which is costly and requires domain expertise. In this paper, we present a novel strategy for template generation for relation classification, which is based on adapting Harris’ distributional similarity principle to templates encoded using contextualized representations. Further, we perform empirical evaluation of different strategies for combining the automatically acquired templates with manual templates. The experimental results on TACRED show that our approach not only performs better than the zero-shot RC methods that only use manual templates, but also that it achieves state-of-the-art performance for zero-shot TACRED at 64.3 F1 score.

pdf bib
MUX-PLMs: Pre-training Language Models with Data Multiplexing
Vishvak Murahari | Ameet Deshpande | Carlos Jimenez | Izhak Shafran | Mingqiu Wang | Yuan Cao | Karthik Narasimhan

The widespread adoption of large language models such as ChatGPT and Bard has led to unprecedented demand for these technologies. The burgeoning cost of inference for ever-increasing model sizes coupled with hardware shortages has limited affordable access and poses a pressing need for efficiency approaches geared towards high throughput and performance. Multi-input multi-output (MIMO) algorithms such as data multiplexing, offer a promising solution with a many-fold increase in throughput by performing inference for multiple inputs at the cost of a single input. Yet these approaches are not currently performant enough to be deployed in modern systems. We change that by developing MUX-PLMs, a class of high throughput pre-trained language models (PLMs) trained with data multiplexing, that can be fine-tuned for any downstream task to yield high-throughput high-performance. Our novel multiplexing and demultiplexing modules proficiently entangle and disentangle inputs, and enable high-performance high throughput that are competitive with vanilla PLMs while achieving 2x/5x inference speedup with only a 1−4% drop on a broad suite of tasks.

pdf bib
Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)
Robert Gale | Alexandra Salem | Gerasimos Fergadiotis | Steven Bedrick

Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improved on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.

pdf bib
Effectiveness of Data Augmentation for Parameter Efficient Tuning with Limited Data
Stephen Obadinma | Hongyu Guo | Xiaodan Zhu

Recent work has demonstrated that using parameter efficient tuning techniques such as prefix tuning (or P-tuning) on pretrained language models can yield performance that is comparable or superior to fine-tuning while dramatically reducing trainable parameters. Nevertheless, the effectiveness of such methods under the context of data augmentation, a common strategy to improve learning under low data regimes, has not been fully explored. In this paper, we examine the effectiveness of several popular task-agnostic data augmentation techniques, i.e., EDA, Back Translation, and Mixup, when using two general parameter efficient tuning methods, P-tuning v2 and LoRA, under data scarcity. We show that data augmentation can be used to boost the performance of P-tuning and LoRA models, but the effectiveness of each technique varies and certain methods can lead to a notable degradation in performance, particularly when using larger models and on harder tasks. We further analyze the sentence representations of P-tuning compared to fine-tuning to help understand the above behaviour, and reveal how P-tuning generally presents a more limited ability to separate the sentence embeddings from different classes of augmented data. In addition, it displays poorer performance on heavily altered data. However, we demonstrate that by adding a simple contrastive loss function it can help mitigate such issues for prefix tuning, resulting in sizable improvements to augmented data performance.

pdf bib
Relational Sentence Embedding for Flexible Semantic Matching
Bin Wang | Haizhou Li

pdf bib
Tucker Decomposition with Frequency Attention for Temporal Knowledge Graph Completion
Likang Xiao | Richong Zhang | Zijie Chen | Junfan Chen

pdf bib
CLIP-based image captioning via unsupervised cycle-consistency in the latent space
Romain Bielawski | Rufin VanRullen

pdf bib
Token-level Fitting Issues of Seq2seq Models
Guangsheng Bao | Zhiyang Teng | Yue Zhang

pdf bib
Revealing the Blind Spot of Sentence Encoder Evaluation by HEROS
Cheng-Han Chiang | Hung-yi Lee | Yung-Sung Chuang | James Glass

pdf bib
One-Shot Exemplification Modeling via Latent Sense Representations
John Harvill | Mark Hasegawa-Johnson | Hee Suk Yoon | Chang D. Yoo | Eunseop Yoon

pdf bib
Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language Model
Lingfeng Shen | Haiyun Jiang | Lemao Liu | Shuming Shi

pdf bib
Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele