Asian Chapter of the Association for Computational Linguistics (2022)



pdf (full)
bib (full)
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Yulan He | Heng Ji | Sujian Li | Yang Liu | Chua-Hui Chang

pdf bib
Chasing the Tail with Domain Generalization: A Case Study on Frequency-Enriched Datasets
Manoj Kumar | Anna Rumshisky | Rahul Gupta

Natural language understanding (NLU) tasks are typically defined by creating an annotated dataset in which each utterance is encountered once. Such data does not resemble real-world natural language interactions in which certain utterances are encountered frequently, others rarely. For deployed NLU systems this is a vital problem, since the underlying machine learning (ML) models are often fine-tuned on typical NLU data, and then applied to real-world data with a very different distribution. Such systems need to maintain interpretation consistency for both high-frequency utterances and low-frequency utterances. We propose an alternative strategy that explicitly uses utterance frequency in training data to learn models that are more robust to unknown distributions. We present a methodology to simulate utterance usage in two public NLU corpora and create new corpora with head, body and tail segments. We evaluate several methods for joint intent classification and named entity recognition (IC-NER), and use two domain generalization approaches that we adapt to NER. The proposed approaches demonstrate upto 7.02% relative improvement in semantic accuracy over baselines on the tail data. We provide insights as to why the proposed approaches work and show that the reasons for observed improvements do not align with those reported in previous work.

pdf bib
Double Trouble: How to not Explain a Text Classifier’s Decisions Using Counterfactuals Synthesized by Masked Language Models?
Thang Pham | Trung Bui | Long Mai | Anh Nguyen

A principle behind dozens of attribution methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution. A popular Input Marginalization (IM) method (Kim et al., 2020) uses BERT to replace a token, yielding more plausible counterfactuals. While Kim et al., 2020 reported that IM is effective, we find this conclusion not convincing as the Deletion-BERT metric used in their paper is biased towards IM. Importantly, this bias exists in Deletion-based metrics, including Insertion, Sufficiency, and Comprehensiveness. Furthermore, our rigorous evaluation using 6 metrics and 3 datasets finds no evidence that IM is better than a Leave-One-Out (LOO) baseline. We find two reasons why IM is not better than LOO: (1) deleting a single word from the input only marginally reduces a classifier’s accuracy; and (2) a highly predictable word is always given near-zero attribution, regardless of its true importance to the classifier. In contrast, making LIME samples more natural via BERT consistently improves LIME accuracy under several ROAR metrics.

pdf bib
An Empirical Study on Cross-X Transfer for Legal Judgment Prediction
Joel Niklaus | Matthias Stürmer | Ilias Chalkidis

Cross-lingual transfer learning has proven useful in a variety of Natural Language (NLP) tasks, but it is understudied in the context of legal NLP, and not at all in Legal Judgment Prediction (LJP). We explore transfer learning techniques on LJP using the trilingual Swiss-Judgment-Prediction (SJP) dataset, including cases written in three languages. We find that Cross-Lingual Transfer (CLT) improves the overall results across languages, especially when we use adapter-based fine-tuning. Finally, we further improve the model’s performance by augmenting the training dataset with machine-translated versions of the original documents, using a 3× larger training corpus. Further on, we perform an analysis exploring the effect of cross-domain and cross-regional transfer, i.e., train a model across domains (legal areas), or regions. We find that in both settings (legal areas, origin regions), models trained across all groups perform overall better, while they also have improved results in the worst-case scenarios. Finally, we report improved results when we ambitiously apply cross-jurisdiction transfer, where we further augment our dataset with Indian legal cases.

pdf bib
CNN for Modeling Sanskrit Originated Bengali and Hindi Language
Chowdhury Rahman | MD. Hasibur Rahman | Mohammad Rafsan | Mohammed Eunus Ali | Samiha Zakir | Rafsanjani Muhammod

Though recent works have focused on modeling high resource languages, the area is still unexplored for low resource languages like Bengali and Hindi. We propose an end to end trainable memory efficient CNN architecture named CoCNN to handle specific characteristics such as high inflection, morphological richness, flexible word order and phonetical spelling errors of Bengali and Hindi. In particular, we introduce two learnable convolutional sub-models at word and at sentence level that are end to end trainable. We show that state-of-the-art (SOTA) Transformer models including pretrained BERT do not necessarily yield the best performance for Bengali and Hindi. CoCNN outperforms pretrained BERT with 16X less parameters and achieves much better performance than SOTA LSTMs on multiple real-world datasets. This is the first study on the effectiveness of different architectures from Convolution, Recurrent, and Transformer neural net paradigm for modeling Bengali and Hindi.

pdf bib
Leveraging Key Information Modeling to Improve Less-Data Constrained News Headline Generation via Duality Fine-Tuning
Zhuoxuan Jiang | Lingfeng Qiao | Di Yin | Shanshan Feng | Bo Ren

Recent language generative models are mostly trained on large-scale datasets, while in some real scenarios, the training datasets are often expensive to obtain and would be small-scale. In this paper we investigate the challenging task of less-data constrained generation, especially when the generated news headlines are short yet expected by readers to keep readable and informative simultaneously. We highlight the key information modeling task and propose a novel duality fine-tuning method by formally defining the probabilistic duality constraints between key information prediction and headline generation tasks. The proposed method can capture more information from limited data, build connections between separate tasks, and is suitable for less-data constrained generation tasks. Furthermore, the method can leverage various pre-trained generative regimes, e.g., autoregressive and encoder-decoder models. We conduct extensive experiments to demonstrate that our method is effective and efficient to achieve improved performance in terms of language modeling metric and informativeness correctness metric on two public datasets.

pdf bib
Systematic Evaluation of Predictive Fairness
Xudong Han | Aili Shen | Trevor Cohn | Timothy Baldwin | Lea Frermann

Mitigating bias in training on biased datasets is an important open problem. Several techniques have been proposed, however the typical evaluation regime is very limited, considering very narrow data conditions. For instance, the effect of target class imbalance and stereotyping is under-studied. To address this gap, we examine the performance of various debiasing methods across multiple tasks, spanning binary classification (Twitter sentiment), multi-class classification (profession prediction), and regression (valence prediction). Through extensive experimentation, we find that data conditions have a strong influence on relative model performance, and that general conclusions cannot be drawn about method efficacy when evaluating only on standard datasets, as is current practice in fairness research.

pdf bib
Graph-augmented Learning to Rank for Querying Large-scale Knowledge Graph
Hanning Gao | Lingfei Wu | Po Hu | Zhihua Wei | Fangli Xu | Bo Long

Knowledge graph question answering (KGQA) based on information retrieval aims to answer a question by retrieving answer from a large-scale knowledge graph. Most existing methods first roughly retrieve the knowledge subgraphs (KSG) that may contain candidate answer, and then search for the exact answer in the KSG. However, the KSG may contain thousands of candidate nodes since the knowledge graph involved in querying is often of large scale, thus decreasing the performance of answer selection. To tackle this problem, we first propose to partition the retrieved KSG to several smaller sub-KSGs via a new subgraph partition algorithm and then present a graph-augmented learning to rank model to select the top-ranked sub-KSGs from them. Our proposed model combines a novel subgraph matching networks to capture global interactions in both question and subgraphs and an Enhanced Bilateral Multi-Perspective Matching model to capture local interactions. Finally, we apply an answer selection model on the full KSG and the top-ranked sub-KSGs respectively to validate the effectiveness of our proposed graph-augmented learning to rank method. The experimental results on multiple benchmark datasets have demonstrated the effectiveness of our approach.

pdf bib
An Embarrassingly Simple Approach for Intellectual Property Rights Protection on Recurrent Neural Networks
Zhi Qin Tan | Hao Shan Wong | Chee Seng Chan

Capitalise on deep learning models, offering Natural Language Processing (NLP) solutions as a part of the Machine Learning as a Service (MLaaS) has generated handsome revenues. At the same time, it is known that the creation of these lucrative deep models is non-trivial. Therefore, protecting these inventions’ intellectual property rights (IPR) from being abused, stolen and plagiarized is vital. This paper proposes a practical approach for the IPR protection on recurrent neural networks (RNN) without all the bells and whistles of existing IPR solutions. Particularly, we introduce the Gatekeeper concept that resembles the recurrent nature in RNN architecture to embed keys. Also, we design the model training scheme in a way such that the protected RNN model will retain its original performance iff a genuine key is presented. Extensive experiments showed that our protection scheme is robust and effective against ambiguity and removal attacks in both white-box and black-box protection schemes on different RNN variants. Code is available at

pdf bib
WAX: A New Dataset for Word Association eXplanations
Chunhua Liu | Trevor Cohn | Simon De Deyne | Lea Frermann

Word associations are among the most common paradigms to study the human mental lexicon. While their structure and types of associations have been well studied, surprisingly little attention has been given to the question of why participants produce the observed associations. Answering this question would not only advance understanding of human cognition, but could also aid machines in learning and representing basic commonsense knowledge. This paper introduces a large, crowd-sourced data set of English word associations with explanations, labeled with high-level relation types. We present an analysis of the provided explanations, and design several tasks to probe to what extent current pre-trained language models capture the underlying relations. Our experiments show that models struggle to capture the diversity of human associations, suggesting WAX is a rich benchmark for commonsense modeling and generation.

pdf bib
Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality
Haozhe Chi | Minghua Yang | Junhao Zhu | Guanhong Wang | Gaoang Wang

Multimodal sentiment analysis (MSA) is an important way of observing mental activities with the help of data captured from multiple modalities. However, due to the recording or transmission error, some modalities may include incomplete data. Most existing works that address missing modalities usually assume a particular modality is completely missing and seldom consider a mixture of missing across multiple modalities. In this paper, we propose a simple yet effective meta-sampling approach for multimodal sentiment analysis with missing modalities, namely Missing Modality-based Meta Sampling (M3S). To be specific, M3S formulates a missing modality sampling strategy into the modal agnostic meta-learning (MAML) framework. M3S can be treated as an efficient add-on training component on existing models and significantly improve their performances on multimodal data with a mixture of missing modalities. We conduct experiments on IEMOCAP, SIMS and CMU-MOSI datasets, and superior performance is achieved compared with recent state-of-the-art methods.

pdf bib
SPARQL-to-Text Question Generation for Knowledge-Based Conversational Applications
Gwénolé Lecorvé | Morgan Veyret | Quentin Brabant | Lina M. Rojas Barahona

This paper focuses on the generation of natural language questions based on SPARQL queries, with an emphasis on conversational use cases (follow-up question-answering). It studies what can be achieved so far based on current deep learning models (namely pretrained T5 and BART models). To do so, 4 knowledge-based QA corpora have been homogenized for the task and a new challenge set is introduced. A first series of experiments analyzes the impact of different training setups, while a second series seeks to understand what is still difficult for these models. The results from automatic metrics and human evaluation show that simple questions and frequent templates of SPARQL queries are usually well processed whereas complex questions and conversational dimensions (coreferences and ellipses) are still difficult to handle. The experimental material is publicly available on .

pdf bib
S+PAGE: A Speaker and Position-Aware Graph Neural Network Model for Emotion Recognition in Conversation
Chen Liang | Jing Xu | Yangkun Lin | Chong Yang | Yongliang Wang

Emotion recognition in conversation (ERC) has attracted much attention in recent years for its necessity in widespread applications. With the development of graph neural network (GNN), recent state-of-the-art ERC models mostly use GNN to embed the intrinsic structure information of a conversation into the utterance features. In this paper, we propose a novel GNN-based model for ERC, namely S+PAGE, to better capture the speaker and position-aware conversation structure information. Specifically, we add the relative positional encoding and speaker dependency encoding in the representations of edge weights and edge types respectively to acquire a more reasonable aggregation algorithm for ERC. Besides, a two-stream conversational Transformer is presented to extract both the self and inter-speaker contextual features for each utterance. Extensive experiments are conducted on four ERC benchmarks with state-of-the-art models employed as baselines for comparison, whose results demonstrate the superiority of our model.

pdf bib
Grammatical Error Correction Systems for Automated Assessment: Are They Susceptible to Universal Adversarial Attacks?
Vyas Raina | Yiting Lu | Mark Gales

Grammatical error correction (GEC) systems are a useful tool for assessing a learner’s writing ability. These systems allow the grammatical proficiency of a candidate’s text to be assessed without requiring an examiner or teacher to read the text. A simple summary of a candidate’s ability can be measured by the total number of edits between the input text and the GEC system output: the fewer the edits the better the candidate. With advances in deep learning, GEC systems have become increasingly powerful and accurate. However, deep learning systems are susceptible to adversarial attacks, in which a small change at the input can cause large, undesired changes at the output. In the context of GEC for automated assessment, the aim of an attack can be to deceive the system into not correcting (concealing) grammatical errors to create the perception of higher language ability. An interesting aspect of adversarial attacks in this scenario is that the attack needs to be simple as it must be applied by, for example, a learner of English. The form of realistic attack examined in this work is appending the same phrase to each input sentence: a concatenative universal attack. The candidate only needs to learn a single attack phrase. State-of-the-art GEC systems are found to be susceptible to this form of simple attack, which transfers to different test sets as well as system architectures,

pdf bib
This Patient Looks Like That Patient: Prototypical Networks for Interpretable Diagnosis Prediction from Clinical Text
Betty van Aken | Jens-Michalis Papaioannou | Marcel Naik | Georgios Eleftheriadis | Wolfgang Nejdl | Felix Gers | Alexander Loeser

The use of deep neural models for diagnosis prediction from clinical text has shown promising results. However, in clinical practice such models must not only be accurate, but provide doctors with interpretable and helpful results. We introduce ProtoPatient, a novel method based on prototypical networks and label-wise attention with both of these abilities. ProtoPatient makes predictions based on parts of the text that are similar to prototypical patients—providing justifications that doctors understand. We evaluate the model on two publicly available clinical datasets and show that it outperforms existing baselines. Quantitative and qualitative evaluations with medical doctors further demonstrate that the model provides valuable explanations for clinical decision support.

pdf bib
Cross-lingual Similarity of Multilingual Representations Revisited
Maksym Del | Mark Fishel

Related works used indexes like CKA and variants of CCA to measure the similarity of cross-lingual representations in multilingual language models. In this paper, we argue that assumptions of CKA/CCA align poorly with one of the motivating goals of cross-lingual learning analysis, i.e., explaining zero-shot cross-lingual transfer. We highlight what valuable aspects of cross-lingual similarity these indexes fail to capture and provide a motivating case study demonstrating the problem empirically. Then, we introduce Average Neuron-Wise Correlation (ANC) as a straightforward alternative that is exempt from the difficulties of CKA/CCA and is good specifically in a cross-lingual context. Finally, we use ANC to construct evidence that the previously introduced “first align, then predict” pattern takes place not only in masked language models (MLMs) but also in multilingual models with causal language modeling objectives (CLMs). Moreover, we show that the pattern extends to the scaled versions of the MLMs and CLMs (up to 85x original mBERT). Our code is publicly available at

pdf bib
Arabic Dialect Identification with a Few Labeled Examples Using Generative Adversarial Networks
Mahmoud Yusuf | Marwan Torki | Nagwa El-Makky

Given the challenges and complexities introduced while dealing with Dialect Arabic (DA) variations, Transformer based models, e.g., BERT, outperformed other models in dealing with the DA identification task. However, to fine-tune these models, a large corpus is required. Getting a large number high quality labeled examples for some Dialect Arabic classes is challenging and time-consuming. In this paper, we address the Dialect Arabic Identification task. We extend the transformer-based models, ARBERT and MARBERT, with unlabeled data in a generative adversarial setting using Semi-Supervised Generative Adversarial Networks (SS-GAN). Our model enabled producing high-quality embeddings for the Dialect Arabic examples and aided the model to better generalize for the downstream classification task given few labeled examples. Experimental results showed that our model reached better performance and faster convergence when only a few labeled examples are available.

pdf bib
Semantic Shift Stability: Efficient Way to Detect Performance Degradation of Word Embeddings and Pre-trained Language Models
Shotaro Ishihara | Hiromu Takahashi | Hono Shirai

Word embeddings and pre-trained language models have become essential technical elements in natural language processing. While the general practice is to use or fine-tune publicly available models, there are significant advantages in creating or pre-training unique models that match the domain. The performance of the models degrades as language changes or evolves continuously, but the high cost of model building inhibits regular re-training, especially for the language models. This study proposes an efficient way to detect time-series performance degradation of word embeddings and pre-trained language models by calculating the degree of semantic shift. Monitoring performance through the proposed method supports decision-making as to whether a model should be re-trained. The experiments demonstrated that the proposed method can identify time-series performance degradation in two datasets, Japanese and English. The source code is available at

pdf bib
Neural Text Sanitization with Explicit Measures of Privacy Risk
Anthi Papadopoulou | Yunhao Yu | Pierre Lison | Lilja Øvrelid

We present a novel approach for text sanitization, which is the task of editing a document to mask all (direct and indirect) personal identifiers and thereby conceal the identity of the individuals(s) mentioned in the text. In contrast to previous work, the approach relies on explicit measures of privacy risk, making it possible to explicitly control the trade-off between privacy protection and data utility. The approach proceeds in three steps. A neural, privacy-enhanced entity recognizer is first employed to detect and classify potential personal identifiers. We then determine which entities, or combination of entities, are likely to pose a re-identification risk through a range of privacy risk assessment measures. We present three such measures of privacy risk, respectively based on (1) span probabilities derived from a BERT language model, (2) web search queries and (3) a classifier trained on labelled data. Finally, a linear optimization solver decides which entities to mask to minimize the semantic loss while simultaneously ensuring that the estimated privacy risk remains under a given threshold. We evaluate the approach both in the absence and presence of manually annotated data. Our results highlight the potential of the approach, as well as issues specific types of personal data can introduce to the process.

pdf bib
AGRank: Augmented Graph-based Unsupervised Keyphrase Extraction
Haoran Ding | Xiao Luo

Keywords or keyphrases are often used to highlight a document’s domains or main topics. Unsupervised keyphrase extraction (UKE) has always been highly anticipated because no labeled data is needed to train a model. This paper proposes an augmented graph-based unsupervised model to identify keyphrases from a document by integrating graph and deep learning methods. The proposed model utilizes mutual attention extracted from the pre-trained BERT model to build the candidate graph and augments the graph with global and local context nodes to improve the performance. The proposed model is evaluated on four publicly available datasets against thirteen UKE baselines. The results show that the proposed model is an effective and robust UKE model for long and short documents. Our source code is available on GitHub.

pdf bib
Towards Unified Representations of Knowledge Graph and Expert Rules for Machine Learning and Reasoning
Zhepei Wei | Yue Wang | Jinnan Li | Zhining Liu | Erxin Yu | Yuan Tian | Xin Wang | Yi Chang

With a knowledge graph and a set of if-then rules, can we reason about the conclusions given a set of observations? In this work, we formalize this question as the cognitive inference problem, and introduce the Cognitive Knowledge Graph (CogKG) that unifies two representations of heterogeneous symbolic knowledge: expert rules and relational facts. We propose a general framework in which the unified knowledge representations can perform both learning and reasoning. Specifically, we implement the above framework in two settings, depending on the availability of labeled data. When no labeled data are available for training, the framework can directly utilize symbolic knowledge as the decision basis and perform reasoning. When labeled data become available, the framework casts symbolic knowledge as a trainable neural architecture and optimizes the connection weights among neurons through gradient descent. Empirical study on two clinical diagnosis benchmarks demonstrates the superiority of the proposed method over time-tested knowledge-driven and data-driven methods, showing the great potential of the proposed method in unifying heterogeneous symbolic knowledge, i.e., expert rules and relational facts, as the substrate of machine learning and reasoning models.

pdf bib
Who did what to Whom? Language models and humans respond diversely to features affecting argument hierarchy construction
Xiaonan Xu | Haoshuo Chen

Pre-trained transformer-based language models have achieved state-of-the-art performance in many areas of NLP. It is still an open question whether the models are capable of integrating syntax and semantics in language processing like humans. This paper investigates if models and humans construct argument hierarchy similarly with the effects from telicity, agency, and individuation, using the Chinese structure “NP1+BA/BEI+NP2+VP”. We present both humans and six transformer-based models with prepared sentences and analyze their preference between BA (view NP1 as an agent) and BEI (NP2 as an agent). It is found that the models and humans respond to (non-)agentive features in telic context and atelic feature very similarly. However, the models show insufficient sensitivity to both pragmatic function in expressing undesirable events and different individuation degrees represented by human common nouns vs. proper names. By contrast, humans rely heavily on these cues to establish the thematic relation between two arguments NP1 and NP2. Furthermore, the models tend to interpret the subject as an agent, which is not the case for humans who align agents independently of subject position in Mandarin Chinese.

pdf bib
CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media
Momchil Hardalov | Anton Chernyavskiy | Ivan Koychev | Dmitry Ilvovsky | Preslav Nakov

While there has been substantial progress in developing systems to automate fact-checking, they still lack credibility in the eyes of the users. Thus, an interesting approach has emerged: to perform automatic fact-checking by verifying whether an input claim has been previously fact-checked by professional fact-checkers and to return back an article that explains their decision. This is a sensible approach as people trust manual fact-checking, and as many claims are repeated multiple times. Yet, a major issue when building such systems is the small number of known tweet–verifying article pairs available for training. Here, we aim to bridge this gap by making use of crowd fact-checking, i.e., mining claims in social media for which users have responded with a link to a fact-checking article. In particular, we mine a large-scale collection of 330,000 tweets paired with a corresponding fact-checking article. We further propose an end-to-end framework to learn from this noisy data based on modified self-adaptive training, in a distant supervision scenario. Our experiments on the CLEF’21 CheckThat! test set show improvements over the state of the art by two points absolute. Our code and datasets are available at

pdf bib
Hate Speech and Offensive Language Detection in Bengali
Mithun Das | Somnath Banerjee | Punyajoy Saha | Animesh Mukherjee

Social media often serves as a breeding ground for various hateful and offensive content. Identifying such content on social media is crucial due to its impact on the race, gender, or religion in an unprejudiced society. However, while there is extensive research in hate speech detection in English, there is a gap in hateful content detection in low-resource languages like Bengali. Besides, a current trend on social media is the use of Romanized Bengali for regular interactions. To overcome the existing research’s limitations, in this study, we develop an annotated dataset of 10K Bengali posts consisting of 5K actual and 5K Romanized Bengali tweets. We implement several baseline models for the classification of such hateful posts. We further explore the interlingual transfer mechanism to boost classification performance. Finally, we perform an in-depth error analysis by looking into the misclassified posts by the models. While training actual and Romanized datasets separately, we observe that XLM-Roberta performs the best. Further, we witness that on joint training and few-shot training, MuRIL outperforms other models by interpreting the semantic expressions better. We make our code and dataset public for others.

pdf bib
Learning Interpretable Latent Dialogue Actions With Less Supervision
Vojtěch Hudeček | Ondřej Dušek

We present a novel architecture for explainable modeling of task-oriented dialogues with discrete latent variables to represent dialogue actions. Our model is based on variational recurrent neural networks (VRNN) and requires no explicit annotation of semantic information. Unlike previous works, our approach models the system and user turns separately and performs database query modeling, which makes the model applicable to task-oriented dialogues while producing easily interpretable action latent variables. We show that our model outperforms previous approaches with less supervision in terms of perplexity and BLEU on three datasets, and we propose a way to measure dialogue success without the need for expert annotation. Finally, we propose a novel way to explain semantics of the latent variables with respect to system actions.

pdf bib
Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts
Asahi Ushio | Francesco Barbieri | Vitor Sousa | Leonardo Neves | Jose Camacho-Collados

Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER in Twitter, one of the largest social media platforms, and construct a new NER dataset, TweetNER7, which contains seven entity types annotated over 11,382 tweets from September 2019 to August 2021. The dataset was constructed by carefully distributing the tweets over time and taking representative trends as a basis. Along with the dataset, we provide a set of language model baselines and perform an analysis on the language model performance on the task, especially analyzing the impact of different time periods. In particular, we focus on three important temporal aspects in our analysis: short-term degradation of NER models over time, strategies to fine-tune a language model over different periods, and self-labeling as an alternative to lack of recently-labeled data. TweetNER7 is released publicly ( along with the models fine-tuned on it (NER models have been integrated into TweetNLP and can be found at

pdf bib
PInKS: Preconditioned Commonsense Inference with Minimal Supervision
Ehsan Qasemi | Piyush Khanna | Qiang Ning | Muhao Chen

Reasoning with preconditions such as “glass can be used for drinking water unless the glass is shattered” remains an open problem for language models. The main challenge lies in the scarcity of preconditions data and the model’s lack of support for such reasoning. We present PInKS , Preconditioned Commonsense Inference with WeaK Supervision, an improved model for reasoning with preconditions through minimum supervision. We show, empirically and theoretically, that PInKS improves the results on benchmarks focused on reasoning with the preconditions of commonsense knowledge (up to 40% Macro-F1 scores). We further investigate PInKS through PAC-Bayesian informativeness analysis, precision measures, and ablation study.

pdf bib
Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation
Benjamin Muller | Luca Soldaini | Rik Koncel-Kedziorski | Eric Lind | Alessandro Moschitti

Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.

pdf bib
Discourse Parsing Enhanced by Discourse Dependence Perception
Yuqing Xing | Longyin Zhang | Fang Kong | Guodong Zhou

In recent years, top-down neural models have achieved significant success in text-level discourse parsing. Nevertheless, they still suffer from the top-down error propagation issue, especially when the performance on the upper-level tree nodes is terrible. In this research, we aim to learn from the correlations in between EDUs directly to shorten the hierarchical distance of the RST structure to alleviate the above problem. Specifically, we contribute a joint top-down framework that learns from both discourse dependency and constituency parsing through one shared encoder and two independent decoders. Moreover, we also explore a constituency-to-dependency conversion scheme tailored for the Chinese discourse corpus to ensure the high quality of the joint learning process. Our experimental results on CDTB show that the dependency information we use well heightens the understanding of the rhetorical structure, especially for the upper-level tree layers.

pdf bib
Prediction of People’s Emotional Response towards Multi-modal News
Ge Gao | Sejin Paik | Carley Reardon | Yanling Zhao | Lei Guo | Prakash Ishwar | Margrit Betke | Derry Tanti Wijaya

We aim to develop methods for understanding how multimedia news exposure can affect people’s emotional responses, and we especially focus on news content related to gun violence, a very important yet polarizing issue in the U.S. We created the dataset NEmo+ by significantly extending the U.S. gun violence news-to-emotions dataset, BU-NEmo, from 320 to 1,297 news headline and lead image pairings and collecting 38,910 annotations in a large crowdsourcing experiment. In curating the NEmo+ dataset, we developed methods to identify news items that will trigger similar versus divergent emotional responses. For news items that trigger similar emotional responses, we compiled them into the NEmo+-Consensus dataset. We benchmark models on this dataset that predict a person’s dominant emotional response toward the target news item (single-label prediction). On the full NEmo+ dataset, containing news items that would lead to both differing and similar emotional responses, we also benchmark models for the novel task of predicting the distribution of evoked emotional responses in humans when presented with multi-modal news content. Our single-label and multi-label prediction models outperform baselines by large margins across several metrics.

pdf bib
AugCSE: Contrastive Sentence Embedding with Diverse Augmentations
Zilu Tang | Muhammed Yusuf Kocyigit | Derry Tanti Wijaya

Data augmentation techniques have been proven useful in many applications in NLP fields. Most augmentations are task-specific, and cannot be used as a general-purpose tool. In our work, we present AugCSE, a unified framework to utilize diverse sets of data augmentations to achieve a better, general-purpose, sentence embedding model. Building upon the latest sentence embedding models, our approach uses a simple antagonistic discriminator that differentiates the augmentation types. With the finetuning objective borrowed from domain adaptation, we show that diverse augmentations, which often lead to conflicting contrastive signals, can be tamed to produce a better and more robust sentence representation. Our methods achieve state-of-the-art results on downstream transfer tasks and perform competitively on semantic textual similarity tasks, using only unsupervised data.

pdf bib
Seamlessly Integrating Factual Information and Social Content with Persuasive Dialogue
Maximillian Chen | Weiyan Shi | Feifan Yan | Ryan Hou | Jingwen Zhang | Saurav Sahay | Zhou Yu

Complex conversation settings such as persuasion involve communicating changes in attitude or behavior, so users’ perspectives need to be addressed, even when not directly related to the topic. In this work, we contribute a novel modular dialogue system framework that seamlessly integrates factual information and social content into persuasive dialogue. Our framework is generalizable to any dialogue tasks that have mixed social and task contents. We conducted a study that compared user evaluations of our framework versus a baseline end-to-end generation model. We found our model was evaluated to be more favorable in all dimensions including competence and friendliness compared to the baseline model which does not explicitly handle social content or factual questions.

pdf bib
Dual-Encoder Transformers with Cross-modal Alignment for Multimodal Aspect-based Sentiment Analysis
Zhewen Yu | Jin Wang | Liang-Chih Yu | Xuejie Zhang

Multimodal aspect-based sentiment analysis (MABSA) aims to extract the aspect terms from text and image pairs, and then analyze their corresponding sentiment. Recent studies typically use either a pipeline method or a unified transformer based on a cross-attention mechanism. However, these methods fail to explicitly and effectively incorporate the alignment between text and image. Supervised finetuning of the universal transformers for MABSA still requires a certain number of aligned image-text pairs. This study proposes a dual-encoder transformer with cross-modal alignment (DTCA). Two auxiliary tasks, including text-only extraction and text-patch alignment are introduced to enhance cross-attention performance. To align text and image, we propose an unsupervised approach which minimizes the Wasserstein distance between both modalities, forcing both encoders to produce more appropriate representations for the final extraction. Experimental results on two benchmarks demonstrate that DTCA consistently outperforms existing methods.

pdf bib
AVAST: Attentive Variational State Tracker in a Reinforced Navigator
Je-Wei Jang | Mahdin Rohmatillah | Jen-Tzung Chien

Recently, emerging approaches have been proposed to deal with robotic navigation problems, especially vision-and-language navigation task which is one of the most realistic indoor navigation challenge tasks. This task can be modelled as a sequential decision-making problem, which is suitable to be solved by deep reinforcement learning. Unfortunately, the observations provided from the simulator in this task are not fully observable states, which exacerbate the difficulty of implementing reinforcement learning. To deal with this challenge, this paper presents a novel method, called as attentive variational state tracker (AVAST), a variational approach to approximate belief state distribution for the construction of a reinforced navigator. The variational approach is introduced to improve generalization to the unseen environment which barely achieved by traditional deterministic state tracker. In order to stabilize the learning procedure, a fine-tuning process using policy optimization is proposed. From the experimental results, the proposed AVAST does improve the generalization relative to previous works in vision-and-language navigation task. A significant performance is achieved without requiring any additional exploration in the unseen environment.

pdf bib
Phylogeny-Inspired Adaptation of Multilingual Models to New Languages
Fahim Faisal | Antonios Anastasopoulos

Large pretrained multilingual models, trained on dozens of languages, have delivered promising results due to cross-lingual learning capabilities on a variety of language tasks. Further adapting these models to specific languages, especially ones unseen during pre-training, is an important goal toward expanding the coverage of language technologies. In this study, we show how we can use language phylogenetic information to improve cross-lingual transfer leveraging closely related languages in a structured, linguistically-informed manner. We perform adapter-based training on languages from diverse language families (Germanic, Uralic, Tupian, Uto-Aztecan) and evaluate on both syntactic and semantic tasks, obtaining more than 20% relative performance improvements over strong commonly used baselines, especially on languages unseen during pre-training.

pdf bib
Transferring Knowledge via Neighborhood-Aware Optimal Transport for Low-Resource Hate Speech Detection
Tulika Bose | Irina Illina | Dominique Fohr

The concerning rise of hateful content on online platforms has increased the attention towards automatic hate speech detection, commonly formulated as a supervised classification task. State-of-the-art deep learning-based approaches usually require a substantial amount of labeled resources for training. However, annotating hate speech resources is expensive, time-consuming, and often harmful to the annotators. This creates a pressing need to transfer knowledge from the existing labeled resources to low-resource hate speech corpora with the goal of improving system performance. For this, neighborhood-based frameworks have been shown to be effective. However, they have limited flexibility. In our paper, we propose a novel training strategy that allows flexible modeling of the relative proximity of neighbors retrieved from a resource-rich corpus to learn the amount of transfer. In particular, we incorporate neighborhood information with Optimal Transport, which permits exploiting the geometry of the data embedding space. By aligning the joint embedding and label distributions of neighbors, we demonstrate substantial improvements over strong baselines, in low-resource scenarios, on different publicly available hate speech corpora.

pdf bib
Bag-of-Vectors Autoencoders for Unsupervised Conditional Text Generation
Florian Mai | James Henderson

Text autoencoders are often used for unsupervised conditional text generation by applying mappings in the latent space to change attributes to the desired values. Recently, Mai et al. (2020) proposed Emb2Emb, a method to learn these mappings in the embedding space of an autoencoder. However, their method is restricted to autoencoders with a single-vector embedding, which limits how much information can be retained. We address this issue by extending their method to Bag-of-Vectors Autoencoders (BoV-AEs), which encode the text into a variable-size bag of vectors that grows with the size of the text, as in attention-based models. This allows to encode and reconstruct much longer texts than standard autoencoders. Analogous to conventional autoencoders, we propose regularization techniques that facilitate learning meaningful operations in the latent space. Finally, we adapt Emb2Emb for a training scheme that learns to map an input bag to an output bag, including a novel loss function and neural architecture. Our empirical evaluations on unsupervised sentiment transfer show that our method performs substantially better than a standard autoencoder.

pdf bib
RecInDial: A Unified Framework for Conversational Recommendation with Pretrained Language Models
Lingzhi Wang | Huang Hu | Lei Sha | Can Xu | Daxin Jiang | Kam-Fai Wong

Conversational Recommender System (CRS), which aims to recommend high-quality items to users through interactive conversations, has gained great research interest recently. A CRS is usually composed of a recommendation module and a generation module. In the previous work, these two modules are loosely connected in the model training and are shallowly integrated during inference, where a simple switching or copy mechanism is adopted to incorporate recommended items into generated responses. Moreover, the current end-to-end neural models trained on small crowd-sourcing datasets (e.g., 10K dialogs in the ReDial dataset) tend to overfit and have poor chit-chat ability. In this work, we propose a novel unified framework that integrates recommendation into the dialog (RecInDial) generation by introducing a vocabulary pointer. To tackle the low-resource issue in CRS, we finetune the large-scale pretrained language models to generate fluent and diverse responses, and introduce a knowledge-aware bias learned from an entity-oriented knowledge graph to enhance the recommendation performance. Furthermore, we propose to evaluate the CRS models in an end-to-end manner, which can reflect the overall performance of the entire system rather than the performance of individual modules, compared to the separate evaluations of the two modules used in previous work. Experiments on the benchmark dataset ReDial show our RecInDial model significantly surpasses the state-of-the-art methods. More extensive analyses show the effectiveness of our model.

pdf bib
SummVD : An efficient approach for unsupervised topic-based text summarization
Gabriel Shenouda | Aurélien Bossard | Oussama Ayoub | Christophe Rodrigues

This paper introduces a new method, SummVD, for automatic unsupervised extractive summarization. This method is based on singular value decomposition, a linear method in the number of words, in order to reduce the dimensionality of word embeddings and propose a representation of words on a small number of dimensions, each representing a hidden topic. It also uses word clustering to reduce the vocabulary size. This representation, specific to one document, reduces the noise brought by several dimensions of the embeddings that are useless in a restricted context. It is followed by a linear sentence extraction heuristic. This makes SummVD an efficient method for text summarization. We evaluate SummVD using several corpora of different nature (news, scientific articles, social network). Our method outperforms in effectiveness recent extractive approaches. Moreover, SummVD requires low resources, in terms of data and computing power. So it can be run on long single documents such as scientific papers as much as large multi-document corpora and is fast enough to be used in live summarization systems.

pdf bib
Director: Generator-Classifiers For Supervised Language Modeling
Kushal Arora | Kurt Shuster | Sainbayar Sukhbaatar | Jason Weston

Current language models achieve low perplexity but their resulting generations still suffer from toxic responses, repetitiveness, and contradictions. The standard language modeling setup fails to address these issues. In this paper, we introduce a new architecture, Director, that consists of a unified generator-classifier with both a language modeling and a classification head for each output token. Training is conducted jointly using both standard language modeling data, and data labeled with desirable and undesirable sequences. Experiments in several settings show that the model has competitive training and decoding speed compared to standard language models while yielding superior results, avoiding undesirable behaviors while maintaining generation quality. It also outperforms existing model guiding approaches in terms of both accuracy and efficiency. Our code is made publicly available.

pdf bib
VLStereoSet: A Study of Stereotypical Bias in Pre-trained Vision-Language Models
Kankan Zhou | Eason Lai | Jing Jiang

In this paper we study how to measure stereotypical bias in pre-trained vision-language models. We leverage a recently released text-only dataset, StereoSet, which covers a wide range of stereotypical bias, and extend it into a vision-language probing dataset called VLStereoSet to measure stereotypical bias in vision-language models. We analyze the differences between text and image and propose a probing task that detects bias by evaluating a model’s tendency to pick stereotypical statements as captions for anti-stereotypical images. We further define several metrics to measure both a vision-language model’s overall stereotypical bias and its intra-modal and inter-modal bias. Experiments on six representative pre-trained vision-language models demonstrate that stereotypical biases clearly exist in most of these models and across all four bias categories, with gender bias slightly more evident. Further analysis using gender bias data and two vision-language models also suggest that both intra-modal and inter-modal bias exist.

pdf bib
Dynamic Context Extraction for Citation Classification
Suchetha Nambanoor Kunnath | David Pride | Petr Knoth

We investigate the effect of varying citation context window sizes on model performance in citation intent classification. Prior studies have been limited to the application of fixed-size contiguous citation contexts or the use of manually curated citation contexts. We introduce a new automated unsupervised approach for the selection of a dynamic-size and potentially non-contiguous citation context, which utilises the transformer-based document representations and embedding similarities. Our experiments show that the addition of non-contiguous citing sentences improves performance beyond previous results. Evalu- ating on the (1) domain-specific (ACL-ARC) and (2) the multi-disciplinary (SDP-ACT) dataset demonstrates that the inclusion of additional context beyond the citing sentence significantly improves the citation classifi- cation model’s performance, irrespective of the dataset’s domain. We release the datasets and the source code used for the experiments at:

pdf bib
Affective Retrofitted Word Embeddings
Sapan Shah | Sreedhar Reddy | Pushpak Bhattacharyya

Word embeddings learned using the distributional hypothesis (e.g., GloVe, Word2vec) do not capture the affective dimensions of valence, arousal, and dominance, which are present inherently in words. We present a novel retrofitting method for updating embeddings of words for their affective meaning. It learns a non-linear transformation function that maps pre-trained embeddings to an affective vector space, in a representation learning setting. We investigate word embeddings for their capacity to cluster emotion-bearing words. The affective embeddings learned by our method achieve better inter-cluster and intra-cluster distance for words having the same emotions, as evaluated through different cluster quality metrics. For the downstream tasks on sentiment analysis and sarcasm detection, simple classification models, viz. SVM and Attention Net, learned using our affective embeddings perform better than their pre-trained counterparts (more than 1.5% improvement in F1-score) and other benchmarks. Furthermore, the difference in performance is more pronounced in limited data setting.

pdf bib
Is Encoder-Decoder Redundant for Neural Machine Translation?
Yingbo Gao | Christian Herold | Zijian Yang | Hermann Ney

Encoder-decoder architecture is widely adopted for sequence-to-sequence modeling tasks. For machine translation, despite the evolution from long short-term memory networks to Transformer networks, plus the introduction and development of attention mechanism, encoder-decoder is still the de facto neural network architecture for state-of-the-art models. While the motivation for decoding information from some hidden space is straightforward, the strict separation of the encoding and decoding steps into an encoder and a decoder in the model architecture is not necessarily a must. Compared to the task of autoregressive language modeling in the target language, machine translation simply has an additional source sentence as context. Given the fact that neural language models nowadays can already handle rather long contexts in the target language, it is natural to ask whether simply concatenating the source and target sentences and training a language model to do translation would work. In this work, we investigate the aforementioned concept for machine translation. Specifically, we experiment with bilingual translation, translation with additional target monolingual data, and multilingual translation. In all cases, this alternative approach performs on par with the baseline encoder-decoder Transformer, suggesting that an encoder-decoder architecture might be redundant for neural machine translation.

pdf bib
SAPGraph: Structure-aware Extractive Summarization for Scientific Papers with Heterogeneous Graph
Siya Qi | Lei Li | Yiyang Li | Jin Jiang | Dingxin Hu | Yuze Li | Yingqi Zhu | Yanquan Zhou | Marina Litvak | Natalia Vanetik

Scientific paper summarization is always challenging in Natural Language Processing (NLP) since it is hard to collect summaries from such long and complicated text. We observe that previous works tend to extract summaries from the head of the paper, resulting in information incompleteness. In this work, we present SAPGraph to utilize paper structure for solving this problem. SAPGraph is a scientific paper extractive summarization framework based on a structure-aware heterogeneous graph, which models the document into a graph with three kinds of nodes and edges based on structure information of facets and knowledge. Additionally, we provide a large-scale dataset of COVID-19-related papers, CORD-SUM. Experiments on CORD-SUM and ArXiv datasets show that SAPGraph generates more comprehensive and valuable summaries compared to previous works.

pdf bib
Toward Implicit Reference in Dialog: A Survey of Methods and Data
Lindsey Vanderlyn | Talita Anthonio | Daniel Ortega | Michael Roth | Ngoc Thang Vu

Communicating efficiently in natural language requires that we often leave information implicit, especially in spontaneous speech. This frequently results in phenomena of incompleteness, such as omitted references, that pose challenges for language processing. In this survey paper, we review the state of the art in research regarding the automatic processing of such implicit references in dialog scenarios, discuss weaknesses with respect to inconsistencies in task definitions and terminologies, and outline directions for future work. Among others, these include a unification of existing tasks, addressing data scarcity, and taking into account model and annotator uncertainties.

pdf bib
A Decade of Knowledge Graphs in Natural Language Processing: A Survey
Phillip Schneider | Tim Schopf | Juraj Vladika | Mikhail Galkin | Elena Simperl | Florian Matthes

In pace with developments in the research field of artificial intelligence, knowledge graphs (KGs) have attracted a surge of interest from both academia and industry. As a representation of semantic relations between entities, KGs have proven to be particularly relevant for natural language processing (NLP), experiencing a rapid spread and wide adoption within recent years. Given the increasing amount of research work in this area, several KG-related approaches have been surveyed in the NLP research community. However, a comprehensive study that categorizes established topics and reviews the maturity of individual research streams remains absent to this day. Contributing to closing this gap, we systematically analyzed 507 papers from the literature on KGs in NLP. Our survey encompasses a multifaceted review of tasks, research types, and contributions. As a result, we present a structured overview of the research landscape, provide a taxonomy of tasks, summarize our findings, and highlight directions for future work.

pdf bib
Multimodal Generation of Radiology Reports using Knowledge-Grounded Extraction of Entities and Relations
Francesco Dalla Serra | William Clackett | Hamish MacKinnon | Chaoyang Wang | Fani Deligianni | Jeff Dalton | Alison Q. O’Neil

Automated reporting has the potential to assist radiologists with the time-consuming procedure of generating text radiology reports. Most existing approaches generate the report directly from the radiology image, however we observe that the resulting reports exhibit realistic style but lack clinical accuracy. Therefore, we propose a two-step pipeline that subdivides the problem into factual triple extraction followed by free-text report generation. The first step comprises supervised extraction of clinically relevant structured information from the image, expressed as triples of the form (entity1, relation, entity2). In the second step, these triples are input to condition the generation of the radiology report. In particular, we focus our work on Chest X-Ray (CXR) radiology report generation. The proposed framework shows state-of-the-art results on the MIMIC-CXR dataset according to most of the standard text generation metrics that we employ (BLEU, METEOR, ROUGE) and to clinical accuracy metrics (recall, precision and F1 assessed using the CheXpert labeler), also giving a 23% reduction in the total number of errors and a 29% reduction in critical clinical errors as assessed by expert human evaluation. In future, this solution can easily integrate more advanced model architectures - to both improve the triple extraction and the report generation - and can be applied to other complex image captioning tasks, such as those found in the medical domain.

pdf bib
SBERT studies Meaning Representations: Decomposing Sentence Embeddings into Explainable Semantic Features
Juri Opitz | Anette Frank

Models based on large-pretrained language models, such as S(entence)BERT, provide effective and efficient sentence embeddings that show high correlation to human similarity ratings, but lack interpretability. On the other hand, graph metrics for graph-based meaning representations (e.g., Abstract Meaning Representation, AMR) can make explicit the semantic aspects in which two sentences are similar. However, such metrics tend to be slow, rely on parsers, and do not reach state-of-the-art performance when rating sentence similarity. In this work, we aim at the best of both worlds, by learning to induce Semantically Structured Sentence BERT embeddings (S3BERT). Our S3BERT embeddings are composed of explainable sub-embeddings that emphasize various sentence meaning features (e.g., semantic roles, negation, or quantification). We show how to i) learn a decomposition of the sentence embeddings into meaning features, through approximation of a suite of interpretable semantic AMR graph metrics, and how to ii) preserve the overall power of the neural embeddings by controlling the decomposition learning process with a second objective that enforces consistency with the similarity ratings of an SBERT teacher model. In our experimental studies, we show that our approach offers interpretability – while preserving the effectiveness and efficiency of the neural sentence embeddings.

pdf bib
The Lifecycle of “Facts”: A Survey of Social Bias in Knowledge Graphs
Angelie Kraft | Ricardo Usbeck

Knowledge graphs are increasingly used in a plethora of downstream tasks or in the augmentation of statistical models to improve factuality. However, social biases are engraved in these representations and propagate downstream. We conducted a critical analysis of literature concerning biases at different steps of a knowledge graph lifecycle. We investigated factors introducing bias, as well as the biases that are rendered by knowledge graphs and their embedded versions afterward. Limitations of existing measurement and mitigation strategies are discussed and paths forward are proposed.

pdf bib
Food Knowledge Representation Learning with Adversarial Substitution
Diya Li | Mohammed J Zaki

Knowledge graph embedding (KGE) has been well-studied in general domains, but has not been examined for food computing. To fill this gap, we perform knowledge representation learning over a food knowledge graph (KG). We employ a pre-trained language model to encode entities and relations, thus emphasizing contextual information in food KGs. The model is trained on two tasks – predicting a masked entity from a given triple from the KG and predicting the plausibility of a triple. Analysis of food substitutions helps in dietary choices for enabling healthier eating behaviors. Previous work in food substitutions mainly focuses on semantic similarity while ignoring the context. It is also hard to evaluate the substitutions due to the lack of an adequate validation set, and further, the evaluation is subjective based on perceived purpose. To tackle this problem, we propose a collection of adversarial sample generation strategies for different food substitutions over our learnt KGE. We propose multiple strategies to generate high quality context-aware recipe and ingredient substitutions and also provide generalized ingredient substitutions to meet different user needs. The effectiveness and efficiency of the proposed knowledge graph learning method and the following attack strategies are verified by extensive evaluations on a large-scale food KG.

pdf bib
Construction Repetition Reduces Information Rate in Dialogue
Mario Giulianelli | Arabella Sinclair | Raquel Fernández

Speakers repeat constructions frequently in dialogue. Due to their peculiar information-theoretic properties, repetitions can be thought of as a strategy for cost-effective communication. In this study, we focus on the repetition of lexicalised constructions—i.e., recurring multi-word units—in English open-domain spoken dialogues. We hypothesise that speakers use construction repetition to mitigate information rate, leading to an overall decrease in utterance information content over the course of a dialogue. We conduct a quantitative analysis, measuring the information content of constructions and that of their containing utterances, estimating information content with an adaptive neural language model. We observe that construction usage lowers the information content of utterances. This facilitating effect (i) increases throughout dialogues, (ii) is boosted by repetition, (iii) grows as a function of repetition frequency and density, and (iv) is stronger for repetitions of referential constructions.

pdf bib
Analogy-Guided Evolutionary Pretraining of Binary Word Embeddings
R. Alexander Knipper | Md. Mahadi Hassan | Mehdi Sadi | Shubhra Kanti Karmaker Santu

As we begin to see low-powered computing paradigms (Neuromorphic Computing, Spiking Neural Networks, etc.) becoming more popular, learning binary word embeddings has become increasingly important for supporting NLP applications at the edge. Existing binary word embeddings are mostly derived from pretrained real-valued embeddings through different simple transformations, which often break the semantic consistency and the so-called “arithmetic” properties learned by the original, real-valued embeddings. This paper aims to address this limitation by introducing a new approach to learn binary embeddings from scratch, preserving the semantic relationships between words as well as the arithmetic properties of the embeddings themselves. To achieve this, we propose a novel genetic algorithm to learn the relationships between words from existing word analogy data-sets, carefully making sure that the arithmetic properties of the relationships are preserved. Evaluating our generated 16, 32, and 64-bit binary word embeddings on Mikolov’s word analogy task shows that more than 95% of the time, the best fit for the analogy is ranked in the top 5 most similar words in terms of cosine similarity.

pdf bib
Contrastive Video-Language Learning with Fine-grained Frame Sampling
Zixu Wang | Yujie Zhong | Yishu Miao | Lin Ma | Lucia Specia

Despite recent progress in video and language representation learning, the weak or sparse correspondence between the two modalities remains a bottleneck in the area. Most video-language models are trained via pair-level loss to predict whether a pair of video and text is aligned. However, even in paired video-text segments, only a subset of the frames are semantically relevant to the corresponding text, with the remainder representing noise; where the ratio of noisy frames is higher for longer videos. We propose FineCo (Fine-grained Contrastive Loss for Frame Sampling), an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames. It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence. Building on the well established VideoCLIP model as a starting point, FineCo achieves state-of-the-art performance on YouCookII, a text-video retrieval benchmark with long videos. FineCo also achieves competitive results on text-video retrieval (MSR-VTT), and video question answering datasets (MSR-VTT QA and MSR-VTT MC) with shorter videos.

pdf bib
Enhancing Tabular Reasoning with Pattern Exploiting Training
Abhilash Shankarampeta | Vivek Gupta | Shuo Zhang

Recent methods based on pre-trained language models have exhibited superior performance over tabular tasks (e.g., tabular NLI), despite showing inherent problems such as not using the right evidence and inconsistent predictions across inputs while reasoning over the tabular data (Gupta et al., 2021). In this work, we utilize Pattern-Exploiting Training (PET) (i.e., strategic MLM) on pre-trained language models to strengthen these tabular reasoning models’ pre-existing knowledge and reasoning abilities. Our upgraded model exhibits a superior understanding of knowledge facts and tabular reasoning compared to current baselines. Additionally, we demonstrate that such models are more effective for underlying downstream tasks of tabular inference on INFOTABS. Furthermore, we show our model’s robustness against adversarial sets generated through various character and word level perturbations.

pdf bib
Re-contextualizing Fairness in NLP: The Case of India
Shaily Bhatt | Sunipa Dev | Partha Talukdar | Shachi Dave | Vinodkumar Prabhakaran

Recent research has revealed undesirable biases in NLP data and models. However, these efforts focus of social disparities in West, and are not directly portable to other geo-cultural contexts. In this paper, we focus on NLP fairness in the context of India. We start with a brief account of the prominent axes of social disparities in India. We build resources for fairness evaluation in the Indian context and use them to demonstrate prediction biases along some of the axes. We then delve deeper into social stereotypes for Region and Religion, demonstrating its prevalence in corpora and models. Finally, we outline a holistic research agenda to re-contextualize NLP fairness research for the Indian context, accounting for Indian societal context, bridging technological gaps in NLP capabilities and resources, and adapting to Indian cultural values. While we focus on India, this framework can be generalized to other geo-cultural contexts.

pdf bib
Low-Resource Multilingual and Zero-Shot Multispeaker TTS
Florian Lux | Julia Koch | Ngoc Thang Vu

While neural methods for text-to-speech (TTS) have shown great advances in modeling multiple speakers, even in zero-shot settings, the amount of data needed for those approaches is generally not feasible for the vast majority of the world’s over 6,000 spoken languages. In this work, we bring together the tasks of zero-shot voice cloning and multilingual low-resource TTS. Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, we show that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language. We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker using objective metrics as well as human studies and provide our code and trained models open source.

pdf bib
Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps
Hiroki Iida | Naoaki Okazaki

IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with low-frequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present state-of-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.

pdf bib
KESA: A Knowledge Enhanced Approach To Sentiment Analysis
Qinghua Zhao | Shuai Ma | Shuo Ren

Though some recent works focus on injecting sentiment knowledge into pre-trained language models, they usually design mask and reconstruction tasks in the post-training phase. This paper aims to integrate sentiment knowledge in the fine-tuning stage. To achieve this goal, we propose two sentiment-aware auxiliary tasks named sentiment word selection and conditional sentiment prediction and, correspondingly, integrate them into the objective of the downstream task. The first task learns to select the correct sentiment words from the given options. The second task predicts the overall sentiment polarity, with the sentiment polarity of the word given as prior knowledge. In addition, two label combination methods are investigated to unify multiple types of labels in each auxiliary task. Experimental results demonstrate that our approach consistently outperforms baselines (achieving a new state-of-the-art) and is complementary to existing sentiment-enhanced post-trained models.

pdf bib
Cross-lingual Few-Shot Learning on Unseen Languages
Genta Winata | Shijie Wu | Mayank Kulkarni | Thamar Solorio | Daniel Preotiuc-Pietro

Large pre-trained language models (LMs) have demonstrated the ability to obtain good performance on downstream tasks with limited examples in cross-lingual settings. However, this was mostly studied for relatively resource-rich languages, where at least enough unlabeled data is available to be included in pre-training a multilingual language model. In this paper, we explore the problem of cross-lingual transfer in unseen languages, where no unlabeled data is available for pre-training a model. We use a downstream sentiment analysis task across 12 languages, including 8 unseen languages, to analyze the effectiveness of several few-shot learning strategies across the three major types of model architectures and their learning dynamics. We also compare strategies for selecting languages for transfer and contrast findings across languages seen in pre-training compared to those that are not. Our findings contribute to the body of knowledge on cross-lingual models for low-resource settings that is paramount to increasing coverage, diversity, and equity in access to NLP technology. We show that, in few-shot learning, linguistically similar and geographically similar languages are useful for cross-lingual adaptation, but taking the context from a mixture of random source languages is surprisingly more effective. We also compare different model architectures and show that the encoder-only model, XLM-R, gives the best downstream task performance.

pdf bib
Domain-aware Self-supervised Pre-training for Label-Efficient Meme Analysis
Shivam Sharma | Mohd Khizir Siddiqui | Md. Shad Akhtar | Tanmoy Chakraborty

Existing self-supervised learning strategies are constrained to either a limited set of objectives or generic downstream tasks that predominantly target uni-modal applications. This has isolated progress for imperative multi-modal applications that are diverse in terms of complexity and domain-affinity, such as meme analysis. Here, we introduce two self-supervised pre-training methods, namely Ext-PIE-Net and MM-SimCLR that (i) employ off-the-shelf multi-modal hate-speech data during pre-training and (ii) perform self-supervised learning by incorporating multiple specialized pretext tasks, effectively catering to the required complex multi-modal representation learning for meme analysis. We experiment with different self-supervision strategies, including potential variants that could help learn rich cross-modality representations and evaluate using popular linear probing on the Hateful Memes task. The proposed solutions strongly compete with the fully supervised baseline via label-efficient training while distinctly outperforming them on all three tasks of the Memotion challenge with 0.18%, 23.64%, and 0.93% performance gain, respectively. Further, we demonstrate the generalizability of the proposed solutions by reporting competitive performance on the HarMeme task. Finally, we empirically establish the quality of the learned representations by analyzing task-specific learning, using fewer labeled training samples, and arguing that the complexity of the self-supervision strategy and downstream task at hand are correlated. Our efforts highlight the requirement of better multi-modal self-supervision methods involving specialized pretext tasks for efficient fine-tuning and generalizable performance.

pdf bib
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
Hugo Berg | Siobhan Hall | Yash Bhalgat | Hannah Kirk | Aleksandar Shtedritski | Max Bain

Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddings to text queries that are jointly trained with adversarial debiasing and a contrastive loss, reduces various bias measures with minimal degradation to the image-text representation.

pdf bib
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Surangika Ranathunga | Nisansa de Silva

Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, GDP and the speaker population of languages and provide possible reasons for this disparity, along with some suggestions to overcome the same.

pdf bib
Neural Readability Pairwise Ranking for Sentences in Italian Administrative Language
Martina Miliani | Serena Auriemma | Fernando Alva-Manchego | Alessandro Lenci

Automatic Readability Assessment aims at assigning a complexity level to a given text, which could help improve the accessibility to information in specific domains, such as the administrative one. In this paper, we investigate the behavior of a Neural Pairwise Ranking Model (NPRM) for sentence-level readability assessment of Italian administrative texts. To deal with data scarcity, we experiment with cross-lingual, cross- and in-domain approaches, and test our models on Admin-It, a new parallel corpus in the Italian administrative language, containing sentences simplified using three different rewriting strategies. We show that NPRMs are effective in zero-shot scenarios (~0.78 ranking accuracy), especially with ranking pairs containing simplifications produced by overall rewriting at the sentence-level, and that the best results are obtained by adding in-domain data (achieving perfect performance for such sentence pairs). Finally, we investigate where NPRMs failed, showing that the characteristics of the training data, rather than its size, have a bigger effect on a model’s performance.

pdf bib
Delivering Fairness in Human Resources AI: Mutual Information to the Rescue
Leo Hemamou | William Coleman

Automatic language processing is used frequently in the Human Resources (HR) sector for automated candidate sourcing and evaluation of resumes. These models often use pre-trained language models where it is difficult to know if possible biases exist. Recently, Mutual Information (MI) methods have demonstrated notable performance in obtaining representations agnostic to sensitive variables such as gender or ethnicity. However, accessing these variables can sometimes be challenging, and their use is prohibited in some jurisdictions. These factors can make detecting and mitigating biases challenging. In this context, we propose to minimize the MI between a candidate’s name and a latent representation of their CV or short biography. This method may mitigate bias from sensitive variables without requiring the collection of these variables. We evaluate this methodology by first projecting the name representation into a smaller space to prevent potential MI minimization problems in high dimensions.

pdf bib
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Thinh Hung Truong | Yulia Otmakhova | Timothy Baldwin | Trevor Cohn | Jey Han Lau | Karin Verspoor

Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods, with the aim of understanding sub-clausal negation. The test suite contains premise–hypothesis pairs where the premise contains sub-clausal negation and the hypothesis is constructed by making minimal modifications to the premise in order to reflect different possible interpretations. Aside from adopting standard NLI labels, our test suite is systematically constructed under a rigorous linguistic framework. It includes annotation of negation types and constructions grounded in linguistic theory, as well as the operations used to construct hypotheses. This facilitates fine-grained analysis of model performance. We conduct experiments using pre-trained language models to demonstrate that our test suite is more challenging than existing benchmarks focused on negation, and show how our annotation supports a deeper understanding of the current NLI capabilities in terms of negation and quantification.

pdf bib
HaRiM+: Evaluating Summary Quality with Hallucination Risk
Seonil (Simon) Son | Junsoo Park | Jeong-in Hwang | Junghwa Lee | Hyungjong Noh | Yeonsoo Lee

One of the challenges of developing a summarization model arises from the difficulty in measuring the factual inconsistency of the generated text. In this study, we reinterpret the decoder overconfidence-regularizing objective suggested in (Miao et al., 2021) as a hallucination risk measurement to better estimate the quality of generated summaries. We propose a reference-free metric, HaRiM+, which only requires an off-the-shelf summarization model to compute the hallucination risk based on token likelihoods. Deploying it requires no additional training of models or ad-hoc modules, which usually need alignment to human judgments. For summary-quality estimation, HaRiM+ records state-of-the-art correlation to human judgment on three summary-quality annotation sets: FRANK, QAGS, and SummEval. We hope that our work, which merits the use of summarization models, facilitates the progress of both automated evaluation and generation of summary.

pdf bib
The lack of theory is painful: Modeling Harshness in Peer Review Comments
Rajeev Verma | Rajarshi Roychoudhury | Tirthankar Ghosal

The peer-review system has primarily remained the central process of all science communications. However, research has shown that the process manifests a power-imbalance scenario where the reviewer enjoys a position where their comments can be overly critical and wilfully obtuse without being held accountable. This brings into question the sanctity of the peer-review process, turning it into a fraught and traumatic experience for authors. A little more effort to still remain critical but be constructive in the feedback would help foster a progressive outcome from the peer-review process. In this paper, we argue to intervene at the step where this power imbalance actually begins in the system. To this end, we develop the first dataset of peer-review comments with their real-valued harshness scores. We build our dataset by using the popular Best-Worst-Scaling mechanism. We show the utility of our dataset for text moderation in peer reviews to make review reports less hurtful and more welcoming. We release our dataset and associated codes in Our research is one step towards helping create constructive peer-review reports.

pdf bib
Dual Mechanism Priming Effects in Hindi Word Order
Sidharth Ranjan | Marten van Schijndel | Sumeet Agarwal | Rajakrishnan Rajkumar

Word order choices during sentence production can be primed by preceding sentences. In this work, we test the DUAL MECHANISM hypothesis that priming is driven by multiple different sources. Using a Hindi corpus of text productions, we model lexical priming with an n-gram cache model, and we capture more abstract syntactic priming with an adaptive neural language model. We permute the preverbal constituents of corpus sentences and then use a logistic regression model to predict which sentences actually occurred in the corpus against artificially generated meaning-equivalent variants. Our results indicate that lexical priming and lexically-independent syntactic priming affect complementary sets of verb classes. By showing that different priming influences are separable from one another, our results support the hypothesis that multiple different cognitive mechanisms underlie priming.

pdf bib
Unsupervised Single Document Abstractive Summarization using Semantic Units
Jhen-Yi Wu | Ying-Jia Lin | Hung-Yu Kao

In this work, we study the importance of content frequency on abstractive summarization, where we define the content as “semantic units.” We propose a two-stage training framework to let the model automatically learn the frequency of each semantic unit in the source text. Our model is trained in an unsupervised manner since the frequency information can be inferred from source text only. During inference, our model identifies sentences with high-frequency semantic units and utilizes frequency information to generate summaries from the filtered sentences. Our model performance on the CNN/Daily Mail summarization task outperforms the other unsupervised methods under the same settings. Furthermore, we achieve competitive ROUGE scores with far fewer model parameters compared to several large-scale pre-trained models. Our model can be trained under low-resource language settings and thus can serve as a potential solution for real-world applications where pre-trained models are not applicable.

pdf bib
Detecting Incongruent News Articles Using Multi-head Attention Dual Summarization
Sujit Kumar | Gaurav Kumar | Sanasam Ranbir Singh

With the increasing use of influencing incongruent news headlines for spreading fake news, detecting incongruent news articles has become an important research challenge. Most of the earlier studies on incongruity detection focus on estimating the similarity between the headline and the encoding of the body or its summary. However, most of these methods fail to handle incongruent news articles created with embedded noise. Motivated by the above issue, this paper proposes a Multi-head Attention Dual Summary (MADS) based method which generates two types of summaries that capture the congruent and incongruent parts in the body separately. From various experimental setups over three publicly available datasets, it is evident that the proposed model outperforms the state-of-the-art baseline counterparts.

pdf bib
Meta-Learning based Deferred Optimisation for Sentiment and Emotion aware Multi-modal Dialogue Act Classification
Tulika Saha | Aditya Prakash Patra | Sriparna Saha | Pushpak Bhattacharyya

Dialogue Act Classification (DAC) that determines the communicative intention of an utterance has been investigated widely over the years as a standalone task. But the emotional state of the speaker has a considerable effect on its pragmatic content. Sentiment as a human behavior is also closely related to emotion and one aids in the better understanding of the other. Thus, their role in identification of DAs needs to be explored. As a first step, we extend the newly released multi-modal EMOTyDA dataset to enclose sentiment tags for each utterance. In order to incorporate these multiple aspects, we propose a Dual Attention Mechanism (DAM) based multi-modal, multi-tasking conversational framework. The DAM module encompasses intra-modal and interactive inter-modal attentions with multiple loss optimization at various hierarchies to fuse multiple modalities efficiently and learn generalized features across all the tasks. Additionally, to counter the class-imbalance issue in dialogues, we introduce a 2-step Deferred Optimisation Schedule (DOS) that involves Meta-Net (MN) learning and deferred re-weighting where the former helps to learn an explicit weighting function from data automatically and the latter deploys a re-weighted multi-task loss with a smaller learning rate. Empirically, we establish that the joint optimisation of multi-modal DAC, SA and ER tasks along with the incorporation of 2-step DOS and MN learning produces better results compared to its different counterparts and outperforms state-of-the-art model.

pdf bib
Enhancing Financial Table and Text Question Answering with Tabular Graph and Numerical Reasoning
Rungsiman Nararatwong | Natthawut Kertkeidkachorn | Ryutaro Ichise

Typical financial documents consist of tables, texts, and numbers. Given sufficient training data, large language models (LM) can learn the tabular structures and perform numerical reasoning well in question answering (QA). However, their performances fall significantly when data and computational resources are limited. This study improves this performance drop by infusing explicit tabular structures through a graph neural network (GNN). We proposed a model developed from the baseline of a financial QA dataset named TAT-QA. The baseline model, TagOp, consists of answer span (evidence) extraction and numerical reasoning modules. As our main contributions, we introduced two components to the model: a GNN-based evidence extraction module for tables and an improved numerical reasoning module. The latter provides a solution to TagOp’s arithmetic calculation problem specific to operations requiring number ordering, such as subtraction and division, which account for a large portion of numerical reasoning. Our evaluation shows that the graph module has the advantage in low-resource settings, while the improved numerical reasoning significantly outperforms the baseline model.

pdf bib
Fine-grained Contrastive Learning for Definition Generation
Hengyuan Zhang | Dawei Li | Shiping Yang | Yanran Li

Recently, pre-trained transformer-based models have achieved great success in the task of definition generation (DG). However, previous encoder-decoder models lack effective representation learning to contain full semantic components of the given word, which leads to generating under-specific definitions. To address this problem, we propose a novel contrastive learning method, encouraging the model to capture more detailed semantic representations from the definition sequence encoding. According to both automatic and manual evaluation, the experimental results on three mainstream benchmarks demonstrate that the proposed method could generate more specific and high-quality definitions compared with several state-of-the-art models.

pdf bib
Hengam: An Adversarially Trained Transformer for Persian Temporal Tagging
Sajad Mirzababaei | Amir Hossein Kargaran | Hinrich Schütze | Ehsaneddin Asgari

Many NLP main tasks benefit from an accurate understanding of temporal expressions, e.g., text summarization, question answering, and information retrieval. This paper introduces Hengam, an adversarially trained transformer for Persian temporal tagging outperforming state-of-the-art approaches on a diverse and manually created dataset. We create Hengam in the following concrete steps: (1) we develop HengamTagger, an extensible rule-based tool that can extract temporal expressions from a set of diverse language-specific patterns for any language of interest. (2) We apply HengamTagger to annotate temporal tags in a large and diverse Persian text collection (covering both formal and informal contexts) to be used as weakly labeled data. (3) We introduce an adversarially trained transformer model on HengamCorpus that can generalize over the HengamTagger’s rules. We create HengamGold, the first high-quality gold standard for Persian temporal tagging. Our trained adversarial HengamTransformer not only achieves the best performance in terms of the F1-score (a type F1-Score of 95.42 and a partial F1-Score of 91.60) but also successfully deals with language ambiguities and incorrect spellings. Our code, data, and models are publicly available at

pdf bib
What’s Different between Visual Question Answering for Machine “Understanding” Versus for Accessibility?
Yang Trista Cao | Kyle Seelman | Kyungjun Lee | Hal Daumé III

In visual question answering (VQA), a machine must answer a question given an associated image. Recently, accessibility researchers have explored whether VQA can be deployed in a real-world setting where users with visual impairments learn about their environment by capturing their visual surroundings and asking questions. However, most of the existing benchmarking datasets for VQA focus on machine “understanding” and it remains unclear how progress on those datasets corresponds to improvements in this real-world use case. We aim to answer this question by evaluating discrepancies between machine “understanding” datasets (VQA-v2) and accessibility datasets (VizWiz) by evaluating a variety of VQA models. Based on our findings, we discuss opportunities and challenges in VQA for accessibility and suggest directions for future work.

pdf bib
Persona or Context? Towards Building Context adaptive Personalized Persuasive Virtual Sales Assistant
Abhisek Tiwari | Sriparna Saha | Shubhashis Sengupta | Anutosh Maitra | Roshni Ramnani | Pushpak Bhattacharyya

Task-oriented conversational agents are gaining immense popularity and success in a wide range of tasks, from flight ticket booking to online shopping. However, the existing systems presume that end-users will always have a pre-determined and servable task goal, which results in dialogue failure in hostile scenarios, such as goal unavailability. On the other hand, human agents accomplish users’ tasks even in a large number of goal unavailability scenarios by persuading them towards a very similar and servable goal. Motivated by the limitation, we propose and build a novel end-to-end multi-modal persuasive dialogue system incorporated with a personalized persuasive module aided goal controller and goal persuader. The goal controller recognizes goal conflicting/unavailability scenarios and formulates a new goal, while the goal persuader persuades users using a personalized persuasive strategy identified through dialogue context. We also present a novel automatic evaluation metric called Persuasiveness Measurement Rate (PMeR) for quantifying the persuasive capability of a conversational agent. The obtained improvements (both quantitative and qualitative) firmly establish the superiority and need of the proposed context-guided, personalized persuasive virtual agent over existing traditional task-oriented virtual agents. Furthermore, we also curated a multi-modal persuasive conversational dialogue corpus annotated with intent, slot, sentiment, and dialogue act for e-commerce domain.

pdf bib
Legal Case Document Summarization: Extractive and Abstractive Methods and their Evaluation
Abhay Shukla | Paheli Bhattacharya | Soham Poddar | Rajdeep Mukherjee | Kripabandhu Ghosh | Pawan Goyal | Saptarshi Ghosh

Summarization of legal case judgement documents is a challenging problem in Legal NLP. However, not much analyses exist on how different families of summarization models (e.g., extractive vs. abstractive) perform when applied to legal case documents. This question is particularly important since many recent transformer-based abstractive summarization models have restrictions on the number of input tokens, and legal documents are known to be very long. Also, it is an open question on how best to evaluate legal case document summarization systems. In this paper, we carry out extensive experiments with several extractive and abstractive summarization methods (both supervised and unsupervised) over three legal summarization datasets that we have developed. Our analyses, that includes evaluation by law practitioners, lead to several interesting insights on legal summarization in specific and long document summarization in general.

pdf bib
FPC: Fine-tuning with Prompt Curriculum for Relation Extraction
Sicheng Yang | Dandan Song

The current classification methods for relation extraction (RE) generally utilize pre-trained language models (PLMs) and have achieved superior results. However, such methods directly treat relation labels as class numbers, therefore they ignore the semantics of relation labels. Recently, prompt-based fine-tuning has been proposed and attracted much attention. This kind of methods insert templates into the input and convert the classification task to a (masked) language modeling problem. With this inspiration, we propose a novel method Fine-tuning with Prompt Curriculum (FPC) for RE, with two distinctive characteristics: the relation prompt learning, introducing an auxiliary prompt-based fine-tuning task to make the model capture the semantics of relation labels; the prompt learning curriculum, a fine-tuning procedure including an increasingly difficult task to adapt the model to the difficult multi-task setting. We have conducted extensive experiments on four widely used RE benchmarks under fully supervised and low-resource settings. The experimental results show that FPC can significantly outperform the existing methods and obtain the new state-of-the-art results.

pdf bib
Dead or Murdered? Predicting Responsibility Perception in Femicide News Reports
Gosse Minnema | Sara Gemelli | Chiara Zanchi | Tommaso Caselli | Malvina Nissim

Different linguistic expressions can conceptualize the same event from different viewpoints by emphasizing certain participants over others. Here, we investigate a case where this has social consequences: how do linguistic expressions of gender-based violence (GBV) influence who we perceive as responsible? We build on previous psycholinguistic research in this area and conduct a large-scale perception survey of GBV descriptions automatically extracted from a corpus of Italian newspapers. We then train regression models that predict the salience of GBV participants with respect to different dimensions of perceived responsibility. Our best model (fine-tuned BERT) shows solid overall performance, with large differences between dimensions and participants: salient _focus_ is more predictable than salient _blame_, and perpetrators’ salience is more predictable than victims’ salience. Experiments with ridge regression models using different representations show that features based on linguistic theory similarly to word-based features. Overall, we show that different linguistic choices do trigger different perceptions of responsibility, and that such perceptions can be modelled automatically. This work can be a core instrument to raise awareness of the consequences of different perspectivizations in the general public and in news producers alike.

pdf bib
PESE: Event Structure Extraction using Pointer Network based Encoder-Decoder Architecture
Alapan Kuila | Sudeshna Sarkar

The task of event extraction (EE) aims to find the events and event-related argument information from the text and represent them in a structured format. Most previous works try to solve the problem by separately identifying multiple substructures and aggregating them to get the complete event structure. The problem with the methods is that it fails to identify all the interdependencies among the event participants (event-triggers, arguments, and roles). In this paper, we represent each event record in a unique tuple format that contains trigger phrase, trigger type, argument phrase, and corresponding role information. Our proposed pointer network-based encoder-decoder model generates an event tuple in each time step by exploiting the interactions among event participants and presenting a truly end-to-end solution to the EE task. We evaluate our model on the ACE2005 dataset, and experimental results demonstrate the effectiveness of our model by achieving competitive performance compared to the state-of-the-art methods.

pdf bib
How do we get there? Evaluating transformer neural networks as cognitive models for English past tense inflection
Xiaomeng Ma | Lingyu Gao

There is an ongoing debate of whether neural network can grasp the quasi-regularities in languages like humans. In a typical quasi-regularity task, English past tense inflections, the neural network model has long been criticized that it learns only to generalize the most frequent pattern, but not the regular pattern, thus can not learn the abstract categories of regular and irregular and is dissimilar to human performance. In this work, we train a set of transformer models with different settings to examine their behavior on this task. The models achieved high accuracy on unseen regular verbs and some accuracy on unseen irregular verbs. The models’ performance on the regulars are heavily affected by type frequency and ratio but not token frequency and ratio, and vice versa for the irregulars. The different behaviors on the regulars and irregulars suggest that the models have some degree of symbolic learning on the regularity of the verbs. In addition, the models are weakly correlated with human behavior on nonce verbs. Although the transformer model exhibits some level of learning on the abstract category of verb regularity, its performance does not fit human data well suggesting that it might not be a good cognitive model.

pdf bib
Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling
Ilia Kulikov | Maksim Eremeev | Kyunghyun Cho

Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank of eos token decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.

pdf bib
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET
Chantal Amrhein | Rico Sennrich

Neural metrics have achieved impressive correlation with human judgements in the evaluation of machine translation systems, but before we can safely optimise towards such metrics, we should be aware of (and ideally eliminate) biases toward bad translations that receive high scores. Our experiments show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses. When applying this strategy to COMET for en-de and de-en, we find that COMET models are not sensitive enough to discrepancies in numbers and named entities. We further show that these biases are hard to fully remove by simply training on additional synthetic data and release our code and data for facilitating further experiments.

pdf bib
Whodunit? Learning to Contrast for Authorship Attribution
Bo Ai | Yuchen Wang | Yugin Tan | Samson Tan

Authorship attribution is the task of identifying the author of a given text. The key is finding representations that can differentiate between authors. Existing approaches typically use manually designed features that capture a dataset’s content and style, but these approaches are dataset-dependent and yield inconsistent performance across corpora. In this work, we propose to learn author-specific representations by fine-tuning pre-trained generic language representations with a contrastive objective (Contra-X). We show that Contra-X learns representations that form highly separable clusters for different authors. It advances the state-of-the-art on multiple human and machine authorship attribution benchmarks, enabling improvements of up to 6.8% over cross-entropy fine-tuning. However, we find that Contra-X improves overall accuracy at the cost of sacrificing performance for some authors. Resolving this tension will be an important direction for future work. To the best of our knowledge, we are the first to integrate contrastive learning with pre-trained language model fine-tuning for authorship attribution.

pdf bib
Higher-Order Dependency Parsing for Arc-Polynomial Score Functions via Gradient-Based Methods and Genetic Algorithm
Xudong Zhang | Joseph Le Roux | Thierry Charnois

We present a novel method for higher-order dependency parsing which takes advantage of the general form of score functions written as arc-polynomials, a general framework which encompasses common higher-order score functions, and includes new ones. This method is based on non-linear optimization techniques, namely coordinate ascent and genetic search where we iteratively update a candidate parse. Updates are formulated as gradient-based operations, and are efficiently computed by auto-differentiation libraries. Experiments show that this method obtains results matching the recent state-of-the-art second order parsers on three standard datasets.

pdf bib
Underspecification in Scene Description-to-Depiction Tasks
Ben Hutchinson | Jason Baldridge | Vinodkumar Prabhakaran

Questions regarding implicitness, ambiguity and underspecification are crucial for understanding the task validity and ethical concerns of multimodal image+text systems, yet have received little attention to date. This position paper maps out a conceptual framework to address this gap, focusing on systems which generate images depicting scenes from scene descriptions. In doing so, we account for how texts and images convey meaning differently. We outline a set of core challenges concerning textual and visual ambiguity, as well as risks that may be amplified by ambiguous and underspecified elements. We propose and discuss strategies for addressing these challenges, including generating visually ambiguous images, and generating a set of diverse images.

pdf bib
COFAR: Commonsense and Factual Reasoning in Image Search
Prajwal Gatti | Abhirama Subramanyam Penamakuri | Revant Teotia | Anand Mishra | Shubhashis Sengupta | Roshni Ramnani

One characteristic that makes humans superior to modern artificially intelligent models is the ability to interpret images beyond what is visually apparent. Consider the following two natural language search queries – (i) “a queue of customers patiently waiting to buy ice cream” and (ii) “a queue of tourists going to see a famous Mughal architecture in India”. Interpreting these queries requires one to reason with (i) Commonsense such as interpreting people as customers or tourists, actions as waiting to buy or going to see; and (ii) Fact or world knowledge associated with named visual entities, for example, whether the store in the image sells ice cream or whether the landmark in the image is a Mughal architecture located in India. Such reasoning goes beyond just visual recognition. To enable both commonsense and factual reasoning in the image search, we present a unified framework namely Knowledge Retrieval-Augmented Multimodal Transformer (KRAMT) that treats the named visual entities in an image as a gateway to encyclopedic knowledge and leverages them along with natural language query to ground relevant knowledge. Further, KRAMT seamlessly integrates visual content and grounded knowledge to learn alignment between images and search queries. This unified framework is then used to perform image search requiring commonsense and factual reasoning. The retrieval performance of KRAMT is evaluated and compared with related approaches on a new dataset we introduce – namely COFAR.


pdf (full)
bib (full)
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Yulan He | Heng Ji | Sujian Li | Yang Liu | Chua-Hui Chang

pdf bib
Transfer Learning for Humor Detection by Twin Masked Yellow Muppets
Aseem Arora | Gaël Dias | Adam Jatowt | Asif Ekbal

Humorous texts can be of different forms such as punchlines, puns, or funny stories. Existing humor classification systems have been dealing with such diverse forms by treating them independently. In this paper, we argue that different forms of humor share a common background either in terms of vocabulary or constructs. As a consequence, it is likely that classification performance can be improved by jointly tackling different humor types. Hence, we design a shared-private multitask architecture following a transfer learning paradigm and perform experiments over four gold standard datasets. Empirical results steadily confirm our hypothesis by demonstrating statistically-significant improvements over baselines and accounting for new state-of-the-art figures for two datasets.

pdf bib
A Unified Model for Reverse Dictionary and Definition Modelling
Pinzhen Chen | Zheng Zhao

We build a dual-way neural dictionary to retrieve words given definitions, and produce definitions for queried words. The model learns the two tasks simultaneously and handles unknown words via embeddings. It casts a word or a definition to the same representation space through a shared layer, then generates the other form in a multi-task fashion. Our method achieves promising automatic scores on previous benchmarks without extra resources. Human annotators prefer the model’s outputs in both reference-less and reference-based evaluation, indicating its practicality. Analysis suggests that multiple objectives benefit learning.

pdf bib
Benchmarking the Covariate Shift Robustness of Open-world Intent Classification Approaches
Sopan Khosla | Rashmi Gangadharaiah

Task-oriented dialog systems deployed in real-world applications are often challenged by out-of-distribution queries. These systems should not only reliably detect utterances with unsupported intents (semantic shift), but also generalize to covariate shift (supported intents from unseen distributions). However, none of the existing benchmarks for open-world intent classification focus on the second aspect, thus only performing a partial evaluation of intent detection techniques. In this work, we propose two new datasets ( and ) that include utterances useful for evaluating the robustness of open-world models to covariate shift. Along with the i.i.d. test set, both datasets contain a new cov-test set that, along with out-of-scope utterances, contains in-scope utterances sampled from different distributions not seen during training. This setting better mimics the challenges faced in real-world applications. Evaluating several open-world classifiers on the new datasets reveals that models that perform well on the test set struggle to generalize to the cov-test. Our datasets fill an important gap in the field, offering a more realistic evaluation scenario for intent classification in task-oriented dialog systems.

pdf bib
Number Theory Meets Linguistics: Modelling Noun Pluralisation Across 1497 Languages Using 2-adic Metrics
Gregory Baker | Diego Molla

A simple machine learning model of pluralisation as a linear regression problem minimising a p-adic metric substantially outperforms even the most robust of Euclidean-space regressors on languages in the Indo-European, Austronesian, Trans New-Guinea, Sino-Tibetan, Nilo-Saharan, Oto-Meanguean and Atlantic-Congo language families. There is insufficient evidence to support modelling distinct noun declensions as a p-adic neighbourhood even in Indo-European languages.

pdf bib
CLIP4IDC: CLIP for Image Difference Captioning
Zixin Guo | Tzu-Jui Wang | Jorma Laaksonen

Image Difference Captioning (IDC) aims at generating sentences to describe differences between two similar-looking images. Conventional approaches learn an IDC model with a pre-trained and usually frozen visual feature extractor. Accordingly, two major issues may arise: (1) a large domain gap usually exists between the pre-training datasets used for training such a visual encoder and that of the downstream IDC task, and (2) the visual feature extractor, when separately encoding two images, often does not effectively encode the visual changes between two images. Due to the excellent zero-shot performance of the recently proposed CLIP, we thus propose CLIP4IDC to transfer a CLIP model for the IDC task to address those issues. Different from directly fine-tuning CLIP to generate sentences, we introduce an adaptation training process to adapt CLIP’s visual encoder to capture and align differences in image pairs based on the textual descriptions. Experiments on three IDC benchmark datasets, CLEVR-Change, Spot-the-Diff, and Image-Editing-Request, demonstrate the effectiveness of CLIP4IDC.

pdf bib
Towards Modeling Role-Aware Centrality for Dialogue Summarization
Xinnian Liang | Chao Bian | Shuangzhi Wu | Zhoujun Li

Role-oriented dialogue summarization generates summaries for different roles in dialogue (e.g. doctor and patient). Existing methods consider roles separately where interactions among different roles are not fully explored. In this paper, we propose a novel Role-Aware Centrality (RAC) model to capture role interactions, which can be easily applied to any seq2seq models. The RAC assigns each role a specific sentence-level centrality score by involving role prompts to control what kind of summary to generate. The RAC measures both the importance of utterances and the relevance between roles and utterances. Then we use RAC to re-weight context representations, which are used by the decoder to generate role summaries. We verify RAC on two public benchmark datasets, CSDS and MC. Experimental results show that the proposed method achieves new state-of-the-art results on the two datasets. Extensive analyses have demonstrated that the role-aware centrality helps generate summaries more precisely.

pdf bib
Robust Hate Speech Detection via Mitigating Spurious Correlations
Kshitiz Tiwari | Shuhan Yuan | Lu Zhang

We develop a novel robust hate speech detection model that can defend against both word- and character-level adversarial attacks. We identify the essential factor that vanilla detection models are vulnerable to adversarial attacks is the spurious correlation between certain target words in the text and the prediction label. To mitigate such spurious correlation, we describe the process of hate speech detection by a causal graph. Then, we employ the causal strength to quantify the spurious correlation and formulate a regularized entropy loss function. We show that our method generalizes the backdoor adjustment technique in causal inference. Finally, the empirical evaluation shows the efficacy of our method.

pdf bib
FAD-X: Fusing Adapters for Cross-lingual Transfer to Low-Resource Languages
Jaeseong Lee | Seung-won Hwang | Taesup Kim

Adapter-based tuning, by adding light-weight adapters to multilingual pretrained language models (mPLMs), selectively updates language-specific parameters to adapt to a new language, instead of finetuning all shared weights. This paper explores an effective way to leverage a public pool of pretrained language adapters, to overcome resource imbalances for low-resource languages (LRLs). Specifically, our research questions are, whether pretrained adapters can be composed, to complement or replace LRL adapters. While composing adapters for multi-task learning setting has been studied, the same question for LRLs has remained largely unanswered. To answer this question, we study how to fuse adapters across languages and tasks, then validate how our proposed fusion adapter, namely FAD-X, can enhance a cross-lingual transfer from pretrained adapters, for well-known named entity recognition and classification benchmarks.

pdf bib
Combining Argumentation Structure and Language Model for Generating Natural Argumentative Dialogue
Koh Mitsuda | Ryuichiro Higashinaka | Kuniko Saito

Argumentative dialogue is an important process where speakers discuss a specific theme for consensus building or decision making. In previous studies for generating consistent argumentative dialogue, retrieval-based methods with hand-crafted argumentation structures have been used. In this study, we propose a method to generate natural argumentative dialogues by combining an argumentation structure and language model. We trained the language model to rewrite a proposition of an argumentation structure on the basis of its information, such as keywords and stance, into the next utterance while considering its context, and we used the model to rewrite propositions in the argumentation structure. We manually evaluated the generated dialogues and found that the proposed method significantly improved the naturalness of dialogues without losing consistency of argumentation.

pdf bib
Every word counts: A multilingual analysis of individual human alignment with model attention
Stephanie Brandl | Nora Hollenstein

Human fixation patterns have been shown to correlate strongly with Transformer-based attention. Those correlation analyses are usually carried out without taking into account individual differences between participants and are mostly done on monolingual datasets making it difficult to generalise findings. In this paper, we analyse eye-tracking data from speakers of 13 different languages reading both in their native language (L1) and in English as language learners (L2). We find considerable differences between languages but also that individual reading behaviour such as skipping rate, total reading time and vocabulary knowledge (LexTALE) influence the alignment between humans and models to an extent that should be considered in future studies.

pdf bib
Analyzing Biases to Spurious Correlations in Text Classification Tasks
Adian Liusie | Vatsal Raina | Vyas Raina | Mark Gales

Machine learning systems have shown impressive performance across a range of natural language tasks. However, it has been hypothesized that these systems are prone to learning spurious correlations that may be present in the training data. Though these correlations will not impact in-domain performance, they are unlikely to generalize well to out-of-domain data, limiting the applicability of systems. This work examines this phenomenon on text classification tasks. Rather than artificially injecting features into the data, we demonstrate that real spurious correlations can be exploited by current state-of-the-art deep-learning systems. Specifically, we show that even when only ‘stop’ words are available at the input stage, it is possible to predict the class significantly better than random. Though it is shown that these stop words are not required for good in-domain performance, they can degrade the ability of the system to generalize well to out-of-domain data.

pdf bib
BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation
Haiyue Song | Raj Dabre | Zhuoyuan Mao | Chenhui Chu | Sadao Kurohashi

Existing subword segmenters are either 1) frequency-based without semantics information or 2) neural-based but trained on parallel corpora. To address this, we present BERTSeg, an unsupervised neural subword segmenter for neural machine translation, which utilizes the contextualized semantic embeddings of words from characterBERT and maximizes the generation probability of subword segmentations. Furthermore, we propose a generation probability-based regularization method that enables BERTSeg to produce multiple segmentations for one word to improve the robustness of neural machine translation. Experimental results show that BERTSeg with regularization achieves up to 8 BLEU points improvement in 9 translation directions on ALT, IWSLT15 Vi->En, WMT16 Ro->En, and WMT15 Fi->En datasets compared with BPE. In addition, BERTSeg is efficient, needing up to 5 minutes for training.

pdf bib
NERDz: A Preliminary Dataset of Named Entities for Algerian
Samia Touileb

This paper introduces a first step towards creating the NERDz dataset. A manually annotated dataset of named entities for the Algerian vernacular dialect. The annotations are built on top of a recent extension to the Algerian NArabizi Treebank, comprizing NArabizi sentences with manual transliterations into Arabic and code-switched scripts. NERDz is therefore not only the first dataset of named entities for Algerian, but it also comprises parallel entities written in Latin, Arabic, and code-switched scripts. We present a detailed overview of our annotations, inter-annotator agreement measures, and define two preliminary baselines using a neural sequence labeling approach and an Algerian BERT model. We also make the annotation guidelines and the annotations available for future work

pdf bib
An Effective Post-training Embedding Binarization Approach for Fast Online Top-K Passage Matching
Yankai Chen | Yifei Zhang | Huifeng Guo | Ruiming Tang | Irwin King

With the rapid development of Natural Language Understanding for information retrieval, fine-tuned deep language models, e.g., BERT-based, perform remarkably effective in passage searching tasks. To lower the architecture complexity, the recent state-of-the-art model ColBERT employs Contextualized Late Interaction paradigm to independently learn fine-grained query-passage representations. Apart from the architecture simplification, embedding binarization, as another promising branch in model compression, further specializes in the reduction of memory and computation overheads. In this concise paper, we propose an effective post-training embedding binarization approach over ColBERT, achieving both architecture-level and embedding-level optimization for online inference. The empirical results demonstrate the efficaciousness of our proposed approach, empowering it to perform online query-passage matching acceleration.

pdf bib
Addressing Segmentation Ambiguity in Neural Linguistic Steganography
Jumon Nozaki | Yugo Murawaki

Previous studies on neural linguistic steganography, except Ueoka et al. (2021), overlook the fact that the sender must detokenize cover texts to avoid arousing the eavesdropper’s suspicion. In this paper, we demonstrate that segmentation ambiguity indeed causes occasional decoding failures at the receiver’s side. With the near-ubiquity of subwords, this problem now affects any language. We propose simple tricks to overcome this problem, which are even applicable to languages without explicit word boundaries.

pdf bib
Parsing linearizations appreciate PoS tags - but some are fussy about errors
Alberto Muñoz-Ortiz | Mark Anderson | David Vilares | Carlos Gómez-Rodríguez

PoS tags, once taken for granted as a useful resource for syntactic parsing, have become more situational with the popularization of deep learning. Recent work on the impact of PoS tags on graph- and transition-based parsers suggests that they are only useful when tagging accuracy is prohibitively high, or in low-resource scenarios. However, such an analysis is lacking for the emerging sequence labeling parsing paradigm, where it is especially relevant as some models explicitly use PoS tags for encoding and decoding. We undertake a study and uncover some trends. Among them, PoS tags are generally more useful for sequence labeling parsers than for other paradigms, but the impact of their accuracy is highly encoding-dependent, with the PoS-based head-selection encoding being best only when both tagging accuracy and resource availability are high.

pdf bib
EmoNoBa: A Dataset for Analyzing Fine-Grained Emotions on Noisy Bangla Texts
Khondoker Ittehadul Islam | Tanvir Yuvraz | Md Saiful Islam | Enamul Hassan

For low-resourced Bangla language, works on detecting emotions on textual data suffer from size and cross-domain adaptability. In our paper, we propose a manually annotated dataset of 22,698 Bangla public comments from social media sites covering 12 different domains such as Personal, Politics, and Health, labeled for 6 fine-grained emotion categories of the Junto Emotion Wheel. We invest efforts in the data preparation to 1) preserve the linguistic richness and 2) challenge any classification model. Our experiments to develop a benchmark classification system show that random baselines perform better than neural networks and pre-trained language models as hand-crafted features provide superior performance.

pdf bib
Exploring Universal Sentence Encoders for Zero-shot Text Classification
Souvika Sarkar | Dongji Feng | Shubhra Kanti Karmaker Santu

Universal Sentence Encoder (USE) has gained much popularity recently as a general-purpose sentence encoding technique. As the name suggests, USE is designed to be fairly general and has indeed been shown to achieve superior performances for many downstream NLP tasks. In this paper, we present an interesting “negative” result on USE in the context of zero-shot text classification, a challenging task, which has recently gained much attraction. More specifically, we found some interesting cases of zero-shot text classification, where topic based inference outperformed USE-based inference in terms of F1 score. Further investigation revealed that USE struggles to perform well on data-sets with a large number of labels with high semantic overlaps, while topic-based classification works well for the same.

pdf bib
The Effects of Language Token Prefixing for Multilingual Machine Translation
Rachel Wicks | Kevin Duh

Machine translation traditionally refers to translating from a single source language into a single target language. In recent years, the field has moved towards large neural models either translating from or into many languages. The model must be correctly cued to translate into the correct target language. This is typically done by prefixing language tokens onto the source or target sequence. The location and content of the prefix can vary and many use different approaches without much justification towards one approach or another. As a guidance to future researchers and directions for future work, we present a series of experiments that show how the positioning and type of a target language prefix token effects translation performance. We show that source side prefixes improve performance. Further, we find that the best language information to denote via tokens depends on the supported language set.

pdf bib
How Relevant is Selective Memory Population in Lifelong Language Learning?
Vladimir Araujo | Helena Balabin | Julio Hurtado | Alvaro Soto | Marie-Francine Moens

Lifelong language learning seeks to have models continuously learn multiple tasks in a sequential order without suffering from catastrophic forgetting. State-of-the-art approaches rely on sparse experience replay as the primary approach to prevent forgetting. Experience replay usually adopts sampling methods for the memory population; however, the effect of the chosen sampling strategy on model performance has not yet been studied. In this paper, we investigate how relevant the selective memory population is in the lifelong learning process of text classification and question-answering tasks. We found that methods that randomly store a uniform number of samples from the entire data stream lead to high performances, especially for low memory size, which is consistent with computer vision studies.

pdf bib
An Improved Baseline for Sentence-level Relation Extraction
Wenxuan Zhou | Muhao Chen

Sentence-level relation extraction (RE) aims at identifying the relationship between two entities in a sentence. Many efforts have been devoted to this problem, while the best performing methods are still far from perfect. In this paper, we revisit two problems that affect the performance of existing RE models, namely entity representation and noisy or ill-defined labels. Our improved RE baseline, incorporated with entity representations with typed markers, achieves an F1 of 74.6% on TACRED, significantly outperforms previous SOTA methods. Furthermore, the presented new baseline achieves an F1 of 91.1% on the refined Re-TACRED dataset, demonstrating that the pretrained language models (PLMs) achieve high performance on this task. We release our code to the community for future research.

pdf bib
Multi-Type Conversational Question-Answer Generation with Closed-ended and Unanswerable Questions
Seonjeong Hwang | Yunsu Kim | Gary Geunbae Lee

Conversational question answering (CQA) facilitates an incremental and interactive understanding of a given context, but building a CQA system is difficult for many domains due to the problem of data scarcity. In this paper, we introduce a novel method to synthesize data for CQA with various question types, including open-ended, closed-ended, and unanswerable questions. We design a different generation flow for each question type and effectively combine them in a single, shared framework. Moreover, we devise a hierarchical answerability classification (hierarchical AC) module that improves quality of the synthetic data while acquiring unanswerable questions. Manual inspections show that synthetic data generated with our framework have characteristics very similar to those of human-generated conversations. Across four domains, CQA systems trained on our synthetic data indeed show good performance close to the systems trained on human-annotated data.

pdf bib
Improving Chinese Story Generation via Awareness of Syntactic Dependencies and Semantics
Henglin Huang | Chen Tang | Tyler Loakman | Frank Guerin | Chenghua Lin

Story generation aims to generate a long narrative conditioned on a given input. In spite of the success of prior works with the application of pre-trained models, current neural models for Chinese stories still struggle to generate high-quality long text narratives. We hypothesise that this stems from ambiguity in syntactically parsing the Chinese language, which does not have explicit delimiters for word segmentation. Consequently, neural models suffer from the inefficient capturing of features in Chinese narratives. In this paper, we present a new generation framework that enhances the feature capturing mechanism by informing the generation model of dependencies between words and additionally augmenting the semantic representation learning through synonym denoising training. We conduct a range of experiments, and the results demonstrate that our framework outperforms the state-of-the-art Chinese generation models on all evaluation metrics, demonstrating the benefits of enhanced dependency and semantic representation learning.

pdf bib
NGEP: A Graph-based Event Planning Framework for Story Generation
Chen Tang | Zhihao Zhang | Tyler Loakman | Chenghua Lin | Frank Guerin

To improve the performance of long text generation, recent studies have leveraged automatically planned event structures (i.e. storylines) to guide story generation. Such prior works mostly employ end-to-end neural generation models to predict event sequences for a story. However, such generation models struggle to guarantee the narrative coherence of separate events due to the hallucination problem, and additionally the generated event sequences are often hard to control due to the end-to-end nature of the models. To address these challenges, we propose NGEP, an novel event planning framework which generates an event sequence by performing inference on an automatically constructed event graph and enhances generalisation ability through a neural event advisor. We conduct a range of experiments on multiple criteria, and the results demonstrate that our graph-based neural framework outperforms the state-of-the-art (SOTA) event planning approaches, considering both the performance of event sequence generation and the effectiveness on the downstream task of story generation.

pdf bib
A Simple Yet Effective Hybrid Pre-trained Language Model for Unsupervised Sentence Acceptability Prediction
Yang Zhao | Issei Yoshida

Sentence acceptability judgment assesses to what degree a sentence is acceptable to native speakers of the language. Most unsupervised prediction approaches rely on a language model to obtain the likelihood of a sentence that reflects acceptability. However, two problems exist: first, low-frequency words would have a significant negative impact on the sentence likelihood derived from the language model; second, when it comes to multiple domains, the language model needs to be trained on domain-specific text for domain adaptation. To address both problems, we propose a simple method that substitutes Part-of-Speech (POS) tags for low-frequency words in sentences used for continual training of masked language models. Experimental results show that our word-tag-hybrid BERT model brings improvement on both a sentence acceptability benchmark and a cross-domain sentence acceptability evaluation corpus. Furthermore, our annotated cross-domain sentence acceptability evaluation corpus would benefit future research.

pdf bib
Post-Training with Interrogative Sentences for Enhancing BART-based Korean Question Generator
Gyu-Min Park | Seong-Eun Hong | Seong-Bae Park

The pre-trained language models such as KoBART often fail in generating perfect interrogative sentences when they are applied to Korean question generation. This is mainly due to the fact that the language models are much experienced with declarative sentences, but not with interrogative sentences. Therefore, this paper proposes a novel post-training of KoBART to enhance it for Korean question generation. The enhancement of KoBART is accomplished in three ways: (i) introduction of question infilling objective to KoBART to enforce it to focus more on the structure of interrogative sentences, (ii) augmentation of training data for question generation with another data set to cope with the lack of training instances for post-training, (iii) introduction of Korean spacing objective to make KoBART understand the linguistic features of Korean. Since there is no standard data set for Korean question generation, this paper also proposes KorQuAD-QG, a new data set for this task, to verify the performance of the proposed post-training. Our code are publicly available at

pdf bib
Do ever larger octopi still amplify reporting biases? Evidence from judgments of typical colour
Fangyu Liu | Julian Eisenschlos | Jeremy Cole | Nigel Collier

Language models (LMs) trained on raw texts have no direct access to the physical world. Gordon and Van Durme (2013) point out that LMs can thus suffer from reporting bias: texts rarely report on common facts, instead focusing on the unusual aspects of a situation. If LMs are only trained on text corpora and naively memorise local co-occurrence statistics, they thus naturally would learn a biased view of the physical world. While prior studies have repeatedly verified that LMs of smaller scales (e.g., RoBERTa, GPT-2) amplify reporting bias, it remains unknown whether such trends continue when models are scaled up. We investigate reporting bias from the perspective of colour in larger language models (LLMs) such as PaLM and GPT-3. Specifically, we query LLMs for the typical colour of objects, which is one simple type of perceptually grounded physical common sense. Surprisingly, we find that LLMs significantly outperform smaller LMs in determining an object’s typical colour and more closely track human judgments, instead of overfitting to surface patterns stored in texts. This suggests that very large models of language alone are able to overcome certain types of reporting bias that are characterized by local co-occurrences.

pdf bib
Adversarially Improving NMT Robustness to ASR Errors with Confusion Sets
Shuaibo Wang | Yufeng Chen | Songming Zhang | Deyi Xiong | Jinan Xu

Neural machine translation (NMT) models are known to be fragile to noisy inputs from automatic speech recognition (ASR) systems. Existing methods are usually tailored for robustness against only homophone errors which account for a small portion of realistic ASR errors. In this paper, we propose an adversarial example generation method based on confusion sets that contain words easily confusable with a target word by ASR to conduct adversarial training for NMT models. Specifically, an adversarial example is generated from the perspective of acoustic relations instead of the traditional uniform or unigram sampling from the confusion sets. Experiments on different test sets with hand-crafted and real-world noise demonstrate the effectiveness of our method over previous methods. Moreover, our approach can achieve improvements on the clean test set.

pdf bib
Improving Graph-Based Text Representations with Character and Word Level N-grams
Wenzhe Li | Nikolaos Aletras

Graph-based text representation focuses on how text documents are represented as graphs for exploiting dependency information between tokens and documents within a corpus. Despite the increasing interest in graph representation learning, there is limited research in exploring new ways for graph-based text representation, which is important in downstream natural language processing tasks. In this paper, we first propose a new heterogeneous word-character text graph that combines word and character n-gram nodes together with document nodes, allowing us to better learn dependencies among these entities. Additionally, we propose two new graph-based neural models, WCTextGCN and WCTextGAT, for modeling our proposed text graph. Extensive experiments in text classification and automatic text summarization benchmarks demonstrate that our proposed models consistently outperform competitive baselines and state-of-the-art graph-based models.

pdf bib
Risk-graded Safety for Handling Medical Queries in Conversational AI
Gavin Abercrombie | Verena Rieser

Conversational AI systems can engage in unsafe behaviour when handling users’ medical queries that may have severe consequences and could even lead to deaths. Systems therefore need to be capable of both recognising the seriousness of medical inputs and producing responses with appropriate levels of risk. We create a corpus of human written English language medical queries and the responses of different types of systems. We label these with both crowdsourced and expert annotations. While individual crowdworkers may be unreliable at grading the seriousness of the prompts, their aggregated labels tend to agree with professional opinion to a greater extent on identifying the medical queries and recognising the risk types posed by the responses. Results of classification experiments suggest that, while these tasks can be automated, caution should be exercised, as errors can potentially be very serious.

pdf bib
Performance-Efficiency Trade-Offs in Adapting Language Models to Text Classification Tasks
Laura Aina | Nikos Voskarides | Roi Blanco

Pre-trained language models (LMs) obtain state-of-the-art performance when adapted to text classification tasks. However, when using such models in real world applications, efficiency considerations are paramount. In this paper, we study how different training procedures that adapt LMs to text classification perform, as we vary model and train set size. More specifically, we compare standard fine-tuning, prompting, and knowledge distillation (KD) when the teacher was trained with either fine-tuning or prompting. Our findings suggest that even though fine-tuning and prompting work well to train large LMs on large train sets, there are more efficient alternatives that can reduce compute or data cost. Interestingly, we find that prompting combined with KD can reduce compute and data cost at the same time.

pdf bib
Seeking Diverse Reasoning Logic: Controlled Equation Expression Generation for Solving Math Word Problems
Yibin Shen | Qianying Liu | Zhuoyuan Mao | Zhen Wan | Fei Cheng | Sadao Kurohashi

To solve Math Word Problems, human students leverage diverse reasoning logic that reaches different possible equation solutions. However, the mainstream sequence-to-sequence approach of automatic solvers aims to decode a fixed solution equation supervised by human annotation. In this paper, we propose a controlled equation generation solver by leveraging a set of control codes to guide the model to consider certain reasoning logic and decode the corresponding equations expressions transformed from the human reference. The empirical results suggest that our method universally improves the performance on single-unknown (Math23K) and multiple-unknown (DRAW1K, HMWP) benchmarks, with substantial improvements up to 13.2% accuracy on the challenging multiple-unknown datasets.

pdf bib
BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset
Ajwad Akil | Najrin Sultana | Abhik Bhattacharjee | Rifat Shahriyar

In this work, we present BanglaParaphrase, a high-quality synthetic Bangla Paraphrase dataset curated by a novel filtering pipeline. We aim to take a step towards alleviating the low resource status of the Bangla language in the NLP domain through the introduction of BanglaParaphrase, which ensures quality by preserving both semantics and diversity, making it particularly useful to enhance other Bangla datasets. We show a detailed comparative analysis between our dataset and models trained on it with other existing works to establish the viability of our synthetic paraphrase data generation pipeline. We are making the dataset and models publicly available at to further the state of Bangla NLP.

pdf bib
NepBERTa: Nepali Language Model Trained in a Large Corpus
Sulav Timilsina | Milan Gautam | Binod Bhattarai

Nepali is a low-resource language with more than 40 million speakers worldwide. It is written in Devnagari script and has rich semantics and complex grammatical structure. To this date, multilingual models such as Multilingual BERT, XLM and XLM-RoBERTa haven’t been able to achieve promising results in Nepali NLP tasks, and there does not exist any such a large-scale monolingual corpus. This study presents NepBERTa, a BERT-based Natural Language Understanding (NLU) model trained on the most extensive monolingual Nepali corpus ever. We collected a dataset of 0.8B words from 36 different popular news sites in Nepal and introduced the model. This data set is 3 folds times larger than the previous publicly available corpus. We evaluated the performance of NepBERTa in multiple Nepali-specific NLP tasks, including Named-Entity Recognition, Content Classification, POS Tagging, and Sequence Pair Similarity. We also introduce two different datasets for two new downstream tasks and benchmark four diverse NLU tasks altogether. We bring all these four tasks under the first-ever Nepali Language Understanding Evaluation (Nep-gLUE) benchmark. We will make Nep-gLUE along with the pre-trained model and data sets publicly available for research.

pdf bib
Local Structure Matters Most in Most Languages
Louis Clouatre | Prasanna Parthasarathi | Amal Zouaq | Sarath Chandar

Many recent perturbation studies have found unintuitive results on what does and does not matter when performing Natural Language Understanding (NLU) tasks in English. Coding properties, such as the order of words, can often be removed through shuffling without impacting downstream performances. Such insight may be used to direct future research into English NLP models. As many improvements in multilingual settings consist of wholesale adaptation of English approaches, it is important to verify whether those studies replicate or not in multilingual settings. In this work, we replicate a study on the importance of local structure, and the relative unimportance of global structure, in a multilingual setting. We find that the phenomenon observed on the English language broadly translates to over 120 languages, with a few caveats.

pdf bib
Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
Meera Hahn | James M. Rehg

We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer’s location, the goal is to predict the Observer’s final location in a map. We develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior works. Our approach outperforms previous baselines.

pdf bib
CSS: Combining Self-training and Self-supervised Learning for Few-shot Dialogue State Tracking
Haoning Zhang | Junwei Bao | Haipeng Sun | Huaishao Luo | Wenye Li | Shuguang Cui

Few-shot dialogue state tracking (DST) is a realistic problem that trains the DST model with limited labeled data. Existing few-shot methods mainly transfer knowledge learned from external labeled dialogue data (e.g., from question answering, dialogue summarization, machine reading comprehension tasks, etc.) into DST, whereas collecting a large amount of external labeled data is laborious, and the external data may not effectively contribute to the DST-specific task. In this paper, we propose a few-shot DST framework called CSS, which Combines Self-training and Self-supervised learning methods. The unlabeled data of the DST task is incorporated into the self-training iterations, where the pseudo labels are predicted by a DST model trained on limited labeled data in advance. Besides, a contrastive self-supervised method is used to learn better representations, where the data is augmented by the dropout operation to train the model. Experimental results on the MultiWOZ dataset show that our proposed CSS achieves competitive performance in several few-shot scenarios.

pdf bib
Demographic-Aware Language Model Fine-tuning as a Bias Mitigation Technique
Aparna Garimella | Rada Mihalcea | Akhash Amarnath

BERT-like language models (LMs), when exposed to large unstructured datasets, are known to learn and sometimes even amplify the biases present in such data. These biases generally reflect social stereotypes with respect to gender, race, age, and others. In this paper, we analyze the variations in gender and racial biases in BERT, a large pre-trained LM, when exposed to different demographic groups. Specifically, we investigate the effect of fine-tuning BERT on text authored by historically disadvantaged demographic groups in comparison to that by advantaged groups. We show that simply by fine-tuning BERT-like LMs on text authored by certain demographic groups can result in the mitigation of social biases in these LMs against various target groups.

pdf bib
Towards Simple and Efficient Task-Adaptive Pre-training for Text Classification
Arnav Ladkat | Aamir Miyajiwala | Samiksha Jagadale | Rekha A. Kulkarni | Raviraj Joshi

Language models are pre-trained using large corpora of generic data like book corpus, com- mon crawl and Wikipedia, which is essential for the model to understand the linguistic characteristics of the language. New studies suggest using Domain Adaptive Pre-training (DAPT) and Task-Adaptive Pre-training (TAPT) as an intermediate step before the final finetuning task. This step helps cover the target domain vocabulary and improves the model performance on the downstream task. In this work, we study the impact of training only the embedding layer on the model’s performance during TAPT and task-specific finetuning. Based on our study, we propose a simple approach to make the in- termediate step of TAPT for BERT-based mod- els more efficient by performing selective pre-training of BERT layers. We show that training only the BERT embedding layer during TAPT is sufficient to adapt to the vocabulary of the target domain and achieve comparable performance. Our approach is computationally efficient, with 78% fewer parameters trained during TAPT. The proposed embedding layer finetuning approach can also be an efficient domain adaptation technique.

pdf bib
Extractive Entity-Centric Summarization as Sentence Selection using Bi-Encoders
Ella Hofmann-Coyle | Mayank Kulkarni | Lingjue Xie | Mounica Maddela | Daniel Preotiuc-Pietro

Entity-centric summarization is a type of controllable summarization that aims to produce a summary of a document that is specific to a given target entity. Extractive summaries possess multiple advantages over abstractive ones such as preserving factuality and can be directly used in downstream tasks like target-based sentiment analysis or incorporated into search applications. In this paper, we explore methods to solve this task by recasting it as a sentence selection task, as supported by the EntSUM data set. We use methods inspired by information retrieval, where the input to the model is a pair representing a sentence from the original document and the target entity, in place of the query. We explore different architecture variants and loss functions in this framework with results showing an up to 5.8 F1 improvement over past state-of-the-art and outperforming the competitive entity-centric Lead 3 heuristic by 1.1 F1. In addition, we also demonstrate similarly strong results on the related task of salient sentence selection for an entity.

pdf bib
Towards Unsupervised Morphological Analysis of Polysynthetic Languages
Sujay Khandagale | Yoann Léveillé | Samuel Miller | Derek Pham | Ramy Eskander | Cass Lowry | Richard Compton | Judith Klavans | Maria Polinsky | Smaranda Muresan

Polysynthetic languages present a challenge for morphological analysis due to the complexity of their words and the lack of high-quality annotated datasets needed to build and/or evaluate computational models. The contribution of this work is twofold. First, using linguists’ help, we generate and contribute high-quality annotated data for two low-resource polysynthetic languages for two tasks: morphological segmentation and part-of-speech (POS) tagging. Second, we present the results of state-of-the-art unsupervised approaches for these two tasks on Adyghe and Inuktitut. Our findings show that for these polysynthetic languages, using linguistic priors helps the task of morphological segmentation and that using stems rather than words as the core unit of abstraction leads to superior performance on POS tagging.

pdf bib
Self-Repetition in Abstractive Neural Summarizers
Nikita Salkar | Thomas Trikalinos | Byron Wallace | Ani Nenkova

We provide a quantitative and qualitative analysis of self-repetition in the output of neural summarizers. We measure self-repetition as the number of n-grams of length four or longer that appear in multiple outputs of the same system. We analyze the behavior of three popular architectures (BART, T5, and Pegasus), fine-tuned on five datasets. In a regression analysis, we find that the three architectures have different propensities for repeating content across output summaries for inputs, with BART being particularly prone to self-repetition. Fine-tuning on more abstractive data, and on data featuring formulaic language is associated with a higher rate of self-repetition. In qualitative analysis, we find systems produce artefacts such as ads and disclaimers unrelated to the content being summarized, as well as formulaic phrases common in the fine-tuning domain. Our approach to corpus-level analysis of self-repetition may help practitioners clean up training data for summarizers and ultimately support methods for minimizing the amount of self-repetition.

pdf bib
Domain Specific Sub-network for Multi-Domain Neural Machine Translation
Amr Hendy | Mohamed Abdelghaffar | Mohamed Afify | Ahmed Y. Tawfik

This paper presents Domain-Specific Sub-network (DoSS). It uses a set of masks obtained through pruning to define a sub-network for each domain and finetunes the sub-network parameters on domain data. This performs very closely and drastically reduces the number of parameters compared to finetuning the whole network on each domain. Also a method to make masks unique per domain is proposed and shown to greatly improve the generalization to unseen domains. In our experiments on German to English machine translation the proposed method outperforms the strong baseline of continue training on multi-domain (medical, tech and religion) data by 1.47 BLEU points. Also continue training DoSS on new domain (legal) outperforms the multi-domain (medical, tech, religion, legal) baseline by 1.52 BLEU points.

pdf bib
Modeling Document-level Temporal Structures for Building Temporal Dependency Graphs
Prafulla Kumar Choubey | Ruihong Huang

We propose to leverage news discourse profiling to model document-level temporal structures for building temporal dependency graphs. Our key observation is that the functional roles of sentences used for profiling news discourse signify different time frames relevant to a news story and can, therefore, help to recover the global temporal structure of a document. Our analyses and experiments with the widely used knowledge distillation technique show that discourse profiling effectively identifies distant inter-sentence event and (or) time expression pairs that are temporally related and otherwise difficult to locate.

pdf bib
Evaluating Pre-Trained Sentence-BERT with Class Embeddings in Active Learning for Multi-Label Text Classification
Lukas Wertz | Jasmina Bogojeska | Katsiaryna Mirylenka | Jonas Kuhn

The Transformer Language Model is a powerful tool that has been shown to excel at various NLP tasks and has become the de-facto standard solution thanks to its versatility. In this study, we employ pre-trained document embeddings in an Active Learning task to group samples with the same labels in the embedding space on a legal document corpus. We find that the calculated class embeddings are not close to the respective samples and consequently do not partition the embedding space in a meaningful way. In addition, we explore using the class embeddings as an Active Learning strategy with dramatically reduced results compared to all baselines.

pdf bib
MiQA: A Benchmark for Inference on Metaphorical Questions
Iulia Comșa | Julian Eisenschlos | Srini Narayanan

We propose a benchmark to assess the capability of large language models to reason with conventional metaphors. Our benchmark combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by accurately selecting between the literal and metaphorical register. We examine the performance of state-of-the-art pre-trained models on binary-choice tasks and find a large discrepancy between the performance of small and very large models, going from chance to near-human level. We also analyse the largest model in a generative setting and find that although human performance is approached, careful multiple-shot prompting is required.

pdf bib
GCDT: A Chinese RST Treebank for Multigenre and Multilingual Discourse Parsing
Siyao Peng | Yang Janet Liu | Amir Zeldes

A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset’s parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.

pdf bib
Assessing Combinational Generalization of Language Models in Biased Scenarios
Yanbo Fang | Zuohui Fu | Xin Dong | Yongfeng Zhang | Gerard de Melo

In light of the prominence of Pre-trained Language Models (PLMs) across numerous downstream tasks, shedding light on what they learn is an important endeavor. Whereas previous work focuses on assessing in-domain knowledge, we evaluate the generalization ability in biased scenarios through component combinations where it could be easy for the PLMs to learn shortcuts from the training corpus. This would lead to poor performance on the testing corpus, which is combinationally reconstructed from the training components. The results show that PLMs are able to overcome such distribution shifts for specific tasks and with sufficient data. We further find that overfitting can lead the models to depend more on biases for prediction, thus hurting the combinational generalization ability of PLMs.

pdf bib
Controllable Text Simplification with Deep Reinforcement Learning
Daiki Yanamoto | Tomoki Ikawa | Tomoyuki Kajiwara | Takashi Ninomiya | Satoru Uchida | Yuki Arase

We propose a method for controlling the difficulty of a sentence based on deep reinforcement learning. Although existing models are trained based on the word-level difficulty, the sentence-level difficulty has not been taken into account in the loss function. Our proposed method generates sentences of appropriate difficulty for the target audience through reinforcement learning using a reward calculated based on the difference between the difficulty of the output sentence and the target difficulty. Experimental results of English text simplification show that the proposed method achieves a higher performance than existing approaches. Compared to previous studies, the proposed method can generate sentences whose grade-levels are closer to those of human references estimated using a fine-tuned pre-trained model.

pdf bib
Vector Space Interpolation for Query Expansion
Deepanway Ghosal | Somak Aditya | Sandipan Dandapat | Monojit Choudhury

Topic-sensitive query set expansion is an important area of research that aims to improve search results for information retrieval. It is particularly crucial for queries related to sensitive and emerging topics. In this work, we describe a method for query set expansion about emerging topics using vector space interpolation. We use a transformer model called OPTIMUS, which is suitable for vector space manipulation due to its variational autoencoder nature. One of our proposed methods – Dirichlet interpolation shows promising results for query expansion. Our methods effectively generate new queries about the sensitive topic by incorporating set-level diversity, which is not captured by traditional sentence-level augmentation methods such as paraphrasing or back-translation.

pdf bib
SchAman: Spell-Checking Resources and Benchmark for Endangered Languages from Amazonia
Arturo Oncevay | Gerardo Cardoso | Carlo Alva | César Lara Ávila | Jovita Vásquez Balarezo | Saúl Escobar Rodríguez | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Juan López Bautista | Nimia Acho Rios | Remigio Zapata Cesareo | Héctor Erasmo Gómez Montoya | Roberto Zariquiey

Spell-checkers are core applications in language learning and normalisation, which may enormously contribute to language revitalisation and language teaching in the context of indigenous communities. Spell-checking as a generation task, however, requires large amount of data, which is not feasible for endangered languages, such as the languages spoken in Peruvian Amazonia. We propose here augmentation methods for various misspelling types as a strategy to train neural spell-checking models and we create an evaluation resource for four indigenous languages of Peru: Shipibo-Konibo, Asháninka, Yánesha, Yine. We focus on special errors that are significant for learning these languages, such as phoneme-to-grapheme ambiguity, grammatical errors (gender, tense, number, among others), accentuation, punctuation and normalisation in contexts where two or more writing traditions co-exist. We found that an ensemble model, trained with augmented data from various types of error achieves overall better scores in most of the error types and languages. Finally, we released our spell-checkers as a web service to be used by indigenous communities and organisations to develop future language materials.

pdf bib
CoFE: A New Dataset of Intra-Multilingual Multi-target Stance Classification from an Online European Participatory Democracy Platform
Valentin Barriere | Guillaume Guillaume Jacquet | Leo Hemamou

Stance Recognition over proposals is the task of automatically detecting whether a comment on a specific proposal is in favor of this proposal, against this proposal or that neither inference is likely. The dataset that we propose to use is an online debating platform inaugurated in 2021, where users can submit proposals and comment over proposals or over other comments. It contains 4.2k proposals and 20k comments focused on various topics. Every comment and proposal can come written in another language, with more than 40% of the proposal/comment pairs containing at least two languages, creating a unique intra-multilingual setting. A portion of the data (more than 7k comment/proposal pairs, in 26 languages) was annotated by the writers with a self-tag assessing whether they are in favor or against the proposal. Another part of the data (without self-tag) has been manually annotated: 1206 comments in 6 morphologically different languages (fr, de, en, el, it, hu) were tagged, leading to a Krippendorff’s α of 0.69. This setting allows defining an intra-multilingual and multi-target stance classification task over online debates.

pdf bib
Exploring the Effects of Negation and Grammatical Tense on Bias Probes
Samia Touileb

We investigate in this paper how correlations between occupations and gendered-pronouns can be affected and changed by adding negation in bias probes, or changing the grammatical tense of the verbs in the probes. We use a set of simple bias probes in Norwegian and English, and perform 16 different probing analysis, using four Norwegian and four English pre-trained language models. We show that adding negation to probes does not have a considerable effect on the correlations between gendered-pronouns and occupations, supporting other works on negation in language models. We also show that altering the grammatical tense of verbs in bias probes do have some interesting effects on models’ behaviours and correlations. We argue that we should take grammatical tense into account when choosing bias probes, and aggregating results across tenses might be a better representation of the existing correlations.

pdf bib
Promoting Pre-trained LM with Linguistic Features on Automatic Readability Assessment
Shudi Hou | Simin Rao | Yu Xia | Sujian Li

Automatic readability assessment (ARA) aims at classifying the readability level of a passage automatically. In the past, manually selected linguistic features are used to classify the passages. However, as the use of deep neural network surges, there is less work focusing on these linguistic features. Recently, many works integrate linguistic features with pre-trained language model (PLM) to make up for the information that PLMs are not good at capturing. Despite their initial success, insufficient analysis of the long passage characteristic of ARA has been done before. To further investigate the promotion of linguistic features on PLMs in ARA from the perspective of passage length, with commonly used linguistic features and abundant experiments, we find that: (1) Linguistic features promote PLMs in ARA mainly on long passages. (2) The promotion of the features on PLMs becomes less significant when the dataset size exceeds 750 passages. (3) By analyzing commonly used ARA datasets, we find Newsela is actually not suitable for ARA. Our code is available at

pdf bib
An Empirical Study of Pipeline vs. Joint approaches to Entity and Relation Extraction
Zhaohui Yan | Zixia Jia | Kewei Tu

The Entity and Relation Extraction (ERE) task includes two basic sub-tasks: Named Entity Recognition and Relation Extraction. In the last several years, much work focused on joint approaches for the common perception that the pipeline approach suffers from the error propagation problem. Recent work reconsiders the pipeline scheme and shows that it can produce comparable results. To systematically study the pros and cons of these two schemes. We design and test eight pipeline and joint approaches to the ERE task. We find that with the same span representation methods, the best joint approach still outperforms the best pipeline model, but improperly designed joint approaches may have poor performance. We hope our work could shed some light on the pipeline-vs-joint debate of the ERE task and inspire further research.

pdf bib
CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing
Andy Rosenbaum | Saleh Soltan | Wael Hamza | Marco Damonte | Isabel Groves | Amir Saffari

A bottleneck to developing Semantic Parsing (SP) models is the need for a large volume of human-labeled training data. Given the complexity and cost of human annotation for SP, labeled data is often scarce, particularly in multilingual settings. Large Language Models (LLMs) excel at SP given only a few examples, however LLMs are unsuitable for runtime systems which require low latency. In this work, we propose CLASP, a simple method to improve low-resource SP for moderate-sized models: we generate synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters). We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English, and the model must generalize to four new languages. On both datasets, we show significant improvements over strong baseline methods.

pdf bib
Plug and Play Knowledge Distillation for kNN-LM with External Logits
Xuyang Jin | Tao Ge | Furu Wei

Despite the promising evaluation results by knowledge distillation (KD) in natural language understanding (NLU) and sequence-to-sequence (seq2seq) tasks, KD for causal language modeling (LM) remains a challenge. In this paper, we present a novel perspective of knowledge distillation by proposing plug and play knowledge distillation (PP-KD) to improve a (student) kNN-LM that is the state-of-the-art in causal language modeling by leveraging external logits from either a powerful or a heterogeneous (teacher) LM. Unlike conventional logit-based KD where the teacher’s knowledge is built-in during training, PP-KD is plug and play: it stores the teacher’s knowledge (i.e., logits) externally and uses the teacher’s logits of the retrieved k-nearest neighbors during kNN-LM inference at test time. In contrast to marginal perplexity improvement by logit-based KD in conventional neural (causal) LM, PP-KD achieves a significant improvement, enhancing the kNN-LMs in multiple language modeling datasets, showing a novel and promising perspective for causal LM distillation.

pdf bib
How Well Do Multi-hop Reading Comprehension Models Understand Date Information?
Xanh Ho | Saku Sugawara | Akiko Aizawa

Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems. To evaluate the model precisely in a hierarchical manner, we first propose a dataset, HieraDate, with three probing tasks in addition to the main question: extraction, reasoning, and robustness. Our dataset is created by enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA, focusing on multi-hop questions on date information that involve both comparison and numerical reasoning. We then evaluate the ability of existing models to understand date information. Our experimental results reveal that the multi-hop models do not have the ability to subtract two dates even when they perform well in date comparison and number subtraction tasks. Other results reveal that our probing questions can help to improve the performance of the models (e.g., by +10.3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.

pdf bib
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora
Sara Papi | Alina Karakanta | Matteo Negri | Marco Turchi

Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.

pdf bib
How to tackle an emerging topic? Combining strong and weak labels for Covid news NER
Aleksander Ficek | Fangyu Liu | Nigel Collier

Being able to train Named Entity Recognition (NER) models for emerging topics is crucial for many real-world applications especially in the medical domain where new topics are continuously evolving out of the scope of existing models and datasets. For a realistic evaluation setup, we introduce a novel COVID-19 news NER dataset (COVIDNEWS-NER) and release 3000 entries of hand annotated strongly labelled sentences and 13000 auto-generated weakly labelled sentences. Besides the dataset, we propose CONTROSTER, a recipe to strategically combine weak and strong labels in improving NER in an emerging topic through transfer learning. We show the effectiveness of CONTROSTER on COVIDNEWS-NER while providing analysis on combining weak and strong labels for training. Our key findings are: (1) Using weak data to formulate an initial backbone before tuning on strong data outperforms methods trained on only strong or weak data. (2) A combination of out-of-domain and in-domain weak label training is crucial and can overcome saturation when being training on weak labels from a single source.


pdf (full)
bib (full)
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop

pdf bib
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop
Yan Hanqi | Yang Zonghan | Sebastian Ruder | Wan Xiaojun

pdf bib
Emotional Intensity Estimation based on Writer’s Personality
Haruya Suzuki | Sora Tarumoto | Tomoyuki Kajiwara | Takashi Ninomiya | Yuta Nakashima | Hajime Nagahara

We propose a method for personalized emotional intensity estimation based on a writer’s personality test for Japanese SNS posts. Existing emotion analysis models are difficult to accurately estimate the writer’s subjective emotions behind the text. We personalize the emotion analysis using not only the text but also the writer’s personality information. Experimental results show that personality information improves the performance of emotional intensity estimation. Furthermore, a hybrid model combining the existing personalized method with ours achieved state-of-the-art performance.

pdf bib
Bipartite-play Dialogue Collection for Practical Automatic Evaluation of Dialogue Systems
Shiki Sato | Yosuke Kishinami | Hiroaki Sugiyama | Reina Akama | Ryoko Tokuhisa | Jun Suzuki

Automation of dialogue system evaluation is a driving force for the efficient development of dialogue systems. This paper introduces the bipartite-play method, a dialogue collection method for automating dialogue system evaluation. It addresses the limitations of existing dialogue collection methods: (i) inability to compare with systems that are not publicly available, and (ii) vulnerability to cheating by intentionally selecting systems to be compared. Experimental results show that the automatic evaluation using the bipartite-play method mitigates these two drawbacks and correlates as strongly with human subjectivity as existing methods.

pdf bib
Toward Building a Language Model for Understanding Temporal Commonsense
Mayuko Kimura | Lis Kanashiro Pereira | Ichiro Kobayashi

The ability to capture temporal commonsense relationships for time-related events expressed in text is a very important task in natural language understanding. On the other hand, pre-trained language models such as BERT, which have recently achieved great success in a wide range of natural language processing tasks, are still considered to have poor performance in temporal reasoning. In this paper, we focus on the development of language models for temporal commonsense inference over several pre-trained language models. Our model relies on multi-step fine-tuning using multiple corpora, and masked language modeling to predict masked temporal indicators that are crucial for temporal commonsense reasoning. We also experimented with multi-task learning and build a language model that can improve performance on multiple time-related tasks. In our experiments, multi-step fine-tuning using the general commonsense reading task as auxiliary task produced the best results. This result showed a significant improvement in accuracy over standard fine-tuning in the temporal commonsense inference task.

pdf bib
Optimal Summaries for Enabling a Smooth Handover in Chat-Oriented Dialogue
Sanae Yamashita | Ryuichiro Higashinaka

In dialogue systems, one option for creating a better dialogue experience for the user is to have a human operator take over the dialogue when the system runs into trouble communicating with the user. In this type of handover situation (we call it intervention), it is useful for the operator to have access to the dialogue summary. However, it is not clear exactly what type of summary would be the most useful for a smooth handover. In this study, we investigated the optimal type of summary through experiments in which interlocutors were presented with various summary types during interventions in order to examine their effects. Our findings showed that the best summaries were an abstractive summary plus one utterance immediately before the handover and an extractive summary consisting of five utterances immediately before the handover. From the viewpoint of computational cost, we recommend that extractive summaries consisting of the last five utterances be used.

pdf bib
MUTE: A Multimodal Dataset for Detecting Hateful Memes
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque

The exponential surge of social media has enabled information propagation at an unprecedented rate. However, it also led to the generation of a vast amount of malign content, such as hateful memes. To eradicate the detrimental impact of this content, over the last few years hateful memes detection problem has grabbed the attention of researchers. However, most past studies were conducted primarily for English memes, while memes on resource constraint languages (i.e., Bengali) are under-studied. Moreover, current research considers memes with a caption written in monolingual (either English or Bengali) form. However, memes might have code-mixed captions (English+Bangla), and the existing models can not provide accurate inference in such cases. Therefore, to facilitate research in this arena, this paper introduces a multimodal hate speech dataset (named MUTE) consisting of 4158 memes having Bengali and code-mixed captions. A detailed annotation guideline is provided to aid the dataset creation in other resource constraint languages. Additionally, extensive experiments have been carried out on MUTE, considering the only visual, only textual, and both modalities. The result demonstrates that joint evaluation of visual and textual features significantly improves (≈ 3%) the hateful memes classification compared to the unimodal evaluation.

pdf bib
A Simple and Fast Strategy for Handling Rare Words in Neural Machine Translation
Nguyen-Hoang Minh-Cong | Vinh Thi Ngo | Van Vinh Nguyen

Neural Machine Translation (NMT) has currently obtained state-of-the-art in machine translation systems. However, dealing with rare words is still a big challenge in translation systems. The rare words are often translated using a manual dictionary or copied from the source to the target with original words. In this paper, we propose a simple and fast strategy for integrating constraints during the training and decoding process to improve the translation of rare words. The effectiveness of our proposal is demonstrated in both high and low-resource translation tasks, including the language pairs: English → Vietnamese, Chinese → Vietnamese, Khmer → Vietnamese, and Lao → Vietnamese. We show the improvements of up to +1.8 BLEU scores over the baseline systems.

pdf bib
C3PO: A Lightweight Copying Mechanism for Translating Pseudocode to Code
Vishruth Veerendranath | Vibha Masti | Prajwal Anagani | Mamatha Hr

Writing computer programs is a skill that remains inaccessible to most due to the barrier of programming language (PL) syntax. While large language models (LLMs) have been proposed to translate natural language pseudocode to PL code, they are costly in terms of data and compute. We propose a lightweight alternative to LLMs that exploits the property of code wherein most tokens can be simply copied from the pseudocode. We divide the problem into three phases: Copy, Generate, and Combine. In the Copy Phase, a binary classifier is employed to determine and mask the pseudocode tokens that can be directly copied into the code. In the Generate Phase, a Sequence-to-Sequence model is used to generate the masked PL code equivalent. In the Combine Phase, the generated sequence is combined with the tokens that the Copy Phase had masked. We show that our C3PO models achieve similar performance to non-C3PO models while reducing the computational cost of training as well as the vocabulary sizes.

pdf bib
Outlier-Aware Training for Improving Group Accuracy Disparities
Li-Kuang Chen | Canasai Kruengkrai | Junichi Yamagishi

Methods addressing spurious correlations such as Just Train Twice (JTT, Liu et al. 2021) involve reweighting a subset of the training set to maximize the worst-group accuracy. However, the reweighted set of examples may potentially contain unlearnable examples that hamper the model’s learning. We propose mitigating this by detecting outliers to the training set and removing them before reweighting. Our experiments show that our method achieves competitive or better accuracy compared with JTT and can detect and remove annotation errors in the subset being reweighted in JTT.

pdf bib
An Empirical Study on Topic Preservation in Multi-Document Summarization
Mong Yuan Sim | Wei Emma Zhang | Congbo Ma

Multi-document summarization (MDS) is a process of generating an informative and concise summary from multiple topic-related documents. Many studies have analyzed the quality of MDS dataset or models, however no work has been done from the perspective of topic preservation. In this work, we fill the gap by performing an empirical analysis on two MDS datasets and study topic preservation on generated summaries from 8 MDS models. Our key findings include i) Multi-News dataset has better gold summaries compared to Multi-XScience in terms of its topic distribution consistency and ii) Extractive approaches perform better than abstractive approaches in preserving topic information from source documents. We hope our findings could help develop a summarization model that can generate topic-focused summary and also give inspiration to researchers in creating dataset for such challenging task.

pdf bib
Detecting Urgency in Multilingual Medical SMS in Kenya
Narshion Ngao | Zeyu Wang | Lawrence Nderu | Tobias Mwalili | Tal August | Keshet Ronen

Access to mobile phones in many low- and middle-income countries has increased exponentially over the last 20 years, providing an opportunity to connect patients with healthcare interventions through mobile phones (known as mobile health). A barrier to large-scale implementation of interactive mobile health interventions is the human effort needed to manage participant messages. In this study, we explore the use of natural language processing to improve healthcare workers’ management of messages from pregnant and postpartum women in Kenya. Using multilingual, low-resource language text messages from the Mobile solutions for Women and Children’s health (Mobile WACh NEO) study, we developed models to assess urgency of incoming messages. We evaluated models using a novel approach that focuses on clinical usefulness in either triaging or prioritizing messages. Our best-performing models did not reach the threshold for clinical usefulness we set, but have the potential to improve nurse workflow and responsiveness to urgent messages.

pdf bib
Language over Labels: Contrastive Language Supervision Exceeds Purely Label-Supervised Classification Performance on Chest X-Rays
Anton Wiehe | Florian Schneider | Sebastian Blank | Xintong Wang | Hans-Peter Zorn | Christian Biemann

The multi-modal foundation model CLIP computes representations from texts and images that achieved unprecedented performance on tasks such as zero-shot image classification. However, CLIP was pretrained on public internet data. Thus it lacks highly domain-specific knowledge. We investigate the adaptation of CLIP-based models to the chest radiography domain using the MIMIC-CXR dataset. We show that the features of the pretrained CLIP models do not transfer to this domain. We adapt CLIP to the chest radiography domain using contrastive language supervision and show that this approach yields a model that outperforms supervised learning on labels on the MIMIC-CXR dataset while also generalizing to the CheXpert and RSNA Pneumonia datasets. Furthermore, we do a detailed ablation study of the batch and dataset size. Finally, we show that language supervision allows for better explainability by using the multi-modal model to generate images from texts such that experts can inspect what the model has learned.

pdf bib
Dynamic Topic Modeling by Clustering Embeddings from Pretrained Language Models: A Research Proposal
Anton Eklund | Mona Forsman | Frank Drewes

A new trend in topic modeling research is to do Neural Topic Modeling by Clustering document Embeddings (NTM-CE) created with a pretrained language model. Studies have evaluated static NTM-CE models and found them performing comparably to, or even better than other topic models. An important extension of static topic modeling is making the models dynamic, allowing the study of topic evolution over time, as well as detecting emerging and disappearing topics. In this research proposal, we present two research questions to understand dynamic topic modeling with NTM-CE theoretically and practically. To answer these, we propose four phases with the aim of establishing evaluation methods for dynamic topic modeling, finding NTM-CE-specific properties, and creating a framework for dynamic NTM-CE. For evaluation, we propose to use both quantitative measurements of coherence and human evaluation supported by our recently developed tool.

pdf bib
Concreteness vs. Abstractness: A Selectional Preference Perspective
Tarun Tater | Diego Frassinelli | Sabine Schulte im Walde

Concrete words refer to concepts that are strongly experienced through human senses (banana, chair, salt, etc.), whereas abstract concepts are less perceptually salient (idea, glory, justice, etc.). A clear definition of abstractness is crucial for the understanding of human cognitive processes and for the development of natural language applications such as figurative language detection. In this study, we investigate selectional preferences as a criterion to distinguish between concrete and abstract concepts and words: we hypothesise that abstract and concrete verbs and nouns differ regarding the semantic classes of their arguments. Our study uses a collection of 5,438 nouns and 1,275 verbs to exploit selectional preferences as a salient characteristic in classifying English abstract vs. concrete words, and in predicting their concreteness scores. We achieve an f1-score of 0.84 for nouns and 0.71 for verbs in classification, and Spearman’s ρ correlation of 0.86 for nouns and 0.59 for verbs.


pdf (full)
bib (full)
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations

pdf bib
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: System Demonstrations
Wray Buntine | Maria Liakata

pdf bib
VScript: Controllable Script Generation with Visual Presentation
Ziwei Ji | Yan Xu | I-Tsun Cheng | Samuel Cahyawijaya | Rita Frieske | Etsuko Ishii | Min Zeng | Andrea Madotto | Pascale Fung

In order to offer a customized script tool and inspire professional scriptwriters, we present VScript. It is a controllable pipeline that generates complete scripts, including dialogues and scene descriptions, as well as presents visually using video retrieval. With an interactive interface, our system allows users to select genres and input starting words that control the theme and development of the generated script. We adopt a hierarchical structure, which first generates the plot, then the script and its visual presentation. A novel approach is also introduced to plot-guided dialogue generation by treating it as an inverse dialogue summarization. The experiment results show that our approach outperforms the baselines on both automatic and human evaluations, especially in genre control.

pdf bib
TexPrax: A Messaging Application for Ethical, Real-time Data Collection and Annotation
Lorenz Stangier | Ji-Ung Lee | Yuxi Wang | Marvin Müller | Nicholas Frick | Joachim Metternich | Iryna Gurevych

Collecting and annotating task-oriented dialog data is difficult, especially for highly specific domains that require expert knowledge. At the same time, informal communication channels such as instant messengers are increasingly being used at work. This has led to a lot of work-relevant information that is disseminated through those channels and needs to be post-processed manually by the employees. To alleviate this problem, we present TexPrax, a messaging system to collect and annotate _problems_, _causes_, and _solutions_ that occur in work-related chats. TexPrax uses a chatbot to directly engage the employees to provide lightweight annotations on their conversation and ease their documentation work. To comply with data privacy and security regulations, we use an end-to-end message encryption and give our users full control over their data which has various advantages over conventional annotation tools. We evaluate TexPrax in a user-study with German factory employees who ask their colleagues for solutions on problems that arise during their daily work. Overall, we collect 202 task-oriented German dialogues containing 1,027 sentences with sentence-level expert annotations. Our data analysis also reveals that real-world conversations frequently contain instances with code-switching, varying abbreviations for the same entity, and dialects which NLP systems should be able to handle.

pdf bib
PicTalky: Augmentative and Alternative Communication for Language Developmental Disabilities
Chanjun Park | Yoonna Jang | Seolhwa Lee | Jaehyung Seo | Kisu Yang | Heuiseok Lim

Children with language disabilities face communication difficulties in daily life. They are often deprived of the opportunity to participate in social activities due to their difficulty in understanding or using natural language. In this regard, Augmentative and Alternative Communication (AAC) can be a practical means of communication for children with language disabilities. In this study, we propose PicTalky, which is an AI-based AAC system that helps children with language developmental disabilities to improve their communication skills and language comprehension abilities. PicTalky can process both text and pictograms more accurately by connecting a series of neural-based NLP modules. Additionally, we perform quantitative and qualitative analyses on the modules of PicTalky. By using this service, it is expected that those suffering from language problems will be able to express their intentions or desires more easily and improve their quality of life. We have made the models freely available alongside a demonstration of the web interface. Furthermore, we implemented robotics AAC for the first time by applying PicTalky to the NAO robot.

pdf bib
UKP-SQuARE v2: Explainability and Adversarial Attacks for Trustworthy QA
Rachneet Sachdeva | Haritz Puerto | Tim Baumgärtner | Sewin Tariverdian | Hao Zhang | Kexin Wang | Hossain Shaikh Saadi | Leonardo F. R. Ribeiro | Iryna Gurevych

Question Answering (QA) systems are increasingly deployed in applications where they support real-world decisions. However, state-of-the-art models rely on deep neural networks, which are difficult to interpret by humans. Inherently interpretable models or post hoc explainability methods can help users to comprehend how a model arrives at its prediction and, if successful, increase their trust in the system. Furthermore, researchers can leverage these insights to develop new methods that are more accurate and less biased. In this paper, we introduce SQuARE v2, the new version of SQuARE, to provide an explainability infrastructure for comparing models based on methods such as saliency maps and graph-based explanations. While saliency maps are useful to inspect the importance of each input token for the model’s prediction, graph-based explanations from external Knowledge Graphs enable the users to verify the reasoning behind the model prediction. In addition, we provide multiple adversarial attacks to compare the robustness of QA models. With these explainability methods and adversarial attacks, we aim to ease the research on trustworthy QA models. SQuARE is available on

pdf bib
TaxFree: a Visualization Tool for Candidate-free Taxonomy Enrichment
Irina Nikishina | Ivan Andrianov | Alsu Vakhitova | Alexander Panchenko

Taxonomies are widely used in a various number of downstream NLP tasks and, therefore, should be kept up-to-date. In this paper, we present TaxFree, an open source system for taxonomy visualisation and automatic Taxonomy Enrichment without pre-defined candidates on the example of WordNet-3.0. As oppose to the traditional task formulation (where the list of new words is provided beforehand), we provide an approach for automatic extension of a taxonomy using a large pre-trained language model. As an advantage to the existing visualisation tools of WordNet, TaxFree also integrates graphic representations of synsets from ImageNet. Such visualisation tool can be used for both updating taxonomies and inspecting them for the required modifications.

pdf bib
F-coref: Fast, Accurate and Easy to Use Coreference Resolution
Shon Otmazgin | Arie Cattan | Yoav Goldberg

We introduce fastcoref, a python package for fast, accurate, and easy-to-use English coreference resolution. The package is pip-installable, and allows two modes: an accurate mode based on the LingMess architecture, providing state-of-the-art coreference accuracy, and a substantially faster model, F-coref, which is the focus of this work. F-coref allows to process 2.8K OntoNotes documents in 25 seconds on a V100 GPU (compared to 6 minutes for the LingMess model, and to 12 minutes of the popular AllenNLP coreference model) with only a modest drop in accuracy. The fast speed is achieved through a combination of distillation of a compact model from the LingMess model, and an efficient batching implementation using a technique we call leftover batching.

pdf bib
PIEKM: ML-based Procedural Information Extraction and Knowledge Management System for Materials Science Literature
Huichen Yang | Carlos Aguirre | William Hsu

The published materials science literature contains abundant description information about synthesis procedures that can help discover new material areas, deepen the study of materials synthesis, and accelerate its automated planning. Nevertheless, this information is expressed in unstructured text, and manually processing and assimilating useful information is expensive and time-consuming for researchers. To address this challenge, we develop a Machine Learning-based procedural information extraction and knowledge management system (PIEKM) that extracts procedural information recipe steps, figures, and tables from materials science articles, and provides information retrieval capability and the statistics visualization functionality. Our system aims to help researchers to gain insights and quickly understand the connections among massive data. Moreover, we demonstrate that the machine learning-based system performs well in low-resource scenarios (i.e., limited annotated data) for domain adaption.

pdf bib
BiomedCurator: Data Curation for Biomedical Literature
Mohammad Golam Sohrab | Khoa N.A. Duong | Ikeda Masami | Goran Topić | Yayoi Natsume-Kitatani | Masakata Kuroda | Mari Nogami Itoh | Hiroya Takamura

We present BiomedCurator1, a web application that extracts the structured data from scientific articles in PubMed and BiomedCurator uses state-of-the-art natural language processing techniques to fill the fields pre-selected by domain experts in the relevant biomedical area. The BiomedCurator web application includes: text generation based model for relation extraction, entity detection and recognition, text classification model for extracting several fields, information retrieval from external knowledge base to retrieve IDs, and a pattern-based extraction approach that can extract several fields using regular expressions over the PubMed and datasets. Evaluation results show that different approaches of BiomedCurator web application system are effective for automatic data curation in the biomedical domain.

pdf bib
Text Characterization Toolkit (TCT)
Daniel Simig | Tianlu Wang | Verna Dankers | Peter Henderson | Khuyagbaatar Batsuren | Dieuwke Hupkes | Mona Diab

We present a tool, Text Characterization Toolkit (TCT), that researchers can use to study characteristics of large datasets. Furthermore, such properties can lead to understanding the influence of such attributes on models’ behaviour. Traditionally, in most NLP research, models are usually evaluated by reporting single-number performance scores on a number of readily available benchmarks, without much deeper analysis. Here, we argue that – especially given the well-known fact that benchmarks often contain biases, artefacts, and spurious correlations – deeper results analysis should become the de-facto standard when presenting new models or benchmarks. TCT aims at filling this gap by facilitating such deeper analysis for datasets at scale, where datasets can be for training/development/evaluation. TCT includes both an easy-to-use tool, as well as off-the-shelf scripts that can be used for specific analyses. We also present use-cases from several different domains. TCT is used to predict difficult examples for given well-known trained models; TCT is also used to identify (potentially harmful) biases present in a dataset.

pdf bib
Meeting Decision Tracker: Making Meeting Minutes with De-Contextualized Utterances
Shumpei Inoue | Hy Nguyen | Hoang Pham | Tsungwei Liu | Minh-Tien Nguyen

Meetings are a universal process to make decisions in business and project collaboration. The capability to automatically itemize the decisions in daily meetings allows for extensive tracking of past discussions. To that end, we developed Meeting Decision Tracker, a prototype system to construct decision items comprising decision utterance detector (DUD) and decision utterance rewriter (DUR). We show that DUR makes a sizable contribution to improving the user experience by dealing with utterance collapse in natural conversation. An introduction video of our system is also available at


pdf (full)
bib (full)
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts

pdf bib
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts
Miguel A. Alonso | Zhongyu Wei

pdf bib
Efficient and Robust Knowledge Graph Construction
Ningyu Zhang | Tao Gui | Guoshun Nan

Knowledge graph construction which aims to extract knowledge from the text corpus, has appealed to the NLP community researchers. Previous decades have witnessed the remarkable progress of knowledge graph construction on the basis of neural models; however, those models often cost massive computation or labeled data resources and suffer from unstable inference accounting for biased or adversarial samples. Recently, numerous approaches have been explored to mitigate the efficiency and robustness issues for knowledge graph construction, such as prompt learning and adversarial training. In this tutorial, we aim to bring interested NLP researchers up to speed on the recent and ongoing techniques for efficient and robust knowledge graph construction. Additionally, our goal is to provide a systematic and up-to-date overview of these methods and reveal new research opportunities to the audience.

pdf bib
Recent Advances in Pre-trained Language Models: Why Do They Work and How Do They Work
Cheng-Han Chiang | Yung-Sung Chuang | Hung-yi Lee

Pre-trained language models (PLMs) are language models that are pre-trained on large-scaled corpora in a self-supervised fashion. These PLMs have fundamentally changed the natural language processing community in the past few years. In this tutorial, we aim to provide a broad and comprehensive introduction from two perspectives: why those PLMs work, and how to use them in NLP tasks. The first part of the tutorial shows some insightful analysis on PLMs that partially explain their exceptional downstream performance. The second part first focuses on emerging pre-training methods that enable PLMs to perform diverse downstream tasks and then illustrates how one can apply those PLMs to downstream tasks under different circumstances. These circumstances include fine-tuning PLMs when under data scarcity, and using PLMs with parameter efficiency. We believe that attendees of different backgrounds would find this tutorial informative and useful.

pdf bib
When Cantonese NLP Meets Pre-training: Progress and Challenges
Rong Xiang | Hanzhuo Tan | Jing Li | Mingyu Wan | Kam-Fai Wong

Cantonese is an influential Chinese variant with a large population of speakers worldwide. However, it is under-resourced in terms of the data scale and diversity, excluding Cantonese Natural Language Processing (NLP) from the stateof-the-art (SOTA) “pre-training and fine-tuning” paradigm. This tutorial will start with a substantially review of the linguistics and NLP progress for shaping language specificity, resources, and methodologies. It will be followed by an introduction to the trendy transformerbased pre-training methods, which have been largely advancing the SOTA performance of a wide range of downstream NLP tasks in numerous majority languages (e.g., English and Chinese). Based on the above, we will present the main challenges for Cantonese NLP in relation to Cantonese language idiosyncrasies of colloquialism and multilingualism, followed by the future directions to line NLP for Cantonese and other low-resource languages up to the cutting-edge pre-training practice.

pdf bib
Grounding Meaning Representation for Situated Reasoning
Nikhil Krishnaswamy | James Pustejovsky

As natural language technology becomes ever-present in everyday life, people will expect artificial agents to understand language use as humans do. Nevertheless, most advanced neural AI systems fail at some types of interactions that are trivial for humans (e.g., ask a smart system “What am I pointing at?”). One critical aspect of human language understanding is situated reasoning, where inferences make reference to the local context, perceptual surroundings, and contextual groundings from the interaction. In this cutting-edge tutorial, we bring to the NLP/CL community a synthesis of multimodal grounding and meaning representation techniques with formal and computational models of embodied reasoning. We will discuss existing approaches to multimodal language grounding and meaning representations, discuss the kind of information each method captures and their relative suitability to situated reasoning tasks, and demon- strate how to construct agents that conduct situated reasoning by embodying a simulated environment. In doing so, these agents also represent their human interlocutor(s) within the simulation, and are represented through their virtual embodiment in the real world, enabling true bidirectional communication with a computer using multiple modalities.

pdf bib
The Battlefront of Combating Misinformation and Coping with Media Bias
Yi Fung | Kung-Hsiang Huang | Preslav Nakov | Heng Ji

Misinformation is a pressing issue in modern society. It arouses a mixture of anger, distrust, confusion, and anxiety that cause damage on our daily life judgments and public policy decisions. While recent studies have explored various fake news detection and media bias detection techniques in attempts to tackle the problem, there remain many ongoing challenges yet to be addressed, as can be witnessed from the plethora of untrue and harmful content present during the COVID-19 pandemic and the international crises of late. In this tutorial, we provide researchers and practitioners with a systematic overview of the frontier in fighting misinformation. Specifically, we dive into the important research questions of how to (i) develop a robust fake news detection system, which not only fact-check information pieces provable by background knowledge but also reason about the consistency and the reliability of subtle details for emerging events; (ii) uncover the bias and agenda of news sources to better characterize misinformation; as well as (iii) correct false information and mitigate news bias, while allowing diverse opinions to be expressed. Moreover, we discuss the remaining challenges, future research directions, and exciting opportunities to help make this world a better place, with safer and more harmonic information sharing.

pdf bib
A Tour of Explicit Multilingual Semantics: Word Sense Disambiguation, Semantic Role Labeling and Semantic Parsing
Roberto Navigli | Edoardo Barba | Simone Conia | Rexhina Blloshmi

The recent advent of modern pretrained language models has sparked a revolution in Natural Language Processing (NLP), especially in multilingual and cross-lingual applications. Today, such language models have become the de facto standard for providing rich input representations to neural systems, achieving unprecedented results in an increasing range of benchmarks. However, questions that often arise are: firstly, whether current language models are, indeed, able to capture explicit, symbolic meaning; secondly, if they are, to what extent; thirdly, and perhaps more importantly, whether current approaches are capable of scaling across languages. In this cutting-edge tutorial, we will review recent efforts that have aimed at shedding light on meaning in NLP, with a focus on three key open problems in lexical and sentence-level semantics: Word Sense Disambiguation, Semantic Role Labeling, and Semantic Parsing. After a brief introduction, we will spotlight how state-of-the-art models tackle these tasks in multiple languages, showing where they excel and where they fail. We hope that this tutorial will broaden the audience interested in multilingual semantics and inspire researchers to further advance the field.


pdf (full)
bib (full)
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022

pdf bib
Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022
Yulan He | Heng Ji | Sujian Li | Yang Liu | Chua-Hui Chang

pdf bib
Efficient Entity Embedding Construction from Type Knowledge for BERT
Yukun Feng | Amir Fayazi | Abhinav Rastogi | Manabu Okumura

Recent work has shown advantages of incorporating knowledge graphs (KGs) into BERT for various NLP tasks. One common way is to feed entity embeddings as an additional input during pre-training. There are two limitations to such a method. First, to train the entity embeddings to include rich information of factual knowledge, it typically requires access to the entire KG. This is challenging for KGs with daily changes (e.g., Wikidata). Second, it requires a large scale pre-training corpus with entity annotations and high computational cost during pre-training. In this work, we efficiently construct entity embeddings only from the type knowledge, that does not require access to the entire KG. Although the entity embeddings contain only local information, they perform very well when combined with context. Furthermore, we show that our entity embeddings, constructed from BERT’s input embeddings, can be directly incorporated into the fine-tuning phase without requiring any specialized pre-training. In addition, these entity embeddings can also be constructed on the fly without requiring a large memory footprint to store them. Finally, we propose task-specific models that incorporate our entity embeddings for entity linking, entity typing, and relation classification. Experiments show that our models have comparable or superior performance to existing models while being more resource efficient.

pdf bib
Spa: On the Sparsity of Virtual Adversarial Training for Dependency Parsing
Chao Lou | Wenjuan Han | Kewei Tu

Virtual adversarial training (VAT) is a powerful approach to improving robustness and performance, leveraging both labeled and unlabeled data to compensate for the scarcity of labeled data. It is adopted on lots of vision and language classification tasks. However, for tasks with structured output (e.g., dependency parsing), the application of VAT is nontrivial due to the intrinsic proprieties of structures: (1) the non-sparse problem and (2) exponential complexity. Against this background, we propose the Sparse Parse Adjustment (spa) algorithm and successfully applied VAT to the dependency parsing task. spa refers to the learning algorithm which combines the graph-based dependency parsing model with VAT in an exact computational manner and enhances the dependency parser with controllable and adjustable sparsity. Empirical studies show that the TreeCRF parser optimized using outperforms other methods without sparsity regularization.

pdf bib
KreolMorisienMT: A Dataset for Mauritian Creole Machine Translation
Raj Dabre | Aneerav Sukhoo

In this paper, we describe KreolMorisienMT, a dataset for benchmarking machine translation quality of Mauritian Creole. Mauritian Creole (Kreol Morisien) is a French-based creole and a lingua franca of the Republic of Mauritius. KreolMorisienMT consists of a parallel corpus between English and Kreol Morisien, French and Kreol Morisien and a monolingual corpus for Kreol Morisien. We first give an overview of Kreol Morisien and then describe the steps taken to create the corpora. Thereafter, we benchmark Kreol Morisien ↔ English and Kreol Morisien ↔ French models leveraging pre-trained models and multilingual transfer learning. Human evaluation reveals our systems’ high translation quality.

pdf bib
LEATHER: A Framework for Learning to Generate Human-like Text in Dialogue
Anthony Sicilia | Malihe Alikhani

Algorithms for text-generation in dialogue can be misguided. For example, in task-oriented settings, reinforcement learning that optimizes only task-success can lead to abysmal lexical diversity. We hypothesize this is due to poor theoretical understanding of the objectives in text-generation and their relation to the learning process (i.e., model training). To this end, we propose a new theoretical framework for learning to generate text in dialogue. Compared to existing theories of learning, our framework allows for analysis of the multi-faceted goals inherent to text-generation. We use our framework to develop theoretical guarantees for learners that adapt to unseen data. As an example, we apply our theory to study data-shift within a cooperative learning algorithm proposed for the GuessWhat?! visual dialogue game. From this insight, we propose a new algorithm, and empirically, we demonstrate our proposal improves both task-success and human-likeness of the generated text. Finally, we show statistics from our theory are empirically predictive of multiple qualities of the generated dialogue, suggesting our theory is useful for model-selection when human evaluations are not available.

pdf bib
Conceptual Similarity for Subjective Tags
Yacine Gaci | Boualem Benatallah | Fabio Casati | Khalid Benabdeslem

Tagging in the context of online resources is a fundamental addition to search systems. Tags assist with the indexing, management, and retrieval of online products and services to answer complex user queries. Traditional methods of matching user queries with tags either rely on cosine similarity, or employ semantic similarity models that fail to recognize conceptual connections between tags, e.g. ambiance and music. In this work, we focus on subjective tags which characterize subjective aspects of a product or service. We propose conceptual similarity to leverage conceptual awareness when assessing similarity between tags. We also provide a simple cost-effective pipeline to automatically generate data in order to train the conceptual similarity model. We show that our pipeline generates high-quality datasets, and evaluate the similarity model both systematically and on a downstream application. Experiments show that conceptual similarity outperforms existing work when using subjective tags.

pdf bib
TaskMix: Data Augmentation for Meta-Learning of Spoken Intent Understanding
Surya Kant Sahu

Meta-Learning has emerged as a research direction to better transfer knowledge from related tasks to unseen but related tasks. However, Meta-Learning requires many training tasks to learn representations that transfer well to unseen tasks; otherwise, it leads to overfitting, and the performance degenerates to worse than Multi-task Learning. We show that a state-of-the-art data augmentation method worsens this problem of overfitting when the task diversity is low. We propose a simple method, TaskMix, which synthesizes new tasks by linearly interpolating existing tasks. We compare TaskMix against many baselines on an in-house multilingual intent classification dataset of N-Best ASR hypotheses derived from real-life human-machine telephony utterances and two datasets derived from MTOP. We show that TaskMix outperforms baselines, alleviates overfitting when task diversity is low, and does not degrade performance even when it is high.

pdf bib
Understanding the Use of Quantifiers in Mandarin
Guanyi Chen | Kees van Deemter

We introduce a corpus of short texts in Mandarin, in which quantified expressions figure prominently. We illustrate the significance of the corpus by examining the hypothesis (known as Huang’s “coolness” hypothesis) that speakers of East Asian Languages tend to speak more briefly but less informatively than, for example, speakers of West-European languages. The corpus results from an elicitation experiment in which participants were asked to describe abstract visual scenes. We compare the resulting corpus, called MQTUNA, with an English corpus that was collected using the same experimental paradigm. The comparison reveals that some, though not all, aspects of quantifier use support the above-mentioned hypothesis. Implications of these findings for the generation of quantified noun phrases are discussed.

pdf bib
Does Representational Fairness Imply Empirical Fairness?
Aili Shen | Xudong Han | Trevor Cohn | Timothy Baldwin | Lea Frermann

NLP technologies can cause unintended harms if learned representations encode sensitive attributes of the author, or predictions systematically vary in quality across groups. Popular debiasing approaches, like adversarial training, remove sensitive information from representations in order to reduce disparate performance, however the relation between representational fairness and empirical (performance) fairness has not been systematically studied. This paper fills this gap, and proposes a novel debiasing method building on contrastive learning to encourage a latent space that separates instances based on target label, while mixing instances that share protected attributes. Our results show the effectiveness of our new method and, more importantly, show across a set of diverse debiasing methods that representational fairness does not imply empirical fairness. This work highlights the importance of aligning and understanding the relation of the optimization objective and final fairness target.

pdf bib
SEHY: A Simple yet Effective Hybrid Model for Summarization of Long Scientific Documents
Zhihua Jiang | Junzhan Yang | Dongning Rao

Long document (e.g., scientific papers) summarization is obtaining more and more attention in recent years. Extractive approaches attempt to choose salient sentences via understanding the whole document, but long documents cover numerous subjects with varying details and will not ease content understanding. Instead, abstractive approaches elaborate to generate related tokens while suffering from truncating the source document due to their input sizes. To this end, we propose a Simple yet Effective HYbrid approach, which we call SEHY, that exploits the discourse information of a document to select salient sections instead sentences for summary generation. On the one hand, SEHY avoids the full-text understanding; on the other hand, it retains salient information given the length limit. In particular, we design two simple strategies for training the extractor: extracting sections incrementally and based on salience-analysis. Then, we use strong abstractive models to generate the final summary. We evaluate our approach on a large-scale scientific paper dataset: arXiv. Further, we discuss how the disciplinary class (e.g., computer science, math or physics) of a scientific paper affects the performance of SEHY as its writing style indicates, which is unexplored yet in existing works. Experimental results show the effectiveness of our approach and interesting findings on arXiv and its subsets generated in this paper.

pdf bib
PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation
Siqi Bao | Huang He | Fan Wang | Hua Wu | Haifeng Wang | Wenquan Wu | Zhihua Wu | Zhen Guo | Hua Lu | Xinxian Huang | Xin Tian | Xinchao Xu | Yingzhan Lin | Zheng-Yu Niu

To explore the limit of dialogue generation pre-training, we present the models of PLATO-XL with up to 11 billion parameters, trained on both Chinese and English social media conversations. To train such large models, we adopt the architecture of unified transformer with high computation and parameter efficiency. In addition, we carry out multi-party aware pre-training to better distinguish the characteristic information in social media conversations. With such designs, PLATO-XL successfully achieves superior performances as compared to other approaches in both Chinese and English chitchat. We further explore the capacity of PLATO-XL on other conversational tasks, such as knowledge grounded dialogue and task-oriented conversation. The experimental results indicate that PLATO-XL obtains state-of-the-art results across multiple conversational tasks, verifying its potential as a foundation model of conversational AI.

pdf bib
A Hybrid Architecture for Labelling Bilingual Māori-English Tweets
David Trye | Vithya Yogarajan | Jemma König | Te Taka Keegan | David Bainbridge | Mark Apperley

Most large-scale language detection tools perform poorly at identifying Māori text. Moreover, rule-based and machine learning-based techniques devised specifically for the Māori-English language pair struggle with interlingual homographs. We develop a hybrid architecture that couples Māori-language orthography with machine learning models in order to annotate mixed Māori-English text. This architecture is used to label a new bilingual Twitter corpus at both the token (word) and tweet (sentence) levels. We use the collected tweets to show that the hybrid approach outperforms existing systems with respect to language detection of interlingual homographs and overall accuracy. We also evaluate its performance on out-of-domain data. Two interactive visualisations are provided for exploring the Twitter corpus and comparing errors across the new and existing techniques. The architecture code and visualisations are available online, and the corpus is available on request.

pdf bib
Meta-Learning Adaptive Knowledge Distillation for Efficient Biomedical Natural Language Processing
Abiola Obamuyide | Blair Johnston

There has been an increase in the number of large and high-performing models made available for various biomedical natural language processing tasks. While these models have demonstrated impressive performance on various biomedical tasks, their training and run-time costs can be computationally prohibitive. This work investigates the use of knowledge distillation, a common model compression method, to reduce the size of large models for biomedical natural language processing. We further improve the performance of knowledge distillation methods for biomedical natural language by proposing a meta-learning approach which adaptively learns parameters that enable the optimal rate of knowledge exchange between the teacher and student models from the distillation data during knowledge distillation. Experiments on two biomedical natural language processing tasks demonstrate that our proposed adaptive meta-learning approach to knowledge distillation delivers improved predictive performance over previous and recent state-of-the-art knowledge distillation methods.

pdf bib
The Effects of Surprisal across Languages: Results from Native and Non-native Reading
Andrea de Varda | Marco Marelli

It is well known that the surprisal of an upcoming word, as estimated by language models, is a solid predictor of reading times (Smith and Levy, 2013). However, most of the studies that support this view are based on English and few other Germanic languages, leaving an open question as to the cross-lingual generalizability of such findings. Moreover, they tend to consider only the best-performing eye-tracking measure, which might conflate the effects of predictive and integrative processing. Furthermore, it is not clear whether prediction plays a role in non-native language processing in bilingual individuals (Grüter et al., 2014). We approach these problems at large scale, extracting surprisal estimates from mBERT, and assessing their psychometric predictive power on the MECO corpus, a cross-linguistic dataset of eye movement behavior in reading (Siegelman et al., 2022; Kuperman et al., 2020). We show that surprisal is a strong predictor of reading times across languages and fixation measurements, and that its effects in L2 are weaker with respect to L1.

pdf bib
Assessing How Users Display Self-Disclosure and Authenticity in Conversation with Human-Like Agents: A Case Study of Luda Lee
Won Ik Cho | Soomin Kim | Eujeong Choi | Younghoon Jeong

There is an ongoing discussion on what makes humans more engaged when interacting with conversational agents. However, in the area of language processing, there has been a paucity of studies on how people react to agents and share interactions with others. We attack this issue by investigating the user dialogues with human-like agents posted online and aim to analyze the dialogue patterns. We construct a taxonomy to discern the users’ self-disclosure in the dialogue and the communication authenticity displayed in the user posting. We annotate the in-the-wild data, examine the reliability of the proposed scheme, and discuss how the categorization can be utilized for future research and industrial development.

pdf bib
Block Diagram-to-Text: Understanding Block Diagram Images by Generating Natural Language Descriptors
Shreyanshu Bhushan | Minho Lee

Block diagrams are very popular for representing a workflow or process of a model. Understanding block diagrams by generating summaries can be extremely useful in document summarization. It can also assist people in inferring key insights from block diagrams without requiring a lot of perceptual and cognitive effort. In this paper, we propose a novel task of converting block diagram images into text by presenting a framework called “BloSum”. This framework extracts the contextual meaning from the images in the form of triplets that help the language model in summary generation. We also introduce a new dataset for complex computerized block diagrams, explain the dataset preparation process, and later analyze it. Additionally, to showcase the generalization of the model, we test our method with publicly available handwritten block diagram datasets. Our evaluation with different metrics demonstrates the effectiveness of our approach that outperforms other methods and techniques.

pdf bib
Multi-Domain Dialogue State Tracking By Neural-Retrieval Augmentation
Lohith Ravuru | Seonghan Ryu | Hyungtak Choi | Haehun Yang | Hyeonmok Ko

Dialogue State Tracking (DST) is a very complex task that requires precise understanding and information tracking of multi-domain conversations between users and dialogue systems. Many task-oriented dialogue systems use dialogue state tracking technology to infer users’ goals from the history of the conversation. Existing approaches for DST are usually conditioned on previous dialogue states. However, the dependency on previous dialogues makes it very challenging to prevent error propagation to subsequent turns of a dialogue. In this paper, we propose Neural Retrieval Augmentation to alleviate this problem by creating a Neural Index based on dialogue context. Our NRA-DST framework efficiently retrieves dialogue context from the index built using a combination of unstructured dialogue state and structured user/system utterances. We explore a simple pipeline resulting in a retrieval-guided generation approach for training a DST model. Experiments on different retrieval methods for augmentation show that neural retrieval augmentation is the best performing retrieval method for DST. Our evaluations on the large-scale MultiWOZ dataset show that our model outperforms the baseline approaches.

pdf bib
TaKG: A New Dataset for Paragraph-level Table-to-Text Generation Enhanced with Knowledge Graphs
Qianqian Qi | Zhenyun Deng | Yonghua Zhu | Lia Jisoo Lee | Michael Witbrock | Jiamou Liu

We introduce TaKG, a new table-to-text generation dataset with the following highlights: (1) TaKG defines a long-text (paragraph-level) generation task as opposed to well-established short-text (sentence-level) generation datasets. (2) TaKG is the first large-scale dataset for this task, containing three application domains and ~750,000 samples. (3) To address the divergence phenomenon, TaKG enhances table input using external knowledge graphs, extracted by a new Wikidata-based method. We then propose a new Transformer-based multimodal sequence-to-sequence architecture for TaKG that integrates two pretrained language models RoBERTa and GPT-2. Our model shows reliable performance on long-text generation across a variety of metrics, and outperforms existing models for short-text generation tasks.

pdf bib
Revisiting Checkpoint Averaging for Neural Machine Translation
Yingbo Gao | Christian Herold | Zijian Yang | Hermann Ney

Checkpoint averaging is a simple and effective method to boost the performance of converged neural machine translation models. The calculation is cheap to perform and the fact that the translation improvement almost comes for free, makes it widely adopted in neural machine translation research. Despite the popularity, the method itself simply takes the mean of the model parameters from several checkpoints, the selection of which is mostly based on empirical recipes without many justifications. In this work, we revisit the concept of checkpoint averaging and consider several extensions. Specifically, we experiment with ideas such as using different checkpoint selection strategies, calculating weighted average instead of simple mean, making use of gradient information and fine-tuning the interpolation weights on development data. Our results confirm the necessity of applying checkpoint averaging for optimal performance, but also suggest that the landscape between the converged checkpoints is rather flat and not much further improvement compared to simple averaging is to be obtained.

pdf bib
Modeling Referential Gaze in Task-oriented Settings of Varying Referential Complexity
Özge Alacam | Eugen Ruppert | Sina Zarrieß | Ganeshan Malhotra | Chris Biemann | Sina Zarrieß

Referential gaze is a fundamental phenomenon for psycholinguistics and human-human communication. However, modeling referential gaze for real-world scenarios, e.g. for task-oriented communication, is lacking the well-deserved attention from the NLP community. In this paper, we address this challenging issue by proposing a novel multimodal NLP task; namely predicting when the gaze is referential. We further investigate how to model referential gaze and transfer gaze features to adapt to unseen situated settings that target different referential complexities than the training environment. We train (i) a sequential attention-based LSTM model and (ii) a multivariate transformer encoder architecture to predict whether the gaze is on a referent object. The models are evaluated on the three complexity datasets. The results indicate that the gaze features can be transferred not only among various similar tasks and scenes but also across various complexity levels. Taking the referential complexity of a scene into account is important for successful target prediction using gaze parameters especially when there is not much data for fine-tuning.

pdf bib
Automating Interlingual Homograph Recognition with Parallel Sentences
Yi Han | Ryohei Sasano | Koichi Takeda

Interlingual homographs are words that spell the same but possess different meanings across languages. Recognizing interlingual homographs from form-identical words generally needs linguistic knowledge and massive annotation work. In this paper, we propose an automatic interlingual homograph recognition method based on the cross-lingual word embedding similarity and co-occurrence of form-identical words in parallel sentences. We conduct experiments with various off-the-shelf language models coordinating with cross-lingual alignment operations and co-occurrence metrics on the Chinese-Japanese and English-Dutch language pairs. Experimental results demonstrate that our proposed method is able to make accurate and consistent predictions across languages.

pdf bib
CoRAL: a Context-aware Croatian Abusive Language Dataset
Ravi Shekhar | Mladen Karan | Matthew Purver

In light of unprecedented increases in the popularity of the internet and social media, comment moderation has never been a more relevant task. Semi-automated comment moderation systems greatly aid human moderators by either automatically classifying the examples or allowing the moderators to prioritize which comments to consider first. However, the concept of inappropriate content is often subjective, and such content can be conveyed in many subtle and indirect ways. In this work, we propose CoRAL – a language and culturally aware Croatian Abusive dataset covering phenomena of implicitness and reliance on local and global context. We show experimentally that current models degrade when comments are not explicit and further degrade when language skill and context knowledge are required to interpret the comment.

pdf bib
A Copy Mechanism for Handling Knowledge Base Elements in SPARQL Neural Machine Translation
Rose Hirigoyen | Amal Zouaq | Samuel Reyd

Neural Machine Translation (NMT) models from English to SPARQL are a promising development for SPARQL query generation. However, current architectures are unable to integrate the knowledge base (KB) schema and handle questions on knowledge resources, classes, and properties unseen during training, rendering them unusable outside the scope of topics covered in the training set. Inspired by the performance gains in natural language processing tasks, we propose to integrate a copy mechanism for neural SPARQL query generation as a way to tackle this issue. We illustrate our proposal by adding a copy layer and a dynamic knowledge base vocabulary to two Seq2Seq architectures (CNNs and Transformers). This layer makes the models copy KB elements directly from the questions, instead of generating them. We evaluate our approach on state-of-the-art datasets, including datasets referencing unknown KB elements and measure the accuracy of the copy-augmented architectures. Our results show a considerable increase in performance on all datasets compared to non-copy architectures.

pdf bib
A Multilingual Multiway Evaluation Data Set for Structured Document Translation of Asian Languages
Bianka Buschbeck | Raj Dabre | Miriam Exel | Matthias Huck | Patrick Huy | Raphael Rubino | Hideki Tanaka

Translation of structured content is an important application of machine translation, but the scarcity of evaluation data sets, especially for Asian languages, limits progress. In this paper we present a novel multilingual multiway evaluation data set for the translation of structured documents of the Asian languages Japanese, Korean and Chinese. We describe the data set, its creation process and important characteristics, followed by establishing and evaluating baselines using the direct translation as well as detag-project approaches. Our data set is well suited for multilingual evaluation, and it contains richer annotation tag sets than existing data sets. Our results show that massively multilingual translation models like M2M-100 and mBART-50 perform surprisingly well despite not being explicitly trained to handle structured content. The data set described in this paper and used in our experiments is released publicly.

pdf bib
On Measures of Biases and Harms in NLP
Sunipa Dev | Emily Sheng | Jieyu Zhao | Aubrie Amstutz | Jiao Sun | Yu Hou | Mattie Sanseverino | Jiin Kim | Akihiro Nishi | Nanyun Peng | Kai-Wei Chang

Recent studies show that Natural Language Processing (NLP) technologies propagate societal biases about demographic groups associated with attributes such as gender, race, and nationality. To create interventions and mitigate these biases and associated harms, it is vital to be able to detect and measure such biases. While existing works propose bias evaluation and mitigation methods for various tasks, there remains a need to cohesively understand the biases and the specific harms they measure, and how different measures compare with each other. To address this gap, this work presents a practical framework of harms and a series of questions that practitioners can answer to guide the development of bias measures. As a validation of our framework and documentation questions, we also present several case studies of how existing bias measures in NLP—both intrinsic measures of bias in representations and extrinsic measures of bias of downstream applications—can be aligned with different harms and how our proposed documentation questions facilitates more holistic understanding of what bias measures are measuring.

pdf bib
Logographic Information Aids Learning Better Representations for Natural Language Inference
Zijian Jin | Duygu Ataman

Statistical language models conventionally implement representation learning based on the contextual distribution of words or other formal units, whereas any information related to the logographic features of written text are often ignored, assuming they should be retrieved relying on the cooccurence statistics. On the other hand, as language models become larger and require more data to learn reliable representations, such assumptions may start to fall back, especially under conditions of data sparsity. Many languages, including Chinese and Vietnamese, use logographic writing systems where surface forms are represented as a visual organization of smaller graphemic units, which often contain many semantic cues. In this paper, we present a novel study which explores the benefits of providing language models with logographic information in learning better semantic representations. We test our hypothesis in the natural language inference (NLI) task by evaluating the benefit of computing multi-modal representations that combine contextual information with glyph information. Our evaluation results in six languages with different typology and writing systems suggest significant benefits of using multi-modal embeddings in languages with logograhic systems, especially for words with less occurence statistics.

pdf bib
Cross-domain Analysis on Japanese Legal Pretrained Language Models
Keisuke Miyazaki | Hiroaki Yamada | Takenobu Tokunaga

This paper investigates the pretrained language model (PLM) specialised in the Japanese legal domain. We create PLMs using different pretraining strategies and investigate their performance across multiple domains. Our findings are (i) the PLM built with general domain data can be improved by further pretraining with domain-specific data, (ii) domain-specific PLMs can learn domain-specific and general word meanings simultaneously and can distinguish them, (iii) domain-specific PLMs work better on its target domain; still, the PLMs retain the information learnt in the original PLM even after being further pretrained with domain-specific data, (iv) the PLMs sequentially pretrained with corpora of different domains show high performance for the later learnt domains.

pdf bib
Multilingual CheckList: Generation and Evaluation
Karthikeyan K | Shaily Bhatt | Pankaj Singh | Somak Aditya | Sandipan Dandapat | Sunayana Sitaram | Monojit Choudhury

Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm –Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance. We release the code of TEA and the CheckLists created at

pdf bib
Part Represents Whole: Improving the Evaluation of Machine Translation System Using Entropy Enhanced Metrics
Yilun Liu | Shimin Tao | Chang Su | Min Zhang | Yanqing Zhao | Hao Yang

Machine translation (MT) metrics often experience poor correlations with human assessments. In terms of MT system evaluation, most metrics pay equal attentions to every sample in an evaluation set, while in human evaluation, difficult sentences often make candidate systems distinguishable via notable fluctuations in human scores, especially when systems are competitive. We find that samples with high entropy values, which though usually count less than 5%, tend to play a key role in MT evaluation: when the evaluation set is shrunk to only the high-entropy portion, correlations with human assessments are actually improved. Thus, in this paper, we propose a fast and unsupervised approach to enhance MT metrics using entropy, expanding the dimension of evaluation by introducing sentence-level difficulty. A translation hypothesis with a significantly high entropy value is considered difficult and receives a large weight in aggregation of system-level scores. Experimental results on five sub-tracks in the WMT19 Metrics shared tasks show that our proposed method significantly enhanced the performance of commonly-used MT metrics in terms of system-level correlations with human assessments, even outperforming existing SOTA metrics. In particular, all enhanced metrics exhibit overall stability in correlations with human assessments in circumstances where only competitive MT systems are included, while the corresponding vanilla metrics fail to correlate with human assessments.

pdf bib
Memformer: A Memory-Augmented Transformer for Sequence Modeling
Qingyang Wu | Zhenzhong Lan | Kun Qian | Jing Gu | Alborz Geramifard | Zhou Yu

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new optimization scheme, memory replay back-propagation (MRBP), which promotes long-range back-propagation through time with a significantly reduced memory requirement. Experimental results show that Memformer has achieved comparable performance compared against the baselines by using 8.1x less memory space and 3.2x faster on inference. Analysis of the attention pattern shows that our external memory slots can encode and retain important information through timesteps.

pdf bib
Open-Domain Conversational Question Answering with Historical Answers
Hung-Chieh Fang | Kuo-Han Hung | Chen-Wei Huang | Yun-Nung Chen

Open-domain conversational question answering can be viewed as two tasks: passage retrieval and conversational question answering, where the former relies on selecting candidate passages from a large corpus and the latter requires better understanding of a question with contexts to predict the answers. This paper proposes ConvADR-QA that leverages historical answers to boost retrieval performance and further achieves better answering performance. Our experiments on the benchmark dataset, OR-QuAC, demonstrate that our model outperforms existing baselines in both extractive and generative reader settings, well justifying the effectiveness of historical answers for open-domain conversational question answering.

pdf bib
Robustness Evaluation of Text Classification Models Using Mathematical Optimization and Its Application to Adversarial Training
Hikaru Tomonari | Masaaki Nishino | Akihiro Yamamoto

Neural networks are known to be vulnerable to adversarial examples due to slightly perturbed input data. In practical applications of neural network models, the robustness of the models against perturbations must be evaluated. However, no method can strictly evaluate their robustness in natural language domains. We therefore propose a method that evaluates the robustness of text classification models using an integer linear programming (ILP) solver by an optimization problem that identifies a minimum synonym swap that changes the classification result. Our method allows us to compare the robustness of various models in realistic time. It can also be used for obtaining adversarial examples. Because of the minimal impact on the altered sentences, adversarial examples with our method obtained high scores in human evaluations of grammatical correctness and semantic similarity for an IMDb dataset. In addition, we implemented adversarial training with the IMDb and SST2 datasets and found that our adversarial training method makes the model robust.

pdf bib
HERB: Measuring Hierarchical Regional Bias in Pre-trained Language Models
Yizhi Li | Ge Zhang | Bohao Yang | Chenghua Lin | Anton Ragni | Shi Wang | Jie Fu

Fairness has become a trending topic in natural language processing (NLP) and covers biases targeting certain social groups such as genders and religions. Yet regional bias, another long-standing global discrimination problem, remains unexplored still. Consequently, we intend to provide a study to analyse the regional bias learned by the pre-trained language models (LMs) that are broadly used in NLP tasks. While verifying the existence of regional bias in LMs, we find that the biases on regional groups can be largely affected by the corresponding geographical clustering. We accordingly propose a hierarchical regional bias evaluation method (HERB) utilising the information from the sub-region clusters to quantify the bias in the pre-trained LMs. Experiments show that our hierarchical metric can effectively evaluate the regional bias with regard to comprehensive topics and measure the potential regional bias that can be propagated to downstream tasks. Our codes are available at

pdf bib
Multilingual Auxiliary Tasks Training: Bridging the Gap between Languages for Zero-Shot Transfer of Hate Speech Detection Models
Syrielle Montariol | Arij Riabi | Djamé Seddah

Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between lan- guages, such as in hate speech detection. In this paper, we highlight this limitation for hate speech detection in several domains and languages using strict experimental settings. Then, we propose to train on multilingual auxiliary tasks – sentiment analysis, named entity recognition, and tasks relying on syntactic information – to improve zero-shot transfer of hate speech detection models across languages. We show how hate speech detection models benefit from a cross-lingual knowledge proxy brought by auxiliary tasks fine-tuning and highlight these tasks’ positive impact on bridging the hate speech linguistic and cultural gap between languages.

pdf bib
Chop and Change: Anaphora Resolution in Instructional Cooking Videos
Cennet Oguz | Ivana Kruijff-Korbayova | Emmanuel Vincent | Pascal Denis | Josef van Genabith

Linguistic ambiguities arising from changes in entities in action flows are a key challenge in instructional cooking videos. In particular, temporally evolving entities present rich and to date understudied challenges for anaphora resolution. For example “oil” mixed with “salt” is later referred to as a “mixture”. In this paper we propose novel annotation guidelines to annotate recipes for the anaphora resolution task, reflecting change in entities. Moreover, we present experimental results for end-to-end multimodal anaphora resolution with the new annotation scheme and propose the use of temporal features for performance improvement.

pdf bib
“#DisabledOnIndianTwitter” : A Dataset towards Understanding the Expression of People with Disabilities on Indian Twitter
Ishani Mondal | Sukhnidh Kaur | Kalika Bali | Aditya Vashistha | Manohar Swaminathan

Twitter serves as a powerful tool for self-expression among the disabled people. To understand how disabled people in India use Twitter, we introduce a manually annotated corpus #DisabledOnIndianTwitter comprising of 2,384 tweets posted by 27 female and 15 male users. These users practice diverse professions and engage in varied online discourses on disability in India. To examine patterns in their Twitter use, we propose a novel hierarchical annotation taxonomy to classify the tweets into various themes including discrimination, advocacy, and self-identification. Using these annotations, we benchmark the corpus leveraging state-of-the-art classifiers. Finally through a mixed-methods analysis on our annotated corpus, we reveal stark differences in self-expression between male and female disabled users on Indian Twitter.

pdf bib
Topic-aware Multimodal Summarization
Sourajit Mukherjee | Anubhav Jangra | Sriparna Saha | Adam Jatowt

Multimodal Summarization (MS) has attracted research interest in the past few years due to the ease with which users perceive multimodal summaries. It is important for MS models to consider the topic a given target content belongs to. In the current paper, we propose a topic-aware MS system which performs two tasks simultaneously: differentiating the images into “on-topic” and “off-topic” categories and further utilizing the “on-topic” images to generate multimodal summaries. The hypothesis is that, the proposed topic similarity classifier will help in generating better multimodal summary by focusing on important components of images and text which are specific to a particular topic. To develop the topic similarity classifier, we have augmented the existing popular MS data set, MSMO, with similar “on-topic” and dissimilar “off-topic” images for each sample. Our experimental results establish that the focus on “on-topic” features helps in generating topic-aware multimodal summaries, which outperforms the state of the art approach by 1.7 % in ROUGE-L metric.

pdf bib
ArgGen: Prompting Text Generation Models for Document-Level Event-Argument Aggregation
Debanjana Kar | Sudeshna Sarkar | Pawan Goyal

Most of the existing discourse-level Information Extraction tasks have been modeled to be extractive in nature. However, we argue that extracting information from larger bodies of discourse-like documents requires more natural language understanding and reasoning capabilities. In our work, we propose the novel task of document-level event argument aggregation which generates consolidated event-arguments at a document-level with minimal loss of information. More specifically, we focus on generating precise document-level information frames in a multilingual setting using prompt-based methods. In this paper, we show the effectiveness of u prompt-based text generation approach to generate document-level argument spans in a low-resource and zero-shot setting. We also release the first of its kind multilingual event argument aggregation dataset that can be leveraged in other related multilingual text generation tasks as well:

pdf bib
Hierarchical Processing of Visual and Language Information in the Brain
Haruka Kawasaki | Satoshi Nishida | Ichiro Kobayashi

In recent years, many studies using deep learning have been conducted to elucidate the mechanism of information representation in the brain under stimuli evoked by various modalities. On the other hand, it has not yet been clarified how we humans link information of different modalities in the brain. In this study, to elucidate the relationship between visual and language information in the brain, we constructed encoding models that predict brain activity based on features extracted from the hidden layers of VGG16 for visual information and BERT for language information. We investigated the hierarchical characteristics of cortical localization and representational content of visual and semantic information in the cortex based on the brain activity predicted by the encoding model. The results showed that the cortical localization modeled by VGG16 is getting close to that of BERT as VGG16 moves to higher layers, while the representational contents differ significantly between the two modalities.

pdf bib
Differential Bias: On the Perceptibility of Stance Imbalance in Argumentation
Alonso Palomino | Khalid Al Khatib | Martin Potthast | Benno Stein

Most research on natural language processing treats bias as an absolute concept: Based on a (probably complex) algorithmic analysis, a sentence, an article, or a text is classified as biased or not. Given the fact that for humans the question of whether a text is biased can be difficult to answer or is answered contradictory, we ask whether an “absolute bias classification” is a promising goal at all. We see the problem not in the complexity of interpreting language phenomena but in the diversity of sociocultural backgrounds of the readers, which cannot be handled uniformly: To decide whether a text has crossed the proverbial line between non-biased and biased is subjective. By asking “Is text X more [less, equally] biased than text Y?” we propose to analyze a simpler problem, which, by its construction, is rather independent of standpoints, views, or sociocultural aspects. In such a model, bias becomes a preference relation that induces a partial ordering from least biased to most biased texts without requiring a decision on where to draw the line. A prerequisite for this kind of bias model is the ability of humans to perceive relative bias differences in the first place. In our research, we selected a specific type of bias in argumentation, the stance bias, and designed a crowdsourcing study showing that differences in stance bias are perceptible when (light) support is provided through training or visual aid.

pdf bib
BeamR: Beam Reweighing with Attribute Discriminators for Controllable Text Generation
David Landsman | Jerry Zikun Chen | Hussain Zaidi

Recent advances in natural language processing have led to the availability of large pre-trained language models (LMs), with rich generative capabilities. Although these models are able to produce fluent and coherent text, it remains a challenge to control various attributes of the generation, including sentiment, formality, topic and many others. We propose a Beam Reweighing (BeamR) method, building on top of standard beam search, in order to control different attributes. BeamR combines any generative LM with any attribute discriminator, offering full flexibility of generation style and attribute, while the beam search backbone maintains fluency across different domains. Notably, BeamR allows practitioners to leverage pre-trained models without the need to train generative LMs together with discriminators. We evaluate BeamR in two diverse tasks: sentiment steering, and machine translation formality. Our results show that BeamR performs on par with or better than existing state-of-the-art approaches (including fine-tuned methods), and highlight the flexiblity of BeamR in both causal and seq2seq language modeling tasks.

pdf bib
R&R: Metric-guided Adversarial Sentence Generation
Lei Xu | Alfredo Cuesta-Infante | Laure Berti-Equille | Kalyan Veeramachaneni

Adversarial examples are helpful for analyzing and improving the robustness of text classifiers. Generating high-quality adversarial examples is a challenging task as it requires generating fluent adversarial sentences that are semantically similar to the original sentences and preserve the original labels, while causing the classifier to misclassify them. Existing methods prioritize misclassification by maximizing each perturbation’s effectiveness at misleading a text classifier; thus, the generated adversarial examples fall short in terms of fluency and similarity. In this paper, we propose a rewrite and rollback (R&R) framework for adversarial attack. It improves the quality of adversarial examples by optimizing a critique score which combines the fluency, similarity, and misclassification metrics. R&R generates high-quality adversarial examples by allowing exploration of perturbations that do not have immediate impact on the misclassification metric but can improve fluency and similarity metrics. We evaluate our method on 5 representative datasets and 3 classifier architectures. Our method outperforms current state-of-the-art in attack success rate by +16.2%, +12.8%, and +14.0% on the classifiers respectively. Code is available at

pdf bib
A Simple yet Effective Learnable Positional Encoding Method for Improving Document Transformer Model
Guoxin Wang | Yijuan Lu | Lei Cui | Tengchao Lv | Dinei Florencio | Cha Zhang

Positional encoding plays a key role in Transformer-based architecture, which is to indicate and embed token sequential order information. Understanding documents with unreliable reading order information is a real challenge for document Transformer models. This paper proposes a simple and effective positional encoding method, learnable sinusoidal positional encoding (LSPE), by building a learnable sinusoidal positional encoding feed-forward network. We apply LSPE to document Transformer models and pretrain them on document datasets. Then we finetune and evaluate the model performance on document understanding tasks in form, receipt, and invoice domains. Experimental results show our proposed method not only outperforms other baselines, but also demonstrates its robustness and stability on handling noisy data with incorrect order information.

pdf bib
MMM: An Emotion and Novelty-aware Approach for Multilingual Multimodal Misinformation Detection
Vipin Gupta | Rina Kumari | Nischal Ashok | Tirthankar Ghosal | Asif Ekbal

The growth of multilingual web content in low-resource languages is becoming an emerging challenge to detect misinformation. One particular hindrance to research on this problem is the non-availability of resources and tools. Majority of the earlier works in misinformation detection are based on English content which confines the applicability of the research to a specific language only. Increasing presence of multimedia content on the web has promoted misinformation in which real multimedia content (images, videos) are used in different but related contexts with manipulated texts to mislead the readers. Detecting this category of misleading information is almost impossible without any prior knowledge. Studies say that emotion-invoking and highly novel content accelerates the dissemination of false information. To counter this problem, here in this paper, we first introduce a novel multilingual multimodal misinformation dataset that includes background knowledge (from authentic sources) of the misleading articles. Second, we propose an effective neural model leveraging novelty detection and emotion recognition to detect fabricated information. We perform extensive experiments to justify that our proposed model outperforms the state-of-the-art (SOTA) on the concerned task.

pdf bib
Adversarial Sample Generation for Aspect based Sentiment Classification
Mamta | Asif Ekbal

Deep learning models have been proven vulnerable towards small imperceptible perturbed input, known as adversarial samples, which are indiscernible by humans. Initial attacks in Natural Language Processing perturb characters or words in sentences using heuristics and synonyms-based strategies, resulting in grammatical incorrect or out-of-context sentences. Recent works attempt to generate contextual adversarial samples using a masked language model, capturing word relevance using leave-one-out (LOO). However, they lack the design to maintain the semantic coherency for aspect based sentiment analysis (ABSA) tasks. Moreover, they focused on resource-rich languages like English. We present an attack algorithm for the ABSA task by exploiting model explainability techniques to address these limitations. It does not require access to the training data, raw access to the model, or calibrating a new model. Our proposed method generates adversarial samples for a given aspect, maintaining more semantic coherency. In addition, it can be generalized to low-resource languages, which are at high risk due to resource scarcity. We show the effectiveness of the proposed attack using automatic and human evaluation. Our method outperforms the state-of-art methods in perturbation ratio, success rate, and semantic coherence.


pdf (full)
bib (full)
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

pdf bib
Proceedings of the Workshop on Cognitive Aspects of the Lexicon
Michael Zock | Emmanuele Chersoni | Yu-Yin Hsu | Enrico Santus

pdf bib
Patterns of Text Readability in Human and Predicted Eye Movements
Nora Hollenstein | Itziar Gonzalez-Dios | Lisa Beinborn | Lena Jäger

It has been shown that multilingual transformer models are able to predict human reading behavior when fine-tuned on small amounts of eye tracking data. As the cumulated prediction results do not provide insights into the linguistic cues that the model acquires to predict reading behavior, we conduct a deeper analysis of the predictions from the perspective of readability. We try to disentangle the three-fold relationship between human eye movements, the capability of language models to predict these eye movement patterns, and sentence-level readability measures for English. We compare a range of model configurations to multiple baselines. We show that the models exhibit difficulties with function words and that pre-training only provides limited advantages for linguistic generalization.

pdf bib
(In)Alienable Possession in Mandarin Relative Clauses
Deran Kong | Yu-Yin Hsu

Inalienable possession differs from alienable possession in that, in the former – e.g., kinships and part-whole relations – there is an intrinsic semantic dependency between the possessor and possessum. This paper reports two studies that used acceptability-judgment tasks to investigate whether native Mandarin speakers experienced different levels of interpretational costs while resolving different types of possessive relations, i.e., inalienable possessions (kinship terms and body parts) and alienable ones, expressed within relative clauses. The results show that sentences received higher acceptability ratings when body parts were the possessum as compared to sentences with alienable possessum, indicating that the inherent semantic dependency facilitates the resolution. However, inalienable kinship terms received the lowest acceptability ratings. We argue that this was because the kinship terms, which had the [+human] feature and appeared at the beginning of the experimental sentences, tended to be interpreted as the subject in shallow processing; these features contradicted the semantic-syntactic requirements of the experimental sentences.

pdf bib
Do Age of Acquisition and Orthographic Transparency Have the Same Effects in Different Modalities?
Mohammad Momenian

This paper is intended to study the effects of age of acquisition (AoA) and orthographic transparency on word retrieval in Persian, which is an understudied language. A naming task (both pictures and words) and a recall task (both pictures and words) were used to explore how lexical retrieval and verbal memory are affected by AoA and transparency. Seventy two native speakers of Persian were recruited to participate in two experiments. The results showed that early acquired words are processed faster than late acquired words only when pictures were used as stimuli. Transparency of the words was not an influential factor. However, in the recall experiment a three-way interaction was observed: early acquired pictures and words were processed faster than late acquired stimuli except the words in the transparent condition. The findings speak to the fact that language-specific properties of languages are very important.

pdf bib
CAT ManyNames: A New Dataset for Object Naming in Catalan
Mar Domínguez Orfila | Maite Melero Nogués | Gemma Boleda Torrent

Object Naming is an important task within the field of Language and Vision that consists of generating a correct and appropriate name for an object given an image. The ManyNames dataset uses real-world human annotated images with multiple labels, instead of just one. In this work, we describe the adaptation of this dataset (originally in English) to Catalan, by (i) machine-translating the English labels and (ii) collecting human annotations for a subset of the original corpus and comparing both resources. Analyses reveal divergences in the lexical variation of the two sets showing potential problems of directly translated resources, particularly when there is no resource to a proper context, which in this case is conveyed by the image. The analysis also points to the impact of cultural factors in the naming task, which should be accounted for in future cross-lingual naming tasks.

pdf bib
Finetuning Latin BERT for Word Sense Disambiguation on the Thesaurus Linguae Latinae
Piroska Lendvai | Claudia Wick

The Thesaurus Linguae Latinae (TLL) is a comprehensive monolingual dictionary that records contextualized meanings and usages of Latin words in antique sources at an unprecedented scale. We created a new dataset based on a subset of sense representations in the TLL, with which we finetuned the Latin-BERT neural language model (Bamman and Burns, 2020) on a supervised Word Sense Disambiguation task. We observe that the contextualized BERT representations finetuned on TLL data score better than static embeddings used in a bidirectional LSTM classifier on the same dataset, and that our per-lemma BERT models achieve higher and more robust performance than reported by Bamman and Burns (2020) based on data from a bilingual Latin dictionary. We demonstrate the differences in sense organizational principles between these two lexical resources, and report about our dataset construction and improved evaluation methodology.

pdf bib
Putting WordNet’s Dictionary Examples in the Context of Definition Modelling: An Empirical Analysis
Fatemah Almeman | Luis Espinosa Anke

Definition modeling is the task to generate a valid definition for a given input term. This relatively novel task has been approached either with no context (i.e., given a word embedding alone) and, more recently, as word-in-context modeling. Despite their success, most works make little to no distinction between resources and their specific features (e.g., type and style of definitions, or quality of examples) when used for training. Given the high diversity lexicographic resources exhibit in terms of topic coverage, style and formal structure, it is desirable for downstream definition modeling to better understand which of them are better suited for the task. In this paper, we propose an empirical evaluation of the well-known lexical database WordNet, and specifically, its dictionary examples. We evaluate them both directly, by matching them against criteria for good dictionary writing, and indirectly, in the task of definition modeling. Our results suggest that WordNet’s dictionary examples could be improved by extending them in length, and incorporating prototypicality.

pdf bib
Exploring Nominal Coercion in Semantic Spaces with Static and Contextualized Word Embeddings
Chenxin Liu | Emmanuele Chersoni

The distinction between mass nouns and count nouns has a long history in formal semantics, and linguists have been trying to identify the semantic properties defining the two classes. However, they also recognized that both can undergo meaning shifts and be used in contexts of a different type, via nominal coercion. In this paper, we present an approach to measure the meaning shift in count-mass coercion in English that makes use of static and contextualized word embedding distance. Our results show that the coercion shifts are detected only by a small subset of the traditional word embedding models, and that the shifts detected by the contextualized embedding of BERT are more pronounced for mass nouns.

pdf bib
A Frame-Based Model of Inherent Polysemy, Copredication and Argument Coercion
Chen Long | Laura Kallmeyer | Rainer Osswald

The paper presents a frame-based model of inherently polysemous nouns (such as ‘book’, which denotes both a physical object and an informational content) in which the meaning facets are directly accessible via attributes and which also takes into account the semantic relations between the facets. Predication over meaning facets (as in ‘memorize the book’) is then modeled as targeting the value of the corresponding facet attribute while coercion (as in ‘finish the book’) is modeled via specific patterns that enrich the predication. We use a compositional framework whose basic components are lexicalized syntactic trees paired with semantic frames and in which frame unification is triggered by tree composition. The approach is applied to a variety of combinations of predications over meaning facets and coercions.

pdf bib
VISCOSE - a Kanji Dictionary Enriched with VISual, COmpositional, and SEmantic Information
Werner Winiwarter | Bartholomäus Wloka

In this paper, we present a novel approach for building kanji dictionaries by enriching the lexical data of 3,500 kanji with images, structural decompositions, and semantically based cross-media mappings from the textual to the visual dimension. Our kanji dictionary is part of a Web-based contextual language learning environment based on augmented browsing technology. We display our multimodal kanji information as kanji cards in the Web browser, offering a versatile representation that can be integrated into other advanced creative language learning applications, such as memorization puzzles, creative storytelling assignments, or educational games.

pdf bib
Compositionality as an Analogical Process: Introducing ANNE
Giulia Rambelli | Emmanuele Chersoni | Philippe Blache | Alessandro Lenci

Usage-based constructionist approaches consider language a structured inventory of constructions, form-meaning pairings of different schematicity and complexity, and claim that the more a linguistic pattern is encountered, the more it becomes accessible to speakers. However, when an expression is unavailable, what processes underlie the interpretation? While traditional answers rely on the principle of compositionality, for which the meaning is built word-by-word and incrementally, usage-based theories argue that novel utterances are created based on previously experienced ones through analogy, mapping an existing structural pattern onto a novel instance. Starting from this theoretical perspective, we propose here a computational implementation of these assumptions. As the principle of compositionality has been used to generate distributional representations of phrases, we propose a neural network simulating the construction of phrasal embedding as an analogical process. Our framework, inspired by word2vec and computer vision techniques, was evaluated on tasks of generalization from existing vectors.


pdf (full)
bib (full)
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

pdf bib
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems
Daniel Deutsch | Can Udomcharoenchaikit | Juri Opitz | Yang Gao | Marina Fomicheva | Steffen Eger

pdf bib
A Japanese Corpus of Many Specialized Domains for Word Segmentation and Part-of-Speech Tagging
Shohei Higashiyama | Masao Ideuchi | Masao Utiyama | Yoshiaki Oida | Eiichiro Sumita

pdf bib
Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment Analysis
Zachary Zhou | Alisha Zachariah | Devin Conathan | Jeffery Kline

pdf bib
From COMET to COMES – Can Summary Evaluation Benefit from Translation Evaluation?
Mateusz Krubiński | Pavel Pecina

pdf bib
Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Juri Opitz | Anette Frank

pdf bib
GLARE: Generative Left-to-right AdversaRial Examples
Ryan Andrew Chi | Nathan Kim | Patrick Liu | Zander Lack | Ethan A Chi

pdf bib
Random Text Perturbations Work, but not Always
Zhengxiang Wang

pdf bib
A Comparative Analysis of Stance Detection Approaches and Datasets
Parush Gera | Tempestt Neal

pdf bib
Why is sentence similarity benchmark not predictive of application-oriented task performance?
Kaori Abe | Sho Yokoi | Tomoyuki Kajiwara | Kentaro Inui

pdf bib
Chat Translation Error Detection for Assisting Cross-lingual Communications
Yunmeng Li | Jun Suzuki | Makoto Morishita | Kaori Abe | Ryoko Tokuhisa | Ana Brassard | Kentaro Inui

pdf bib
Evaluating the role of non-lexical markers in GPT-2’s language modeling behavior
Roberta Rocca | Alejandro de la Vega

pdf bib
Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset
Guanyi Chen | Fahime Same | Kees Van Deemter


pdf (full)
bib (full)
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

pdf bib
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Khalid Alnajjar | Niko Partanen | Jack Rueter

pdf bib
A Stylometric Analysis of Amadís de Gaula and Sergas de Esplandián
Yoshifumi Kawasaki

Amadís de Gaula (AG) and its sequel Sergas de Esplandián (SE) are masterpieces of medieval Spanish chivalric romances. Much debate has been devoted to the role played by their purported author Garci Rodríguez de Montalvo. According to the prologue of AG, which consists of four books, the author allegedly revised the first three books that were in circulation at that time and added the fourth book and SE. However, the extent to which Montalvo edited the materials at hand to compose the extant works has yet to be explored extensively. To address this question, we applied stylometric techniques for the first time. Specifically, we investigated the stylistic differences (if any) between the first three books of AG and his own extensions. Literary style is represented as usage of parts-of-speech n-grams. We performed principal component analysis and k-means to demonstrate that Montalvo’s retouching on the first book was minimal, while revising the second and third books in such a way that they came to moderately resemble his authentic creation, that is, the fourth book and SE. Our findings empirically corroborate suppositions formulated from philological viewpoints.

pdf bib
Computational Exploration of the Origin of Mood in Literary Texts
Emily Öhman | Riikka H. Rossi

This paper is a methodological exploration of the origin of mood in early modern and modern Finnish literary texts using computational methods. We discuss the pre-processing steps as well as the various natural language processing tools used to try to pinpoint where mood can be best detected in text. We also share several tools and resources developed during this process. Our early attempts suggest that overall mood can be computationally detected in the first three paragraphs of a book.

pdf bib
Sentiment is all you need to win US Presidential elections
Sovesh Mohapatra | Somesh Mohapatra

Election speeches play an integral role in communicating the vision and mission of the candidates. From lofty promises to mud-slinging, the electoral candidate accounts for all. However, there remains an open question about what exactly wins over the voters. In this work, we used state-of-the-art natural language processing methods to study the speeches and sentiments of the Republican candidates and Democratic candidates fighting for the 2020 US Presidential election. Comparing the racial dichotomy of the United States, we analyze what led to the victory and defeat of the different candidates. We believe this work will inform the election campaigning strategy and provide a basis for communicating to diverse crowds.

pdf bib
Interactive Analysis and Visualisation of Annotated Collocations in Spanish (AVAnCES)
Simon Gonzalez

Phraseology studies have been enhanced by Corpus Linguistics, which has become an interdisciplinary field where current technologies play an important role in its development. Computational tools have been implemented in the last decades with positive results on the identification of phrases in different languages. One specific technology that has impacted these studies is social media. As researchers, we have turned our attention to collecting data from these platforms, which comes with great advantages and its own challenges. One of the challenges is the way we design and build corpora relevant to the questions emerging in this type of language expression. This has been approached from different angles, but one that has given invaluable outputs is the building of linguistic corpora with the use of online web applications. In this paper, we take a multidimensional approach to the collection, design, and deployment of a phraseology corpus for Latin American Spanish from Twitter data, extracting features using NLP techniques, and presenting it in an interactive online web application. We expect to contribute to the methodologies used for Corpus Linguistics in the current technological age. Finally, we make this tool publicly available to be used by any researcher interested in the data itself and also on the technological tools developed here.

pdf bib
Fractality of sentiment arcs for literary quality assessment: The case of Nobel laureates
Yuri Bizzoni | Kristoffer Laigaard Nielbo | Mads Rosendahl Thomsen

In the few works that have used NLP to study literary quality, sentiment and emotion analysis have often been considered valuable sources of information. At the same time, the idea that the nature and polarity of the sentiments expressed by a novel might have something to do with its perceived quality seems limited at best. In this paper, we argue that the fractality of narratives, specifically the long-term memory of their sentiment arcs, rather than their simple shape or average valence, might play an important role in the perception of literary quality by a human audience. In particular, we argue that such measure can help distinguish Nobel-winning writers from control groups in a recent corpus of English language novels. To test this hypothesis, we present the results from two studies: (i) a probability distribution test, where we compute the probability of seeing a title from a Nobel laureate at different levels of arc fractality; (ii) a classification test, where we use several machine learning algorithms to measure the predictive power of both sentiment arcs and their fractality measure. Our findings seem to indicate that despite the competitive and complex nature of the task, the populations of Nobel and non-Nobel laureates seem to behave differently and can to some extent be told apart by a classifier.

pdf bib
Style Classification of Rabbinic Literature for Detection of Lost Midrash Tanhuma Material
Solomon Tannor | Nachum Dershowitz | Moshe Lavee

Midrash collections are complex rabbinic works that consist of text in multiple languages, that evolved through long processes of instable oral and written transmission. Determining the origin of a given passage in such a compilation is not always straightforward and is often a matter disputed by scholars, yet it is essential for scholars’ understanding of the passage and its relationship to other texts in the rabbinic corpus. To help solve this problem, we propose a system for classification of rabbinic literature based on its style, leveraging recently released pretrained Transformer models for Hebrew. Additionally, we demonstrate how our method can be applied to uncover lost material from the Midrash Tanhuma.

pdf bib
Use the Metadata, Luke! – An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives
Balázs Indig | Zsófia Sárközi-Lindner | Mihály Nagy

Many digital humanists (philologists, historians, sociologists, librarians, the audience for web archives) design their research around metadata (publication date ranges, sources, authors, etc.). However, current major web archives are limited to technical metadata while lacking high quality, descriptive metadata allowing for faceted queries. As researchers often lack the technical skill necessary to enrich existing web archives with descriptive metadata, they increasingly turn to creating personal web archives that contain such metadata, tailored to their research requirements. Software that enable creating such archives without advanced technical skills have gained popularity, however, tools for examination and querying are currently the missing link. We showcase a solution designed to fill this gap.

pdf bib
MALM: Mixing Augmented Language Modeling for Zero-Shot Machine Translation
Kshitij Gupta

Large pre-trained language models have brought remarkable progress in NLP. Pre-training and Fine-tuning have given state-of-art performance across tasks in text processing. Data Augmentation techniques have also helped build state-of-art models on low or zero resource tasks. Many works in the past have attempted at learning a single massively multilingual machine translation model for zero-shot translation. Although those translation models are producing correct translations, the main challenge is those models are producing the wrong languages for zero-shot translation. This work and its results indicate that prompt conditioned large models do not suffer from off-target language errors i.e. errors arising due to translation to wrong languages. We empirically demonstrate the effectiveness of self-supervised pre-training and data augmentation for zero-shot multi-lingual machine translation.

pdf bib
ParsSimpleQA: The Persian Simple Question Answering Dataset and System over Knowledge Graph
Hamed Babaei Giglou | Niloufar Beyranvand | Reza Moradi | Amir Mohammad Salehoof | Saeed Bibak

The simple question answering over the knowledge graph concerns answering single-relation questions by querying the facts in the knowledge graph. This task has drawn significant attention in recent years. However, there is a demand for a simple question dataset in the Persian language to study open-domain simple question answering. In this paper, we present the first Persian single-relation question answering dataset and a model that uses a knowledge graph as a source of knowledge to answer questions. We create the ParsSimpleQA dataset semi-automatically in two steps. First, we build single-relation question templates. Next, we automatically create simple questions and answers using templates, entities, and relations from Farsbase. To present the reliability of the presented dataset, we proposed a simple question-answering system that receives questions and uses deep learning and information retrieval techniques for answering questions. The experimental results presented in this paper show that the ParsSimpleQA dataset is very promising for the Persian simple question-answering task.

pdf bib
Enhancing Digital History – Event discovery via Topic Modeling and Change Detection
King Ip Lin | Sabrina Peng

Digital history is the application of computer science techniques to historical data in order to uncover insights into events occurring during specific time periods from the past. This relatively new interdisciplinary field can help identify and record latent information about political, cultural, and economic trends that are not otherwise apparent from traditional historical analysis. This paper presents a method that uses topic modeling and breakpoint detection to observe how extracted topics come in and out of prominence over various time periods. We apply our techniques on British parliamentary speech data from the 19th century. Findings show that some of the events produced are cohesive in topic content (religion, transportation, economics, etc.) and time period (events are focused in the same year or month). Topic content identified should be further analyzed for specific events and undergo external validation to determine the quality and value of the findings to historians specializing in 19th century Britain.

pdf bib
A Parallel Corpus and Dictionary for Amis-Mandarin Translation
Francis Zheng | Edison Marrese-Taylor | Yutaka Matsuo

Amis is an endangered language indigenous to Taiwan with limited data available for computational processing. We thus present an Amis-Mandarin dataset containing a parallel corpus of 5,751 Amis and Mandarin sentences and a dictionary of 7,800 Amis words and phrases with their definitions in Mandarin. Using our dataset, we also established a baseline for machine translation between Amis and Mandarin in both directions. Our dataset can be found at

pdf bib
Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers
Nilo Pedrazzini | Barbara McGillivray

The industrialization process associated with the so-called Industrial Revolution in 19th-century Great Britain was a time of profound changes, including in the English lexicon. An important yet understudied phenomenon is the semantic shift in the lexicon of mechanisation. In this paper we present the first large-scale analysis of terms related to mechanization over the course of the 19th-century in English. We draw on a corpus of historical British newspapers comprising 4.6 billion tokens and train historical word embedding models. We test existing semantic change detection techniques and analyse the results in light of previous historical linguistic scholarship.

pdf bib
Optimizing the weighted sequence alignment algorithm for large-scale text similarity computation
Maciej Janicki

We present an optimized implementation of the weighted sequence alignment algorithm (a.k.a. weighted edit distance) in a scenario where the items to align are numeric vectors and the substitution weights are determined by their cosine similarity. The optimization relies on using vector and matrix operations provided by numeric computation libraries (including GPU acceleration) instead of loops. The resulting algorithm provides an efficient way of aligning large sets of texts represented as sequences of continuous-space numeric vectors (embeddings). The optimization made it possible to compute alignment-based similarity for all pairs of texts in a large corpus of Finnic oral folk poetry for the purpose of studying intertextuality in the oral tradition.

pdf bib
Domain-specific Evaluation of Word Embeddings for Philosophical Text using Direct Intrinsic Evaluation
Goya van Boven | Jelke Bloem

We perform a direct intrinsic evaluation of word embeddings trained on the works of a single philosopher. Six models are compared to human judgements elicited using two tasks: a synonym detection task and a coherence task. We apply a method that elicits judgements based on explicit knowledge from experts, as the linguistic intuition of non-expert participants might differ from that of the philosopher. We find that an in-domain SVD model has the best 1-nearest neighbours for target terms, while transfer learning-based Nonce2Vec performs better for low frequency target terms.

pdf bib
Towards Bootstrapping a Chatbot on Industrial Heritage through Term and Relation Extraction
Mihael Arcan | Rory O’Halloran | Cécile Robin | Paul Buitelaar

We describe initial work in developing a methodology for the automatic generation of a conversational agent or ‘chatbot’ through term and relation extraction from a relevant corpus of language data. We develop our approach in the domain of industrial heritage in the 18th and 19th centuries, and more specifically on the industrial history of canals and mills in Ireland. We collected a corpus of relevant newspaper reports and Wikipedia articles, which we deemed representative of a layman’s understanding of this topic. We used the Saffron toolkit to extract relevant terms and relations between the terms from the corpus and leveraged the extracted knowledge to query the British Library Digital Collection and the Project Gutenberg library. We leveraged the extracted terms and relations in identifying possible answers for a constructed set of questions based on the extracted terms, by matching them with sentences in the British Library Digital Collection and the Project Gutenberg library. In a final step, we then took this data set of question-answer pairs to train a chatbot. We evaluate our approach by manually assessing the appropriateness of the generated answers for a random sample, each of which is judged by four annotators.

pdf bib
Non-Parametric Word Sense Disambiguation for Historical Languages
Enrique Manjavacas Arevalo | Lauren Fonteyn

Recent approaches to Word Sense Disambiguation (WSD) have profited from the enhanced contextualized word representations coming from contemporary Large Language Models (LLMs). This advancement is accompanied by a renewed interest in WSD applications in Humanities research, where the lack of suitable, specific WSD-annotated resources is a hurdle in developing ad-hoc WSD systems. Because they can exploit sentential context, LLMs are particularly suited for disambiguation tasks. Still, the application of LLMs is often limited to linear classifiers trained on top of the LLM architecture. In this paper, we follow recent developments in non-parametric learning and show how LLMs can be efficiently fine-tuned to achieve strong few-shot performance on WSD for historical languages (English and Dutch, date range: 1450-1950). We test our hypothesis using (i) a large, general evaluation set taken from large lexical databases, and (ii) a small real-world scenario involving an ad-hoc WSD task. Moreover, this paper marks the release of GysBERT, a LLM for historical Dutch.

pdf bib
Introducing a Large Corpus of Tokenized Classical Chinese Poems of Tang and Song Dynasties
Chao-Lin Liu | Ti-Yong Zheng | Kuan-Chun Chen | Meng-Han Chung

Classical Chinese poems of Tang and Song dynasties are an important part for the studies of Chinese literature. To thoroughly understand the poems, properly segmenting the verses is an important step for human readers and software agents. Yet, due to the availability of data and the costs of annotation, there are still no known large and useful sources that offer classical Chinese poems with annotated word boundaries. In this project, annotators with Chinese literature background labeled 32399 poems. We analyzed the annotated patterns and conducted inter-rater agreement studies about the annotations. The distributions of the annotated patterns for poem lines are very close to some well-known professional heuristics, i.e., that the 2-2-1, 2-1-2, 2-2-1-2, and 2-2-2-1 patterns are very frequent. The annotators agreed well at the line level, but agreed on the segmentations of a whole poem only 43% of the time. We applied a traditional machine-learning approach to segment the poems, and achieved promising results at the line level as well. Using the annotated data as the ground truth, these methods could segment only about 18% of the poems completely right under favorable conditions. Switching to deep-learning methods helped us achieved better than 30%.

pdf bib
Creative Text-to-Image Generation: Suggestions for a Benchmark
Irene Russo

Language models for text-to-image generation can output good quality images when referential aspects of pictures are evaluated. The generation of creative images is not under scrutiny at the moment, but it poses interesting challenges: should we expect more creative images using more creative prompts? What is the relationship between prompts and images in the global process of human evaluation? In this paper, we want to highlight several criteria that should be taken into account for building a creative text-to-image generation benchmark, collecting insights from multiple disciplines (e.g., linguistics, cognitive psychology, philosophy, psychology of art).

pdf bib
The predictability of literary translation
Andrew Piper | Matt Erlin

Research has shown that the practice of translation exhibits predictable linguistic cues that make translated texts detectable from original-language texts (a phenomenon known as “translationese”). In this paper, we test the extent to which literary translations are subject to the same effects and whether they also exhibit meaningful differences at the level of content. Research into the function of translations within national literary markets using smaller case studies has suggested that translations play a cultural role that is distinct from that of original-language literature, i.e. their differences reside not only at the level of translationese but at the level of content. Using a dataset consisting of original-language fiction in English and translations into English from 120 languages (N=21,302), we find that one of the principal functions of literary translation is to convey predictable geographic identities to local readers that nevertheless extend well beyond the foreignness of persons and places.

pdf bib
Emotion Conditioned Creative Dialog Generation
Khalid Alnajjar | Mika Hämäläinen

We present a DialGPT based model for generating creative dialog responses that are conditioned based on one of the following emotions: anger, disgust, fear, happiness, pain, sadness and surprise. Our model is capable of producing a contextually apt response given an input sentence and a desired emotion label. Our model is capable of expressing the desired emotion with an accuracy of 0.6. The best performing emotions are neutral, fear and disgust. When measuring the strength of the expressed emotion, we find that anger, fear and disgust are expressed in the most strong fashion by the model.

pdf bib
Integration of Named Entity Recognition and Sentence Segmentation on Ancient Chinese based on Siku-BERT
Sijia Ge

Sentence segmentation and named entity recognition are two significant tasks in ancient Chinese processing since punctuation and named entity information are important for further research on ancient classics. These two are sequence labeling tasks in essence so we can tag the labels of these two tasks for each token simultaneously. Our work is to evaluate whether such a unified way would be better than tagging the label of each task separately with a BERT-based model. The paper adopts a BERT-based model that was pre-trained on ancient Chinese text to conduct experiments on Zuozhuan text. The results show there is no difference between these two tagging approaches without concerning the type of entities and punctuation. The ablation experiments show that the punctuation token in the text is useful for NER tasks, and finer tagging sets such as differentiating the tokens that locate at the end of an entity and those are in the middle of an entity could offer a useful feature for NER while impact negatively sentences segmentation with unified tagging.

pdf bib
(Re-)Digitizing 吳守禮 Ngôo Siú-lé’s Mandarin – Taiwanese Dictionary
Pierre Magistry | Afala Phaxay

This paper presents the efforts conducted to obtain a usable and open digital version in XML-TEI of one of the major lexicographic work for bilingual Taiwanese dictionaries, namely the 《國臺對照活用辭典》(Practical Mandarin-Taiwanese Dictionary) The original dictionary was published in 2000, after decades of work by Prof. 吳守禮 (Ngôo Siu-le/Wu Shouli)


pdf (full)
bib (full)
Proceedings of the first Workshop on Information Extraction from Scientific Publications

pdf bib
Proceedings of the first Workshop on Information Extraction from Scientific Publications
Tirthankar Ghosal | Sergi Blanco-Cuaresma | Alberto Accomazzi | Robert M. Patton | Felix Grezes | Thomas Allen

pdf bib
Overview of the First Shared Task on Detecting Entities in the Astrophysics Literature (DEAL)
Felix Grezes | Sergi Blanco-Cuaresma | Thomas Allen | Tirthankar Ghosal

In this article, we describe the overview of our shared task: Detecting Entities in the Astrophysics Literature (DEAL). The DEAL shared task was part of the Workshop on Information Extraction from Scientific Publications (WIESP) in AACL-IJCNLP 2022. Information extraction from scientific publications is critical in several downstream tasks such as identification of critical entities, article summarization, citation classification, etc. The motivation of this shared task was to develop a community-wide effort for entity extraction from astrophysics literature. Automated entity extraction would help to build knowledge bases, high-quality meta-data for indexing and search, and several other use-cases of interests. Thirty-three teams registered for DEAL, twelve of them participated in the system runs, and finally four teams submitted their system descriptions. We analyze their system and performance and finally discuss the findings of DEAL.

pdf bib
Classification of URL Citations in Scholarly Papers for Promoting Utilization of Research Artifacts
Masaya Tsunokake | Shigeki Matsubara

Utilizing citations for research artifacts (e.g., dataset, software) in scholarly papers contributes to efficient expansion of research artifact repositories and various applications e.g., the search, recommendation, and evaluation of such artifacts. This study focuses on citations using URLs (URL citations) and aims to identify and analyze research artifact citations automatically. This paper addresses the classification task for each URL citation to identify (1) the role that the referenced resources play in research activities, (2) the type of referenced resources, and (3) the reason why the author cited the resources. This paper proposes the classification method using section titles and footnote texts as new input features. We extracted URL citations from international conference papers as experimental data. We performed 5-fold cross-validation using the data and computed the classification performance of our method. The results demonstrate that our method is effective in all tasks. An additional experiment demonstrates that using cited URLs as input features is also effective.

pdf bib
TELIN: Table Entity LINker for Extracting Leaderboards from Machine Learning Publications
Sean Yang | Chris Tensmeyer | Curtis Wigington

Tracking state-of-the-art (SOTA) results in machine learning studies is challenging due to high publication volume. Existing methods for creating leaderboards in scientific documents require significant human supervision or rely on scarcely available LaTeX source files. We propose Table Entity LINker (TELIN), a framework which extracts (task, model, dataset, metric) quadruples from collections of scientific publications in PDF format. TELIN identifies scientific named entities, constructs a knowledge base, and leverages human feedback to iteratively refine automatic extractions. TELIN identifies and prioritizes uncertain and impactful entities for human review to create a cascade effect for leaderboard completion. We show that TELIN is competitive with the SOTA but requires much less human annotation.

pdf bib
PICO Corpus: A Publicly Available Corpus to Support Automatic Data Extraction from Biomedical Literature
Faith Mutinda | Kongmeng Liew | Shuntaro Yada | Shoko Wakamiya | Eiji Aramaki

We present a publicly available corpus with detailed annotations describing the core elements of clinical trials: Participants, Intervention, Control, and Outcomes. The corpus consists of 1011 abstracts of breast cancer randomized controlled trials extracted from the PubMed database. The corpus improves previous corpora by providing detailed annotations for outcomes to identify numeric texts that report the number of participants that experience specific outcomes. The corpus will be helpful for the development of systems for automatic extraction of data from randomized controlled trial literature to support evidence-based medicine. Additionally, we demonstrate the feasibility of the corpus by using two strong baselines for named entity recognition task. Most of the entities achieve F1 scores greater than 0.80 demonstrating the quality of the dataset.

pdf bib
Linking a Hypothesis Network From the Domain of Invasion Biology to a Corpus of Scientific Abstracts: The INAS Dataset
Marc Brinner | Tina Heger | Sina Zarriess

We investigate the problem of identifying the major hypothesis that is addressed in a scientific paper. To this end, we present a dataset from the domain of invasion biology that organizes a set of 954 papers into a network of fine-grained domain-specific categories of hypotheses. We carry out experiments on classifying abstracts according to these categories and present a pilot study on annotating hypothesis statements within the text. We find that hypothesis statements in our dataset are complex, varied and more or less explicit, and, importantly, spread over the whole abstract. Experiments with BERT-based classifiers show that these models are able to classify complex hypothesis statements to some extent, without being trained on sentence-level text span annotations.

pdf bib
Leveraging knowledge graphs to update scientific word embeddings using latent semantic imputation
Jason Hoelscher-Obermaier | Edward Stevinson | Valentin Stauber | Ivaylo Zhelev | Viktor Botev | Ronin Wu | Jeremy Minton

The most interesting words in scientific texts will often be novel or rare. This presents a challenge for scientific word embedding models to determine quality embedding vectors for useful terms that are infrequent or newly emerging. We demonstrate how Latent Semantic Imputation (LSI) can address this problem by imputing embeddings for domain-specific words from up-to-date knowledge graphs while otherwise preserving the original word embedding model. We use the MeSH knowledge graph to impute embedding vectors for biomedical terminology without retraining and evaluate the resulting embedding model on a domain-specific word-pair similarity task. We show that LSI can produce reliable embedding vectors for rare and out-of-vocabulary terms in the biomedical domain.

pdf bib
Full-Text Argumentation Mining on Scientific Publications
Arne Binder | Leonhard Hennig | Bhuvanesh Verma

Scholarly Argumentation Mining (SAM) has recently gained attention due to its potential to help scholars with the rapid growth of published scientific literature. It comprises two subtasks: argumentative discourse unit recognition (ADUR) and argumentative relation extraction (ARE), both of which are challenging since they require e.g. the integration of domain knowledge, the detection of implicit statements, and the disambiguation of argument structure. While previous work focused on dataset construction and baseline methods for specific document sections, such as abstract or results, full-text scholarly argumentation mining has seen little progress. In this work, we introduce a sequential pipeline model combining ADUR and ARE for full-text SAM, and provide a first analysis of the performance of pretrained language models (PLMs) on both subtasks. We establish a new SotA for ADUR on the Sci-Arg corpus, outperforming the previous best reported result by a large margin (+7% F1). We also present the first results for ARE, and thus for the full AM pipeline, on this benchmark dataset. Our detailed error analysis reveals that non-contiguous ADUs as well as the interpretation of discourse connectors pose major challenges and that data annotation needs to be more consistent.

pdf bib
On the portability of extractive Question-Answering systems on scientific papers to real-life application scenarios
Chyrine Tahri | Xavier Tannier | Patrick Haouat

There are still hurdles standing in the way of faster and more efficient knowledge consumption in industrial environments seeking to foster innovation. In this work, we address the portability of extractive Question Answering systems from academic spheres to industries basing their decisions on thorough scientific papers analysis. Keeping in mind that such industrial contexts often lack high-quality data to develop their own QA systems, we illustrate the misalignment between application requirements and cost sensitivity of such industries and some widespread practices tackling the domain-adaptation problem in the academic world. Through a series of extractive QA experiments on QASPER, we adopt the pipeline-based retriever-ranker-reader architecture for answering a question on a scientific paper and show the impact of modeling choices in different stages on the quality of answer prediction. We thus provide a characterization of practical aspects of real-life application scenarios and notice that appropriate trade-offs can be efficient and add value in those industrial environments.

pdf bib
Detecting Entities in the Astrophysics Literature: A Comparison of Word-based and Span-based Entity Recognition Methods
Xiang Dai | Sarvnaz Karimi

Information Extraction from scientific literature can be challenging due to the highly specialised nature of such text. We describe our entity recognition methods developed as part of the DEAL (Detecting Entities in the Astrophysics Literature) shared task. The aim of the task is to build a system that can identify Named Entities in a dataset composed by scholarly articles from astrophysics literature. We planned our participation such that it enables us to conduct an empirical comparison between word-based tagging and span-based classification methods. When evaluated on two hidden test sets provided by the organizer, our best-performing submission achieved F1 scores of 0.8307 (validation phase) and 0.7990 (testing phase).

pdf bib
Domain Specific Augmentations as Low Cost Teachers for Large Students
Po-Wei Huang

Current neural network solutions in scientific document processing employ models pretrained on domain-specific corpi, which are usually limited in model size, as pretraining can be costly and limited by training resources. We introduce a framework that uses data augmentation from such domain-specific pretrained models to transfer domain specific knowledge to larger general pretrained models and improve performance on downstream tasks. Our method improves the performance of Named Entity Recognition in the astrophysical domain by more than 20% compared to domain-specific pretrained models finetuned to the target dataset.

pdf bib
Moving beyond word lists: towards abstractive topic labels for human-like topics of scientific documents
Domenic Rosati

Topic models represent groups of documents as a list of words (the topic labels). This work asks whether an alternative approach to topic labeling can be developed that is closer to a natural language description of a topic than a word list. To this end, we present an approach to generating human-like topic labels using abstractive multi-document summarization (MDS). We investigate our approach with an exploratory case study. We model topics in citation sentences in order to understand what further research needs to be done to fully operationalize MDS for topic labeling. Our case study shows that in addition to more human-like topics there are additional advantages to evaluation by using clustering and summarization measures instead of topic model measures. However, we find that there are several developments needed before we can design a well-powered study to evaluate MDS for topic modeling fully. Namely, improving cluster cohesion, improving the factuality and faithfulness of MDS, and increasing the number of documents that might be supported by MDS. We present a number of ideas on how these can be tackled and conclude with some thoughts on how topic modeling can also be used to improve MDS in general.

pdf bib
Astro-mT5: Entity Extraction from Astrophysics Literature using mT5 Language Model
Madhusudan Ghosh | Payel Santra | Sk Asif Iqbal | Partha Basuchowdhuri

Scientific research requires reading and extracting relevant information from existing scientific literature in an effective way. To gain insights over a collection of such scientific documents, extraction of entities and recognizing their types is considered to be one of the important tasks. Numerous studies have been conducted in this area of research. In our study, we introduce a framework for entity recognition and identification of NASA astrophysics dataset, which was published as a part of the DEAL SharedTask. We use a pre-trained multilingual model, based on a natural language processing framework for the given sequence labeling tasks. Experiments show that our model, Astro-mT5, out-performs the existing baseline in astrophysics related information extraction.

pdf bib
NLPSharedTasks: A Corpus of Shared Task Overview Papers in Natural Language Processing Domains
Anna Martin | Ted Pedersen | Jennifer D’Souza

As the rate of scientific output continues to grow, it is increasingly important to develop systems to improve interfaces between researchers and scholarly papers. Training models to extract scientific information from the full texts of scholarly documents is important for improving how we structure and access scientific information. However, there are few annotated corpora that provide full paper texts. This paper presents the NLPSharedTasks corpus, a new resource of 254 full text Shared Task Overview papers in NLP domains with annotated task descriptions. We calculated strict and relaxed inter-annotator agreement scores, achieving Cohen’s kappa coefficients of 0.44 and 0.95, respectively. Lastly, we performed a sentence classification task over the dataset, in order to generate a neural baseline for future research and to provide an example of how to preprocess unbalanced datasets of full scientific texts. We achieved an F1 score of 0.75 using SciBERT, fine-tuned and tested on a rebalanced version of the dataset.

pdf bib
Parsing Electronic Theses and Dissertations Using Object Detection
Aman Ahuja | Alan Devera | Edward Alan Fox

Electronic theses and dissertations (ETDs) contain valuable knowledge that can be useful for a wide range of purposes. To effectively utilize the knowledge contained in ETDs for downstream tasks such as search and retrieval, question-answering, and summarization, the data first needs to be parsed and stored in a format such as XML. However, since most of the ETDs available on the web are PDF documents, parsing them to make their data useful for downstream tasks is a challenge. In this work, we propose a dataset and a framework to help with parsing long scholarly documents such as ETDs. We take the Object Detection approach for document parsing. We first introduce a set of objects that are important elements of an ETD, along with a new dataset ETD-OD that consists of over 25K page images originating from 200 ETDs with bounding boxes around each of the objects. We also propose a framework that utilizes this dataset for converting ETDs to XML, which can further be used for ETD-related downstream tasks. Our code and pre-trained models are available at:

pdf bib
TDAC, The First Corpus in Time-Domain Astrophysics: Analysis and First Experiments on Named Entity Recognition
Atilla Kaan Alkan | Cyril Grouin | Fabian Schussler | Pierre Zweigenbaum

The increased interest in time-domain astronomy over the last decades has resulted in a substantial increase in observation reports publication leading to a saturation of how astrophysicists read, analyze and classify information. Due to the short life span of the detected astronomical events, the information related to the characterization of new phenomena has to be communicated and analyzed very rapidly to allow other observatories to react and conduct their follow-up observations. This paper introduces TDAC: the first Corpus in Time-Domain Astrophysics, based on observation reports. We also present the NLP experiments we made for named entity recognition based on annotations we made and annotations from the WIESP NLP Challenge.

pdf bib
Reproducibility Signals in Science: A preliminary analysis
Akhil Pandey Akella | Hamed Alhoori | David Koop

Reproducibility is an important feature of science; experiments are retested, and analyses are repeated. Trust in the findings increases when consistent results are achieved. Despite the importance of reproducibility, significant work is often involved in these efforts, and some published findings may not be reproducible due to oversights or errors. In this paper, we examine a myriad of features in scholarly articles published in computer science conferences and journals and test how they correlate with reproducibility. We collected data from three different sources that labeled publications as either reproducible or irreproducible and employed statistical significance tests to identify features of those publications that hold clues about reproducibility. We found the readability of the scholarly article and accessibility of the software artifacts through hyperlinks to be strong signals noticeable amongst reproducible scholarly articles.

pdf bib
A Majority Voting Strategy of a SciBERT-based Ensemble Models for Detecting Entities in the Astrophysics Literature (Shared Task)
Atilla Kaan Alkan | Cyril Grouin | Fabian Schussler | Pierre Zweigenbaum

Detecting Entities in the Astrophysics Literature (DEAL) is a proposed shared task in the scope of the first Workshop on Information Extraction from Scientific Publications (WIESP) at AACL-IJCNLP 2022. It aims to propose systems identifying astrophysical named entities. This article presents our system based on a majority voting strategy of an ensemble composed of multiple SciBERT models. The system we propose is ranked second and outperforms the baseline provided by the organisers by achieving an F1 score of 0.7993 and a Matthews Correlation Coefficient (MCC) score of 0.8978 in the testing phase.