Vera Demberg

2026

Modeling Turn-Taking with Semantically Informed Gestures
Varsha Suresh | M. Hamza Mughal | Christian Theobalt | Vera Demberg
Findings of the Association for Computational Linguistics: EACL 2026

In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.

pdf bib

Findings of the Association for Computational Linguistics: EACL 2026
Vera Demberg | Kentaro Inui | Lluís Marquez
Findings of the Association for Computational Linguistics: EACL 2026

pdf bib abs

System-Mediated Attention Imbalances Make Vision-Language Models Say Yes
Tsan Tsai Chan | Varsha Suresh | Anisha Saha | Michael Hahn | Vera Demberg
Findings of the Association for Computational Linguistics: ACL 2026

Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond ‘yes’. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.

pdf bib

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Vera Demberg | Kentaro Inui | Lluís Marquez
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib abs

Tug-of-war between idioms’ figurative and literal interpretations in LLMs
Soyoung Oh | Xinting Huang | Mathis Pink | Michael Hahn | Vera Demberg
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Idioms present a unique challenge for language models due to their non-compositional figurative interpretations, which often strongly diverge from the idiom’s literal interpretation. In this paper, we employ causal tracing to systematically analyze how pretrained causal transformers deal with this ambiguity. We localize three mechanisms: (i) Early sublayers and specific attention heads retrieve figurative interpretation, while suppressing literal interpretation. (ii) When disambiguating context precedes the idiom, the model leverages it from the earliest layer and later layers refine the interpretation if the context conflicts with the retrieved interpretation. (iii) Then, selective, competing pathways carry both interpretations: an intermediate pathway that prioritizes the figurative interpretation and a parallel direct route that favors literal interpretation, ensuring that both readings remain available. Our findings provide mechanistic evidence for idiom comprehension in autoregressive transformers.

pdf bib

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Vera Demberg | Kentaro Inui | Lluís Marquez
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs

Discharge instructions are patient-facing, safety-critical documents that guide medication use, follow-up care, and recovery after hospitalization. Because they must synthesize information across the clinical record and often include post-discharge guidance not stated verbatim in the EHR, they are a difficult target for clinical text generation. In this work, we study discharge instructions in MIMIC-IV through a grounding-first lens. Using two LLMs, we decompose each discharge instruction into medically relevant statements and verify them against the Electronic Health Record (EHR). We find that discharge instructions for Surgical admissions are much longer, averaging roughly 24–25 statements per admission versus 11–12 in Non-Surgical cases, while supported content remains similar in absolute count. The additional Surgical content is dominated by statements that are not directly stated in the record or require clinically plausible extrapolation. Through this analysis, we advocate for better grounding and completeness evaluations at a fine-grained level, establishing a foundational step toward safer and more reliable discharge-instruction generation.

2025

pdf bib abs

Explanatory Summarization with Discourse-Driven Planning
Dongqi Liu | Xi Yu | Vera Demberg | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 13

Lay summaries for scientific documents typically include explanations to help readers grasp sophisticated concepts or arguments. However, current automatic summarization methods do not explicitly model explanations, which makes it difficult to align the proportion of explanatory content with human-written summaries. In this paper, we present a plan-based approach that leverages discourse frameworks to organize summary generation and guide explanatory sentences by prompting responses to the plan. Specifically, we propose two discourse-driven planning strategies, where the plan is conditioned as part of the input or part of the output prefix, respectively. Empirical experiments on three lay summarization datasets show that our approach outperforms existing state-of-the-art methods in terms of summary quality, and it enhances model robustness, controllability, and mitigates hallucination. The project information is available at https://dongqi.me/projects/ExpSum.

pdf bib abs

Synthetic Data Augmentation for Cross-domain Implicit Discourse Relation Recognition
Frances Yung | Varsha Suresh | Zaynab Reza | Mansoor Ahmad | Vera Demberg
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Implicit discourse relation recognition (IDRR) – the task of identifying the implicit coherence relation between two text spans – requires deep semantic understanding. Recent studies have shown that zero-/few-shot approaches significantly lag behind supervised models. However, LLMs may be useful for synthetic data augmentation, where LLMs generate a second argument following a specified coherence relation. We applied this approach in a cross-domain setting, generating discourse continuations using unlabelled target-domain data to adapt a base model which was trained on source-domain labelled data. Evaluations conducted on a large-scale test set revealed that different variations of the approach did not result in any significant improvements. We conclude that LLMs often fail to generate useful samples for IDRR, and emphasize the importance of considering both statistical significance and comparability when evaluating IDRR models.

pdf bib

Team SaarLST at the GEM’24 D2T Task: Symbolic retrieval substantially reduces hallucination in data-to-text generation
Mayank Jobanputra | Vera Demberg
Proceedings of the 18th International Natural Language Generation Conference: Generation Challenges

pdf bib abs

ProofTeller: Exposing recency bias in LLM reasoning and its side effects on communication
Mayank Jobanputra | Alisa Kovtunova | Brisca Balthes | Fedor Grigoryevich Pogulskiy | Yifan Wang | Stefan Borgwardt | Vera Demberg
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Large language models (LLMs) are increasingly applied in domains that demand reliable and interpretable reasoning. While formal methods can generate provably correct proofs, these proofs are often inaccessible to non-expert users. This raises a natural question: can LLMs, when given a verified proof, faithfully interpret its reasoning and communicate it clearly? We introduce ProofTeller, a benchmark that evaluates this ability across three tasks: (1) identifying key proof steps, (2) summarizing the reasoning, and (3) explaining the result in concise natural language. The benchmark covers three domains: _Biology_, _Drones_, and _Recipes_, representing scientific, safety-critical, and everyday reasoning scenarios. We find a consistent near-conclusion bias: LLMs tend to focus on steps closest to the final proof conclusion rather than on the most informative ones. A targeted human study confirms that explanations based on such steps are rated less appropriate for end users. These findings indicate that even when reasoning is provided, current LLMs face challenges in communicating key information in a useful manner, highlighting the need for LLMs that can communicate important details reliably.

pdf bib abs

On Crowdsourcing Task Design for Discourse Relation Annotation
Frances Yung | Vera Demberg
Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation

Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus - initially annotated with the free-choice method - using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

pdf bib abs

Implicit Discourse Relation Classification For Nigerian Pidgin
Muhammed Saeed | Peter Bourgonje | Vera Demberg
Proceedings of the 31st International Conference on Computational Linguistics

Nigerian Pidgin (NP) is an English-based creole language spoken by nearly 100 million people across Nigeria, and is still low-resource in NLP. In particular, there are currently no available discourse parsing tools, which, if available, would have the potential to improve various downstream tasks. Our research focuses on implicit discourse relation classification (IDRC) for NP, a task which, even in English, is not easily solved by prompting LLMs, but requires supervised training. % With this in mind, we have developed a framework for the task, which could also be used by researchers for other English-lexified languages. We systematically compare different approaches to the low resource IDRC task: in one approach, we use English IDRC tools directly on the NP text as well as on their English translations (followed by a back-projection of labels). In another approach, we create a synthetic discourse corpus for NP, in which we automatically translate the English discourse-annotated corpus PDTB to NP, project PDTB labels, and then train an NP IDR classifier. The latter approach of training a “native” NP classifier outperforms our baseline by 13.27% and 33.98% in f₁ score for 4-way and 11-way classification, respectively.

pdf bib abs

LLMs syntactically adapt their language use to their conversational partner
Florian Kandra | Vera Demberg | Alexander Koller
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation.We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

pdf bib abs

Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues
Varsha Suresh | M. Hamza Mughal | Christian Theobalt | Vera Demberg
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.

pdf bib abs

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of our dataset. This study aims to pave the way for future research on scientific video-to-text summarization.

pdf bib abs

Diffusion models have recently achieved remarkable advancements in terms of image quality and fidelity to textual prompts. Concurrently, the safety of such generative models has become an area of growing concern. This work introduces a novel type of jailbreak, which triggers T2I models to generate the image with visual text, where the image and the text, although considered to be safe in isolation, combine to form unsafe content. To systematically explore this phenomenon, we propose a dataset to evaluate the current diffusion-based text-to-image (T2I) models under such jailbreak. We benchmark nine representative T2I models, including two closed-source commercial models. Experimental results reveal a concerning tendency to produce unsafe content: all tested models suffer from such type of jailbreak, with rates of unsafe generation ranging from around 10% to 70% where DALL·E 3 demonstrates almost the highest unsafety. In real-world scenarios, various filters such as keyword blocklists, customized prompt filters, and NSFW image filters, are commonly employed to mitigate these risks. We evaluate the effectiveness of such filters against our jailbreak and found that, while these filters may be effective for single modality detection, they fail to work against our jailbreak. We also investigate the underlying reason for such jailbreaks, from the perspective of text rendering capability and training data. Our work provides a foundation for further development towards more secure and reliable T2I models.

2024

pdf bib abs

Do large language models and humans have similar behaviours in causal inference with script knowledge?
Xudong Hong | Margarita Ryzhova | Daniel Biondi | Vera Demberg
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Recently, large pre-trained language models (LLMs) have demonstrated superior language understanding abilities, including zero-shot causal reasoning. However, it is unclear to what extent their capabilities are similar to human ones. We here study the processing of an event B in a script-based story, which causally depends on a previous event A. In our manipulation, event A is stated, negated, or omitted in an earlier section of the text. We first conducted a self-paced reading experiment, which showed that humans exhibit significantly longer reading times when causal conflicts exist (¬ A → B) than under logical conditions (A → B). However, reading times remain similar when cause A is not explicitly mentioned, indicating that humans can easily infer event B from their script knowledge. We then tested a variety of LLMs on the same data to check to what extent the models replicate human behavior. Our experiments show that 1) only recent LLMs, like GPT-3 or Vicuna, correlate with human behavior in the ¬ A → B condition. 2) Despite this correlation, all models still fail to predict that nil → B is less surprising than ¬ A → B, indicating that LLMs still have difficulties integrating script knowledge.

pdf bib abs

Generalizing across Languages and Domains for Discourse Relation Classification
Peter Bourgonje | Vera Demberg
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The availability of corpora annotated for discourse relations is limited and discourse relation classification performance varies greatly depending on both language and domain. This is a problem for downstream applications that are intended for a language (i.e., not English) or a domain (i.e., not financial news) with comparatively low coverage for discourse annotations. In this paper, we experiment with a state-of-the-art model for discourse relation classification, originally developed for English, extend it to a multi-lingual setting (testing on Italian, Portuguese and Turkish), and employ a simple, yet effective method to mark out-of-domain training instances. By doing so, we aim to contribute to better generalization and more robust discourse relation classification performance across both language and domain.

pdf bib

Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Tatsuya Kawahara | Vera Demberg | Stefan Ultes | Koji Inoue | Shikib Mehri | David Howcroft | Kazunori Komatani
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib abs

RST-LoRA: A Discourse-Aware Low-Rank Adaptation for Long Document Abstractive Summarization
Dongqi Liu | Vera Demberg
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

For long document summarization, discourse structure is important to discern the key content of the text and the differences in importance level between sentences. Unfortunately, the integration of rhetorical structure theory (RST) into parameter-efficient fine-tuning strategies for long document summarization remains unexplored. Therefore, this paper introduces RST-LoRA and proposes four RST-aware variants to explicitly incorporate RST into the LoRA model. Our empirical evaluation demonstrates that incorporating the type and uncertainty of rhetorical relations can complementarily enhance the performance of LoRA in summarization tasks. Furthermore, the best-performing variant we introduced outperforms the vanilla LoRA and full-parameter fine-tuning models, as confirmed by multiple automatic and human evaluations, and even surpasses previous state-of-the-art methods.

pdf bib abs

DiscoGeM 2.0: A Parallel Corpus of English, German, French and Czech Implicit Discourse Relations
Frances Yung | Merel Scholman | Sarka Zikanova | Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present DiscoGeM 2.0, a crowdsourced, parallel corpus of 12,834 implicit discourse relations, with English, German, French and Czech data. We propose and validate a new single-step crowdsourcing annotation method and apply it to collect new annotations in German, French and Czech. The corpus was constructed by having crowdsourced annotators choose a suitable discourse connective for each relation from a set of unambiguous candidates. Every instance was annotated by 10 workers. Our corpus hence represents the first multi-lingual resource that contains distributions of discourse interpretations for implicit relations. The results show that the connective insertion method of discourse annotation can be reliably extended to other languages. The resulting multi-lingual annotations also reveal that implicit relations inferred in one language may differ from those inferred in the translation, meaning the annotations are not always directly transferable. DiscoGem 2.0 promotes the investigation of cross-linguistic differences in discourse marking and could improve automatic discourse parsing applications. It is openly downloadable here: https://github.com/merelscholman/DiscoGeM.

pdf bib abs

SpreadNaLa: A Naturalistic Code Generation Evaluation Dataset of Spreadsheet Formulas
Sebastian Schuster | Ayesha Ansar | Om Agarwal | Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Automatic generation of code from natural language descriptions has emerged as one of the main use cases of large language models (LLMs). This has also led to a proliferation of datasets to track progress in the reliability of code generation models, including domains such as programming challenges and common data science tasks. However, existing datasets primarily target the use of code generation models to aid expert programmers in writing code. In this work, we consider a domain of code generation which is more frequently used by users without sophisticated programming skills: translating English descriptions to spreadsheet formulas that can be used to do everyday data processing tasks. We extract naturalistic instructions from StackOverflow posts and manually verify and standardize the corresponding spreadsheet formulas. We use this dataset to evaluate an off-the-shelf code generation model (GPT 3.5 text-davinci-003) as well as recently proposed pragmatic code generation procedures and find that Code Reviewer reranking (Zhang et al., 2022) performs best among the evaluated methods but still frequently generates formulas that differ from human-generated ones.

pdf bib abs

SIGA: A Naturalistic NLI Dataset of English Scalar Implicatures with Gradable Adjectives
Rashid Nizamani | Sebastian Schuster | Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Many utterances convey meanings that go beyond the literal meaning of a sentence. One class of such meanings is scalar implicatures, a phenomenon by which a speaker conveys the negation of a more informative utterance by producing a less informative utterance. This paper introduces a Natural Language Inference (NLI) dataset designed to investigate the ability of language models to interpret utterances with scalar implicatures. Our dataset is comprised of text extracted from the C4 English text corpus and annotated with both crowd-sourced and expert annotations. We evaluate NLI models based on DeBERTa to investigate 1) whether NLI models can learn to predict pragmatic inferences involving gradable adjectives and 2) whether models generalize to utterances involving unseen adjectives. We find that fine-tuning NLI models on our dataset significantly improves their performance to derive scalar implicatures, both for in-domain and for out-of domain examples. At the same time, we find that the investigated models still perform considerably worse on examples with scalar implicatures than on other types of NLI examples, highlighting that pragmatic inferences still pose challenges for current models.

pdf bib abs

SciNews: From Scholarly Complexities to Public Narratives – a Dataset for Scientific News Report Generation
Dongqi Liu | Yifan Wang | Jia Loy | Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.

pdf bib abs

Retrieval-Augmented Modular Prompt Tuning for Low-Resource Data-to-Text Generation
Ruitao Feng | Xudong Hong | Mayank Jobanputra | Mattes Warning | Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Data-to-text (D2T) generation describes the task of verbalizing data, often given as attribute-value pairs. While this task is relevant for many different data domains beyond the traditionally well-explored tasks of weather forecasting, restaurant recommendations, and sports reporting, a major challenge to the applicability of data-to-text generation methods is typically data sparsity. For many applications, there is extremely little training data in terms of attribute-value inputs and target language outputs available for training a model. Given the sparse data setting, recently developed prompting methods seem most suitable for addressing D2T tasks since they do not require substantial amounts of training data, unlike finetuning approaches. However, prompt-based approaches are also challenging, as a) the design and search of prompts are non-trivial; and b) hallucination problems may occur because of the strong inductive bias of these models. In this paper, we propose a retrieval-augmented modular prompt tuning () method, which constructs prompts that fit the input data closely, thereby bridging the domain gap between the large-scale language model and the structured input data. Experiments show that our method generates texts with few hallucinations and achieves state-of-the-art performance on a dataset for drone handover message generation.

pdf bib abs

Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin
Pin-Jie Lin | Merel Scholman | Muhammed Saeed | Vera Demberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.

pdf bib abs

Prompting Implicit Discourse Relation Annotation
Frances Yung | Mansoor Ahmad | Merel Scholman | Vera Demberg
Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII)

Pre-trained large language models, such as ChatGPT, archive outstanding performance in various reasoning tasks without supervised training and were found to have outperformed crowdsourcing workers. Nonetheless, ChatGPT’s performance in the task of implicit discourse relation classification, prompted by a standard multiple-choice question, is still far from satisfactory and considerably inferior to state-of-the-art supervised approaches. This work investigates several proven prompting techniques to improve ChatGPT’s recognition of discourse relations. In particular, we experimented with breaking down the classification task that involves numerous abstract labels into smaller subtasks. Nonetheless, experiment results show that the inference accuracy hardly changes even with sophisticated prompt engineering, suggesting that implicit discourse relation classification is not yet resolvable under zero-shot or few-shot settings.

pdf bib abs

Summary of the Visually Grounded Story Generation Challenge
Xudong Hong | Asad Sayeed | Vera Demberg
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges

Recent advancements in vision-and-language models have opened new possibilities for natural language generation, particularly in generating creative stories from visual input. We thus host an open-sourced shared task, Visually Grounded Story Generation (VGSG), to explore whether these models can create coherent, diverse, and visually grounded narratives. This task challenges participants to generate coherent stories based on sequences of images, where characters and events must be grounded in the images provided. The task is structured into two tracks: the Closed track with constraints on fixed visual features and the Open track which allows all kinds of models. We propose the first two-stage model using GPT-4o as the baseline for the Open track that first generates descriptions for the images and then creates a story based on those descriptions. Human and automatic evaluations indicate that: 1) Retrieval augmentation helps generate more human-like stories, and 2) Largescale pre-trained LLM improves story quality by a large margin; 3) Traditional automatic metrics can not capture the overall quality.

pdf bib abs

TeamSaarLST at the GEM’24 Data-to-text Task: Revisiting symbolic retrieval in the LLM-age
Mayank Jobanputra | Vera Demberg
Proceedings of the 17th International Natural Language Generation Conference: Generation Challenges

Data-to-text (D2T) generation is a natural language generation (NLG) task in which a system describes structured data in natural language. Generating natural language verbalization for structured data is challenging as the data may not contain all the required details (here, properties such as gender are missing from the input data and need to be inferred for correct language generation), and because the structured data may conflict with the knowledge contained in the LLM’s parameters learned during pre-training. Both of these factors (incorrect filling in of details, pretraining conflict and input data) can lead to so-called hallucinations. In this paper, we propose a few-shot retrieval augmented generation (RAG) system, using a symbolic retriever – PropertyRetriever. Additionally, we experiment with state-of-the-art large language models (LLMs) to generate data verbalizations. Our system achieves the best results on 4 out of 6 subtasks for METEOR and chrF++ metrics. We present our results along with an error analysis. We release our code for reproducing the results as well as the generated verbalizations from all the experiments for any further explorations here.

pdf bib abs

Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It?
Anupama Chingacham | Miaoran Zhang | Vera Demberg | Dietrich Klakow
Proceedings of the 1st Human-Centered Large Language Modeling Workshop

Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text.However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic.We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise.Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline.Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise.

pdf bib abs

A Parameter-Efficient Multi-Objective Approach to Mitigate Stereotypical Bias in Language Models
Yifan Wang | Vera Demberg
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Pre-trained language models have shown impressive abilities of understanding and generating natural languages. However, they typically inherit undesired human-like bias and stereotypes from training data, which raises concerns about putting these models into use in real-world scenarios. Although prior research has proposed to reduce bias using different fairness objectives, they usually fail to capture different representations of bias and, therefore, struggle with fully debiasing models. In this work, we introduce a multi-objective probability alignment approach to overcome current challenges by incorporating multiple debiasing losses to locate and penalize bias in different forms. Compared to existing methods, our proposed method can more effectively and comprehensively reduce stereotypical bias, and maintains the language ability of pre-trained models at the same time. Besides, we adopt prefix-tuning to optimize fairness objectives, and results show that it can achieve better bias removal than full fine-tuning while requiring much fewer computational resources. Our code and data are available at https://github.com/Ewanwong/debias_NLG.

pdf bib abs

RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework
Yifan Wang | Vera Demberg
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite significant advancements in natural language generation, controlling language models to produce texts with desired attributes remains a formidable challenge. In this work, we introduce RSA-Control, a training-free controllable text generation framework grounded in pragmatics. RSA-Control directs the generation process by recursively reasoning between imaginary speakers and listeners, enhancing the likelihood that target attributes are correctly interpreted by listeners amidst distractors. Additionally, we introduce a self-adjustable rationality parameter, which allows for automatic adjustment of control strength based on context. Our experiments, conducted with two task types and two types of language models, demonstrate that RSA-Control achieves strong attribute control while maintaining language fluency and content consistency. Our code is available at https://github.com/Ewanwong/RSA-Control.

pdf bib abs

Large language models fail to derive atypicality inferences in a human-like manner
Charlotte Kurch | Margarita Ryzhova | Vera Demberg
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Recent studies have claimed that large language models (LLMs) are capable of drawing pragmatic inferences (Qiu et al., 2023; Hu et al., 2022; Barattieri di San Pietro et al., 2023). The present paper sets out to test LLM’s abilities on atypicality inferences, a type of pragmatic inference that is triggered through informational redundancy. We test several state-of-the-art LLMs in a zero-shot setting and find that LLMs fail to systematically fail to derive atypicality inferences. Our robustness analysis indicates that when inferences are seemingly derived in a few-shot settings, these results can be attributed to shallow pattern matching and not pragmatic inferencing. We also analyse the performance of the LLMs at the different derivation steps required for drawing atypicality inferences – our results show that models have access to script knowledge and can use it to identify redundancies and accommodate the atypicality inference. The failure instead seems to stem from not reacting to the subtle maxim of quantity violations introduced by the informationally redundant utterances.

pdf bib abs

Temperature-scaling surprisal estimates improve fit to human reading times – but does it do so for the “right reasons”?
Tong Liu | Iza Škrjanec | Vera Demberg
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A wide body of evidence shows that human language processing difficulty is predicted by the information-theoretic measure surprisal, a word’s negative log probability in context. However, it is still unclear how to best estimate these probabilities needed for predicting human processing difficulty – while a long-standing belief held that models with lower perplexity would provide more accurate estimates of word predictability, and therefore lead to better reading time predictions, recent work has shown that for very large models, psycholinguistic predictive power decreases. One reason could be that language models might be more confident of their predictions than humans, because they have had exposure to several magnitudes more data. In this paper, we test what effect temperature-scaling of large language model (LLM) predictions has on surprisal estimates and their predictive power of reading times of English texts. Firstly, we show that calibration of large language models typically improves with model size, i.e. poorer calibration cannot account for poorer fit to reading times. Secondly, we find that temperature-scaling probabilities lead to a systematically better fit to reading times (up to 89% improvement in delta log likelihood), across several reading time corpora. Finally, we show that this improvement in fit is chiefly driven by words that are composed of multiple subword tokens.

2023

pdf bib abs

Design Choices for Crowdsourcing Implicit Discourse Relations: Revealing the Biases Introduced by Task Design
Valentina Pyatkin | Frances Yung | Merel C. J. Scholman | Reut Tsarfaty | Ido Dagan | Vera Demberg
Transactions of the Association for Computational Linguistics, Volume 11

Disagreement in natural language annotation has mostly been studied from a perspective of biases introduced by the annotators and the annotation frameworks. Here, we propose to analyze another source of bias—task design bias, which has a particularly strong impact on crowdsourced linguistic annotations where natural language is used to elicit the interpretation of lay annotators. For this purpose we look at implicit discourse relation annotation, a task that has repeatedly been shown to be difficult due to the relations’ ambiguity. We compare the annotations of 1,200 discourse relations obtained using two distinct annotation tasks and quantify the biases of both methods across four different domains. Both methods are natural language annotation tasks designed for crowdsourcing. We show that the task design can push annotators towards certain relations and that some discourse relation senses can be better elicited with one or the other annotation approach. We also conclude that this type of bias should be taken into account when training and testing models.

pdf bib abs

Visual Writing Prompts: Character-Grounded Story Generation with Curated Image Sequences
Xudong Hong | Asad Sayeed | Khushboo Mehra | Vera Demberg | Bernt Schiele
Transactions of the Association for Computational Linguistics, Volume 11

Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent, diverse, and visually grounded compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and diverse than stories generated with the current state-of-the-art model. Our code, image features, annotations and collected stories are available at https://vwprompt.github.io/.

pdf bib abs

Investigating Explicitation of Discourse Connectives in Translation using Automatic Annotations
Frances Yung | Merel Scholman | Ekaterina Lapshinova-Koltunski | Christina Pollkläsener | Vera Demberg
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Discourse relations have different patterns of marking across different languages. As a result, discourse connectives are often added, omitted, or rephrased in translation. Prior work has shown a tendency for explicitation of discourse connectives, but such work was conducted using restricted sample sizes due to difficulty of connective identification and alignment. The current study exploits automatic methods to facilitate a large-scale study of connectives in English and German parallel texts. Our results based on over 300 types and 18000 instances of aligned connectives and an empirical approach to compare the cross-lingual specificity gap provide strong evidence of the Explicitation Hypothesis. We conclude that discourse relations are indeed more explicit in translation than texts written originally in the same language. Automatic annotations allow us to carry out translation studies of discourse relations on a large scale. Our methodology using relative entropy to study the specificity of connectives also provides more fine-grained insights into translation patterns.

pdf bib

Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

pdf bib abs

Tackling Hallucinations in Neural Chart Summarization
Saad Obaid ul Islam | Iza Škrjanec | Ondrej Dusek | Vera Demberg
Proceedings of the 16th International Natural Language Generation Conference

Hallucinations in text generation occur when the system produces text that is not grounded in the input. In this work, we tackle the problem of hallucinations in neural chart summarization. Our analysis shows that the target side of chart summarization training datasets often contains additional information, leading to hallucinations. We propose a natural language inference (NLI) based method to preprocess the training data and show through human evaluation that our method significantly reduces hallucinations. We also found that shortening long-distance dependencies in the input sequence and adding chart-related information like title and legends improves the overall performance.

pdf bib abs

Visually Grounded Story Generation Challenge
Xudong Hong | Khushboo Mehra | Asad Sayeed | Vera Demberg
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

Recent large pre-trained models have achieved strong performance in multimodal language generation, which requires a joint effort of vision and language modeling. However, most previous generation tasks are based on single image input and produce short text descriptions that are not grounded on the input images. In this work, we propose a shared task on visually grounded story generation. The input is an image sequence, and the output is a story that is conditioned on the input images. This task is particularly challenging because: 1) the protagonists in the generated stories need to be grounded in the images and 2) the output story should be a coherent long-form text. We aim to advance the study of vision-based story generation by accepting submissions that propose new methods as well as new evaluation measures.

pdf bib abs

Visual Coherence Loss for Coherent and Visually Grounded Story Generation
Xudong Hong | Vera Demberg | Asad Sayeed | Qiankun Zheng | Bernt Schiele
Findings of the Association for Computational Linguistics: ACL 2023

Local coherence is essential for long-form text generation models. We identify two important aspects of local coherence within the visual storytelling task: (1) the model needs to represent re-occurrences of characters within the image sequence in order to mention them correctly in the story; (2) character representations should enable us to find instances of the same characters and distinguish different characters. In this paper, we propose a loss function inspired by a linguistic theory of coherence for self-supervised learning for image sequence representations. We further propose combining features from an object and a face detector to construct stronger character features. To evaluate input-output relevance that current reference-based metrics don’t measure, we propose a character matching metric to check whether the models generate referring expressions correctly for characters in input image sequences. Experiments on a visual story generation dataset show that our proposed features and loss function are effective for generating more coherent and visually grounded stories.

pdf bib abs

Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples – which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (~0.9%) with only 10% data.

pdf bib abs

Exploiting Knowledge about Discourse Relations for Implicit Discourse Relation Classification
Nobel Varghese | Frances Yung | Kaveri Anuranjana | Vera Demberg
Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)

In discourse relation recognition, the classification labels are typically represented as one-hot vectors. However, the categories are in fact not all independent of one another on the contrary, there are several frameworks that describe the labels’ similarities (by e.g. sorting them into a hierarchy or describing them interms of features (Sanders et al., 2021)). Recently, several methods for representing the similarities between labels have been proposed (Zhang et al., 2018; Wang et al., 2018; Xiong et al., 2021). We here explore and extend the Label Confusion Model (Guo et al., 2021) for learning a representation for discourse relation labels. We explore alternative ways of informing the model about the similarities between relations, by representing relations in terms of their names (and parent category), their typical markers, or in terms of CCR features that describe the relations. Experimental results show that exploiting label similarity improves classification results.

pdf bib abs

ChatGPT vs Human-authored Text: Insights into Controllable Text Summarization and Sentence Style Transfer
Dongqi Liu | Vera Demberg
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Large-scale language models, like ChatGPT, have garnered significant media attention and stunned the public with their remarkable capacity for generating coherent text from short natural language prompts. In this paper, we aim to conduct a systematic inspection of ChatGPT’s performance in two controllable generation tasks, with respect to ChatGPT’s ability to adapt its output to different target audiences (expert vs. layman) and writing styles (formal vs. informal). Additionally, we evaluate the faithfulness of the generated text, and compare the model’s performance with human-authored texts. Our findings indicate that the stylistic variations produced by humans are considerably larger than those demonstrated by ChatGPT, and the generated texts diverge from human samples in several characteristics, such as the distribution of word types. Moreover, we observe that ChatGPT sometimes incorporates factual errors or hallucinations when adapting the text to suit a specific style.

pdf bib abs

Incorporating Distributions of Discourse Structure for Long Document Abstractive Summarization
Dongqi Liu | Yifan Wang | Vera Demberg
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

For text summarization, the role of discourse structure is pivotal in discerning the core content of a text. Regrettably, prior studies on incorporating Rhetorical Structure Theory (RST) into transformer-based summarization models only consider the nuclearity annotation, thereby overlooking the variety of discourse relation types. This paper introduces the ‘RSTformer’, a novel summarization model that comprehensively incorporates both the types and uncertainty of rhetorical relations. Our RST-attention mechanism, rooted in document-level rhetorical structure, is an extension of the recently devised Longformer framework. Through rigorous evaluation, the model proposed herein exhibits significant superiority over state-of-the-art models, as evidenced by its notable performance on several automatic metrics and human evaluation.

2022

pdf bib abs

Natural language generation in real-time settings with raw sensor data is a challenging task. We find that formulating the task as an end-to-end problem leads to two major challenges in content selection – the sensor data is both redundant and diverse across environments, thereby making it hard for the encoders to select and reason on the data. We here present a new corpus for a specific domain that instantiates these properties. It includes handover utterances that an assistant for a semi-autonomous drone uses to communicate with humans during the drone flight. The corpus consists of sensor data records and utterances in 8 different environments. As a structured intermediary representation between data records and text, we explore the use of description logic (DL). We also propose a neural generation model that can alert the human pilot of the system state and environment in preparation of the handover of control.

pdf bib abs

Barch: an English Dataset of Bar Chart Summaries
Iza Škrjanec | Muhammad Salman Edhi | Vera Demberg
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present Barch, a new English dataset of human-written summaries describing bar charts. This dataset contains 47 charts based on a selection of 18 topics. Each chart is associated with one of the four intended messages expressed in the chart title. Using crowdsourcing, we collected around 20 summaries per chart, or one thousand in total. The text of the summaries is aligned with the chart data as well as with analytical inferences about the data drawn by humans. Our datasets is one of the first to explore the effect of intended messages on the data descriptions in chart summaries. Additionally, it lends itself well to the task of training data-driven systems for chart-to-text generation. We provide results on the performance of state-of-the-art neural generation models trained on this dataset and discuss the strengths and shortcomings of different models.

pdf bib abs

DiscoGeM: A Crowdsourced Corpus of Genre-Mixed Implicit Discourse Relations
Merel Scholman | Tianai Dong | Frances Yung | Vera Demberg
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present DiscoGeM, a crowdsourced corpus of 6,505 implicit discourse relations from three genres: political speech, literature, and encyclopedic texts. Each instance was annotated by 10 crowd workers. Various label aggregation methods were explored to evaluate how to obtain a label that best captures the meaning inferred by the crowd annotators. The results show that a significant proportion of discourse relations in DiscoGeM are ambiguous and can express multiple relation senses. Probability distribution labels better capture these interpretations than single labels. Further, the results emphasize that text genre crucially affects the distribution of discourse relations, suggesting that genre should be included as a factor in automatic relation classification. We make available the newly created DiscoGeM corpus, as well as the dataset with all annotator-level labels. Both the corpus and the dataset can facilitate a multitude of applications and research purposes, for example to function as training data to improve the performance of automatic discourse relation parsers, as well as facilitate research into non-connective signals of discourse relations.

pdf bib abs

Design Choices in Crowdsourcing Discourse Relation Annotations: The Effect of Worker Selection and Training
Merel Scholman | Valentina Pyatkin | Frances Yung | Ido Dagan | Reut Tsarfaty | Vera Demberg
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Obtaining linguistic annotation from novice crowdworkers is far from trivial. A case in point is the annotation of discourse relations, which is a complicated task. Recent methods have obtained promising results by extracting relation labels from either discourse connectives (DCs) or question-answer (QA) pairs that participants provide. The current contribution studies the effect of worker selection and training on the agreement on implicit relation labels between workers and gold labels, for both the DC and the QA method. In Study 1, workers were not specifically selected or trained, and the results show that there is much room for improvement. Study 2 shows that a combination of selection and training does lead to improved results, but the method is cost- and time-intensive. Study 3 shows that a selection-only approach is a viable alternative; it results in annotations of comparable quality compared to annotations from trained participants. The results generalized over both the DC and QA method and therefore indicate that a selection-only approach could also be effective for other crowdsourced discourse annotation tasks.

pdf bib abs

The effect of domain knowledge and implicitation on discourse relation inferences
Marian Marchal | Merel Scholman | Vera Demberg
Dialogue & Discourse Volume 13

Readers adopt their domain knowledge to make inferences about information that is left implicit in the text. The present research investigates the role of domain knowledge in discourse relation interpretation, as this has not been examined experimentally in previous work. We compare interpretations of experts from the field of economics and biomedical sciences in texts from within and outside of their domain of expertise. The results show that high-knowledge readers are better at inferring the correct relation interpretation compared to low-knowledge readers. This effect was stronger in relations that contained a connective in the original text than in relations that were originally implicit. The study provides insight on the impact of background knowledge on discourse relation inferencing and how readers interpret discourse relations when they lack the required domain knowledge.

pdf bib abs

Two-Stage Movie Script Summarization: An Efficient Method For Low-Resource Long Document Summarization
Dongqi Liu | Xudong Hong | Pin-Jie Lin | Ernie Chang | Vera Demberg
Proceedings of the Workshop on Automatic Summarization for Creative Writing

The Creative Summarization Shared Task at COLING 2022 aspires to generate summaries given long-form texts from creative writing. This paper presents the system architecture and the results of our participation in the Scriptbase track that focuses on generating movie plots given movie scripts. The core innovation in our model employs a two-stage hierarchical architecture for movie script summarization. In the first stage, a heuristic extraction method is applied to extract actions and essential dialogues, which reduces the average length of input movie scripts by 66% from about 24K to 8K tokens. In the second stage, a state-of-the-art encoder-decoder model, Longformer-Encoder-Decoder (LED), is trained with effective fine-tuning methods, BitFit and NoisyTune. Evaluations on the unseen test set indicate that our system outperforms both zero-shot LED baselines as well as other participants on various automatic metrics and ranks 1st in the Scriptbase track.

pdf bib abs

Improving Zero-Shot Multilingual Text Generation via Iterative Distillation
Ernie Chang | Alex Marin | Vera Demberg
Proceedings of the 29th International Conference on Computational Linguistics

The demand for multilingual dialogue systems often requires a costly labeling process, where human translators derive utterances in low resource languages from resource rich language annotation. To this end, we explore leveraging the inductive biases for target languages learned by numerous pretrained teacher models by transferring them to student models via sequence-level knowledge distillation. By assuming no target language text, the both the teacher and student models need to learn from the target distribution in a few/zero-shot manner. On the MultiATIS++ benchmark, we explore the effectiveness of our proposed technique to derive the multilingual text for 6 languages, using only the monolingual English data and the pretrained models. We show that training on the synthetic multilingual generation outputs yields close performance to training on human annotations in both slot F1 and intent accuracy; the synthetic text also scores high in naturalness and correctness based on human evaluation.

pdf bib abs

Few-Shot Pidgin Text Adaptation via Contrastive Fine-Tuning
Ernie Chang | Jesujoba O. Alabi | David Ifeoluwa Adelani | Vera Demberg
Proceedings of the 29th International Conference on Computational Linguistics

The surging demand for multilingual dialogue systems often requires a costly labeling process for each language addition. For low resource languages, human annotators are continuously tasked with the adaptation of resource-rich language utterances for each new domain. However, this prohibitive and impractical process can often be a bottleneck for low resource languages that are still without proper translation systems nor parallel corpus. In particular, it is difficult to obtain task-specific low resource language annotations for the English-derived creoles (e.g. Nigerian and Cameroonian Pidgin). To address this issue, we utilize the pretrained language models i.e. BART which has shown great potential in language generation/understanding – we propose to finetune the BART model to generate utterances in Pidgin by leveraging the proximity of the source and target languages, and utilizing positive and negative examples in constrastive training objectives. We collected and released the first parallel Pidgin-English conversation corpus in two dialogue domains and showed that this simple and effective technique is suffice to yield impressive results for English-to-Pidgin generation, which are two closely-related languages.

pdf bib abs

Zero-shot Script Parsing
Fangzhou Zhai | Vera Demberg | Alexander Koller
Proceedings of the 29th International Conference on Computational Linguistics

Script knowledge is useful to a variety of NLP tasks. However, existing resources only cover a small number of activities, limiting its practical usefulness. In this work, we propose a zero-shot learning approach to script parsing, the task of tagging texts with scenario-specific event and participant types, which enables us to acquire script knowledge without domain-specific annotations. We (1) learn representations of potential event and participant mentions by promoting cluster consistency according to the annotated data; (2) perform clustering on the event / participant candidates from unannotated texts that belongs to an unseen scenario. The model achieves 68.1/74.4 average F1 for event / participant parsing, respectively, outperforming a previous CRF model that, in contrast, has access to scenario-specific supervision. We also evaluate the model by testing on a different corpus, where it achieved 55.5/54.0 average F1 for event / participant parsing.

pdf bib abs

Establishing Annotation Quality in Multi-label Annotations
Marian Marchal | Merel Scholman | Frances Yung | Vera Demberg
Proceedings of the 29th International Conference on Computational Linguistics

In many linguistic fields requiring annotated data, multiple interpretations of a single item are possible. Multi-label annotations more accurately reflect this possibility. However, allowing for multi-label annotations also affects the chance that two coders agree with each other. Calculating inter-coder agreement for multi-label datasets is therefore not trivial. In the current contribution, we evaluate different metrics for calculating agreement on multi-label annotations: agreement on the intersection of annotated labels, an augmented version of Cohen’s Kappa, and precision, recall and F1. We propose a bootstrapping method to obtain chance agreement for each measure, which allows us to obtain an adjusted agreement coefficient that is more interpretable. We demonstrate how various measures affect estimates of agreement on simulated datasets and present a case study of discourse relation annotations. We also show how the proportion of double labels, and the entropy of the label distribution, influences the measures outlined above and how a bootstrapped adjusted agreement can make agreement measures more comparable across datasets in multi-label scenarios.

pdf bib abs

Programmable Annotation with Diversed Heuristics and Data Denoising
Ernie Chang | Alex Marin | Vera Demberg
Proceedings of the 29th International Conference on Computational Linguistics

Neural natural language generation (NLG) and understanding (NLU) models are costly and require massive amounts of annotated data to be competitive. Recent data programming frameworks address this bottleneck by allowing human supervision to be provided as a set of labeling functions to construct generative models that synthesize weak labels at scale. However, these labeling functions are difficult to build from scratch for NLG/NLU models, as they often require complex rule sets to be specified. To this end, we propose a novel data programming framework that can jointly construct labeled data for language generation and understanding tasks – by allowing the annotators to modify an automatically-inferred alignment rule set between sequence labels and text, instead of writing rules from scratch. Further, to mitigate the effect of poor quality labels, we propose a dually-regularized denoising mechanism for optimizing the NLU and NLG models. On two benchmarks we show that the framework can generate high-quality data that comes within a 1.48 BLEU and 6.42 slot F1 of the 100% human-labeled data (42k instances) with just 100 labeled data samples – outperforming benchmark annotation frameworks and other semi-supervised approaches.

pdf bib abs

Label distributions help implicit discourse relation classification
Frances Yung | Kaveri Anuranjana | Merel Scholman | Vera Demberg
Proceedings of the 3rd Workshop on Computational Approaches to Discourse

Implicit discourse relations can convey more than one relation sense, but much of the research on discourse relations has focused on single relation senses. Recently, DiscoGeM, a novel multi-domain corpus, which contains 10 crowd-sourced labels per relational instance, has become available. In this paper, we analyse the co-occurrences of relations in DiscoGem and show that they are systematic and characteristic of text genre. We then test whether information on multi-label distributions in the data can help implicit relation classifiers. Our results show that incorporating multiple labels in parser training can improve its performance, and yield label distributions which are more similar to human label distributions, compared to a parser that is trained on just a single most frequent label per instance.

2021

pdf bib abs

Time-Aware Ancient Chinese Text Translation and Inference
Ernie Chang | Yow-Ting Shiue | Hui-Syuan Yeh | Vera Demberg
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

In this paper, we aim to address the challenges surrounding the translation of ancient Chinese text: (1) The linguistic gap due to the difference in eras results in translations that are poor in quality, and (2) most translations are missing the contextual information that is often very crucial to understanding the text. To this end, we improve upon past translation techniques by proposing the following: We reframe the task as a multi-label prediction task where the model predicts both the translation and its particular era. We observe that this helps to bridge the linguistic gap as chronological context is also used as auxiliary information. We validate our framework on a parallel corpus annotated with chronology information and show experimentally its efficacy in producing quality translation outputs. We release both the code and the data for future research.

pdf bib abs

The SelectGen Challenge: Finding the Best Training Samples for Few-Shot Neural Text Generation
Ernie Chang | Xiaoyu Shen | Alex Marin | Vera Demberg
Proceedings of the 14th International Conference on Natural Language Generation

We propose a shared task on training instance selection for few-shot neural text generation. Large-scale pretrained language models have led to dramatic improvements in few-shot text generation. Nonetheless, almost all previous work simply applies random sampling to select the few-shot training instances. Little to no attention has been paid to the selection strategies and how they would affect model performance. Studying the selection strategy can help us (1) make the most use of our annotation budget in downstream tasks and (2) better benchmark few-shot text generative models. We welcome submissions that present their selection strategies and the effects on the generation quality.

pdf bib abs

Jointly Improving Language Understanding and Generation with Quality-Weighted Weak Supervision of Automatic Labeling
Ernie Chang | Vera Demberg | Alex Marin
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Neural natural language generation (NLG) and understanding (NLU) models are data-hungry and require massive amounts of annotated data to be competitive. Recent frameworks address this bottleneck with generative models that synthesize weak labels at scale, where a small amount of training labels are expert-curated and the rest of the data is automatically annotated. We follow that approach, by automatically constructing a large-scale weakly-labeled data with a fine-tuned GPT-2, and employ a semi-supervised framework to jointly train the NLG and NLU models. The proposed framework adapts the parameter updates to the models according to the estimated label-quality. On both the E2E and Weather benchmarks, we show that this weakly supervised training paradigm is an effective approach under low resource scenarios with as little as 10 data instances, and outperforming benchmark systems on both datasets when 100% of the training data is used.

pdf bib abs

Neural Data-to-Text Generation with LM-based Text Augmentation
Ernie Chang | Xiaoyu Shen | Dawei Zhu | Vera Demberg | Hui Su
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

For many new application domains for data-to-text generation, the main obstacle in training neural models consists of a lack of training data. While usually large numbers of instances are available on the data side, often only very few text samples are available. To address this problem, we here propose a novel few-shot approach for this setting. Our approach automatically augments the data available for training by (i) generating new text samples based on replacing specific values by alternative ones from the same category, (ii) generating new text samples based on GPT-2, and (iii) proposing an automatic method for pairing the new text samples with data samples. As the text augmentation can introduce noise to the training data, we use cycle consistency as an objective, in order to make sure that a given data sample can be correctly reconstructed after having been formulated as text (and that text samples can be reconstructed from data). On both the E2E and WebNLG benchmarks, we show that this weakly supervised training paradigm is able to outperform fully supervised sequence-to-sequence models with less than 10% of the training set. By utilizing all annotated data, our model can boost the performance of a standard sequence-to-sequence model by over 5 BLEU points, establishing a new state-of-the-art on both datasets.

pdf bib abs

Does the Order of Training Samples Matter? Improving Neural Data-to-Text Generation with Curriculum Learning
Ernie Chang | Hui-Syuan Yeh | Vera Demberg
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Recent advancements in data-to-text generation largely take on the form of neural end-to-end systems. Efforts have been dedicated to improving text generation systems by changing the order of training samples in a process known as curriculum learning. Past research on sequence-to-sequence learning showed that curriculum learning helps to improve both the performance and convergence speed. In this work, we delve into the same idea surrounding the training samples consisting of structured data and text pairs, where at each update, the curriculum framework selects training samples based on the model’s competence. Specifically, we experiment with various difficulty metrics and put forward a soft edit distance metric for ranking training samples. On our benchmarks, it shows faster convergence speed where training time is reduced by 38.7% and performance is boosted by 4.84 BLEU.

pdf bib abs

Comparison of methods for explicit discourse connective identification across various domains
Merel Scholman | Tianai Dong | Frances Yung | Vera Demberg
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

Existing parse methods use varying approaches to identify explicit discourse connectives, but their performance has not been consistently evaluated in comparison to each other, nor have they been evaluated consistently on text other than newspaper articles. We here assess the performance on explicit connective identification of three parse methods (PDTB e2e, Lin et al., 2014; the winner of CONLL2015, Wang et al., 2015; and DisSent, Nie et al., 2019), along with a simple heuristic. We also examine how well these systems generalize to different datasets, namely written newspaper text (PDTB), written scientific text (BioDRB), prepared spoken text (TED-MDB) and spontaneous spoken text (Disco-SPICE). The results show that the e2e parser outperforms the other parse methods in all datasets. However, performance drops significantly from the PDTB to all other datasets. We provide a more fine-grained analysis of domain differences and connectives that prove difficult to parse, in order to highlight the areas where gains can be made.

pdf bib abs

Semi-automatic discourse annotation in a low-resource language: Developing a connective lexicon for Nigerian Pidgin
Marian Marchal | Merel Scholman | Vera Demberg
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

Cross-linguistic research on discourse structure and coherence marking requires discourse-annotated corpora and connective lexicons in a large number of languages. However, the availability of such resources is limited, especially for languages for which linguistic resources are scarce in general, such as Nigerian Pidgin. In this study, we demonstrate how a semi-automatic approach can be used to source connectives and their relation senses and develop a discourse-annotated corpus in a low-resource language. Connectives and their relation senses were extracted from a parallel corpus combining automatic (PDTB end-to-end parser) and manual annotations. This resulted in Naija-Lex, a lexicon of discourse connectives in Nigerian Pidgin with English translations. The lexicon shows that the majority of Nigerian Pidgin connectives are borrowed from its English lexifier, but that there are also some connectives that are unique to Nigerian Pidgin.

pdf bib abs

A practical perspective on connective generation
Frances Yung | Merel Scholman | Vera Demberg
Proceedings of the 2nd Workshop on Computational Approaches to Discourse

In data-driven natural language generation, we typically know what relation should be expressed and need to select a connective to lexicalize it. In the current contribution, we analyse whether a sophisticated connective generation module is necessary to select a connective, or whether this can be solved with simple methods (such as random choice between connectives that are known to express a given relation, or usage of a generic language model). Comparing these methods to the distributions of connective choices from a human connective insertion task, we find mixed results: for some relations, it is acceptable to lexicalize them using any of the connectives that mark this relation. However, for other relations (temporals, concessives) either a more detailed relation distinction needs to be introduced, or a more sophisticated connective choice module would be necessary.

pdf bib abs

On Training Instance Selection for Few-Shot Neural Text Generation
Ernie Chang | Xiaoyu Shen | Hui-Syuan Yeh | Vera Demberg
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Large-scale pretrained language models have led to dramatic improvements in text generation. Impressive performance can be achieved by finetuning only on a small number of instances (few-shot setting). Nonetheless, almost all previous work simply applies random sampling to select the few-shot training instances. Little to no attention has been paid to the selection strategies and how they would affect model performance. In this work, we present a study on training instance selection in few-shot neural text generation. The selection decision is made based only on the unlabeled data so as to identify the most worthwhile data points that should be annotated under some budget of labeling cost. Based on the intuition that the few-shot training instances should be diverse and representative of the entire data distribution, we propose a simple selection strategy with K-means clustering. We show that even with the naive clustering-based approach, the generation models consistently outperform random sampling on three text generation tasks: data-to-text generation, document summarization and question generation. The code and training data are made available. We hope that this work will call for more attention on this largely unexplored area.

pdf bib abs

Entity Enhancement for Implicit Discourse Relation Classification in the Biomedical Domain
Wei Shi | Vera Demberg
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Implicit discourse relation classification is a challenging task, in particular when the text domain is different from the standard Penn Discourse Treebank (PDTB; Prasad et al., 2008) training corpus domain (Wall Street Journal in 1990s). We here tackle the task of implicit discourse relation classification on the biomedical domain, for which the Biomedical Discourse Relation Bank (BioDRB; Prasad et al., 2011) is available. We show that entity information can be used to improve discourse relational argument representation. In a first step, we show that explicitly marked instances that are content-wise similar to the target relations can be used to achieve good performance in the cross-domain setting using a simple unsupervised voting pipeline. As a further step, we show that with the linked entity information from the first step, a transformer which is augmented with entity-related information (KBERT; Liu et al., 2020) sets the new state of the art performance on the dataset, outperforming the large pre-trained BioBERT (Lee et al., 2020) model by 2% points.

2020

pdf bib abs

A problem in automatically generated stories for image sequences is that they use overly generic vocabulary and phrase structure and fail to match the distributional characteristics of human-generated text. We address this problem by introducing explicit representations for objects and their relations by extracting scene graphs from the images. Utilizing an embedding of this scene graph enables our model to more explicitly reason over objects and their relations during story generation, compared to the global features from an object classifier used in previous work. We apply metrics that account for the diversity of words and phrases of generated stories as well as for reference to narratively-salient image features and show that our approach outperforms previous systems. Our experiments also indicate that our models obtain competitive results on reference-based metrics.

pdf bib abs

Story Generation with Rich Details
Fangzhou Zhai | Vera Demberg | Alexander Koller
Proceedings of the 28th International Conference on Computational Linguistics

Automatically generated stories need to be not only coherent, but also interesting. Apart from realizing a story line, the text also needs to include rich details to engage the readers. We propose a model that features two different generation components: an outliner, which proceeds the main story line to realize global coherence; a detailer, which supplies relevant details to the story in a locally coherent manner. Human evaluations show our model substantially improves the informativeness of generated text while retaining its coherence, outperforming various baselines.

pdf bib abs

DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool
Ernie Chang | Jeriah Caplinger | Alex Marin | Xiaoyu Shen | Vera Demberg
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

We present a lightweight annotation tool, the Data AnnotatoR Tool (DART), for the general task of labeling structured data with textual descriptions. The tool is implemented as an interactive application that reduces human efforts in annotating large quantities of structured data, e.g. in the format of a table or tree structure. By using a backend sequence-to-sequence model, our system iteratively analyzes the annotated labels in order to better sample unlabeled data. In a simulation experiment performed on annotating large quantities of structured data, DART has been shown to reduce the total number of annotations needed with active learning and automatically suggesting relevant labels.

2019

pdf bib

pdf bib abs

Crowdsourcing Discourse Relation Annotations by a Two-Step Connective Insertion Task
Frances Yung | Vera Demberg | Merel Scholman
Proceedings of the 13th Linguistic Annotation Workshop

The perspective of being able to crowd-source coherence relations bears the promise of acquiring annotations for new texts quickly, which could then increase the size and variety of discourse-annotated corpora. It would also open the avenue to answering new research questions: Collecting annotations from a larger number of individuals per instance would allow to investigate the distribution of inferred relations, and to study individual differences in coherence relation interpretation. However, annotating coherence relations with untrained workers is not trivial. We here propose a novel two-step annotation procedure, which extends an earlier method by Scholman and Demberg (2017a). In our approach, coherence relation labels are inferred from connectives that workers insert into the text. We show that the proposed method leads to replicable coherence annotations, and analyse the agreement between the obtained relation labels and annotations from PDTB and RSTDT on the same texts.

pdf bib abs

A Hybrid Model for Globally Coherent Story Generation
Fangzhou Zhai | Vera Demberg | Pavel Shkadzko | Wei Shi | Asad Sayeed
Proceedings of the Second Workshop on Storytelling

Automatically generating globally coherent stories is a challenging problem. Neural text generation models have been shown to perform well at generating fluent sentences from data, but they usually fail to keep track of the overall coherence of the story after a couple of sentences. Existing work that incorporates a text planning module succeeded in generating recipes and dialogues, but appears quite data-demanding. We propose a novel story generation approach that generates globally coherent stories from a fairly small corpus. The model exploits a symbolic text planning module to produce text plans, thus reducing the demand of data; a neural surface realization module then generates fluent text conditioned on the text plan. Human evaluation showed that our model outperforms various baselines by a wide margin and generates stories which are fluent as well as globally coherent.

pdf bib abs

Verb-Second Effect on Quantifier Scope Interpretation
Asad Sayeed | Matthias Lindemann | Vera Demberg
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Sentences like “Every child climbed a tree” have at least two interpretations depending on the precedence order of the universal quantifier and the indefinite. Previous experimental work explores the role that different mechanisms such as semantic reanalysis and world knowledge may have in enabling each interpretation. This paper discusses a web-based task that uses the verb-second characteristic of German main clauses to estimate the influence of word order variation over world knowledge.

pdf bib abs

Acquiring Annotated Data with Cross-lingual Explicitation for Implicit Discourse Relation Classification
Wei Shi | Frances Yung | Vera Demberg
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019

Implicit discourse relation classification is one of the most challenging and important tasks in discourse parsing, due to the lack of connectives as strong linguistic cues. A principle bottleneck to further improvement is the shortage of training data (ca. 18k instances in the Penn Discourse Treebank (PDTB)). Shi et al. (2017) proposed to acquire additional data by exploiting connectives in translation: human translators mark discourse relations which are implicit in the source language explicitly in the translation. Using back-translations of such explicitated connectives improves discourse relation parsing performance. This paper addresses the open question of whether the choice of the translation language matters, and whether multiple translations into different languages can be effectively used to improve the quality of the additional data.

pdf bib

Proceedings of the 13th International Conference on Computational Semantics - Student Papers
Simon Dobnik | Stergios Chatzikyriakidis | Vera Demberg | Kathrein Abu Kwaik | Vladislav Maraev
Proceedings of the 13th International Conference on Computational Semantics - Student Papers

pdf bib

Proceedings of the 13th International Conference on Computational Semantics - Short Papers
Simon Dobnik | Stergios Chatzikyriakidis | Vera Demberg
Proceedings of the 13th International Conference on Computational Semantics - Short Papers

pdf bib abs

Learning to Explicitate Connectives with Seq2Seq Network for Implicit Discourse Relation Classification
Wei Shi | Vera Demberg
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Implicit discourse relation classification is one of the most difficult steps in discourse parsing. The difficulty stems from the fact that the coherence relation must be inferred based on the content of the discourse relational arguments. Therefore, an effective encoding of the relational arguments is of crucial importance. We here propose a new model for implicit discourse relation classification, which consists of a classifier, and a sequence-to-sequence model which is trained to generate a representation of the discourse relational arguments by trying to predict the relational arguments including a suitable implicit connective. Training is possible because such implicit connectives have been annotated as part of the PDTB corpus. Along with a memory network, our model could generate more refined representations for the task. And on the now standard 11-way classification, our method outperforms the previous state of the art systems on the PDTB benchmark on multiple settings including cross validation.

pdf bib

Proceedings of the 13th International Conference on Computational Semantics - Long Papers
Simon Dobnik | Stergios Chatzikyriakidis | Vera Demberg
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

pdf bib abs

Improving Language Generation from Feature-Rich Tree-Structured Data with Relational Graph Convolutional Encoders
Xudong Hong | Ernie Chang | Vera Demberg
Proceedings of the 2nd Workshop on Multilingual Surface Realisation (MSR 2019)

The Multilingual Surface Realization Shared Task 2019 focuses on generating sentences from lemmatized sets of universal dependency parses with rich features. This paper describes the results of our participation in the deep track. The core innovation in our approach is to use a graph convolutional network to encode the dependency trees given as input. Upon adding morphological features, our system achieves the third rank without using data augmentation techniques or additional components (such as a re-ranker).

pdf bib abs

Next Sentence Prediction helps Implicit Discourse Relation Classification within and across Domains
Wei Shi | Vera Demberg
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Implicit discourse relation classification is one of the most difficult tasks in discourse parsing. Previous studies have generally focused on extracting better representations of the relational arguments. In order to solve the task, it is however additionally necessary to capture what events are expected to cause or follow each other. Current discourse relation classifiers fall short in this respect. We here show that this shortcoming can be effectively addressed by using the bidirectional encoder representation from transformers (BERT) proposed by Devlin et al. (2019), which were trained on a next-sentence prediction task, and thus encode a representation of likely next sentences. The BERT-based model outperforms the current state of the art in 11-way classification by 8% points on the standard PDTB dataset. Our experiments also demonstrate that the model can be successfully ported to other domains: on the BioDRB dataset, the model outperforms the state of the art system around 15% points.

pdf bib abs

How compatible are our discourse annotation frameworks? Insights from mapping RST-DT and PDTB annotations
Vera Demberg | Merel C.J. Scholman | Fatemeh Torabi Asr
Dialogue & Discourse Volume 10

Discourse-annotated corpora are an important resource for the community, but they are often annotated according to different frameworks. This makes joint usage of the annotations difficult, preventing researchers from searching the corpora in a unified way, or using all annotated data jointly to train computational systems. Several theoretical proposals have recently been made for mapping the relational labels of different frameworks to each other, but these proposals have so far not been validated against existing annotations. The two largest discourse relation annotated resources, the Penn Discourse Treebank and the Rhetorical Structure Theory Discourse Treebank, have however been annotated on the same texts, allowing for a direct comparison of the annotation layers. We propose a method for automatically aligning the discourse segments, and then evaluate existing mapping proposals by comparing the empirically observed against the proposed mappings. Our analysis highlights the influence of segmentation on subsequent discourse relation labelling, and shows that while agreement between frameworks is reasonable for explicit relations, agreement on implicit relations is low. We identify several sources of systematic discrepancies between the two annotation schemes and discuss consequences for future annotation and for usage of the existing resources.

2018

pdf bib abs

Toward Bayesian Synchronous Tree Substitution Grammars for Sentence Planning
David M. Howcroft | Dietrich Klakow | Vera Demberg
Proceedings of the 11th International Conference on Natural Language Generation

Developing conventional natural language generation systems requires extensive attention from human experts in order to craft complex sets of sentence planning rules. We propose a Bayesian nonparametric approach to learn sentence planning rules by inducing synchronous tree substitution grammars for pairs of text plans and morphosyntactically-specified dependency trees. Our system is able to learn rules which can be used to generate novel texts after training on small datasets.

pdf bib abs

We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation.

pdf bib abs

Do Speakers Produce Discourse Connectives Rationally?
Frances Yung | Vera Demberg
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

A number of different discourse connectives can be used to mark the same discourse relation, but it is unclear what factors affect connective choice. One recent account is the Rational Speech Acts theory, which predicts that speakers try to maximize the informativeness of an utterance such that the listener can interpret the intended meaning correctly. Existing prior work uses referential language games to test the rational account of speakers’ production of concrete meanings, such as identification of objects within a picture. Building on the same paradigm, we design a novel Discourse Continuation Game to investigate speakers’ production of abstract discourse relations. Experimental results reveal that speakers significantly prefer a more informative connective, in line with predictions of the RSA model.

pdf bib abs

Learning distributed event representations with a multi-task approach
Xudong Hong | Asad Sayeed | Vera Demberg
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

Human world knowledge contains information about prototypical events and their participants and locations. In this paper, we train the first models using multi-task learning that can both predict missing event participants and also perform semantic role classification based on semantic plausibility. Our best-performing model is an improvement over the previous state-of-the-art on thematic fit modelling tasks. The event embeddings learned by the model can additionally be used effectively in an event similarity task, also outperforming the state-of-the-art.

pdf bib

A vision-grounded dataset for predicting typical locations for verbs
Nelson Mukuze | Anna Rohrbach | Vera Demberg | Bernt Schiele
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib

Rollenwechsel-English: a large-scale semantic role corpus
Asad Sayeed | Pavel Shkadzko | Vera Demberg
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs

G-TUNA: a corpus of referring expressions in German, including duration information
David Howcroft | Jorrig Vogels | Vera Demberg
Proceedings of the 10th International Conference on Natural Language Generation

Corpora of referring expressions elicited from human participants in a controlled environment are an important resource for research on automatic referring expression generation. We here present G-TUNA, a new corpus of referring expressions for German. Using the furniture stimuli set developed for the TUNA and D-TUNA corpora, our corpus extends on these corpora by providing data collected in a simulated driving dual-task setting, and additionally provides exact duration annotations for the spoken referring expressions. This corpus will hence allow researchers to analyze the interaction between referring expression length and speech rate, under conditions where the listener is under high vs. low cognitive load.

pdf bib abs

Crowdsourcing discourse interpretations: On the influence of context and the reliability of a connective insertion task
Merel Scholman | Vera Demberg
Proceedings of the 11th Linguistic Annotation Workshop

Traditional discourse annotation tasks are considered costly and time-consuming, and the reliability and validity of these tasks is in question. In this paper, we investigate whether crowdsourcing can be used to obtain reliable discourse relation annotations. We also examine the influence of context on the reliability of the data. The results of a crowdsourced connective insertion task showed that the method can be used to obtain reliable annotations: The majority of the inserted connectives converged with the original label. Further, the method is sensitive to the fact that multiple senses can often be inferred for a single relation. Regarding the presence of context, the results show no significant difference in distributions of insertions between conditions overall. However, a by-item comparison revealed several characteristics of segments that determine whether the presence of context makes a difference in annotations. The findings discussed in this paper can be taken as evidence that crowdsourcing can be used as a valuable method to obtain insights into the sense(s) of relations.

pdf bib abs

Modeling Semantic Expectation: Using Script Knowledge for Referent Prediction
Ashutosh Modi | Ivan Titov | Vera Demberg | Asad Sayeed | Manfred Pinkal
Transactions of the Association for Computational Linguistics, Volume 5

Recent research in psycholinguistics has provided increasing evidence that humans predict upcoming content. Prediction also affects perception and might be a key to robustness in human language processing. In this paper, we investigate the factors that affect human prediction by building a computational model that can predict upcoming discourse referents based on linguistic knowledge alone vs. linguistic knowledge jointly with common-sense knowledge in the form of scripts. We find that script knowledge significantly improves model estimates of human predictions. In a second study, we test the highly controversial hypothesis that predictability influences referring expression type but do not find evidence for such an effect.

pdf bib abs

Using Explicit Discourse Connectives in Translation for Implicit Discourse Relation Classification
Wei Shi | Frances Yung | Raphael Rubino | Vera Demberg
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Implicit discourse relation recognition is an extremely challenging task due to the lack of indicative connectives. Various neural network architectures have been proposed for this task recently, but most of them suffer from the shortage of labeled data. In this paper, we address this problem by procuring additional training data from parallel corpora: When humans translate a text, they sometimes add connectives (a process known as explicitation). We automatically back-translate it into an English connective and use it to infer a label with high confidence. We show that a training set several times larger than the original training set can be generated this way. With the extra labeled instances, we show that even a simple bidirectional Long Short-Term Memory Network can outperform the current state-of-the-art.

pdf bib abs

On the Need of Cross Validation for Discourse Relation Classification
Wei Shi | Vera Demberg
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The task of implicit discourse relation classification has received increased attention in recent years, including two CoNNL shared tasks on the topic. Existing machine learning models for the task train on sections 2-21 of the PDTB and test on section 23, which includes a total of 761 implicit discourse relations. In this paper, we’d like to make a methodological point, arguing that the standard test set is too small to draw conclusions about whether the inclusion of certain features constitute a genuine improvement, or whether one got lucky with some properties of the test set, and argue for the adoption of cross validation for the discourse relation classification task by the community.

pdf bib abs

Psycholinguistic Models of Sentence Processing Improve Sentence Readability Ranking
David M. Howcroft | Vera Demberg
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

While previous research on readability has typically focused on document-level measures, recent work in areas such as natural language generation has pointed out the need of sentence-level readability measures. Much of psycholinguistics has focused for many years on processing measures that provide difficulty estimates on a word-by-word basis. However, these psycholinguistic measures have not yet been tested on sentence readability ranking tasks. In this paper, we use four psycholinguistic measures: idea density, surprisal, integration cost, and embedding depth to test whether these features are predictive of readability levels. We find that psycholinguistic features significantly improve performance by up to 3 percentage points over a standard document-level readability metric baseline.

pdf bib abs

A Systematic Study of Neural Discourse Models for Implicit Discourse Relation
Attapol Rutherford | Vera Demberg | Nianwen Xue
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Inferring implicit discourse relations in natural language text is the most difficult subtask in discourse parsing. Many neural network models have been proposed to tackle this problem. However, the comparison for this task is not unified, so we could hardly draw clear conclusions about the effectiveness of various architectures. Here, we propose neural network models that are based on feedforward and long-short term memory architecture and systematically study the effects of varying structures. To our surprise, the best-configured feedforward architecture outperforms LSTM-based model in most cases despite thorough tuning. Further, we compare our best feedforward system with competitive convolutional and recurrent networks and find that feedforward can actually be more effective. For the first time for this task, we compile and publish outputs from previous neural and non-neural systems to establish the standard for further comparison.

pdf bib abs

Examples and Specifications that Prove a Point: Identifying Elaborative and Argumentative Discourse Relations
Merel C.J. Scholman | Vera Demberg
Dialogue & Discourse Volume 8

Examples and specifications occur frequently in text, but not much is known about how they function in discourse and how readers interpret them. Looking at how they’re annotated in existing discourse corpora, we find that annotators often disagree on these types of relations; specifically, there is disagreement about whether these relations are elaborative (additive) or argumentative (pragmatic causal). To investigate how readers interpret examples and specifications, we conducted a crowdsourced discourse annotation study. The results show that these relations can indeed have two functions: they can be used to both illustrate/specify a situation and serve as an argument for a claim. These findings suggest that examples and specifications can have multiple simultaneous readings. We discuss the implications of these results for discourse annotation.

2016

pdf bib

How can we adapt generation to the user’s cognitive load?
Vera Demberg
Proceedings of the 9th International Natural Language Generation conference

pdf bib

Thematic fit evaluation: an aspect of selectional preferences
Asad Sayeed | Clayton Greenberg | Vera Demberg
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib

Roleo: Visualising Thematic Fit Spaces on the Web
Asad Sayeed | Xudong Hong | Vera Demberg
Proceedings of ACL-2016 System Demonstrations

pdf bib

LingoTurk: managing crowdsourced tasks for psycholinguistics
Florian Pusse | Asad Sayeed | Vera Demberg
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib

Improving event prediction by representing script participants
Simon Ahrendt | Vera Demberg
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib abs

Annotating Discourse Relations in Spoken Language: A Comparison of the PDTB and CCR Frameworks
Ines Rehbein | Merel Scholman | Vera Demberg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In discourse relation annotation, there is currently a variety of different frameworks being used, and most of them have been developed and employed mostly on written data. This raises a number of questions regarding interoperability of discourse relation annotation schemes, as well as regarding differences in discourse annotation for written vs. spoken domains. In this paper, we describe ouron annotating two spoken domains from the SPICE Ireland corpus (telephone conversations and broadcast interviews) according todifferent discourse annotation schemes, PDTB 3.0 and CCR. We show that annotations in the two schemes can largely be mappedone another, and discuss differences in operationalisations of discourse relation schemes which present a challenge to automatic mapping. We also observe systematic differences in the prevalence of implicit discourse relations in spoken data compared to written texts,find that there are also differences in the types of causal relations between the domains. Finally, we find that PDTB 3.0 addresses many shortcomings of PDTB 2.0 wrt. the annotation of spoken discourse, and suggest further extensions. The new corpus has roughly theof the CoNLL 2015 Shared Task test set, and we hence hope that it will be a valuable resource for the evaluation of automatic discourse relation labellers.

pdf bib

Event participant modelling with neural networks
Ottokar Tilk | Vera Demberg | Asad Sayeed | Dietrich Klakow | Stefan Thater
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib abs

From OpenCCG to AI Planning: Detecting Infeasible Edges in Sentence Generation
Maximilian Schwenger | Álvaro Torralba | Joerg Hoffmann | David M. Howcroft | Vera Demberg
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The search space in grammar-based natural language generation tasks can get very large, which is particularly problematic when generating long utterances or paragraphs. Using surface realization with OpenCCG as an example, we show that we can effectively detect partial solutions (edges) which cannot ultimately be part of a complete sentence because of their syntactic category. Formulating the completion of an edge into a sentence as finding a solution path in a large state-transition system, we demonstrate a connection to AI Planning which is concerned with this kind of problem. We design a compilation from OpenCCG into AI Planning allowing the detection of infeasible edges via AI Planning dead-end detection methods (proving the absence of a solution to the compilation). Our experiments show that this can filter out large fractions of infeasible edges in, and thus benefit the performance of, complex realization processes.

2015

pdf bib

Towards Flexible, Small-Domain Surface Generation: Combining Data-Driven and Grammatical Approaches
Andrea Fischer | Vera Demberg | Dietrich Klakow
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

pdf bib

Verb polysemy and frequency effects in thematic fit modeling
Clayton Greenberg | Vera Demberg | Asad Sayeed
Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics

pdf bib

Uniform Surprisal at the Level of Discourse Relations: Negation Markers and Discourse Connective Omission
Fatemeh Torabi Asr | Vera Demberg
Proceedings of the 11th International Conference on Computational Semantics

pdf bib

Learning to predict script events from domain-specific text
Rachel Rudinger | Vera Demberg | Ashutosh Modi | Benjamin Van Durme | Manfred Pinkal
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib

Vector-space calculation of semantic surprisal for predicting word pronunciation duration
Asad Sayeed | Stefan Fischer | Vera Demberg
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib

Improving unsupervised vector-space thematic fit evaluation via role-filler prototype clustering
Clayton Greenberg | Asad Sayeed | Vera Demberg
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib

Proceedings of the Fifth Workshop on Cognitive Modeling and Computational Linguistics
Vera Demberg | Timothy O’Donnell
Proceedings of the Fifth Workshop on Cognitive Modeling and Computational Linguistics

pdf bib

Incremental Semantic Role Labeling with Tree Adjoining Grammar
Ioannis Konstas | Frank Keller | Vera Demberg | Mirella Lapata
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib

On the Information Conveyed by Discourse Markers
Fatemeh Torabi Asr | Vera Demberg
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

pdf bib

The semantic augmentation of a psycholinguistically-motivated syntactic formalism
Asad Sayeed | Vera Demberg
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

pdf bib

Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)
Vera Demberg | Roger Levy
Proceedings of the Fourth Annual Workshop on Cognitive Modeling and Computational Linguistics (CMCL)

pdf bib

Incremental, Predictive Parsing with Psycholinguistically Motivated Tree-Adjoining Grammar
Vera Demberg | Frank Keller | Alexander Koller
Computational Linguistics, Volume 39, Issue 4 - December 2013

2012

pdf bib

Measuring the Strength of Linguistic Cues for Discourse Relations
Fatemeh Torabi Asr | Vera Demberg
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects

pdf bib

Incremental Derivations in CCG
Vera Demberg
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)

pdf bib

Incremental Neo-Davidsonian semantic construction for TAG
Asad Sayeed | Vera Demberg
Proceedings of the 11th International Workshop on Tree Adjoining Grammars and Related Formalisms (TAG+11)

bib abs

German and English Treebanks and Lexica for Tree-Adjoining Grammars
Miriam Kaeshammer | Vera Demberg
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a treebank and lexicon for German and English, which have been developed for PLTAG parsing. PLTAG is a psycholinguistically motivated, incremental version of tree-adjoining grammar (TAG). The resources are however also applicable to parsing with other variants of TAG. The German PLTAG resources are based on the TIGER corpus and, to the best of our knowledge, constitute the first scalable German TAG grammar. The English PLTAG resources go beyond existing resources in that they include the NP annotation by (Vadas and Curran, 2007), and include the prediction lexicon necessary for PLTAG.

pdf bib

Syntactic Surprisal Affects Spoken Word Duration in Conversational Contexts
Vera Demberg | Asad Sayeed | Philip Gorinski | Nikolaos Engonopoulos
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib

Implicitness of Discourse Relations
Fatemeh Torabi Asr | Vera Demberg
Proceedings of COLING 2012

Vera Demberg

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Co-authors

Venues