David Schlangen - ACL Anthology

David Schlangen

2026

The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models
Sherzod Hakimov | Roland Bernard | Tim Leiber | Karl Osswald | Kristina Richert | Ruilin Yang | Raffaella Bernardi | David Schlangen
Findings of the Association for Computational Linguistics: EACL 2026

Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We present the first comprehensive study that systematically evaluates how explicit reasoning training affects the negotiation abilities of both commercial and open-weight large language models, comparing these models to their vanilla counterparts across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models.Our findings show that enabling reasoning—that is, scaling test time compute—significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5’s performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while a leading commercial model maintains language consistency between reasoning and final output.

2025

From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Instruction-tuned large language models (LLMs) have shown strong performance on a variety of tasks; however, generalizing from synthetic to human-authored instructions in grounded environments remains a challenge for them. In this work, we study generalization challenges in spatial grounding tasks where models interpret and translate instructions for building object arrangements on a 2.5D grid. We fine-tune LLMs using only synthetic instructions and evaluate their performance on a benchmark dataset containing both synthetic and human-authored instructions. Our results reveal that while models generalize well on simple tasks, their performance degrades significantly on more complex tasks. We present a detailed error analysis of the gaps in instruction generalization.

There is an increasing trend towards evaluating NLP models with LLMs instead of human judgments, raising questions about the validity of these evaluations, as well as their reproducibility in the case of proprietary models. We provide JUDGE-BENCH, an extensible collection of 20 NLP datasets with human annotations covering a broad range of evaluated properties and types of data, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show substantial variance across models and datasets. Models are reliable evaluators on some tasks, but overall display substantial variability depending on the property being evaluated, the expertise level of the human judges, and whether the language is human or model-generated. We conclude that LLMs should be carefully validated against human judgments before being used as evaluators.

Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model’s response. In this paper, we investigate whether Dialogue Games—goal-directed and rule-governed activities driven predominantly by verbal actions—can also serve as a source of feedback signals for learning.We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with Group Relative Policy Optimization (GRPO). We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in this promising new direction of “learning in (synthetic) interaction”.

Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models
Sherzod Hakimov | Yerkezhan Abdullayeva | Kushal Koshti | Antonia Schmidt | Yan Weiser | Anne Beyer | David Schlangen
Proceedings of the 31st International Conference on Computational Linguistics

While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model’s capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.

Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
Sherzod Hakimov | Lara Pfennigschmidt | David Schlangen
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs.

clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

The emergence of instruction-tuned large language models (LLMs) has advanced the field of dialogue systems, enabling both realistic user simulations and robust multi-turn conversational agents. However, existing research often evaluates these components in isolation, either focusing on a single user simulator or a specific system design, limiting the generalisability of insights across architectures and configurations. In this work, we propose clem:todd (chat-optimized LLMs for task-oriented dialogue systems development), a flexible framework for systematically evaluating dialogue systems under consistent conditions. clem:todd enables detailed benchmarking across combinations of user simulators and dialogue systems, whether existing models from literature or newly developed ones. To the best of our knowledge, clem:todd is the first evaluation framework for task-oriented dialogue systems that supports plug-and-play integration and ensures uniform datasets, evaluation metrics, and computational constraints. We showcase clem:todd’s flexibility by re-evaluating existing task-oriented dialogue systems within this unified setup and integrating three newly proposed dialogue systems into the same evaluation pipeline. Our results provide actionable insights into how architecture, scale, and prompting strategies affect dialogue performance, offering practical guidance for building efficient and effective conversational AI systems.

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
Filippo Momentè | Alessandro Suglia | Mario Giulianelli | Ambra Ferrari | Alexander Koller | Oliver Lemon | David Schlangen | Raquel Fernández | Raffaella Bernardi
Findings of the Association for Computational Linguistics: EMNLP 2025

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two—benchmarks or games—is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.

Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes
Meng Li | Michael Vrazitulis | David Schlangen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Rational speakers are supposed to know what they know and what they do not know, and to generate expressions matching the strength of evidence. In contrast, it is still a challenge for current large language models to generate corresponding utterances based on the assessment of facts and confidence in an uncertain real-world environment. While it has recently become popular to estimate and calibrate confidence of LLMs with verbalized uncertainty, what is lacking is a careful examination of the linguistic knowledge of uncertainty encoded in the latent space of LLMs. In this paper, we draw on typological frameworks of epistemic expressions to evaluate LLMs’ knowledge of epistemic modality, using controlled stories. Our experiments show that the performance of LLMs in generating epistemic expressions is limited and not robust, and hence the expressions of uncertainty generated by LLMs are not always reliable. To build uncertainty-aware LLMs, it is necessary to enrich semantic knowledge of epistemic modality in LLMs.

Prior Lessons of Incremental Dialogue and Robot Action Management for the Age of Language Models
Casey Kennington | Pierre Lison | David Schlangen
Dialogue & Discourse Volume 16

Efforts towards endowing robots with the ability to speak have benefited from recent advancements in natural language processing, in particular large language models. However, current language models are not fully incremental, as their processing is inherently monotonic and thus lack the ability to revise their interpretations or output in light of newer observations. This monotonicity has important implications for the development of dialogue systems for human–robot interaction. In this paper, we review the literature on interactive systems that operate incrementally (i.e., at the word level or below it). We motivate the need for incremental systems, survey incremental modeling of important aspects of dialogue like speech recognition and language generation. Primary focus is on the part of the system that makes decisions, known as the dialogue manager. We find that there is very little research on incremental dialogue management, offer some requirements for practical incremental dialogue management, and implications of incremental dialogue for embodied, robotic platforms in the age of large language models.

2024

Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft
Chalamalasetti Kranti | Sherzod Hakimov | David Schlangen
Findings of the Association for Computational Linguistics: EMNLP 2024

In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs’ in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work.

Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies
Philipp Sadler | Sherzod Hakimov | David Schlangen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In collaborative goal-oriented settings, the participants are not only interested in achieving a successful outcome, but do also implicitly negotiate the effort they put into the interaction (by adapting to each other). In this work, we propose a challenging interactive reference game that requires two players to coordinate on vision and language observations. The learning signal in this game is a score (given after playing) that takes into account the achieved goal and the players’ assumed efforts during the interaction. We show that a standard Proximal Policy Optimization (PPO) setup achieves a high success rate when bootstrapped with heuristic partner behaviors that implement insights from the analysis of human-human interactions. And we find that a pairing of neural partners indeed reduces the measured joint effort when playing together repeatedly. However, we observe that in comparison to a reasonable heuristic pairing there is still room for improvement—which invites further research in the direction of cost-sharing in collaborative interactions.

A Dialogue Game for Eliciting Balanced Collaboration
Isidora Jeknić | David Schlangen | Alexander Koller
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Collaboration is an integral part of human dialogue. Typical task-oriented dialogue games assign asymmetric roles to the participants, which limits their ability to elicit naturalistic role-taking in collaboration and its negotiation. We present a novel and simple online setup that favors balanced collaboration: a two-player 2D object placement game in which the players must negotiate the goal state themselves. We show empirically that human players exhibit a variety of role distributions, and that balanced collaboration improves task performance. We also present an LLM-based baseline agent which demonstrates that automatic playing of our game is an interesting challenge for artificial systems.

The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization
Luka Borec | Philipp Sadler | David Schlangen
Proceedings of the 17th International Natural Language Generation Conference

This work analyses the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. Stochastic decoding methods like nucleus sampling are typically applied to overcome issues such as monotonous and repetitive text generation, which are often observed with maximization-based decoding techniques. We hypothesize that nucleus sampling might also reduce the occurrence of memorization patterns, because it could lead to the selection of tokens outside the memorized sequence. To test this hypothesis we create a diagnostic dataset with a known distribution of duplicates that gives us some control over the likelihood of memorisation of certain parts of the training data. Our analysis of two GPT-Neo models fine-tuned on this dataset interestingly shows that (i) an increase of the nucleus size reduces memorization only modestly, and (ii) even when models do not engage in “hard” memorization – a verbatim reproduction of training samples – they may still display “soft” memorization whereby they generate outputs that echo the training data but without a complete one-by-one resemblance.

Conceptual Pacts for Reference Resolution Using Small, Dynamically Constructed Language Models: A Study in Puzzle Building Dialogues
Julian Hough | Sina Zarrieß | Casey Kennington | David Schlangen | Massimo Poesio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Using Brennan and Clark’s theory of a Conceptual Pact, that when interlocutors agree on a name for an object, they are forming a temporary agreement on how to conceptualize that object, we present an extension to a simple reference resolver which simulates this process over time with different conversation pairs. In a puzzle construction domain, we model pacts with small language models for each referent which update during the interaction. When features from these pact models are incorporated into a simple bag-of-words reference resolver, the accuracy increases compared to using a standard pre-trained model. The model performs equally to a competitor using the same data but with exhaustive re-training after each prediction, while also being more transparent, faster and less resource-intensive. We also experiment with reducing the number of training interactions, and can still achieve reference resolution accuracies of over 80% in testing from observing a single previous interaction, over 20% higher than a pre-trained baseline. While this is a limited domain, we argue the model could be applicable to larger real-world applications in human and human-robot interaction and is an interpretable and transparent model.

It Couldn’t Help but Overhear: On the Limits of Modelling Meta-Communicative Grounding Acts with Supervised Learning
Brielen Madureira | David Schlangen
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Active participation in a conversation is key to building common ground, since understanding is jointly tailored by producers and recipients. Overhearers are deprived of the privilege of performing grounding acts and can only conjecture about intended meanings. Still, data generation and annotation, modelling, training and evaluation of NLP dialogue models place reliance on the overhearing paradigm. How much of the underlying grounding processes are thereby forfeited? As we show, there is evidence pointing to the impossibility of properly modelling human meta-communicative acts with data-driven learning models. In this paper, we discuss this issue and provide a preliminary analysis on the variability of human decisions for requesting clarification. Most importantly, we wish to bring this topic back to the community’s table, encouraging discussion on the consequences of having models designed to only “listen in’”.

Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game
Philipp Sadler | Sherzod Hakimov | David Schlangen
Proceedings of the 4th Workshop on Spatial Language Understanding and Grounded Communication for Robotics (SpLU-RoboNLP 2024)

In this work, we evaluate the adaptability of neural agents towards assumed partner behaviors in a collaborative reference game. In this game, success is achieved when a knowledgeable guide can verbally lead a follower to the selection of a specific puzzle piece among several distractors. We frame this language grounding and coordination task as a reinforcement learning problem and measure to which extent a common reinforcement training algorithm (PPO) is able to produce neural agents (the guides) that perform well with various heuristic follower behaviors that vary along the dimensions of confidence and autonomy. We experiment with a learning signal that in addition to the goal condition also respects an assumed communicative effort. Our results indicate that this novel ingredient leads to communicative strategies that are less verbose (staying silent in some of the steps) and that with respect to that the guide’s strategies indeed adapt to the partner’s level of confidence and autonomy.

When Only Time Will Tell: Interpreting How Transformers Process Local Ambiguities Through the Lens of Restart-Incrementality
Brielen Madureira | Patrick Kahardipraja | David Schlangen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Incremental models that process sentences one token at a time will sometimes encounter points where more than one interpretation is possible. Causal models are forced to output one interpretation and continue, whereas models that can revise may edit their previous output as the ambiguity is resolved. In this work, we look at how restart-incremental Transformers build and update internal states, in an effort to shed light on what processes cause revisions not viable in autoregressive models. We propose an interpretable way to analyse the incremental states, showing that their sequential structure encodes information on the garden path effect and its resolution. Our method brings insights on various bidirectional encoders for contextualised meaning representation and dependency parsing, contributing to show their advantage over causal models when it comes to revisions.

Evaluating Modular Dialogue System for Form Filling Using Large Language Models
Sherzod Hakimov | Yan Weiser | David Schlangen
Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT 2024)

This paper introduces a novel approach to form-filling and dialogue system evaluation by leveraging Large Language Models (LLMs). The proposed method establishes a setup wherein multiple modules collaborate on addressing the form-filling task. The dialogue system is constructed on top of LLMs, focusing on defining specific roles for individual modules. We show that using multiple independent sub-modules working cooperatively on this task can improve performance and handle the typical constraints of using LLMs, such as context limitations. The study involves testing the modular setup on four selected forms of varying topics and lengths, employing commercial and open-access LLMs. The experimental results demonstrate that the modular setup consistently outperforms the baseline, showcasing the effectiveness of this approach. Furthermore, our findings reveal that open-access models perform comparably to commercial models for the specified task.

Taking Action Towards Graceful Interaction: The Effects of Performing Actions on Modelling Policies for Instruction Clarification Requests
Brielen Madureira | David Schlangen
Proceedings of the Third Workshop on Understanding Implicit and Underspecified Language

Clarification requests are a mechanism to help solve communication problems, e.g. due to ambiguity or underspecification, in instruction-following interactions. Despite their importance, even skilful models struggle with producing or interpreting such repair acts. In this work, we test three hypotheses concerning the effects of action taking as an auxiliary task in modelling iCR policies. Contrary to initial expectations, we conclude that its contribution to learning an iCR policy is limited, but some information can still be extracted from prediction uncertainty. We present further evidence that even well-motivated, Transformer-based models fail to learn good policies for when to ask Instruction CRs (iCRs), while the task of determining what to ask about can be more successfully modelled. Considering the implications of these findings, we further discuss the shortcomings of the data-driven paradigm for learning meta-communication acts.

2023

Instruction Clarification Requests in Multimodal Collaborative Dialogue Games: Tasks, and an Analysis of the CoDraw Dataset
Brielen Madureira | David Schlangen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

In visual instruction-following dialogue games, players can engage in repair mechanisms in face of an ambiguous or underspecified instruction that cannot be fully mapped to actions in the world. In this work, we annotate Instruction Clarification Requests (iCRs) in CoDraw, an existing dataset of interactions in a multimodal collaborative dialogue game. We show that it contains lexically and semantically diverse iCRs being produced self-motivatedly by players deciding to clarify in order to solve the task successfully. With 8.8k iCRs found in 9.9k dialogues, CoDraw-iCR (v1) is a large spontaneous iCR corpus, making it a valuable resource for data-driven research on clarification in dialogue. We then formalise and provide baseline models for two tasks: Determining when to make an iCR and how to recognise them, in order to investigate to what extent these tasks are learnable from data.

On General Language Understanding
David Schlangen
Findings of the Association for Computational Linguistics: EMNLP 2023

Natural Language Processing prides itself to be an empirically-minded, if not outright empiricist field, and yet lately it seems to get itself into essentialist debates on issues of meaning and measurement (“Do Large Language Models Understand Language, And If So, How Much?”). This is not by accident: Here, as everywhere, the evidence underspecifies the understanding. As a remedy, this paper sketches the outlines of a model of understanding, which can ground questions of the adequacy of current methods of measurement of model quality. The paper makes three claims: A) That different language use situation types have different characteristics, B) That language understanding is a multifaceted phenomenon, bringing together individualistic and social processes, and C) That the choice of Understanding Indicator marks the limits of benchmarking, and the beginnings of considerations of the ethics of NLP use.

clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents
Kranti Chalamalasetti | Jana Götze | Sherzod Hakimov | Brielen Madureira | Philipp Sadler | David Schlangen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recent work has proposed a methodology for the systematic evaluation of “Situated Language Understanding Agents” — agents that operate in rich linguistic and non-linguistic contexts — through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable of following game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models generally performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value.

Yes, this Way! Learning to Ground Referring Expressions into Actions with Intra-episodic Feedback from Supportive Teachers
Philipp Sadler | Sherzod Hakimov | David Schlangen
Findings of the Association for Computational Linguistics: ACL 2023

The ability to pick up on language signals in an ongoing interaction is crucial for future machine learning models to collaborate and interact with humans naturally. In this paper, we present an initial study that evaluates intra-episodic feedback given in a collaborative setting. We use a referential language game as a controllable example of a task-oriented collaborative joint activity. A teacher utters a referring expression generated by a well-known symbolic algorithm (the “Incremental Algorithm”) as an initial instruction and then monitors the follower’s actions to possibly intervene with intra-episodic feedback (which does not explicitly have to be requested). We frame this task as a reinforcement learning problem with sparse rewards and learn a follower policy for a heuristic teacher. Our results show that intra-episodic feedback allows the follower to generalize on aspects of scene complexity and performs better than providing only the initial statement.

Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
Sherzod Hakimov | David Schlangen
Findings of the Association for Computational Linguistics: ACL 2023

Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. While being actively researched, multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. In this work, we ask whether language-only models can be utilised for tasks that require visual input – but also, as we argue, often require a strong reasoning component. Similar to some recent related work, we make visual information accessible to the language model using separate verbalisation models. Specifically, we investigate the performance of open-source, open-access language models against GPT-3 on five vision-language tasks when given textually-encoded visual information. Our results suggest that language models are effective for solving vision-language tasks even with limited samples. This approach also enhances the interpretability of a model’s output by providing a means of tracing the output back through the verbalised image content.

TAPIR: Learning Adaptive Revision for Incremental Natural Language Understanding with a Two-Pass Model
Patrick Kahardipraja | Brielen Madureira | David Schlangen
Findings of the Association for Computational Linguistics: ACL 2023

Language is by its very nature incremental in how it is produced and processed. This property can be exploited by NLP systems to produce fast responses, which has been shown to be beneficial for real-time interactive applications. Recent neural network-based approaches for incremental processing mainly use RNNs or Transformers. RNNs are fast but monotonic (cannot correct earlier output, which can be necessary in incremental processing). Transformers, on the other hand, consume whole sequences, and hence are by nature non-incremental. A restart-incremental interface that repeatedly passes longer input prefixes can be used to obtain partial outputs, while providing the ability to revise. However, this method becomes costly as the sentence grows longer. In this work, we propose the Two-pass model for AdaPtIve Revision (TAPIR) and introduce a method to obtain an incremental supervision signal for learning an adaptive revision policy. Experimental results on sequence labelling show that our model has better incremental performance and faster inference speed compared to restart-incremental Transformers, while showing little degradation on full sequences.

The Road to Quality is Paved with Good Revisions: A Detailed Evaluation Methodology for Revision Policies in Incremental Sequence Labelling
Brielen Madureira | Patrick Kahardipraja | David Schlangen
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Incremental dialogue model components produce a sequence of output prefixes based on incoming input. Mistakes can occur due to local ambiguities or to wrong hypotheses, making the ability to revise past outputs a desirable property that can be governed by a policy. In this work, we formalise and characterise edits and revisions in incremental sequence labelling and propose metrics to evaluate revision policies. We then apply our methodology to profile the incremental behaviour of three Transformer-based encoders in various tasks, paving the road for better revision policies.

Pento-DIARef: A Diagnostic Dataset for Learning the Incremental Algorithm for Referring Expression Generation from Examples
Philipp Sadler | David Schlangen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

NLP tasks are typically defined extensionally through datasets containing example instantiations (e.g., pairs of image _i_ and text _t_), but motivated intensionally through capabilities invoked in verbal descriptions of the task (e.g., “_t_ is a description of _i_, for which the content of _i_ needs to be recognised and understood”).We present Pento-DIARef, a diagnostic dataset in a visual domain of puzzle pieces where referring expressions are generated by a well-known symbolic algorithm (the “Incremental Algorithm”),which itself is motivated by appeal to a hypothesised capability (eliminating distractors through application of Gricean maxims). Our question then is whether the extensional description (the dataset) is sufficient for a neural model to pick up the underlying regularity and exhibit this capability given the simple task definition of producing expressions from visual inputs. We find that a model supported by a vision detection step and a targeted data generation scheme achieves an almost perfect BLEU@1 score and sentence accuracy, whereas simpler baselines do not.

Revising with a Backward Glance: Regressions and Skips during Reading as Cognitive Signals for Revision Policies in Incremental Processing
Brielen Madureira | Pelin Çelikkol | David Schlangen
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

In NLP, incremental processors produce output in instalments, based on incoming prefixes of the linguistic input. Some tokens trigger revisions, causing edits to the output hypothesis, but little is known about why models revise when they revise. A policy that detects the time steps where revisions should happen can improve efficiency. Still, retrieving a suitable signal to train a revision policy is an open problem, since it is not naturally available in datasets. In this work, we investigate the appropriateness of regressions and skips in human reading eye-tracking data as signals to inform revision policies in incremental sequence labelling. Using generalised mixed-effects models, we find that the probability of regressions and skips by humans can potentially serve as useful predictors for revisions in BiLSTMs and Transformer models, with consistent results for various languages.

Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Svetlana Stoyanchev | Shafiq Joty | David Schlangen | Ondrej Dusek | Casey Kennington | Malihe Alikhani
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2022

Norm Participation Grounds Language
David Schlangen
Proceedings of the 2022 CLASP Conference on (Dis)embodiment

The striking recent advances in eliciting seemingly meaningful language behaviour from language-only machine learning models have only made more apparent, through the surfacing of clear limitations, the need to go beyond the language-only mode and to ground these models “in the world”. Proposals for doing so vary in the details, but what unites them is that the solution is sought in the addition of non-linguistic data types such as images or video streams, while largely keeping the mode of learning constant. I propose a different, and more wide-ranging conception of how grounding should be understood: What grounds language is its normative nature. There are standards for doing things right, these standards are public and authoritative, while at the same time acceptance of authority can and must be disputed and negotiated, in interactions in which only bearers of normative status can rightfully participate. What grounds language, then, is the determined use that language users make of it, and what it is grounded in is the community of language users. I sketch this idea, and draw some conclusions for work on computational modelling of meaningful language use.

Anaphoric Phenomena in Situated dialog: A First Round of Annotations
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

We present a first release of 500 documents from the multimodal corpus Tell-me-more (Ilinykh et al., 2019) annotated with coreference information according to the ARRAU guidelines (Poesio et al., 2021). The corpus consists of images and short texts of five sentences. We describe the annotation process and present the adaptations to the original guidelines in order to account for the challenges of grounding the annotations to the image. 50 documents from the 500 available are annotated by two people and used to estimate inter-annotator agreement (IAA) relying on Krippendorff’s alpha.

The slurk Interaction Server Framework: Better Data for Better Dialog Models
Jana Götze | Maike Paetzel-Prüsmann | Wencke Liermann | Tim Diekmann | David Schlangen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents the slurk software, a lightweight interaction server for setting up dialog data collections and running experiments. slurk enables a multitude of settings including text-based, speech and video interaction between two or more humans or humans and bots, and a multimodal display area for presenting shared or private interactive context. The software is implemented in Python with an HTML and JavaScript frontend that can easily be adapted to individual needs. It also provides a setup for pairing participants on common crowdworking platforms such as Amazon Mechanical Turk and some example bot scripts for common interaction scenarios.

New or Old? Exploring How Pre-Trained Language Models Represent Discourse Entities
Sharid Loáiciga | Anne Beyer | David Schlangen
Proceedings of the 29th International Conference on Computational Linguistics

Recent research shows that pre-trained language models, built to generate text conditioned on some context, learn to encode syntactic knowledge to a certain degree. This has motivated researchers to move beyond the sentence-level and look into their ability to encode less studied discourse-level phenomena. In this paper, we add to the body of probing research by investigating discourse entity representations in large pre-trained language models in English. Motivated by early theories of discourse and key pieces of previous work, we focus on the information-status of entities as discourse-new or discourse-old. We present two probing models, one based on binary classification and another one on sequence labeling. The results of our experiments show that pre-trained language models do encode information on whether an entity has been introduced before or not in the discourse. However, this information alone is not sufficient to find the entities in a discourse, opening up interesting questions about the definition of entities for future work.

Generating Coherent and Informative Descriptions for Groups of Visual Objects and Categories: A Simple Decoding Approach
Nazia Attari | David Schlangen | Martin Heckmann | Heiko Wersing | Sina Zarrieß
Proceedings of the 15th International Conference on Natural Language Generation

Can Visual Dialogue Models Do Scorekeeping? Exploring How Dialogue Representations Incrementally Encode Shared Knowledge
Brielen Madureira | David Schlangen
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Cognitively plausible visual dialogue models should keep a mental scoreboard of shared established facts in the dialogue context. We propose a theory-based evaluation method for investigating to what degree models pretrained on the VisDial dataset incrementally build representations that appropriately do scorekeeping. Our conclusion is that the ability to make the distinction between shared and privately known statements along the dialogue is moderately present in the analysed models, but not always incrementally consistent, which may partially be due to the limited need for grounding interactions in the original task.

Generating Landmark-based Manipulation Instructions from Image Pairs
Sina Zarrieß | Henrik Voigt | David Schlangen | Philipp Sadler
Proceedings of the 15th International Conference on Natural Language Generation

2021

Annotating anaphoric phenomena in situated dialogue
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

In recent years several corpora have been developed for vision and language tasks. With this paper, we intend to start a discussion on the annotation of referential phenomena in situated dialogue. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. In addition, a rich annotation scheme covering a broad range of referential phenomena and compatible with the textual task of coreference resolution is necessary in order to take the most advantage of these corpora. Consequently, there are several open questions regarding the semantics of reference and annotation, and the extent to which standard textual coreference accounts for the situated dialogue genre. Working with two corpora on situated dialogue, we present our extension to the ARRAU (Uryupina et al., 2020) annotation scheme in order to start this discussion.

Targeting the Benchmark: On Methodology in Current Natural Language Processing Research
David Schlangen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

It has become a common pattern in our field: One group introduces a language task, exemplified by a dataset, which they argue is challenging enough to serve as a benchmark. They also provide a baseline model for it, which then soon is improved upon by other groups. Often, research efforts then move on, and the pattern repeats itself. What is typically left implicit is the argumentation for why this constitutes progress, and progress towards what. In this paper, we try to step back for a moment from this pattern and work out possible argumentations and their parts.

Reference and coreference in situated dialogue
Sharid Loáiciga | Simon Dobnik | David Schlangen
Proceedings of the Second Workshop on Advances in Language and Vision Research

In recent years several corpora have been developed for vision and language tasks. We argue that there is still significant room for corpora that increase the complexity of both visual and linguistic domains and which capture different varieties of perceptual and conversational contexts. Working with two corpora approaching this goal, we present a linguistic perspective on some of the challenges in creating and extending resources combining language and vision while preserving continuity with the existing best practices in the area of coreference annotation.

Incremental Unit Networks for Multimodal, Fine-grained Information State Representation
Casey Kennington | David Schlangen
Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR)

We offer a fine-grained information state annotation scheme that follows directly from the Incremental Unit abstract model of dialogue processing when used within a multimodal, co-located, interactive setting. We explain the Incremental Unit model and give an example application using the Localized Narratives dataset, then offer avenues for future research.

Is Incoherence Surprising? Targeted Evaluation of Coherence Prediction from Language Models
Anne Beyer | Sharid Loáiciga | David Schlangen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Coherent discourse is distinguished from a mere collection of utterances by the satisfaction of a diverse set of constraints, for example choice of expression, logical relation between denoted events, and implicit compatibility with world-knowledge. Do neural language models encode such constraints? We design an extendable set of test suites addressing different aspects of discourse and dialogue coherence. Unlike most previous coherence evaluation studies, we address specific linguistic devices beyond sentence order perturbations, which allow for a more fine-grained analysis of what constitutes coherence and what neural models trained on a language modelling objective are capable of encoding. Extending the targeted evaluation paradigm for neural language models (Marvin and Linzen, 2018) to phenomena beyond syntax, we show that this paradigm is equally suited to evaluate linguistic qualities that contribute to the notion of coherence.

Towards Incremental Transformers: An Empirical Analysis of Transformer Models for Incremental NLU
Patrick Kahardipraja | Brielen Madureira | David Schlangen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Incremental processing allows interactive systems to respond based on partial inputs, which is a desirable property e.g. in dialogue agents. The currently popular Transformer architecture inherently processes sequences as a whole, abstracting away the notion of time. Recent work attempts to apply Transformers incrementally via restart-incrementality by repeatedly feeding, to an unchanged model, increasingly longer input prefixes to produce partial outputs. However, this approach is computationally costly and does not scale efficiently for long sequences. In parallel, we witness efforts to make Transformers more efficient, e.g. the Linear Transformer (LT) with a recurrence mechanism. In this work, we examine the feasibility of LT for incremental NLU in English. Our results show that the recurrent LT model has better incremental performance and faster inference speed compared to the standard Transformer and LT with restart-incrementality, at the cost of part of the non-incremental (full sequence) quality. We show that the performance drop can be mitigated by training the model to wait for right context before committing to an output and that training with input prefixes is beneficial for delivering correct partial outputs.

Space Efficient Context Encoding for Non-Task-Oriented Dialogue Generation with Graph Attention Transformer
Fabian Galetzka | Jewgeni Rose | David Schlangen | Jens Lehmann
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To improve the coherence and knowledge retrieval capabilities of non-task-oriented dialogue systems, recent Transformer-based models aim to integrate fixed background context. This often comes in the form of knowledge graphs, and the integration is done by creating pseudo utterances through paraphrasing knowledge triples, added into the accumulated dialogue context. However, the context length is fixed in these architectures, which restricts how much background or dialogue context can be kept. In this work, we propose a more concise encoding for background context structured in the form of knowledge graphs, by expressing the graph connections through restrictions on the attention weights. The results of our human evaluation show that this encoding reduces space requirements without negative effects on the precision of reproduction of knowledge and perceived consistency. Further, models trained with our proposed context encoding generate dialogues that are judged to be more comprehensive and interesting.

2020

A Corpus of Controlled Opinionated and Knowledgeable Movie Discussions for Training Neural Conversation Models
Fabian Galetzka | Chukwuemeka Uchenna Eneh | David Schlangen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Fully data driven Chatbots for non-goal oriented dialogues are known to suffer from inconsistent behaviour across their turns, stemming from a general difficulty in controlling parameters like their assumed background personality and knowledge of facts. One reason for this is the relative lack of labeled data from which personality consistency and fact usage could be learned together with dialogue behaviour. To address this, we introduce a new labeled dialogue dataset in the domain of movie discussions, where every dialogue is based on pre-specified facts and opinions. We thoroughly validate the collected dialogue for adherence of the participants to their given fact and opinion profile, and find that the general quality in this respect is high. This process also gives us an additional layer of annotation that is potentially useful for training models. We introduce as a baseline an end-to-end trained self-attention decoder model trained on this data and show that it is able to generate opinionated responses that are judged to be natural and knowledgeable and show attentiveness.

Incremental Processing in the Age of Non-Incremental Encoders: An Empirical Assessment of Bidirectional Models for Incremental NLU
Brielen Madureira | David Schlangen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While humans process language incrementally, the best language encoders currently used in NLP do not. Both bidirectional LSTMs and Transformers assume that the sequence that is to be encoded is available in full, to be processed either forwards and backwards (BiLSTMs) or as a whole (Transformers). We investigate how they behave under incremental interfaces, when partial output must be provided based on partial input seen up to a certain time step, which may happen in interactive systems. We test five models on various NLU datasets and compare their performance using three incremental evaluation metrics. The results support the possibility of using bidirectional encoders in incremental mode while retaining most of their non-incremental quality. The “omni-directional” BERT model, which achieves better non-incremental performance, is impacted more by the incremental access. This can be alleviated by adapting the training regime (truncated training), or the testing procedure, by delaying the output until some right context is available or by incorporating hypothetical right contexts generated by a language model like GPT-2.

From “Before” to “After”: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain
Robin Rojowiec | Jana Götze | Philipp Sadler | Henrik Voigt | Sina Zarrieß | David Schlangen
Proceedings of the 13th International Conference on Natural Language Generation

While certain types of instructions can be com-pactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before/After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. (2016) and investigate whether it is suitable for training an instruction generator as well. We find that it is, and investigate several simple baselines, taking these from the related task of image captioning. Through a series of experiments that simplify the task (by making image processing easier or completely side-stepping it; and by creating template-based targeted instructions), we investigate areas for improvement. We find that captioning models get some way towards solving the task, but have some difficulty with it, and future improvements must lie in the way the change is detected in the instruction.

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Qun Liu | David Schlangen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

2019

Can Neural Image Captioning be Controlled via Forced Attention?
Philipp Sadler | Tatjana Scheffler | David Schlangen
Proceedings of the 12th International Conference on Natural Language Generation

Learned dynamic weighting of the conditioning signal (attention) has been shown to improve neural language generation in a variety of settings. The weights applied when generating a particular output sequence have also been viewed as providing a potentially explanatory insight in the internal workings of the generator. In this paper, we reverse the direction of this connection and ask whether through the control of the attention of the model we can control its output. Specifically, we take a standard neural image captioning model that uses attention, and fix the attention to predetermined areas in the image. We evaluate whether the resulting output is more likely to mention the class of the object in that area than the normally generated caption. We introduce three effective methods to control the attention and find that these are producing expected results in up to 27.43% of the cases.

Natural Language Semantics With Pictures: Some Language & Vision Datasets and Potential Uses for Computational Semantics
David Schlangen
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

Propelling, and propelled by, the “deep learning revolution”, recent years have seen the introduction of ever larger corpora of images annotated with natural language expressions. We survey some of these corpora, taking a perspective that reverses the usual directionality, as it were, by viewing the images as semantic annotation of the natural language expressions. We discuss datasets that can be derived from the corpora, and tasks of potential interest for computational semanticists that can be defined on those. In this, we make use of relations provided by the corpora (namely, the link between expression and image, and that between two expressions linked to the same image) and relations that we can add (similarity relations between expressions, or between images). Specifically, we show that in this way we can create data that can be used to learn and evaluate lexical and compositional grounded semantics, and we show that the “linked to same image” relation tracks a semantic implication relation that is recognisable to annotators even in the absence of the linking image as evidence. Finally, as an example of possible benefits of this approach, we show that an exemplar-model-based approach to implication beats a (simple) distributional space-based one on some derived datasets, while lending itself to explainability.

Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories
Sina Zarrieß | David Schlangen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Zero-shot learning in Language & Vision is the task of correctly labelling (or naming) objects of novel categories. Another strand of work in L&V aims at pragmatically informative rather than “correct” object descriptions, e.g. in reference games. We combine these lines of research and model zero-shot reference games, where a speaker needs to successfully refer to a novel object in an image. Inspired by models of “rational speech acts”, we extend a neural generator to become a pragmatic speaker reasoning about uncertain object categories. As a result of this reasoning, the generator produces fewer nouns and names of distractor categories as compared to a literal speaker. We show that this conversational strategy for dealing with novel objects often improves communicative success, in terms of resolution accuracy of an automatic listener.

Tell Me More: A Dataset of Visual Scene Description Sequences
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 12th International Conference on Natural Language Generation

We present a dataset consisting of what we call image description sequences, which are multi-sentence descriptions of the contents of an image. These descriptions were collected in a pseudo-interactive setting, where the describer was told to describe the given image to a listener who needs to identify the image within a set of images, and who successively asks for more information. As we show, this setup produced nicely structured data that, we think, will be useful for learning models capable of planning and realising such description discourses.

From Explainability to Explanation: Using a Dialogue Setting to Elicit Annotations with Justifications
Nazia Attari | Martin Heckmann | David Schlangen
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

Despite recent attempts in the field of explainable AI to go beyond black box prediction models, typically already the training data for supervised machine learning is collected in a manner that treats the annotator as a “black box”, the internal workings of which remains unobserved. We present an annotation method where a task is given to a pair of annotators who collaborate on finding the best response. With this we want to shed light on the questions if the collaboration increases the quality of the responses and if this “thinking together” provides useful information in itself, as it at least partially reveals their reasoning steps. Furthermore, we expect that this setting puts the focus on explanation as a linguistic act, vs. explainability as a property of models. In a crowd-sourcing experiment, we investigated three different annotation tasks, each in a collaborative dialogical (two annotators) and monological (one annotator) setting. Our results indicate that our experiment elicits collaboration and that this collaboration increases the response accuracy. We see large differences in the annotators’ behavior depending on the task. Similarly, we also observe that the dialog patterns emerging from the collaboration vary significantly with the task.

2018

Decoding Strategies for Neural Referring Expression Generation
Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

RNN-based sequence generation is now widely used in NLP and NLG (natural language generation). Most work focusses on how to train RNNs, even though also decoding is not necessarily straightforward: previous work on neural MT found seq2seq models to radically prefer short candidates, and has proposed a number of beam search heuristics to deal with this. In this work, we assess decoding strategies for referring expression generation with neural models. Here, expression length is crucial: output should neither contain too much or too little information, in order to be pragmatically adequate. We find that most beam search heuristics developed for MT do not generalize well to referring expression generation (REG), and do not generally outperform greedy decoding. We observe that beam search heuristics for termination seem to override the model’s knowledge of what a good stopping point is. Therefore, we also explore a recent approach called trainable decoding, which uses a small network to modify the RNN’s hidden state for better decoding results. We find this approach to consistently outperform greedy decoding for REG.

Being data-driven is not enough: Revisiting interactive instruction giving as a challenge for NLG
Sina Zarrieß | David Schlangen
Proceedings of the Workshop on NLG for Human–Robot Interaction

Modeling traditional NLG tasks with data-driven techniques has been a major focus of research in NLG in the past decade. We argue that existing modeling techniques are mostly tailored to textual data and are not sufficient to make NLG technology meet the requirements of agents which target fluid interaction and collaboration in the real world. We revisit interactive instruction giving as a challenge for datadriven NLG and, based on insights from previous GIVE challenges, propose that instruction giving should be addressed in a setting that involves visual grounding and spoken language. These basic design decisions will require NLG frameworks that are capable of monitoring their environment as well as timing and revising their verbal output. We believe that these are core capabilities for making NLG technology transferrable to interactive systems.

A Corpus of Natural Multimodal Spatial Scene Descriptions
Ting Han | David Schlangen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp!, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.

2017

Grounding Language by Continuous Observation of Instruction Following
Ting Han | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Grounded semantics is typically learnt from utterance-level meaning representations (e.g., successful database retrievals, denoted objects in images, moves in a game). We explore learning word and utterance meanings by continuous observation of the actions of an instruction follower (IF). While an instruction giver (IG) provided a verbal description of a configuration of objects, IF recreated it using a GUI. Aligning these GUI actions to sub-utterance chunks allows a simple maximum entropy model to associate them as chunk meaning better than just providing it with the utterance-final configuration. This shows that semantics useful for incremental (word-by-word) application, as required in natural dialogue, might also be better acquired from incremental settings.

Natural Language Informs the Interpretation of Iconic Gestures: A Computational Approach
Ting Han | Julian Hough | David Schlangen
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

When giving descriptions, speakers often signify object shape or size with hand gestures. Such so-called ‘iconic’ gestures represent their meaning through their relevance to referents in the verbal content, rather than having a conventional form. The gesture form on its own is often ambiguous, and the aspect of the referent that it highlights is constrained by what the language makes salient. We show how the verbal content guides gesture interpretation through a computational model that frames the task as a multi-label classification task that maps multimodal utterances to semantic categories, using annotated human-human data.

Deriving continous grounded meaning representations from referentially structured multimodal contexts
Sina Zarrieß | David Schlangen
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Corpora of referring expressions paired with their visual referents are a good source for learning word meanings directly grounded in visual representations. Here, we explore additional ways of extracting from them word representations linked to multi-modal context: through expressions that refer to the same object, and through expressions that refer to different objects in the same scene. We show that continuous meaning representations derived from these contexts capture complementary aspects of similarity, , even if not outperforming textual embeddings trained on very large amounts of raw text when tested on standard similarity benchmarks. We propose a new task for evaluating grounded meaning representations—detection of potentially co-referential phrases—and show that it requires precise denotational representations of attribute meanings, which our method provides.

Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task
Ting Han | David Schlangen
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

While language conveys meaning largely symbolically, actual communication acts typically contain iconic elements as well: People gesture while they speak, or may even draw sketches while explaining something. Image retrieval prima facie seems like a task that could profit from combined symbolic and iconic reference, but it is typically set up to work either from language only, or via (iconic) sketches with no verbal contribution. Using a model of grounded language semantics and a model of sketch-to-image mapping, we show that adding even very reduced iconic information to a verbal image description improves recall. Verbal descriptions paired with fully detailed sketches still perform better than these sketches alone. We see these results as supporting the assumption that natural user interfaces should respond to multimodal input, where possible, rather than just language alone.

Obtaining referential word meanings from visual and distributional information: Experiments on object naming
Sina Zarrieß | David Schlangen
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate object naming, which is an important sub-task of referring expression generation on real-world images. As opposed to mutually exclusive labels used in object recognition, object names are more flexible, subject to communicative preferences and semantically related to each other. Therefore, we investigate models of referential word meaning that link visual to lexical information which we assume to be given through distributional word embeddings. We present a model that learns individual predictors for object names that link visual and distributional aspects of word meaning during training. We show that this is particularly beneficial for zero-shot learning, as compared to projecting visual objects directly into the distributional space. In a standard object naming task, we find that different ways of combining lexical and visual information achieve very similar performance, though experiments on model combination suggest that they capture complementary aspects of referential meaning.

Is this a Child, a Girl or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings
Sina Zarrieß | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

There has recently been a lot of work trying to use images of referents of words for improving vector space meaning representations derived from text. We investigate the opposite direction, as it were, trying to improve visual word predictors that identify objects in images, by exploiting distributional similarity information during training. We show that for certain words (such as entry-level nouns or hypernyms), we can indeed learn better referential word meanings by taking into account their semantic similarity to other words. For other words, there is no or even a detrimental effect, compared to a learning setup that presents even semantically related objects as negative instances.

Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images
Sina Zarrieß | M. Soledad López Gambino | David Schlangen
Proceedings of the 10th International Conference on Natural Language Generation

Current referring expression generation systems mostly deliver their output as one-shot, written expressions. We present on-going work on incremental generation of spoken expressions referring to objects in real-world images. This approach extends upon previous work using the words-as-classifier model for generation. We implement this generator in an incremental dialogue processing framework such that we can exploit an existing interface to incremental text-to-speech synthesis. Our system generates and synthesizes referring expressions while continuously observing non-verbal user reactions.

Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech
Julian Hough | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We present the joint task of incremental disfluency detection and utterance segmentation and a simple deep learning system which performs it on transcripts and ASR results. We show how the constraints of the two tasks interact. Our joint-task system outperforms the equivalent individual task systems, provides competitive results and is suitable for future use in conversation agents in the psychiatric domain.

Beyond On-hold Messages: Conversational Time-buying in Task-oriented Dialogue
Soledad López Gambino | Sina Zarrieß | David Schlangen
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

A common convention in graphical user interfaces is to indicate a “wait state”, for example while a program is preparing a response, through a changed cursor state or a progress bar. What should the analogue be in a spoken conversational system? To address this question, we set up an experiment in which a human information provider (IP) was given their information only in a delayed and incremental manner, which systematically created situations where the IP had the turn but could not provide task-related information. Our data analysis shows that 1) IPs bridge the gap until they can provide information by re-purposing a whole variety of task- and grounding-related communicative actions (e.g. echoing the user’s request, signaling understanding, asserting partially relevant information), rather than being silent or explicitly asking for time (e.g. “please wait”), and that 2) IPs combined these actions productively to ensure an ongoing conversation. These results, we argue, indicate that natural conversational interfaces should also be able to manage their time flexibly using a variety of conversational resources.

2016

Supporting Spoken Assistant Systems with a Graphical User Interface that Signals Incremental Understanding and Prediction State
Casey Kennington | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Real-Time Understanding of Complex Discriminative Scene Descriptions
Ramesh Manuvinakurike | Casey Kennington | David DeVault | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

In order to explore intuitive verbal and non-verbal interfaces in smart environments we recorded user interactions with an intelligent apartment. Besides offering various interactive capabilities itself, the apartment is also inhabited by a social robot that is available as a humanoid interface. This paper presents a multi-modal corpus that contains goal-directed actions of naive users in attempts to solve a number of predefined tasks. Alongside audio and video recordings, our data-set consists of large amount of temporally aligned sensory data and system behavior provided by the environment and its interactive components. Non-verbal system responses such as changes in light or display contents, as well as robot and apartment utterances and gestures serve as a rich basis for later in-depth analysis. Manual annotations provide further information about meta data like the current course of study and user behavior including the incorporated modality, all literal utterances, language features, emotional expressions, foci of attention, and addressees.

Toward incremental dialogue act segmentation in fast-paced interactive dialogue systems
Ramesh Manuvinakurike | Maike Paetzel | Cheng Qu | David Schlangen | David DeVault
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs
Sina Zarrieß | David Schlangen
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected?
Sina Zarrieß | David Schlangen
Proceedings of the 9th International Natural Language Generation conference

PentoRef: A Corpus of Spoken References in Task-oriented Dialogues
Sina Zarrieß | Julian Hough | Casey Kennington | Ramesh Manuvinakurike | David DeVault | Raquel Fernández | David Schlangen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

PentoRef is a corpus of task-oriented dialogues collected in systematically manipulated settings. The corpus is multilingual, with English and German sections, and overall comprises more than 20000 utterances. The dialogues are fully transcribed and annotated with referring expressions mapped to objects in corresponding visual scenes, which makes the corpus a rich resource for research on spoken referring expressions in generation and resolution. The corpus includes several sub-corpora that correspond to different dialogue situations where parameters related to interactivity, visual access, and verbal channel have been manipulated in systematic ways. The corpus thus lends itself to very targeted studies of reference in spontaneous dialogue.

Investigating Fluidity for Human-Robot Interaction with Real-time, Real-world Grounding Strategies
Julian Hough | David Schlangen
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

DUEL: A Multi-lingual Multimodal Dialogue Corpus for Disfluency, Exclamations and Laughter
Julian Hough | Ye Tian | Laura de Ruiter | Simon Betz | Spyros Kousidis | David Schlangen | Jonathan Ginzburg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present the DUEL corpus, consisting of 24 hours of natural, face-to-face, loosely task-directed dialogue in German, French and Mandarin Chinese. The corpus is uniquely positioned as a cross-linguistic, multimodal dialogue resource controlled for domain. DUEL includes audio, video and body tracking data and is transcribed and annotated for disfluency, laughter and exclamations.

Resolving References to Objects in Photographs using the Words-As-Classifiers Model
David Schlangen | Sina Zarrieß | Casey Kennington
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution
Casey Kennington | Livia Dia | David Schlangen
Proceedings of the 11th International Conference on Computational Semantics

Incremental Semantics for Dialogue Processing: Requirements, and a Comparison of Two Approaches
Julian Hough | Casey Kennington | David Schlangen | Jonathan Ginzburg
Proceedings of the 11th International Conference on Computational Semantics

Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution
Casey Kennington | David Schlangen
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Reading Times Predict the Quality of Generated Text Above and Beyond Human Ratings
Sina Zarrieß | Sebastian Loth | David Schlangen
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

Incrementally Tracking Reference in Human/Human Dialogue Using Linguistic and Extra-Linguistic Information
Casey Kennington | Ryu Iida | Takenobu Tokunaga | David Schlangen
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

InproTKs: A Toolkit for Incremental Situated Processing
Casey Kennington | Spyros Kousidis | David Schlangen
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

Situationally Aware In-Car Information Presentation Using Incremental Speech Generation: Safer, and More Effective
Spyros Kousidis | Casey Kennington | Timo Baumann | Hendrik Buschmeier | Stefan Kopp | David Schlangen
Proceedings of the EACL 2014 Workshop on Dialogue in Motion

Situated Incremental Natural Language Understanding using a Multimodal, Linguistically-driven Update Model
Casey Kennington | Spyros Kousidis | David Schlangen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

Investigating speaker gaze and pointing behaviour in human-computer interaction with the mint.tools collection
Spyros Kousidis | Casey Kennington | David Schlangen
Proceedings of the SIGDIAL 2013 Conference

Open-ended, Extensible System Utterances Are Preferred, Even If They Require Filled Pauses
Timo Baumann | David Schlangen
Proceedings of the SIGDIAL 2013 Conference

Interpreting Situated Dialogue Utterances: an Update Model that Uses Speech, Gaze, and Gesture Information
Casey Kennington | Spyros Kousidis | David Schlangen
Proceedings of the SIGDIAL 2013 Conference

2012

Combining Incremental Language Generation and Incremental Speech Synthesis for Adaptive Information Presentation
Hendrik Buschmeier | Timo Baumann | Benjamin Dosch | Stefan Kopp | David Schlangen
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Markov Logic Networks for Situated Incremental Natural Language Understanding
Casey Kennington | David Schlangen
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Joint Satisfaction of Syntactic and Pragmatic Constraints Improves Incremental Spoken Language Understanding
Andreas Peldszus | Okko Buß | Timo Baumann | David Schlangen
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Incremental Construction of Robust but Deep Semantic Representations for Use in Responsive Dialogue Systems
Andreas Peldszus | David Schlangen
Proceedings of the Workshop on Advances in Discourse Analysis and its Computational Aspects

The Future of Spoken Dialogue Systems is in their Past: Long-Term Adaptive, Conversational Assistants
David Schlangen
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)

The InproTK 2012 release
Timo Baumann | David Schlangen
NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012)

INPRO_iSS: A Component for Just-In-Time Incremental Speech Synthesis
Timo Baumann | David Schlangen
Proceedings of the ACL 2012 System Demonstrations

2011

Evaluation and Optimisation of Incremental Processors
Timo Baumann Okko Buß | David Schlangen
Dialogue & Discourse Volume 2

Incremental spoken dialogue systems, which process user input as it unfolds, pose additional engineering challenges compared to more standard non-incremental systems: Their processing components must be able to accept partial, and possibly subsequently revised input, and must produce output that is at the same time as accurate as possible and delivered with as little delay as possible. In this article, we define metrics that measure how well a given processor meets these challenges, and we identify types of gold standards for evaluation. We exemplify these metrics in the evaluation of several incremental processors that we have developed. We also present generic means to optimise some of the measures, if certain trade-offs are accepted. We believe that this work will help enable principled comparison of components for incremental dialogue systems and portability of results.

A General, Abstract Model of Incremental Dialogue Processing
David Schlangen | Gabriel Skantze
Dialogue & Discourse Volume 2

We present a general model and conceptual framework for specifying architectures for incremental processing in dialogue systems, in particular with respect to the topology of the network of modules that make up the system, the way information flows through this network, how information increments are ‘packaged’, and how these increments are processed by the modules. This model enables the precise specification of incremental systems and hence facilitates detailed comparisons between systems, as well as giving guidance on designing new systems. In particular, the model can serve as a framework for specifying module communication in such systems, as we illustrate with some examples.

Predicting the Micro-Timing of User Input for an Incremental Spoken Dialogue System that Completes a User’s Ongoing Turn
Timo Baumann | David Schlangen
Proceedings of the SIGDIAL 2011 Conference

Introduction to the Special Issue on Incremental Processing in Dialogue
Hannes Rieser | David Schlangen
Dialogue & Discourse Volume 2

A brief introduction to the topics discussed in the special issue, and to the individual papers.

2010

Comparing Local and Sequential Models for Statistical Incremental Natural Language Understanding
Silvan Heintze | Timo Baumann | David Schlangen
Proceedings of the SIGDIAL 2010 Conference

Collaborating on Utterances with a Spoken Dialogue System Using an ISU-based Approach to Incremental Dialogue Management
Okko Buß | Timo Baumann | David Schlangen
Proceedings of the SIGDIAL 2010 Conference

Middleware for Incremental Processing in Conversational Agents
David Schlangen | Timo Baumann | Hendrik Buschmeier | Okko Buß | Stefan Kopp | Gabriel Skantze | Ramin Yaghoubzadeh
Proceedings of the SIGDIAL 2010 Conference

2009

Incremental Reference Resolution: The Task, Metrics for Evaluation, and a Bayesian Filtering Model that is Sensitive to Disfluencies
David Schlangen | Timo Baumann | Michaela Atterer
Proceedings of the SIGDIAL 2009 Conference

TELIDA: A Package for Manipulation and Visualization of Timed Linguistic Data
Titus von der Malsburg | Timo Baumann | David Schlangen
Proceedings of the SIGDIAL 2009 Conference

A General, Abstract Model of Incremental Dialogue Processing
David Schlangen | Gabriel Skantze
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Assessing and Improving the Performance of Speech Recognition for Incremental Systems
Timo Baumann | Michaela Atterer | David Schlangen
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

RUBISC - a Robust Unification-Based Incremental Semantic Chunker
Michaela Atterer | David Schlangen
Proceedings of SRSL 2009, the 2nd Workshop on Semantic Representation of Spoken Language

Incremental Dialogue Processing in a Micro-Domain
Gabriel Skantze | David Schlangen
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

2008

Towards Incremental End-of-Utterance Detection in Dialogue Systems
Michaela Atterer | Timo Baumann | David Schlangen
Coling 2008: Companion volume: Posters

A Simple Method for Resolution of Definite Reference in a Shared Visual Context
Alexander Siebert | David Schlangen
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue
David Schlangen | Beth Ann Hockey
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

2007

An Implemented Method for Distributed Collection and Assessment of Speech Data
Alexander Siebert | David Schlangen | Raquel Fernández
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

Beyond Repair – Testing the Limits of the Conversational Repair System
David Schlangen | Raquel Fernández
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

Referring under Restricted Interactivity Conditions
Raquel Fernández | Tatjana Lucht | David Schlangen
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2005

Towards Finding and Fixing Fragments—Using ML to Identify Non-Sentential Utterances and their Antecedents in Multi-Party Dialogue
David Schlangen
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

Causes and Strategies for Requesting Clarification in Dialogue
David Schlangen
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004

Feeding OWL: Extracting and Representing the Content of Pathology Reports
David Schlangen | Manfred Stede | Elena Paslaru Bontas
Proceeedings of the Workshop on NLP and XML (NLPXML-2004): RDF/RDFS and OWL in Language Technology

2003

The interpretation of non-sentential utterances in dialogue
David Schlangen | Alex Lascarides
Proceedings of the Fourth SIGdial Workshop of Discourse and Dialogue

Co-authors

Philipp Sadler 10

Raquel Fernández 7

Spyros Kousidis 6

Sharid Loáiciga 5

Michaela Atterer 4

Raffaella Bernardi 4

Kranti Chalamalasetti 4

Patrick Kahardipraja 4

Alexander Koller 4

Gabriel Skantze 4

Hendrik Buschmeier 3

David DeVault 3

Mario Giulianelli 3

Ramesh Manuvinakurike 3

Alessandro Suglia 3

Fabian Galetzka 2

Jonathan Ginzburg 2

Martin Heckmann 2

Nikolai Ilinykh 2

M. Soledad López Gambino 2

Filippo Momentè 2

Andreas Peldszus 2

Antonia Schmidt 2

Alexander Siebert 2

Alberto Testoni 2

Yerkezhan Abdullayeva 1

Malihe Alikhani 1

Anna Bavaresco 1

Roland Bernard 1

Jasmin Bernotat 1

Leonardo Bertolazzi 1

Timo Baumann Okko Buß 1

Birte Carlmeyer 1

Philipp Cimiano 1

Benjamin Dosch 1

Ondřej Dušek 1

Desmond Elliott 1

Chukwuemeka Uchenna Eneh 1

Kai Frederic Engelmann 1

Friederike Eyssel 1

Ambra Ferrari 1

Luca Franceschetti 1

Michael Hanna 1

Silvan Heintze 1

Thomas Hermann 1

Beth Ann Hockey 1

Patrick Holthaus 1

Isidora Jeknić 1

Kushal Koshti 1

Franz Kummert 1

Norman Köster 1

Alex Lascarides 1

Christian Leichsenring 1

Wencke Liermann 1

Sebastian Loth 1

Tatjana Lucht 1

Titus Von Der Malsburg 1

André F. T. Martins 1

Davide Mazzaccara 1

Philipp Mondorf 1

Vera Neplenbroek 1

Maike Paetzel 1

Maike Paetzel-Prüsmann 1

Elena Paslaru Bontas 1

Sandro Pezzelle 1

Lara Pfennigschmidt 1

Barbara Plank 1

Massimo Poesio 1

Marian Pohling 1

Kristina Richert 1

Viktor Richter 1

Hannes Rieser 1

Robin Rojowiec 1

Tatjana Scheffler 1

Birte Schiffhauer 1

Manfred Stede 1

Svetlana Stoyanchev 1

Michael Sullivan 1

Aditya K Surikuchi 1

Takenobu Tokunaga 1

Michael Vrazitulis 1

Sven Wachsmuth 1

Heiko Wersing 1

Sebastian Wrede 1

Ramin Yaghoubzadeh 1

Laura de Ruiter 1

Sebastian Meyer zu Borgsen 1

Pelin Çelikkol 1

Venues