Thomas Baier


2025

How can flexibility and control over the interpretation of multimodal signals by embodied agents be balanced? Flexibility means that agents respond fluently in any context, whereas control means that responses are transparent and faithful to goals and principles that are explicitly defined. This paper describes a modular platform to create multimodal interactive agents using an event bus on which signals and interpretations are posted as a sequence in time, but also provides control options to drive the interaction given specific intentions and goals. Different sensors and interpretation components can be integrated by defining their input and output topics in the event bus, which results in an open multimodal sequence-driven workflow for further interpretations. In addition, our platform allows us to define higher-level intents that control sequence patterns to achieve a goal. A key component is an episodic Knowledge Graph (eKG) that acts as a long-term symbolic memory to aggregate and connect these interpretations. This eKG establishes coherence and continuity across different interactions. Intents and the eKG make it possible to define different (embodied) agents and compare their behavior without having to implement complex software components for multimodal sensor data and design the control over their dependencies. In this paper, we explain the broad range of components that we developed and integrated into various interactive agents. We also explain how the interaction is recorded as multimodal data and how it results in an aggregated memory in the eKG. By analyzing the recorded interaction, we can compare agents and agent components and study their interactive behavior with people and other agents.

2024

Linguistic conventions that arise in dialogue reflect common ground and can increase communicative efficiency. Social robots that can understand these conventions and the process by which they arise have the potential to become efficient communication partners. Nevertheless, it is unclear how robots can engage in convention formation when presented with both familiar and new information. We introduce an adaptable game platform, SPOTTER, to study the dynamics of convention formation for visually grounded referring expressions in both human-human and human-robot interaction. Specifically, we seek to elicit convention forming for members of an inner circle of well-known individuals in the common ground, as opposed to individuals from an outer circle, who are unfamiliar. We release an initial corpus of 5000 utterances from two exploratory pilot experiments in Dutch. Different from previous work focussing on human-human interaction, we find that referring expressions for both familiar and unfamiliar individuals maintain their length throughout human-robot interaction. Stable conventions are formed, although these conventions can be impacted by distracting outer circle individuals. With our distinction between familiar and unfamiliar, we create a contrastive operationalization of common ground, which aids research into convention formation.

2022

We present a new method based on episodic Knowledge Graphs (eKGs) for evaluating (multimodal) conversational agents in open domains. This graph is generated by interpreting raw signals during conversation and is able to capture the accumulation of knowledge over time. We apply structural and semantic analysis of the resulting graphs and translate the properties into qualitative measures. We compare these measures with existing automatic and manual evaluation metrics commonly used for conversational agents. Our results show that our Knowledge-Graph-based evaluation provides more qualitative insights into interaction and the agent’s behavior.

2021

We present EMISSOR: a platform to capture multimodal interactions as recordings of episodic experiences with explicit referential interpretations that also yield an episodic Knowledge Graph (eKG). The platform stores streams of multiple modalities as parallel signals. Each signal is segmented and annotated independently with interpretation. Annotations are eventually mapped to explicit identities and relations in the eKG. As we ground signal segments from different modalities to the same instance representations, we also ground different modalities across each other. Unique to our eKG is that it accepts different interpretations across modalities, sources and experiences and supports reasoning over conflicting information and uncertainties that may result from multimodal experiences. EMISSOR can record and annotate experiments in virtual and real-world, combine data, evaluate system behavior and their performance for preset goals but also model the accumulation of knowledge and interpretations in the Knowledge Graph as a result of these episodic experiences.