2024
pdf
bib
abs
The Illusion of Competence: Evaluating the Effect of Explanations on Users’ Mental Models of Visual Question Answering Systems
Judith Sieker
|
Simeon Junker
|
Ronja Utescher
|
Nazia Attari
|
Heiko Wersing
|
Hendrik Buschmeier
|
Sina Zarrieß
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We examine how users perceive the limitations of an AI system when it encounters a task that it cannot perform perfectly and whether providing explanations alongside its answers aids users in constructing an appropriate mental model of the system’s capabilities and limitations. We employ a visual question answer and explanation task where we control the AI system’s limitations by manipulating the visual inputs: during inference, the system either processes full-color or grayscale images. Our goal is to determine whether participants can perceive the limitations of the system. We hypothesize that explanations will make limited AI capabilities more transparent to users. However, our results show that explanations do not have this effect. Instead of allowing users to more accurately assess the limitations of the AI system, explanations generally increase users’ perceptions of the system’s competence – regardless of its actual performance.
pdf
bib
abs
Resilience through Scene Context in Visual Referring Expression Generation
Simeon Junker
|
Sina Zarrieß
Proceedings of the 17th International Natural Language Generation Conference
Scene context is well known to facilitate humans’ perception of visible objects. In this paper, we investigate the role of context in Referring Expression Generation (REG) for objects in images, where existing research has often focused on distractor contexts that exert pressure on the generator. We take a new perspective on scene context in REG and hypothesize that contextual information can be conceived of as a resource that makes REG models more resilient and facilitates the generation of object descriptions, and object types in particular. We train and test Transformer-based REG models with target representations that have been artificially obscured with noise to varying degrees. We evaluate how properties of the models’ visual context affect their processing and performance. Our results show that even simple scene contexts make models surprisingly resilient to perturbations, to the extent that they can identify referent types even when visual information about the target is completely missing.
2023
pdf
bib
abs
Keeping an Eye on Context: Attention Allocation over Input Partitions in Referring Expression Generation
Simeon Schüz
|
Sina Zarrieß
Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023)
In Referring Expression Generation, model inputs are often composed of different representations, including the visual properties of the intended referent, its relative position and size, and the visual context. Yet, the extent to which this information influences the generation process of black-box neural models is largely unclear. We investigate the relative weighting of target, location, and context information in the attention components of a Transformer-based generation model. Our results show a general target bias, which, however, depends on the content of the generated expressions, pointing to interesting directions for future research.
2022
pdf
bib
abs
Exploring Semantic Spaces for Detecting Clustering and Switching in Verbal Fluency
Özge Alacam
|
Simeon Schüz
|
Martin Wegrzyn
|
Johanna Kißler
|
Sina Zarrieß
Proceedings of the 29th International Conference on Computational Linguistics
In this work, we explore the fitness of various word/concept representations in analyzing an experimental verbal fluency dataset providing human responses to 10 different category enumeration tasks. Based on human annotations of so-called clusters and switches between sub-categories in the verbal fluency sequences, we analyze whether lexical semantic knowledge represented in word embedding spaces (GloVe, fastText, ConceptNet, BERT) is suitable for detecting these conceptual clusters and switches within and across different categories. Our results indicate that ConceptNet embeddings, a distributional semantics method enriched with taxonomical relations, outperforms other semantic representations by a large margin. Moreover, category-specific analysis suggests that individual thresholds per category are more suited for the analysis of clustering and switching in particular embedding sub-space instead of a one-fits-all cross-category solution. The results point to interesting directions for future work on probing word embedding models on the verbal fluency task.
2021
pdf
bib
abs
Decoupling Pragmatics: Discriminative Decoding for Referring Expression Generation
Simeon Schüz
|
Sina Zarrieß
Proceedings of the Reasoning and Interaction Conference (ReInAct 2021)
The shift to neural models in Referring Expression Generation (REG) has enabled more natural set-ups, but at the cost of interpretability. We argue that integrating pragmatic reasoning into the inference of context-agnostic generation models could reconcile traits of traditional and neural REG, as this offers a separation between context-independent, literal information and pragmatic adaptation to context. With this in mind, we apply existing decoding strategies from discriminative image captioning to REG and evaluate them in terms of pragmatic informativity, likelihood to ground-truth annotations and linguistic diversity. Our results show general effectiveness, but a relatively small gain in informativity, raising important questions for REG in general.
pdf
bib
abs
Decoding, Fast and Slow: A Case Study on Balancing Trade-Offs in Incremental, Character-level Pragmatic Reasoning
Sina Zarrieß
|
Hendrik Buschmeier
|
Ting Han
|
Simeon Schüz
Proceedings of the 14th International Conference on Natural Language Generation
Recent work has adopted models of pragmatic reasoning for the generation of informative language in, e.g., image captioning. We propose a simple but highly effective relaxation of fully rational decoding, based on an existing incremental and character-level approach to pragmatically informative neural image captioning. We implement a mixed, ‘fast’ and ‘slow’, speaker that applies pragmatic reasoning occasionally (only word-initially), while unrolling the language model. In our evaluation, we find that increased informativeness through pragmatic decoding generally lowers quality and, somewhat counter-intuitively, increases repetitiveness in captions. Our mixed speaker, however, achieves a good balance between quality and informativeness.
pdf
bib
abs
Diversity as a By-Product: Goal-oriented Language Generation Leads to Linguistic Variation
Simeon Schüz
|
Ting Han
|
Sina Zarrieß
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
The ability for variation in language use is necessary for speakers to achieve their conversational goals, for instance when referring to objects in visual environments. We argue that diversity should not be modelled as an independent objective in dialogue, but should rather be a result or by-product of goal-oriented language generation. Different lines of work in neural language generation investigated decoding methods for generating more diverse utterances, or increasing the informativity through pragmatic reasoning. We connect those lines of work and analyze how pragmatic reasoning during decoding affects the diversity of generated image captions. We find that boosting diversity itself does not result in more pragmatically informative captions, but pragmatic reasoning does increase lexical diversity. Finally, we discuss whether the gain in informativity is achieved in linguistically plausible ways.
2020
pdf
bib
abs
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms
Simeon Schüz
|
Sina Zarrieß
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
In human cognition, world knowledge supports the perception of object colours: knowing that trees are typically green helps to perceive their colour in certain contexts. We go beyond previous studies on colour terms using isolated colour swatches and study visual grounding of colour terms in realistic objects. Our models integrate processing of visual information and object-specific knowledge via hard-coded (late) or learned (early) fusion. We find that both models consistently outperform a bottom-up baseline that predicts colour terms solely from visual inputs, but show interesting differences when predicting atypical colours of so-called colour diagnostic objects. Our models also achieve promising results when tested on new object categories not seen during training.