Sina Zarrieß

Also published as: Sina Zarriess


2021

pdf bib
Decoding, Fast and Slow: A Case Study on Balancing Trade-Offs in Incremental, Character-level Pragmatic Reasoning
Sina Zarrieß | Hendrik Buschmeier | Ting Han | Simeon Schüz
Proceedings of the 14th International Conference on Natural Language Generation

Recent work has adopted models of pragmatic reasoning for the generation of informative language in, e.g., image captioning. We propose a simple but highly effective relaxation of fully rational decoding, based on an existing incremental and character-level approach to pragmatically informative neural image captioning. We implement a mixed, ‘fast’ and ‘slow’, speaker that applies pragmatic reasoning occasionally (only word-initially), while unrolling the language model. In our evaluation, we find that increased informativeness through pragmatic decoding generally lowers quality and, somewhat counter-intuitively, increases repetitiveness in captions. Our mixed speaker, however, achieves a good balance between quality and informativeness.

pdf bib
Proceedings of the 14th International Conference on Computational Semantics (IWCS)
Sina Zarrieß | Johan Bos | Rik van Noord | Lasha Abzianidze
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

pdf bib
Decoupling Pragmatics: Discriminative Decoding for Referring Expression Generation
Simeon Schüz | Sina Zarrieß
Proceedings of the Reasoning and Interaction Conference (ReInAct 2021)

The shift to neural models in Referring Expression Generation (REG) has enabled more natural set-ups, but at the cost of interpretability. We argue that integrating pragmatic reasoning into the inference of context-agnostic generation models could reconcile traits of traditional and neural REG, as this offers a separation between context-independent, literal information and pragmatic adaptation to context. With this in mind, we apply existing decoding strategies from discriminative image captioning to REG and evaluate them in terms of pragmatic informativity, likelihood to ground-truth annotations and linguistic diversity. Our results show general effectiveness, but a relatively small gain in informativity, raising important questions for REG in general.

pdf bib
What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts
Ronja Utescher | Sina Zarrieß
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Multi-modal texts are abundant and diverse in structure, yet Language & Vision research of these naturally occurring texts has mostly focused on genres that are comparatively light on text, like tweets. In this paper, we discuss the challenges and potential benefits of a L&V framework that explicitly models referential relations, taking Wikipedia articles about buildings as an example. We briefly survey existing related tasks in L&V and propose multi-modal information extraction as a general direction for future research.

pdf bib
Challenges in Designing Natural Language Interfaces for Complex Visual Models
Henrik Voigt | Monique Meuschke | Kai Lawonn | Sina Zarrieß
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

Intuitive interaction with visual models becomes an increasingly important task in the field of Visualization (VIS) and verbal interaction represents a significant aspect of it. Vice versa, modeling verbal interaction in visual environments is a major trend in ongoing research in NLP. To date, research on Language & Vision, however, mostly happens at the intersection of NLP and Computer Vision (CV), and much less at the intersection of NLP and Visualization, which is an important area in Human-Computer Interaction (HCI). This paper presents a brief survey of recent work on interactive tasks and set-ups in NLP and Visualization. We discuss the respective methods, show interesting gaps, and conclude by suggesting neural, visually grounded dialogue modeling as a promising potential for NLIs for visual models.

pdf bib
Diversity as a By-Product: Goal-oriented Language Generation Leads to Linguistic Variation
Simeon Schüz | Ting Han | Sina Zarrieß
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

The ability for variation in language use is necessary for speakers to achieve their conversational goals, for instance when referring to objects in visual environments. We argue that diversity should not be modelled as an independent objective in dialogue, but should rather be a result or by-product of goal-oriented language generation. Different lines of work in neural language generation investigated decoding methods for generating more diverse utterances, or increasing the informativity through pragmatic reasoning. We connect those lines of work and analyze how pragmatic reasoning during decoding affects the diversity of generated image captions. We find that boosting diversity itself does not result in more pragmatically informative captions, but pragmatic reasoning does increase lexical diversity. Finally, we discuss whether the gain in informativity is achieved in linguistically plausible ways.

2020

pdf bib
Knowledge Supports Visual Language Grounding: A Case Study on Colour Terms
Simeon Schüz | Sina Zarrieß
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In human cognition, world knowledge supports the perception of object colours: knowing that trees are typically green helps to perceive their colour in certain contexts. We go beyond previous studies on colour terms using isolated colour swatches and study visual grounding of colour terms in realistic objects. Our models integrate processing of visual information and object-specific knowledge via hard-coded (late) or learned (early) fusion. We find that both models consistently outperform a bottom-up baseline that predicts colour terms solely from visual inputs, but show interesting differences when predicting atypical colours of so-called colour diagnostic objects. Our models also achieve promising results when tested on new object categories not seen during training.

pdf bib
Humans Meet Models on Object Naming: A New Dataset and Analysis
Carina Silberer | Sina Zarrieß | Matthijs Westera | Gemma Boleda
Proceedings of the 28th International Conference on Computational Linguistics

We release ManyNames v2 (MN v2), a verified version of an object naming dataset that contains dozens of valid names per object for 25K images. We analyze issues in the data collection method originally employed, standard in Language & Vision (L&V), and find that the main source of noise in the data comes from simulating a naming context solely from an image with a target object marked with a bounding box, which causes subjects to sometimes disagree regarding which object is the target. We also find that both the degree of this uncertainty in the original data and the amount of true naming variation in MN v2 differs substantially across object domains. We use MN v2 to analyze a popular L&V model and demonstrate its effectiveness on the task of object naming. However, our fine-grained analysis reveals that what appears to be human-like model behavior is not stable across domains, e.g., the model confuses people and clothing objects much more frequently than humans do. We also find that standard evaluations underestimate the actual effectiveness of the naming model: on the single-label names of the original dataset (Visual Genome), it obtains −27% accuracy points than on MN v2, that includes all valid object names.

pdf bib
From “Before” to “After”: Generating Natural Language Instructions from Image Pairs in a Simple Visual Domain
Robin Rojowiec | Jana Götze | Philipp Sadler | Henrik Voigt | Sina Zarrieß | David Schlangen
Proceedings of the 13th International Conference on Natural Language Generation

While certain types of instructions can be com-pactly expressed via images, there are situations where one might want to verbalise them, for example when directing someone. We investigate the task of Instruction Generation from Before/After Image Pairs which is to derive from images an instruction for effecting the implied change. For this, we make use of prior work on instruction following in a visual environment. We take an existing dataset, the BLOCKS data collected by Bisk et al. (2016) and investigate whether it is suitable for training an instruction generator as well. We find that it is, and investigate several simple baselines, taking these from the related task of image captioning. Through a series of experiments that simplify the task (by making image processing easier or completely side-stepping it; and by creating template-based targeted instructions), we investigate areas for improvement. We find that captioning models get some way towards solving the task, but have some difficulty with it, and future improvements must lie in the way the change is detected in the instruction.

pdf bib
Object Naming in Language and Vision: A Survey and a New Dataset
Carina Silberer | Sina Zarrieß | Gemma Boleda
Proceedings of the 12th Language Resources and Evaluation Conference

People choose particular names for objects, such as dog or puppy for a given dog. Object naming has been studied in Psycholinguistics, but has received relatively little attention in Computational Linguistics. We review resources from Language and Vision that could be used to study object naming on a large scale, discuss their shortcomings, and create a new dataset that affords more opportunities for analysis and modeling. Our dataset, ManyNames, provides 36 name annotations for each of 25K objects in images selected from VisualGenome. We highlight the challenges involved and provide a preliminary analysis of the ManyNames data, showing that there is a high level of agreement in naming, on average. At the same time, the average number of name types associated with an object is much higher in our dataset than in existing corpora for Language and Vision, such that ManyNames provides a rich resource for studying phenomena like hierarchical variation (chihuahua vs. dog), which has been discussed at length in the theoretical literature, and other less well studied phenomena like cross-classification (cake vs. dessert).

2019

pdf bib
Know What You Don’t Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories
Sina Zarrieß | David Schlangen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Zero-shot learning in Language & Vision is the task of correctly labelling (or naming) objects of novel categories. Another strand of work in L&V aims at pragmatically informative rather than “correct” object descriptions, e.g. in reference games. We combine these lines of research and model zero-shot reference games, where a speaker needs to successfully refer to a novel object in an image. Inspired by models of “rational speech acts”, we extend a neural generator to become a pragmatic speaker reasoning about uncertain object categories. As a result of this reasoning, the generator produces fewer nouns and names of distractor categories as compared to a literal speaker. We show that this conversational strategy for dealing with novel objects often improves communicative success, in terms of resolution accuracy of an automatic listener.

pdf bib
Sketch Me if You Can: Towards Generating Detailed Descriptions of Object Shape by Grounding in Images and Drawings
Ting Han | Sina Zarrieß
Proceedings of the 12th International Conference on Natural Language Generation

A lot of recent work in Language & Vision has looked at generating descriptions or referring expressions for objects in scenes of real-world images, though focusing mostly on relatively simple language like object names, color and location attributes (e.g., brown chair on the left). This paper presents work on Draw-and-Tell, a dataset of detailed descriptions for common objects in images where annotators have produced fine-grained attribute-centric expressions distinguishing a target object from a range of similar objects. Additionally, the dataset comes with hand-drawn sketches for each object. As Draw-and-Tell is medium-sized and contains a rich vocabulary, it constitutes an interesting challenge for CNN-LSTM architectures used in state-of-the-art image captioning models. We explore whether the additional modality given through sketches can help such a model to learn to accurately ground detailed language referring expressions to object shapes. Our results are encouraging.

pdf bib
Tell Me More: A Dataset of Visual Scene Description Sequences
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 12th International Conference on Natural Language Generation

We present a dataset consisting of what we call image description sequences, which are multi-sentence descriptions of the contents of an image. These descriptions were collected in a pseudo-interactive setting, where the describer was told to describe the given image to a listener who needs to identify the image within a set of images, and who successively asks for more information. As we show, this setup produced nicely structured data that, we think, will be useful for learning models capable of planning and realising such description discourses.

2018

pdf bib
The Task Matters: Comparing Image Captioning and Task-Based Dialogical Image Description
Nikolai Ilinykh | Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

Image captioning models are typically trained on data that is collected from people who are asked to describe an image, without being given any further task context. As we argue here, this context independence is likely to cause problems for transferring to task settings in which image description is bound by task demands. We demonstrate that careful design of data collection is required to obtain image descriptions which are contextually bounded to a particular meta-level task. As a task, we use MeetUp!, a text-based communication game where two players have the goal of finding each other in a visual environment. To reach this goal, the players need to describe images representing their current location. We analyse a dataset from this domain and show that the nature of image descriptions found in MeetUp! is diverse, dynamic and rich with phenomena that are not present in descriptions obtained through a simple image captioning task, which we ran for comparison.

pdf bib
Decoding Strategies for Neural Referring Expression Generation
Sina Zarrieß | David Schlangen
Proceedings of the 11th International Conference on Natural Language Generation

RNN-based sequence generation is now widely used in NLP and NLG (natural language generation). Most work focusses on how to train RNNs, even though also decoding is not necessarily straightforward: previous work on neural MT found seq2seq models to radically prefer short candidates, and has proposed a number of beam search heuristics to deal with this. In this work, we assess decoding strategies for referring expression generation with neural models. Here, expression length is crucial: output should neither contain too much or too little information, in order to be pragmatically adequate. We find that most beam search heuristics developed for MT do not generalize well to referring expression generation (REG), and do not generally outperform greedy decoding. We observe that beam search heuristics for termination seem to override the model’s knowledge of what a good stopping point is. Therefore, we also explore a recent approach called trainable decoding, which uses a small network to modify the RNN’s hidden state for better decoding results. We find this approach to consistently outperform greedy decoding for REG.

pdf bib
Being data-driven is not enough: Revisiting interactive instruction giving as a challenge for NLG
Sina Zarrieß | David Schlangen
Proceedings of the Workshop on NLG for Human–Robot Interaction

Modeling traditional NLG tasks with data-driven techniques has been a major focus of research in NLG in the past decade. We argue that existing modeling techniques are mostly tailored to textual data and are not sufficient to make NLG technology meet the requirements of agents which target fluid interaction and collaboration in the real world. We revisit interactive instruction giving as a challenge for datadriven NLG and, based on insights from previous GIVE challenges, propose that instruction giving should be addressed in a setting that involves visual grounding and spoken language. These basic design decisions will require NLG frameworks that are capable of monitoring their environment as well as timing and revising their verbal output. We believe that these are core capabilities for making NLG technology transferrable to interactive systems.

2017

pdf bib
Obtaining referential word meanings from visual and distributional information: Experiments on object naming
Sina Zarrieß | David Schlangen
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate object naming, which is an important sub-task of referring expression generation on real-world images. As opposed to mutually exclusive labels used in object recognition, object names are more flexible, subject to communicative preferences and semantically related to each other. Therefore, we investigate models of referential word meaning that link visual to lexical information which we assume to be given through distributional word embeddings. We present a model that learns individual predictors for object names that link visual and distributional aspects of word meaning during training. We show that this is particularly beneficial for zero-shot learning, as compared to projecting visual objects directly into the distributional space. In a standard object naming task, we find that different ways of combining lexical and visual information achieve very similar performance, though experiments on model combination suggest that they capture complementary aspects of referential meaning.

pdf bib
Deriving continous grounded meaning representations from referentially structured multimodal contexts
Sina Zarrieß | David Schlangen
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Corpora of referring expressions paired with their visual referents are a good source for learning word meanings directly grounded in visual representations. Here, we explore additional ways of extracting from them word representations linked to multi-modal context: through expressions that refer to the same object, and through expressions that refer to different objects in the same scene. We show that continuous meaning representations derived from these contexts capture complementary aspects of similarity, , even if not outperforming textual embeddings trained on very large amounts of raw text when tested on standard similarity benchmarks. We propose a new task for evaluating grounded meaning representations—detection of potentially co-referential phrases—and show that it requires precise denotational representations of attribute meanings, which our method provides.

pdf bib
Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images
Sina Zarrieß | M. Soledad López Gambino | David Schlangen
Proceedings of the 10th International Conference on Natural Language Generation

Current referring expression generation systems mostly deliver their output as one-shot, written expressions. We present on-going work on incremental generation of spoken expressions referring to objects in real-world images. This approach extends upon previous work using the words-as-classifier model for generation. We implement this generator in an incremental dialogue processing framework such that we can exploit an existing interface to incremental text-to-speech synthesis. Our system generates and synthesizes referring expressions while continuously observing non-verbal user reactions.

pdf bib
The Code2Text Challenge: Text Generation in Source Libraries
Kyle Richardson | Sina Zarrieß | Jonas Kuhn
Proceedings of the 10th International Conference on Natural Language Generation

We propose a new shared task for tactical data-to-text generation in the domain of source code libraries. Specifically, we focus on text generation of function descriptions from example software projects. Data is drawn from existing resources used for studying the related problem of semantic parser induction, and spans a wide variety of both natural languages and programming languages. In this paper, we describe these existing resources, which will serve as training and development data for the task, and discuss plans for building new independent test sets.

pdf bib
Beyond On-hold Messages: Conversational Time-buying in Task-oriented Dialogue
Soledad López Gambino | Sina Zarrieß | David Schlangen
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

A common convention in graphical user interfaces is to indicate a “wait state”, for example while a program is preparing a response, through a changed cursor state or a progress bar. What should the analogue be in a spoken conversational system? To address this question, we set up an experiment in which a human information provider (IP) was given their information only in a delayed and incremental manner, which systematically created situations where the IP had the turn but could not provide task-related information. Our data analysis shows that 1) IPs bridge the gap until they can provide information by re-purposing a whole variety of task- and grounding-related communicative actions (e.g. echoing the user’s request, signaling understanding, asserting partially relevant information), rather than being silent or explicitly asking for time (e.g. “please wait”), and that 2) IPs combined these actions productively to ensure an ongoing conversation. These results, we argue, indicate that natural conversational interfaces should also be able to manage their time flexibly using a variety of conversational resources.

pdf bib
Is this a Child, a Girl or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings
Sina Zarrieß | David Schlangen
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

There has recently been a lot of work trying to use images of referents of words for improving vector space meaning representations derived from text. We investigate the opposite direction, as it were, trying to improve visual word predictors that identify objects in images, by exploiting distributional similarity information during training. We show that for certain words (such as entry-level nouns or hypernyms), we can indeed learn better referential word meanings by taking into account their semantic similarity to other words. For other words, there is no or even a detrimental effect, compared to a learning setup that presents even semantically related objects as negative instances.

2016

pdf bib
Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected?
Sina Zarrieß | David Schlangen
Proceedings of the 9th International Natural Language Generation conference

pdf bib
PentoRef: A Corpus of Spoken References in Task-oriented Dialogues
Sina Zarrieß | Julian Hough | Casey Kennington | Ramesh Manuvinakurike | David DeVault | Raquel Fernández | David Schlangen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

PentoRef is a corpus of task-oriented dialogues collected in systematically manipulated settings. The corpus is multilingual, with English and German sections, and overall comprises more than 20000 utterances. The dialogues are fully transcribed and annotated with referring expressions mapped to objects in corresponding visual scenes, which makes the corpus a rich resource for research on spoken referring expressions in generation and resolution. The corpus includes several sub-corpora that correspond to different dialogue situations where parameters related to interactivity, visual access, and verbal channel have been manipulated in systematic ways. The corpus thus lends itself to very targeted studies of reference in spontaneous dialogue.

pdf bib
Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs
Sina Zarrieß | David Schlangen
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Resolving References to Objects in Photographs using the Words-As-Classifiers Model
David Schlangen | Sina Zarrieß | Casey Kennington
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Reading Times Predict the Quality of Generated Text Above and Beyond Human Ratings
Sina Zarrieß | Sebastian Loth | David Schlangen
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)

2013

pdf bib
Combining Referring Expression Generation and Surface Realization: A Corpus-Based Investigation of Architectures
Sina Zarrieß | Jonas Kuhn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
An Automatic Method for Building a Data-to-Text Generator
Sina Zarriess | Kyle Richardson
Proceedings of the 14th European Workshop on Natural Language Generation

pdf bib
LFG-based Features for Noun Number and Article Grammatical Errors
Gábor Berend | Veronika Vincze | Sina Zarrieß | Richárd Farkas
Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task

2012

pdf bib
Generating Non-Projective Word Order in Statistical Linearization
Bernd Bohnet | Anders Björkelund | Jonas Kuhn | Wolfgang Seeker | Sina Zarriess
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
To what extent does sentence-internal realisation reflect discourse context? A study on word order
Sina Zarrieß | Aoife Cahill | Jonas Kuhn
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
A Corpus-based Study of the German Recipient Passive
Patrick Ziering | Sina Zarrieß | Jonas Kuhn
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we investigate the usage of a non-canonical German passive alternation for ditransitive verbs, the recipient passive, in naturally occuring corpus data. We propose a classifier that predicts the voice of a ditransitive verb based on the contextually determined properties its arguments. As the recipient passive is a low frequent phenomenon, we first create a special data set focussing on German ditransitive verbs which are frequently used in the recipient passive. We use a broad-coverage grammar-based parser, the German LFG parser, to automatically annotate our data set for the morpho-syntactic properties of the involved predicate arguments. We train a Maximum Entropy classifier on the automatically annotated sentences and achieve an accuracy of 98.05%, clearly outperforming the baseline that always predicts active voice baseline (94.6%).

2011

pdf bib
Underspecifying and Predicting Voice for Surface Realisation Ranking
Sina Zarrieß | Aoife Cahill | Jonas Kuhn
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Cross-Lingual Induction for Deep Broad-Coverage Syntax: A Case Study on German Participles
Sina Zarrieß | Aoife Cahill | Jonas Kuhn | Christian Rohrer
Coling 2010: Posters

pdf bib
A Cross-Lingual Induction Technique for German Adverbial Participles
Sina Zarrieß | Aoife Cahill | Jonas Kuhn | Christian Rohrer
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground

pdf bib
Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
Cheikh M. Bamba Dione | Jonas Kuhn | Sina Zarrieß
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we report on the design of a part-of-speech-tagset for Wolof and on the creation of a semi-automatically annotated gold standard. In order to achieve high-quality annotation relatively fast, we first generated an accurate lexicon that draws on existing word and name lists and takes into account inflectional and derivational morphology. The main motivation for the tagged corpus is to obtain data for training automatic taggers with machine learning approaches. Hence, we took machine learning considerations into account during tagset design and we present training experiments as part of this paper. The best automatic tagger achieves an accuracy of 95.2% in cross-validation experiments. We also wanted to create a basis for experimenting with annotation projection techniques, which exploit parallel corpora. For this reason, it was useful to use a part of the Bible as the gold standard corpus, for which sentence-aligned parallel versions in many languages are easy to obtain. We also report on preliminary experiments exploiting a statistical word alignment of the parallel text.

2009

pdf bib
Developing German Semantics on the basis of Parallel LFG Grammars
Sina Zarrieß
Proceedings of the 2009 Workshop on Grammar Engineering Across Frameworks (GEAF 2009)

pdf bib
Exploiting Translational Correspondences for Pattern-Independent MWE Identification
Sina Zarrieß | Jonas Kuhn
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

2006

pdf bib
A Conceptual Analysis of the Notion of Instrumentality via a Multilingual Analysis
Asanee Kawtrakul | Mukda Suktarachan | Bali Ranaivo-Malancon | Pek Kuan | Achla Raina | Sudeshna Sarkar | Alda Mari | Sina Zarriess | Elixabete Murguia | Patrick Saint-Dizier
Proceedings of the Third ACL-SIGSEM Workshop on Prepositions