Carina Silberer


2023

pdf bib
Combining Tradition with Modernness: Exploring Event Representations in Vision-and-Language Models for Visual Goal-Step Inference
Chong Shen | Carina Silberer
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Procedural knowledge understanding (PKU) underlies the ability to infer goal-step relations. The task of Visual Goal–Step Inference addresses this ability in the multimodal domain. It requires to identify images that represent the steps towards achieving a textually expressed goal. The best existing methods encode texts and images either with independent encoders, or with object-level multimodal encoders using blackbox transformers. This stands in contrast to early, linguistically inspired methods for event representations, which focus on capturing the most crucial information, namely actions and the participants, to learn stereotypical event sequences and hence procedural knowledge. In this work, we study various methods and their effects on PKU of injecting the early shallow event representations to nowadays multimodal deep learning-based models. We find that the early, linguistically inspired methods for representing event knowledge does contribute to understand procedures in combination with modern vision-and-language models. In the future, we are going to explore more complex structure of events and study how to exploit it on top of large language models.

pdf bib
HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities
Esra Dönmez | Pascal Tilli | Hsiu-Yu Yang | Ngoc Thang Vu | Carina Silberer
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

Image-Text-Matching (ITM) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language (VL). However, due to the weak association between the web-collected image–text pairs, models fail to show fine-grained understanding of the combined semantics of these modalities. To this end, we propose Hard Negative Captions (HNC): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine-grained cross-modal comprehension in VL. Additionally, we provide a challenging manually-created test set for benchmarking models on a fine-grained cross-modal mismatch with varying levels of compositional complexity. Our results show the effectiveness of training on HNC by improving the models’ zero-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios. Also, we demonstrate that HNC models yield a comparable or better initialization for fine-tuning. Our code and data are publicly available.

pdf bib
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing
Piush Aggarwal | {\"O}zge Ala{\c{c}}am | Carina Silberer | Sina Zarrie{\ss} | Torsten Zesch
Proceedings of the 1st Workshop on Linguistic Insights from and for Multimodal Language Processing

pdf bib
Implicit Affordance Acquisition via Causal Action–Effect Modeling in the Video Domain
Hsiu-Yu Yang | Carina Silberer
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

2022

pdf bib
On the Complementarity of Images and Text for the Expression of Emotions in Social Media
Anna Khlyzova | Carina Silberer | Roman Klinger
Proceedings of the 12th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis

Authors of posts in social media communicate their emotions and what causes them with text and images. While there is work on emotion and stimulus detection for each modality separately, it is yet unknown if the modalities contain complementary emotion information in social media. We aim at filling this research gap and contribute a novel, annotated corpus of English multimodal Reddit posts. On this resource, we develop models to automatically detect the relation between image and text, an emotion stimulus category and the emotion class. We evaluate if these tasks require both modalities and find for the image–text relations, that text alone is sufficient for most categories (complementary, illustrative, opposing): the information in the text allows to predict if an image is required for emotion understanding. The emotions of anger and sadness are best predicted with a multimodal model, while text alone is sufficient for disgust, joy, and surprise. Stimuli depicted by objects, animals, food, or a person are best predicted by image-only models, while multimodal mod- els are most effective on art, events, memes, places, or screenshots.

pdf bib
Are Visual-Linguistic Models Commonsense Knowledge Bases?
Hsiu-Yu Yang | Carina Silberer
Proceedings of the 29th International Conference on Computational Linguistics

Despite the recent success of pretrained language models as on-the-fly knowledge sources for various downstream tasks, they are shown to inadequately represent trivial common facts that vision typically captures. This limits their application to natural language understanding tasks that require commonsense knowledge. We seek to determine the capability of pretrained visual-linguistic models as knowledge sources on demand. To this end, we systematically compare language-only and visual-linguistic models in a zero-shot commonsense question answering inference task. We find that visual-linguistic models are highly promising regarding their benefit for text-only tasks on certain types of commonsense knowledge associated with the visual world. Surprisingly, this knowledge can be activated even when no visual input is given during inference, suggesting an effective multimodal fusion during pretraining. However, we reveal that there is still a huge space for improvement towards better cross-modal reasoning abilities and pretraining strategies for event understanding.

2020

pdf bib
Humans Meet Models on Object Naming: A New Dataset and Analysis
Carina Silberer | Sina Zarrieß | Matthijs Westera | Gemma Boleda
Proceedings of the 28th International Conference on Computational Linguistics

We release ManyNames v2 (MN v2), a verified version of an object naming dataset that contains dozens of valid names per object for 25K images. We analyze issues in the data collection method originally employed, standard in Language & Vision (L&V), and find that the main source of noise in the data comes from simulating a naming context solely from an image with a target object marked with a bounding box, which causes subjects to sometimes disagree regarding which object is the target. We also find that both the degree of this uncertainty in the original data and the amount of true naming variation in MN v2 differs substantially across object domains. We use MN v2 to analyze a popular L&V model and demonstrate its effectiveness on the task of object naming. However, our fine-grained analysis reveals that what appears to be human-like model behavior is not stable across domains, e.g., the model confuses people and clothing objects much more frequently than humans do. We also find that standard evaluations underestimate the actual effectiveness of the naming model: on the single-label names of the original dataset (Visual Genome), it obtains −27% accuracy points than on MN v2, that includes all valid object names.

pdf bib
Object Naming in Language and Vision: A Survey and a New Dataset
Carina Silberer | Sina Zarrieß | Gemma Boleda
Proceedings of the Twelfth Language Resources and Evaluation Conference

People choose particular names for objects, such as dog or puppy for a given dog. Object naming has been studied in Psycholinguistics, but has received relatively little attention in Computational Linguistics. We review resources from Language and Vision that could be used to study object naming on a large scale, discuss their shortcomings, and create a new dataset that affords more opportunities for analysis and modeling. Our dataset, ManyNames, provides 36 name annotations for each of 25K objects in images selected from VisualGenome. We highlight the challenges involved and provide a preliminary analysis of the ManyNames data, showing that there is a high level of agreement in naming, on average. At the same time, the average number of name types associated with an object is much higher in our dataset than in existing corpora for Language and Vision, such that ManyNames provides a rich resource for studying phenomena like hierarchical variation (chihuahua vs. dog), which has been discussed at length in the theoretical literature, and other less well studied phenomena like cross-classification (cake vs. dessert).

2019

pdf bib
What do Entity-Centric Models Learn? Insights from Entity Linking in Multi-Party Dialogue
Laura Aina | Carina Silberer | Ionut-Teodor Sorodoc | Matthijs Westera | Gemma Boleda
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Humans use language to refer to entities in the external world. Motivated by this, in recent years several models that incorporate a bias towards learning entity representations have been proposed. Such entity-centric models have shown empirical success, but we still know little about why. In this paper we analyze the behavior of two recently proposed entity-centric models in a referential task, Entity Linking in Multi-party Dialogue (SemEval 2018 Task 4). We show that these models outperform the state of the art on this task, and that they do better on lower frequency entities than a counterpart model that is not entity-centric, with the same model size. We argue that making models entity-centric naturally fosters good architectural decisions. However, we also show that these models do not really build entity representations and that they make poor use of linguistic context. These negative results underscore the need for model analysis, to test whether the motivations for particular architectures are borne out in how models behave when deployed.

2018

pdf bib
Grounding Semantic Roles in Images
Carina Silberer | Manfred Pinkal
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We address the task of visual semantic role labeling (vSRL), the identification of the participants of a situation or event in a visual scene, and their labeling with their semantic relations to the event or situation. We render candidate participants as image regions of objects, and train a model which learns to ground roles in the regions which depict the corresponding participant. Experimental results demonstrate that we can train a vSRL model without reliance on prohibitive image-based role annotations, by utilizing noisy data which we extract automatically from image captions using a linguistic SRL system. Furthermore, our model induces frame—semantic visual representations, and their comparison to previous work on supervised visual verb sense disambiguation yields overall better results.

pdf bib
AMORE-UPF at SemEval-2018 Task 4: BiLSTM with Entity Library
Laura Aina | Carina Silberer | Ionut-Teodor Sorodoc | Matthijs Westera | Gemma Boleda
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes our winning contribution to SemEval 2018 Task 4: Character Identification on Multiparty Dialogues. It is a simple, standard model with one key innovation, an entity library. Our results show that this innovation greatly facilitates the identification of infrequent characters. Because of the generic nature of our model, this finding is potentially relevant to any task that requires the effective learning from sparse or imbalanced data.

2014

pdf bib
Learning Grounded Meaning Representations with Autoencoders
Carina Silberer | Mirella Lapata
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Models of Semantic Representation with Visual Attributes
Carina Silberer | Vittorio Ferrari | Mirella Lapata
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Casting Implicit Role Linking as an Anaphora Resolution Task
Carina Silberer | Anette Frank
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Grounded Models of Semantic Representation
Carina Silberer | Mirella Lapata
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2010

pdf bib
UHD: Cross-Lingual Word Sense Disambiguation Using Multilingual Co-Occurrence Graphs
Carina Silberer | Simone Paolo Ponzetto
Proceedings of the 5th International Workshop on Semantic Evaluation

2008

pdf bib
Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration
Wolodja Wentland | Johannes Knopp | Carina Silberer | Matthias Hartung
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we present HeiNER, the multilingual Heidelberg Named Entity Resource. HeiNER contains 1,547,586 disambiguated English Named Entities together with translations and transliterations to 15 languages. Our work builds on the approach described in (Bunescu and Pasca, 2006), yet extends it to a multilingual dimension. Translating Named Entities into the various target languages is carried out by exploiting crosslingual information contained in the online encyclopedia Wikipedia. In addition, HeiNER provides linguistic contexts for every NE in all target languages which makes it a valuable resource for multilingual Named Entity Recognition, Disambiguation and Classification. The results of our evaluation against the assessments of human annotators yield a high precision of 0.95 for the NEs we extract from the English Wikipedia. These source language NEs are thus very reliable seeds for our multilingual NE translation method.