Alan Yuille


2024

pdf bib
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
Zhuowan Li | Cihang Xie | Benjamin Van Durme | Alan Yuille
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether multi-modal learning can help understand each individual modality. In this work, we conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models by probing on a broad range of tasks. Five probing tasks are evaluated in order to assess the quality of the learned representations in a nuanced manner. Our results on five probing tasks suggest vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.

2018

pdf bib
Scene Graph Parsing as Dependency Parsing
Yu-Siang Wang | Chenxi Liu | Xiaohui Zeng | Alan Yuille
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

In this paper, we study the problem of parsing structured knowledge graphs from textual descriptions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effectiveness of our learned parser on image retrieval applications.

pdf bib
PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution
Hong Chen | Zhenhua Fan | Hao Lu | Alan Yuille | Shu Rong
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce PreCo, a large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as entity representation, by alleviating the challenge of low overlap between training and test sets and enabling separated analysis of mention detection and mention clustering. To strengthen the training-test overlap, we collect a large corpus of 38K documents and 12.5M words which are mostly from the vocabulary of English-speaking preschoolers. Experiments show that with higher training-test overlap, error analysis on PreCo is more efficient than the one on OntoNotes, a popular existing dataset. Furthermore, we annotate singleton mentions making it possible for the first time to quantify the influence that a mention detector makes on coreference resolution performance. The dataset is freely available at https://preschool-lab.github.io/PreCo/.