Grace Luo


pdf bib
G3: Geolocation via Guidebook Grounding
Grace Luo | Giscard Biamby | Trevor Darrell | Daniel Fried | Anna Rohrbach
Findings of the Association for Computational Linguistics: EMNLP 2022

We demonstrate how language can improve geolocation: the task of predicting the location where an image was taken. Here we study explicit knowledge from human-written guidebooks that describe the salient and class-discriminative visual features humans use for geolocation. We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations and an associated textual guidebook for GeoGuessr, a popular interactive geolocation game. Our approach predicts a country for each image by attending over the clues automatically extracted from the guidebook. Supervising attention with country-level pseudo labels achieves the best performance. Our approach substantially outperforms a state-of-the-art image-only geolocation method, with an improvement of over 5% in Top-1 accuracy. Our dataset and code can be found at

pdf bib
Focus! Relevant and Sufficient Context Selection for News Image Captioning
Mingyang Zhou | Grace Luo | Anna Rohrbach | Zhou Yu
Findings of the Association for Computational Linguistics: EMNLP 2022

News Image Captioning requires describing an image by leveraging additional context derived from a news article. Previous works only coarsely leverage the article to extract the necessary context, which makes it challenging for models to identify relevant events and named entities. In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained via an oracle) and the global context that summarizes the news, we can dramatically improve the model’s ability to generate accurate news captions. This begs the question, how to automatically extract such key entities from an image? We propose to use pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article, and then capture the non-visual entities via a open relation extraction model. Our experiments demonstrate that by simply selecting better context from the article, we can significantly improve the performance of existing models and achieve the new state-of-the-art performance on multiple benchmarks.

pdf bib
Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation
Giscard Biamby | Grace Luo | Trevor Darrell | Anna Rohrbach
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Detecting out-of-context media, such as “miscaptioned” images on Twitter, is a relevant problem, especially in domains of high public significance. In this work we aim to develop defenses against such misinformation for the topics of Climate Change, COVID-19, and Military Vehicles. We first present a large-scale multimodal dataset with over 884k tweets relevant to these topics. Next, we propose a detection method, based on the state-of-the-art CLIP model, that leverages automatically generated hard image-text mismatches. While this approach works well on our automatically constructed out-of-context tweets, we aim to validate its usefulness on data representative of the real world. Thus, we test it on a set of human-generated fakes, created by mimicking in-the-wild misinformation. We achieve an 11% detection improvement in a high precision regime over a strong baseline. Finally, we share insights about our best model design and analyze the challenges of this emerging threat.


pdf bib
NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media
Grace Luo | Trevor Darrell | Anna Rohrbach
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes. We are motivated by the threat scenario where an image is used out of context to support a certain narrative. While some prior datasets for detecting image-text inconsistency generate samples via text manipulation, we propose a dataset where both image and text are unmanipulated but mismatched. We introduce several strategies for automatically retrieving convincing images for a given caption, capturing cases with inconsistent entities or semantic context. Our large-scale automatically generated the NewsCLIPpings Dataset: (1) demonstrates that machine-driven image repurposing is now a realistic threat, and (2) provides samples that represent challenging instances of mismatch between text and image in news that are able to mislead humans. We benchmark several state-of-the-art multimodal models on our dataset and analyze their performance across different pretraining domains and visual backbones.