Tamara Berg

Also published as: Tamara L Berg, Tamara L. Berg


2023

pdf bib
Revealing Single Frame Bias for Video-and-Language Learning
Jie Lei | Tamara Berg | Mohit Bansal
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Training an effective video-and-language model intuitively requires multiple frames as model inputs. However, it is unclear whether using multiple frames is beneficial to downstream tasks, and if yes, whether the performance gain is worth the drastically-increased computation and memory costs resulting from using more frames. In this work, we explore single-frame models for video-and-language learning. On a diverse set of video-and-language tasks (including text-to-video retrieval and video question answering), we show the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong “static appearance bias” in popular video-and-language datasets. Therefore, to allow for a more comprehensive evaluation of video-and-language models, we propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling. Our code is available at https://github.com/jayleicn/singularity.

2021

pdf bib
mTVR: Multilingual Moment Retrieval in Videos
Jie Lei | Tamara Berg | Mohit Bansal
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We introduce mTVR, a large-scale multilingual video moment retrieval dataset, containing 218K English and Chinese queries from 21.8K TV show video clips. The dataset is collected by extending the popular TVR dataset (in English) with paired Chinese queries and subtitles. Compared to existing moment retrieval datasets, mTVR is multilingual, larger, and comes with diverse annotations. We further propose mXML, a multilingual moment retrieval model that learns and operates on data from both languages, via encoder parameter sharing and language neighborhood constraints. We demonstrate the effectiveness of mXML on the newly collected mTVR dataset, where mXML outperforms strong monolingual baselines while using fewer parameters. In addition, we also provide detailed dataset analyses and model ablations. Data and code are publicly available at https://github.com/jayleicn/mTVRetrieval

2020

pdf bib
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jie Lei | Liwei Wang | Yelong Shen | Dong Yu | Tamara Berg | Mohit Bansal
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events.

pdf bib
TVQA+: Spatio-Temporal Grounding for Video Question Answering
Jie Lei | Licheng Yu | Tamara Berg | Mohit Bansal
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present the task of Spatio-Temporal Video Question Answering, which requires intelligent systems to simultaneously retrieve relevant moments and detect referenced visual concepts (people and objects) to answer natural language questions about videos. We first augment the TVQA dataset with 310.8K bounding boxes, linking depicted objects to visual concepts in questions and answers. We name this augmented version as TVQA+. We then propose Spatio-Temporal Answerer with Grounded Evidence (STAGE), a unified framework that grounds evidence in both spatial and temporal domains to answer questions about videos. Comprehensive experiments and analyses demonstrate the effectiveness of our framework and how the rich annotations in our TVQA+ dataset can contribute to the question answering task. Moreover, by performing this joint task, our model is able to produce insightful and interpretable spatio-temporal attention visualizations.

pdf bib
What is More Likely to Happen Next? Video-and-Language Future Event Prediction
Jie Lei | Licheng Yu | Tamara Berg | Mohit Bansal
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work.

2018

pdf bib
TVQA: Localized, Compositional Video Question Answering
Jie Lei | Licheng Yu | Mohit Bansal | Tamara Berg
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.

2017

pdf bib
Hierarchically-Attentive RNN for Album Summarization and Storytelling
Licheng Yu | Mohit Bansal | Tamara Berg
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We address the problem of end-to-end visual storytelling. Given a photo album, our model first selects the most representative (summary) photos, and then composes a natural language story for the album. For this task, we make use of the Visual Storytelling dataset and a model composed of three hierarchically-attentive Recurrent Neural Nets (RNNs) to: encode the album photos, select representative (summary) photos, and compose the story. Automatic and human evaluations show our model achieves better performance on selection, generation, and retrieval than baselines.

2014

pdf bib
TreeTalk: Composition and Compression of Trees for Image Descriptions
Polina Kuznetsova | Vicente Ordonez | Tamara L. Berg | Yejin Choi
Transactions of the Association for Computational Linguistics, Volume 2

We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions. We investigate two related tasks: image caption generalization and generation, where the former is an optional subtask of the latter. The high-level idea of our approach is to harvest expressive phrases (as tree fragments) from existing image descriptions, then to compose a new description by selectively combining the extracted (and optionally pruned) tree fragments. Key algorithmic components are tree composition and compression, both integrating tree structure with sequence structure. Our proposed system attains significantly better performance than previous approaches for both image caption generalization and generation. In addition, our work is the first to show the empirical benefit of automatically generalized captions for composing natural image descriptions.

pdf bib
ReferItGame: Referring to Objects in Photographs of Natural Scenes
Sahar Kazemzadeh | Vicente Ordonez | Mark Matten | Tamara Berg
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Generalizing Image Captions for Image-Text Parallel Corpus
Polina Kuznetsova | Vicente Ordonez | Alexander Berg | Tamara Berg | Yejin Choi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the Workshop on Vision and Natural Language Processing
Julia Hockenmaier | Tamara Berg
Proceedings of the Workshop on Vision and Natural Language Processing

2012

pdf bib
Detecting Visual Text
Jesse Dodge | Amit Goyal | Xufeng Han | Alyssa Mensch | Margaret Mitchell | Karl Stratos | Kota Yamaguchi | Yejin Choi | Hal Daumé III | Alex Berg | Tamara Berg
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Collective Generation of Natural Image Descriptions
Polina Kuznetsova | Vicente Ordonez | Alexander Berg | Tamara Berg | Yejin Choi
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Midge: Generating Image Descriptions From Computer Vision Detections
Margaret Mitchell | Jesse Dodge | Amit Goyal | Kota Yamaguchi | Karl Stratos | Xufeng Han | Alyssa Mensch | Alex Berg | Tamara Berg | Hal Daumé III
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Composing Simple Image Descriptions using Web-scale N-grams
Siming Li | Girish Kulkarni | Tamara L Berg | Alexander C Berg | Yejin Choi
Proceedings of the Fifteenth Conference on Computational Natural Language Learning