Christoph Feichtenhofer


2024

pdf bib
Altogether: Image Captioning via Re-aligning Alt-text
Hu Xu | Po-Yao Huang | Xiaoqing Tan | Ching-Feng Yeh | Jacob Kahn | Christine Jou | Gargi Ghosh | Omer Levy | Luke Zettlemoyer | Wen-tau Yih | Shang-Wen Li | Saining Xie | Christoph Feichtenhofer
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners’ training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

2021

pdf bib
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Prahal Arora | Masoumeh Aminzadeh | Christoph Feichtenhofer | Florian Metze | Luke Zettlemoyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Dmytro Okhonko | Armen Aghajanyan | Florian Metze | Luke Zettlemoyer | Christoph Feichtenhofer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.