Gargi Ghosh


2024

pdf bib
Altogether: Image Captioning via Re-aligning Alt-text
Hu Xu | Po-Yao Huang | Xiaoqing Tan | Ching-Feng Yeh | Jacob Kahn | Christine Jou | Gargi Ghosh | Omer Levy | Luke Zettlemoyer | Wen-tau Yih | Shang-Wen Li | Saining Xie | Christoph Feichtenhofer
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners’ training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

2023

pdf bib
ALERT: Adapt Language Models to Reasoning Tasks
Ping Yu | Tianlu Wang | Olga Golovneva | Badr AlKhamissi | Siddharth Verma | Zhijing Jin | Gargi Ghosh | Mona Diab | Asli Celikyilmaz
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recent advancements in large language models have enabled them to perform well on complex tasks that require step-by-step reasoning with few-shot learning. However, it is unclear whether these models are applying reasoning skills they have learnt during pre-training , or if they are simply memorizing their training corpus at finer granularity and have learnt to better understand their context. To address this question, we introduce {pasted macro ‘OUR’}model, a benchmark and suite of analyses for evaluating reasoning skills of language models. {pasted macro ‘OUR’}model enables comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. Our benchmark provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. By using {pasted macro ‘OUR’}model we further investigate the role of finetuning. Our extensive empirical analysis shows that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during the finetuning stage compared to pretraining stage. However, we also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.

2021

pdf bib
Multi-Task Retrieval for Knowledge-Intensive Tasks
Jean Maillard | Vladimir Karpukhin | Fabio Petroni | Wen-tau Yih | Barlas Oguz | Veselin Stoyanov | Gargi Ghosh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be _universal_ and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.

pdf bib
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Prahal Arora | Masoumeh Aminzadeh | Christoph Feichtenhofer | Florian Metze | Luke Zettlemoyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Dmytro Okhonko | Armen Aghajanyan | Florian Metze | Luke Zettlemoyer | Christoph Feichtenhofer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.

2019

pdf bib
Exploring Deep Multimodal Fusion of Text and Photo for Hate Speech Classification
Fan Yang | Xiaochang Peng | Gargi Ghosh | Reshef Shilon | Hao Ma | Eider Moore | Goran Predovic
Proceedings of the Third Workshop on Abusive Language Online

Interactions among users on social network platforms are usually positive, constructive and insightful. However, sometimes people also get exposed to objectionable content such as hate speech, bullying, and verbal abuse etc. Most social platforms have explicit policy against hate speech because it creates an environment of intimidation and exclusion, and in some cases may promote real-world violence. As users’ interactions on today’s social networks involve multiple modalities, such as texts, images and videos, in this paper we explore the challenge of automatically identifying hate speech with deep multimodal technologies, extending previous research which mostly focuses on the text signal alone. We present a number of fusion approaches to integrate text and photo signals. We show that augmenting text with image embedding information immediately leads to a boost in performance, while applying additional attention fusion methods brings further improvement.