Po-Yao Huang


pdf bib
VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Dmytro Okhonko | Armen Aghajanyan | Florian Metze | Luke Zettlemoyer | Christoph Feichtenhofer
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks. VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval. Our experiments on a diverse series of downstream tasks, including sequence-level text-video retrieval, VideoQA, token-level action localization, and action segmentation reveal state-of-the-art performance, surpassing prior work, and in some cases even outperforming supervised approaches. Code is made available at https://github.com/pytorch/fairseq/examples/MMPT.

pdf bib
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Prahal Arora | Masoumeh Aminzadeh | Christoph Feichtenhofer | Florian Metze | Luke Zettlemoyer
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
Po-Yao Huang | Mandela Patrick | Junjie Hu | Graham Neubig | Florian Metze | Alexander Hauptmann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.


pdf bib
Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting
Po-Yao Huang | Junjie Hu | Xiaojun Chang | Alexander Hauptmann
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unsupervised machine translation (MT) has recently achieved impressive results with monolingual corpora only. However, it is still challenging to associate source-target sentences in the latent space. As people speak different languages biologically share similar visual systems, the potential of achieving better alignment through visual content is promising yet under-explored in unsupervised multimodal MT (MMT). In this paper, we investigate how to utilize visual content for disambiguation and promoting latent space alignment in unsupervised MMT. Our model employs multimodal back-translation and features pseudo visual pivoting in which we learn a shared multilingual visual-semantic embedding space and incorporate visually-pivoted captioning as additional weak supervision. The experimental results on the widely used Multi30K dataset show that the proposed model significantly improves over the state-of-the-art methods and generalizes well when images are not available at the testing time.


pdf bib
Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations
Po-Yao Huang | Xiaojun Chang | Alexander Hauptmann
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

With the aim of promoting and understanding the multilingual version of image search, we leverage visual object detection and propose a model with diverse multi-head attention to learn grounded multilingual multimodal representations. Specifically, our model attends to different types of textual semantics in two languages and visual objects for fine-grained alignments between sentences and images. We introduce a new objective function which explicitly encourages attention diversity to learn an improved visual-semantic embedding space. We evaluate our model in the German-Image and English-Image matching tasks on the Multi30K dataset, and in the Semantic Textual Similarity task with the English descriptions of visual content. Results show that our model yields a significant performance gain over other methods in all of the three tasks.


pdf bib
Attention-based Multimodal Neural Machine Translation
Po-Yao Huang | Frederick Liu | Sz-Rung Shiang | Jean Oh | Chris Dyer
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers