A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking

Hideki Nakayama, Akihiro Tamura, Takashi Ninomiya


Abstract
Visually-grounded natural language processing has become an important research direction in the past few years. However, majorities of the available cross-modal resources (e.g., image-caption datasets) are built in English and cannot be directly utilized in multilingual or non-English scenarios. In this study, we present a novel multilingual multimodal corpus by extending the Flickr30k Entities image-caption dataset with Japanese translations, which we name Flickr30k Entities JP (F30kEnt-JP). To the best of our knowledge, this is the first multilingual image-caption dataset where the captions in the two languages are parallel and have the shared annotations of many-to-many phrase-to-region linking. We believe that phrase-to-region as well as phrase-to-phrase supervision can play a vital role in fine-grained grounding of language and vision, and will promote many tasks such as multilingual image captioning and multimodal machine translation. To verify our dataset, we performed phrase localization experiments in both languages and investigated the effectiveness of our Japanese annotations as well as multilingual learning realized by our dataset.
Anthology ID:
2020.lrec-1.518
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4204–4210
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.518
DOI:
Bibkey:
Cite (ACL):
Hideki Nakayama, Akihiro Tamura, and Takashi Ninomiya. 2020. A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4204–4210, Marseille, France. European Language Resources Association.
Cite (Informal):
A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking (Nakayama et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.518.pdf