Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Yasmine Karoui; Rémi Lebret; Negar Foroutan Eghlidi; Karl Aberer

doi:10.18653/v1/2023.acl-short.32

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Yasmine Karoui, Rémi Lebret, Negar Foroutan Eghlidi, Karl Aberer

Abstract

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM.We utilize a cross-lingual contextualised token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at https://github.com/Yasminekaroui/CliCoTea.

Anthology ID:: 2023.acl-short.32
Volume:: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 366–375
Language:
URL:: https://aclanthology.org/2023.acl-short.32
DOI:: 10.18653/v1/2023.acl-short.32
Bibkey:
Cite (ACL):: Yasmine Karoui, Rémi Lebret, Negar Foroutan Eghlidi, and Karl Aberer. 2023. Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 366–375, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages (Karoui et al., ACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.acl-short.32.pdf
Video:: https://aclanthology.org/2023.acl-short.32.mp4

PDF Cite Search Video