Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing

Max Reinhardt; Gregor Geigle; Radu Timofte; Goran Glavaš

Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing

Max Reinhardt, Gregor Geigle, Radu Timofte, Goran Glavaš

Abstract

Large-scale pretraining of vision-language (VL) models brought dramatic improvements across numerous tasks, from visual question-answering to cross-modal retrieval but these gains are mostly limited to English. Massively multilingual VL encoder models (mVLMs) hold promise for other languages: after fine-tuning on only English task data, they can perform the task in other languages in what is termed zero-shot cross-lingual transfer (ZS-XLT). Still, ZS-XLT sees a large performance gap to English, especially for low-resource languages. In this work, we reduce this gap with a fine-tuning strategy known as Scheduled Unfreezing (SUF): instead of updating all parameters from the start, we begin with the top layer(s) of the vision-language encoder and gradually unfreeze (i.e., update) its layers top to bottom. SUF forces reliance on encoder’s representations from higher layers: the fact that in multilingual models these representations encode higher-level semantics rather than low-level language-specific idiosyncrasies, we hypothesize, should render SUF beneficial for ZS-XLT. Experiments with two mVLMs (UC2 & CCLM) on three downstream tasks (xGQA, XVNLI, xFlickrCo) show that SUF brings consistent gains in ZS-XLT, especially for visual Q&A (xGQA) by up to 10 points.

Anthology ID:: 2024.alvr-1.13
Volume:: Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)
Month:: August
Year:: 2024
Address:: Bangkok, Thailand
Editors:: Jing Gu, Tsu-Jui (Ray) Fu, Drew Hudson, Asli Celikyilmaz, William Wang
Venues:: ALVR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 155–166
Language:
URL:: https://aclanthology.org/2024.alvr-1.13
DOI:
Bibkey:
Cite (ACL):: Max Reinhardt, Gregor Geigle, Radu Timofte, and Goran Glavaš. 2024. Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing. In Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), pages 155–166, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):: Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing (Reinhardt et al., ALVR-WS 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.alvr-1.13.pdf

PDF Cite Search