SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Zifan Jiang; Gerard Sant; Amit Moryossef; Mathias Müller; Rico Sennrich; Sarah Ebling

SignCLIP: Connecting Text and Sign Language by Contrastive Learning

Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias Müller, Rico Sennrich, Sarah Ebling

Abstract

We present SignCLIP, which re-purposes CLIP (Contrastive Language-Image Pretraining) to project spoken language text and sign language videos, two classes of natural languages of distinct modalities, into the same space. SignCLIP is an efficient method of learning useful visual representations for sign language processing from large-scale, multilingual video-text pairs, without directly optimizing for a specific task or sign language which is often of limited size.We pretrain SignCLIP on Spreadthesign, a prominent sign language dictionary consisting of ~500 thousand video clips in up to 44 sign languages, and evaluate it with various downstream datasets. SignCLIP discerns in-domain signing with notable text-to-video/video-to-text retrieval accuracy. It also performs competitively for out-of-domain downstream tasks such as isolated sign language recognition upon essential few-shot prompting or fine-tuning.We analyze the latent space formed by the spoken language text and sign language poses, which provides additional linguistic insights. Our code and models are openly available.

Anthology ID:: 2024.emnlp-main.518
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9171–9193
Language:
URL:: https://aclanthology.org/2024.emnlp-main.518
DOI:
Bibkey:
Cite (ACL):: Zifan Jiang, Gerard Sant, Amit Moryossef, Mathias Müller, Rico Sennrich, and Sarah Ebling. 2024. SignCLIP: Connecting Text and Sign Language by Contrastive Learning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9171–9193, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: SignCLIP: Connecting Text and Sign Language by Contrastive Learning (Jiang et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.518.pdf

PDF Cite Search