In this paper, we describe TTIC’s submission to WMT 2023 Sign Language Translation task on the Swiss-German Sign Language (DSGS) to German track. Our approach explores the advantages of using large-scale self-supervised pre-training in the task of sign language translation, over more traditional approaches that rely heavily on supervision, along with costly labels such as gloss annotations. The proposed model consists of a VideoSwin transformer for image encoding, and a T5 model adapted to receive VideoSwin features as input instead of text. In WMT-SLT 22’s development set, this system achieves 2.03 BLEU score, a 59% increase over the previous best reported performance. In the official test set, our primary submission achieves 1.1 BLEU score and 17.0 chrF score.
Existing work on sign language translation – that is, translation from sign language videos into sentences in a written language – has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube).OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers and is the largest publicly available ASL translation dataset to date. To tackle the challenges of sign language translation in realistic settings and without glosses, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features. The proposed techniques produce consistent and large improvements in translation quality, over baseline models basedon prior work.
We describe TTIC’s model submission to WMT-SLT 2022 task on sign language translation (Swiss-German Sign Language (DSGS) - German). Our model consists of an I3D backbone for image encoding and a Transformerbased encoder-decoder model for sequence modeling. The I3D is pre-trained with isolated sign recognition using the WLASL dataset. The model is based on RGB images alone and does not rely on the pre-extracted human pose. We explore a few different strategies for model training in this paper. Our system achieves 0.3 BLEU score and 0.195 Chrf score on the official test set.