Bowen Shi


pdf bib
TTIC’s Submission to WMT-SLT 23
Marcelo Sandoval-Castaneda | Yanhong Li | Bowen Shi | Diane Brentari | Karen Livescu | Gregory Shakhnarovich
Proceedings of the Eighth Conference on Machine Translation

In this paper, we describe TTIC’s submission to WMT 2023 Sign Language Translation task on the Swiss-German Sign Language (DSGS) to German track. Our approach explores the advantages of using large-scale self-supervised pre-training in the task of sign language translation, over more traditional approaches that rely heavily on supervision, along with costly labels such as gloss annotations. The proposed model consists of a VideoSwin transformer for image encoding, and a T5 model adapted to receive VideoSwin features as input instead of text. In WMT-SLT 22’s development set, this system achieves 2.03 BLEU score, a 59% increase over the previous best reported performance. In the official test set, our primary submission achieves 1.1 BLEU score and 17.0 chrF score.


pdf bib
TTIC’s WMT-SLT 22 Sign Language Translation System
Bowen Shi | Diane Brentari | Gregory Shakhnarovich | Karen Livescu
Proceedings of the Seventh Conference on Machine Translation (WMT)

We describe TTIC’s model submission to WMT-SLT 2022 task on sign language translation (Swiss-German Sign Language (DSGS) - German). Our model consists of an I3D backbone for image encoding and a Transformerbased encoder-decoder model for sequence modeling. The I3D is pre-trained with isolated sign recognition using the WLASL dataset. The model is based on RGB images alone and does not rely on the pre-extracted human pose. We explore a few different strategies for model training in this paper. Our system achieves 0.3 BLEU score and 0.195 Chrf score on the official test set.

pdf bib
Searching for fingerspelled content in American Sign Language
Bowen Shi | Diane Brentari | Greg Shakhnarovich | Karen Livescu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural language processing for sign language video—including tasks like recognition, translation, and search—is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled keywords or key phrases in raw sign language videos. This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before. We propose an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence. Our experiments, done on a large public dataset of ASL fingerspelling in the wild, show the importance of fingerspelling detection as a component of a search and retrieval model. Our model significantly outperforms baseline methods adapted from prior work on related tasks.

pdf bib
Open-Domain Sign Language Translation Learned from Online Video
Bowen Shi | Diane Brentari | Gregory Shakhnarovich | Karen Livescu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Existing work on sign language translation – that is, translation from sign language videos into sentences in a written language – has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube).OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers and is the largest publicly available ASL translation dataset to date. To tackle the challenges of sign language translation in realistic settings and without glosses, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features. The proposed techniques produce consistent and large improvements in translation quality, over baseline models basedon prior work.


pdf bib
A Cross-Task Analysis of Text Span Representations
Shubham Toshniwal | Haoyue Shi | Bowen Shi | Lingyu Gao | Karen Livescu | Kevin Gimpel
Proceedings of the 5th Workshop on Representation Learning for NLP

Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution. While extensive research has focused on functional architectures for representing words and sentences, there is less work on representing arbitrary spans of text within sentences. In this paper, we conduct a comprehensive empirical evaluation of six span representation methods using eight pretrained language representation models across six tasks, including two tasks that we introduce. We find that, although some simple span representations are fairly reliable across tasks, in general the optimal span representation varies by task, and can also vary within different facets of individual tasks. We also find that the choice of span representation has a bigger impact with a fixed pretrained encoder than with a fine-tuned encoder.