2025
pdf
bib
abs
Improving Multilingual Sign Language Translation with Automatically Clustered Language Family Information
Ruiquan Zhang
|
Cong Hu
|
Pei Yu
|
Yidong Chen
Proceedings of the 31st International Conference on Computational Linguistics
Sign Language Translation (SLT) bridges the communication gap between deaf and hearing individuals by converting sign language videos into spoken language texts. While most SLT research has focused on bilingual translation models, the recent surge in interest has led to the exploration of Multilingual Sign Language Translation (MSLT). However, MSLT presents unique challenges due to the diversity of sign languages across nations. This diversity can lead to cross-linguistic conflicts and hinder translation accuracy. To use the similarity of actions and semantics between sign languages to alleviate conflict, we propose a novel approach that leverages sign language families to improve MSLT performance. Sign languages were clustered into families automatically based on their Language distribution in the MSLT network. We compare the results of our proposed family clustering method with the analysis conducted by sign language linguists and then train dedicated translation models for each family in the many-to-one translation scenario. Our experiments on the SP-10 dataset demonstrate that our approach can achieve a balance between translation accuracy and computational cost by regulating the number of language families.
pdf
bib
abs
Dynamic Feature Fusion for Sign Language Translation Using HyperNetworks
Ruiquan Zhang
|
Rui Zhao
|
Zhicong Wu
|
Liang Zhang
|
Haoqi Zhang
|
Yidong Chen
Findings of the Association for Computational Linguistics: NAACL 2025
This paper presents an efficient dual-stream early fusion method for sign language translation. Inspired by the brain’s ability to process color, shape, and motion simultaneously, the method explores complex dependencies between RGB and keypoint streams, improving speed and efficiency. A key challenge is extracting complementary features from both streams while ensuring global semantic consistency to avoid conflicts and improve generalization. To address this issue, we propose a hypernetwork-based fusion strategy that effectively extracts salient features from RGB and keypoint streams, alongside a partial shortcut connection training method to strengthen the complementary information between the dual streams. Additionally, we introduce self-distillation and SST contrastive learning to maintain feature advantages while aligning the global semantic space. Experiments show that our method achieves state-of-the-art performance on two public sign language datasets, reducing model parameters by about two-thirds.
2024
pdf
bib
abs
Translation Quality Evaluation of Sign Language Avatar
Yuan Zhao
|
Ruiquan Zhang
|
Dengfeng Yao
|
Yidong Chen
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“Sign Language Avatar technology aims to create virtual agents capable of communicating with deaf individuals through sign language, similar to the text dialogue agent ChatGPT but focusing on sign language communication. Challenges in sign language production include limited dataset sizes, information loss due to reliance on intermediate representations, and insufficient realism in generated actions. In this event, we particularly focus on the ability of the Sign Language Avatar to translate spoken language text into sign language that is easily understood by deaf individuals. As the first sign language avatar event held by the China National Conference on Computational Linguistics(CCL), this event attracted wide attention from both industry and academia, with 14 teams registering and 10 of them submitting their system interfaces on time. We provided a dataset consisting of 1074 text-video parallel sentence pairs for training, and the evaluation team comprised proficient Chinese sign language users and professional sign language translators. The scoring method employed a comprehensive evaluation based on multiple metrics, focusing primarily on sign language grammar accuracy, naturalness, readability, and cultural adaptability. The final scores were determined by considering performance across these four aspects. The final scores, taking into account these four aspects, showed that four teams demonstrated good readability, with Vivo Mobile Communication Co., Ltd. ranking first with a score of 3.513 (out of a full score of 5), leading the baseline model by 1.394 points. According to the analysis of the results, most teams used the traditional method of converting text into Gloss sequences before generating sign language. Additionally, some teams experimented with emerging methods, including gloss-free end-to-end training and Large Language Model(LLMs) prompt learning, which also achieved promising results. We anticipate that this event will promote the development of sign language avatar technology and provide higher-quality communication tools for the deaf community. For more information on this task, please visit the website of the CCL24-Eval: Translation Quality Evaluation of Sign Language Avatar Task.”
pdf
bib
abs
Adaptive Simultaneous Sign Language Translation with Confident Translation Length Estimation
Tong Sun
|
Biao Fu
|
Cong Hu
|
Liang Zhang
|
Ruiquan Zhang
|
Xiaodong Shi
|
Jinsong Su
|
Yidong Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Traditional non-simultaneous Sign Language Translation (SLT) methods, while effective for pre-recorded videos, face challenges in real-time scenarios due to inherent inference delays. The emerging field of simultaneous SLT aims to address this issue by progressively translating incrementally received sign video. However, the sole existing work in simultaneous SLT adopts a fixed gloss-based policy, which suffer from limitations in boundary prediction and contextual comprehension. In this paper, we delve deeper into this area and propose an adaptive policy for simultaneous SLT. Our approach introduces the concept of “confident translation length”, denoting maximum accurate translation achievable from current input. An estimator measures this length for streaming sign video, enabling the model to make informed decisions on whether to wait for more input or proceed with translation. To train the estimator, we construct a training data of confident translation length based on the longest common prefix between translations of partial and complete inputs. Furthermore, we incorporate adaptive training, utilizing pseudo prefix pairs, to refine the offline translation model for optimal performance in simultaneous scenarios. Experimental results on PHOENIX2014T and CSL-Daily demonstrate the superiority of our adaptive policy over existing methods, particularly excelling in situations requiring extremely low latency.