Augmenting Sign Language Translation Datasets with Large Language Models

Pedro Alejandro Dal Bianco; Jean Paul Nunes Reinhold; Facundo Manuel Quiroga; Franco Ronchetti

Augmenting Sign Language Translation Datasets with Large Language Models

Pedro Alejandro Dal Bianco, Jean Paul Nunes Reinhold, Facundo Manuel Quiroga, Franco Ronchetti

Abstract

Sign language translation (SLT) is a challenging task due to the scarcity of labeled data and the heavy-tailed distribution of sign language vocabularies. In this paper, we explore a novel data augmentation approach for SLT: using a large language model (LLM) to generate paraphrases of the target language sentences in the training data. We experiment with a Transformer-based SLT model (Signformer) on three datasets spanning German, Greek, and Argentinian Sign Languages. For models trained with augmentation, we adopt a two-stage regime: pre-train on the LLM-augmented corpus and then fine-tune on the original, non-augmented training set. Our augmented training sets, expanded with GPT-4-generated paraphrases, yield mixed results. On a medium-scale German SL corpus (PHOENIX14T), LLM augmentation improves BLEU-4 from 9.56 to 10.33. In contrast, a small-vocabulary Greek SL dataset with a near-perfect baseline (94.38 BLEU) sees a slight drop to 92.22 BLEU, and a complex Argentinian SL corpus with a long-tail vocabulary distribution remains around 1.2 BLEU despite augmentation. We analyze these outcomes in relation to each dataset’s complexity and token frequency distribution, finding that LLM-based augmentation is more beneficial when the dataset contains a richer vocabulary and many infrequent tokens. To our knowledge, this work is the first to apply LLM paraphrasing to SLT, and we discuss these results with respect to prior data augmentation efforts in sign language translation.

Anthology ID:: 2025.wslp-main.4
Volume:: Proceedings of the Workshop on Sign Language Processing (WSLP)
Month:: December
Year:: 2025
Address:: IIT Bombay, Mumbai, India (Co-located with IJCNLP–AACL 2025)
Editors:: Mohammed Hasanuzzaman, Facundo Manuel Quiroga, Ashutosh Modi, Sabyasachi Kamila, Keren Artiaga, Abhinav Joshi, Sanjeet Singh
Venues:: WSLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20–26
Language:
URL:: https://aclanthology.org/2025.wslp-main.4/
DOI:
Bibkey:
Cite (ACL):: Pedro Alejandro Dal Bianco, Jean Paul Nunes Reinhold, Facundo Manuel Quiroga, and Franco Ronchetti. 2025. Augmenting Sign Language Translation Datasets with Large Language Models. In Proceedings of the Workshop on Sign Language Processing (WSLP), pages 20–26, IIT Bombay, Mumbai, India (Co-located with IJCNLP–AACL 2025). Association for Computational Linguistics.
Cite (Informal):: Augmenting Sign Language Translation Datasets with Large Language Models (Bianco et al., WSLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wslp-main.4.pdf

PDF Cite Search Fix data