Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang; Youngjoon Jang; Liliane Momeni; Gül Varol; Sarah Ebling; Andrew Zisserman

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing

Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, Andrew Zisserman

Abstract

The goal of this work is to develop a universal approach for aligning subtitles (i.e., spoken language text with corresponding timestamps) to continuous sign language videos. Prior approaches typically rely on end-to-end training tied to a specific language or dataset, which limits their generality. In contrast, our method Segment, Embed, and Align (SEA) provides a single framework that works across multiple languages and domains. SEA leverages two pretrained models: the first to segment a video sequence into individual signs and the second to embed each sign video clip into a shared latent space with text. Alignment is subsequently performed with a lightweight dynamic programming procedure that runs efficiently on CPU within a minute, even for hour-long episodes. SEA is flexible and can adapt to a wide range of scenarios, utilizing resources from small lexicons to large continuous corpora. Experiments on four sign language datasets demonstrate state-of-the-art alignment performance, highlighting the potential of SEA to generate high-quality parallel data for advancing sign language processing.

Anthology ID:: 2026.acl-long.1401
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30371–30384
Language:
URL:: https://aclanthology.org/2026.acl-long.1401/
DOI:
Bibkey:
Cite (ACL):: Zifan Jiang, Youngjoon Jang, Liliane Momeni, Gül Varol, Sarah Ebling, and Andrew Zisserman. 2026. Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30371–30384, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing (Jiang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1401.pdf
Checklist:: 2026.acl-long.1401.checklist.pdf

PDF Cite Search Checklist Fix data