Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation

Linjun Li; Tao Jin; Xize Cheng; Ye Wang; Wang Lin; Rongjie Huang; Zhou Zhao

doi:10.18653/v1/2023.findings-acl.699

Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation

Linjun Li, Tao Jin, Xize Cheng, Ye Wang, Wang Lin, Rongjie Huang, Zhou Zhao

Abstract

Visual temporal-aligned translation aims to transform the visual sequence into natural words, including important applicable tasks such as lipreading and fingerspelling recognition. However, various performance habits of specific words by different speakers or signers can lead to visual ambiguity, which has become a major obstacle to the development of current methods. Considering the constraints above, the generalization ability of the translation system is supposed to be further explored through the evaluation results on unseen performers. In this paper, we develop a novel generalizable framework named Contrastive Token-Wise Meta-learning (CtoML), which strives to transfer recognition skills to unseen performers. To the best of our knowledge, employing meta-learning methods directly in the image domain poses two main challenges, and we propose corresponding strategies. First, sequence prediction in visual temporal-aligned translation, which aims to generate multiple words autoregressively, is different from the vanilla classification. Thus, we devise the token-wise diversity-aware weights for the meta-train stage, which encourages the model to make efforts on those ambiguously recognized tokens. Second, considering the consistency of word-visual prototypes across different domains, we develop two complementary global and local contrastive losses to maintain inter-class relationships and promote domain-independent. We conduct extensive experiments on the widely-used lipreading dataset GRID and the fingerspelling dataset ChicagoFSWild, and the experimental results show the effectiveness of our proposed CtoML over existing state-of-the-art methods.

Anthology ID:: 2023.findings-acl.699
Volume:: Findings of the Association for Computational Linguistics: ACL 2023
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 10993–11007
Language:
URL:: https://aclanthology.org/2023.findings-acl.699
DOI:: 10.18653/v1/2023.findings-acl.699
Bibkey:
Cite (ACL):: Linjun Li, Tao Jin, Xize Cheng, Ye Wang, Wang Lin, Rongjie Huang, and Zhou Zhao. 2023. Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10993–11007, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation (Li et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-acl.699.pdf

PDF Cite Search