SA-DETR:Span Aware Detection Transformer for Moment Retrieval

Tianheng Xiong, Wei Wei, Kaihe Xu, Dangyang Chen


Abstract
Moment Retrieval aims to locate specific video segments related to the given text. Recently, DETR-based methods, originating from Object Detection, have emerged as effective solutions for Moment Retrieval. These approaches focus on multimodal feature fusion and refining Queries composed of span anchor and content embedding. Despite the success, they often overlook the video-text instance related information in Query Initialization and the crucial guidance role of span anchors in Query Refinement, leading to inaccurate predictions. To address this, we propose a novel Span Aware DEtection TRansformer (SA-DETR) that leverages the importance of instance related span anchors. To fully leverage the instance related information, we generate span anchors based on video-text pair rather than using learnable parameters, as is common in conventional DETR-based methods, and supervise them with GT labels. To effectively exploit the correspondence between span anchors and video clips, we enhance content embedding guided by textual features and generate Gaussian mask to modulate the interaction between content embedding and fusion features. Furthermore, we explore the feature alignment across various stages and granularities and apply denoise learning to boost the span awareness of the model. Extensive experiments on QVHighlights, Charades-STA, and TACoS demonstrate the effectiveness of our approach.
Anthology ID:
2025.coling-main.510
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7634–7647
Language:
URL:
https://aclanthology.org/2025.coling-main.510/
DOI:
Bibkey:
Cite (ACL):
Tianheng Xiong, Wei Wei, Kaihe Xu, and Dangyang Chen. 2025. SA-DETR:Span Aware Detection Transformer for Moment Retrieval. In Proceedings of the 31st International Conference on Computational Linguistics, pages 7634–7647, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
SA-DETR:Span Aware Detection Transformer for Moment Retrieval (Xiong et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.510.pdf