Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding

Erica Kido Shimomoto, Edison Marrese-Taylor, Hiroya Takamura, Ichiro Kobayashi, Hideki Nakayama, Yusuke Miyao


Abstract
This paper explores the task of Temporal Video Grounding (TVG) where, given an untrimmed video and a query sentence, the goal is to recognize and determine temporal boundaries of action instances in the video described by natural language queries. Recent works tackled this task by improving query inputs with large pre-trained language models (PLM), at the cost of more expensive training. However, the effects of this integration are unclear, as these works also propose improvements in the visual inputs. Therefore, this paper studies the role of query sentence representation with PLMs in TVG and assesses the applicability of parameter-efficient training with NLP adapters. We couple popular PLMs with a selection of existing approaches and test different adapters to reduce the impact of the additional parameters. Our results on three challenging datasets show that, with the same visual inputs, TVG models greatly benefited from the PLM integration and fine-tuning, stressing the importance of the text query representation in this task. Furthermore, adapters were an effective alternative to full fine-tuning, even though they are not tailored to our task, allowing PLM integration in larger TVG models and delivering results comparable to SOTA models. Finally, our results shed light on which adapters work best in different scenarios.
Anthology ID:
2023.findings-acl.829
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
13101–13123
Language:
URL:
https://aclanthology.org/2023.findings-acl.829
DOI:
10.18653/v1/2023.findings-acl.829
Bibkey:
Cite (ACL):
Erica Kido Shimomoto, Edison Marrese-Taylor, Hiroya Takamura, Ichiro Kobayashi, Hideki Nakayama, and Yusuke Miyao. 2023. Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13101–13123, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Towards Parameter-Efficient Integration of Pre-Trained Language Models In Temporal Video Grounding (Kido Shimomoto et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-acl.829.pdf
Video:
 https://aclanthology.org/2023.findings-acl.829.mp4