Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks

Pedro Rodriguez, Mahmoud Azab, Becka Silvert, Renato Sanchez, Linzy Labson, Hardik Shah, Seungwhan Moon


Abstract
Searching troves of videos with textual descriptions is a core multimodal retrieval task. Owing to the lack of a purpose-built dataset for text-to-video retrieval, video captioning datasets have been re-purposed to evaluate models by (1) treating captions as positive matches to their respective videos and (2) assuming all other videos to be negatives. However, this methodology leads to a fundamental flaw during evaluation: since captions are marked as relevant only to their original video, many alternate videos also match the caption, which introduces false-negative caption-video pairs. We show that when these false negatives are corrected, a recent state-of-the-art model gains 25% recall points—a difference that threatens the validity of the benchmark itself. To diagnose and mitigate this issue, we annotate and release 683K additional caption-video pairs. Using these, we recompute effectiveness scores for three models on two standard benchmarks (MSR-VTT and MSVD). We find that (1) the recomputed metrics are up to 25% recall points higher for the best models, (2) these benchmarks are nearing saturation for Recall@10, (3) caption length (generality) is related to the number of positives, and (4) annotation costs can be mitigated through sampling. We recommend retiring these benchmarks in their current form, and we make recommendations for future text-to-video retrieval benchmarks.
Anthology ID:
2023.findings-eacl.3
Volume:
Findings of the Association for Computational Linguistics: EACL 2023
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Andreas Vlachos, Isabelle Augenstein
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
47–68
Language:
URL:
https://aclanthology.org/2023.findings-eacl.3
DOI:
10.18653/v1/2023.findings-eacl.3
Bibkey:
Cite (ACL):
Pedro Rodriguez, Mahmoud Azab, Becka Silvert, Renato Sanchez, Linzy Labson, Hardik Shah, and Seungwhan Moon. 2023. Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks. In Findings of the Association for Computational Linguistics: EACL 2023, pages 47–68, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks (Rodriguez et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-eacl.3.pdf
Dataset:
 2023.findings-eacl.3.dataset.zip
Video:
 https://aclanthology.org/2023.findings-eacl.3.mp4