TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning

Kate Sanders, Nathaniel Weir, Benjamin Van Durme


Abstract
It is challenging for models to understand complex, multimodal content such as television clips, and this is in part because video-language models often rely on single-modality reasoning and lack interpretability. To combat these issues we propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by searching for trees of entailment relationships between simple text-video evidence and higher-level conclusions that prove question-answer pairs. We also introduce the task of multimodal entailment tree generation to evaluate reasoning quality. Our method’s performance on the challenging TVQA benchmark demonstrates interpretable, state-of-the-art zero-shot performance on full clips, illustrating that multimodal entailment tree generation can be a best-of-both-worlds alternative to black-box systems.
Anthology ID:
2024.emnlp-main.1059
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
19009–19028
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1059
DOI:
10.18653/v1/2024.emnlp-main.1059
Bibkey:
Cite (ACL):
Kate Sanders, Nathaniel Weir, and Benjamin Van Durme. 2024. TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 19009–19028, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning (Sanders et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1059.pdf