VisualEDU: A Benchmark for Assessing Coding and Visual Comprehension through Educational Problem-Solving Video Generation

Hao Chen, Tianyu Shi, Pengran Huang, Zeyuan Li, Jiahui Pan, Qianglong Chen, Lewei He


Abstract
Generating logically coherent video from text (T2V) for reasoning-intensive tasks like mathematical problem-solving presents a significant challenge for Vision-Language Models (VLMs). Therefore, we introduce VisualEDU, a benchmark based on Manim package to rigorously evaluate VLM capabilities in producing coherent, step-by-step video solutions for educational purposes, with a framework that integrates meta-prompt learning, visual and code feedback, and a modular drawing toolkit to enhance output quality. Novel metrics for temporal consistency, logical correctness, and visual clarity are proposed, and extensive experiments across nine VLMs reveal that while advanced proprietary models show promise, all struggle significantly with increasing task complexity (e.g., the performances of Claude-3.7-Sonnet and GPT-4o are below 56% on difficult tasks ), highlighting limitations in code generation, visual feedback correction and precise tool invocation. VisualEDU offers a robust platform for systematic T2V assessment in reasoning-intensive domains and guides future VLM improvements in this area.
Anthology ID:
2025.findings-emnlp.889
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16363–16394
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.889/
DOI:
Bibkey:
Cite (ACL):
Hao Chen, Tianyu Shi, Pengran Huang, Zeyuan Li, Jiahui Pan, Qianglong Chen, and Lewei He. 2025. VisualEDU: A Benchmark for Assessing Coding and Visual Comprehension through Educational Problem-Solving Video Generation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16363–16394, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
VisualEDU: A Benchmark for Assessing Coding and Visual Comprehension through Educational Problem-Solving Video Generation (Chen et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.889.pdf
Checklist:
 2025.findings-emnlp.889.checklist.pdf