Visual Program Distillation with Template-Based Augmentation

Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem


Abstract
Adapting visual programming or prompting large language models (LLMs) to generate executable code for visual tasks like visual question answering (VQA) for specialized tasks or domains remains challenging due to high annotation and inference costs. We propose a low-cost visual program distillation method that can be used for models with at most 1 billion parameters and requires no human-generated program annotations. We achieve this through synthetic data augmentation based on decoupling programs into higher-level skills, called templates, and their corresponding arguments. Experimental results show that, with a relatively small amount of question/answer data, small language models can generate high-quality specialized visual programs with the added benefit of much faster inference.
Anthology ID:
2025.findings-emnlp.162
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2998–3018
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.162/
DOI:
Bibkey:
Cite (ACL):
Michal Shlapentokh-Rothman, Yu-Xiong Wang, and Derek Hoiem. 2025. Visual Program Distillation with Template-Based Augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 2998–3018, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Visual Program Distillation with Template-Based Augmentation (Shlapentokh-Rothman et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.162.pdf
Checklist:
 2025.findings-emnlp.162.checklist.pdf