ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Yuhang Li; Chenchen Zhang; Ruilin Lv; Ao Liu; Ken Deng; Yuanxing Zhang; Jiaheng Liu; Bo Zhou

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, Bo Zhou

Abstract

While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement learning framework that empowers an agent to close a robust generate–diagnose–refine loop by invoking a multimodal LLM (MLLM) as a tool. During training, the agent employs an MLLM-in-the-loop to serve as a visual critic, evaluating code via screenshots and providing actionable feedback. Crucially, we enforce a strict zero-reward policy for invalid renders to guarantee renderability and mitigate reward hacking. To prevent behavioral collapse, we introduce Forced Optimization, a strict acceptance rule that admits only improving revisions, yielding monotonically better trajectories. At inference, we decouple the critic and run a lightweight, critic-free self-edit cycle, keeping latency comparable to base decoding while retaining most of the gains. Across three widely used benchmarks, ReLook consistently outperforms strong baselines in vision-grounded front-end code generation, highlighting the benefits of agentic perception, visual rewards, and training–inference decoupling.

Anthology ID:: 2026.acl-long.1167
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25471–25485
Language:
URL:: https://aclanthology.org/2026.acl-long.1167/
DOI:
Bibkey:
Cite (ACL):: Yuhang Li, Chenchen Zhang, Ruilin Lv, Ao Liu, Ken Deng, Yuanxing Zhang, Jiaheng Liu, and Bo Zhou. 2026. ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 25471–25485, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding (Li et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1167.pdf
Checklist:: 2026.acl-long.1167.checklist.pdf

PDF Cite Search Checklist Fix data