SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Gyubeum Lim; Yemo Koo; Vijay Krishna Madisetti

SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models

Gyubeum Lim, Yemo Koo, Vijay Krishna Madisetti

Abstract

Understanding long-context visual information remains a fundamental challenge for vision-language models, particularly in agentic tasks such as GUI control and web navigation. While web pages and GUI environments are inherently structured documents, current VLMs typically neglect decision-oriented document understanding in their training objectives. Existing approaches primarily extend visual embeddings to process long, high-resolution inputs, but these methods are memory-intensive and impractical for locally deployable solutions. To address these issues, we propose SCoPE VLM, a document navigation expert that leverages a novel Chain of Scroll mechanism to selectively and recursively navigate documents, focusing exclusively on relevant segments. We introduce a dedicated data generation pipeline to construct informative Chain of Scroll trajectories and Episodic Group Relative Policy Optimization, a tailored reinforcement learning method to bridge the gap between training and inference. Our method substantially reduces memory usage and effectively models human-like reading behaviors. To the best of our knowledge, SCoPE VLM is the first framework to explicitly model agentic reading patterns in multi-page document question answering, advancing the capabilities of multimodal agents.

Anthology ID:: 2026.eacl-long.6
Volume:: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:: EACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 95–140
Language:
URL:: https://aclanthology.org/2026.eacl-long.6/
DOI:
Bibkey:
Cite (ACL):: Gyubeum Lim, Yemo Koo, and Vijay Krishna Madisetti. 2026. SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 95–140, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: SCoPE VLM: Selective Context Processing for Efficient Document Navigation in Vision-Language Models (Lim et al., EACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.eacl-long.6.pdf
Checklist:: 2026.eacl-long.6.checklist.pdf

PDF Cite Search Checklist Fix data