Switching Heads and Softening Tokens: Turnkey Solutions to Visually Grounded Document QA

Ximing Wen; Wenbo Li; Sudipta Paul; Yashas Malur Saidutta; Kalpa Gunaratna; Srinivas Chappidi

Switching Heads and Softening Tokens: Turnkey Solutions to Visually Grounded Document QA

Ximing Wen, Wenbo Li, Sudipta Paul, Yashas Malur Saidutta, Kalpa Gunaratna, Srinivas Chappidi

Abstract

Visually Grounded Document Question Answering often lacks robust, end-to-end solutions capable of handling complex, multi-answer queries without reliance on ad-hoc processing. In this work, we propose two turnkey LLM architectures to address this gap. We first introduce a single-head architecture where coordinates are represented as special tokens within the unified vocabulary. While structurally robust, this approach suffers from the limitations of discrete supervision; to address this, we propose a novel “softening token” method that enables differentiable Mean-Squared-Error loss over token probabilities. Although this significantly improves visual grounding, the spatial precision remains bound by discretization. Consequently, we propose a second solution: a dual-head architecture that alternates between text generation and regression-based bounding box prediction. This method offers high spatial precision via a regression head, further stabilized by our introduction of an Intersection-over-Union loss. Finally, by combining the single head model’s structural robustness with the high precision of the dual head model, we propose an ensemble method that yields significant performance gains beyond each of individual components.

Anthology ID:: 2026.findings-acl.1818
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 36490–36503
Language:
URL:: https://aclanthology.org/2026.findings-acl.1818/
DOI:
Bibkey:
Cite (ACL):: Ximing Wen, Wenbo Li, Sudipta Paul, Yashas Malur Saidutta, Kalpa Gunaratna, and Srinivas Chappidi. 2026. Switching Heads and Softening Tokens: Turnkey Solutions to Visually Grounded Document QA. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36490–36503, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Switching Heads and Softening Tokens: Turnkey Solutions to Visually Grounded Document QA (Wen et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1818.pdf
Checklist:: 2026.findings-acl.1818.checklist.pdf

PDF Cite Search Checklist Fix data