Efficient Architectures for High Resolution Vision-Language Models

Miguel Carvalho, Bruno Martins


Abstract
Vision-Language Models (VLMs) have recently experienced significant advancements. However, challenges persist in the accurate recognition of fine details within high resolution images, which limits performance in multiple tasks. This work introduces Pheye, a novel architecture that efficiently processes high-resolution images while training fewer parameters than similarly sized VLMs. Notably, Pheye achieves a high efficiency while maintaining strong performance, particularly in tasks that demand fine-grained image understanding and/or the handling of scene-text.
Anthology ID:
2025.coling-main.700
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10520–10530
Language:
URL:
https://aclanthology.org/2025.coling-main.700/
DOI:
Bibkey:
Cite (ACL):
Miguel Carvalho and Bruno Martins. 2025. Efficient Architectures for High Resolution Vision-Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 10520–10530, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Efficient Architectures for High Resolution Vision-Language Models (Carvalho & Martins, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.700.pdf