Fusion of Object-Centric and Linguistic Features for Domain-Adapted Multimodal Learning

Jordan Konstantinov Kralev

Fusion of Object-Centric and Linguistic Features for Domain-Adapted Multimodal Learning

Abstract

Modern multimodal systems often struggle to link domain-specific visual content with textual descriptions, especially when object recognition is limited to general categories (e.g. COCO classes) and lacks customised adaptation to language models. In this paper, we present a novel framework that integrates a domain-specific adapted Detectron2 model into predefined models via a trainable projection layer, enabling precise crossmodal adaptation for specialised domains. Our approach extends Detectron2’s recognition capabilities to new categories by fine-tuning on multi-domain datasets, while a lightweight linear projection layer maps region-based visual features to the model’s embedding space without completely retraining the model. We evaluated the framework for domain-specific image captioning. The presented approach provides a scalable design for combining domain-specific visual recognition with language inference, with applications in domains that require fine-grained multimodal understanding.

Anthology ID:: 2025.ranlp-1.69
Volume:: Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Galia Angelova, Maria Kunilovskaya, Marie Escribe, Ruslan Mitkov
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 587–594
Language:
URL:: https://aclanthology.org/2025.ranlp-1.69/
DOI:
Bibkey:
Cite (ACL):: Jordan Konstantinov Kralev. 2025. Fusion of Object-Centric and Linguistic Features for Domain-Adapted Multimodal Learning. In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era, pages 587–594, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: Fusion of Object-Centric and Linguistic Features for Domain-Adapted Multimodal Learning (Kralev, RANLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.ranlp-1.69.pdf

PDF Cite Search Fix data