Manuel Kaufmann
2026
Learning Vision-Language Alignment in Unified LLMs with 24 Text Tokens per Image
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
We explore how to adapt a pre-trained large language model to understand and generate both visual and textual information. We use an image tokenizer to compress images into discrete tokens, and train the model using the next-token prediction paradigm with the standard cross-entropy loss. A two-stage pre-training approach is applied, first training on image-only data and then on a small amount of image-text data. We evaluate how different image-text token mixing ratios during continual pre-training affect the model’s ability to retain language skills while learning visual representations. The resulting model shows promising signs of flexible multimodal understanding, bridging vision and language in a single pre-trained model.