Imanol Schlag
2026
Learning Vision-Language Alignment in Unified LLMs with 24 Text Tokens per Image
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Nicola Irmiger | Yixuan Xu | Raphael Kreft | Aram Davtyan | Manuel Kaufmann | Imanol Schlag
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
We explore how to adapt a pre-trained large language model to understand and generate both visual and textual information. We use an image tokenizer to compress images into discrete tokens, and train the model using the next-token prediction paradigm with the standard cross-entropy loss. A two-stage pre-training approach is applied, first training on image-only data and then on a small amount of image-text data. We evaluate how different image-text token mixing ratios during continual pre-training affect the model’s ability to retain language skills while learning visual representations. The resulting model shows promising signs of flexible multimodal understanding, bridging vision and language in a single pre-trained model.
2024
On the Effect of (Near) Duplicate Subwords in Language Modelling
Anton Schäfer | Thomas Hofmann | Imanol Schlag | Tiago Pimentel
Findings of the Association for Computational Linguistics: ACL 2024
Anton Schäfer | Thomas Hofmann | Imanol Schlag | Tiago Pimentel
Findings of the Association for Computational Linguistics: ACL 2024
Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned random indices before being served to the LM. However, this process—while typically lossless—may lead to less efficient LM training, because it removes character-level information, thereby making it more difficult to generalise across similar subwords, such as *now* and *Now*. We refer to such subwords as **near duplicates**. In this paper, we study the impact of near duplicate subwords on LM training efficiency. First, we design an experiment that gives us an upper bound to how much we should expect a model to improve if we could perfectly generalise across near duplicates. We do this, by duplicating each token in our LM’s vocabulary, creating perfectly equivalent classes of subwords. Experimentally, we find that LMs need roughly 17% more data when trained in a fully duplicated setting. Second, we investigate the impact of naturally occurring near duplicates on LMs. Here, we see that deduplicating them considerably hurts LM performance; but that this loss in performance can be easily mitigated.
Swiss AI Initiative - Collecting Large Amounts of High-Quality Data for Training Large Language Models
Jan Deriu | Maud Ehrmann | Emanuela Boros | Maximilian Böther | Christiane Sibille | Ihor Protsenko | Marta Brucka | Imanol Schlag | Elliott Ash
Proceedings of the 9th edition of the Swiss Text Analytics Conference
Jan Deriu | Maud Ehrmann | Emanuela Boros | Maximilian Böther | Christiane Sibille | Ihor Protsenko | Marta Brucka | Imanol Schlag | Elliott Ash
Proceedings of the 9th edition of the Swiss Text Analytics Conference