Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander Toshev


Abstract
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models’ capability.
Anthology ID:
2024.emnlp-main.75
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1281–1287
Language:
URL:
https://aclanthology.org/2024.emnlp-main.75
DOI:
Bibkey:
Cite (ACL):
Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander Toshev. 2024. Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1281–1287, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation (Zhang et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.75.pdf