Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Yuhui Zhang; Brandon McKinzie; Zhe Gan; Vaishaal Shankar; Alexander T Toshev

doi:10.18653/v1/2024.emnlp-main.75

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander T Toshev

Abstract

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models’ capability.

Anthology ID:: 2024.emnlp-main.75
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1281–1287
Language:
URL:: https://aclanthology.org/2024.emnlp-main.75/
DOI:: 10.18653/v1/2024.emnlp-main.75
Bibkey:
Cite (ACL):: Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, and Alexander T Toshev. 2024. Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1281–1287, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation (Zhang et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.75.pdf

PDF Cite Search Fix data