EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs

Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, Xiao-Ming Wu


Abstract
We present EasyGen, an efficient model designed to enhance multimodal understanding and generation by harnessing the capabilities of diffusion models and large language models (LLMs). Unlike existing multimodal models that predominately depend on encoders like CLIP or ImageBind and need ample amounts of training data to bridge modalities, EasyGen leverages BiDiffuser, a bidirectional conditional diffusion model, to foster more efficient modality interactions. EasyGen achieves text generation by training a projection layer linking BiDiffuser and an LLM, and facilities image generation by training an adapter to align the LLM’s text space with the BiDiffuser’s image space. Comprehensive quantitative and qualitative experiments show that EasyGen excels in data-efficient training, high-quality image generation, and extendibility, effectively addressing the challenges in multimodal generation.
Anthology ID:
2024.luhme-long.74
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1351–1370
Language:
URL:
https://aclanthology.org/2024.luhme-long.74/
DOI:
10.18653/v1/2024.acl-long.74
Bibkey:
Cite (ACL):
Xiangyu Zhao, Bo Liu, Qijiong Liu, Guangyuan Shi, and Xiao-Ming Wu. 2024. EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1351–1370, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs (Zhao et al., ACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.acl-long.74.pdf