Multimodal Generation with Consistency Transferring

Junxiang Qiu; Jinda Lu; Shuo Wang

doi:10.18653/v1/2025.findings-naacl.31

Multimodal Generation with Consistency Transferring

Abstract

Multimodal content generation has become an area of considerable interest. However, existing methods are hindered by limitations related to model constraints and training strategies: (1) Most current approaches rely on training models from scratch, resulting in inefficient training processes when extending these models; (2) There is a lack of constraints on adjacent steps within the models, leading to slow sampling and poor generation stability across various sampling methods. To address the issues, we introduce Multimodal Generation with Consistency Transferring (MGCT). The method introduces two key improvements: (1) A Model Consistency Transferring (MCT) strategy to acquire low-cost prior knowledge, increasing training efficiency and avoiding error accumulation; (2) A Layer Consistency Transferring (LCT) between adjacent steps, enhancing denoising capabilities at each step and improving model stability across various generation methods. These strategies ensure the consistency of jointly generated multimodal content and improving training efficiency. Experiments show that the algorithm enhances the model’s ability to capture actions and depict backgrounds more effectively. In both the AIST++ and Landscape datasets, it improves video generation speed by approximately 40% and quality by about 39.3%, while also achieving a slight 3% improvement in audio quality over the baseline.

Anthology ID:: 2025.findings-naacl.31
Volume:: Findings of the Association for Computational Linguistics: NAACL 2025
Month:: April
Year:: 2025
Address:: Albuquerque, New Mexico
Editors:: Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 504–513
Language:
URL:: https://aclanthology.org/2025.findings-naacl.31/
DOI:: 10.18653/v1/2025.findings-naacl.31
Bibkey:
Cite (ACL):: Junxiang Qiu, Jinda Lu, and Shuo Wang. 2025. Multimodal Generation with Consistency Transferring. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 504–513, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):: Multimodal Generation with Consistency Transferring (Qiu et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-naacl.31.pdf

PDF Cite Search Fix data