Masked Diffusion Captioning for Visual Feature Learning

Chao Feng, Zihao Wei, Andrew Owens


Abstract
We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image–caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token’s position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.
Anthology ID:
2025.findings-emnlp.1376
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25247–25263
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.1376/
DOI:
Bibkey:
Cite (ACL):
Chao Feng, Zihao Wei, and Andrew Owens. 2025. Masked Diffusion Captioning for Visual Feature Learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 25247–25263, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Masked Diffusion Captioning for Visual Feature Learning (Feng et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.1376.pdf
Checklist:
 2025.findings-emnlp.1376.checklist.pdf