Andrew Bunner
2024
ImageInWords: Unlocking Hyper-Detailed Image Descriptions
Roopal Garg
|
Andrea Burns
|
Burcu Karagol Ayan
|
Yonatan Bitton
|
Ceslee Montgomery
|
Yasumasa Onoe
|
Andrew Bunner
|
Ranjay Krishna
|
Jason Baldridge
|
Radu Soricut
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Despite the longstanding adage ”an image is worth a thousand words,” generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image-text, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT-4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest fidelity images, and boost compositional reasoning by up to 6% on ARO, SVO-Probes, and Winoground datasets. We release the IIW-Eval benchmark with human judgement labels, object and image-level annotations from our framework, and existing image caption datasets enriched via IIW-model.
Search
Co-authors
- Roopal Garg 1
- Andrea Burns 1
- Burcu Karagol Ayan 1
- Yonatan Bitton 1
- Ceslee Montgomery 1
- show all...