SangHyeok Lee
2025
Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding
Kyungryul Back
|
Seongbeom Park
|
Milim Kim
|
Mincheol Kwon
|
SangHyeok Lee
|
Hyunyoung Lee
|
Junhee Cho
|
Seunghyun Park
|
Jinkyu Kim
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Vision-Language Models (LVLMs) have recently shown promising results on various multimodal tasks, even achieving human-comparable performance in certain cases. Nevertheless, LVLMs remain prone to hallucinations–they often rely heavily on a single modality or memorize training data without properly grounding their outputs. To address this, we propose a training-free, tri-layer contrastive decoding with watermarking, which proceeds in three steps: (1) select a mature layer and an amateur layer among the decoding layers, (2) identify a pivot layer using a watermark-related question to assess whether the layer is visually well-grounded, and (3) apply tri-layer contrastive decoding to generate the final output. Experiments on public benchmarks such as POPE, MME and AMBER demonstrate that our method achieves state-of-the-art performance in reducing hallucinations in LVLMs and generates more visually grounded responses.
Search
Fix author
Co-authors
- Kyungryul Back 1
- Junhee Cho 1
- Milim Kim 1
- Jinkyu Kim 1
- Mincheol Kwon 1
- show all...