Wenbo Hu


2025

pdf bib
Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models
Zijun Chen | Wenbo Hu | Guande He | Zhijie Deng | ZHeng ZHang | Richang Hong
Proceedings of the 31st International Conference on Computational Linguistics

Multimodal large language models (MLLMs) combine visual and textual data for tasks like image captioning and visual question answering. Proper uncertainty calibration is crucial but challenging for reliable use in areas like healthcare and autonomous driving. This paper investigates several MLLMs, focusing on their calibration across various scenarios, including before and after visual fine-tuning as well as before and after multimodal training of the base LLMs. We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios. We also highlight differences in uncertainty between text and the impact of the integration of these two types of information in uncertainty. To better understand MLLMs’ miscalibration and their ability to self-assess uncertainty, we developed the IDK (I don’t know) dataset, which is key for evaluating how they handle unknowns. Our findings reveal that MLLMs tend to give answers rather than admit uncertainty, but this self-assessment improves with prompt adjustments. Finally, to calibrate MLLMs and enhance model reliability, we propose techniques such as temperature scaling and iterative prompt optimization. Our results provide insights into improving MLLMs for effective and responsible deployment in multimodal applications.

2024

pdf bib
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
Haoyi Qiu | Wenbo Hu | Zi-Yi Dou | Nanyun Peng
Findings of the Association for Computational Linguistics: ACL 2024

Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs, undermining their reliability. A comprehensive quantitative evaluation is necessary to identify and understand the extent of hallucinations in these models. However, existing benchmarks are often limited in scope, focusing mainly on object hallucinations. Furthermore, current evaluation methods struggle to effectively address the subtle semantic distinctions between model outputs and reference data, as well as the balance between hallucination and informativeness. To address these issues, we introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases. Moreover, we propose a large language model (LLM)-based two-stage evaluation framework that generalizes the popular CHAIR metric and incorporates both faithfulness and coverage into the evaluation. Experiments on 10 established LVLMs demonstrate that our evaluation metric is more comprehensive and better correlated with humans than existing work when evaluating on our challenging human-annotated benchmark dataset. Our work also highlights the critical balance between faithfulness and coverage of model outputs, and encourages future works to address hallucinations in LVLMs while keeping their outputs informative.