Tianrui Guan
2025
Large Language Models and Causal Inference in Collaboration: A Comprehensive Survey
Xiaoyu Liu
|
Paiheng Xu
|
Junda Wu
|
Jiaxin Yuan
|
Yifan Yang
|
Yuhang Zhou
|
Fuxiao Liu
|
Tianrui Guan
|
Haoliang Wang
|
Tong Yu
|
Julian McAuley
|
Wei Ai
|
Furong Huang
Findings of the Association for Computational Linguistics: NAACL 2025
Causal inference has demonstrated significant potential to enhance Natural Language Processing (NLP) models in areas such as predictive accuracy, fairness, robustness, and explainability by capturing causal relationships among variables. The rise of generative Large Language Models (LLMs) has greatly impacted various language processing tasks. This survey focuses on research that evaluates or improves LLMs from a causal view in the following areas: reasoning capacity, fairness and safety issues, explainability, and handling multimodality. Meanwhile, LLMs can assist in causal inference tasks, such as causal relationship discovery and causal effect estimation, by leveraging their generation ability and knowledge learned during pre-training. This review explores the interplay between causal inference frameworks and LLMs from both perspectives, emphasizing their collective potential to further the development of more advanced and robust artificial intelligence systems.
2024
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
Xiyang Wu
|
Tianrui Guan
|
Dianqi Li
|
Shuaiyi Huang
|
Xiaoyu Liu
|
Xijun Wang
|
Ruiqi Xian
|
Abhinav Shrivastava
|
Furong Huang
|
Jordan Lee Boyd-Graber
|
Tianyi Zhou
|
Dinesh Manocha
Findings of the Association for Computational Linguistics: EMNLP 2024
Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well. Additionally, fine-tuning on these examples could undermine their validity. To address this, we aim to scale up the number of cases through an automated approach, reducing human bias in crafting such corner cases. This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples. Our generated visual-question pairs pose significant challenges to LVLMs, requiring them to overcome contextual biases and distractions to arrive at correct answers. AutoHallusion enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AutoHallusion, paving the way for a long battle against hallucinations. The codebase and data can be accessed at https://github.com/wuxiyang1996/AutoHallusion
Search
Fix data
Co-authors
- Furong Huang 2
- Xiaoyu Liu 2
- Wei Ai 1
- Jordan Lee Boyd-Graber 1
- Shuaiyi Huang 1
- show all...