2024
pdf
bib
abs
Knowledge-Centric Hallucination Detection
Xiangkun Hu
|
Dongyu Ru
|
Lin Qiu
|
Qipeng Guo
|
Tianhang Zhang
|
Yang Xu
|
Yun Luo
|
Pengfei Liu
|
Yue Zhang
|
Zheng Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have shown impressive capabilities but also a concerning tendency to hallucinate. This paper presents RefChecker, a framework that introduces claim-triplets to represent claims in LLM responses, aiming to detect fine-grained hallucinations. In RefChecker, an extractor generates claim-triplets from a response, which are then evaluated by a checker against a reference. We delineate three task settings: Zero, Noisy and Accurate Context, to reflect various real-world use cases. We curated a benchmark spanning various NLP tasks and annotated 11k claim-triplets from 2.1k responses by seven LLMs. RefChecker supports both proprietary and open-source models as the extractor and checker. Experiments demonstrate that claim-triplets enable superior hallucination detection, compared to other granularities such as response, sentence and sub-sentence level claims. RefChecker outperforms prior methods by 18.2 to 27.2 points on our benchmark and the checking results of RefChecker are strongly aligned with human judgments.
pdf
bib
abs
RepEval: Effective Text Evaluation with LLM Representation
Shuqian Sheng
|
Yi Xu
|
Tianhang Zhang
|
Zanwei Shen
|
Luoyi Fu
|
Jiaxin Ding
|
Lei Zhou
|
Xiaoying Gan
|
Xinbing Wang
|
Chenghu Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text evaluation are often tailored to specific scenarios, while LLM-based evaluation metrics are costly, requiring fine-tuning or rely heavily on the generation capabilities of LLMs. Besides, previous LLM-based metrics ignore the fact that, within the space of LLM representations, there exist direction vectors that indicate the estimation of text quality. To this end, we introduce RepEval, a metric that leverages the projection of LLM representations for evaluation. Through simple prompt modifications, RepEval can easily transition to various tasks, requiring only minimal sample pairs for direction vector construction. Results on fourteen datasets across two evaluation tasks demonstrate the high effectiveness of our method, which exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
pdf
bib
abs
ECON: On the Detection and Resolution of Evidence Conflicts
Cheng Jiayang
|
Chunkit Chan
|
Qianqian Zhuang
|
Lin Qiu
|
Tianhang Zhang
|
Tengxiao Liu
|
Yangqiu Song
|
Yue Zhang
|
Pengfei Liu
|
Zheng Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The rise of large language models (LLMs) has significantly influenced the quality of information in decision-making systems, leading to the prevalence of AI-generated content and challenges in detecting misinformation and managing conflicting information, or “inter-evidence conflicts.” This study introduces a method for generating diverse, validated evidence conflicts to simulate real-world misinformation scenarios. We evaluate conflict detection methods, including Natural Language Inference (NLI) models, factual consistency (FC) models, and LLMs, on these conflicts (RQ1) and analyze LLMs’ conflict resolution behaviors (RQ2). Our key findings include: (1) NLI and LLM models exhibit high precision in detecting answer conflicts, though weaker models suffer from low recall; (2) FC models struggle with lexically similar answer conflicts, while NLI and LLM models handle these better; and (3) stronger models like GPT-4 show robust performance, especially with nuanced conflicts. For conflict resolution, LLMs often favor one piece of conflicting evidence without justification and rely on internal knowledge if they have prior beliefs.
pdf
bib
abs
SH2: Self-Highlighted Hesitation Helps You Decode More Truthfully
Jushi Kai
|
Tianhang Zhang
|
Hai Hu
|
Zhouhan Lin
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) demonstrate great performance in text generation. However, LLMs are still suffering from hallucinations. In this work, we propose an inference-time method, Self-Highlighted Hesitation (SH2), to help LLMs decode more truthfully. SH2 is based on a simple fact rooted in information theory that for an LLM, the tokens predicted with lower probabilities are prone to be more informative than others. Our analysis shows that these low-confidence tokens are more likely to be closely related to factual information, such as nouns, proper nouns, and adjectives. Therefore, we propose to ”highlight” the factual information by selecting key tokens with the lowest probabilities and concatenating them to the original context, thus forcing the model to repeatedly read and hesitate on these tokens before generation. During decoding, we also adopt contrastive decoding to emphasize the difference in output probabilities brought by the hesitation. Experimental results demonstrate that our SH2, requiring no additional data or models, can effectively help LLMs elicit factual knowledge and distinguish hallucinated contexts by themselves. Significant and consistent improvements are achieved by SH2 for LLaMA-7b, LLaMA2-7b and Mistral-7b on various hallucination tasks.
2023
pdf
bib
abs
Enhancing Uncertainty-Based Hallucination Detection with Stronger Focus
Tianhang Zhang
|
Lin Qiu
|
Qipeng Guo
|
Cheng Deng
|
Yue Zhang
|
Zheng Zhang
|
Chenghu Zhou
|
Xinbing Wang
|
Luoyi Fu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have gained significant popularity for their impressive performance across diverse fields. However, LLMs are prone to hallucinate untruthful or nonsensical outputs that fail to meet user expectations in many real-world applications. Existing works for detecting hallucinations in LLMs either rely on external knowledge for reference retrieval or require sampling multiple responses from the LLM for consistency verification, making these methods costly and inefficient. In this paper, we propose a novel reference-free, uncertainty-based method for detecting hallucinations in LLMs. Our approach imitates human focus in factuality checking from three aspects: 1) focus on the most informative and important keywords in the given text; 2) focus on the unreliable tokens in historical context which may lead to a cascade of hallucinations; and 3) focus on the token properties such as token type and token frequency. Experimental results on relevant datasets demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance across all the evaluation metrics and eliminates the need for additional information.