Hui Zhao


2024

pdf bib
Benchmarking Hallucination in Large Language Models Based on Unanswerable Math Word Problem
YuHong Sun | Zhangyue Yin | Qipeng Guo | Jiawen Wu | Xipeng Qiu | Hui Zhao
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training significantly enhance the model’s ability to avoid hallucination. We show that utilizing MWP is a reliable and effective approach to assess hallucination. Our code and data are available at https://github.com/Yuki-Asuuna/UMWP.

2023

pdf bib
基于语义任务辅助的方面情感分析(Semantic Task-assisted Aspect-based Sentiment Analysis)
Zhaozhen Wu (吴肇真) | Hui Zhao (赵晖) | Tiquan Gu (谷体泉) | Guoyi Cao (曹国义)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“方面情感分析(Aspect-Based Sentiment Analysis,ABSA)任务旨在判断一句话中不同方面的细粒度情感极性。如何有效的捕获句子的语义信息是该任务的关键。现有的大多数分类方法通过引入外部知识并设计复杂的模块来理解句子的语义信息,而忽略了外部解析器的噪音及模型的复杂化。在本文中,我们提出了一种基于语义理解的多任务学习网络,它旨在通过多任务学习从原始语料中捕获句子的语义信息。本文考虑从多任务角度出发,在具有共享参数的原始数据集中,分别提出了两个语义辅助任务:方面上下文顺序预测任务和方面上下文句法依存预测任务。然后,将辅助任务与原始的方面情感分类任务进行多任务的训练得到增强了语义理解的编码器,最后将该编码器用于方面情感分类任务。实验结果表明,模型在三个主要的公开数据集Rest14、Lap14和Twitter上的准确率和Macro-F1值都有较好的表现。”