Shengyuan Wang


2025

pdf bib
Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning
Shengyuan Wang | Jie Feng | Tianhui Liu | Dan Pei | Yong Li
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations—incorrect or inconsistent representations of geospatial information—that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.

pdf bib
CLEAR: A Clinically Grounded Tabular Framework for Radiology Report Evaluation
Yuyang Jiang | Chacha Chen | Shengyuan Wang | Feng Li | Zecong Tang | Benjamin M. Mervak | Lydia Chelala | Christopher M Straus | Reve Chahine | Samuel G. Armato Iii | Chenhao Tan
Findings of the Association for Computational Linguistics: EMNLP 2025

Existing metrics often lack the granularity and interpretability to capture nuanced clinical differences between candidate and ground-truth radiology reports, resulting in suboptimal evaluation. We introduce a **Cl**inically grounded tabular framework with **E**xpert-curated labels and **A**ttribute-level comparison for **R**adiology report evaluation (**CLEAR**). CLEAR not only examines whether a report can accurately identify the presence or absence of medical conditions, but it also assesses whether the report can precisely describe each positively identified condition across five key attributes: first occurrence, change, severity, descriptive location, and recommendation. Compared with prior works, CLEAR’s multi-dimensional, attribute-level outputs enable a more comprehensive and clinically interpretable evaluation of report quality. Additionally, to measure the clinical alignment of CLEAR, we collaborated with five board-certified radiologists to develop **CLEAR-Bench**, a dataset of 100 chest radiograph reports from MIMIC-CXR, annotated across 6 curated attributes and 13 CheXpert conditions. Our experiments demonstrated that CLEAR achieves high accuracy in extracting clinical attributes and provides automated metrics that are strongly aligned with clinical judgment.

2024

pdf bib
AlignBench: Benchmarking Chinese Alignment of Large Language Models
Xiao Liu | Xuanyu Lei | Shengyuan Wang | Yue Huang | Andrew Feng | Bosi Wen | Jiale Cheng | Pei Ke | Yifan Xu | Weng Lam Tam | Xiaohan Zhang | Lichao Sun | Xiaotao Gu | Hongning Wang | Jing Zhang | Minlie Huang | Yuxiao Dong | Jie Tang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. We tailor a human-in-the-loop data curation pipeline, containing 8 main categories, 683 real-scenario rooted queries and corresponding human verified references.To ensure references’ correctness, each knowledge-intensive query is accompanied with evidences collected from reliable webpages (including the url and quotation) by our annotators.For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge (CITATION) with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability.All evaluation codes and data are publicly available at https://github.com/THUDM/AlignBench

pdf bib
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
Pei Ke | Bosi Wen | Andrew Feng | Xiao Liu | Xuanyu Lei | Jiale Cheng | Shengyuan Wang | Aohan Zeng | Yuxiao Dong | Hongning Wang | Jie Tang | Minlie Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4’s direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.