Zheng Xie

2026

Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge.To bridge the gap, we introduce Sentient Agent as a Judge(SAGE), an automated evaluation framework that measures an LLM’s higher-order social cognition.SAGE instantiates a “Sentient Agent” – an LLM-powered agent that simulates human-like emotional changes and inner thoughts to provide a more realistic evaluation of the tested model in multi-turn conversations.At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. Human evaluation further demonstrates 85.3% consistency between the agent’s emotional reasoning and human judgments. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4×) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g. Arena). SAGE thus provides a principled, scalable, and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

pdf bib abs

Bridging Distance and Spectral Positional Encodings via Anchor-Based Diffusion Geometry Approximation
Yan Zimo | Zheng Xie | Runfan Duan | Chang Liu | Wumei Du
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Molecular graph learning benefits from positional signals that capture both local neighborhoods and global topology. Two widely used families are spectral encodings derived from Laplacian or diffusion operators and anchor-based distance encodings built from shortest-path information, but the relationship between them is still not well understood. In this paper, we study when anchor-based distance encodings can approximate diffusion geometry. Under a random r-regular graph model, we derive an explicit trilateration map that reconstructs truncated diffusion coordinates from transformed anchor distances and anchor spectral positions, together with pointwise and Frobenius-gap guarantees. On DrugBank with a shared GNP-based DDI backbone, anchor-distance Nyström accurately recovers diffusion geometry, and both DE and LapPE outperform models without positional encodings, with LapPE showing slightly more consistent performance.

2025

pdf bib abs

Reasoning based on chains of thought (CoTs) enables large language models (LLMs) to solve problems by thinking step by step and becomes the mainstream solution for Question-Answering (QA) tasks. Knowledge graph (KG)-enhanced CoT technology helps correct factual errors or predict reasoning direction. Existing KG-enhanced methods find relevant information in KGs “within” each reasoning step of CoTs. However, in some cases, logical connections “between” reasoning steps may be missing or wrong, leading to broken reasoning chains and wrong reasoning direction. To solve the above problem, we argue that the errors between reasoning steps require collaborative verification and mining of multiple triplets and multiple paths in KG. So we propose the DCMKC (Dual Consistency Matching for KG and CoT) method, aiming to maintain semantic and structural consistency between KG and CoT. The main idea is to convert CoTs and KGs into two granularity-aligned graphs, transforming multi-hop reasoning and KG matching into iterative matching and modification of two graphs. In each iteration, DCMKC matches the KG reasoning chains with CoTs based on semantic similarity and judges the structural consistency between them. Then it modifies CoTs using the matched chains. After iterations, the CoTs and KG reasoning chains reach high semantic and structural consistency, which is theoretically and experimentally demonstrated by kernel and spectral methods. The two kinds of chains are then used to generate the final answers. Experimental results show that our method outperforms baselines on multiple datasets, especially on multi-answer questions, with up to 5.1% improvement over the baseline.