Jiazheng Li - ACL Anthology

Jiazheng Li

2026

SALT: Step-level Advantage Assignment for Long-horizon Agents via Trajectory Graph
Jiazheng Li | Yawei Wang | Qiaojing Yan | Yijun Tian | Zhichao Xu | Huan Song | Panpan Xu | Lin Lee Cheong
Findings of the Association for Computational Linguistics: EACL 2026

Large Language Models (LLMs) have demonstrated remarkable capabilities, enabling language agents to excel at single-turn tasks. However, their application to complex, multi-step, and long-horizon tasks remains challenging. While reinforcement learning (RL) offers a promising avenue for addressing these challenges, mainstream approaches typically rely solely on sparse, outcome-based rewards — a limitation that becomes especially problematic for group-based RL algorithms lacking critic models, such as Group Relative Policy Optimization (GRPO). In such methods, uniformly rewarding or penalizing all actions within a trajectory can lead to training instability and suboptimal policies, because beneficial and detrimental actions are often entangled across multi-step interactions. To address this challenge, we propose SALT, a novel and lightweight framework that provides a finer-grained advantage assignment, derived solely from outcome rewards. We achieve this by constructing a graph from trajectories of the same prompt, which allows us to quantify the quality of each step and assign advantages accordingly. Crucially, SALT is designed as a plug-and-play module that seamlessly integrates with existing group-based RL algorithms — requiring no modifications to the rollout procedure and introducing negligible computational overhead. Extensive experiments on the WebShop, ALFWorld, and AppWorld benchmarks with various model sizes demonstrate that SALT consistently improves performance. We also conduct a thorough analysis to validate the design choices behind SALT and offer actionable insights.

2025

LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop
Runcong Zhao | Artem Bobrov | Jiazheng Li | Cesare Aloisi | Yulan He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.

Two Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
Jiazheng Li | Yuxiang Zhou | Junru Lu | Gladys Tyen | Lin Gui | Cesare Aloisi | Yulan He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Although preference optimization methods have improved reasoning performance in Large Language Models (LLMs), they often lack transparency regarding why one reasoning outcome is preferred over another. This limitation is especially critical in Automated Student Answer Scoring (ASAS), where explainability is essential to justify assessment outcomes. Verbal reinforcement learning offers the potential to generate explicit reflection, but it tends to produce superficial critiques that can harm assessment performance. Existing LLMs also struggle to reliably detect subtle reasoning errors in ASAS tasks. Moreover, manually identifying intermediate reasoning errors is expensive and difficult to scale. To address these challenges, we introduce a **contrastive reflection synthesis pipeline** that generates precise verbal feedback by identifying discrepancies in structure reasoning graph paths. Leveraging these synthetic reflection data, we propose *DARS*, a Dual-model Reflective Scoring framework featuring a dedicated Critic model trained for effective reflection. *DARS* achieves strong performance and consistently outperforms existing ASAS baselines across all evaluation metrics. Extensive experiments further provide novel insights into the value of reflection data, framework design, and the scaling behavior of *DARS*. We release the DARS code at https://github.com/lijiazheng99/DARS.

Drift: Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference
Jiazheng Li | Hanqi Yan | Yulan He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

As Large Language Models (LLMs) are increasingly applied to complex reasoning tasks, achieving both accurate task performance and faithful explanations becomes crucial. However, LLMs often generate unfaithful explanations, partly because they do not consistently adhere closely to the provided context. Existing approaches to this problem either rely on superficial calibration methods, such as decomposed Chain-of-Thought prompting, or require costly retraining to improve model faithfulness. In this work, we propose a probabilistic inference paradigm that leverages task-specific and lookahead rewards to ensure that LLM-generated rationales are more faithful to model decisions and align better with input context. These rewards are derived from a domain-specific proposal distribution, allowing for optimized sequential Monte Carlo approximations. Our evaluations across three different reasoning tasks show that this method, which allows for controllable generation during inference, improves both accuracy and faithfulness of LLMs. This method offers a promising path towards making LLMs more reliable for reasoning tasks without sacrificing performance.

AERA Chat: An Interactive Platform for Automated Explainable Student Answer Assessment
Jiazheng Li | Artem Bobrov | Runcong Zhao | Cesare Aloisi | Yulan He
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Explainability in automated student answer scoring systems is critical for building trust and enhancing usability among educators. Yet, generating high-quality assessment rationales remains challenging due to the scarcity of annotated data and the prohibitive cost of manual verification, prompting heavy reliance on rationales produced by large language models (LLMs), which are often noisy and unreliable. To address these limitations, we present AERA Chat, an interactive visualization platform designed for automated explainable student answer assessment. AERA Chat leverages multiple LLMs to concurrently score student answers and generate explanatory rationales, offering innovative visualization features that highlight critical answer components and rationale justifications. The platform also incorporates intuitive annotation and evaluation tools, supporting educators in marking tasks and researchers in evaluating rationale quality from different models. We demonstrate the effectiveness of our platform through evaluations of multiple rationale-generation methods on several datasets, showcasing its capability for facilitating robust rationale evaluation and comparative analysis.

EnigmaToM: Improve LLMs’ Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States
Hainiu Xu | Siya Qi | Jiazheng Li | Yuxiang Zhou | Jinhua Du | Caroline Catmur | Yulan He
Findings of the Association for Computational Linguistics: ACL 2025

Theory-of-Mind (ToM), the ability to infer others’ perceptions and mental states, is fundamental to human interaction but remains challenging for Large Language Models (LLMs). While existing ToM reasoning methods show promise with reasoning via perceptual perspective-taking, they often rely excessively on off-the-shelf LLMs, reducing their efficiency and limiting their applicability to high-order ToM reasoning. To address these issues, we present EnigmaToM, a novel neuro-symbolic framework that enhances ToM reasoning by integrating a Neural Knowledge Base of entity states (Enigma) for (1) a psychology-inspired iterative masking mechanism that facilitates accurate perspective-taking and (2) knowledge injection that elicits key entity information. Enigma generates structured knowledge of entity states to build spatial scene graphs for belief tracking across various ToM orders and enrich events with fine-grained entity state details. Experimental results on ToMi, HiToM, and FANToM benchmarks show that EnigmaToM significantly improves ToM reasoning across LLMs of varying sizes, particularly excelling in high-order reasoning scenarios.

RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following
Junru Lu | Jiazheng Li | Guodong Shen | Lin Gui | Siyu An | Yulan He | Di Yin | Xing Sun
Findings of the Association for Computational Linguistics: ACL 2025

Role-playing is important for Large Language Models (LLMs) to follow diverse instructions while maintaining role identity and the role’s pre-defined ability limits. Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries, but overlook role-playing in instruction-following scenarios. We introduce a fine-grained role-playing and instruction-following composite benchmark, named RoleMRC, including: (1) Multi-turn dialogues between ideal roles and humans, including free chats or discussions upon given passages; (2) Role-playing machine reading comprehension, involving response, refusal, and attempts according to passage answerability and role ability; (3) More complex scenarios with nested, multi-turn and prioritized instructions. The final RoleMRC features a 10.2k role profile meta-pool, 37.9k well-synthesized role-playing instructions, and 1.4k testing samples. We develop a pipeline to quantitatively evaluate the fine-grained role-playing and instruction-following capabilities of several mainstream LLMs, as well as models that are fine-tuned on our data. Moreover, cross-evaluation on external role-playing datasets confirms that models fine-tuned on RoleMRC enhances instruction-following without compromising general role-playing and reasoning capabilities. We also probe the neural-level activation maps of different capabilities over post-tuned LLMs. Access to our RoleMRC, RoleMRC-mix and Codes: https://github.com/LuJunru/RoleMRC.

2024

Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence
Junru Lu | Jiazheng Li | Siyu An | Meng Zhao | Yulan He | Di Yin | Xing Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Direct Preference Optimization (DPO) has emerged as a prominent algorithm for the direct and robust alignment of Large Language Models (LLMs) with human preferences, offering a more straightforward alternative to the complex Reinforcement Learning from Human Feedback (RLHF). Despite its promising efficacy, DPO faces a notable drawback: “verbosity”, a common over-optimization phenomenon also observed in RLHF. While previous studies mainly attributed verbosity to biased labels within the data, we propose that the issue also stems from an inherent algorithmic length reliance in DPO. Specifically, we suggest that the discrepancy between sequence-level Kullback–Leibler (KL) divergences between chosen and rejected sequences, used in DPO, results in overestimated or underestimated rewards due to varying token lengths. Empirically, we utilize datasets with different label lengths to demonstrate the presence of biased rewards. We then introduce an effective downsampling approach, named SamPO, to eliminate potential length reliance. Our experimental evaluations, conducted across three LLMs of varying scales and a diverse array of conditional and open-ended benchmarks, highlight the efficacy of SamPO in mitigating verbosity, achieving improvements of 5% to 12% over DPO through debaised rewards. Our code can be accessed at: https://github.com/LuJunru/SamPO/.

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives
Runcong Zhao | Qinglin Zhu | Hainiu Xu | Jiazheng Li | Yuxiang Zhou | Yulan He | Lin Gui
Findings of the Association for Computational Linguistics: ACL 2024

Existing datasets for narrative understanding often fail to represent the complexity and uncertainty of relationships in real-life social scenarios. To address this gap, we introduce a new benchmark, Conan, designed for extracting and analysing intricate character relation graphs from detective narratives. Specifically, we designed hierarchical relationship categories and manually extracted and annotated role-oriented relationships from the perspectives of various characters, incorporating both public relationships known to most characters and secret ones known to only a few. Our experiments with advanced Large Language Models (LLMs) like GPT-3.5, GPT-4, and Llama2 reveal their limitations in inferencing complex relationships and handling longer narratives. The combination of the Conan dataset and our pipeline strategy is geared towards understanding the ability of LLMs to comprehend nuanced relational dynamics in narrative contexts.

The Mystery of In-Context Learning: A Comprehensive Survey on Interpretation and Analysis
Yuxiang Zhou | Jiazheng Li | Yanzheng Xiang | Hanqi Yan | Lin Gui | Yulan He
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Understanding in-context learning (ICL) capability that enables large language models (LLMs) to excel in proficiency through demonstration examples is of utmost importance. This importance stems not only from the better utilization of this capability across various tasks, but also from the proactive identification and mitigation of potential risks, including concerns regarding truthfulness, bias, and toxicity, that may arise alongside the capability. In this paper, we present a thorough survey on the interpretation and analysis of in-context learning. First, we provide a concise introduction to the background and definition of in-context learning. Then, we give an overview of advancements from two perspectives: 1) a theoretical perspective, emphasizing studies on mechanistic interpretability and delving into the mathematical foundations behind ICL; and 2) an empirical perspective, concerning studies that empirically analyze factors associated with ICL. We conclude by discussing open questions and the challenges encountered, and suggesting potential avenues for future research. We believe that our work establishes the basis for further exploration into the interpretation of in-context learning. To aid this effort, we have created a repository containing resources that will be continually updated.

Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Jiazheng Li | Hainiu Xu | Zhaoyue Sun | Yuxiang Zhou | David West | Cesare Aloisi | Yulan He
Findings of the Association for Computational Linguistics: EMNLP 2024

Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths. Data and code are available at: https://github.com/lijiazheng99/thought_tree_assessment.

NarrativePlay: Interactive Narrative Understanding
Runcong Zhao | Wenjia Zhang | Jiazheng Li | Lixing Zhu | Yanran Li | Yulan He | Lin Gui
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

In this paper, we introduce NarrativePlay, a novel system that allows users to role-play a fictional character and interact with other characters in narratives in an immersive environment. We leverage Large Language Models (LLMs) to generate human-like responses, guided by personality traits extracted from narratives. The system incorporates auto-generated visual display of narrative settings, character portraits, and character speech, greatly enhancing the user experience. Our approach eschews predefined sandboxes, focusing instead on main storyline events from the perspective of a user-selected character. NarrativePlay has been evaluated on two types of narratives, detective and adventure stories, where users can either explore the world or increase affinity with other characters through conversations.

2023

NapSS: Paragraph-level Medical Text Simplification via Narrative Prompting and Sentence-matching Summarization
Junru Lu | Jiazheng Li | Byron Wallace | Yulan He | Gabriele Pergola
Findings of the Association for Computational Linguistics: EACL 2023

Accessing medical literature is difficult for laypeople as the content is written for specialists and contains medical jargon. Automated text simplification methods offer a potential means to address this issue. In this work, we propose a summarize-then-simplify two-stage strategy, which we call NapSS, identifying the relevant content to simplify while ensuring that the original narrative flow is preserved. In this approach, we first generate reference summaries via sentence matching between the original and the simplified abstracts. These summaries are then used to train an extractive summarizer, learning the most relevant content to be simplified. Then, to ensure the narrative consistency of the simplified text, we synthesize auxiliary narrative prompts combining key phrases derived from the syntactical analyses of the original text. Our model achieves results significantly better than the seq2seq baseline on an English medical corpus, yielding 3% 4% absolute improvements in terms of lexical similarity, and providing a further 1.1% improvement of SARI score when combined with the baseline. We also highlight shortcomings of existing evaluation methods, and introduce new metrics that take into account both lexical and high-level semantic similarity. A human evaluation conducted on a random sample of the test set further establishes the effectiveness of the proposed approach. Codes and models are released here: https://github.com/LuJunru/NapSS.

Distilling ChatGPT for Explainable Automated Student Answer Assessment
Jiazheng Li | Lin Gui | Yuxiang Zhou | David West | Cesare Aloisi | Yulan He
Findings of the Association for Computational Linguistics: EMNLP 2023

Providing explainable and faithful feedback is crucial for automated student answer assessment. In this paper, we introduce a novel framework that explores using ChatGPT, a cutting-edge large language model, for the concurrent tasks of student answer scoring and rationale generation. We identify the appropriate instructions by prompting ChatGPT with different templates to collect the rationales, where inconsistent rationales are refined to align with marking standards. The refined ChatGPT outputs enable us to fine-tune a smaller language model that simultaneously assesses student answers and provides rationales. Extensive experiments on the benchmark dataset show that the proposed method improves the overall QWK score by 11% compared to ChatGPT. Furthermore, our thorough analysis and human evaluation demonstrate that the rationales generated by our proposed method are comparable to those of ChatGPT. Our approach provides a viable solution to achieve explainable automated assessment in education

2022

PHEE: A Dataset for Pharmacovigilance Event Extraction from Text
Zhaoyue Sun | Jiazheng Li | Gabriele Pergola | Byron Wallace | Bino John | Nigel Greene | Joseph Kim | Yulan He
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The primary goal of drug safety researchers and regulators is to promptly identify adverse drug reactions. Doing so may in turn prevent or reduce the harm to patients and ultimately improve public health. Evaluating and monitoring drug safety (i.e., pharmacovigilance) involves analyzing an ever growing collection of spontaneous reports from health professionals, physicians, and pharmacists, and information voluntarily submitted by patients. In this scenario, facilitating analysis of such reports via automation has the potential to rapidly identify safety signals. Unfortunately, public resources for developing natural language models for this task are scant. We present PHEE, a novel dataset for pharmacovigilance comprising over 5000 annotated events from medical case reports and biomedical literature, making it the largest such public dataset to date. We describe the hierarchical event schema designed to provide coarse and fine-grained information about patients’ demographics, treatments and (side) effects. Along with the discussion of the dataset, we present a thorough experimental evaluation of current state-of-the-art approaches for biomedical event extraction, point out their limitations, and highlight open challenges to foster future research in this area.

2021

Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis
Linyi Yang | Jiazheng Li | Padraig Cunningham | Yue Zhang | Barry Smyth | Ruihai Dong
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

While state-of-the-art NLP models have been achieving the excellent performance of a wide range of tasks in recent years, important questions are being raised about their robustness and their underlying sensitivity to systematic biases that may exist in their training and test data. Such issues come to be manifest in performance problems when faced with out-of-distribution data in the field. One recent solution has been to use counterfactually augmented datasets in order to reduce any reliance on spurious patterns that may exist in the original data. Producing high-quality augmented data can be costly and time-consuming as it usually needs to involve human feedback and crowdsourcing efforts. In this work, we propose an alternative by describing and evaluating an approach to automatically generating counterfactual data for the purpose of data augmentation and explanation. A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance when compared to models training on the original data and even when compared to models trained with the benefit of human-generated augmented data.

Co-authors

Gabriele Pergola 2

Byron C. Wallace 2

Caroline Catmur 1

Lin Lee Cheong 1

Pádraig Cunningham 1

Yanzheng Xiang 1

Qinglin Zhu (祝清麟) 1

Venues