Yao Liu

2025

While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and LLaMA-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.

pdf bib abs
Can Large Language Models Act as Ensembler for Multi-GNNs?
Hanqi Duan | Yao Cheng | Jianxiang Yu | Yao Liu | Xiang Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Graph Neural Networks (GNNs) have emerged as powerful models for learning from graph-structured data. However, GNNs lack the inherent semantic understanding capability of rich textual node attributes, limiting their effectiveness in applications. On the other hand, we empirically observe that for existing GNN models, no one can consistently outperforms others across diverse datasets. In this paper, we study whether LLMs can act as an ensembler for multi-GNNs and propose the LensGNN model. The model first aligns multiple GNNs, mapping the representations of different GNNs into the same space. Then, through LoRA fine-tuning, it aligns the space between the GNN and the LLM, injecting graph tokens and textual information into LLMs. This allows LensGNN to ensemble multiple GNNs and take advantage of the strengths of LLM, leading to a deeper understanding of both textual semantic information and graph structural information. The experimental results show that LensGNN outperforms existing models. This research advances text-attributed graph ensemble learning by providing a robust and superior solution for integrating semantic and structural information. We provide our code and data here: https://github.com/AquariusAQ/LensGNN.

pdf bib abs
Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph
Yibo Zhao | Jiapeng Zhu | Can Xu | Yao Liu | Xiang Li
Findings of the Association for Computational Linguistics: ACL 2025

The rapid growth of social media platforms has raised significant concerns regarding online content toxicity. When Large Language Models (LLMs) are used for toxicity detection, two key challenges emerge: 1) the absence of domain-specific toxicity knowledge leads to false negatives; 2) the excessive sensitivity of LLMs to toxic speech results in false positives, limiting freedom of speech. To address these issues, we propose a novel method called *MetaTox*, leveraging graph search on a meta-toxic knowledge graph to enhance hatred and toxicity detection. First, we construct a comprehensive meta-toxic knowledge graph by utilizing LLMs to extract toxic information through a three step pipeline. Second, we query the graph via retrieval and ranking processes to supplement accurate, relevant toxicity knowledge. Extensive experiments and case studies across multiple datasets demonstrate that our MetaTox boosts overall toxicity detection performance, particularly in out-of-domain settings. In addition, under in-domain scenarios, we surprisingly find that small language models are more competent. Our code is available at https://github.com/YiboZhao624/MetaTox.

Generative methods significantly advance event argument extraction by probabilistically generating event argument sequences in a structured format. However, existing approaches primarily rely on a single prompt to generate event arguments in a fixed, predetermined order. Such a rigid approach overlooks the complex structural and dynamic interdependencies among event arguments. In this work, we present GEMS, a multi-prompt learning framework that Generates Event arguments via Multi-perspective prompts and ontology Steering. Specifically, GEMS utilizes multiple unfilled prompts for each sentence, predicting event arguments in varying sequences to explicitly capture the interrelationships between arguments. These predictions are subsequently aggregated using a voting mechanism. Furthermore, an ontology-driven steering mechanism is proposed to ensure that the generated arguments are contextually appropriate and consistent with event-specific knowledge. Extensive experiments on two benchmark datasets demonstrate that GEMS achieves state-of-the-art performance, particularly in low-resource settings. The source code is available at: https://github.com/AONE-NLP/EAE-GEMS

pdf bib abs
SCE: Semantic Consistency Enhanced Reinforcement Learning for Multi-Hop Knowledge Graph Reasoning
Yanwen Huang | Yao Liu | Qiao Liu | Rui Hou | Tingting Dai
Findings of the Association for Computational Linguistics: EMNLP 2025

Multi-hop reasoning with reinforcement learning has proven effective in discovering inference paths in incomplete knowledge graphs. However, a major challenge remains: spurious paths (incorrect reasoning paths that accidentally lead to correct answers) often arise due to reward mechanisms that prioritize final results over reasoning quality. While existing approaches attempt to mitigate this issue using external rules, they often neglect the internal semantic consistency between the target triple and the intermediate triples along the reasoning path. In this paper, we propose a novel framework, Semantic Consistency Enhanced Reinforcement Learning (SCE), which incorporates semantic consistency into the reward function to guide multi-hop reasoning. Experimental results demonstrate that SCE outperforms strong baseline methods and facilitates the discovery of more interpretable reasoning paths.

Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like the Model Context Protocol (MCP). This extension could introduce new security vulnerabilities. We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. We evaluate two critical attack scenarios: (1) malicious content relay and (2) sensitive data leakage through MCP-enabled tools. Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content.

2024

pdf bib abs
InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment
Jianing Wang | Junda Wu | Yupeng Hou | Yao Liu | Ming Gao | Julian McAuley
Findings of the Association for Computational Linguistics: ACL 2024

Do current large language models (LLMs) better solve graph reasoning and generation tasks with parameter updates? In this paper, we propose InstructGraph, a framework that empowers LLMs with the abilities of graph reasoning and generation by instruction tuning and preference alignment. Specifically, we first propose a structured format verbalizer to unify all graph data into a universal code-like format, which can simply represent the graph without any external graph-specific encoders. Furthermore, a graph instruction tuning stage is introduced to guide LLMs in solving graph reasoning and generation tasks. Finally, we identify potential hallucination problems in graph tasks and sample negative instances for preference alignment, the target of which is to enhance the output’s reliability of the model. Extensive experiments across multiple graph-centric tasks exhibit that InstructGraph can achieve the best performance and outperform GPT-4 and LLaMA2 by more than 13% and 38%, respectively.

pdf bib abs
Cognitive Bias in Decision-Making with LLMs
Jessica Maria Echterhoff | Yao Liu | Abeer Alessa | Julian McAuley | Zexue He
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) offer significant potential as tools to support an expanding range of decision-making tasks. Given their training on human (created) data, LLMs have been shown to inherit societal biases against protected groups, as well as be subject to bias functionally resembling cognitive bias. Human-like bias can impede fair and explainable decisions made with LLM assistance. Our work introduces BiasBuster, a framework designed to uncover, evaluate, and mitigate cognitive bias in LLMs, particularly in high-stakes decision-making tasks. Inspired by prior research in psychology and cognitive science, we develop a dataset containing 13,465 prompts to evaluate LLM decisions on different cognitive biases (e.g., prompt-induced, sequential, inherent). We test various bias mitigation strategies, while proposing a novel method utilizing LLMs to debias their own human-like cognitive bias within prompts. Our analysis provides a comprehensive picture of the presence and effects of cognitive bias across commercial and open-source models. We demonstrate that our selfhelp debiasing effectively mitigates model answers that display patterns akin to human cognitive bias without having to manually craft examples for each bias.

2023

Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.