Jiaoyan Chen - ACL Anthology

Jiaoyan Chen

2026

HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
Hanhua Hong | Yizhi LI | Jiaoyan Chen | Sophia Ananiadou | Xiaoli Li | Jung-jae Kim | Chenghua Lin
Findings of the Association for Computational Linguistics: ACL 2026

Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance. In this work, we propose Hierarchical Research Agent System (HiRAS), a hierarchical multi-agent framework for end-to-end paper reproduction that employs supervisory manager agents to coordinate specialised agents across fine-grained stages. We also identify limitations in the reference-free evaluation of the Paper2Code benchmark and introduce Paper2Code-Extra (P2C-Ex), a refined protocol that incorporates repository-level information and better aligns with the original reference-based metric. We conduct extensive evaluation, validating the effectiveness and robustness of our proposed methods, and observing improvements, including >10% relative performance gain above the previous state-of-the-art using open-source backbone models and significantly reduced hallucination in the evaluation. All code and data will be made publicly available.

What Breaks Knowledge Graph based RAG? Benchmarking and Empirical Insights into Reasoning under Incomplete Knowledge
Dongzhuoran Zhou | Yuqicheng Zhu | Xiaxia Wang | Hongkuan Zhou | Yuan He | Jiaoyan Chen | Steffen Staab | Evgeny Kharlamov
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge Graph-based Retrieval-Augmented Generation (KG-RAG) is an increasingly explored approach for combining the reasoning capabilities of large language models with the structured evidence of knowledge graphs. However, current evaluation practices fall short: existing benchmarks often include questions that can be directly answered using existing triples in KG, making it unclear whether models perform reasoning or simply retrieve answers directly. Moreover, inconsistent evaluation metrics and lenient answer matching criteria further obscure meaningful comparisons. In this work, we introduce a general method for constructing benchmarks and present BRINK (Benchmark for Reasoning under Incomplete Knowledge) to systematically assess KG-RAG methods under knowledge incompleteness. Our empirical results show that current KG-RAG methods have limited reasoning ability under missing knowledge, often rely on internal memorization, and exhibit varying degrees of generalization depending on their design.

MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Long-tail Knowledge
Jie He | Nan Hu | Wanqiu Long | Jiaoyan Chen | Jeff Z. Pan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, enabling them to tackle knowledge-intensive tasks. However, limited research has explored how LLMs effectively leverage RAG techniques for multi-hop question answering (QA), particularly when handling knowledge with with varying degrees of familiarity. In this paper, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a benchmark designed to evaluate multi-hop QA performance on questions involving 10,479 question-answer pairs for evaluating old/new knowledge and 17,887 pairs for assessing popular/unpopular knowledge, with each question equipped with corresponding sub-questions and answers. This benchmark primarily evaluates the multi-hop reasoning ability of LLMs and their capacity to handle knowledge with varying levels of familiarity during the reasoning process. We evaluate 22 state-of-the-art LLMs using three distinct QA strategies: LLM-based parameterized knowledge QA, direct RAG-enhanced QA, and multi-hop RAG-enhanced QA. Our experiments reveal key challenges in how LLMs handle knowledge with different familiarity and offer insights into improving their multi-hop reasoning capabilities when combined with RAG techniques.

2025

Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning
Yuxia Geng | Runkai Zhu | Jiaoyan Chen | Jintai Chen | Xiang Chen | Zhuo Chen | Shuofei Qiao | Yuxiang Wang | Xiaoliang Xu | Sheng-Jun Huang
Findings of the Association for Computational Linguistics: ACL 2025

Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP’s frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies. Our code and data are available at: https://github.com/zhurunkai/DCDA.

Certainty in Uncertainty: Reasoning over Uncertain Knowledge Graphs with Statistical Guarantees
Yuqicheng Zhu | Jingcheng Wu | Yizhen Wang | Hongkuan Zhou | Jiaoyan Chen | Evgeny Kharlamov | Steffen Staab
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Uncertain knowledge graph embedding (UnKGE) methods learn vector representations that capture both structural and uncertainty information to predict scores of unseen triples. However, existing methods produce only point estimates, without quantifying predictive uncertainty—limiting their reliability in high-stakes applications where understanding confidence in predictions is crucial. To address this limitation, we propose UnKGCP, a framework that generates prediction intervals guaranteed to contain the true score with a user-specified level of confidence. The length of the intervals reflects the model’s predictive uncertainty. UnKGCP builds on the conformal prediction framework but introduces a novel nonconformity measure tailored to UnKGE methods and an efficient procedure for interval construction. We provide theoretical guarantees for the intervals and empirically verify these guarantees. Extensive experiments on standard UKG benchmarks across diverse UnKGE methods further demonstrate that the intervals are sharp and effectively capture predictive uncertainty.

Noise-powered Multi-modal Knowledge Graph Representation Framework
Zhuo Chen | Yin Fang | Yichi Zhang | Lingbing Guo | Jiaoyan Chen | Jeff Z. Pan | Huajun Chen | Wen Zhang
Proceedings of the 31st International Conference on Computational Linguistics

The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph (MMKG) representation learning framework. Such a framework is essential for embedding structured knowledge into multi-modal Large Language Models effectively, alleviating issues like knowledge misconceptions and multi-modal hallucinations. In this work, we explore the efficacy of models in accurately embedding entities within MMKGs through two pivotal tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking to robustly integrate multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility. Moreover, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Code and data are available at https://github.com/zjukg/SNAG.

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
Nan Hu | Jiaoyan Chen | Yike Wu | Guilin Qi | Hongru Wang | Sheng Bi | Yongrui Chen | Tongtong Wu | Jeff Z. Pan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA.

2024

Exploring the Impact of Table-to-Text Methods on Augmenting LLM-based Question Answering with Domain Hybrid Data
Dehai Min | Nan Hu | Rihui Jin | Nuo Lin | Jiaoyan Chen | Yongrui Chen | Yu Li | Guilin Qi | Yun Li | Nijun Li | Qianren Wang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 6: Industry Track)

Augmenting Large Language Models (LLMs) for Question Answering (QA) with domain specific data has attracted wide attention. However, domain data often exists in a hybrid format, including text and semi-structured tables, posing challenges for the seamless integration of information. Table-to-Text Generation is a promising solution by facilitating the transformation of hybrid data into a uniformly text-formatted corpus. Although this technique has been widely studied by the NLP community, there is currently no comparative analysis on how corpora generated by different table-to-text methods affect the performance of QA systems.In this paper, we address this research gap in two steps. First, we innovatively integrate table-to-text generation into the framework of enhancing LLM-based QA systems with domain hybrid data. Then, we utilize this framework in real-world industrial data to conduct extensive experiments on two types of QA systems (DSFT and RAG frameworks) with four representative methods: Markdown format, Template serialization, TPLM-based method, and LLM-based method. Based on the experimental results, we draw some empirical findings and explore the underlying reasons behind the success of some methods. We hope the findings of this work will provide a valuable reference for the academic and industrial communities in developing robust QA systems.

DET: A Dual-Encoding Transformer for Relational Graph Embedding
Lingbing Guo | Zhuo Chen | Jiaoyan Chen | Qiang Zhang | Huajun Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Despite recent successes in natural language processing and computer vision, Transformer faces scalability issues when processing graphs, e.g., computing the full node-to-node attention on knowledge graphs (KGs) with million of entities is still infeasible. The existing methods mitigate this problem by considering only the local neighbors, sacrificing the Transformer’s ability to attend to elements at any distance. This paper proposes a new Transformer architecture called Dual-Encoding Transformer (DET). DET comprises a structural encoder to aggregate information from nearby neighbors, and a semantic encoder to seek for semantically relevant nodes. We adopt a semantic neighbor search approach inspired by multiple sequence alignment (MSA) algorithms used in biological sciences. By stacking the two encoders alternately, similar to the MSA Transformer for protein representation, our method achieves superior performance compared to state-of-the-art attention-based methods on complex relational graphs like KGs and citation networks. Additionally, DET remains competitive for smaller graphs such as molecules.

TacoERE: Cluster-aware Compression for Event Relation Extraction
Yong Guan | Xiaozhi Wang | Lei Hou | Juanzi Li | Jeff Z. Pan | Jiaoyan Chen | Freddy Lecue
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Event relation extraction (ERE) is a critical and fundamental challenge for natural language processing. Existing work mainly focuses on directly modeling the entire document, which cannot effectively handle long-range dependencies and information redundancy. To address these issues, we propose a cluster-aware compression method for improving event relation extraction (TacoERE), which explores a compression-then-extraction paradigm. Specifically, we first introduce document clustering for modeling event dependencies. It splits the document into intra- and inter-clusters, where intra-clusters aim to enhance the relations within the same cluster, while inter-clusters attempt to model the related events at arbitrary distances. Secondly, we utilize cluster summarization to simplify and highlight important text content of clusters for mitigating information redundancy and event distance. We have conducted extensive experiments on both pre-trained language models, such as RoBERTa, and large language models, such as ChatGPT and GPT-4, on three ERE datasets, i.e., MAVEN-ERE, EventStoryLine and HiEve. Experimental results demonstrate that TacoERE is an effective method for ERE.

CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering
Yike Wu | Yi Huang | Nan Hu | Yuncheng Hua | Guilin Qi | Jiaoyan Chen | Jeff Z. Pan
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent studies have explored the use of Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for Knowledge Graph Question Answering (KGQA). They typically require rewriting retrieved subgraphs into natural language formats comprehensible to LLMs. However, when tackling complex questions, the knowledge rewritten by existing methods may include irrelevant information, omit crucial details, or fail to align with the question’s semantics. To address them, we propose a novel rewriting method CoTKR, Chain- of-Thought Enhanced Knowledge Rewriting, for generating reasoning traces and corresponding knowledge in an interleaved manner, thereby mitigating the limitations of single-step knowledge rewriting. Additionally, to bridge the preference gap between the knowledge rewriter and the question answering (QA) model, we propose a training strategy PAQAF, Preference Alignment from Question Answering Feedback, for leveraging feedback from the QA model to further optimize the knowledge rewriter. We conduct experiments using various LLMs across several KGQA benchmarks. Experimental results demonstrate that, compared with previous knowledge rewriting methods, CoTKR generates the most beneficial knowledge representation for QA models, which significantly improves the performance of LLMs in KGQA.

2023

Novel Relation Detection: Discovering Unknown Relation Types via Multi-Strategy Self-Supervised Learning
Qingbin Liu | Yin Kung | Yanchao Hao | Dianbo Sui | Siyuan Cheng | Xi Chen | Ningyu Zhang | Jiaoyan Chen
Findings of the Association for Computational Linguistics: EMNLP 2023

Conventional approaches to relation extraction can only recognize predefined relation types. In the real world, new or out-of-scope relation types may keep challenging the deployed models. In this paper, we formalize such a challenging problem as Novel Relation Detection (NRD), which aims to discover potential new relation types based on training samples of known relations. To this end, we construct two NRD datasets and exhaustively investigate a variety of out-of-scope detection methods. We further propose an effective NRD method that utilizes multi-strategy self-supervised learning to handle the problem of shallow semantic similarity in the NRD task. Experimental results demonstrate the effectiveness of our method, which significantly outperforms previous state-of-the-art methods on both datasets.

Trigger-Argument based Explanation for Event Detection
Yong Guan | Jiaoyan Chen | Freddy Lecue | Jeff Pan | Juanzi Li | Ru Li
Findings of the Association for Computational Linguistics: ACL 2023

Event Detection (ED) is a critical task that aims to identify events of certain types in plain text. Neural models have achieved great success on ED, thus coming with a desire for higher interpretability. Existing works mainly exploit words or phrases of the input text to explain models’ inner mechanisms. However, for ED, the event structure, comprising of an event trigger and a set of arguments, are more enlightening clues to explain model behaviors. To this end, we propose a Trigger-Argument based Explanation method (TAE), which can utilize event structure knowledge to uncover a faithful interpretation for the existing ED models at neuron level. Specifically, we design group, sparsity, support mechanisms to construct the event structure from structuralization, compactness, and faithfulness perspectives. We evaluate our model on the large-scale MAVEN and the widely-used ACE 2005 datasets, and observe that TAE is able to reveal the process by which the model predicts. Experimental results also demonstrate that TAE can not only improve the interpretability on standard evaluation metrics, but also effectively facilitate the human understanding.

Language Model Analysis for Ontology Subsumption Inference
Yuan He | Jiaoyan Chen | Ernesto Jimenez-Ruiz | Hang Dong | Ian Horrocks
Findings of the Association for Computational Linguistics: ACL 2023

Investigating whether pre-trained language models (LMs) can function as knowledge bases (KBs) has raised wide research interests recently. However, existing works focus on simple, triple-based, relational KBs, but omit more sophisticated, logic-based, conceptualised KBs such as OWL ontologies. To investigate an LM’s knowledge of ontologies, we propose OntoLAMA, a set of inference-based probing tasks and datasets from ontology subsumption axioms involving both atomic and complex concepts. We conduct extensive experiments on ontologies of different domains and scales, and our results demonstrate that LMs encode relatively less background knowledge of Subsumption Inference (SI) than traditional Natural Language Inference (NLI) but can improve on SI significantly when a small number of samples are given. We will open-source our code and datasets.

Class Lifelong Learning for Intent Detection via Structure Consolidation Networks
Qingbin Liu | Yanchao Hao | Xiaolong Liu | Bo Li | Dianbo Sui | Shizhu He | Kang Liu | Jun Zhao | Xi Chen | Ningyu Zhang | Jiaoyan Chen
Findings of the Association for Computational Linguistics: ACL 2023

Intent detection, which estimates diverse intents behind user utterances, is an essential component of task-oriented dialogue systems. Previous intent detection models are usually trained offline, which can only handle predefined intent classes. In the real world, new intents may keep challenging deployed models. For example, with the prevalence of the COVID-19 pandemic, users may pose various issues related to the pandemic to conversational systems, which brings many new intents. A general intent detection model should be intelligent enough to continually learn new data and recognize new arriving intent classes. Therefore, this work explores Class Lifelong Learning for Intent Detection (CLL-ID), where the model continually learns new intent classes from new data while avoiding catastrophic performance degradation on old data. To this end, we propose a novel lifelong learning method, called Structure Consolidation Networks (SCN), which consists of structure-based retrospection and contrastive knowledge distillation to handle the problems of expression diversity and class imbalance in the CLL-ID task. In addition to formulating the new task, we construct 3 benchmarks based on 8 intent detection datasets. Experimental results demonstrate the effectiveness of SCN, which significantly outperforms previous lifelong learning methods on the three benchmarks.

2022

UoR-NCL at SemEval-2022 Task 3: Fine-Tuning the BERT-Based Models for Validating Taxonomic Relations
Thanet Markchom | Huizhi Liang | Jiaoyan Chen
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In human languages, there are many presuppositional constructions that impose a constrain on the taxonomic relations between two nouns depending on their order. These constructions create a challenge in validating taxonomic relations in real-world contexts. In SemEval2022-Task3 Presupposed Taxonomies: Evaluating Neural Network Semantics (PreTENS), the organizers introduced a task regarding validating the taxonomic relations within a variety of presuppositional constructions. This task is divided into two subtasks: classification and regression. Each subtask contains three datasets in multiple languages, i.e., English, Italian and French. To tackle this task, this work proposes to fine-tune different BERT-based models pre-trained on different languages. According to the experimental results, the fine-tuned BERT-based models are effective compared to the baselines in classification. For regression, the fine-tuned models show promising performance with the possibility of improvement.

2021

OntoEA: Ontology-guided Entity Alignment via Joint Knowledge Graph Embedding
Yuejia Xiang | Ziheng Zhang | Jiaoyan Chen | Xi Chen | Zhenxi Lin | Yefeng Zheng
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

An Industry Evaluation of Embedding-based Entity Alignment
Ziheng Zhang | Hualuo Liu | Jiaoyan Chen | Xi Chen | Bo Liu | YueJia Xiang | Yefeng Zheng
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

Embedding-based entity alignment has been widely investigated in recent years, but most proposed methods still rely on an ideal supervised learning setting with a large number of unbiased seed mappings for training and validation, which significantly limits their usage. In this study, we evaluate those state-of-the-art methods in an industrial context, where the impact of seed mappings with different sizes and different biases is explored. Besides the popular benchmarks from DBpedia and Wikidata, we contribute and evaluate a new industrial benchmark that is extracted from two heterogeneous knowledge graphs (KGs) under deployment for medical applications. The experimental results enable the analysis of the advantages and disadvantages of these alignment methods and the further discussion of suitable strategies for their industrial deployment.

Zero-shot Text Classification via Reinforced Self-training
Zhiquan Ye | Yuxia Geng | Jiaoyan Chen | Jingmin Chen | Xiaoxiao Xu | SuHang Zheng | Feng Wang | Jun Zhang | Huajun Chen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Zero-shot learning has been a tough problem since no labeled data is available for unseen classes during training, especially for classes with low similarity. In this situation, transferring from seen classes to unseen classes is extremely hard. To tackle this problem, in this paper we propose a self-training based method to efficiently leverage unlabeled data. Traditional self-training methods use fixed heuristics to select instances from unlabeled data, whose performance varies among different datasets. We propose a reinforcement learning framework to learn data selection strategy automatically and provide more reliable selection. Experimental results on both benchmarks and a real-world e-commerce dataset show that our approach significantly outperforms previous methods in zero-shot text classification

Co-authors

Yong Guan (关勇) 2

Evgeny Kharlamov 2

Steffen Staab 2

Hongkuan Zhou 2

Yuqicheng Zhu 2

Sophia Ananiadou 1

Shizhu He (何世柱) 1

Sheng-Jun Huang 1

Ernesto Jimenez-Ruiz 1

Ru Li (李茹) 1

Thanet Markchom 1

Dongzhuoran Zhou 1

Venues