Jiaqi Li
Papers on this page may belong to the following people: Jiaqi Li, Jiaqi Li
2025
AD-LLM: Benchmarking Large Language Models for Anomaly Detection
Tiankai Yang | Yi Nian | Li Li | Ruiyao Xu | Yuangang Li | Jiaqi Li | Zhuo Xiao | Xiyang Hu | Ryan A. Rossi | Kaize Ding | Xia Hu | Yue Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Tiankai Yang | Yi Nian | Li Li | Ruiyao Xu | Yuangang Li | Jiaqi Li | Zhuo Xiao | Xiyang Hu | Ryan A. Rossi | Kaize Ding | Xia Hu | Yue Zhao
Findings of the Association for Computational Linguistics: ACL 2025
Anomaly detection (AD) is an important machine learning task with many real-world uses, including fraud detection, medical diagnosis, and industrial monitoring. Within natural language processing (NLP), AD helps detect issues like spam, misinformation, and unusual user activity. Although large language models (LLMs) have had a strong impact on tasks such as text generation and summarization, their potential in AD has not been studied enough. This paper introduces AD-LLM, the first benchmark that evaluates how LLMs can help with NLP anomaly detection. We examine three key tasks: (i) zero-shot detection, using LLMs’ pre-trained knowledge to perform AD without tasks-specific training; (ii) data augmentation, generating synthetic data and category descriptions to improve AD models; and (iii) model selection, using LLMs to suggest unsupervised AD models. Through experiments with different datasets, we find that LLMs can work well in zero-shot AD, that carefully designed augmentation methods are useful, and that explaining model selection for specific datasets remains challenging. Based on these results, we outline six future research directions on LLMs for AD.
Adaptive Preference Optimization with Uncertainty-aware Utility Anchor
Xiaobo Wang | Zixia Jia | Jiaqi Li | Qi Liu | Zilong Zheng
Findings of the Association for Computational Linguistics: EMNLP 2025
Xiaobo Wang | Zixia Jia | Jiaqi Li | Qi Liu | Zilong Zheng
Findings of the Association for Computational Linguistics: EMNLP 2025
Offline preference optimization methods are efficient for large language models (LLMs) alignment. Direct Preference optimization (DPO)-like learning, one of the most popular approaches, stands out for its efficiency in reward modeling. However, these methods typically follow the convention to use Bradley-Terry (BT) reward modeling that faces several critical assumptions, including the requirement for pairwise training data, model distribution shifting, human rationality assumption, etc. To address these limitations, we propose a general framework for offline preference optimization methods, Adaptive Preference Optimization with Utility Anchor (UAPO), which introduces an anchoring function to estimate the uncertainties brought from preference data annotation. Our method enables training even in scenarios where the data is unpaired, significantly enhancing data utilization efficiency. Moreover, the anchor design makes UAPO more robust in the training process. Experimental results demonstrate that UAPO achieves competitive outcomes without the strict dependency on data pairing, paving the way for more flexible and effective preference optimization methods.
Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering
Runxuan Liu | Bei Luo | Jiaqi Li | Baoxin Wang | Ming Liu | Dayong Wu | Shijin Wang | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Runxuan Liu | Bei Luo | Jiaqi Li | Baoxin Wang | Ming Liu | Dayong Wu | Shijin Wang | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have shown remarkable capabilities in natural language processing. However, in knowledge graph question answering tasks (KGQA), there remains the issue of answering questions that require multi-hop reasoning. Existing methods rely on entity vector matching, but the purpose of the question is abstract and difficult to match with specific entities. As a result, it is difficult to establish reasoning paths to the purpose, which leads to information loss and redundancy. To address this issue, inspired by human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a novel framework that constructs reasoning paths from purposes back to conditions. ORT operates in three key phases: (1) using LLM to extract purpose labels and condition labels, (2) constructing label reasoning paths based on the KG ontology, and (3) using the label reasoning paths to guide knowledge retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves state-of-the-art performance and significantly enhances the capability of LLMs for KGQA.
NLP-ADBench: NLP Anomaly Detection Benchmark
Yuangang Li | Jiaqi Li | Zhuo Xiao | Tiankai Yang | Yi Nian | Xiyang Hu | Yue Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
Yuangang Li | Jiaqi Li | Zhuo Xiao | Tiankai Yang | Yi Nian | Xiyang Hu | Yue Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025
Anomaly detection (AD) is an important machine learning task with applications in fraud detection, content moderation, and user behavior analysis. However, AD is relatively understudied in a natural language processing (NLP) context, limiting its effectiveness in detecting harmful content, phishing attempts, and spam reviews. We introduce NLP-ADBench, the most comprehensive NLP anomaly detection (NLP-AD) benchmark to date, which includes eight curated datasets and 19 state-of-the-art algorithms. These span 3 end-to-end methods and 16 two-step approaches that adapt classical, non-AD methods to language embeddings from BERT and OpenAI. Our empirical results show that no single model dominates across all datasets, indicating a need for automated model selection. Moreover, two-step methods with transformer-based embeddings consistently outperform specialized end-to-end approaches, with OpenAI embeddings outperforming those of BERT. We release NLP-ADBench at https://github.com/USC-FORTIS/NLP-ADBench, providing a unified framework for NLP-AD and supporting future investigations.
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection
Jiaqi Li | Xinyi Dong | Yang Liu | Zhizhuo Yang | Quansen Wang | Xiaobo Wang | Song-Chun Zhu | Zixia Jia | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2025
Jiaqi Li | Xinyi Dong | Yang Liu | Zhizhuo Yang | Quansen Wang | Xiaobo Wang | Song-Chun Zhu | Zixia Jia | Zilong Zheng
Findings of the Association for Computational Linguistics: ACL 2025
We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs’ reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.
Reinforced Query Reasoners for Reasoning-intensive Retrieval Tasks
Xubo Qin | Jun Bai | Jiaqi Li | Zixia Jia | Zilong Zheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xubo Qin | Jun Bai | Jiaqi Li | Zixia Jia | Zilong Zheng
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale LLMs like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce Reinforced Query Reasoner (RQR), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. Our approach frames query reformulation as a reinforcement learning problem and employs a novel semi-rule-based reward function. This enables smaller language models, e.g., Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve reasoning performance rivaling large-scale LLMs without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that, with BM25 as retrievers, both RQR-7B and RQR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment. All code and dataset will be publicly released.
2024
LooGLE: Can Long-Context Language Models Understand Long Contexts?
Jiaqi Li | Mengmeng Wang | Zilong Zheng | Muhan Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaqi Li | Mengmeng Wang | Zilong Zheng | Muhan Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) are typically limited to processing texts within context window size, which has spurred significant research efforts into enhancing LLMs’ long-context understanding as well as developing high-quality benchmarks to evaluate the ability. However, prior datasets suffer from short comings like short length compared to the context window of modern LLMs; outdated documents that might have data leakage problems; and an emphasis on short dependency tasks only. In this paper, we present LooGLE , a Long Context Generic Language Evaluation benchmark. It features documents post-2022, with over 24,000 tokens per document and 6,000 newly generated questions spanning varying dependency ranges in diverse domains. Human annotators meticulously crafted over 1,100 high-quality question-answer (QA) pairs with thorough cross-validation for a most precise assessment of LLMs’ long dependency capabilities. We conduct a comprehensive evaluation of representative LLMs on LooGLE . The results indicate that most LLMs have shockingly bad long context ability and fail to capture long dependencies in the context, even when their context window size is enough to fit the entire document. Our results shed light on enhancing the “true long-context understanding” ability of LLMs instead of merely enlarging their context window.
Triple-view Event Hierarchy Model for Biomedical Event Representation
Jiayi Huang | Lishuang Li | Xueyang Qin | Yi Xiang | Jiaqi Li | Yubo Feng
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
Jiayi Huang | Lishuang Li | Xueyang Qin | Yi Xiang | Jiaqi Li | Yubo Feng
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)
“Biomedical event representation can be applied to various language tasks. A biomedical eventoften involves multiple biomedical entities and trigger words, and the event structure is complex.However, existing research on event representation mainly focuses on the general domain. Ifmodels from the general domain are directly transferred to biomedical event representation, theresults may not be satisfactory. We argue that biomedical events can be divided into three hierar-chies, each containing unique feature information. Therefore, we propose the Triple-views EventHierarchy Model (TEHM) to enhance the quality of biomedical event representation. TEHM ex-tracts feature information from three different views and integrates them. Specifically, due to thecomplexity of biomedical events, We propose the Trigger-aware Aggregator module to handlecomplex units within biomedical events. Additionally, we annotate two similarity task datasetsin the biomedical domain using annotation standards from the general domain. Extensive exper-iments demonstrate that TEHM achieves state-of-the-art performance on biomedical similaritytasks and biomedical event casual relation extraction.Introduction”
Can Large Language Models Understand DL-Lite Ontologies? An Empirical Study
Keyu Wang | Guilin Qi | Jiaqi Li | Songlin Zhai
Findings of the Association for Computational Linguistics: EMNLP 2024
Keyu Wang | Guilin Qi | Jiaqi Li | Songlin Zhai
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) have shown significant achievements in solving a wide range of tasks. Recently, LLMs’ capability to store, retrieve and infer with symbolic knowledge has drawn a great deal of attention, showing their potential to understand structured information. However, it is not yet known whether LLMs can understand Description Logic (DL) ontologies. In this work, we empirically analyze the LLMs’ capability of understanding DL-Lite ontologies covering 6 representative tasks from syntactic and semantic aspects. With extensive experiments, we demonstrate both the effectiveness and limitations of LLMs in understanding DL-Lite ontologies. We find that LLMs can understand formal syntax and model-theoretic semantics of concepts and roles. However, LLMs struggle with understanding TBox NI transitivity and handling ontologies with large ABoxes. We hope that our experiments and analyses provide more insights into LLMs and inspire to build more faithful knowledge engineering solutions.
SparkRA: A Retrieval-Augmented Knowledge Service System Based on Spark Large Language Model
Dayong Wu | Jiaqi Li | Baoxin Wang | Honghong Zhao | Siyuan Xue | Yanjie Yang | Zhijun Chang | Rui Zhang | Li Qian | Bo Wang | Shijin Wang | Zhixiong Zhang | Guoping Hu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Dayong Wu | Jiaqi Li | Baoxin Wang | Honghong Zhao | Siyuan Xue | Yanjie Yang | Zhijun Chang | Rui Zhang | Li Qian | Bo Wang | Shijin Wang | Zhixiong Zhang | Guoping Hu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) have shown remarkable achievements across various language tasks. To enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Assistant (SparkRA) based on our SciLit-LLM. SparkRA is accessible online and provides three primary functions: literature investigation, paper reading, and academic writing. As of July 30, 2024, SparkRA has garnered over 50,000 registered users, with a total usage count exceeding 1.3 million.
MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing
Jiaqi Li | Miaozeng Du | Chuanyi Zhang | Yongrui Chen | Nan Hu | Guilin Qi | Haiyun Jiang | Siyuan Cheng | Bozhong Tian
Findings of the Association for Computational Linguistics: ACL 2024
Jiaqi Li | Miaozeng Du | Chuanyi Zhang | Yongrui Chen | Nan Hu | Guilin Qi | Haiyun Jiang | Siyuan Cheng | Bozhong Tian
Findings of the Association for Computational Linguistics: ACL 2024
Multimodal knowledge editing represents a critical advancement in enhancing the capabilities of Multimodal Large Language Models (MLLMs). Despite its potential, current benchmarks predominantly focus on coarse-grained knowledge, leaving the intricacies of fine-grained (FG) multimodal entity knowledge largely unexplored. This gap presents a notable challenge, as FG entity recognition is pivotal for the practical deployment and effectiveness of MLLMs in diverse real-world scenarios. To bridge this gap, we introduce MIKE, a comprehensive benchmark and dataset specifically designed for the FG multimodal entity knowledge editing. MIKE encompasses a suite of tasks tailored to assess different perspectives, including Vanilla Name Answering, Entity-Level Caption, and Complex-Scenario Recognition. In addition, a new form of knowledge editing, Multi-step Editing, is introduced to evaluate the editing efficiency. Through our extensive evaluations, we demonstrate that the current state-of-the-art methods face significant challenges in tackling our proposed benchmark, underscoring the complexity of FG knowledge editing in MLLMs. Our findings spotlight the urgent need for novel approaches in this domain, setting a clear agenda for future research and development efforts within the community.
2023
Three Stream Based Multi-level Event Contrastive Learning for Text-Video Event Extraction
Jiaqi Li | Chuanyi Zhang | Miaozeng Du | Dehai Min | Yongrui Chen | Guilin Qi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Jiaqi Li | Chuanyi Zhang | Miaozeng Du | Dehai Min | Yongrui Chen | Guilin Qi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Text-video based multimodal event extraction refers to identifying event information from the given text-video pairs. Existing methods predominantly utilize video appearance features (VAF) and text sequence features (TSF) as input information. Some of them employ contrastive learning to align VAF with the event types extracted from TSF. However, they disregard the motion representations in videos and the optimization of contrastive objective could be misguided by the background noise from RGB frames. We observe that the same event triggers correspond to similar motion trajectories, which are hardly affected by the background noise. Moviated by this, we propose a Three Stream Multimodal Event Extraction framework (TSEE) that simultaneously utilizes the features of text sequence and video appearance, as well as the motion representations to enhance the event extraction capacity. Firstly, we extract the optical flow features (OFF) as motion representations from videos to incorporate with VAF and TSF. Then we introduce a Multi-level Event Contrastive Learning module to align the embedding space between OFF and event triggers, as well as between event triggers and types. Finally, a Dual Querying Text module is proposed to enhance the interaction between modalities. Experimental results show that TSEE outperforms the state-of-the-art methods, which demonstrates its superiority.
2020
Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure
Jiaqi Li | Ming Liu | Min-Yen Kan | Zihao Zheng | Zekun Wang | Wenqiang Lei | Ting Liu | Bing Qin
Proceedings of the 28th International Conference on Computational Linguistics
Jiaqi Li | Ming Liu | Min-Yen Kan | Zihao Zheng | Zekun Wang | Wenqiang Lei | Ting Liu | Bing Qin
Proceedings of the 28th International Conference on Computational Linguistics
Research into the area of multiparty dialog has grown considerably over recent years. We present the Molweni dataset, a machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni’s source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances. We annotate 30,066 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations in a modified Segmented Discourse Representation Theory (SDRT; Asher et al., 2016) style for all of its multiparty dialogs, contributing large-scale (78,245 annotated discourse relations) data to bear on the task of multiparty dialog discourse parsing. Our experiments show that Molweni is a challenging dataset for current MRC models: BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7% F1 on Molweni’s questions, a 20+% significant drop as compared against its SQuAD 2.0 performance.
Search
Fix author
Co-authors
- Zilong Zheng 4
- Zixia Jia 3
- Guilin Qi 3
- Yongrui Chen 2
- Miaozeng Du 2
- Xiyang Hu 2
- Yuangang Li 2
- Ming Liu 2
- Yi Nian 2
- Bing Qin (秦兵) 2
- Xiaobo Wang 2
- Baoxin Wang 2
- Shijin Wang 2
- Dayong Wu 2
- Zhuo Xiao 2
- Tiankai Yang 2
- Chuanyi Zhang 2
- Yue Zhao 2
- Jun Bai 1
- Zhijun Chang 1
- Siyuan Cheng 1
- Kaize Ding 1
- Xinyi Dong 1
- Yubo Feng 1
- Xia Hu 1
- Guoping Hu 1
- Nan Hu 1
- Jiayi Huang 1
- Haiyun Jiang 1
- Min-Yen Kan 1
- Wenqiang Lei 1
- Lishuang Li 1
- Li Li 1
- Qi Liu 1
- Runxuan Liu 1
- Ting Liu 1
- Yang Liu 1
- Bei Luo 1
- Dehai Min 1
- Li Qian 1
- Xueyang Qin 1
- Xubo Qin 1
- Ryan A. Rossi 1
- Bozhong Tian 1
- Mengmeng Wang 1
- Keyu Wang 1
- Zekun Wang 1
- Bo Wang 1
- Quansen Wang 1
- Yi Xiang 1
- Ruiyao Xu 1
- Siyuan Xue 1
- Yanjie Yang 1
- Zhizhuo Yang 1
- Songlin Zhai 1
- Muhan Zhang 1
- Rui Zhang 1
- Zhixiong Zhang 1
- Honghong Zhao 1
- Zihao Zheng 1
- Song-chun Zhu 1