Zehan Li

2025

Large Language Models (LLMs) have shown impressive capabilities in language understanding and generation, leading to growing interest in zero-shot relation triplet extraction (ZeroRTE), a task that aims to extract triplets for unseen relations without annotated data. However, existing methods typically depend on costly fine-tuning and lack the structured semantic guidance required for accurate and interpretable extraction. To overcome these limitations, we propose FrameRTE, a novel ZeroRTE framework that adopts a “frame first, then extract” paradigm. Rather than extracting triplets directly, FrameRTE first constructs high-quality Relation Semantic Frames (RSFs) through a unified pipeline that integrates frame retrieval, synthesis, and enhancement. These RSFs serve as structured and interpretable knowledge scaffolds that guide frozen LLMs in the extraction process. Building upon these RSFs, we further introduce a human-inspired three-stage reasoning pipeline consisting of semantic frame evocation, frame-guided triplet extraction, and core frame elements validation to achieve semantically constrained extraction. Experiments demonstrate that FrameRTE achieves competitive zero-shot performance on multiple benchmarks. Moreover, the RSFs we construct serve as high-quality semantic resources that can enhance other extraction methods, showcasing the synergy between linguistic knowledge and foundation models.

pdf bib

Beyond LLMs A Linguistic Approach to Causal Graph Generation from Narrative Texts
Zehan Li | Ruhua Pan | Xinyu Pi
Proceedings of the The 7th Workshop on Narrative Understanding

pdf bib abs

Re-Cent: A Relation-Centric Framework for Joint Zero-Shot Relation Triplet Extraction
Zehan Li | Fu Zhang | Kailun Lyu | Jingwei Cheng | Tianyue Peng
Proceedings of the 31st International Conference on Computational Linguistics

Zero-shot Relation Triplet Extraction (ZSRTE) aims to extract triplets from the context where the relation patterns are unseen during training. Due to the inherent challenges of the ZSRTE task, existing extractive ZSRTE methods often decompose it into named entity recognition and relation classification, which overlooks the interdependence of two tasks and may introduce error propagation. Motivated by the intuition that crucial entity attributes might be implicit in the relation labels, we propose a Relation-Centric joint ZSRTE method named Re-Cent. This approach uses minimal information, specifically unseen relation labels, to extract triplets in one go through a unified model. We develop two span-based extractors to identify the subjects and objects corresponding to relation labels, forming span-pairs. Additionally, we introduce a relation-based correction mechanism that further refines the triplets by calculating the relevance between span-pairs and relation labels. Experiments demonstrate that Re-Cent achieves state-of-the-art performance with fewer parameters and does not rely on synthetic data or manual labor.

pdf bib abs

Generation-Augmented Retrieval: Rethinking the Role of Large Language Models in Zero-Shot Relation Extraction
Zehan Li | Fu Zhang | Tianyue Peng | He Liu | Jingwei Cheng
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in Relation Extraction (RE) emphasize Zero-Shot methodologies, aiming to recognize unseen relations between entities with no annotated data. Although Large Language Models (LLMs) have demonstrated outstanding performance in many NLP tasks, their performance in Zero-Shot RE (ZSRE) without entity type constraints still lags behind Small Language Models (SLMs). LLM-based ZSRE often involves manual interventions and significant computational overhead, especially when scaling to large-scale multi-choice data.To this end, we introduce RE-GAR-AD, which not only leverages the generative capability of LLMs but also utilizes their representational power without tuning LLMs. We redefine LLM-based ZSRE as a retrieval challenge, utilizing a Generation-Augmented Retrieval framework coupled with a retrieval Adjuster. Specifically, our approach guides LLMs through crafted prompts to distill sentence semantics and enrich relation labels. We encode sentences and relation labels using LLMs and match their embeddings in a triplet fashion. This retrieval technique significantly reduces token input requirements. Additionally, to further optimize embeddings, we propose a plug-in retrieval adjuster with only 2M parameters, which allows rapid fine-tuning without accessing LLMs’ parameters. Our LLM-based model demonstrates comparable performance on multiple benchmarks.

pdf bib abs

In the large language model (LLM) revolution, embedding is a key component of various systems, such as retrieving knowledge or memories for LLMs or building content moderation filters. As such cases span from English to other natural or programming languages, from retrieval to classification and beyond, it is advantageous to build a unified embedding model rather than dedicated ones for each scenario. In this context, the pre-trained multilingual decoder-only large language models, e.g., BLOOM, emerge as a viable backbone option. To assess their potential, we propose straightforward strategies for constructing embedders and introduce a universal evaluation benchmark. Experimental results show that our trained model is proficient at generating good embeddings across languages and tasks, even extending to languages and tasks for which no finetuning/pretraining data is available. We also present detailed analyses and additional evaluations. We hope that this work could encourage the development of more robust open-source universal embedders.

pdf bib abs

CE-DA: Custom Embedding and Dynamic Aggregation for Zero-Shot Relation Extraction
Fu Zhang | He Liu | Zehan Li | Jingwei Cheng
Proceedings of the 31st International Conference on Computational Linguistics

Zero-shot Relation Extraction (ZSRE) aims to predict novel relations from sentences with given entity pairs, where the relations have not been encountered during training. Prototypebased methods, which achieve ZSRE by aligning the sentence representation and the relation prototype representation, have shown great potential. However, most existing works focus solely on improving the quality of prototype representations, neglecting sentence representations and lacking interaction between different types of relation side information. In this paper, we propose a novel ZSRE framework named CE-DA, which includes two modules: Custom Embedding and Dynamic Aggregation. We employ a two-stage approach to obtain customized embeddings of sentences. In the first stage, we train a sentence encoder through unsupervised contrastive learning, and in the second stage, we highlight the potential relations between entities in sentences using carefully designed entity emphasis prompts to further enhance sentence representations. Additionally, our dynamic aggregation method assigns different weights to different types of relation side information through a learnable network to enhance the quality of relation prototype representations. In contrast to traditional methods that treat the importance of all side information equally, our dynamic aggregation method further strengthen the interaction between different types of relation side information. Our method demonstrates competitive performance across various metrics on two ZSRE datasets.

2024

pdf bib abs

ProCQA: A Large-scale Community-based Programming Question Answering Dataset for Code Search
Zehan Li | Jianfei Zhang | Chuantao Yin | Yuanxin Ouyang | Wenge Rong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Retrieval-based code question answering seeks to match user queries in natural language to relevant code snippets. Previous approaches typically rely on pretraining models using crafted bi-modal and uni-modal datasets to align text and code representations. In this paper, we introduce ProCQA, a large-scale programming question answering dataset extracted from the StackOverflow community, offering naturally structured mixed-modal QA pairs. To validate its effectiveness, we propose a modality-agnostic contrastive pre-training approach to improve the alignment of text and code representations of current code language models. Compared to previous models that primarily employ bimodal and unimodal pairs extracted from CodeSearchNet for pre-training, our model exhibits significant performance improvements across a wide range of code retrieval benchmarks.

pdf bib abs

AlignRE: An Encoding and Semantic Alignment Approach for Zero-Shot Relation Extraction
Zehan Li | Fu Zhang | Jingwei Cheng
Findings of the Association for Computational Linguistics: ACL 2024

Zero-shot Relation Extraction (ZSRE) aims to predict unseen relations between entity pairs from input sentences. Existing prototype-based ZSRE methods encode relation descriptions into prototype embeddings and predict by measuring the similarity between sentence embeddings and prototype embeddings. However, these methods often overlook abundant side information of relations and suffer from a significant encoding gap between prototypes and sentences, limiting performance. To this end, we propose a framework named AlignRE, based on two Alignment methods for ZSRE. Specifically, we present a novel perspective centered on encoding schema alignment to enhance prototype-based ZSRE methods. We utilize well-designed prompt-tuning to bridge the encoding gap. To improve prototype quality, we explore and leverage multiple side information and propose a prototype aggregation method based on semantic alignment to create comprehensive relation prototype representations. We conduct experiments on FewRel and Wiki-ZSL datasets and consistently outperform state-of-the-art methods. Moreover, our method exhibits substantially faster performance and reduces the need for extensive manual labor in prototype construction. Code is available at https://github.com/lizehan1999/AlignRE.

2023

pdf bib abs

Text Representation Distillation via Information Bottleneck Principle
Yanzhao Zhang | Dingkun Long | Zehan Li | Pengjun Xie
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models (PLMs) have recently shown great success in text representation field. However, the high computational cost and high-dimensional representation of PLMs pose significant challenges for practical applications. To make models more accessible, an effective method is to distill large models into smaller representation models. In order to relieve the issue of performance degradation after distillation, we propose a novel Knowledge Distillation method called IBKD. This approach is motivated by the Information Bottleneck principle and aims to maximize the mutual information between the final representation of the teacher and student model, while simultaneously reducing the mutual information between the student model’s representation and the input data. This enables the student model to preserve important learned information while avoiding unnecessary information, thus reducing the risk of over-fitting. Empirical studies on two main downstream applications of text representation (Semantic Textual Similarity and Dense Retrieval tasks) demonstrate the effectiveness of our proposed approach.