Xueqiang Xu
2026
Zero-Shot Open-Schema Entity Structure Discovery
Xueqiang Xu | Jinfeng Xiao | James Barry | Mohab Elkaref | Jiaru Zou | Pengcheng Jiang | Yunyi Zhang | Maxwell J Giammona | Geeth De Mel | Jiawei Han
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Xueqiang Xu | Jinfeng Xiao | James Barry | Mohab Elkaref | Jiaru Zou | Pengcheng Jiang | Yunyi Zhang | Maxwell J Giammona | Geeth De Mel | Jiawei Han
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Entity structure extraction, which aims to extract entities and their associated attribute–value structures from text, is an essential task for text understanding and knowledge graph construction. Existing methods based on large language models (LLMs) typically rely heavily on predefined entity attribute schemas or annotated datasets, often leading to incomplete extraction results. To address these challenges, we introduce ZOES, a novel approach to entity structure extraction that does not require any schema or annotated samples. ZOES operates via a principled mechanism of enrichment, refinement, and unification, based on the insight that an entity and its associated structure are mutually reinforcing. Experiments demonstrate that ZOES consistently enhances LLMs’ ability to extract more complete entity structures across three different domains, showcasing both the effectiveness and generalizability of the method. These findings suggest that such an enrichment, refinement, and unification mechanism may serve as a principled approach to improving the quality of LLM-based entity structure discovery in various scenarios.
2025
LogiCoL: Logically-Informed Contrastive Learning for Set-based Dense Retrieval
Yanzhen Shen | Sihao Chen | Xueqiang Xu | Yunyi Zhang | Chaitanya Malaviya | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yanzhen Shen | Sihao Chen | Xueqiang Xu | Yunyi Zhang | Chaitanya Malaviya | Dan Roth
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
While significant progress has been made with dual- and bi-encoder dense retrievers, they often struggle on queries with logical connectives, a use case that is often overlooked yet important in downstream applications. Current dense retrievers struggle with such queries, such that the retrieved results do not respect the logical constraints implied in the queries. To address this challenge, we introduce LogiCoL, a logically-informed contrastive learning objective for dense retrievers. LogiCoL builds upon in-batch supervised contrastive learning, and learns dense retrievers to respect the subset and mutually-exclusive set relation between query results via two sets of soft constraints expressed via t-norm in the learning objective. We evaluate the effectiveness of LogiCoL on the task of entity retrieval, where the model is expected to retrieve a set of entities in Wikipedia that satisfy the implicit logical constraints in the query. We show that models trained with LogiCoL yield improvement both in terms of retrieval performance and logical consistency in the results. We provide detailed analysis and insights to uncover why queries with logical connectives are challenging for dense retrievers and why LogiCoL is most effective.
s3: You Don’t Need That Much Data to Train a Search Agent via RL
Pengcheng Jiang | Xueqiang Xu | Jiacheng Lin | Jinfeng Xiao | Zifeng Wang | Jimeng Sun | Jiawei Han
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Pengcheng Jiang | Xueqiang Xu | Jiacheng Lin | Jinfeng Xiao | Zifeng Wang | Jimeng Sun | Jiawei Han
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Retrieval-augmented generation (RAG) systems empower large language models (LLMs) to access external knowledge during inference. Recent advances have enabled LLMs to act as search agents via reinforcement learning (RL), improving information acquisition through multi-turn interactions with retrieval engines. However, existing approaches either optimize retrieval using search-only metrics (e.g., NDCG) that ignore downstream utility or fine-tune the entire LLM to jointly reason and retrieve—entangling retrieval with generation and limiting the real search utility and compatibility with frozen or proprietary models. In this work, we propose **s3**, a lightweight, model-agnostic framework that decouples the searcher from the generator and trains the searcher using a Gain Beyond RAG reward: the improvement in generation accuracy over naïve RAG. **s3** requires only 2.4k training samples to outperform baselines trained on over 70 × more data, consistently delivering stronger downstream performance across six general QA and five medical QA benchmarks.