2025
pdf
bib
abs
The Ranking Blind Spot: Decision Hijacking in LLM-based Text Ranking
Yaoyao Qian
|
Yifan Zeng
|
Yuchao Jiang
|
Chelsi Jain
|
Huazheng Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the “Ranking Blind Spot”—a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: **Decision Objective Hijacking**, which alters the evaluation goal in pairwise ranking systems, and **Decision Criteria Hijacking**, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to real-world examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks.
pdf
bib
abs
SimpleDoc: Multi‐Modal Document Understanding with Dual‐Cue Page Retrieval and Iterative Refinement
Chelsi Jain
|
Yiran Wu
|
Yifan Zeng
|
Jiale Liu
|
Shengyu Dai
|
Zhenwen Shao
|
Qingyun Wu
|
Huazheng Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g., images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
pdf
bib
abs
Divide, Optimize, Merge: Scalable Fine-Grained Generative Optimization for LLM Agents
Jiale Liu
|
Yifan Zeng
|
Shaokun Zhang
|
Chi Zhang
|
Malte Højmark-Bertelsen
|
Marie Normann Gadeberg
|
Huazheng Wang
|
Qingyun Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
LLM-based optimization has shown remarkable potential in improving agentic systems. However, the conventional approach of prompting LLM-based generative optimizer with the trajectories on the whole training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-grained Generative Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging.Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrates that FGO outperforms conventional approach by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based generative optimization of increasingly sophisticated agentic systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.
pdf
bib
abs
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling
Jiahao Qiu
|
Yifu Lu
|
Yifan Zeng
|
Jiacheng Guo
|
Jiayi Geng
|
Chenhao Zhu
|
Xinzhe Juan
|
Ling Yang
|
Huazheng Wang
|
Kaixuan Huang
|
Yue Wu
|
Mengdi Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, UltraFeedback, GSM8K, HH-RLHF, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves a 65% win rate at maximum lengths of 192 and 384 tokens, outperforming standard BoN with the same computational cost. Furthermore, TreeBoN achieves around a 60% win rate across longer responses, showcasing its scalability and alignment efficacy.