Jiale Liu
2025
SimpleDoc: Multi‐Modal Document Understanding with Dual‐Cue Page Retrieval and Iterative Refinement
Chelsi Jain
|
Yiran Wu
|
Yifan Zeng
|
Jiale Liu
|
Shengyu Dai
|
Zhenwen Shao
|
Qingyun Wu
|
Huazheng Wang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g., images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
Divide, Optimize, Merge: Scalable Fine-Grained Generative Optimization for LLM Agents
Jiale Liu
|
Yifan Zeng
|
Shaokun Zhang
|
Chi Zhang
|
Malte Højmark-Bertelsen
|
Marie Normann Gadeberg
|
Huazheng Wang
|
Qingyun Wu
Findings of the Association for Computational Linguistics: EMNLP 2025
LLM-based optimization has shown remarkable potential in improving agentic systems. However, the conventional approach of prompting LLM-based generative optimizer with the trajectories on the whole training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-grained Generative Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging.Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrates that FGO outperforms conventional approach by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based generative optimization of increasingly sophisticated agentic systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.
Search
Fix author
Co-authors
- Huazheng Wang 2
- Qingyun Wu 2
- Yifan Zeng 2
- Shengyu Dai 1
- Marie Normann Gadeberg 1
- show all...