Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation

Mann Bajpai; Pulkit Chatwal; Priyanshu Deswal; Harish Pratap Singh; Santosh Kumar Mishra

Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation

Mann Bajpai, Pulkit Chatwal, Priyanshu Deswal, Harish Pratap Singh, Santosh Kumar Mishra

Abstract

Reliable automatic evaluation of retrieval-grounded long-form reports typically requires human annotation or frontier-scale proprietary LLMs, both of which are expensive in constrained settings. Team rgipt participated in RAG4Reports@ACL 2026 Task 1 with a zero-shot nugget-verification system that runs entirely on a single NVIDIA T4 GPU. We compare three ultra-lightweight decoder-only models: Qwen2-0.5B, Qwen2-1.5B, and Qwen2.5-0.5B, under identical inference conditions to examine how small an LLM judge can be while retaining human-aligned ranking signal. Both Qwen2 models produced negative 𝜏_gap, whereas Qwen2.5-0.5B achieved 𝜏_gap = 0.0772 and Pearson r = 0.2209, ranking 13th of 21 teams. Within this family and evaluation setting, model generation appears to matter more than parameter count, although this finding is based on three configurations on a single task and warrants further validation.

Anthology ID:: 2026.rag4reports-1.13
Volume:: Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
Month:: July
Year:: 2026
Address:: San Diego, CA, USA
Editors:: Eugene Yang, Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Andrew Yates
Venues:: RAG4Reports | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 94–98
Language:
URL:: https://aclanthology.org/2026.rag4reports-1.13/
DOI:
Bibkey:
Cite (ACL):: Mann Bajpai, Pulkit Chatwal, Priyanshu Deswal, Harish Pratap Singh, and Santosh Kumar Mishra. 2026. Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation. In Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026), pages 94–98, San Diego, CA, USA. Association for Computational Linguistics.
Cite (Informal):: Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation (Bajpai et al., RAG4Reports 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.rag4reports-1.13.pdf

PDF Cite Search Fix data