CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Hongchao Jiang; Yiming Chen; Yushi Cao; Hung-yi Lee; Robby T. Tan

doi:10.18653/v1/2026.acl-long.888

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

Abstract

Large Language Models (LLMs) are increasingly used not only to generate code, but also to judge it: comparing, ranking, or scoring competing solutions. However, their reliability in this evaluative role remains poorly understood. Inconsistent or flawed judgments can undermine benchmarks and distort training signals. This paper investigates the performance and robustness of LLMs when used as code judges. We introduce CodeJudgeBench, a benchmark explicitly designed to evaluate LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. We comprehensively benchmark the performance of 26 LLM-as-a-Judge models, encompassing general-purpose, code-tuned, and reasoning models. Our empirical findings reveal that relatively small reasoning models (e.g., Qwen3-8B) can outperform much larger non-reasoning models up to 70B. We further stress-test robustness by applying both general and code-specific perturbations. All models show significant instability and are sensitive to changes such as response ordering, variable naming, and misleading comments. These findings highlight serious concerns about the consistency and robustness of LLM-based judges for coding tasks.

Anthology ID:: 2026.acl-long.888
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 19416–19448
Language:
URL:: https://aclanthology.org/2026.acl-long.888/
DOI:: 10.18653/v1/2026.acl-long.888
Bibkey:
Cite (ACL):: Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, and Robby T. Tan. 2026. CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19416–19448, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks (Jiang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.888.pdf
Checklist:: 2026.acl-long.888.checklist.pdf

PDF Cite Search Checklist Fix data