Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation

Dmitriy Fedrushkov; Yulong He; Ivan Smirnov; Artem Aliev; Sergey Kovalchuk

Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation

Dmitriy Fedrushkov, Yulong He, Ivan Smirnov, Artem Aliev, Sergey Kovalchuk

Abstract

Reference-free evaluation of LLM-generated code is essential when execution-based testing is unavailable or costly. We compare two paradigms: explicit LLM-as-a-Judge scoring, which assigns a quality score to a solution, and log-probability scoring, which uses log P𝜃(code ∣ task) as an instruction-free signal.Across HumanEval-X, we find that the two approaches capture qualitatively different aspects of code correctness. Explicit judges — particularly larger models — perform strongly on generated code, reflecting their ability to reason about task-solution alignment, but fail to distinguish correct solutions from minimally mutated ones. Log-probability exhibits the opposite pattern: weaker performance on generated code, but consistent pairwise separation of canonical from mutated solutions.These results reveal a discrimination-ranking dissociation and show that the two paradigms provide complementary, non-interchangeable signals: explicit judges capture semantic correctness, while log-probability captures local structural consistency.

Anthology ID:: 2026.gem-main.55
Volume:: Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 574–581
Language:
URL:: https://aclanthology.org/2026.gem-main.55/
DOI:
Bibkey:
Cite (ACL):: Dmitriy Fedrushkov, Yulong He, Ivan Smirnov, Artem Aliev, and Sergey Kovalchuk. 2026. Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 574–581, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation (Fedrushkov et al., GEM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.gem-main.55.pdf

PDF Cite Search Fix data