CodeJudge: Evaluating Code Generation with Large Language Models

Weixi Tong, Tianyi Zhang


Abstract
Large Language Models (LLMs) have shown promising performance in code generation. However, how to reliably evaluate code generated by LLMs remains an unresolved problem. This paper presents CodeJudge, a code evaluation framework that leverages LLMs to evaluate the semantic correctness of generated code without the need for test cases. We investigate different ways to guide the LLM in performing “slow thinking” to arrive at an in-depth and reliable evaluation. We experimented with four LLMs as evaluators on four code generation datasets and five programming languages. The results show that CodeJudge significantly outperformed existing methods in most settings. Furthermore, compared with a SOTA GPT-3.5-based code evaluation method, CodeJudge achieved better results even when using a much smaller model, Llama-3-8B-Instruct. Our code and datasets are available on GitHub https://github.com/VichyTong/CodeJudge.
Anthology ID:
2024.emnlp-main.1118
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20032–20051
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1118
DOI:
Bibkey:
Cite (ACL):
Weixi Tong and Tianyi Zhang. 2024. CodeJudge: Evaluating Code Generation with Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20032–20051, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
CodeJudge: Evaluating Code Generation with Large Language Models (Tong & Zhang, EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1118.pdf
Software:
 2024.emnlp-main.1118.software.zip
Data:
 2024.emnlp-main.1118.data.zip