L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Dragomir Radev, Arman Cohan, Arman Cohan


Abstract
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs. Despite promising results, there is a notable lack of a comprehensive evaluation of these models’ language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning, and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition, we assess confidence calibration, and conduct human evaluations to identify typical failures across different tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We release the evaluation framework1 and all model outputs, hoping to lay the groundwork for further future research. All future evaluations (e.g., LLaMA-3, StarCoder2, etc) will be updated on the project website: https://l2c-eval.github.io/.
Anthology ID:
2024.tacl-1.72
Volume:
Transactions of the Association for Computational Linguistics, Volume 12
Month:
Year:
2024
Address:
Cambridge, MA
Venue:
TACL
SIG:
Publisher:
MIT Press
Note:
Pages:
1311–1329
Language:
URL:
https://aclanthology.org/2024.tacl-1.72
DOI:
10.1162/tacl_a_00705
Bibkey:
Cite (ACL):
Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Dragomir Radev, Arman Cohan, and Arman Cohan. 2024. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models. Transactions of the Association for Computational Linguistics, 12:1311–1329.
Cite (Informal):
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models (Ni et al., TACL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.tacl-1.72.pdf