One-to-many testing for code generation from (just) natural language

Mansi Uniyal, Mukul Singh, Gust Verbruggen, Sumit Gulwani, Vu Le


Abstract
MBPP is a popular dataset for evaluating the task of code generation from natural language. Despite its popularity, there are three problems: (1) it relies on providing test cases to generate the right signature, (2) there is poor alignment between instruction and evaluation test cases, and (3) contamination of the exact phrasing being present in training datasets. We adapt MBPP to emphasize on generating code from just natural language by (1) removing ambiguity about the semantics of the task from the descriptions, and (2) evaluating generated code on multiple sets of assertions to account for ambiguity in the syntax. We compare popular open and closed weight models on the original (MBPP) and adapted (MBUPP) datasets.
Anthology ID:
2024.findings-emnlp.902
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15397–15402
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.902
DOI:
10.18653/v1/2024.findings-emnlp.902
Bibkey:
Cite (ACL):
Mansi Uniyal, Mukul Singh, Gust Verbruggen, Sumit Gulwani, and Vu Le. 2024. One-to-many testing for code generation from (just) natural language. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15397–15402, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
One-to-many testing for code generation from (just) natural language (Uniyal et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.902.pdf
Software:
 2024.findings-emnlp.902.software.zip
Data:
 2024.findings-emnlp.902.data.zip