2024
pdf
bib
abs
RAR: Retrieval-augmented retrieval for code generation in low resource languages
Avik Dutta
|
Mukul Singh
|
Gust Verbruggen
|
Sumit Gulwani
|
Vu Le
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Language models struggle in generating code for low-resource programming languages, since these are underrepresented in training data. Either examples or documentation are commonly used for improved code generation. We propose to use both types of information together and present retrieval augmented retrieval (RAR) as a two-step method for selecting relevant examples and documentation. Experiments on three low-resource languages (Power Query M, OfficeScript and Excel formulas) show that RAR outperforms independently example and grammar retrieval (+2.81–26.14%). Interestingly, we show that two-step retrieval selects better examples and documentation when used independently as well.
pdf
bib
abs
One-to-many testing for code generation from (just) natural language
Mansi Uniyal
|
Mukul Singh
|
Gust Verbruggen
|
Sumit Gulwani
|
Vu Le
Findings of the Association for Computational Linguistics: EMNLP 2024
MBPP is a popular dataset for evaluating the task of code generation from natural language. Despite its popularity, there are three problems: (1) it relies on providing test cases to generate the right signature, (2) there is poor alignment between instruction and evaluation test cases, and (3) contamination of the exact phrasing being present in training datasets. We adapt MBPP to emphasize on generating code from just natural language by (1) removing ambiguity about the semantics of the task from the descriptions, and (2) evaluating generated code on multiple sets of assertions to account for ambiguity in the syntax. We compare popular open and closed weight models on the original (MBPP) and adapted (MBUPP) datasets.
2023
pdf
bib
abs
InstructExcel: A Benchmark for Natural Language Instruction in Excel
Justin Payan
|
Swaroop Mishra
|
Mukul Singh
|
Carina Negreanu
|
Christian Poelitz
|
Chitta Baral
|
Subhro Roy
|
Rasika Chakravarthy
|
Benjamin Van Durme
|
Elnaz Nouri
Findings of the Association for Computational Linguistics: EMNLP 2023
With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel OfficeScripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, InstructExcel, created by leveraging the ‘Automate’ feature in Excel to automatically generate OfficeScripts from users’ actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that InstructExcel is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.
pdf
bib
abs
TSTR: Target Similarity Tuning Meets the Real World
Anirudh Khatry
|
Sumit Gulwani
|
Priyanshu Gupta
|
Vu Le
|
Mukul Singh
|
Ananya Singha
|
Gust Verbruggen
Findings of the Association for Computational Linguistics: EMNLP 2023
Target similarity tuning (TST) is a method of selecting relevant examples in natural language (NL) to code generation through large language models (LLMs) to improve performance. Its goal is to adapt a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs. In this paper, we propose different methods to apply and improve TST in the real world. First, we replace the sentence transformer with embeddings from a larger model, which reduces sensitivity to the language distribution and thus provides more flexibility in synthetic generation of examples, and we train a tiny model that transforms these embeddings to a space where embedding similarity matches code similarity, which allows the model to remain a black box and only requires a few matrix multiplications at inference time. Second, we how to efficiently select a smaller number of training examples to train the TST model. Third, we introduce a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.
pdf
bib
abs
CodeFusion: A Pre-trained Diffusion Model for Code Generation
Mukul Singh
|
José Cambronero
|
Sumit Gulwani
|
Vu Le
|
Carina Negreanu
|
Gust Verbruggen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Imagine a developer who can only change their last line of code—how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.