2024
pdf
bib
abs
RAR: Retrieval-augmented retrieval for code generation in low resource languages
Avik Dutta
|
Mukul Singh
|
Gust Verbruggen
|
Sumit Gulwani
|
Vu Le
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Language models struggle in generating code for low-resource programming languages, since these are underrepresented in training data. Either examples or documentation are commonly used for improved code generation. We propose to use both types of information together and present retrieval augmented retrieval (RAR) as a two-step method for selecting relevant examples and documentation. Experiments on three low-resource languages (Power Query M, OfficeScript and Excel formulas) show that RAR outperforms independently example and grammar retrieval (+2.81–26.14%). Interestingly, we show that two-step retrieval selects better examples and documentation when used independently as well.
pdf
bib
abs
Solving Data-centric Tasks using Large Language Models
Shraddha Barke
|
Christian Poelitz
|
Carina Negreanu
|
Benjamin Zorn
|
José Cambronero
|
Andrew Gordon
|
Vu Le
|
Elnaz Nouri
|
Nadia Polikarpova
|
Advait Sarkar
|
Brian Slininger
|
Neil Toronto
|
Jack Williams
Findings of the Association for Computational Linguistics: NAACL 2024
Large language models are rapidly replacing help forums like StackOverflow, and are especially helpful to non-professional programmers and end users. These users are often interested in data-centric tasks, like spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including data. But how do we decide how much data and which data to include in the prompt?This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a novel cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table,our cluster-then-select technique outperforms a random selection baseline.
pdf
bib
abs
One-to-many testing for code generation from (just) natural language
Mansi Uniyal
|
Mukul Singh
|
Gust Verbruggen
|
Sumit Gulwani
|
Vu Le
Findings of the Association for Computational Linguistics: EMNLP 2024
MBPP is a popular dataset for evaluating the task of code generation from natural language. Despite its popularity, there are three problems: (1) it relies on providing test cases to generate the right signature, (2) there is poor alignment between instruction and evaluation test cases, and (3) contamination of the exact phrasing being present in training datasets. We adapt MBPP to emphasize on generating code from just natural language by (1) removing ambiguity about the semantics of the task from the descriptions, and (2) evaluating generated code on multiple sets of assertions to account for ambiguity in the syntax. We compare popular open and closed weight models on the original (MBPP) and adapted (MBUPP) datasets.
2023
pdf
bib
abs
TSTR: Target Similarity Tuning Meets the Real World
Anirudh Khatry
|
Sumit Gulwani
|
Priyanshu Gupta
|
Vu Le
|
Mukul Singh
|
Ananya Singha
|
Gust Verbruggen
Findings of the Association for Computational Linguistics: EMNLP 2023
Target similarity tuning (TST) is a method of selecting relevant examples in natural language (NL) to code generation through large language models (LLMs) to improve performance. Its goal is to adapt a sentence embedding model to have the similarity between two NL inputs match the similarity between their associated code outputs. In this paper, we propose different methods to apply and improve TST in the real world. First, we replace the sentence transformer with embeddings from a larger model, which reduces sensitivity to the language distribution and thus provides more flexibility in synthetic generation of examples, and we train a tiny model that transforms these embeddings to a space where embedding similarity matches code similarity, which allows the model to remain a black box and only requires a few matrix multiplications at inference time. Second, we how to efficiently select a smaller number of training examples to train the TST model. Third, we introduce a ranking-based evaluation for TST that does not require end-to-end code generation experiments, which can be expensive to perform.
pdf
bib
abs
CodeFusion: A Pre-trained Diffusion Model for Code Generation
Mukul Singh
|
José Cambronero
|
Sumit Gulwani
|
Vu Le
|
Carina Negreanu
|
Gust Verbruggen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Imagine a developer who can only change their last line of code—how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.