José Cambronero


2024

pdf bib
Encoding Spreadsheets for Large Language Models
Haoyu Dong | Jianbo Zhao | Yuzhang Tian | Junyu Xiong | Mengyu Zhou | Yun Lin | José Cambronero | Yeye He | Shi Han | Dongmei Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Spreadsheets are characterized by their extensive two-dimensional grids, flexible layouts, and varied formatting options, which pose significant challenges for large language models (LLMs). In response, we introduce SheetEncoder, pioneering an efficient encoding method designed to unleash and optimize LLMs’ powerful understanding and reasoning capability on spreadsheets. Initially, we propose a vanilla serialization approach that incorporates cell addresses, values, and formats. However, this approach was limited by LLMs’ token constraints, making it impractical for most applications. To tackle this challenge, three innovative modules are proposed to compress spreadsheets effectively: structural-anchor-based compression, inverse index translation, and data-format-aware aggregation. It significantly improves performance in spreadsheet table detection task, outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting. Moreover, fine-tuned LLM with SheetEncoder has an average compression ratio of 25×, but achieves a state-of-the-art 78.9% F1 score, surpassing the best existing models by 12.3%, demonstrating that SheetEncoder greatly boosts LLMs’s performance on spreadsheet data.

pdf bib
Solving Data-centric Tasks using Large Language Models
Shraddha Barke | Christian Poelitz | Carina Negreanu | Benjamin Zorn | José Cambronero | Andrew Gordon | Vu Le | Elnaz Nouri | Nadia Polikarpova | Advait Sarkar | Brian Slininger | Neil Toronto | Jack Williams
Findings of the Association for Computational Linguistics: NAACL 2024

Large language models are rapidly replacing help forums like StackOverflow, and are especially helpful to non-professional programmers and end users. These users are often interested in data-centric tasks, like spreadsheet manipulation and data wrangling, which are hard to solve if the intent is only communicated using a natural-language description, without including data. But how do we decide how much data and which data to include in the prompt?This paper makes two contributions towards answering this question. First, we create a dataset of real-world NL-to-code tasks manipulating tabular data, mined from StackOverflow posts. Second, we introduce a novel cluster-then-select prompting technique, which adds the most representative rows from the input data to the LLM prompt. Our experiments show that LLM performance is indeed sensitive to the amount of data passed in the prompt, and that for tasks with a lot of syntactic variation in the input table,our cluster-then-select technique outperforms a random selection baseline.

2023

pdf bib
CodeFusion: A Pre-trained Diffusion Model for Code Generation
Mukul Singh | José Cambronero | Sumit Gulwani | Vu Le | Carina Negreanu | Gust Verbruggen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Imagine a developer who can only change their last line of code—how often would they have to start writing a function from scratch before it is correct? Auto-regressive models for code generation from natural language have a similar limitation: they do not easily allow reconsidering earlier tokens generated. We introduce CodeFusion, a pre-trained diffusion code generation model that addresses this limitation by iteratively denoising a complete program conditioned on the encoded natural language. We evaluate CodeFusion on the task of natural language to code generation for Bash, Python, and Microsoft Excel conditional formatting (CF) rules. Experiments show that CodeFusion (75M parameters) performs on par with state-of-the-art auto-regressive systems (350M-175B parameters) in top-1 accuracy and outperforms them in top-3 and top-5 accuracy due to its better balance in diversity versus quality.