Saleema Amershi


2024

pdf bib
AUTOGEN STUDIO: A No-Code Developer Tool for Building and Debugging Multi-Agent Systems
Victor Dibia | Jingya Chen | Gagan Bansal | Suff Syed | Adam Fourney | Erkang Zhu | Chi Wang | Saleema Amershi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Multi-agent systems, where multiple agents (generative AI models + tools) collaborate, are emerging as an effective pattern for solving long-running, complex tasks in numerous do- mains. However, specifying their parameters (such as models, tools, and orchestration mechanisms etc,.) and debugging them remains challenging for most developers. To address this challenge, we present AUTOGEN STUDIO, a no-code developer tool for rapidly prototyping, debugging, and evaluating multi-agent work- flows built upon the AUTOGEN framework. AUTOGEN STUDIO offers a web interface and a Python API for representing LLM-enabled agents using a declarative (JSON-based) specification. It provides an intuitive drag-and-drop UI for agent workflow specification, interactive evaluation and debugging of workflows, and a gallery of reusable agent components. We highlight four design principles for no-code multi-agent developer tools and contribute an open-source implementation. https://github.com/microsoft/autogen/tree/autogenstudio/samples/apps/autogen-studio

2023

pdf bib
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
John Chung | Ece Kamar | Saleema Amershi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user’s domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.

pdf bib
Aligning Offline Metrics and Human Judgments of Value for Code Generation Models
Victor Dibia | Adam Fourney | Gagan Bansal | Forough Poursabzi-Sangdeh | Han Liu | Saleema Amershi
Findings of the Association for Computational Linguistics: ACL 2023

Large language models have demonstrated great potential to assist programmers in generating code. For such human-AI pair programming scenarios, we empirically demonstrate that while generated code are most often evaluated in terms of their functional correctness (i.e., whether generations pass available unit tests), correctness does not fully capture (e.g., may underestimate) the productivity gains these models may provide. Through a user study with N=49 experienced programmers, we show that while correctness captures high-value generations, programmers still rate code that fails unit tests as valuable if it reduces the overall effort needed to complete a coding task. Finally, we propose a hybrid metric that combines functional correctness and syntactic similarity and show that it achieves a 14% stronger correlation with value and can therefore better represent real-world gains when evaluating and comparing models.

2016

pdf bib
A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs
Kristina Toutanova | Chris Brockett | Ke M. Tran | Saleema Amershi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing