Sally Fang
2024
RETAIN: Interactive Tool for Regression Testing Guided LLM Migration
Tanay Dixit
|
Daniel Lee
|
Sally Fang
|
Sai Sree Harsha
|
Anirudh Sureshan
|
Akash V Maharaj
|
Yunyao Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large Language Models (LLMs) are increasingly integrated into diverse applications. The rapid evolution of LLMs presents opportunities for developers to enhance applications continuously. However, this constant adaptation can also lead to performance regressions during model migrations. While several interactive tools have been proposed to streamline the complexity of prompt engineering, few address the specific requirements of regression testing for LLM Migrations. To bridge this gap, we introduce RETAIN (REgression Testing guided LLM migrAtIoN), a tool designed explicitly for regression testing in LLM Migrations. RETAIN comprises two key components: an interactive interface tailored to regression testing needs during LLM migrations, and an error discovery module that facilitates understanding of differences in model behaviors. The error discovery module generates textual descriptions of various errors or differences between model outputs, providing actionable insights for prompt refinement. Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.
Evaluation and Continual Improvement for an Enterprise AI Assistant
Akash Maharaj
|
Kun Qian
|
Uttaran Bhattacharya
|
Sally Fang
|
Horia Galatanu
|
Manas Garg
|
Rachel Hanessian
|
Nishant Kapoor
|
Ken Russell
|
Shivakumar Vaithyanathan
|
Yunyao Li
Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)
The development of conversational AI assistants is an iterative process with many components involved. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprise that is under active development and how we address these challenges. We also share preliminary results and discuss lessons learned.