Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)

Eduard Dragut, Yunyao Li, Lucian Popa, Slobodan Vucetic, Shashank Srivastava (Editors)

Anthology ID:: 2024.dash-1
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Venues:: DaSH | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.dash-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.dash-1.pdf

pdf bib
Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic | Shashank Srivastava

Prompt engineering is an iterative procedure that often requires extensive manual effort to formulate suitable instructions for effectively directing large language models (LLMs) in specific tasks. Incorporating few-shot examples is a vital and effective approach to provide LLMs with precise instructions, leading to improved LLM performance. Nonetheless, identifying the most informative demonstrations for LLMs is labor-intensive, frequently entailing sifting through an extensive search space. In this demonstration, we showcase a human-in-the-loop tool called ool (Active Prompt Engineering) designed for refining prompts through active learning. Drawing inspiration from active learning, ool iteratively selects the most ambiguous examples for human feedback, which will be transformed into few-shot examples within the prompt.

pdf bib abs
Towards Optimizing and Evaluating a Retrieval Augmented QA Chatbot using LLMs with Human-in-the-Loop
Anum Afzal | Alexander Kowsik | Rajna Fani | Florian Matthes

Large Language Models have found application in various mundane and repetitive tasks including Human Resource (HR) support. We worked with the domain experts of a large multinational company to develop an HR support chatbot as an efficient and effective tool for addressing employee inquiries. We inserted a human-in-the-loop in various parts of the development cycles such as dataset collection, prompt optimization, and evaluation of generated output. By enhancing the LLM-driven chatbot’s response quality and exploring alternative retrieval methods, we have created an efficient, scalable, and flexible tool for HR professionals to address employee inquiries effectively. Our experiments and evaluation conclude that GPT-4 outperforms other models and can overcome inconsistencies in data through internal reasoning capabilities. Additionally, through expert analysis, we infer that reference-free evaluation metrics such as G-Eval and Prometheus demonstrate reliability closely aligned with that of human evaluation.

The development of conversational AI assistants is an iterative process with many components involved. As such, the evaluation and continual improvement of these assistants is a complex and multifaceted problem. This paper introduces the challenges in evaluating and improving a generative AI assistant for enterprise that is under active development and how we address these challenges. We also share preliminary results and discuss lessons learned.

pdf bib abs
Mini-DA: Improving Your Model Performance through Minimal Data Augmentation using LLM
Shuangtao Yang | Xiaoyi Liu | Xiaozheng Dong | Bo Fu

When performing data augmentation using large language models (LLMs), the common approach is to directly generate a large number of new samples based on the original dataset, and then model is trained on the integration of augmented dataset and the original dataset. However, data generation demands extensive computational resources. In this study, we propose Mini-DA, a minimized data augmentation method that leverages the feedback from the target model during the training process to select only the most challenging samples from the validation set for augmentation. Our experimental results show in text classification task, by using as little as 13 percent of the original augmentation volume, Mini-DA can achieve performance comparable to full data augmentation for intent detection task, significantly improving data and computational resource utilization efficiency.

pdf bib abs
CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models
Son The Nguyen | Niranjan Uma Naresh | Theja Tulabandhula

This paper addresses the challenges of aligning large language models (LLMs) with human values via preference learning (PL), focusing on incomplete and corrupted data in preference datasets. We propose a novel method for robustly and completely recalibrating values within these datasets to enhance LLMs’ resilience against the issues. In particular, we devise a guaranteed polynomial time ranking algorithm that robustifies several existing models, such as the classic Bradley–Terry–Luce (BTL) model and certain generalizations of it. To the best of our knowledge, our present work is the first to propose an algorithm that provably recovers an 𝜖-optimal ranking with high probability while allowing as large as O(n) perturbed pairwise comparison results per model response. Furthermore, we show robust recovery results in the partially observed setting. Our experiments confirm that our algorithms handle adversarial noise and unobserved comparisons well in LLM preference dataset settings. This work contributes to the development and scaling of more reliable and ethically aligned AI models by equipping the dataset curation pipeline with the ability to handle missing and maliciously manipulated inputs.