Negar Arabzadeh


2024

pdf bib
Assessing and Verifying Task Utility in LLM-Powered Applications
Negar Arabzadeh | Siqing Huo | Nikhil Mehta | Qingyun Wu | Chi Wang | Ahmed Hassan Awadallah | Charles L. A. Clarke | Julia Kiseleva
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what extent LLM-powered applications genuinely enhance user experience and task execution efficiency. This highlights the need to verify utility of LLM-powered applications, particularly by ensuring alignment between the application’s functionality and end-user needs. We introduce AgentEval, a novel framework designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We present a comprehensive analysis of the effectiveness and robustness of AgentEval for two open source datasets including Math Problem solving and ALFWorld House-hold related tasks. For reproducibility purposes, we make the data, code and all the logs publicly available at https://github.com/Narabzad/AgentEval

pdf bib
Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels
Negar Arabzadeh | Charles Clarke
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The rapid advancement of natural language processing, information retrieval (IR), computer vision, and other technologies has presented significant challenges in evaluating the performance of these systems. One of the main challenges is the scarcity of human-labeled data, which hinders the fair and accurate assessment of these systems. In this work, we specifically focus on evaluating IR systems with sparse labels, borrowing from recent research on evaluating computer vision tasks.taking inspiration from the success of using Fréchet Inception Distance (FID) in assessing text-to-image generation systems. We propose leveraging the Fréchet Distance to measure the distance between the distributions of relevant judged items and retrieved results. Our experimental results on MS MARCO V1 dataset and TREC Deep Learning Tracks query sets demonstrate the effectiveness of the Fréchet Distance as a metric for evaluating IR systems, particularly in settings where a few labels are available.This approach contributes to the advancement of evaluation methodologies in real-world scenarios such as the assessment of generative IR systems.

2023

pdf bib
PREME: Preference-based Meeting Exploration through an Interactive Questionnaire
Negar Arabzadeh | Ali Ahmadvand | Julia Kiseleva | Yang Liu | Ahmed Hassan Awadallah | Ming Zhong | Milad Shokouhi
Findings of the Association for Computational Linguistics: EACL 2023

The recent increase in the volume of online meetings necessitates automated tools for organizing the material, especially when an attendee has missed the discussion and needs assistance in quickly exploring it. In this work, we propose a novel end-to-end framework for generating interactive questionnaires for preference-based meeting exploration. As a result, users are supplied with a list of suggested questions reflecting their preferences. Since the task is new, we introduce an automatic evaluation strategy by measuring how much the generated questions via questionnaire are answerable to ensure factual correctness and covers the source meeting for the depth of possible exploration.