2024
pdf
bib
abs
Generative Explore-Exploit: Training-free Optimization of Generative Recommender Systems using LLM Optimizers
Lütfi Kerem Senel
|
Besnik Fetahu
|
Davis Yoshida
|
Zhiyu Chen
|
Giuseppe Castellucci
|
Nikhita Vedula
|
Jason Ingyu Choi
|
Shervin Malmasi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recommender systems are widely used to suggest engaging content, and Large Language Models (LLMs) have given rise to generative recommenders. Such systems can directly generate items, including for open-set tasks like question suggestion. While the world knowledge of LLMs enables good recommendations, improving the generated content through user feedback is challenging as continuously fine-tuning LLMs is prohibitively expensive. We present a training-free approach for optimizing generative recommenders by connecting user feedback loops to LLM-based optimizers. We propose a generative explore-exploit method that can not only exploit generated items with known high engagement, but also actively explore and discover hidden population preferences to improve recommendation quality. We evaluate our approach on question generation in two domains (e-commerce and general knowledge), and model user feedback with Click Through Rate (CTR). Experiments show our LLM-based explore-exploit approach can iteratively improve recommendations and consistently increase CTR. Ablation analysis shows that generative exploration is key to learning user preferences, avoiding the pitfalls of greedy exploit-only approaches. A human evaluation strongly supports our quantitative findings.
pdf
bib
abs
Identifying High Consideration E-Commerce Search Queries
Zhiyu Chen
|
Jason Ingyu Choi
|
Besnik Fetahu
|
Shervin Malmasi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
In e-commerce, high consideration search missions typically require careful and elaborate decision making, and involve a substantial research investment from customers. We consider the task of identifying High Consideration (HC) queries. Identifying such queries enables e-commerce sites to better serve user needs using targeted experiences such as curated QA widgets that help users reach purchase decisions. We explore the task by proposing an Engagement-based Query Ranking (EQR) approach, focusing on query ranking to indicate potential engagement levels with query-related shopping knowledge content during product search. Unlike previous studies on predicting trends, EQR prioritizes query-level features related to customer behavior, finance, and catalog information rather than popularity signals. We introduce an accurate and scalable method for EQR and present experimental results demonstrating its effectiveness. Offline experiments show strong ranking performance. Human evaluation shows a precision of 96% for HC queries identified by our model. The model was commercially deployed, and shown to outperform human-selected queries in terms of downstream customer impact, as measured through engagement.
pdf
bib
abs
Combining Multiple Metrics for Evaluating Retrieval-Augmented Conversations
Jason Ingyu Choi
|
Marcus Collins
|
Eugene Agichtein
|
Oleg Rokhlenko
|
Shervin Malmasi
Proceedings of the Third Workshop on Bridging Human--Computer Interaction and Natural Language Processing
Conversational AI is a subtype of Human Computer Interaction that has gained wide adoption. These systems are typically powered by Large Language Models (LLMs) that use Retrieval Augmented Generation (RAG) to infuse external knowledge, which is effective against issues like hallucination. However, automatically evaluating retrieval augmented conversations with minimal human effort remains challenging, particularly in online settings. We address this challenge by proposing a lexical metric, and a novel method for combining it with other metrics, including semantic models. Our approach involves: (1) Conversational Information Utility (CIU), a new automated metric inspired by prior user studies on web search evaluation, to compute information overlap between conversation context and grounded information in an unsupervised, purely lexical way; and (2) a generalized reward model through Mixture-of-Experts (MoE-CIU) that dynamically ensembles CIU with other metrics, including learned ones, into a single reward. Evaluation against human ratings on two public datasets (Topical Chat and Persona Chat) shows that CIU improves correlation against human judgments by 2.0% and 0.9% respectively compared to the second best metric. When MoE is applied to combine lexical and learned semantic metrics, correlations further improve by 9.9% and 5.0%, suggesting that unified reward models are a promising approach.
2022
pdf
bib
abs
Wizard of Tasks: A Novel Conversational Dataset for Solving Real-World Tasks in Conversational Settings
Jason Ingyu Choi
|
Saar Kuzi
|
Nikhita Vedula
|
Jie Zhao
|
Giuseppe Castellucci
|
Marcus Collins
|
Shervin Malmasi
|
Oleg Rokhlenko
|
Eugene Agichtein
Proceedings of the 29th International Conference on Computational Linguistics
Conversational Task Assistants (CTAs) are conversational agents whose goal is to help humans perform real-world tasks. CTAs can help in exploring available tasks, answering task-specific questions and guiding users through step-by-step instructions. In this work, we present Wizard of Tasks, the first corpus of such conversations in two domains: Cooking and Home Improvement. We crowd-sourced a total of 549 conversations (18,077 utterances) with an asynchronous Wizard-of-Oz setup, relying on recipes from WholeFoods Market for the cooking domain, and WikiHow articles for the home improvement domain. We present a detailed data analysis and show that the collected data can be a valuable and challenging resource for CTAs in two tasks: Intent Classification (IC) and Abstractive Question Answering (AQA). While on IC we acquired a high performing model (>85% F1), on AQA the performance is far from being satisfactory (~27% BertScore-F1), suggesting that more work is needed to solve the task of low-resource AQA.