Shujing Dong

2026

ShopperBench: A Benchmark for Personalized Shopping with Persona-Guided Simulation
Yuan Ling | Chunqing Yuan | Shujing Dong | Yongjian Yang | Nataraj Mocherla | Ayush Goyal
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 5: Industry Track)

Personalized shopping agents must adapt their decisions to different user personas, balancing efficiency, preference alignment, and goal success. Building upon the WebShop dataset and 𝜏²-Bench environment, ShopperBench introduces a persona-guided benchmark for evaluating such adaptive behaviors. ShopperBench augments shopping trajectories with persona-conditioned goals, reasoning rationales, and preference cues, capturing how diverse shopper types—from price-conscious planners to trend-seeking explorers—navigate product search and selection. We further design a baseline of ShopperAgents that operate under persona guidance to simulate realistic, goal-oriented shopping interactions. To evaluate these agents, we propose new metrics including Persona Fidelity, Persona-Query Alignment, and Path Consistency. Together, Our ShopperBench provides a testbed for studying personalized and context-aware shopping intelligence, bridging the gap between human-centered e-commerce behavior and agent-based simulation.

2025

pdf bib abs

A critical challenge in deploying Large Language Models (LLMs) is developing reliable mechanisms to estimate their confidence, enabling systems to determine when to trust model outputs and when to seek human intervention. In this paper, we present a Calibrated Reflection Approach for Enhancing Confidence Estimation in LLMs, a framework that combines structured reasoning with distance-aware calibration techniques. Our approach introduces three key innovations: (1) a Maximum Confidence Selection (MCS) method that comprehensively evaluates confidence across all possible labels, (2) a reflection-based prompting mechanism that enhances reasoning reliability, and (3) a distance-aware calibration technique that accounts for ordinal relationships between labels. We evaluate our framework across diverse datasets, including HelpSteer2, Llama T-REx, and an internal conversational dataset, demonstrating its effectiveness across both conversational and fact-based classification tasks. This work contributes to the broader goal of developing reliable and well-calibrated confidence estimation methods for LLMs, enabling informed decisions about when to trust model outputs and when to defer to human judgement.

Co-authors

Nataraj Mocherla 1

Yongjian Yang 1

Chunqing Yuan 1

Venues

Fix author