Cyrus Andre DSouza


2026

High-quality search is essential for the success of online platforms, spanning e-commerce, social media, shopping-focused applications, and broader search systems such as content discovery and enterprise web search. To ensure optimal user experience and drive business growth, continuous evaluation and improvement of search systems is crucial. This paper introduces PROBES, a novel multi-task system powered by Large Language Models (LLMs) designed for end-to-end evaluation of semantic search systems. PROBES identifies context-aware relevance using a fine-grained scale (exact, substitute, complement, irrelevant) by leveraging the query category, feature-level intent, and category-aware feature importance, enabling more precise and consistent judgments than relying solely on raw query text. This allows PROBES to provide differentiated relevance assessment across a diverse range of query categories. PROBES then dives deeper to understand the reason behind irrelevant results (Precision issues) by checking product content conflicts and inaccuracies. It also analyzes Missed Recall by leveraging retrieval and relevance models to determine whether a missed recall was due to a selection issue or a ranking/retrieval system issue. To evaluate PROBES, we introduce a new metric, the Actionable Error Rate (AER), defined as the proportion of actionable errors over all flagged errors. We observe that PROBES operates at an AER of 76%, generating actionable insights across 100 product categories.

2025

High-quality content is critical for driving customer satisfaction and conversions across digital platforms and e-commerce. Ensuring that essential information is complete, accurate, and aligned with customer expectations presents a significant challenge at scale. Existing approaches to content evaluation often treat all information uniformly, without prioritizing based on customer relevance, and rely heavily on manual prompt design to encode domain expertise into Large Language Models (LLMs). We present ISEE, a unified framework that addresses these limitations through three core innovations: (1) automated identification of customer-impacting features by synthesizing signals from search behavior, queries, and feedback, enabling targeted content improvements; (2) an instruction-tuned multimodal LLM trained to reliably follow structured operational guidelines, reducing dependence on manual prompt engineering; and (3) robust zero-shot generalization to new product content, features and SOPs via targeted instruction tuning. Evaluated across 20 product categories and 150 product specific features, ISEE achieves 90% precision at 78% recall in detecting content inconsistencies, outperforming much larger (> 200B parameters) models while using a compact 12B architecture.