Zachary Ellis


2026

As healthcare services deploy AI to automate patient-facing communication, concerns persist about the interactional work through which empathy is made relevant. We examine empathy not as an internal state but as an interactional accomplishment, asking how patients display orientations to an LLM-powered voice assistant’s turns as (non-)empathic in real clinical telephone calls. Using Conversation Analysis (CA) to analyse post–cataract surgery follow-up calls conducted by AI-powered voice assistant Dora (Ufonia), we compare patient responses across earlier and later system versions.Earlier calls show minimal, delayed, prosodically closed responses to wellbeing enquiries, consistent with treating Dora as a transactional information-gathering device. Later calls more often feature socially rich formats, for example colloquial upgrades, gratitude tokens, occasional return enquiries, and increased turn-final rising intonation, suggesting patients hear Dora’s talk as socially implicative and thus opening space for affiliative/empathetic uptake. We discuss implications for CA-informed conversation design and for evaluating “empathy” via participant orientations in situ rather than post-hoc self-report.
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.