Zachary Ellis

2026

Empathy as interactional accomplishment in clinical interactions with a conversational agent
Spencer Hazel | Adam Brandt | Yajie Vera He | Ernest Lim | Jared Joselowitz | Zachary Ellis
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

As healthcare services deploy AI to automate patient-facing communication, concerns persist about the interactional work through which empathy is made relevant. We examine empathy not as an internal state but as an interactional accomplishment, asking how patients display orientations to an LLM-powered voice assistant’s turns as (non-)empathic in real clinical telephone calls. Using Conversation Analysis (CA) to analyse post–cataract surgery follow-up calls conducted by AI-powered voice assistant Dora (Ufonia), we compare patient responses across earlier and later system versions.Earlier calls show minimal, delayed, prosodically closed responses to wellbeing enquiries, consistent with treating Dora as a transactional information-gathering device. Later calls more often feature socially rich formats, for example colloquial upgrades, gratitude tokens, occasional return enquiries, and increased turn-final rising intonation, suggesting patients hear Dora’s talk as socially implicative and thus opening space for affiliative/empathetic uptake. We discuss implications for CA-informed conversation design and for evaluating “empathy” via participant orientations in situ rather than post-hoc self-report.

pdf bib abs

As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen’s kappa of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

Co-authors

Yan Jia 1

Venues

Fix author