Keelan Evanini
2026
Conversational AI for Virtual Standardized Patients using a Speech-to-Speech LLM
Andrew Emerson | Keelan Evanini | Su Somay | Kevin Frome | Le An Ha | Polina Harik
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Andrew Emerson | Keelan Evanini | Su Somay | Kevin Frome | Le An Ha | Polina Harik
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
To develop clinical reasoning skills, medical students are often tasked with interacting with trained standardized patients (SPs). Human SPs enable real conversations that can resemble authentic clinical scenarios. However, human SPs require extensive training and are often limited in their accessibility and continual availability to medical students or residents. Virtual SPs offer the ability for medical students to practice clinical interviews in a lower-stakes setting across a broader set of clinical cases. This paper introduces a virtual SP (VSP) that leverages Amazon’s Nova Sonic, a speech-to-speech foundation model designed for human-like conversation. We investigated the ability of Nova Sonic to portray four distinct clinical cases in virtual doctor-patient encounters with 20 third-year medical students. The system’s realism, its perceived learning value, and user experience were all assessed via a survey administered to the students. Students were also asked to compare this experience to interactions with a human SP. Survey results and conversations were analyzed to derive insights for improving the Nova Sonic-based VSP system.
2025
Toward Automated Evaluation of AI-Generated Item Drafts in Clinical Assessment
Tazin Afrin | Le An Ha | Victoria Yaneva | Keelan Evanini | Steven Go | Kristine DeRuchie | Michael Heilig
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Tazin Afrin | Le An Ha | Victoria Yaneva | Keelan Evanini | Steven Go | Kristine DeRuchie | Michael Heilig
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
This study examines the classification of AI-generated clinical multiple-choice questions drafts as “helpful” or “non-helpful” starting points. Expert judgments were analyzed, and multiple classifiers were evaluated—including feature-based models, fine-tuned transformers, and few-shot prompting with GPT-4. Our findings highlight the challenges and considerations for evaluation methods of AI-generated items in clinical test development.
Automated Evaluation of Standardized Patients with LLMs
Andrew Emerson | Le An Ha | Keelan Evanini | Su Somay | Kevin Frome | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Andrew Emerson | Le An Ha | Keelan Evanini | Su Somay | Kevin Frome | Polina Harik | Victoria Yaneva
Proceedings of the Artificial Intelligence in Measurement and Education Conference (AIME-Con): Full Papers
Standardized patients (SPs) are essential for clinical reasoning assessments in medical education. This paper introduces evaluation metrics that apply to both human and simulated SP systems. The metrics are computed using two LLM-as-a-judge approaches that align with human evaluators on SP performance, enabling scalable formative clinical reasoning assessments.
2019
Using Rhetorical Structure Theory to Assess Discourse Coherence for Non-native Spontaneous Speech
Xinhao Wang | Binod Gyawali | James V. Bruno | Hillary R. Molloy | Keelan Evanini | Klaus Zechner
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
Xinhao Wang | Binod Gyawali | James V. Bruno | Hillary R. Molloy | Keelan Evanini | Klaus Zechner
Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019
This study aims to model the discourse structure of spontaneous spoken responses within the context of an assessment of English speaking proficiency for non-native speakers. Rhetorical Structure Theory (RST) has been commonly used in the analysis of discourse organization of written texts; however, limited research has been conducted to date on RST annotation and parsing of spoken language, in particular, non-native spontaneous speech. Due to the fact that the measurement of discourse coherence is typically a key metric in human scoring rubrics for assessments of spoken language, we conducted research to obtain RST annotations on non-native spoken responses from a standardized assessment of academic English proficiency. Subsequently, automatic parsers were trained on these annotations to process non-native spontaneous speech. Finally, a set of features were extracted from automatically generated RST trees to evaluate the discourse structure of non-native spontaneous speech, which were then employed to further improve the validity of an automated speech scoring system.
Application of an Automatic Plagiarism Detection System in a Large-scale Assessment of English Speaking Proficiency
Xinhao Wang | Keelan Evanini | Matthew Mulholland | Yao Qian | James V. Bruno
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
Xinhao Wang | Keelan Evanini | Matthew Mulholland | Yao Qian | James V. Bruno
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications
This study aims to build an automatic system for the detection of plagiarized spoken responses in the context of an assessment of English speaking proficiency for non-native speakers. Classification models were trained to distinguish between plagiarized and non-plagiarized responses with two different types of features: text-to-text content similarity measures, which are commonly used in the task of plagiarism detection for written documents, and speaking proficiency measures, which were specifically designed for spontaneous speech and extracted using an automated speech scoring system. The experiments were first conducted on a large data set drawn from an operational English proficiency assessment across multiple years, and the best classifier on this heavily imbalanced data set resulted in an F1-score of 0.761 on the plagiarized class. This system was then validated on operational responses collected from a single administration of the assessment and achieved a recall of 0.897. The results indicate that the proposed system can potentially be used to improve the validity of both human and automated assessment of non-native spoken English.
2017
Discourse Annotation of Non-native Spontaneous Spoken Responses Using the Rhetorical Structure Theory Framework
Xinhao Wang | James Bruno | Hillary Molloy | Keelan Evanini | Klaus Zechner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Xinhao Wang | James Bruno | Hillary Molloy | Keelan Evanini | Klaus Zechner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
The availability of the Rhetorical Structure Theory (RST) Discourse Treebank has spurred substantial research into discourse analysis of written texts; however, limited research has been conducted to date on RST annotation and parsing of spoken language, in particular, non-native spontaneous speech. Considering that the measurement of discourse coherence is typically a key metric in human scoring rubrics for assessments of spoken language, we initiated a research effort to obtain RST annotations of a large number of non-native spoken responses from a standardized assessment of academic English proficiency. The resulting inter-annotator kappa agreements on the three different levels of Span, Nuclearity, and Relation are 0.848, 0.766, and 0.653, respectively. Furthermore, a set of features was explored to evaluate the discourse structure of non-native spontaneous speech based on these annotations; the highest performing feature resulted in a correlation of 0.612 with scores of discourse coherence provided by expert human raters.
A Report on the 2017 Native Language Identification Shared Task
Shervin Malmasi | Keelan Evanini | Aoife Cahill | Joel Tetreault | Robert Pugh | Christopher Hamill | Diane Napolitano | Yao Qian
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Shervin Malmasi | Keelan Evanini | Aoife Cahill | Joel Tetreault | Robert Pugh | Christopher Hamill | Diane Napolitano | Yao Qian
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Native Language Identification (NLI) is the task of automatically identifying the native language (L1) of an individual based on their language production in a learned language. It is typically framed as a classification task where the set of L1s is known a priori. Two previous shared tasks on NLI have been organized where the aim was to identify the L1 of learners of English based on essays (2013) and spoken responses (2016) they provided during a standardized assessment of academic English proficiency. The 2017 shared task combines the inputs from the two prior tasks for the first time. There are three tracks: NLI on the essay only, NLI on the spoken response only (based on a transcription of the response and i-vector acoustic features), and NLI using both responses. We believe this makes for a more interesting shared task while building on the methods and results from the previous two shared tasks. In this paper, we report the results of the shared task. A total of 19 teams competed across the three different sub-tasks. The fusion track showed that combining the written and spoken responses provides a large boost in prediction accuracy. Multiple classifier systems (e.g. ensembles and meta-classifiers) were the most effective in all tasks, with most based on traditional classifiers (e.g. SVMs) with lexical/syntactic features.
2015
Automated Speech Recognition Technology for Dialogue Interaction with Non-Native Interlocutors
Alexei V. Ivanov | Vikram Ramanarayanan | David Suendermann-Oeft | Melissa Lopez | Keelan Evanini | Jidong Tao
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Alexei V. Ivanov | Vikram Ramanarayanan | David Suendermann-Oeft | Melissa Lopez | Keelan Evanini | Jidong Tao
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
A distributed cloud-based dialog system for conversational application development
Vikram Ramanarayanan | David Suendermann-Oeft | Alexei V. Ivanov | Keelan Evanini
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Vikram Ramanarayanan | David Suendermann-Oeft | Alexei V. Ivanov | Keelan Evanini
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
2014
Automatic detection of plagiarized spoken responses
Keelan Evanini | Xinhao Wang
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
Keelan Evanini | Xinhao Wang
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
Automated scoring of speaking items in an assessment for teachers of English as a Foreign Language
Klaus Zechner | Keelan Evanini | Su-Youn Yoon | Lawrence Davis | Xinhao Wang | Lei Chen | Chong Min Lee | Chee Wee Leong
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
Klaus Zechner | Keelan Evanini | Su-Youn Yoon | Lawrence Davis | Xinhao Wang | Lei Chen | Chong Min Lee | Chee Wee Leong
Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications
2013
Coherence Modeling for the Automated Assessment of Spontaneous Spoken Responses
Xinhao Wang | Keelan Evanini | Klaus Zechner
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Xinhao Wang | Keelan Evanini | Klaus Zechner
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Prompt-based Content Scoring for Automated Spoken Language Assessment
Keelan Evanini | Shasha Xie | Klaus Zechner
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
Keelan Evanini | Shasha Xie | Klaus Zechner
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications
2012
Exploring Content Features for Automated Speech Scoring
Shasha Xie | Keelan Evanini | Klaus Zechner
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Shasha Xie | Keelan Evanini | Klaus Zechner
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
2011
Non-scorable Response Detection for Automated Speaking Proficiency Assessment
Su-Youn Yoon | Keelan Evanini | Klaus Zechner
Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications
Su-Youn Yoon | Keelan Evanini | Klaus Zechner
Proceedings of the Sixth Workshop on Innovative Use of NLP for Building Educational Applications
2010
Search
Fix author
Co-authors
- Klaus Zechner 8
- Xinhao Wang 6
- Le An Ha 3
- James V. Bruno 2
- Andrew Emerson 2
- Kevin Frome 2
- Polina Harik 2
- Alexei V. Ivanov 2
- Yao Qian 2
- Vikram Ramanarayanan 2
- Su Somay 2
- David Suendermann-Oeft 2
- Shasha Xie 2
- Victoria Yaneva 2
- Su-Youn Yoon 2
- Tazin Afrin 1
- James Bruno 1
- Aoife Cahill 1
- Lei Chen 1
- Lawrence Davis 1
- Kristine DeRuchie 1
- Steven Go 1
- Binod Gyawali 1
- Christopher Hamill 1
- Michael Heilig 1
- Derrick Higgins 1
- Chungmin Lee 1
- Chee Wee Leong 1
- Melissa Lopez 1
- Shervin Malmasi 1
- Hillary Molloy 1
- Hillary R. Molloy 1
- Matthew Mulholland 1
- Diane Napolitano 1
- Robert Pugh 1
- Jidong Tao 1
- Joel Tetreault 1