Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

Automatic Speech Recognition (ASR) in medical contexts has the potential to save time, cut costs, increase report accuracy, and reduce physician burnout. However, the healthcare industry has been slower to adopt this technology, in part due to the importance of avoiding medically-relevant transcription mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others. We collect a benchmark of 18 clinician preferences on 149 realistic medical sentences called the Clinician Transcript Preference benchmark (CTP) and make it publicly available for the community to further develop clinically-aware ASR metrics. To our knowledge, this is the first public dataset of its kind. We demonstrate that our metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins.


Introduction
Clinicians in a number of disciplines work in an overburdened healthcare system that leads to difficult working environments and an epidemic of physician burnout [1]. AI-related technologies have the potential for improving efficiency on repetitive tasks, therefore increasing both patient throughput and decreasing physician burnout. For example, physicians in a number of disciplines spend as much time doing paperwork as with patients [2]. However, the adoption of speech technology in the medical community has been slow [3], and there are a number of speech technologies that could improve efficiency.
Speech technology can be applied to a number of medical problems including transcribing patient-physician conversations [4], helping dysarthric patients communicate [5], and diagnosing medical conditions from speech [6,7,8,9]. In this work, we focus on the task of generating a report after a colonoscopy procedure.
One of many reasons for the lower adoption of time-saving speech transcription technologies is that the ASR systems often don't perform as well in real-world clinical settings as they do on evaluation benchmarks. The most common metric for measuring ASR performance, Word Error Rate (WER), has significant practical drawbacks [10,11,12]. First, all mistakes are treated equally. In clinical settings, however, medical words are more important (e.g. "had complete resection" → "had com-*Authors contributed equally. 1 https://osf.io/tg492/ plete c-section" is a worse mistake than → "has complete resection", but both have equal WER). Second, some mistakes affect the overall intelligibility more than others (e.g. "was no perforation" → "was no puffer age" vs "was not any perforation"). Although researchers have proposed alternatives to the WER, no metric combines medical domain knowledge with recent AI advances in language understanding.
In this work, we make the following contributions: 1. Generate a collection of realistic medical sentences and transcripts with plausible ASR errors and collect preferences from 18 clinicians on 149 sentences. We publicly released this dataset for reference and future studies. 2. Present the Clinical BERTScore (CBERTScore), and demonstrate that it more closely matches clinician preferences on medical transcripts than other ASR metrics (WER, BLEU, METEOR, BERTScore). 3. Demonstrate that CBERTScore does not perform worse than other metrics on non-medical transcripts.

Related work
There are a number of ways to evaluate transcript quality. The Word Error Rate (WER), is the simplest to compute and most common. It counts the number of insertions, deletions, and substitutions between two text strings, and normalizes by the length of the reference string. The Bilingual Evaluation Understudy (BLEU) [13] measures the amount of n-gram overlap between two text strings (where n is often 4). It captures the intuition that groups of words are important in addition to individual words. METEOR [14] focuses on unigrams, but computes an explicit alignment between two strings and takes both precision and recall into consideration. While these techniques are cheap to compute, they primarily focus on character or string similarity, not semantic similarity.
Our work most closely follows the BERTScore [15]. This metric computes a neural word embedding for each word in the reference and candidate. Embeddings are matched using cosine distance instead of string similarity, and the final score takes precision and recall into account (see Fig.1). This method takes semantic similarity into account, but not that some words are more important to preserve in clinical contexts.
Structured graphs are one way to encode real-world knowledge in a machine-readable format. The Knowledge Graph (KG) [16] is a publicly available structure that encodes medical knowledge. Previous work has used the medical subset of the KG to learn medical entity extraction [4]. We primarily follow this approach to determine which words are clinically significant.

Clinical BERTScore
Our proposed metric, the Clinical BERTScore (CBERTScore), combines the BERTScore [15] and the medical subset of the Knowledge Graph [4], and extends the work in a few ways. We first define the BERTScore here for convenience: each word is represented by a neural word embedding. The similarity score for each word in the reference is the maximum cosine distance with word embeddings from the candidate, and vice versa. The scores from the reference and candidate are both computed, then combined into a single score as follows: The definition of CBERTScore is: where BERTScore is computed over all words or a subset of words. The cosine similarity is considered only if one of the words is in the subset of words. If there are no medical words in either the reference or candidate sentence, we define the CBERTScore to be the BERTScore on all words. We inject medical information into this metric in two ways. First, we compute a weighted score on a subset of words involving medical terms, as determined by the Knowledge Graph [4]. Second, we tune the weight of the clinical term penalty to best match a clinician transcript dataset (CTP) that we collected. We describe our method for determining k in Sec. 3.1.2.

Medical Entities
Similar to [4], we derive roughly 20K medically relevant words from Google's Knowledge graph [16]. These words come from entities with properties such as "/medicine/disease", "/medicine/drug", "/medicine/medical treatment", and "/medicine/medical finding". We also include numbers for the CBERTScore algorithm, since numerical accuracy is important in medical contexts.

Tuning the medical entities weight factor
CBERTScore has a parameter controlling the weight of the clinical component. To determine this factor, we picked the best performing k on the training subset of the Clinician Transcript Preference (CTP) dataset (Sec. 3.2). We evaluated k using 11 points evenly spaced between 0 and 1, and performed the evaluation methodology in Sec. 3.2 for each. We then used this value for all subsequent results and analyses.

Clinician Transcript Preference (CTP) Dataset
In order to compare CBERTScore's agreement with human preference, we sent out a Qualtrics survey to elicit judgment specifically from clinicians 2 . We call this dataset the Clinician Transcript Preference dataset (CTP), and we will make it publicly available on the Open Science Framework (OSF).
We collected data on 150 sentences. They were divided into three groups, each containing 50 trials. 18 subjects with clinical background responded to more than half the questions. Fig. 1 (left) describes clinician backgrounds. Each participant was randomly assigned to a group to ensure approximately uniform response coverage. For each trial, participants are given a ground truth sentence and two "transcripts" and asked to select the less useful one or to indicate the two are about the same. An example of such a triplet is as follows: Target: "Patient elects to go under Propofol sedation." #1: Patient elects to go under Prilosec sedation. #2: Patient selects to go under Propofol sedation. The survey was designed to take no more than 20 min to minimize the cognitive strain on participants. One sentence was malformed, resulting in 149 sentences for the final dataset.

Constructing the CTP triplets
To generate the triplets of (target, transcript #1, transcript #2) used in the survey, we started by downloading publicly available YouTube videos on colonoscopies created by GI physicians and educational institutes. The target sentences were transcribed by Google's publicly available Speech-to-Text medical dictation model [17] and manually checked for accuracy. Filler words such as "uh" and repeated words were edited out. Sentences longer than 30 words or less than 5 were discarded.
For each target sentence, transcript #1 was generated by one of Google's other, non-medical, publicly available ASR models. Transcripts with an edit distance [18] outside [1,3] were discarded. This procedure generated 1220 candidate sentences.
To ensure that the two transcripts were roughly comparable in terms of fidelity, transcript #2 was generated synthetically. We used a publicly available English word frequency dictionary [19] to select words in the target sentence that were candidates for synthetic errors. Candidate words were at least 5 characters, appeared in the 1M word dictionary fewer than Agreement with speech pathologist raters on the non-medical dataset, when restricting the data to cases where there is a fidelity difference between two candidate transcripts.
10 times, and were not proper nouns. 486 candidate sentences matched these criteria. Finally, transcript #2 was generated by deleting the candidate word or manually substituting it with a phonetically similar word or phrase 3 . We discarded similar sentences and selected 150 triplets for the final survey. The ordering of the two transcripts was randomized, and so were the sentences.

Evaluating metrics on the CTP
To compare the ability of different metrics to agree with rater preference from the CTP, we define a 3-class classification problem as follows: where M is a metric, ti are the transcripts, and the predictions are reversed for the WER, since lower values indicate higher fidelity. l is a free variable, which we optimize separately for each metric. We split the data into two halves, choose the best performing l on one half, and report the accuracy using that l on the second half.

Non-medical sentences
To demonstrate that CBERTScore doesn't degrade on nonmedical speech, we compare the metrics' agreement with rater preferences on a dataset with annotations similar to [20]. Part of this dataset consists of 5-tuples of (ground truth sentence, transcript 1, transcript 2, assessment 1, assessment 2), where the sentence assessments describe how much of the ground truth sentence's meaning is captured in the transcript. We used of subset of 103 utterances from our annotated data data where the ratings were not the same, and at least one transcript was rated as having "Major errors". We report performance using a similar formulation as on the CTP evaluation in Sec. 3.2.2: we frame this as a 2-way classification problem (no cutoff is needed since we exclude tuples that have the same rating). 3 A Python fuzz search algorithm based on CMU Pronouncing Dictionary was used for consistency.

Clinician responses
18 clinicians responded to a total number of 149 triplet questions. Each question had 5 or 6 responses. 78% of questions had more than half agreement on which transcript was less useful and 42% had more than 80% agreement. Clinicians thought transcripts were the the same usefulness in 21% of cases.

Metric agreement on medical text
We report 3-way accuracy classification on the CTP dataset using two labeling schemes (Fig. 2). In the first, we only look at the questions where more than half the respondents agreed. In the second, we report accuracy on the questions where more than 4/5 of the respondents agreed. For both numbers, we determine the cutoff from one half the data and report accuracy on the second half.
First, the metric ordering by performance is the same using both labeling schemes, and the best CBERTScore medical weighting factor was the same using both label schemes. Second, BERTScore and CBERTScore are significantly more closely aligned with clinician preferences than other metrics. Third, CBERTScore weighted entirely toward medical terms outperforms or ties with BERTScore agreement. Fourth, the weighted combination of medical and non-medical terms outperforms other metrics in terms of clinician agreement. Fifth, the medical component meaningfully improves performance of CBERTScore over BERTScore (75.9% vs 67.2% and 87.5% vs 84.4%).

Metric agreement on non-medical text
CBERTScore was the second best performing metric on nonmedical text. Importantly, the addition of the medical component did not degrade the performance compared to BERTScore.

Knowledge Graph medical terms wins and losses on the CTP
The CTP (Sec. 3.2) had 127 distinct words that were the source of transcript errors, and 684 distinct other words. The medically-relevant terms used in the CBERTScore algorithm, identified primarily from the Knowledge Graph as described in Sec. 3.1.1, intersected with 99 of of the 127 transcript error words. By manual inspection, 25 of the 28 transcript error words in the CTP not included in the CBERTScore word list were used in a medical context but were not only medical in meaning (ex. "surveillance", "tethered", and "longitudinal"). 3 of the 28 missed words did have a primarily medical meaning, but were not included in the CBERTScore list either due to errors in the KG or errors in the queries generating the list ("cologuard", "colonoscope", "protuberance"). Some of the words have a clear meaning in a medical context, and could be manually added to the list for future applications ("snare", "suctioning", etc.). The CBERTScore word list included 100 words that weren't selected for transcript errors. Many of these are medical in nature, but were not selected for synthetic transcript errors via the method described in Sec. 3.2 (ex. "endoscope", "hypoplastic", "lymphoma").

Fig.
3 left shows the degree to which better performing metrics subsume other metrics, or make a different pattern of mistakes.
The plot show the (Metric Y correct)/(Metric X and Y disagree).
Metrics that have higher clinician agreement and a high fraction on this plot are strictly better, whereas metrics with higher agreement but a low value in this plot indicate that another metric might have additional signal. We see that CBERTScore is nearly strictly better than the other metrics, with the possible exception of METEOR (when they differ, METEOR gives the correct rating in roughly a third of cases).
There were some triplets that CBERTScore got correct that no other metric did. The improvements over BERTScore always involved a medical term, and sometimes involved encouraging the metric to prioritize medical mistakes (ex. "Marked the site with 5 cc's of indigo carmine." → "Marked the site with 5 cici's of indigo carmine." vs "Marked the sight with 5 cc's of indigo carmine.") There were thirteen triplets that the neural word embeddings predicted correctly that other metrics did not. Many of these wins came from the strength of neural word embeddings penalizing less for semantically similar mistakes (ex. "Small burst of coagulation to create a darkish white ablation." → "Small burst of coagulation to create a darkish white oblation." vs "Small burst of coagulation to create a dark white ablation."). Furthermore, BERTScore agreed with clinicians on some medical word mistakes, likely due to the BERT embedding somewhat understanding when a transcript error leads to a large semantic change in a medical term (ex. "No ongoing infection or coagulopathy." → "No on going infection or coagulopathy." vs "No ongoing infection or glomerulopathy."). Fig. 3 shows that METEOR made the most correct predictions when CBERTScore was incorrect. Some mistakes are due to the KG medical list being incomplete. For example, "longitudinal" was not included, but has medical meaning in clinical contexts (ex. "The longitudinal extent of the hot snare." → "The long eternal extent of the hot snare." vs "The longitudinal extend to the hot snare.").

CBERTScore mistakes
Another pattern of mistake is when a non-medical adjective contains an error, but the adjective modifies a medical term in an important way. For example, "vessel" is a medical term, but "feeding" is not (ex. "This polyp is at high risk of bleeding, with multiple feeding vessels." → "This polyp is at high risk of bleeding, with multiple seeding vessels." vs "This polyp is at high risking bleeding, with multiple feeding vessels."). This suggests that future work might include modifications and dependencies when calculating clinical importance.
Finally, a third pattern of mistake involves the fact that ME-TEOR penalizes complex correponsdences between candidate and reference sentences, while CBERTScore only considers the best pairwise word matches. One example in the CTP preserves most of the words, but reorders them (ex. "Inject into the head of the polyp, another 1 to 2 cc." → "Injectant the head of the polyp, another 1 to 2 cc." vs "Inject into the head of the polyp, another 1 2 to cc.").

Conclusions
We present CBERTScore, a novel metric that combines medical domain knowledge and recent advances in neural word embeddings. We collect and release a benchmark of clinician rater preferences on transcript errors, demonstrate that CBERTScore is more closely aligned with clinician preferences, and release the benchmark for the research community to continue to improve ASR in medical contexts.