How Might We Create Better Benchmarks for Speech Recognition?

The applications of automatic speech recognition (ASR) systems are proliferating, in part due to recent significant quality improvements. However, as recent work indicates, even state-of-the-art speech recognition systems – some which deliver impressive benchmark results, struggle to generalize across use cases. We review relevant work, and, hoping to inform future benchmark development, outline a taxonomy of speech recognition use cases, proposed for the next generation of ASR benchmarks. We also survey work on metrics, in addition to the de facto standard Word Error Rate (WER) metric, and we introduce a versatile framework designed to describe interactions between linguistic variation and ASR performance metrics.


Introduction
The applications of ASR systems are many and varied; conversational virtual assistants on smartphones and smart-home devices, automatic captioning for videos, text dictation, and phone chat bots for customer support, to name a few. This proliferation has been enabled by significant gains in ASR quality. ASR quality is typically measured by word error rate (WER), or, informally, the Levenshtein distance between the target transcript and the machine-generated transcript (Levenshtein, 1966;Wang et al., 2003)-see Section 3.
Current state-of-the-art accuracy is now in low-single-digits for the widely used Librispeech benchmark set (Panayotov et al., 2015), with e.g. Zhang et al. (2020) achieving a WER of 1.4%. However, as Szymański et al. (2020) have pointed out, overall, our current ASR benchmarks leave much to be desired when it comes to evaluating performance across multiple real-world applications. Typical benchmark sets beyond Librispeech include TIMIT (Garofolo et al., 1993), Switchboard (Godfrey et al., 1992), WSJ (Paul and Baker, 1992), CALLHOME (Canavan et al., 1997), and Fisher (Cieri et al., 2004). 1 1 For an overview of such datasets and benchmarks, see These benchmark sets cover a range of speech use cases, including read speech (e.g. Librispeech), and spontaneous speech (e.g. Switchboard).
However, with many ASR systems benchmarking in the low single digits, small improvements have become increasingly difficult to interpret, and any remaining errors may be concentrated. For example, for Switchboard, a considerable portion of the remaining errors involve filler words, hesitations and non-verbal backchannel cues (Xiong et al., 2017;Saon et al., 2017).
Furthermore, achieving state-of-the-art results on one of these sets does not necessarily mean that an ASR system will generalize successfully when faced with input from a wide range of domains at inference time: as Likhomanenko et al. (2020) show, "no single validation or test set from public datasets is sufficient to measure transfer to other public datasets or to real-world audio data". In one extreme example, Keung et al. (2020) show that modern ASR architectures may even start emitting repetitive, nonsensical transcriptions when faced with audio from a domain that was not covered at training time-even in cases where it would have achieved perfectly acceptable Librispeech evaluation numbers. Inspired by Goodhart's law, which states that any measure that becomes a target ceases to be a good measure, we argue that as a field, it behooves us to think more about better benchmarks in order to gain a well-rounded view of the performance of ASR systems across domains.
In this paper, we make three contributions. First, we provide a taxonomy of relevant domains, based on our experience developing ASR systems for use in many different products, with the goal of helping make nextgeneration benchmarks as representative as possible (Biber, 1993). Second, we argue that optimizing only for WER, as most current benchmarks imply, does not reflect considerations that are ubiquitous in real-world deployments of ASR technology: for example, prohttps://github.com/syhw/wer_are_we. Additionally, FAIR recently released the Casual Conversations dataset intended for AI fairness measurements (Hazirbas et al., 2021). duction considerations such as latency and compute resources can imply additional interrelated optimization objectives. We survey relevant work on additional metrics that can be used to measure ASR systems. Third, we describe what metadata would be useful in nextgeneration benchmark data sets in order to help analyze the interaction between linguistic variation and performance of ASR systems-for example, to measure how well an ASR system holds up in the face of sociolinguistic variation within the target language, or second-language accents, as in e.g. Feng et al. (2021).

ASR Use Cases
With ASR use cases spanning many applications and tasks, ideally ASR systems would be robust to various classes of variation in speech input. For example, an ASR system which provides automatic captions for video meetings would recognize words from many different semantic fields, adaptable to the topic of the meeting. Speech characteristics may also vary across domains: for example, the speech style used when dictating text messages differs from the style of a group conversation, where speakers may occasionally talk over each other.
An ideal benchmark set would include what we will call 'horizontal' and 'vertical' variation. Horizontal challenges refer to a wide variety of scenarios where ASR may be used, while vertical challenges involve e.g. diversity in topics, encoding formats, and others.
2.1 Horizontals: ASR applications ASR application domains can be roughly subdivided based on the number of speakers, the mode of speech (spontaneous vs. prepared speech) and the intended recipient (human or device). An ideal benchmark set would cover as many of these horizontals as possible-e.g. through merging existing benchmark sets, as does Likhomanenko et al. (2020), and adding additional data to cover any gaps.
Dictation Text dictation is a popular use case of ASR systems -one of the first successful commercial applications with broad appeal. This feature serves both convenience and accessibility, allowing users to enter text without manually typing. Dictation tends to involve relatively slow speech, typically that of a single speaker, who is aware they are interacting with a device, and who may consciously modify their speech patterns to facilitate device understanding (Cohn et al., 2020). Dictation may have applications in many fields. One with many idiosyncratic challenges is medical dictation, where ASR systems are used to help medical personnel take notes and generate medical records (Miner et al., 2020;Mani et al., 2020). This poses challenges in the support of domain-specific jargon, which we will discuss in subsection 2.2. In a related application, dictation practice is sometimes used by language learners, often in combination with a pronunciation feedback system (McCrocklin, 2019). In other contexts, transcription of dictated audio may be part of a composite pipeline, such as in automatic translation, where the initial transcript feeds a subsequent system for translation to another language.
Voice Search and Control Voice search and other conversational assistant products enable users to access information or invoke actions via spoken input. Similar to dictation, audio in such settings is typically single-speaker, with human-to-device characteristics. Compared to dictation, queries may be somewhat shorter, and may contain proper nouns (e.g. place names or business names). Semiotic-class tokens such as times (Sproat et al., 2001) are also more common in this setting. A related type of human-to-device speech is interactive voice response (IVR), where callers to customer support may first interact with a voice chatbot, which can help gather information prior to redirecting the call, or potentially resolve issues itself. (Inam et al., 2017).
Voicemails, Oration, and Audiobooks While dictation users may modify their speech based on the knowledge that they are dictating directly to a device, ASR systems may also be used to help provide transcriptions for voicemail messages (Padmanabhan et al., 2002;Liao et al., 2010), parliamentary speeches (Gollan et al., 2005;Steingrímsson et al., 2020), and so on. Such settings, while still typically single-speaker, include artifacts of spontaneity-e.g. fillers or hesitations like 'uh', backchannel speech, as well as disfluencies, false starts, and corrections (Jamshid Lou and Johnson, 2020;Mendelev et al., 2021;Knudsen et al., 2020). Transcribing audiobooks includes elements of dictation and oration: due to their read-speech nature, audiobooks typically contain less spontaneity than typical human-to-human speech (Igras-Cybulska et al.), but they are usually more natural than human-to-device speech. 2

Conversations and Meetings
In settings such as human-to-human conversations, the task of the ASR system typically involves transcribing spontaneous speech among several participants within a single audio recording. For example, meeting transcription can help to improve accessibility of video meetings, or may serve to document conversations (Kanda et al., 2021); see e.g. Janin et al. (2004); Carletta et al. (2005) for relevant data sets. Another use case for transcriptions of human-to-human conversations is customer-agent conversations, as well as other types of telephony, which can help monitor the quality of phone-based customer service.
Podcasts, Movies and TV Podcast transcription forms a related, and fast-growing, application area, with recent data sets including Clifton et al. (2020). Podcast transcription is in some ways similar to the long-standing task of automatically transcribing interviews, e.g. to help make them more accessible, as in various oral-history projects (Byrne et al., 2004). Finally, another similar use case is the transcription of motion pictures, including documentaries, which may require increased robustness to non-speech audio, such as music and special effects. Spontaneous speech is common to these human-to-human, multi-speaker settings, with fillers such as 'uh', overlap, and interruption between speakers. We draw a distinction between movie subtitling and TV closed captioning. Subtitling is an 'offline' task in that the entire audio is available to the ASR system at recognition time, and the setting allows for multiple passes, including human post-editors. Compare to closed captioning, where streaming ASR processes a live broadcast with tight latency constraints. Additionally, these two modes have different transcription conventions and formatting requirements. Subtitles often contain non-verbal cues that support comprehension for hearing impaired, and are optimized for readability. Conversely, closed captions are often projected in upper case with fewer constraints, such as line breaks, to denote speaker turns.

Verticals: Technical challenges
ASR applications do not just differ in the style of speech. Other dimensions include: the semantic content of the input speech (a lecture about nuclear physics involves very different terminology than a phone conversation to set up a car maintenance appointment), the audio encoding format, and sample rate, among others. Again, the ideal benchmark should cover as many of these factors as possible.
Terminology and Phrases ASR systems applied to a wide range of domains need to recognize hundreds of thousands, if not millions, of distinct words. Such systems typically involve a language model trained on large volumes of text from multiple sources. To benchmark an ASR system's capability across a wide range of topics, test sets could include terms and phrases from many different fields: consider medical terminology (e.g. 'ribonucleotides'), historical phrases (e.g. 'Yotvingians'), and many more. ASR systems should also be savvy to neologisms (e.g. 'doomscrolling'), although, admittedly, the fast-changing nature of neologisms and trending phrases makes this particularly challenging. Another area that deserves special attention in measurements is loanwords, which may have pronunciations that involve unusual grapheme-to-phoneme correspondences; such words may even necessitate personalized pronunciation learning (Bruguier et al., 2016).
Speed Recordings where speech is significantly faster or slower than average may pose additional recognition challenges (Siegler and Stern, 1995;Fosler-Lussier and Morgan, 1999), so the ideal benchmark should also cover samples with various speech rates. This is particularly important for paid services, where users sometimes artificially speed up the recordings or cut out easily detectable portions of silence in order to reduce costs. Such processing can introduce unnatural shifts in pitch and add confusion to the punctuation at speaker turn, and sentence boundaries.
Acoustic Environment The setting in which the input audio was recorded (real-life or phone conversation, video call, dictation) can also materially impact ASR performance, and settings with high amounts of background noise can be particularly challenging. Ideally, test sets should be available to measure how robust an ASR system is in the face of background noise and other environmental factors (Park et al., 2019;Kinoshita et al., 2020). The entertainment domain contains a large amount of scenes with background music, which often have lyrics that are usually not meant to be transcribed. Even call center conversations sometimes contain hold music which is not part of the payload of the call.
Encoding Formats Lastly, different audio encodings (linear PCM, A-law, µ-law), codecs (FLAC, OPUS, MP3) and non-standard sample rates such as 17 kHz may affect recognition quality, and should be represented (Sanderson and Paliwal, 1997;Hokking et al., 2016). The same holds for audio that has been up-or down-sampled, e.g. between 8 kHz typical for telephony and 16 kHz or above, for broadcast media.

Practical Issues
We argue that the more horizontal and vertical areas are covered by a benchmark, the more representative it will be, and hence the more appropriate for measuring ASR progress. There are some practical matters that are also important to consider when creating the ideal benchmark.
Transcription Conventions Creating transcriptions of human speech in a consistent manner can be unexpectedly challenging: for example, should hesitations like 'uh' be transcribed? How should transcribers handle unusual cases like the artist 'dead mouse', which is written as 'deadmau5' by convention? And if a speaker says 'wanna', should the transcription reflect that as such, or should the transcriber transcribe that as 'want to'? The answer to such questions will depend on the downstream use context (e.g. a dialog system, where hesitations may be useful, or an email message, where they may need to be omitted instead). For example, while in closed captioning or podcast transcriptions omitting repetitions, disfluencies, and filler words (e.g. "like", "kind of") is considered desirable, this might not be appropriate for some other ASR domains such as subtitling. Defining and applying a comprehensive set of transcription conventions, as e.g. Switchboard (Godfrey et al., 1992) and CORAAL (Kendall and Farrington, 2020), is critical in building high-quality data sets. It is also important to detect and correct transcription errors in annotated corpora (Rosenberg, 2012).
Perhaps the most important choice in such transcription conventions is whether to adopt 'spoken-domain' transcriptions, where numbers are spelled out in words (e.g. 'three thirty'), or 'written-domain' transcriptions, where they are rendered in the typical written form ('3:30'). Many data sets use spoken-domain transcriptions only, but often in real-world ASR deployments it is valuable for readability and downstream usage (e.g. by a natural-language understanding system), to have fully-formatted, written-domain transcripts, as described by O'Neill et al. (2021)-who also provide a written-domain benchmark data set. Representativeness For any ASR test set, at least two considerations come into play: first, how closely does the test set approximate reality; and second, is the test set sufficiently large to be representative? For example, test sets that are intended to measure how well an ASR system deals with speech with background noise should have a realistic amount of background noise: not too little, but also not too much-e.g. to the point that even human listeners stand no chance of transcribing the audio correctly. Adding noise artificially, as established e.g. by the Aurora corpora (Pearce and Hirsch, 2000;Parihar and Picone, 2002), does not take into account the Lombard effect. In terms of size, analyses akin to Guyon et al. (1998) are helpful to ensure that any change is statistically significant; we are not aware of much work along these lines for ASR systems specifically, but it seems like it would be worthwhile to explore this area more. The ultimate goal should be to increase the predictive power of error metrics.

Metrics: WER and Beyond
Assume, for the sake of argument, that an impressive selection of test sets has been collected in order to create our imagined ideal next-generation benchmark for ASR, covering many use cases, technical challenges, and so on. The performance of an ASR system could now be measured simply by computing a single, overall WER across all the utterances in this collection of test sets-and a system that yields lower WER on this benchmark could be said to be 'better' than a system with higher WER.
However, in a real-world deployment setting, the question of which system is 'best' typically relies on an analysis of many metrics. For example, imagine a system with a WER of 1.5% but an average transcription latency of 2500 milliseconds, and another system that achieves 1.6% WER but a latency of only 1250 milliseconds: in many settings, the second system could still be more suitable for deployment, despite achieving worse WER results. Of course, 'latency' itself is not a well-defined term: sometimes the measurement is reported as the average delay between the end of each spoken word and the time it is emitted by the ASR system, while in other cases the measure is based only on the first or the last word in an utterance. Neither is well-defined in presence of recognition errors. Yet another kind of latency is end-to-end latency, involving everything between the microphone activity and the final projection of results, including network overhead and optional post-processing like capitalization, punctuation etc. A "pure" ASR latency metric ignores those and focuses on the processing time of the recognizer, while latency in the context of voice assistant commands may consider the delay before successful recognition of a command, which might sometimes precede the actual end of utterance. In this section, we describe how, much like latency, even WER itself has many nuances, and we point to other metrics, beyond WER and latency, that can be considered account when measuring ASR systems.

WER
The workhorse metric of ASR is the Word Error Rate, or WER. Calculating WER is relatively easy on spoken-domain transcriptions with no formatting (e.g. 'set an alarm for seven thirty') but quickly becomes a nuanced matter when processing written-domain transcriptions-for example, if the ground truth is provided as 'Set an alarm for 7:30.' with capitalization and punctuation, is it an error in WER terms if the system emits lowercase 'set' instead of uppercase 'Set', as given in the ground truth? Typically, for standard WER calculations in such scenarios, capitalization and word-final punctuation is not considered to be a factor, and other metrics are calculated for fully-formatted WER-e.g. case-sensitive WER, where 'set' vs 'Set' would be considered an error.
WER can also be calculated on only a subset of relevant words or phrases: for example, it may be helpful to compute separate error rates for different kinds of semiotic classes, such as spoken punctuation, times, or phone numbers-as well as for different semantic areas, such as relevant domain terminology vs. generic English words. The assessment of ASR quality on rare phrases is yet another issue-average WER does not always adequately reflect how well an ASR system picks up rare yet important words, suggesting it may be valuable to know WER for common and less common words. A related approach is to use precision-recall, e.g. as Chiu et al. (2018) do for medical terminology. Such 'sliced' approaches can help provide insight into the recognition quality of words or phrases that are particularly salient in a given setting. For example, if a system that is intended for use in a voicemail transcription setting achieves 3% overall WER, but it mistranscribes every phone number, that system would almost certainly not be preferred over a system that achieves 3.5% overall WER, but that makes virtually no mistakes on phone numbers. As Peyser et al. (2019) show, such examples are far from theoretical; fortunately, as they show, it is also possible to create synthetic test sets using text-to-speech systems to get a sense of WER in a specific context. Standard tools like NIST SCLITE 3 can be used to calculate WER and various additional statistics.
Importantly, it is possible to calculate the local WER on any level of granularity: utterance, speaker turn, file, entire recording etc. The average WER alone, weighted by the number of words, is not sufficient to describe the shape of the distribution over the individual local measurements. Given two ASR systems with identical WERs, we almost always prefer the one with the lower standard deviation, as it reduces the uncertainty w.r.t. the worst case. A more accurate metric that samples the shape of the distribution consists of percentiles (e.g. 90, 95 or 99) that are more suitable to provide an upper bound. Additionally, reporting the standard deviation allows researchers to judge whether an improvement in WER is significant or just a statistical fluctuation. The same argument holds true for latency.
Finally, WER can also be calculated on not just the top machine hypothesis, but also on the full n-best list, as in e.g. Biadsy et al. (2017).

Metadata about Words
Correctly transcribing speech into text is the most critical part of an ASR system, but downstream use cases may require more than just a word-by-word textual transcription of the input audio. For example, having per-word confidence scores can be helpful in dialog systems (Yu et al., 2011); having accurate timestamps at the word level is essential in many application of the long form domain, such as closed captioning, subtitling and keyword search; having phonemic transcriptions for every word enables downstream disambiguation (e.g. when the transcription gives 'live', did the user say the adjective [lıv] or the verb [laıv]); and emitting word timings to indicate where each word appeared in the audio can be important for search applications, especially for longer recordings. The ideal ASR benchmark would also make it possible to verify this metadata: for example, if it is possible to use forced alignment to infer where in the audio words appear, and to check how accurately an ASR system is emitting word timings (Sainath et al., 2020a). speaker diarization is yet another type of of metadata that can be emitted at a per-word or per-phrase level, for which independent benchmarks already exist (Ryant et al., 2021).

Real-Time Factor
A general metric for the processing speed is the real-time factor (RTF), commonly defined as the ratio between the processing wall-clock time and the raw audio duration (Liu, 2000). Streaming ASR systems are required to operate at an RTF below one, but in applications that do not require immediate processing an RTF over one might be acceptable. As with WER and latency, RTF samples form a distribution, whose shape is important in understanding the behavior in the worst case. The process of finding the most likely hypothesis in ASR (often referred to as "decoding" for historical reasons) requires an efficient exploration of the search space: a subset of all possible hypotheses. The larger the search space, the slower the search, but the more likely is the recognizer to find the correct hypothesis. A small search space allows for quick decoding, but often comes at the cost of higher WER. It is common to report an RTF vs WER curve which shows all possible operating points, allowing for mutual trade off. Note this definition operates with the wall-clock time, thus ignoring the hardware requirements. It is common to normalize the RTF by the number of CPU cores and hardware accelerators.

Streaming ASR
For ASR systems that stream output to the user while recognition is ongoing, as in many voice assistant and dictation applications, additional metrics will be useful, e.g. measuring the stability of partial results, which reflects the number of times the recognizer changes previously emitted words while recognizing a query (Shangguan et al., 2020). A related dimension is quality of the intermediate hypotheses: a streaming system that emits highly inaccurate intermediate hypotheses can yield a jarring user experience, even if the final hypothesis achieves an acceptable WER. This is particularly important in combination with a downstream application like machine translation that can be very sensitive to corrections in partial hypotheses (Ansari et al., 2020).
Yet another factor is streaming latency, e.g. how quickly partials are emitted (Shangguan et al., 2021), and more generally, the delay between the end of the user's input and the finalized transcription (Sainath et al., 2020b;. The accuracy of the endpointer module can significantly affect this latency: endpointers need to strike the right balance between keeping the microphone open while the user may still continue speaking (e.g. if the user pauses briefly to collect their thoughts), while closing it as soon as the user is likely to be done speaking, and a number of relevant endpointer metrics can be calculated, as in e.g. .

Inference and Training
Latency is influenced by many factors beyond the quality of the endpointer: for example, the number of parameters in the ASR model, the surrounding software stack, and the computational resources available will impact the duration of the recognition process for an audio recording, in both streaming and non-streaming -batch recognition settings. Compressing models can help them run faster, and in more settings (Peng et al., 2021), although the impact of shrinking models should be measured carefully (Hooker et al., 2020a,b).
Beyond inference, training may also be worth benchmarking in more detail: factors such as the number of parameters in the model, the model architecture, the amount of data used, the training software, and the hardware available will influence how long it takes to train an ASR model using a given algorithm. Benchmarks such as MLPerf (Mattson et al., 2020) do not yet incorporate speech recognition, but this may be worth exploring in the future.

Contextual Biasing
Certain phrases or words are sometimes expected in dialogue contexts (e.g. 'yes' or 'no'), along with particular types of words (e.g. brand names in the context of shopping). In such cases, ASR systems may al-low for contextual biasing to increase the language model probability of relevant words or phrases (Aleksic et al., 2015). Measuring contextual biasing typically involves evaluating a relevant test set twice: once with, and once without the contextual biasing enabled (the default behavior). Even when contextual biasing is enabled, it will typically be desirable for the system to continue to recognize other words and phrases without too much of an accuracy impact, so that recognition results remain reasonable in the event that the input does not contain the words or phrases that were expected-typically anti-sets will be used, as described by Aleksic et al. (2015). Contextual biasing plays a key role in classical dialogue systems like IVR.

Hallucination
In some cases, ASR models can hallucinate transcriptions: e.g. providing transcriptions for audio even where no speech is present, or simply misbehaving on out-of-domain utterances (Liao et al., 2015;Keung et al., 2020). Intuitively, this type of errors should be reported explicitly as the "insertion rate", which is calculated as part of the WER anyway. However, insertion errors are rather rare and do not stand out strongly in presence of speech and natural recognition errors.
Measuring whether an ASR system is prone to such hallucinations can be done by running it on test sets from domains that were unseen at training time. In addition, it is possible to employ reject sets which contain various kinds of audio that should not result in a transcription: for example, such reject sets may cover various noises (e.g. AudioSet Gemmeke et al. (2017)), silence, speech in other languages, and so on.
A related topic is adversarial attacks, when a particular message is 'hidden' in audio in a way that humans cannot hear, but which may deceive ASR systems into transcribing in an unexpected way; measuring robustness to such issues would be desirable, but it remains an active area of research-much like the creation of such attacks more broadly (Carlini and Wagner, 2018).

Debuggability and Fixability
Finally, one aspect of ASR systems that tends to be important for real-world deployments, but which is hard to quantify in a numeric metric, is how easy it is to debug and fix any misrecognitions that may arise. For example, if a new word such as 'COVID-19' comes up which is not yet recognized by the system, it would be preferable if adding such a new word could be done without necessitating a full retrain of the system. While quantifying this property of ASR systems is hard, we believe that the degree to which it is easy to debug and fix any ASR system is worth mentioning.

Demographically Informed Quality
As previously discussed, the ideal benchmark for ASR systems would cover as many horizontals and verticals as possible, and would involve various kinds of metrics beyond just WER. Another important dimension, however, would be the availability of demographic characteristics, and analyzing the metrics based on such characteristics. Such demographic characteristics may correlate with linguistic variation-for example, non-native speakers of English may have an accent showing traces of their native language-which may in turn impact ASR performance. Having demographic characteristics can help produce analyses like the one reported by Feng et al. (2021), who analyzed differences in recognition performance for different accents, age ranges, and gender within an ASR system.
The ideal benchmark set, then, should include sufficient metadata to run similar analyses, enabling developers to understand how their system behaves when processing various accents or dialects; to see whether factors like gender and age influence recognition performance in their system. Linguistic variation may take many different shapes, including: • phonetic differences, e.g. vowel realizations that are specific to a given accent • phonological differences, e.g. various number of phonemes in different dialects of a language • lexical differences, e.g. region-specific terms • syntactical differences, e.g. double-negatives • voice quality differences, e.g. pitch differences, which are correlated with parameters such as gender and age (Liao et al., 2015) Fortunately, several data sets already exist with relevant demographic tags for many utterances, e.g. Mozilla Common Voice (Ardila et al., 2020) which offers public data sets across many languages with dialect and accent tags. There are also academic data sets produced by sociolinguists, such as CORAAL for AAVE (Kendall and Farrington, 2020), ESLORA for Galician Spanish (Barcala et al., 2018), the Corpus Gesproken Nederlands for Dutch (van Eerten, 2007), and others. Such corpora provide a useful blueprint for providing such metadata, and we believe that it would be valuable for similar tags to be available for as many other data set as possible. As Andrus et al. (2021) show, at times it will likely be difficult to get the demographic metadata that is needed, but still, getting such data wherever possible is important-as they put it, "what we can't measure, we can't understand".
Even where demographic information is already present in ASR evaluation sets, it can be a valuable to conduct an analysis of the target user base for a deployed ASR system in order to ensure that all relevant tags are available. For example, if a data set has labels for four distinct accents, but the target user base is known from sociolinguistic research to use six distinct accents, this gap will not necessarily be evident when running an analysis of any possible differences among the four accents for which tags are available. It is important to understand the sociolinguistic characteristics of the target user base, and to cover as many of these properties as possible. Given that language has almost infinite variation as you zoom in-in the extreme, everyone has a slightly different voice-this is a task that requires careful sociolinguistic judgement and analysis, calling for interdisciplinary collaboration between linguists and developers of ASR systems.
Even when a rich set of tags is available, it can be difficult to interpret the results. We describe a simple, metric-independent population-weighted visualization framework designed to evaluate ASR systems based on such demographic metadata. Our approach supports the different language variations outlined above, and we propose this analyses as a valuable addition to future benchmarks.

Population-Weighted Slicing Framework
Factors like accents (native or non-native), dialects, gender, and others can result in linguistic variation, and this may in turn impact ASR performance. Thus it can be valuable to calculate WER, latency, and other metrics not just on a data set as a whole, but s Figure 1: Examples of WER sliced into groups A, B, and C, with the width of the bars reflecting relative sizes of those groups. also to slice metrics based on such meta-linguistic parameters.
Such sliced metrics can be used to determine any performance gap between groups, and if so, what efforts may need to be undertaken to shrink such gaps. The ideal test set should be representative of the target user base, but as this may be hard to achieve at data collection time, it can make sense to re-weight any metrics based on real-world population statistics: for example, imagine a scenario where 98% of the recordings in a data set come from native speakers, with the remaining 2% coming from non-native speakers. If the target deployment setting involves more like 15% non-native speech, the metrics obtained over the 2% slice of the data set coming from non-native speakers should carry 15% of the weight.
To make such analyses easier, we propose subdividing all speakers into mutually exclusive groups based on relevant linguistic or demographic criteria. For example, consider a scenario where the real-world population is subdivided into 3 mutually exclusive groups: group A (60% of the population), group B (30%), and group C (10%). The two subplots of Figure 1 visualize examples of evaluations of two ASR models for slices corresponding to these groups, with the WER scores represented by the height of the bars, and the width of the bars reflecting the size of the groups.
Even in the actual test data set, group A covers 80% of the test data, with groups B and C accounting for 10% each (i.e. under-representing group B and over-representing group A), this population-weighted framework provides an intuitive way to address this imbalance, and understand how ASR systems perform in the face of linguistic diversity. The average WER of the system can be calculated as an average of all WER scores across population groups, weighted according to the size of those groups-which may differ from the WER obtained by simply calculating the WER on the actual data set, as we have re-weighted based on the real-world distribution.
Importantly, while the average weighted WER is a useful metric, the full distribution should still be understood: continuing the example depicted on Figure 1, the average WER for both scenarios in this case would be 10 4 , but the disparity between the various groups in the plot where group C achieves a WER of 19.3% is clearly much bigger in one scenario than another.
Given WER measurements for several groups of speakers, we should also measure the disparity of the ASR performance across various groups. In a simplified way, one could calculate the difference between the best-performing and the worst-performing groups, but see Mitchell et al. (2020) for a general discussion of ML fairness metrics. While the WER gap in the best-group and the worst-performing group for the scenario depicted on the second subplot of Figure 1 is 3.5 absolute points, the gap is 12.8 absolute points for the distribution on the first subfigure-despite these two systems having the same average WER, one system is clearly more consistent than another.
Slicing can be based on just a single parameter, such as accent, gender, or age, but in reality, speakers are likely to fall into several categories at once. Therefore, it may make sense to look at intersectional groups: for example, ASR performance of 20-30 years old female speakers of Chicano English from Miami. Obtaining such rich metadata, however, may be challenging. Also, the more groups we intersect, the stronger the effect of data sparsity becomes: it may be challenging to fill every bucket with enough samples to obtain solid statistics and to control for all other variables not considered. At any rate, as long as mutually exclusive groups can be defined-whether based on a single parameter or in an intersectional way-this framework can help provide a more thorough understanding of various ASR metrics. Weighting by population also allows re-balancing potentially unbalanced test sets, and gives insight into what kinds of ASR performance would be encountered by different groups.
The goal of this approach is to generate new insights into the ASR accuracy for each slice without making assumptions about the causal interaction between the underlying latent variables.
The analytical methods we discuss here are much more detailed than what is commonly employed for ASR system evaluation nowadays, but this level of detail is more usual in the field of variationist sociolinguistics, suggesting potential for future collaborations (Labov, 1990;Grama et al., 2019).

Defining slices
To evaluate the ASR systems in a framework that we are proposing, it is crucial to define representative and mutually exclusive slices. While the classification we suggest in this section is by no means exhaustive, it can be used as a starting point.
Regional language variation Many languages have regional language variation. For example, in the United States alone, there are 3 main regional groups of dialects: the Inland North, the South, and the West (Labov, 1991), with multiple cities developing their own regional language variants. Such regional variants may involve regional phonology ('get' rhymes with 'vet' in the North, and with 'fit' in the South), and even significant lexical and syntactic differences ('going/planning to' can be expressed as 'fixin' to' in the South). Aksënova et al. (2020) has shown how such regional variation can be explored, and how it can impact ASR performance. Ideally, then, as many regional variants as possible should be covered by the ideal benchmark for a given language.
Sociolects Along with regional differences, there may also also linguistic diversity introduced by speakers of various sociolects: in American English, one might think of AAVE, Chicano (Mexican-American) English, and others. For example, AAVE-covered by the CORAAL data set (Kendall and Farrington, 2020)-has distinctive syntactic constructions such as habitual be ('She be working') and perfective done ('He done run'), along with systematic phonological differences (Wolfram, 2004). And even within a single sociolect such as AAVE there might be linguistic diversity . Sociolects may impact ASR quality (Koenecke et al., 2020), and it would therefore be desirable for benchmarks to cover as many sociolects as possible.
L2 background Speech produced by non-native (L2) may reflect some characteristics of their native (L1) language (Bloem et al., 2016), making it important to measure the impact of L2 accents on ASR accuracy. One relevant data set for English is the GMU Speech Accent Archive Weinberger (2015), which collects such data for L2 speakers of English.
Gender, age, and pitch Recognition performance may vary depending on the gender or age of the speaker (Liao et al., 2015;Tatman, 2017;Tatman and Kasten, 2017;Feng et al., 2021). In some cases, as in Common Voice (Ardila et al., 2020;Hazirbas et al., 2021), self-reported metadata is available. Where such information is not available, it may make sense to fall back to a proxy analysis based on pitch-which is known to be correlated with factors such as age and gender-in order to understand whether there are recognition accuracy differences for various pitch buckets, as in Liao et al. (2015).
Speech impairments Accuracy rates of standard ASR systems may also degrade for speech produced by people with speech impairments. Recent work has investigated ways to collect relevant data (Grill and Tučková, 2016;Park et al., 2021), enabling analyses of ASR systems in this area. However, given the high degree of variability in this space, a more robust path at least for the near-term future may be designing personalized ASR systems for people with non-standard speech (Shor et al., 2019). Beyond speech impairments, voice technologies could bring benefits to people with various types of diseases and impairments such as Alzheimer's, Parkinson's, and hearing loss.

Conclusion
The ultimate goal of benchmarking should be the ability to predict how well an ASR system is going to generalize to new and unseen data. In the previous sections we have argued that a single aggregate statistic like the average WER can be too coarsegrained for describing the accuracy in a real-world deployment that targets multiple sociolinguistic slices of the population. Ideally, the insights generated by the proposed analysis would be actionable, from the composition of the training data to fine-grained twiddling with a clear objective function.
Before we conclude, we should point out that any benchmark that implemented even a fraction of the metrics outlined above would yield rich amounts of information-which will likely pose challenges in terms of organizing, presenting, and understanding all this material. Model report cards, as outlined by Mitchell et al. (2019), may be a natural way to capture this information for an ASR system-although we would suggest calling them system report cards instead, given that most ASR systems do not consist solely of a single monolithic model. Given the sheer amount of variation in the ways in which people speak, and a large number of technical factors, measuring ASR systems is a complicated task. Today's benchmarks clearly leave room for improvement, whether it is through covering more horizontal domains (different kinds of speech), measuring the impact of cross-cutting vertical issues (e.g. factors like background noise), using more metrics than just WER (e.g. latency), and including demographic characteristics. We hope that our survey of these areas, and the simple population-weighted visualization framework we introduced, can help improve future benchmarks-not just for English, but also for the thousands of other languages spoken in our world today. This will clearly be a long-term journey, but it will be very important for the field as a whole to find ways to measure ASR systems better as speech recognition research continues to advance.