E2E Spoken Entity Extraction for Virtual Agents

In human-computer conversations, extracting entities such as names, street addresses and email addresses from speech is a challenging task. In this paper, we study the impact of fine-tuning pre-trained speech encoders on extracting spoken entities in human-readable form directly from speech without the need for text transcription. We illustrate that such a direct approach optimizes the encoder to transcribe only the entity relevant portions of speech ignoring the superfluous portions such as carrier phrases, or spell name entities. In the context of dialog from an enterprise virtual agent, we demonstrate that the 1-step approach outperforms the typical 2-step approach which first generates lexical transcriptions followed by text-based entity extraction for identifying spoken entities.


INTRODUCTION
Enterprise Virtual Agents (EVA) provide automated customer care services that rely on spoken language understanding (SLU) in a dialog context to extract a diverse range of intents and entities that are specific to that business [1]. Gathering various entities like names, email, street address from human callers become a part of large range of virtual agents. In order to minimize the error in recognition and extraction of names, designers of speech interfaces often design prompts that request the user not only to say their name but spell it as well to address issues of homophones. (eg. Catherine or Katheryn). Such a behavior to spell carries over to other entities such as street and email addresses, without the users being explicitly prompted to do so.
Extensive research has been done to recognize entities in spoken input [2,3,4,5,6]. Similar to text-based NER, approaches for Spoken NER often involve predicting entity offsets and type in text provided by an automatic speech recognizer (ASR) [7,8,9] or recognizing directly as a part of E2E ASR output. Significantly limited research has been done on spoken entity extraction in dialogs [10,11], and even fewer in enterprise virtual agents [3,5]. Some methods proposed include using a predefined list of entity names [12] in a speech recognizer, fuzzy refinement by exploiting knowledge graphs [13], or to using a large vocabulary speech recognizer to obtain the transcript further processed using text-based NER tools. Such techniques are difficult to adapt to caller responses to the prompt "say and spell your first/last name" in a spoken dialog system, as illustrated by the following example. 1 s as in sam k as in kipe i as in ina ia b as in boy o as in over --> skibo Until recently, where [14] adapt a standard Seq-2-Seq designed for machine translation to extract person names from text generated by a speech recognizer. We verify their 2-step approach and extend it to extract additional entities -postal and email address. However, this 2-step approach means two systems to maintain and adapt, but also, loss of acoustic-prosodic informa-tion. In this paper, we propose a novel method for extracting human-readable spoken entities directly from speech with a single model (1-step approach) that is optimized for the entity extraction task. We hypothesize CTC loss, which has become a standard in training E2E ASR systems, can be re-imagined to map audio events to text events. Thus generating only event relevant subword tokens instead of actual transcription of what is being said. Our system transparently only need samples similar to (<audio.wav>, entity) for training purposes. We also adopt the standard tagging approach [7] where a system marks offsets of an entity using entity specific begin and end tokens, if there are multiple entities in the same input or predicting type in a joint model.
We acquire data from a production EVA system which have human-in-the-loop for automation purposes. For testing, we deploy an in-house team which listens to audio in a constrained environment similar to an EVA (humans annotators listen to audio briefly for a designated duration, then they should type it into the system). We found that our proposed 1-step approach significantly outperforms the 2-step approach for extracting names, address and emails from users of an EVA system. By generating only entity relevant tokens, our system learns to perform more intelligent entity extraction, instead of just performing literal lexical transcription as done by all existing ASR systems. We believe this opens the door to more interesting use-cases where E2E ASR systems can do structured multi-task generation for e.g: first token is intent label. The contributions of our paper are as follows: • We adapt a standard E2E ASR which uses CTC loss and greedy decoding to transcribe only entity relevant tokens from speech.
• We show that our proposed method performs better when compared to 2-step cascading approach and also better than human annotators in a fully automated human-in-the-loop dialog system.

RELATED WORK
A common practice is to convert normalized token sequence in spoken form produced by ASR into a written form better suited to processing by downstream components in dialog systems [15]. This written form is then used to extract structured information in the form of intent and slot-values to continue a dialog [16].
Recently, there is a growing tread to use neural en-coders optimizing directly using speech input, popularly known as E2E SLU approaches [17,18].
Inverse Text normalization: Information extraction systems generally use an Inverse text normalization (ITN) component to convert a token sequence in spoken form produced by ASR into a written form suitable for presentation to users and processing by downstream components -NLU and dialog. The entities requiring significant transformation to go from spoken form to written form include cardinals and ordinals as well as more complex items like dates, times and addresses [19,15]. Methods proposed for ITN include: using language models (LM) to decode written-form hypothesis [19], a finite-state verbalization model [20], leveraging rules and handcrafted grammars to cast ITN as a labeling problem [15]. [21] gives a good overview of text normalization approaches used for ASR systems.

Structured information extraction:
Significant work has been done in NLP literature to adopt seq-2seq based methods to extract structured information from text. This covers contributions on recognizing offsets and type of entities [22], structured summarization [23] or creating parse trees by predicting appropriate tags at time of language generation [24]. These works which have primarily focused on monologues fail to capture the variability which exists in a spoken dialog.
E2E SLU: Several E2E approaches which directly act on speech, have been proposed for named entity recognition, a closely related task to the entity extraction studied in this work [7,25,26,27,28]. Unlike these previous works that typically need both the text transcript along with entity type and offset tags, our approaches only need the normalized entity for supervision. Our system transparently only use pairs of audio and the target normalized entities to extract. Thus, removing a significant amount of cost and effort needed to obtain transcription and entity tags.

E2E ASR:
Recent state-of-the-art Automatic speech recognition systems transcribe speech into lexical tokens to represent what is is being said. This is generally achieved by either fine-tuning self-supervised neural encoders or even training from scratch using a CTC loss to recognize lexical tokens. This literal transcription leads to need of additional text-based component to extract entities from transcriptions. We believe directly fine-tuning these self-supervised speech encoders to only transcribe entity relevant tokens leads to machine learning to transcribe how humans do it i.e: understand, ignore and transcribe.

METHOD
We rethink ASR not only to be a transcription system but an E2E speech based encoder that can extract human-readable entities thus, learning to ignore, normalize and generate only target entity tokens directly from speech. In addition, we adopt a method to compute the confidence score for the entity extraction system, which can be used to improve the precision of entity extraction. We compare our 1-step approach to [14] 2-step approach for cascading ASR and NL (seq-2-seq) component for automatic name capture.

Non-Autoregressive Speech Based Extraction
We re-purpose an E2E ASR fine-tuned for standard transcription task and fine-tune it using (Speech-input, Entity) pairs optimized for CTC loss [29]. We use NeMo library [30] for all training and testing purposes.
In this work, we pick an off-the-shelf Citrinet [31] model downloaded from NeMo library 1 . It is trained on a 7k hour collection of publicly available transcribed data and uses SentencePiece [32] tokenizer with vocabulary size of 1024 (L). We fine-tune it again for the transcription task using additional 800 hrs of transcribed speech from a collection of enterprise virtual agent applications. This model achieves a word accuracy of 93.1% on a 28k utterance test set consisting of user utterances that are in response to "How may I help you?" opening prompt from various enterprise virtual agent applications [14].
E2E Citrinet ASR is then re-purposed and fine-tuned for name extraction. For entities (email and postal address) which contains vocab tokens not part of ASR tokenizer (digits, special symbols) we intiate the classification head for a new sentence piece tokenizer. We keep the vocabulary size as 1024. We then fine-tune this E2E encoder for direct entity extraction from speech using a standard CTC loss.

From Network output to Entities
We use the same mathematical formulation as CTC [29] to classify unseen speech input sequences in way to minimise task specific error measure i.e: output entity specific tokens. Similar to standard practice, our CTC network has a softmax layer with one more unit than there are labels in L. The activation of the extra unit is the probability of observing a 'blank' or no label. The activation of the first L unites are interpreted as the probabilities of observing the corresponding labels at k the activation of output unit k at time t. y t k is interpreted as the probability of observing label k at time t, thus, defining a distribution over the set L T of length T sequences over the alphabet L = L ∪ {blank}: we refer to the elements of the L T as paths, ad denote them π [29] makes an implicit assumption in Equation 1 that network outputs at different times are conditionally independent. However, feedback loops within encoders connecting different position information makes them conditionally dependent. One possibly important reason behind the success of CTC based E2E ASRs.
Many−to−one map B is defined as L → L ≤T , where L ≤T refers to set of sequences of length less than or equal to T over the original label alphabet L.
All blanks and repeated labels are removed from the paths. Thus when optimized to output only entity relevant tokens, system outputs blanks for those time-steps instead of mapping them to a token in L (step-wise CTC outputs in Figure 2). Finally, B is used to define the conditional probability of an entity l ∈ L ≤T as sum of probabilities of all the paths corresponding to it: Thus, original mathematical formulation of CTC loss and contextualized representations of speech encoder allows us to optimize directly for entity extraction instead of transcription. Thus highlighting that CTC loss can learn to extract only entity relevant tokens and ignore other information in the speech input such as carrier phrases, words as examples of a character, repetitions etc. Figure 2 shows sample with output tokens at each time step (80ms for Citrinet) along with probability of that token. System predicts blank tokens for steps with no output in E2E ASR (blue) plot, probably refering to silences in-between tokens. However 1-step approach generates blank tokens for information system is optimized to ignore. Thus, CTC

Fig. 2.
Samples showing output of greedy decode comparing an E2E Citrinet ASR with our proposed 1-step approach. Time steps which are not marked with any token are predicted as blanks. Blue tokens which mark our system's output generally focuses on spell but can learn from context to make a better decision. loss helps to output contextualized tokens and aligning them to steps in audio with any supervision. Our experiments suggest that other E2E ASR architectures, like Conformer [33] show similar results, when fine-tuned with non-autoregressive CTC loss. However, we limit our experiments to Citrinet in this paper.
At training time, classifier construction is done according to [29] and implemented in NeMo Library. We refer the reader to original Citrinet paper for more implementation details [31].
For decoding, we use simple greedy CTC decoding (best path method) where the argmax function is applied to the output predictions from the network at each time step and the most probable tokens are concatenated to form a preliminary output token sequence. CTC decoding rules are applied to remove blank symbols and repeated tokens. The remaining tokens are merged to obtain extracted entity. We provide confidence scores by summing over the posterior probability of non-blank predicted tokens, a method originally proposed by [34].

Baseline: Cascading ASR and NLU systems
In this 2-step approach, we first transcribe the speech provided by humans into text using an E2E ASR (sample output in Table 1) described in the following section. We then extract entities from the transcribed text by learning to translate using (transcription, entity) pairs. We use a standard off-the-shelf transformer based Seq-2-Seq system. which is typically used for neural machine translation. Table 1 shows sample input H and desired output M .
The ASR hypothesis is provided in the form of bytepair encoded (BPE) tokens as input to the Seq-2-Seq model, while the decoder generates entity relevant BPE tokens along with the start and end tokens. We use a shared embedding layer for both encoder and decoder tokens. We use fastBPE 2 to learn a shared vocabulary for both the encoder and decoder. We found that using four multi-headed attention layers for both encoder and decoder gives the best result on the validation sets, which is the only change we make to seq-2-seq setup used by [14]. We use Adam optimizer with a fixed batch size of 32 and a fixed learning rate of 1.0e−5. We do not perform any pre-training but instead train our system from a random initialization. We provide confidencescore as the sum of log-probability assigned to the character entity sequence.

Dataset
While some public datasets like the OGI collection [35] include a small subset of spelled names, with the growing demand for EVAs for multiple industry verticals, they capture millions of user utterances responding to different prompts. Table 1 shows transcription of caller's speech input and desired machine output. System output also includes entity type specific tags help guide the decoding algorithm in-the-case of joint models for all entities.  We collect training data from several production EVA applications including banking, insurance, mobile service, and retail from callers based in the United States. Our collections are user responses to several prompts consists in the form of audio samples and labels by human-in-the-loop agents. Human agents listen to customer inputs, then either type a human-readable entity or report an invalid input provided by a user. We remove the samples where user doesn't provide a meaningful inputs (keeping 70%-85% utterances). Thus creating a data with (speech, entity) pairs used by automatic extraction systems. Table 2 show statistics for each prompt type we collect namely, first name, last name, full name, postal and email address. It also shows size of training and test sets. We found that speaker generally took longer to tell their full name and email address compared to other prompt types. We lowercase all entities to lowercase characters. We keep additional valid sets which are size of 10% of train sets for model selection. Table 2 also shows median duration in the training set and also 95% percentile range for it.
For testing purposes, we randomly sampled audio from a large pool of data, which is collected at different time frames than train data, but the same set of applications. We imitate an human-in-the-loop scenario where annotators listen to user inputs, type the entity by listening to audio only once in the limited time. Our test  Table 2. Statistics of data participants are a mix of native and non-native speakers who could be less exposed to European names. Table 3 shows size for test data for each type. It is often observed that the test participants introduce errors when labeling in a constraint setting like an EVA. Later, we employ native speakers of English to verify and correct entity labels. The last column in Table 3 shows human-in-the-loop performance in a constraint setting.
In dialog systems customers are prompted to provide information such as personal names, postal addresses, and email addresses at various dialog turns using different authored prompts. In Table 1, we present a few sample responses from callers to such prompts and desired system output by predicting both start and end of an entity. We consider first word as first-name and rest of the words as last-name for full name entity extraction.  Table 3.

Statistics of data and results
Our results illustrate that 1-step extraction which directly acts on speech outperforms 2-step entity extraction approach, as shown at Table 3. We use the same train and test data for both 1-step and 2-step approach. Results also show that our systems achieve better performance than human annotators for most prompt type except email addresses. We found extracting email addresses is hardest of all types. This is in line with out hypothesis Email-IDs are hardest to extract because of possibly infinite combinations humans can make to describe a unique ID (approximately 70% of email training data used some form of carrier phrase like "as", "in" and "like"). Joint model which pools train-

ASR Transcription
ASR → S2S E2E extraction jack smith j a k s m i t h jack smith jak smith fingh s i n g h fingh singh lunscarard l u n d s t a a r d lundstaard lundsgaard o leary o capital l apostrophe e a r y leary ol'eary fourty one hundred twenty third street 4100 23rd street 41 123rd street Table 4. Few samples cases where our proposed 1-step approach performs better than 2-step approach. Text in red highlights output is wrong, while green is correct.
ing data for all entities shows improved performance for first-name, full-name and email extraction.

Varying amount of training data
For name extraction, EVAs hear to a larger amount of users providing their names, address and emails on a daily basis. Our proposed approach which optimizes using (speech, Entity) pairs depends upon supervised data for automation. Thus it becomes important to analyze the amount the data needed before the system starts showing results which are useful to replace humans in an EVA.

Fig. 3.
Varying training data and measuring accuracy for 1-step approach. Figure 2 shows variation in Accuracy for full-name extraction test set. We measure accuracy at the level of words i.e: word is either first name or the last name, and at the level of characters. We found that system achieves high accuracy at the level of characters with less training data but needs more data to get complete name correct.

Effect of transcription quality
We found that performance of cascade approach is better when human transcribed text is provided compared to E2E ASR output which has errors in it. Performance of 2-step S2S trained on the same noisy data as E2E extraction system seems more robust as it produces high quality results if correct transcriptions are provided. However, generating transcriptions with no errors is practically impossible and also acquiring data to finetune a ASR for this task will be costly.  Table 5. Comparing performance when human transcribed text is used instead of ASR output.

OBSERVATIONS
Linguistic analysis: We found that for address and email extraction, humans users extensively practice breaking their answer into spell with or without language descriptions e.g: s as in sam. However for names, humans tend to use spell-based expansions only when they are explicitly told to provide it. Table 4 shows output for both 2-step and 1-step approach for extracting entities. We found cascading approach using S2S performs better if transcribed text provided by ASR has less errors. We believe some of these errors in transcription are due to pre-bias in language to ASR training data vocabulary. Improved performance of E2E extraction system indicates it can learn to resolve ambiguities for efficient entity extraction. Automation Rate: Virtual agents use confidence score provided by an automatic module to decide whether call should be routed through a human agent. The confidence threshold determines the error versus rejection curve and a suitable operating point is chosen that optimizes the rejection at a given error rate. Figure 3 shows error rejection for fullname extraction. Our 1step approach shows 12% error rate at 20% rejection, while 2-step approach shows 21% error at 25% rejection, thus significantly outperforming 2-step approach.
It also performs better than human-in-the-loop. Fig. 4. Error-rejection curves for full name extraction. Setting a threshold helps dialog system designer control automation rate.

CONCLUSIONS
In this paper we show high-quality spoken entities can be extracted directly from speech by fine-tuning E2E ASR systems. The proposed 1-step model may not be influenced by ASR mistakes while carrying the critical token sequence to the final entity extraction phase. We believe our results can further improve using autoregressive loss functions.
For complete automation of prompts in customer calls a system also needs to extract intent for samples (10-15%) with no entities in it. Our early experiments suggest this can be done by mixing intent labelled data along with entity extraction data. This leads to minor loss in performance for each entity. We believe the fact that our system can learn to ignore speech patterns and only transcribe entity relevant tokens opens venue to extract other forms of structured information directly from speech.