Automatic Named Entity Obfuscation in Speech

Sharing data containing personal information often requires its anonymization, even when consent for sharing was obtained from the data originator. While approaches exist for automated anonymization of text, the area is not as thoroughly explored in speech. This work focuses on identifying, replacing and inserting replacement named entities synthesized using voice cloning into original audio thereby retaining prosodic information while reducing the likelihood of deanonymization. The approach employs a novel named entity recognition (NER) system built directly on speech by training HuBERT (Hsu et al., 2021) using the English speech NER dataset (Yadav et al., 2020). Name substitutes are found using a masked language model and are synthesized using text to speech voice cloning (Eren and Team, 2021), upon which the substitute named entities are re-inserted into the original text. The approach is prototyped on a sample of the LibriSpeech corpus (Panayotov et al., 2015) with each step evaluated individually.


Introduction
Privacy concerns, particularly where an individual could be identified, preclude sharing and therefore automatic exploitation of many data sources.Anonymization, the removal of identifying information, has been automated for text (Lison et al., 2021), including large scale applications such as in clinical (Hartman et al., 2020) or legal settings (Oksanen et al., 2022), with off-the-shelf systems having reported performance of 90+% (Hartman et al., 2020).To minimize the risk of re-identification, obfuscation -replacing identifying information with a different substitute of the same type -has been explored as an alternative to replacing identifying information with a generic marker (Sousa and Kern, 2022).The main focus in speech has been on voice anonymization, which may not be a problem with speaker consent, with the removal of identifying information receiving less attention.To our knowledge, this is the first prototype to perform named entity obfuscation directly, in the original speaker's voice.Aside from voice cloning, it explores a named entity recognition approach based directly on audio signal and uses language model masking to find appropriate substitutions.
Recent advances in speech models, particularly the inclusion of language models within the speech model itself (e.g.HuBERT (Hsu et al., 2021)) gives models greater insight into expected contexts.Previous work on named entity recognition (NER) in speech frequently employs a two step approach, transcribing speech first, followed by the application of existing named entity techniques (Yadav et al., 2020).However, this process has the potential to compound errors as errors in transcription will increase the probability of error in NER.We suggest that the addition of language models into the speech model gives these sufficient power to perform NER directly, and therefore that transcribing (automatic speech recognition, ASR) and NER can be separated, and used to provide a confidence measure in their performance.Divided, the two do not propagate errors in the same way; in fact, treating ASR and NER separately allows one to fix (some of the) errors of the other.The proposed second (final) ASR pass merely produces a confidence value in the result to decide whether a manual check should be performed.
The success of few shot learning, where a limited number of examples is used to generalize a pre-trained deep learning model to a new situation, for text-to-speech -and specifically voice cloning (Zhang and Lin, 2022) -enables an alternative, equivalent but different, entity to be inserted in the audio signal in place of the original while preserving the prosody information throughout.While large databases of potential replacement entities can be used to select a substitution, these may not preserve necessary properties (such as gender).Al-ternatively, word embeddings have been used to suggest close (in the multi-dimensional space) alternatives (Abdalla et al., 2020), however these can suffer from the same drawback.We propose using a more contextualized alternative to word embeddings, a masked language model (Devlin et al., 2019), where the model is trained by hiding (masking) words and predictions of the original word are made based on their context.This work makes the following contributions: (1) a complete obfuscation pipeline for names in speech1 , (2) a named entity recognizer built directly on speech without requiring text transcription first, (3) alternative (obfuscated) entity replacement selection via masking language model, and (4) confidence annotated system output, allowing for manual correction and / or selection of shareable instances.Section 2 contains the methodology with results in Section 3. Section 4 presents the conclusions and future work.

Methodology
The steps of the overall pipeline, which takes in an audio file and produces an obfuscated audio file along with a confidence value, can be found in Figure 1.The approach comprises of three main parts: 1) identification of named entities (NEs) in the audio, 2) finding an equivalent alternative for the original NEs, and 3) reconstructing the original audio to incorporate the replacement NEs.The reconstructed audio can further be used to obtain a confidence value.

Identification of named entities
To enable the direct use of a language model on speech input for the purpose of named entity recognition (NER), a dataset of audio recordings with annotated NEs is required.The English speech NER dataset (Yadav et al., 2020), which consists of 70,769 waveforms with transcripts annotated with person, location and organization NEs, is used for fine-tuning the Hidden-Unit BERT speech model (HuBERT) (Hsu et al., 2021).HuBERT was selected over other speech models since it learns both accoustic and language models from its inputs and therefore has an increased awareness of context.The success of language models on text NER has demonstrated how crucial context is for this task, and using a model which incorporates both an acoustic and a language model (over acoustic only) allows the approach to exploit the information used in text NER, while managing to avoid the need for a transcript.
For training, NE annotations need to be converted to a suitable format, indicating the presence or absence of a NE in each position.Following the inside-outside(-beginning) chunking common to many NER approaches (Tjong Kim Sang and De Meulder, 2003), three formats were explored: 1) character level annotation, mapping each character to either o for a character outside of a named entity, space, or n, l, e for characters within person, location or organization entities respectively, 2) the same character level annotation with separate characters added to denote the beginning of each type of NE (mapping the sentence TELL JACK to oooo mnnn with m denoting the start of a person NE), 3) and, for completeness, annotation was also explored at word level.
With the training parameters shown in Appendix A.1, the best NE performance was obtained from the first annotation approach, where NE beginnings were not explicitly annotated.The lower performance of the second annotation approach can be attributed to the low quantity of training data for the beginning marker annotations.While word level annotation was explored, it is likely to need a far greater quantity of data to enable mapping of different length inputs to a single label.
Separately, HuBERT was also fine-tuned for automatic speech recognition (ASR), i.e. for transcribing text from audio.Identical training data was used, with annotation being the transcription provided as part of the NE annotation (with NE annotation removed).The same parameters were employed for its training.Alongside the predicted (NE or ASR) annotation, prediction output also yields an offset which can be converted to a time offset.This can be used to identify the position of the NE(s) to be replaced, and after a greedy alignment of the two outputs, the original transcription of the original NE(s) can be extracted.

Finding an alternative NE
Once a person NE is identified, a suitable equivalent substitution needs to be obtained, i.e. we want to find the word which could replace the NE in the text if the NE was hidden.This is precisely the concept behind masked language models (MLMs):  these models learn their weights so that given a sentence with a hidden (masked) word, the model will output the complete original sentence.The (ASR extracted) original sentences with NEs (as identified by the NE tuned model) masked were passed to a MLM.Three MLM models were explored: BERT, bert-large-uncased model (Devlin et al., 2019), ALBERT, albert-xxlarge-v2, model (Lan et al., 2019) and the distilled RoBERTa base, distilroberta-base, model (Sanh et al., 2019).Each model, with no additional tuning, results in a (pre-specified) number of predictions for each NE in the sentence.Since the models used different datasets in training, their predictions are expected to be different: for example, some may suggest pronouns rather than names.
Given the propensity of the MLM to return substitutions which are not names (for example, for the sentence you should call Stella, the model returns you should call him, you should call them, you should call 911 etc), an external list of people names is used for the validation of the proposed suggestions2 and the highest scoring substitution is returned.Heuristically, the original name is matched against the list to identify whether it is a first or a last name (where possible) and names of the same type suggested by the MLM are returned.Simple rules are employed (last of a sequence of names is a last name, a single name without a title is a first name etc) to decide on a substitution when the original name does not appear in either the first or last name list.Given the nature of MLMs, suggested alternatives are likely to be more common words: as a positive side effect, this should make them easier to render with voice cloning as they may already appear in the reference speech.Should MLM fail to propose any suitable substitutions, one is selected at random from the first & last name lists, subject to the same heuristic rules.

Reconstruction of original audio
In this work, the substitute NE is to be re-inserted into the original audio.To reduce the risk of de-identification via the extraction of entities which failed to be identified and therefore stayed in their original form, the substitute entity needs to be produced in the speaker's voice.The YourTTS (Casanova et al., 2021) model, which offers the ability for fine-tuning with less than one minute of speech while achieving good results with reasonable quality, can be used to generate the substitute sentence with all available speech of the speaker provided as reference.Note that it is not necessary to remove the original sentence from the reference data: in fact, its presence may result in more accurate rendering of the substitute sentence.The pre-trained model used in this work (tts_models/multilingual/multi-dataset/your_tts) was trained on the the voice cloning toolkit (VCTK) dataset (Yamagishi et al., 2019) which contains approximately 400 sentence, selected from newspaper text, uttered by 108-110 different speakers, giving it its generalization power.Aside from the reference passed to the model on the command line, no tuning or training of the YourTTS model is done in this work.
The ASR transcribed text with the substituted NE is generated, rather than the substitution alone, to ensure that the intonation as closely matches the substitution's position in the sentence.The average amplitude of the generated audio is matched to that of the original segment using the Python pydub library.The generated audio is again pased through the HuBERT based NE recognizer, to identify the location of the substituted NE in the generated audio and allow its extraction (note that in this pass, it is not necessary to perform ASR -only the offsets of the replacement NE are required).Should the NE recognizer not identify the same number of NEs as were present in the original, the instance is flagged for manual review.
For each NE in the text, a pair of start and end offsets are available: one pair extracted by the Hu-BERT based NE extraction from the original audio and a second pair from the audio generated from the substituted text.This allows the new NEs to be inserted in place of the original NEs.The splicing and concatenation of the waveforms is also performed using the pydub library.
A second HuBERT based ASR pass over the newly constructed (substituted) audio, and its comparison against the substituted text using word error rate (WER) and character error rate (CER) gives measures of confidence.Both the metrics, commonly used for evaluation of ASR, allow for sequences of different length to the target -the further the reconstructed audio is from the target sentence, the less likely it is that the substitution will go unnoticed.For the purpose of the demonstrating the viability of the prototype, no hyperparameter optimization was performed, and the larger HuBERT models were not employed, however improvement in performance of both models are expected should this be pursued.

Finding an alternative NE
A small scale evaluation is performed on a sample of 20 sentences selected at random from the Lib-riSpeech corpus (Panayotov et al., 2015) across 6 speakers.Sentence selection was subject to them containing a person named entity.While detailed results for the individual steps can be found in Table 2, it should be noted that -for the purposes of this work -the focus is the accuracy of the extraction of the correct NE.The stated accuracy is therefore somewhat misleading: in a number of cases, such as the word Raphael, the named entity is divided into two separate words, suggesting two consecutive named entities.However, this issue is corrected when the NE output is aligned with ASR output and the two separate NE instances are (correctly) merged.Cases with NEs which cannot be aligned are flagged up for manual intervention.The average ASR and (exact match) NE identification do not vary when a different MLM is employed, as this only effects the selection of the substituted name, resulting in different average confidence values.

Reconstruction of original audio
The voice cloning model requires some reference audio for the speaker: for the 6 selected speakers, 4 have less than 5 audio files (two having 3, and one having only 2 files) in the dataset.The quantity of data used as reference is likely to impact the quality (in terms of its similarity to the original speaker) of the generated text.Given the likely scenarios of deployment, such as dialogues where more than 2 sentences of speech per speaker are available, this may not be representative of the results obtainable with the pipeline.However, it should be noted that even if all substituted instances can be identified as substitutions, the system is equal to a masking technique (where an entity is replaced with a fixed entity, such as a bleep).

Conclusion
The prototype described shows the steps of an obfuscation pipeline for speech, which results in substituted person named entities uttered in the original speakers voice and replaced in the original audio signal.The prototype makes use of a named entity recognizer built directly on top of audio input, and employs masked language models to generate the substituted entity.It offers an end-to-end automatic solution enabling the sharing of speech with identifying information removed.
The resulting obfuscated speech remains in the original speaker's voice, allowing for the application of traditional speaker anonymization approaches to mask the speaker's identity.The original prosody can be protected by applying a transformation such as waveform change, offering a significant advantage over a technique which generates a complete obfuscated transcription (instead of splicing an obfuscated entity into original speech).

Limitations
The cloning model used, YourTTS, is trained on the VCTK dataset which consists of high-quality speech signal.It is therefore unclear whether the same accuracy would be obtained with lower quality signal which may contain some background noise.(However, it should again be noted that even if all substituted instances are identifiable in the output, the system is equivalent to a masking model.) The selection of a person NE replacement does not currently account for continuity: if the same person entity is referred to later, it may be substituted with a different entity to the previous occasion.In addition, the back-off strategy ignores aspects such as gender.
To show the approach feasible, very little optimization was performed.Further training and parameter optimization is likely to lead to improved performance for both ASR and NER models.
The approach is currently only implemented for person NEs but it could be extended very simply to other types of NEs.However, the degree to which other entity types require obfuscation in speech is not clear to us as mentions of organizations may well not be identifying at all.

Ethics Statement
Aside from the ethical concerns regarding voice cloning (covered in e.g.YourTTS (Casanova et al., 2021)), deployment would require a detailed evaluation of risk of de-identification.It is believed that the final confidence and the accuracy of each step can be combined to significantly reduce this risk.The voice itself also offers options for identification: the value of yielding substitutions in the original speaker's voice (and keeping the original prosody) would need to be weighed up against approaches which anonymize voice but preserve prosodic information.

3. 1
Identification of named entities The 70,769 training corpus, sampled at 16kHz, is divided up into 70% for training (49,540 instances), and 15% for both validation and evaluation (10,615 examples).The hubert-base-ls960 model is used with parameters listed in Appendix A.1.The performance in training, indicated via WER and CER, is shown in

Table 2 :
Table 1 for both ASR and NER.Evaluation of individual steps