Hearing Lips in Noise: Universal Viseme-Phoneme Mapping and Transfer for Robust Audio-Visual Speech Recognition

Audio-visual speech recognition (AVSR) provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with visual information. However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task, with noise adaptation techniques such as front-end denoise processing. Though effective, these methods are usually faced with two practical challenges: 1) lack of sufficient labeled noisy audio-visual training data in some real-world scenarios and 2) less optimal model generality to unseen testing noises. In this work, we investigate the noise-invariant visual modality to strengthen robustness of AVSR, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation. Inspired by human perception mechanism, we propose a universal viseme-phoneme mapping (UniVPM) approach to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noisy conditions. Extensive experiments on public benchmarks LRS3 and LRS2 show that our approach achieves the state-of-the-art under various noisy as well as clean conditions. In addition, we also outperform previous state-of-the-arts on visual speech recognition task.


Introduction
The world surrounding us involves multiple modalities, including vision, audio, text, etc., which complement each other and jointly comprise human perception (Baltrušaitis et al., 2018;Zhu et al., 2021b).Audio-visual speech recognition (AVSR) leverages both audio and visual modalities to understand human speech, which provides a promising solution to ameliorate the noise-robustness of audio-only speech recognition with noise-invariant lip movement information (Sumby and Pollack, 1954). 1 Code is available at https://github.com/YUCHEN005/UniVPM.However, most existing efforts still focus on audio modality to improve noise-robustness considering its dominance in AVSR, where audio modality contains much richer information to represent speech content than visual modality (Sataloff, 1992;Ren et al., 2021).Current mainstream approaches introduce noise adaptation techniques to improve robustness 2 , inspired by robust speech recognition (Wang et al., 2020).Most of them leverage noise-corrupted training data to strengthen robustness (Afouras et al., 2018a;Ma et al., 2021b;Song et al., 2022), and recent works extend it to selfsupervised learning scheme (Shi et al., 2022b;Hsu and Shi, 2022).Based on that, latest works introduce speech enhancement as front-end to denoise before recognition (Xu et al., 2020;Hong et al., 2022).Despite the effectiveness, these methods are usually faced with two practical challenges.First, they require abundant labeled noisy audio-visual data for network training, which is not always available in some real-world scenarios (Lin et al., 2021;Chen et al., 2022a).Second, the well-trained model may not adapt to new-coming noise scenes in practical applications 2 , resulting in less optimal model 2 Experimental analysis are in §A.1 and §4.2.arXiv:2306.10563v1[eess.AS] 18 Jun 2023 generality (Meng et al., 2017).Therefore, our research idea in this paper is leveraging visual modality to develop a general noise-robust AVSR system while without dependence on noisy training data.
We may gain some inspirations from human perception mechanism of noisy audio-visual speech.Neuroscience studies (Nath and Beauchamp, 2011) find that human brain will unconsciously rely more on the lip movement to understand speech under noisy conditions (a.k.a., McGurk Effect, McGurk and MacDonald, 1976).During this process, instead of directly recognizing lip movement, human brain will first transfer it to speech signal in auditory cortex for further understanding (Bourguignon et al., 2020;Mégevand et al., 2020).With prior knowledge of lip-audio mapping, human brain can restore informative clean audio from lip movement under any noisy conditions to aid in speech understanding (Bernstein et al., 2004;Aller et al., 2022).
Motivated by above observations, we propose a universal viseme-phoneme3 mapping approach (UniVPM) to implement modality transfer, which can restore clean audio from lip movement to enable speech recognition under any noisy conditions.We first build two universal memory banks to model all the visemes and phonemes via online balanced clustering.Based on that, an adversarial mutual information estimator is proposed to construct strong viseme-phoneme mapping, which enables final lip-to-audio modality transfer via retrieval.As a result, our system can adapt well to any testing noises while without noisy training data.Empirical results show the effectiveness of our approach.Our contributions are summarized as: • We present UniVPM, a general noise-robust AVSR approach investigated on visual modality, which can adapt to any testing noises while without dependence on noisy training data, a.k.a., unsupervised noise adaptation.
• We build two universal banks to model all the visemes and phonemes via online balanced clustering, followed by an adversarial mutual information estimator to construct strong mapping between them, which enables modality transfer to restore clean audio from lip movement for speech recognition under any noises.
• Our UniVPM outperforms previous state-ofthe-arts on LRS3 and LRS2 benchmarks.Ex-tensive experiments also show its superiority on visual speech recognition (VSR) task.

Related Work
Audio-Visual Speech Recognition.AVSR provides a promising solution to noise-robust speech recognition with the noise-invariant visual modality (Afouras et al., 2018a).However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task (Sataloff, 1992;Ren et al., 2021).Mainstream approaches introduce noise adaptation techniques to strengthen robustness, where most of them leverage noise-corrupted data to improve network training (Afouras et al., 2018a;Ma et al., 2021b;Pan et al., 2022;Shi et al., 2022b;Hsu and Shi, 2022), and recent works further introduce speech enhancement as front-end to denoise before recognition (Xu et al., 2020;Hong et al., 2022).Despite the effectiveness, these methods require abundant labeled noisy audio-visual training data that is not always available in some real scenarios, and they may not adapt to the new-coming noise scenes in practical applications.In this work, we investigate the visual modality to develop a general noise-robust AVSR approach while without dependence on noisy training data, a.k.a., unsupervised noise adaptation.Memory Network.Memory network (Weston et al., 2014) presents a long-term memory component that can be read from and written in with inference capability.Miller et al. (2016) introduces key-value memory structure where key memory is used to address a query and the retrieved output is obtained from value memory using the address.Since this scheme can remember selected information, it is effective for augmenting features in many tasks, including video prediction (Lee et al., 2021), cross-modal retrieval (Song et al., 2018;Chen et al., 2020a), lip reading (Kim et al., 2021a(Kim et al., , 2022) ) and talking face generation (Park et al., 2022).Despite the advances, the memory network is prone to overfitting when handling imbalanced distributed data, a.k.a., long tail4 (Liu et al., 2019), which may fail to model the minority classes well.In this work, we propose to build two memory banks via online balanced clustering to model all the visemes and phonemes equally, i.e., universal.
Viseme-Phoneme Mapping.Viseme-phoneme mapping is important to many visual-audio learning tasks, including speech recognition (Chan et al., 2022), lip reading (Ren et al., 2021) and lip-tospeech synthesis (Prajwal et al., 2020).Among them, cross-modal distillation is a popular technique to transfer knowledge from viseme to phoneme (Afouras et al., 2020;Zhao et al., 2020;Ren et al., 2021).Other works design specific neural networks to learn their mapping (Qu et al., 2019;Kim et al., 2021b).Recent studies introduce selfsupervised learning to capture correlations between visemes and phonemes (Qu et al., 2021;Ma et al., 2021a).Though effective, these methods are often challenged by the ambiguity of homophenes (Bear and Harvey, 2017) where one lip shape can produce different sounds.To this end, we propose an adversarial mutual information estimator to construct strict viseme-phoneme mapping with the strong distinguishing ability of adversarial learning.

Overview
The overall framework of proposed UniVPM is illustrated in Fig. 2.During training, we first send the input video and clean audio streams into two front-ends for processing, which generates modality sequences f v , f a ∈ R T ×D , where T is number of frames and D is embedding dimension.These frames are sent into two memory banks to model all the visemes and phonemes, using an online balanced clustering algorithm where each cluster center represents a specific viseme or phoneme.Then, we propose an adversarial mutual information estimator to construct strong mapping between corresponding visemes and phonemes.Based on that, we finally implement modality transfer via retrieval to restore clean audio from visual signals, which enables speech recognition under any testing noises.

Online Balanced Clustering
Clustering is a widely used knowledge discovery technique to partition a set of data points into homogeneous groups, which has a variety of applications such as data mining (Fayyad et al., 1996).Among them, K-Means algorithm (MacQueen, 1967) is the most well-known and popular one.However, it cannot be directly applied for our viseme and phoneme clustering due to imbalanced data distribution (see §A.4).This may challenge K-Means clustering according to uniform effect (Xiong et al., 2006).As shown in Fig. 3  end for 25: end for First, we set the number of clusters N to 40, following the amount of English phonemes (Phy, 2022).Then, we set a maximum cluster size S max (i.e., number of samples in each cluster) to control the total memory.We also initialize an empty bank B as an overall cache, as well as a list of empty banks {B 1 , B 2 , ..., B N } to cache each cluster.
The proposed algorithm is executed in three steps, center initialization, K-Means clustering and re-sampling.First, we collect the first few batches of data frames into B to initialize N dispersed cluster centers {c 1 , c 2 , ..., c N }, using K-Means++ algorithm (Arthur and Vassilvitskii, 2006).Second, we add the current batch data to bank B and employ vanilla K-Means algorithm to re-allocate each sample in the bank to the nearest cluster center, after that the new cluster centers would be updated.Finally, we propose a re-sampling strategy to balance the size of different clusters as well as control the total memory of bank B, by setting a threshold cluster size S thr (line 12 in Alg. 1).For those clusters with more than S thr samples (i.e., majority cluster), we perform undersampling by only maintaining the S thr nearest samples to cluster center.In contrast, for the minority clusters with less samples than threshold, we propose oversampling to interpolate a new sample between center and the nearest sample with a random weight, inspired by SMOTE algorithm (Chawla et al., 2002).In this way, as illustrated in Fig. 3 (b), the resulted clusters would be balanced-sized and separated to better represent each of the visemes and phonemes.

Adversarial Mutual Information Estimator
After clustering visemes and phonemes in banks, we propose an Adversarial Mutual Information Estimator (AMIE) to construct strong mapping between them.Mutual Information (MI) is a commonly used measure to explore the coherence between two distributions, which is, however, historically difficult to estimate.Recently, Belghazi et al. (2018) propose a Mutual Information Neural Estimation (MINE) approach to approximate MI lower bound with neural network.Based on that, we propose an adversarial learning approach to maximize the MI between visemes and phonemes, in order to construct strict mapping between them and thus alleviate the ambiguity of homophenes.

Preliminary Theory of MINE
Mutual information measures the mutual dependency between two probability distributions, where p(x, y) is the joint probability distribution of X and Y , and p(x) and p(y) are the marginals.Therefore, the mutual information can be written in terms of Kullback-Leibler (KL-) divergence: where D KL is defined as: Furthermore, the KL-divergence admits the Donsker-Varadhan (DV) representation (Donsker and Varadhan, 1983;Belghazi et al., 2018): where the supremum is taken over all functions T on Ω ⊂ R d to guarantee two finite expectations.Therefore, we have the MI lower bound: where I Θ is the neural information measure, and T θ denotes a trainable neural network.

Proposed AMIE
Based on MINE, we propose an Adversarial Mutual Information Estimator to explore and maximize the mutual information between clustered visemes and phonemes.As illustrated in Fig. 2 and 4, given a visual sequence f v , we send each frame of it into viseme bank to find the nearest cluster center c v , which forms the viseme sequence s v ∈ R T ×D .Similarly, we obtain a phoneme sequence s a to represent audio features f a .The neural network T θ then feeds {s v , s a } to output a scalar for MI estimation, where T θ is a 3-layer classifier with output as a 1-dimensional scalar.Furthermore, since we do not concern the accurate value of MI when maximizing it, we employ Jensen-Shannon (JS) representation (Hjelm et al., 2018) to approximate KL-divergence in Eq. 4, which has been proved with more stable neural network optimization.Therefore, the mutual information between clustered visemes and phonemes is estimated as: ) where sa is the shuffle-ordered version of s a that subjects to the marginal distributions of phonemes, and sp(z) = log(1 + e z ) is the softplus function.
As stated in Belghazi et al. (2018), the neural network T θ can be used to estimate MI between generated data (s v , s a in our case) by directly trained on them.However, this will suffer a lot from the poor quality of generated data at early training stage.One feasible scheme (Zhu et al., 2021a)  T θ on real data (f v , f a in our case) and then estimate MI on generated data, but this suffers from the ambiguity of homophenes (see Fig. 8).To this end, we propose AMIE with adversarial learning to estimate and maximize the MI between corresponding visemes and phonemes, which can construct strict viseme-phoneme mapping without ambiguity.

is to train
Inspired by GAN (Goodfellow et al., 2014), we design the AMIE as discriminator and the visemephoneme banks as generator.Based on that, the adversarial loss is defined as: Our framework employs an adversarial learning strategy for optimization, where D and G play a two-player minimax game as detailed in Alg. 2. As a result, the estimated MI between corresponding visemes and phonemes would be maximized to construct mapping relationships.The strong distinguishing ability of adversarial learning enables strict viseme-phoneme mapping to overcome the ambiguity of homophenes, as shown in Fig. 5.

Optimization
The UniVPM is optimized in an end-to-end manner (see Alg. 2), with the final training objective as: where L ASR denotes the downstream speech recognition loss.L var is a variance regularization term to disperse the clustered viseme and phoneme centers, which aims to ease their mapping construction.λ GAN , λ rec and λ var are weighting parameters.We also select some mainstream AVSR approaches as baselines for comparison, e.g., u-HuBERT (Hsu and Shi, 2022), and details are presented in §B.7.
Table 3 further compares the performance of Uni-VPM with AV-HuBERT on unseen testing noises, which are sampled from DEMAND (Thiemann et al., 2013) dataset.First, when AV-HuBERT is finetuned and tested both on DEMAND noise, good WER performance can be achieved.However, if it is finetuned on MUSAN noise and then tested on unseen DEMAND noise, the performance would degrade a lot.In comparison, our UniVPM finetuned on clean data (purple shades) achieves significant improvement and surpasses the AV-HuBERT finetuned on MUSAN noise, which further verifies the strong generality of our model.Furthermore, when finetuned on MUSAN noise, our UniVPM even outperforms the AV-HuBERT finetuned on in-domain DEMAND noise, which highlights the superiority of our approach on unseen test noises.
Visual Speech Recognition.To further verify the effectiveness of UniVPM, we evaluate its VSR performance by discarding the input audio modality during inference, as shown in HuBERT baseline (34.7%→26.7%).Although the visual-only training and self-training strategies improve AV-HuBERT's results, our UniVPM still defines new state-of-the-art on LRS3 benchmark.the visemes and phonemes equally without overfitting, which is further shown in Fig. 5.

Effect of AMIE.
As presented in Table 5, AMIE plays the key role in the promising performance of UniVPM by constructing strong viseme-phoneme mapping.As a comparison, the contrastive learning baseline only provides limited improvement, and MINE performs better by maximizing the estimated MI between visemes and phonemes.Based on that, our proposed AMIE introduces JS representation to stabilize system optimization, which improves performance but still suffers from the ambiguity of homophenes.To this end, our adversarial learning approach achieves further improvement by constructing strict viseme-phoneme mapping without ambiguity, as shown in Fig. 8.
Analysis of Adversarial Learning.As illustrated in Eq. 11, there are two key components in adversarial learning, i.e., I(s v , s a ) that constructs visemephoneme mapping and I(f v , fa ) that supervises the quality of restored clean audio.Results in Table 5 indicate that viseme-phoneme mapping is the most important, and the supervision on restored clean audio also improves the AVSR performance.
Analysis of Regularization.According to Eq. 12, L rec and L var are two auxiliary terms for regularization, where the former supervises the quality of restored audio, and the latter disperses clustered viseme and phoneme centers to ease their mapping construction.Both of them are proved with positive contributions to the gains of performance.
Visualizations.Fig. 5 presents t-SNE visualization and confusion matrixes to further verify the effectiveness of UniVPM.First, the online clustering baseline generates gathered viseme and phoneme centers due to over-fitting, where only several majority phonemes are modeled as shown in (g).Our proposed online balanced clustering alleviates such over-fitting issue and generates separated phoneme centers, which can cover most of the real phonemes as illustrated in (h).However, we can still observe gathered viseme centers due to homophenes, and the ambiguity of viseme-phoneme mapping is also shown in (k).To this end, our proposed AMIE effectively alleviates the ambiguity of homophenes thanks to the strong distinguishing ability of adversarial learning, which constructs strict visemephoneme mapping in (l).Meanwhile, we also observe dispersed viseme centers in (c), which can distinguish the same visemes that correspond to different phonemes.In addition, real phonemes are also better modeled by clustering as shown in (i).
Evaluation of Modality Transfer.proposed AMIE significantly improves the quality of restored clean audio with strict viseme-phoneme mapping, which also yields better VSR result.

Conclusion
In this paper, we propose UniVPM, a general robust AVSR approach motivated from visual modality via unsupervised noise adaptation.UniVPM constructs universal viseme-phoneme mapping to implement modality transfer, which can restore clean audio from visual signals to enable speech recognition under any noises.Experiments on public benchmarks show that UniVPM achieves state-of-the-art under various noisy as well as clean conditions.Further analysis also verifies its effectiveness on VSR task.

Limitations
We state two points of limitations and future work in this section.First, the UniVPM combines both restored clean audio and original input audio for downstream speech recognition, while without any trade-off to weight them.For example, under extremely noisy conditions the restored clean audio plays a more important role, while in less noisy scenarios the original audio may provide more valid information.Some weighting strategies to select the most effective audio information could benefit the downstream speech recognition.Second, the proposed clustering and viseme-phoneme mapping are actually unsupervised schemes, so that it could be promising to extend our UniVPM to the popular self-supervised learning framework, in order to make full use of the abundant unlabeled data.

A Supplementary Experimental Analysis
A.1 Analysis of the Noise-Robustness of AVSR Table 7 presents the performance of AV-HuBERT to analyze the noise-robustness of AVSR system.First, as the original motivation of AVSR, the visual modality significantly improves the audioonly speech recognition performance under various noisy as well as clean testing conditions, especially the low-SNR environments.However, most existing efforts still focus on audio modality to improve robustness considering its dominance in AVSR task.The reason is the inherent information insufficiency of visual modality to represent speech content.Mainstream approaches introduce noise adaptation techniques (Hu et al., 2022b,a;Chen et al., 2022bChen et al., , 2023a,b;,b;Hu et al., 2023a,c,b;Zhu et al., 2023a,b) to strengthen robustness, where most of them leverage noise-corrupted data to improve network training (Afouras et al., 2018a;Ma et al., 2021b;Pan et al., 2022;Shi et al., 2022b;Hsu and Shi, 2022;Chen et al., 2022c;Hu et al., 2023d;Zhu et al., 2023c).As shown in Table 7, available noisy training data significantly improves the AVSR performance in different testing conditions.However, this strategy is usually faced with two practical challenges.First, it requires abundant labeled noisy audio-visual training data, which is not always available in some real-world scenarios (Meng et al., 2017;Long et al., 2017;Lin et al., 2021;Chen et al., 2022a).For instance, in scenarios like theatre, it is valuable to develop a AVSR system but costly to obtain sufficient training data.Second, as it is impossible to cover all the realworld noises in training data, when some unseen Table 7: WER (%) of AV-HuBERT on LRS3 benchmark."Mode" denotes the input modality during both finetuning and inference stages, "PT Type" denotes the pre-training data type, "FT Type" denotes the finetuning data type, and "avg" denotes the average performance on all SNRs.
noises appear in practical testing scenarios, the well-trained model may not perform well as shown in Table 3, resulting in less optimal model generality (Meng et al., 2017)

A.2 Analysis of Limited In-domain Noisy Audio-Visual Data
According to §1 and §A.1, the first challenge of audio modality-based robust AVSR is the limited indomain noisy audio-visual data, which leads to domain mismatch between training and testing (Meng et al., 2017;Long et al., 2017;Lin et al., 2021;Chen et al., 2020cChen et al., , 2022a)).Actually there are two methods of obtaining such data, i.e., collection and simulation.First, we can collect and transcribe amounts of noisy audio-visual data under real-world scenarios, but that is extremely time-consuming and laborious, and to our best knowledge there is currently no such public dataset.Second, as there is sufficient clean transcribed audio-visual data (Afouras et al., 2018b;Chung et al., 2017), we can collect indomain noise to simulate noisy audio-visual data.However, this data augmentation method can only partially alleviate but not resolve the domain mismatch problem (Zhang et al., 2022).What is worse, the in-domain noise data is also not always available in all the real-world scenarios (Meng et al., 2017;Long et al., 2017;Chen et al., 2020cChen et al., , 2022a)).
As presented in Table 1, in case of no available in-domain noise, our UniVPM achieves compara-ble performance to previous state-of-the-art trained on in-domain noise.When in-domain noise is available, our UniVPM directly outperforms previous state-of-the-art, which breaks out the limit of data augmentation and moves one step forward to the real noisy data training setting (i.e., oracle).In addition, Table 3 further investigates the cases with outof-domain training noise, where our UniVPM even surpasses previous state-of-the-art trained on indomain noise.As a result, our proposed approach effectively alleviates the limitation of in-domain noisy data in audio modality-based robust AVSR.

A.3 Analysis of UniVPM from Meta-Learning Perspective
The main idea of our proposed UniVPM can also be explained from meta-learning perspective (Raghu et al., 2019), i.e., learn how to learn.In AVSR task, considering the inherent information sufficiency of visual modality to represent speech content (Sataloff, 1992;Ren et al., 2021), the key factor of its robustness is still the informative audio modality.However, audio is usually interfered by background noise during practical inference.Therefore, the key of improving robustness is to gain sufficient knowledge from clean audio in training stage, and metalearning exactly tells AVSR how to learn from the clean audio.Motivated by this idea, we leverage clean audio-visual data to train the core modules of UniVPM, i.e., viseme and phoneme banks, where video serves as "prompt" and clean audio serves as "meta".In particular, our UniVPM learns the mapping between visemes and phonemes, which then enables modality transfer to restore clean audio against testing noises.Here the viseme-phoneme mapping defines how to learn from clean audio.Therefore, we only need video "prompt" during inference to access the clean audio "meta", which enables UniVPM to adapt to any testing noises.A.4 Analysis of Phoneme Distribution in LRS3 and LRS2 Datasets Fig. 6 presents the phoneme distribution in LRS3 and LRS2 datasets.We can observe that in both datasets, the phoneme obeys a long-tail distribution (Liu et al., 2019) with head classes including 'h#', 'ih', 'n', 'l', 's', 'ah', etc.For better visualization, Fig. 7 removes the dominant phoneme 'h#' and also presents a long-tail distribution.Therefore, the neural network trained on these data is prone to over-fitting to head phoneme classes, resulting in less satisfactory performance on tail classes.LRS3 and LRS2 are both large-scale English reading speech datasets recorded with thousands of speakers from a wide range of races, so that they can be roughly representative of the phoneme distribution of English language.

B Experimental Details
B.1 Datasets LRS3 6 (Afouras et al., 2018b) is currently the largest public sentence-level lip reading dataset, which contains over 400 hours of English video extracted from TED and TEDx talks on YouTube.The training data is divided into two parts: pretrain (403 hours) and trainval (30 hours), and both of them are transcribed at sentence level.The pretrain part differs from trainval in that the duration of its 6 https://www.robots.ox.ac.uk/~vgg/dat a/lip_reading/lrs3.html

B.2 Data Preprocessing
The data preprocessing for above two datasets follows the LRS3 preprocessing steps in prior work (Shi et al., 2022a).For the audio stream, we extract the 26-dimensional log filter-bank fea-ture at a stride of 10 ms from input raw waveform.
For the video clips, we detect the 68 facial keypoints using dlib toolkit (King, 2009) and align the image frame to a reference face frame via affine transformation.Then, we convert the image frame to gray-scale and crop a 96×96 region-of-interest (ROI) centered on the detected mouth.During training, we randomly crop a 88×88 region from the whole ROI and flip it horizontally with a probability of 0.5.At inference time, the 88×88 ROI is center cropped without horizontal flipping.To synchronize these two modalities, we stack each 4 neighboring acoustic frames to match the image frames that are sampled at 25Hz.

B.3 Model Configurations
Front-ends.We adopt the modified ResNet-18 from prior work (Shi et al., 2022a) as visual frontend, where the first convolutional layer is replaced by a 3D convolutional layer with kernel size of 5×7×7.The visual feature is flattened into an 1D vector by spatial average pooling in the end.For audio front-end, we use one linear projection layer followed by layer normalization (Ba et al., 2016).
UniVPM.The viseme and phoneme banks contain N = 40 clusters, following the amount of English phonemes (Phy, 2022), i.e., 39 regular phonemes and one special phoneme '[PAD]' that indicates silence.It is worth mentioning that the actual amount of visemes is less than phonemes due to homophene phenomenon, i.e., one-to-many lipaudio mapping (Bear and Harvey, 2017), but in this work we set same number of clusters to construct a strict one-to-one viseme-phoneme mapping, as shown in Fig. 5 and Fig. 8.The cluster capacity S max in Alg. 1 is set to 20, and the temperature τ in Eq. 9 is set to 0.1.
Speech Recognition.The downstream speech recognition model follows AV-HuBERT (Shi et al., 2022b) with 24 Transformer (Vaswani et al., 2017) encoder layers and 9 decoder layers, where the embedding dimension/feed-forward dimension/attention heads in each Transformer layer are set to 1024/4096/16 respectively.We use a dropout of p = 0.1 after the self-attention block within each Transformer layer, and each Transformer layer is dropped (Fan et al., 2019) at a rate of 0.1.
The total number of parameters in our UniVPM and AV-HuBERT baseline are 478M and 476M.

B.4 Data Augmentation
Following prior work (Shi et al., 2022b), we use many noise categories for data augmentation to simulate noisy training data.We select the noise categories of "babble", "music" and "natural" from MUSAN noise dataset (Snyder et al., 2015), and extract some "speech" noise samples from LRS3 dataset.For experiments on unseen testing noises (see Table 3), we also select the noise categories of "Meeting", "Cafe", "Resto" and "Station" from DEMAND noise dataset (Thiemann et al., 2013).All categories are divided into training, validation and test partitions.
During training process, we randomly select one noise category and sample a noise clip from its training partition.Then, we randomly mix the sampled noise with input clean audio, at signal-to-noise ratio (SNR) of 0dB with a probability of 0.25.
At inference time, we evaluate our model on clean and noisy test sets respectively.Specifically, the system performance on each noise type is evaluated separately, where the testing noise clips are added at five different SNR levels: {−10, −5, 0, 5, 10}dB.At last, the testing results on different noise types and SNR levels will be averaged to obtain the final noisy WER result.

B.5 Training and Inference
Training.The noisy training data is synthesized by adding random noise from MUSAN (Snyder et al., 2015) or DEMAND (Thiemann et al., 2013) of 0dB at a probability of 0.25.We load the pretrained AV-HuBERT8 for front-ends and downstream speech recognition model, and then follow its sequence-to-sequence (S2S) finetuning configurations (Shi et al., 2022b) to train our system.We use Transformer decoder to decode the encoded features into unigram-based subword units (Kudo, 2018), where the vocabulary size is set to 1000.The weighting parameters λ GAN /λ rec /λ var in Eq. 12 are set to 0.1/0.2/0.5, respectively.The entire system is trained for 60K steps using Adam optimizer (Kingma and Ba, 2014), where the learning rate is warmed up to a peak of 0.001 for the first 20K updates and then linearly decayed.The training process takes ∼ 2.5 days on 4 NVIDIA-V100-32GB GPUs, where in comparison the AV-HuBERT finetuning takes ∼ 1.3 days on 4 NVIDIA-V100-32GB GPUs.
Inference.As shown in Table 1, the testing noises "Babble", "Music" and "Natural" are sampled from MUSAN, and "Speech" is drawn from LRS3, following prior work (Shi et al., 2022b).No language model is used during inference.We employ beam search for decoding, where the beam width and length penalty are set to 50 and 1 respectively.All hyper-parameters in our systems are tuned on validation set.Since our experimental results are quite stable, a single run is performed for each reported result.

B.6 Details of UniVPM Optimization
As detailed in Alg. 2, we design a two-step adversarial learning strategy for UniVPM optimization, where the discriminator and generator play a twoplayer minimax game.First, we maximize L GAN to update the discriminator, where generator is detached from optimization.According to Eq. 11, maximizing the first term of L GAN increases the MI between visual and audio sequences, while maximizing the second term is actually decreasing the MI between visemes and phonemes, as well as the MI between visual and restored audio sequences (this is opposite to our desired viseme-phoneme mapping and modality transfer).Second, we freeze discriminator and update the rest network, where minimizing L G increases the MI between visemes and phonemes, as well as MI between visual and restored audio sequences.In addition, L ASR optimizes the downstream speech recognition model, L rec supervise the quality of restored clean audio, and L var disperses the viseme and phoneme centers to ease their mapping construction.The entire system is trained in an end-to-end manner.
In actual experiments, to save computation cost, we update B v and B a once every 10 epochs, which has been proved with no affect on the system performance.One can refer to our attached code for more implementation details.

B.7 Baselines
In this section, we describe the baselines for comparison.
• RNN-T (Makino et al., 2019): RNN-T adopts the popular recurrent neural network transducer (Graves, 2012) for AVSR task, where the audio and visual features are concatenated before fed into the encoder.
• EG-seq2seq (Xu et al., 2020): EG-seq2seq builds a joint audio enhancement and multimodal speech recognition system based on RNN (Zhang et al., 2019), where the A-V features are concatenated before decoding.
• LF-MMI TDNN (Yu et al., 2020): LF-MMI TDNN proposes a joint audio-visual speech separation and recognition system based on time-delay neural network (TDNN), where the A-V features are concatenated before fed into the recognition network.
• AV-HuBERT (Shi et al., 2022a,b): AV-HuBERT employs self-supervised learning to capture deep A-V contextual information, where the A-V features are masked and concatenated before fed into Transformer encoder to calculate masked-prediction loss for pretraining, and sequence-to-sequence loss is then used for finetuning.
• u-HuBERT (Hsu and Shi, 2022): u-HuBERT extends AV-HuBERT to a unified framework of audio-visual and audio-only pre-training.

C Clustering Algorithms
C.1 Uniform Effect in K-Means K-Means (MacQueen, 1967) is the most popular and successful clustering algorithm, where sample re-allocation and center renewal are executed alternatively to minimize the intra-cluster distance.However, Xiong et al. (2006) points out that K-Means algorithm tends to produce balanced clustering result, a.k.a., uniform effect.This preference seriously degrades the performance when the clusters are imbalanced-sized.The consequence is that the center of minority clusters will gradually move to the territory of majority cluster, as illustrated in Fig. 3 (a).In other words, the K-Means algorithm will be over-fitted to majority clusters, leaving the samples in minority clusters not well modeled.

C.2 K-Means++
The performance of K-Means clustering relies on the center initialization, where the vanilla algorithm initialize cluster centers randomly.K-Means++ (Arthur and Vassilvitskii, 2006) is an improved version with dispersed initial centers.It determines cluster centers one by one, and each newly initialized center is pushed as distant as possible to the existed centers.As a result, the K initial cluster centers would separate from each other and benefit the subsequent clustering process.

C.3 Details of Online Clustering Baseline
For comparison, we build an Online Clustering algorithm as baseline.It is similar to Alg. 1 but employs a vanilla random pruning strategy, instead of re-sampling, to control the total memory of the bank.Our strategy is to randomly keep S thr samples in the cluster if its number of samples exceeds S thr .Compared to the proposed Online Balanced Clustering algorithm, this baseline also controls memory size but ignores the imbalanced clusters, as indicated by the dashed ellipses in Fig. 3 (a).

C.4 Principles of Online Balanced Clustering
According to Alg. 1, the main idea of proposed Online Balanced Clustering is the re-sampling operation to balance cluster sizes.For majority clusters, we perform undersampling to maintain the S thr nearest samples to cluster center, so that the gathered clusters in Fig. 3 (a) can be separated.
For minority clusters, we introduce oversampling to interpolate a new sample near the center, so that the minority clusters are highlighted.As a result, all the clusters are balanced-sized and separated from each other as shown in Fig. 3 (b), so that the over-fitting problem is resolved.As a result, all of the visemes and phonemes can get well represented, which enables the subsequent visemephoneme mapping construction.where the clean audio is employed for phoneme clustering and the noisy audio is used to improve the system noise-robustness.Compared to Fig. 2, there is an extra data stream of noisy audio to improve robustness.

Figure 1 :
Figure 1: Illustration of noisy audio-visual speech recognition.(a) Mainstream AVSR approaches with noise adaptation.(b) Our framework constructs visemephoneme mapping for modality transfer, which restores clean audio from visual signals to enable speech recognition under any noisy conditions.

Figure 2 :
Figure 2: Illustration of our proposed UniVPM.(a) Training on clean audio-visual data to construct universal viseme-phoneme mapping.(b) Inference on any noisy data with restored clean audio from modality transfer.

Figure 3
Figure 3: t-SNE visualization of clustered phonemes from (a) online clustering (with random pruning to keep fixed cluster size, details are in §C.3), and (b) our proposed online balanced clustering.We randomly select six clusters for visualization, and black triangle denotes the cluster center.Dashed ellipses highlight the real phoneme classes, which are confirmed by pre-trained phoneme recognition model(Phy, 2022).
SetupDatasets.Our experiments are conducted on two large-scale public datasets, LRS3(Afouras et al., 2018b) and LRS2(Chung et al., 2017).LRS3 dataset collects 433 hours of transcribed English videos from TED & TEDx talks.LRS2 contains 224 hours of video speech from BBC programs.Configurations and Baselines.The proposed Uni-VPM is implemented based on AV-HuBERT with similar configurations, which are detailed in §B.3.

Figure 5 :
Figure 5: Left panel: t-SNE visualization of clustered viseme and phoneme centers (ellipses highlight the undesirably gathered centers).Right panel: confusion matrix of phoneme matching and viseme-phoneme mapping.In (g)-(i), the vertical axis indicates phoneme center IDs and the horizontal axis indicates real phonemes predicted by pre-trained model (Phy, 2022), while in (j)-(l) the horizontal axis indicates viseme center IDs.

Figure 9 :
Figure 9: Illustration of noisy training pipeline of UniVPM.Both clean and noisy audio are used for training,where the clean audio is employed for phoneme clustering and the noisy audio is used to improve the system noise-robustness.Compared to Fig.2, there is an extra data stream of noisy audio to improve robustness.

Table 4 .
In this case, with restored clean audio from lip movements, the proposed UniVPM significantly outperforms AV-
Table5presents the ablation study of components in UniVPM.The four parts of ablation are independent, i.e., each study is conducted where other three components are kept same as full UniVPM.
Qiu-Shi Zhu, JieZhang, Zi-Qiang Zhang, and Li-Rong  Dai.2023a.A joint speech enhancement and selfsupervised representation learning framework for noise-robust speech recognition.IEEE/ACM Transactions on Audio, Speech, and Language Processing.
. Above two challenges motivate this work.With unsupervised noise adaptation investigated on visual modality, our proposed UniVPM improves the AVSR performance under clean training data to a comparable level to the state-of-the-art AV-HuBERT trained on noisy data in various noisy as well as clean testing conditions, as shown in Table 1, 2, and 3.Moreover, available noisy training data can further improve the robustness of UniVPM and yield new state-of-the-arts on both LRS3 and LRS2 benchmarks.