Contrastive Token-Wise Meta-Learning for Unseen Performer Visual Temporal-Aligned Translation

Visual temporal-aligned translation aims to transform the visual sequence into natural words, including important applicable tasks such as lipreading and fingerspelling recognition. However, various performance habits of specific words by different speakers or signers can lead to visual ambiguity, which has become a major obstacle to the development of current methods. Considering the constraints above, the generalization ability of the translation system is supposed to be further explored through the evaluation results on unseen per-formers. In this paper, we develop a novel generalizable framework named C ontrastive To ken-Wise M eta-l earning (CtoML), which strives to transfer recognition skills to unseen performers. To the best of our knowledge, employing meta-learning methods directly in the image domain poses two main challenges, and we propose corresponding strategies. First, sequence prediction in visual temporal-aligned translation, which aims to generate multiple words autoregressively, is different from the vanilla classification. Thus, we devise the token-wise diversity-aware weights for the meta-train stage, which encourages the model to make efforts on those ambiguously recognized tokens. Second, considering the consistency of word-visual prototypes across different domains, we develop two complementary global and local contrastive losses to maintain inter-class relationships and promote domain-independence. We conduct extensive experiments on the widely-used lipreading dataset GRID and the fingerspelling dataset ChicagoF-SWild, and the experimental results show the effectiveness of our proposed CtoML over existing state-of-the-art methods.


Introduction
Human communication is dominated by speech, which conveys semantic information through the acoustic signals.However, persons with dysphonia necessitate reliance on visual perception for independent expressions, such as lip movements and hand gestures.Hence, automatic visual language translation is helpful in bridging the gap between people who communicate through diverse senses.
In visual language translation tasks, whether it is observing lip movement or understanding gesture sequences, the common denominator is that the visual content and the translated natural language words are temporally aligned.In this paper, we collectively refer to tasks with the above properties as visual temporal-aligned translation.Specifically, lipreading, a.k.a.visual speech recognition (VSR), aims to recognize spoken sentences based on lip movements.Another task is fingerspelling recognition, where recognized text is generated letter-byletter from the fast and coherent indistinguishable handshapes of signers (Figure 2).Accordingly, mainstream research methods utilize autoregressive models to generate multiple words.
However, current methods (Afouras et al., 2022;Shi et al., 2019) show weakness when applied to real-life scenarios, due to the fact that different performers have a variety of performance habits on specific words leading to ambiguity, as shown in Figure 1.Moreover, the gap between performers can be magnified in data-limited settings, i.e. some environments or low-resource languages where collection and annotation are expensive.Ideally, an applicable visual temporal-aligned translation system is supposed to have excellent generalization ability and convincing recognition accuracy for unseen performers.We argue that referring to methods in the domain generalization task to deal with this dilemma is a feasible solution.Concretely, performers with different performance styles or presentation habits can be treated as different domains.For example, in lipreading, some speakers have a slight lip movement, while others have a relatively exaggerated display.Another instance in fingerspelling recognition is that the handshapes of each signer on the same letter are diverse, which could be due to factors such as personal habits and movement directions.Therefore, domainindependent visual temporal-aligned translation can break through the above obstacles, especially on datasets with limited labeled samples.
In this paper, we propose an innovative generalizable framework to deal with the challenging domain-independent visual temporal-aligned translation, called Contrastive Token-Wise Metalearning (CtoML).Video-sentence pairs of specific performers are used for training, and then we test on unseen performers.As far as we know, directly transferring meta-learning methods in the image domain to lipreading and fingerspelling recognition raises the following two challenges: First, sequence prediction in visual temporalaligned translation autoregressively generates multiple words, which is different from vanilla classification.Consequently, we design the token-wise diversity-aware weights for the meta-train phase.The variance of the interacted attention map between performers is measured, and then regarded as the learning difficulty coefficient of the token, so as to concentrate on tackling ambiguous words.
Second, taking into account the consistency of word-visual prototypes across different domains, we develop two complementary contrastive losses.Globally, we frame-word-align the decoded interaction matrix between movement features and sequential sentences to maintain a consistent semantic space of the same class across domains, regardless of the domain-specific variations and class-specific vocabulary positions in the sequence.Locally, relying on contrastive constraints, we facilitate the model to draw closer decoded outputs of words that are semantically similar regardless of domain, and pull away those words that are disparate.In summary, the token-wise diversity-aware weights we devised encourage the model to focus on tokens with inherent ambiguity in sequence prediction, and the meta-learning process can simulate various domain shift scenarios to assist in finding generalization learning directions.The effectiveness of our CtoML is demonstrated through extensive experiments on the lipreading benchmark dataset GRID and fingerspelling dataset ChicagoF-SWild.Our main contributions are as follows: • We are dedicated to enhancing the generalization ability of the translation system to outof-domain performers, and correspondingly propose contrastive token-wise meta-learning (CtoML) framework to clarify the generalization learning direction in sequence prediction.
• To focus on the inherently ambiguous words that are confusing to recognize, we devised the token-wise diversity-aware weights to reflect the learning difficulty coefficient of tokens.
• Based on the contrastive constraints, two complementary global and local losses are developed to preserve the inter-class semantic relationships and promote domain-independence.
2 Related Work

Lip Reading
Lipreading is the task of recognizing spoken sentences from a silent talking face video.Early works are carried out on word-level recognition (Chung and Zisserman, 2016;Wand et al., 2016), and then with the adoption of models developed from ASR tasks, the researchers turn to sentence-level prediction (Assael et al., 2016;Chung et al., 2017;Zhang et al., 2019).Existing studies are primarily based on CTC methods (Assael et al., 2016; Figure 3: The overall framework of diversity-aware Transformer, composed of visual encoder, token-wise weights calculator and decoder.The diversity-aware decoding with token-wise weights is applied in the meta-train stage.Petridis et al., 2018;Chen et al., 2020) and autoregressive sequence-to-sequence models (Chung et al., 2017;Zhang et al., 2019;Afouras et al., 2022).Strikingly, Transformer-based architectures (Afouras et al., 2022;Lin et al., 2021) are commonly developed and lead to significant improvements.Distilling knowledge from speech recognition to enhance visual modality in lipreading (Zhao et al., 2020;Ma et al., 2021a) also deserves attention.Additionally, advances from self-supervised representation learning methods (Ma et al., 2021b;Shi et al., 2022;Pan et al., 2022;Cheng et al., 2023) employing pre-training strategies are instructive for visual speech recognition.However, the methods mentioned above do not delve into the generalization ability of lipreading models, which motivates us to explore the direction of model learning on unseen speakers.

Fingerspelling Recognition
Fingerspelling recognition is a component of sign language recognition that aims to discriminate the fine-grained handshapes of signers.Since the introduction of prior end-to-end models (Koller et al., 2017;Shi and Livescu, 2017;Papadimitriou and Potamianos, 2019) to continuous sign language recognition (Jin et al., 2022b,a), fingerspelling recognition in the wild has achieved substantial progress with a greater emphasis on real-life scenarios (Joze and Koller, 2018;Shi et al., 2018Shi et al., , 2019;;Gajurel et al., 2021;S et al., 2021).Subsequently, a multi-task learning manner is proposed in recent researches (Shi et al., 2021;Jiang et al., 2022) to detect effective gesture regions while recognizing fingerspelling signs.Although fingerspelling recognition is a constrained task, it is actually more suitable for evaluating the generalization capability of the model on unseen signers, due to the indistinguishable ambiguity caused by faster fine-grained finger movements.

Domain Generalization
Domain generalization aims to train a model with limited source domains to generalize directly to unseen target domains.Recently proposed methods are progressive in three aspects, including (1) Representation learning: domain-alignment based method (Mahajan et al., 2021) for learning domain-agnostic representations; disentangled representation method (Zhang et al., 2022) for separating domain-specific and domain-invariant features.
(3) Learning strategies: meta-learning (Shu et al., 2021;Jin and Zhao, 2021) with meta-train and meta-test two stages.Despite advancements in speech fusion and synthesis (Huang et al., 2022(Huang et al., , 2023;;Li et al., 2023), domain generalization on visual temporal-aligned translation tasks has only made preliminary obser-vations on lipreading from unseen speakers (Assael et al., 2016).For fingerspelling, although the exsiting real-life dataset (Shi et al., 2019) is nonoverlapping, the recognition accuracy still retains a large capacity for improvement.Thus, we propose a novel framework called CtoML to solve this task.

Problem Formulation
We first introduce the problem formulation of visual temporal-aligned translation.Given a sequence of frames from video segments , where s i is the ith frame, T s is the number of frames in the segment, l j is the j-th word or letter, T l is the number of transcribed units and T l ≤ T s .Here, the visual content of the video segment S is temporally aligned with the semantics in the textual sequence L. In the domain-independent setting, we treat each speaker or clustered signers as a domain, denoted as Next, The source training set D sr and the target testing set D tg are divided strictly according to the performers, ensuring that the performers appearing in D sr are not permitted to be seen in D tg , i.e.D sr ∩ D tg = ∅.Thus, the training set containing video-sentence pairs can be denoted as and the testing set as where N sr and N tg are the numbers of training and testing sets respectively.

Diversity-Aware Transformer
To enforce the model to concentrate on ambiguously recognized words, we introduce a diversityaware Transformer with token-wise weights, as shown in Figure 3. Concretely, we devise a tokenwise module to capture ambiguities between visual representations of different performers' output from the visual encoder.Integrating the token-wise difficulty coefficient, natural language words are sequentially generated in the meta-train stage.
Visual Encoder.Following vanilla Transformer (Vaswani et al., 2017) and autoregressive visual speech recognition model TM-seq2seq (Afouras et al., 2022), the encoder of diversity-aware Transformer is composed of stacked multi-head self-attention blocks and feed-forward layers.In advance, we prepare the features extracted by the pretrained model, denoted as F ∈ R Ts×d .Then, we can obtain the encoded representations F ′ through the visual encoder (VisEncoder) as follows: where F ′ ∈ R Ts×d .Illustratively, the details of the encoder are provided in Appendix A.
Cross-Modal Decoder.We train a stable task model with the vanilla decoder before the metalearning phase for subsequent token-wise weight calculations.The standard cross-modal decoder interacts target word embeddings E ∈ R T l ×d with encoded visual features F ′ to generate character probabilities.Specifically, the decoder is stacked with self-attention, inter-attention, and feed-forward layers.At each time step t, the word embedding (E t ∈ R Tt×d ) before t is updated to E ′ t via the self-attention layer, as below: where E ′ t ∈ R Tt×d , T t is the word embedding length up to time t, LN(•) and SA(•) denote the layer normalization and the self-attention layer, respectively.E ′ t is then used as a query to calculate the output I t of the inter-attention layer.The process with the residual connection is as follows: where I t , I ′ t ∈ R Tt×d and I ′ t denotes the output of feed-forward network FFN(•), MHA(•) denotes multi-head attention.Then, we can produce the probability distribution p t and give the crossentropy loss function L ta as follows: where W p and b p are trainable parameter matrices.
Token-wise Weights Calculator.To obtain the learning difficulty coefficient of tokens, we compute the diversity-aware weights using the attention maps of the inter-attention layers between different performers.We denote the final layer output of c ∈ R d of the k-th performer on word c can be denoted as: where u (k) r is the attention vector of the k-th performer, and k is the number of the samples labelled as c.Subsequently, we compute the variance of each word across different performers to reflect ambiguities due to various motion habits, given by: where v c ∈ R d , c is the index of the word in the vocabulary, σ denotes the non-linear activation function such as Sigmoid.Hence, complete word difficulty representations V = [v c ] Tc c=1 ∈ R Tc×d are produced, which are provided to the diversityaware decoding.T c is the vocabulary length.
Diversity-aware Decoding.After requiring the token-wise weights, we perform diversity-aware decoding on the encoded features in the metatrain stage.Specifically, we apply the token-wise weights to the cross-entropy loss in Eqn.( 4) and obtain a developed loss L da given by: where * denotes the token-wise multiplication of vectors.Therefore, our model can consciously focus on ambiguous words under the adjustment of the token-wise diversity-aware weights.

Contrastive Meta-learning
Our methods train the task model on meta-train sets and then improve the generalization ability on meta-test sets.Furthermore, inter-class semantic relationships are consolidated while promoting domain-independence under two complementary contrast constraints.During the meta-train stage, our prior computed token-wise weights can cooperate with the specific task loss.The complete meta-learning process is summarized in Alg.
end for 14: θ ← θ − β∇ θ L obj 15: end while 16: return Model parameters θ stage, the parameters are updated from the tasksupervised cross-entropy loss function L da , calculated as θ ′ = θ − α∇ θ L da (D tr ; θ).θ denotes all the trainable parameters and α is the learning rate.Then, the relationship between ambiguous words and the semantic space consistencies among performers are preserved and harmonized in the following meta-test stage.
In detail, we devise two complementary contrast constraints.With a global objective of stabilizing inter-class relationships, we attempt to preserve the relationship between learned ambiguous words on unseen performers.Hence, for a specific performer, we can compute a word distribution g (k) c with the personality vector obtained by Eqn.( 5) and a softmax at temperature τ .Then, the global loss L gl can be gained by minimizing the symmetrized relative entropy as follows: where N o is the number of pairs of (D i , D j ), g  of reducing ambiguous word overlap regardless of performers.We put together u r in Eqn.( 5) of all samples in D sr in a performer-insensitive manner to obtain a set A of token-wise attention vectors, containing N z tokens.Next, the contrastive loss L lc we exploit can be calculated as follows: where ρ(•) denotes a distance function, (x b , y b ) is the b-th random sample pair in set A, N b is the number of the sample pairs, Y = 1 − [x l = y l ], x l and y l are the respective labels, margin ξ is to con trol the distance between two samples.Considering the computational complexity, when we sample pairs, we use the sorted queue that dequeues every two elements, instead of enumeration.We optimize the meta visual temporal-aligned translation model with the developed task loss L da and contrastive constraints L gl and L lc , given by: where λ is utilized to control the balance, and θ is the learning rate.During the inference stage, we use a common visual encoder and cross-modal decoder without meta stages.

Experiments
We evaluate and compare our CtoML on two challenging visual temporal-aligned translation tasks, lipreading and fingerspelling recognition.Our experiments are conducted on two datasets: GRID (Cooke et al., 2006) for lipreading, and ChicagoF-SWild (Shi et al., 2018) for fingerspelling recognition.In this section, we provide a brief introduction to the datasets and corresponding evaluation metrics.Then, we present concrete experimental settings and compare CtoML with baseline methods.Subsequently, we analyze the main results and conduct ablation studies.Besides, we also provide qualitative examples and analysis on GRID and ChicagoFSWild dataset in the Appendix D.
ChicagoFSWild: The ChicagoFSWild dataset contains 7,304 fingerspelling clips performed by 160 signers.The data is split into three sets with no overlapping signers: 5,455 training sentences from 87 signers, 981 development sentences from 37 signers, and 868 test sentences from 36 signers.The vocabulary size is 31 including 26 alphabets and 5 special characters.In this paper, we follow the split in (Shi et al., 2018).

Experiments for Lipreading
Evaluation Metrics: Following prior works (Assael et al., 2016), we evaluate the performance based on the metrics of character error rate (CER) and word error rate (WER).The error rates can be computed as: ErrorRate = (S+D+I)

M
, where S, D, I are the number of the substitutions, deletions and insertions in the alignments, and M is the number of characters or words.Implementation Details: The videos are first processed with the Dlib detector (King, 2009), and then we extract a mouth-centered crop of size 100 × 60 as the video input.We augment the dataset with horizontal flips with 50% probability.According to (Shi et al., 2022), we obtain robust feature representations for our model.For each meta-train stage, we perform 10 iterations and take the last updated parameter as θ ′ .Also, more training and parameter settings are listed in the Appendix C.

Results and Analysis:
We compare our CtoML with several state-of-the-art methods, LipNet (Assael et al., 2016), SimulLR (Lin et al., 2021), LCANet (Xu et al., 2018), TM-seq2seq (Afouras et al., 2022) and AV-HuBERT (Shi et al., 2022).We denote CtoML without modules designed for the generalization objective as Ours(base), which is trained with only the task loss L ta .Table 1 summarizes the results of unseen speaker lipreading on GRID dataset with a comparison to baselines.Across all domain splits, our CtoML outperforms the state-of-the-art method AV-HuBERT with an average of 1.44% on WER and 1.36% on CER.
In comparison, our method provides token-wise diversity-aware weights, which supports the model to grasp the learning direction of ambiguous words better.Furthermore, we exploit the essence of domain generalization to improve the model's generalization capability to unseen speakers, which has not been considered in previous methods.Moreover, the performance of Ours(base) is comparable to that of the state-of-the-art methods, further demonstrating the effectiveness of our modules for generalization to unseen performers.Notably, SimulLR (Lin et al., 2021), which performs well on overlapping regular split, struggles with domainindependent settings.It suggests that methods with special objectives may not be generalized effectively.In addition, we can see that the evaluation results fluctuate across different divisions, indicating that there are indeed significant differences in performance habits between speakers.
Ablation Study: We conduct extensive ablation studies on token-wise diversity-aware weights, contrastive constraints, and meta-learning strategies to represent all contributions.Qualitative Analysis: Figure 4 shows the qualitative results on GRID.We provide a comparison of the predicted sentences of CtoML and AV-HuBERT.Intuitively, CtoML performs better due to the joint effect of the devised token-wise diversity-aware wights, complementary contrastive losses and meta-learning strategy.In the first example, we can see that when encountering the letters b and d with similar lip movements, the AV-HuBERT appears weak.In contrast, CtoML effectively guides the direction to deal with ambiguous words and thus predicts the text sequence correctly.Although our model does not successfully predict the second case, it captures fast and ambiguous features and finds tokens similar to the ground truth.

Experiments for Fingerspelling
Due to space constraints, implementation details and qualitative analysis are put in Appendix C,D.
Evaluation Metrics: For evaluation, the letter accuracy modified from Levenshtein string edit distance 1 − (S+D+I) M is adopted (Shi et al., 2019).Here, the sum of S, D, I is the minimum number that transforms the prediction to ground truth, and M is the number of ground-truth letters.Results and Analysis: For the fingerspelling recognition, our CtoML is compared with four state-of-the-art methods, HDC-FSR (Shi et al., 2018), IAF-FSR (Shi et al., 2019), FGVA (Gajurel et al., 2021), and TDC-SL (Papadimitriou and Potamianos, 2020).Illustratively, since the authors do not name their methods, the names used here are abbreviations assigned based on specific characteristics.We use Ours(w.ResNet18) to denote the model that adopts ResNet18 rather than ResNet50 (He et al., 2016).From Table 4, we can find that CtoML outperforms the others by averaging at least 4.9% and 8.7% on the development and test set.This is because the performance of signers with the same letter can be diverse, and our complementary contrastive constraints allow models to understand ambiguous words while maintaining a consistent semantic space on unseen signers.Convincingly, Ours(w.ResNet18) is significantly higher than the others by a margin of 4.1% on the test set, even though we use a weaker extractor.

Conclusion
We have proposed a new framework called contrastive token-wise meta-learning (CtoML) for visual temporal-aligned translation, which promotes the generalization capability on unseen performers.
To concentrate on the inherently ambiguous words, we devise token-wise diversity-aware weights.Furthermore, we develop contrastive meta-learning to clarify the learning direction.Reasonable complementary contrast constraints are provided to preserve inter-class semantic relationships and promote domain-independence.The experimental results on GRID and ChicagoFSWild dataset demonstrate the effectiveness of CtoML.

Limitations
In this section, we develop a clear discussion of the limitations of this paper.Our method faces obstacles when attempting to validate it on datasets other than those previously used in this paper.For example, LRS2 (Afouras et al., 2022) is a widely used dataset in visual language recognition tasks.However, since LRS2 dataset does not provide speaker identification labels, we cannot easily classify speakers into domain-specific and domainindependent sets.Despite the enormous amount of work, re-annotating existing datasets with crowdsourcing or annotating a new real-life dataset with speaker labels is a viable solution.Besides, existing data augmentation methods cannot match the generalization requirements on visual temporal-aligned translation perfectly, which inspires researchers to develop targeted augmentation paradigms based on the study in this paper to cooperate with our meta-learning strategies.

Ethics Statement
The datasets used in this study were those produced by previous researchers, and we followed all relevant legal and ethical guidelines for their acquisition and use.Furthermore, we recognize the potential moral hazard of visual temporal-aligned translation tasks, such as their use in surveillance or listening.We are committed to conducting our research ethically and ensuring that our research is beneficial.

A Encoder Details
In this section, we provide the concrete structure of the visual encoder.With prepared visual features, the self-attention layer (SA) is shown as follows: where MultiHead(•) denotes multi-head attention that projects randomly initialized matrices into different representation subspaces.The multi-head attention can be calculated by multiple single heads: Here, and W 1 ∈ R d×d are trainable parameter matrices, h i denotes the i-th head and h is the number of heads.MHA(•) is the short form of multi-head attention and ATT(•) represents scaled dot-product attention as follows: where ) and d k is the dimension of matrix K.A residual connection and layer normalization LN(•) are followed after each self-attention layer: Subsequently, incorporating a feed-forward network FFN(•) with transformation layers and a nonlinear activation function σ, we can obtain the encoded features F ′ as: where W 2 ∈ R 4d×d , W 3 ∈ R d×4d are trainable weight matrices and F ′ ∈ R Ts×d .

B Hyperparameter Analysis
As depicted in Table 5, we tune the initialization combination of learning rates α and β and determine that α = 1 × 10 −3 , β = 5 × 10 −4 has the best performance.Besides, the result in Figure 6 shows that when τ is 0.2, the performance achieves the best on both partitions.

C Implementation Details
Lipreading: In terms of training details, we set the hidden size d to 512 for GRID.The number of heads and attention blocks in multi-head attention mechanisms is 8 and 3, respectively.The dropout rate is set to 0.3.We optimize the loss function using the Adam optimizer (Kingma and Ba, 2014) with learning rates α, β initialized to 1 × 10 −3 and 5 × 10 −4 .The coefficient λ in L obj is set to 0.005.The maximum number of epochs is 30 and the batch size is 32 with an NVIDIA GeForce RTX 3090 GPU.For our CtoML, each epoch takes around 3 hours.
Fingerspelling For each video segment, we use a face detector to gain a face-centered crop, which is consistent with (Shi et al., 2019).We use ResNet50 (He et al., 2016) that is pre-trained on ImageNet (Deng et al., 2009) to extract features for sampled resized frames of 112 × 112.The hidden size d is set to 512, the number of heads is 8.We set the dropout rate to 0.2.As for attention blocks, we use 3 in both the encoder and decoder.Adam algorithm is selected to optimize and the learning rates α, β are initialized to 5×10 −4 .λ that controls the balance of the loss function is set to 0.005.The batch size is 32 and the maximum number of epochs is set to 30.As for the setting of the meta-train stage, it is the same as lipreading.

D Qualitative Results
In this section, we provide several examples to conduct qualitative analysis on dataset ChicagoFSWild to prove the superiority of our proposed CtoML.
From Figure 7, we can find that our CtoML predicts the most promising results with all the proposed modules.The green characters are predictions that meet our expectations, while the red characters are wrong predictions or caused the evaluation metric to drop.The three examples from top to bottom are representative samples selected from the train, development, and test sets.Note that the top example is the result of the training process, only to observe the impact of ambiguous words, and does not reflect the effect of the model.Through the above examples, we can draw the following three facts: (1) The visual performance of some characters is too similar to be the main factor that confuses the model.(2) In the continuous signs, the characters in the middle will be more likely to cause prediction errors or omissions due to too fast or incomplete sign.(3) Due to the production of the dataset, irrelevant signs may appear at the beginning of the video clip, resulting in redundant predicted characters.For the first two challenges, our model shows considerable performance due to the excellent generalization ability.

Figure 1 :
Figure 1: Ambiguities in visual temporal-aligned translation.Green box: Similar performance of the same performer on different words.Blue box: Diverse performance of different performers on the same text unit.
1. Concretely, we randomly split entire K domains D sr into meta-train (D tr = {D i } Ntr i=1 ) and metatest (D te = {D j } Nte j=1 ) domains in each epoch, where N te = N sr − N tr .During the meta-train Algorithm 1 Contrastive Meta Visual Temporal-Aligned Translation Input: Source training domains D sr = {D m } Nsr m , Initialize: Model parameters θ; hyperparameters α, β, λ 1: while not converged do 2: Randomly split D sr into D tr and D te 3: Sample a batch B tr from D tr ▷ Meta-train 4: for all B tr do 5: Compute task-specific loss L da (B tr ; θ) 6: θ ′ ← θ − α∇ θ L da B te from D te ▷ Meta-test 9: for all B te do 10: Compute global loss(B i ∈B tr ,B j ∈B te ): L gl (B i , B j ; θ ′ ) 11: Compute local loss (B sr ←[B tr , B te ]): L lc (B sr ; θ ′ ) 12: denotes the distribution of D i in the o-th pair on word c, T c is the vocabulary length and H(p∥q) = z log(

Figure 4 :
Figure 4: Qualitative results of GRID.Words in red are incorrect predictions.

Figure 5 :
Figure 5: The impact of λ on WER and CER.

Table 1 :
pz qz ) is the details of relative entropy.Crucially, predictions should not be sensitive to unseen performers, thus requiring a local objective Results of CtoML on the four splits of GRID dataset compared to the baselines and variants.S(1&2&20&22) represents four unseen speakers S1, S2, S20 and S22, and the others are similar.All values here are percentages.

Table 2 :
Ablation results of our CtoML on GRID dataset.

Table 3 :
Ablation study of two devised complementary contrastive constraints on GRID dataset.

Table 4 :
Results of CtoML on ChicagoFSWild dataset compared to the baselines, where dev, test denotes the development and test sets, respectively.
Table 2 show the capabilities of each key module, where w/o.TAW denotes the model without token-wise weights in the diversity-aware decoding, w/o.CS denotes the model without contrastive constraints, and w/o.
Comparing CtoML with w/o.CS, we find that CtoML achieves relatively superior results, indicating that contrastive constraints can work smoothly.

Table 5 :
The impact of different initialization learning rates on GRID S(1&2&20&22).