Listen, Decipher and Sign: Toward Unsupervised Speech-to-Sign Language Recognition

Existing supervised sign language recognition systems rely on an abundance of well-annotated data. Instead, an unsupervised speech-to-sign language recognition (SSR-U) system learns to translate between spoken and sign languages by observing only non-parallel speech and sign-language corpora. We propose speech2sign-U, a neural network-based approach capable of both character-level and word-level SSR-U. Our approach significantly outperforms baselines directly adapted from unsupervised speech recognition (ASR-U) models by as much as 50% recall@10 on several challenging American sign language corpora with various levels of sample sizes, vocabulary sizes, and audio and visual variability. The code is available at cactuswiththoughts/UnsupSpeech2Sign.git.


Introduction
Many hearing-impaired people communicate natively in sign language (SL); for them, SL communication is as effortless as native spoken communication is for normal-hearing people.However, when it comes to a conversation between a hearingimpaired and a normal hearing, tremendous barriers exist for several reasons.First, there is a shortage of people who are bilingual in spoken and sign languages.Automatic sign language recognition models exists (Koller et al., 2016;Huang et al., 2018) but are fully supervised and require a large number of annotated data, which are hard to acquire.As a result, such systems are often limited to a small vocabulary.On the other hand, untranscribed speech audio and SL videos are quite common on the Internet, presenting an exciting possibility: Given a non-parallel pair of speech and sign language datasets, can we train a model to translate between spoken and sign languages?This task, we called unsupervised speech-to-sign language recognition (SSR-U), is analogous to well-known problems such as unsupervised machine translation (MT-U) (Ravi and Knight, 2011;Artetxe et al., 2018a;Lample et al., 2018) and unsupervised automatic speech recognition (ASR-U) (Liu et al., 2018;Chen et al., 2019;Baevski et al., 2021), albeit with a few new challenges.First of all, in the case of SSR-U, both modalities are continuous as opposed to at least one of them being discrete in the case of ASR-U and MT-U.Consequently, the matching process is much more challenging due to higher within and cross-modal variability.Further, most sign language and spoken language can only be matched on the word level as opposed to the subword level in the case of ASR-U.Not only does the space of possible mappings explode combinatorially, but less training data and fewer temporal constraints are also available to recover the correct mapping.
In this paper, we develop a neural network-based framework, speech2sign-U, for both character-level (with fingerspelling sequence) and word-level SSR-U.It achieves promising results on datasets with up to around 900 ASL signs.

Problem formulation
Suppose we have a corpus of unlabeled speech recordings sampled from the random process A = (A 1 , • • • , A T ) and another separately collected corpus of unlabeled sign language videos sampled from the random process Both A and V contain the same semantic information but different para-linguistic information such as speaker/signer identity and prosody.In other words, if we filter out the para-linguistic information and retain the semantic information as X := X(A) = (X 1 , • • • , X T ) ∼ P X for the speech and for the videos, we can find a generator function G : X T → Y L such that Y = G(X).Since the corpora are unpaired, we cannot estimate G directly from samples, and the goal of SSR-U is to "decipher" it using only the relations between the speech-only and video-only distributions, P X and P Y : for all sign language unit sequences y ∈ Y L , where G(y|x) = 1 if and only if y = G(x).
3 Proposed Methods

Character-level speech2sign-U
In the case of character-level speech2sign-U, V is drawn from a collection of unlabeled fingerspelling sequences, where each V i is the hand gesture for a character.In this case, we adopt a similar architecture as wav2vec-U (Baevski et al., 2021).
Sign video preprocessing Given a sign video v ∼ V , we obtain its visual features by passing the raw video frames into a local feature extractor such as VGG19 or RCNN (Ren et al., 2015).The local features are then contextualized by a sign language encoder, consisting of a twolayer multilayer perceptron (MLP) and a one-layer uni-directional LSTM: The sign language encoder is then trained using contrastive predictive coding (CPC) (van den Oord et al., 2018): where N i,k is a set of negative samples chosen uniformly at random from times other than i + k.Finally, we apply K-means clustering on Speech preprocessing As in wav2vec-U, for each utterance, we first use a voice activity detector (VAD) to remove silences between speech frames and randomly insert silences between word boundaries of the sign cluster sequence so that their silence distributions match.Next, we contextualize the raw speech frames using wav2vec 2.0 pretrained on LibriLight: Finally, we extract K-means clusters from (z 1 , • • • , z T ) and merge consecutive frames belonging to the same clusters to obtain the segment-level speech features Unsupervised training A convolutional generator G : X → Y then generates a sequence of cluster units ( Ŷ1 , • • • , ŶL ) = G(X) from the segment features X by sampling from the posterior probabilities at each segment i: . (5) Then we adopt the generative adversarial network (GAN (Goodfellow et al., 2014)) objective by training a binary classifier D : Y → [0, 1] to discriminate between the real cluster sequence and the generated one: where L gp , L sp and L cd stand for the gradient penalty, smoothness penalty and code diversity losses as defined in (Baevski et al., 2021).

Word-level speech2sign-U
Word-level speech2sign-U is more challenging than character-level: the GAN objective in Eq. ( 6) fails to converge for vocabulary sizes of 100 or larger, apparently due to variability in the audio and video signals.Therefore, we instead adopt a novel GANfree architecture trained to match marginals between the generated and real probability distributions as shown in Fig. 1.
Preprocessing We extract the sign video and speech features similar to Section 3.1, except with a few modifications: first, we assume the word-level boundaries for both the speech and sign videos are available, which may be ground truth or boundaries detected using unsupervised word segmentation algorithms from phoneme boundaries (Kreuk et al., 2020;Bhati et al., 2021;Cuervo et al., 2022).
Then we compute the segment-level speech features by averaging the frame-level wav2vec 2.0 features within each word.Further, we use the I3D (Carreira and Zisserman, 2017) as the local feature extractor and average the pretrained video feature frames within each word-level sign video segment.Lastly, we perform K-means clustering on the segment features and use the output cluster units as inputs X to the speech generator as we found that quantized speech features work better than continuous features.Unsupervised unigram matching Similar to Section 3.1, we seek to match the probability distributions in the two modalities as our unsupervised training criterion.Instead of using a convolutional generator as in Eq. ( 5), we instead use a linear generator for each segment i: Eq. ( 1) can now be achieved by minimizing the ℓ 1 distance between the empirical positional unigram probabilities of the generated and real sign cluster units: where PX i and PY i are empirical unigram distributions for the speech and sign units, and G ∈ R |X|×|Y| := (G(y|x)) x∈X,y∈Y .Note that such an objective is typically optimized implicitly by a GAN, but we found that the explicit formula not only avoids the need for a discriminator but also leads to more stable training and better performance.
Unsupervised skip-gram matching Positional unigram constraints alone may not be sufficient for word-level SSR-U.Therefore, we add additional moment constraints using skip-grams.Define the k-step skip-gram to be the joint probability Then, apply Eq. ( 1) again, we have the skip-grams for the generated and real sign cluster units satisfy Again, we approximate this constraint by minimizing their ℓ 1 distance: The overall loss for the word-level speech2sign-U is then Speech-to-sign retriever Given a query speech audio (sign video), we would like to use it to retrieve its translation from a database of sign videos (speech audios).To this end, we use the generator to compute a similarity score between each speech sequence X and sign sequence Y as: where DTW(•, •) is the dynamic time warping distance between two feature sequences with cosine distance as the frame-level metric, computed using the DTW library (Giorgino, 2009).

Datasets
The detailed statistics are shown in Table 1.
Fingerspelling LibriSpeech To extract semantic units from the fingerspelling signs, we trained the visual CPC encoder on a sentence-level fingerspelling dataset constructed from the 960-hour LibriSpeech dataset and the Unvoiced dataset (Nagaraj, 2018).To construct the dataset, we replace each letter in the LibriSpeech transcript with an image of that letter's ASL Alphabet symbol chosen uniformly at random from Unvoiced.To study the effect of visual variability on SSR-U, we subset the ASL Alphabet images to 100, 300, 500, or 1000 images per letter sign.The dev-clean subset of LibriSpeech is used as the validation set.
Fingerspelling LJSpeech We train our characterlevel model on another sentence-level fingerspelling dataset constructed from LJSpeech (Ito and Johnson, 2017) and the ASL Alphabet dataset similar to the fingerspelling LibriSpeech.

ASL LibriSpeech
For the word-level SSR-U, we construct another corpus using LibriSpeech for speech and MSASL (Joze and Koller, 2019) for word-level sign videos.Since many MSASL videos no longer exist on YouTube, only 11.6k out of 25k videos are downloaded.Further, due to the mismatch in vocabulary size, we use forced alignment information to filter out LibriSpeech words that don't appear in MSASL and keep sentences that are at least 5 words long.Next, for each word in each sentence, we pick a word-level sign video uniformly at random from MSASL.To study the effect of vocabulary size on our model, we follow the split provided by (Joze and Koller, 2019) to subset the data to a vocabulary size of 100, 200, 500 or 1000.

Overall results
Evaluation metrics We evaluate the performance of our systems using two metrics: the unit error rate (UER) is the average insertion I, deletion D, and substitution S error between the predicted and true visual cluster units, which may be character-or word-level units depending on the task: The other metric we used to evaluate the speech-tosign (A→ V) and sign-to-speech (V→ A) retrieval tasks is recall@k (R@k) (k = 1, 5, 10), which is the percentage of hits in the top k results returned by the retriever.
Character-level SSR-U The character-level results are shown in Table 2. To obtain retrieval results, we trained our own wav2vec-U 2.0 using the code released by the authors.Unfortunately, we were unable to achieve the same results they report in their paper.For our ASR-U experiments, wav2vec-U significantly outperforms wav2vec-U 2.0 in terms of both word error rates and retrieval tasks.For SSR-U, we compare our models with wav2vec-U (and 2.0) as well as a supervised image and caption retrieval model trained under a ranking-based criterion (Harwath et al., 2018).We replace their original CNN speech encoder with a two-layer MLP with hidden and output sizes of 256 and ReLU activation, and their VGG16 image encoder with a linear image encoder with an output size of 256.We found that our models with 100 and 300 images per letter achieve superior performances in terms of recall scores, even to the text-based wav2vec-U, but remain about 30% below the supervised topline.Notably, our model performs worse on the A → V direction than on the V → A direction, especially in terms of recall@1.This is perhaps due to significant insertion

Model
Images/ltr UER↓ A→ V V→ A R@1↑ R@5↑ R@10↑ R@1↑ R@5↑ R@10↑ Supervised speech-to-sign recognition errors in the generated character sequence, which leads to many false positives during speech-to-sign retrieval.
Word-level SSR-U The word-level results are shown in Table 3.To establish top-line results for error rates and retrieval recall scores, we train a word-level unsupervised speech recognition model, speech2text-U, using the same criterion as speech2sign-U in Eq. 12, except by replacing the sign cluster sequences obtained from clustering word-level sign video features (see Section 3.2) with the underlying textual word labels as the target random variable Y .At the same time, for the subset with a vocabulary size of 98, we compare the performance of our model that uses unsupervised unigram and skipgram matching with wav2vec-U, which uses a JSD GAN for distribution matching, to show our proposed training method significantly improves the word error rates and the recall scores for both retrieval directions.However, we still observe a large gap in recall between our unsupervised model and the supervised speech-to-image retrieval model (Harwath et al., 2018).The performance of both word-level ASR-U and SSR-U degrades as the vocabulary size increases.The unit error rate (UER) increases from 53.6% to 87.9%, the recall@1 of speech-to-sign (A→V) retrieval decreases from 69.6% to 12.1%, and the recall@1 of sign-to-speech (V→A) retrieval decreases from 71.4% to 10.9% as the vocabulary size increases from 98 to 877.Such performance degradation is much more significant than that of character-level SSR-U because the word modality involves extra morphological complexity on top of the phonological character modality.Effect of the number of speech clusters We experiment with speech2sign-U models with speech cluster sizes |X| equal to 100, 200, 400, and a model that directly takes raw wav2vec 2.0 features as inputs (|X| = ∞), as shown in Figure 2. We found that the continuous model is significantly

Model
Vocab size UER↓ A→ V V→ A R@1↑ R@5↑ R@10↑ R@1↑ R@5↑ R@10↑ Supervised speech-to-sign recognition GAN loss, we instead feed the generator outputs to a discriminator with a single convolutional layer while keeping all other settings the same.Our experiment indicates that the GAN-free approach is consistently more stable and accurate compared to the GAN-based approach.

Effect of visual features
The effect of visual features is shown in Table 4.We experimented with different types of visual features on ASL Lib-riSpeech with different vocabulary sizes such as VGG19 and the pose keypoint features from Open-Pose (Cao et al., 2019).For the OpenPose features, we extract the keypoints from each video frames and re-sample each sign video feature frames to 30 frames as the segment-level feature.I3D architecture (Carreira and Zisserman, 2017) significantly outperforms VGG19 and OpenPose as a feature extractor, demonstrating the importance of temporal information for SSR-U.We also found that I3D with optical flow features performs better than I3D with raw RGB inputs for most vocabulary sizes.Further, we found that concatenating the features from the RGB-based and flow-based I3Ds is beneficial for vocabulary sizes 193 and 468 but not when the vocabulary size is too small or too large, even causing training instability for vocabulary size 877.

Effect of segmentations
The effect of gold and predicted speech segmentation for word-level SSR-U is shown in Table 5.For models trained with phoneme boundaries, we obtain predicted word segmentations using a CPC-based unsupervised segmentation system (Kreuk et al., 2020) with mean-pooled phoneme-level wav2vec 2.0 features as inputs.The convolutional encoder in the original model is replaced by a two-layer MLP with 256 output dimensions trained on ASL LibriSpeech 100 for 200 epochs.This yields an exact-match boundary F1 of 88%.Using such detected word boundaries, we found about a 20% drop in recall@1 for speech2text-U and an 8-17% relative drop in recall@1,5,10 for speech2sign-U.Still, our model remains much better than the wav2vec-U baseline with ground-truth word boundaries, demonstrating its robustness to segmentation noise.

Effect of word frequencies
We plotted the F1 score of the first 100 word classes ranked by frequency in Figure 4.For ASL LibriSpeech 100 and 500, while noisy, it is not hard to observe that the F1 score positively correlates with word frequency in a somewhat exponential fashion.Starting with F1 above 0.55 for the most frequent word, the performance quickly drops below 0.2 at around the 30th most frequent word.This trend is less conclusive on ASL LibriSpeech 1000 with generally low F1 scores, but the highest F1 scores are still observed for the most frequent words.The trend is also illustrated by the DTW alignment of a speechvideo pair correctly retrieved by speech2sign-U in Figure 5.In our example, speech2sign-U mistakes the sign "more" for more frequent signs such as "when" and "have".Additional factors such as visual similarity also play a role in the case of "more" and "when", as both signs involve touching the tips of both hands.Such factors may explain the fluctuations in Figure 4.More error analysis can be found in Appendix A.

Related works
Sign language recognition One way to bridge between sign language and written/spoken language is to build a sign language recognition (SLR) system trained on parallel sign language and text corpora.The earliest attempts tried to recognize fingerspelling gestures using hand-tracking signals from wired gloves (Grimes, 1983;Charayaphan and Marble, 1992).Later works introduced vision to either correct the errors made by the handtracking model, or to serve as a cheaper and lessintrusive alternative (Tamura and Kawasaki, 1988).
Focusing on the problem of isolated sign recognition and treating it as a classification task, a variety of statistical and deep learning models have been proposed, such as HMM (Starner and Pent- land, 1997), 3D-CNN (Huang et al., 2015), twostream inflated 3D (I3D) CNN (Carreira and Zisserman, 2017;Joze and Koller, 2019), and transformer (Boháček and Hrúz, 2022), among others.To handle multi-sign video sequences, (Koller et al., 2016(Koller et al., , 2017(Koller et al., , 2018(Koller et al., , 2019) ) reformulate the problem as a sequence labeling problem and develop various systems based on 2D-CNN-HMM hybrid models for German sign language recognition.Later works improve the alignment mechanism of previous models using soft DTW (Huang et al., 2018), CTC with DTW contraints (Pu et al., 2019) or pseudo-labeling refinement (Zhou et al., 2019).While some aim to directly use raw RGB images or generic action features like optical flow as inputs (Koller et al., 2016;Huang et al., 2018;Joze and Koller, 2019), others have found domainspecific features like whole-body and hand keypoints to be more reliable and robust (Boháček and Hrúz, 2022).Thanks to the rapid development of the field, there are now many word-level and sentence-level datasets available in different SLs, and we refer to (Joze and Koller, 2019) for a more comprehensive review.
Unsupervised cross-modal alignment The task of translating between two languages without parallel corpora has been demonstrated between written language pairs (MT-U) and between spokenwritten language pairs (ASR-U).(Haghighi et al., 2008) and (Ravi and Knight, 2011;Pourdamghani and Knight, 2017) are respectively the first to treat word-level and sentence-level MT-U as a distribution matching problem and built the first such systems by training statistical machine translation systems using nonparallel corpora, which are further improved by (Artetxe et al., 2018b).

Conclusion
In this paper, we propose the task of unsupervised speech-to-sign language recognition and a neural network model, speech2sign-U, capable of both character-level and word-level SSR-U.On various unpaired speech and ASL datasets, our models consistently outperform previous unsupervised models such as wav2vec-U.Further, we found our model reliable to train for a variety of vocabulary sizes and robust against various types of noise in both speech and visual modalities.

Limitations
Our model currently requires high-quality word boundaries for both speech and sign videos.However, as demonstrated by our preliminary results in Table 5, we can overcome such limitations by incorporating more powerful unsupervised segmentation algorithms to our system.Further, while our dataset is sufficient to model the variability in speech and videos, all experiments to date have assumed that spoken and signed sentences share similar word order, which may not be true of natural spoken and signed communications.A future direction of this research will seek to develop methods for spoken-sign language pairs with very different syntactic structures.Lastly, the vocabulary size under our study on word-level SSR-U is relatively small (<1000), and a promising future direction is to extend the current approach to deal with much larger vocabulary size in more diverse conversations.

Ethical considerations
One potential ethical concern for our model is the risk of miscommunication.Due to the small amount of resources used to train our system, it tends to be less accurate than its supervised counterpart, and its mistakes may cause confusion, misunderstanding and other psychological harm to the users of our systems.The other ethical concern is that the data used to train the system is demographically homogeneous, as we have noticed from some brief inspections that most of the signers in the ASL datasets are white middle-aged adults.This may lead the system to worse retrieval accuracy for people underrepresented in the training corpus, such as black people, children and elderly people.

A.1 Reproducibility checklist
All experiments are done on four 16GB NVIDIA V100 GPUs and all models are implemented using Pytorch (Paszke et al., 2019) and Fairseq (Ott et al., 2019).
Character-level speech2sign-U We use the exact same generator and discriminator architectures as the wav2vec-U (Baevski et al., 2021).For the CPC-based fingerspelling feature extractor, we use a two-layer MLP as the encoder, with 256 hidden units, ReLU activation and 256 output units and a single-layer LSTM with 256 hidden and output units as the autoregressive predictor.We found 3 prediction steps and 32 negative samples per positive sample for the CPC loss to be the best setting for training.For the CPC-based fingerspelling feature extractor, we train for 60 epochs using Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 0.001, a batch size of 16 with β 1 = 0.9 and β 2 = 0.999.The checkpoint with the highest average next-frame prediction performance during training is used for the feature extraction later.For the K-means clustering, we use FAISS (Johnson et al., 2019) and set the number of clusters to be the same as the vocabulary size.For the GAN training, we train the model for 10000 updates and validate the model every 1000 updates using the UER metric.We observe similar performance between the best and the last checkpoints for most experiments.Again, we follow the publicly available implementation of wav2vec-U (Baevski et al., 2021) using Fairseq for all the distributed training, optimizer and scheduler setting.
Word-level speech2sign-U For extracting the optical flow features of sign images, we use the OpenCV implementation of Dual TV-L1 method and resized all images to 224 × 224.For the Open-Pose features, we follow the default settings to extract the pose keypoints and set the keypoint coordinates to 0 when the model fails to detect any keypoints.We also normalize the keypoints by the size of the video frame.The I3D model we use are trained on the ImageNet dataset and fine-tuned on the Charades dataset, for both RGB and flow implementations.The same CPC sign encoder as that in character-level experiments is used, except with the pretrained video features as inputs and the outputs of the MLP encoder as outputs instead of that of the LSTM model.We then train the CPC sign encoder for 200 epochs on ASL LibriSpeech 1000.The CPC sign encoder features are then quantized into the same number of discrete units as the vocabulary size (100 for ASL LibriSpeech 100, etc.) using K-means implemented in FAISS (Johnson et al., 2019).For the speech feature clustering, we again use the FAISS (Johnson et al., 2019)

A.2 More SSR-U retrieval examples and error analysis
More DTW alignments between speech-video pairs correctly retrieved by speech2sign-U are shown in Figure 6.As we can see, our model is able to correctly align the speech and sign video after the DTW step.However, in order to better understand the type of errors the model is susceptible to, we also show the similarity map before the DTW step in Figure 7.While the similarity maps are noisier than their corresponding DTW alignments, the high similarity regions are correctly concentrated approximately along the diagonal most of the time.there are, however, several common failure modes by speech2sign-U.The most common mistake by the model is to confuse less frequent words with more frequent ones, for example, confuse the less frequent word "history" with the more frequent word "from" and "outside" in Figure 7d, or the less frequent "more" with the more frequent "good" in Figure 7c or the less frequent "like" with the more frequent "when" and "man" in Figure 7b.Another type of mistake is to confuse visually similar signs such as "one", "two" and "three" in Figure 7a.The last common type of mistake for speech2sign-U is to confuse acoustically similar words, such as the word "they" and "their" in Figure 7c.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?
The license and terms of use is straightforward as we use open-source, publicly available software for non-commercial purposes B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?
The data is collected by other authors and have been carefully checked to remove any privacy-related information B5.Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section 4.1 B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?Section 4, Appendix A The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.

Figure 1 :
Figure 1: Overall architecture of the word-level speech2sign-U.Solid blocks contain trainable parameters while dashed blocks do not.

Figure 5 :
Figure 5: An example of the DTW alignment by speech2sign-U between a pair of speech and sign video (with its optical flow sequence shown below)

Table 3 :
Overall speech2sign-U results on ASL LibriSpeech Effect of training objectivesThe effect of different training objectives including the default speech2sign-U loss (L1) in Eq. (12), the maximum mean discrepancy (MMD) GAN and the Jensen-Shannon divergence (JSD) GAN is shown in Figure3.For models trained with MMD and JSD

Table 5 :
Effect of the speech segmentation using speech2sign-U on ASL LibriSpeech 100