Handshape-Aware Sign Language Recognition: Extended Datasets and Exploration of Handshape-Inclusive Method

The majority of existing work on sign language recognition encodes signed videos without explicitly acknowledging the phonological attributes of signs. Given that handshape is a vital parameter in sign languages, we explore the potential of handshape-aware sign language recognition. We augment the PHOENIX14T dataset with gloss-level handshape labels, resulting in the new PHOENIX14T-HS dataset. Two unique methods are proposed for handshape-inclusive sign language recognition: a single-encoder network and a dual-encoder network, complemented by a training strategy that simultaneously optimizes both the CTC loss and frame-level cross-entropy loss. The proposed methodology consistently outperforms the baseline performance. The dataset and code can be accessed at https://github.com/Este1le/slr_handshape.git.


Introduction
Sign languages are primarily the languages of Deaf people.They are the center of the Deaf culture and the daily lives of the Deaf community. 1In the U.S., estimates suggest that between 500,000 to two million people communicate using American Sign Language (ASL), making it the fifth mostused minority language in the country after Spanish, Italian, German, and French (Lane et al., 1996).Natural sign languages, which develop independently and possess unique grammatical structures distinct from surrounding spoken languages, are just as crucial to include in the field of natural language processing (NLP) as any other language, as Yin et al. (2021) advocates.
One direction in sign language processing (SLP) is sign language recognition (SLR), a task of recognizing and translating signs into glosses, the writ-1 Deaf sociolinguist Barbara Kannapell: "It is our language in every sense of the word.We create it, we keep it alive, and it keeps us and our traditions alive."And further, "To reject ASL is to reject the Deaf person."

WHITE LIKE
Figure 1: An example of a handshape minimal pair in ASL. 2 Both signs start from an identical handshape but end with a distinct one.In practical scenarios, when a signer signs rapidly, the terminal handshape of LIKE may closely resemble that of WHITE, leading to potential difficulties in differentiation.
ten representations of signs typically denoted by spoken language words.Among the array of SLR products, sign gloves, and wearable devices using sensors to track hand movements, are widespread.However, these devices have faced criticism from the Deaf community, primarily due to the social stigma associated with wearing them.This feedback has motivated us to explore video-based SLR, an alternative approach that utilizes cameras to record signs and feed them into the system as video inputs.By doing so, we aspire to foster a more inclusive society, preserving the valuable sign languages serving as the heart of the Deaf culture, and thereby facilitating improved communications between the Deaf and hearing communities.
Signs can be defined by five parameters: handshape, orientation, location, movement, and nonmanual markers such as facial expressions.Signs that differ in only one of these parameters can form minimal pairs.An example of a handshape minimal pair in ASL is illustrated in Figure 1.As reported by Fahey and Hilger (2022), among all parameters, handshape minimal pairs are identified with the lowest accuracy -only 20%, compared to palm orientation (40%), location (47%), and movement (87%).This indicates the complexity in-  volved in distinguishing handshapes, underscoring their importance in correctly interpreting signs.
The majority of existing research on SLR does not incorporate phonological features such as handshapes into their system designs, with only a few exceptions (Koller et al., 2016;Cihan Camgoz et al., 2017;Koller et al., 2019).Typically, signs are interpreted as a cohesive whole, meaning that an SLR model is expected to correctly recognize all five parameters simultaneously to accurately identify a sign.This constitutes a major distinction between spoken and sign languages -the former is linear, while the latter incorporates both linearity and simultaneity (Hill et al., 2018).This uniqueness introduces considerable challenges to SLR tasks.
The limited interest in integrating handshapes into SLR systems can be attributed largely to the absence of handshape annotations in existing Continuous SLR (CSLR3 ) datasets.In response to this, we have extended one of the most widely used SLR datasets, PHOENIX14T, with handshape annotations, sourced from online dictionaries and manual labeling, thus creating the PHOENIX14T-HS dataset.Our hope is that this will facilitate more research into handshape-aware SLR.
Moreover, we introduce two handshapeinclusive4 SLR networks (Figure 2), designed with either single or dual-encoder architectures.These proposed models extend the basic SLR network, which doesn't include handshape information in gloss prediction.Thus, any existing SLR can adopt our approach, underscoring the adaptability of our methods.
We set a benchmark on the PHOENIX14T-HS dataset with the proposed methods.Our models outperform previous state-of-the-art (SOTA) singlemodality SLR networks, which utilize only RGB videos as input and were trained on PHOENIX14T.
Various approaches have been proposed to improve SLR system performance.

Multi-stream network
The multi-stream networks use multiple parallel encoders to extract features from distinct input streams.In addition to the RGB stream, Cui et al. (2019) incorporate an optical flow stream, while Zhou et al. (2021b); Chen et al. (2022b) use key points.Koller et al. (2019) and Papadimitriou and Potamianos (2020) introduce two extract encoders for hand and mouth encoding, directing the system's focus towards critical image areas.

Cross-entropy loss
Training objectives beyond CTC loss can also be employed.Cheng et al. (2020); Hao et al. (2021) train their models to also minimize the frame-level cross-entropy loss, with frame-level labels derived from the CTC decoder's most probable alignment.

Handshape-inclusive Datasets
Currently, datasets frequently employed for the continuous SLR task, such as RWTH-PHOENIX-Weather 2014T (Camgoz et al., 2018 and CSL Daily (Zhou et al., 2021a) generally lack handshape annotations, except for RWTH-PHOENIX-Weather 2014 (Koller et al., 2015), which is extended by Forster et al. (2014) with handshape and orientation labels.The annotating process involved initially labeling the orientations frame-byframe, followed by clustering within each orientation, and then manually assigning a handshape label to each cluster.Additionally, a subset of 2k signs is annotated using the SignWriting (Sutton and DAC, 2000) annotation system.To facilitate handshape recognition, Koller et al. (2016) introduced the 1-Million-Hands dataset, comprising 1 million cropped hand images from sign videos, each labeled with a handshape.The dataset consists of two vocabulary-level datasets in Danish and New Zealand sign language, where handshapes are provided in the lexicon, and a continuous SLR dataset, PHOENIX14, annotated with SignWriting.It also includes 3k manually labeled handshape test images.

Handshape-aware SLR
Research on leveraging handshape labels to support SLR has been relatively scarce.Koller et al. (2016) applied the statistical modelling from Koller et al. (2015) and incorporated a stacked fusion with features from the 1-Million-Hands model and full frames.While Cihan Camgoz et al. ( 2017) and Koller et al. (2019) utilized a multi-stream sys-tem, where two separate streams are built to predict handshapes and glosses, respectively.These two streams are then merged and trained for gloss recognition.The aforementioned studies are all carried out on the PHOENIX14 dataset, made possible by the efforts of Forster et al. (2014), which extended the dataset with handshape labels.Our work instead focuses on the PHOENIX14T dataset.

Datasets
We have enriched the SLR dataset PHOENIX14T by incorporating handshape labels derived from the SignWriting dictionary and manual labeling.In the subsequent sections, we initially present the original PHOENIX14T dataset (3.1) and the Sign-Writing dictionary (3.2), followed by a detailed description of the updated PHOENIX14T dataset (PHOENIX14T-HS), now featuring handshape labels (3.3).

PHOENIX14T
PHOENIX14T (Camgoz et al., 2018) is one of the few predominantly utilized datasets for SLR tasks nowadays.This dataset consists of German sign language (DGS) aired by the German public TV station PHOENIX in the context of weather forecasts.The corpus comprises DGS videos from 9 different signers, glosses annotated by deaf experts, and translations into spoken German language.Key statistics of the dataset are detailed in Table .1.
PHOENIX14T (Camgoz et al., 2018), an extension of PHOENIX14 (Koller et al., 2015), features redefined sentence segmentations and a slightly reduced vocabulary compared to its predecessor.Despite Forster et al. (2014) having expanded PHOENIX14 with handshape labels, their extended dataset is not publicly accessible and only includes labels for the right hand.In contrast, our annotated data will be released publicly, encompassing handshapes for both hands.

SignWriting
The SignWriting dictionary (Sutton and DAC, 2000;Koller et al., 2013) publicly accessible, useredited sign language dataset, encompassing more than 80 distinct sign languages.Adhering to the International SignWriting Alphabet, which prescribes a standard set of icon bases, users represent signs via abstract illustrations of handshapes, facial expressions, orientations, and movements.These depictions can be encoded into XML format and  converted into textual descriptions.We utilized the SignWriting parser6 provided by Koller et al. (2013) to extract handshapes for both hands from the original SignWriting dictionary.

Handshape-extended PHOENIX14T (PHOENIX14T-HS)
There are 17,947 entries for DGS in the SignWriting dictionary.However, 314 signs/glosses in the PHOENIX14T dataset are either not included or lack handshape annotations in the dictionary (Table.1).This implies that 4,366 of the 7,096 samples in the train set contain signs devoid of handshape labels.We thus manually labeled these 314 signs.This results in the following annotation steps: 1. Look up the SignWriting dictionary.2. Manually label handshapes for signs not present in SignWriting.
The author, who has a competent understanding of ASL and Sign Language Linguistics, yet lacks formal training in DGS, annotated by simultaneously watching the corresponding sign video to ensure alignment.The task proves particularly demanding when consecutive gaps-signs missing handshapes-emerge.To delineate the boundaries of these signs, the author resorted to online DGS dictionaries (not SignWriting).The entire manual annotation process took around 30 hours.
Our method contrasts with that of Koller et al. (2016), which applied frame-level handshape annotations.We have instead adopted gloss-level handshape annotations.While the frame-level approach is more detailed, Koller et al. (2016) reported a significant number of blurred frames, making the task of frame-by-frame labeling both challenging and time-intensive.Moreover, given that sign language recognition is essentially a gloss-level recognition task, our aim is to maintain consistency in granularity when integrating handshape recognition as an additional task within the framework.We adopt the Danish taxonomy for handshape labels as in (Koller et al., 2016), which includes 60 unique handshapes.The application results in the PHOENIX14T-HS dataset, from where an example is shown in Figure 3.Given that all 9 signers in PHOENIX14T are right-hand dominant, it is appropriate to employ the default annotations from SignWriting without necessitating side-switching.
Figure 4 presents the frequency distribution of the top 10 most prevalent handshapes for each hand in the PHOENIX14T dataset.It is important to note that a single sign may comprise multiple handshapes.In fact, 13.5% of the signs in the dataset incorporate more than one handshape for the right hand, whereas the left hand employs more than one handshape in 5% of the signs.
Limitations We would like to note certain limitations in the proposed PHOENIX14T-HS dataset.While approximately one-third of the signs were manually labeled with handshapes, the remaining two-thirds were labeled using the user-generated SignWriting dictionary.As a result, these handshape labels may contain noise and should not be seen as curated.When dealing with sign variants, i.e. multiple entries for a single sign, our selection process was random and thus may not necessarily correspond with the sign video.
Moreover, individual signers possess unique signing preferences, leading them to opt for different sign variants.Furthermore, the signers might deviate from the dictionary-form signs, resulting in discrepancies between the real-world usage and the standardization form.In terms of our labeling process, we omitted handshape labels during the initial and final moments of each video, when the signers' hands are in a resting position.Finally, we did not account for co-articulation, the transition phase between two consecutive signs, in our handshape labeling.

Methods
The task of SLR can be defined as follows.Given an input sign video V = (v 1 , ..., v T ) with T frames, the goal is to learn a network to predict a sequence of glosses G = (g 1 , ..., g L ) with L words, monotonically aligned to T frames, where T ≥ L.
In this section, we start by describing the vanilla SLR network, where handshapes are not provided or learned during training in Section 4.1.We then introduce the two handshape-inclusive network architectures employed in this study in Section 4.2.Specifically, these networks are designed to predict sign glosses and handshapes concurrently.Finally, we elaborate on our chosen training and pretraining strategy in Section 4.3 and 4.4.

Vanilla SLR networks
The architecture of the vanilla SLR network is illustrated in Figure 5. Similar to Chen et al. (2022a) and Chen et al. (2022b), we use an S3D (Xie et al., 2018) as the video encoder, followed by a head network, where only the first four blocks of S3D are included to extract dense temporal representations.Then, a gloss classification layer and a CTC decoder are attached to generate sequences of gloss predictions.Model I.In comparison to the vanilla network, this model forwards the S3D feature to two additional heads, each tasked with predicting the handshapes for the left and right hand respectively.The loss for this model is computed as follows:

Handshape-inclusive SLR networks
where L G CT C represents the CTC loss of the gloss predictor.L L CT C and L G CT C denote the CTC losses for the left and right handshape predictors, weighted by λ L and λ R .
Model II.This model employs dual encoders, each dedicated to encoding the representations for glosses and handshapes independently.While both encoders receive the same input (sign videos) and share the same architecture, they are trained with different target labels (gloss vs. handshape).We also incorporate a joint head, which combines the visual representation learned by both encoders to generate gloss predictions.The architecture of this joint head mirrors that of the gloss head and the handshape head.Therefore, the loss for this model is computed as follows: ) where L J CT C denotes the CTC loss of the joint gloss predictor.
For this model, we also adopt a late ensemble strategy.This involves averaging the gloss probabilities predicted by both the gloss head and the joint head.The averaged probabilities are then fed into a CTC decoder, producing the gloss sequence.

Training strategy
The CTC loss is computed by taking the negative logarithm of the probability of the correct path, which corresponds to the true transcription.It is a relatively coarse-grained metric because it operates at the gloss level, not requiring temporal boundaries of glosses.Given that handshape prediction could potentially operate on a frame level, it stands to reason for us to compute the loss at this level as well.However, as the PHOENIX14T-HS dataset does not provide temporal segmentations, we opt to estimate these with gloss probabilities7 generated by our models.First, we extract the best path for glosses from a CTC decoder and fill in the blanks with neighboring glosses.After this, if a particular gloss has only one associated handshape, we assign that handshape to all frames within the extent of that gloss.If there is more than one handshape, we gather the handshape probabilities produced by the handshape classifiers within that segment and feed them into a CTC decoder to determine the optimal handshape labels for the frames within that gloss's range8 .Finally, we calculate the crossentropy loss between the pseudo-labels and the handshape probabilities.This enables more finegrained frame-level supervision.
The loss function then becomes: where L L CE and L R CE are cross-entropy loss for left and right hand weighted by λ L CE and λ R CE respectively.

Pretraining
Given that our target dataset is relatively sparse, it's crucial to pretrain the model to ensure a solid initialization.We first pretrain the S3D encoder on the action recognition dataset, Kinetics-400 (Kay et al., 2017), consisting of 3 million video clips.Following this, we further pretrain on a word-level ASL dataset, WLASL (Li et al., 2020), which includes 21 thousand videos.

Experiments
In this section, we present the performance of our top-performing model (Section 5.1) and further conduct ablation study (Section 5.2) to analyze the crucial components of our implementations.

Best model
Our highest-performing system utilizes the dualencoder architecture of Model II.After initial pretraining on Kinetics-400 and WLASL datasets, we freeze the parameters of the first three blocks of the S3D.For the hyperparameters, we set λ L and λ R to 1, while λ L CE and λ R CE are set to 0.05.The initial learning rate is 0.001.Adam is used as the optimizer.

Model variants
In our analysis, we contrast our suggested model variants, Model I and Model II (discussed in Section 4.2), with the Vanilla SLR network (described in Section 4.1).Additionally, we compare models that feature solely a right handshape head against those equipped with two heads, one for each hand.An extended variant, Model II+, which adds two handshape heads to the gloss encoder, is also considered in our experimentation.
As demonstrated in  2. Here, we unfreeze all parameters in S3D and exclude the optimization of the cross-entropy loss.

Pretraining for gloss encoder
We delve into optimal pretraining strategies for the S3D encoder that's coupled with a gloss head.We conduct experiments using Model I, as shown in Table 4.We contrast the efficacy of four distinct pretraining methodologies: (1) pretraining solely on Kinetics-400; (2) sequential pretraining, first on Kinetics-400, followed by WLASL; (3) tripletiered pretraining on Kinetics-400, then WLASL, and finally on handshape prediction by attaching two handshape heads while deactivating the gloss head; and (4) a similar three-stage process, but focusing on gloss prediction in the final step.

Pretraining for handshape encoder
Table 5 outlines various pretraining strategies adopted for the handshape encoder in Model II.The results pertain to right handshape predictions on the PHOENIX14T-HS dataset.Both Kinetics and WLASL are employed for gloss predictions, as they lack handshape annotations.We also test the 1-Million-Hands dataset (Koller et al., 2016) for pretraining purposes.This dataset comprises a million images cropped from sign videos, each labeled with handshapes.To adapt these images for S3D, we duplicate each image 16 times, creating a 'static' video.Furthermore, we experiment with two input formats: the full frame and the righthand clip.As indicated in Table 5, both pretraining and full-frame input significantly outperform their counterparts.

Frozen parameters
We evaluate the impact of freezing varying numbers of blocks within the pretrained S3D encoders in Model II.The results are presented in Table 6.

Cross-entropy loss
In Table 7, we investigate the computation of cross-entropy loss on handshape predictions utilizing pseudo-labels obtained via two methods: Ensemble and HS.The former pertains to pseudolabels gathered as outlined in Section 4.3, while the latter relies on CTC decoder-applied handshape probabilities from the handshape head to produce pseudo-labels.We also examine a hybrid approach (Ensemble, HS), which sums the losses from both methods.In addition, we tune the weights λ L CE and λ R CE in Equation 3, setting them to the same value.

Conclusions
In this work, we introduce the concept of handshape-aware SLR, enriching this area of re- search by offering a handshape-enriched dataset, PHOENIX14T-HS, and proposing two distinctive handshape-inclusive SLR methods.Out methodologies maintain orthogonality with existing SLR architectures, delivering top performance among single-modality SLR models.Our goal is to draw increased attention from the research community toward the integration of sign language's phonological features within SLR systems.Furthermore, we invite researchers and practitioners in NLP to contribute to the relatively nascent and challenging research area of SLP, thus fostering a richer understanding from linguistic and language modeling perspectives.
In future work, we would like to explore three primary avenues that show promise for further exploration: (1) Extension of multimodal SLR models.This involves expanding multi-modality SLR models, which use inputs of various modalities like RGB, human body key points, and optical flow, to become handshape-aware.This approach holds potential as different streams capture distinct aspects of sign videos, supplying the system with a richer information set.(2) Contrastive learning.Rather than using handshape labels as supervision, they can be employed to generate negative examples for contrastive learning.This can be achieved by acquiring the gloss segmentation from the CTC decoder and replacing the sign in the positive examples with its counterpart in the handshape minimal pair.The resulting negative examples would be particularly challenging for the model to distinguish, thereby aiding in the development of better representations.(3) Data augmentation.Alternatively, to create negative examples for contrastive learning, the data volume could be increased using the same method that generates negative examples for contrastive learning.

Limitations
Noisy labels: As highlighted in Section 3, the handshape labels we create might be noisy, since two-thirds of them are from a user-edited online dictionary.Additionally, these labels may not correspond perfectly to the sign videos due to the variations among signers and specific signs.
Single annotator: Finding DGS experts to serve as our annotators proved challenging.Also, obtaining multiple annotators and achieving interannotator agreements proved to be difficult.

Single parameter:
The dataset used in our study does not account for other sign language parameters including orientation, location, movement, and facial expressions.Moreover, these parameters are not explicitly incorporated as subsidiary tasks within our SLR methodologies.
Single dataset: We only extend a single dataset with handshape labels.It remains to be seen whether the methods we propose will prove equally effective on other datasets, featuring different sign domains, or sizes.

Figure 2 :
Figure 2: We propose two handshape-inclusive SLR network variants.Model I employs a single video encoder, while Model II implements both a gloss encoder and a handshape (HS) encoder, applying a joint head to the concatenated representations produced by the two encoders.

Figure 2
Figure 2 depicts our proposal of two handshapeinclusive SLR network variants, which are expansions upon the vanilla network.Both variants explicitly utilize handshape information by training

Table 1 :
Statistics of PHOENIX14T.In the train set, 299 signs are absent from the SignWriting (SW), and 15 signs lack handshape (HS) annotations in SW, which indicates that 4,366 samples include signs without handshape annotations.We thus manually annotated the combined total of 314 (299+15) signs.

Table 2 :
Comparison with previous work on SLR on PHOENIX14T evaluated by WER.The previous best results are underlined.Methods marked with * denote approaches that utilize multiple modalities besides RGB videos, such as human body key points and optical flow.Notably, our best model (Model II) achieves the lowest WER among single-modality models.

Table 3 ,
Model II outperforms Model I and Model II+.

Table 3 :
Performance comparison of model variants.Models marked with * are variants that employ only the right handshape head.Please note that the experimental setup differs from the HS-SLR model presented in Table

Table 4 :
Performance of Model I with different pertaining strategies.HS pretrains the model on predicting the handshapes only (the gloss head is deactivated).Gloss pretrains the model on predicting the glosses only (the handshape heads are deactivated).

Table 6 :
Performance of Model II with various frozen blocks in both S3D encoders.

Table 7 :
Performance of Models II with varying weights and methods for acquiring pseudo-labels used in crossentropy loss calculation.