WLASL-LEX: a Dataset for Recognising Phonological Properties in American Sign Language

Signed Language Processing (SLP) concerns the automated processing of signed languages, the main means of communication of Deaf and hearing impaired individuals. SLP features many different tasks, ranging from sign recognition to translation and production of signed speech, but has been overlooked by the NLP community thus far.In this paper, we bring to attention the task of modelling the phonology of sign languages. We leverage existing resources to construct a large-scale dataset of American Sign Language signs annotated with six different phonological properties. We then conduct an extensive empirical study to investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties. We find that, despite the inherent challenges of the task, graph-based neural networks that operate over skeleton features extracted from raw videos are able to succeed at the task to a varying degree. Most importantly, we show that this performance pertains even on signs unobserved during training.


Introduction
Around 200 languages in the world are signed rather than spoken, featuring their own vocabulary and grammatical structures.For example the American Sign Language (ASL) is not a mere translation of English into signs and is unrelated to the British Sign Language (BSL).Their non-textual nature introduces many challenges to their automated processing, compared with purely textual NLP.Research on Sign Language Processing (SLP) encompasses tasks such as sign language detection, i.e. recognising if and which signed language is performed (Moryossef et al., 2020) and sign language recognition (SLR) (Koller, 2020), i.e. the identification of signs either in isolation or in continuous speech.Other tasks concern the translation from signed to spoken (or written) (Camgoz et al., 2018) language or the production of signs from text (Rastgoo et al., 2021).With the recent success of deep learning-based approaches in computer vision (CV), as well as advancements in -from the CV perspective-related tasks of action and gesture recognition (Asadi-Aghbolaghi et al., 2017), SLP is gaining more attention in the CV community (Zheng et al., 2017).Due to the complexity of the tasks, some recent approaches to various SLP tasks implicitly rely on phonological features (Tornay, 2021;Metaxas et al., 2018;Gebre et al., 2013;Tavella et al., 2021).Surprisingly, however, little work has been carried out on explicitly modelling the phonology of signed languages.This presents a timely opportunity to investigate signed languages from the perspective of computational linguistics (Yin et al., 2021).In the context of signed languages, phonology typically distinguishes between manual features, such as usage, position and movement of hands and fingers, and non-manual features, such as facial expressions.Sign language phonology is a matured field with well-developed theoretical frameworks (Liddell and Johnson, 1989;Fenlon et al., 2017;Sandler, 2012).These phonological features, or phonemes, are drawn from a fixed inventory of possible configurations which is typically much smaller than the vocabulary of signed languages (Borg and Camilleri, 2020).For example, there is only a limited number of fingers that can be used to perform a sign due to anatomical constraints.Hence, different signs share phonological properties and well performing classifiers can be used to predict those properties for signs unseen during training.This potentially holds even across different languages, because, while different languages may dictate different combinations of phonemes, there are also significant overlaps (Tornay et al., 2020).
Finally, these phonological properties have a strong discriminatory power when determining signs.For example, in ASL-Lex (Caselli et al., 2017), a lexicon which also captures phonology information, the authors report that more than 50% of its 994 described signs have a unique combination of only six phonological properties and more than 80% of the signs share their combination with at most two other signs.By relying on this phonological information from resources such as ASL-Lex, many signs can be uniquely determined.This means that well performing classifiers can leverage this information to predict signs without having encountered them during training.This is a capability that current data-driven approaches to SLR lack by design (Koller, 2020).Thus, in combination, mature approaches to phonology recognition can facilitate the development of sign language resources, for example by providing first-pass silver annotations for new sign languages based on their phonological properties.This is an important task for both documenting low-resource sign languages as well as rapid developing of large-scale datasets, and for fully harnessing data-driven CV approaches.
To spur research in this direction, we extend the preliminary work by Tavella et al. (2021) and introduce the task of Phonological Property Recognition (PPR).More specifically, with this paper, we contribute (i) WLASLLex2001, a large-scale, automatically constructed PPR dataset, (ii) an analysis of the dataset quality, and (iii) an empirical study of the performance of different deep-learning based baselines thereon.

Methodology
We address PPR as a classification problem based on features extracted from videos of people speaking SL.Although manual annotation approaches are widely adopted, these are time consuming and require expert knowledge.Instead, we rely on automated dataset construction.On a high level, we cross-reference a large-scale ASL SLR dataset with an ASL Lexicon and annotate videos of signs with their corresponding phonological properties.We then extract skeletal features, by taking advantage of pre-trained deep models from the computer vision community (Rong et al., 2021;Wang et al., 2019).Finally, we train several deep models to classify them as phonological classes.

Dataset construction
As previously mentioned, ASL-Lex (Caselli et al., 2017) contains phonological features of American Sign Language, such as where the sign is executed, the movement performed by the hand and the number of hands and fingers involved.The latter properties were coded by 3 ASL-versed people.In our work, we are interested in recognising phonological properties from videos of people speaking ASL.Consequently, we aim to construct a dataset, suitable for supervised learning, containing videos labelled with six phonological properties.Specifically, we choose the manual properties with the strongest discriminatory power to determine signs based on their configuration (Caselli et al., 2017): (i) flexion: aperture of the selected fingers of the dominant hand at sign onset, (ii) major location: general location of the dominant hand at sign onset, (iii) minor location: specific location of the dominant hand at sign onset, (iv) movement: the first movement path of the sign, (v) selected fingers: fingers that are moving or are foregrounded during that movement, and (vi) sign type: symmetry of the hands according to Battison (1978).
A detailed description of all the properties is provided in the appendix.One of the limitations of ASL-Lex is the small number of examples and lack of variety: its first iteration (ASL-Lex 1.0) contains less than 1000 videos, all signed by the same person.While sufficient for educational purposes, these videos are of limited suitability for developing robust classifiers that can capture the diversity of ASL speakers (Yin et al., 2021).To this end, we source videos from WLASL (Li et al., 2020) (Word Level-ASL), one of the largest available SL datasets, featuring more than 2000 glosses demonstrated by over 100 people, for a total of more than 20000 videos.Each sign is performed by at least 3 different signers, which implies greater variability compared to having one gloss performed by only one user.By cross referencing ASL-Lex and WLASL2000 based on corresponding glosses, we can increase the number of samples available to train our models.
Finally, to leverage state of the art SLR architectures that operate over structured input, we enrich each raw video with its extracted keypoints that represent the joints of the speaker.To do so, we use two pretrained models, FrankMocap (Rong et al., 2021) and HRNet (Wang et al., 2019).While these tracking algorithms follow different paradigms, the former extracting 3D coordinates based on a predicted human model and the latter predicting keypoints as coordinates from videos directly, they produce similar outputs.An important distinction is that while FrankMocap estimates the 3D keypoints, HRNet outputs 2D keypoints with associated prediction confidence scores.We use these different models to explore whether different tracking algorithms affect the recognition of phonological classes.We select a subset of features of the upper body, namely: nose, eyes, shoulders, elbows, wrists, thumbs and first/last knuckles of the fingers.These manual features were determined to be the most informative while performing sign language recognition (Jiang et al., 2021b).
Our final dataset, WLASL-Lex2001 (WLASL2000 + ASL-Lex 1.0), is composed of 10017 videos corresponding to 800 glosses, 3D skeletons (x, y, z from FrankMocap and x, y and score from HRNet) labelled with their phonological properties.A characteristic of this dataset is that it follows a long tailed distribution.Due to the nature of language, some phonological properties are more common than others, which means that some classes are more represented than others.On the one hand, the training setup for our models should take this factor into account, but on the other hand, the advantage of training over phonological classes instead of glosses is that different glosses can share phonological classes.

Models
To estimate the complexity of the dataset, we use the majority-class baseline and the Multi-Layer Perceptron (MLP) as basic deep models.We fur-ther use Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) as models capable of capturing the temporal component of videos.As state-of-the-art SLP architectures that have been used to perform SLR, we use the I3D 3D Convolutional Neural Network (Carreira and Zisserman, 2017;Li et al., 2020) able to learn from raw videos, and the Spatio-Temporal Graph Convolutional Network (STGCN) (Jiang et al., 2021b) that captures both spatial and temporal components from the extracted keypoints.

Experimental Setup
For each phonological property we generate dataset splits and train dedicated models separately.While a multi-class multi-label approach could achieve higher scores, by relying on potential interdependencies of different properties, we chose to model the properties in isolation, to disentangle the factors that affect the learnability of each property.From now on, when we mention the dataset, we refer to an instance of the WLASL-Lex 2001 dataset, where labels are the values of a single phonological class.
We make this distinction because we produce six different train, validation and test splits (with a 70 : 15 : 15 ratio) stratifying on the corresponding phonological property (Phoneme).By doing so, we make sure that (a) all splits contain all possible labels for a classification target (i.e.phonological property) and (b) follow the same distribution.Since we source the videos from WLASL, we have multiple videos representing each gloss, therefore, randomly splitting our data will result in the fact that glosses in the test set might appear in the training set as well, signed by a different speaker.Thus, to investigate how well the models can predict properties on unseen glosses, we also produce label-stratified splits on gloss-level (Gloss), such that videos of glosses in the validation and test set do not appear in training data and vice versa.Thus, to summarise, experiments in the Phoneme setting aim to evaluate the capability to recognise phonological properties of signs that were already encountered in the training data, but are performed by a different speaker in the test set.Conversely, experiments in the Gloss setting aim to evaluate the capability to recognise phonological properties of signs completely unseen during training.
We use an I3D model that has been pre-trained on Kinetics-400 (Carreira and Zisserman, 2017)  ) and per-class averaged accuracy (A) of various models on the test sets of the six tasks.For accuracy, we report the error margin as a confidence interval at α = 0.05 using asymptotic normal approximation.We omit error margins for balanced accuracy as the low number of classes results in a small sample size.Additional performance measures are reported in the appendix.
and fine-tune it on raw videos from our datasets.The other models are trained from scratch using keypoints as input.We fix the length of all input to 150 frames, longer sequences are truncated while shorter sequences are looped to reach the fixed length.We select the best performing model based on performance on the validation set and for the final test set performance we train the models on both train and validation sets.For more details on model selection, consult the appendix.We measure both accuracy, to investigate how well models perform in general, and class-balanced accuracy to take into account how well they are able to model different classes of the phonological properties.

Results and discussion
The upper half of Table 1 presents the results for the six dataset splits for the Phoneme setting, where glosses in test data could have appeared in training data as well.The poor performance of the simple MLP architecture suggests that the tasks are in fact challenging and do not exhibit easily exploitable regularities.Due to its simplicity, it is barely able to reach the baseline for some properties (34% vs. 35% and 44% vs. 50% for movement and flexion respectively).In particular, MLP classifying based on FrankMocap (MLP F ) output is often the worst performing combination.Conversely, STGCN using HRNet output (STGCN H ) outperforms other models on all six tasks.In some cases, for example when predicting movement or flexion, it is the only model which significantly surpasses the majority class baseline.This superior performance is ex-pected, as this specific combination of the STGCN operating over HRNet-extracted keypoints has been shown to be the largest contributor to the SLR performance on the WLASL2000 dataset (Jiang et al., 2021a).Models that operate over structured input often outperform the 3D CNN, demonstrating the utility of additional information provided by the skeleton features.The results also suggest that models using the HRNet skeleton output outperform those who use FrankMocap, possibly due to the confidence scores produced by HRNet and associated with the coordinates.This difference in performance suggests to conduct a more rigorous study to investigate the impact of different feature extraction methods as a possible future research direction.
The lower half of Table 1 shows the performance of models to predict the phonological properties of unseen glosses (Gloss).The performance of all tasks and all models deteriorates, suggesting that their success is partly derived from exploiting the similarities between glosses that appear in training and test data.However, the best model, STGCN H , performs comparably to the Phoneme-split, with a drop of less than 10 accuracy points for five of the six tasks.
Often, crowd sourced (Polonio et al., 2018) or automatically constructed datasets such as ours, have a performance ceiling, possibly due to incorrectly assigned ground truth labels or low quality of input data (Chen et al., 2016;Schlegel et al., 2020).To investigate the former, we measure the agreement on videos that all models misclassify using Fleiss' κ.Intuitively, if models consistently agree on a label different than the ground truth, the ground truth label might be wrong.We find that averaged across the six tasks, the agreement is negligible: 0.09 ± 0.06 and 0.11 ± 0.09 for Phoneme and Gloss split, respectively.
Similarly, for the latter, if all models consistently fail to assign any correct label for a given video (e.g.all models err on a video appearing in the test sets of movement and flexion), this can hint at low quality of the input, making it impossible to predict anything correctly.We find that this is not the case with WLASL-LEX2001, as videos appearing in test sets of different tasks tend to have a low mutual misclassification rate: 1% and 0.7% of videos appearing in test sets of two and three tasks were misclassified by all models for all associated tasks for the Phoneme split.For the Gloss split the numbers are 3 and 0% for two and three tasks, respectively.Together, these observations suggest that the models presented in this paper are unlikely to reach the performance ceiling on WLASL-Lex2001 and more advanced approaches could obtain even higher accuracy scores.

Conclusion
In this paper, we discuss the task of Phonological Property Recognition (PPR).We automatically construct a dataset for the task featuring six phonological properties and analyse it extensively.We find that there is potential for improvement over our presented data-driven baseline approaches.Researchers pursuing this direction can focus on developing better-performing models, for example by relying on jointly learning all properties, as labels for different properties can be mutually dependent.
Another possible avenue is to investigate the feasibility of using PRR to perform tokenisation of continuous sign language speech, by decomposing it into multiple phonemes, which is identified as one of the big challenges of SLP (Yin et al., 2021).

A Hyperparameters optimization
Table 2 contains all the hyperparameters explored during our experiment over each different model.The best model is the one that maximises the Matthew's correlation coefficient with T P, T N, F P, F N being true/false positive/negative.For the STGCN we use hyperparameters chosen by Jiang et al. (2021a), because initial experiments on our data showed a difference of at most 2% accuracy, which is within the uncertainty estimate.To find the optimal hyperparameters for the other models, we perform Bayesian optimisation over a pre-defined set.We maximise Matthews correlation coefficient (MCC) (Matthews, 1975) on the validation sets of all six tasks.We choose MCC as it provides a good trade-off between overall and class-level accuracy which is necessary due to the unbalance inherently present in our dataset.

C Phonological classes description
Tables 4 to 9 describe in detail the meaning of values for all the phonological classes according to ASL-Lex (Caselli et al., 2017).
The cardinality is calculated on WLASL-Lex, which is why some classes that are in ASL-Lex are not represented (i.e., cardinality equal to 0).

D Additional results
Table 10 illustrates additional results for several different metrics.In particular, we report micro-and macro precision/recall and Matthews correlation coefficient.These metrics help to give a better understanding of the classification results, as they are affected more by data imbalance when compared to accuracy.

Figure 1 :
Figure 1: We annotate ASL sign videos with their corresponding phonological information and skeleton features of the speakers, and train neural networks to recognise the former from the latter.

Table 2 :
Set of explored hyperparameters for each different model Table 3 illustrates the performance on the test set for each model with respect to chance as measured by training 5 models from different random seeds.The performance difference is negligible suggesting that model training is largely stable with regard to chance.

Table 3 :
Mean and standard deviation of accuracy of all architectures trained with the HRNet output, measured on the SIGNTYPE test set and averaged over 5 different random seeds.Results for the 3D CNN are obtained from the validation set.

Table 4 :
Values and relative definitions for selected fingers

Table 5 :
Values and relative definitions for major location

Table 6 :
Values and relative definitions for flexion on the front of the fingers of the non-dominant hand 99 PalmBack Sign is produced on the back of the palm of the non-dominant hand 218 FingerBack Sign is produced on the back of the fingers of the non-dominant hand 186 FingerRadial Sign is produced on the radial side of the non-dominant hand 410 FingerUlnar Sign is produced on the ulnar side of the non-dominant hand 40 FingerTip Sign is produced on the tip of the fingers of the non-dominant hand 158 Heel Sign is produced on the heel of the non-dominant hand 88 Other Sign is produced in an unspecified location on the body 707 Neutral Sign is not produced on or near the body 3390

Table 7 :
Values and relative definitions for minor location Only the dominant hand moves The location and orientation of the hands may differ, and the other specifications of handshape are not the same Non-Dominant hand must be an unmarked handshape (B A S 1 C O 5)

Table 8 :
Values and relative definitions for sign type

Table 9 :
Values and relative definitions for movement