Including Facial Expressions in Contextual Embeddings for Sign Language Generation

State-of-the-art sign language generation frameworks lack expressivity and naturalness which is the result of only focusing manual signs, neglecting the affective, grammatical and semantic functions of facial expressions. The purpose of this work is to augment semantic representation of sign language through grounding facial expressions. We study the effect of modeling the relationship between text, gloss, and facial expressions on the performance of the sign generation systems. In particular, we propose a Dual Encoder Transformer able to generate manual signs as well as facial expressions by capturing the similarities and differences found in text and sign gloss annotation. We take into consideration the role of facial muscle activity to express intensities of manual signs by being the first to employ facial action units in sign language generation. We perform a series of experiments showing that our proposed model improves the quality of automatically generated sign language.


Introduction
Communication between the Deaf and Hard of Hearing (DHH) people and hearing non-signing people may be facilitated by the emerging language technologies. DHH individuals are medically underserved worldwide (McKee et al., 2020;Masuku et al., 2021) due to the lack of doctors who can understand and use sign language. Also, educational resources that are available in sign language are limited especially in STEM fields (Boyce et al., 2021;Lynn et al., 2020). Although the Americans with Disabilities Act (United States Department of Justice, 2010) requires government services, public accommodations, and commercial facilities to communicate effectively with DHH individuals, the reality is far from ideal. Sign language interpreters are not always available and communicating through text is not always feasible as written languages are completely different from signed languages.
In contrast to Sign Language Recognition (SLR) which has been studied for several decades (Rastgoo et al., 2021) in the computer vision community (Yin et al., 2021), Sign Language Generation (SLG) is a more recent and less explored research topic (Quandt et al., 2021;Cox et al., 2002;Glauert et al., 2006).
Missing a rich grounded semantic representation, the existing SLG frameworks are far from generating understandable and natural sign language. Sign languages use spatiotemporal modalities and encode semantic information in manual signs and also in facial expressions. A major focus in SLG has been put on manual signs, neglecting the affective, grammatical, and semantic roles of facial expressions. In this work, we bring insights from computational linguistics to study the role of facial expressions in automated SLG. Apart from using facial landmarks encoding the contours of the face, eyes, nose, and mouth, we are the first to explore the use of facial Action Units (AUs) to learn semantic spaces or representations for sign language generation.
In addition, with insights from multimodal Transformer architecture design, we present a novel model, the Dual Encoder Transformer for SLG, which takes as input spoken text and glosses, computes the correlation between both inputs, and generates skeleton poses with facial landmarks and facial AUs. Previous work used either gloss or text to generate sign language or used text-to-gloss (T2G) prediction as an intermediary step (Saunders et al., 2020). Our model architecture, on the other hand, allows us to capture information otherwise lost when using gloss only, and captures differences between text and gloss, which is especially useful for highlighting adjectives otherwise lost in gloss annotation. We perform several experiments using the PHOENIX14-T weather forecast dataset and Figure 1: Sign Language uses multiple modalities, such as hands, body, and facial expressions to convey semantic information. Although gloss annotation is often used to transcribe sign language, the above examples show that meaning encoded through facial expressions are not captured. In addition, the translation from text (blue) to gloss (red) is lossy even though sign languages have the capability to express the complete meaning from text. The lower example shows lowered brows and a wrinkled nose to add the meaning of kräftiger(heavy) (present in text) to the RAIN sign. show that our model performs better than baseline models using only gloss or text.
In summary, our main contributions are the following: • Novel Dual Encoder Transformer for SLG which captures information from text and gloss, as well as their relationship to generate continuous 3D sign pose sequences, facial landmarks, and facial action units.
• Use of facial action units to ground semantic representation in sign language.

Background and Related Work
More than 70 million Deaf and Hard of Hearing worldwide use one of 300 existing sign languages as their primary language (Kozik, 2020). In this section, we explain the linguistic characteristics of sign languages, the importance of facial expressions to convey meaning, and elaborate on prior work in SLG.

Sign Language Linguistics
Sign languages are spatiotemporal languages and are articulated by using the hands, face, and other parts of the body, which need to be visible. In contrast to spoken languages which are oral-aural languages, sign languages are articulated in front of the top half of the body and around the head. No universal method such as the International Phonetic Alphabet (IPA) exists to capture the complexity of signs. Gloss annotation is often used to represent the meaning of signs in written form. Glosses do not provide any information about the execution of the sign, only about its meaning. Even more, as glosses use written languages rather than the sign language, they are a mere approximation of the sign's meaning, representing only one possible transcription. For that reason, glosses do not always represent the full meaning of signs as shown in Figure 1. Every sign can be broken into four manual characteristics: shape, location, movement, and orientation. Non-manual components such as mouth movements (mouthing), facial expressions, and body movements are other aspects of sign lan- guage phonology. In contrast to spoken languages, signing occurs simultaneously while vowels and consonants occur sequentially. Although the vocabulary size of ASL in dictionaries is around 15,000 (Spread the Sign, 2017) compared to approximately 170,000 in spoken English, the simultaneity of phonological components allows for a wide range of signs to describe slight differences of the same gloss. While in English various words describe largeness (big, large, huge, humongous, etc.) in ASL, there is one main sign for "large": BIG. However, through modifications of facial expressions, mouthing, and the size of the sign, different levels of largeness can be expressed just as in a spoken language (Grushkin, 2017). To communicate spoken concepts without a corresponding sign fingerspelling-a manual alphabet-is sometimes used. (Baker et al., 2016)

Grammatical Facial Expressions
Facial expressions are grammatical components of sign languages that encode semantic representations, which, when excluded leads to loss of meaning. Facial expressions in particular have an important role in distinguishing different types of sentences such as WH-questions, Yes/No questions, doubt, negations, affirmatives, conditional clauses, focus and relative clauses (da Silva et al., 2020). The following example shows how the same gloss order can present a question or an affirmation (Baker et al., 2016): Example 1 Indopakistani Sign Language a) FATHER CAR EXIST. "(My) father has a car." b) FATHER CAR EXIST? "Does (your/his) father have a car." In this example, what makes sentence b) a question are raised eyebrows and a forward and/or downward movement of the head/chin in parallel to the manual signs. Figure 2: Examples from different facial Action Units (AUs) (Friesen and Ekman, 1978) from the lower face relevant to the generation of mouthings in sign languages. AUs can occur with different intensity values between 0 and 5. AUs have been used in psychology and in affective computing to understand emotions expressed through facial expressions. Image from (De la Torre and Cohn, 2011). In addition, facial expressions can differentiate the meaning of a sign assuming the role of a quantifier. Figure 1 shows different signs for the same gloss, REGEN (rain). We can observe from the text transcript (in blue) that the news anchor says "rain" in the upper example but "heavy rain" in the lower. This example shows how gloss annotations are not perfect transcriptions of sign languages as they only convey the meaning of manual aspect of the signs. Information conveyed through facial expressions to show intensities are not represented in gloss annotation. To view the loss of information that occurs in gloss annotation we used Spacy (Honnibal and Montani, 2017) to compute the Part-of-Speech (POS) annotation for text and gloss. In Table 1 the occurrence of nouns, verbs, adverbs, and adjectives are shown for text and gloss over the entire dataset. We can see that although gloss annotations have lower occurrence for all POS, the difference is statistically significant for adjectives with p < 0.05. To calculate this significance, we performed hypothesis testing with two proportions by computing the Z score. We used t-tests to determine statistical significance of our model's performance.

Sign Language Generation
Several advances in generating sign poses from text have been recently achieved in SLG, however there is limited work that considers the loss of semantic

Continuous Embedding
Feed-Forward

Self-Attention
Self-Attention

Feed-Forward
Word Embedding 1 Figure 3: Our proposed model architecture, the Dual Encoder Transformer for Sign Language Generation. Our architecture is characterized by using two encoders, one for text and one for gloss annotation. The use of two encoders allows to multiply the outputs of both emphasizing the differences and similarities. In addition we to using skeleton poses and facial landmarks, we include facial action units (Friesen and Ekman, 1978).
information when using gloss to generate poses and aligned facial expressions. Previous work has generated poses by translating text-to-gloss (T2G) and then gloss-to-pose (G2S) or by using either text or gloss as input (Stoll et al., 2020;Saunders et al., 2020). We propose a Dual Encoder Transformer for SLG which trains individual encoders for text and gloss, and combines the encoder's output to capture similarities and differences. In addition, the majority of previous work on SLG has focused mainly on manual signs (Stoll et al., 2020;Saunders et al., 2020;Zelinka and Kanis, 2020;Saunders et al., 2021b). (Saunders et al., 2021a) are the first to generate facial expressions and mouthing together with hand poses. The representation used for the non-manual channels is the same as for the hand gestures, namely coordinates of facial landmarks. In this work we explore the use of facial Action Units (AUs) (see Figure 2) which represent intensities of facial muscle movements (Friesen and Ekman, 1978). Although AUs have been primarily used in tasks related to emotion recognition (Viegas et al., 2018), recent works have shown that AUs help detect WH-questions, Y/N questions, and other types of sentences in Brazilian Sign Language (da Silva et al., 2020).

Sign Language Dataset
In this work we use the publicly available PHOENIX14T dataset (Camgoz et al., 2018), fre-quently used as a benchmark dataset for SLR and SLG tasks. The dataset comprises a collection of weather forecast videos in German Sign Language (DGS), segmented into sentences and accompanied by German transcripts from the news anchor and sign-gloss annotations. PHOENIX14T contains videos of 9 different signers with 1066 different sign glosses and 2887 different German words. The video resolution is 210 by 260 pixels per frame and 30 frames per second. The dataset is partitioned into training, validation, and test set with respectively 7,096, 519, and 642 sentences.

Methods: Dual Encoder Transformer for Sign Language Generation
In this section, we present our proposed model, the Dual Encoder Transformer for Sign Language Generation. Given the loss of information that occurs when translating from text-to-gloss, our novel architecture takes into account the information from text and gloss as well as their similarities and differences to generate sign language in the form of skeleton poses and facial landmarks shown in Figure 3. For that purpose, we learn the conditional probability p = (Y |X, Z) of producing a sequence of signs Y = (y 1 , . . . , y T ) with T frames, given the text of a spoken language sentence X T = (x 1 , . . . , x N ) with N words and the corresponding glosses Z = (z 1 , . . . , z U ) with U glosses.
Our work is inspired by the Progressive Transformer (Saunders et al., 2020) which allows translation from a symbolic representation (words or glosses) to a continuous domain (joint and face landmark coordinates), by employing positional encoding to permit the processing of inputs with varied lengths. In contrast to the Progressive Transformer which uses one encoder to use either text or glosses to generate skeleton poses, we employ two encoders, one for text and one for glosses, to capture information from both sources, and create a combined representation from the encoder outputs to represent correlations between text and glosses. In the following we will describe the different components of the dual encoder transformer.

Embeddings
As our input sources are words, we need to convert them into numerical representations. Similar to transformers used for text-to-text translations, we use word embeddings based on the vocabulary present in the training set. As we are using two encoders to represent similarities and differences between text and glosses we use one word embedding based on the vocabulary of the text and one using the vocabulary of the glosses. We also experiment by using the text word embedding for both encoders. Given that our target is a sequence of skeleton joint coordinates, facial landmark coordinates, and continuous values of facial AUs with varying length we use counter encoding (Saunders et al., 2020). The counter c varies between [0,1] with intervals proportional to the sequence length. It allows the generation of frames without an end token. The target joints are then defined as: The target joints m t are then passed to a continuous embedding which is a linear layer.

Dual Encoders
We use two encoders, one for text and one for gloss annotations. Both encoders have the same architecture. They are composed by L layers each with one Multi-head Attention (MHA) and a feedforward layer. Residual connections (He et al., 2016) around each of the two sublayers with subsequent layer normalization (Ba et al., 2016). MHA uses multiple projections of scaled dot-products which permits the model to associate each word of the input with each other. The scaled dot-product attention outputs a vector of values, V , which is weighted by queries, Q, keys, K, and dimensionality, d k : Different self-attention heads are used in MHA, which allows to generate parallel mappings of the Q, V , and K with different learnt parameters.
The outputs of MHA are then fed into a nonlinear feed-forward projection. In our case, where we employ two different encoders, their outputs can be formulated as: with h n being the contextual representation of the source sequence, N the number of words, and U the number of glosses in the source sequence.
As we want to not only use the information encoded in text and gloss, but also their relationship, we combine the output of both encoders with a Hadamard multiplication. As the N ̸ = U , we stack h n vertically for U times and stack h u vertically for N times in order to have two matrices with the same dimensions. Then we multiply both matrices with the Hadamard multiplication. Hadamard multiplication is a concatenation of every element in two matrices, where a i,j and b i,j are multiplied together to get a i,j b i,j . This represents concatenating the output vectors from the text encoder with the output of the vectors from the gloss encoder.

Decoder
Our decoder is based on the progressive transformer decoder (DPT), an auto-regressive model that produces continuous sequences of sign pose and the previously described counter value (Saunders et al., 2020). In addition to producing sign poses and facial landmarks, our decoder also produces 17 facial AUs. The counter-concatenated joint embeddings which include manual and facial features (facial landmarks and AUs),ĵ u , are used to represent the sign pose of each frame. Firstly, an initial MHA sub-layer is applied to the joint embeddings, similar to the encoder but with an extra masking operation. The masking of future frames is necessary to prevent the model from attending to future time steps. A further MHA mechanism is then used to map the sym-bolic representations from the encoder to the continuous domain of the decoder. A final feed forward sub-layer follows, with each sub-layer followed by a residual connection and layer normalisation as in the encoder. The output of the progressive decoder can be formulated as: whereŷ u corresponds to the 3D joint positions, facial landmarks, and AUs, representing the produced sign pose of frame u andĉ u is the respective counter value. The decoder learns to generate one frame at a time until the predicted counter value, c u , reaches 1. The model is trained using the mean squared error (MSE) loss between the predicted sequence,ŷ 1:U , and the ground truth, y * 1:U : 5 Computational Experiments

Features
We extract three different types of features from the PHOENIX14T dataset: skeleton joint coordinates, facial landmark coordinates, and facial action unit intensities. We use OpenPose (Cao et al., 2019) to extract skeleton poses from each frame and use for our experiments the coordinates of 50 joints which represent the upper body, arms, and hands, which we will start referring to as "manual features". We also use OpenFace (Baltrusaitis et al., 2018) to extract 68 facial landmarks as well as 17 facial action units (AUs) shown in Figure 2 to describe "facial features".

Baseline Models
We will compare the performance of our proposed model (TG2S) with two Progressive Transformers (Saunders et al., 2020), one using gloss only to produce sign poses (G2S), and one that uses text only (T2S). We train each model only with manual features and also with the combination of manual and facial features through concatenation.

Evaluation Methods
In order to automatically evaluate the performance of our model and the baseline models, we use back translation suggested by (Saunders et al., 2020). For that purpose we use the Sign Language Transformer (SLT) (Camgoz et al., 2020) which translates sign poses into text and computes BLEU and ROUGE scores between the translated text and the original text. As the original SLT was designed to receive video frames as input, we modified the architecture to enable the processing of skeleton poses and facial features. When facial AUs are added to the hands, body, and face features, the difference from using manual data only is slightly lower, being BLEU-4 of 10.61. In Table 3 the results of using hands and body joint skeleton as sole input to the baseline models and our proposed model are shown. We can see that our proposed model TG2S shows the highest BLEU-4 scores of 8.19 in test set, compared to 7.84 for G2S and 7.56 for T2S. Table 4 presents the results of including facial landmarks as well as facial AUs with body and hands skeleton joints as input. Also here we can see that our proposed model outperforms the baseline models showing BLEU-4 score of 5.76 in test set. G2S obtained BLUE-4 score of 6.37 and T2S 5.53.

Quantitative Results
We see in Tables 3 and 4 that G2S obtained higher scores than T2S. Given that gloss annotations fail to encode the richness of meaning in signs, it appears the smaller vocabulary helps the model achieve higher scores by neglecting information otherwise described in text. Our proposed model is able to obtain better results than G2S by making a compromise of using information from gloss, text, and their similarities and differences. We also can see in both tables that the inclusion of facial information reduces the overall scores. We believe that this might be the case due to the diverse range of facial expressions possible. We cannot directly compare the results of Table 3     ent SLT models were used to compute the BLEU scores. Figure 4 shows the visual quality of our models prediction when using manual and facial information. Both examples show that the predictions captured the hand shape, orientation and movement from ground truth. In the bottom example for RAIN, the predictions were even able to capture the repetitive hand movement symbolizing falling rain. What can also be noted is that the ground truth is not perfect.

Qualitative Results
In both examples unnatural finger and head postures can be seen. In addition, ground truth is not displaying movements of the eyebrows and mouth in the expected intensities. Figure 5 shows situations in which the predictions failed to represent the correct phonology of signs. In the first example we see that hand shape, orientation, and position are not correct. The predictions of our models also fail to capture pointing hand shapes as shown in example 2.

Discussion and Conclusion
In this work, for the first time, we attempt to augment contextual embeddings for sign language by learning a joint meaning representation that includes fine-grained facial expressions. Our results show that the proposed semantic representation is richer and linguistically grounded. Although our proposed model helped bridge the loss of information by taking into account text, gloss, and their similarities and differences, there are still several challenges to be tackled by a multidisciplinary scientific community.
Complex hand shapes with pointing fingers are very challenging to generate. The first step to improve the generation of the fingers is in improving methods to recognize finger movements more accurately. Similarly, we need tools that are more robust in detecting facial expressions even in situations of occlusion. We also realize that SLG models are overfitting specific sign languages instead of learning a generalized representations of signs.
We chose to work with a German sign language since that is the only dataset with gloss annotation that could help us study our hypotheses. The How2Sign dataset (Duarte et al., 2021) is a feasible dataset for ASL, but it does not allow any model to extract facial landmarks, facial action units or facial expression from the original video frames since the faces are blurred. In the future, we hope to see new datasets with better and more diverse annotations The upper example shows that the predictions captured the correct hand shape, orientation, and movement of the sign CLOUD. In the lower example it is visible that the predictions captured the repeating hand movement meaning RAIN. Although at first glance the hand orientation seems not correct, it is a slight variation which still is correct. for different sign languages that would allow the design of natural and usable sign language generation system.