Making Body Movement in Sign Language Corpus Accessible for Linguists and Machines with Three-Dimensional Normalization of MediaPipe

Linguists can access movement in the sign language video corpus through manual annotation or computational methods. The first relies on a predefinition of features, and the second requires technical knowledge. Methods like MediaPipe and OpenPose are now more often used in sign language processing. MediaPipe detects a two-dimensional (2D) body pose in a single image with a limited approximation of the depth coordinate. Such 2D projection of a three-dimensional (3D) body pose limits the potential application of the resulting models outside the capturing camera settings and position. 2D pose data does not provide lin-guists with direct and human-readable access to the collected movement data. We propose our four main contributions: A novel 3D normalization method for MediaPipe’s 2D pose, a novel human-readable way of representing the 3D normalized pose data, an analysis of Japanese Sign Language (JSL) sociolinguistic features using the proposed techniques, where we show how an individual signer can be identified based on unique personal movement patterns suggesting a potential threat to anonymity. Our method outperforms the common 2D normalization on a small, diverse JSL dataset. We demonstrate its benefit for deep-learning approaches by significantly outperforming the pose-based state-of-the-art models on the open sign language recognition benchmark.


Introduction
Our research aims to find a movement representation that allows processing of sign language movement directly, without relying on annotations or systems of predefined movement features, such as the Hamburg Notation System (Prillwitz et al., 1989).And help to overcome the camera settings constraints of the available datasets.

Problem
Due to its visual nature, sign language data are stored and distributed in a video format.Lin-guists must annotate the sign language features on the video to process them.And to annotate a feature, it must first be clearly defined and thoroughly explained to the annotators.That process is not only time-consuming, but it also limits access to the collected data and the processing potential.Pose estimation methods like OpenPose (Cao et al., 2017(Cao et al., , 2021) ) and MediaPipe (Lugaresi et al., 2019) are more often included in the sign language processing pipeline.Recently published sign language datasets often include pose estimation data.How2Sign1 , a large multimodal dataset of American Sign Language (ASL) presented in Duarte et al. (2021), and the Word-Level American Sign Language2 (WLASL) video dataset presented in Li et al. (2020a), both provide estimated pose data.However, the detection accuracy of pose estimation techniques still requires improvement.Moryossef et al. (2021) reported the negative influence of inaccurate or missing estimations on model performance and applicability beyond training datasets.
Generally, sign language researchers independently develop their own ways of processing pose data for specific body joints and features.Recent approaches still rely on raw pixel data (Sadeghzadeh and Islam, 2022) or a combination of pixel and pose data (Shi et al., 2021).Moreover, sign language datasets vary in terms of the position of the camera relative to the signer, resulting in dissimilarity in 2D pose estimation output for similar movements.Such inconsistencies prevent model generalization, thereby limiting movement feature extraction and inference outside the dataset.
The commonly used standard normalization process proposed in Celebi et al. (2013), recently adopted in Schneider et al. (2019) and Fragkiadakis et al. (2020), involves coordinates axis origin translation to a "neck key point" and scaling of all coordinates so that the distance between shoulder key points equals one.This work will refer to this normalization method as the "basic."This method successfully eliminates the influence of body size differences.However, features such as body rotation (toward the camera), posture, and body proportions in the dataset can still influence feature extraction.

Related Work
Overcoming the camera setting boundaries and pose estimation are being actively researched.Complete 3D body pose estimation using a single camera is one of the main goals (Ji et al., 2020).Activity classification results obtained using 2D and 3D pose data were compared, with no significant difference emerging; the features in 2D data were sufficient (Marshall et al., 2019).A pipeline that includes learning 3D pose coordinates from 2D pose data collected from sign language videos was proposed and used recognizing tasks and synthesizing avatar animations (Brock et al., 2020).The Skeletor (Jiang et al., 2021), a deep-learning approach that refines the pose's estimated coordinates in three dimensions.However, it relies on contextual information and therefore requires a sequence of frames.

Our Proposal
Here, we propose a three-dimensional normalization and encoding method for MediaPipe pose data that is entirely algorithmic and normalizes pose data based on fixed human proportions.Data is presented as a series of angular joint changes.It could be applied to various sign language processing tasks and datasets and allows the search and extraction of movement features.Both methods are detailed in Sections 2.1 and 2.3.
We tested our method against the basic normalization on continuous sign language data using standard machine learning techniques in Section 3.1 and isolated sign language data using the deeplearning in Section 3.2.We show how the JSL sociolinguistic features are present in the signing movement and how they can be explored using our methods.
The main contributions are: • Novel three-dimensional normalization method for MediaPipe • Novel movement representation method for a direct sign language processing

Methodology
To achieve our goal of directly processing sign language, we set the following requirements for the desired movement representation: the adherence to the triangle inequality, the capability of movement synthesis, being intuitive and understandable to humans, and being low-dimensional.
The adherence triangle inequality is essential for automated data processing techniques like clustering, machine learning, and indexing.
Movement representation data must be distributed across a space compatible with the notion of distance and similarity.Sampling from such a space should return the corresponding pose, and moving through it should produce movement to meet the movement synthesis requirement.
The space should not have any latent features, and its dimensions must be perceivable by a human.To promote readability and facilitate processing, the space must be as low-dimensional as possible to eliminate unnecessary information from representations.
Normalization must transform pose data into the desired space, and encoding must represent it suitably for human perception.
To determine the degree of adherence to these requirements and compare our method to the basic normalization method, mentioned in Section 1.1, we conducted experiments on two types of sign language data using standard machine learning and deep-learning techniques.

3D Normalization
The MediaPipe's holistic model4 along with two x and y provides a limited estimation of the z coordinate for each body key point on the given frame.We propose a procedure to improve the depth z coordinate estimated by MediaPipe.
Joints do not move in isolation; they naturally interact within the skeleton.For example, the movement of the arm changes the position of the hand and its fingers.Therefore, we propose processing pose skeleton data as a "joint tree structure" that respects actual human proportions, with the root node in the neck key point.We aimed to use all available information to simplify pose data processing.We selected 137 joints from the holistic model: 22 for each hand, nine for the upper body, and 84 for facial expression.We created a human skeleton, which showed rigidness, proportions, and connectivity in line with the human body.
To improve the depth coordinates estimated by MediaPipe, we use the proportions of the body.In Herman (2016), an overview of standard human body proportions is provided.For simplification, we assume that the body proportions for all data processed with MediaPipe are constant when using our method.
We captured a short video of a body standing upright and used MediaPipe to calculate the ratio of distances between key points relative to the distance between the shoulders.The maximum distances across frames were used to calculate the proportions.The joint tree model of the human body aligned with MediaPipe key points stores the proportional values for each joint.From the length of just one joint in real space, we can compute the lengths of all joints, which requires some reliable MediaPipe estimation as a basis.
The holistic MediaPipe model includes a face model for estimating key facial points; its depth coordinates are the most accurate.The distance between the eyes in the MediaPipe model is selected as a basis for calculating the lengths of body joints.We trust its estimation the most.Eyes positions are calculated based on the average positions of key points 159, 143, 157, and 149 for the left eye, and key points 384, 386, 379, and 372 for the right eye.Relative to the distance between the shoulders, the distance between the eyes was calculated as 0.237 from the previously captured short video.
With this ratio, we calculate the lengths of all body joints in 3D using the Formula 1, where eyedistance gives the distance between the eyes according to MediaPipe and prop j gives the "captured proportion" of the joint.We calculate the z coordinate using the length, and relative x and y coordinate with the origin in the "parent joint" in the joint tree structure.Lastly, we apply the sign value from MediaPipe's original z estimation to it.
The joint tree structure allows us to control the order of calculation with the traversal and process only the desired part of the tree if needed.To obtain the coordinates for a joint with the origin set at the neck point, we sum the coordinates for all its parent joints in the tree.A detailed example of the 3D estimation step is shown in the middle part of Figure 1.

Scaling and Rotation
After 3D coordinates refinement, the pose data are represented as a joint tree structure with coordinates in 3D space.To address the variation in camera angle and relative position, we rotate and scale the coordinates as the final part of the normalization process.The resulting pose is consistently rotated toward the camera and is fixed in size.The root node (neck point) is the origin (0,0,0), and the left shoulder is at (0.5,0,0).Both scaling and rotation are performed through a linear transformation in 3D space.To generate the transformation matrix, a scaling factor and rotation angles are required, which we compute for each frame.We apply the transformation matrix to all joint coordinates, using joint tree structure traversal to obtain rotated and scaled coordinates for each joint.A detailed example of a scaled and rotated pose is shown in Figure 1 (bottom panel).
Facial expressions are essential for processing sign language.Therefore, we perform an additional separate transformation only for face points.We scale and rotate the key face points so that the nose points toward the camera along the z-axis, while the point between the eyes is on the y-axis.Additional normalized facial key point data are shown in Figure 1 (bottom panel, upper right corners) and 2e-h.

Representation
Our normalization process returns pose data as 137 "tree-structured joints" with 3D coordinates, which is helpful for decomposing movement.We use relative coordinates for each joint, with the origin set at the parent joint, to represent the joint's movement in space independently.Since the proportions are fixed and known, independent movement may be estimated with arccosine of direction cosines values, i.e.,the angles between joint and axes, which range from −π to π in radians.The resulting body movement appears as a series of 411 (3 * 137) isolated signals.Each signal shows a value for the angle between the corresponding joint and the corresponding axis at every frame.The resulting decomposition allows the quick determination of when, where, and what movement of the body is captured, providing direct access to it.
Initially, for each key body point, we obtained three values from the MediaPipe.After normalization, the key body points became joints with three direction angles values, striped from the variation in size, rotation, and body proportions.The dimensionality of the information remained the same while the representation space changed, adhering to the requirements in Section 2.
We use image RGB color space to visualize a series of direction angles for joints to simplify the interpretation of the movement.The process is shown in Figure 2a-b: direction angles with x-, y-, and z-axes ranging from −π to π in radians are encoded in the red, green, and blue channels, respectively, as 8-bit integers ranging from 0 to 255. Figure 3 shows an example image of a representative movement.
For handshapes, it might serve as an additional visual clue to add one-dimensional (1D) encoding of the absolute 3D angle (0°-180°; blue = 0°, red = 90°, green = 180°) between the hand joints and their parents.Figure 2d provides an example.Key face points are encoded differently.We captured a relaxed facial expression as the baseline (Figure 2e) and encoded deviation angles for each key point in three axes.The angle changes are usually tiny, so we multiplied the difference by a factor of four (determined by trial and error) to boost visual interpretation.Figure 2f-g shows an example of encoded facial expressions.
For computer processing methods, channels must be separated; thus, data will be encoded as an 8-bit integer with only one channel per pixel.This method allows for movement manipulation via image manipulation.Image processing is a highly advanced area with various methods and approaches applicable to movement processing.Movement detection and recognition can borrow approaches from object detection (image processing).
Figure 4 shows an encoded representation of 13 simultaneous closings and openings of a rotating fists.Where and when the movement occurs is easily detectable with the naked eye, and might also be easily detectable via modern pattern-recognition methods.
The proposed encoding scheme is straightforward, and the image can easily be decoded back to direction angles and coordinates.Movement patterns are fully explainable and can produce skeleton animations, aiding visual comprehension and thus satisfying the requirements in Section 2.
We hope to encourage researchers to explore the capabilities of encoded movement data, augmenting their sign language knowledge to explore movement features.Section 3 discusses how the proposed methods compare to the standard normalization process used for linguistic and sociolinguistic features extracted from sign language datasets.

Experimental Setup
The proposed normalization was explicitly developed to process data with high variance, which is typical of data captured in real life.The decomposition property of our approach allows for comparing pure movement data on a joint-by-joint basis.In this Section, we compare the performance of our method to the basic normalization method, mentioned in Section 1.1, on a dataset composed of continuous samples -the JSL Dataset and a public benchmark dataset collected of isolated samples -the WLASL-100 dataset.The JSL dataset has a variation in the camera angle and includes coding for various sociolinguistic features.However, it is a small and very diverse dataset; therefore, it will be used for feature exploration and camera settings boundary testing using standard machine learning algorithms.The isolated WLASL-100 is more suitable for deep-learning testing since it is an established public sign language recognition benchmark.

Continious Signs -JSL
We created a simple dataset from the Japanese Sign Language (JSL) Corpus presented by Bono et al. (2014Bono et al. ( , 2020)).The JSL Corpus is a continuous and dialog-based corpus that includes utterance-and word-level annotations.It consists of conversations between two participants freely discussing various topics.The signers vary in age, geographic area, gender, and the deaf school they attended.Conversations were captured from both the semi-front and side positions; a sample from the dataset is shown in Figure 1.

JSL Dataset Satistics
Using the word-level annotations, we have selected lexical signs from the JSL Corpus with more than 25 samples.For each lexical sign, we extracted 25 video examples for each camera view (a total of 50 samples).Some samples had an insufficient capturing quality for pose estimation, so our final dataset comprised 674 semi-front view and 608 side view videos.The resulting number of classes and samples per class for each feature is shown in Table 4.
For comparison, we created a second dataset from the same samples by normalizing the Me-diaPipe pose data using the basic normalization method.The resulting samples vary in duration from four to 120 frames, and we had to resize them using linear interpolation to fit the longest sample in the dataset.The JSL Corpus includes the signer ID, prefecture of residence, deaf school, age group, and gender.This information was added to the dataset since we were interested in examining whether these features affected signing movements.

Classification
First, we used the "Neighbourhood Components Analysis" (NCA) approach presented in Goldberger et al. (2004) to visualize the embedding space for each sociolinguistic feature in the dataset.We tested various classification techniques the scikit-learn package 5 (Pedregosa et al., 2011), including the linear support vector classifier (SVC) (Fan et al., 2008), nearest neighbor classifier (Cover and Hart, 1967), naive Bayes (Zhang, 2004), and decision tree (Breiman et al., 1984), to check for the presence of features in the data and assess the potential applicability of our normalization method to classification tasks.
We designed an experiment in which a model was trained on data captured from the front perspective and tested using data captured from the side perspective.We did this to address the camera angle boundary and generalization issue mentioned in Section 1, i.e., to determine the applicability to other datasets and capture conditions.

Isolated Signs -WLASL
We used the popular public deep-learning benchmark, the Word Level American Sign Language Dataset (Li et al., 2020a), to demonstrate the utility of our normalization and representation methods in deep-learning pose-based approaches.

WLASL-100 Dataset Satistics
We selected the WLASL-100, a WLASL subset of the top one hundred samples per sign.The split statistics are shown in Table 1.

WLASL Preprocessing
The WLASL dataset is distributed in video format, requiring preprocessing before training.Our preprocessing flow starts with the mediapipe pose data extraction and normalization using the proposed methods and basic normalization to create two datasets for comparison.The next preprocessing steps are visualized in Figure 8 and include finding and cutting to the part where both hands are visible, resizing using the linear interpolation to a fixed 100 frames, and removing the facial and relative joint information rows from the samples to reduce the dimensionality from 455 and 411 in basic normalization case to 159 values per 100 frames.We want to point out that start and end frames for basic normalization samples were determined using a corresponding sample of the proposed normalization dataset to guarantee consistency between the two datasets.

Model
We chose the Conformer model presented by Gulati et al. (2020) as the core unit since it is aimed at twodimensional signal representations.Figure 5 shows the overview of our model, where we use the adaptive average pool layer to reduce each sample to 15 frames and add one fully connected layer before and one after the conformer.Both do not have bias nor an activation but have an L2 normalization and a dropout layer after.The resulting model ends up as simple as possible.We train it using the Adam optimizer (Kingma and Ba, 2015) and the log loss for 200 epochs with the mini-batch size 32.Before training, all samples are standard-scaled6 on the training set, and during training, a 50% uniform noise is added to the samples.

Continious Signs
Figure 6 shows the well-distinguished clusters for Signer ID, Prefecture, Deaf School, and Age Group, with the only exception for the Gender feature.
Table 4 shows all classification results and samples per class distribution, whereas Table 2 shows only the best results as a summary.
For the Lexical Sign feature, our method outperforms the basic normalization method.Signer ID was the best-performing feature on front view data (accuracy = 78.57%)when using naive Bayes, for which the baseline was 8.33% and 43/12 samples per number classes ratio.Also, Lexical Sign with 49.78% accuracy on front plus side view data from naive Bayes compared to a 3.7% chance of guessing with only 43/27 samples per number classes ratio.Each signer has a unique movement pattern (i.e., signing style).Likewise, stylistic characteristics uniquely vary depending on the prefecture of residence, deaf school, and age group; only gender had no influence on signing movements.The results, shown in Table 2, indicate that the only feature that did not significantly improve prediction performance from the chance level was Gender (accuracy = 75.53%),even though it had the most samples per class among all features and is binary in data.This is consistent with the NCA embedding visualization.We cannot predict the JSL signer's gender based on their signing movements.Model performances on other features were consistently above the baseline, except for the learning transfer tests (training on front view images and testing on side view images), thus confirming the presence of the movement patterns attributed to them.JSL experts validated the results, confirming our findings based on their experience.
The last two columns in Table 2 indicate that our method retains the extracted features better than the basic normalization method for all features, overcoming the camera angle setting boundary.

Isolated Signs
In Table 3, we report the average accuracy across ten runs for each dataset with the top 1, top 5, and top 10 prediction scores as established in the WLASL benchmark reporting practice.Our model outperforms the state-of-the-art pose-based results on both datasets.Moreover, the proposed normalization pose-only dataset exceeds the models with combined modalities.As for comparing the normalization of two datasets, the results suggest a great performance improvement using the proposed normalization over the basic normalization, going from 75.85% to 84.26% using the same pose estimated data from MediaPipe.In Figure 7, the accuracy curve of the test data set during training is shown, indicating a clear improvement in learning with the proposed normalization.

Discussion and Conclusions
The proposed methods allow linguists and engineers to directly access the movement captured in the sign language corpus.Before, they had to use human annotation or recognition methods, which both relied on a predefinition of the features and were effectively limited by it.
Sign language movement can now be represented and stored in human-readable form with the proposed encoding method, allowing researchers to observe and comprehend it visually.Normalized pose data are distributed over a joint-based, low-dimensional feature space with distinct and fully explainable dimensions.Machine learning methods can also process it directly since it complies with the distance notion and the triangular inequality.
The embedding results showed the presence of stylistic movement features that correspond to known sociolinguistic features of JSL, similar to predictions of the speaker's country of origin based on their accent.Linguists and sign language ex-  perts can apply their knowledge of language properties and the proposed method to uncover novel features.Nevertheless, our results raise a concern about signer privacy protection since stylistic features of individual signers can be predicted based solely on signing movement.The deep-learning WLASL-100 benchmark results are consistent with the JSL dataset tests.Our method significantly outperforms other pose-based methods and successfully competes with multimodal approaches.Sign language is naturally conveyed through body movement; extracting it from the collected video data improves performance and robustness.
Our method performs consistently well across all data sets.We satisfied the initial requirements outlined in Section 2 and addressed the generalization issue discussed in Section 1.The proposed methods are suitable for any sign language, and multiple sign languages can be encoded into one signing space, thus facilitating cross-language studies in future research.

Limitations
The proposed representation method can be used for any three-dimensional pose estimation.However, the proposed normalization method relies entirely on the initial data recording quality and estimation accuracy of MediaPipe and is incompatible with two-dimensional pose estimation methods like OpenPose.Our normalization method recalculates the value of z coordinate but relies on MediaPipe's depth estimations to determine the order of the final coordinates.Even under ideal conditions, accounting for body proportions is difficult since the normalization method assumes all humans have the same body proportions.It may lead to instances where the hands are not ideally touching, failing to detect an important sign language feature.
Processing some facial expressions, mouth gestures, and mouthing is limited and requires additional modalities (e.g., pixel data).Still, the detected facial key points can provide aid in pixel extraction.

Figure 2 :
Figure 2: Proposed encoding scheme: a) calculating a joint's direction angles with arccosine; b) translating the angle values into RGB color space; c) displaying the RGB-encoded data as an image; d) adding additional 3D angles between the joints with color-coding; e) normalizing facial expressions with a relaxed baseline face; and f-g) example color-coding of various facial expressions deviating from the relaxed baseline face.

Figure 4 :
Figure 4: Hands movement is clearly visible on the encoding plot of a processed video sample with 13 closings and openings of rotating fists.

Figure 5 :
Figure 5: An adaptive average pooling layer, with a single conformer layer (Gulati et al., 2020) trained with cross-entropy loss.

Table 2 :
Comparison of the normalization methods in terms of JSL feature classification performance.

Table 3 :
The comparison of accuracy scores on the WLASL-100 test data.We report the performance of our model with the proposed and basic normalization method.