Spontaneous gestures encoded by hand positions can improve language models: An Information-Theoretic motivated study

,


Introduction
Human communication is a multi-modal process where both verbal and non-verbal information are expressed simultaneously.This is true in various forms of communication, one-way (speech) or twoway (conversation).It has been revealed in empirical studies that speakers' expression in the visual modality, including gestures, body poses, eye contacts and other types of non-verbal behaviors, play critical roles in face-to-face communication, as they add subtle information that is hard to con-vey in verbal language.It is becoming an emerging sub-area in computational linguistics.However, whether and to what degrees these sparse and random non-verbal signals can be treated as a formal communication channel that transmits "serious" information remains a seldom-validated question, especially with computational methods.We believe a key missing step is to explore whether the nonverbal information can be quantified.
The questions that are worth further investigation include (but are not limited to): How rich is the information contained in these non-verbal channels?What are their relationships to verbal information?Can we understand the meanings of different gestures, poses, and motions embedded in spontaneous language in a similar way to understanding word meanings?The goal of this study is to propose a simple but straightforward framework to approach the above questions, under the guidance of Information Theory.Some preliminary, yet prospective results are presented.The code and data for this study is published in this repository https://github.com/innerfirexy/Life-lessons.

Studies on gestures in communication
Early studies and theories The functions of gestures in communication and the connection to verbal language have been extensively studied in behavioral science, psychology and cognitive sciences.McNeill (1992) has developed the Growth Point theory, which can be conceptualized as a "snapshot" of an utterance at its beginning stage psychologically.McNeill (1992)'s theory classifies gestures into two categories, representative ones, which have clearer semantic meanings (e.g., depicting objects and describing locations), and non-representative ones, which refer to the repetitive movements that have little substan-tive meanings.McNeill et al. (2008) further put forward a more fine-grained classification scheme for gestures: iconic, metaphoric, deictic, and beats, in which the iconic and metaphoric gestures are directly related to the concrete and abstract content in the verbal language.The psycholinguistics theories and studies indicate the feasibility of investigating the "meanings" of gestures with computational semantic approaches.

Lab-based experimental studies
The effect of gestures has been broadly studied in laboratory-based behavioral experiments.Holler and Levinson (2019) 2019) takes a step forward and develops a protocol for automatically extracting kinematic features from video data, which can be applied to quantitative and qualitative analysis of gestures.Their work provides insight to the hands position-based encoding method for gestures (discussed in section 4.2).

Computational studies
More recently, the communicative functions of gestures have been studied in different settings from human-human to human-agent interaction interactions.Synthesized gestures are integrated into virtual characters and robots to facilitate the dialogue fluidity and user experiences (Kopp, 2017).In such systems, the content and form of co-speech gestures are determined from the semantic meanings of utterances being produced (Hartmann et al., 2006), and/or from given communication goals and situations (Bergmann and Kopp, 2010).The success of these systems also indicates the possibility of understanding gestures in the wild by learning language models that include simple gestural features.
To summarize, the works reviewed above have paved the road for studying gestures in a more "data-driven" style, that is, using data collected from more naturalistic contexts and more automatic methods for encoding gestures.

Multi-modal techniques in machine learning and NLP research
The recent advances in deep neural network-based machine learning techniques provide new methods to understand the non-verbal components of human communication.Many existing works primarily focus on using multi-modal features as clues for a variety of inference tasks, including video content understanding and summarization (Li et al., 2020;Bertasius et al., 2021), as well as more specific ones such as predicting the shared attention among speakers (Fan et al., 2018) and semantic-aware action segmentation (Gavrilyuk et al., 2018;Xu et al., 2019).More recently, models that include multiple channels have been developed to characterize context-situated human interactions (Fan et al., 2021).Advances in representation learning have enabled researchers to study theoretical questions with the tools of multi-modal language models.Neural sequential models are used for predicting the shared attention among speakers (Fan et al., 2018) and semantic-aware action segmentation (Gavrilyuk et al., 2018;Xu et al., 2019).More recently, models that include multiple channels have been developed to characterize visually embedded and context-situated language use (Fan et al., 2021;Li et al., 2019Li et al., , 2021;;He et al., 2022).Another line of work focuses on the predicting task in the opposite direction, that is, predicting/generating gesture motion from audio and language data (Ginosar et al., 2019;Yoon et al., 2020;Alexanderson et al., 2020).For short, advances in representation learning have enabled researchers to study theoretical questions in complex models.

The theoretical basis of informative communication
To what degrees do non-verbal actions contribute to informative communication?Other than the empirical works reviewed in section 2.1, the same question can also be explored from the perspective of abstract theories.(Sandler, 2018) draws evidence from sign languages to show that the actions of hands and other body parts reflect the compositional nature of linguistic components (their methods are further discussed in section 4.2).Their work reveals that the use of bodily articulators maps the way a verbal language origins and evolves.Although the spontaneous gestures of our interest here are different from a strictly defined sign language, Sandler (2018)'s work inspires us that more similar properties can be found between verbal and non-verbal languages at a higher level of abstraction.Information Theory (Shannon, 1948) is the next lens that we use.Information Theory is broadly applied as the theoretical background for the probabilistic models of language.It also provides philosophical explanations for a broad spectrum of linguistic phenomena.One interesting example is the assumption/principle of entropy rate constancy (ERC).Under this assumption, human communication in any form (written, spoken, etc.) should optimize the rate of information transmission rate by keeping the overall entropy rate constant.
In natural language, entropy refers to the predictability of words (tokens, syllables) estimated with probabilistic language models.Genzel andCharniak (2002, 2003) first formulated a method to examine ERC for written language by decomposing the entropy term into local and global entropy: in which s can be any symbol whose probability can be estimated, such as a word, punctuation, or sentence.C and L refer to the global and local contexts for s, among which C is purely conceptual and only L can be operationally defined.By ERC, the left term in eq. ( 1) should remain invariant against the position of s.It results in an expectation that the first term on the right H(s|L) should increase with the position of s, because the second term I(s, C|L), i.e., the mutual information between s and itself global context should always decrease, which is confirmed in Genzel and Charniak (2003)'s work.Xu andReitter (2016, 2018) also confirmed the pattern in spoken language, relating it to the success of task-oriented dialogues (Xu and Reitter, 2017).
The term H(s|L) can be estimated with various methods.Genzel andCharniak (2002, 2003) used the average negative log-probability of all n-grams in a sentence to estimate H(s|L), and the probabilities are returned from an n-gram language model.Some more recent works have used transformerbased neural language models to examine ERC in dialogue (Giulianelli et al., 2021(Giulianelli et al., , 2022) ) and in broader data modalities with various operationalizations (Meister et al., 2021).Now, the goal of this study is to extend the application scope of ERC to the non-verbal realm.More specifically, if the s in eq. ( 1) represents any symbol that carries information, for example, a gesture or pose, then the same increase pattern should be observed within a sequence of gestures.ERC can be interpreted as a "rational" strategy for the information sender (speaker) because it requires less predictable content (higher local entropy) to occur at a later position within the message, which maximizes the likelihood for the receiver (listener) to successfully decode information with the least effort.The question explored here is whether we "speak" rationally by gestures.

Questions and Hypotheses
We examine two hypotheses in this study: Hypothesis 1: Incorporating non-verbal representations as input will improve the performance of language modeling tasks.To test Hypothesis 1, we extract non-verbal representations using the output from pose estimation, and then compose discrete tokens to represent the non-verbal information.The non-verbal tokens are inserted into word sequences and form a hybrid type of input data for training language models.The language models are modified to take non-verbal and verbal input sequences simultaneously and compute a fused internal representation.We expect the inclusion of non-verbal information will increase the performance of language models measured by perplexity.Hypothesis 2: Non-verbal communication conforms to the principle of Entropy Rate Constancy.To test Hypothesis 2, we approximate the local entropy (H(s|L)) of non-verbal "tokens" using the perplexity scores obtained from neural sequential models, and correlate it with the utterances' relative positions within the monologue data.If we can find that H(s|L) increases with utterance position, then it supports the hypothesis.

Data collection and processing
The video data used are collected from 4 YouTube channels, i.e., 4 distinct speakers.There are 1 female and 3 male speakers, and the spoken language is English.All the videos are carefully selected based on the standards that each video must contain only one speaker who faces in front of the camera, and whose hands must be visible.The automatic generated captions in .vttformat are obtained for each video.
The pre-processing step is to extract the fullbody landmark points of the speaker, in prepara-tion for the next gesture representation step.For this task, we use the BlazePose (Bazarevsky et al., 2020) model, which is a lightweight convolutional neural network-based pose estimator provided in MediaPipe1 .It outputs the (x, y) coordinates of 33 pose landmarks that characterize the key points of the body pose, including {nose, left-eye, . . ., } (see (Xu et al., 2020) for full description).Here each coordinate (x, y) is a pair of fraction values within [0, 1] describing the key point's relative position on a frame, whose zero point is at the upper left corner.In fact, the pose estimator returns a 3-D coordinate (x, y, z) for each point, where the third dimension z is the depth.We discard this z component based on our observation that most speakers do not show hand movement in that direction.

Encode gestures based on hands' positions
The next step is to obtain representation for gestures so that they can be studied using language models in a similar way as word embeddings.After having surveyed extensively on previous studies about methods of encoding gestures, we decide to develop an encoding scheme that categorizes gestures into discrete token based on the positions of hands, which are inspired by the work of (Trujillo et al., 2019;Sandler, 2018).To briefly summarize their work, Trujillo et al. ( 2019) measure the vertical amplitude feature of the right dominant hand in relation to a participant's body (upper-left of fig.1); Sandler (2018) use the relative positions of dominant and non-dominant hands between torso and face as the evidence for the hierarchical organization of body language (lower-left of fig.1).
The workflow of our method is in three steps.The first two steps are to identify the focus area of the speaker's upper body, which a square area whose size almost equals the height of the upper body.We come up with this empirical setup based on the observation that this square area covers the vast majority of possible hand positions in our data.The third step is to encode the gesture based the relative positions of hands within the focus area.2) Find the vertical boundaries of the body area.First, compute the vertical distance between the nose and the mid-point of two eyes, δ = |y nose − y eyes |.Then the top bound (forehead) is calculated by: y min = y eyes − 2δ.This is according to the common knowledge about proportions of the human head (Artyfactory, 2022).The bottom bound y max is the mean y coordinates of both hips because the speakers are in a sitting pose and only their upper bodies are visible.Lastly, obtain the size of focus area by y max − y min .3) Divide the focus area into 3 × 3 regions, i.e., nine regions with indices {1, 2, . . ., 9}.Index each hand with an integer based on which region it is in, and then encode the gesture into an integer number, using the combination of both hands' indices.The encoding formula is: in which L and R are the region index for the left and right hand, respectively.This formula maps any combination of (L, R) to a distinct integer number g, which we call gesture token.
As shown in the example of fig. 1, the speaker's left and right hands fall into region 9 and 8, so the gesture label is <72>.Because there are 9 possible positions for each hand, the total number of gesture tokens is 9 × 9 = 81.For the convenience of the modeling step later, we use one integer index (instead of a string connected by hyphen) to denote each of these 81 gestures: <1>, <2>, ..., <81>.The pseudo code is presented in appendix A.1.Some notes: why not string but integer.We understand that encoding (L, R) into an integer number is not as straight-forward to interpret the gesture as another method of simply representing it with a string, such as "L-R" to indicate left hand in region L and right hand in region R.But using an integer index has the advantage that the gesture tokens can be directly supplied to the language models, just like word indices.

Prepare gesture sequences
Having gestures encoded, we prepare the gesture sequences using the time stamped text transcript for each video.We use the automatically generated text transcript in .vttformat, which contains the <START> and <END> time stamps for each word (token) in the subtitle.See the following example: <00:00:00.510><c>let's</c> <00:00:00.780><c>talk</c> <00:00:01.020><c>about</c> Gesture is defined by the combination of position indices of both hands: 1 2' 3 $ 42 5 67 8 9 !: 3 In this example, ; $ / and < $ =.Thus, the gesture token is / 5 ( 8 * " : = $ >? in which each word is annotated by a pair of <c></c> tag, and the <START> time stamp is appended to the head.We treat the start time for one word as the ending time for the previous word.In this example, the token let's elapses from 0.780 to 1.020 (seconds).Multiplying the time stamps with the frame rate of 24 (FPS), which means the frame range is from the 19th to 24th.Then, for each frame within this range, we extract a gesture token using the method described in Section 4.2, resulting in a sequence of gesture tokens, {g 19 , g 20 , . . ., g 24 }.This sequence represents a continuous change of gestures during the articulation of the word, which in most cases, consists of identical tokens.Thus, we select the majority token g m within the sequence as the final representation.
Despite the down sampling effect of using majority sampling, there is still a large amount of repetition in the resulted gesture sequence, which could cause sparsity issues for the modeling tasks.For instance, in the first row of table 1, the gesture token is the same <24> for the first 6 tokens, which means that the speaker did not move his/her hands during that period of time.We deal with this issue by "compressing" the repeated gesture tokens.For the same example in table 1, merging the 6 repeats of <49> and 2 repeats of <76> results in a compressed gesture sequence, {<49>, <76>}, which indicates that the speaker has made two distinct gestures during the utterance.Throughout the rest of the paper, we call the original gesture sequence that come with repeats the raw sequence, and the one with repeats merged the compressed sequence.For each raw gesture sequence of length N , its compressed version {ĝ 1 , ĝ2 , . . ., ĝN ′ } usually has smaller length N ′ ≤ N .

Incorporate gesture inputs to LMs
We implement two neural network-based models for the language modeling tasks, using LSTM (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017) encoders.The models are tailored for handling two types of input: single-modal (words or gestures alone) and mixedmodal (words + gestures).

Single-modal LM task
The single-modal model takes as input a sequence of either word (w) or gesture (median g or compressed ĝ) tokens and converts them to the embedding space.Then the token embeddings are fed to the LSTM/Transformer encoders to compute a dense representation for tokens at each time step of the sequence.Finally, the dense representation at the current time step t is used to predict the token at the next time step t + 1 using a softmax output.
The model architecture is shown in fig. 2.
The learning object here is the same as a typical sequential language modeling task, i.e., to minimize the negative log probability: in which t 1 , . . ., t k−1 is all the tokens (gesture or word) before t k within the same utterance.We directly use this NLL value as the estimated local entropy, i.e., H(g|L) ≜ N LL, which is the target variable of our interest.Detailed model hyperparameters and training procedures are included in appendix A.2.

Mixed-modal LM task
The mixed-modal model takes the word sequence S w (u) = {w i } and gesture sequence S g (u) = {g i } of the same utterance u simultaneously as input.A pair of sequences, S w (words) and S g (gestures) are the input, which is then fed into a modality fusion module, where the embedding representation for words and gestures at each time step, i.e., w i and g i , are fused by sum, concat, or a bilinear fusion component.Finally, the resulting mixed embeddings are encoded by the LSTM/Transformer encoder for the next-word prediction task.The purpose of this model is to verify Hypothesis 1, for which we expect the perplexity scores of a mixed-modal model to be lower than that of a single-modal one.It is also our interest to explore the optimal modality fusion method.The model's architecture is shown in fig.2b.Detailed hyperparameters will be presented in the Appendix.

Statistics
62 videos of a total length of 10 hours and 39 minutes are collected.The average length of each video is 723.7 seconds (SD = 438.1).The data and preprocessing scripts will be open-sourced.17.9K lines of subtitles consisting of 121.5K words are collected.We have extracted 81 distinct gesture tokens, whose total number is 121.5K in the raw sequence data (equals the total number of words).Within the compressed sequence data, the total number of gesture tokens is reduced to 26.12 K.
The top five most frequent gesture tokens (according to the raw, uncompressed data) are <79>, <71>, <70>, <80> and <76>.Their frequency counts, proportions, and the average entropy values are shown in section 5.1.It can be seen that <79> is the dominant gesture token, where the speaker's right hand falls in region 7 and left hand in region 9.The entropy value increases as the frequency rank drops, which roughly follows the Zipf's law (see the frequency vs. rank plots in fig.3).Because Zipf's law is a common distribution for word tokens (Zipf, 2013;Piantadosi, 2014), it is a side evidence showing that gestures encode semantic information in a similar way as words.A detailed analysis of the gestures' positional and semantic meanings is provided in section 5.4.

Examining Hypothesis 1: Mixed vs. single modal comparison
The plots of validation cross-entropy loss against training epochs are shown in fig. 4. We use the prefixes sand mto indicate the single-modal and mixed-modal models, respectively, that is, smodels take pure word sequences as input, while m-models take word+gesture sequences as input.
It can be clearly seen that the m-LSTM has lower validation loss than s-LSTM, and same trend is found between m-Transformer and s-Transformer.
It supports Hypothesis 1: gestures indeed contain useful information that can improve the language model's performance.
Note that an exponential conversion of the crossentropy loss (i.e., the NLL in eq. ( 3)) leads to another quantity perplexity, which is more commonly used to evaluate the performance of language models.The Transformer-based models have overall lower perplexity than LSTM-based ones, which is expected as a Transformer encoder has more parameters to facilitate the sequence prediction task.But meanwhile, the validation loss for training Transformer models does not decrease as significantly (see the less smooth curves in fig.4b) as LSTM models, which probably indicates some overfitting issue.This can be fixed by collecting more training data.
We also compare three different feature fusion method in training the m-LSTM/Transformer models, and found that sum and concat have better performance (lower loss) in language modeling tasks.The corresponding validation losses for three feature fusion methods, sum, concat, and bilinear, are shown in fig. 5.It can be seen that sum and concat result in a significantly lower loss for m-LSTM, but the difference is not that observable in m-Transformer because in the latter loss shortly converges after training starts.

Examine Hypothesis 2: Local entropy increases with utterance position
To examine Hypothesis 2, we plot the local entropy of each gesture sequence (median and compressed, respectively) against the corresponding utterance's position in fig.6, which shows a visible increasing trend.We also use linear models to verify the correlations between local entropy and utterance position, that is, local entropy as dependent variable and utterance position as predictor (no random effect is considered due to limited data size).It is confirmed that utterance position is a significant predictor of local entropy with positive β coefficients.For raw gestures, the betas are smaller: β LSTM = 1.6 × 10 −3 (p < .05),β Trm = 2.3 × 10 −3 (p < .01);for compressed gestures: β LSTM = 0.097, β Trm = 0.093 (p < .001).
Therefore, the increase of local entropy is statistically significant.It supports our hypothesis.Lines are smoothed curves using generalized additive models (GAM).95% bootstrap CIs presented.

Analysis of typical gestures
We examine the top five frequent gesture tokens <79>, <71> <70>, <80> and <76>, and show some selected screenshots in fig.7 (See appendix A.3 for more examples).For <79>, <70> and <80>, the positions of both hands are at the mid-lower position in front of the body.Gesture <79> has two hands evenly distant from the center, while <70> captures a movement to the right and <80> to the left.Gesture <76> has the right hand at the same height as the speaker's neck and the left hand hanging down, which is a typical onehand gesture in conversation.One technical detail is that in most screenshots of <76> the left hands are invisible, but the pose estimation algorithm can still infer their positions with accuracies above 95% (see the report from Mediapipe), which is also why they are included in our analysis.In general, the selected four gestures can represent commonly seen patterns in daily communication.
Based on the results from section 5.2 that including gesture features can improve the performance of language models, we conjecture that there could exist a correlation between gestures and certain semantic representations, i.e., a speaker may use certain type of gestures to convey certain meanings.We verify this guess by examining the embedding vectors of word tokens that colocate with three selected gestures: <70>, <71>, and <80>.Two other frequent gestures, <79> and <76> are excluded from the analysis because: <79> is overwhelmingly frequent, which could result in in-balanced samples across gestures; <76> is scarcely distributed, which makes it difficult to find sentences solely containing it.Next, we pick sentences that contain one distinct gesture, and then obtain the corresponding sentence vectors from a pre-trained BERT model.The last hidden layer of 768-d for each word is used to compute the mean sentence vector.We calculate the inner-group pair-wise cosine distances for each gesture, and the outer-group pairwise distances between all gestures.From the results shown in table 3, we can see that for gesture <70>, its inner-group distance (.291) is smaller than the outer-group ones (.298 and .298),with which t-tests yield p < .001results.It suggests that its corresponding sentences are distributed in a semantic sub-space farther away from others, and <70> is probably a gesture that co-occurs with some particular meanings.This needs to be further examined in future studies with more data.
To sum, we found preliminary positive evidence for associating gestures with distinct semantic meanings.However, the analysis above is limited in following aspects: First, the data come from a limited population, which means the findings about gesture semantics may lack generality.Second, pretrained embeddings are used instead of fine-tuned ones, which can result in inaccurate description of the semantic space.We believe these limits can be overcome in our future studies.

Conclusions
Our main conclusions are two-fold: First, incorporating gestural features will significantly improve the performance of language modeling tasks, even when gestures are represented with a simplistic method.Second, the way gestures are used as a complementary non-verbal communication sidechannel follows the principle of entropy rate constancy (ERC) in Information Theory.It means that the information encoded in hand gestures, albeit subtle, is actually organized in a rational way that enhances the decoding/understanding of information from a receiver's perspective.This is the first work done, to the best of our knowledge, to extend the scope of ERC to non-verbal communication.
The conclusions are based on empirical results from multi-modal language models trained on monologue speech videos with gesture information represented by discrete tokens.There are two explanations for what causes the observed pattern of increasing entropy: First, more rare gestures (higher entropy) near the later stage of communication; Second, the entropy for the same gesture also increases during the communication.While the latter indicates a more sophisticated and interesting theory about gesture usage, both explanations require further investigation.
This work is exploratory, but the evidence is promising, as only a small dataset is used, and a simplistic gesture representation method is applied.For future work, we plan to work with a larger and more diverse dataset with a higher variety in genres (public speech, etc.) and examine more advanced representation methods, such as continuous embedding and clustering.Another direction to pursue is to interpret the semantic meanings of gestures and other non-verbal features by examining their semantic distance from utterances in vector space.More specifically, non-parametric clustering algorithms can be useful to identify distinct dynamic actions, which provides a different way to extract non-verbal representations.
general patterns in human non-verbal communication, but not for the identification of individual speakers nor for other commercial use.

A.3 Screenshots for frequent gestures
Some typical screenshots for the top 4 frequent gestures from all four speakers are shown in fig.8.We can find similar appearances of the same gesture across different speakers.
study the facilitation from multiple layers of visual and vocal signals can add semantic and pragmatic information in faceto-face communication.Similarly, Macuch Silva et al. (2020) find visible gestures are more powerful form of communication than vocalization in dialogue object description tasks.In these studies, gestures from human subjects are usually manually coded by observing the hands' spatial positions and motions to characterize naturalistic and meaningful movements.Trujillo et al. ( 1) Compute the horizontal center of the body x center by averaging the x coordinates of nose, left & right shoulders, and left & right hips.

Figure 1 :
Figure 1: The method for encoding gestures based on hand positions.

Figure 3 :
Figure 2: Architecture of the LSTM/Transformer-based language models for handling single-(a) and mixed-modal (b) input sequences.

Figure 4 :
Figure 4: The cross-entropy loss on the validation set against training epochs for mixed-and single-modal models.

Figure 5 :
Figure 5: Validation loss against training epochs for comparing the three feature fusion methods in mixedmodal models: sum, concat, and bilinear.

Figure 6 :
Figure 6: Local entropy of gesture sequences increases with utterance position.Dots are actual data points.Lines are smoothed curves using generalized additive models (GAM).95% bootstrap CIs presented.

Table 2 :
The frequency count, proportion, and entropy values of the top five frequent gesture tokens.