Does Listener Gaze in Face-to-Face Interaction Follow the Entropy Rate Constancy Principle: An Empirical Study

,


Introduction
Human social interaction is intrinsically multimodal (Stivers and Sidnell, 2005).Face-to-face communication as a multimodal process includes verbal information as well as non-verbal cues, such as gaze, head movements, and speech-accompanying manual gestures from both interlocutors.Previous studies have demonstrated that non-verbal behaviours is rich in communicative functions (Wagner et al., 2014;Holler and Levinson, 2019).
In this paper, we look at gaze behaviour of listeners in video-recorded face-to-face-interaction, specifically, explanation dialogues in which a board game is explained by one interlocutor (the 'explainer') to another (the 'explainee'; Türk et al., 2023).We apply information theoretical measures to the gaze behaviour of the explainees as well as the corresponding utterances of both interlocutors and aim at answering the following questions: (i) How informative is listener gaze from an interactional perspective and does it follow the 'entropy rate constancy' principle (ERC; Genzel and Charniak, 2002)?(ii) Is there a correlation between verbal information and listener gaze, in terms of local entropy, in dialogue (similar to recent findings on manual gesture in monologue; Xu et al., 2022)?
2 Related work

Communicative functions of gaze
In human interaction, gaze is an important and multifunctional nonverbal signal with functions such as indicating attention, allocating space, and eliciting and monitoring feedback (Kendon, 1967;Duncan, 1975;Harness Goodwin and Goodwin, 1986).According to Brône et al. (2017), these functions follow two important roles of gaze: a participation role and a regulation role.They also emphasise the importance of gaze for turn management.During dialogue interaction, listeners' continuous gaze towards the speaker signals attention and engagement (Kendon, 1967;Rossano, 2012).
From the technical side, the importance of gaze has been studied and integrated in recovering meaning in interactional dialogues (Alaçam et al., 2021), facilitation grounding in interaction with embodied agents (Nakano et al., 2003) or in human-robot interaction (Skantze et al., 2014).

Entropy-rate constancy
Information theory (Shannon, 1948) is a mathematical framework that has proved useful for linguistic analysis as it can explain aspects of language use.Genzel andCharniak (2002, 2003), for example, propose the 'Entropy Rate Constancy' (ERC) principle, which states that entropy is constant over the length of a written text or other use of language.
The claim by Genzel and Charniak (2002) is as follows: Let  (  |  ,   ) denote the conditional entropy of the word   , where   =  −+1 , . . .,  −1 is the local -gram context, and   =  0 ,  1 , . . .,  −1 the context which contains all of the words preceding the word   .The condi-tional entropy of the word   can then be decomposed as: The assumption of ERC is that  (  |  ,   ) is constant.Given the fact that  (  ;   ,   ) -as the mutual information between   under its local context   and its global context   -increases because the global context increases, the local entropy  (  |  ) would have to increase in order for  (  |  ,   ) to remain constant.
A similar theory, the 'Uniform Information Density' (UID) hypothesis, states that speakers tend to distribute information uniformly throughout an utterance (Jaeger, 2010).In UID, the information of a linguistic unit , defined by its surprisal, is its negative log probability () = − log  ℓ (), where  ℓ is the underlying probability distribution of .Since the true probability distribution of  is not available, a language model with learned parameters is usually used to approximate the surprisal value of the corresponding linguistic unit (Smith and Levy, 2013;Goodkind and Bicknell, 2018;Wilcox et al., 2020).The linguistic unit  as a large sequence (e.g., an utterance or a text), can be further divided into a sequence of smaller units: ⟨ 1 , . . .,   ⟩, where   ∈  and  is the set of vocabulary.The surprisal of the current linguistic unit   is then expressed as the conditional negative log probability given its previous context: (  ) = − log  ℓ (  | < ).According to the UID hypothesis, drastic variations in the per-unit information density of an utterance can place a heavy processing burden on the listener, and thus make communication more difficult.The evenly distributed information density of speech on the other hand, promotes 'rational' (Xu and Reitter, 2018) communication.

Hypothesis
In this paper, we investigate the hypothesis that, during face-to-face interaction, listener gaze, being an important non-verbal communication mechanism, also conforms to the ERC principle.

Data collection
The dialogue interactions on which this study is based are explanations of board games (Türk et al., 2023).The explainer, who is familiar with the game, explains it to the explainee, who is unfamiliar with the game (see Figure 1).There are three reasons why we focused on task-oriented dialogue, namely the game explanation scenario, in our study: (i) Based on the theory of topic shifts (Ng and Bradac, 1993;Linell, 1998), we know that some interlocutors play a more active role, controlling the dialogue and introducing new topics, while the others have a more passive role and follow these topic shifts.Therefore, we identify explanations as common and representative of daily dialogue (there is always a more dominant speaker -the explainer -and a more passive listener).Furthermore, we assume that the passive listeners instead use more non-verbal signals (e.g.gaze, facial expression) than verbal signals to give feedback to the dominant speakers (Buschmeier and Kopp, 2018), which potentially provides different gaze data for this study.(ii) By focusing on one type of dialogue (explanations) and additionally one topic (a specific game), we assume that the dialogue contents are quite static at the lexical and semantic level and can be considered as invariant in our experiment (and effects on the entropy rate of gaze are not by the domain).(iii) Based on this, we can, in principle, even look at the behaviour of individual speakers and whether their speech behaviour follows the ERC principle.
The interactions are divided into two phases: In the first phase, the game is explained without the game material being present.In a second phase, the game is put on the table and the two participants start playing (but usually the explanation continues at least while the game is being set up).All participants speak German.The interaction can therefore be considered a task-oriented dialogue.For this study we use the videos of 58 interactions from the corpus and extract the explanation part (first phase).The explanations vary in length from 2:12 to 17:36 min (mean length 7:04 min, standard deviation 3:15 min).

Preparing gaze sequences
For each dialogue, we extract the explainee's gaze information using the 'Openface' framework (Baltrusaitis et al., 2018) and create a 'gaze sequence' that is used in the neural sequential model (see Section 4.4).Openface generates two types of gaze features (i) gaze vectors in world coordinates (three dimensions for the left and right eye each), and (ii) gaze direction values in radians averaged from both eyes (two dimensions).We integrate both representations because gaze direction is easier to use, but does not contain depth information, which is potentially relevant as listeners regularly change their posture and head orientation during the interaction.
We cluster the two gaze features using the DB-SCAN algorithm (Ester et al., 1996) to find the spatial distribution of gaze and identify its 'dense region' (Tran et al., 2020), both horizontally and vertically.This dense region typically represents the target which explainees gaze at most of the time during an interaction.After the dense region has been detected, we use a 3 × 3 grid-based labelling scheme (inspired by Xu et al., 2022) and label the gaze points inside the dense region with '5'.Depending on whether the gaze direction is horizontally or vertically away from the dense region, eight other number-based labels are used for the gaze points outside the dense region (see Table 1).A similar approach is used for the depth-component selected from the eye gaze vectors.DBSCAN-clustering is used to find the dense region where the gaze of the left eye and right eye are located in the depth dimension.The eye gaze vector inside the dense region is again given the label '5'.Based on how close the left eye vector or right eye vector is to the dense region, eight different labels are used (see Table 2).
Given the label   ∈ [1, 9] for the gaze direction value and the label   ∈ [1, 9] for the gaze vector, a combined label , that represents the gaze informa- tion, is generated as y = (  − 1) representing the set of possible eye gaze labels).

Automatic speech recognition for the interaction
Transcriptions of the dialogues were created automatically using 'Whisper' (Radford et al., 2022), which creates speech segments with a start and an end time.In order to calculate word timings, we approximated word onsets by calculating the duration of each speech segment, dividing it by its length in words, and approximating word duration (assuming, for this study, that words have uniform length).Eye gaze labels are then aligned with words based on video frame rate (50 fps).Figure 2 shows an example of this alignment.After pre-processing, we concatenate all of the utterance-aligned gaze sequences and use them as training and test data for the neural sequential model.

Processing gaze sequences
Analogous to the processing the information density of linguistic units according in the UID hypothesis (see section 2.2), we consider eye movements as  sequences of gaze labels and need to estimate its underlying probability distribution.We approach this by training an autoregressive model, more specifically a Transformer model (Vaswani et al., 2017), which we have chosen since it has a strong psychometric predictive power compared to LSTM-RNNs models (Wilcox et al., 2020).To compute the local entropy of a gaze sequence, we first calculate its negative log probability: where  is the maximum index of a given eye movement sequence and   ∈ [1, 81].The local entropy  ( 1 ,  2 , . . .  ) of the gaze sequence is then the exponential of NLL (perplexity).The learning task is thus to predict the next gaze label   based on the preceding sequence ⟨ 1 , . . .,  −1 ⟩ and minimise its negative log probability NLL.

Processing dialogue data
To compare the local entropy of the gaze sequences with the local entropy of the corresponding speech segments, we compute the latter using a pre-trained language model (specifically dbmdz/german-gpt2; Schweter, 2020).For the computation we use whole dialogues, i.e., both explainer and explainee utterances, as both contribute to the dialogue and explainee utterances are conditioned on what the explainer has said before and vice versa.Some of the utterances contain only backchannels (such as "yes", "uh-huh", or "okay"; Yngve, 1970), which play an important role in the dialogue as linguistic 'feedback' to previous utterances (Allwood et al., 1992;Clark, 1996).In our data, backchannels are usually produced by explainees and we do not exclude them as semantically irrelevant words.

Results and discussion
Figure 3 shows the combined eye gaze sequences from all 58 videos.The x-axes represent the dialogue position of the speech segments (with each dialogue position corresponding to about 7s of speech) and the y-axes represent the local entropy of the gaze sequences.Figure 3A, shows a generally rising trend of the local entropy, Figure 3B shows a replot of the data without the effect from entropy spikes shown in 3A (which we consider as outlier values), and yields a globally rising trend, but also a decreasing trend for the first 75 dialogue positions.
The 58 explanations vary in length (see Section 4.1), so plotting them together introduces some bias into the visualisation.We have therefore divided the dataset into four groups (based on the total length of the explanations) and plotted the local entropy of the gaze sequences separately in Figure 4.Besides the common rising trend in all of four sub-plots Figure 4A-D, they also share a decreasing tendency at the beginning, resulting in (roughly) convex shapes.One explanation for this phenomenon may be that, at the beginning of the interaction, explainees focus on the explanation of the game and direct their gaze (and attention) mainly to the explainer, thus signalling their participation role (Brône et al., 2017).As a result, there is little variation of gaze labels.However, as the explanation progresses, the explainees' cognitive load may increase, a known cause of gaze aversive behaviour (Morency et al., 2006;Doherty-Sneddon and Phelps, 2005).Another explanation for the increasing trend could be that explainers are likely to provide non-verbal feedback to explainers (e.g. head nodding or other head movements; Heylen, 2006), signalling understanding etc., which could also lead to changes in gaze behaviour.In any case, the result is a higher diversity of gaze labels and thus an increase in local entropy.
We chose a board game explanation so that the linguistic content of the dialogue is limited.This gives us a good opportunity to look at whether different individuals organise their speech in a 'rational' way (following the ERC principle; Xu and Reitter, 2018).According to Figure 5, it turns out that, given the same topic (the explanation of the board game), different explainers organise the explanation in a rational way, as evidenced by the increasing trend of the local entropy.Moreover, this increasing trend of local entropy for both the speech segments across dialogues and their corresponding gaze sequences indicates a potential congruence between the information density of speech and the gaze behaviour of explainees/listeners.

Conclusions
This study attempts to find out whether (i) gaze as an important non-verbal communication signal follows the entropy rate constancy principle, and whether (ii) the information density of listeners' gaze behaviour correlates with that of the dialogue content.We recorded interaction videos, trained a transformer model, and used a pre-trained language model to approximate the information density of listeners' gaze and dialogue content.We find that listeners' gaze roughly follows the ERC principle, which can be taken as further evidence that non-verbal communication generally follows the ERC principle (at least to some extent; Xu et al., 2022) -although the result is inconclusive in that the information density of listeners' gaze fluctuates along the interaction.A congruence between dialogues content and listeners' gaze can be roughly confirmed, as the local entropy of listeners' gaze and the local entropy of speech both show an increasing tendency.The fluctuation of the entropy rate value (Figure 3A as well as Figure 4) indicates that the property of non-verbal communication cannot simply be explained by ERC principles.
As a next step, we plan to look more closely at listeners' eye gaze and further analyse whether sudden changes in the local entropy of gaze behaviour can be aligned with changes in the local entropy of speech, and if so, under what kinds of context.We also plan to perform further linguistic analysis on the entropy spikes, looking at dialogue acts and calculating the syntactic complexity of the corresponding dialogue positions.Possible future research includes simplifying or compressing the dialogue context based on changes in the entropy values of the listener's gaze.

Limitations
One limitation of the study is that the dialogues are not very diverse in content and activity.They are all explanations of a specific board game.
A second limitation is that because we rely on video-based gaze information collection rather than dedicated eye-tracking hardware, the gaze data derived from Openface is sometimes sub-optimalthe camera setup was not optimised for gaze and face tracking, as the camera shots were over the shoulder and thus slightly elevated, rather than at eye level -so the data may contain noise.This compromise was necessary to meet multiple analysis objectives and also to avoid disrupting the naturalness of the conversational setting (e.g., by using wearable eye-trackers).
A third limitation concerns the alignment of utterances with gaze labels.Since German words can be very long (e.g.'Tiefseeabenteuer', the name of the board game), the simplified assumption that all words have the same duration and align with the same number of gaze labels will cause some noise.

Ethics statement
The data collected for this study is for research purposes only and no commercial use is allowed.The participants recruited are mainly university students.Participants gave informed consent for their participation in the study and the use of their data, and were paid 10 euros per hour.The study was approved by the university's internal ethics and data protection review boards.
A Fragment of a game explanation Figure 6 shows an example fragment of a game explanation.The timestamped speech segments were automatically recognised from the audio of the interaction videos using 'Whisper' (Radford et al., 2022).Whisper's word error rate (WER) for German is given as 4.5% (see https://github.com/openai/whisper/blob/main/README.md).The transcripts were not corrected.

B Hyperparameters selection for DBSCAN clustering
For the DBSCAN algorithm, we set up a criterion that only if a point is surrounded by at least ten samples, it can be considered a core point for a cluster.A for-loop is used to find an optimal epsilon value required by the DBSCAN algorithm.The epsilon value is in the range [0.01, 0.1].We wanted

Figure 1 :
Figure1: Scene from the data collections.The participant on the left (the explainer) explains the board game to the participant on the right (the explainee).The explainee's eye gaze is captured with a camera behind the explainer.

Figure 2 :
Figure 2: Example of the alignment of speech segments (grey) and gaze label sequences (white).

Figure 3 :
Figure 3: Change of the perplexity (local entropy) of eye gaze during the dialogues.A shows the general trend, B shows the normalised trend without the effect from the entropy spikes.

Figure 5 :
Figure 5: Change of perplexity(local entropy) of the speech segments during the dialogue.The shaded area is the bootstrapped 95% confidence intervals

Table 1 :
Label decisions criterion for direction value representation of eye gaze (DR: dense region).

Table 2 :
Label decisions criterion for vector representation of eye gaze (RE: right eye; LE: left eye).