Is Information Density Uniform in Task-Oriented Dialogues?

The Uniform Information Density principle states that speakers plan their utterances to reduce fluctuations in the density of the information transmitted. In this paper, we test whether, and within which contextual units this principle holds in task-oriented dialogues. We show that there is evidence supporting the principle in written dialogues where participants play a cooperative reference game as well as in spoken dialogues involving instruction giving and following. Our study underlines the importance of identifying the relevant contextual components, showing that information content increases particularly within topically and referentially related contextual units.


Introduction
Due to production and perception errors, differences between individuals, and other sources of uncertainty, language use for information transmission can be thought to happen through a noisy channel. Effective and efficient information exchange under such conditions can be modelled using the tools of Information Theory (Shannon, 1948). Indeed, information-theoretic models have successfully accounted for surprisal in speech perception (Jelinek et al., 1975;Clayards et al., 2008), reading (Keller, 2004;Demberg and Keller, 2008;Levy et al., 2009), and sentence interpretation (Levy, 2008;Gibson et al., 2013), providing psycholinguistic evidence that the information content of linguistic signals is related to comprehension processing effort.
Speakers, too, are sensitive to the properties of the communication channel. They are thought to simultaneously minimise their own production effort and the addressee's processing effort (Clark and Wilkes-Gibbs, 1986;Clark and Schaefer, 1989). The most efficient way of dealing with both pressures, according to Information Theory, is to transmit information at a constant rate (Genzel and Charniak, 2002), making linguistic choices that reduce fluctuations in the density of the information transmitted. Evidence for the principle of uniform information density (UID;Jaeger and Levy, 2007;Jaeger, 2010) has been found at many levels of language production: speakers tend to reduce the duration of more predictable sounds (Aylett andTurk, 2004, 2006;Bell et al., 2003;Demberg et al., 2012); they tend to drop sentential material within more predictable scenarios (Jaeger and Levy, 2007;Jaeger, 2010;Frank and Jaeger, 2008); in spoken dialogue they are more likely to overlap at turn transitions when information density is low (Dethlefs et al., 2016); and the rate at which they transmit information in texts is uniform (Genzel andCharniak, 2002, 2003;Qian and Jaeger, 2011). Empirically, it is as yet unclear whether information density remains uniform throughout conversations (Vega and Ward, 2009;Doyle and Frank, 2015a,b;Xu and Reitter, 2018).
That information density is not always uniform in dialogue may be due to the complex structure of conversational context (Clark and Brennan, 1991), which not only includes previous utterances and world knowledge, but can also comprise preceding interactions between the interlocutors, their perceptual input, and their goals. This paper tests the UID principle on the previously unexplored setting of task-oriented dialogue, with its well-defined structural units and more constrained context than in open domain dialogues. To estimate information density, we use a pre-trained Transformer-based language model, which provides more robust measurements than the n-gram models used in prior work. We study whether, and within which structural units the UID principle holds, finding new evidence in support of it in certain structural units of both written cooperative reference games and spoken map navigation dialogues. 1 Our study highlights the importance of identifying relevant con-textual structures, showing that topically and referentially related contextual units correspond to more uniform information transmission profiles.

Measuring Information Content
To investigate whether information density is uniform throughout a discourse, each lexical choice can be modelled as a random variable Y i and its information density estimated as the Shannon information content H(Y i ). For the UID principle to hold, the amount of information transmitted with every new word H(Y i ) must remain constant. We where C i is the entire relevant context and L i is the local context, both influencing lexical choice X i . Typically, H(X i |C i , L i ) is not estimated directly. The term is further decomposed into H(X i |L i ), the information content of X i given the local context, and I(X i ; C i |L i ), the locally conditioned mutual information between X i and the entire relevant context: As the relevant context is built up, I(X i ; C i |L i ) is assumed to increase: next word prediction becomes easier when more contextual cues are available (Genzel and Charniak, 2002). So for the UID principle to hold-i.e., for H(X i |C i , L i ) to remain constant in Eq. 1-the locally conditioned information content H(X i |L i ) must increase, too, as relevant context accumulates.
The local context of a word choice is typically taken to be the utterance or sentence, and these are also considered as the units of information transmission (Genzel andCharniak, 2002, 2003;Doyle and Frank, 2015a,b;Qian and Jaeger, 2011;Xu and Reitter, 2018). The information content of an utterance is computed by averaging over the negative logarithms of all locally contextualised word probabilities: To remove the confounding effect of utterance length on information content (Keller, 2004), we use Xu and Reitter's (2018) normalised metric of utterance information content: where L(n) is the set of all utterances of length n and X ∈ L(n); for simplicity, we leave out the conditioning variable.

Data and Hypotheses
The UID principle is assumed to hold within a structural unit that determines the type and size of the overall relevant context C i as used in Eq. 1. Genzel andCharniak (2002, 2003) show that, in texts, part of the relevant context is lexical (writers tend to reuse words that have already appeared in the discourse) and topically determined, as given by the paragraph structure of texts. In dialogue, defining a topically relevant contextual unit is not straightforward. Xu and Reitter (2018) use a topic segmentation algorithm to identify relevant units in unconstrained dialogues and show that information density is influenced by topic shift. Here we exploit the inherent (task-related) structure of taskoriented dialogues to test the UID principle within contextual units of different type and size. We analyse two corpora of task-oriented English dialogues: MapTask (MT, Anderson et al., 1991)  MT contains 128 transcribed spoken dialogues consisting of an instruction giver directing an instruction follower to navigate to a point on a map. The participants cannot see the other's map and their respective maps may contain slightly different landmarks. We consider two types of contextual unit: a) the overall dialogue: a series of landmarks are described in succession to help the instruction follower draw a path towards a goal location; b) a dialogue transaction: a dialogue excerpt related to reaching a certain landmark, manually annotated as part of the corpus. For both types of contextual unit, we also construct versions where we use the MT dialogue act annotation to filter out turns exclusively consisting of backchannels and other grounding acts ('okay', 'mmhmm') common in spoken language. 4 This results in contextual units that focus on information-transmission dialogue acts and are more referentially coherent.
PB contains 2,500 dialogues where two participants without specified roles communicate via written chat. Each dialogue consists of 5 rounds: in each round, each participant sees a set of photographs which partially overlap with the set of images seen by their dialogue partner. The goal is to find out which images they have in common.
The images available to each participant change in each round, but a subset reappears, thus triggering subsequent references to previously described photographs. This task design allows us to investigate the following types of contextual unit: a) the overall dialogue: throughout a game, all the photographs are about a certain domain (e.g., food or dogs); b) a dialogue round: different images are described in succession as participants try to figure out which ones they share in a given round; c) an image reference chain: the (non-adjacent) utterances that refer to a certain image across rounds (we use the automatic annotation of referring utterance chains by Takmaz et al., 2020).
We hypothesise that, in MT, the UID principle will be more visible at the transaction level, where the context is more topically coherent, than at the dialogue level, where a dozen different landmarks are brought up in succession-in particular when only information-transmission dialogue acts are taken into account. In PB, we expect the strongest effect to be present at the level of reference chains. Chains are determined both topically, by the target image, and lexically, by the conceptual pacts established in previous mentions of a target (Brennan and Clark, 1996). In rounds and dialogues, where several different images are described, topic and lexical choices are constrained by the image domain but the vocabulary used in previous turns is more varied. We thus expect the effect to be less pronounced at these two levels.

Modelling
To estimate the information content of an utterance we compute the log probabilities in Eq. 2 using GPT-2 (Radford et al., 2019), a pre-trained Transformer language model, which allows us to obtain more accurate probability estimates than n-gram models. We rely on HuggingFace's implementation of GPT-2 with default tokenizers and default parameters (Wolf et al., 2020). As GPT-2 was pretrained mainly on written text, it is less tuned to the idiosyncrasies of dialogue data. We therefore finetune it separately on a 70% split of each target corpus. 5 As shown in Table 1, finetuning yields a substantial reduction in the model's perplexity. More information on model parameters and the finetuning procedure can be found in Appendix B.  Table 1: Word-level perplexity of the GPT-2 models on 30% held-out portions of the corpora.
We use the finetuned language models to estimate the information content (Eq. 3) of the 30% held-out portion of each corpus, and count turn positions (i.e., the positions of utterances within a dialogue-or a smaller structural unit) from the beginning of the relevant structural unit. 6 Following Xu and Reitter (2018), to test whether utterance information remains uniform we fit a linear mixedeffect model using the logarithm of information content as response variable and the logarithm of turn position as predictor. We include a random slope for the turn position and a random intercept term grouped by distinct dialogues, which allows us to model variation among individual speakers as a function of their addressee.
We adopt Genzel and Charniak's assumption that the mutual information I(X i ; C i |L i ) between an utterance and its context increases with turn position (Genzel and Charniak, 2002, see Section 2); so for H(S i |C i , L i ) to remain stable, utterance information content H(S i |L i ), too, must increase. Consequently, we consider the UID principle to hold when turn position has a significant positive effect on information content.
Validation To validate our estimates of utterance information content, we replicate Genzel and Charniak's (2002;2003) and Keller's (2004) study on the Wall Street Journal articles of the Penn Treebank (Mitchell et al., 1999) 7 using GPT-2 finetuned on this corpus (see Table 1). In the original studies, the authors measure the correlation between the position of sentences within newspaper articles-as well as within paragraphs-and the sentence information content, as measured using n-gram language models. As mentioned above, these studies assume that I(X i ; C i |L i ) increases as discourse context is built up, and test whether the locally conditioned information content H(X i |L i ), too, increases throughout articles and paragraphs.
In our validation study, we take both entire 6 All dialogues, annotated with information content estimates, are provided in the supplementary material. Excerpts can be found in Appendix A. 7 https://catalog.ldc.upenn.edu/ LDC99T42; WSJ part of the corpus (sections 0-24). articles and paragraphs as structural units and count sentence positions from the beginning of the relevant unit. Our linear mixed-effect models show a significant positive effect of sentence position on information content both within articles (β = 1.65e−2, p < 0.001) and within paragraphs (β = 1.53e−2, p < 0.01). To reproduce the original experimental setting, we further train an n-gram language model with interpolated Kneser-Ney smoothing using Keller's (2004) data split and select the configuration with the lowest perplexity on the test set, a 3-gram model with a discount value of 0.8. In line with previous work, we find a positive Kendall's rank-correlation 8 between sentence position and information, as measured with the n-gram model as well as with the Transformers. The original results are therefore replicated. 9

Results
We test whether the UID principle holds in MT and PB using the procedure presented in Section 4. The full results of our statistical analysis can be found in Appendix E (Tables 6 and 7). Recall that for the principle to hold, the locally conditioned information content H(X i |L i ) must increase with the position of X i in the relevant context unit C i . The local context L i is defined as a dialogue turn.

MapTask
When we take entire MT dialogues as the contextual unit, we do not find a positive effect of turn position on information content, regardless of whether we focus in information-transmission dialogue acts 8 Our data consist of multiple measurements for each sentence position (one for each document), thus causing a large number of ties (i.e., multiple entries with the same sentence position but different entropy estimates). We choose Kendall's test for all our experiments because it deals with ties better than other correlation tests such as Spearman's or Pearson's. 9 A detailed description of the experimental setup and the full results can be found in Appendix D.
(see Figure 1a for the results with all dialogue acts). In contrast, the types of dialogue act considered affect our results on transactions. We fail to find an effect in transactions with backchannels but the linear mixed-effect models show a positive effect of turn position within transactions without backchannels (β = 2.38e−2, p < 0.001). We attribute these findings to the nature of the task. Over the course of a dialogue, speakers traverse a map naming different landscape features and therefore are unable to establish more than a minimal level of linguistic routine at the dialogue level. Transactions, on the other hand, correspond to more referentially constrained subtasks; this becomes more evident when information-transmission dialogue acts are isolated from transmission-coordination acts. Analysing the instruction giver and follower informationtransmission turns independently reveals that there is no significant effect for instruction followers; the overall positive effect is driven by the instruction givers (β = 3.46e−2, p < 0.001; see Figure 1b). This reflects the asymmetric nature of information transmission in MT dialogues.

PhotoBook
The effect of position on information content is positive within the PB dialogues (β = 3.13e−2, p < 0.001); Figure 1c shows a consistently increasing sawtooth pattern for information content, providing evidence that participants optimise their information-transmission strategy throughout PB games. Information content slightly decreases within game rounds (β = −0.74e−2, p < 0.005), yet this effect is mainly due to the higher estimates obtained for the first turns of these contextual units (see Figure 3e in Appendix E), often used by participants to coordinate on how to start the new round. Because multiple images are discussed in a round, this contextual unit seems not to capture the relevant context of individual dialogue turns nor be large enough to display the participants' overall information transmission strategy that we observe at the dialogue level.
Finally, as hypothesised, the effect of position on information content is positive at the reference chain level (β = 1.27e−2, p < 0.001). As participants re-refer to an image over the game, they increase the density of their messages (as shown in Figure 1d) and also decrease message length (Kendall's correlation between position in chain and length is τ = −0.268, p < 0.001). Thus, as reference chains unfold, the reduction process observed by Takmaz et al. (2020) is complemented by information compression. The relatively low magnitude of the fixed effect as well as that of the correlation between utterance length and chain position, however, suggest that the process we see at play is not only one of compression and reduction. Figure 1d indeed shows that the fourth position in a chain often comes with a decrease in information content, perhaps indicating that once a conceptual pact has been established between interlocutors, referential expressions can be significantly simplified without losing referential power-as in the following reference chain (information content estimates in parenthesis): 1. 'Man eating slice of pizza' (0.69) 2. 'last one for me is guy with pizza' (0.78) 3. 'pizza eater' (0.91) 4. 'pizza' (0.67)

Conclusion
We investigated to what extent the principle of uniform information density holds in two corpora of English task-oriented dialogues. We have related the properties of task-determined contextual units to patterns of information transmission and have hypothesised that the UID principle holds to a stronger degree in more topically coherent and reference-specific contextual units. Our hypotheses are confirmed in PhotoBook, where we find evidence that dialogue participants use rational strategies of information transmission over an entire dialogue. We do not observe uniformity of information in the MapTask dialogues and transactions as a whole, similarly to other negative results in interactive settings (e.g., Vega and Ward, 2009;Doyle and Frank, 2015b). Yet the effect is present within MapTask transactions when we restrict our analysis to information-transmission dialogue acts: these make for a more topically and referentially coherent contextual unit. Indeed, the organisation of context can be complex in dialogues. We have shown that theoretically motivated contextual units such as reference chains in PhotoBook and information-transmission acts in MapTask transactions are good candidates to characterise the relevant context over which participants deploy strategies of information compression.
We are aware that the assumptions used to test the UID principle, which we have adopted from Genzel and Charniak's seminal work (2002)-i.e., that context informativeness increases as strongly as sentence entropy as discourse is built up-can be controversial. Nevertheless, in this paper we have followed this line of reasoning, used in previous work (Genzel and Charniak, 2003;Vega and Ward, 2009;Qian and Jaeger, 2011;Doyle and Frank, 2015b;Xu and Reitter, 2018), and applied it to novel data and contextual units. In Giulianelli and Fernández (2021), we go one step further and empirically test these assumptions for the first time, using direct estimates of the contextualised entropy H(S i |C i , L i ) of an utterance and thus of the informativity of its linguistic context I(S i ; C i , L i ).
The study presented in this paper provides new empirical evidence on language production in dialogue which we believe can directly inform the development of natural language generation models. Our findings suggest that models that take relevant contextual units into account (Takmaz et al., 2020;Hawkins et al., 2020) are better suited for reproducing human patterns of information transmission, and confirm that the use of training objectives that enforce a uniform organisation of information density (Meister et al., 2020;Wei et al., 2021) is a promising avenue for training language models.

Appendix A Dialogue Excerpts
Tables 2 and 3 show excerpts of MapTask and PhotoBook dialogues. The dialogues are annotated with turn positions (within different contextual units), speaker identifier, and information content estimates. The speaker identifiers in MapTask refer to the speaker roles of instruction givers (G) and followers (F).

B Transformer Language Models
We experiment with GPT-2 (Radford et al., 2019), an autoregressive Transformer-based (Vaswani et al., 2017) language model, relying on Hugging-Face's implementation with default tokenizers and default parameters (Wolf et al., 2020). 10 The maximum sequence length is set equal to the maximum utterance length in the corpus: 320 for Penn Treebank, 150 for MapTask, and 40 for PhotoBook. As the pre-trained model yields high perplexity on the dialogue corpora (Table 1), we finetune 11 it on 70% of each target corpus and leave out 30% of the dataset to compute the model's evaluation perplexity and to conduct our statistical analysis. The training and held-out portions of PhotoBook 10 The pre-trained model is named gpt2 in HuggingFace. 11    consist of games 0-1751 and 1752-2501 respectively; the training and held-out set of MapTask comprise dialogues q1ec1-q6nc2 and q6nc3-q8nc8. One version of GPT-2 is finetuned for 30 epochs on PhotoBook dialogues with a learning rate of 5e − 05 and batches of size 64; a second version is finetuned for 60 epochs on MapTask dialogues with a learning rate of 1e − 05 and batches of size 16; the last version is finetuned for 30 epochs on Penn Treebank articles with a learning rate of 5e−05 and batches of size 8. The other finetuning parameters are set to their default values. Utterance beginning and end are used as context cues but their information content is not computed. Furthermore, for the dialogue corpora, we try prepending input utterances with dialogue turn cues ("A: ", "B: ") as a hint to the language models that the data is conversational; the information content of these speaker identifying tokens is never computed. This modification of the input text does not consistently reduce the models' perplexity scores. The perplexity of the pre-trained and finetuned models on the target corpora is reported in the main paper.

C Effects of Finetuning
The following are the main effects of finetuning GPT-2 on MapTask dialogues: • GPT-2 finetuned on MapTask assigns lower perplexity to disfluencies. While the pretrained model assigns high information content to utterances that contain disfluencies, this is not the case for the finetuned model.
• Backchannels also become less surprising with finetuning: the information content of, e.g., okay, mmhmm, well, right, erm, yeah, no, aye decreases by 25% to 75 • With finetuning, GPT-2 doesn't only get used to features of transcribed speech: expressions that refer to MapTask landmarks also become more likely (e.g., the rapids, a rope bridge, the gold mine) • Simple spatial indications (towards the bottom left-hand corner, on the left-hand side) are among the utterances with the lowest surprisal.
These are the main effects of finetuning GPT-2 on PhotoBook dialogues: • Among the most surprising utterances for the pre-trained model are some that are specific to PhotoBook games: submit bye, loading may be frozen. For these two utterances, e.g., surprisal decreases by 1/4 and 1/3 respectively after finetuning.
• Written chat language becomes less surprising: e.g., the surprisal for kk done decreases by one third.

D Penn Treebank Replication Study
Cut-off = 25 Cut-off = 76 Cut-off = ∞ Raw data τ τ τ 3-gram (Keller, 2004) 0.078 * * 0.093 * * 0.081 * * 3-gram (ours) 0.082 * * 0.087 * * 0.087 * * GPT-2 pre-trained 0.034 * * 0.054 * * 0.054 * * GPT-2 finetuned 0.077 * * 0.084 * * 0.084 * * Binned data τ τ τ 3-gram (Keller, 2004) 0.671 * * 0.147 0.170 * * 3-gram (ours) 0.740 * * 0.099 0.097 GPT-2 pre-trained 0.453 * 0.448 * * 0.101 GPT-2 finetuned 0.680 * * 0.347 * * 0.104 Table 5: Kendall's rank-correlation between sentence information and sentence position, with sentence length partialled out, for the Penn Treebank test set. Significance: '**' p < 0.001, '*' p < 0.01, '' p ≥ 0.05. the data into multiple test tests (Genzel and Char niak, 2002, 2003) as this was shown not to alter the sentence information estimates (Keller, 2004;Xu and Reitter, 2018). The best language model is the 3-gram model with a discount value of 0.8, which achieves a perplexity of 335.80 on the test set. The perplexity obtained using NLTK's evaluation script is 221.57 ( Figure 2) as it is calculated by taking into account beginning and end of sentence symbols. We use the n-gram language model as well as the GPT-2 language model (as described in Section B) to estimate the information content of all sentences in the test set and measure the correlation with sentence position. In Genzel and Charniak's (2002) original work, the correlation between sentence position and sentence information is computed by binning the sentence information data points based on their sentence position. Correlation is measured between sentence position indices 1-25 and the average sentence information estimated for the respective sentence position. Keller (2004) also measures the raw correlation between all sentence position-information pairs, without binning. Neither work reports the correlation measure used. We use Kendall's rank-correlation as it is less sensitive than Spearman's rank-correlation to the large amount of ties (position-information pairs with the same position index) in our data. Moreover, whereas Genzel and Charniak (2002) select a single sentence position cut-off (c = 25), in Keller's (2004) study three variants of the cut-off are used (c = 25, c = 76, and no cut-off). We also compute correlation at these three levels. Finally, following Keller (2004), we compute the partial correlation between sentence position and sentence information, excluding the effect of sentence length. The results are reported in Tables 4 and 5.

E Experimental Results
Tables 6 and 7 summarise the results of our statistical analysis, as introduced in Section 4. In both tables, the logarithm of information content is the response variable and the logarithm of turn position is the fixed effect. We include a random intercept grouped by distinct dialogues and a random slope for the turn position. Fixed effects with significant coefficient estimates are marked in bold. The Random effects columns show the standard deviation of the random effects (Coeff.) and the residual standard deviation. The UID principle is considered to hold when turn position has a significant positive effect on information content. Figure 3 shows the patterns of information content against turn position for the contextual units whose patterns are not displayed in Section 5.

F Computing Infrastructure
The models were trained and evaluated on a computer cluster with Debian Linux OS. Parallelization over four GPUs was implemented for the finetuning of GPT-2. All information content computations were executed using used a single GPU. The GPU nodes are GPU GeForce 1080Ti, 11GB GDDR5X, with NVIDIA driver version 418.56 and CUDA version 10.1.