Speaking the Language of Your Listener: Audience-Aware Adaptation via Plug-and-Play Theory of Mind

Dialogue participants may have varying levels of knowledge about the topic under discussion. In such cases, it is essential for speakers to adapt their utterances by taking their audience into account. Yet, it is an open question how such adaptation can be modelled in computational agents. In this paper, we model a visually grounded referential game between a knowl-edgeable speaker and a listener with more limited visual and linguistic experience. Inspired by psycholinguistic theories, we endow our speaker with the ability to adapt its referring expressions via a simulation module that monitors the effectiveness of planned utterances from the listener’s perspective. We propose an adaptation mechanism building on plug-and-play approaches to controlled language generation, where utterance generation is steered on the fly by the simulator without finetuning the speaker’s underlying language model. Our re-sults and analyses show that our approach is effective: the speaker’s utterances become closer to the listener’s domain of expertise, which leads to higher communicative success.


Introduction
Speakers tend to adapt their language use to the perceived knowledge, information, and linguistic abilities of their interlocutors (Isaacs and Clark, 1987;Clark, 1996;Pickering and Garrod, 2004).When adults speak with children, for example, they use simplified expressions to ensure children are able to understand (Saxton, 2009); when computational linguists give a talk at a cognitive science conference, they (hopefully) avoid making extensive use of NLP jargon, as that would prevent their audience from following through the presentation.Successful adaptation to the conceptual knowledge of conversational partners requires the ability to represent and reason about others' mental states * Shared first authorship.

Knowledgeable about all domains
Knowledgeable about Indoor domain only

Adapted utterance
Visual domain: Food Non-adapted utterance: "Green salad"

SPEAKER LISTENER
Operates on frozen language model to control generation Figure 1: An illustration of our knowledge-asymmetric setup where an expert Speaker interacts with a less knowledgeable Listener.The Speaker tailors its utterance about an image from the food domain for a Listener who only knows about the indoor domain.The speaker's Simulator module inspired by Theory of Mind guides this adaptation.The adapted utterance exploits indoor terms ('bookshelves') without referring to food.(Tomasello, 2005), a socio-cognitive ability typically referred to as Theory of Mind (ToM; Premack and Woodruff, 1978).Yet, speakers do not always resort to explicitly modelling the knowledge of their dialogue partner: due to different cognitive costs and pressures, they sometimes plan their utterances egocentrically, i.e., only taking into account their own knowledge and abilities (Keysar, 2007).
In this paper, we model a communicative situation where the interlocutors have asymmetric language abilities: a proficient speaker interacts with a listener characterised by limited semantic knowledge to complete a reference game, as illustrated in Fig. 1.Our goal is to mimic a scenario in which, for example, a high school physics professor can make complex atomic models understandable to young students by using terminology familiar to them, such as culinary terminology to explain Thomson's 'plum pudding model'.We focus on the speaker's Referring Expression Generation (REG; Reiter and Dale, 1997;Krahmer and van Deemter, 2012) in a multimodal dialogue setting and use REG models equipped with visual perception to generate discriminative image descriptions within a set of related image candidates.Several psycholinguistic theories have proposed that language production is interwoven with comprehension via 'forward prediction'-i.e., producing an utterance involves predicting how a comprehender would understand it (e.g., Pickering and Garrod, 2013;Roelofs, 2020).Inspired by this idea, we equip our speaker model with a simulator, i.e., a module that 'simulates' whether a listener would be able to identify the target referent.Based on this predicted behaviour (i.e., the expected effect of the planned utterance), the simulator modifies the generation plan on the fly to increase communicative success.
These are the main contributions of our study: 1 • We model adaptation between agents with asymmetric knowledge, using a referential task as case study, where agents communicate in natural language about realistic images (in contrast to related work using synthetic data-see §2).
• We propose a novel simulation-based approach and test it in two settings: (1) a self-aware setting where the speaker predicts how a generic listener (with the same knowledge as the speaker) would resolve a planned utterance, and (2) an audienceaware setting where the speaker learns-from the behaviour of a listener with restricted semantic knowledge-to form representations of the listeners' knowledge (Clark, 1985;Isaacs and Clark, 1987) and predict their responses.
• We exploit the simulator's representations in an innovative way: by leveraging a plug-and-play approach originally introduced for controllable text generation (Dathathri et al., 2020), which steers language production at the decoding stage without altering the underlying language model.
• We show that our approach leads to increased resolution accuracy; in particular, our audienceaware speaker is able to adapt its utterances effectively when referring to a target within a visual domain unknown to the listener.
• We provide an in-depth analysis of the patterns present in the adapted utterances and the model's 1 Code and models available at https://github.com/nicofirst1/speaker-adaptation production strategies underpinning our results.

Pragmatic Reference Generation
Speakers tend to design their referring expressions to be pragmatically informative, i.e., discriminative from the listener's perspective.Most approaches to pragmatic reference expression generation (REG) have considered scenarios where we can assume a shared set of linguistic conventions between speakers and addressees (common domain and training data).The Rational Speech Act framework (RSA;Frank and Goodman, 2012;Goodman and Stuhlmüller, 2013;Goodman and Frank, 2016) has become a popular option for characterising such settings, with REG models that reason probabilistically about their interlocutors' interpretation via recursively defined speaker and listener models (Andreas and Klein, 2016;Monroe et al., 2017;Cohn-Gordon et al., 2018;Zarrieß and Schlangen, 2019;Fried et al., 2021), possibly taking into account information accumulated during interaction (Hawkins et al., 2020).There also exist joint speaker-listener models that are not recursive in the RSA sense.In these models, speakers can become listener-aware at inference time thanks to enhanced decoding algorithms (Vedantam et al., 2017) or they can learn to generate discriminative utterances at training time, for example via altered supervised training objectives (Mao et al., 2016) or auxiliary reinforcement learning (RL) modules (Yu et al., 2017), including approaches where the RL rewards are determined by the reference resolution success of a listener model (Lazaridou et al., 2020).
Our model, too, produces audience-aware discriminative image descriptions through an auxiliary module that captures the listener's perspective.However, in contrast to the above studies, the setting we investigate has two distinct key features: (1) we model situations with knowledge asymmetry between the dialogue participants, and (2) we experiment with plug-and-play controlled generation methods that result in temporary updates to the speaker's language model-rather than steering generation via recursive probabilistic reasoning.We review work related to these two aspects next.

Knowledge Asymmetry & Referring Tasks
What if the speaker and the listener have access to differing semantic knowledge?It is well known that speakers are able to adapt to less proficient addressees (Isaacs and Clark, 1987).Janarthanam and Lemon (2010) were one of the first to address adaptation in dialogue systems with asymmetric knowledge.They modelled REG for technical domains where users may not know the jargon, using RL to learn a REG policy from a user simulation.More recently, Ohashi and Higashinaka (2022) focus on generating utterances in task-oriented dialogue with users that have limited vocabulary.They exploit the natural language understanding module of the system (representing user understanding) to set up a reward function, which is then used to finetune the NLG module via RL.
In the context of visually grounded referring tasks, Bao et al. (2022) focus on a scenario where the listener has comprehension difficulties and model adaptation by reweighing the probability of candidate referring utterances as a function of their likelihood to be successfully interpreted by the listener.Similarly, Liu et al. (2016) apply ToM-based listener modelling, where the speaker generates multiple candidate utterances and ranks them with the help of the ToM listener.Generating and ranking multiple utterances, however, is an inefficient production mechanism.For these reasons, others have tried to condition the speaker model prior to utterance generation, mainly with external modules.Corona Rodriguez et al. (2019) model interactions where the listener has an impaired perceptual system and implement this conditioning through an external policy network that takes as input listener embeddings.While Zhu et al. (2021) propose a ToM module that tracks the listener's understanding via meta-learning for few-shot coordination in a setup where listeners understand different languages.Singh et al. (2023) train an attention-based adapter layer in a reward-based manner as part of a multi-agent referential game where the speaker aims to generate utterances that would be understood by one listener, but not the other.Finally, Greco et al. (2023) have a setup that is the most similar to ours, where Expert speakers adapt to Layman listeners.But unlike our plug-and-play approach, the authors follow the RSA framework in developing audience-aware models that are updated through interaction.

Adaptive Controlled Generation
Most of the approaches to adaptation we have reviewed apply RL to the speaker model or finetune its language model through interaction.As a re-sult, the speaker is not able to retain its original knowledge, which might cause catastrophic forgetting (McCloskey and Cohen, 1989;French, 1999).With the advent of large pretrained language models, a plethora of new methods for controlled text generation have been proposed, including prefixtuning (Li and Liang, 2021;Ben-David et al., 2022), prompting (Brown et al., 2020), adapters (Houlsby et al., 2019;Pfeiffer et al., 2020a,b), and energybased constraints (Qin et al., 2022).Visual prefixes and prompts (Alayrac et al., 2022) have also been used to condition generation, especially without training the full language model.We argue that this recent line of research offers promising alternative frameworks for adaptive REG.In particular, we investigate a solution to adaptation inspired by the plug-and-play approach to controlled text generation (PPLM; Dathathri et al., 2020;Pascual et al., 2021), which has been used to steer large pretrained language models towards generating texts with certain features (e.g., positive/negative sentiment or a given vocabulary distribution).In Dathathri et al. (2020), latent representations are updated at inference time with the help of a classifier while keeping the model parameters unchanged.Building on this idea, we propose a modular approach to REG adaptation in asymmetric knowledge settings where a module trained to predict the listener's behaviour-similar to the 'prediction net' in the machine ToM model by Rabinowitz et al. (2018b)-is exploited to control generation on the fly.

Problem Formulation
We provide an abstract overview of the problem we address and our approach.Details on the data and the experimental pipeline are given in §4 and §5.
Scenario Our setup is a classic referential game: two artificial agents, a speaker and a listener, share a visual context involving multiple images.The speaker produces an utterance to refer to one of the images (the target) and the listener attempts to identify the referent given that utterance.In particular, we model a scenario with knowledge asymmetry, where the speaker is more knowledgeable than the listener.We hypothesise that, in such a setup, for communication to be successful, the speaker will need to adapt its utterances to the listener's representational space and language.To make this possible, we endow the speaker with a simulation module and an adaptation mechanism.
Simulation We provide the speaker with a module that simulates how a listener would process a planned utterance.We assume that, by having interacted with listeners in the past, the speaker has learned a model of certain listener types (e.g., a prototypical idea of what a 3-year-old would understand).We operationalise this by pretraining several instances of the simulator, one per listener type, to predict how a listener is likely to resolve a referring utterance.We compare three settings: • Baseline: No simulation takes place.
• Self-aware: The simulator is trained to predict how a listener with the same knowledge as the speaker would resolve an utterance.This is equivalent to a pragmatic speaker who reasons about the effect of its utterances on a generic listener (see §2.1), but in our approach at test time the listener's interpretations are predicted rather than directly observed.Our proposal is also inspired by human production models based on 'self-monitoring' (Levelt, 1993;Roelofs, 2020).
• Audience-aware: The simulator is trained to predict how a listener with a subset of the speaker's knowledge, i.e., a single domain, would resolve an utterance.Thus, the speaker learns a modela theory of mind-of a less knowledgeable listener type that allows the speaker to make predictions about the listener's behaviour.When performing the referential task, we assume that the speaker knows the type of the listener beforehand, i.e., which simulator needs to be engaged (similarly to knowing that we are addressing a 3-year-old, for example).
Adaptation Rather than finetuning the speaker's language model, we exploit the pretrained simulators to control utterance generation on the fly via a monitoring loop.The simulator checks whether planned utterances would be effective; if that is not the case, a loss is backpropagated to update the initial hidden state of the speaker's decoder and a new utterance is generated.Our hypothesis is that such a mechanism will lead to referring utterances that are adapted to the listener's knowledge.

Data
As a basis for our experiments, we use the Pho-toBook dataset (PB; Haber et al., 2019), a collection of task-oriented visually grounded English dialogues between pairs of participants who communicate via written chat.In a PhotoBook game, two participants see their own private sets ('photobooks') of real-life images belonging to the same visual domain.The goal of the interaction is for them to find out which images they have in common.This elicits referring utterances such as "I have a little boy holding a phone to a teddy bear", where participants refer to an image that their dialogue partner needs to identify among six similar images.Our focus is on the generation of such referring utterances, leaving aside the dialogue context for simplicity.We use the dataset of referring utterances automatically extracted from the PB dialogues by Takmaz et al. (2020), which includes 41,340 utterances paired with their target image and the other five images in the visual context.We choose PhotoBook because its visual contexts consist of realistic images and feature multiple challenging distractors, all selected from the visual domain of the target image.The original images are taken from the Microsoft COCO dataset (Lin et al., 2014) and belong to 30 different visual domains (e.g., 'person-umbrella', 'car-motorcycle').
To model speaker adaptation to different semantic domains, we split the dataset of PB referring utterances according to the visual domain of each game.We cluster the image domains as a function of the similarity between their vocabulary vectors, constructed by counting word frequencies in the referring utterances belonging to a given domain.We obtain a set of 5 macro-domains (appliances, food, indoor, outdoor, vehicles), selected so that the domain vocabularies have minimal overlap.For each cluster of visual domains, we extract the corresponding referring utterance and visual context.We then randomly split these into training (70%), validation (15%), and test set (15%).We also merge the 5 domain-specific datasets into an 'all-domains' dataset to be used to train domain-general models as described in §5.See summary in Table 1.

Experimental Pipeline
As described in §3, our experimental pipeline includes two agents-a speaker and a listenerimplemented as a generative language model instantiating the speaker, a discriminative model instantiating the listener, and a third model, a simulator used by the speaker to assess the forward effect of its planned utterance on the listener.The language model and the discriminator model are adapted from those by Takmaz et al. (2020), and the simulator model is built on the discriminator's architecture with additional components.We train these models from scratch to have full control over the linguistic and visual knowledge of the agents and their degree of asymmetry.We use ResNet-152 to encode the images (He et al., 2016).See Appendix A for more information about the training schemes and hyperparameters.

Generative Language Model
The speaker is a visually conditioned language model that generates an utterance describing a target image within a visual context.The model follows an encoder-decoder architecture consisting of a visual encoder that represents the visual context along with the target image, and a decoder for language generation.The decoder generates a referring utterance via nucleus sampling (Holtzman et al., 2020), also paying attention to the encoder output at every time step.See Appendix A.1 for more details about the model architecture.
We train the visually conditioned language model from scratch using the training set including all domains in PB and optimize the model with respect to Cross Entropy Loss using Adam (Kingma and Ba, 2015).We select the best model based on its performance on a set of natural language generation metrics on the validation set.The weights of the trained speaker are then frozen and used as the core language generation model in all our experiments identically.

Discriminator
Our listener is a discriminator model that receives six images in the visual context plus an utterance, and is tasked with identifying the target image that the utterance refers to.To encode the utterance, we use word embeddings trained from scratch to make sure no knowledge leaks from any pretraining.The model combines the visual context and the utterance to produce a multimodal context vector.The listener identifies the target image by comparing this multimodal context vector to the representations of each candidate image via dot-product and selecting the image with the highest score.See Appendix A.2 for the detailed description of the model architecture.
We train one listener model per domain in Table 1. 3 The models are optimized with Cross Entropy loss using the Adam optimizer.The best models are selected based on resolution accuracy on the validation set.We keep these domain-specific listener models frozen in the rest of the study.See Appendix A.2 for further details.
Performance We distinguish between in-domain (IND) accuracy-i.e., the resolution accuracy achieved on the test set of the domain on which the listener has been trained-and out-of-domain (OOD) accuracy-accuracy on domains the listener has not been exposed to (e.g., the accuracy on images from the vehicles domain of a listener exclusively trained on the food domain).Our listeners are truly domain specific: they are able to identify the target image with an average accuracy of 83.08% in IND, while their OOD accuracy is 19.05% on average-barely above a random baseline (16.67%).See Appendix B.2 for the full results broken down per domain.

Simulator
As explained in §3, the speaker is endowed with a simulator module.The simulator receives inputs in two parallel streams.In one stream, it receives the visual context v coupled with the speaker's planned utterance u t , and in the second stream, the visual context along with the language model's initial hidden state h 0 .The motivation behind this architectural choice is related to the plug-and-play approach at the core of our proposal.The first stream is inspired by previous work on ToM (e.g., Rabinowitz et al., 2018a): its main input is the same as what a listener would receive, an utterance.However, to control generation on the fly, we need to modify the language model's internal representations.Thus, the main reason for the second stream is technical: the gradients from the simulator's loss cannot flow back to the language model's hidden states if the input to the simulator is text due to the non-differentiability of the argmax operation. 4The second stream uses a combination of linear layers and standardization to compute the dot product between h 0 and v.The outcomes of the two streams are multiplied to obtain the final representation that is compared to the candidate images.
We train one audience-aware simulator per domain-specific listener and one self-aware general simulator with Cross Entropy loss using the AdamW optimizer (Loshchilov and Hutter, 2017).
The training set sizes of both types of simulators are the same, with the target behaviour being different.In the simulation of a general listener, the simulator predicts the behaviour of a listener that was exposed to all domains as the speaker, contrary to one domain in the domain-specific case.We choose the best simulator per listener type based on the simulators' prediction accuracies (more details in Appendix A.3).The simulators are then frozen in the rest of the pipeline.
Performance The self-aware simulator achieves an accuracy of 70% when predicting the behaviour of a general listener.The audience-aware simulators predict the behaviour of domain-specific listeners with an average accuracy of 78.20% for IND samples, and 72.78% for OOD samples. 5The drop in accuracy from IND samples to OOD samples could be due to difficulties in ascertaining the reactions of a listener on OOD data.See details of the results in Appendix B.3.

Audience-Aware Adaptation
In our framework, adaptation takes place at inference time building on our pretrained, frozen models for the language model, the discriminators and simulators described in §5.We first explain our adaptation mechanism ( §6.1) and then report the results obtained ( §6.2).

Adaptation Mechanism
Algorithm 1 describes the adaptation mechanism sketched in §3, which exploits the simulator to iteratively monitor the generation outcomes of the speaker.Given the visual context v, the initial hidden state of the speaker's decoder h 0 and the currently planned utterance u t , the simulator makes a prediction for the listener's selection. 6We calculate the Cross Entropy loss between the simulator's prediction and the true target.We use the gradients flowing back from this loss to update h 0 with the Adam optimizer.That is, adaptation is performed by backpropagating the loss to modify only the initial hidden state of the speaker's decoder.Based on the updated h 0 , the language model generates a new utterance to be reviewed by the simulator.The mechanism stops when: either (1) the simulator predicts that the listener will choose the gold target image; or (2) when the maximum number of adaptation steps is reached (st adp ).At each step, we reset the random seed to ensure that the changes in the sampling of the words are only attributable to the updates to h 0 , showing the effects of adaptation directly without being confounded by the stochastic nature of sampling.

Results
We evaluate whether our approach leads to increased communicative success, quantified in terms of listener resolution accuracy.We report the results for the three settings described in §3.For each of the three modules involved in these settings, we provide an evaluation card (Hupkes et al., 2022) to clarify the nature of our generalisation tests in Appendix C.
Baseline Table 2 provides a breakdown of resolution accuracies per type of domain-specific listener in the setting without simulation; Table 3 shows the averages.Not surprisingly, the results obtained with generated utterances are lower than those reported in §5.2.However the patterns are the same: when the speaker agent refers to an image within a domain known to the listener (IND), the average resolution accuracy is 52.30%; communication however breaks down in out-of-domain instances, where the average OOD score is 19.06%, close to random choice.Audience-aware adaptation When the speaker adapts its utterances by predicting the behaviour of a domain-specific listener, we see a significant increase in both IND and OOD (Table 3).This indicates that audience-aware adaptation helps in knowledge asymmetric scenarios, including in IND situations where the agents communicate about a domain known to the listener (65.09% vs. 71.77%).More importantly, while there is certainly room for improvement, the speaker is able to generate utterances that can more often be resolved in OOD (19% vs. 26.74%).

Analysis
Our experiments show that simulation-based adaptation leads to more successful communication.In this section, we analyse the speaker model and its generated utterances to understand which neural processing mechanisms and which production strategies are behind our main results.

Probing for Domain Information
We begin with an analysis of the neural representations of the speaker model in the audience-aware setting.We focus on h 0 , the first hidden state of the LSTM decoder.This is the output of the visual encoder on which the simulator module intervenes in order to adapt the speaker's utterance plan.Because h 0 is the result of encoding a target image (within a visual context), we expect it to carry information about the semantic domain of the image.
If it was not able to differentiate visual domains, it would be very unlikely to successfully adapt to domain-specific listeners.We test this hypothesis using diagnostic probing (Adi et al., 2017;Conneau et al., 2018;Hupkes et al., 2018).We train a logistic regression classifier on a 70% of hidden states h 0 collected from the speaker when at test time, and then we assess whether it can predict the image domain corresponding to the remaining 30% of the hidden states.As expected, the probing classifier is able to do so with perfect precision and recall (both equal 1.0) across the 5 visual domains.
Using the same approach, we test whether the domain of the listener -rather than the image domain -is also encoded in h 0 .7Our hypothesis is that this should not be the case: before the simulator kicks in, the speaker model has no information on the listener's domain-specific knowledge.Probing accuracy scores vary between 0.13 and 0.16 across domains (the random baseline is 0.17), indicating that indeed the speaker's hidden state does not carry listener information before adaptation.
As the simulator activates, the original h 0 is updated for a maximum of st adp adaptation steps.We now look at the updated hidden states h 1 0 , . . ., h st adp 0 and test whether their encoding of the image and the listener domain changes with adaptation.First, we use the probing classifier previously trained to predict image domains from h 0 to test the adapted hidden states.We find that the encoding of the image domain deteriorates with domain-specific adaptation (Figure 2).Then, we probe h 1 0 , h 2 0 , . . .for listener information and we show that the listener's domain can be predicted almost perfectly from the adapted h 0 after only three adaptation steps (Figure 2).8Taken together, these observations indicate that the neural processing mechanism that leads to more successful interaction is one by which information about the semantic domain of the visual context is replaced by information on the domain of the listener -and one which only requires a few gradient updates.

The Speaker's Adapted Vocabulary
We analyse macro-level properties of the corpus of adapted utterances as compared to the utterances generated in the simulator-less baseline setting.We compute type-utterance ratio and type-token ratio over adaptation steps to monitor the relative size and the variety of the vocabulary as the speaker uses its simulator module.As Figure 3 shows, after an initial drop for the first 1-3 adaptation steps, type-utterance ratio and type-token ratio increase substantially with respect to the non-adapted utterances (and to the gold referring utterances).The speaker vocabulary becomes much more diverse.What remains rather stable throughout adaptation, instead, is the unigram part-of-speech distribution (Figure 7 in Appendix E).While, after the first adaptation step, the difference in POS usage is notable (e.g., less punctuation, more nouns), only proper nouns and determiners show substantial changes in relative proportions, with proper nouns increasing and determiners decreasing over time.

Adaptation Strategies
The trends observed so far characterise the effect of adaptation across steps but they do not differentiate between successful and unsuccessful adaptation.In Figure 4, we split adapted utterances (the ones actually generated by the speaker when it believed its utterance would be successful) according to whether they lead to a correct listener guess.We observe that more successful utterances contain words with lower age of acquisition9 (AoA, t = −28.88,p < 0.001), they show a lower rate of lexical choice from the target image vocabulary (t = −28.76,p < 0.001), and a higher rate of words from the listener vocabulary (t = 5.88, p < 0.001).The average AoA in an utterance increases with adaptation steps (see Fig. 8 in Appendix E), suggesting that the excessive abstractness of the descriptions may be behind the limited gains we observe with adaptation.

Qualitative Inspection
In Figure 5, we provide examples of adapted sentences from the test set to demonstrate how the audience-aware adaptation strategies affect the lexical choices made by the language model.In the top example, the image domain is 'food'; however, the listener was trained on the 'indoor' domain.We see that the speaker moves away from generating detailed mentions of food to including a word related to the listener's own domain, bookshelves.In the bottom example where the listener has only been exposed to the 'food' domain and the image domain is 'outdoor', the model avoids mentioning the truck.Instead, it produces an utterance containing a prominent color in the image, i.e., pink, and some visible entities that belong to the listener's domain, namely, donuts.These observations suggest that the model exploits various adaptation strategies.
In the whole set of adapted utterances, we observe comprehensible sentences; however, there is also a large number of less fluent, unnatural ones.As we do not use pretrained large language models, sometimes, the speaker's initial utterances themselves are not fluent.The dynamics of adaptation may further exacerbate this situation and lead the language model towards generating unnatural utterances.Such utterances may not be understood by human listeners; yet, they could make sense to artificial listeners.In order to ensure that the adapted utterances are comprehensible to humans, further precautions may be needed, such as incentivizing the generative model to keep the adapted utterances grammatical and fluent, possibly with the aid of human feedback.

Conclusion
We focused on a standard reference game-a speaker produces an utterance, and a listener uses it  to pick the referent from a visual context.However, our setup is asymmetric-the speaker has general semantic knowledge, while the listener has little knowledge of all domains but one (e.g., food).Such a setting is a perfect scenario for studying adaptation, i.e., the common process in human communication by which a speaker tunes its language to that of a listener to achieve communicative success.We modeled this mechanism using a plug-andplay approach to controllable text generation: the speaker's output is conditioned on the effect of the planned utterance on the listener, as predicted by an internal simulator.Our results show that speaking the language of a listener increases communicative success.Through adaptation, the speaker's language becomes less tied to the input domain and more tied to the listener's vocabulary, revealing that audience-aware adaptation can be realized without irreversible changes to generation models.
Our approach and findings pave the way for pragmatic models that can account for different communicative scenarios.Future work may study adaptation to other dimensions such as age group or sociocultural background.Moreover, adaptation could be explored in multiple 'directions'-in our setup, only the speaker adapts.We also simplify the setup by abstracting away the online process that leads to the simulation of the listeners.It would be beneficial to allow the simulators to learn to predict listener behaviour during interaction in an online manner.Finally, our approach could be applied to other and possibly more complex communicative tasks, perhaps in conjunction with a mechanism leveraging human feedback via reinforcement learning.

Limitations
Although we use data from dialogues, we do not model collaborative reference, i.e., we do not model continual mutual adaptation.Instead, we focus on the speaker's adaptation to the listener in a single turn, which is certainly a simplified setup.Furthermore, our plug-and-play approach still requires the training of simulators per listener type.However, as we keep the speaker and listener models frozen and use the output obtained from them to train the simulators, this allows us to reduce the required amounts of training.We train the models from scratch using PhotoBook data and do not make use of state-of-the-art large pretrained visionand-language models that are nowadays commonly based on Transformers, which could be considered a limitation.We opted for this setup as it is more aligned with our research questions, allowing us to control the domain-specificity of the models.We also acknowledge the imbalance in the set sizes of the domains, as well as the possible lexical and visual overlaps in the samples across domains.The overlaps may facilitate the adaptation of certain sentences from one domain to another (asymmetry is not controlled in a fine-grained manner), and this is not uncommon in human communication.

Ethics Statement
We are using neither large pretrained language models that have been found to be prone to bias issues nor uncurated data scraped from the internet that would open up myriads of problems.Still, there could be some bias in the PhotoBook data that should be investigated: players might have used offensive or undesirable language in describing images.Therefore, deploying these speakers and listeners directly is not advisable.Our research focusing on the adaptation of a speaker to their audience is done with the aim of improving communicative success within scenarios with knowledge asymmetry following human capabilities of self-monitoring and Theory of Mind.It is possible that adaptation to a specific listener could exacerbate possible biases if the training set of a given listener happens to include more bias.However, the reverse is also the case, where adaptation to underrepresented user groups could be beneficial.

Appendix A Training Details
We provide the details of the setups of the generative language model in §A.1, the discriminators in §A.2, the simulators in §A.3 and the adaptation mechanism in §A.4.We use Python version 3.9.0 and PyTorch version 1.11.0 in the development and testing of all our models.In Table 4

A.1 Generative Language Model
In addition to the main hyperparameters listed in Table 4, the language model requires several additional parameters.In nucleus sampling, we set the p value for top-p to 0.9 and sample from a vocabulary that consists of the words in the training splits of all 5 domains.The maximum length of the generated utterances is set to 30.The model is initialized and trained with 4 different seeds, which yield similar performances.We use an early stopping patience of 30 epochs based on the validation set scores. 10egarding the architectural details of the visually-conditioned language model, in the visual encoder we feed both the standardized target image vector and the concatenation of the six images in the full visual context into a linear layer followed by the ReLU non-linearity.We then concatenate the ensuing representations of the target image with the visual context and once more apply a linear layer followed by a ReLU non-linearity to obtain the final visual context, v.This visual context is used to initialize a bidirectional LSTM encoder that takes as input the previous utterance referring to the target image in the current dialogue, if exists (see footnote 6), otherwise a special token indicating the absence of such an utterance.The final forward and backward hidden states of this encoder are concatenated, go through a linear layer and tanh non-linearity.The output is then set as the initial hidden state h 0 of the LSTM decoder (Hochreiter and Schmidhuber, 1997).

A.2 Discriminators
In these models instantiating the listeners, the word embeddings go through a dropout layer and a linear layer followed by the Leaky-ReLU non-linearity, after which standardization is applied.The visual context is processed in the same way as in the generative language model.Each word representation is concatenated with the representation of the visual context.The resulting vectors go through a linear layer and ReLU.Finally, we apply attention over these vectors to obtain the attention-weighted multimodal context vector.It is this context vector that is compared to the representations of candidate images via dot product.
We use the same set of hyperparameters for each domain as shown in Table 4.The domain-specific listener models were selected based on their accuracy on the in-domain validation set.We report accuracy and MRR on the in-domain and out-ofdomain test sets in Table 6.
OOD word masking Our listeners are initialized with the same vocabulary comprising all the words in the training data.However, the domain-specific listeners only learn the words that exist in their own training sets.Therefore, if the speaker generates an OOD word for a domain-specific listener, in order not to further confound the effects of adaptation on the listeners, we mask the word with the <unk> vector.This vector is the same across all domains.

A.3 Simulator
We select the simulator models based on their accuracy in predicting the behaviour of the listener models on the validation set.The simulator models are trained using the AdamW optimizer (Loshchilov and Hutter, 2017) with a weight decay of 0.0001, and a plateau learning scheduler with a patience of 2, a factor of 0.5, a threshold of 0.5.

A.4 Adaptation Mechanism
We optimize the values of the number of adaptation steps and the learning rate for the adaptation mechanism.We perform 2 hyperparameter sweeps using the Weight & Biases (WandB) platform (Biewald, 2020), evaluating a range of values.We find a positive correlation between both hyperparameters and adaptation accuracy, with Pearson's correlation coefficients of 0.71 for the learning rate, and 0.66 for the number of steps.

B Additional Results
Here, we provide additional results yielded by our models for the speaker in §B.1, the listener in §B.2, the simulator in §B.3 and for the adaptation mechanism in §B.4.

B.1 Speaker Results
We provide the detailed results of the speaker model on the test set in Table 5 with the averages and standard deviations over 4 runs.

B.2 Listener Results
Table 6 reports the domain-specific listener performances on IND and OOD gold data.We observe that the domain-specific listeners perform well in in-domain settings and perform close to the random baseline in OOD settings.
Table 7 presents the domain-specific listener accuracies on speaker-generated input.Especially in IND settings, we see lower scores as compared to use of the gold data, presumably because the listener models were trained on gold data.

B.3 Simulator Results
The detailed outcomes of the simulator models are reported in Table 8.Here, we also report the results for the subset where the listener made a correct prediction (Pos) vs. it made an incorrect prediction (Neg).The simulators are better able to capture the correct listener behaviour, possibly because during the training of simulators, in-domain data provides a clear picture of listener's correct behaviour.

B.4 Adaptation Results
In Table 9, we provide the test set results of the adaptation pipeline, broken down into domains and for IND and OOD inputs separately.The outcomes show that adaptation has effects in both IND and OOD settings, increasing resolution accuracies over speaker-generated utterances.

C Evaluation Cards
For each of the three main modules in our experiments, we provide an evaluation card to clarify the nature of our generalisation tests.11See Table 10 for the generator, Table 11 for the simulator, and Table 12 for the listener.We also register our work in the GenBench evolving survey of generalisation in NLP (Hupkes et al., 2022). 12

D Additional Experiments
Here, we provide details on additional experiments we performed in our adaptation pipeline.
In our adaptation mechanism, one of the stopping conditions is that the simulator predicts that the listener will be able to guess the referent.We also explored continuing adaptation until the listener itself correctly guesses the referent.We report the results in Table 13, which reveal that using this stopping condition would yield higher results since the utterances are adapted until the actual listener makes a correct guess, mimicking an online interaction setup.

E Additional Analyses
We note that we measure the type-utterance ratio for each step (i.e., the vocabulary size divided by the number of utterances available for that step), rather than the vocabulary size, because different steps correspond to different numbers of utterances: adaptation stops when the simulator module predicts the target image.We also measure the domain-specificity of utterances over steps, both in terms of the target image domain and of the listener domain, as the per-    centage of domain-specific words in an utterance.We consider as domain-specific the words that appear only in interactions about a certain domain.
The speaker, throughout adaptation, produces more words belonging to both the image and the listener domain (Figure 9) and thus less domain-agnostic words.We saw that, over adaptation steps, the decoder hidden state forgets image domain information in favour of the listener domain.This does not translate into no longer producing words from the image domain, suggesting that the speaker may be focusing more on the specific image than on its semantic domain.
Figure 8 shows mean utterance age of acquisition rating (Kuperman et al., 2012) over steps.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Experimental setup is discussed in the main paper sections 3,5 and 6.Hyperparameter search and the selected values are described in the Appendix.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?
We report multiple runs for each model type, and report the mean and standard deviations of these runs.
We report the the main libraries we use and their versions in the Appendix.
D Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 2 :
Figure 2: Probing accuracy for image domain and listener domain predictions over adaptation steps.The 0-th step corresponds to the non-adapted h 0 .

Figure 4 :
Figure 4: Factors affecting the success of an adapted utterance, age of acquisition (left) and % of words in an utterance belonging to the target image domain (right).
adapted): green salad with a person holding up a portion with fork?Generated (not adapted): I have one more maybe round you think that has a lime green shaped greens, a salad?Adapted: must bookshelves in the salad?Target:OUTDOOR.Listener domain:FOOD Gold (not adapted): I have the pink food truck again ... white shirt lady Generated (not adapted): girl at black phone, red truck, brown hair, pink Adapted: pink donuts

Figure 5 :
Figure 5: Examples showing how audience-aware adaptation changes the generated utterances.For simplicity, we only show the target images and not the whole visual contexts.We report the final adapted utterances when the adaptation mechanism stops because the simulator predicts that the listener will select the correct image.

Figure 7
Figure 7 shows unigram part-of-speech distribution across adaptation steps for the in-domain and out-of-domain conditions.We also measure the domain-specificity of utterances over steps, both in terms of the target image domain and of the listener domain, as the per-

Figure 8 :
Figure 8: Mean utterance Age of Acquisition over adaptation steps.Step 0 corresponds to the non-adapted utterance.

Figure 9 :
Figure 9: Rate of lexical choice from image and listener domain-specific vocabularies.

Table 2 :
Resolution accuracy in the Baseline setting.Rows indicate the listener domain and columns the evaluation domain.Shaded cells show IND accuracy.Averages across 5 seeds.Full table with sd's in App.B.2.
IND 52.30 ± 1.10 65.09 ± 1.98 71.77 ± 2.16 Table 3: Average resolution accuracy for our 3 settings in OOD and IND.Results on the test set over 5 runs.

Table 4 :
Hyperparameters used for training the generative language model, discriminator and simulator models.

Table 5 :
Speaker results on the test set as measured by common natural language generation evaluation metrics.

Table 7 :
Listener accuracies on speaker-generated data.Each row indicates the domain a listener was trained on and the columns indicate the domain of the input samples.Results over 5 seeds.

Table 13 :
Listener accuracy using the listener stopping condition in the adaptation mechanism.