So many design choices: Improving and interpreting neural agent communication in signaling games

,


Introduction
Emergent language games are experimental protocols designed to model how communication may arise among a group of agents.For the linguist, they can serve as models of how language might have emerged in humans (Nowak et al., 1999;Kirby, 2002;Kirby et al., 2008); for the AI or NLP scientist, they provide an interesting and challenging test-bed for cooperation and communication across distinct neural agents using symbolic channels (Havrylov and Titov, 2017;Zhang et al., 2021).
Our focus in this paper is on signaling games (Lewis, 1969).More precisely, we adopt a setting in which a sender is exposed to some data and produces a message that is transmitted to a receiver.The receiver has then to answer a question related to the data that the sender was exposed to.Both agents share the common goal of the receiver answering correctly to the question.This common goal encourages the sender to encode relevant information about the input data in its message and in such a way that the receiver can decode it.In the present paper, we show the sender an image, the original image.The receiver is shown a pair of images: a target image, which is semantically related to the original image, and one unrelated distractor.These images all depict a solid on a uniform background; the shape, the size, the position and the color of this object are the same for the original and the target image, while at least one of these features is different for the distractor.Based on the sender's message, the receiver has to guess which image of the pair is the target.We allow the senders to compose sequences of arbitrary symbols of variable length.
One of the long-term goals of the study of such language games is to understand under which conditions emergent communication protocols display language-like features.In particular, compositionality has been a major concern ever since Hockett (1960) and remains so in today's NLP research landscape (Baroni, 2019).In order to observe complex, structured communication protocols, we need to provide the agents with an environment complex enough for such a characteristic to develop.This adds two requirements on the agents' stimuli: the images we show them will need to be structured, and ought to not be discriminated through low-level features (Bouchacourt and Baroni, 2018).
When designing and experimenting with such a signaling game, a number of design choices are left open-ranging from the exact objective optimized by the agents, to the selection of training examples and to whether agents have prior information about their environment.In this paper, we exhaustively study how different choices often encountered in the relevant literature interact, and which combinations of these, if any, yield the most stable, efficient communication protocols.In addition, we use training data that theoretically allow the agents to ignore one aspect of the images (e.g., the color of the object shown, or its size), so as to test whether the agents do ignore one feature and how implementation choices impact this behavior.To that end, we define four automatic metrics to probe syntactic and semantic aspects of their communication protocols; we believe them to be useful to future emergent communication studies, as the current agreed upon tool set for studying artificial emergent languages remains fairly narrow.These metrics help us assess what the emergent languages have in common and how they differ.We find that language-like characteristics can be driven by seemingly unrelated factors, and that ensuring the emergence of a reliable communication protocol that generalizes to held-out examples requires a careful consideration of how to implement the language game.The main contributions of this work are thus twofold: we report an exhaustive review of implementation choices, and we provide novel automated metrics to study the semantics of emergent communication protocols.
We provide an overview of related works in Section 2. Dataset and game details are presented in Section 3. We describe our implementation variants in Section 4 and our automatic metrics in Section 5. We discuss our results in Section 6.
There is a large prior body of research that investigate how specific implementation choices can impact the characteristics of the emergent communication protocol.For instance, Liang et al. (2020) advocate in favor of competition as an environmental pressure for learning composition by only rewarding the fastest of two teams in a multi-turn signaling game.Rita et al. (2022) mathematically demonstrate that the typical losses used to implement Lewis games can be broken down in a information term and a co-adaptation term, and that limiting overfitting on the latter term experimentally leads to more compositional and generalizable protocols.Mu and Goodman (2021) discuss generalization, and how to induce it by modifying the signaling game to involve sets of targets, rather than unique targets per episode.Patel et al. (2021) study a navigation task to show how to foster interpretability, i.e., communication protocols that are grounded in agents' perceptions of their environment.Rita et al. ( 2020) discuss how encouraging "laziness" in the sender and "impatience" in the receiver shapes the messages so as to exhibit Zipfian patterns.Chaabouni et al. (2019b) use handcrafted languages to study word-order preferences of LSTM-based agents.Kim and Oh (2021) discuss the importance of dataset size, game difficulty and agent population sizes.Bouchacourt and Baroni (2018) study how the visual components of signaling game agents can undermine the naturalness of their communication.Korbak et al. (2019) propose a specific pretraining regimen to foster compositionality.
Another relevant section of the literature discusses automatic metrics designed to capture specific language-like aspects of the emergent protocol.Chief of these is the meaning-form correlation (a.k.a.topographic similarity) of Brighton and Kirby (2006), which quantifies compositionality by measuring whether changes in form are commensurate with changes in meaning (though other metrics exist, e.g., Andreas, 2019).Chaabouni et al. (2020) argue that this metric does not correlate with generalization capabilities, and that it is thus unsuitable for studying compositionality.Mickus et al. (2020) show how it is impacted by other language-like features.Following these remarks, we focus on novel metrics and defer discussions of topographic similarity to Appendix B.1.

Experimental setup
Dataset.We construct a dataset of synthetic images depicting solids on gray backgrounds, using vpython. 1 They exhibit a combination of five features, each of which have two possible values: horizontal position (left, right), vertical position (top, bottom), object type (cube, sphere), object color (red, blue), object size (small, large).We generate 1000 images for each of the 2 5 possible combinations of feature values (or categories).
We divide the dataset in two splits: a training split and an evaluation split.2This partition is performed as follows.First, one category is selected as the seed category.Then, base categories are the 16 categories that differ from the seed category on exactly 0, 2 or 4 features.Generalization categories are the 16 remaining categories, that differ from the seed category on exactly 1, 3 or 5 features.Base category images are then further divided 80%-20% between training and evaluation splits.All generalization category images are assigned to the evaluation split.The training split therefore contains only images from base categories while the evaluation split contains both images from base categories and images from generalization categories.
This partition of categories entails that that during training, all training instances involve image categories that differ by at least two features.Hence, agents may entirely disregard one feature (e.g., color) and still manage to perfectly discriminate all training instances.Only during evaluation are they confronted with pairs of categories that differ by a single feature: namely, when the original image is taken from a base category and the distractor image from a generalization one (or vice versa).
Game & model architecture.All of our models are comprised of two agents: a sender and a receiver.They are trained to solve a Lewis signaling game with a single communication turn.The sender is first shown an image I and produces a message: a sequence of up to 10 symbols from an alphabet of size 16.The receiver is then provided as input a target image I ′ of the same category as I, a distractor image J of a different category, and the message, and has to identify I ′ as the intended target.This game is illustrated in Figure 1.The original image I differs from the target image I ′ so as to deter the sender from describing low-level features of the images (e.g., specific pixel brightness, Bouchacourt and Baroni, 2018).
Both agents contain an image encoder, implemented as a convolution stack, and an LSTM to process symbols.The sender's LSTM is primed with the encoded original image representation, and then generates the message.The receiver uses its LSTM to convert the message into a vector; it then  computes the dot product between the message encoding and each of the target and distractor images encoding; we infer a probability distribution over the image pair using a softmax function.
Models are trained with REINFORCE (Williams, 1992); the loss for an episode is defined as: where a t is the t th action taken in the episode, p(a t ) its probability, and r t its associated reward.Each episode contains one generation action per symbol in the message, and one classification action.All actions of an episode are associated with the same reward r t = r.By default, we set r to 1 when the receiver successfully retrieves the target image, and 0 otherwise.

Implementation choices
Having described our basic setup above, we now list the different implementation variants that we study in the present paper.We refer to these implementation variants using a vector notation; for a binary trait Φ, a model for which Φ is implemented will be denoted as ⟨. . ., +Φ, . . ., ⟩, conversely, its absence would be signaled with ⟨. . ., −Φ, . . ., ⟩.
Pretraining of the visual component.In order to ensure that the recurrent message encoders and decoders receive coherent, usable representations of the images, for some variants, we pretrain the image encoders convolutions.In the remainder of the text, we denote as ⟨+P, . . .⟩ models that have undergone pretraining, and ⟨−P, . . .⟩ models that did not.We consider three pretraining objectives: an auto-encoding task and two classification tasks.
The auto-encoding pretraining consists in training the convolution stack along with an additional deconvolution stack to reproduce images provided as input, using a mean squared error loss: where Ŷ is the reconstruction of the RGB image Y of height h and width w.Models pretrained with this objective are denoted as ⟨+P AE , . . .⟩.
The first classification objective, which we dub "category-wise", corresponds to predicting which of the 2 5 categories the input image corresponds to,3 and is learned using a cross-entropy loss: where ŷ is the vector p(y = 1|I), . . ., p(y = 2 5 |I) corresponding to the classifier's probability distribution over possible labels.Models pretrained with this objective are denoted as ⟨+P CW , . . .⟩.
The second classification objective, called "feature-wise", consists in predicting each of the 5 feature values of the input image-i.e., an agreggate of five binary classification sub-tasks.The loss function for this last objective L FW is thus: where Ŷ is the structured prediction, such that Ŷfi is the probability assigned for the i th possible value of the f th feature, and y = (y 1 , . . ., y f ) is the vector of target feature values for this example.We denote models pretrained with this objective as ⟨+P FW , . . .⟩.
We also consider whether or not to freeze the parameters of the image encoder convolution stacks.Assuming the pretraining was successful, the resulting image vector representations should contain all the information necessary for models to succeed.In this case, freezing convolutions reduces the number of learnable parameters, which may help the optimization.Pretrained models whose convolution stacks are frozen are denoted as ⟨+P, +F, . . .⟩, whereas models whose convolutions (pretrained or not) are updated are denoted as ⟨. . ., −F, . . .⟩. Rewards and regularization.One drawback of the pretraining methods and the adversarial sampling alike is that most of them (i.e., all except the auto-encoder method) require information which might not be available in other datasets, namely labels pertaining to the semantics of the images.
One possible technique not subject to this concern consists in adding an entropy term to the REIN-FORCE loss, as is sometimes done in emergent communication (e.g., Lazaridou et al., 2018;Chaabouni et al., 2019a).This entropy loss is defined as: where β S and β R are two scalar coefficients controlling the strength of this regularization, H S,t is the entropy of the probability distribution computed by the sender and used to select the t th symbol of the message, and H R is the entropy of the probability distribution computed by the receiver.The scalar coefficients are set to β S = 10 −2 and β R = 10 −3 . 44 The use of this entropy term is denoted with ⟨. . ., +H, . . .⟩.
Another technique consists in redefining the rewards system.Instead of associating each action of an episode with a binary reward r ∈ {0, 1}, the reward is defined as the probability that the receiver assigns to the target image, i.e., how confident it is in retrieving the target.The use of this confidence-based reward system is denoted with ⟨. . ., +C, . . .⟩.
The last technique that we study consists in deducting the recent average rewards as a baseline term b (Sutton and Barto, 2018, §13): where b is the average of r over the last 1000 batches.The use of this baseline term is denoted with ⟨. . ., +B⟩.
While confidence-based rewards and baseline can technically be applied jointly, doing so proves to be detrimental.None of the runs for models implemented as ⟨. . ., +C, +B⟩ yielded a successful communication protocol.We conjecture that this is due to the probability mass assigned to the target image being very close to the average reward (0.5) at the beginning of the training process, which leads to losses and gradient updates close to 0. In what follows, the use of these two techniques are then considered mutually exclusive.
Comparison with previous work.In our experiments, we exhaustively evaluate various design choices, which cover many architectures similar to those studied in earlier works.For instance, Lazaridou et al. (2018) would correspond to a ⟨−P, −F, −A, −H, −E, −B⟩ model, Bouchacourt and Baroni (2018) adopt a model similar to a ⟨+P cw , +F, −A, −H − E, −B⟩.In what follows, we do not focus on how specific earlier works fare, but instead attempt to develop a more global picture.

Automatic metrics
Communication efficiency.We primarily measure the performance of a model by its communication efficiency (c.e.), which we define as the average probability assigned by the model to the target image over a large number of evaluation instances. 5 Evaluation instances involve all categories seen during training with additional categories as well (see Section 3).To assess how the agents handle unseen combination of features at a finer level, we 5 Communication efficiency differs from accuracy, defined as the proportion of evaluation instances for which the target image is assigned a higher probability than the distractor.Accuracy can be maximal (100%) even with a very low communication efficiency (50 + ϵ%).Low communication efficiency is a sign of sub-optimal performance, as an effective communication system should describe the target category unambiguously, i.e., the agents should solve the game with a high degree of confidence.In practice, we find these two values to be highly correlated in our experiments, suggesting our models are well calibrated (Guo et al., 2017).define base-c.e., gen.-c.e. and mixed-c.e. by restricting the two selected categories to two base categories, two generalization categories, and one of each respectively.
All of our metrics are generalized from single models to sets of models by computing their average across models (i) using, for each model, the value obtained during the evaluation phase in which it reaches its highest communication efficiency and (ii) discarding any model which never reaches a communication efficiency of 60% or above at any point of the training process. 6Any model that does reach a communication efficiency of 60% or above is said to be "successful".The convergence ratio (cvg.) of a set of models is the proportion of successful models in this set.
Abstractness.We task receivers with recognizing not the original image I shown to senders, but another target I ′ of the same category.This is meant to encourage senders to describe not so much the input image as its category.We evaluate this aspect using the abstractness of a model: where p R (J) is the probability assigned by the receiver to the image J, I and I ′ are the original and target images, and m is the sender's message for the input I. Abstractness is 0 if all the mass is on the original image, and 1 when it is distributed evenly.7 Scrambling resistance.To measure how sensitive to symbol ordering receivers are, we define the scrambling resistance of a model by comparing the probability assigned to the target image by the receiver when provided with the sender's message m, and when provided with a randomly permuted version m ′ of it.More precisely, given a message m, we compute: where a t is the t th symbol of the message produced by the sender, p R (x) is the probability of the receiver selecting the target image given the message  x, and σ is a random permutation of the interval 1, n .The scrambling resistance of a model is an average of sr over a large number of evaluation instances.
Semantic probes.In order to determine which features of the original/target category are described in a sender's message, we implement a probing method based on decision trees.We convert any message m into a bag-of-symbols vector u ∈ N 16 , such that u i is the number of occurrences of symbol i in m.Given a set of messages each associated with its corresponding original/target category, for each of the five features, we can train a decision tree to predict the values of the feature based on the bag-of-symbols representation of the messages.While the messages may very well encode information under a form that cannot be decoded by such a simple system, high accuracy from a decision tree is proof that the corresponding feature is consistently described in the messages.8 6 Results

Global performance
Table 1 shows the performance of all of the runs we have performed, aggregated based on the reward system they use (binary rewards, confidencebased reward, or binary rewards with a baseline term), on whether the visual convolution stacks are pretrained (without differentiating between the various pretraining objectives) and, if so, on whether these convolution stacks are frozen during training.We observe that the most impactful implementation choice is whether or not to use a baseline term (⟨. . ., −C, +B⟩).Improvements with +B are much more consistent and pronounced than models using confidence-based rewards (⟨. . ., +C, −B⟩) or pretraining (⟨+P, . . .⟩).
On its own, pretraining brings some degree of improvement comparable to what we see in models implemented as ⟨. . ., +C, −B⟩.Setups involving freezing pretrained convolution stacks (⟨+P, +F, . . .⟩) reach a convergence ratio of 1 at the expense of a downgrade in communication efficiency.Moreover, pretraining without freezing weights (⟨+P, −F, . . .⟩), while not detrimental, does not improve performances unless used jointly with either +C or +B.Optimal performances are attested when using pretraining with a baseline term (⟨+P, −F, . . ., −C, +B⟩).
Table 2 shows the performance (top) of all of the runs that we have performed and (bottom) of all runs with the baseline term and without frozen convolution stacks, aggregated based on whether they are trained with the entropy penalty.We observe that, while in general using this regularization term is an efficient way to boost both the convergence ratio and the communication efficiency of converging runs, this positive effect does not persist with ⟨. . ., −F, . . ., −C, +B⟩ runs (see below for more information about the drop in cvg. in this case).
Because of their high performance, we focus on models implemented as ⟨. . ., −F, . . ., −C, +B⟩ in the remainder of this discussion.A communication efficiency around 97% might intuitively seem an indicator of excellent performance, but remark that, should the sender completely ignore one semantic feature of the images, then the communication efficiency could still rise up to 30.5  31 (≈ 98.4%): this value is obtained when, among the 31 possible categories for the distractor, 30 lead to perfect retrieval of the target image and 1 leads to chance retrieval.As such, none of the performances seen so far guarantees that all features are encoded in the messages.Table 3 shows the performance of the runs aggregated based on the sampling strategy for distractors and the use of pretraining for the visual convolution stacks (still without differentiating between the various pretraining objectives).We see that, compared to uniform sampling, the adversarial sampling strategy systematically and substantially increases the communication efficiency.Nonetheless, the adversarial strategy can induce a lower convergence ratio when the convolution stacks are not pretrained and an entropy penalty is added, suggesting that this sampling strategy and the entropy penalty used jointly make training too challenging for agents with randomly initialized convolution stacks.In all, the higher performances observed with the adversarial sampling strategy lead us to narrow down our discussion once more, this time focusing on models implemented as ⟨. . ., −F, +A, . . ., −C, +B⟩.
Finally, we focus on the effect of the different pretraining objectives in Table 4. Though all three pretraining objectives are helpful, we observe the highest improvement in communication efficiency with the two classification objectives.Among them, the category-wise objective outperforms the feature-wise objective.While the featurewise objective provides feature-level guidance, the category-wise pretraining regimen directly trains the convolution stacks to tease apart images of different categories, which is what the signaling game requires of them.We hypothesize that the feature-wise objective might be superior when the category space is sufficiently larger and more complex.

Generalization and language analysis
Having looked at how to foster reliability and high performance, we now turn to how to a study of how well the models generalize to unseen items and whether their messages display language-like characteristics-as the literature often remarks that such characteristics should not be taken for granted (Mu and Goodman, 2021;Patel et al., 2021).
However, when grouping runs implemented as ⟨. . .,+A, . . .,−C,−B⟩ depending on their pretraining and convolution freezing, we find one group of outliers: ⟨P AE , +F,+A,. . .,−C,−B⟩ runs have an abstractness of 0.958.This value is statistically lower than for each of the six other groups (as shown by a Pitman test; p < 10 −6 in all cases).Convolution stacks pretrained as auto-encoders learn to capture the specificity of each image, which apparently permeates the emergent languages if subsequently frozen.
We also observe an opposite-albeit weakereffect with the category-wise pretraining objective.⟨PCW,+F,+A,. . .,−C,−B⟩ runs have an abstractness of 0.998, higher than the 0.994 of ⟨PCW,−F,+A,. . .,−C,−B⟩ runs.The difference (p < 0.04, Pitman test) indicates that in such cases, finetuning the convolution stacks leads the agents to include image-specific information in their messages.
Scrambling resistance.Scrambling resistance yields high values, ranging from 0.892 when using auto-encoder pretraining to 0.915 when using feature-wise pretraining. 9In other words, the receiver is able to recognize a category based on a randomly permuted message with a high degree of accuracy.This property, however, does not entail that the sender produces symbols in a (near) random order.Indeed, even English, which requires a rather strict word-order, arguably has a high scrambling resistance: it is natural to associate the scrambled sentence "cube a there blue is" with a picture of a blue cube rather than that of a blue sphere (or a red cube, etc.).High scrambling resistance points towards the possibility that each symbol is loaded with an intrinsic meaning, the interpretation of which is fairly independent of its position-in contrast with, e.g., the digits in positional numeral systems (which are compositional systems with low scrambling resistance).
Generalization.As we saw in Section 6.1, the highest communication efficiency we observe, of 0.985, is obtained with the ⟨P CW , −F, +A, . . ., −C, −B⟩ implementation.Let us recall that this means that when the source/target category and the distractor category are selected from the whole set of categories, the receiver puts on average 0.985 of the probability mass of its choice distribution on the target image.As for the base-c.e.(when both categories are base categories, i.e., not seen during training) of this implementation, its value is near perfect, above 0.999.Its gen-c.e.(when both categories are generalization categories), is also very high, at 0.997.These different values indicate that the models are able to generalize very well not only to unseen images but also to new categories (i.e., unseen combinations of features).
For this same implementation, the mixed-c.e.(when only one of the categories is a base category) drops to 0.971.10Recall that this is the only case where target and distractor may differ by a single feature.Even if agents disregard one feature, their mixed-c.e. can still theoretically reach up to 14.5 15 (≈ 96.7%).Hence, ⟨P FW , −F, +A, . . ., −C, −B⟩ runs communicate about all features, despite it not being required by the training objective.Similarly, ⟨P CW , −F, +A, . . ., −C, −B⟩ runs obtain a mixed-c.e. of 0.967 (almost equal to the threshold) and ⟨−P, −F, +A, . . ., −C, −B⟩ runs reach a mixed-c.e. of 0.964 (slightly below).
Semantic content Scrambling resistance scores highlight that the semantic contents of symbols are mostly position-insensitive.This entails that our decision-tree based probes, which rely on bagof-symbols representations of the messages, are relevant.accurately conveyed than other image features.11This indicates that shape is harder to identify than color, size or position and that since the training process does not incentivize the agents to describe all features, they systematically focus on the four easiest.12Interestingly, applying an entropy penalty during training strongly drives the agents to communicate about the shape.Moreover, models pretrained with the auto-encoder objective lead to higher values than any others. 13The difference in shape recognition between this group and the others is always significant (p < 10 −2 ).

Conclusions
Two broad conclusions emerge from our experiments.Firstly, we saw that not all implementations perform equally well.We demonstrated how the use of a baseline term or an adversarial input sampling mechanism were necessary to reach high performance.While pretraining convolution stacks can prove beneficial in limited circumstances, not fine-tuning them afterwards may prove to be highly detrimental.In all, a well designed implementation can learn reliably and generalize to new images and combinations of features.
Secondly, we have made a case for the need of fine-grained methods when analyzing the emergent communication protocol.We have introduced an array of tools.Among them, scrambling resistance were used to demonstrate that each symbol in our languages has semantic contribution independent from its position.Decision trees based probes informed us that these symbols were put to use to systematically describe all but one of the input image's features, shape being constitently neglected though not entirely ignored despite the possibility we left open through the design of the training instances.These results also connect with design choices: for instance, we saw how entropy regularization and auto-encoder pretraining strengthened the prominence of shape in the messages.
We next plan to experiment with a partition of categories between base and generalization that forces all features to be encoded in the messages, and then use decision trees and other methods to automatically describe the syntax and the semantics of the emergent communication protocols in simple terms, so as to better characterize how these protocols relate to natural language.We also plan to study the impacts of the semantic complexity of the input images on these emergent protocol, using a richer set of features and values, and using unlabeled real-world scenes.Lastly, our findings will have to be confirmed in setups involving other games such as navigation tasks.

Limitations
There are two main limitations to the present work.First and foremost is the computational cost associated with the present experiments.We present here results and analyzed gleaned over 10 runs, 7 pretraining regimens, 8 RL gradient propagation variants and 2 data sampling approaches, for a total of 1120 models.While training any one of our models is cheap (less than 3 hours on a single A100 NVIDIA GPU), the total number of models may pose a challenge for future replication studies and comes at an environmental cost.This also prevented us from selecting optimal batch size, learning rate, and so on for specific setups-as described in Appendix A, we set these values globally prior to running experiments.This may affect results and impact conclusions.
Second is the theoretical scope of the current paper.We have focused solely on single-turn, 2 agents signaling game setups.The recommendations and conclusions drawn in the present paper may or may not translate to other language games.Likewise, while this study aims at exhaustiveness, material limitations have bounded the scope of implementation choices we studied.Some approaches, such as KL regularization (Geist et al., 2019), have thus been left out of the present study.

A Hyperparameters selection and training details
Throughout our experiments, we allow agents to generate messages of up to 10 symbols long, using a vocabulary of 16 symbols.We train all models for up to 100 epochs of 1000 batches each, using 128 training instance per batch.We repeat each training procedure across 10 random seeds.Parameters are optimized with RMSProp (Hinton et al., 2012).
Prior to any experiment reported here, we ran a small-scale grid-search to select a learning rate most likely to reliably induce a successful emergent communication protocol.We exhaustively test learning rates in {10 −x/2 | 4 ≤ x ≤ 12}and measure the convergence ratio for groups of 10 runs trained for 50 epochs.Results, displayed in Figure 2, suggest an optimal learning rate of 10 −4 which we adopt in all subsequent experiments. 14n Section 4, hyperparameter values for the pretraining procedures were selected based on the models' lack of further improvement on a heldout subset of the training data.Using 1000 steps per epoch and batches of 128 images, we found that 5 epochs and a learning rate of 3 • 10 −4 was sufficient to guarantee an accuracy close to 100% for the classification pretraining tasks, whereas the auto-encoding task required 40 epochs with the same learning rate.

B.1 Meaning-Form Correlation
In compositional languages, the meaning and the form of messages tend to be correlated: Minute changes in form (e.g., substitutions of a single token) are expected to correspond to minute changes in meaning.To study the compositionality of the communication protocols set up by the agents, one can also measure their meaning-form correlation (MFC).
Meaning-form correlation, or topological similarity, consists in comparing how the distance between two messages relates to the distance between their semantic contents.More formally, it is computed as a Spearman correlation between two paired samples of distance measurements D F = (d F (o i , o j )) 1≤i<j≤n and D M = (d M (o i , o j )) 1≤i<j≤n over the same set of observations, with the assumption that one distance function (d F ) captures variation in form and the other (d M ) capture variation in meaning.For clarity, we denote an MFC correlation score using the symbol τ .In our case, we have compared the Jaccard index of the two messages as bags-of-symbols to the Hamming distance between the two corresponding image categories. 15 MFC scores are not easy to interpret by themselves, but it can be illuminating to see how they vary and correlate with properties.While the distribution of MFC and its relation with communication efficiency is quite complex, we have observed that difficult setups (e.g., where a globally useful design choice is not implemented, or where an adversarial sampling strategy factors in) display two trends: on the one hand, they exhibit lower MFC scores, on the other hand, for such a setup, the MFC scores of individual runs are more in line with with performance (i.e., they display a stronger Spearman correlation or a weaker anti-correlation with communication efficiency).For example, the two top rows of Table 6 show a case in which the absence of a baseline term entails a lower MFC and a weaker anti-correlation with c.e.The middle two rows show a case in which the use of the adversarial distractor sampling strategy during training also entails a lower MFC and a stronger correlation with c.e.The two bottom rows show another case in which the adversarial training strategy has a similar effect.In addition, the last row shows that when the training is made particularly easy, the models produce on average messages that are very compositional (in the sense reflected by the MFC), but that the best models diverge from this: the best models are the ones in which the two agents develop some form of co-adaptation at odds with compositionality.This echoes the findings of Chaabouni et al. (2020), who highlight that MFC is not necessarily tied to generalization capabilities. 15Using the Levenshtein distance instead of the Jaccard index yields the same conclusions, as MFC scores derived from either distance are extremely significantly correlated.

B.2 Decision Trees
Full results for the decision-tree semantic content probes are displayed in Table 7.As noted in the main text, the behavior for size and position features is very similar to that for color, and very distinct from that for shape. 0a9f4f

Figure 1 :
Figure 1: The Lewis Signaling game considered in this paper.The sender (left) is shown the original image and produces a message that the receiver (right) uses to distinguish the target image from the distractor image.The original and target images share the same semantic category (here: top right big red cube).
Distractor sampling.By default, during training, we first select the original/target category c t uniformly at random, before selecting the distractor category c d uniformly among remaining categories.A second strategy that we envision to improve performance consists in adversarially sampling c d instead.More precisely, when we evaluate the agents at the end of each training epoch, we derive countbased estimates of the probability P (fail | (c t , c d )) of communication failure for each pairs (c t , c d ).At training time, c d is sampled with a probability proportional to P (fail | (c t , c d )).At evaluation time, c d is still sampled uniformly.We denote the use of this adversarial sampling during training as ⟨. . ., +A, . . .⟩, and its absence as ⟨. . ., −A, . . .⟩.

Table 1 :
Effects of pretraining and reward redefinition on convergence and communication efficiency.

Table 4 :
Effects of pretraining objectives