Learning to translate by learning to communicate

We formulate and test a technique to use Emergent Communication (EC) with a pre-trained multilingual model to improve on modern Unsupervised NMT systems, especially for low-resource languages. It has been argued that the current dominant paradigm in NLP of pre-training on text-only corpora will not yield robust natural language understanding systems, and the need for grounded, goal-oriented, and interactive language learning has been high lighted. In our approach, we embed a multilingual model (mBART, Liu et al., 2020) into an EC image-reference game, in which the model is incentivized to use multilingual generations to accomplish a vision-grounded task. The hypothesis is that this will align multiple languages to a shared task space. We present two variants of EC Fine-Tuning (Steinert-Threlkeld et al., 2022), one of which outperforms a backtranslation-only baseline in all four languages investigated, including the low-resource language Nepali.


Introduction
While neural machine translation (NMT) systems are one of the great success stories of natural language processing (Sutskever et al., 2014;Bahdanau et al., 2015;Wu et al., 2016), typical methods rely on large quantities of parallel text (i.e.existing human translated texts) as gold data for supervised learning.These approaches are thus difficult to apply to low-resource languages, which lack large bodies of such data (Joshi et al., 2020).To extend this vital language technology to low-resource languages, many have focused on Unsupervised NMT (UNMT) -the task of building NMT systems without any parallel text (Artetxe et al., 2018;Lample et al., 2018a,c;Lample and Conneau, 2019;Conneau et al., 2020).Figure 1: Illustration of our modeling process.For the pre-training stage, we use the off-the-shelf mBART (Lewis et al., 2020).We fine-tune the model for translation with Emergent Communication.
Typical approaches to UNMT rely on large pretrained multilingual models (Lample and Conneau, 2019;Conneau et al., 2020;Liu et al., 2020;Song et al., 2019) and the method of back-translation (Sennrich et al., 2016b) to iteratively generate synthetic parallel text.These approaches, however, still rely on plain text information alone.For that reason, the resulting models are considered ungrounded (there is no link between the text and the external world).This may limit model abilities.
Despite NLP breakthroughs stemming from large-scale pre-training on raw text corpora with self-supervised learning (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019;Conneau et al., 2020;Liu et al., 2020;Brown et al., 2020, i.a.), several recent results suggest limitations in model generalization (McCoy et al., 2019;Niven and Kao, 2019;Ettinger, 2020;Rogers et al., 2020, i.a.).More fundamentally, several have argued that pre-training on text alone will not deliver fully general and robust NLP systems. 1or example, using several detailed thought experiments, Bender and Koller (2020) argue that models trained on text alone will not, in principle, be able to recover either the conventional meaning of expressions or the communicative intent of an expression in context.Their arguments highlight the importance of the interaction between linguistic expressions and extra-linguistic communicative intents (e.g.acting in the world, executing programs). 2Similarly, Bisk et al. (2020) articulate progressively broader world scopes in which language use is embedded, and argue that present pre-training methods work at a relatively limited scope.They too emphasize the importance of embodied interaction with the environment and with the social world for future NLP systems. 3 In this paper, we propose to use methods from the field of emergent communication (EC) (Wagner et al., 2003;Skyrms, 2010;Lazaridou and Baroni, 2020) to improve UNMT systems.EC studies artificial agents communicating with each other to accomplish particular environmental goals.EC is a subfield of reinforcement learning, wherein language (i.e. the communcation protocol) is shaped by rewards determined by interacting with an external environment and with other agents.Typical work in this area starts from a tabula rasa and studies under what conditions -e.g.environments, tasks/goals, social settings-the resulting communication protocols among agents resembles human language, along axes like word length economy (Chaabouni et al., 2019a), word-order biases (Chaabouni et al., 2019b), and compositionality (Andreas, 2019;Chaabouni et al., 2020;Steinert-Threlkeld, 2020;Geffen Lan et al., 2020), among others (Mu and Goodman, 2021).
Our approach leverages the insight that people learn new languages by using them to do things (e.g.order food, buy train tickets); our machines should do the same.We improve upon a standard UNMT system by taking a large pre-trained multilingual model (mBART) and embedding it in an EC task, having it participate in goal-directed communication (in addition to back-translation).Communication should promote translation in the following way.Translation can be viewed as 'aligning' model representations for sentences in several languages.In the supervised case, parallel text instructs the model how to do this alignment.
In the unsupervised case, through communication, each model aligns its language representations with the same shared environment, thereby promoting alignment between the languages themselves.This work is thus an instance of the wider framework of Emergent Communication Fine-tuning (EC-FT) (Steinert-Threlkeld et al., 2022).
In what remains, we describe our pipeline for EC fine-tuning (Section 2) and the experiments that we conduct to demonstrate its benefit for UNMT (Section 3), overview our experimental results, in which we show EC yields benefits for every language we study with particularly strong gains for the lowresource language Nepali (Section 4).We then study some manipulations on our training pipeline (Section 5) before discussing the implications of these experiments (Section 6), and situating them in the context of existing work (Section 7).
Our contributions are the following: (i) We demonstrate that EC-FT can be used to improve upon UNMT baselines.(ii) We give a proof-ofconcept for the viability of using modern pretrained language models in an EC scenario.(iii) We articulate a view for EC-FT as a generalized and parameterizable framework.

Methodology
As shown in Figure 1, the pipeline that we introduce here consists of three main phases: (1) Begin with a pre-trained multilingual model, which either already has an encoder and decoder, or from which this seq2seq stack can be initialized.(2) Conduct emergent-communication training using image and/or text embeddings (Figure 2).(3) Use iterative backtranslation (Sennrich et al., 2016a

OR
Figure 2: Emergent Communication Fine-Tuning: the task is a standard image reference game from the EC literature, but with the sender and receiver initialized from a pre-trained multilingual decoder and encoder.The communication language alternates between the two languages in the translation pair that is being fine-tuned.
For step (2), we test two versions of the EC fine-tuning task.In the first (I2I-EC), the EC step uses only image embeddings, and the model must select the original input image from among distractors, based on a text generation (akin to a caption).In the second (T2I-EC), the communication game involves gold captions, instead of only image features: based on a caption, the model must generate a translation of it, on the basis of which the original image must be selected from amongst distractors.
First, we introduce some notation.We use E m and D m for the multilingual encoder and decoder, respectively, which are parameterized by θ E and θ D .This formulation of our pipeline leaves many concrete choices open.In the remainder of this section, we describe the specific implementation of this process used in our experiments.

Pipeline Components
Pre-trained Model We use mBART(-large) (Liu et al., 2020), which has demonstrated strong unsupervised translation performance in several languages.mBART employs seq2seq pre-training, encoding a "noised" input sequence and then reconstructing the original sequence with the decoder, over a collection of 25 languages.mBART's encoder-decoder architecture and corresponding seq2seq training make it a natural fit for our EC ex-periments, in which a multilingual decoder and encoder are used to send and receive natural language messages.We use θ E P T to denote the parameters of the pretrained encoder, and mutatis mutandis for the pretrained decoder.
Backtranslation Iterative backtranslation allows a model (usually pre-trained) to achieve some level of translation performance while only training on monolingual data (Section 7).Our baseline system is mBART fine-tuned with backtranslation only.In the EC-FT case, backtranslation is always performed last so that the model is tuned for translation immediately before it is evaluated.
Image-to-Image EC (I2I-EC) Our emergent communication framework consists of two main subtasks.First, an agent (the sender, a decoder) must take in an image encoding and produce a natural language description of it.The generation language may vary; there will be several in our experiments.Next, another agent (the receiver, an encoder) takes in the generated text and uses it to pick the described image from a set of distractors.In the EC literature, this is referred to as a standard image reference game (see Figure 2). 5et i ∈ R d i be an image embedding (d i is the dimension of these embeddings, which may come from a vision model).We also assume that we have a reshaper R(i; θ R ) which maps images to R dm .
Because mBART is not natively multi-modal, some adaptations are made to allow it to generate a description of an image.In particular, the image embedding cannot simply be the first token to the sender since mBART reserves this for a special language identification token.Further, it is not obvious that a pre-trained transformer decoder's cross-attention can be "turned off" without effecting overall performance.For these reasons, we pass the image embedding into an "unroller" U (one auto-regressive transformer layer) to generate a sequence of embeddings where M is a hyperparameter.This sequence is then used as the keys and values in the sender's cross-attention.
We auto-regressively generate from the sender's distributions where LID is a language ID token and T <K is the prefix of text T generated at the previous time step.The sampling required for discrete generation is not differentiable, so we use the straight-through Gumbel-Softmax estimator (Jang et al., 2017;Maddison et al., 2017) with temperature τ = 1.0.T := GS-ST(S) is the sequence of one-hot vectors sampled in this way.
The receiver consumes this generated 'caption': E m (T ) ∈ R K×dm .To produce a single representation of the image, we use an 'aggregator' A which takes this sequence of representations and pools them into a single one A(E m (T ); θ A ) ∈ R dm . 6he score for each of the candidate images is the inverse of the mean squared error between the image and the receiver's final representation.The loss for the image selection task is then cross-entropy among the image candidates.This loss partially follows Lee et al. (2018), though they jointly train on supervised caption generation during EC.
Given the original image i, and a set {i m } M m=1 of distractor images, let the image selection loss be where Θ = {θ D , θ E , θ R , θ A , θ U } and the softmax is taken over the distractor images {R(i m )}.
Finally, because EC can cause significant language drift (Lee et al., 2018(Lee et al., , 2019;;Lu et al., 2020;Lazaridou et al., 2020), we use KL regularization (Havrylov and Titov, 2017;Baziotis et al., 2020) to ensure that the sender's output distribution does not drift too far from the distribution of an auxiliary causal language model (CLM; this model is not trained as part of EC):7 (2) Combining equations ( 1) and ( 2) and averaging over iterations of the game, the final EC loss is with λ a hyperparameter.
Text-to-Image EC (T2I-EC) The text-to-image EC task is identical to I2I-EC, except in what is presented to the sender via cross-attention.In T2I-EC, monolingual gold captions are used in the crossattention for the emergent generation after being embedded by the encoder E m .
In other words, given c i as a caption for image i, T2I-EC still uses L EC (equation ( 3)), but without the unroller for the sender.Now, we have As in I2I-EC, the image descriptions are generated in either the caption language (here, English) or another translation target language.Importantly, the emergent generation need not be identical to the gold caption.This is desirable, since there may be several valid paraphrases of a given translation/caption.Similarly, we only require gold captions in one language, not every language; for this reason, there is no implicitly parallel text data and so the translation task can still be considered unsupervised.
The motivation for this version of EC comes from the observation that the encodings used in the sender's cross-attention should be fairly similar to those generated by the model's encoder, since the model is being fine-tuned to be an encoderdecoder translation model.Generating into varying target languages incentivizes the model to use the same encodings for generating different languages, rather than copying the input text to the output.In contrast, there is no guarantee that the image encodings used in I2I-EC are at all similar to those produced by the model's encoder.

Initial Supervision
Because multilingual EC is a complicated task with sparse training signal, we first ground the agents in their visual sub-tasks independently of the combined communication task.We train the sender to produce gold-standard captions in a high-resource language (English in our experiments) while simultaneously training the receiver to pick out the correct image based on the gold-standard caption.Critically, this stage only assumes that you have gold-standard captions in one language.The model is never trained on gold captions in non-English languages.This step is conducted independently, before EC.

Data
Training We use two main sources of training data: monolingual corpora for backtranslation, and pairings of images and captions in a single highresource language.We train translation systems between English and four other languages: Chinese (zh), German (de), Nepali (ne), and Sinhala (si).
Backtranslation creates synthetic translation pairs by generating sentences in the second language given natural sentences in the first.Following experiments using mBART for unsupervised translation (Liu et al., 2020), we use small portions of the Common Crawl 25 dataset, which is the pretraining data for mBART.In this way, no novel data is introduced to establish our UNMT baseline.
For the EC stage, the data required differs between I2I-EC and T2I-EC.The former requires only image embeddings.The latter requires paired images and captions, since the true caption is used to prompt the sender's generation.As mentioned, we assume that captions are only available for one language.Since English is in every translation pair, we use English captions.Our image-caption pairs come from the MS-COCO dataset (Lin et al., 2014), and our image embeddings are extracted from ResNet 50 (He et al., 2016b) (these are also used during the supervised captioning stage).
Validation and Test Translation validation and test sets are the only parallel data used in our experiments.For Nepali and Sinhala, we use the standard splits of the FLoRes evaluation datasets (Guzmán et al., 2019).For Chinese and German, we use the newstest2018 and newsdev2019 splits of the WMT'19 release as validation data (Barrault et al., 2019).For test data in these two languages, we sample 4096 examples from News Commentary v14 subset of the same release.

Experiments
We evaluate a UNMT baseline and our two proposed EC-FT pipelines on translation performance for each language pair.Checkpoints are picked by highest mean BLEU on the validation set.We first describe these models and then our evaluation.More extensive details can be found in Appendix A.
Baseline For our UNMT baseline, we start with mBART-25 and perform iterative backtranslation for 8192 steps in each direction.mBART employs language control tokens at the beginning of sequences, but it is not pre-trained to decode one language from another (Liu et al., 2020), which is a key feature of (back-)translation.To overcome the model's tendency to copy the input sequence to the output, we establish language-controlled generation using language control tokens and language masks (Liu et al., 2020).Concretely, we obtain token counts from the mBART training data, and these are used to create a logit mask, only allowing the model to generate tokens which make up the top p percent of the probability mass of the data in the given language.For the first 2048 backtranslation steps, we use a masking threshold of p = 0.9.After that, we raise the threshold to p = 0.99.
(I2I/T2I)-EC In both of our EC-FT models, we keep the total number of backtranslation steps the same (8192), and add 2048 steps each of supervised caption training and EC-FT.The language of generation can also be controlled during EC, so we use language-control tokens and a logit mask to ensure the sender generates in the specified language.The language of the emergent generation is selected uniformly at random per example.
Evaluation For our final evaluation, we report both BLEU and COMET (Rei et al., 2020) scores in both translation directions for each language pair.COMET provides the output of a regression model trained to predict the human directassessment translation quality score of a translation pair.Based on normalized quality scores, a COMET score of 0 means the translation is predicted to be of average quality.Postive scores indicate above-average quality, and vice-versa.We use the wmt22-comet-da model.translation (BT) shows a marked decrease in performance from the two higher-resource languages (Chinese and German) to the two lower-resource languages (Nepali and Sinhala).This is expected since BT-based UNMT often requires a strong initialization (Lample et al., 2018c) and multilingual models (like mBART) do not perform as well for lower-resource languages (Wu and Dredze, 2020).

Results
Our model fine-tuned with both backtranslation and I2I-EC remains close to or exceeds the baseline for the two higher-resource languages and Nepali but achieves very poor performance on Sinhala.It appears that EC provides a worse initialization for backtranslation for this language.
In contrast, our "text-to-image" variant of EC-FT (T2I-EC) yields the best performing model in terms of mean BLEU for all four of our languages.In particular, we see significant gains for both lower-resource languages.Most striking is the Nepali-English pair, which sees a +15% BLEU improvement over the baseline.While there are improvements in both directions, the Nepali→English direction has the largest gain.By contrast, Sinhala shows improvements in both directions, with the larger improvement in the to-Sinhala direction (partially due to a stronger baseline).The improvements are smallest for German, which is both very high-resource and the most similar to English of our languages.The COMET scores were broadly correlated with BLEU scores in all of our settings.
These results show that EC-FT of a pre-trained multilingual model can provide real improvement over a backtranslation-only baseline, giving proofof-concept of communication for fine-tuning.

Manipulations
To better understand which components of the pipeline affect the results in T2I-EC, we conducted several follow-up experiments.For each manipulation, we looked at one high-resource language (German) and one low-resource langauge (Sinhala).See Appendix B for full methodological details.
Image Encoder To test the effect of the image encoder, we replaced the ResNet image encoder with the best performing one from CLIP (Radford et al., 2021).This image encoder is based on the Vision Transformer (Dosovitskiy et al., 2021) architecture and trained jointly with a text encoder via a contrastive loss to pair image encodings with caption encodings.

Initial Backtranslation
Because the EC component of training is the first time that language ID codes are being used to generate text from the decoder with input other than representations of the same language from the encoder, we experimented with splitting the backtranslation training into two parts.Instead of doing all 8192 steps after EC, we did 2048 steps after image supervision but before EC, and the final 6144 steps after EC.Lowe et al. (2020), who showed that inter-leaving EC with a supervised learning objective can improve EC results, we ran a version of our training pipeline where we alternated between EC and BT four times.The total number of training steps remained the same (2048 and 8192, respectively), but this was now done in 4 equal-sized EC-to-BT pieces.

Results
Table 2 shows the results of these ablations.Evaluation is in terms of BLEU on the test set, and the ∆ column reports the percent difference from the best value for a language in Table 1.We find significant reduction in translation quality with the CLIP image encoder and inconsistent performance for both an initial BT phase and interleaved training, with performance dropping for German but slightly increasing for Sinhala when compared to T2I-EC (as seen in the ∆ column).

Discussion
We have demonstrated that (at least one variant of) EC fine-tuning provides improvement on unsupervised translation over a standard backtranslation baseline.The gains are especially pronounced for the low-resource language Nepali, which is ideal since under-resourced languages constitute the expected use case for unsupervised translation techniques.Furthermore, since the hyperparameters for the EC-FT portion of our pipeline were mostly determined empirically, our approach may be underoptimized, meaning future work may yield further improvement using the same technique.
I2I-EC However, it is also clear that our formulation and implementation of "standard" EC (I2I-EC) does not improve upon the baseline, and even degrades performance in many cases.Our interpretation of this behavior is linked to our motivation for formulating T2I-EC in the first place.
As mentioned in Section 2.1, the image representations used in the sender's cross-attention, in the image-to-image setup, are not guaranteed to be at all similar to the representations that the receiver learns to encode.Because we seek to fine-tune for a standard seq2seq task (translation), it is desirable that the sender (mBART decoder) be trained to use the same or similar representations to those produced by the receiver (mBART encoder).Thus, we hypothesize that the null and negative effects of I2I-EC may be due to this mismatch between the representations the sender is trained to use, and those that the receiver is trained to produce.
However, we do not believe we have shown that I2I-EC will not be useful under slightly different formulations.In particular, the image representations may be able to be constrained to be similar to those of the receiver, either during EC or during the initial supervision phase.This could be accomplished using an auxillary distance loss, or by normalizing the mean and variance of the representations in both places.
EC Fine-Tuning Lastly, we view EC fine-tuning as a broader framework in which we have tested two distinct formulations (Steinert-Threlkeld et al., 2022).We will assume that the invariant element of EC is a model's use of discrete, natural-language generations as input to a second model, which must use them to accomplish some task.
Given this definition, there are several choice points for applying EC-FT.The parameter we explicitly explore in our experiments is whether the input to the sender is image-based or text-based.In both of our formulations, the receiver is trained by a contrastive image-choice loss.Another parameter for future work concerns whether this loss applies to images or texts.The receiver could be trained to choose the correct sentence out of a set of distractors via the similarity of the sentence embeddings.
A third parameter is whether the receiver is trained by a contrastive loss or a generative one (i.e.exactly reproducing a target sequence, as in seq2seq training). 8In fact, an EC parameterization with text input, text output, and generative loss has already been formulated elsewhere, though it is not referred to as such.Niu et al. (2019) design a formulation of backtranslation, in which the artificial intermediate text is generated with straight-through Gumbel Softmax, instead of generated separately first.Future work will explore using this method with pre-trained models, i.e. in an EC-FT context.
These and other parameter choices leave extensive room for exciting future work with EC-FT as a general framework, both for UNMT and beyond.

Related Work
UNMT Unsupervised NMT uses only monolingual texts in each language of interest.Lample et al. (2018c) describe three principles for successful UNMT systems: 1. initialization, the initial model must leverage aligned representations between languages; 2. language modeling, there should be a strong "data driven prior" over the text patterns of each language; and 3. backtranslation which turns the unsupervised problem into a noisily-supervised one, through the use of semi-synthetic translations.
Significant progress has been made in improving each of these aspects of UNMT.Pre-trained multilingual language models (Lample and Conneau, 2019;Conneau et al., 2020;Liu et al., 2020;Song et al., 2019) have vastly improved the tractability of principles 1 and 2, largely replacing initialization techniques using inferred bilingual dictionaries (e.g.Lample et al., 2018b).
For the third principle, iterative backtranslation is widely used (Sennrich et al., 2016a;He et al., 2016a;Lample et al., 2018a;Haddow et al., 2022).On this approach, synthetic data is generated "on the fly", during training.The model is updated before each new batch of synthetic text is generated, leading to simultaneous incremental improvement in generated data quality and model quality.
In this work, we adhere to all three principles, but add EC as a training signal.It has been noted that UNMT baselines still perform relatively poorly for low-resource languages (Guzmán et al., 2019).We improve upon low-resource UNMT pipelines by leveraging goal-directed, multimodal fine-tuning via emergent communication.
EC and NLP A few other papers combine EC and NMT specifically.Lee et al. (2018) use EC and image captioning to build UNMT models, showing that EC promotes better translation than the multimodal alignment technique of Nakayama and Nishida (2017).Our approach differs in several important respects: we initialize our EC environment with pre-trained language models; we use both EC and backtranslation; and we do not simultaneously train on the EC objective and image captioning objective.Moreover, because we use one multilingual model, our caption grounding only uses one language, instead of all languages.Our results show that EC promotes unsupervised translation in the context of advanced methods that combine pre-training with backtranslation.Li et al. (2020b) use emergent communication as a pre-training step for NMT systems.They have agents play an EC game, and then use those parameters to initialize an NMT system.They find that (together with adapters and weight-distance regularization) EC pre-training improves in BLEU over a standard NMT baseline, with especially large gains coming in the few-shot setting.While this shows that EC can provide a good initialization for a recurrent NMT system, our present work shows that EC can provide a good fine-tuning signal for a pre-trained multilingual language model.We also note two differences with respect to both works: (i) they use recurrent networks, whereas we start from a pre-trained transformer, and (ii) they use separate models for each language, whereas we use one multilingual model.Lee et al. (2019) cast translation as a communication game with a third pivot language as the latent space in order to study (i) language drift from a pre-trained supervised MT model and (ii) using visual grounding (via gold image captions) plus language modeling to counter such drift.This approach thus does use EC with a pre-trained model, but it is a small model trained on the target task (translation).Our approach encourages using EC in conjunction with large-scale pre-trained language models which are intended to be general-purpose.
Finally, Lazaridou et al. (2020) study various ways of combining EC with a standard visionlanguage task, namely image captioning.They identify several forms of language drift and explore ways of incorporating auxillary losses.This work heavily inspires our own, since many of their settings correspond to using a pre-trained imagecaption system.Our focus, however, has been on using EC to fine-tune large-scale pre-trained models on a language-only task, which introduces its own challenges and has its own benefits.
Multimodal pre-training Recently, efforts in multimodal pre-training are surging, especially in vision-language (V-L) pre-training (Du et al., 2022).Most of the works create joint V-L representations through a fusion encoder (Li et al., 2020a(Li et al., , 2019;;Tan and Bansal, 2019), where the fused represen-tation is the joint representation of image and text, as learned by a single encoder.Other recent works such as CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021) attempt to use different encoders for images and text to make the framework more efficient.While V-L pre-training models image and text data jointly (Du et al., 2022;Wang et al., 2021), we start with an existing pre-trained language model and further train it through the communication process in an image referential game.Although we expect the alignment between image and text to arise through this process, we view the visual modality as an additional signal to ground the multilingual communication process.
We also note that most previous work on V-L pre-training is evaluated solely on vision or V-L tasks (Li et al., 2019;Radford et al., 2021;Jia et al., 2021).The advantage of this joint pre-training for language-only tasks remains unclear (Yun et al., 2021;Pezzelle et al., 2021).In this paper, we focus on a language-only task (UNMT) to evaluate whether visual grounding can improve such tasks.
Finally, we note that EC-FT is more general than typical approaches to multimodal pre-training.While the image-based task we employ here works by promoting multimodal alignment, the range of possible tasks that can be used in EC-FT is huge, from directing other agents (Mordatch and Abbeel, 2018) to controlling a robot (Das et al., 2019) to playing games and reasoning about social dilemmas (Jaques et al., 2019).This wide range of tasks can incorporate many dimensions of communication that should be beneficial for NLP systemse.g.other agents with their own private goals, social context, embodied control-that are not easily captured by multimodal pre-training (Bender and Koller, 2020;Bisk et al., 2020).In terms of Bisk et al. (2020)'s world scopes mentioned in the introduction, multimodal pre-training corresponds to world scope 3 (perception); EC-FT has the ability to move us much closer towards the final scopes 4 (embodiment and action) and 5 (the social world).
Multimodal Fine-tuning A related body of work focuses on adapting pre-trained language-only models for use in multi-modal tasks.For example, Tsimpoukelli et al. (2021) show that using a frozen language model and adapting a visual encoder to produce embeddings aligned with the LM's can be useful for few-shot learning in multimodal tasks like visual question answering.Liang et al. (2022) make this approach more modular by additionally freezing the visual encoder and learning separate prompt vectors.In the EC-FT context, these works suffer some of the same limitations in world scope, but could provide very useful methods for the environment-to-sender adapter step discussed in Section 2.1.

Conclusion
We have shown that Emergent Communication can be used as a fine-tuning signal for a large pretrained multilingual model; this grounding in a goal-oriented multimodal task yields improvements over an unsupervised NMT baseline in all four languages studied.There is likely room to further improve upon the specific EC variants we propose here, since we believe the EC process is underoptimized for hyperparameters.We have further noted that the framework we propose leaves extensive room for further experimentation, since there are many choice points of the general EC setup that we have not yet tested, and may be promising avenues for future improvement.The general EC-FT framework may also be applied to other tasks beyond UNMT in future work.

A Main Experiments Training Details
We here include more details about the training protocol for the results reported in Section 4. Our codebase is built upon the mBART code from huggingface (Wolf et al., 2019) and PyTorch (Paszke et al., 2019).We use one NVIDIA RTX 8000 GPU for each experiment.Backtranslation is the most expensive part of the entire training pipeline.It takes around 24-28 hours to finish, depending on the languages.The combined training time for caption grounding and emergent communication is within 1 hour.
Baseline As discussed in section 3, our UNMT baseline is established by starting with mBART and performing 8192 steps of iterative backtranslation for each translation pair.We use a batch size of 32 and a maximum generated sequence length of 64.See more hyperparameter choices in Table 3.  3).Supervised caption training is described in Section 2.1.We have 8 choices for the image selection task (7 distractors and 1 correct choice).As part of Sender agent, we use a one-layer autoregressive transformer to serialize (or, "unroll") a single ResNet image representation to a sequence of vectors to imitate the sequential data mBART observes during its pre-training.The unrolled sequence is used in the sender's cross-attention, and the sender is trained to generate the gold-standard caption.
Also during the supervised captioning stage, the receiver takes in the gold-standard caption, and a one-layer RNN is used to aggregate its final hidden states and choose the correct image.The image selection (cross-entropy) loss is scaled with λ, before being added to the caption-generation loss.Full hyperparameter choices are detailed in Table 4.
I2I-EC fine-tuning is also described in Section 2.1.Different from caption grounding, we have a total of 16 image choices instead of 8.The adapter unrolls the ResNet image representation to a length of 32.The emergent generation is language-constrained as described in Section 3 with a threshold.A repetition penalty is applied to the generations, and they are constrained to not repeat any 4-grams or longer.KL-regularization with a separate mBART instance fine-tuned on causal language modeling is applied with a λ parameter.Full hyperparameter choices are detailed in Table 5. Auxiliary CLM To have a language model for use in KL regularization (see equation ( 2)), we fine-tuned just the mBART decoder on the same common crawl data used for its pretraining in all of the languages of interest.We trained for 100000 steps, batch size 32, sequence length 96, and learning rate of 6 × 10 −6 .This model was then frozen during EC training and only used to compute the KL divergence which was used in updating the sender's parameters.

B Manipulations Training Details
All manipulations are performed on the main T2I-EC process.Interleaved training uses versions of the the learning rate schedules used for the main experiments shortened by a factor of 4.

C Full results
In Table 6, we include full results for our main experiment (summarized in Table 1).Although we found the EC process to help with machine translation, it also leads to instability in model training.We a systematic study of this variation to future work.
In Table 7 we show experiments with a more modern choice of image encoder -CLIP-Large (Mullenbach et al., 2021).We find that the CLIP-Large encoder under-performs ResNet.
The full results from our manipulation experiments (Section 5) are found in Table 8.
x E ∈ R N ×|V | and x D ∈ R K×|V | are sequences of symbols of length N and K respectively.E m (x E ; θ E ) ∈ R N ×dm , where d m is the model hidden dimension, is the encoder output.Similarly, the decoder output is D m (x D , e; θ D ) ∈ R K×|V | , where e ∈ R N ×dm is a set of vectors for crossattention of the decoder.
I2I-EC For our I2I-EC fine-tuned model, training consists of the following pipeline 1. 2048 steps of backtranslation 2. 2048 steps of supervised captioning training (English-only) 3. 2048 steps of EC fine-tuning 4. 6144 steps of backtranslation Backtranslation uses the same exact hyperparameters as in the baseline, but with training split between the first 2048 and last 6144 steps (Table T2I-EC For our T2I-EC fine-tuned model, training is performed slightly differently for empirical reasons 1. 2048 steps of supervised captioning training (English-only)2.2048 steps of EC fine-tuning 3. 8192 steps of backtranslation T2I-EC hyperparameters are very similar to I2I-EC.See full parameters in Tables3, 4, and 5.
; Section 7) to tune the model for translation.4

Table 1 :
Table1shows the results from our main experiment.Firstly, our UNMT baseline based on iterative back-Results of our main experiment.Values reported here are the maximum across 3 random seeds per row; see Appendix C for full variation.T2I-EC shows consistent improvement for each language in terms of both mean BLEU and COMET.∆ shows percent improvement over the baseline.