On the Relationship between Zipf’s Law of Abbreviation and Interfering Noise in Emergent Languages

This paper studies whether emergent languages in a signaling game follow Zipf’s law of abbreviation (ZLA), especially when the communication ability of agents is limited because of interfering noises. ZLA is a well-known tendency in human languages where the more frequently a word is used, the shorter it will be. Surprisingly, previous work demonstrated that emergent languages do not obey ZLA at all when neural agents play a signaling game. It also reported that a ZLA-like tendency appeared by adding an explicit penalty on word lengths, which can be considered some external factors in reality such as articulatory effort. We hypothesize, on the other hand, that there might be not only such external factors but also some internal factors related to cognitive abilities. We assume that it could be simulated by modeling the effect of noises on the agents’ environment. In our experimental setup, the hidden states of the LSTM-based speaker and listener were added with Gaussian noise, while the channel was subject to discrete random replacement. Our results suggest that noise on a speaker is one of the factors for ZLA or at least causes emergent languages to approach ZLA, while noise on a listener and a channel is not.


Introduction
There has recently been a growing interest in simulating languages spontaneously emerging among artificial agents, by training them to solve some tasks requiring communications. A primary motivation in this area is to pursue the development of artificial intelligence that can interact or communicate with human beings (e.g., Havrylov and Titov, 2017;Lazaridou et al., 2017Lazaridou et al., , 2018Lee et al., 2018). In addition to this line of research, some studies have investigated the characteristics of emergent languages, mainly concerned with to what extent they are similar to human languages or what kind of factor forms language-like protocols (e.g., Kottur et al., 2017;Harding Graesser et al., 2019;. , for example, studied the relationship between emergent languages and Zipf's law of abbreviation (ZLA), which is a universal tendency in human languages, where frequent words tend to be shorter (Zipf, 1935;Kanwal et al., 2017). To see whether emergent languages follow ZLA, they performed experiments in which agents played a signaling game. Their results suggested that emergent languages have an opposite tendency against ZLA. In other words, more frequent inputs are encoded into longer messages. They also reported that by giving an additional penalty on message lengths (Eq. 6), the emergence of a ZLA-like tendency was observed.
Zipf (1935) hypothesized that ZLA comes about between two conflicting pressures: one for accuracy and the other for efficiency. In a paradigm with human subjects using a simple artificial language, Kanwal et al. (2017), for instance, introduced some external factors for simulating the competing pressures, namely, money reward for precise and quick communications. In emergentlanguage simulations, the explicit penalty on message lengths (Eq. 6) of  can also be considered an external factor for ZLA.
However, we speculate that there might be not only such external factors but also internal factors (or implicit penalties) related to the cognitive abilities of human beings such as memory. Inspired by some concepts in psychology, we hypothesize at first in the following way: Hypothesis 1. ZLA appears due to some internal factors from the cognitive abilities of human beings, as well as external factors. In other words, human beings assign shorter codes to frequent words so that they can avoid difficulty in their internal processes as much as possible.
Some studies in psychology suggested that in human beings, there is an output buffer of some sort that temporarily reserves some words to be spoken (Baddeley et al., 1975;Baddeley, 2003;Meyer et al., 2003;Damian et al., 2010;Baddeley and Hitch, 2019). The output buffer might decay over time, be overwhelmed by incoming inputs one after another, or be exposed to other disturbances. Such pressures, we thought, could be factors to shorten frequent words.
But how should they be modeled in the simulations of language emergence? Since artificial agents in simulations are not humans but often (recurrent) neural networks, it is not trivial to define equivalent pressures for them. To adopt such pressures into a signaling game, we propose modeling them into noise that interferes with the states of agents. Although the potential factors described above might be the matter of a speaker in a signaling game, we also propose adding noise to a listener for comprehensive research. The listener's short-term memory might also be limited due to similar reasons as the speaker. Besides, we try adding noise to a channel that spans the speaker and the listener, referring to a noisy-channel model (Shannon, 1948). Although a noisy channel is not probably pressure for efficiency but for accuracy, the assumption that redundancy contributes to accuracy seems to think implicitly of a listener as capable enough of correcting errors while maintaining necessary information, which is not trivial for neural agents. Therefore it is worth a try.
By the modeling and for the comprehensiveness, hypothesis 1 is revised as follows: Hypothesis 2. ZLA appears due to some of the three types of noises: noise on a speaker, noise on a listener, and noise on a channel.
In our experimental setup, speaker and listener agents are exposed to Gaussian noise since they have continuous vectors as their states. On the other hand, the channel is exposed to discrete random replacements, as messages passing through it have discrete variables.
Our experiments suggest that noise on a speaker is one factor for ZLA or at least causes emergent languages to be closer to ZLA, whereas noise on a listener and a channel is not in our signaling game. Rather, the noise on a channel strengthened redundancy.
Our analysis reveals the following things. First, when noise interferes with a speaker agent, noise accumulation can make it difficult to generate long consistent messages. Second, when noise interferes with a listener agent, on the other hand, noise accumulation does not affect the overall tendency crucially: even if the listener agent "forgets" the prefix of a message, the suffix is sufficient for communications. Third, noise on a channel can be thought of as a pressure for accuracy rather than efficiency, which is consistent with an information-theoretic point of view and Zipf's hypothesis.  studied whether emergent languages follow ZLA when neural agents play a signaling game. As we largely refer to, we review their setups, methods, and results in this section.

Signaling Game with a Power-law distribution
They extended a signaling game (Lewis, 1969) by making inputs be sampled from a power-law distribution. In the power-law distribution, the n-th most frequent input is sampled from a finite input space I at the probability ∝ 1/n. Thus, if agents learned to assign frequent inputs to shorter messages, their communication protocol could be said to obey ZLA. Let S and L be a speaker and a listener. Formally, the game procedure is as follows: 1. An input i ∈ I is sampled from a power-law distribution. Let i r be the r-th most frequent input. Then i r is sampled at the probability ∝ r −1 .
2. Given i, the speaker S generates a message m, i.e., m = S(i). m = x 1 . . . x |m| is a string over an alphabet A = {a 1 , . . . , a |A|−1 , eos} s.t. x i = eos (1 ≤ i < |m|), x |m| = eos, and 0 < |m| ≤ max len, where |m| is the length of m and max len is a hyperparameter. Note that eos ∈ A stands for "end-ofsentence," and it is guaranteed to be attached to the end of each message 1 .
4. The procedure is successful if i = o.

Training Method
Since players in a signaling game are neural networks, each input i ∈ I is represented as a |I|dimensional one-hot vector i. Likewise, an output o is represented as a |I|-dimensional vector o s.t.
(o) k > 0 (k = 1, . . . , |I|) and |I| k=1 (o) k = 1. Let L(i, o) = L(i, L(S(i)) be the cross-entropy error between i and o = L(S(i)): where S is a speaker and L is a listener. Our purpose is to minimize its expectation E[L], but the simple backpropagation algorithm is not applicable due to discrete messages m = x 1 . . . x |m| sampled from a speaker.  used the following surrogate function, the gradient of which is an unbiased gradient estimator, with an auxiliary loss entropy regularizer ER: where b is a mean baseline added to reduce the estimate variance, SG(·) denotes the stop-gradient operation 2 , P S,t is the speaker's output layer at time step t defining a categorical distribution over an alphabet A, P S,t (x t ) is the probability of x t ∈ A being sampled at time step t, and H(·) is the entropy function. Eq. 3 and Eq. 4 are derived by the approach of Schulman et al. (2015), which can be seen as the combination of REINFORCE-like method (Williams, 1992) and standard backpropagation. ER (Eq. 5) is added to encourage the exploration during training (Williams and Peng, 1991).

Anti-ZLA Emergent Languages
Chaabouni et al. (2019) reported, somewhat surprisingly, that the communication protocols had a clear anti-ZLA tendency when agents play a signaling game described in section 2.1. They also reported that a ZLA-like tendency appeared when they additionally imposed an artificial length pressure on messages: where m is a message, | · | denotes length, and α ≥ 0 is a hyperparameter. Rita et al. (2020) took a quite similar approach and observed the emergence of ZLA. As well as imposing a length pressure on a speaker agent, they re-designed the architecture of a listener agent so that the listener would be impatient to recover i as soon as possible.
Note that both the length pressure (Eq. 6) and the architecture re-design in Rita et al. (2020) can be regarded as somewhat explicit losses, whereas we try to impose an implicit pressure on agents.

Game with Noise
For a game, we take almost the same design as , which was introduced in section 2.1. We additionally introduce a channel C over which messages move from speaker to listener: A listener L obtains a messagem = C(m) through a channel C, instead of receiving directly m = S(i) from a speaker. Also, there are several differences in hyperparameter settings.

Architectures
As speaker and listener agents have continuous vectors as their states, they are added with continuous noise. For simplicity, we choose a Gaussian noise sampled at each time step with replacement. Channels, on the other hand, are exposed to discrete noise, since they convey discrete symbols. We take a random replacement operation for the channel noise.

Speaker and Listener
The architectures of speaker and listener agents are based on a single-layer LSTM, following .
At training time, we add Gaussian noise to the cell states of the LSTM of each agent 3 . Formally, where σ > 0 is a standard deviation (SD), E is the identity matrix, N (· | 0, σ 2 E) is a Gaussian distribution with a mean vector 0 and a variancecovariance matrix σ 2 E, and t is a sampled value from N (· | 0, σ 2 E) at time step t. We denote by σ S , σ L the SDs for the speaker and listener architecture respectively. At test time, we do not add noise for deterministic evaluation.

Channel
At training time, we think of a channel as being exposed to some noise so that the messages can be degraded during transportation. Such degradation is modeled as replacement: each symbol in a message is probabilistically replaced with another one. Note that each message is attached with eos, which is exceptionally protected from the replacement, since the effect of the insertion or deletion of eos is too strong for our purpose.
Formally, let A be an alphabet, m = a 1 . . . a n be an original message generated by the speaker, andm =ã 1 . . .ã n be transformed one. Then the probability distribution overã i = eos given a i = eos (i = 1, . . . , n − 1) is as follows: where π C is a hyperparameter s.t. 0 ≤ π C ≤ 1. Let us call π C a channel replacement probability. At test time, the channel is free from noise so that we can perform deterministic examinations.

Design and Estimation of Loss Function
We use almost the same loss function as Eq. 2. We modify ER (Eq. 5) into Decayed Entropy Regularizer (DER) and we define an additional auxiliary loss Soft Max Length (SML) in the following sections. Both DER and SML are introduced to prevent messages from being unnaturally long. Note that they themselves are not factors for ZLA in our assumption.  2019) used ER (Eq. 5) to encourage the exploration. However, ER might have an unexpected side-effect: They could lead messages to be unnecessarily long. We give an intuitive explanation as shown in Figure 1. Suppose that a speaker agent has learned a message pattern m = x 1 . . . x |m| for an input i. By the definition of the message, x |m| = eos, indicating that the probability that eos is sampled is relatively higher at time step |m|. Then, the speaker's output layer P S,|m| at time step |m| is updated so that the entropy H(P |m| ) will be larger. It means that the probability of eos being sampled becomes lower, which might lead the message to be longer. Such an effect can cause an undesirable bias in emergent languages. Thus, we modify ER into Decayed Entropy Regularizer (DER) as: where ρ H is a hyperparameter s.t. 0 < ρ H ≤ 1. DER is a weighted mean that puts a higher priority on the entropy at earlier time steps but lower on those at later. Therefore, it is expected to cancel the unnecessary effect of hindering eos emission at later time steps.

Soft Max Length
Each message m is generated by sampling a symbol x t at each time step t and concatenating them until either eos is sampled (self-termination) or the time step reaches max len − 1 (forced termination). In the forced termination case, eos is attached to the end of the sequence. However, this generating procedure may cause a speaker agent to fail to learn to emit eos for some inputs, since message lengths are bounded regardless of the eos emission. To handle this problem, we introduce an additional auxiliary loss Soft Max Length (SML) defined as: where m is a message, | · | denotes length, λ sml is the coefficient of this term, and eff max len is a hyperparameter s.t. 0 ≤ eff max len ≤ max len.

Training and Implementation
We follow  on the rest of the training method: Agents are trained for 2500 episodes, each of which contains 100 mini-batches. Each mini-batches are made of 5120 inputs sampled from the power-law distribution with replacement. When the accuracy at test time reaches 0.99 or more, the training stops early. Note that we do not add any noise at test time.
The game and the training are implemented using the EGG toolkit (Kharitonov et al., 2019) 4 .

Evaluating Communicative Effectiveness
As Lowe et al. (2019) pointed out, emergent communications have to be carefully examined in terms of effectiveness: even if something like communication emerges, agents might act without referring to signals from others. Since message lengths can vary in our signaling game, it is doubtful that every single symbol in a message conveys essential information. For example, it is not trivial whether eos is really end-of-sentence, since agents can use other symbols as "punctuations" or meaningless "blanks." The effective position of beginning-of-sentence is not trivial, either. Thus, apparent message lengths may differ from actual ones.
To evaluate effectiveness, we introduce position-wise symbol effectiveness and then head/intermediate/tail effectiveness to cover a weak point in the former.

Position-wise Symbol Effectiveness
First, to evaluate how informative symbols are distributed across positions, we introduce positionwise symbol effectiveness, which is a quite similar notion to positional encoding in Rita et al. (2020). Suppose a symbol x k in a message m = x 1 . . . x k . . . x |m| is informative enough. Then, a 4 The code for the EGG toolkit is found at https: //github.com/facebookresearch/EGG. Our code is available at https://github.com/wedddy0707/ noisyEGG.git. listener L is expected to fail to recover an input i correctly if x k is replaced with another symbol y, i.e., i = L(x 1 . . . y . . . x |m| ). Based on this intuition, the symbol effectiveness e(m, k) at position k ∈ {1, . . . , max len} in a message m = x 1 . . . x |m| is defined as follows: where A is an alphabet, m[x k := a] denotes x 1 . . . x k−1 ax k+1 . . . x |m| , and 1 φ is defined as By definition, 0 ≤ e(m, k) ≤ 1. Low e(m, k) means that symbol x k is redundant, since the listener L can recover i from most of m[x k := a] (a ∈ A ). Otherwise, x k is considered necessary for successful communications. Note that eos = x |m| is prevented from being replaced. The value of e(m, k) (Eq. 14) may vary depending on messages and speaker agents. That would make it difficult to perform straightforward evaluations for position-wise symbol effectiveness. To handle this problem, we also define e k , mean e(m, k) across messages and across speaker agents. Formally, let S = {S 1 , . . . , S |S| } be a set of |S| speaker agents trained with different random seeds. Then e k is defined as:

Head, Intermediate, and Tail Effectiveness
One may be interested in detecting whether the effectiveness is concentrated in the prefixes, infixes, or suffixes of messages. However, e k (Eq. 17) do not seem good for this purpose: Since message lengths can vary, the effectiveness of infixes and suffixes can scatter across e k . Thus, we additionally introduce head effectiveness e head , intermediate effectiveness e med , and tail effectiveness e tail .
Intuitively, e head is mean effectiveness across the heads of messages (i.e., x 1 in m = x 1 . . . x |m| ) and across speaker agents. Similarly, e med (resp. e tail ) is mean effectiveness across the intermediate positions (resp. tails) of messages and across speaker # successful runs ER (baseline) 16 DER 7 SML 6 DER+SML 11  where · is a floor function.

Hyperparameter Setting
In all our experiments, the size |I| of an input space was set to 256, the size |A| of an alphabet was 40, the size of hidden layers was 100 for both agents, and the entropy regularizer coefficient λ H was 1. The hyperparameters σ S , σ L , and π C for noise varied through sections.
We define a training run ending with an accuracy higher than 0.99 as a successful run.
To check the effects on learning, in addition, Table 1 shows the number of successful runs out of 16 for each model. Although apparent tendencies in Figure 2 are similar between the DER and DER+SML model, Table 1 suggests that it is easier to learn with the DER+SML model which has 5 more successful runs than the SML model.

Effects of Noise
In this section, we show the influence of noise on a speaker, listener, and channel. We used the DER+SML model with the same hyperparameters as in the previous section. We examined the effect of each noise by varying σ S , σ L , and π C . Note that σ S is the standard deviation of noise on a speaker, σ L is the one on a speaker, and π C is the channel replacement probability.
To see the overall tendency, we show mean message lengths for each model in Figure 3 5 . The tendency shifts from anti-ZLA to the one between ZLA and anti-ZLA as σ S gets bigger.
In addition, we show Spearman correlations between input frequency ranks and message length ranks in Table 2. Intuitively, ρ < 0 implies ZLA and ρ > 0 implies anti-ZLA. According to Table 2, ρ gets smaller as σ S gets bigger, which is consistent with the observation in Figure 3.
To check the symbol effectiveness, we show e k (Eq. 17) in Figure 4. Judging from Figure 4, the effectiveness at an earlier position becomes higher 5 There are some messages of length max len =40 while other messages are much shorter. We excluded the former in Figure 3 because otherwise the mean lines would have unnatural peaks and impair readability. As a result, 4 out of 1792, 30 out of 2048, and 7 out of 1526 data points were removed for σS = 1/4, 1/2, and 1 respectively.  as σ S gets bigger. We also show e head , e med , and e tail (Eq. 18,Eq. 19,and Eq. 20) in Figure 9. In Figure 9, the bigger σ S is, the higher e head and e med are, indicating that the former halves of messages become more informative by the effect of noise on a speaker.
These results suggest that noise on a speaker is a factor for ZLA, or at least causes message lengths to be closer to ZLA. One possible reason is that noise accumulation over time made it difficult for a speaker agent to generate long consistent messages.
To see the overall tendency, mean message lengths are shown in Figure 5. The apparent tendencies are quite similar among all the settings including 'no noise,' showing clear anti-ZLA tendencies. Spearman correlations in Table 2 also suggest anti-ZLA tendencies.
To check the symbol effectiveness, we show e k (Eq. 17) in Figure 6. In Figure 6, e k for σ L > 0 shows similar tendencies to those for 'no noise,' although the peak of e k for σ L = 1/2 is lower than the other results.. e head , e med , and e tail (Eq. 18,Eq. 19,and Eq. 20) are shown in Figure 9. According to Figure 9, e head for σ L > 0 tends to be smaller than the one for 'no noise,' but the overall tendencies seem similar (e.g., e head < e med < e tail ).
These results suggest that noise on a listener is not a crucial factor for changing a tendency in emergent languages. The listener's short-term memory is thought to have been limited due to noise accumulation over time, as e head got smaller. However, even if there was no noise, informative symbols tended to be located in the latter half of messages, i.e, e head < e med < e tail , which is one possible reason why noise on a listener did not crucially affect the overall tendency.
To see the overall tendency, mean message lengths are shown in Figure 7. The apparent results for π C > 0 are similar to the one for 'no noise,' showing clear anti-ZLA tendencies. Spearman correlations in Table 2 also suggest anti-ZLA tendencies. To check the symbol effectiveness, we show e k (Eq. 17) in Figure 8. In Figure 8, e k becomes lower entirely as π C gets bigger. e head , e med , and e tail (Eq. 18,Eq. 19,and Eq. 20) are shown in Figure 9. In Figure 9, e head , e med , and e tail become lower as π C gets bigger. Remember that low e(m, k) (Eq. 14) means that the symbol at position k in m is redundant. Thus, lower e k , e head , e med , and e tail indicate that symbols are redundant on the whole.
These results suggest that redundancy was facilitated due to the noise on a channel. It is consistent with Zipf's hypothesis and a noisy-channel model.

Discussion
Our experiments suggest that noise on a speaker is a factor for ZLA, while noise on a listener and a channel is not in our signaling game.
One possible reason for the noise on a speaker is that noise accumulation matters as time goes. At each trial, the speaker agent gets an input i and transforms it into an initial hidden state h 0 . The hidden states need to maintain the input i in some way for emitting consistent symbols. But noise accumulates over time and is harmful to their memory, which may cause frequent messages to be shorter. However, the result per se shows a neutral tendency between ZLA and anti-ZLA. Our implicit length pressure might not have been strong enough, or there might have been some problems with the agents' architectures.
Noise on a listener is not a crucial factor for ZLA in our setting. Judging from symbol effectiveness, the latter halves of messages tend to be more informative than the former when noise interferes with the listener. It means that the listener could "forget" the former halves of messages. In Figure 9: e head , e med , and e tail in successful runs under various noise conditions. the first place, however, the former halves are less informative even if there is no noise. That may be why noise on a listener did not affect the overall tendency. Noise on a channel seems to facilitate the redundancy of messages, which is consistent with Zipf's hypothesis and a noisy-channel model.
To help agents with learning, we used the two auxiliary loss DER (Eq. 11) and SML (Eq. 13) which are somewhat artificial. In particular, the usage of SML conflicts a bit with our original goal to give rise to ZLA by an implicit penalty, as SML is similar to an artificial length pressure (Eq. 6).

Conclusion
In this paper, we simulated the emergence of language and checked whether the emergent languages follow Zipf's law of abbreviation (ZLA). Inspired by some psychological concepts, we proposed exposing architectures to some noise during training. Our experiments were conducted under several noise conditions. The results suggested that noise on a speaker agent is one factor for ZLA, whereas neither noise on a listener nor noise on a channel is in our signaling game.
Our main contribution is to propose a potential factor for ZLA instead of an external length pressure and to demonstrate that noise imposing internal difficulty on a speaker agent may cause ZLA.
However, there are several problems and limitations in addition to what is discussed in section 5. First, we could not try the combination of noises. One might be interested in combining the noises on a speaker, listener, and channel, but we failed to train agents stably under such conditions. It is simply because it became much more difficult for agents to learn under several noises.
Second, our signaling game did not contain any contexts. As an input space was no more complex than having the order by frequency, emergent languages could only have a unigram-like structure. However, according to Piantadosi et al. (2011), word predictability considering contexts is a better predictor of word length than unigram probabilities. From a more realistic point of view, therefore, contexts should be considered in some ways. Moreover, if agents are forced to remember contexts, noise on a listener may also be a factor for ZLA, making the listener impatient.
We leave these issues for future work.