The Effect of Efficient Messaging and Input Variability on Neural-Agent Iterated Language Learning

Natural languages display a trade-off among different strategies to convey syntactic structure, such as word order or inflection. This trade-off, however, has not appeared in recent simulations of iterated language learning with neural network agents (Chaabouni et al., 2019b). We re-evaluate this result in light of three factors that play an important role in comparable experiments from the Language Evolution field: (i) speaker bias towards efficient messaging, (ii) non systematic input languages, and (iii) learning bottleneck. Our simulations show that neural agents mainly strive to maintain the utterance type distribution observed during learning, instead of developing a more efficient or systematic language.


Introduction
The world's languages show immense variety, but linguistic patterns also show important universal tendencies (Greenberg, 1963). It has been argued that these common design features are shaped by human cognitive constraints and pressures during communication and transmission (Kirby et al., 2014). A well-known example of these tendencies is the trade-off between case marking and word order as redundant strategies to encode the role of sentence constituents (Sinnemäki, 2008;Futrell et al., 2015): flexible order typically correlates with the presence of case marking (e.g. in Russian), while fixed order is often observed in languages with little or no case marking (e.g. English).
Researchers interested in language universals and their origins have extensively used agent-based modeling techniques to study the impact of social processes on the emergence of linguistic structures (de Boer, 2006). Besides the horizontal transmission that is often modeled in the referential game setup, the process of iterated learning, where signals are transmitted vertically from generation to generation, has been identified to shape language (Kirby, 2001;Kirby et al., 2014).
Recently, the advent of deep learning based NLP has triggered a renewed interest in agent-based simulations of language emergence. Most existing studies simulate the emergence of language by letting neural network agents play referential games and studying the signals they use (Kottur et al., 2017;Havrylov and Titov, 2017;Lazaridou et al., 2018;Chaabouni et al., 2019a;Dagan et al., 2021). By contrast, Chaabouni et al. (2019b) expose their agents to a pre-defined language, which is then learned and reproduced iteratively by a chain of agents. They analyze how specific properties of the initial languages affect learnability, and further investigate how agent biases affect the evolution across generations. Among others, they studied whether neural agents tend to avoid redundant coding strategies as natural languages do. However, the case-marking/word-order trade-off did not clearly appear in their iterated learning experiments.
In this work, we re-evaluate this finding in light of three factors that play an important role in comparable experiments from the Language Evolution field: (i) speaker bias towards efficient messaging (i Cancho and Solé, 2003), (ii) unpredictable variation in the initial languages (Smith and Wonnacott, 2010;Fedzechkina et al., 2017), (iii) exposure to a limited set of example utterances, known as 'learning bottleneck ' (Kirby et al., 2014). We follow the iterated learning setup of Chaabouni et al. (2019b) where neural agents are trained to communicate about trajectories in a simple gridworld, exchanging instructions in miniature languages (Fig. 1 have been shown to serve as equally valuable cues to learn grammatical roles during language learning by children (Slobin and Bever, 1982), and by simple recurrent neural networks (Lupyan and Chris tiansen, 2002 Each utterance or sentence (S) consists of several phrases (P), which in turn are composed of a command (C) and a quantifier (Q): where lef t, right, up, down, 1, 2, 3 are spoken words which are atomic elements of the language. We consider three basic language types: Fixedorder with marker (redundant), Fixed-order without marker (non-redundant), and Free-order with marker (non-redundant). See examples in Fig. 1.
Fixed-order vs. Free-order This concerns Rule 1: In a fixed-order language, the order of phrases strictly corresponds to the temporal order of instructions in the trajectory. 2 Free-order languages, instead, allow any permutation of phrases. For instance, there are in total six possible freeorder utterances for a 3-phrase trajectory (Fig. 1).
Case Marking In a case-marking language, each phrase is preceded by a temporal marker indicating its role. Thus, Rule 2 changes to: P i → m i C Q with the marker m i indicating that CQ is the i th action segment. Note that a fully free-order language is unintelligible without markers.

Neural-Agents Iterated Learning
We strictly follow the iterated learning setup of Chaabouni et al. (2019b) unless explicitly noted.
Agent architecture Agents are implemented as 1-layer attentional Seq2Seq (Sutskever et al., 2014;Bahdanau et al., 2015) LSTM (Hochreiter and Schmidhuber, 1997) networks. Each agent acts as both speaker (receiving trajectories and describing them with utterances) and listener (receiving utterances and trying to induce the corresponding trajectories). The vocabulary contains both actions and words; embeddings of the encoder input and decoder output are tied (Press and Wolf, 2017).
Individual and iterated learning Given trajectory-utterance pairs, agents are trained by teacher forcing (Salakhutdinov, 2014) in both listening and speaking mode, using the early-stopping and optimizer settings of Chaabouni et al. (2019b). 3 Iterated learning (Kirby, 2001) is implemented by letting a trained adult agent teach a randomly initialized child agent, and repeating this process for a number of generations. At each generation, two steps are performed: (1) a trained adult agent receives a batch of trajectories and generates utterances by sampling from its own decoder outputs; (2) a randomly initialized child agent is trained on these agent-specific trajectory-utterance pairs. As an exception, the generation-0 agent is directly trained on the corpus generated by a given miniature grammar.
Evaluation In both speaking and listening mode, sequences are generated by greedy decoding and evaluated by sentence-level accuracy. Listener evaluation is standard as the true meaning of an utterance is unique. For speakers in the first generation, instead, we consider all acceptable utterances according to the grammar as candidate targets; for later generations, we take k = i! utterances sampled from the parent's speaking network as targets (i is the maximum number of phrases per trajectory). Validation for early stopping is performed similarly. These evaluation procedures allow a child agent's language to deviate from the parent language, even while achieving perfect accuracy.
For each experiment, we report speaking accuracy, listening accuracy, as well as average utterance length across generations. To get more insight into language changes, we plot the distribution of utterance types in the adult speaking agents across generations. Specifically, we count how often an utterance belongs to one of the basic language types (fix, fix_marker, free, free_marker), and how often markers are dropped for some of the phrases (fix_drop, free_drop). Utterances that do not fall into any of these categories are labeled as 'other'.
For more technical details on the agents and evaluation, see Appendix A. Example utterances at various generations are shown in Appendix B.

Effect of Least-Effort Bias
A bias towards efficient messaging has been proposed as explaining factor for several tendencies observed in natural languages (i Cancho and Solé, 2003;Kanwal et al., 2017;Fedzechkina et al., 2017). Could the lack of such a bias in neural networks explain the survival of redundant languages? To verify this, we design a simple mechanism simulating an agent's preference to minimize utterance length, based on Chaabouni et al. (2019b)'s framework: To teach the next generation, given a trajectory t, an adult agent generates n (possibly identical) utterances {û} = {û 1 ,û 2 , ...û n } by sampling from its trained network. Instead of modifying the training process, we exploit the diversity occurring in the sampled utterances and hard-code a shorter-sentence selection bias into this adult language generation. As shown in Algorithm 1, the sampling function is called n times to generate n samples. In turn, at each iteration, we ask the adult speaker to generate sentences and select each time the shortest one. Thus, we can control the bias strength by varying the number of generated samples ( ) in each of the n iterations. As increases, the chances of sampling a shorter sentence increase, resulting in a stronger pressure; = 1 means no bias. We expect this least-effort bias will cause the redundant disambiguation mechanism to gradually disappear, and the fixed-order strategy to dominate as that always leads to shorter utterances.
Results Fig. 2 shows the iterated learning results of the Fixed+Marker language with various levels of least-effort bias = {1, 3, 5, 8}, which represent no pressure, low-, medium-and highlevel pressure towards shorter utterances, respectively. The experiment without least-effort pressure We find that, while speaking accuracy remains stable (2a), our least-effort pressure leads to a severe drop in listening accuracy (2b) and a dramatic increase of other types in the speaking adult agent starting from the fifth generation (2d). Stronger levels of pressure lead to a faster decrease of average utterance length (2c), which was expected. However, manual inspection of the utterances (Appendix B) reveals that agents start dropping entire phrases, thereby losing information, instead of either dropping markers or changing the word order.

Effect of Input Language Variability
In the languages of Chaabouni et al. (2019b), markers are either present and fully systematic, or not present at all. If there is no marker example in the initial language, it is unlikely an agent would suddenly invent it. Conversely, a fully systematic use of markers may be perfectly learnable by the agent, unlikely to change or disappear over generations. Unpredictable variation, instead, is a common feature of the artificial languages used in human learning studies. For example Fedzechkina et al. (2017) combine optional case marking with either fixed or free word order, while Smith and Wonnacott (2010) use two plural markers with different distributions over nouns. Inspired by this, we experiment with unpredictable variations in the use of markers, namely: (i) variability among utterances, where each utterance is consistent with one of the basic language types chosen at random, and (ii) variability within utterances, where the use of markers is also unpredictable within a single utterance.

Variability Among Utterances
For every trajectory in the initial training set, two utterances are generated for each of the three basic language types: (i) Fixed-order+Marker, (ii) Fixedorder without markers and (iii) Free-order+Marker. Our goal here is to find out whether the agents will tend to prefer any of the three language types over generations, according to their inherent biases.
Results Results are shown in Fig. 3 (blue lines). The overall high speaking accuracy suggests that the agents can learn to imitate their parents' language very well. We observe a slow, but steady, loss in listening accuracy, which we attribute to the random sampling errors from the parent speaker and the natural presence of errors in the neural network learning process. Besides a steady increase of uncategorizable utterances (other) in Fig. 3d, the distribution of the three language types remains relatively stable even after 20 generations. We looked for sentences where only some of the markers are dropped (free_drop/fix_drop) but found almost none. See examples in Appendix B. These results show that mixing language types in the initial training set is not sufficient to induce the loss of redundant encoding with agent learners, as opposed to human learners (Fedzechkina et al., 2017).

Results with Least-Effort Bias
We also study the combination of the two factors (Mix+pressure) using medium pressure ( = 3). This setup (Fig. 3, green lines) leads to a more efficient language during the first five generations, as shown by the initially stable speaking and listening accuracy and a decrease of average length. This phase corresponds to a proliferation of fixed-order no-marker sentences and the disappearance of the other language types (3e). In only two generations, this language has reached the shortest possible overall length while remaining intelligible. After a while, however, child agents start to receive shorter but incorrect utterances, resulting in a drop of listening accuracy and, finally, an unintelligible language.

Variability Within Utterances
We design a more unpredictable language where each phrase marker is randomly dropped according to a given probability (10%  (2017). We expect agents will either stop using markers completely over generations, or use them more consistently.

Results
Despite the relatively small probability of dropping a marker, speaking and listening accuracies drop rapidly (3a and 3b, grey lines) and other utterances become dominant (Fig. 3f). This language becomes unintelligible before regularization is observed, again challenging our expectations.

Effect of Learning Bottleneck
Although real languages support the production of enormous sets of utterances, human learners master them by exposure to a limited number of examples. This learning bottleneck acts as a pressure forcing languages to regularize during cultural transmission (Smith et al., 2003;Brighton et al., 2005). Human-based experiments and computational simulations have found that this pressure can lead to increased structure in emerging language systems (Kirby et al., 2014). We apply a learning bottleneck to our mixed language experiment (Sect. 5.1) by randomly sub-sampling, at each iteration, 50% of the data used to train the next generation. Evaluation and training details are the same as in Sect. 5.1.

Results
We find that training data sub-sampling leads to a slightly steeper drop in listening accuracy (yellow vs. blue line in Fig. 3b), but the respective distributions of utterance types (Fig. 3g vs. 3d) remain very similar, suggesting the learning bottle-

Discussion and Conclusions
Neural-agent iterating learning is a promising framework to study the impact of social processes on the emergence of patterns and language universals, like the word order/case marking tradeoff. However, previous work with LSTM-based agents (Chaabouni et al., 2019b) has failed to replicate this human-like pattern. We re-evaluated this finding by (i) hard-coding a least-effort bias into our agents, (ii) designing less systematic input languages, and (iii) introducing a learning bottleneck. In all cases, our agents proved to be accurate learners, but the patterns of language change over generations did not match our expectations. Specifically, least-effort bias ( §4) and highly unpredictable input language ( §5.2) led to communication failure, whereas moderate input language variability ( §5.1) and learning bottleneck ( §6) led to a stable language distribution, confirming previous observations on the survival of redundant coding strategies in neural-agent iterated learning (Chaabouni et al., 2019b). Only combining least-effort bias with moderate language variability ( §5.1) led to a temporary optimization of the language, but that was again followed by communication failure.
In real language use, the pressure to reduce effort is balanced with communicative needs (Kirby et al., 2015;Regier et al., 2015) and does normally not lead to severe language degradation. Future work should design subtler least-effort biases, for instance considering efficiency in terms of grammatical structure and cognitive effort. Moreover, our results with non fully systematic languages show that agents strive to preserve the initial utterance type distribution. In human learning, this behavior is called probability matching and is affected by task complexity: more difficult tasks tend to regularization or over-matching (Ferdinand et al., 2019;Kam and Newport, 2009), where the more frequent variant is chosen more often than it appeared in the input. Over many generations, even a slight overmatching can lead to the emergence of linguistic regularities (Smith and Wonnacott, 2010) like the word order/case marking trade-off observed in human learners by Fedzechkina et al. (2017). We conclude that the current neural-agent iterated learning framework is not yet ready to simulate language evolution processes in a human-like way. More natural cognitive biases supporting efficiency should be modeled, while the speaker objective should be balanced with a measure of communicative success, such as the likelihood of a message to be understood by the listener (Goodman and Frank, 2016;Scontras et al., 2021).

A Training details
Following Chaabouni et al. (2019b) we limit the maximum number of segments per trajectory, i, to 5 and at most 3 steps per phrase, resulting in a total of 89k possible trajectories. Subsequently, the number of candidate target utterances during evaluation is set to k = i! = 120. As an exception, for the drop-marker language (Section 5.2) we limit i to 4 instead of 5 due to the computational cost of enumerating all correct utterances for a trajectory in this language during validation (accordingly, k is reduced to 24). The trajectory-utterance pairs are randomly split into training, validation and test sets with a proportion of 80%, 10% and 10% respectively.
We fix the hidden layer size (20) Table 3: Utterances sampled from the agents' speaking network given the trajectory 'RIGHT UP UP DOWN RIGHT RIGHT RIGHT' in Fix+Marker language learning with pressure ( §4), Mix language learning without and with pressure ( §5.1), Mix_drop language learning ( §5.2) and Mix language with learning bottleneck ( §6). For each experiment and each generation, we show six randomly sampled utterances (duplicates are omitted for clarity).