Inducing Transformer’s Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks

Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. However, existing neural models have been shown to lack this basic ability in learning symbolic structures. Motivated by the failure of a Transformer model on the SCAN compositionality challenge (Lake and Baroni, 2018), which requires parsing a command into actions, we propose two auxiliary sequence prediction tasks as additional training supervision. These automatically-generated sequences are more representative of the underlying compositional symbolic structures of the input data. During inference, the model jointly predicts the next action and the next tokens in the auxiliary sequences at each step. Experiments on the SCAN dataset show that our method encourages the Transformer to understand compositional structures of the command, improving its accuracy on multiple challenging splits from ≤ 10% to 100%. With only 418 (5%) training instances, our approach still achieves 97.8% accuracy on the MCD1 split. Therefore, we argue that compositionality can be induced in Transformers given minimal but proper guidance. We also show that a better result is achieved using less contextualized vectors as the attention’s query, providing insights into architecture choices in achieving systematic compositionality. Finally, we show positive generalization results on the grounded-SCAN task (Ruis et al., 2020).


Introduction
Human intelligence, including natural languages, demonstrates systematic compositionality, the algebraic capacity to understand and produce a potentially infinite number of novel combinations of known components (Chomsky, 1957;Montague, 1970). For example, we know the usage of words "walk," "twice," "and"; once we learn a new verb "dax", we can immediately understand or produce utterances like "dax twice and walk." This type of compositionality is central to the human ability of making strong generalizations from limited data (Lake et al., 2017). However, there have been arguments that neural networks are associative devices that cannot capture systematic compositionality (Fodor and Pylyshyn, 1988;Marcus, 1998;Fodor and Lepore, 2002;Marcus, 2003;Calvo and Symons, 2014). Supporting this view, it has been shown that general neural models, like RNNs and Transformers (Vaswani et al., 2017), generalize poorly to the development set's unseen combination of components seen in training set Liu et al., 2020).
However, recent works have equipped recurrent neural networks (RNNs) with separate primitive and functional embeddings of the input tokens (Li et al., 2019;Russin et al., 2020). On the SCAN dataset ) that requires parsing a command into actions, these models can effectively parse "jump thrice" when only trained on how to "walk thrice", "walk", and "jump". In this work, we first show that this dual embedding method from CGPS-RNN (Li et al., 2019) can be transferred to the Transformer architecture to achieve nearly perfect results in substituting new primitives. Our CGPS-Transformer encoder maintains a functional/syntactic embedding and a primitive/semantic embedding for every word in the vocabulary. This separation of syntax and semantics is crucial in achieving compositionality: during the training, the model successfully learns the syntactic similarity between "jump" and other primitives through training examples like "jump− →JUMP" and "walk− →WALK". The semantic difference between primitives (e.g., "jump" should be translated into "JUMP" rather than "WALK") is encoded into the semantic embeddings, which do not participate in Table 1: Examples from the SCAN dataset  under the MCD1 split (Keysers et al., 2020). The outputs include the action sequence and two auxiliary sequences we created. "JP" is short for JUMP and "TL" for "TURN LEFT". Commands " [primitive] around left twice" are excluded from the training set, so the model must build a generalizable understanding of "twice" from training examples like "jump opposite left twice". all but the last attention layer. Therefore, the model can generalize to test example "jump around left" from training example "walk around left." Next, we show that although this model is capable of substituting new primitives (e.g., "jump") into learned structures (e.g., " [prim] around left"), it still fails to learn the compositional structure of larger syntactic units. For example, in the MCD splits (Keysers et al., 2020) that maximize the output compound divergence between train and test sets, CGPS-Transformer fails to "walk around left twice" by training on "walk around left" and "walk left twice" (see examples in Table 1), only registering an average of 5.7% exact-match score on three splits. This model also struggles in generalizing to action sequences longer than those seen in the training. Hence, in this work, we automatically create two simple and intuitive auxiliary sequence generation tasks to represent the lower level symbolic structures of input commands. These tasks can better teach the Transformer model to achieve compositional generalization. For the example "walk left thrice− →TURNL WALK TURNL WALK TURNL WALK", we create the first sequence [2, 2, 1, 1, 0, 0] to track the progress of three "walk left" actions and to ensure the correct repetitions of the action are executed. This sequence exposes the compositional structure of the action sequence "TURNL WALK TURNL WALK TURNL WALK" as three separate segments of "TURNL WALK". We also create a second sequence [1, 0, 1, 0, 1, 0] to supervise the successful completion of parsing every "walk left" action into "TURNL WALK". This sequence isolates the semantics of "walk left" as an action sequence of length 2. Overall, we extend the original seq-to-seq task defined in SCAN to a new, seq-to-3seq task. The two ground-truth auxiliary sequences, like the action sequence, are only given for training and the model has to predict these three sequences jointly at the test time.
On the three MCD splits, the CGPS-Transformer that predicts the auxiliary sequences achieves a perfect test-set performance of 100% accuracy, as compared to 7.66%, 3.25%, and 6.12% from the same model but without the auxiliary sequence prediction tasks. Our approach also generalizes to longer action sequences with 100% accuracy as compared to 0% from the baseline without the auxiliary sequences. Between the two previous works that achieved significant progress on the MCD splits, LANE (Liu et al., 2020) explicitly models the procedure of recognizing a symbolic function (e.g., "x twice") and applying the function via two separate models. These two models each make their discrete predictions step-by-step and are jointly trained with Hierarchical Reinforcement Learning. Guo et al. (2021) proposed to use monolingual dev/test data for semi-supervised learning and their results are worse than ours. Our approach differs from these two works in that it builds upon the general seq2seq architecture and does not require a peek into novel dev-set commands.
We further demonstrate that the CGPS-Transformer model only needs a small number of supervision with auxiliary sequences to develop the compositionality, as it achieves 97.8% accuracy on MCD1 split with 418 (5% of all) training examples. This suggests that systematic compositionality does not require a ton of training examples. Instead, a small number of well-designed demonstrations that exhibit the compositional structures of the data can better induce a generalizable model. Our ablation study shows that both the auxiliary tasks are necessary in promoting the compositional generalization behavior of the model. We further conduct experiments to show that it is important to use the output/intermediate vectors of the decoder's first layer as the queries in the task-specific attention for predicting the first auxiliary sequence (e.g., [2, 2, 1, 1, 0, 0] for "jump left thrice"). If we instead use the decoder's highly contextualized final outputs as the queries, the model would fail to predict the correct auxiliary sequence and the target actions. We make three arguments from this ablation study: (1) the model's prediction on the target action is dependent on its predictions of auxiliary sequences and it does not see them as three independent tasks; (2) predicting the auxiliary sequences, although seemingly simple, is not a trivial task and is highly correlated with understanding the compositional structure of symbolic functions; (3) it is easier to achieve compositionality using less contextualized representations as query vectors of the attention function. This third point echoes the fact that systematic compositionality values the meaning of some individual words (e.g.,"twice" and "thrice") in symbolic structures.
Finally, we also show that our auxiliary sequence prediction method can be transferred to grounded-SCAN (Ruis et al., 2020), a newer multi-modal compositional challenge that requires new recombinations of seen phrases. Overall, we hope our method and findings can provide an insightful view of the compositional generalization in deep neural models and inspire future works in this direction.

SCAN Dataset Generalization Splits
The SCAN dataset (Lake and Baroni, 2018) consists of natural language command inputs (e.g., "jump twice and walk opposite left") paired with action sequence outputs (e.g., "JUMP JUMP TURNL WALK TURNL WALK") generated synthetically. Each sub-command is made of four types of words: primitive ("walk, jump, look, run, turn"), adverb ("opposite, around"), direction adv. ("left, right"), and repetition adv. ("twice, thrice"). Each command contains at most 2 sub-commands connected via conjunction ("and, after"). In order to test the compositional generalization ability, different splits of the SCAN dataset were created, each of which has a distributional shift between the training set and dev/test sets. One of these splits is ADDJUMP, where the training set is consisted of the atomic example ("jump− →JUMP") and all other atomic and compound commands without "jump"; the test set contains compound commands that involve "jump" (e.g., "jump around left thrice and jump left").
Later, Keysers et al. (2020) proposed a procedure to maximize the output compound divergence while guaranteeing a small atom divergence between train and test sets. They produced three MCD splits under this objective. For example, the training and dev sets of MCD1 have a similar distribution of individual words to ensure minimal atom divergence. However, the training set does not contain compounds "[primitive] around left twice", which only appear in the dev and test sets (as shown in Table 1). These splits require a higher level of compositionality than recognizing the syntactic equivalence of primitives: the models must be able to (1) understand the underlying symbolic functions "x twice− → x x" from training examples like "jump left twice" and master the semantics of "jump around left" from examples like "jump around left thrice"; (2) compositionally apply the function "twice" to a novel argument "jump around left" in the dev and test sets. Therefore, most models that achieve close-to-perfect results on ADDJUMP fail on this challenge completely, struggling under 10% accuracy on all MCD splits.

Failure Analysis of a Previous Model
Before introducing our method, we first analyze the failure mode of our CGPS-Transformer, a model that achieves 95.82% accuracy on ADDJUMP test set but only 7.66% on MCD1. We observe that, given the novel, dev command "jump around left twice" that requires 8 repetitions of "jump left", the model mistakenly generates the seen, training action sequence for "jump around left thrice", "jump around left", or "jump opposite left twice". In some examples, the model completes 11 "jump left", one repetition short of the similar trainingset example. This evidence suggests that CGPS-Transformer does not understand the symbolic function "x thrice− → x x x." During the training, the model builds a representation for "jump opposite left thrice" as a whole and maps it to the correct action sequence to reach 100% accuracy. During Ground-truths: Figure 1: The CGPS-Transformer with primitive and functional input embeddings (Li et al., 2019). The decoder takes three partially generated sequences as the input, and predicts the next action and next ids in the auxiliary sequences. The parts highlighted in non-gray colors are added to the CGPS-Transformer to support the prediction of two auxiliary sequences.
the test, instead of generalizing compositionally to apply the symbolic functions ("twice") to a novel argument ("jump around left"), the model generalizes distributionally to map this unseen compound to a seen example from the training set with similar semantics. Hence, this failure mode of the CGPS model motivates us to design auxiliary tasks that encourage the model to view the command as a symbolic structure: the function "twice" applied to the argument "walk around left". We elaborate on our method in the next section.

Method
First, we briefly introduce the baseline model where we apply our method (Sec. 3.1). We then motivate and introduce the auxiliary sequences we create to improve the model's compositional generalization ability (Sec. 3.2). Finally, we explain how a seq2seq model can jointly predicts its target sequence and the auxiliary sequences in the training and inference (Sec. 3.3).

CGPS-Transformer Baseline
The CGPS model (Li et al., 2019) has a RNN encoder that embeds and encodes the syntax and semantics of the input separately and a RNN decoder to achieve generalization over single-word substitutions (e.g., "walk/run− →jump"). We recreate this model on top of the Transformer (Vaswani et al., 2017) as the baseline (visualized in Fig. 1).
In the SCAN dataset, we denote input sequence x, where each word is from an input vocabulary of size U . The output y is a sequence of T actions, where each action is from an output vocabulary of size V . The CGPS model has two separate embedding matrices for the input: the functional embedding E f and the primitive embedding E p : where f and p are the functional and primitive embeddings of the input sequence. The encoder builds the contextualized representation c of the input using functional embeddings f , while the decoder produces the output vector z t by attending to c and previous actions [y 1 , ..., y t−1 ]. Instead of directly projecting z t to the logits on output vocabulary, the decoder employs an extra multihead attention layer, with z t as the query, c as the key, and the primitive embeddings p as the value. Its output vector, and further the logits, come from an attention average over the un-contextualized p: where MHAttn is the multi-head cross-attention andŷ t is the final distribution on the output vocabulary. To enforce a strict separation of the information encoded in f and p, they regularize the L 2 norm of both embeddings and add noise to them during training.

Creating Auxiliary Sequences
For every command-action pair in the SCAN dataset, we automatically create two auxiliary sequences of the same length as the action sequence. These sequences represent the lower level symbolic structures in the input and can better teach the model in achieving compositional generalization.
twice". To prevent this error, we create the first auxiliary sequence AuxSeq1 (the 2nd row of every outputs in Table 1) to track the progress of three "jump around left" and to ensure the correct repetitions of the action are executed. For the example "walk left thrice− →TURNL WALK TURNL WALK TURNL WALK", we create a sequence of ids [2, 2, 1, 1, 0, 0]. This sequence exposes the compositional structure of the action sequence "TURNL WALK TURNL WALK TURNL WALK" as three separate segments of "TURNL WALK": it ignores the content of every action and focuses on the symbolic functions embodied by "twice" and "thrice".
(2) The model also sometimes "jump opposite left twice" when it is actually asked to "jump around left twice". In response to this error, we create the second auxiliary sequence AuxSeq2 (the 3rd row of every outputs in Table 1) to supervise the correct completion of every single "jump around left". For a shorter example "walk left thrice− →TURNL WALK TURNL WALK TURNL WALK", we create a sequence of ids [1, 0, 1, 0, 1, 0]. This sequence isolates the semantics of "walk left" as an action sequence of length 2. We argue that, if the model can correctly predict these two sequences and builds a connection between them and the actions, it will learn the compositional structures of the commands and generalize to novel combinations in the test set. Please refer to the appendix Sec. B for more details about the auxiliary sequences.

Joint Prediction of Auxiliary Sequences
Now with these two auxiliary sequences, the original seq2seq task defined in SCAN is augmented to a 'sequence-to-3sequences' problem. Therefore, we made some adaptations to our Transformer decoder to jointly predict three sequences. First, we introduce two extra embedding matrices for the two auxiliary sequences in the decoder in addition to the existing action embeddings. The input to the decoder is the sum of three embedding vectors. After the regular Transformer layers, we add another multi-head cross-attention (the red component in Fig. 1) using the output h t of the decoder's first self-attention layer as the query, the input's functional embedding f as the key, and the encoder's output representation c as the value. The attention outputs o aux are then projected to the space of the auxiliary sequence ids to produce the logits of the next id in the auxiliary sequence.
Later experiments show that the choice of the query vector plays a crucial role in deciding whether the model can achieve the compositionality in understanding the command. During the training, the decoder takes the two auxiliary sequences, each prepended with a start-of-sentence token, as the input. We then maximize the log-likelihood of predicting the next id in the auxiliary sequence at each step. During the inference, the decoder uses the partial auxiliary and action sequences generated in the previous steps, instead of the ground-truth sequences, as the input.

SCAN Dataset
The SCAN dataset (Lake and Baroni, 2018) consists of natural language commands paired with action sequences. Each data split has a distributional shift between its training and test sets to evaluate models' compositional generalization ability.

ADDJUMP:
The training set is consisted of the atomic "jump" example and all atomic and compound commands without "jump"; the dev and test sets contain compound commands with "jump".
LENGTH: All command-action pairs are split according to the action sequence length into the training set (≤22 tokens) and dev/test set (≥24 tokens).
MCD : (Keysers et al., 2020) includes three separate splits created to maximize the output compound divergence while guaranteeing a small atom divergence between train and test sets. We refer to Sec. 2.1 for more details about the challenges.

Experimental Setup
We use 2 separate Transformer stacks as the encoder and decoder. Each stack has 2 layers, 2 heads per layer, 64 hidden units per head, and a feedforward dimension of 256. We train all our models using the Adam Optimizer (Kingma and Ba, 2015) with a constant learning rate of 5 −3 , β 1 = 0.9, β 2 = 0.98. Each model is trained on an NVidia V100 for ∼16 hours with the batch size of 512. We use the same encoder-embedding regularization coefficient of 0.01 as Li et al. (2019).

Main Results
In Table 2, we show our model's performance against the CGPS-Transformer baseline and previous works on multiple splits of SCAN. The CGPS-Transformer baseline struggles to obtain the basic compositional generalization ability, with the appalling performance of 7.66%, 3.25%, and 6.12% accuracy on the three MCD splits, respectively. When the model faces novel combinations of seen elements, it fails in a systematic and predictable way as we explained in Sec. 2.2: instead of generalizing compositionally to recognize the relationship between the symbolic functions ("twice") and their arguments ("jump around left"), the model generalize distributionally to map this unseen compound to a seen example ("jump around left thrice") from the training set with similar semantics. The CGPS-Transformer baseline is also unable to generalize to examples with longer action sequences, with 0% accuracy on the LENGTH split. During the training, the model has seen multiple short commands with the adverb "thrice" but has never seen "jump around left thrice", which has a longer sequence of actions. At the evaluation, the model fails to perform this long command as it  doesn't learn a compositional understanding of the symbolic function "x thrice − → x x x". On the LENGTH and three MCD splits, our model that predicts the auxiliary sequences obtains 100% accuracy, significantly improving upon the baseline analyzed above. By the completion of this work, there are two previous efforts (Semi-sup and LANE in Table 2) that achieved close-to-perfect performance on one or multiple MCD splits. Guo et al. (2021) explored the semi-supervised learning using extra pseudo-parallel dev/test data and showed the efficacy of the iterative back-translation method under this setting. Its performance is worse than our method on MCD splits. LANE (Liu et al., 2020) explicitly models the procedure of recognizing a symbolic function (e.g., "x twice") and applying the symbolic function ("x twice− → x x") via two separate models. These two models make their own discrete predictions and are jointly trained with Hierarchical Reinforcement Learning. Compared to these two previous methods, our work has a fundamental difference in the initial objective: we are investigating the possibility of inducing compositionality in the internal mechanism of a general seq2seq neural network. Therefore, we propose a data-driven approach that teaches the seq2seq model by examples and without exposure to the novel test-set commands.

Few-Shot Learning Studies
In order to understand the sample efficiency of the CGPS-Transformer when auxiliary sequences are available, we try to limit the number of training examples of the command-action pairs and auxiliary sequences. As shown in Table 3  We also control the amount of auxiliary sequence supervision given to the model during the training. All (8365) command-action pairs from the training set are still available to the model. For those training examples without the auxiliary sequences, we feed a sequence of start-of-sentence tokens and do not supervise its prediction on auxiliary sequences. As shown in Table 4, CGPS-Transformer can achieve 72.73% accuracy on the MCD1 split with 5% (418) of all ground-truth auxiliary sequences and 89.19% accuracy with 10% (836) of all ground-truths. This result seems surprising at the first glance: the model obtains 97.8% accuracy with only 418 command-action pairs and auxiliary sequences. Now with 7947 extra command-action supervisions, the performance is even worse at 72.73%. Based on this observation, we believe that the extra examples without auxiliary sequences enhance the model's tendency to fit whole commands to distributed representations, and thus deteriorate the compositional reasoning ability of the model.

Analyzing the Architectures to Achieve Compositionality
We conduct another ablation study to show that it is important to use the intermediate or output vectors of the decoder's first layer, as the queries in the attention layer for predicting the first auxiliary sequence (e.g., [2, 2, 1, 1, 0, 0] for "jump left thrice").
We present the comprehensive ablation results in Table 6, where the column headers correspond to different choices of the query: "L1-INT" stands for the intermediate vector (before cross-attention) of decoder's first layer; "L1-OUT" stands for the first layer's output vector (after cross-attention); "L2-OUT" is the decoder's final output vector after two layers. Each row represents a choice of the key and value vectors, among different combinations of f : functional embeddings, p: primitive embeddings, and c: contextualized vectors of the input command. Every cell contains the accuracy of both the action sequence and the first auxiliary sequence. We make three arguments from this ablation study. First, the model's prediction on the target action is dependent on its predictions of the auxiliary sequences and it does not see them as independent tasks. We can observe this clear trend in Table 6: the models with higher accuracy in predicting the auxiliary sequence (the 2nd number in each cell) are always better at predicting the action sequence (the 1st number in each cell) from SCAN.
Second, predicting the auxiliary sequences, although seemingly simple, is not a trivial task and is highly correlated with the compositional structure of symbolic functions. As shown by the results, the model would struggle to predict the auxiliary sequence correctly if we simply use the decoder's final output vector as the query to the attention, which leads us to the third point.
It is easier to achieve compositionality us-   ing less contextualized representations as query vectors of the attention function. As shown in Table 6, the performance of using the decoder's first self-attention outputs "L1-INT" as the queries is consistently better than using the decoder's final outputs "L2-OUT", no matter what vectors we used as the keys and values. This finding echoes with the fact that systematic compositionality values the functionality of some individual words (e.g.,"twice" and "thrice") in certain symbolic structures. Such information can be partially lost or harder to isolate in the highly contextualized vectors.

Generalization to gSCAN
Finally, we show that our method can be generalized to gSCAN (Ruis et al., 2020), a multi-modal compositional challenge that grounded language in the states of a grid world. Similar to SCAN, the gSCAN "Adverb to verb" sub-task tests model's ability to execute novel commands made of seen components (e.g., "pull" and "while spinning"). The "Adverb" sub-task challenges the model to learn the meaning of adverb 'cautiously' from just one or a few examples in the training. As shown in Table 7, on these two adverb sub-tasks, adding our two auxiliary sequence prediction tasks improves the performance of the original LSTM baseline. This demonstrates that our auxiliary sequences are not only useful for SCAN, but can have a strong positive impact on similar compositionality challenge that requires recombination of seen phrases.

Discussion
In this section, we discuss what our experiments reveal regarding the compositionality of Transform-ers as well as the limitation of our method in terms of its applicability to other datasets.
First, we develop our method for SCAN , which is a synthetic dataset and its language commands are produced from a limited set of rules. Thus, it is unclear how the findings on the simplified SCAN setting can be transferred to large-scale, natural datasets. However, the community believes that SCAN is a valuable benchmark and useful analysis tool for studying language compositionality because, first, its inputs are still realistic English, i.e., they use the same set of functional words that people use in natural language ("and", "after", etc.) and each of these words has an symbolic function that influences the structure of the output sequence. Second, it has been shown that even large pre-trained language models cannot achieve strong performance on SCAN, indicating that exposure to more texts and linguistic structures do not naturally induce compositionality in neural models. Therefore, our method's effectiveness and simplicity should still provide some key insights into the nature of the neural model's acquisition of compositionality.
Second, it is worth noting that the synthetic language input in SCAN can be written as a contextfree grammar; as a result, we can design an automatic procedure to generate both auxiliary sequences based on the underlying grammar. Applying this method to a dataset with natural language requires designing a heuristic to approximate the underlying grammar. However, as the community is still trying to establish a basic understanding of whether/how a neural network can recognize the compositionality in language, an important first step could be done under a simpler setting (e.g., SCAN) with a controlled grammar. Furthermore, predicting the auxiliary sequences, although seemingly simple, is not a trivial task and is highly correlated with the compositional structure of symbolic functions. The fact that Transformer can predict the two auxiliary sequences perfectly suggests that it can model the compositional structure without extra information at test time if given the proper training supervision. Therefore, we believe that some of our observations are promising and exciting to the community.
Last but not least, our method achieves strong few-shot generalization (97.8% on MCD1 with only 418 training instances) and perfect length generalization. This opens up the possibility of using a small number of human-annotated auxiliary sequences to improve the models' performance on large-scale, natural datasets where automatically generating auxiliary sequences is infeasible.
6 Related Work

Compositional Generalization Datasets
The SCAN dataset (Lake and Baroni, 2018) consists of natural language commands paired with action sequences and is consisted of multiple splits that test the generalization of different compositional elements. Keysers et al. (2020) proposed a method to maximize compound divergence while guaranteeing a small atom divergence between train and test sets and created three MCD splits for SCAN. They also constructed the CFQ semantic parsing dataset of natural language questions paired with SPARQL output using this method. It was later expanded to * -CFQ (Tsarkov et al., 2021), a large suite of benchmarks based on the original CFQ task. COGS (Kim and Linzen, 2020) is a semantic parsing dataset with multiple systematic gaps that can only be addressed by compositional generalization. More related tasks (Loula et al., 2018;Liška et al., 2018;Bastings et al., 2018) are proposed on top of these original datasets to better evaluate the compositional generalization ability.

Compositional Generalization Methods
Many early works have explored the compositionality of neural networks, like RNNs, for systematic behavior (Wong and Wang, 2007;Brakel and Frank, 2009) in language learning and compositional counting ability (Wiles, 1998;Weiss et al., 2018). In a study of sensitivity to hierarchical structure (Linzen et al., 2016), the authors argued that sequential language modeling signal is insufficient for capturing syntax-sensitive dependencies and called for more direct supervision.
Recently, because of the publication of these popular benchmarks, multiple works have come up with promising methods that achieved better but still limited compositional generalization. Dessì and Baroni (2019) showed that CNNs can better generalize to novel compositions than RNNs. Lake (2019) proposed a meta-learning approach using a seq2seq model with a memory mechanism. They randomly shuffled the command-action matching of four primitives and store the correct matching for this batch in the memory. A later work (Nye et al., 2020) argued to generalize via the paradigm of pro-gram synthesis with a predefined meta-grammar. Data augmentation (Andreas, 2020;Akyürek et al., 2021) is also a natural method in promoting the generalization by automatically creating extra data that could resemble the test-set distribution. Most interestingly, previous work (Li et al., 2019;Russin et al., 2020) showed that it is possible to directly encode the inductive bias into the model architecture. They proposed to embed and encode the syntax and semantics of the input separately to achieve the compositional generalization over single-word substitution ("walk/run− →jump"). However, all of these works that achieve good results on some SCAN splits (e.g., ADDJUMP) still struggle significantly on the MCD and LENGTH splits.
By the completion of this work, there are only two previous efforts that achieved close-to-perfect performance on the MCD splits of the SCAN dataset. Liu et al. (2020) designed a memoryaugmented neural network that explicitly models the procedure of recognizing a symbolic function and applying this function via two separate models. These two models make discrete predictions and are jointly trained with Hierarchical Reinforcement Learning. Guo et al. (2021) explored the semisupervised learning with pseudo-parallel dev/test data and showed the efficacy of iterative backtranslation. Our method differs from these two works as (1) it induces the compositional rules implicitly from a general, seq2seq Transformer architecture; (2) it doesn't require peeking into the novel commands of dev/test data. A contemporary work (Conklin et al., 2021) proposed to construct meta-train and meta-test sets that consist of similar input sequence and used meta-learning to encourage the model to learn generalizable features.

Conclusion
In this work, we propose two auxiliary sequence prediction tasks to induce the compositional generalization ability in a Transformer model. On the challenging LENGTH and MCD splits of the SCAN dataset, our method achieves the perfect 100% accuracy, a huge improvement from the ≤ 10% performance from the baseline model. We further show that our method works well in low-resource settings as it reaches 97.8% accuracy with only needs 418 training examples. Ablation analysis shows that the model achieves better compositionality using the decoder's less contextualized vectors to compute the next token in auxiliary sequences.

A Stability across Random Seeds
It has been previously observed that the performance of non-pretrained models on the SCAN (Lake and Baroni, 2018) dataset is not stable across different random seeds. We train our model with auxiliary sequence prediction tasks with 5 random seeds and report the full results in Table 8. Our model achieves stable, close-to-perfect results in the MCD1, MCD3, and LENGTH splits. However, there is a relatively larger standard deviation among the 5 runs on the MCD2 splits. Upon analyzing the examples in the MCD2 splits and the mistakes the model makes in some random-seed runs, we find that MCD2 poses a unique compositional challenge that's not covered in other MCD splits: the training set contains no examples of the form "X [once] after Y twice." While other MCD splits require the model to perform an unseen application of a seen function (e.g., "twice") to a seen argument (e.g., "jump around left"), MCD2 additionally challenges the model with unseen combination of seen functions (e.g., twice and once in "X once after Y twice"). Both of our auxiliary sequences are designed to guide actions inside a single function (e.g., "jump around left twice") while the extra challenge of MCD2 calls for generalizing over two functions, thus causing a bit more unstable performance across different random seeds. However, our method still achieves a best score of 100% and an average of over 90% accuracy on MCD2, outperforming the baseline significantly.

B.1 Auxiliary Sequence for SCAN
For every command-action pair in the SCAN dataset (Lake and Baroni, 2018), we automatically create two auxiliary sequences of the same length as the action sequence. These sequences represent the lower level symbolic structures in the input and can better teach the model in achieving compositional generalization.
(1) As explained in Sec. 3.2, we create the first auxiliary sequence AuxSeq1 (the 2nd row of every outputs in Table 1 and Table 9) to track the progress of three "jump opposite left" and to ensure the correct repetitions of the action are executed. For the example "jump opposite left thrice− →TL TL JP TL TL JP TL TL JP", we create a sequence of ids [2, 2, 2, 1, 1, 1, 0, 0, 0]. This sequence exposes the compositional structure of the action sequence "TL TL JUMP TL TL JUMP TL TL JUMP" as three separate segments of "TL TL JUMP": it ignores the content of every single action inside a "jump opposite left" and focuses on the symbolic functions embodied by "twice" and "thrice".
Additionally, if the command comes after the conjunction word "and" or "after", we increment every element of the sequence by 3 (because each command has a maximum repetition function of "thrice"). Therefore, in the first example of Table 9, the second command "walk right thrice" is paired with an AuxSeq1 of [5,5,4,4,3,3].
(2) We create the second auxiliary sequence AuxSeq2 (the 3rd row of every outputs in Table 1 and Table 9) to supervise the correct completion of every single "jump around left". For a shorter example "walk left thrice− →TURNL WALK TURNL WALK TURNL WALK", we create a sequence of ids [1, 0, 1, 0, 1, 0]. This sequence isolates the semantics of "walk left" as an action sequence of length 2.
Additionally, if the command comes after the conjunction word "and" or "after", we increment every element of the sequence by 8 because each command has a maximum length of 8 (e.g., "walk around left"). Therefore, in the first example of Table 9, the second command "walk right thrice" is paired with an AuxSeq2 of [9,8,9,8,9,8]. We argue that, if the model can correctly predict these two sequences and builds a connection between them and the actions, it will learn the compositional structures of the commands and generalize to novel combinations in the test set.

B.2 Auxiliary Sequence for gSCAN
Grounded-SCAN (Ruis et al., 2020) (gSCAN) is a multi-modal compositional challenge that grounds language in the states of a grid world. Similar to SCAN, gSCAN also test model's ability to execute novel commands made of seen components (e.g., "pull" and "while spinning"). It also challenges the model to learn the meaning of adverb 'cautiously' from just one or a few examples in the training. The automatic procedure of generating auxiliary sequences for SCAN can be easily transferred to gSCAN with only a small change for the AuxSeq2. As shown in the first example in Table 10, we create the first auxiliary sequence (AuxSeq1) to track the progress of all "walk" actions by counting down the remaining "walk" to perform. For the second auxiliary sequence (AuxSeq2), instead of simply counting down an action sequence of length k (e.g., [2, 1, 0] for "walk opposite left"), we additionally distinguish between different adverbs in the sequence. Consider the first two examples in Table 10: the AuxSeq2 counts down from 15 to 11 when the model needs to "walk cautiously", but counts down from 8 to 4 when prompted to "walk while spinning". This is not needed for SCAN because every adverb in SCAN has a different action sequence length. For example, AuxSeq2 starts from 8 for "walk around left" and starts from 3 for "walk opposite left" and thus can already teach the model to distinguish between these adverbs.