Systematic Generalization on gSCAN: What is Nearly Solved and What is Next?

We analyze the grounded SCAN (gSCAN) benchmark, which was recently proposed to study systematic generalization for grounded language understanding. First, we study which aspects of the original benchmark can be solved by commonly used methods in multi-modal research. We find that a general-purpose Transformer-based model with cross-modal attention achieves strong performance on a majority of the gSCAN splits, surprisingly outperforming more specialized approaches from prior work. Furthermore, our analysis suggests that many of the remaining errors reveal the same fundamental challenge in systematic generalization of linguistic constructs regardless of visual context. Second, inspired by this finding, we propose challenging new tasks for gSCAN by generating data to incorporate relations between objects in the visual environment. Finally, we find that current models are surprisingly data inefficient given the narrow scope of commands in gSCAN, suggesting another challenge for future work.


Introduction
Systematic generalization refers to the ability to understand new compositions of previously observed concepts and linguistic constructs. While humans exhibit this ability, neural networks often struggle. To study systematic generalization, several synthetic datasets have been proposed. Lake and Baroni (2018) introduced SCAN, a dataset composed of natural language instructions paired with action sequences, and splits in various ways to assess systematic generalization. Recently, Ruis et al. (2020) introduced the grounded SCAN (gSCAN) benchmark, which similarly pairs natural language instructions with action sequences, but further requires that instructions are interpreted within the context of a grid-based visual navigation environment. In this work, we analyze which aspects of gSCAN can currently be solved by general-purpose models, and propose new tasks and evaluation metrics for future research of systematic generalization on gSCAN.
First, to understand which aspects of gSCAN can be addressed by a general-purpose approach, we evaluate a Transformer-based model with crossmodal attention. Cross-modal attention has been proven effective for other multi-modal tasks (Lu et al., 2019;Tan and Bansal, 2019;Chen et al., 2020). It achieves strong performance on a majority of the splits, surprisingly outperforming several "specialist" approaches designed for gSCAN (Heinze-Deml and Bouchacourt, 2020;Gao et al., 2020;Kuo et al., 2020). We analyze the remaining errors, finding that many of the errors appear to be related to the same fundamental challenge in systematic generalization of linguistic constructs studied in datasets such as SCAN, regardless of the visual context.
Our analysis motivates the creation of an additional gSCAN task, which features a greater degree of complexity in how natural language instructions are grounded in the visual context. In this task, the agent needs to reason about spatial relations between objects expressed in language. We find this new task to be challenging for existing models.
Finally, we also assess the data efficiency of our cross-modal attention model on gSCAN. We find that despite the simplicity of the world state and the grammar used to generate instructions, model performance on most splits declines significantly when provided with less than ∼40% of the 360,000 original training examples. This suggests that we should consider sample complexity for future work.

Cross-modal Attention Solves gSCAN, Almost
Experimental Setup gSCAN has two types of generalization tasks: compositional generalization (CG) and length generalization. We focus on CG splits, which consist of a shared training set, 1 ran-Split Seq2Seq (2020) GECA (2020) Kuo (2020) Heinze (2020) Gao (2020) FiLM (2018)   are synthetically generated. The agent observes the world state which is a d × d grid (d = 6 in our case) with objects in various visual attributes. The output is an action sequence in the grid world (e.g. turn left, walk, walk, stay). We use exact match of the entire sequence as evaluation metric and report the mean and standard deviation across 5 runs.

Our Model
We implement a seq2seq model with 6 Transformer layers for the encoder and the decoder each. The architecture is similar to ViL-BERT (Lu et al., 2019), a popular multi-modal model for visual and text information fusion. On a high-level, the encoder's text stream reads the commands while its visual stream encodes the world states. There are cross-modal attention between the two streams. The details are in Appendix A.
Results Table 1 shows the results by various models. Among them, FiLM and Relation Networks (RN) have achieved strong performance on other synthetic datasets for similar multi-modal tasks (Perez et al., 2018;Santoro et al., 2017).
We also compare to models with specialized designs for CG or even gSCAN: a seq2seq model using data augmentation (Andreas, 2020), auxiliary loss (Heinze-Deml and Bouchacourt, 2020), task-specific architecture (Kuo et al., 2020;Gao   The cross-modal attention model outperforms others on 5 out 8 splits. We hypothesize it is more effective as we leverage cross-modal attention to allow bi-directional interaction between language instruction and visual environment, in contrast to prior works that only have uni-directional attention from text to visual context (Ruis et al., 2020;Kuo et al., 2020;Gao et al., 2020;Heinze-Deml and Bouchacourt, 2020). The additional attention from visual context to text improves grounding natural language instructions to the visual environment. However, on the "hard" splits (D, G and H), all methods struggle.
Analysis First, we analyze in details the "hard" splits. Certain aspects of interpreting instructions are highly dependent on visual grounding. Most prominently, every instruction requires resolving the location of a referred object. However, adverbs such as cautiously or while spinning have the same meaning regardless of the visual context. Figure 1 shows one such example from the H split. The agent can successfully locate the target object but fails to combine the seen verb pull and adverb while spinning to generate the correct action sequence. Therefore, to assess the degree to which errors are caused by incorrect visual grounding, we calculate  Table 3: Percentage of exact matches of target position on splits the cross-modal attention model fails. For the novel direction split (D), the agent can end up at correct row or column around 80% of the time. For the adverb splits (G, H), the agent can usually find the object, but fails to generate action sequence with correct manner.   Table 3. Despite the low exact match of entire sequences, the agent can find the correct targets more than 90% of time for two adverb splits (G, H). For the novel direction split (D), the direction "south west" is not seen during training. Similarly to Ruis et al. (2020), we find that the correct row or column is often selected, but not both. By analyzing attention weights, Ruis et al. (2020) attribute this as a failure to generate novel combinations of actions, not necessarily to identify the correct target object location, similar to our findings from visualizing attention weights in Figure 2. Therefore, we hypothesize that for these three splits (D, G, and H), the primary remaining challenge for the crossmodal attention model is not necessarily related to visual grounding.
Next, we analyze the splits (A, B, C, E, and F) where the cross-modal attention model performs well. Table 4 contrasts several variants of our model. We notice that removing decoder's textual attention (i.e., −T) has virtual no significant change. On the other end, removing visual attention (−V) causes significantly worsened performance on E split. Removing cross-modal attention (−X) significantly degrades all splits, highlighting the benefit of cross-modal attention.  Table 4: Ablation studies of our model on gSCAN (-V, -T, and -X: removing decoder's visual, decoder's textual, and encoder's cross-modal attentions respectively).

Training Testing
Command: walk to a big circle south west of a green small cylinder Target: walk walk L_turn walk walk Command: walk to a circle west of a blue small square Target: L_turn walk L_turn walk walk 3 Grounded Spatial Relation CG Proposed Task Given our analysis suggesting that many of the remaining challenges for gSCAN may not necessarily be related to visual grounding, we propose an additional task that features a greater degree of complexity in how natural language instructions are grounded in the visual environment.
We hope this will complement the original gSCAN tasks as a useful assessment of systematic generalization in grounded language understanding. The new data we create will contain language expressions that refer to target objects along with their relations to a second referenced object. We use two types of relations: next to and relative positions such as north and west. As an example, we can have expressions such as a blue square next to a red circle or a blue square north of a red circle.
To ensure that interpreting the spatial relation is necessary to correctly identify the target object, we create visual distractors in the environment. In the above examples, a blue square that is north of a magenta circle would be a distractor to the intended blue square which is next to the red circle. The existence of visual distractors will force the agent to examine the correspondence between the visual information and the (compositional) lan-  guage expression to disambiguate and locate the correct target object.
Similar to the gSCAN setup, we use a shared training set and hold-out specific examples to evaluate generalization abilities for learning the visually grounded spatial relation reasoning as in Figure 3. In addition to random split (I: Random), we create new test splits including novel object properties (II: Visual), novel target and reference combinations (III: Relation), novel referent (IV: Referent), and novel relative positions (V, VI: Relative position). Results We evaluate the cross-modal attention model on this new task and show results in Table 6. The model outperforms the baseline methods by a large margin on 4 out of 6 splits, but performs surprisingly worse on splits V and VI. The model performs unexpectedly well on the III. We conduct ablation studies and report the results in Table 7. III is also robust to various ways of removing attentions, similar to the F split we observed in Table 4. On the other end, the splits II and IV are affected negatively very much, while V and VI are surprisingly improved with removing the cross-modal attention. However, there is not a single model that excels at all splits at the same time.
We hypothesize that the cross-modal attention model overfits to certain aspects of the training distribution, leading to worse out of distribution performance on V and VI splits. We compute the exact match of our models on examples of seen relations from the random test split, to verify the in-domain generalization. Particularly, this in-domain settings evaluate situations where targets are "north west/north east" (in-domain V) and "south/west" (in-domain VI) to their references. It Split Seq2Seq (2020) FiLM (2018)

Sample Complexity
One inspiration of studying systematic generalization is to develop techniques that can reduce the sample complexity for learning novel behaviors. While we observe that many models perform well at least on some splits, we ask a natural question that has not been studied before: even on those splits, are our models fulfilling the promise of learning systematic generalization? Figure 4 shows that for the original gSCAN and the newly proposed task, performance for the cross-modal attention model starts to significantly drop when trained using less than around 40% of the training data for most splits. The model without cross-modal attention is even more data inefficient; performance starts to significantly drop when trained using less than 70% of the training data for the original compositional splits, or when reducing the training data by any amount for the spatial relation splits. This suggests exploring model architectural priors can help improve data efficiency and provide more benefits for generalization. Figure 4: Data efficiency of the models on compositional splits (top) and spatial relation splits (bottom). The x-axis is the percentage of data (%) and the y-axis is the exact match (%).

Model w/ cross-modal attention Model w/o cross-modal attention
To investigate how the number of primitives influences the data efficiency, we re-generate smaller compositional splits and spatial relation splits by reducing the number of primitives. For compositional splits, we exclude one noun (cylinder), one color adjective (blue), and one adverb (while zigzagging). For spatial relation splits, we exclude one noun (cylinder) and one location preposition (next to). This reduces the number of training examples by around 2/3 and leads to around 110,000 and 74,000 training examples for the smaller compositional splits and spatial relation splits respectively. We then perform similar experiments with our cross-modal attention model. The results are shown in Figure 5. When the number of primitives decreases, the percentage of training data required to attain satisfactory performance is increased, with more than 80% of training data needed for both splits. One possible explanation is that the total number of training examples is reduced when the number of primitives decreases. Therefore, the model sees fewer combinations during training, thus needs a higher percentage of data to properly learn visual grounding and compositionality. This provides further evidence that current model need extensive training data to achieve systematic generalization, demonstrating the necessity of evaluating sample complexity for future work.

Related work
Many datasets and tasks have been proposed for examining systematic generalization: visual question answering (Johnson et al., 2017;Bahdanau et al., 2019a;Pezzelle and Fernández, 2019;Bahdanau et al., 2019b), visually grounded navigation instruction following (Hermann et al., 2017;Yu et al., 2018;Chevalier-Boisvert et al., 2019;Chaplot et al., 2018). Many approaches have been proposed too, though a large percentage of them uses task-specific design and only work well for the specific tasks or datasets, for example, learning neural program executions (Andreas et al., 2016;Hu et al., 2017;Johnson et al., 2017;Santoro et al., 2017;Hudson and Manning, 2018;Mao et al., 2019), in addition to the ones we have described previously.
This work focuses on using a generic crossmodal attention model to probe the gSCAN dataset in the hope of understanding in what aspect the task/dataset is challenging. Ding et al. (2020) also presented new evidences that a neural-based model can solve similar CG tasks. Our interest is to use such models to inspire new task designs that continue to challenge neural models.

Conclusion
In this work, we have demonstrated that a generalpurpose cross-modal attention model can achieve strong performance on a majority of gSCAN splits and outperform more specialized prior work. We have proposed a challenging additional task for gSCAN that requires agents to reason over spatial relations between objects in the visual scene, and have highlighted data efficiency as a consideration for future work.  Figure 6: The architecture of our model. The encoder consists of 6 layers of transformer block with cross-modal attention. The decoder contains 6 layers of transformer block, and each block has one self-attention block and one multi-head attention block over contextual representations of encoder. 2

A.3 Baseline Methods
We implement FiLM (Perez et al., 2018) and RN (Santoro et al., 2017) as our baselines. The baselines are built upon the seq2seq model. FiLM learns functions to predict a set of β, γ conditioned on inputs to modulate neural networks. For FiLM, we add a linear layer to predict β, γ for each convolution layer. For RN, we use 3 relation layers with hidden size of 256 followed by 2 fully-connected layers to predict object relations. The hidden states are then used to initialize the LSTM decoder. We refer the readers to the original FiLM and RN paper for model details. Other hyper-parameters remain the same as the original seq2seq baseline. 3

B Data Generation
We build upon the original gSCAN codebase to generate spatial relation data splits. 4

B.1 Input Commands Generation
The input commands are generated based on context-free grammar (CFG) in Table 8. We add new grammar rules and lexicon rules to support object relations. Since we focus on evaluating relation reasoning, we exclude templates without any object relation.

B.2 World State Generation
We generate the world state with the following constraints: 1) For each target referent T, there will be one unique reference object R next to it; 2) There will be up to n visual distractors V 1 , V 2 , ..., V n , which have the same size, shape, and color of the target object T; 3) Each visual distractor V i will or will not have its own reference object O i .
Additionally, to avoid ambiguity, when generating visual distractor V i , we ensure: 1) If the input command contains abstract relative position (i.e. next), V i cannot be placed near R, and O i must be distinct from R; 2) If the input command contains specific relative position (e.g. north, west, etc.), V i can be placed anywhere, and O i can be the same as R. However, O i cannot have the same relative position to V i as R to T. Other procedures remain the same as the original setup.

B.3 Data Examples
We show data examples for each spatial relation split in Figure 7. The systematic difference between the training and test split is highlighted.
Command: walk to a big cylinder east of a big square Target: walk walk walk walk walk L_turn walk

I: Random
Command: walk to a red square west of a green small cylinder Target: L_turn L_turn walk walk R_turn walk walk walk walk

II: Visual
Command: walk to a green small square next to a blue circle Target: walk walk walk walk walk L_turn walk

III: Relation
Command: push a small yellow square west of a green big cylinder Target: L_turn L_turn walk R_turn walk push push

IV: Referent
Command: pull a circle north of a yellow big cylinder Target