Syntax-Guided Transformers: Elevating Compositional Generalization and Grounding in Multimodal Environments

Compositional generalization, the ability of intelligent models to extrapolate understanding of components to novel compositions, is a fundamental yet challenging facet in AI research, especially within multimodal environments. In this work, we address this challenge by exploiting the syntactic structure of language to boost compositional generalization. This paper elevates the importance of syntactic grounding, particularly through attention masking techniques derived from text input parsing. We introduce and evaluate the merits of using syntactic information in the multimodal grounding problem. Our results on grounded compositional generalization underscore the positive impact of dependency parsing across diverse tasks when utilized with Weight Sharing across the Transformer encoder. The results push the state-of-the-art in multimodal grounding and parameter-efficient modeling and provide insights for future research.


Introduction
Compositional Generalization refers to the ability of an intelligent agent to generalize its understanding of the underlying structure of a problem, especially when it is faced with novel compositions of the previously seen building blocks or components (Chomsky, 1957;Montague, 1970).It is fundamental for models to be able to extrapolate from their training environment to novel situations, a common occurrence in real-world applications.Hupkes et al. (2020) categorizes compositional generalization capabilities into five categories, systematicity, substitutivity, localism & globalism, and overgeneralization.These abilities are crucial for models to achieve strong performance on tasks that require reasoning and understanding of hierarchical structures, such as natural language understanding, object classification, and robotics.
Humans understand new compositions of previously observed concepts and simpler constructs.
Input Command: pull the small blue object that is inside of the small green box and in the same row as the red circle while zigzagging.Action sequence: turn left, turn left, walk, turn right, walk, turn right, walk, pull Here, an agent is provided with a command.Its objective is to generate/execute a series of predefined actions to fulfill the task within the given environment.
On the other hand, despite remarkable progress in the field of Artificial Intelligence, even state-of-theart language models demonstrate limitations in this aspect (Lake and Baroni, 2018;Thomas McCoy et al., 2020;Shaw et al., 2021).Especially, they often fail to effectively generalize in the reasoning depth, which involves handling multi-turn reasoning about entities and their properties in the world or even the co-occurrence of unseen spatial relations (Wu et al., 2021).These limitations indicate a crucial need for innovative approaches to address these issues.
In this research, our objective is to exploit the syntactic structure of language to enhance compositional generalization.Our focus is mainly on the multimodal problem setting that entangles vision and language.In this unique setting, compositional linguistic descriptions must be accurately grounded in the environment to devise coherent action plans or achieve specific goals.An illustrative example of this scenario is shown in Figure 1.
The motivation behind leveraging syntax in our approach stems from the inherent structure and compositionality of natural language.Syntactic parsing provides crucial structural information about how words in a sentence relate to each other.We hypothesize that syntactic structure can improve intelligent agents' ability to discern the applicable attributes and descriptions for each object in its environment and better apprehend deeper levels of reasoning.
By imposing an understanding of language structure through syntactic parsing, we aim to extend the capabilities of current multimodal language models.This could potentially pave the way for more sophisticated models capable of robustly interacting with dynamic and complex vision and language environments.Apart from using structure, we equipped our end-to-end model with weight sharing that has demonstrated improving the generalization capabilities in single-modality tasks.
As a result, we reach state-of-the-art performance on the ReaSCAN compositional generalization benchmark, showing improvement across all test splits, especially ones requiring sentence structure comprehension.In summary, our contributions include: • Enhancing grounded compositional generalization by integrating syntactic parsing into our model.
• Using syntax-guided attention masking along with weight sharing, we build a highly parameter-efficient model compared to baselines.
• Our model has shown marked improvement in performance across a variety of tasks that are designed for compositional generalization evaluation while enhancing computational efficacy.
Furthermore, recent research highlights the significant role of syntactic information in enhancing neural models' compositional generalization capability.Kuo et al. (2021) suggested aligning the compositional structure of networks with the problem domain, resulting in a dynamic compositional neural network.Moreover, Shaw et al. (2021) and Qiu et al. (2022) recommended grammar induction-based data augmentation techniques to improve compositional generalization.Unlike our work that focuses on input command structure, Kim et al. (2021b) introduced the concept of using parse tree node annotations in the target sequence of sequence-to-sequence tasks for enhancing compositional generalization.Meanwhile, Kim et al. (2021a) incorporated parse tree nodes into the ETC (Ainslie et al., 2020) model.They employed attention masking specific for ETC to symbolize the relations of tokens and aid this model in a simplified classification task based on the CFQ dataset.
We are inspired by previous research (Kim et al., 2021a) that employs a similar technique with manually extracted parses for compositional generalization on the single text modality.However, our model utilizes off-the-shelf parsers instead of accurate manually generated parse trees, and it is generally applicable independent of the underlying models.

Problem Setting
Various studies on compositional generalization have presented a range of tasks and problem settings (Lake and Baroni, 2018;Keysers et al., 2020;Kim and Linzen, 2020;Wu et al., 2021;Ruis et al., 2020).These datasets are comprised of a training set and several test sets.To ensure rigorous evaluation, the test sets have been deliberately structured to differ from the training set in a way that requires the compositional capability to succeed.Our paper focuses on grounding natural language instructions in the visual modality, where we map words to specific objects or actions in a multimodal environment that provides a framework to evaluate an intelligent agent's compositional structures and spatial reasoning capabilities.
We use the most recent multimodal compositional generalization benchmarks to assess our models comprehensively.In these benchmarks, an agent receives natural language instruction to carry out an action or navigate specific environments.These datasets are inherently synthetic, and they have been carefully crafted to guarantee that the test sets are systematically different from the training sets.By placing commands within a spatial context, these benchmarks bridge the gap between abstract cognitive understanding and practical action execution.Consequently, they stand as both a scholarly tool for studying compositional generalization and a valuable resource for fields like robotics that require comprehension of spatially anchored commands.
Among these benchmarks, our primary focus is ReaSCAN, owing to its heightened complexity and recent introduction to the academic community.An example of this dataset, depicted in Figure 1, consists of three main components: The initial state of the world, the provided input command, and the corresponding target command.Tasked with this information, the agent aims to infer the target command by leveraging both the information from the input command and the initial state.Structurally, the world's representation in ReaS-CAN is formulated as a 6×6×17 matrix.Each matrix cell comprises a 17-dimensional vector encapsulating information pertaining to an object's attributes-namely, color, shape, and size-along with indicators of the agent's positioning and orientation.The evaluation metric for this dataset is

Split
Held-out Examples Random Random.A1 yellow square referred with color & shape.

A2
red square referred in the command.

A3
small cylinder referred with size and shape B1 co-occur of small red circle and big blue square.

B2
co-occur of same size as and inside of relations.

C1
Additional conjunction clause depth added to 2-relative-clause commands.the percentage of exact matches of the predicted action sequence.The ReaSCAN dataset includes one random test split that mirrors the training's component and compound distribution, in addition to seven compositional generalization test splits.Each of these splits is designed to probe a specific facet of a model's grounding generalization capability, as detailed in Table 1.Category A test splits delve into novel attribute compositions at both the command and object levels, drawing inspiration from gSCAN.Category B test splits assess the model's ability to generalize to unprecedented co-occurrences of concepts and spatial relations.Meanwhile, Category C probes the model's capacity to extrapolate from simple command structures to more intricate structures with higher levels of reasoning (Wu et al., 2021).To illustrate the A1 split, all examples with commands containing variations of "yellow square" (such as "small yellow square" or "big yellow square") are excluded from the training data.This prevents models from associating targets with that phrase.However, the training set does include examples like "yellow cylinder" and "blue square."As a result, during testing, models are expected to accurately interpret the "yellow square" even without prior exposure to the actual composition.

Proposed Method
To address the challenge at hand, we implemented a multimodal transformer, as illustrated in Figure 2. In this model, input commands are tokenized and then supplemented with positional encoding before passing to the transformer.Concurrently, the visual environment is segmented into 36 distinct cells, each serving as a visual token.After passing the visual token to a linear layer, these tokens receive positional encoding and are passed to the transformer.
We've employed a generic parser to seamlessly embed the structure of the textual modality into our model, thereby shaping attention masks for the encoder's textual self-attention.Prioritizing efficiency, parsing each input command is conducted during a preprocessing phase.
Our transformer is based on the GroCoT model (Sikarwar et al., 2022).Each encoder layer employs a cross-attention mechanism between modalities, followed by modality-specific self-attention.Our computed input command masks are utilized in the self-attention modules of the textual modality.Remarkably, encoder layer weights are consistently shared across all layers.
In the end, we concatenate the encoded result of each modality and pass it to the transformer's auto-regressive decoder to generate the action sequence corresponding to the input command given the environment.

Syntax-guided attention
One main component of our proposed model is exploiting the syntactic structure of the command.For this aim, we investigate using both dependency and constituency parsing.Dependency and constituency trees can be used to analyze the grammatical structure of sentences.Dependency trees focus on the grammatical relationships between individual words, where each word except the root depends on another, and the edges of the tree signify these dependencies.However, constituency trees emphasize the hierarchical organization of words into larger syntactic units or constituents, with internal nodes representing these groupings and leaves representing individual words.While dependency trees are more concerned with identifying grammatical roles and relationships between words, constituency trees aim to show how words group together into larger syntactic units, often carrying syntactic labels like NP (noun phrase) or VP (verb phrase) (Foscarin et al., 2023;Hearne et al., 2008).Examples of these parse trees are shown in Figure 3.
Syntax-guided attention masking.We use the syntactic information to guide the self-attention module of transformer encoder layers as depicted in Figures 2 and 4b.We force each token to only attend to the tokens connected in the syntax tree.In this way, we avoid faulty attention patterns and overfitting irrelevant parts of the sentence.In addition, by imposing the structure with a parse tree, our model can capture the nesting structure of the command's meaning and the relationships between its components.By making the structural information explicit, our model can potentially extrapolate the meaning of novel combinations and nesting linguistic structures encompassing higher reasoning depth.

Weight Sharing
Parameter sharing is a strategic approach where identical learned parameters are applied across various positions or layers within a model.This technique enables the reuse of the same encoder unit at each phase of the transformer encoder (Dehghani et al., 2019).Such an approach not only streamlines the model but also nurtures the acquisition of more robust and adaptable representations of the input (Ontanon et al., 2022).The findings of Kim et al. (2021a) demonstrate that a transformer employing attention masking requires extended training epochs for convergence, potentially due to maskinginduced backpropagation constraints.In light of this, we hypothesize that introducing weight sharing might counterbalance this challenge.Weight sharing reduces the model's complexity by decreasing the number of parameters, which could lead to faster convergence.This method acts as a form of regularization, stabilizing training and facilitating smoother optimization landscapes.In addition, Ontanon et al. (2022) show that a transformer with shared weights across its encoder layers is arguably endowed with a more suitable inductive bias that allows the model to learn the primitive concepts.We hypothesize this will positively affect learning spatial relations or object-property relations, which are frequently used in our model's input.Motivated by these advantages, we incorporated this weight sharing technique into our transformer model to evaluate its efficacy in a multimodal setting.Beyond the enhanced generalizability, weight sharing serves as a computational benefit by reducing the number of learnable parameters during the training phase.

Experiments
Implementation Details.Our model architecture is founded on the GroCoT framework as detailed by Sikarwar et al. (2022) and is implemented using the PyTorch machine learning library (Paszke et al., 2019).Also, we employed the pre-trained stanza toolkit (Qi et al., 2020) for constituency and dependency parsing.We used 48 GB A6000 GPUs accompanied by 756GB RAM.On average, each experiment took about 52 hours to train the models from scratch, with the Adam optimizer (Kingma and Ba, 2017) parameter updates throughout the training regimen.To ensure a rigorous evaluation, we used the same specialized compositional validation set as Sikarwar et al. (2022), drawing 500 samples from each compositional division of the primary dataset.Model proficiency was assessed against this validation set, with the highest-performing model designated as our optimal choice.Our results are presented as an average derived from three independent runs, each initial-ized with a random seed.We ran the models for the ReaSCAN benchmark for 120 epochs, and the models for the gSCAN and GSRR benchmarks for 100 epochs.Hyperparameters used for the experiments of each dataset are shown in Appendix A. The code and models proposed in this work are all available in GitHub1 .
Datasets.We used gSCAN (Ruis et al., 2020), GSRR (Qiu et al., 2021b), ReaSCAN (Wu et al., 2021) benchmarks for evaluation.The Grounded SCAN (gSCAN) dataset is a benchmark tailored for examining compositional generalization in machine learning models by translating natural language commands into actions in a grid-world scenario.Its unique splits ensure models move beyond rote memorization to deep compositional understanding of concepts.The Grounded Systematic Relation Reasoning (GSRR) dataset extends gSCAN by aligning natural language instructions intricately with visual elements, emphasizing spatial relationships and object references.ReaSCAN, a further development, brings the challenges of realworld reasoning into this environment by introducing more challenging tasks and concept relations.Together, these datasets offer a high-complexity framework for assessing the compositional and relational understanding of machine learning models in visual environments.A detailed explanation of both the gSCAN and Grounded Systematic Relation Reasoning datasets can be found in Appendix B.
Baselines.We embarked on a series of experiments designed to evaluate our model's effectiveness compared to the most recent state-ofthe-art models on the mentioned multimodal compositional generalization datasets.We include the following baselines.Results.We comprehensively evaluated our approach across all the previously mentioned benchmarks compared to the baselines.Alongside the accuracy and efficacy metrics, we also provide insights into the computational overhead associated with our method.Furthermore, a qualitative analysis is presented, delving deeper into our approach's performance nuances and strengths.
The benchmark results, presented in Tables 2,  3, and 4, demonstrate our model's superior performance over all reported models, with a notable 3% improvement on the average of ReaS-CAN benchmark splits.This substantiates our hypothesis that incorporating syntactic parsing significantly boosts the model's generalization derived from grounded compositional training data.Moreover, dependency parsing consistently outperformed constituency or marked a very similar performance across multiple benchmarks, including GSRR and gSCAN.Our model displayed improvements across nearly all ReaSCAN splits except for C2.As per Sikarwar et al. (2022), the C2 split is "unfair," lacking the required information in training data for comprehensive model training.Even including syntactic information could not improve the model's performance on this split and even caused a decrease in the performance.Our methodology also showcased its merit in the object property test cases (A1-3), effectively constraining attention to words pertinent to target object descriptors.For instance, as shown in Figure 4, the attention weights from the properties to the corresponding objects are high.
Notably, our model exhibited considerable strides in the C1 split, indicative of the value added by syntactic information.For a more reliable comparison, we applied a t-test to our C1 test split results.Using a significance level (α) of 0.05, this statistical analysis provided further validation for the observed enhancements in our model's performance, particularly within the context of the C1 test split.Furthermore, our model exhibits enhanced performance on the GSRR dataset.As illustrated in Table 4 Table 3: The result of our proposed model on the gSCAN dataset test splits.The results are an average of three runs.
We did not report the results on D and G splits since we achieved 0.00±0.00% performance, But take them into account in the averaged result.† denotes the models with masking.Models marked with * refer to the multimodal version of their implementation. environment.
While our proposed techniques effectively address splits A, B, C, E, and F, mirroring the successes of previous works such as (Sikarwar et al., 2022) and (Qiu et al., 2021b), they struggle with challenges presented by specific gSCAN compositionality splits, notably D, G, and H.These particular splits are designed to assess the model's capacity for systematic generalization when novel patterns should occur on the output sequence rather than in grounding the input instruction (Sikarwar et al., 2022), a facet that is not expected to be captured by our proposed model.

Ablation
For a granular understanding of the contributions from each alteration to the baseline model, we undertook an ablation study.This involved the sequential removal of each modification to measure its individual impact.As depicted in Table 5, while individual modifications did not significantly change the baseline, their collective integration enhanced the model's generalization.Remarkably, eliminating dependency parsing or weight sharing resulted in a noticeable performance dip.The improvement upon integration posits that weight sharing can potentially offset the masking prolonged convergence challenge by reducing parameter count, thereby mitigating the convergence issues.

Qualitative Analysis
In our previous discussions, we highlighted the significance of integrating dependency parsing as a fundamental approach to understanding the complex structures inherent in sentences.This integration is not a mere enhancement; it critically enriches the model's grounding capabilities, offering a more robust bridge between raw textual sequences and their semantic structure.
To provide empirical evidence of our technique for guiding attention, we conducted an analysis of the cross-attention module.We aimed to compare its behavior before and after applying attention masking.The results, presented in Figure 5, indicate a clear trend: in 86% of validation samples, the cross-attention module exhibits a pronounced focus on the target object.
Figures 5b and 5c elucidate the impact of selfattention masking on these weights.After using attention masking (see Figure 5b), the attention distribution becomes notably sparser; instead of individual words attending in isolation to every potentially relevant cell, they now form cohesive compositional expressions, each attending to the   corresponding cells as a whole.For instance, in Figure 5c, "and in the same," phrase's tokens attend to cells (1,2), (4,3), and (5,2) together with greater attention on the target object in contrast to their attention pattern without masking.

Efficiency Analysis
In the realm of modern model design, the challenge lies in amplifying capabilities while managing computational overhead.significantly reduce the parameter space.This not only streamlines memory utilization and accelerates training but also acts as an implicit regularizer, bolstering the model's generalization capabilities and reducing overfitting.Further enhancing this is our implementation of attention masking, which refines computational efficiency.By enabling the model to selectively bypass attention to certain tokens, we can optimize the model to avoid redundant computational processes, ensuring optimal resource allocation and superior performance.
As illustrated in Table 6, our model stands out in terms of efficiency.Despite having fewer parameters (1.9M) than the models by Qiu et al. (2021a) and Sikarwar et al. (2022), which have 3M and 4.6M parameters respectively, our model consistently outperforms them across all benchmarks.

Conclusion
Our research demonstrated that exploiting the syntactic structure of compositional and complex linguistic and spatial expressions improved the grounding ability of the instruction-follower agent in multimodal environments.Our results indicated improvements compared to the previous state-ofthe-art models.In particular, we show that our proposed model is effective for generalization on tasks and test splits that require generalization over unobserved reasoning depths, such as the C1 split in the ReaSCAN dataset.By utilizing the syntacticguided attention masking along with the weight sharing, we achieved not only more accurate but also more parameter-efficient models for grounded compositional generalization.

Limitations
Despite the promising results achieved in our study, several limitations warrant consideration: Synthetic Data: Our experiments predominantly rely on synthetic datasets.While these datasets provide a controlled environment for assessing model performance, they might not capture the complexities and nuances of real-world data.Evaluating the models on real-world datasets is crucial to ensure their practical applicability.
Error Propagation from the Parser: The model's performance is intrinsically tied to the accuracy of the pre-trained parsers we utilized.Errors or inaccuracies in parsing can lead to suboptimal model outputs.Additionally, our synthetic data, being unambiguous, might not reveal the full extent of potential parser-related issues.
Computational Constraints: Due to computational limitations, the hyperparameter search might not have been exhaustive.A more comprehensive exploration might yield better model configurations.

A Hyperparameters
Here, we present the hyperparameters used in the models for every benchmark in Table 7.

B Datasets Description B.1 Grounded SCAN dataset
The Grounded SCAN (gSCAN) dataset is a pivotal benchmark for assessing compositional generalization in machine learning models.Evolving from the foundational SCAN (Lake and Baroni, 2018) dataset, gSCAN is designed to evaluate a model's proficiency in translating command sequences into actions within a grid world environment, emphasizing on compositional challenges.
This benchmark offers systematic test splits that rigorously examine a model's capability to generalize beyond its training data.These compositional splits include: • A (Random): Random data with a similar distribution to the training data.
• B (Color-Shape): Novel composition of object properties in the testing.Yellow squares are referred to by color and shape.
• C (Color Only): Red squares as target.
• D (Novel Direction): Challenges a model's spatial comprehension, with targets set in unfamiliar directions, the southwest.
• E (Novel Contextual References): Evaluates a model's understanding of relative sizes, with commands pointing to circles of size 2 described as "small." • F (Novel Composition of Actions and Arguments): Probes a model's grasp of object classes and their nuances, exemplified by squares of size 3 necessitating two pushes.• H (Adverb-Verb Combination): Generalizes to commands pairing actions and their modifiers, like "while spinning" combined with "pull." The compositional test splits of the gSCAN dataset ensure that models are not indulging in learning statistical shortcuts but are genuinely mastering compositional reasoning.In gSCAN, every command is mapped to an action sequence for an agent in the grid world, whether moving to a particular spot or interacting with a distinct described object.

B.2 Grounded Systematic Relation Reasoning dataset (GSRR)
The Grounded Systematic Relation Reasoning (GSRR) dataset, introduced by (Qiu et al., 2021b), extends the gSCAN benchmark.Their initial analyses of the gSCAN dataset indicated its efficacy; the authors observed that several remaining challenges might not be primarily tied to visual grounding.In light of this, they proposed the GSRR task, characterized by an elevated complexity in aligning natural language instructions with the visual environment.
In this dataset, language expressions specifically delineate target objects and explicitly describe their relationships with a secondary referenced object.They incorporate two types of relations into our dataset: immediate adjacency ("next to") and cardinal directions such as "north" and "west."In addition, they put visual distractors objects within the environment to emphasize the critical role of spatial relations in identifying the target objects.
The dataset is systematically divided into various splits to ensure a comprehensive assessment: • I (Random): Similar distribution as the training.
• II (Visual): Commands centering on "red squares" either as targets or references.
• V (Relative Position 1): Commands where targets are situated to the "north" of their reference points.
• VI (Relative Position 2): Instructions where targets are located "southwest" relative to their references.

C Evaluation Card
Here, we present the evaluation card of our compositional generalization experiments based on (Hupkes et al., 2023)

Figure 1 :
Figure 1: This example is taken from the ReaSCAN dataset.Here, an agent is provided with a command.Its objective is to generate/execute a series of predefined actions to fulfill the task within the given environment.

C2 2 -
relative-clause command with that is instead of and.

Figure 3 :
Figure 2: Overall architecture of the proposed model.
Figure 4: Self-Attention example from the A2 test set of ReaSCAN dataset.Figures (a) and (b) depict the averaged self-attention map from our models' over all encoder layers and heads.Rows and columns correspond to text tokens.Brighter attention cells indicate higher attention weights (a) Ruis et al. (2020) (Multimodal LSTM) is a fusion of sequence-tosequence (seq2seq) architecture with a visual encoder, employing a recurrent 'command encoder' to process the instructions.(b) Gao et al. (2020) (GCN-LSTM) integrates a Graph Convolutional Neural (GCN) network with a multimodal LSTM.The command encoding is achieved via a BiL-STM equipped with multi-step textual attention, while the world is encoded through a GCN layer.(c) Qiu et al. (2021b) (Multimodal Transformer) is a multimodal transformer equipped with crossattention for multimodal compositional general-ization.(d) Sikarwar et al. (2022) (GroCoT) is another transformer-based model that incorporates interleaved self-attention into the multimodal transformer with cross-attention.
Figure 5: Cross-Attention from Text-to-Image.In Figure (a), the purple zone indicates the model's incorrect object selection, while the red zone highlights the accurate choice.Figures (b) and (c) depict the averaged cross-attention map from our models over encoder layers and attention heads.The rows represent environment cells (the first element shows the row, and the second shows the column index, both starting from 0), and the columns correspond to text tokens.Brighter attention cells signify elevated attention weights.

Table 2 :
, both variants of our model demonstrate improvements in the II split.It is worth noting that the II split shares the same challenge as the A2 split from the ReaSCAN dataset but in a less complex The result of our proposed model on the ReaSCAN dataset test splits.The results are an average of three runs.† denotes the models with masking.Models marked with * refer to the multimodal version of their implementation.

Table 5 :
The ablation study result of our modifications on ReaSCAN dataset test splits.Results are reported on an average of three runs.We evaluate every combination of components from our best model.W/S stands for weight sharing, and the ✓shows the presence of the module.Dep in this table refers to the Dependency masking.We evaluate the model with or without dependency masking in the masking part.

Table 6 :
Comparing model parameters: our model vs. current state-of-the-art models.Dependency † refers to the model with dependency parsing for attention masking.

Table 7 :
Hypterparameters used in the experiments.
• G (Adverb): Commands carrying the adverb "cautiously" test how well the model interprets action modifiers after seeing limited training samples (k=1). taxonomy.