LAGr: Label Aligned Graphs for Better Systematic Generalization in Semantic Parsing

Semantic parsing is the task of producing structured meaning representations for natural language sentences. Recent research has pointed out that the commonly-used sequence-to-sequence (seq2seq) semantic parsers struggle to generalize systematically, i.e. to handle examples that require recombining known knowledge in novel settings. In this work, we show that better systematic generalization can be achieved by producing the meaning representation directly as a graph and not as a sequence. To this end we propose LAGr (Label Aligned Graphs), a general framework to produce semantic parses by independently predicting node and edge labels for a complete multi-layer input-aligned graph. The strongly-supervised LAGr algorithm requires aligned graphs as inputs, whereas weakly-supervised LAGr infers alignments for originally unaligned target graphs using approximate maximum-a-posteriori inference. Experiments demonstrate that LAGr achieves significant improvements in systematic generalization upon the baseline seq2seq parsers in both strongly- and weakly-supervised settings.


Introduction
Recent research has shown that neural models struggle to systematically generalize to examples with unseen combinations of seen rules from the training set (Lake and Baroni, 2018;Finegan-Dollak et al., 2018;Hupkes et al., 2019).Systematic generalization is especially important for the task of semantic parsing, which requires models to translate natural language sentences to structured meaning representations (MRs), such as SPARQL database queries or lambda calculus logical forms.To generalize systematically in this task, the model must be capable of producing MRs for examples that feature new combinations of meaning construction rules, such as the rule that maps a noun like

Generalization example
The baby liked the hedgehog * baby(x 1 ) ∧ hedgehog(x 4 )∧ like.agent(x 2 , x 1 ) ∧ like.theme(x 2 , x 4 )) Figure 1: Examples from the training and the generalization sets of the COGS dataset (Kim and Linzen, 2020b).While "hedgehog" is only observed in the agent role during training, the generalization set features this word in the theme role.
"hedgehog" in Figure 1 to its respective predicate hedgehog(.),and the rule that defines which semantic role with respect to the verb (e.g.agent or theme) the resulting predicate takes.Using synthetic (Bahdanau et al., 2019;Kim and Linzen, 2020a;Keysers et al., 2020) and natural benchmarks (Finegan-Dollak et al., 2018;Shaw et al., 2020), researchers have been studying systematic generalization of existing semantic parsing methods as well as proposing new approaches such as using meta-learning (Conklin et al., 2021), pretrained models (Furrer et al., 2020), or intermediate meaning representations (Herzig et al., 2021).
The dominant framework in these studies is sequence-to-sequence (seq2seq, Sutskever et al., 2014;Bahdanau et al., 2015) learning, whereby the model produces a serialized MR in an autoregressive fashion, by predicting one token at a time, while conditioning on all previously generated tokens.We hypothesize that for semantic parsing constructing the MR by combining independent predictions that are not conditioned on each other can generalize more systematically than seq2seq.For example, consider the sentence "The dog liked that the hippo danced".Arguably, the predictions that "dog" is the agent of "like" and that "hippo" is the agent of "danced" can be made independently of each other.Our intuition is that a model that predicts such aspects of meaning independently of each other can be better at learning contextinsensitive rules because the overall context for each individual prediction is reduced.
Following this intuition, we propose LAGr (Label Aligned Graphs), a framework to produce semantic parses by independently labelling the nodes and edges of a fully-connected multi-layer output graph that is aligned with the input utterance.While the general idea of predicting semantic parses as graphs is not new (Lyu and Titov, 2018), the systematic generalization benefits of doing so have not been investigated prior to this work.Importantly, LAGr retains most of the flexibility that seq2seq models have, without the complexity and rigidity that comes with other alternatives to seq2seq, such as grammar-based methods (Herzig and Berant, 2020).
We first introduce LAGr in the stronglysupervised setting where output graphs are aligned to the input sequences, thus allowing for standard supervised training.For the weakly-supervised case when the alignment is not available, we treat it as a latent variable.We infer the latent alingment with a simple and novel approximate maximuma-posteriori (MAP) inference approach which involves solving several minimum cost bipartite matching problems with the Hungarian algorithm (Kuhn, 1955a).We then use the resulting aligned graphs to train the model.Our experiments demonstrate that in both strongly-and weakly-supervised settings LAGr significantly improves upon comparable seq2seq semantic parsers on the COGS and CFQ datasets (Kim and Linzen, 2020a;Keysers et al., 2020).

Semantic Parsing by Labeling Aligned Graphs
We present LAGr (Label Aligned Graphs), a framework for constructing meaning representations (MR) directly as graphs (i.e., MR graphs).
When LAGr is used to output logical forms, the graph nodes can be variables, entities, categories and predicates, and graph edges can be the Neo-Davidsonian style semantic role relations that the nodes appear in, e.g."is-agent-of" or "is-themeof" (Parsons, 1990).While this work focuses on predicting logical forms, LAGr can, in principle, also be used to output other kinds of graphs, such as abstract syntax tree parses of SQL queries.As illustrated in Figure 2, LAGr predicts the output by labeling the nodes and edges of a fully-connected multi-layer output graph that is aligned with the input utterance.We label a multi-layer as opposed to a single-layer graph because some MR graphs have more nodes than the number of input tokens (see Section 4.2 for an example).
Notation and Terminology Formally, let x = x 1 , x 2 , ..., x N denote a natural language utterance of N tokens.LAGr produces an MR graph G by labeling the nodes and edges of a complete graph Γ a with M = L • N nodes that are arranged in L layers.The layers are aligned with the input sequence x in a way that for each input position i there is a unique corresponding output node in each layer.We say that nodes from different layers that are aligned with the position i form a column (an example column in Figure 2b contains the nodes labeled as actor and ?x0 for the word star at the position i = 3).
We write Γ a = (z, ξ) to indicate that a complete labeled graph Γ a is characterized by its node labels z ∈ V M n and edge labels ξ ∈ V M ×M e , where V n and V e are node and edge label vocabularies, respectively.Both vocabularies also include additional null labels that we use as padding (e.g.grey nodes in Figure 2 are labeled as null).To produce the output MR graph G from Γ a , we remove all null nodes and null edges.Lastly, we use z j and ξ jk notations to refer to the labels of node j and of the edge (j, k) where j = (l−1)N +i is a one-dimensional index that corresponds to the i-th node in the l-th layer.

Labeling Aligned Graphs
To label the nodes of Γ a we encode the input utterance x as a matrix of N d-dimensional vectors H = f enc (x) ∈ R N ×d , where f enc can be an arbitrary encoder model such as LSTM (Hochreiter and Schmidhuber, 1997) or a Transformer (Vaswani et al., 2017).LAGr then defines a factorized distribution p(z|x) over the node labels z as follows:  where O ∈ R M ×|Vn| contains logits for M = N × L nodes from all the L graph layers, || denotes the concatenation operation along the node axis, W l denotes the weight matrix for layer l.Here and in following equations softmax(.) is applied to the last dimension of the input tensor and every multiplication by a weight matrix is followed by the addition of a bias vector which we omit to enhance clarity.Our edge labelling computation is reminiscent of the multi-head self-attention by Vaswani et al. (2017), with the key difference that softmax is applied across the edge labels and not across positions: where H α q and H α k contain concatenated key and query vectors for the label α ∈ V e across all L graph layers, U α,l , V α,l ∈ R d |Ve| , d |Ve| are the weights for the edge label α, and the stack operator stacks the matrices into a 3D tensor to which softmax is subsequently applied.Similarly to p(z|x), we obtain p(ξ|x) as follows: The factorized nature of Equations 3 and 4 makes the argmax inference ẑ, ξ = arg max p(z, ξ|x) trivial to perform.When the groundtruth aligned graph Γ * a = (z * , ξ * ) for the MR graph G is available, LAGr can be trained by directly optimizing log p(z = z * , ξ = ξ * |x).We refer to this training setting as strongly-supervised LAGr.

Weakly-supervised LAGr
In many practical settings, the alignment between the MR graph G and the sequence x is unavailable, making the aligned graph Γ a unknown.To address this common scenario, we propose a weaklysupervised LAGr algorithm based on a latent alignment model.Similarly to the strongly-supervised case, we assume that the MR graph can be represented as a labeled complete, multi-layer graph ), with the difference that in this case the alignment between x and Γ na is not known.We assume a generative process whereby Γ na is obtained by permuting the columns of the latent aligned graph Γ a with a random permutation a, where a j is the index of the column in Γ a that becomes the j-th column in Γ na .For the rest of this section we focus on the single layer (L = 1) case to simplify the formulas.For this case our probabilistic model defines the following distribution over Γ na = (s, e): p(e, s|x) = a z ξ p(e, s, a, z, ξ|x) where p(a) = 1/N !. Computing p(e, s|x) exactly is intractable.For this reason, we train LAGr by using an approximation of p(e, s|x) in which instead of summing over all possible aligments a, we only consider the maximum-a-posteriori (MAP) alignment â = arg max a p(a|e, s, x).This approach is sometimes called the hard Expectation-Maximization algorithm in the literature on probabilistic models (Svensén and Bishop, 2007) log p(z a j = s j |x) We are not aware of an exact algorithm for solving the above optimization problem, however if the edge log-likelihood term log p(e|a, x) is dropped in the equations above, maximizing the node label probability p(s|a, x) is equivalent to a standard minimum cost bipartite matching problem.This optimization problem can be solved by a polynomialtime Hungarian algorithm (Kuhn, 1955b).We can thus use an approximate MAP alignment â1 = arg max a j log p(z a j = s j |x).While dropping p(e|a, x) from Equation 6is a drastic simplification, in situations where node labels s are unique and the model is sufficiently trained to output sharp probabilities p(z j |x) we expect â1 to often match â.To further improve the MAP alignment approximation and alleviate the reliance on the node label uniqueness, we generate a shortlist of K candidate alignments by solving K noisy matching problems of the form arg max a j log p(z a j = s j |x) + ja j , where ja j ∼ N (0, σ).We then select the alignment candidate a that yields the highest full loglikelihood log p(s|a, x) + log p(e|a, x).
We refer the reader to Algorithm 1 for a detailed presentation of weakly-supervised LAGr.

Related Work
The LAGr approach is heavily inspired by graphbased dependency parsing algorithms (Mcdonald, 2006).In neural graph-based dependency parsers (Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017) the model is trained to predict the existence and the label of each of the possible edges between the input words.The Abstract Meaning Representation (AMR) parser by Lyu and Titov

Algorithm 1: Training LAGr with weak supervision
Init: Let K be the number of alignment candidates, T be the number of training steps, and θ t be the model parameters after t steps. 1 for t=1, ..., T do 2 sample example (x, e, s) (2018) brings similar methodology to the realm of semantic parsing, although they do not consider the systematic generalization implications of using a graph-based parser instead of a seq2seq one.Lyu and Titov (2018) only output single layer graphs which requires aggresive graph compression; in LAGr we allow the model to output a multiple layer graph instead.Lastly, the amortized Gumbel-Sinkhorn alignment inference used by Lyu and Titov (2018) is much more complex than the Hungarian-algorithm-based approximate MAP inference that we employ here.Another important inspiration for LAGr is the UDepLambda method (Reddy et al., 2016) that converts dependency parses into graph-like logical forms.LAGr can be seen as an algorithm that produces UDe-pLambda graphs directly with the neural model, side-stepping the intermediate dependency parsing step.
Another alternative to seq2seq semantic parsers are span-based parsers that predict span-level actions for building MR expressions from subexpressions (Pasupat et al., 2019;Herzig and Berant, 2020;Liu et al., 2021).A prerequisite for using a span-based parser is an MR that can be viewed as a recursive composition of MRs for subspans.While this strong compositionality assumption holds for the logical forms used in earlier semantic parsing research (e.g.Zettlemoyer and Collins (2005)), an intermediate MR would be re-quired to produce other meaning representations, such as e.g.SPARQL or SQL queries, with a spanbased parser.The designer for an intermediate MR for a span-based parser must think about MRs for spans and how they should be composed.This can sometimes lead to non-trivial corner cases, such as e.g.ternary grammar rules in Herzig and Berant (2020).On the contrary, a graph-based parser can in principle produce any graph, although in practice in our experiments we compress the raw graphs slightly to make the learning problem easier.
Other related semantic parsing approaches include the semantic labeling method by Zheng and Lapata (2020) and the structured reordering approach by Wang et al. (2021).Zheng and Lapata (2020) show that labelling the input sequence prior to feeding it to the seq2seq semantic parser improves systematic generalization.Compared to that study, our work goes one step further by adding edge labeling, which allows us to let go of the seq2seq model entirely.Wang et al. (2021) model semantic parsing as structured permutation of the input sequence followed by monotonic segmentlevel transduction.This approach achieves impressive results, but is considerably more complex than LAGr.Finally, Guo et al. (2020) achieve a very high performance on CFQ by combining the sketch prediction approach (Dong and Lapata, 2018) with an algorithm that outputs the MR as a directed acyclic graph (DAG).Unlike LAGr, their algorithm produces the DAG in a sequential left-to-right fashion.Notably, the non-hierachical version of this algorithm without sketch prediction performs poorly.
Concurrently with this work, Ontañón et al. (2021) show that semantic parsing by sequence tagging improves systematic generalization.Their sequence tags are similar to the aligned graphs that we predict with LAGr when using a single graph layer.Ontañón et al. (2021) do not discuss how to infer sequence tags from logical forms when the former are not available.

Experiments
We demonstrate the effectiveness of LAGr on two systematic generalization benchmarks for semantic parsing: COGS (Kim and Linzen, 2020a)

Graph Construction
In order to study LAGr on COGS, we first convert the logical forms to UDepLambda-style (Reddy et al., 2016) MR graphs.Specifically, we construct the graph nodes using the one-and two-place predicates and definite articles (e.g.hedgehog, apple, eat and the * nodes in Figure 2a).We do not create dedicated nodes for variables, as every variable in COGS is either an argument to a unique one-place predicate (e.g.x 1 is for hedgehog(x 1 )), or the first argument to a unique two-place predicate (e.g.x 2 for eat in eat.agent(x 2 , x 1 )).Instead, we let the respective predicate node represent the variable.The labeled edges for our graphs are defined by the Neo-Davidsonian role predicates of the logical forms (such as agent, theme, recipient, ccomp, nmod.on,nmod.in,xcomp, nmod.beside).For example, the conjunct eat.agent(x 2 , x 1 ) results in an agent edge between the eat and hedgehog nodes.We also add special article edges to connect definite article nodes (denoted by the * label) to their respective nouns (e.g.hedgehog in Figure 2a).We take advantage of the correspondence between variable names and input positions (x i corresponds to the i-th token) to construct single-layer (L = 1) aligned graphs Γ a for COGS that are suitable for strongly-supervised LAGr, as described in Section 2.1.The node and edge vocabularies for the aligned graphs contain 645 and 10 labels respectively, each including a null label.Training Details Hyperparameter tuning on COGS is challenging since the performance on the in-distribution development set always saturates to near 100%.We adopt the hyperparameter tuning procedure discussed in Conklin et al. (2021) to find the best configuration for our baselines and strongly-supervised LAGr models.Specifically, we create a "Gen Dev" dataset by sampling 1000 random examples from the generalization set and use them to find the best hyperparameter configuration.We find that our Transformer-based seq2seq and LAGr models perform better when embeddings are initialized following He et al. (2015) and when positional embeddings are scaled down by 1 √ dim .The latter techniques were adopted following the recent work of Csordás et al. (2021) under the PED (Positional Embedding Downscaling) name.We report the exact match accuracy, i.e., the percentage of examples for which the predicted graphs after serialization yielded the same logical form, as well as the standard deviation over at least 10 random seeds.We tune the hyperparameters for strongly-supervised LAGr first; we then use the same configuration for weakly-supervised LAGr and only tune the inference hyperparameters, i.e. the number of candidates K and the noise level σ.Since weakly-supervised LAGr does not always converge on the training set, we implement a restart mechanism that relaunches experiments with a new random seed where a training performance of at least 95% is not achieved.Setting K = 10 and σ = 1.0 allows us to achieve a convergence rate of around 50%.For more details on our hyperparameter search, and best configurations, we refer the reader to Appendix A.1.
Additionally, we observe that the training loss does not go to 0 in the weakly-supervised setting.We attribute this to a significant (2.7%) percentage of training examples in which there are three and more nodes with the same label (namely "*" for definite articles), which presents a challenge to our alignment inference mechanism.To remedy this, we cache and append the previously used alignment as the K + 1st alignment candidate (see lines 3-8 in Algorithm 1).This allows the model to remember low-loss alignments and thereby helps achieve full convergence.Lastly, we also run weaklysupervised LAGr with retraining, in which we take the final learned alignments for all examples and retrain models with the learned alignments being used as strong supervision.Baselines We compare LAGr to LSTM-and Transformer-based seq2seq semantic parsers that produce logical forms as sequences of tokens.In addition to training our own seq2seq baselines, we also include baseline results from the original COGS paper by Kim and Linzen (2020a) and from follow-up works by Akyürek and Andreas (2021), and Csordás et al. (2021).We also compare LAGr to a lexicon-based seq2seq model "LSTM+Lex" by Akyürek and Andreas (2021) that leverages the copy mechanism in the seq2seq decoder to perform a lexical lookup to generate the output token.
We experiment with two variations of LAGr: using shared encoders and separating encoders for syntax (i.e., node predictions) and semantics (i.e., edge predictions) -reflected in Table 1 by the subindex " sh" versus " sep" in the model names respectively.We achieve the best result in the strongly-supervised setting using separate encoders.While this setting significantly improves the performance of LAGr in all cases, for the stronglysupervised LSTM-based LAGr models, separating encoders seems to be crucial (71.4% vs 39.0%).
The use of retraining in weakly-supervised LAGr is helpful.It allows us to increase the accuracy of weakly-supervised LAGr to match our stronglysupervised result.Finally, LAGr is able to match the performance of the LSTM+Lex approach by Akyürek and Andreas (2021) without relying on the use of lexicons -a result we further discuss in Section 5.

CFQ
Dataset CFQ (Keysers et al., 2020) is a benchmark for systematic generalization in semantic parsing that requires models to translate English sentences to SPARQL database queries.We use CFQ's Maximum Compound Divergence (MCD) splits, which were generated by making the distribution of compositional structures in the train and test sets as divergent as possible.
SPARQL queries contain two components: a SELECT and a WHERE clause.The SELECT clause is either of the form count( * ) for yes/no questions or SELECT DISTINCT ?x0 for wh-questions (those starting with "which", "what", "who", etc.).The WHERE clause can contain constrains of three kinds: filter constraints ensuring two variables or entities are distinct (e.g.FILTER ?x0 != M0), two-place predicates expressing a relation between two entities (e.g.?x0 parent ?x1), and one-place predicates expressing if an entity belongs to a category (e.g.?x0 a ns:film.actor)Graph Construction Before constructing the graphs, similarly to prior work (Furrer et al., 2020;Guo et al., 2020), we compress the SPARQL queries by merging some triples in the WHERE clauses.As an example, consider the question "Were M2 and M3 directed by a screenwriter that executive produced M1?", where the original MR contains both [M2 directed by ?x0,M3 directed by ?x0] conjuncts.To make it easier to align SPARQL queries to the input question, we merge triples by concatenating their subjects and objects, e.g.yielding [[M2, M3] directed by ?x0] for the above example.With this compression, the SPARQL queries can now contain an arbitrary number of entities in the triples.To convert the compressed SPARQL queries to graphs we first remove the SELECT clauses.To preserve the question type information, for wh-questions we replace the ?x0 variable in the WHERE clause with a special select ?x0 variable.As the example in Figure 2b shows, we define the graph nodes by taking the entities (including variables, e.g.?x0,M1) and all predicates (parent, sibling, actor) from the triples.For one-place predicates, we connect the entity nodes to the predicate node with an agent edge label.For triples with two-place predicates, we connect the predicate to the left-hand side and right-hand side entities with the agent and theme edge respectively.We add a FILTER edge between the variables or entities that participate in a filter constraint.The resulting node and the edge vocabularies contain 84 and 4 labels respectively, each also including a null label.
Training Details Unlike COGS, we use L=2 graph layers in LAGr in order to accommodate for the larger MR graphs in CFQ.This is because CFQ contains examples such as "Who married M1's female German executive producer?"that contains 8 tokens, but induces the following 10 nodes:?x1, executive produced, M1, gender, ns:m.02zsn,nationality, ns:m.0345h,select ?x0, spouses, person.
In all our CFQ experiments we use a shared Transformer encoder for both node and edge prediction.To assess performance, we use exact graph accuracy, which we define as the percentage of examples where the predicted and true graphs are isomorphic.The predicted graphs contain enough information to exactly reconstruct the SPARQL query, hence our exact graph accuracy can be compared to the exact match accuracy from the prior work.For hyperparameter tuning, we follow Keysers et al. ( 2020) and use CFQ's in-distribution random split to find the best model configuration.We do this by first fixing the number of candidate alignments at K = 1 to search for the best hyperparameters.Once we find the best configuration, we tune K and σ.For the best found configuration of K = 5, σ = 10, as well as for the base configuration K = 1, σ = 0, we report the average graph accuracy and standard deviation for 8-11 runs of weakly-supervised LAGr on the MCD1, MCD2, MCD3 and the random split.Similarly to COGS, we use the PED initialization technique   (Guo et al., 2020) ♠, and pretrained T5-small seq2seq model with intermediate representations (IR) (Herzig et al., 2021) ♦.Approaches other than LAGr report the average exact match accuracy with 95% confidence intervals.
from Csordás et al. (2021), and discard runs where weakly-supervised LAGr does not reach at least 99.5% graph accuracy on the training set (around 12% of all runs).For further details on our CFQ experiments we refer the reader to Appendix A.2. Results We compare LAGr to seq2seq semantic parsing results reported in prior work (Keysers et al., 2020;Furrer et al., 2020), as well as results obtained with compressed SPARQL queries (Guo et al., 2020;Herzig et al., 2021).As shown in Table 2, weakly-supervised LAGr outperforms all comparable baselines on all of CFQ's out-of-distribution MCD splits.While both K = 1 and K = 5 with σ = 10 yield impressive performance gains compared to the baselines, we obtain mixed results about the impact of a higher K and the use of noise.Specifically, the best result on MCD1 is achieved with K = 1 in contrast to MCD2 and MCD3 where K = 5 with σ = 10 performs significantly better than when using K = 1.
For reference, Table 2 also includes the state-ofthe-art Hierarchical Poset Decoding (HPD, Guo et al., 2020) method (see Section 3), which arguably is not a fair baseline to LAGr because of its use of sketch prediction and lexicons.Notably, when these techniques are not used, LAGr performs much better than their base HPD algorithm.
To further zoom into the impact of the weaklysupervised LAGr's hyperparameters, we report results of preliminary experiments 2 in which we 2 These experiments were carried out using an earlier preliminary implementation.Results in Table 3   Table 3: The effect of the number of alignment candidates K and noise level σ on the performance of weakly-supervised LAGr using CFQ's random split.We report the average graph accuracy and the standard deviation over 5 runs.We show the best configuration in bold.
tuned the number of alignment candidates K and the noise level σ.One can see that choosing the best alignment out of K > 1 candidates is indeed helpful, and that noise of high magnitude (σ = 10) brings the best improvement on the random split.These improvements also translate into systematic generalization gains for MCD2 and MCD3, as shown in Table 2 where we see that K = 5 achieves better performance than K = 1.The positive effect of a larger K on these splits is in line with our expectation since 3.7 -5.7% of examples in each CFQ split have at least two predicates with identical node labels, which can make it hard to align the MR graph to the input by look-ing at node labels only.Interestingly, in contrast to our intuition, when using ten candidate alignments, the random split test performance is slightly worse than when using five.We show examples of the node labels that weakly-supervised LAGr predicts in the learned aligned CFQ graphs as well as the corresponding SPARQL queries in Figure 3 (Appendix A.3).

Discussion & Future Work
In this work we have shown that performing semantic parsing by labeling aligned graphs brings significant gains in systematic generalization.In our COGS and CFQ experiments, LAGr significantly improves upon sequence-to-sequence baselines in both strongly and weakly-supervised settings.Specifically, on COGS, LAGr outperforms our carefully-tuned seq2seq baselines and performs similarly to LSTMs that leverage lexicons.Lexicons can also be integrated into LAGr, although we do not expect this to improve LAGr's performance on COGS, as our best performing models already predict node labels perfectly.Lexicons also bring their own challenges of dealing with context-dependency and ambiguity, hence it is notable that LAGr matches the performance of a lexicon-equipped model while making less assumptions about the nature of the input-to-output mapping.On CFQ, LAGr outperforms all seq2seq baselines on all MCD splits.
Based on our error analysis (see Appendix A.3), we believe that a modification of LAGr that conditions edge predictions on node labels could bring further improvements.Importantly, this modification would be compatible with our current alignment inference algorithm.Another obvious direction to improve LAGr's performance is by using a pretrained encoder.Lastly, while the current alignment inference algorithm is effective, applying more advanced discrete optimization or amortized inference methods could be an interesting direction for future work.

A Appendix
A.1 COGS Hyperparameter Tuning COGS does not include an out-of-distribution development set, which makes it challenging to find the best model configuration.To overcome this problem, we followed the same hyperparameter tuning procedure for our baselines and our stronglysupervised LAGr models as proposed by Conklin et al. (2021).We sampled 1000 examples from the generalization set as a "Gen Dev" set which was used to pick the best hyperparameter configuration.We tested 0.001, 0.004, 0.0001 and 0.0004 for learning rates, 64, 128 and 256 for batch sizes, and 0.1 versus 0.4 for dropout.We tested an embedding size of 256 versus 512.Furthermore, for the Transformer baselines and for LAGr with a Transformer encoder, we also tested 2 versus 4 layers, and 4 versus 8 attention heads.We trained all models for 70,000 steps, with no early stopping.Each configuration was evaluated on 5 seeds.Once the best configuration was found, we retrained all models on at least 10 seeds.The final number of seeds that were used to report our results in Table 1 are the following: 20 seeds for each of the weakly-supervised LAGr experiments with and without retraining, 80 and 20 seeds for stronglysupervised LAGr with a separate and shared encoder, respectively, and finally, 20 seeds for our baseline Transformer experiments.We varied the number of seeds in order to obtain more accurate estimates for the mean performance measures.The best configurations for COGS are shown in Table 6.
For weakly-supervised LAGr, we used the best configuration we found for strongly-supervised LAGr.We then investigated different values for K, the number of candidate alignments, with 1, 5 versus 10, and for the noise levels σ of 0, 0.001, 0.01, 0.1, 1, 10, 15 and 20.In addition, we also implemented a random restart procedure to restart runs with a new random seed if they were not able to reach at least 98% of training accuracy.We found that only when we used K = 10 with σ = 1, we were able to get around 50% of the runs to converge.This was different from our CFQ experiments, where 97% of runs converged to at least 98% when appropriate noise levels were chosen (i.e., σ < 15).
As for our seq2seq baseline, in order to reproduce the same Transformer performance as reported by Csordás et al. (2021), we reused both their hyperparameters and their model implementation.Namely, we used a learning rate of 1e-4 with a linear scheduler and no warmup, a batch size of 128, an encoder dimension of 512 with dropout of 0.1.Lastly, we clipped gradients larger than 1.0.

A.2 CFQ Hyperparameter Tuning
We performed hyperparameter tuning on CFQ's random split, and chose the best configuration based on the development exact graph accuracy.For LAGr with both shared and separate Transformer encoders, we tested learning rates of 0.0001, 0.0004, 0.0006, 0.0008 and 0.001, with a linear warmup of 0, 1000 versus 5000 steps, with dropout of 0.1 and 0.4, batch sizes of 64, 128, 256 and 512, and 2 versus 4 Transformer layers and attention heads of 4 versus 8.In contrast to COGS, we were able to drive the training loss to 0 without caching and appending previously learned alignments as the K + 1st alignment candidates.For this reason, we did not use this caching technique.Lastly, similarly to COGS, we filtered out runs that diverged in terms of their training graph accuracy.While for COGS weakly-supervised LAGr is more sensitive to varying K and σ, in CFQ, we obtained 97% convergence from all our runs in Table 3.We report the best configuration used for CFQ in Table 7.

A.3 Error analysis
Table 4 shows some commonly encountered errors on COGS with strongly-supervised LAGr.In all examples, the model predicted the correct set of nodes.However, even when all nodes are correctly predicted, some may not show up in the final logical form, if it has no connecting edges to other nodes (see the "dog" node in example 4.).
Figure 3 shows the predicted nodes of aligned graphs and resulting queries produced by the best weakly-supervised LAGr model on CFQ.The top two rows show common errors where some edge labels do not get predicted, and where some nodes are missing due to the model not having predicted any connecting edges for the nodes, thus omitting the nodes from the final output graph.The bottom two rows show the inferred aligned graphs for examples that result in the correct output graph.

A.4 Further COGS examples
Table 5 shows further examples from COGS's generalization set with various cases for challenging models' ability to test systematic generalization.
Example 1: wrong edge label, between right nodes In A cockroach sent Sophia the sandwich beside the yacht .
A child was blessed.

Figure 2 :
Figure2: Aligned and unaligned graphs for COGS (a) and CFQ (b).For COGS, pink, blue and black denote agent, theme and article edges, respectively.For CFQ, yellow, pink and blue mark FILTER, agent, theme edges.Grey nodes mark null nodes, and * denotes the definite article.The aligned graph for CFQ is provided for illustration purposes, and was not used for training.See Section 4 for the learned aligned graphs.

Figure 3 :
Figure 3: Predicted nodes of aligned graphs and resulting queries produced by the best weakly-supervised LAGr with k = 5, σ = 10 on the development set of CFQ.Top two rows show common errors with missing edge labels and missing nodes, and bottom rows show the inferred alignments for correct examples.
).For COGS, pink, blue and black denote agent, theme and article edges, respectively.For CFQ, yellow, pink and blue mark FILTER, agent, theme edges.Grey nodes mark null nodes, and * denotes the definite article.The aligned graph for CFQ is provided for illustration purposes, and was not used for training.See Section 4 for the learned aligned graphs.

Table 2 :
Average graph accuracy and standard deviation of weakly-supervised LAGr on CFQ (bottom).Middle: results by several seq2seq baselines from prior work (Keysers et al. (2020) ♥, Furrer et al. (2020) ♣ ).Top: results not directly comparable to LAGr: Hierarchical Poset Decoding are thus not directly comparable to those reported in Table2.
Luke S.Zettlemoyer and Michael Collins.2005.Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars.In Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence, UAI'05, pages 658-666.