Measuring Alignment Bias in Neural Seq2seq Semantic Parsers

Prior to deep learning the semantic parsing community has been interested in understanding and modeling the range of possible word alignments between natural language sentences and their corresponding meaning representations. Sequence-to-sequence models changed the research landscape suggesting that we no longer need to worry about alignments since they can be learned automatically by means of an attention mechanism. More recently, researchers have started to question such premise. In this work we investigate whether seq2seq models can handle both simple and complex alignments. To answer this question we augment the popular Geo semantic parsing dataset with alignment annotations and create Geo-Aligned. We then study the performance of standard seq2seq models on the examples that can be aligned monotonically versus examples that require more complex alignments. Our empirical study shows that performance is significantly better over monotonic alignments.


Introduction
In semantic parsing, the goal is to map natural language (NL) sentences into machine-readable meaning representations (MR) which allow for automated reasoning.For example, consider the following pair: NL : What is the population of Georgia ?MR : answer (population (state (georgia) ) ) Prior to deep learning models, a popular approach was to learn a grammar-based parser that explicitly models alignments between the NL and MR sequences (Wong and Mooney, 2006;Zettlemoyer andCollins, 2005, 2007;Lu et al., 2008;Kwiatkowksi et al., 2010;Kwiatkowski et al., 2011).The emergence of sequence-to-sequence (seq2seq) semantic parsers with attention mechanisms changed the research landscape: one of the initial premises of seq2seq models is that alignments no longer need to be explicitly modeled because the attention mechanisms will automatically learn them (Bahdanau et al., 2015).More recently, researchers started to question such premise, having observed that seq2seq models fail to make proper generalizations on out-of-distribution test sets on which traditional grammar-based models excel (Liu et al., 2020(Liu et al., , 2021;;Wang et al., 2021).
In this paper we follow this line of research and ask the questions: Can standard seq2seq models handle arbitrary alignments?And if not, what kind of alignment bias do they have?To answer these questions, we augment the GEO semantic parsing benchmark (Zelle and Mooney, 1996) with alignment annotations and create GEO-ALIGNED.We then compare the performance of seq2seq models on examples that can be easily aligned with simple monotonic alignments to the performance of these models on examples that require word reordering.Our empirical study shows that seq2seq parsers perform significantly better over examples that can be monotonically aligned.In other words, the flexibility of not having to explicitly model alignments comes at a cost: seq2seq models have difficulties in learning complex alignments.
The main contributions of this paper are: 1. We introduce a new dataset: GEO-ALIGNED that augments the GEO semantic benchmark with alignment annotations.We used the English and German versions of the original dataset, and we additionally introduce a new Italian version.
2. Using GEO-ALIGNED we define new evaluation splits to distinguish parsing performance 3. Our empirical study shows that seq2seq parsers are significantly better in handling monotonic alignments, and quantifies the impact of using attention.
4. As a side contribution we offer a measure of the complexity of the GEO dataset, showing that more than half of the examples involve monotonic alignments.

The GEO-ALIGNED Benchmark
In this section we describe the GEO-ALIGNED dataset, an augmentation of the popular GEO semantic parsing benchmark first introduced by Zelle and Mooney (1996).We start by providing a brief formal definition of word alignments following standard notation from the statistical machine translation literature, and we define monotonic and non-monotonic alignments (Wu, 2010).We then detail how we augment the GEO dataset and provide statistics that measure the complexity of the dataset.

Bi-text alignments
Given an input sequence of N words x = x 1 , . . ., x N , and a target sequence of M words y = y 1 , . . ., y M , a bi-text is defined as the tuple (x, y).A bi-text word alignment is a set of bisymbols A, where each bi-symbol (x i , y j ) couples a word x i in the input sequence at position i to a word y j in the target sequence at position j.
If a word x i from the input sequence does not need an alignment to a word in the target, we introduce an ε in y at position i.This bi-symbol (x i , ε i ) amounts to a deletion, i.e. mapping from input to target involves deleting a word from the input.Conversely, if a word y j from the target does not require an alignment to a word in the input, we introduce an ε in x at position j.This bi-symbol (ε j , y j ) amounts to an insertion, i.e. mapping from input to target involves inserting an extra word in the target.We refer to the number of insertions and deletions in an alignment as the gap length.Figure 1 shows examples of alignments from the GEO-ALIGNED dataset.

Monotonic and non-monotonic alignments
Monotonic alignments are bi-text alignments where A contains bi-symbols of the forms (x i , y j ), (x i , ε j ) or (ε i , y j ) where i = j.In other words, a monotonic alignment does not involve any reordering of the words.Conversely, non-monotonic alignments also include bi-symbols of the form (x i , y j ) where i = j.Figure 1 shows an example of a monotonic alignment versus a non-monotonic one.

Alignment annotation
The original GEO dataset contains 880 English questions about US geography, paired with a meaning representation.Several MR formalisms have been introduced for this dataset, including a firstorder logic as in Zelle and Mooney (1996), a variable-free functional language introduced by Kate et al. (2005) and SQL (Popescu et al., 2003;Giordani and Moschitti, 2013;Iyer et al., 2017).
In GEO-ALIGNED, we use the variable-free functional language formalism.Similarly to Wang et al. (2021), we further simplify the MR by removing the brackets.This is done to avoid introducing numerous ε in the alignments, and also to better reveal the structural similarity between the NL and MR sequences.Similarly to Dong and Lapata (2016), we remove constants used to identify states, rivers, cities, places and countries by substituting them with their type.Alignments were provided by four expert annotators.For each pair, the annotators were first asked to decide whether there was a monotonic or non-monotonic alignment.Secondly, annotators were asked to provide the actual alignment from NL to MR words.More specifically, two annotators aligned the entire dataset, while the other two each annotated fifty disjoint examples.Inter-annotation agreement was calculated by comparing the alignments provided.A first agreement metric is Cohen's Kappa statistic (Cohen, 1960) to measure the agreement of monotonic versus non-monotonic labels: the average score obtained is 0.803, which corresponds to substantial agreement.We then calculated the average percentage of exact matches between the alignments of the two main annotators and each of the other three, which resulted in a 90% average match.Disagreements were resolved by keeping the annotation that best matched the alignment strategy taken by the majority.
Bi-text word alignments vary depending on the order in which the words appear both in the natural language and the meaning representation (Steedman, 2020).If we keep the MR fixed, a sentence in one language might be monotonically aligned, while the same sentence in another language might not be.To better understand the range of alignments between natural language utterances and meaning representations one should ideally consider multiple languages.With this objective in mind, we additionally annotated the German version (Jones et al., 2012) of GEO, and a new Italian version that we introduce, obtained by translations of the English sentences provided by an Italian native speaker.
The resulting dataset contains the NL and MR data pairs, augmented with • a label indicating whether there is a monotonic alignment; • the alignment that maps NL and MR words.
Table 1 reports annotation statistics for GEO-ALIGNED.In general, it can be observed that across all languages the majority of the alignments are monotonic and the average gap length is less than three.For non-monotonic alignments the average number of reordered words is below three.
With respect to differences between the three languages, Figure 2 shows a histogram of the gap lengths of monotonic alignments.As we can see the distributions are quite similar, but slightly shifted towards longer gaps for German and Italian.In particular, there are significantly more alignments with no gap in English.The proportion of monotonic alignments reflects the structural similarity between the variable-free MRs and the NL sequences.It is highest in the case of English, after which the MR formalism was modeled.German  is syntactically more similar to English than Italian and as a result it can be more easily aligned with the MR sequences.An exemplary syntactic difference is adjective placement: in English and German adjectives come before nouns, whilst in Italian they are usually placed after.When a superlative is used in the NL sentence, the MR, being modeled after English, places it before the noun.This creates a monotonic alignment with English and German sentences and a non-monotonic one with Italian ones.For example, if the question is What is the largest state ? the corresponding MR will be answer(largest(state(all))).Because largest comes before state in both English and German as well as in the MR, the alignment will be monotonic.In Italian, largest comes after state and the alignment will require reordering.
3 Measuring Alignment Bias

Models and Experiments
The goal of our study is to compare the performance of neural seq2seq models over monotonic and non-monotonic alignments.Our hypothesis is that seq2seq models can implicitly learn monotonic alignments more easily than non-monotonic alignments.To evaluate this hypothesis we compared the performance of two seq2seq architectures on GEO-ALIGNED.

LSTM SEQ2SEQ
A standard seq2seq model based on a bidirectional-LSTM encoder (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997), and a unidirectional LSTM decoder that uses attention (Bahdanau et al., 2015).We then ablate the decoder of the attention layer to investigate its impact on the performance for the different alignments.
BART A pre-trained seq2seq model based on a bidirectional encoder and a left-to-right decoder (Lewis et al., 2020).Since it was pre-trained on English corpora, we only used this model on the English version of the dataset.
For our experiment we use exact-match accuracy as the evaluation metric, i.e. the percentage of exact matches between the predicted and ground-truth MRs.The alignment labels in GEO-ALIGNED allow us to break down the accuracy score for the two classes of alignments and observe whether the seq2seq framework has an implicit bias towards monotonic alignments.Further implementation and experimental setup details can be found in Appendix A.

Results
Table 2 shows the performance for the different models and languages.As we can observe accuracy for all models is significantly lower over non-monotonic alignments and this is true for all languages.The difference in performance between monotonic and non-monotonic alignments is more pronounced for models with no attention, but it holds true for all of them.
The performance follows the same pattern across languages and models: accuracies are higher for monotonic sequences than for non-monotonic ones.For English and Italian the differences are quite similar: models with attention score 0.13 point higher for monotonic sequences; without attention the difference is 0.19 for English and 0.17 for Italian.German has a lower accuracy overall.One possible explanation (as shown in Figure 2) is that the monotonic gap distribution for these two lan-  guages has a slight shift towards shorter gaps and in particular the sequences with no gap could help the models to implicitly induce better alignments.Moreover, the difference between monotonic and non-monotonic performance is starker: the model scored 0.19 and 0.23 better on monotonic examples with and without attention respectively.This might be due to the fact that more words are reordered on average for German than for the other two languages (see Table 1).Figure 3 shows accuracy for monotonic sequences binned by gap length.We observe that for all languages there is a negative correlation between accuracy and gap length.We performed a qualitative analysis of the predictions by categorizing errors based on how many steps are needed to correct the mistake.Simpler errors are those where the correct MR can be recovered by inserting, deleting or changing at where the gold MR can be recovered by inserting river in the second position.More complex errors require correcting three or more tokens, and can also require reordering of the output.Table 3 reports statistics of our analysis.In general, we found that errors on monotonic examples are of the simpler category in much higher proportion than for non-monotonic: across languages, nonmonotonic sequences require much more complex corrections involving three or more tokens as well as considerable reordering.
Another interesting finding is that, despite BART and our LSTM-based seq2seq model achieve similar results in English (see Table 2), the LSTMbased model makes more complex mistakes, particularly in the monotonic case.For these examples, the vast majority of the errors for BART were one-token, and we found that most of these were minor mistakes such as predicting the token loc 2 instead of loc 1.The predictions of the LSTMbased model are more dissimilar to the gold MR.

Related Work
Several grammar formalisms have been proposed for semantic parsing, including categorical grammars (Steedman, 1996(Steedman, , 2000;;Zettlemoyer and Collins, 2005;Clark and Curran, 2003;Zettle-moyer and Collins, 2007;Kwiatkowksi et al., 2010;Kwiatkowski et al., 2011) and synchronous context free grammars (Wong and Mooney, 2006).Both approaches model alignments explicitly and they are induced from data.There have also been attempts to derive a more general formalism to unify the different grammar based approaches to semantic parsing (Jones et al., 2011).
More recently, neural seq2seq models were proposed for semantic parsing in Dong and Lapata (2016); Jia and Liang (2016); Iyer et al. (2017).The seq2seq approach aims to relax the reliance upon high-quality lexicons, i.e. domain-specific word alignments.Most seq2seq systems implement an attention mechanism such as those proposed by Bahdanau et al. (2015); Luong et al. (2015); Xu et al. (2015), which can be seen as a strategy to learn soft alignments (Dong and Lapata, 2016).
Recently there has been an interest in testing the generalization abilities of neural semantic parsers, which resulted in the creation of several new benchmarks (Bastings et al., 2018;Lake and Baroni, 2018;Loula et al., 2018;Ruis et al., 2020;Keysers et al., 2020;Kim and Linzen, 2020) on which recent work has shown improved performance by introducing more alignment bias in the models either explicitly (Liu et al., 2021), or implicitly (Wang et al., 2021).

Conclusion
In this paper we introduced the GEO-ALIGNED dataset that offers an evaluation framework for testing the performance of semantic parsers over examples of varying alignment complexity.Our experiments have shown that seq2seq neural parsers perform significantly better over simpler monotonic alignments, suggesting that they have an implicit bias.We hope that GEO-ALIGNED can be used by other researchers to further test alignment biases.

Figure 1 :
Figure 1: Examples alignments from the GEO-ALIGNED benchmark.Each bi-symbol is represented as a vertical line coupling words in the NL with words in the corresponding MR.The monotonic alignment (a) does not involve crossings of bi-symbols, while the non-monotonic alignment (b) involves considerable reordering.

Figure 2 :
Figure 2: Distribution of gap lengths for the monotonic alignments.

Figure 3 :
Figure 3: Accuracy for monotonic examples as a function of gap length.

Table 2 :
Summary of results for the different models and languages: LSTM is the seq2seq model based on a bidirectional LSTM encoder and an LSTM decoder with attention.LSTM-attn ablates the attention layer in the decoder.Acc reports the overall accuracy for each model, MAcc and NMAcc are the accuracy over sequences with monotonic and non-monotonic alignments respectively.