Enforcing Consistency in Weakly Supervised Semantic Parsing

The predominant challenge in weakly supervised semantic parsing is that of spurious programs that evaluate to correct answers for the wrong reasons. Prior work uses elaborate search strategies to mitigate the prevalence of spurious programs; however, they typically consider only one input at a time. In this work we explore the use of consistency between the output programs for related inputs to reduce the impact of spurious programs. We bias the program search (and thus the model’s training signal) towards programs that map the same phrase in related inputs to the same sub-parts in their respective programs. Additionally, we study the importance of designing logical formalisms that facilitate this kind of consistency-based training. We find that a more consistent formalism leads to improved model performance even without consistency-based training. When combined together, these two insights lead to a 10% absolute improvement over the best prior result on the Natural Language Visual Reasoning dataset.


Introduction
Semantic parsers map a natural language utterance into an executable meaning representation, called a logical form or program (Zelle and Mooney, 1996;Zettlemoyer and Collins, 2005). These programs can be executed against a context (e.g., database, image, etc.) to produce a denotation (e.g., answer) for the input utterance. Methods for training semantic parsers from only (utterance, denotation) supervision have been developed (Clarke et al., 2010;Liang et al., 2011;Berant et al., 2013); however, training from such weak supervision is challenging. The parser needs to search for the correct program from an exponentially large space, and the presence of spurious programs-incorrect repre- * Work done while interning with Allen Institute for AI. Figure 1: Utterance x and its program candidates z 1 -z 4 , all of which evaluate to the correct denotation (True). z 2 is the correct interpretation; other programs are spurious. Related utterance x shares the phrase yellow object above a black object with x. Our consistency reward would score z 2 the highest since it maps the shared phrase most similarly compared to z .
sentations that evaluate to the correct denotationgreatly hampers learning. Several strategies have been proposed to mitigate this issue (Guu et al., 2017;Liang et al., 2018;Dasigi et al., 2019). Typically these approaches consider a single input utterance at a time and explore ways to score programs.
In this work we encourage consistency between the output programs of related natural language utterances to mitigate the issue of spurious programs. Consider related utterances, There are two boxes with three yellow squares and There are three yellow squares, both containing the phrase three yellow squares. Ideally, the correct programs for the utterances should contain similar sub-parts that corresponds to the shared phrase. To incorporate this intuition during search, we propose a consistencybased reward to encourage programs for related utterances that share sub-parts corresponding to the shared phrases ( §3). By doing so, the model is provided with an additional training signal to distinguish between programs based on their consistency with programs predicted for related utterances.
We also show the importance of designing the logical language in a manner such that the groundtruth programs for related utterances are consistent with each other. Such consistency in the logical language would facilitate the consistency-based training proposed above, and encourage the semantic parser to learn generalizable correspondence between natural language and program tokens. In the previously proposed language for the Natural Language Visual Reasoning dataset (NLVR; Suhr et al., 2017), we notice that the use of macros leads to inconsistent interpretations of a phrase depending on its context. We propose changes to this language such that a phrase in different contexts can be interpreted by the same program parts ( §4).
We evaluate our proposed approaches on NLVR using the semantic parser of Dasigi et al. (2019) as our base parser. On just replacing the old logical language for our proposed language we see an 8% absolute improvement in consistency, the evaluation metric used for NLVR ( §5). Combining with our consistency-based training leads to further improvements; overall 10% over the best prior model, reporting a new state-of-the-art on the NLVR dataset.

Background
In this section we provide a background on the NLVR dataset (Suhr et al., 2017) and the semantic parser of Dasigi et al. (2019).
Natural Language Visual Reasoning (NLVR) dataset contains human-written natural language utterances, where each utterance is paired with 4 synthetically-generated images. Each (utterance, image) pair is annotated with a binary truth-value denotation denoting whether the utterance is true for the image or not. Each image is divided into three boxes, where each box contains 1-8 objects. Each object has four properties: position (x/y coordinates), color (black, blue, yellow), shape (triangle, square, circle), and size (small, medium, large). The dataset also provides a structured representation of each image which we use in this paper. Figure 1 shows an example from the dataset.
Weakly supervised iterative search parser We use the semantic parser of Dasigi et al. (2019) which is a grammar-constrained encoder-decoder with attention model from Krishnamurthy et al. (2017). It learns to map a natural language utterance x into a program z such that it evaluates to the correct denotation y = z r when executed against the structured image representation r. Dasigi et al. (2019) use a manually-designed, typed, variablefree, functional query language for NLVR, inspired by the GeoQuery language (Zelle and Mooney, 1996).
Given a dataset of triples (x i , c i , y i ), where x i is an utterance, c i is the set of images associated to it, and y i is the set of corresponding denotations, their approach iteratively alternates between two phases to train the parser: Maximum marginal likelihood (MML) and a Reward-based method (RBM). In MML, for an utterance x i , the model maximizes the marginal likelihood of programs in a given set of logical forms Z i , all of which evaluate to the correct denotation. The set Z i is constructed either by performing a heuristic search, or generated from a trained semantic parser.
The reward-based method maximizes the (approximate) expected value of a reward function R.
Here,p is the re-normalization of the probabilities assigned to the programs on the beam, and the reward function R = 1 if z i evaluates to the correct denotation for all images in c i , or 0 otherwise. Please refer Dasigi et al. (2019) for details.

Consistency reward for programs
Consider the utterance x = There is a yellow object above a black object in Figure 1. There are many program candidates decoded in search that evaluate to the correct denotation. Most of them are spurious, i.e., they do not represent the meaning of the utterance and only coincidentally evaluate to the correct output. The semantic parser is expected to distinguish between the correct program and spurious ones by identifying correspondence between parts of the utterance and the program candidates. Consider a related utterance x = There are 2 boxes with a yellow object above a black object. The parser should prefer programs for x and x which contain similar sub-parts corresponding to the shared phrase p = yellow object above a black object. That is, the parser should be consistent in its interpretation of a phrase in different contexts.
To incorporate this intuition during program search, we propose an additional reward to programs for an utterance that are consistent with programs for a related utterance.
Specifically, consider two related utterances x and x that share a phrase p. We compute a reward for a program candidate z of x based on how similarly it maps the phrase p as compared to a program candidate z of x . To compute this reward we need (a) relevant program parts in z and z that correspond to the phrase p, and (b) a consistency reward that measures consistency between those parts.
(a) Relevant program parts Let us first see how to identify relevant parts of a program z that correspond to a phrase p in the utterance.
Our semantic parser (from Krishnamurthy et al. (2017)) outputs a linearized version of the program z = [z 1 , . . . , z T ], decoding one action at a time from the logical language. At each time step, the parser predicts a normalized attention vector over the tokens of the utterance, denoted by We use these attention values as a relevance score between a program action and the utterance tokens. Given the phrase p with token span [m, n], we identify the relevant actions in z as the ones whose total attention score over the tokens in p exceeds a heuristically-chosen threshold τ = 0.6.
This set of program actions A(z, p) is considered to be generated due to the phrase p. For example, for utterance There is a yellow object above a black object, with program objExists(yellow(above(black(allObjs))), this approach could identify that for the phrase yellow object above a black object the actions corresponding to the functions yellow, above, and black are relevant.
(b) Consistency reward Now, we will define a reward for the program z based on how consistent its mapping of the phrase p is w.r.t. the program z of a related utterance. Given a related program z and its relevant action set A(z , p), we define the consistency reward S(z, z , p) as the F1 score for the action set A(z, p) when compared to A(z , p). If there are multiple shared phrases p i between x and x , we can compute a weighted average of different S(z, z , p i ) to compute a singular consistency reward S(z, z ) between the programs z and z . In this work, we only consider a single shared phrase p between the related utterances, hence S(z, z , p) = S(z, z , p) in our paper.
As we do not know the gold program for x , we decode top-K program candidates using beamsearch and discard the ones that do not evaluate to the correct denotation. We denote this set of programs by Z c . Now, to compute a consistency reward C(x, z, x ) for the program z of x,we take a weighted average of S(z, z ) for different z ∈ Z c where the weights correspond to the probability of the program z as predicted by the parser.
Consistency reward based parser Given x and a related utterance x , we use C(x, z, x ) as an additional reward in Eq. 1 to upweight programs for x that are consistent with programs for x .
This consistency-based reward pushes the parser's probability mass towards programs that have consistent interpretations across related utterances, thus providing an additional training signal over simple denotation accuracy. The formulation presented in this paper assumes that there is a single related utterance x for the utterance x. If multiple related utterances are considered, the consistency reward C(x, z, x j ) for different related utterances x j can be summed/averaged to compute a single consistency reward C(x, z) the program z of utterance x based on all the related utterances.

Consistency in Language
The consistency reward ( §3) makes a key assumption about the logical language in which the utterances are parsed: that the gold programs for utterances sharing a natural language phrase actually correspond to each other. For example, that the phrase yellow object above a black object would always get mapped to yellow(above(black)) irrespective of the utterance it occurs in. On analyzing the logical language of Dasigi et al. (2019), we find that this assumption does not hold true. Let us look at the following examples: x 1 : There are items of at least two different colors z 1 : objColorCountGrtEq(2, allObjs) x 2 : There is a box with items of at least two different colors z 2 : boxExists( memberColorCountGrtEq(2, allBoxes)) Here the phrase items of at least two different colors is interpreted differently in the two utterances. In x 2 , a macro function memberColorCountGrtEq is used, which internally calls objColorCountGrtEq for each box in the image. Now consider, x 3 : There is a tower with exactly one block z 3 : boxExists(memberObjCountEq(1,allBoxes)) x 4 : There is a tower with a black item on the top z 4 : objExists(black(top(allObjs))) Here the phrase There is a tower is interpreted differently: z 3 uses a macro for filtering boxes based on their object count and interprets the phrase using boxExists. In the absence of a complex macro for checking black item on the top, z 4 resorts to using objExists making the interpretation of the phrase inconsistent. These examples highlight that these macros, while they shorten the search for programs, make the language inconsistent.
We make the following changes in the logical language to make it more consistent. Recall from §2 that each NLVR image contains 3 boxes each of which contains 1-8 objects. We remove macro functions like memberColorCountGrtEq, and introduce a generic boxFilter function. This function takes two arguments, a set of boxes and a filtering function f: Set[Obj] → bool, and prunes the input set of boxes to the ones whose objects satisfies the filter f. By doing so, our language is able to reuse the same object filtering functions across different utterances. In this new language, the gold program for the utterance x 2 would be z 2 : boxCountEq(1, boxFilter(allBoxes, objColorCountGrtEq(2))) By doing so, our logical language can now consistently interpret the phrase items of at least two different colors using the object filtering function f: objColorCountGrtEq(2) across both x 1 and x 2 . Similarly, the gold program for x 4 in the new logical language would be z 4 : boxExists(boxFilter(allBoxes, black(top))) making the interpretation of There is a box consistent with x 3 . Please refer appendix §A for details.

Experiments
Dataset We report results on the standard development, public-test, and hidden-test splits of NLVR. The training data contains 12.4k (utterance, image) pairs where each of 3163 utterances are paired with 4 images. Each evaluation set roughly contains 270 unique utterances.
Evaluation Metrics (1) Accuracy measures the proportion of examples for which the correct denotation is predicted. (2) Since each utterance in NLVR is paired with 4 images, a consistency metric is used, which measures the proportion of utterances for which the correct denotation is predicted for all associated images. Improvement in this metric is indicative of correct program prediction as it is unlikely for a spurious program to correctly make predictions on multiple images.
Experimental details We use the same parser, training methodology, and hyper-parameters as Dasigi et al. (2019). For discovering related utterances, we manually identify ∼10 sets of equivalent phrases that are common in NLVR. For example, there are NUM boxes, COLOR1 block on a COLOR2 block, etc. For each utterance that contains a particular phrase, we pair it with one other randomly chosen utterance that shares the phrase. We make 1579 utterance pairs in total. Refer appendix §B for details about data creation. 1 Baselines We compare against the state-of-theart models; ABS. SUP. (Goldman et al., 2018) that uses abstract examples, ABS. SUP. + RERANK that uses additional data and reranking, and the iterative search parser of Dasigi et al. (2019).
Results Table 1 compares the performance of our two proposed methods to enforce consistency in the decoded programs with the previous approaches. We see that changing the logical language to a more consistent one ( §4) significantly improves performance: the accuracy improves by 2-4% and consistency by 4-8% on the dev. and public-test sets. Additionally, training the parser using our proposed consistency reward ( §3) further improves performance: accuracy improves by 0.3-0.4% but the consistency significantly improves by 1.4-2.3%. 2 On the hidden-test set of NLVR, our final model improves accuracy by 7% and consistency by 10% compared to previous approaches. Larger improvements in consistency across evaluation sets indicates that our approach to enforce consistency between programs of related utterances greatly reduces the impact of spurious programs.

Conclusion
We proposed two approaches to mitigate the issue of spurious programs in weakly supervised semantic parsing by enforcing consistency between output programs. First, a consistency based reward that biases the program search towards programs that map the same phrase in related utterances to similar sub-parts. Such a reward provides an additional training signal to the model by leveraging related utterances. Second, we demonstrate the importance of logical language design such that it facilitates such consistency-based training. The two approaches combined together lead to significant improvements in the resulting semantic parser.

A Logical language details
In Figure 2, we show an example utterance with its gold program according to our proposed logical language. We use function composition and function currying to maintain the variable-free nature of our language. For example, action z 7 uses function composition to create a function from Set Actions z 8 -z 10 use function currying to curry the 2-argument function objectCountGtEq by giving it one int=2 argument. This results in a 1-argument function objectCountGtEq(2) from Set[Object] → bool.

B Dataset details
To discover related utterance pairs within the NLVR dataset, we manually identify 11 sets of phrases that commonly occur in NLVR and can be interpreted in the same manner: In each phrase, we replace the abstract COLOR, NUMBER, SHAPE token with all possible options from the NLVR dataset to create grounded phrases. For example, black block at the top, yellow object above a blue object. For each set of equivalent grounded phrases, we identify the set of utterances that contains any of the phrase. For each utterance in that set, we pair it with 1 randomly chosen utterance from that set. Overall, we identify related utterances for 1420 utterances (out of 3163) and make 1579 pairings in total; if an utterance contains two phrases of interest, it can be paired with more than 1 utterance.