Semantic-aware Contrastive Learning for More Accurate Semantic Parsing

Since the meaning representations are detailed and accurate annotations which express fine-grained sequence-level semtantics, it is usually hard to train discriminative semantic parsers via Maximum Likelihood Estimation (MLE) in an autoregressive fashion. In this paper, we propose a semantic-aware contrastive learning algorithm, which can learn to distinguish fine-grained meaning representations and take the overall sequence-level semantic into consideration. Specifically, a multi-level online sampling algorithm is proposed to sample confusing and diverse instances. Three semantic-aware similarity functions are designed to accurately measure the distance between meaning representations as a whole. And a ranked contrastive loss is proposed to pull the representations of the semantic-identical instances together and push negative instances away. Experiments on two standard datasets show that our approach achieves significant improvements over MLE baselines and gets state-of-the-art performances by simply applying semantic-aware contrastive learning on a vanilla Seq2Seq model.

Unfortunately, meaning representation is formal and detailed annotation, that is, it should be viewed as a sequence-level whole possessing finegrained semantics.Such features make it hard to train accurate semantic parsers via MLE, which only computes loss token-by-token and is insensitive to small perturbations.For example, in Figure 1(b) even one token error (from ≥ to <) can reverse the semantics of a meaning representation: Property ( λ s (s num_turnovers ≥ 3), player) to Property ( λ s (s num_turnovers < 3), player).For a case of pp-attatchment problem, the MRs in Fig 1(a) have very similar token sequence but very different semantics.We analyze the error cases of a classical SEQ2SEQ parser in OVERNIGHT dataset 1 , and found that the edit distances from 42.7% of the error parses to the correct MRs are only 1, and that of 74.2% of the error parses are ≤ 3.That is, most errors are due to the lack of ability to distinguish fine-grained semantics.There-fore it is crucial to develop learning algorithms which can take the sequence semantics and the fine-granularity of meaning representations into consideration.
In this paper, we propose Semantic-aware Contrastive Learning (SemCL), which can learn semantic-aware, fine-grained meaning representations for accurate semantic parsing.To resolve the fine-granularity challenge, we sample negative instances in different divergence levels.And a multilevel online sampling algorithm is proposed to collect confusing and diverse instances.To resolve the sequence-level semantics challenge, we compare meaning representations as a whole, rather in token-by-token.Three semantic-aware similarity functions are designed to accurately measure the distance between utterances and meaning representations.Finally, we propose ranked contrastive loss, which is used to pull the representations of the semantic-identical instances together and push negative instances away (even if they look very similar to the positive ones).In this way, the semantic parsers can learn to distinguish fine-grained semantics and take the overall semantics into consideration.
In summary, the main contributions of this paper are: • We propose a semantic-aware contrastive learning algorithm, which can effectively model the fine-grained and sequence-level semantics in semantic parsing.To our best knowledge, this is the first attempt to adopt contrastive learning for semantic parsing.
• We design an effective contrastive learning algorithm, which contains a multi-level online sampling algorithm, three semanticaware similarity functions, and a ranked contrastive loss.This framework can also benefit other tasks which depend on the distinguishing ability of the fine-grained or whole-level semantics.
• Experiments on two standard datasets show that our approach achieves significant improvements over MLE baselines, and gets state-of-the-art performances.

Base SEQ2SEQ Parser
This paper uses the classical SEQ2SEQ semantic parser as our base model due to its simplicity and effectiveness (Dong and Lapata, 2016).
Attention-based Decoder.Given the sentence representation, the tokens of the logical forms are generated sequentially.Specifically, the decoder is first initialized with the hidden states of the encoder.Then at each step t, let φ(y t−1 ) be the vector of the previous predicted token, the current hidden state s t is obtained from φ(y t−1 ) and s t−1 .We calculate the attentioned source context representations for the current step t: (1) and the next token is generated from the vocabulary distribution: MLE Learning.The SEQ2SEQ model is trained by maximizing the likelihoods of the tokens in an autoregressive fashion: where D is the corpus, x is the sentence, y is its logical form label.However, such a token-by-token autoregressive training paradigm is insensitive to the overall semantics of the structured MR, making it hard to train effective and discriminative semantic parsers.We propose semantic-aware contrastive learning to help the semantic parsers to perceive the divergence of fine-grained semantics, which is overlooked in the existing autoregressive training approaches.
3 Semantic-aware Contrastive Learning for Semantic Parsing In this section, we describe how to address the sequence-level semtantics and fine-granularity challenges via semantic-aware constrastive learning.The contrastive learning aims to disperse apart semantic-distinct instances and pull closer semantic-identical instances on vector representation space.Specifically, to learn to differentiate fine-grained meaning representations, we design a multi-level online sampling algorithm, which collects confusing negative samples and diverse positive samples in multi-level way.To comparing meaning representations as a whole at the sequence level, we design three semantics-aware compatibility functions.To learn accurate and discriminative semantic parsers, we propose ranked contrastive loss to support the multi-level samples.
In following we describe them in detail.

Multi-level Online Instance Sampling
Contrastive learning algorithms learn good parameters by trying to pull positive instances closer and push negative instances away.Positive and negative instances play a fundamental role in constrastive learning (Karpukhin et al., 2020;Gao et al., 2021), and many studies focus on how to construct good positive and negative instances.
In semantic parsing, it is challenging to sample good contrastive instances.Firstly, because the meaning representation is formal and diverse, it is hard to tell the the changes on semantics after a small perturbation.Secondly, to distinguish the fine-grained semantic representations, contrastive learning needs accurate negative/positive samples.However, many instances are vague, which cannot be accurately categorized into positive-negative partitions.For example, paraphrasing, one common way to build positive instances, may changes the original fine-grained semantics.And two very different MRs may represent the same meaning, and cannot be treated as negative samples.To resolve the above challenges, we propose a multi-level partition algorithm to address vague instances, and sample instances via an online al-gorithm.

Multi-Level Sample Partition
In contrastive learning, each instance is a pair of utterance and MR x, y .To address the vagueness of instances, we divide samples into different levels according its confidence, and set each instance with a Rank value.Specifically, Rank = 0 indicates true positive instances, Rank = 2 indicates true negative instances, and Rank = 1 indicates vague instances which may be correct.In following we describe how to divide instances into these levels and leave the sampling algorithm to next section.Rank = 0: This level contains true positive samples.We use the golden annotations in training corpus as positive instances.And two common types of aliases are also used as positive samples, which are show in Table 1.Given a MR, the utterance labeled as it and its aliases are used as positive samples.
Rank = 1: This level contains vague samples which we cannot clearly identify whether it is into positive or negative.There are two types of vague instances.One is the utterance paraphrased ver- Redundant type assurance sion of instances, i.e., we paraphrase the annotated pair x, y and obtain x , y .Because paraphrasing may change the original semantic, we set this instance as vague one.The other is the MR aliases version, i.e., for a positive instance x, y we view x, y and x , y as vague instances if the annotated x , y instance has the same execution result as y.Because the same execution result means they are potential aliases and may entail the same semantic, we use them as vague instances.
Because these samples are vague, directly adding them as positive will mislead the model, but ignoring them may reduce diversity of positive instances and thus affect the model generalization ability.Therefore we view them as vague instances.
Rank = 2: This level contains true negative instances.For utterance, negative MRs are the MRs with the wrong execution results.For MR, negative utterances are the ones labeled with the MRs producing wrong execution results .

Instance Sampling
There are two common sampling algorithms for contrastive learning: 1) Batch sampling: The positive and negative sample pairs are collected from the same batch.As shown in SimCLR (Chen et al., 2020), this algorithm is efficient and simple.
2) Online sampling: Given an annotated x, y pair, we sample its positive, vague, and negative instances during parsing.Given an input utterance, we use the top-K parses as candidates, and then devide them into Rank 0, 1, 2 according to the methods described in above.
Because online sampling can collect hard negative samples(i.e., the top K ranked instances), this paper use it for better distinguishing confusing, fine-grained meaning representations.

Ranked Contrastive Loss
Traditional contrastive learning only considers positive and negative samples, and their instances are usually not fine-grained.In our semanticaware contrastive learning, we need to deal with multi-level instances, and use special semanticaware similarities.
To this end, we propose ranked contrastive loss, which aims to learn accurate and robust representations from the multi-level sample instances.Concretely, given sample instances in several lev-els, ranked contrastive loss compare both utterances and meaning representations: Rank x ≥r e φ θ (x ,y)/τ (5) Rank y ≥r e φ θ (x,y )/τ (6) , in which τ denotes a temperature parameter.
When there are only positive and negative samples as two ranks, the ranked contrastive loss can be gracefully degraded into the ordinary InfoNCE loss (van den Oord et al., 2018;Carse et al., 2021): With the minibatch B of size k, consisting of one positive pair (x, y) and k −1 negative pairs (x , y), the InfoNCE loss is defined as e φ θ (x,y)/τ + k−1 i=1 e φ θ (x i ,y)/τ (7) , which is also proved to be a lower bound on the mutual information of x and y.
The final training objective is to minimize the decoding loss and contrastive loss as follows: where α and β are hyper-parameters that represent the weights of the contrastive learning.In this way, the model can reduce the influence of noise in sample instances and robustly improve the generalization by the augmentation of the instances.

Semantic-aware Compatibility Function
In contrastive learning, it is critical to measure the similarities between utterances and meaning representations, so that the positive x, y instances will have high similarity, and the negative instances will have low similarity.As described above, a semantic parsing system needs to measure the similarity by taking semantic representations as a whole.Specifically, we design three compatibility functions on sequence representations, attention-based representations and MRconditioned representations.
Compatibility Function on Sequence Representations This similarity measure takes both utterance and meaning representations as two token se-quences.We project the embedding representations of utterances and MRs onto the latent embedding space, and obtain the similarity between them: , in which h x is the encoded contextual embedding of utterance x in SEQ2SEQ encoder, and g y = g 1 , g 2 , ..., g |y| is the encoded representations of y.An additional LSTM encoder is employed to represent MRs, which also shares the same token embeddings with the decoder.
Compatibility Function with Attention Because different tokens may have different importances, we extend the above mean pooling with attention mechanism as a soft selection to compute token-specific sentence representations.
Then the compatibility function is: MR-Conditioned Compatibility Function Semantic parsing is a SEQ2SEQ generation process, and the decoder decides which utterance tokens are used to decode a MR token y t .Therefore we take these conditional association into consideration, and measure the similarity between x and y.In SEQ2SEQ decoding, c t is the attentioned source context representation in the decoding step as in Equ. 2. Then the compatibility function is: , where c t captures the used parts of utterance representation in decoding.

Experimental Settings
Datasets We conduct experiments on OVERNIGHT and GEOGRANNO, which involve various domains.Our implementations are public available2 .
OVERNIGHT This is a multi-domain dataset, which contains natural language queries paired with lambda DCS logical forms.The OVERNIGHT benchmark consists of eight semantic parsing datasets covering a range of semantic phenomena, which requires precise semantics learning ability to map natural language queries to the structured logical forms.We use the same train/test splits as Wang et al. (2015) to choose the best model during training.
GEOGRANNO This is an version of GEO (Zelle and Mooney, 1996;Herzig and Berant, 2019), which is labeled with lambda DCS logical forms.The dataset is constructed by paraphrases detecting.Crowd workers are employed to select the correct canonical utterance from candidate list.The generalization ability of models are requisite to handle 278 test queries from small numbers of train examples with only 487 instances.We follow the same splits in original paper (Herzig and Berant, 2019).
In all our experiments, the standard accuracy is used to evaluate systems.The accuracies on all datasets are calculated as the same as Jia and Liang (2016) and Herzig and Berant (2019).
Data Preprocessing Following Dong and Lapata (2016), we handle entities with Replacing mechanism, which replaces identified entities with their types and IDs.The entity mapping lexicons in Cao et al. (2019) are also used.The paraphrasing model is the trained paraphraser based on T53 , and we paraphrase 20 different expressions for each utterances.

Overall Results
The overall results of baselines and different settings of our method are shown in Table 2 and Table 3. SR, Att, and Cond indicate the above three compatibility functions.We can see that: 1. Semantic-aware contrastive learning is effective, which achieves state-of-the-art performance using a simple base model.On OVERNIGHT and GEOGRANNO dataset, we both achieve state-of-the-art performance on average (81.1% and 73.4%).The results demonstrate the superiority of our contrastive learning algorithms.

2.
By taking the fine granularity and sequence-level semantics into consideration, the semantic-aware contrastive learning can significantly outperform MLE algorithm.Compared with the MLE counterpart -SEQ2SEQ, the contrastive learning algorithm can lead to 5.4 and 1.8 accuracy improvements on OVERNIGHT and GEOGRANNO.This verifies that compared with MLE, our semantic-aware constrastive learning can learn more accurate semantic parsers.
3. Semantic-aware similarity is critical for accurate semantic parsing.We can see that, all compatibility functions show their advantages over MLE-baselines.And more accurate similarity measure can result better performance.Such as, MR-conditioned compatibility functions are Previous Methods COPYNET (Herzig and Berant, 2019) 72.0 One-stage (Cao et al., 2020) 71.9 Two-stage (Cao et al., 2020) 71.6 SSD Word-Level (Wu et al., 2021) 72.9 SSD Grammar-Level (Wu et al., 2021)  more stable in various domains and datasets.In general, the improvement of using semantic-aware contrastive learning is significant, regardless of which function in the three ones is used.

Detailed Analysis
Effect of multi-level sampling To analyze the effect of the multi-level samples partition, we conduct experiments with the positive-negative samples partition.The results are shown in Table 4.When there are only positive and negative samples, we try three ways to deal with the part of Rank = 1: ignoring it or viewing it as positive or negative samples.We can see that treating it as negative is inadvisable, which brings significant performance degradation.We think this is because there are many positive examples in Rank = 1 part, which should be gathered in representation space.And viewing them as negatives will mislead the model learning.We can see that treating it as positive also brings slight performance drop, which may be due to the noise samples.
The results also show that our approach is better than neglecting it.We believe that ignoring them   will lead to insufficient generalization ability of the model.In general, it is problematic to employ vague instances in positive-negative partition, and our multi-level fashion can facilitate the learning of vague samples.

Effect of contrastive losses
To investigate the effect of contrastive losses, we compare the settings with only contrastive learning on utterance side or MR side.The results in Table 4 show the performances of using contrastive losses on both sides are the best in all domains.We believe that by jointly optimizing the constrastive losses in both sides, the model can learn better sentenceside and semantic-side representations, which are beneficial to the model's awareness of fine-grained semantics.

Effect of instances sampling
We conduct another experiment by changing sampling methods and the results are shown in Table 4: Batch Sampling denotes the sample instances are collected in the same batch; Random Sampling denotes for each utterance/MR we randomly select 100 MRs/utterances to form contrastive samples.
We can see the instance-level sampling is important.The Random Sampling on the instancelevel is slightly better than Batch Sampling.We can see that the performances of Online Sampling (FULLMODEL) is significantly higher than other methods.This verifies that differentiating the hard negative samples, which are confusing on the current model, makes the model learn the fine-grained semantics more accurately.
Visualization of the representations space of utterances We use t-SNE (van der Maaten and Hinton, 2008) to visualize the utterance representations on a 2D map.The utterance representations are calculated by averaging the hidden states on words.The encoders of SEQ2SEQ models and our contrastive learning models(with sequence representations compatibility function) are used to obtain utterance representations.In  paraphrases for each utterance with the transparent markers.Although there are some noise samples in the paraphrases, compared with no paraphrases, the generalization ability of the model is improved by our multi-level sampling mechanism.
Case study In Table 5 we compare the parsed results of our model with that of the SEQ2SEQ baseline.In domain PUBLICATION, the utterance is mistaken by baseline for similar semantic "articles published in 2004 or 2010".Our model can well distinguish the fine-grained difference and generate the correct MR, even the expression about "recent" is rare in the training set.We think both generalization and discrimination of the models are improved.In domain BASKETBALL, the baseline over-translate the word "points".But SemCL can perceive the whole semantics and produce the right MR, which shows the effectiveness of treating both the utterance and the MR as a whole.

Related Work
Contrastive Learning.Contrastive learning is a method of representation learning (Hadsell et al., 2006), which pulls the relevant embeddings together and pushes different ones apart to provide more effective representations (van den Oord et al., 2018;Chen et al., 2020).In NLP, contrastive learning is widely used in sentence representations learning (Qu et al., 2021;Gao et al., 2021;Kim et al., 2021;Giorgi et al., 2021).Contrastive learning also improves many natural language understanding tasks (Chen et al., 2021;Wang et al., 2021a,b;Qin et al., 2021).To the best of our knowledge, our work is the first attempt to adopt contrastive learning for semantic parsing.
The structured learning methods are employed to maximize the margins or minimize the expected risks (Yih et al., 2015;Yin et al., 2018;Xiao et al., 2016b;Iyyer et al., 2017).Dual learning methods have been proposed, which also design and optimze rewards in sequence level (Cao et al., 2019(Cao et al., , 2020)).Globally normalized models are proposed to relieve the label bias problem in MLE (Huang et al., 2021).Different from previous works, our approach aims to acquire more discriminative sequence representations by contrastive learning.
Semantic Generalization.Recently, generalization problem has become a research hotspot in semantic parsing.To generalize on various natural language expressions, semantic-invariance knowledge are introduced by paraphrasing (Berant and Liang, 2014;Wang et al., 2015;Dong et al., 2017;Herzig and Berant, 2019).There are also many strudies for achieving generalization on meaning composition (Oren et al., 2020;Liu et al., 2020;Conklin et al., 2021;Bogin et al., 2021;Herzig et al., 2021).To generalize in low resources settings, data augmentation (Jia and Liang, 2016;Marzoev et al., 2020) and constrained decoding (Wu et al., 2021;Shin et al., 2021) methods are proposed.In this paper, the generalization of representations is improved by pulling the utterance and MRs with similar overall semantics closer in representation space.

Conclusions
This paper proposes Semantic-aware Contrastive Learning -an effective contrastive learning framework for semantic parsing, which takes the sequence-level semantics and the fine-granularity into consideration.Specifically, we propose a multi-level online sampling algorithm for accurately collecting confusing and diverse samples, and design three semantic-aware similarity functions to measure the similarity between utterances and MRs.We also propose Ranked Contrastive Loss to optimze the representations from the multi-level samples.Experimental results show that our approach can achieve state-of-theart performance in several domains and datasets.
There are two main limitations of this work.1) Since additional negative instances are used for contrasting in our SemCL method, our method requires more training time than vanilla semantic parsing methods.2) Our proposed method still relies on an amount of annotated data.Many contrastive learning methods have been proposed to resolve the low-resource tasks.We will exploit contrastive learning for low-resource semantic parsing in the future.

Figure 1 :
Figure 1: Meaning representations are detailed and accurate annotations which express fine-grained sequence-level semtantics.(a) the correct MR y and the predicted y have similar token sequences but with very different semantics; (b) one token error (from ≥ to <) will reverse the semantics of MR.

Figure 2 :
Figure 2: The architecture of our method, in which (1) we first collect samples via online sampling where P arse(x) are the parses of the current model and Aug(x) are the paraphrased augmentations; (2) then we divide samples into multi-levels, where Rank = 0 indicates true positive, Rank = 2 indicates true negative, and Rank = 1 is vague instances containing paraphrases and potential aliases; (3) finally we use Ranked Contrastive Loss for optimization based on the similarities between the multi-level samples.

Figure 3 :
Figure 3: Comparison of t-SNE visualization for the learned utterance representations.The colors and markers indicate different MR labels.The transparent markers indicate the representations of the paraphrases.

Table 2 :
Overall results on OVERNIGHT.

Table 4 :
Bas. Blo.Cal.Hou.Pub.Rec.Res.Soc.Avg.Ablation results of our model with different settings on OVERNIGHT.

Table 5 :
Cases on OVERNIGHT with MLE baseline and our SemCL.The MRs are simplified for readability.