AESOP: Paraphrase Generation with Adaptive Syntactic Control

We propose to control paraphrase generation through carefully chosen target syntactic structures to generate more proper and higher quality paraphrases. Our model, AESOP, leverages a pretrained language model and adds deliberately chosen syntactical control via a retrieval-based selection module to generate fluent paraphrases. Experiments show that AESOP achieves state-of-the-art performances on semantic preservation and syntactic conformation on two benchmark datasets with ground-truth syntactic control from human-annotated exemplars. Moreover, with the retrieval-based target syntax selection module, AESOP generates paraphrases with even better qualities than the current best model using human-annotated target syntactic parses according to human evaluation. We further demonstrate the effectiveness of AESOP to improve classification models’ robustness to syntactic perturbation by data augmentation on two GLUE tasks.


Introduction
Syntactically-controlled paraphrase generation, which aims to generate paraphrases that conform with given syntactic structures, has drawn increasing attention in the community. On the one hand, paraphrase generation has benefited a wide range of NLP applications, such as neural machine translation (Yang et al., 2019), dialogue generation (Gao et al., 2020), as well as improving model robustness  and interpretability (Jiang et al., 2019). On the other hand, syntacticallycontrolled paraphrasing has been used for diverse question generation (Yu and Jiang, 2021), diversifying creative generation (Tian et al., 2021) and improving model robustness (Iyyer et al., 2018;.
However, selecting suitable target syntactic structures to control paraphrase generation for diverse and high-quality results is a lesser explored direction. Prior works usually use a fixed set of syntactic AESOP What to say?
How to guide paraphrasing? because of the gold , everything will change .
target parse paraphrase Figure 1: Given a source sentence, AESOP selects target syntactic parses adaptively to guide paraphrase generation. Paraphrases here are all generated by AESOP, which preserve the semantics from source sentences and conform with the selected syntactic parses.
structures for all input sentences (Iyyer et al., 2018;. A challenge with this method is that not all sentences can be paraphrased into the same set of syntactic structures. For example, it is impossible to turn a long sentence with multiple clauses into a noun phrase. Thus, Chen et al. (2019b) proposed to use crowd-sourcing to collect exemplars that can provide compatible syntax with the source sentence to guide generation. Disadvantages with this method are that the crowdsourcing process is costly, and one exemplar sentence can only provide a specific syntactic guidance, while there are many syntactic parses that can properly guide the paraphrase generation (as shown in Figure 1). In contrast, we propose to automatically select multiple syntactic parse structures to control paraphrase generation for more diverse and higher quality generation. Our first contribution is the proposal of AESOP (Adaptive Syntactically-Controlled Paraphrasing), a model that integrates pretrained Language Models (LMs) with a novel retrieval-based target syntactic parse selection module to control paraphrase generation. By leveraging the expressiveness of pretrained LMs and the adaptive selection module, AESOP is capable of generating fluent and syntactically-diverse paraphrases. With ground-truth target syntactic parses from human-annotated exemplars, AESOP achieves the state-of-the-art performance on both semantic preservation and syntactic conformation metrics. By human evaluation, we show that AE-SOP can generate paraphrases with even better quality than the current best model using human annotated exemplars, which points out the importance of studying the adaptive target parse selection for future works on controlled paraphrase generation.
Our second contribution is the construction of two datasets containing adversarial examples with syntactic perturbation generated by AESOP that are further validated and labeled by crowd workers. Experiments show that the two datasets are challenging to current classification models, and using AESOP to augment the training data can effectively improve classification models' robustness to syntactic attacks. 1

Task Formulation
We formulate the task of adaptive syntacticallycontrolled paraphrase generation as: given an input sentence X, find a set of proper syntactic controls Y to generate paraphrases Z, such that Z's syntax conforms to Y while retaining the semantics of X.
We use the term target syntactic parses to refer to the syntactic structure that guides the generation, which could be from exemplar sentences, a set of fixed templates, or our adaptive selection module. AESOP has two components: i) a retrieval-based module that adaptively selects a set of target syntactic parses to guide the paraphrase generation; ii) an encoder-decoder architecture that leverages BART (Lewis et al., 2020) to generate paraphrases.

Adaptive Target Syntactic Parse Selection
In AESOP, we propose a retrieval-based strategy to select target syntactic parse adaptively (i.e., Algorithm 1). For a given syntactic parse of source sentence pruned at height H (as shown in Figure 2), denoted as T H s , we aim to find k suitable target syntactic parses to guide the generation. First, we collect (source sentence X, paraphrase Z) pairs from the training data. Then, we prune X and Z's constituency parse trees at height H simultaneously and get corresponding (T H s , T H t ) pairs. By counting, we have the frequencies of all unique paired combination of pruned source parses with target syntactic parses from their paraphrases, as  Figure 3: AESOP Framework. With a source sentence as input, AESOP has i) a retrieval-based selection module that adaptively chooses a set of target syntactic parses as control signals, together with ii) an encoder-decoder architecture to generate fluent paraphrases. With ground-truth target syntactic parses from exemplars, AESOP leverages the syntactic information at different heights from exemplars to guide the generation.
{T H s1 , ..., T H sm }. Then, for each parse T H si , we retrieve all possible target syntactic parses from pairwised parse combinations from the training data. For each combination, we count how many time it occurs in the training data. For one certain combination with its occurrence frequency #(T H si , T H t ), we divide its frequency over the sum of frequencies for all possible target syntactic parses for T H si and get a list of frequency ratios. We use the ratio distribution as probabilities to select k/m target syntactic parses T H t for each of m parse T H si as shown in Equation 2, which results in k (= m * k/m) target syntactic parses in total. . (2) In our later experiments, we use the ranker in Equation 1 to retrieve top-tanked target syntactic parses and their corresponding paraphrases. Using the two-step strategy instead of ranking all syntactic parses based on similarity, we aim to find diverse target syntactic parses suitable for the source sentence. We use the weighted sampling strategy rather than directly choose the most frequently occurred combinations to take care of compatible combinations that occur less in a specific dataset.

Architecture of AESOP
AESOP takes as inputs the source sentence X, its full syntactic parse T S and target syntactic parse(s) Y , and generates as outputs a paraphrase Z of X together with a duplication of the target parse Y . Specifically, given source sentences X, we tokenize and get their constituency-based parse trees 3 , denoted as T s (shown as source parse tree in Figure 3). Similar to previous works (Iyyer et al., 2018;Chen et al., 2019a;Kumar et al., 2020), we linearize the constituency parse tree to a sequence (shown as source full syntactic parse in Figure 3).
To utilize the encoder-decoder BART (Lewis et al., 2020) model for syntactic-controlled paraphrase generation, we propose an effective design of having source sentence<sep>source full syntactic parse<sep>target syntactic parse as the input sequence for the encoder. The output sequence from the decoder is the sequence of target syntactic parse<sep>paraphrase. We will showcase the efficiency of our model design in Section 4 and provide a visual interpretation that AESOP successfully disentangles the semantic and syntactic information in Section 5. During training, we get gold target syntactic parses directly from parallel-annotated paraphrases.  Table 1: Performance comparison with ground-truth syntactic control. With coarse syntactic control from shallow height of pruning, AESOP started to outperform the current state-of-the-art model SGCP. AESOP-H4 outperforms SGCP across all semantic preservation (BLUE, ROUGE Scores and METEOR) and syntactic conformation metrics (TED-R and TED-E). ↑ means higher is better, while ↓ means lower is better. With the full syntactic parse (-F), AESOP achieves its best controllability, which is comparable to previous best performance. source-as-input and exemplar-as-output are for quality check purpose and not for comparison.
In our setting, we train separate models using pruned trees of target parses at different heights H. During inference, the target syntactic parses are either from exemplar sentences, fixed templates or our adaptive selection module.

Paraphrase Generation with Syntactic Control
We train and evaluate AESOP on ParaNMTsmall (Chen et al., 2019b) and QQP-Pos (Kumar et al., 2020). Our train/dev/test split follows previous work (Kumar et al., 2020). During our experiments, we aim to answer three research questions: • Q1: Will AESOP conform with the syntactic control while preserving the semantics, given ground-truth target parses? (Section 4. Baselines. For supervised models that utilize exemplar sentences to get target parses, we compare with CGEN (Chen et al., 2019a) and two versions of SGCP (Kumar et al., 2020): SGCP-R and SGCP-F. SGCP prunes constituency parse trees of exemplar sentences from height 3 up to 10. During the evaluation, SGCP-R chooses the best paraphrase out of many, and SGCP-F uses the full parse tree.
To the best of our knowledge, SGCP-R is the current state-of-the-art model under this setting. For models that utilize a fixed set of target syntactic parses, we compare with SCPN (Iyyer et al., 2018) that proposes 10 syntactic parses at height 2 to guide the generation.

Ground-truth Syntactic Control
To answer Q1, we evaluate AESOP on both datasets with ground-truth target syntactic parses from exemplar sentences.
Experiment Setup. First, we get the constituency parse trees of exemplar sentences. Then, we remove all leaf nodes (i.e., tokens in the sentences) from the constituency parse trees to prevent any semantics propagating from exemplar sentences into generation. We further prune the parse trees of exemplars at different heights to get different levels of syntactic specifications. Technically, the deeper we prune the parse tree, the more fine-grained syntactic information the model can use. Practically, it is less likely to provide fine-  Table 2: Performance of AESOP without ground-truth target parse. Valid@100 is the validity check for the best paraphrases of first 100 test instances, and Votes is the percent of received votes for a paraphrase from one model to be the best among 4 models. Human evaluation indicates AESOP generate even better-quality paraphrases than the current best model SGCP that uses the human-annotated target syntactic parse from exemplars.
grained target syntactic parses. For example, it is easy to provide a target syntactic parse at height 2 containing a verb phrase and a noun phrase as (ROOT(S(NP)(VP)(.))), but it is hard to provide more fine-grained syntactic information even for experts. In AESOP, we try to use the syntactic information from exemplar sentences as shallow as possible. We train separate models by using target syntactic parses from pruning the constituency parse tree of paraphrases at heights 2, 3 and 4. 4 Correspondingly, we denote them as AESOP(-H2/H3/H4). During evaluation, we only use the target syntactic parse from the exemplar sentences at that corresponding height.
Quality Check. We use source sentences and exemplar sentences to check the quality of the datasets in Table 1. Using the source sentences as paraphrases will lead to high semantic preservation scores, but they have distinct syntactic structure with paraphrases, so TED-R scores are poor. On the other hand, exemplar sentences have distinct semantics with both the source sentences and paraphrases, which lead to poor semantic-preservation 4 Implementation details are in Appendix A.1.
metrics. From TED-R scores, we can see that the tree-edit-distance between parse trees of exemplar sentences and paraphrases is low but not 0. It indicates that the quality of such human-annotated exemplar sentences are good yet imperfect.
Experiment Results. Table 1 shows the performance comparison. Unsurprisingly, the deeper we prune target syntactic parse from exemplars, AE-SOP gets more syntactic information to achieve better controllability. With full target syntactic parse tree, AESOP achieves its best syntactic controllability, which is comparable to previous best performance. On the other hand, AESOP outperforms SGCP-R in semantic-preservation metrics by only using coarse syntactic information from height 2 (AESOP-H2) for ParaNMT-small and height 3 (AESOP-H3) for QQP-Pos. With more syntactic information, AESOP-H4 outperforms the current state-of-the-art SGCP-R in both semantics preservation and syntactic conformation metrics. It showcases AESOP's great ability of syntacticallycontrolled paraphrase generation.

Adaptive Target Parse Selection
To answer Q2, we evaluate AESOP without annotated exemplars. By having SGCP-R in our experiments, we aim to evaluate if AESOP can generate even better paraphrases compared to the current best model with human-annotated exemplars.
Experiment Setup. How to select suitable target syntactic parses to guide the generation is still an open problem in the paraphrase generation community. To fairly compare with SCPN which proposes 10 syntactic templates at height 2, we also adopt AESOP trained at height 2 (shown as AESOP-H2  Table 3: Human validity check of top-k selected target syntactic parses. All numbers are 10-round mean with standard deviation. In AESOP, we use the ranker in Equation 1 to sort and get top-k target parses, while others use random selection. High validity rate of paraphrases indicate the high quality of our retrieved target syntactic parses. The trend that higher-ranked syntactic parses have higher validity rates verifies the efficiency of our ranker. in Table 1). 5 Unlike previous work, AESOP uses the adaptive selection module to decide a set of target syntactic parses automatically. For a fair comparison, we also feed the same 10 syntactic target parses from SCPN to AESOP, denoted as AESOP-static. It is hard to evaluate retrieved target syntactic parses because paraphrases are intrinsically diverse, so that many target syntactic parses could be reasonable. Therefore, we use the quality of generated paraphrases, which is our end goal, to reflect the quality of retrieved target syntactic parses. For evaluation, we use automatic metrics together with extensive human evaluations. 6 Automatic Metrics. First, we generate 10 paraphrases from each model. To establish a strong baseline, we chose the best paraphrase with the highest BLEU scores with source sentences across all models. As shown in Table 2, the improvement from AESOP-static to AESOP indicates the effectiveness of our adaptive selection strategy. SCPN performs better at TED-E@2 metrics on both datasets. After qualitative checks, we share the same finding with previous works (Kumar et al., 2020;Chen et al., 2019a) that SCPN tends to strictly adhere to syntactic parses at the cost of semantics. 7 On the other hand, AESOP leans towards generating fluent paraphrases and can make up for the case when the target syntactic parse is less reasonable -AESOP achieves a better syntactic conformation when the syntactic control signal is more accurate, indicated by the decreases of TED-E@2 scores in Table 2 Table 2. Besides, we show workers 4 paraphrases from all models and ask them to vote for which one is the best. Then we report the percentage of votes that each model got as votes. In result, AESOP generates more valid paraphrases than all baselines and gets the most votes, even than SGCP-R that utilizes human-annotated exemplars. Such finding demonstrates the effectiveness of AESOP and points out the importance of studying automatic target parse selection in paraphrase generation. 8

Quality of Retrieved Syntactic Parses
To answer Q3, we evaluate the quality of retrieved top-k target syntactic parses by checking the validity of their corresponding paraphrases. We generate 10 paraphrases for each of the first 50 test instances (500 in total) using SCPN, AESOP-static, and AE-SOP and ask workers to validate. After annotation, we use the similarity ranker in Equation 1 to rank and get the top-k target syntactic parses and their corresponding paraphrases for AESOP. For other baselines, as they use a fixed set of target syntactic parses and do not have any ranking mechanism, we do random permutation to rank target parses to get top-k paraphrases. We run the experiments for 10 rounds and report the validity rate of paraphrases for top-k target syntactic parses in Table 3. Comparing to pre-designed syntactic parses, the higher validity rates of paraphrases from AESOP indicate the better quality of our retrieved target syntactic parses. The trend that higher-ranked syntactic parses have higher validity rates also verifies the efficiency of our ranker.

Model Analysis and Interpretation
Ablation Studies. We take out each part of sequence in both encoder and decoder and conduct   several ablation studies on AESOP-H4 with exemplars. We show how each part of sequences would influence AESOP's performance in Table 4. Takeaways from our ablation studies are: 1) AESOP's performance plummets without any syntactic specifications (row1&row4). 2) Taking out target parse (tp) in the output sequence will lead to worse performance in both semantic preservation and syntactic controllability (row1&row2, row3&row4). We will visually interpret the benefit of such design later in this section. 3) Taking out each part in the input sequence for the encoder will leads to a significant performance drop of AESOP on QQP-Pos datset for both criteria (i.e., semantic preservation and syntactic controllability). The trend is the same for ParaNMT-small dataset, except only taking out the full parse (fp) will leads to around 1% improvement on semantic preservation metrics, while the syntactic controllability stays almost the same. Considering the much larger performance drop on criteria, we decided the current design of AESOP.
Interpretation. In Figure 4, we visualize cross attentions between encoder and decoder for two designs, i.e., AESOP with (right) and without (left) target syntactic parse in the decoder on the test set of ParaNMT-small. Technically, we search for the final output with beam = 4 and take the average of cross attention scores of 12 attention heads from the last layer of the decoder. Finally, we add the attention of all tokens within each component (ss, fp and tp). To manifest the difference, we denote the highest attention scores as 100, and calculate the relative cross attention to the highest. Compared to the design without target syntactic parse in the decoder, cross attention between paraphrases and source sentences stays the highest in AESOP. However, the ratio of cross attention scores of (paraphrases, target parses) and (paraphrases, full source parses) decreases. Such decreases indicate that having target parses in the decoder helps to disentangle semantic and syntactic information from the input sequence. Instead, AESOP learns the syntactic information from target syntactic parses through self-attention in the decoder. As a result, it leads to a performance boost in Table 4. At the same time, target parses influence paraphrase generation directly during decoding through the decoder's self-attention, which leads to better controllability of AESOP. Take the example in Figure 4, without target parse in the decoder, the model outputs a large black dog sits in the corner beside him. as the paraphrase to by his side crouched a huge black wolfish dog .. After adding the target parse in the decoder, the model no longer generates prepositional phrase in the corner and outputs a large black dog sits beside him., which matches better with the input target parse.

Improve Robustness
Recent works show that powerful LMs (e.g., BERT (Devlin et al., 2019)) are capturing the superficial lexical features McCoy et al. (2019) and are vulnerable to simple perturbations (Jin et al., 2020). Motivated by this, we first test if BERT is robust to syntactic perturbations by paraphrasing.
We fine-tune BERT models on two GLUE (Wang et al., 2018) tasks (SST-2 and RTE). Then, we generate 10 paraphrases using AESOP-H2 for each test instance in the dev set and choose top-5 to get 2 larger dev sets. 9 We run trained BERT models on new dev sets again.
Human Annotation. We collect the paraphrases where models fail but succeeded at their original a large black dog sits in the corner beside him.   sentence as adversarial examples. We then put all these examples on MTurk and ask workers to reannotate. 10 For SST-2, we ask workers to assign sentiment labels as positive, negative or undecided (mixed sentiments). For RTE, one test instance has sentence1 and sentence2 with a label if sen-tence1 entails sentence2. We generate paraphrases for sentence2 and ask workers to binary-decide if sentence1 entails generated paraphrases. We show the statistics of collected adversarial set and original dev set in Table 6. Researchers can test their models' robustness to syntactic perturbations on our collected datasets.
Augmentation. We augment each training instance with 5 best paraphrases from AESOP-H2. For SynPG and SCPN, as the pre-designed templates for SynPG is a subset of SCPN's. We generate 5 paraphrases using selected templates in SynPG. Then, we retrain BERT models with augmented training data from each model. Then, we re-10 See more annotation details in Appendix A.6.  train BERT models after augmentation and get their test accuracies. We define ParaGAP as the accuracy difference for after-and before-augmentation using paraphrase generation models. ParaGAP indicates how efficient the augmentation is to improve the model robustness to syntactic perturbations.

Original Collected Combined
Experiment Result. As shown in Table 5, BERT models perform poorly in our collected datasets before augmentation, which indicate that our collected adversarial datasets are challenging, and BERT is vulnerable to syntactic perturbations. After using 4 different paraphrasing models to augment the training data, models' robustness to such perturbations all get improved. Among all models,  (2021) propose a transformer-based model SynPG for paraphrase generation. AESOP is a supervised paraphrase generation model, which means that we require parallel paraphrases during training. Previous supervised paraphrase models are mostly RNNbased models, including SCPN (Iyyer et al., 2018), CGEN (Chen et al., 2019a) and SGCP (Kumar et al., 2020). Such models suffer from generating long sentences and do not utilize the power of recent pretrained language models. Goyal and Durrett (2020a) is a concurrent work with ours that also builds on BART to generate paraphrases but has a different model design. For syntactic control, Goyal and Durrett (2020b) use target syntactic parses to reorder source sentences to guide the generation, while other works, including AESOP, directly use target syntactic parses to guide the generation. CGEN (Chen et al., 2019a) and SGCP (Kumar et al., 2020) use target syntactic parses from crowd-sourced exemplars, SCPN (Iyyer et al., 2018) and SynPG  use pre-designed templates, while AESOP retrieves target syntactic parses automatically.

Conclusion and Future Works
In this work, we propose AESOP for paraphrase generation with adaptive syntactic control. One interesting and surprising finding of this paper is that using automatically retrieved parses to control paraphrase generation can result in better qualities than the current best model using human-annotated exemplars. Such finding manifests the benefits of adaptive target parse selection for controlled paraphrase generation -it does not only generate diverse paraphrases, but also higher quality paraphrases. This suggests future works on syntactically controlled paraphrase generation to pay more attention to target parse selection, and we hope AE-SOP can serve as a strong baseline for this direction.
In our work, we use generated paraphrases to reflect the quality of automatically-selected target parses; future works can design specific metrics to evaluate the quality of retrieved syntactic parses.
In addition, we find that having the control signal in the decoder can lead to better controllability of AESOP. Future works can test the generalizability of this modeling strategy in other controlled generation tasks. In addition, we show that AESOP can effectively attack classification models and contribute two datasets to test models' robustness to syntactic perturbation. We find that using AESOP to augment training data can effectively improve classification models' robustness to syntactic perturbations.

Acknowledgments
Many thanks to I-Hung Hsu for his constructive suggestion and fruitful discussion for AESOP. We thank Kuan-Hao Huang, Sarik Ghazarian, Yu Hou and anonymous reviewers for their great feedback to improve our work.

Ethical Consideration
Our proposed model AESOP utilizes a pretrained language model to generate paraphrases. Trained on massive online texts, it is well-known that such pretrained language models could capture the bias reflecting the training data. Therefore, AESOP could potentially generate offensive or biased content. We suggest interested parties carefully check the generated content before deploying AESOP in any real-world applications. Note that AESOP might be used for malicious purposes because it does not have a filtering mechanism that checks the toxicity, bias, or offensiveness of source sentences from the input. Therefore, AESOP can generate paraphrases for harmful content that may offend certain groups or individuals. Our collected datasets are based on the development sets of two public classification tasks on GLUE, including SST-2 for sentiment analysis and RTE for textual entailment. These do not contain any explicit detail that leaks information about a user's name, health, racial or ethnic origin, religious or philosophical affiliation or beliefs.

A Appendix
A.1 Implementation Details Parameters. We use a learning rate 3 × e −5 to train AESOP. We use 6 layers of encoder and 6 layers of decoder with model dimension of 768 and 12 heads. For the input sequence, we set the max length to 128 and max output sequence length as 62.
We train 25 epochs for each model. It takes about one days to finish training for ParaNMT-small and about half a day for QQP-Pos on one NVIDIA GeForce RTX 2080.
Optimization. We use Adam (β 1 = 0.9, β 2 = 0.999) with a linear learning rate decay schedule for optimization. All experiments are done using Huggingface library (Wolf et al., 2020). 12 Table 7 is a supplementary to Table 2. Using AESOP-H2 yields a better performance in terms of the semantic preservation metrics. We share the same finding from Section 4.1 that the syntactic controllability will get better when we use the deeper heights of syntactic parse trees. However, the semantic preservation metrics get worse with more fine-grained syntactic control, we hypothesize this is because deeper-level of control signals can be misleading, but such signals restrict models to generate paraphrases that conform to the provided misleading syntactic signals, which impairs the ability of pretrained language models to generate fluent texts.

A.3 Validity Check on Paraphrases
In Section A.3.1, we will give more details of the human validity check in Table 2 and more details of human evaluation of Table 3 in Section A.3.2.

A.3.1 Validity@100 and Votes
We choose the best paraphrases among 10 generated paraphrases from SCPN, AESOP-static and AESOP for the first 100 test instances in the both datasets. For SGCP, we take its output paraphrases that uses the exemplar sentences. Then, we perform the human validity check of these 400 paraphrases on Amazon Mturk platform. For each source sentence, we provide all 4 paraphrases from these four models to three workers. In our instruction, we ask them to annotate three-level of validity: invalid paraphrase, imperfect paraphrase that does 12 https://huggingface.co/.
not lose key information, and perfect paraphrases. We binarize worker's labels with both imperfect and perfect paraphrases as a valid instance, otherwise invalid. Then, we the majority vote of labels among three workers as the final label. We calculate the ratio of valid instances over 100 and report the ratio as Validity@100 in Table 2. As a supplementary, Table 8 shows the breakdown annotation of three-level validity check. In addition, we ask workers to vote for the best paraphrase among the four paraphrases, and report the ratio of total votes the model gets over all 300 votes as Votes in Table 2 to reduce the influence of personal preference. We use fleiss's kappa scores (McHugh, 2012) to measure the Inter Annotator Agreement (IAA). The IAA for validity@100 is 0.63, which indicates a substantial agreement among workers.
Mturk Setup Details. We set the qualification as the location needs to be in the US, completed HITs no less than 5000, and approval rate no lower than 98%. Our one HIT contains 10 instances. For one HIT, we have three respondents (workers) to work on it. For payment, we pay workers $0.8 per HIT with a potential bonus of $1 if they participate over 5 HITs published by us.

A.3.2 Validity@500
The annotators of the human evaluation in Section 4.3 are three graduate students from our institute. None of them are involved in this project. We have two of them work on validity checks for ParaNMT-small and QQP-Pos, and there was one student who worked on both. We check their understandings about paraphrases before the study and instruct them to only label a paraphrase as valid when the paraphrase is natural, fluent, and preserves the semantics of the source sentence. To understand the Inter Annotator Agreement (IAA), we randomly selected 50 samples of (source sentence, paraphrase) pairs and asked them to annotate if they are valid paraphrases independently. After the annotation, we count it as an agreement if they agree on the same label (either valid paraphrase or invalid). The average IAA is 0.9 between the three of them, which indicates a good agreement. Then, we have these three works to annotate all instances sampled on Table 2. After annotation, we count a paraphrase as a valid paraphrase only if both of 2 annotators think it is valid.

A.4 Case Study with Invalid Target Syntax
Strict conformation to inappropriate target syntactic parse sometimes leads to semantics lost and abrupt termination of sentences, which hurts the goal of generating fluent and natural paraphrases as indicated in Section 4.2. For example, given the input sentence i had a dream yesterday and it was about you and a target syntactic parse with height 2 (ROOT(S(ADVP)(NP)(VP)(.))), SCPN generates maybe it was about you . that has the same syntactic parse with the target parse, while AESOP generates you were in my dream last night . whose syntax parse at height 2 is (ROOT(S(NP)(VP)(.))).

A.5 Qualitative Comparison
We provide a qualitative comparison between AE-SOP and other competitive paraphrase generation models under both settings with or without exemplar sentences in Table 9. We show that with ground-truth syntactic control (Setting I), AESOP can generate paraphrases that are closer to groundtruth paraphrases. Without ground truth, AESOP can generate diverse paraphrases that are more natural and better preserve the semantics than SCPN.

A.6 Adversarial Set Collection
We contribute two datasets constructed from AE-SOP in Section 6 by crowd-sourcing. We collect all adversarial examples that successfully attacked the models, as shown in the all column of Table 5 and put them on Amazon MTurk to annotate if the paraphrases are valid. We set the qualification as the location needs to be in the US, completed HITs no less than 5000, and approval rate no lower than 98%. One HIT contains 12 instances and have 3 respondents (workers) work on it. For payment, we pay workers $0.4 per HIT as qualification test.
After selecting qualified workers, we pay them $1 per HIT with another potential bonus of $1 if they participate over 5 HITs published by us. On average, experienced workers spent around 10 minutes to complete one HIT, which means our payment is above the federal minimum wage in the US.
Instruction and Annotation. As sentiment analysis on SST-2 is intuitive, we list examples as an instruction to guide the annotation. We count it as an agreement if all of three workers given the same label to one instance (i.e., positive, negative or undecided), and we calculate IAA as the ratio of agreements over all instances for qualified workers. The average IAA of three workers among all instances are 0.8, which indicates a good agreement. During the dataset collection, we use the majority vote to decide the final label of one instance. For textual entailment on RTE dataset, we refer to the guideline from the original guide line of RTE-4 13 to explain the textual entailment task itself with examples. The IAA for RTE annotation is 0.71.

A.7 AESOP Helps to Improve the Decision Boundary
We conduct a study on how augmenting the training data would influence models' decision boundaries.
More specifically, we test BERT models before and after augmentation with AESOP, on the combination of the original gold test set and our collected adversarial datsets on two downstream tasks. For

AESOP
(ROOT (S (S ) (NP ) (VP ) (. ))) there was a big black wolf lying next to him . (ROOT (NP (NP ) (. ))) a large , black , wolf like dog lay beside him . Table 9: A qualitative comparison of generated paraphrases with or without exemplar sentences from AESOP. SGCP and AESOP-H4 use target syntactic parses from exemplar sentences to guide the generation. SCPN use fixed target syntactic templates, while AESOP retrieves target syntactic parses automatically.
(a) SST before augmentation (b) SST after augmentation (c) RTE before augmentation (d) RTE after augmentation Figure 5: AESOP helps to improve the model decision boundary. For visualization, we use TSNE to reduce the dimension of [CLS] token from the last layer of BERT model combining the collected data and dev set for SST-2 and RTE.
visualization, we use TSNE to reduce the dimension of [CLS] token from the last layer of BERT model. Figure 5 show that AESOP helps BERT models to improve the decision boundary to be more clear, which is also indicated by Table 5 in the main content.