Syntactically-Informed Unsupervised Paraphrasing with Non-Parallel Data

Previous works on syntactically controlled paraphrase generation heavily rely on large-scale parallel paraphrase data that is not easily available for many languages and domains. In this paper, we take this research direction to the extreme and investigate whether it is possible to learn syntactically controlled paraphrase generation with nonparallel data. We propose a syntactically-informed unsupervised paraphrasing model based on conditional variational auto-encoder (VAE) which can generate texts in a specified syntactic structure. Particularly, we design a two-stage learning method to effectively train the model using non-parallel data. The conditional VAE is trained to reconstruct the input sentence according to the given input and its syntactic structure. Furthermore, to improve the syntactic controllability and semantic consistency of the pre-trained conditional VAE, we fine-tune it using syntax controlling and cycle reconstruction learning objectives, and employ Gumbel-Softmax to combine these new learning objectives. Experiment results demonstrate that the proposed model trained only on non-parallel data is capable of generating diverse paraphrases with specified syntactic structure. Additionally, we validate the effectiveness of our method for generating syntactically adversarial examples on the sentiment analysis task.


Introduction
Paraphrases are texts or passages conveying the same meaning but with different surface realization. Paraphrase generation (PG) is a key technology of automatically generating a restatement for a given text, which has the potential use in many downstream tasks, such as question answering (Dong et al., 2017), machine translation  and text summarization (Zhao et al., 2018).
Recent years have witnessed that learning controllable paraphrase generation (CPG) with specified styles is attracting intense research interests, e.g., satisfying particular syntactic templates (Iyyer et al., 2018) or exemplars Kumar et al., 2020). As CPG can produce diverse paraphrases by exposing syntactic control, it can be also employed for adversarial example generation (Iyyer et al., 2018).
Existing syntactically controlled paraphrase networks (Iyyer et al., 2018) rely on large paraphrase parallel data for training. Unfortunately, paraphrase parallel corpora are not easily available for many languages, and are expensive to build. Conversely, non-parallel data is much easier to find, and many languages with limited parallel data still possess a huge amount of non-parallel data.
In this paper, we propose a Syntacticallyinformed Unsupervised Paraphrasing (SUP) framework based on conditional variational auto-encoder (VAE) to generate syntactic paraphrases with specified syntactic skeletons, which does not require any parallel paraphrase data. The basic assumption behind SUP is that, given a sentence, there may exist many valid paraphrases with different syntactic structures. Specifically, as shown in Figure 1, SUP runs in two stages. At stage 1, we train a conditional VAE to reconstruct a given input sentence according to the sentence itself and its syntactic parse tree. The model trained at this stage is endowed with basic ability to generate texts of desired syntax structures (similar to a warmup procedure). At stage 2, to improve the syntactic controllability and semantic consistency of generated sentences, we fine-tune the model trained at stage 1 using carefully-designed objective functions involving syntax controlling and cycle reconstruction. After the conditional VAE model is fine-tuned, given an input sentence and a different syntactic structure, the model can generate a paraphrase according to the given structure.
We evaluate SUP on both syntactic paraphrase generation and adversarial example generation tasks.
Experiments show that SUP outperforms previous unsupervised paraphrasing method SIVAE (Zhang et al., 2019). It is also capable of generating syntactically adversarial examples that have a significant impact on the performance of attacked neural models. We further show that augmenting training data with such examples can improve the robustness of target neural models.
In summary, the major contributions of this paper are as follows: • We propose a syntactically-informed unsupervised paraphrasing model based on conditional VAE framework and use it to generate syntactically adversarial examples.
• To enable the model to generate syntacticallycontrolled paraphrases, we propose a novel tree encoder to effectively model structure information and a syntax controlling learning objective to further improve syntactic controllability. Meanwhile, we also introduce a cycle reconstruction learning objective to preserve the semantics of the input sentence.
• Experiments show that our model can successfully generate syntactically adversarial examples. By augmenting training data with such examples, we can improve the robustness of target neural models.

Related Work
Paraphrase Generation The task of paraphrase generation has recently received significant attention Liu et al., 2020a). Previous works mainly explore supervised paraphrasing methods, which require large corpora of parallel sentences for training. Due to the lack of parallel data, unsupervised paraphrasing has become an emerging research direction (Miao et al., 2018;Liu et al., 2020c). However, these methods mainly rely on lexical changes to generate paraphrases. Compared to these approaches, our work focus primarily on the syntactically controlled paraphrase generation, which is able to generate a paraphrase according to a given syntactic structure.
Controlled Text Generation Recent works on controlled generation aim at controlling attributes such as sentiment (Hu et al., 2017;John et al., 2019;. These works use a categorical feature as a controlling signal. Different from them, we use a more complicated, non-categorical syntactic structure as a controlling signal. To ensure syntactic controllability, we design a tree encoder and syntax controlling loss to encourage the model to generate sentences that conform to given syntax. We have also witnessed other works that attempt to control structural aspects of the generation, such as studies using a given syntactic form (Iyyer et al., 2018;Liu et al., 2020b). Our work is closely related to this category, and to the syntactically-controlled paraphrase networks (SCPN) proposed by Iyyer et al.(2018) in particular. They use the attentional seq2seq framework to build a parse generator and a paraphrase generator. A two-stage generation process is used. In the first stage, they generate full parse trees from syntactic templates, and then produce final generations in the second stage. Both parse and paraphrase generator require parallel data for training. Significantly different from their method, our model based on conditional VAE is an unsupervised method that does not require any parallel data for training.

Conditional Variational Autoencoder
Our work is also related to syntax-infused text generation (Bao et al., 2019;Zhang et al., 2019). Their models use two variational autoencoders to introduce two latent variables which are designed to capture semantic and syntactic information. The variational autoencoder (VAE) network is proposed by Kingma and Welling (2014) for image generation. Bowman et al. (2016) successfully apply VAE in fluent sentence generation from a latent space. The conditional VAE is a modification of VAE to generate diverse images conditioned on certain attributes, e.g. generating different human faces given skin color (Sohn et al., 2015;Yan et al., 2016). Inspired by conditional VAE, we view the syntactic structure as the conditional attribute and adopt conditional VAE to generate syntactic paraphrases. Furthermore, to improve the syntactic controllability and semantic consistency of generated sentences, we use syntax controlling and cycle reconstruction objective functions to fine-tune the model.

Adversarial Example Generation
To generate adversarial examples for NLP models, most previous works rely on injecting noise either at the character level (Ebrahimi et al., 2018;Gao et al., 2018) or at the word level by adding and deleting Figure 1: Architecture of the proposed syntactically-informed unsupervised paraphrasing model. Stage 1: Training a Conditional VAE model by reconstructing the input sentence given the sentence itself and its syntax structure. Here we simply take x = {x 1 , x 2 } as an example. Stage 2: Fine-tuning the model using novel objective functions.
x, s, s (different from s), and y denote the input sentence, its syntactic structure, other syntactic structure, and output sentence, respectively. L * denote the loss terms. words Garg and Ramakrishnan, 2020). In this paper, we generate syntactically adversarial examples, which still remains an open challenge, as semantic meaning of these examples should be preserved despite of their substantial structural changes.

Approach
We use the constituency parse tree to provide syntactic information. Given a set of training instances , where s i is the syntactic parse tree of the sentence x i , we aim to train a syntactic paraphrasing model which can produce more diverse paraphrases given arbitrary syntax.
However, using a full parse tree (the whole tree without leaf nodes) is too specific and poses the challenge of selecting such a tree for a given input as syntactic structures of two different sentences are not easily compatible to each other. Therefore, we mainly use a general template (the top 3 layers of a parse tree), as shown in Figure 2, as the controlling signal which is beneficial to generate meaningful paraphrases.
Specifically, we employ the conditional variational autoencoder (VAE) framework which have proven to be able to generate diverse texts conditioned on certain attributes. In this work, we view syntax as the conditional attribute. The training process consists of two stages. In the first stage, we train the model in an auto-encoding manner, while in the second stage, we use new objective functions to fine-tune it, as shown in Figure 1. The two stages will be described in detail below.

Stage 1: Training a Conditional VAE
At this stage, we pre-train the conditional VAE model. The model is required to reconstruct the input sentence given the sentence itself and its syntactic template. In doing so, the model acquires the preliminary ability to generate a desired sentence conditioning on given syntactic template, which makes the training in the subsequent stage easier.
Sentence Encoding Given a sentence x, we first obtain the sentence hidden-state h ] by the sentence encoder. For the semantic variable z x , we compute the mean and variance of q(z x |x) from h x as: Encoding This encoder provides the necessary syntactic guidance for the generation of paraphrases. Formally, let syntactic template s = {V, E}, where V is the set of nodes, E the set of edges.
As shown in Figure 2, we traverse the given syntactic template in a top-down (green line) and left-to-right (blue line) manner to obtain and model parent-child and sibling relationships, respectively. For the top-down (TD) direction, we encode each node in a depth-first manner. Specifically, the representation h v of each node v ∈ V with the hiddenstate representation of its parent node pa(v) and its embedding as follows: where e(v) is the embedding of the node v. Although we can obtain TD representations of all nodes of the syntactic template, only the TD representations of leaf nodes will be used for left-to-right encoding. For the particular example given in Fig For the left-to-right (LR) encoding, the encoder is a forward GRU network. We take the leaf nodes sequence Leaf seq = {CC, N P, V P, Dot} as input, and compute the LR representations of all leaf Particularly, take the NP node as an example: Then, we use the last hidden state of the syntactic encoder h LR Dot as the final syntax representation h s for providing the syntactic signal to the decoder.
Decoding in the Training Phase We employ the reparameterization trick to obtain semantic variables z x = µ x + σ x ε, ε ∈ N (0, I). Then at each time step, we concatenate the syntactic representation h s with the previous word's embedding as the input to the decoder and concatenate the semantic variable z x with the hidden-state output by the decoder for predicting the word at next time step, as shown in stage 1 in Figure 1. Note that the initial hidden state of the decoder is set to zero.
Decoding in the Test Phase Giving the same sentence but with a different syntactic template, the model can generate a syntactically controlled paraphrase. We obtain semantic variable z x by the maximum a posteriori (MAP) inference. In this way, semantic information from the input sentence could be preserved as much as possible. After that, the decoding process is the same as the training phase.
The Objective Function To train the above model, we optimize the following objective function: where L cvae and L bow denote the conditional VAE loss and bag-of-word loss, respectively. λ bow is a hyper-parameters for balancing the two losses. .

Conditional VAE Loss:
The loss is used to optimize the conditional VAE model by minimizing the reconstruction loss L rec , and meanwhile minimizing the KL loss L kl to encourage the posterior q(z x |x) to match the prior p(z x ): where p(z x ) follows standard normal distribution N (0, I), q(z x |x) takes the form N (µ x , σ 2 x ). Here, µ x , σ x are computed by Eq. (1). λ * are balancing hyper-parameters.
Bag-of-Word Loss: We introduce the Bag-ofword loss to enhance content preservation during paraphrase generation. Specifically, we take z x as input and predicts the Bag-of-Word distribution: where W bow , b bow are trainable parameters. The bag-of-word loss is computed as follows: where V denotes the word vocabulary, t is the bag-of-word ground-truth distribution of the corresponding sentence.

Stage 2: Fine-tuning the Conditional VAE Model
During inference, we will give different syntactic structures for every input sentence to generate paraphrases. To encourage generalization on different syntactic structures, we fine-tune the pre-trained conditional VAE in a cycle learning manner. Specifically, as shown in stage 2 in Figure 1, given an input sentence x, its syntactic template s, and other syntactic template s , we feed x and s into the conditional VAE model to generate sentence y (green line). We compute syntax controlling (blue line) and cycle reconstruction losses (red line), and then fine-tune the model to generate a better sentence that is formed in the syntactic structure of s and preserves the semantic meaning of x.
Syntax Controlling Loss First, we build a GRUbased seq2seq neural parser as the evaluator, which is pre-trained on the above mentioned training data D, with x as the input and the linearized syntactic template s as the decoding target. For example, the linearized syntactic template in Figure 2 is (ROOT(S(CC)(NP)(VP)(Dot))).
Second, we apply the pre-trained evaluator 1 to predict the linearized syntactic structure of the output sentence y, where parameters of the conditional VAE are updated to encourage the target syntactic template s to be predicted from the output sentence, i.e., minimizing the following term: where s l is the linearized s . GS(y) denotes a "softly" generated sentence based on Gumbel-Softmax distribution (Jang et al., 2016), where the representation of each word is defined as the weighted sum of word embeddings with the prediction probability at the current timestep. Please notice that the parameters of the evaluator are not updated in this step.
Cycle Reconstruction Loss However, only using the above syntax controlling loss will result in generating a sentence that conforms to the target syntactic structure but drifts away from the original meaning. To address this issue, we borrow the cycle reconstruction loss L cr from style-transfer research (Hu et al., 2017; to encourage the generated sentence to preserve the meaning in the input sentence.
We feed the generated sentence y and the syntactic template s of x to the conditional VAE and update the model to reconstruct original input sen-tence x by minimizing the following term: where GS(y) is is the same as in Eq. (8).
The Objective Function The final loss function for fine-tuning is defined as follow: In our experiments, we still optimize L cvae during the fine-tuning stage, which helps to stabilize the training process. λ * are balancing hyperparameters.

Experiments
In this section, we will answer the following questions: • First, we investigate whether our model can generate syntactically controlled paraphrases.
• Second, we examine whether our model can generate syntactically adversarial examples for sentiment analysis.

Syntactically-Informed Paraphrase Generation
Given an input sentence, a syntactically-informed paraphrase is a sentence with the same meaning as the input sentence but in a different syntactic structure defined by a given syntactic structure.

Models for Comparison
We compared with the following unsupervised paraphrase models: 1) VAE: a vanilla variational autoencoder (Bowman et al., 2016) as a simple baseline; 2) SIVAE: a syntax-infused variational autoencoder (Zhang et al., 2019) that dutilizes additional syntax information to improve the quality of sentence generation and paraphrase generation, where syntax information is provided by a linearized parse tree.
We also compared against the supervised method SCPN (Iyyer et al., 2018) which uses an extended pointer-generator network (See et al., 2017) to encode input sentences and linearized parse trees to generate paraphrases.

Datasets
Quora. The dataset contains 140k pairs of paraphrase sentences and 260k pairs of non-paraphrase sentences. In the standard dataset split, there are 3k  and 30k paraphrase pairs in the held-out validation and test set, respectively. We followed the same unsupervised setting as (Miao et al., 2018;Bao et al., 2019), using non-paraphrase sentences as training instances that do not appear in the validation and test set. For the supervised method SCPN (Iyyer et al., 2018), we used paraphrase sentences for training.

Evaluation Metrics
We employed original sentences and syntactic templates (or full parse trees) obtained from references as input, which is convenient for evaluation. But in the application scenario, we can give any syntactic templates to the trained model. For semantic evaluation, we computed BLEU (Papineni et al., 2002) scores against the reference and original sentence, denoted as BLEU-ref and BLEU-ori, respectively. Addition-2 https://github.com/miyyer/scpn ally, we used i-BLEU (Sun and Zhou, 2012) to measure the diversity of expressions. We also used the embedding-based evaluation method Sentence-BERT 3 (Reimers and Gurevych, 2019) to evaluate the semantic similarity between the generated sentence and the reference sentence.
For syntactic evaluation, we evaluated how often generated paraphrases completely conform to the target syntactic templates by computing the rate of exact syntactic match (ESM): a paraphrase g is deemed as an exact syntactic match to reference r only if the top three levels of its parse tree p g exactly matches those of p r . The tuning of all hyper-parameter was based on the BLEU-ref score on the validation set.

Implementation Details
We parsed all sentences in the training set, the reference sentences in the validation and test set using Stanford CoreNLP (Manning et al., 2014). We used the Adam optimizer (Kingma and Ba, 2014) for optimization. For the training of Stage 1 and Stage 2, we set the learning rate to 5e-4 and 1e-4, respectively. The word embedding layer was initialized by the publicly available GloVe 300-dimensional embeddings. 4 We adopted the tricks of KL annealing and word dropout following (Bowman et al., 2016). We set λ res to 5, λ bow to 0.5, λ sc to 2.5, and λ cr to 1.
We reimplemented VAE and SIVAE, and set the same KL weights for fair comparison.

Results
As shown in Table 1, results in the first row are computed over original sentences, which show a BLEU-ori socre of 100. We can see that all models achieve strong results when using full parse trees as syntactic control. This is because full parse trees contain more fine-grained syntactic information which guides the model to correctly substitute words with equivalents. With the setting of full parse trees, SUP (stage 1) outperforms the existing unsupervised methods in all metrics; With syntactic templates, we beat them in ESM and i-BLEU metrics. VAE and SIVAE-T tend to copy the input sentence as the output and therefore get low ESM but high Ori-BLEU scores.
Among our models, the SUP-T obtains an ESM of 73.9% and 65.9% on Quora and ParaNMT dataset, respectively. This shows that it can generate sentences according to the given syntactic templates (compared to row 1). At the stage 2, adding the conditional VAE loss leads to improvements in semantic metrics. Using conditional VAE loss and syntax controlling loss, we observe that while the syntactic accuracy has been greatly improved, the semantic metrics has decreased. Adding all loss terms leads to gains across both the semantic and syntactic metric scores. These results demonstrate the effectiveness of the proposed fine-tuning methods.
Even without using any parallel data, our model is competitive to the supervised SCPN trained on parallel data in some metrics. Especially, SUP-T (stage 2) achieves a higher S-BERT score than SCPN on the Quora dataset, a higher ESM score than SCPN on the ParaNMT Dataset. Table 2 shows several paraphrases generated by each model. More generation results are presented in Appendix A. We can observe that SUP-F can produce better results than SIVAE-F in terms of both semantics and syntax. VAE and SIVAE-T tend to copy the source sentences. SUP-T can generate paraphrases syntactically similar to the reference.

Human Evaluation
We also conducted human evaluation to measure paraphrase quality in a blind fashion. Following previous work (Iyyer et al., 2018;Goyal and Durrett, 2020), Three annotators were asked to evaluate the 100 randomly selected generations from the Quora test set according to a three-point scale scoring system: 0 denotes that the generated sentence is not a paraphrase at all; 1 means that the generated sentence is a paraphrase containing grammatical errors; 2 indicates that the generated sentence is a grammatically good paraphrase. Additionally, we also asked annotators to evaluate syntactic controllability (ESM-H): whether generations follow given syntactic templates. Table 3 shows the results of human evaluation which are somewhat consistent with the automatic metrics. We notice that the quality of generations from SIVAE-T is better than that of the SCPN-T model. The reason is that SIVAE-T tends to copy input sentences as outputs. It also means that SIVAE-T cannot generate meaningful paraphrases (only copying inputs) according to given syntactic templates. SUP-T obtains comparable results with the SCPN if we consider paraphrases scored 2 and 1 as meaningful paraphrases. Additionally, most generations from SUP-T follow given target syntax.

Influence of KL-Weight on Results
We also analyzed the influence of different KLweights on the SUP-T (stage 1) model. We can see in Table 4   contradictory to each other. Usually, a smaller KL weight makes the autoencoder less "variational" but more "deterministic," leading to a lower syntactic match but better content preservation. In this experiment, to trade-off the content preservation and syntactic controllability, we set the KL weight to 0.3.

Adversarial Example Generation
We further examined the utility of controlled paraphrase generation for adversarial example generation. Following previous work (Iyyer et al., 2018), we evaluated our syntactically adversarial examples on the Stanford Sentiment Treebank Dataset (SST) (Socher et al., 2013). We generated 10 syntactically different paraphrases for each instance using the top 10 frequent syntactic templates and add them to the SST training set. Since we cannot generate a valid paraphrase for each syntactic template, we filtered generated paraphrases using a threshold (BLEU, 1-3gram) to remove nonsensical outputs. In this experiment, we set the threshold to 0.5.

Evaluation Metrics
We evaluated this task with the following metrics: 1. Dev Failure (Failure). We assume a development instance x as a prediction failure if the original prediction is correct, but the prediction for at least one paraphrase is incorrect. Dev Failure is the percentage of instances on the development set, which become prediction failures after paraphrasing.
2. Validity (Valid). To measure the validity of our adversarial examples, we perform manual evaluation on randomly selected 100 adversarial examples. We ask three workers to choose the appropriate label (e.g., positive or negative) for a given sentence, and then compare the worker's judgment to the original sentiment label.
3. Test Accuracy (Acc). It is used to measure the performance of sentiment classification models on the test set.

Implementation Details
We first pre-trained our model on preprocessed 2.1M sentences from the One-Billion-Word Corpus 5 , and then fine-tuned our model on the SST dataset. For the pre-trained classification models, we used the bidirectional LSTM baseline in (Tai et al., 2015). The word embedding layer was initialized by the publicly available GloVe 300dimensional embeddings. We used the Adam optimizer (Kingma and Ba, 2014) for optimization, and set the learning rate to 1e-4.

Results
As shown by Table 5, we obtain a validity score of 68.0 and Dev Failure score of 28.0. By augmenting the training data with paraphrases generated by our model, we obtain a lower Dev Failure score of 25.0. These results suggest that our model could generate legitimate adversarial examples. We improve the robustness of models against syntactic adversaries with little effect on the test accuracy. We also observe that SCPN obtains strong results. This is because the model is trained on largescale parallel data, and generated paraphrases include lexical and syntactic variations. However, these advantages are due to the use of large-scale parallel corpus. Our unsupervised method could be very effective for low-resource languages where no parallel data are available. Table 6 lists some paraphrases generated by SUP with different syntactic templates. Table 7 shows adversarial examples generated by our model. We find that the generated sentences always conform to the target templates. These generation results show that our model could generate legitimate adversarial examples. We also observe that the generated paraphrases have only syntactic variations, not lexical variations. This is because it is difficult for the model to learn to substitute words with equivalents only using non-parallel data. We leave enabling word-level or phrase-level variations in our model for creating more diverse adversarial examples to our future work.

Conclusions
We have presented an unsupervised syntacticallyinformed paraphrasing model based on conditional Template Paraphrase original still, as a visual treat, the film is almost unsurpassed. ( S ( S ) ( , ) ( CC ) ( S ) ) the film is a visual treat, but almost unsurpassed. ( S ( PP ) ( , ) ( NP ) ( VP ) ) as a visual treat, the film is almost unsurpassed. ( S ( ADVP ) ( , ) ( NP ) ( VP ) ) still , the film is almost unsurpassed as a film. original it proves quite compelling as an intense, brooding character study. ( S ( PP ) ( , ) ( NP ) ( VP ) ( . ) ) as compelling, it proves quite an intense character study. ( S ( NP ) ( ADVP ) ( VP ) ( . ) ) it still proves compelling as an intense character study. ( S ( CC ) ( NP ) ( VP ) ) but it proves compelling as an intense character study. ( S ( ADVP ) ( , ) ( NP ) ( VP ) ( . ) ) however, it proves quite compelling as an intense. though only 60 minutes long, the film is packed with information and impressions. ( S ( NP ) ( VP ) ) the film is only 60 minutes long and packed with information and impressions. ( S ( S ) ( , ) ( CC ) ( S ) ) only 60 minutes long , and the film is packed with information. original this film seems thirsty for reflection , itself taking on adolescent qualities. ( S ( NP ) ( VP ) ) this film seems thirsty for taking on adolescent qualities. ( S ( CC ) ( NP ) ( VP ) ) but this film seems thirsty for taking on adolescent qualities. VAE and two-stage training process. We first train the conditional VAE model to generate sentences in desired syntactic structures. To further improve the syntactic controllability and semantic consistency of generated sentences, we introduce syntax controlling and cycle reconstruction objective functions to fine-tune the pre-trained model. Experiments show that our model achieves strong improvements over baselines on unsupervised setting and can generate syntactically controlled paraphrases. Furthermore, adversarial example generation experiments also validate that our model is able to generate syntactically adversarial examples for sentiment analysis, which can be used to improve the robustness of the sentiment classifier model via adversarial training.