Factorising Meaning and Form for Intent-Preserving Paraphrasing

We propose a method for generating paraphrases of English questions that retain the original intent but use a different surface form. Our model combines a careful choice of training objective with a principled information bottleneck, to induce a latent encoding space that disentangles meaning and form. We train an encoder-decoder model to reconstruct a question from a paraphrase with the same meaning and an exemplar with the same surface form, leading to separated encoding spaces. We use a Vector-Quantized Variational Autoencoder to represent the surface form as a set of discrete latent variables, allowing us to use a classifier to select a different surface form at test time. Crucially, our method does not require access to an external source of target exemplars. Extensive experiments and a human evaluation show that we are able to generate paraphrases with a better tradeoff between semantic preservation and syntactic novelty compared to previous methods.


Introduction
A paraphrase of an utterance is "an alternative surface form in the same language expressing the same semantic content as the original form" (Madnani and Dorr, 2010). For questions, a paraphrase should have the same intent, and should lead to the same answer as the original, as in the examples in Table 1. Question paraphrases are of significant interest, with applications in data augmentation (Iyyer et al., 2018), query rewriting (Dong et al., 2017) and duplicate question detection (Shah et al., 2018), as they allow a system to better identify the underlying intent of a user query.
Recent approaches to paraphrasing use information bottlenecks with VAEs (Bowman et al., 2016) or pivot languages  to try to extract the semantics of an input utterance, before projecting back to a (hopefully different) surface form. However, these methods have lit-How is a dialect different from a language? The differences between language and dialect? What is the difference between language and dialect?
What is the weight of an average moose? Average weight of the moose? How much do moose weigh? How heavy is a moose? What country do parrots live in? In what country do parrots live? Where do parrots naturally live? What part of the world do parrots live in? Table 1: Examples of question paraphrase clusters, drawn from Paralex (Fader et al., 2013). Each member of the cluster has essentially the same semantic intent, but a different surface form. Each cluster exhibits variation in word choice, syntactic structure and even question type. Our task is to generate these different surface forms, using only a single example as input. tle to no control over the preservation of the input meaning or variation in the output surface form. Other work has specified the surface form to be generated (Iyyer et al., 2018;Chen et al., 2019a;Kumar et al., 2020), but has so far assumed that the set of valid surface forms is known a priori.
In this paper, we propose SEPARATOR, a method for generating paraphrases that exhibit high variation in surface form while still retaining the original intent. Our key innovations are: (a) to train a model to reconstruct a target question from an input paraphrase with the same meaning, and an exemplar with the same surface form, and (b) to separately encode the form and meaning of questions as discrete and continuous latent variables respectively, enabling us to modify the output surface form while preserving the original question intent. Crucially, unlike prior work on syntax controlled paraphrasing, we show that we can generate diverse paraphrases of an input question at test time by inferring a different discrete syntactic encoding, without needing access to reference exemplars.
We limit our work to English questions for three reasons: (a) the concept of a paraphrase is more Paraphrase Exemplar Encoder Decoder Target < l a t e x i t s h a 1 _ b a s e 6 4 = " o k Z 9 k E 0 B / H / G U P A m Q m I g e X G j o g Y = " > A A A B + X i c b V B N S 8 N A E N 3 4 W e t X 1 K O X Y B E 8 l U Q U P R a 9 e K x g P 6 A N Z b O d t E s 3 m 7 A 7 K d a Q f + L F g y J e / S f e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u Y F i e A a X f f b W l l d W 9 / Y L G 2 V t 3 d 2 9 / b t g 8 O m j l P F o M F i E a t 2 Q D U I L q G B H A W 0 E w U 0 C g S 0 g t H t 1 G + N Q W k e y w e c J O B H d C B 5 y B l F I / V s u 4 v w i E G Y P e W 9 T E O U 9 + y K W 3 V n c J a J V 5 A K K V D v 2 V / d f s z S C C Q y Q b X u e G 6 C f k Y V c i Y g L 3 d T D Q l l I z q A j q G S R q D 9 b H Z 5 7 p w a p e + E s T I l 0 Z m p v y c y G m k 9 i Q L T G V E c 6 k V v K v 7 n d V I M r / 2 M y y R F k G y + K E y F g 7 E z j c H p c w U M x c Q Q y h Q 3 t z p s S B V l a M I q m x C 8 x Z e X S f O 8 6 l 1 W 3 f u L S u 2 m i K N E j s k J O S M e u S I 1 c k f q p E E Y G Z N n 8 k r e r M x 6 s d 6 t j 3 n r i l X M H J E / s D 5 / A J Y i l E Y = < / l a t e x i t > z sem < l a t e x i t s h a 1 _ b a s e 6 4 = " h q 5 N c n P D o S A K x k W r I 4 8 l u X 2 a Y m e c n g m t 0 n G + r s L S 8 s r p W X C 9 t b G 5 t 7 5 R 3 9 5 o 6 T h W D B o t F r N o + 1 S C 4 h A Z y F N B O F N D I F 9 D y h 9 d j v 3 U P S v N Y 3 u E o g W 5 E + 5 K H n F E 0 U q 9 8 4 C E X A W Q e w g P 6 Y Q Z 5 3 T q n l e d 2 7 N K 7 W o W R 5 E c k i N y Q l x y Q W r k h t R J g z D y S J 7 J K 3 m z n q w X 6 9 3 6 m L Y W r N n M P v k D 6 / M H J y e X / A = = < / l a t e x i t > e syn < l a t e x i t s h a 1 _ b a s e 6 4 = " A q O 6 J f 7 / k y s s X W 9 m h N O r 7 w l q d s s = " > A A A C A X i c b V B N S 8 N A E N 3 U r 1 q / o l 4 E L 8 E i e C q J K H o s e v F Y w X 5 A W 8 p m O 2 m X 7 i Z h d y K W E C / + F S 8 e F P H q v / D m v 3 H b 5 q C t D w Y e 7 8 0 w M 8 + P B d f o u t 9 W Y W l 5 Z X W t u F 7 a 2 N z a 3 r F 3 9 x o 6 S h S D O o t E p F o + 1 S B 4 C H X k K K A V K 6 D S F 9 D 0 R 9 c T v 3 k P S v M o v M N x D F 1 J B y E P O K N o p J 5 9 0 E E u + p B 2 E B 7 Q D 1 L I s l 6 q Q W Y 9 u + x W 3 C m c R e L l p E x y 1 H r 2 V 6 c f s U R C i E x Q r d u e G 2 M 3 p Q o 5 E 5 C V O o m G m L I R H U D b 0 J B K 0 N 1 0 + k H m H B u l 7 w S R M h W i M 1 V / T 6 R U a j 2 W v u m U F I d 6 3 p u I / 3 n t B I P L b s r D O E E I 2 W x R k A g H I 2 c S h 9 P n C h i K s S G U K W 5 u d d i Q K s r Q h F Y y I X j z L y + S x m n F O 6 + 4 t 2 f l 6 l U e R 5 E c k i N y Q j x y Q a r k h t R I n T D y S J 7 J K 3 m z n q w X 6 9 3 6 m L U W r H x m n / y B 9 f k D B y q X 5 w = = < / l a t e x i t >ẽ sem Discrete bottleneck Continuous bottleneck Exemplar Predictor < l a t e x i t s h a 1 _ b a s e 6 4 = " s s K M r t 0 V U O v 3 e I n 9 + i o n O u / 6 Q c U = " > A A A B + X i c b V B N S 8 N A E N 3 U r 1 q / o h 6 9 B I v g q S S i 6 L H o x W M F + w F t K J v t p F 2 6 2 Y T d S b G E / B M v H h T x 6 j / x 5 r 9 x 2 + a g r Q 8 G H u / N M D M v S A T X 6 L r f V m l t f W N z q 7 x d 2 d n d 2 z + w D 4 9 a O k 4 V g y a L R a w 6 A d U g u I Q m c h T Q S R T Q K B D Q D s Z 3 M 7 8 9 A a V 5 L B 9 x m o A f 0 a H k I W c U j d S 3 7 R 7 C E w Z h 1 s n 7 m Y Y o 7 9 t V t + b O 4 a w S r y B V U q D R t 7 9 6 g 5 i l E U h k g m r d 9 d w E / Y w q 5 E x A X u m l G h L K x n Q I X U M l j U D 7 2 f z y 3 D k z y s A J Y 2 V K o j N X f 0 9 k N N J 6 G g W m M 6 I 4 0 s v e T P z P 6 6 Y Y 3 v g Z l 0 m K I N l i U Z g K B 2 N n F o M z 4 A o Y i q k h l C l u b n X Y i C r K 0 I R V M S F 4 y y + v k t Z F z b u q u Q + X 1 f p t E U e Z n J B T c k 4 8 c k 3 q 5 J 4 0 S J M w M i H P 5 J W 8 W Z n 1 Y r 1 b H 4 v W k l X M H J M / s D 5 / A G G s l C Q = < / l a t e x i t > X sem < l a t e x i t s h a 1 _ b a s e 6 4 = " i f Z M B Q K u U i U 4 E s E P y H H a V i m C X D o = " > A A A B + X i c b V B N S 8 N A E N 3 U r 1 q / o h 6 9 L B b B U 0 l E 0 W P R i 8 c K 9 g P a E D b b T b t 0 s w m 7 k 2 I I + S d e P C j i 1 X / i z X / j t s 1 B W x 8 M P N 6 b Y W Z e k A i u w X G + r c r a + s b m V n W 7 t r O 7 t 3 9 g H x 5 1 d J w q y t o 0 F r H q B U Q z w S V r A w f B e o l i J A o E 6 w a T u 5 n f n T K l e S w f I U u Y F 5 G R 5 C G n B I z k 2 / Y A 2 B M E Y d 4 r / F x n s v D t u t N w 5 s C r x C 1 J H Z V o + f b X Y B j T N G I S q C B a 9 1 0 n A S 8 n C j g V r K g N U s 0 S Q i d k x P q G S h I x 7 e X z y w t 8 Z p Q h D m N l S g K e q 7 8 n c h J p n U W B 6 Y w I j P W y N x P / 8 / o p h D d e z m W S A p N 0 s S h M B Y Y Y z 2 L A Q 6 4 Y B Z E Z Q q j i 5 l Z M x 0 Q R C i a s m g n B X X 5 5 l X Q u G u 5 V w 3 m 4 r D d v y z i q 6 A S d o n P k o m v U R P e o h d q I o i l 6 R q / o z c q t F + v d + l i 0 V q x y 5 h j 9 g f X 5 A 4 G p l D k = < / l a t e x i t > X syn < l a t e x i t s h a 1 _ b a s e 6 4 = " H w + q 1 / c b u W / w C c G h z l 0 I p T 0 + y / E = " > A A A B 8 X i c b V B N S 8 N A E N 3 4 W e t X 1 a O X x S J 4 K o k o e i x 6 8 V j B f m g b y m Y 7 a Z d u N m F 3 I p b Q f + H F g y J e / T f e / D d u 2 x y 0 9 c H A 4 7 0 Z Z u Y F i R Q G X f f b W V p e W V 1 b L 2 w U N 7 e 2 d 3 Z L e / s N E 6 e a Q 5 3 H M t a t g B m Q Q k E d B U p o J R p Y F E h o B s P r i d 9 8 B G 1 E r O 5 w l I A f s b 4 S o e A M r f T Q Q X j C I M z u x 9 1 S 2 a 2 4 U 9 B F 4 u W k T H L U u q W v T i / m a Q Q K u W T G t D 0 3 Q T 9 j G g W X M C 5 2 U g M J 4 0 P W h 7 a l i k V g / G x 6 8 Z g e W 6 V H w 1 j b U k i n 6 u + J j E X G j K L A d k Y M B 2 b e m 4 j / e e 0 U w 0 s / E y p J E R S f L Q p T S T G m k / d p T 2 j g K E e W M K 6 F v Z X y A d O M o w 2 p a E P w 5 l 9 e J I 3 T i n d e c W / P y t W r P I 4 C O S R H 5 I R 4 5 I J U y Q 2 p k T r h R J F n 8 k r e H O O 8 O O / O x 6 x 1 y c l n D s g f O J 8 / + Q S R H A = = < / l a t e x i t > Y Figure 1: Overview of our approach. The model is trained to reconstruct a target question from one input with the same meaning and another input with the same form. This induces separate latent encoding spaces for meaning and form, allowing us to vary the output form while keeping the meaning constant. Using a discretized space for the syntactic encoding makes it tractable to predict valid surface forms at test time.
clearly defined for questions compared to generic utterances, as question paraphrases should lead to the same answer; (b) the space of possible surface forms is smaller for questions, making the task more achievable, and (c) better dataset availability. However, our approach does not otherwise make any assumptions specific to questions.

Problem Formulation
The task is to learn a mapping from an input question, represented as a sequence of tokens X, to paraphrase(s) Y which have different surface form to X, but convey the same intent.
Our proposed approach, which we call SEPARATOR, uses an encoder-decoder model to transform an input question into a latent encoding space, and then back to an output paraphrase. We hypothesize that a principled information bottleneck (Section 2.1) and a careful choice of training scheme (Section 2.2) lead to an encoding space that separately represents the intent and surface form. This separation enables us to paraphrase the input question, varying the surface form of the output by directly manipulating the syntactic encoding of the input and keeping the semantic encoding constant (Section 2.3). We assume access to reference paraphrase clusters during training (e.g., Table 1), sets of questions with different surface forms that have been collated as having the same meaning or intent.
Our model is a variant of the standard encoderdecoder framework (Cho et al., 2014), and consists of: (a) a vanilla Transformer sentence encoder (Vaswani et al., 2017), that maps an input question X to a multi-head sequence of encodings, e h,t = ENCODER(X); (b) a principled choice of information bottleneck, with a continuous variational path and a discrete vector-quantized path, that maps the encoding sequence to a pair of latent vectors, z sem , z syn = BOTTLENECK(e h,t ), represented in more detail in Figure 1; (c) a vanilla Transformer decoder, that attends over the latent vectors to generate a sequence of output tokens, Y = DECODER(z sem , z syn ). The separation between z sem and z syn is induced by our proposed training scheme, shown in Figure 1 and described in detail in Section 2.2.

Model Architecture
While the encoder and decoder used by the model are standard Transformer modules, our bottleneck is more complex and we now describe it in more detail.
Let the encoder output be {e h,1 , . . . , e h,|X| } = ENCODER(X), where e h,t ∈ R D/H T , h ∈ 1, ..., H T with H T the number of transformer heads, |X| the length of the input sequence and D the dimension of the transformer. We first pool this sequence of encodings to a single vector, using the multi-head pooling described in Liu and Lapata (2019). For each head h, we calculate a distribution over time indexes α h,t using attention: with k h ∈ R D/H a learned parameter.
We then take a weighted average of a linear projection of the encodings, to give pooled outputẽ h , with V h ∈ R D/H×D/H a learned parameter. Transformer heads are assigned either to a semantic group H sem , that will be trained to encode the intent of the input,ẽ sem = [. . . ;ẽ h ; . . .], h ∈ H sem , or to a syntactic group H syn , that will be trained to represent the surface formẽ syn = [. . . ;ẽ h ; . . .], h ∈ H syn (see Figure 1).
The space of possible question intents is extremely large and may be reasonably approximated by a continuous vector space. However, the possible surface forms are discrete and smaller in number. We therefore use a Vector-Quantized Variational Autoencoder (VQ-VAE, van den Oord et al., 2017) for the syntactic encoding z syn , and model the semantic encoding z sem as a continuous Gaussian latent variable, as shown in the upper and lower parts of Figure 1, respectively.
Vector Quantization Let q h be discrete latent variables corresponding to the syntactic quantizer heads, h ∈ H syn . 1 Each variable can be one of K possible latent codes, q h ∈ [0, K]. The heads use distinct codebooks, C h ∈ R K×D/H , which map each discrete code to a continuous embedding C h (q h ) ∈ R D/H . Given sentence X and its pooled encoding {ẽ 1 , ...,ẽ H }, we independently quantize the syntactic subset of the heads h ∈ H syn to their nearest codes from C h and concatenate, giving the syntactic encoding The quantizer module is trained through backpropagation using straight-through estimation (Bengio et al., 2013), with an additional loss term to constrain the embedding space as described in van den Oord et al. (2017), where the stopgradient operator sg(·) is defined as identity during forward computation and zero on backpropagation, and λ is a weight that controls the strength of the constraint. We follow the soft 1 The number and dimensionality of the quantizer heads need not be the same as the number of transformer heads. EM and exponentially moving averages training approaches described in earlier work Angelidis et al., 2021), which we find improve training stability.
Variational Bottleneck For the semantic path, we introduce a learned Gaussian posterior, that represents the encodings as smooth distributions in space instead of point estimates (Kingma and Welling, 2014) , where µ(·) and σ(·) are learned linear transformations. To avoid vanishingly small variance and to encourage a smooth distribution, a prior is introduced, p(z h ) ∼ N (0, 1). The VAE objective is the standard evidence lower bound (ELBO), given by We use the usual Gaussian reparameterisation trick, and approximate the expectation in Equation (6) by sampling from the training set and updating via backpropagation (Kingma and Welling, 2014). The VAE component therefore only adds an additional KL term to the overall loss, In sum, BOTTLENECK(e h,t ) maps a sequence of token encodings to a pair of vectors z sem , z syn , with z sem a continuous latent Gaussian, and z syn a combination of discrete code embeddings.

Factorised Reconstruction Objective
We now describe the training scheme that causes the model to learn separate encodings for meaning and form: z sem should encode only the intent of the input, while z syn should capture any information about the surface form of the input. Although we refer to z syn as the syntactic encoding, it will not necessarily correspond to any specific syntactic formalism. We also acknowledge that meaning and form are not completely independent of each other; arbitrarily changing the form of an utterance is likely to change its meaning. However, it is possible for the same intent to have multiple phrasings , and it is this 'local independence' that we intend to capture.
We create triples {X sem , X syn , Y}, where X sem has the same meaning but different form to Y (i.e., it is a paraphrase, as in Table 1) and X syn is a question with the same form but different meaning  (i.e., it shares the same syntactic template as Y), which we refer to as an exemplar. We describe the method for retrieving these exemplars in Section 2.3. The model is then trained to generate a target paraphrase Y from the semantic encoding z sem of the input paraphrase X sem , and from the syntactic encoding z syn of the exemplar X syn , as demonstrated in Figure 1.
Recalling the additional losses from the variational and quantized bottlenecks, the final combined training objective is given by where L Y (X sem , X syn ) is the cross-entropy loss of teacher-forcing the decoder to generate Y from z sem (X sem ) and z syn (X syn ).

Exemplars
It is important to note that not all surface forms are valid or licensed for all question intents. As shown in Figure 1, our approach requires exemplars during training to induce the separation between latent spaces. We also need to specify the desired surface form at test time, either by supplying an exemplar as input or by directly predicting the latent codes. The output should have a different surface form to the input but remain fluent.
Exemplar Construction During training, we retrieve exemplars X syn from the training data following a process which first identifies the underlying syntax of Y, and finds a question with the same syntactic structure but a different, arbitrary meaning. We use a shallow approximation of syntax, to ensure the availability of equivalent exemplars in the training data. An example of the exemplar retrieval process is shown in Table 2; we first apply a chunker (FlairNLP, Akbik et al., 2018) to Y, then extract the chunk label for each tagged span, ignoring stopwords. This gives us the template that Y follows. We then select a question at random from the training data with the same template to give X syn . If no other questions in the dataset use this template, we create an exemplar by replacing each chunk with a random sample of the same type. We experimented with a range of approaches to determining question templates, including using part-of-speech tags and (truncated) constituency parses. We found that using chunks and preserving stopwords gave a reasonable level of granularity while still combining questions with a similar form. The templates (and corresponding exemplars) need to be granular enough that the model is forced to use them, but abstract enough that the task is not impossible to learn.
Prediction at Test Time In general, we do not assume access to reference exemplars at test time and yet the decoder must generate a paraphrase from semantic and syntactic encodings. Since our latent codes are separated, we can directly predict the syntactic encoding, without needing to retrieve or generate an exemplar. Furthermore, by using a discrete representation for the syntactic space, we reduce this prediction problem to a simple classification task. Formally, for an input question X, we learn a distribution over licensed discrete codes q h , h ∈H syn . We assume that the heads are independent, so that p(q 1 , . . . , qH syn ) = i p(q i ). We use a small fully connected network with the semantic and syntactic encodings of X as inputs, giving p(q h |X) = MLP(z sem (X), z syn (X)).
The network is trained to maximize the likelihood of all other syntactic codes licensed by each input. We calculate the discrete syntactic codes for each question in a paraphrase cluster, and minimize the cross-entropy loss of the network with respect to these codes. At test time, we set

Experimental Setup
Datasets We evaluate our approach on two datasets: Paralex (Fader et al., 2013), a dataset of question paraphrase clusters scraped from WikiAnswers; and Quora Question Pairs (QQP) 2 sourced from the community question answering forum Quora. We observed that a significant fraction of the questions in Paralex included typos or were ungrammatical. We therefore filter out any questions marked as non-English by a language detection script (Lui and Baldwin, 2012), then pass the questions through a simple spellchecker. While this destructively edited some named entities in the questions, it did so in a consistent way across the whole dataset. There is no canonical split for Paralex, so we group the questions into clusters of paraphrases, and split these clusters into train/dev/test partitions with weighting 80/10/10. Similarly, QQP does not have a public test set. We therefore partitioned the clusters in the validation set randomly in two, to give us our dev/test splits. Summary statistics of the resulting datasets are given in Appendix B. All scores reported are on our test split.
Model Configuration Following previous work (Kaiser et al., 2018;Angelidis et al., 2021), our quantizer uses multiple heads (H = 4) with distinct codebooks to represent the syntactic encoding as 4 discrete categorical variables q h , with z syn given by the concatenation of their codebook embeddings C h (q h ). We use a relatively small codebook size of K = 256, relying on the combinatoric power of the multiple heads to maintain the expressivity of the model. We argue that, assuming each head learns to capture a particular property of a template (see Section 4.3), the number of variations in each property is small, and it is only through combination that the space of possible templates becomes large.
We include a detailed list of hyperparameters in Appendix A. Our code is available at http:// github.com/tomhosking/separator.

Comparison Systems
We compare SEPARATOR against several related systems. These include a model which reconstructs Y only from X sem , with no signal for the desired form of the output. In other words, we derive both z sem and z syn from X sem , and no separation between meaning and form is learned. This model uses a continuous Gaussian latent variable for both z syn and z sem , but is otherwise equivalent in architecture to SEPARATOR. We refer to this as the VAE baseline. We also experiment with a vanilla autoencoder or AE baseline by removing the variational component, such that z sem , z syn =ẽ sem ,ẽ syn .
We include our own implementation of the VQ-VAE model described in Roy and Grangier (2019). They use a quantized bottleneck for both z sem and z syn , with a large codebook K = 64, 000, H = 8 heads and a residual connection within the quantizer. For QQP, containing only 55,611 train-  ing clusters, the configuration in Roy and Grangier (2019) leaves the model overparameterized and training did not converge; we instead report results for K = 1, 000.
ParaNMT  translates input sentences into a pivot language (Czech), then back into English. Although this system was trained on high volumes of data (including Common Crawl), the training data contains relatively few questions, and we would not expect it to perform well in the domain under consideration. 'Diverse Paraphraser using Submodularity' (DiPS; Kumar et al. 2019) uses submodular optimisation to increase the diversity of samples from a standard encode-decoder model. Latent bag-of-words (BoW; Fu et al. 2019) uses an encoder-decoder model with a discrete bag-of-words as the latent encoding. SOW/REAP (Goyal and Durrett, 2020) uses a two stage approach, deriving a set of feasible syntactic rearrangements that is used to guide a second encoder-decoder model. We additionally implement a simple tf-idf baseline (Jones, 1972), retrieving the question from the training set with the highest similarity to the input. Finally, we include a basic copy baseline as a lower bound, that simply uses the input question as the output.

Results
Our experiments were designed to answer three questions: (a) Does SEPARATOR effectively factorize meaning and form? (b) Does SEPARATOR  Table 4: Generation results, without access to oracle exemplars. Our approach achieves the highest iBLEU scores, indicating the best tradeoff between output diversity and fidelity to the reference paraphrases.
manage to generate diverse paraphrases (while preserving the intent of the input)? (c) What does the underlying quantized space encode (i.e., can we identify any meaningful syntactic properties)? We address each of these questions in the following sections.

Verification of Separation
Inspired by Chen et al. (2019b) we use a semantic textual similarity task and a template detection task to confirm that SEPARATOR does indeed lead to encodings {z sem , z syn } in latent spaces that represent different types of information.
Using the test set, we construct clusters of questions that share the same meaning C sem , and clusters that share the same template C syn . For each cluster C q ∈ {C sem , C syn }, we extract one question at random X q ∈ C q , compute its encodings {z sem , z syn , z} 3 , and its cosine similarity to the encodings of all other questions in the test set. We take the question with maximum similarity to the query X r , r = argmax r (z q .z r ), and compare the cluster that it belongs to, C r , to the query cluster I(C q = C r ), giving a retrieval accuracy score for each encoding type and each clustering type. For the VAE, we set {z sem , z syn } to be the same heads of z as the separated model. Table 3 shows that our approach yields encodings that successfully factorise meaning and form, with negligible performance loss compared to the VAE baseline; paraphrase retrieval performance using z sem for the separated model is comparable to using z for the VAE.

Paraphrase Generation
Automatic Evaluation While we have shown that our approach leads to disentangled representations, we are ultimately interested in generating diverse paraphrases for unseen data. That is, given some input question, we want to generate an output question with the same meaning but different form.
We use iBLEU (Sun and Zhou, 2012) as our primary metric, a variant of BLEU (Papineni et al., 2002;Post, 2018) that is penalized by the similarity between the output and the input, where α = 0.7 is a constant that weights the tradeoff between fidelity to the references and variation from the input. We also report the usual BLEU(output, ref erences) as well as Self-BLEU(output, input). The latter allows us to examine whether the models are making trivial changes to the input. The Paralex test set contains 5.6 references on average per cluster, while QQP contains only 1.3. This leads to lower BLEU scores for QQP in general, since the models are evaluated on whether they generated the specific paraphrase(s) present in the dataset. Table 4 shows that the Copy, VAE and AE models display relatively high BLEU scores, but achieve this by 'parroting' the input; they are good at reconstructing the input, but introduce little variation in surface form, reflected in the high Self-BLEU scores. This highlights the importance of considering similarity to both the references and to the input. The tf-idf baseline performs surprisingly  well on Paralex; the large dataset size makes it more likely that a paraphrase cluster with a similar meaning to the query exists in the training set. The other comparison systems (in the second block in Table 4) achieve lower Self-BLEU scores, indicating a higher degree of variation introduced, but this comes at the cost of much lower scores with respect to the references. SEPARATOR achieves the highest iBLEU scores, indicating the best balance between fidelity to the references and novelty compared to the input. We give some example output in Table 5; while the other systems mostly introduce lexical variation, SEPARATOR is able to produce output with markedly different syntactic structure to the input, and can even change the question type while successfully preserving the original intent.
The last row in Table 4 (ORACLE) reports results when our model is given a valid exemplar to use directly for generation, thus bypassing the code prediction problem. For each paraphrase cluster, we select one question at random to use as input, and select another to use as the target. We retrieve a question from the training set with the same template as the target to use as an oracle exemplar. This represents an upper bound on our model's performance. While SEPARATOR outperforms existing methods, our approach to predicting syntactic codes (using a shallow fully-connected network) is relatively simple. SEPARATOR using oracle exemplars achieves by far the highest scores in Table 4, demonstrating the potential expressivity of our approach when exemplars are guaranteed to be valid. A more powerful code prediction model could close the gap to this upper bound, as well as enabling the generation of multiple diverse paraphrases for a single input question. However, we leave this to future work.
Human Evaluation In addition to automatic evaluation we elicited judgements from crowdworkers on Amazon Mechanical Turk. Specifically, they were shown a question and two paraphrases thereof (corresponding to different systems) and asked to select which one was preferred along three dimensions: the dissimilarity of the paraphrase compared to the original question, how well the paraphrase reflected the meaning of the original, and the fluency of the paraphrase (see Appendix C). We evaluated a total of 200 questions sampled equally from both Paralex and QQP, and collected 3 ratings for each sample. We assigned each system a score of +1 when it was selected, −1 when the other system was selected, and took the mean over all samples. Negative scores indicate that a system was selected less often than an alternative. We chose the four best performing models according to Table 4 for our evaluation: SEPARATOR, DiPS (Kumar et al., 2019), Latent BoW (Fu et al., 2019) and VAE. Figure 2 shows that although the VAE baseline is the best at preserving question meaning, it is also the worst at introducing variation to the output. SEPARATOR introduces more variation than the other systems evaluated and better preserves the original question intent, as well as generating significantly more fluent output (using a one-way ANOVA with post-hoc Tukey HSD test, p<0.05).

Analysis
When predicting latent codes at test time, we assume that the code for each head may be predicted independently of the others, as working with the full joint distribution would be intractable. We now examine this assumption as well as whether different encodings represent distinct syntactic proper-

Meaning
Dissimilarity Fluency Although the VAE baseline is the best at preserving question meaning, it is the worst at introducing variation to the output. SEPARATOR offers the best balance between dissimilarity and meaning preservation, and is more fluent than both DiPS and Latent BoW.
ties. Following Angelidis et al. (2021), we compute the probability of a question property f 1 , f 2 , . . . taking a particular value a, conditioned by head h and quantized code k h as where I(·) is the indicator function, and examples of values a are shown in Figure 3. We then calculate the mean entropy of these distributions, to determine how property-specific each head is: Heads with lower entropies are more predictive of a property, indicating specialisation and therefore independence. Figure 3 shows our analysis for four syntactic properties: head #2 has learned to control the high level output structure, including the question type or wh-word, and whether the question word appears at the beginning or end of the question. Head #3 controls which type of prepositional phrase is used. The length of the output is not determined by any one head, implying that it results from other properties of the surface form. Future work could leverage this disentanglement to improve the exemplar prediction model, and could lead to more fine-grained control over the generated output form.
In summary, we find that SEPARATOR successfully learns separate encodings for meaning and form. SEPARATOR is able to generate question paraphrases with a better balance of diversity and intent preservation compared to prior work. Although we are able to identify some high-level properties encoded by each of the syntactic latent variables, further work is needed to learn interpretable syntactic encodings.

Related Work
Paraphrasing Prior work on generating paraphrases has looked at extracting sentences with similar meaning from large corpora (Barzilay and McKeown, 2001;Bannard and Callison-Burch, 2005;Ganitkevitch et al., 2013), or identifying paraphrases from sources that are weakly aligned (Dolan et al., 2004;Coster and Kauchak, 2011).
More recently, neural approaches to paraphrasing have shown promise. Several models have used an information bottleneck to try to encode the semantics of the input, including VAEs (Bowman et al., 2016), VQ-VAEs (van den Oord et al., 2017;Roy and Grangier, 2019), and a latent bag-of-words model (Fu et al., 2019). Other work has relied on the strength of neural machine translation models, translating an input into a pivot language and then back into English Hu et al., 2019). Kumar et al. (2019) use submodular function maximisation to improve the diversity of paraphrases generated by an encoder-decoder model. Dong et al. (2017) use an automatic paraphrasing system to rewrite inputs to a question answering system at inference time, reducing the sensitivity of the system to the specific phrasing of a query.

Syntactic Templates
The idea of generating paraphrases by controlling the structure of the output has seen recent interest, but most work so far has assumed access to a template oracle. Iyyer et al. (2018) use linearized parse trees as a template, then sample paraphrases by using multiple templates and reranking the output. Chen et al. (2019a) use a multi task objective to train a model to generate output that follows an input template. Their approach is limited by their use of automatically generated paraphrases for training, and their reliance on the availability of oracle templates. Bao et al. (2019) use a discriminator to separate spaces, but rely on noising the latent space to induce variation in the output form. Their results show good fidelity to the references, but low variation compared to the input. Goyal and Durrett (2020) use the artifically generated dataset ParaNMT-50m  for their training and evaluation, which displays low output variation according to our results. Kumar et al. (2020) show strong performance using full parse trees as templates, but focus on generating output with the correct parse and do not consider the problem of template prediction.
Huang and Chang (2021) independently and concurrently propose training a model with a similar 'split training' approach to ours, but using constituency parses instead of exemplars, and a 'bagof-words' instead of reference paraphrases. Their approach has the advantage of not requiring paraphrase clusters during training, but they do not attempt to solve the problem of template prediction and rely on the availability of oracle target templates. Russin et al. (2020) modify the architecture of an encoder-decoder model, introducing an inductive bias to encode the structure of inputs separately from the lexical items to improve compositional generalisation on an artificial semantic parsing task. Chen et al. (2019b) use a multi-task setup to generate separated encodings, but do not experiment with generation tasks. Shu et al. (2019) learn discrete latent codes to introduce variation to the output of a machine translation system.

Conclusion
We present SEPARATOR, a method for generating paraphrases that balances high variation in surface form with strong intent preservation. Our approach consists of: (a) a training scheme that causes an encoder-decoder model to learn separated latent encodings, (b) a vector-quantized bottleneck that results in discrete variables for the syntactic encoding, and (c) a simple model to predict different yet valid surface forms for the output. Extensive experiments and a human evaluation show that our approach leads to separated encoding spaces with negligible loss of expressivity, and is able to generate paraphrases with a better balance of variation and semantic fidelity than prior methods.
In future, we would like to investigate the properties of the syntactic encoding space, and improve on the code prediction model. It would also be interesting to reduce the levels of supervision required to train the model, and induce the separation without an external syntactic model or reference paraphrases.

A Hyperparameters
Hyperparameters were selected by manual tuning, based on a combination of: (a) validation encoding separation, (b) validation BLEU scores using oracle exemplars, and (c) validation iBLEU scores using predicted syntactic codes.

B Dataset Statistics
Summary statistics for our partitions of Paralex and QQP are shown in Table 7. Questions in QQP were 9.7 tokens long on average, compared to 8.2 for Paralex. We also show the distribution of different question types in Figure 4; QQP contains a higher percentage of why questions, and we found that the questions tend to be more subjective compared to the predominantly factual questions in Paralex.   (Fader et al., 2013), and our partitioning of QQP.

C Human Evaluation
Annotators were asked to rate the outputs according to the following criteria: • To what extent is the meaning expressed in the original question preserved in the rewritten version, with no additional information added? Which of the questions generated by a system is likely to have the same answer as the original?
• Does the rewritten version use different words or phrasing to the original? You should choose the system that uses the most different words or word order.

D Reproducibility Notes
All experiments were run on a single Nvidia RTX 2080 Ti GPU. Training time for SEPARATOR was approximately 2 days on Paralex, and 1 day for QQP. SEPARATOR contains a total of 69,139,744 trainable parameters.

E Template Dropout
Early experiments showed that, while the model was able to separately encode meaning and form, the 'syntactic' encoding space showed little ordering. That is, local regions of the encoding space did not necessarily encode templates that co-occurred with each other in paraphrase clusters. We therefore propose template dropout, where exemplars X syn are replaced with probability p td = 0.3 by a question with a different template from the same paraphrase cluster. This is intended to provide the model with a signal about which templates are similar to each other, and thus reduce the distance between their encodings.
F Ordering of the Encoding Space Figure 5 shows that the semantic encodings z sem are tightly clustered by paraphrase, but the set of (a) Semantic encodings (b) Syntactic encodings Figure 5: Visualisations of z sem and z syn using t-SNE (van der Maaten and Hinton, 2008), coloured by paraphrase cluster. The semantic encodings are clustered by meaning, as expected, but there is little to no local ordering in the syntactic space; valid surface forms of a particular question do not necessarily have syntactic encodings near to each other.
valid forms for each cluster overlaps significantly. In other words, regions of licensed templates for each input are not contiguous, and naively perturbing a syntactic encoding for an input question is not guaranteed to lead to a valid template. Template dropout, described in Appendix E, seems to improve the arrangement of encoding space, but is not sufficient to allow us to 'navigate' encoding space directly. The ability to induce an ordered encoding space and introduce syntactic diversity by simply perturbing the encoding, would allow us to drop the template prediction network, and we hope that future work will build on this idea.

G Failure Cases
A downside of our approach is the use of an information bottleneck; the model must learn to compress a full question into a single, fixed-length vector. This can lead to loss of information or corruption, with the output occasionally repeating words or generating a number that is slightly different to the correct one, as shown in Table 8.
We also occasionally observe instances of the  well documented posterior collapse phenomenon, where the decoder ignores the input encoding and generates a generic high probability sequence.