MCPG: A Flexible Multi-Level Controllable Framework for Unsupervised Paraphrase Generation

,


Introduction
Paraphrase generation aims to restate a given sentence in a way that conveys the same semantic meaning but uses a different expression form.Paraphrasing results benefit lots of downstream tasks, such as text classification (Wang et al., 2022;Chen et al., 2019a), question answering (Dong et al., 2017;Cheng et al., 2021), and semantic matching (Chen et al., 2022).Traditional approaches (Prakash et al., 2016;Gupta et al., 2018) require supervised training on large parallel corpora.However, it is expensive for manual annotation.Thus, unsupervised methods are welcomed in the absence of annotated datasets (Bowman et al., 2016;Wieting et al., 2017;Miao et al., 2019;Hegde and Patil, 2020;Meng et al., 2021;Shen et al., 2022).
In this paper, we focus on controllable unsupervised paraphrase generation, which is a promising direction and some progress has been achieved.However, most existing works are either limited by the supervised settings (Iyyer et al., 2018;Chen et al., 2019b;Kazemnejad et al., 2020;Huang and Chang, 2021;Bandel et al., 2022) or mainly explore to control generation from a single perspective (e.g., syntatic diversity) in the unsupervised condition (Huang and Chang, 2021).Multi-level controllable approaches for unsupervised generation are still not well explored.To this end, we propose a simple and effective framework for unsupervised paraphrase generation called Multi-level Controllable Paraphrase Generator (MCPG), as shown in Figure1.Overall, MCPG is controllable in three levels, i.e., global semantics, local lexicons, and universal styles.This approach is motivated by a commonly observed phenomenon that the human paraphrasing process is decomposable at different levels.For example, given a sentence, we first read it through to capture the general semantic meaning (global semantics).Then we read over to identify factual keywords like names of persons and locations, which should be protected from tampering (local lexicons).Furthermore, if there are any exemplars for reference, we may optionally get inspirations from them regarding word editing or sentence restructuring (universal styles).Finally, we rewrite the remaining parts as much as possible while preserving the original meaning.
Figure1(a) shows the basic paradigm of MCPG tailored for the unsupervised scenario.We employ T5 (Raffel et al., 2020) as the Generator backbone to integrate both global semantic and local lexical constraints.The whole model is trained by reconstructing input text, which is compatible with the denoising objective for pretraining T5.
The global semantics is controlled by the semantic embedding.We show that enabling dropout in-  side the Semantic Encoder during inference works well in creating different semantic embeddings for the same sentence while keeping the semantic meaning.This provides a helpful guidance for MCPG to generate paraphrases with various expression forms.Besides, we can further control the diversity degree by tuning the dropout probability and filter bad semantic embeddings according to the cosine similarity between them and the dropoutdisabled standard embedding of the same input.
The local lexicons are controlled by the keywords.Previous works (Zeng et al., 2019;Su et al., 2021;Yang et al., 2022) pre-specify keywords using either rule-based or model-based methods, but neglect the importance of factual entities, which are actually the unchangeable words we must copy during paraphrasing.Towards this, we adopt Named Entities Recognition tools to identify factual entities like names of persons and locations as keywords.Fixing this kind of keywords turns out to be an effective way for improving paraphrase quality.
In Figure 1(b), we offer an alternative option to constrain the universal style of outputs, which helps MCPG adapt to the target domain efficiently without training.
The universal styles are controlled by a transfer vector, which implicitly encodes the lexical and structural mapping mode between a pair of exemplar paraphrases retrieved from the training set of target domain.This mode is similar to style transfer (Riley et al., 2021).We derive the transfer vector via subtracting the semantic embeddings of two parallel paraphrases, which requires no extra training.By adding the transfer vector to the semantic embedding of the input sentence, our model is inspired to paraphrase in a tone more consistent with the target domain.
Our contributions are to: (1) present a multilevel controllable framework for unsupervised paraphrase generation which is flexible for domain adaptation; (2) propose a dropout-diversified semantic encoding method to control both the global semantics and output diversity; (3) explore the importance of factual entities to control local lexicons; (4) use the transfer vector based on target domain exemplars to control universal styles in an efficient way that requires no extra training.

Related Work
Supervised Approaches Recent works on supervised paraphrase generation mainly focus on improving paraphrase quality using neural models (Prakash et al., 2016).An important direction is to make the generation process more controllable either from a single perspective or in a hybrid way.One approach is to seek guidance from exemplars (Iyyer et al., 2018;Chen et al., 2019b;Goyal and Durrett, 2020;Kazemnejad et al., 2020;Yang et al., 2021).Another approach is to identify keywords to control lexicons (Zeng et al., 2019;Su et al., 2021).Fu et al. (2019) use predicted neighbors of source words to form target paraphrase.Li et al. (2019) decompose paraphrasing transformation into different granularity levels.Other hybrid methods include (Yang et al., 2022;Bandel et al., 2022).Although these models controlled by various constraints demonstrate promising performance, they require parallel data for training, which is laborintensive and inextensible to other domains.
Unsupervised Approaches There are two main lines for unsupervised paraphrase generation: optimization-oriented and pivot-based methods.The goal of optimization-oriented methods (Miao et al., 2019;Liu et al., 2020;Siddique et al., 2020) is to find the optimal paraphrase by optimizing an objective function that considers semantic fidelity, expression diversity, language fluency, etc. Nonetheless, designing reliable metrics to automatically evaluate the quality of paraphrases has been a long-standing problem.The pivot-based approaches aim to first represent the input sentence by some kind of pivot and then reconstruct the input from it (Cai et al., 2021).A popular pivot is the latent semantic representation learned by a variational autoencoder (VAE) (Bowman et al., 2016;Roy and Grangier, 2019), which is mathematically interpretable but hard to train intrinsically.Another method is to use other languages as a pivot (Wieting et al., 2017;Lapata et al., 2017;Wieting and Gimpel, 2018;Guo et al., 2019), which relies on external machine translation systems.The works most similar to us explore to take advantages from large-scale language models and adopt corrupted sequences as paraphrasing pivot (Hegde and Patil, 2020;Niu et al., 2021;Meng et al., 2021).However, these methods lack a global semantic guidance due to the absence of the complete input sentence.In addition to the above methods, Huang and Chang (2021) take the first step towards syntactically controllable unsupervised paraphrasing by manipulating the embedding of the potential constituency parse tree.Whereas, it is time-consuming for constituent parsing, which may increase time complexity for paraphrase generation.

Overview
Figure1 illustrates the overall architecture of MCPG.The basic framework in Figure1(a) de-noted as MCPG-basic is built for the unsupervised condition, which consists of three modules, i.e., Dropout-enabled Semantic Encoder, Factual Keyword Selector and Controllable Generator.Given a source sentence, we first encode its semantic meaning into a dense vector x via the Semantic Encoder, which performs the dropout operation internally to obtain diverse semantic embeddings.In addition to using the dropout disturbed embedding as a global semantic constraint, we also provide the Generator with a few keywords W = {w i } M i=1 as a local lexical constraint to ensure that factual entities are preserved.Figure1(b) shows how to further control the universal expression style with the adapted framework MCPG-adapt when annotated samples from the target domain are available.We seek guidance from the transfer vector v based on parallel exemplars, which are retrieved by the Exemplar Provider from the training set of target domain.By adding the semantic embedding x with the transfer vector v, the Generator is allowed to produce paraphrases that are more adaptive with the target domain.Note, the style constraint based on annotated exemplars is not involved during training.We only use it to increase the model's domain adaptability in a plug-and-play way.

Global Semantic Constraint
The Semantic Encoder is responsible for deriving an embedding for the input sentence, which is helpful for both preserving global semantics of the original sentence and increasing expression diversity.
SimCSE (Gao et al., 2021) has been proved to be effective for sentence representation.We follow them to perform unsupervised contrastive learning upon BERT and use it as the backbone for our Semantic Encoder.The last-layer [CLS] representation is taken as the semantic embedding.Note, the semantic similarity between two sentences can be measured by the cosine distance of their embeddings.This is helpful for filtering bad semantic embeddings (see the last paragraph of this section) and retrieving exemplar (see Sec 5.1).
Dropout is a useful technique for deep network regularization.Conventionally, we only turn it on during training and disable it during inference.Here, we show that enabling dropout during inference is helpful for producing embeddings with subtle difference for the same input while keeping the semantics, thus guiding the model to generate more diverse paraphrases.
Given an input sentence, we obtain the semantic embedding x from the Semantic Encoder by x = SemEncoder(x, dp, drop = T rue), (1) where dp denotes the dropout probability and drop = T rue means the we enable dropout during inference.On the other hand, the conventional standard embedding x of the same input is derived without dropout by setting drop = F alse.We only use the build-in dropout mechanism inside BERT and do not add any additional dropout operations.
We assume that the standard embedding x conveys the full semantic meaning of input sentence without losing any information about the original expression form.Whereas, x is continuously disturbed by the dropout operation that acts on the intermediate layers of the encoder.Generally, the higher the dropout probability dp, the further away the disturbed embedding x from the standard embedding x, and the more diverse the generated texts.Along this line, we can tune dp to achieve a tradeoff between semantic preservation and expression diversity.Besides, we further control the dropout behaviour by checking the similarity between x and x.Only the semantic embeddings x with a cosine similarity larger than a threshold λ(= 0.75) will be used for generating paraphrase candidates.

Local Lexical Constraint
Dense embeddings are effective for encoding global semantics of sentences.However, like many previous models, totally relying on a dense semantic embedding for paraphrasing may fail to retain important keywords, such as factual entities (e.g., persons and locations), which we do not hope to change during paraphrase generation.Besides, using a few words as local clues tends to be helpful for producing semantically consistent paraphrases when the input text is long.Therefore, we utilize entity-based keywords to make the model more controllable in the lexical level.
We use the Keyword Selector to extract keywords.We first adopt the high-accuracy version of TexSmart fine-grained NER tool (Zhang et al., 2020;Liu et al., 2021) to recognize a set of named entities W 1 from the input sentence.Then we remove the stop words and randomly sample the remaining non-entity keywords W 2 every 3 ∼ 10 words.W 1 and W 2 are combined to form the final keyword set W = {w i } M i=1 .The subscripts i corresponds to the order in which the specific keyword appears in the input sequence.Note that the selected keywords only account for a small proportion of the whole sentence, which does not hurt the output diversity too much.

Paraphrase Generator
We employ T5 (Raffel et al., 2020) as the Generator backbone, which is based on an encoder-decoder structure and shows strong performance in the textinfilling task (Gao et al., 2022).We follow the standard denoising objective used in the pretraining stage of T5 to reconstruct the input sentence from a list of keywords W. The only difference is that we introduce the semantic embedding x as a global semantic constraint.The generation function is defined as where GenEncoder and GenDecoder denote the encoder and decoder of the T5-based Generator respectively and ⊕ denotes concatenation.
To construct the input sequence for the GenEncoder, we separate the keywords W by sentinel tokens [BLANK_i], each of which stands for a consecutive span of missing content.For example in Figure 1 Training Objective The whole model is trained in an unsupervised way, where the target sentence y is exactly the input sentence x itself.The loss function over the whole corpus D is defined as (3)

Universal Style Constraint
MCPG-basic trained on the non-parallel corpus is already qualified for paraphrase generation.In this section, we refine it to MCPG-adapt with an additional style constraint when annotated exemplars from target domain are accessible in a training-free way, as shown in Figure 1(b).
Given an input x, we first retrieve the K most similar paraphrase pairs {p k , q k } K k=1 from the training corpus, where p k and q k denote the parallel paraphrases.The selection criterion is based on the cosine similarity between the standard semantic embedding of x and the average embedding of {p k , q k } derived from the Semantic Encoder (3.2).The selected pairs are taken as exemplars to guide paraphrase generation.
Denote the semantic embeddings for a pair of exemplar paraphrases p, q as p, q ∈ R H , and the input sentence x as x ∈ R H .We model the difference of universal style between p and q by a transfer vector where ∥ • ∥ denotes L2 norm.Note that, we assume p is more syntactically similar to x than q based on the Levenshtein distance between corresponding POS tag sequences.Otherwise we just swap p and q in Equation ( 4).Then we transfer x to the paraphrase embedding y by Finally, we decode the paraphrase sequence y by y = GenDecoder(y, GenEncoder(W)), ( 6) where we just replace the semantic embedding x in Equation (2) with the transferred embedding y.

Experimental Setup 4.1 Datasets
There are two variants of our proposed model, i.e., the unsupervised MCPG-basic and the domainadapted MCPG-adapt.We evaluate MCPG-basic on three widely-used datasets: Quora1 , MSCOCO (Lin et al., 2014) and Twitter (Lan et al., 2017) under the unsupervised setting following in Meng et al. (2021).For MCPG-adapt, we compare it with supervised baselines on both Quora and MSCOCO datasets follow the supervised evaluation protocol in Su et al. (2021).The details are as follows.
Quora is grounded from the duplicated question pairs sharing the same answers in the Quora forum.We use 30k instances for testing in the unsupervised experiments.To validate the domain-adapted performance, the size of training, validation and test sets are 100k, 4k and 20k.
MSCOCO was originally constructed for image caption, which contains 5 different captions for each image.We randomly pick one of them as source sentence and the rest as targets.We use 20k test instances under the unsupervised setting.For supervised experiments, we split the dataset into 93k, 4k and 20k for training, validation and testing.
Twitter was built from linked tweets sharing URLs.There are both automatically and human annotated sentence pairs.We only use the pairs that are manually labeled as "paraphrases", which results in 566 instances for unsupervised testing.

Baselines and Evaluation Metrics
We compare MCPG-basic and MCPG-adapt with unsupervised and supervised models respectively.
The unsupervised baselines include ( 1 7)IANet+S (Su et al., 2021).Results for the supervised baselines are cited from Su et al. (2021).We evaluate all models by several automatic metrics: BLEU (Papineni et al., 2002), iBLEU (Sun and Zhou, 2012), and ROUGE (Lin, 2004).The 4-gram BLEU and ROUGE scores of both 1 and 2 grams are reported.The iBLEU score is included to penalize trivial outputs which simply repeat the input sentence.We further calculate SelfBleu (Zhu et al., 2018) to measure output diversity in Sec 6.1.Lower SelfBleu score indicates better diverity.

Implementation Details
We implement the Generator of MCPG based on the huggin-face T5-base checkpoint2 on a subset of CommonCrawl containing 243 million sentences.To speed up training, we directly use the opensource checkpoint of SimCSE3 as our Semantic Encoder and only update the parameters of Generator.During training, we set the learning rate to 1e-3 and the batch size to 1024.During inference, MCPG-basic first increases the dropout probability dp from 0.1 to 0.2 to create different semantic embeddings for the same input and then uses topk sampling to generate 1-best paraphrase based on each embedding.Following Hegde and Patil (2020), we eliminate the ones that are the same as the input and obtain 10 distinct candidates.The unsupervised results in Table 1 are reported on the best candidates that achieve the top sentence-level iBleu scores.Similarly, MCPG-adapt produces K outputs conditioned on different exemplars retrieved from the training set.Table 2 shows the results using 10, 15 and 20 exemplars respectively.

Main Results
Unsupervised Performance From Table 1, we see that our MCPG-basic significantly outperforms most baselines across all three datasets.It is noticeable that our model improves the Bleu scores by a large margin, with 6.53 on Quora, 3.56 on MSCOCO and 6.59 on Twitter respectively.This is because most factual entities appear both in the source and target sentences, which demonstrates the effectiveness of our simple keyword selection strategy in the unsupervised scenario.Directly copying keywords from the source sentence may hurt the diversity of generated paraphrases.Nevertheless, MCPG-basic still shows its superiority in terms of the iBleu score, which considers both the fidelity to reference and the difference from input.Domain-adapted Performance Table 2 compares the results of our domain-adapted model and the supervised baselines, where the annotated parallel corpus from the target domain is accessible.
Note that, our model requires no extra training on the annotated data, whose parameters stay the same with the unsupervised MCPG-basic.Even so, MCPG-adapt achieves competitive performance compared with the baselines which need supervised training.Besides, the models's performance improves as the number of used exemplars increases.
Combined with Table 1, we find that, with the help of transfer vectors, MCPG-adapt outperforms MCPG-basic on both datasets.The improvement on Quora is more significant than MSCOCO.A possible reason is the difference of data distribution.
During experiments we observe that, in Quora, the retrieved exemplars from the training set are more similar to the input sentence than MSCOCO, which makes the derived transfer vectors more accurate for paraphrasing guidance in Quora.

Human Evaluation
To further verify the performance of the proposed model, we conduct human evaluation under the unsupervised setting.We randomly sample 100 instances from the Quora test set and ask three an-  notators to evaluate the output results from UPSA, Corruption, MCPG-basic as well as the reference 4 .Each model is given a rank from 1 (best) to 4 (worst) for each input sentence by comparing their overall output quality w.r.t.semantic fidelity, language fluency and surface-form diversity following (Li et al., 2019).We report the mean rank of three annotators over all the evaluation instances in Table 3. Lower ranks are better.The agreement is the rate that at least two out of three annotators give the same rank to each model.The average Spearman's correlation coefficient between any two annotators is 0.488.It is observed that MCPG-adapt reaches the highest rank compared with UPSA and Corruption with high agreement, which demonstrates the effectiveness of our method.
6 Detailed Analysis

Analysis of Semantic Constraint
During inference, we tune the dropout probability of the Semantic Encoder to control the global semantics and increase generation diversity (Sec 3.2).We provide both qualitative and quantitative analysis about how this works in Figure 2 and Figure 3.
In Figure 2, as the dropout probability dp increases from 0.10 to 0.28, the input and output sentences generally become more and more different.When dp gets moderately larger (e.g., less than or equal to 0.18), the model produces paraphrases with more 4 Since ConRPG (Meng et al., 2021) do not release their code, we do not include it for human evaluation.diverse surface forms without losing the original semantic meaning.However, if dp keeps increasing, it may slightly hurt the fluency of output sentences.Besides, when dp becomes too large, the output sentence may fail to keep the core semantics of the input.The same trend can be statistically demonstrated by Figure 3.We plot the metrics (SelfBleu, Bleu, and iBleu) of MCPG-basic with different dropout probabilities on Twitter.SelfBleu is used to measure the similarity between input and output.The smaller the SelfBleu, the more diverse the output sentence.The results showcase that dropout enables a new way to control paraphrase generation, where we can tune the dropout probability to strike a balance between semantic fidelity and output diversity.

Analysis of Style Constraint
In Sec 5.1, we have proven that, retrieving exemplars from the annotated parallel corpus as a universal style constraint is helpful for domain adaptation, where our model achieves competitive results with the supervised models even without training.Here, we show that our model is also more robust to the size of annotated corpus than the traditional supervised models.Since IANet+S (Su et al., 2021) does not release codes, we reimplement LBOW-Topk (Fu et al., 2019) and compare our model with them on different parallel data sizes.
Figure 4 shows a dramatic drop in the Bleu score of LBOW-Topk as the size of Quora parallel corpus decreases from 100K to 500.On the contrary, the Bleu score of MCPG-adapt@10 still remains above 25.00 when there are only 500 candidates for retrieval.One reason is that our model only uses the retrieved exemplars from the parallel corpus as an assistant for paraphrase generation, but does not rely on them to update model parameters.The other reason is that, although the retrieved exemplars may p: what is the best method of learning to speak a language ?q: which are the best methods to learn a language ?y: which are the best methods to help you improve your skills ?
p: what is the best thing we learn in business school ?q: what is the most important thing to learn in a business school ?y: what is the most important way to improve your skills in chess ?
p: what do you think is the most difficult language for anyone to learn ?q: which language do you think is the most difficult all over the world ?y: which way is the best way to improve your chess skills around the world ?become less similar to the input sentence in the semantic level when the annotated data size shrinks, our model can still benefit from mimicking the universal style mapping mode of exemplars through the transfer vector.Figure 5 lists some example outputs guided by different exemplars.The similarity between the input sentence and the exemplars decreases from top to bottom.Nevertheless, our model is able to generate reasonable paraphrases under different conditions.

Analysis of Lexical Constraint
Table 4 reveals the importance of using factual keywords as a local lexical constraint.Replacing the factual entities with random keywords negatively affects the results on three datasets in different degrees.This effect is most significant on Twitter, where many sentences are about politics and are more likely to contain the names of politicians, countries, etc. Nonetheless, this effect is less obvious on MSCOCO, since image captions tend to describe images in a more general way regardless of the specific names of objects.

Conclusion
In this paper, we propose a simple and effective model MCPG for more controllable unsupervised paraphrase generation which is also flexible for domain adaptation.MCPG decomposes the generation constraints into multiple levels: the global semantics, the local lexicons, and the universal styles in a human-like manner.Particularly, we manipulate a transfer vector derived from the semantic embeddings of exemplars to control output styles, which provides a new possibility for domain adaptation with annotated samples in a trainingfree way.In future works, we will explore to (1) control the model from more perspectives such as the syntactic structure, (2) generate paraphrases with higher quality than the input, i.e., to polish the input texts.

Limitations
In this work, we investigate a multi-level controllable approach for unsupervised and domainadaptive paraphrase generation.Despite the promising experimental results, there are still some limitations of our work: • We use a default dropout probability during training but tune it during inference, which may cause a little inconsistency.
• We do not shuffle the factual keywords and force the model to copy them, may slightly hurt the output diversity.
• We use the transfer vector based on parallel exemplars to help the model in domain adaptation.Although the whole process is trainingfree, it would be better if we could obtain exemplar sentences without a parallel corpus.

Figure 1 :
Figure 1: Model overview.(a) MCPG-basic is the basic framework tailored for the unsupervised scenario, which controls the global semantics through the semantic embedding and the local lexicons through the pre-specified factual keywords.The semantic embedding is disturbed by the dropout mechanism inside the Semantic Encoder with a probability dp, which controls the output diversity.(b) MCPG-adapt is the refined framework which controls the universal style through a transfer vector.This version is designed to increase the model's domain adaptability when annotated exemplar paraphrases from the target domain are available in a training-free way.
(a), the keywords only include "McGill University", thus the input sequence should be "[BLANK_0] McGill University [BLANK_1] "Denote the last layer hidden states of the GenEncoder as [h 1 , h 2 , ..., h S ], where h i ∈ R H and S is the input sequence length.We concatenate the semantic embedding x ∈ R H with them to form [x, h 1 , ..., h N ].These vectors are then attended by the Generator decoder in each layer as memory vectors to guide paraphrase generation.The goal of the GenDecoder is to fill in the blanks (e.g., [BLANK_0] and [BLANK_1] in the example input).That is, the decoder only predicts the missing contexts and the output positions for the given keywords, which are indicated by placeholders [KEY_i].For the above example input, the target sequence should be "What is it like to study [KEY_0] ?" where [KEY_0] indicates the input keyword "McGill University".By replacing the placeholders [KEY_i] with corresponding keywords, we finally obtain the paraphrase sentence y.

Figure 2 :Figure 3 :
Figure 2: Example outputs of MCPG-basic conditioned on different dropout probabilities (ranging from 0.10 to 0.28).Purple parts are new lexical expressions that do not show up in the input sentence.Orange parts are old words in the input but are used in a different way in the output sentences.

Figure 4 :
Figure 4: Effect of the parallel data size.

Figure 5 :
Figure 5: Example outputs of MCPG-adapt conditioned on different exemplars.Purple parts are new expressions in the output y inspired by the exemplar pair(p, q).Orange parts are old expressions that both appear in the input and output sentences.
Generator • (dp=0.10)What is it like to study at McGill University ?• (dp=0.12)What is it like to study , to be at McGill University ?• (dp=0.16)What does it mean to study like this inside McGill University ?• (dp=0.22)What is it like to be a student in McGill University ?

Table 2 :
Domain-adapted performance.MCPG-adapt@K denotes the model variation using K exemplars.

Table 3 :
Human evaluation results.
: you just lived through the hottest year on our planet ever recorded.you just lived the hottest year on our planet ever.you just lived the hottest year we have ever had.you just lived through the hottest year on earth so far recorded.you guys can live by the hottest year ever recorded on this planet.we just lived the hottest year you can ever feel on the planet.you lived through the year we just registered the hottest time on the planet.you recorded us for the hottest era ever lived.you just sweltered the hottest we have ever recorded year of our lives.in any case we recorded the hottest planet ever lived right now.you got the record of the year through the smallest possible time in our planet.
Input: what is the best way to improve your chess skills ?p: what is the best way to teach chess to children ?q: what is the best way and right age to introduce chess to a child ?y: what is the best way to improve your chess skills and the right age to enter ?p: which age is the best to start teaching my little son to play chess ?q: what are the benefits of teaching children to play chess and what is a good age to start ?y: what are the benefits of playing chess and what is a good way to improve your skills ?

Table 4 :
Performance of MCPG-basic with random keywords.The 'blue' parts indicate the performance degradation compared with the results in Table 1.