Effective Unsupervised Constrained Text Generation based on Perturbed Masking

Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions and actions, which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best edit position and action in each step. Specifically, PMCTG extends perturbed masking technique to effectively search for the most incongruent token to edit. Then it introduces four multi-aspect scoring functions to select edit action to further reduce search difficulty. Since PMCTG does not require supervised data, it could be applied to different generation tasks. We show that under the unsupervised setting, PMCTG achieves new state-of-the-art results in two representative tasks, namely keywords- to-sentence generation and paraphrasing.


Introduction
Constrained text generation is the task of generating text that satisfies a given set of constraints, and it serves many real-world text generation applications, such as dialogue generation (Li et al., 2016) and summarization (See et al., 2017).There are broadly two types of constraints: Hard constraints such as including a set of given words or phrases in the generated text.Example 1 in Table 1 shows that the keywords "You" and "beautiful" must occur in the generated sentence.Soft constraints such as acquiring the generated text to be semantically similar to the original text.Example 2 in Table 1 shows a pair of paraphrases where "What are the effective ways to learn cs?" and "How to learn cs effectively?"share a similar meaning.
Conventional approaches model the task in an encoding-decoding paradigm with a supervised

No. Original Text Generated Text
You, beautiful You are so beautiful .
2 How to learn cs effectively?
What are the effective ways to learn cs? setting (Prakash et al., 2016;Gupta et al., 2018).However, these methods have certain shortcomings for two constrained generation tasks.For hard constrained text generation, without external constrained means, these methods are difficult to guarantee that the generated text can satisfy all constraints.For soft constrained one, conventional methods treat it as a machine translation (MT) task (Sutskever et al., 2014) and require massive parallel supervised data for training.Unfortunately, constructing such datasets is resource-intensive.In addition, domain-specific supervised models may be difficult to transfer to new domains.(Li et al., 2019).Recently, unsupervised text generation is proposed to address the above challenges.There are mainly two research directions: Beam search-based method aims to generate candidates in order from left to right that satisfy the constraints in each step, inspired by MT methods (Hokamp and Liu, 2017;Post and Vilar, 2018).However, the search space of MT systems is relatively small, while when applied to other generation tasks, such as paraphrase, this approach does not work as optimally as expected because of a much larger search space (Sha, 2020).Local edit-based method represented by CGMH (Miao et al., 2019) and USPA (Liu et al., 2020) is another effective solution.These methods propose stochastic local edit strategies to search for reasonable sentences in a huge search space based on the given constraints.One main concern is that these methods may take a long time to search for the optimal solution because they are based on stochas-2 Related Work

Constrained Text Generation
Constrained text generation is formulated as a supervised sequence-to-sequence problem under the encoding-decoding paradigm (Sutskever et al., 2014).For example, (Prakash et al., 2016) and (Li et al., 2019) respectively propose a stacked residual LSTM network and a transformer-based model (Vaswani et al., 2017), and (Gupta et al., 2018) propose to leverage a combination of variational autoencoders (VAEs) with LSTM models to generate paraphrases.A new sentence generation model is proposed by (Guu et al., 2018), where a prototype sentence is first extracted from the training corpus and then edited into a new sentence.However, these methods do not support constraint integration (Miao et al., 2019).Later, some works have attempted to add constraints to the generated models.(Wuebker et al., 2016) and (Knowles and Koehn, 2016) utilize prefixes to guide the target text generation.(Mou et al., 2016) use pointwise mutual information (PMI) to predict a keyword and treat it as a constraint to generate target text.However, these methods always bind the constraints to the original model and are therefore difficult to apply to new domains and new generation models (Li et al., 2019).Moreover, the above approaches rely on an adequate parallel supervised corpus, which is hard to obtain in real-world application scenarios.
Unsupervised constrained text generation has become a research hotspot due to its low training cost and mitigation of insufficient training data.VAEs and their variants (Bowman et al., 2016;Roy and Grangier, 2019) are leveraged to generate sentences from a continuous latent space.These methods can effectively get rid of the reliance on supervised datasets but remain difficult to control and incorporate generative constraints.Beam search is a representative approach for unsupervised constrained text generation.Grid Beam Search (GBS) (Hokamp and Liu, 2017) is an algorithm that extends beam search by allowing the inclusion of pre-specified lexical constraints.(Post and Vilar, 2018) propose Dynamic Beam Allocation (DBA), a much faster beam search-based method with hard lexical constraints.(Zhang et al., 2020) propose an insertion-based approach consisting of insertion-based generative pre-training and inner-layer beam search.For the tasks where the search space is limited (represented by machine translation), these methods work well.However, when faced with a large search space, they do not work as optimally as expected (Sha, 2020).
Local edit-based methods have attracted attention recently, as they can help to reduce search spaces.CGMH (Miao et al., 2019) applies the Metropolis-Hastings algorithm (Metropolis et al., 1953) to unsupervised constrained generation.UPSA (Liu et al., 2020) is another local edit-based method.It directly models paraphrasing as an optimization problem and uses simulated annealing to solve it.However, these models may require many steps and running time to generate reasonable sentences since they are based on stochastic strategies.(Sha, 2020) proposes a gradient-guided method G2LC that uses token gradients to determine the edit actions and positions, making the generation process more controllable.However, a problem with G2LC is that it still relies on the supervised corpus to train a binary classification model to serve their semantic similarity objective.

Perturbed Masking
Perturbed masking (Wu et al., 2020) is a parameterfree probing technique to analyze and interpret pre-trained models.Based on a pre-trained BERTbased model with masked language modeling (MLM) objective, it can measure the impact a token has on predicting another token.It is originally used in syntax-based tasks such as syntactic parsing and discourse dependency parsing.
In this paper, we extend perturbed masking to constrained text generation.For the edit-based approach edits only one token at each step, we need to find the token with the highest incongruency to edit.Our insight is to use perturbed masking to present the congruency between different tokens.We believe that the token with the weakest correlation with its adjacent tokens has the highest incongruency and thus it is the most probable to edit.Perturbed masking can evaluate the impact of one token on another and a high impact factor means that the token has a high impact on its adjacent tokens and we consider these chunks (the current token with its adjacent tokens) are congruent.Therefore, we can edit the tokens in chunks with low impact to make these chunks more congruent.

Methodology
In this section, we would introduce the proposed PMCTG by first introducing the specific process of using perturbed masking to select edit positions, and then explaining the proposed scoring functions and the use of them to select the edit actions.

Edit Position Selection
Most previous works select edit locations stochastically, which lead to many unnecessary search steps.To reduce the search steps, we propose to use perturbed masking (Wu et al., 2020) to sample the edit position.
Background.Perturbed masking technique is proposed to assess the inter-token information (i.e., the impact one token has on another token in a sequence) based on masked language modeling (MLM).It is originally used for dependency parsing.
Formally, given a sequence with n tokens x = {x i } n i=1 and a pre-trained BERT-based model (Devlin et al., 2019) trained with MLM objective, we obtain contextual representations for each token H(x) i .To quantify the impact a token x j has on another token x i , we conduct the following threestep calculation: 1 Euclidean distance is leveraged in this paper.I(x|x j , x i ) indicates the impact x j has on x i , where a higher value indicates a high impact, and vice versa.Intuitively, if H(x\{x i }) i and H(x\{x i , x j }) i are similar, it means that the presence or absence of x j has little effect on the prediction of x i , thus reflecting the low importance of x j to x i .Position Selection.It is natural to apply perturbed masking to select the edit position for constrained text generation.Based on perturbed masking technique, we compute the edit score for each token in the sequence and then sample the token with the highest score to edit.The token with minimal impact on its adjacent tokens indicates that it has the weakest correlation with adjacent tokens and therefore requires edit.We add the special tokens [CLS] and [SEP ] to the original sentence and then use the pre-trained BERT to calculate the edit score for each token: Then we can get an edit score vector ES = {ES i } n i=0 .Later, we feed it into a softmax layer and obtain the edit probabilities: After that, the p edit is utilized as the weights to sample the edit position x e in x where e indicates the edit position index.

Edit Action Selection
After sampling the edit position, next we need to determine the edit action.The three edit actions we focus on are: insert, replace and delete.Specifically, our strategy in this step is to pre-implement the three actions first and then sample the actions based on their action scores.When scoring insertion action, we simply make the equal probability of the front or back of the position for token insertion.We first introduce the scoring functions for different tasks and then explain the edit action selection based on the action scores.

Scoring Function Design
We propose multiple scoring functions to improve the generated text.Given the initial sentence x 0 with n tokens and the generated sentence x * with m tokens, the scoring functions include fluency, editorial rationality, semantic similarity, and diversity.Fluency.The primary condition for a reasonable sentence is fluency, thus we use the average negative log-likelihood to estimate a sentence's fluency based on a forward language model.The score is calculated as: Editorial Rationality.Since the sentence generation process is based on local edits, we further use perturbed masking to design a local edit score for different actions to evaluate their rationality.After a replacement action is executed at index i in x 0 , we obtain the sentence x * = {x 0,1 , x 0,2 , . . .x 0,i−1 , x ′ , x 0,i+1 , . . ., x 0,n }, where x ′ is the replaced token and m = n.Then we define the edit score as: Similarly, after an insertion action, we obtain x * = {x 0,1 , x 0,2 , . . .x 0,i , x ′ , x 0,i+1 , . . ., x 0,n }, where x ′ is the inserted token and m = n + 1.The edit score is calculated as: After a deletion action, we obtain x * = {x 0,1 , x 0,2 , . . .x 0,i−1 , x 0,i+1 , . . ., x 0,n }, where m = n − 1.The edit score calculated for deletion is a little different from replacement and insertion action: Semantic Similarity.The semantic similarity consists of keyword similarity and sentence similarity.
We use KeyBERT (Grootendorst, 2020) to extract the keyword set K from x 0 .And the pre-trained BERT is leveraged to encode x 0 and x * , where ik = idx(k) indicates the index of keyword k in x 0 .The keyword similarity is defined as finding the closest token in x * by computing their cosine similarity: As for the sentence similarity, assuming that H(x) indicates the [CLS] representation in x from BERT and is leveraged to present the whole sentence (Devlin et al., 2019), we define the sentence similarity S sem,sen (x * , x 0 ) as: Altogether, the semantic similarity score is: Diversity.Followed (Liu et al., 2020), a BLEUbased (Papineni et al., 2002) function is adopted to evaluate the expression diversity of the original and generated sentence.

Action Scoring
As mentioned above, after sampling the edit position i, we need to determine the edit action by re-implementing three actions and sampling the actions based on their action scores.We generate the inserted and replaced candidate x ′ from a language model such as LSTM (Hochreiter and Schmidhuber, 1997) and GPT (Radford et al., 2019).
We use p candidate as weights to sample x ′ .After obtaining the edit position i and candidate x ′ , we need to calculate the edit score for each action.We adopt S f lu and S edit the our scoring function for keywords-to-sentence generation: and S f lu , S sem , S exp and S edit for paraphrasing: Notably, since different scores are in different magnitudes, they need to be normalized to avoid the dominance of one specific score.After scoring different actions, we use the scores as weights to sample the edit action.

Overall Searching Process
With x 0 (given keywords in the keywords-tosentence generation task or original sentence in the paraphrasing task) as input, we repeat the above steps including edit position selection with perturbed masking and edit action selection with scoring functions for local edit.Until the maximum searching steps, we choose the sentence that achieves the highest score as the final output, according to (12) for keywords-to-sentence generation task or (13) for paraphrasing task respectively.

Experiments
We evaluate our method on two constrained text generation tasks, namely keywords-to-sentence generation and paraphrasing.

Keywords-to-sentence Generation
Experimental Setting.Keywords-to-sentence generation aims to generate a sentence containing the given keywords which is a representative hard constrained text generation task.We conduct keywords-to-sentence generation experiments on the One-Billion-token dataset2 (Chelba et al., 2014).Two language models for generation, namely two-layer LSTM (followed as (Miao et al., 2019;Sha, 2020)) and GPT (Radford et al., 2019), are evaluated.Following (Gururangan et al., 2020), in order to adapt the language models to the specific domain, we randomly sample 5 million sentences  to continually pre-train BERT-based-cased3 and GPT24 .3 thousand sentences are held out as the test set.
As for hyperparameters, for each test sentence, we randomly sample 1 to 4 keywords as hard constraints.Following previous works (Miao et al., 2019;Sha, 2020), the initial sentence for searching is the concatenation of the keywords.The maximum searching step set in this task is 100.And λ f lu and λ edit are set as 1 in equation ( 12).Besides, when the keyword indexes are sampled as edit positions, we directly conduct insert action since the keywords cannot be replaced and deleted.
As for evaluation metrics, the generated target sentence is measured by negative log-likelihood (NLL) loss.NLL is given by a third-party language model which is an n-gram Kneser-Ney language model (Heafield, 2011) trained in a monolingual English corpus from WMT185 .In addition to automatic evaluation metrics, we also introduce human evaluation.Specifically, we invite 3 experts who are fluent English speakers to score the generated sentences according to their quality.The score ranges from 0 to 1 with an accuracy of two decimal places, where 1 indicates the best score.The automatic and human evaluation criteria are consistent with previous works (Sha, 2020).The scoring guideline is shown in Table 2. Baseline.We compare our method with several advanced methods: • sep-B/F (Mou et al., 2016)  Automatic and Human Evaluation Results.Table 3 shows the performance of multiple methods on keywords-to-sentence generation task.Among different kinds of methods, we can see that the local edit-based methods work better than beam searchbased methods, indicating their superior searching ability.CGMH can narrow the search space and make it easy to find higher-quality sentences.G2LC and PMCTG outperform CGMH, which illustrates the importance of determining the correct edit position and action for each step.Exploration and strategies for these two issues can better guide the model to find a more optimal solution, while also greatly reducing the waste of potentially nonessential search steps.Overall, the proposed PM-CTG model outperforms other methods on average in both automatic and human evaluation metrics.
PMCTG utilizes perturbed masking technology to identify edit locations and reflect the reasonableness of edit actions more intuitively and practically.
Compared to previous baselines, our approach may either require fewer steps to search for the optimal sentence or equal steps to achieve better results.In this task, our method needs to run only 100 steps while CGMH needs 200 steps for each sample and our method can achieve better results (7.47 vs 7.70 in average NLL).Besides, although G2LC also only needs to run 100 steps for each sample, our method (PMCTG-LSTM) gives better results (7.47 vs 7.56 in average NLL).Although the process requires another BERT model for perturbed masking, we transform a sentence to a batch of vectors and only need to call the BERT model once per search step to calculate the perturbed masking scores for all tokens.Compared to CGMH and UPSA, our method makes full use of each search step to a certain extent, reducing the extra time spent on random strategies.
Interestingly, PMCTG-LSTM seems to be superior to PMCTG-GPT2 in this task.For one thing, part of the superiority of GPT2 to LSTM is in the semantic richness of the generated sentences.However, in the target dataset, the sentence form and semantics are relatively simple, and therefore the performance of LSTM is comparable to that of GPT2 in cases where there is no need to generate sentences with complex semantics.For another, since keywords are locally ill-formed and semantically distant, the information of keywords may be difficult to support GPT2 to generate reasonable candidates without taking backward probability into account.In contrast, the two-layer LSTM considers both forward and backward probabilities and may be more suitable for generating candidates

Sentences worried
We are very worried about there .
agreement To achieve such an agreement , it is important .

competition, action
The shots of competition and action are on display here .change, hours This will change it in the next 24 hours .The,greatest, court The world's greatest size court will be presented to you .I,things, him I can do lots of things for him .
body, advanced, July,funeral The body was found advanced in July and funeral were held in September .
Miley,more, final,spots But Miley Cyrus has played more than three times in the finaltwo spots .between two less correlated tokens.We find that more keywords may lead to better results, one possible reason is that more keywords can further narrow the search space and facilitate the search of the model.Case Study.Some generated examples of PMCTG-LSTM are shown in Table 4.We observe that the proposed model can generate fluent and meaningful sentences while containing the given keywords.

Paraphrasing
Experimental Setting.Paraphrasing aims to convert a sentence to a different surface form but with the same meaning.We evaluate PMCTG on two paraphrase datasets, namely Quora 6 and Wikian-6 http://www.statmt.org/wmt18/translation-task.htmlswers (Fader et al., 2013).The Quora question pair dataset consists of 140 thousand parallel sentences pairs and 640 thousand non-parallel sentences.The Wikianswers dataset contains 2.3 million question pairs scrawled from the Wikipedia website.We also conduct an experiment on twolayer LSTM (followed as (Miao et al., 2019;Liu et al., 2020;Sha, 2020)) and GPT2 for better comparison.Following previous works (Liu et al., 2020) again, we randomly sample 20 thousand sentences respectively in two datasets as test sets and use the other sentences to continually pre-train BERT-based-cased and GPT2 for domain adaption as (Gururangan et al., 2020).
As for hyperparameters, the maximum searching step set in this task is 50 and λ are all set as 1 in equation ( 13).The initial sentence for searching is the original sentence in the datasets.
In terms of evaluation metrics, we leverage the representative metrics sentence-level BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) as the basic metrics.In addition, as stated in (Sun and Zhou, 2012), standard BLEU and ROUGE could not reflect the diversity between the generated and original sentences.Therefore, we adopt iBLEU (Sun and Zhou, 2012) which penalizes the generated sentences with high similarity with the original ones as an additional evaluation metric.Besides, we also invite experts to evaluate the generated paraphrases.Specifically, we sample 300 sentences from the Quora test set and ask 3 experts to score each sentence according to two aspects: relevance and fluency.The evaluation criterion is again consistent with the previous works (Miao et al., 2019;Liu et al., 2020).The scoring guidelines are shown in Table 2 and Table 5. Baseline.We compare our methods with three types of baseline: • Supervised methods are original sequence-tosequence models trained in in-domain supervised data, including ResidualLSTM (Prakash et al., 2016), VAE-SVG-eq (Gupta et al., 2018), Pointer-generator (See et al., 2017), the Transformer (Vaswani et al., 2017), and DNPG (the decomposable neural paraphrase generation) (Li et al., 2019).• Domain-adapted supervised methods train models in one domain and then adapt them to another domain, including shallow fusion (Gülçehre et al., 2015) and multi-task learning (MTL) method (Domhan and Hieber, 2017).• Unsupervised methods that are free of any supervised data and easily adapted to multiple new domains, including VAE (Kingma and Welling, 2014), CGMH (Miao et al., 2019), UPSA (Liu et al., 2020), and the recurrent state-of-the-art method G2LC (Sha, 2020).Notably, G2LC has two variants of G2LC-Generator and G2LC-Recognizer.

Models
Automatic Evaluation Results.Table 6 presents the results of multiple methods on the paraphrasing task.From the first part of Table 6, we can see that supervised methods significantly outperform the other two kinds of methods.The supervised models were trained on 100 thousand question pairs for Quora and 500 thousand question pairs for Wikianswers.Their superiority indicates the effectiveness of learning knowledge from massive parallel data.However, such in-domain supervised data is hard to obtain in real-world applications.Besides, the second section of Table 6 shows the domain-adapted supervised models' performance.These models are trained in one domain (Quora or Wikianswers) and then evaluated in another domain (Wikianswers or Quora).Their performances are much lower than in-domain supervised models' performances.This demonstrates the poor generalizability of supervised models and calls for the need for unsupervised methods.
The last section of Table 6 shows the results of multiple unsupervised methods.VAE seems to work worst on both datasets, which suggests that paraphrasing by latent space sampling performs not as well as local edit methods.PMCTG achieves the best performance in most cases, which indicates the effectiveness of PMCTG again.Unsupervised PMCTG does not require parallel data and can easily generalize to new domains, thus some unsupervised methods tend to achieve higher performance than the domain-adapted supervised models.In addition, it is worthwhile to note that the performance of some unsupervised methods (UPSA, G2LC, and PMCTG) is even better than some supervised methods (Residual LSTM and VAE-SVG-eq), which indicates that the gap between supervised and unsupervised methods has narrowed due to the effective searching strategies of the local edit-based methods.In addition, different from the keywords-to-sentence generation task, GPT2 works better than two-layer LSTM in the paraphrasing task.We believe that given a partially fluent text, GPT2 can generate more reasonable candidates due to its powerful language modeling capability.
Human Evaluation Results.From Table 7, we show PMCTG-GPT2 achieves state-of-the-art performance in terms of fluency, but still suffers from relevance.We plan to improve its relevance in future research.
Case Study.Table 8 lists some representative generated examples from PMCTG-GPT2.They show the four most common types of paraphrasing for the proposed method.The first type is the change of syntax such as the interchange of "what can. . ." and "how to. . ." as in the first example.The second type is the change of adjective such as the second example where the "possible" is changed into "good".The third type is the change of personal pronouns such as the interchange of "you" and "I" in the third example.The last type is the change of tense, the most common is the interchange of general past tense and general present tense as the last example.In general, one limitation of the proposed model is the relatively low expressive diversity of generated sentences.One possible reason is that since each search step modifies only one token, and the unit of conversion from one expression to another is usually phrases or sentence blocks, thus the model may be biased not to search in that direction.

Conclusion
We propose a method PMCTG to improve the previous stochastic searching methods in the topic of unsupervised constrained generation.PMCTG leverages perturbed masking technique to find the best edit position and leverages newly designed multiple scoring functions to decide the best edit action.We evaluate the proposed method on two representative tasks: keywords-to-sentence generation (hard constraints) and paraphrasing (soft constraints).Experimental results demonstrate the effectiveness of the proposed method which achieves competitive results on three datasets over multiple advanced baseline methods.We plan to improve the diversity and relevance of the generated sentences in future work.

Table 1 :
Examples on constrained text generation.
. Replace x i with [M ASK] token and feed the new sequence x\{x i } into BERT, a contextual representation denoted as H(x\{x i }) i for x i is obtained.2. Replace x i and x j with [M ASK] token and feed the new sequence x\{x i , x j } into BERT, another contextual representation denoted as

Table 3 :
Performance on keywords-to-sentence generation task.Lower NLL and higher score indicate better result.1,2,3 and 4 present the keyword numbers and avg indicates the average score.

Table 4 :
Generated examples of PMCTG-LSTM in keywords-to-sentence generation task.

Table 6 :
Performance on paraphrasing task.R1 and R2 respectively indicate ROUGE1 and ROUGE2.In this table, this first/second/third blocks respectively indicate the results of supervised/domain-adapted supervised/unsupervised methods.

Table 7 :
Human evaluation results on paraphrasing.