Parallel Refinements for Lexically Constrained Text Generation with BART

Lexically constrained text generation aims to control the generated text by incorporating certain pre-specified keywords into the output. Previous work injects lexical constraints into the output by controlling the decoding process or refining the candidate output iteratively, which tends to generate generic or ungrammatical sentences, and has high computational complexity. To address these challenges, we proposed Constrained BART (CBART) for lexically constrained text generation. CBART leverages the pre-trained model, BART and transfers part of the generation burden from the decoder to the encoder by decomposing this task into two sub-tasks, thereby improving the sentence quality. Concretely, we extended BART by adding a token-level classifier over the encoder, aiming at instructing the decoder where to replace and insert. Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens at a low confidence level. To further reduce the inference latency, the decoder predicts all tokens in parallel. Experiment results on One-Billion-Word and Yelp show that CBART can generate plausible text with high quality and diversity while largely accelerating inference.


Introduction
Controllable text generation aims to generate text in a controlled way, such as transferring text style (Shen et al., 2017;Fu et al., 2018; and generating text with control codes (Keskar et al., 2020). Lexically constrained text generation requires that the given keywords must appear in the output, which can be applied to incorporating keywords into a dialog response (Mou et al., 2016), creating a story with keywords (Fan et al., 2018), generating advertisements for products (Miao et al., 2019) and writing a concrete meeting summary based on several key phrases.

Cons
lovely, time, forever, try Step 1 this lovely experience time and forever to try .
Step 2 this was lovely experience ! time and it forever to try this .
Step 3 this was a lovely experience ! first time here and it took forever to try this place . To generate sentences with keywords, Mou et al. (2015) proposed a backward and forward language model (B/F-LM). Liu et al. (2019a) applied adversarial learning (Goodfellow et al., 2014) to B/F-LM. Both models are limited to generating text with one keyword. To incorporate multiple keywords into machine translation, Hokamp and Liu (2017) proposed grid beam search (GBS) by adding an additional constrained dimension to beam search. However, GBS does not consider the future lexical constraints when generating previous tokens, thereby degrading the quality of generated sentences.
Recently, Markov Chain Monte Carlo (MCMC) sampling has been applied to text generation (Berglund et al., 2015;Su et al., 2018;Devlin et al., 2019). Compared with GBS, MCMC-based models can iteratively refine tokens based on contexts. CGMH (Miao et al., 2019) uses Metropolis-Hastings sampling to generate constrained sentences with a series of actions, such as insertion, deletion, and replacement. In most cases, refinements conducted by CGMH are invalid because of the randomly chosen actions and positions. To solve this problem, the gradient information (Sha, 2020) and a token-level classifier (He and Li, 2021) are used to determine the position to be edited and the action to be taken. These models are computation-intensive as they can update only one token in each step. POINTER (Zhang et al., 2020b) reduces the inference latency by refining multiple tokens in one step. However, POINTER is based on BERT (Devlin et al., 2019) and imposes all the burden of generation on the decoder. Previous work (Lewis et al., 2020) has shown that BART is more suitable than BERT for text generation. Nevertheless, BART can not be directly applied to constrained text generation. If we feed the keywords into the encoder, the decoder will not guarantee that the output contains the given keywords.
To alleviate the above problems, we propose CBART, a parallel refinement model for lexically constrained text generation, shown in Figure 1. CBART benefits from the large-scale pre-trained model, i.e., BART. In addition, CBART shifts part of the decoder's burden to the encoder. Specifically, we put a token-level classifier over BART's encoder, in charge of analyzing the input and providing the decoder with coarse-grained modification information, such as the positions to be refined and the actions to be conducted. The refinement information provided by the encoder enables the decoder to revise multiple tokens of the input in one step to make the sentence more fluent, such as inserting missing tokens before certain positions and replacing inappropriate tokens with other tokens. In addition, the decoder predicts all tokens simultaneously, which further speeds up inference. As shown in Table 1, CBART keeps refining the generated text until completing it during inference.
Our work's main contributions are threefold: (1) We propose CBART 1 for lexically constrained text generation. The proposed model takes advantage of the pre-trained model, BART. Besides, we lighten the generation burden on the decoder by decomposing constrained sentence generation into two sub-tasks. Furthermore, CBART can insert and replace multiple tokens in each refinement step and predict all tokens in parallel. (2) To train CBART, we propose a new method to construct synthetic data.
(3) Experiment results on One-Billion-Word and Yelp demonstrate that CBART outperforms previous work in terms of generation quality, generation diversity and inference speed.

Problem Definition
Lexically Constrained Text Generation aims to incorporate the given keywords into the generated text. Given a set of lexical constraints, c 1 , c 2 , . . . , c k , this task aims to find a fluent text 1 Our code is available at https://github.com/ NLPCode/CBART.  Figure 1: The overview of our proposed model. <S>, </S> and <M> represent the start, end, and mask tokens. The refinement process indicated by the dashed line only appears in inference. During training, the decoder input Y M is created based on the gold encoder label sequence; while during inference, it is created based on the predicted encoder label sequence.
by maximizing the conditional probability: where X is the text containing the given keywords.

Methodology
In Section 3.1, we will first introduce the proposed model, CBART. Then, we will show how to create synthetic datasets to train CBART in Section 3.2. Finally, we will introduce several parallel decoding strategies for generating text, the repetition penalty mechanism used to discourage the generation of repetitive tokens, and the termination criterion for inference in Section 3.3.

Model Architecture and Training
The overview of our proposed model is demonstrated in Figure 1. The proposed model consists of two modules: an encoder E and a decoder D.
Encoder and Action Classifier. We use the pretrained language model, BART, to initialize the proposed model. The encoder is responsible for providing coarse-grained refinement information for the decoder. In other words, the encoder is expected to instruct the decoder where to replace and insert. To this end, we add a fully connected feedforward layer over the encoder. More concretely, the encoder takes an incomplete sequence as input and outputs a corresponding label sequence. Based on the label sequence, the decoder can be aware of how to refine the candidate sentence. The encoder serves as a three-class token-level classifier, where we use labels 0, 1, 2 to refer to copy, replacement, and insertion actions. The copy action means that the decoder should keep the current token. The replacement action suggests that the decoder should replace the current token with another token to make the text more coherent. Similarly, the insertion action indicates that the decoder should insert a token before the current token to complete the text. We use (X, L, Y M , Y ) to represent a training instance. X = {x 1 , . . . , x n } denotes the incomplete text fed into the encoder and L = {l 1 , . . . , l n } is the encoder label sequence. Then, the cross-entropy loss for the encoder is: log p(l t |x 1 , . . . , x n ). (2) Decoder. The decoder takes Y M as input and aims to reconstruct the original text Y , where Y M = {y M 1 , . . . , y M m } is constructed based on X and L, and Y = {y 1 , . . . , y m } is the decoder label sequence. Following BART, the decoder will predict the complete output Y rather than the masked tokens of Y M during training. We optimize the decoder by minimizing the reconstruction loss: Joint Training. We jointly optimize the encoder and decoder by minimizing the total loss: where α is the trade-off parameter.

Creating the Synthetic Dataset
Before training CBART, we need to create the synthetic dataset D = {(X, L, Y M , Y )}. Each time, we randomly choose a sentence from One-Billion-Word or Yelp to create the synthetic dataset. Suppose the selected text is "<S> A B C D E F G H I </S>", where <S> and </S> denote the start and end tokens. Next, we randomly select some tokens from it (e.g., "<S> F G H I </S>"). Then, we randomly replace 15% of tokens with other tokens (e.g., replace 'H' with 'K'). Therefore, we obtain the encoder input X ={<S>, F, G, K, I, </S>} and the corresponding encoder label sequence L ={0, 2, 0, 1, 0, 0}, where 2 denotes that a token should be inserted before 'F' and 1 means that 'K' should be replaced with another token. Further, we construct Y M = {</S>, <S>, <M>, F, G, <M>, I } by inserting a special mask token <M> before 'F', replacing 'K' with <M> and shifting the sequence one position to the right. The decoder label sequence Y is constructed by replacing the masked tokens of Y M with gold tokens. It is trivial to decide the gold token for the replacement action. For example, since 'H' is replaced with 'K', the gold label for the second mask token <M> should be 'H'. However, it is challenging to determine the gold token for the insertion action, especially when multiple tokens are missing before a position. Since five tokens ('A','B','C','D' and 'E') before 'F' have been deleted from the original text, we need to decide which token should be inserted before 'F' first, which will be regarded as the gold token for the first mask token <M>.
We test five different ways (Left, Middle, Right, Random and TF-IDF) to construct synthetic datasets for the insertion action. The left method regards the leftmost token 'A' as the first inserted token before 'F', so we get Y ={ <S>, A, F, G, H, I, </S>}. Similarly, the middle, right, and random methods regard the middle token 'C', the rightmost token 'E', or a randomly chosen token 'D' as the first inserted token. We also consider the importance of tokens by computing their TF-IDF scores. Assume token 'B' has the highest TF-IDF value. It will be regarded as the first inserted token. We conduct an experiment to compare the effect of these methods in Section 4.4.

Inference
We set lexical constraints as the initial input of the encoder, X 0 . We start inference by feeding X 0 into the encoder and obtain the predicted label sequencê L 0 with argmax decoding. Next, we construct Y M based on X 0 andL 0 , and feed it into the decoder. Then, we run a decoding strategy on the decoder to getŶ 0 , usingŶ 0 as the encoder input of the next refinement step, X 1 . We continue to refine the encoder input, until meeting a termination condition. Unlike training, we forbid replacing any keyword with the mask token <M> when constructing Y M , and the decoder only needs to predict the masked tokens of Y M to ensure the given keywords appear in the output. In the following, we will introduce greedy decoding for the encoder and four parallel decoding strategies for the decoder: greedy, top-k, top-p, and multiple-sequence decoding. Each decoding strategy allows the decoder to predict all masked tokens of Y M in parallel. Greedy Decoding for the Encoder. At the refine-ment step r, the encoder takes X r as input, and we choose the label with the highest probability as the predicted labell r t (L r = {l r 1 , . . . ,l r n r }): Greedy Decoding. Similar to the method mentioned above, greedy decoding selects the token with the highest probability for position t as the decoder output (Ŷ r = {ŷ r 1 , . . . ,ŷ r m r −1 , </S>}): Top-k and Top-p Decoding. Since maximizationbased decoding, such as greedy decoding and beam search, may cause text degeneration, we use topk (Fan et al., 2018;Holtzman et al., 2018) and top-p decoding (Holtzman et al., 2020) to alleviate this problem. For each position, top-k decoding samples a token from the k most probable tokens, rather than always choosing the most probable one. Similarly, top-p decoding samples a token from the smallest possible set of tokens, whose cumulative probability exceeds the probability p.
Multiple-sequence Decoding. Top-k or top-p decoding can generate more diverse text but risk producing low-quality sentences. To remedy this, we propose multiple-sequence decoding. When using top-k or top-p decoding to generate sentences, we run the decoding method for N times to get multiple sequences. Because all sequences are mutually independent, they can be decoded simultaneously. After obtaining multiple generated sentences, we resort to the pre-trained language model, the GPT-2 small model (Radford et al., 2019), to rank these sentences and choose the one with the lowest negative log-likelihood (NLL). The ranking operation also runs in a non-autoregressive way, thus avoiding excessive overhead. Repetition Penalty. Even large well-trained generation models might generate repetitive phrases or sentences, resulting in a lower diversity of the generated text (Holtzman et al., 2020). We find the proposed model suffers from this issue more seriously as the masked tokens are predicted conditionally independently in each refinement step. To alleviate this problem, we resort to the repetition penalty strategy, which discounts the scores of previously generated tokens. Slightly different from Keskar et al. (2020), we discourage the generation of tokens appearing in Y M instead of previously generated tokens, achieving the repetition penalty without hurting the non-autoregressive property. The probability distribution for the t-th token at the refinement step r is defined as follows: where h i is the logit for the t-th token. If c is true, I(c) equals θ, otherwise equals 1. Termination Criterion. During inference, we refine the output token by token. When should we stop refining? One method is to monitor the encoder. If all predicted encoder labels are 0, indicating no revision is required by any token, we will stop refining. However, this criterion is so strict that refinements may not stop in most cases. Therefore, we adopt a relatively loose standard by monitoring the output of the decoder. To be specific, if the decoder output is the same as that of the last refinement step, we will stop the refinement process.

Experiment Setup
Datasets and Pre-processing. Following Miao et al. (2019) and Zhang et al. (2020b), we conduct experiments on One-Billion-Word 2 and the Yelp dataset 3 . One-Billion-Word is a public dataset for language modeling produced from the WMT 2011 News Crawl data. The Yelp dataset consists of business reviews on Yelp. For each dataset, we filter out sentences with length less than 10 or greater than 40. After preprocessing, we choose 1M , 0.1M sentences from each dataset as the training and validation sets. We also select 1K sentences to provide keywords. To be specific, we extract 1-6 keywords from each sentence. Therefore, we construct six kinds of test sets for lexically constrained text generation, and the size of each test set is 1K. Baselines. We compare our proposed model with several strong baselines for lexically constrained text generation, including three traditional baselines (sep-B/F, asyn-B/F, and GBS) and three recent models (CGMH, POINTER, and X-MCMC-C). We implement two variants of the backward and forward language model (sep-B/F and asyn-B/F) (Mou et al., 2015), GBS (Hokamp and Liu, 2017) and CGMH (Miao et al., 2019). For a fair comparison, these baselines are based on the GPT-2 small model (n layer = 12, n head = 12, d hidden = 768, and 117M parameters), which has a similar architecture to the decoder of BART-large.
X-MCMC-C benefits from the guidance of the XLNet-based classifier, thus substantially improving the generation quality compared to CGMH. We train X-MCMC-C with the code provided by He and Li (2021), which is based on the XLNetbase-cased model (n layer = 12, n head = 12, d hidden = 768, and 110M parameters). We also compare our model with POINTER (Zhang et al., 2020b). Similar to our model, POINTER can insert multiple tokens in each step. We train two different POINTER models, POINTER and POINTER-2, with the code released by Zhang et al. (2020b). Specifically, POINTER is initialized with BERTlarge, while POINTER-2 is initialized with the general model, pre-trained on the English Wikipedia dataset. Both models have comparable parameters (n layer = 24, n head = 16, d hidden = 1024, and 336M parameters) to CBART. Training and Inference. For our model, we create synthetic data with the left method. For each sentence, we create 10 synthetic data instances. Therefore, for each dataset, the size of the synthetic training and validation sets are 10M and 1M . We initialize our model with the BART-large model (n layer = 12, n head = 16, d hidden = 1024, and 406M parameters). We use AdamW (Loshchilov and Hutter, 2019) with an initial learning rate of 1e − 5 and α = 1 to update our proposed model for two epochs and choose the checkpoint with the lowest validation loss. Please refer to Table 6 for the effect of the hyper-parameter α.
During inference, we run beam search decoding with beam width = 5 to generate text for sep-B/F, asyn-B/F and GBS. Following He and Li (2021), we run CGMH and X-MCMC-C for 200 refinement steps and select the candidate text with the lowest NLL as output. For POINTER, we use greedy decoding to generate constrained text. For CBART, we use the four parallel decoding methods (see Section 3.3). We apply the repetition penalty to sep-B/F, asyn-B/F, GBS, our models with θ = 2, and POINTER with the default value θ = 1.25. We implement our model and baselines with HuggingFace (Wolf et al., 2019). Results of finetuned language models and well-trained classifiers of CBART are shown in the Appendix A and B. Automatic Evaluation Metrics. We evaluate the generated sentences from two aspects: generation quality and diversity. Following previous work (Zhang et al., 2020b), we use BLEU (Papineni et al., 2002), NIST (Doddington, 2002) and ME-TEOR (Banerjee and Lavie, 2005) as metrics for the generation quality, which measure the similarity between the generated text and the human reference. A higher BLEU, NIST or METEOR score indicates that a model can generate sentences similar to human references. In this paper, we do not use NLL as a metric for sentence fluency, since a lower NLL value does not always denote better sentence quality. Recent work (Holtzman et al., 2020) has found that language models assign low NLL scores not only to high-quality sentences, but also to repetitive and generic sentences.
As for generation diversity, we first compute the cumulative 4-gram Self-BLEU score (SB-4) (Zhu et al., 2018) to measure how similar one sentence is to the other generations by treating one sentence as the hypothesis and the others as references. Then, we calculate distinct bigrams (D-2) and 4-grams (D-4) , which are the number of unique bigrams and 4-grams divided by the total number of generated tokens. A lower Self-BLEU or higher distinct n-gram value indicates higher diversity. Finally, we measure the n-gram repetitions on a sentence level. Since the length of generated sentences varies greatly, we focus on each sentence's first 20 tokens. Concretely, if a unigram appears more than two times or a trigram appears more than one time within the first 20 tokens of a sentence, we will regard the sentence as containing a repetition.

Main Comparison Experiment Results
We show the experiment results on One-Billion-Word and Yelp test sets in Table 2, from which we can draw four conclusions: (1) Sep-B/F, asyn-B/F and GBS have low generation quality and diversity. Sep-B/F, asyn-B/F and GBS have low BLEU, NIST and METEOR values, indicating poor generation quality. That is possible because these models force keywords to be incorporated into outputs during decoding, thus degrading the generation quality. Moreover, compared with human-written text, sentences generated by sep-B/F, asyn-B/F and GBS are much less diverse as they have higher Self-BLEU and lower  Table 2: Results on One-Billion-Word and Yelp test sets. ("Human" means human references. k and p are hyperparameters for top-k and top-p decoding, respectively. c is the number of parallel sequences for the multiplesequence decoding. "M" refers to METEOR. "Ref" denotes the average number of refinements taken during decoding. "La" (latency) is the average decoding time (second) per sentence computed on test sets without minibatching. "S" denotes speedup. "Rep" means the percentage of sentences containing n-gram repetitions. "Len" represents the average length of the generated sentences.) Results for sep-B/F and asyn-B/F are on the test set with N = 1 constraint. Results for the remaining models are averaged over the six test sets with N = 1 to N = 6 lexical constraints. distinct n-gram scores. Since these models are not aware of keywords before generation, they tend to generate generic phrases ('he said', 'he would', etc.), thereby degrading the generation diversity.

One-Billion-Word
(2) Sampling-based methods have high generation diversity and low quality. CGMH is on par with humans in generation diversity (Self-BLEU and distinct n-gram scores), yet it comes at the expense of degrading sentence quality (lower BLEU scores). This conclusion is in line with the results in previous work (Zhang et al., 2020b;He and Li, 2021). Compared with CGMH, X-MCMC-C slightly boosts the generation quality due to the decreasing of random modifications.
(3) POINTER reduces the inference latency yet with low generation quality. Compared with other baselines, POINTER significantly reduces the inference latency but has low generation quality. There are two possible reasons: (1) POINTER is based on BERT, which is not designed for text generation; (2) POINTER imposes all the burden of generation on the decoder. In addition, pre-training POINTER on Wikipedia (POINTER-2) improves the performance, consistent with what is observed in previous work (Zhang et al., 2020b). However, it is not fair for other models, as we may also make improvements by training them on larger datasets.
(4) The proposed model outperforms baselines in most metrics. Similar to POINTER, CBART can refine multiple tokens in each refinement step. Therefore, CBART only needs several steps to complete a sentence with the given keywords, thus dramatically reducing the inference time. As shown in Table 2, CBART with greedy decoding needs around five refinement steps and is about 28 and 31 times faster than CGMH on One-Billion-Word and Yelp. However, CBART outperforms POINTER in generation quality and diversity by a large margin.  This is possible because CBART shifts part of the burden from the decoder to the encoder and benefits from the intrinsic generation ability of BART.
To summarize, it is non-trivial to satisfy all metrics when generating constrained sentences. CBART with greedy decoding can generate sentences with relatively high sentence quality and diversity while largely reducing the inference latency. We can also control sentence quality and diversity with top-k or top-p sampling. For example, increasing k or p allows CBART to generate tokens with low probabilities, thus improving sentence diversity.

Human Evaluation
To further assess the proposed model, we conduct a human evaluation. We compare CBART (greedy decoding) with GBS, CGMH, X-MCMC-C, POINTER-2 and human references. For each model, we randomly select 50 sentences and invite three volunteers 4 to compare the sentences generated by different models. Following previous work (Huang et al., 2020), each annotator should compare sentence A with sentence B and decide which one is more fluent or informative. A tie is allowed if they have no preference. Sentences in each pair are shuffled before annotation to avoid bias. We show the results of the human evaluation in Table 3. The inter-rater agreement measured by Fleiss' kappa (Fleiss, 1971) is 0.69 and 0.65 for fluency and informativeness, indicating a substantial inter-rater agreement, according to Landis and   Koch (1977). CBART outperforms baselines in both fluency and informativeness. CBART is even on par with humans in terms of sentence fluency. However, the proposed model still lags far behind humans in terms of informativeness. We speculate that this is because the model is only taught to generate fluent sentences during training, resulting in the generated sentences being shorter than human references.

Ablation Study and Analysis
Effect of Methods for Creating Synthetic Datasets. We train CBART on different synthetic datasets constructed with different insertion ways and show the results at the top of Table 4. CBART trained with the left method gets the best sentence quality in terms of BLEU, NIST and METEOR. We speculate that CBART fine-tuned with the left method generates the leftmost tokens first, which is consistent with the generation order of BART. Therefore, CBART trained with the left method has a relatively smaller gap between training and fine-tuning than other variants. Effect of the Repetition Penalty. In Table 4, CBART with the repetition penalty removed tends to get stuck in repetition loops, and the percentage of sentences containing repetitions surges from 2.3% to 54.7% (row 1 vs. row 6).
Effect of the Number of Constraints. As shown at the bottom of  the generation quality. Furthermore, the model diversity improves because it is less likely to generate similar sentences as constraints increase.
Effect of Training Objectives. We conduct experiments to analyze the effect of training CBART with different training objectives, language modeling (LM) and masked language modeling (MLM).
The difference between them is that during training, LM will reconstruct the original text, while MLM only predicts the masked tokens. As shown in Table  5, CBART trained with LM (row 1) performs better than CBART trained with MLM (row 2), possibly because LM is more suitable for text generation, in line with previous results (Lewis et al., 2020). Previous non-autoregressive translation (NAT) models (Ghazvininejad et al., 2019;Stern et al., 2019) removed the causal attention mask (CAM) from the decoder so that each target token can attend to other tokens of the decoder input. Therefore, we also conduct experiments to analyze the effect of CAM. However, we do not observe any significant improvement (row 1 vs. row 3) for our task. This arises from the difference in input between our task and machine translation. For machine translation, the encoder and decoder take different languages as input. By comparison, for our task, the decoder input is constructed by inserting mask tokens before some positions and replacing some tokens with mask tokens. Therefore, each token of the decoder input Y M also appears in the encoder input X. When training CBART with CAM, each token of the decoder input can attend to the encoder input via cross attention, which is equivalent to attending to other tokens of the decoder input by removing CAM. That is why we cannot make further improvements by removing CAM. Effect of Pre-trained Models. We train two base CBART models with 6 layers initialized with random values (row 5) or the BART-base model (row  6) and a large CBART model initialized with the BART-large model (row 7). From Table 5, we conclude that pre-trained models (row 5 vs. row 6) and model size are important (row 6 vs. row 7). These are in line with our intuitions: (1) CBART inherits some syntactic and semantic knowledge from BART; (2) increasing the model size will improve the performance. Note that in this paper, the experiment results of our proposed model are based on the large CBART model, if not specified.
Effect of the Hyper-parameter α. Since α is an important hyper-parameter for training CBART, we train CBART-base (the base CBART model, which is initialized with BART-base.) with different α. From the results in Table 6, we find that α = 1.0 is an appropriate value for CBART. We speculate that when α is too large, CBART will pay more attention to the generation/decoder loss. On the contrary, when α is too small, CBART will put more focus on the classification/encoder loss. Both losses are essential for the performance of CBART. As we can see, CBART trained with α = 1.0 performs best in most metrics, which is a good trade-off between the encoder loss and the decoder loss.

Samples and Analysis
We show some sentences generated by baselines and our proposed model with lexical constraints extracted from Yelp test sets in Table 7. From this table, we can see that the sentence generated by CBART with greedy decoding is more fluent and meaningful than baselines. We can also generate more diverse and informative sentences with top-p or top-k sampling by increasing k or p, but there is a risk of generating less fluent sentences (see p = 0.95, c = 1). Increasing the number of sampling sequence c can slightly alleviate this (see p = 0.95, c = 5). More generated sentences are shown in Table 11 and Table 12 in the Appendix C.

Cons family, good, location, star
Human my family and i did not have a good experience here at this location . the one star is for the food .
GBS this is a good location for family and star wars fans .
CGMH very nice family friendly spot , good location on star .
CBART with Different Decoding Methods greedy my family and i always get good food at this location . the one star is for customer service .

k=5, c=1
my favorite family owned restaurant ! friendly staff , good food and the location is great and always a 5 star experience .

k=50, c=1
it was a family friendly restaurant in las vegas with good food for children and nice location , this two star is because of service .

k=50, c=5
great family owned restaurant and good food . great location and a solid star coffee experience in general !

Related Work
Pre-trained Language Model. Large-scale pretrained models have achieved remarkable success in many natural language understanding (NLU) tasks (Devlin et al., 2019;Liu et al., 2019b) and natural language generation (NLG) tasks (Radford, 2018;Radford et al., 2019). Some unified pretrained language models, such as XLNet , UNILM (Dong et al., 2019) and BART (Lewis et al., 2020), have attempted to solve both NLU and NLG tasks. Unlike previous work, we have extended our work from BART, in order to generate text under specified lexical constraints. Non-autoregressive Generation. NAT (Gu et al., 2018) generates all tokens in parallel, thus speeding up the inference. The parallel decoding comes at the cost of degrading generation quality since NAT breaks the dependency among target tokens.
To alleviate this problem, recent NAT models (Lee et al., 2018;Stern et al., 2019;Gu et al., 2019;Ghazvininejad et al., 2019) generate the output with several steps, making a trade-off between the decoding speed and generation quality. Susanto et al. (2020) extended Levenshtein Transformer (Gu et al., 2019) by injecting keywords into machine translation. It works on machine translation because the source and target are mostly aligned and the solution space is small. However, it per-forms worse than BART on general text generation (Lin et al., 2020), which has a much larger solution space. Different from previous work, CBART inherits the generation ability from BART while maintaining the decoding speed of NAT models. Lexically Constrained Text Generation. B/F-LMs (Mou et al., 2015;Liu et al., 2019a) are limited to generating text with one lexical constraint. GBS (Hokamp and Liu, 2017;Post and Vilar, 2018) incorporates multiple constraints into the output by controlling the decoding process yet degrades the generation quality and diversity. Recently, MCMC sampling has been applied to constrained text generation (Miao et al., 2019;Zhang et al., 2020a). Nevertheless, refinements conducted by these models are decided randomly. To alleviate this, Sha (2020) used the gradient information, and He and Li (2021) used a token-level classifier to decide the refinements, but these models updated only one token in each step. POINTER (Zhang et al., 2020b) can refine multiple tokens in each step but is based on BERT and imposes all the generation burden on the decoder, thus degrading the sentence quality. To solve these problems, we propose CBART, which is based on BART and transfers part of the decoder's burden to the encoder.
Another approach also generates text based on keywords (Fan et al., 2018;Lin et al., 2020) but does not force them to appear in the output. By comparison, our work focuses on requiring all keywords to appear in the output.

Conclusion
In this paper, we presented CBART for lexically constrained text generation. Compared with previous work, CBART leverages BART and transfers part of the burden from the decoder to the encoder. Furthermore, CBART refines multiple tokens in parallel in each refinement step, thus accelerating the inference. Experiment results on One-Billion-Word and Yelp datasets show that CBART can generate fluent and diverse text with lexical constraints and dramatically reduce the inference time.

A Performance of Language Models
The forward GPT-2, backward GPT-2, separate forward GPT-2 and separate backward GPT-2 are initialized with the pre-trained GPT-2 small model. These models are fine-tuned on the training sets of One-Billion-Word or Yelp. We choose the checkpoint with the lowest NLL loss on the validation set. They are used for baselines, including sep-B/F, asyn-B/F, GBS and CGMH. NLL results of the fine-tuned language models on the validation sets are shown in Table 8.

B Performance of Classifiers
We create synthetic datasets for CBART with One-Billion-Word and Yelp. We select 1M and 0.1M sentences from each dataset as the training and validation sets. For each sentence, we create 10 synthetic data instances. Therefore, for each dataset, the size of the synthetic training and validation sets are 10M and 1M . We fine-tune CBART on the synthetic training set for two epochs with a learning rate of 1e − 5 and select the best checkpoint with the lowest loss on the synthetic validation set. We show the performance of classifiers of CBART-base (the base CBART model, which is initialized with BART-base.) and CBART-large (the large CBART model, which is initialized with BART-large.) in Table 9 and Table 10, respectively.

C Generating Text with Lexical Constraints
We show some text generated by baselines and our proposed model with lexical constraints extracted from One-Billion-Word and Yelp test sets in Table  11 and   Table 9: Results of the classifier of CBART-base on the synthetic validation sets of One-Billion-Word and Yelp. "P" and "R" denote precision and recall.  Table 10: Results of the classifier of CBART-large on the synthetic validation sets of One-Billion-Word and Yelp. "P" and "R" denote precision and recall. hearing, system, need

Human
We are already hearing arguments for focusing everything on the economy damaged by failure in the banking system , dropping the need to fix the climate system .
Baselines GBS " We need to have a system of hearing protection , " he said . CGMH A public hearing is just one more system that we need . X-MCMC-C The new public hearing system will need funding for about six months after the court ruling .
POINTER-2 and he said at the senate hearing that it ? s changing in the entire current system , it ? s a need for a reform now . . .

CBART greedy
The new hearing system is less expensive , and there was no need for a specialist . k=5, c=1 " The hearing system is something we need to improve . k=50, c=1 She has a hearing system and is in need of glasses . p=0.5, c=1 The hearing system is closed , but you will not need to pay for it . p=0.9, c=1 Is there any existing hearing system that would need to be adapted ?
Constraints admitted, health, heavy, cold Human Philip , 86 , was admitted to the King Edward VII Hospital in central London on Thursday after his health deteriorated having caught a heavy cold .

GBS
He admitted that he had been " cold and heavy " in the health department CGMH He admitted that he had faced mental health problems and a heavy cold .
X-MCMC-C The Labour MP was admitted to a mental health hospital after suffering a heavy cold for most of the week .
POINTER-2 but when she was admitted to a local mental health unit because what she did was so heavy a burden on mea , and one or more afraid of the other colds . . . ?

CBART greedy
The singer admitted to having mental health problems and suffering from a heavy cold .

k=5, c=1
The court admitted the pair have two previous health problems , including heavy doses of cold and flu medication .

k=50, c=1
Health ministers admitted that the health service was weak because it had suffered a heavy dose of cold and flu . p=0.5, c=1 She admitted to having health problems , including heavy cold and flu . p=0.9, c=1 He had been admitted to the hospital mental health unit , suffering from a heavy head cold and fever .

Constraints way, back, missing, weeks, due
Human John Lackey is all the way back after missing the first six weeks due to injury , and pitching like an ace again .

GBS "
The way back is due to the missing weeks , " he said .

CGMH
The only way back is after missing four weeks due to injury . X-MCMC-C He is on his way back home after missing two weeks due to an Achilles tendon injury .
POINTER-2 he has found his way into england despite his back injury , missing three of the last two weeks with a calf injury and another two due to a calf injury .

CBART greedy
He is on his way back to the squad after missing two weeks due to a stomach bug . k=5, c=1 She is making her way back into action after missing three weeks due to illness . k=50, c=1 Took his way back from a groin injury after missing six weeks due to knee injuries . p=0.5, c=1 But he was on his way back after missing two weeks due to a visa issue . p=0.9, c=1 He is on the way back after missing two weeks due to a fractured collarbone and bruised ribs .

Constraints likely, certain, respond, others, new, environment
Human That said , it is likely that certain forms of religion would respond better than others to the new environment .

GBS "
The new environment is likely to respond to certain others , " he said CGMH Not likely , but certain areas will respond better than others to the new environment .
X-MCMC-C And they are also likely to be certain to respond to others who want to create a new environment to replace the old .
POINTER-2 they are more likely to be unable to do a certain things , about how they respond better than others in this area , are new new or why the current environment is different .

CBART greedy
The first group is most likely to be certain to respond better than others in a new environment . k=5, c=1 The company is likely to depend on certain sectors and respond to others in the new environment .

k=50, c=1
We have no more likely chance of survival , while certain groups respond differently and the others adapt to a new environment . p=0.5, c=1 How likely to be that you learn about certain issues and respond to others in a new environment .
p=0.9, c=1 I suspect it is likely to find certain chemicals would respond better than others , which could cope in the new environment . Table 11: Generated text with constraints from One-Billion-Word test sets. "Human" refers to the human reference.