Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation

Prior studies on text-to-text generation typically assume that the model could figure out what to attend to in the input and what to include in the output via seq2seq learning, with only the parallel training data and no additional guidance. However, it remains unclear whether current models can preserve important concepts in the source input, as seq2seq learning does not have explicit focus on the concepts and commonly used evaluation metrics also treat them equally important as other tokens. In this paper, we present a systematic analysis that studies whether current seq2seq models, especially pre-trained language models, are good enough for preserving important input concepts and to what extent explicitly guiding generation with the concepts as lexical constraints is beneficial. We answer the above questions by conducting extensive analytical experiments on four representative text-to-text generation tasks. Based on the observations, we then propose a simple yet effective framework to automatically extract, denoise, and enforce important input concepts as lexical constraints. This new method performs comparably or better than its unconstrained counterpart on automatic metrics, demonstrates higher coverage for concept preservation, and receives better ratings in the human evaluation. Our code is available at https://github.com/morningmoni/EDE.


Introduction
Text-to-text generation is an important research problem with a broad set of applications, such as dialog response generation (Dinan et al., 2019), headline generation , and summarization (Mao et al., 2020b). A distinct feature of text-to-text generation (vs. free-form text generation) is that it is often desired to preserve the * Equal contribution 1 Our code is available at https://github.com/ morningmoni/EDE

[Source Input] Israel's air forces on
Monday launched a third missile strike on Gaza city in less than four hours, Palestinian witnesses said.

[Headline Generation]
Figure 1: Examples of text-to-text generation tasks where preserving important input concepts is crucial for producing satisfactory results. concepts in the source input (see Fig. 1 for an illustration). On one hand, concept preservation is crucial for maintaining the factual consistency between the input and output (Maynez et al., 2020;Nan et al., 2021). On the other hand, encouraging the model to focus on important input concepts may also improve its generation quality (Yao et al., 2019;. Mainstream text-to-text generation methods are mostly data-driven, which "hope" to learn meaningful mappings between source input and target output via sequence-to-sequence (seq2seq) learning -this is particularly the case for the recent pre-trained language models (PLMs) (Lewis et al., 2020;Raffel et al., 2020), where seq2seq learning is expected to identify what to attend to in the source input and what to include in the model output, with access to only parallel training data.
However, as seq2seq learning does not explicitly focus on key concepts (e.g., named entities) and commonly used evaluation metrics (e.g., BLEU and ROUGE) also treat all the tokens in a sequence equally important, it is unclear how many of the important input concepts can be (or have been) preserved. Existing attempts to alleviate the above issue use soft constraints, such as copy mechanism or additional attention, to focus on (certain parts of) the source input (See et al., 2017;Dinan et al., 2019;Dou et al., 2021). Nevertheless, they still lack explicit guidance and resort to seq2seq learning itself to figure out what is important, without any guarantee of the model output or evaluation on the input concept preservation.
Explicit guidance for text generation can be achieved by lexically (hard) constrained generation (LCGen), which specifies lexical constraints (tokens) that must be included in the model output. However, to what extent guiding text-totext generation with lexical constraints works in general remains unknown, as existing studies on LCGen (Hokamp and Liu, 2017;Post and Vilar, 2018;Zhang et al., 2020b) focus on scenarios where gold (ground-truth) constraints are given (e.g., generate a story using user-specified keywords), while in generic text-to-text generation tasks the constraints (target output) are unavailable.
In this paper, we present a systematic analysis on generic text-to-text generation to understand (1) the abilities of seq2seq models, especially the PLMs, for preserving important input concepts and (2) whether more explicit guidance that uses important input concepts as lexical constraints can complement seq2seq learning. We select four representative tasks where preserving important concepts (entities) in the source input and incorporating them in the model output is essential, including question generation, knowledge-grounded dialog, headline generation, and abstractive summarization. We examine the effectiveness of guiding generation with the important input concepts as lexical constraints, where the concepts are either obtained by comparing the source input with the target output in an analytical study or automatically extracted from the source input in a practical setting.
Specifically, in the analytical study, we first evaluate how many of the concepts found in the target output (named gold concepts) are available in the source input, and how many of them are already preserved by seq2seq models. We then investigate the room for improvement if we guide generation with the gold concepts as lexical constraints (named gold constraints). Next, when the target output is unavailable, we propose a simple yet effective framework, named EDE, to automatically Extract, Denoise, and Enforce input concepts as lexical constraints. EDE achieves significant improvement on two tasks and moderately better or comparable performance on the other two under sequence-level automatic evaluation. Moreover, EDE receives better ratings in the human evaluation and demonstrates higher coverage in the concept-level evaluation on all the tasks that we examine.

Contributions.
(1) We analyze whether current PLMs for text-to-text generation are good enough for preserving important input concepts via seq2seq learning.
(2) We study the usefulness of guiding generation explicitly with important input concepts as lexical constraints. (3) We propose a framework to automatically extract, denoise, and enforce concept constraints, which achieves comparable or better performance than unconstrained generation on a range of text-to-text generation tasks.

Analysis on Concept Preservation
In this section, we conduct a series of analytic experiments on concept preservation, which tries to answer the following two questions: Q1: Is the current "PLMs+seq2seq fine-tuning" paradigm for text-to-text generation good at preserving the important concepts in the source input? Q2: What is the room for improvement if we guide text-to-text generation by enforcing high-quality (gold) concepts as lexical constraints?

Analysis Setup
We conduct extensive analytical experiments on four text-to-text generation tasks including question generation, knowledge-grounded dialog, headline generation, and abstractive summarization. All the tasks require the model to preserve important concepts (entities) in the input and include them in the output in order to achieve satisfactory results. Correspondingly, we consider the input as the source of lexical constraints and the entities in the target output as the desired (gold) constraints.
To answer Q1, we first evaluate the availability of concepts, namely how many of the gold concepts appearing in the target output can be found in the source input. We then conduct a manual analysis to study why some of the gold concepts are missing in the source input. Finally, we analyze model performance by matching gold concepts with the unconstrained model output, which reveals how many of the important input concepts are already preserved without the use of explicit constraints.
To answer Q2, we use the gold concepts as lexical constraints to guide the generation process. We first estimate an ideal upper bound by enforcing all the gold constraints taken from the target output. We next remove the gold constraints that cannot be found in the source input and examine how large the gap is from the ideal upper bound.

Tasks and Datasets
Question Generation. Question generation is the task of generating a question given a passage and the corresponding answer. Ideally, a seq2seq model would learn to focus on the relevant information surrounding the answer span and reuse some of the concepts in the source input as part of the generated question. We use the SQuAD 1.1 dataset (Rajpurkar et al., 2016), which is repurposed by Du et al. (2017) for question generation.

Knowledge-Grounded Dialog.
Knowledgegrounded dialog involves utterances with groundings to specific knowledge sources. It is used as another test case for our analysis as the model is supposed to extract important concepts from the grounded knowledge when appropriate. We use the Wizard of Wikipedia (WoW) dataset (Dinan et al., 2019) for evaluation. As we focus on generation instead of knowledge retrieval, we adopt the gold knowledge setting of WoW where the sentences with relevant knowledge are provided and use the test split that consists of new dialogues under overlapping topics with the training set.
Headline Generation. Headline generation fits our analysis as a headline usually consists of the most important concepts in the input article. We use the English Gigaword dataset (Napoles et al., 2012) for evaluation. As training on the full training set of Gigaword (Rush et al., 2015) is computationally prohibitive, we adopt two low-resource settings where the 10k training examples in Dong et al. (2019) and the first 300k of the 3.8M training examples are used.
Abstractive Summarization. Abstractive summarization is used as another testbed as a summary generally involves important concepts in the source document. We take two widely used news summarization datasets, CNN/Daily Mail (CNN/DM) (Nallapati et al., 2016) and XSum (Narayan et al., 2018), for evaluation, where CNN/DM is more extractive (Mao et al., 2020a) and XSum more abstractive (Maynez et al., 2020) in nature.

Constrained Generation
Next, we briefly introduce constrained generation and the specific methods used for our analysis. Hard Constrained Generation. Lexically (hard) constrained generation (LCGen) specifies lexical constraints (tokens) that must be present in the model output. Compared to soft constrained generation, LCGen has the advantage of ensuring the presence of certain input concepts explicitly, but it could also be problematic when the enforced constraints are noisy (inappropriate). We choose from LCGen methods that enforce constraints by constraining beam search (Hokamp and Liu, 2017;Post and Vilar, 2018), as they function at the inference stage and can be easily combined with different seq2seq models, making our analysis more generalizable. In contrast, sampling-based and insertion-based approaches are not easily applicable to generic text-to-text generation, as they typically involve specialized training schemes or do not accept other inputs than the constraints (Miao et al., 2019;Zhang et al., 2020b;Sha, 2020).
Specifically, we take dynamic beam allocation (DBA) (Post and Vilar, 2018) for our analysis, as it has higher efficiency than other LCGen methods that constrain beam search (Hokamp and Liu, 2017). DBA revises the process of beam search by dividing the beam into a number of groups (named banks), each of which stores the hypotheses that satisfy the same number of constraints. DBA can ensure that every constraint is present in the model output, as the <EOS> token is only allowed when all the constraints are met. At each decoding step, there are three sources of candidates: (1) the top-k tokens across all hypotheses as in standard beam search; (2) all unfulfilled constraint tokens for each hypothesis; and (3) the single-best token for each hypothesis. The banks are trimmed by the sequence probability if the total number of candidate tokens is beyond capacity (beam size). Soft Constrained Generation. We additionally examine the effectiveness of soft constrained generation for comparison. There are various types of soft constraints and we particularly consider the type which implicitly specifies input texts that the model needs to focus on. We use copy mechanism as a representative example for this purpose. Copy mechanism estimates the importance of tokens in the source input and learns to copy them when appropriate, which is useful for preserving the important concepts in the input, especially rare concepts that the model does not get enough exposure to during training. For PLMs, we take the encoder-decoder cross attention in the last decoder layer as the copy distribution (Xu et al., 2020).

Experimental Settings
We use BART (Lewis et al., 2020) as the major base model for analysis. We use spaCy (Honnibal and Montani, 2017) to extract the entities from the target output as the gold concepts, the quality of which is shown to be reasonably good for the source-target alignment in summarization (Nan et al., 2021). We mainly use exact matching to be consistent with the current automatic metrics that generally consider lexical overlap, while also manually analyzing the missing concepts to address the limitation of exact matching. More implementation details can be found in App. A.

Results and Analysis
Concept Availability. We first examine how many of the gold concepts can be found in the source input. A high degree of concept overlap between the source input and target output is expected, since the tasks require groundings to the input particularly. As listed in the 3rd column of Table 1, even if the task requires groundings to the input, it is not uncommon that the corresponding dataset involves extrinsic information accidentally, i.e., the target output contains information that cannot be found in the source input. The issue is most severe on Gigaword and XSum: around half of the gold concepts are not found in the source input, which coincides with recent studies on the factual consistency of text-to-text generation (Matsumaru et al., 2020;Maynez et al., 2020).  the concept is rephrased with commonsense knowledge (e.g., "on Valentine's Day" to "in February"). As shown in Fig. 2, for most datasets, the major reason that gold concepts are missing is simply that they are spelled differently and not found by exact matching. That said, three of five datasets still have 10%+ missing gold concepts after ruling out this factor. 2 Such findings suggest that we need to put more effort to factual consistency when creating text-to-text generation datasets and remove problematic (hallucinated) examples from existing datasets if possible. The quality of gold concepts (NER) is reasonably high except for SQuAD. Nevertheless, after a closer look we find that most NER errors on SQuAD are repeated ones (e.g., "what year" is often incorrectly recognized as date) that can be easily fixed manually. Concept Fulfillment. We next study concept fulfillment, i.e., matching the gold concepts with the model output to examine how many of the important concepts have been preserved without the use of constraints.
As shown in the 4th column of Table 1, roughly half of the gold concepts are preserved through standard seq2seq learning. The unconstrained model performs especially poorly on SQuAD and CNN/DM, as the gap between the concepts in the source input and those in the model output (the 3rd and 4th columns) is quite large. Interestingly, the model output contains more gold concepts than the source input on XSum, possibly because the significant level of extrinsic information in the target out-put of XSum encourages the model to hallucinate. If we only consider available gold concepts (the 5th column), the degree of concept fulfillment is improved consistently but remains relatively low.  Upper bounding. In Fig. 3, we list the ROUGE F1 scores when enforcing all the gold concepts as constraints (all) and only those found in the input (available). We observe that the room for improvement is usually high when all the gold constraints are used, but may become smaller if those constraints not found in the source input are excluded, especially on Gigaword and XSum, the two datasets with more extrinsic information. One exception is CNN/DM, where using available constraints performs better than using all. We hypothesize that this is because CNN/DM involves longer output and much more concepts than other tasks, which together may mislead the beam search process if the beam is mostly occupied for fulfilling the constraints. A reduced number of constraints is hence more appropriate for it. The results above indicate that when guiding generation with important input concepts as lexical constraints, it could be seemingly difficult to improve overall output quality on datasets that involve abundant extrinsic information. That said, as commonly used metrics like ROUGE conduct sequence-level evaluation without distinguishing different tokens or identifying concepts, one also needs to examine model performance by measuring concept preservation in particular to truly reflect the effectiveness of LCGen (Sec. 4.3).

Guiding Text-to-Text Generation with Extracted Concepts
In this section, we study a practical setting where automatically extracted concepts are used as constraints (named automatic constraints), with the following research question in mind: Q3: Can we automatically extract important input concepts and use them as lexical constraints to guide text-to-text generation?

Automatic Constraint Extraction
Existing studies on LCGen generally focus on scenarios where the gold constraints are provided and the model output is centered on the constraints, but for most applications in text-to-text generation, the gold constraints (target output) are not accessible. One thus needs to conduct automatic constraint extraction from the source input. While it is not uncommon to extract keywords from the source input to help generation, previous methods (Yao et al., 2019; typically use them as soft constraints via attention instead of lexical constraints that guide the generation process explicitly. To extract constraints automatically, we create constraint labels on the training set by mapping gold concepts from the target output to the source input (as in the study of concept availability). We then train a state-of-the-art keyphrase extraction model, BERT-KPE , to predict constraints during test time. BERT-KPE uses contextualized representation and jointly optimizes keyphrase identification and ranking. We conduct constraint extraction independently as our preliminary experiments suggest that multi-task learning of text-to-text generation and constraint extraction does not appear particularly helpful. Also, decoupling the two tasks makes our framework generally applicable to different trained seq2seq models.

Constraint Denoising
Unlike using the gold constraints, directly enforcing all the automatic constraints is likely to worsen generation as some of the extracted constraints are inappropriate (see example in Fig. 4). Therefore, we propose a Denoised variant of DBA, named DDBA, which is designed specifically for dealing with automatic constraints that are noisy in nature. DDBA conducts step-level dynamic constraint denoising by modifying the DBA method as follows.
First, only constraints deemed appropriate at a decoding step are added to the beam instead of all  Figure 4: Illustration of the proposed framework EDE with a real example. DBA enforces all lexical constraints and results in nonfluency, while DDBA filters the noisy constraint and generates output with higher quality.
unmet constraints like DBA. We use the generation probability in seq2seq models to measure the appropriateness of constraints. Intuitively, inappropriate constraints that would cause nonfluency at the current step are likely to receive low probability scores and thus filtered. Another advantage of dynamic constraint denoising is that by filtering noisy constraints, more beam space is saved for better sequences that are likely to lead to a final output with higher quality. Second, unlike DBA, DDBA allows the <EOS> token even when not all the constraints are fulfilled. In this way, the model is not forced to include noisy constraints in its output. Note that not satisfying all constraints is acceptable since our goal is not to fulfill all the constraints but rather use them to guide generation. Comparison w. Supervised Denoising. We considered training supervised classifiers using features such as self-attention and copy distribution to predict constraint quality, but found that simply filtering the constraints by their token probability at the current decoding step performs competitively with better efficiency, and hence use the vocabulary probability distribution as the scoring function instead (more details are in App. B). We observe that DDBA, despite its simplicity, improves the quality, especially fluency, of constrained generation under both automatic and human evaluations.

Experiments
In this section, we conduct experiments of guiding generation with automatic constraints and compare with the base as well as state-of-the-art models. The constraints are automatically extracted without access to the target output.

Automatic Evaluation
Question Generation. We show the performance comparison of question generation on SQuAD in Table 2. EDE (DBA) leads to worse BLEU-4 and ROUGE-L but higher METEOR, which is possibly due to the noise in the automatic constraints and the fact that METEOR uses stemming and synonymy matching in addition to exact matching, which successfully matches terms with different spellings. In contrast, EDE (DDBA) performs significantly better than EDE (DBA) and vanilla unconstrained BART (up to +4.3/+2.5 on BLEU-4), consistently outperforming existing baselines with BART base and achieves new state-of-the-art results with BART large on BLEU-4 and ROUGE-L. Copy mechanism generally leads to higher ROUGE-L but worse performance on the other two metrics.   sults of knowledge-grounded dialog on WoW in Table 3. We observe that adding both hard and soft constraints helps with the generation process, which again implies that more explicit guidance than pure seq2seq learning could still be beneficial even for the PLMs. The improvement is consistent with the relatively high upper bound of WoW when available gold constraints are used. Also, enforcing all constraints via DBA without dynamic filtering seems to be more detrimental than helpful. Headline Generation. We show the results of headline generation in Table 4. The improvement of EDE on Gigaword is not as much as previous tasks, which is probably because the concept availability on Gigaword is low and it is hard to extract and utilize meaningful concepts from the source input. Nevertheless, we will later show that EDE is preferred in the human evaluation and also preserves the input concepts better on Gigaword.  Abstractive Summarization. We list the results of abstractive summarization in Table 5. The performance of EDE is comparable to its unconstrained  counterpart on both datasets. Better performance may be achieved when the constraint extraction method is improved. Apart from sequence-level evaluation (ROUGE), we observe that EDE is consistently better at concept preservation in conceptlevel evaluation (Sec. 4.3).

Human Evaluation
In addition to automatic evaluation, we further conduct human evaluation with the following three aspects: closeness, relevancy, and fluency (Prabhumoye et al., 2019). Closeness measures the similarity between the model output and target output.
Relevancy considers the quality of model output directly with the source input, since there are usually more valid outputs than the target output for generation tasks. Fluency of the model output is on a scale of 1 (unreadable) to 4 (perfect). We randomly sample 50 examples on each task and conduct pairwise comparisons between BART, EDE (DBA), and EDE (DDBA) with the help of three external annotators. As listed in Table 6, the results are largely consistent with automatic evaluation. For example, DBA leads to a similar number of better and worse outputs over BART on SQuAD and more worse cases on Gigaword, while DDBA outperforms BART more stably on both datasets for closeness as well as relevancy. The gaps are generally larger on relevancy than closeness, indicating that DDBA is preferred by humans when not comparing with the target output directly. DDBA also consistently outperforms DBA on both aspects and largely shares more similarities (ties) with BART due to its denoising function. The fluency ratings of BART, DBA, and DDBA are 3.82, 3.52, and 3.88 on SQuAD, and 3.34, 2.92, 3.26 on Gigaword. Such results indicate that DBA has negative impacts on generation fluency, while the fluency of DDBA is comparable and sometimes even better than unconstrained generation.

Evaluation on Concept Preservation
As sequence-level metrics consider all the tokens equally important and can only measure overall generation quality, we further conduct conceptlevel evaluation to measure model performance on concept preservation in particular. We first examine the quality of extracted concepts (automatic constraints) in Table 7. We observe that the F1 scores on different datasets are largely in the range of 0.4 to 0.5, which is on par with the state-of-the-art performance on keyphrase extraction benchmarks (Meng et al., 2017).  Similarly, we next analyze concept preservation of seq2seq models and show the results in Fig. 5. We only consider gold concepts available in the source input for a fair evaluation. We observe that DBA generally leads to higher recall but lower precision, while DDBA balances the two metrics and consistently achieves the best F1. The improvements are most remarkable on datasets where the unconstrained model could not preserve input concepts well (e.g., +3.6 on SQuAD and +1.1 on CNN/DM) and less significant if the unconstrained model already preserves most of the concepts or the dataset involves abundant extrinsic information.

Takeaways
From the analysis on concept preservation and model performance when automatic constraints are used, we can see that EDE is likely to improve overall generation quality when the upper bound performance using available gold concepts is high. EDE may not be very effective when many gold concepts cannot be found from the source input, which is an issue caused by mixed factors: (Dataset) the target output itself could be problematic and involve extrinsic concepts; (Model) concept extraction uses exact matching and misses some of the concepts; (Evaluation) existing metrics also use exact matching and would not recognize model improvement even if the concepts with different spellings are extracted. That said, EDE still achieves comparable or better performance than its unconstrained counterpart under sequence-level evaluation (overall quality) and better preserves input concepts under concept-level evaluation (concept preservation).

Related Work
Hard Constrained Generation. Lexically (hard) constrained generation (LCGen) has been adopted in various applications. One line of LCGen methods involves specialized model design or training schemes, which is thus not easily applicable to generic text-to-text generation (Miao et al., 2019;Zhang et al., 2020b;Sha, 2020). Other methods constrain the search space during decoding and can be plugged into fine-tuned models without additional training (Hokamp and Liu, 2017;Post and Vilar, 2018). The constraints used in existing studies are either from external sources (Hokamp and Liu, 2017) or taken directly from the target output (Post and Vilar, 2018;Zhang et al., 2020b;Sha, 2020). Different from the previous settings of LCGen where the goal is to generate outputs satisfying the specified constraints, our study focuses on how to use lexical constraints to guide generation in more generic tasks. That is, the model can decide to take (or ignore) any constraints as long as the decisions are beneficial for generation. Soft Constrained Generation. Soft constrained generation does not specify lexical constraints explicitly but encourages the model to attend to certain input texts. In text-to-text generation tasks where incorporating important concepts of the source input into the model output is essential, the soft constraints are usually achieved by attention mechanism to the source document (See et al., 2017;Dou et al., 2021) or some additional inputs such as keywords (Yao et al., 2019; and external knowledge (Dinan et al., 2019). Factual Consistency of Text-to-Text Generation. Accurately preserving the important concepts of the source input is critical for many text-to-text generation tasks to ensure the factual consistency between the source input and model output. However, recent studies (Maynez et al., 2020; show that models learned in a seq2seq manner are prone to hallucinate unfaithful or nonfactual information, hindering their applicability to real-world applications. Guiding text-to-text generation with explicit constraints has been recently shown to help alleviate model hallucination and improve factual consistency (Mao et al., 2020c).

Conclusion
In this work, we examine whether current pretrained language models for text-to-text generation are good enough for preserving important concepts in the source input without explicit guidance but pure seq2seq learning. We conduct extensive analytical experiments on a range of text-to-text generation tasks and study when adding important input concepts as lexical constraints can help guide textto-text generation. We propose a simple yet effective framework for automatic constraint extraction, denoising, and enforcement, which is shown to perform comparably or better than unconstrained generation in various text-to-text generation tasks and better preserves important input concepts.

A Implementation Details
We use Nvidia GTX 2080 Ti and Quadro RTX 8000 for the training of BART base and BART large , respectively. The evaluation metrics for each task are consistent with existing methods for comparison. For constraint extraction, we remove a constraint completely before entering the generation process if its score predicted by the keyphrase extraction model is lower than a threshold (tuned to optimize constraint F1 score for each task). LCGen is typically slower than standard beam search but the runtime overhead is acceptable when the beam size is not too large (we set it no larger than 20) and the number of constraints is small (most tasks involve one to two constraints on average).

B Supervised Denoising
In our preliminary experiments, we considered training a supervised classifier to estimate step-level constraint importance during decoding. Specifically, we use the following features to measure constraint importance: two dynamic features based on the token probability in the vocabulary distribution and the token probability in the copy distribution (if available); Two static features, out-degree and in-degree centrality defined below, based on the transformer self-attention of source tokens. Source token centrality is based on a directed graph built upon the self-attention (Xu et al., 2020). Let G = (V, E) denote a graph representing the encoder self-attention, where vertices V denote the source tokens and E(i, j) denote the self-attention from token i to j with i E(i, j) = 1. The out-degree centrality of token i is defined as j E(i, j), which measures the degree a token i contributes to other tokens. A transition probability matrix T is defined as T (i, j) = E(i, j)/ j E(i, j), where the selfattention is normalized by the out-degree centrality. The in-degree centrality of token i is defined as j T (j, i). We first run EDE and store the intermediate dynamic features during decoding. Then, those gold constraints with a generation probability greater than a threshold at a certain decoding step are treated as positive examples. As the features are all scalar, we use random forest as the classifier due to efficiency considerations. Other conventional classifiers performed worse in our experiments. The comparison between DDBA (supervised) and DDBA (unsupervised) is shown in Table 8. We observe that the performance of the two variants is rather similar, which is possibly because there are no step-level labels that determine whether it is appropriate to use a constraint at a specific decoding step, and we have to use approximate labels that also depend on the generation probability.
We also explored using reinforcement learning that assigns sequence-level reward (e.g., the scores on evaluation metrics) to bypass the lack of step-level labels. However, reinforcement learning turned out to be unstable and costly to learn.

C Human Evaluation
In Table 9, we provide additional results on human evaluation. The results are largely consistent with automatic evaluation: EDE (DDBA) performs significantly better than BART on question and dialog generation, moderately better on headline generation, and comparable on summarization.