Mention Flags (MF): Constraining Transformer-based Text Generators

This paper focuses on Seq2Seq (S2S) constrained text generation where the text generator is constrained to mention specific words which are inputs to the encoder in the generated outputs. Pre-trained S2S models or a Copy Mechanism are trained to copy the surface tokens from encoders to decoders, but they cannot guarantee constraint satisfaction. Constrained decoding algorithms always produce hypotheses satisfying all constraints. However, they are computationally expensive and can lower the generated text quality. In this paper, we propose Mention Flags (MF), which traces whether lexical constraints are satisfied in the generated outputs in an S2S decoder. The MF models can be trained to generate tokens in a hypothesis until all constraints are satisfied, guaranteeing high constraint satisfaction. Our experiments on the Common Sense Generation task (CommonGen) (Lin et al., 2020), End2end Restaurant Dialog task (E2ENLG) (Duˇsek et al., 2020) and Novel Object Captioning task (nocaps) (Agrawal et al., 2019) show that the MF models maintain higher constraint satisfaction and text quality than the baseline models and other constrained decoding algorithms, achieving state-of-the-art performance on all three tasks. These results are achieved with a much lower run-time than constrained decoding algorithms. We also show that the MF models work well in the low-resource setting.


Introduction
This paper focuses on Seq2Seq (S2S) constrained text generation where a set of encoder input tokens are required to be present in the generated outputs. For example, Keyword-to-Text , Data-to-Text (Gardent et al., 2017;Dušek et al., 2020) and Image-to-Text (Lin et al., 2014;1 The source code for this paper is released at https: //github.com/GaryYufei/ACL2021MF Figure 1: An overview of the Mention Flag mechanism for Transformer-based S2S models. Here, the tokens flower and bee are required to appear in the generated outputs. Each generated token has a corresponding set of Mention Flags which informs the decoder whether each lexical constraint has been satisfied in the current decoder input sequence. For example, the Mention Flag for flower is set (indicated by orange dots) from the third token because it is generated at the second step. Both token and Mention Flag embeddings are the input to the decoder, but Mention Flags are injected into the decoder in a different way to the tokens (see Fig. 3). Note that task specific encoder inputs have been omitted for brevity. Agrawal et al., 2019) require the models to mention all or some of the input keywords, key-value pairs and image object labels (respectively), potentially with linguistic variants, in the generated outputs. Large (pre-trained) Transformer-based S2S models such as T5 (Raffel et al., 2019) can be trained (fine-tuned) to perform this task. However, they only learn to copy the surface tokens from encoder inputs to the decoder outputs and there is no underlying mechanism guaranteeing good constraint satisfaction (the ratio of satisfied lexical constraints to given lexical constraints). Constrained Beam Search (CBS) (Anderson et al., 2017) and related algorithms can guarantee outputs satisfying all constraints, however they are much slower than the standard beam search algorithm. In addition, as they are all inference-based algorithms, their corresponding models are not aware of the constraint words or phrases, the resulting generation could be poor. Ideally, a method for producing constrained text should: a) generate high-quality text; b) achieve high constraint satisfaction; c) have an efficient inference procedure.
To this end, we propose Mention Flags (MF), which trace whether a lexical constraint has been realized in partial decoder outputs. Specifically, each decoder input token is provided with a set of flags indicating which constraints have been satisfied up to that token. As shown in Fig 1, the Mention Flags for flower is set from the third step, because flower is generated at the second step. We represent the three possible Mention Flags as separate trainable embeddings and inject them into the decoder of the S2S Transformer-based Text generator. The dynamic Mention Flags explicitly inform the model about which constraints have been satisfied, which is helpful for the models to produce high-quality text satisfying the constraints (Goal a). During training, all the mention flags are set when the model is tasked to generate the End-of-Sequence (EOS) token, strongly encouraging the model not to stop generation until all constraints are satisfied (Goal b). The MF models only require ordinary decoding algorithms. Their inference time and memory requirements are similar to their baseline models (Goal c).
We conduct experiments on three benchmarks: Commonsense Generative Reasoning (Common-Gen) , where the only input is a set of words representing concepts, and the output text is constrained to include all of them; End-to-End Data-to-Text (E2ENLG) (Dušek et al., 2020), where the constraints are meaning representations with lexicalised attributes and values that the output text should mention; and Novel Object Captioning at scale (nocaps) (Agrawal et al., 2019), where constraints are salient image objects that should be mentioned in the generated caption. Compared to the constrained decoding algorithms, the MF models can produce higher-quality text with a similar level of constraint satisfaction and much less inference run-time and memory. Mention Flags are a general mechanism that improves constraint satisfaction in the non-pre-trained and pre-trained S2S Transformer-based models. Furthermore, our experiments show that the MF models can satisfy novel constraints (i.e, involving words or phrases not seen during training) and they work well in low-resource settings. Our MF models set a new state-of-the-art in these three tasks.

Background
In this paper, we focus on constraining transformerbased text generation models due to their popularity and success in various domains, especially in largescale pre-trained language models (Raffel et al., 2019;Lewis et al., 2020). Previous work can be roughly categorized into two streams: S2S training approaches and Constrained decoding approaches: Training S2S Models S2S models can implicitly capture the co-occurrence between encoder and decoder sequences, particularly pre-trained ones such as T5 (Raffel et al., 2019) and BART (Lewis et al., 2020). Wen et al. (2015) uses a special gate to control what information will be generated in the following steps. Kale and Rastogi (2020) have shown that the T5 models achieve state-of-the-art results in various Data-to-Text tasks, requiring copying from encoder to decoder, after fine-tuning. As an alternative, the Copy Mechanism (Gu et al., 2016) explicitly learns where to copy the input constraints into the output by adding an extra copy pathway to the models. However, these approaches cannot control or guarantee their constraint satisfaction.  also have observed lower constraint satisfaction in the above methods, compared to the constrained decoding approaches.
Constrained Decoding These algorithms, including Constrained Beam Search (CBS) (Anderson et al., 2017) and Grid Beam Search (GBS) (Hokamp and Liu, 2017), maintain a set of states which have their own size-k beams and only allow hypotheses satisfying specific constraints to be considered during inference. Each CBS state corresponds to the hypotheses satisfying different constraints (exponential in the number of constraints) and the GBS states correspond to the hypotheses satisfying the same number of constraints (linear to constraint number). Balakrishnan et al. (2019); Juraska et al. (2018); Dušek and Jurčíček (2016) also modify their inference algorithm in a similar way to fulfill specific output requirements. However, they significantly increase the inference run-time and memory and can produce sub-optimal outputs.

Method
This section first formulates constrained text generation tasks, then introduces Mention Flags and their integration with Transformer-based text generators.

S2S Constrained Text Generation
In the S2S constrained text generation tasks, we are given encoder inputs x = [x 1 , . . . , x lx ] ∈ X that describe the task, where some x i correspond to lexical constraints that must be satisfied in the generated outputs. At generation step t, the decoder takes as input the tokens generated so far y :t = [y 1 , · · · , y t ] ∈ Y and generates the next output token y t+1 .

Mention Flag
At generation step t, a set of Mention Flags indicates whether each lexical constraint has been satisfied up to this step (i.e., in the decoder input sequence y :t ). Formally, they can be defined as The values 1 and 2 represent the status of constraint satisfaction. Once y :t satisfies the constraints, the value of the corresponding Mention Flag(s) are updated from 1 to 2. Value 0 is a static default value for all tokens x i that do not correspond to any constraints. They are not required to be mentioned in the outputs. These typically act as instructions to the model. At the start, Mention Flags m(x, ε) ∈ {0, 1} lx where ε is the empty string because the empty string does not mention anything. During generation, m is monotonic in y * : given decoder input sequence y :t and y :(t+1) , m(x, y :t ) i ≤ m(x, y :(t+1) ) i . The Mention Flags for any token x i can only remain unchanged or update from value 1 to 2.
Value Update for Multi-Word Constraints As shown in Figure 2, Mention Flags for the tokens corresponding to the same constraint are updated together. Given encoder input tokens x i , · · · , x j , forming a multi-word constraint, we require that m(x, y * ) i = · · · = m(x, y * ) j for all (partial) outputs y * , and m(x, y :t ) i = · · · = m(x, y :t ) j = 2 iff x i , · · · , x j are mentioned in y :t . We use conventions from the relevant data set to determine whether a constraint is a multi-word constraint. This avoids false update when the models only generate the prefix of the constraints, rather than the full constraints. For example, given constraint "washing machine", the output could be "I put my washing in the new washing machine." The situation becomes more complicated when both washing and washing machine are given lexical constraints. When we find this case, we delay the value 2 update for washing until the word in is generated. Modern tokenization methods, such as BPE (Sennrich et al., 2016), make this situation frequent.

Definition of Mentions
We deliberately allow a flexible notion of mentions in the Function m().
We can define various types of mentions to fulfill the requirements of different applications and tasks. With this flexibility, the end-users can use Mention Flags in many constraint scenarios. For tasks with strict constraints, we define mentions to be the exact string match in y :t . Otherwise, inflectional variants or synonyms of words in the lexical constraints are allowed when checking for mentions. Our Mention Flag mechanism thus supports lexical constraints with multiple verbalizations. We leave more sophisticated constraints (e.g., using NLP parsers) to future work.
Mention Flag Matrix Given x, y :t , We define the two-dimensional Mention Flag Matrix F ∈ {0, 1, 2} lx×t as follows: During training, given x and ground-truth output Y gt (with l gt tokens), we can construct the groundtruth Mention Flag Matrix F gt ∈ {0, 1, 2} lx×lgt by finding the mentioning position of tokens in the lexical constraints in Y gt . F gt follows the same masking strategy as the decoder input tokens y :t . For the tokens whose corresponding lexical constraints having no alignment with Y gt , their Mention Flags are also assigned value 0. During inference, we build the Mention Flag matrix incrementally, start- Why Mention Flags work During the training of MF models, the ground-truth always has all MFs set to "completed" before stopping the generation (i.e., before generating EOS Token). This provides a strong signal to satisfy all constraints before completing generation. The value update from 1 to 2 in MF provides implicit signals about where the constraints are satisfied during training. Otherwise, the model has to learn this information via the cooccurring sub-sequences between input sequence and output sequence. These two signals allow the model to achieve high constraint satisfaction and help to maintain high text quality (Sec. 4.5). Since there are only 3 added embeddings, learning does not require a substantial amount of training data (Sec. 4.7). Since these embeddings are independent of particular lexical constraints, we expect that performance on novel constraints, not seen during training, is improved (Sec. 4.5).

Integration with S2S Transformer
As shown in Figure 3, Mention Flags are injected into the Transformer decoder. We first review the standard S2S Transformer proposed in Vaswani et al. (2017), then discuss how to inject Mention Flags information into the S2S Transformer model.

Standard S2S Transformer Model
The encoder input tokens x is fed into the Transformer Encoder h e = Enc(x) where h e ∈ R lx×d and d is the model hidden size. In the Transformer decoder, there are two self-attention modules, Self Multi-Head Attention (SA) which handles the current decoder input sequence y :t , and Cross Multi-Head Attention (CA) which handles the interaction between encoder output h e and y :t : KV is the standard keyvalue self-attention proposed in Vaswani et al. (2017). The outputs of CA(h d t , h e ) further determine the model output y t+1 via a Feed Forward layer, a Residual Connection and a softmax layer.
Incorporating Mention Flag Matrix Our two-dimensional Mention Flag matrix F ∈ {0, 1, 2} lx×t is associated with the elements from encoder output h e and current decoder input y :t . The optimal way is to incorporate the full F matrix into a component in the Transformer decoder. We note that the CA module in the Transformer decoder already uses y :t as query and h e as key. The resulting query-key similarity matrix has the same size of our Mention Flag matrix, making it suitable to incorporate F .

Mention Flag Matrix as Relative Position
Inspired by Shaw et al. (2018) which incorporates token relative positions into the SA module, we propose to inject Mention Flags as the "relative positions" between encoder output h e and current decoder input y :t in the CA module. In each decoder layer, we represent F as two sets of trainable embeddings Mention Flag key m k = E k (F ) and Mention Flag Value are the Mention Flag embedding tables. m k and m v ∈ R lx×t×d . We have separated Mention Flags representations for each decoder layer. Eq. 4 is changed to: where R is the Self-Attention function with relative position, defined as follows: As an alternative to representing F as m k and m v , we could follow the approach to relative position in the T5 model (Raffel et al., 2019) and represent F as scalars that are added to the corresponding logits e i,j in Eq. 7 used for computing the attention weights. However, we find this scalar approach less effective than our proposed one in Sec. 4.6.

Experiments
We conduct experiments on three benchmarks with different forms of constraints including Commonsense Generative Reasoning (CommonGen)  with keyword constraints, End-to-End restaurants dialog (E2ENLG) (Dušek et al., 2020) with key-value constraints, and Novel Object Captioning at scale (nocaps) (Agrawal et al., 2019) with visual object word constraints. We integrate Mention Flags with a three-layer standard S2S Transformer models (Trans, L3) (Vaswani et al., 2017) and pre-trained T5 models (Raffel et al., 2019) for each task. The T5 models achieve state-of-the-art results in various Data-to-Text tasks (Kale and Rastogi, 2020). For the T5-Base and T5-Large models, we use the implementation of T5 models in the huggingface transformers 2 . The Trans, L3 models share the same implementation of the T5-Base models, except that it is not initialized with the pretrained parameters and it only uses 3 layers, rather than 12 layers, for both encoder and decoder. In addition, to improve the generalization of our pretrained model, we freeze the parameters in the Self-Attention module and Feed-Forward Layers in each layer of the T5 decoder. This parameters freezing technology is applied to both T5 baseline models and the MF models in all of our experiments. We report constraint satisfaction for all tasks. We use GBS in the CommonGen task (max 5 constraints) and CBS in the E2ENLG (max 1 constraint) and nocaps (max 2 constraints) task.

CommonGen
In this task, the encoder input is a sequence of concepts C = [c 1 , · · · , c k ], k ≤ 5. The models should generate a coherent sentence describing all concepts in C. m(C, ε) = [1, 1, · · · , 1] and m allows inflectional variants to satisfy lexical constraints. We train (fine-tune) Trans, L3, T5-Base and T5-Large model as our baselines. We apply Mention Flags to the T5-Base and T5-Large model (+ MF). Following the suggestions in Lin et al.
(2020), we report CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016) as generated text quality metrics. We calculate constraint satisfaction for all constraints (ALL), novel constraints (Novel) and seen constraints (Seen).  . Bold is the highest score and underline is the second highest score.
Results Table 1 shows that the MF model improves the constraint satisfaction over the baselines for all cases, achieving close to 100% (i.e., 99.6% and 99.1%). Notably, Mention Flags improve novel constraint satisfaction from 2.3% to 49.2% in the randomly initialized Transformer models. Compared to the LevenTrans (Gu et al., 2019) and Con-stLeven (Susanto et al., 2020) models, our Trans, L3 + MF model achieves higher CIDEr and SPICE scores with constraint satisfaction 4.1% lower than the non-autoregressive ConstLeven model. While GBS provides a way to maximise constraint satisfaction (i.e., 100%), doing so significantly degrades the output text quality (more than 50 CIDEr). Our MF model achieves near optimum constraint satisfaction while improving text quality (5.7 CIDEr score improvement in T5-Base and 6.5 CIDEr score improvement in T5-Large). Finally, our T5-Large + MF model outperforms the previous state-of-the-art result (Liu et al., 2021), which integrates the Con-ceptNet (Speer et al., 2017) into the BART model, by 6.5 CIDEr and 0.7 SPICE, suggesting that pretrained language models with textual concepts may provide sufficient information for this task.

E2ENLG
In this task, the encoder input is a sequence of key-value meaning representations C = [k 1 , v 1 , · · · , k n , v n ], n ≤ 8. We lists all given key-value information as a space-separated string. m(C, ε) = [0, 1, 0, 1, · · · , 0, 1] and m allows synonyms to satisfy lexical constraints. For example, welcome children and is family friendly are both mentions of familyFriendly[yes]. The models must generate a fluent and coherent dialog response using all key-value pairs in the encoder. E2ENLG includes 79 different in-domain key-value constraints. We use the scripts from Dušek et al. (2019) 3 to construct the synonyms set for these inputs. We use Trans, L3 and T5-Base model as our baselines. We use CBS to constrain the T5 model to satisfy all missing constraints (T5-Base + C). We report NIST (Lin and Hovy, 2003), BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) as they are common metrics for evaluating the quality of long text in the E2ENLG outputs (more than 20 tokens).
Results Table 2 shows that the MF models consistently achieve higher output text quality and constraint satisfaction than the baseline models (

nocaps
Using T5 for Image Captioning In Image Captioning, each input image is represented by a sequence of visual objects. Each of these objects is assigned (by the object detector) with a textual label. The encoder input is a sequence of objects followed by the same textual labels C = [v 1 1 , · · · , v s 1 1 , l 1 , · · · , v 1 k , · · · , v s k k , l k ] where v * i is the visual feature vector (similar to the one in Li et al. (2020)) and l i is the corresponding textual label. The visual features are used in the same way of normal textual tokens in the T5 models. We find this approach works well for both nocaps and standard COCO image captioning task.
Experiment Setup Traditional image captioning models select and describe a subset of input objects jointly (Anderson et al., 2018). However, Puduppully et al. (2019) shows the benefits of separating content selection and text planning steps for general data-to-text tasks. Following this, we propose to first select salient objects and incorporate the selected objects into the description using Mention Flags. m(C, ε) = [0, 0, · · · , 1, · · · , 0, 0, · · · , 1] where only salient object labels receive value 1. m() allows inflectional variants to satisfy lexical constraints. We use T5-base model in this experiment. The T5 + C and T5 + MF + C models are constrained with CBS. Following Wang et al. (2021), we report CIDEr and SPICE as output text quality metrics and constraint satisfaction for novel constraints (Novel) and all constraints (ALL). We present the performance for all evaluation images (Overall) and for the challenging images with only novel objects (out-of-domain split).

Salient Object Selector
We use a transformerbased salient object detector to select a subset of object labels as lexical constraints. The visual representations of detected image objects are first fed into the 3-layer standard Transformer model without any positional embedding. We train this detector using binary Cross-Entropy loss averaged over all detected input objects. The training data for salient object detection is the training data in nocaps. We use COCO 2017 Dev set as the evaluation dataset to select the best checkpoint.   Results Mention Flags achieve optimal constraint satisfaction in almost all cases. In particular the Trans, L3 + MF model shows marked improvement (i.e., from 16.3% to 49.3%) on novel constraints, despite the fact that the corresponding token embeddings are not changed from their random initialisation. The generated text quality is also improved, particularly in the out-of-domain split. The T5 + C model is 0.3 SPICE lower in both overall and the out-of-domain split than the T5 + MF model, indicating that the MF model correctly captures more long-range relationships (calculated by the parsing trees used in SPICE) among the (novel) objects than CBS. Our T5 + MF model outperforms the existing state-of-the-art end-to-end single-stage image captioning systems (Agrawal et al., 2019;Li et al., 2020;Wang et al., 2021)  99.5+%) compared to the MF models in the other two tasks and find that missing cases frequently happen in the instances with two constraints involving a) (near-) synonymy (e.g., mule and horse) and b) hyponymy (e.g., hot dog and fast food). A more advanced salient object detector would solve this issue.

Model Efficiency
The MF models use standard beam search and run much faster with less memory than the constrained beam search algorithms. For comparison, we select the GBS algorithm because its resource use is linear in the number of constraints and uses less run time and memory than CBS. We run the MF models and the models with GBS using beam size 5 and compare their run time (RT) and memory requirement (#M) in Table 4

Main Result Discussion
Constraint Satisfaction & Text Quality In all tasks, MF models improve the text quality over their baselines (including CBS and GBS) while achieving constraint satisfaction that is close to 100%.
This supports the claim in Sec 3.2 that training signals from Mention Flags can help to improve constraint satisfaction and text quality.
Non-Pre-trained vs. Pre-trained Models In all tasks, Mention Flags have a similar effect (higher text quality and constraint satisfaction) on both nonpre-trained and pre-trained models. This indicates that Mention Flags do not rely on information from pre-trained models to be effective.

Novel Constraints
In the CommonGen and nocaps tasks, the Trans, L3 + MF model achieve much higher coverage (i.e., 2.3% to 49.2% in Common-Gen; 16.3% to 49.3% in nocaps) for constraints with novel lexical items than the baseline models.
Here, the MF models can satisfy novel constraints, even where the corresponding token representations did not receive any training signals. As Mention Flags decouples with model representations, the MF models learn lexicon-independent indicators to mention the novel words.

Design Choices for Mention Flags
We conduct experiments for following choices of  vectors works better than scalars. Finally, using shared MF across all decoder layers has negative impact (e.g., all constraint satisfaction ratio drop) in all three tasks.

Low-Resource Learning
This section shows that Mention Flags are still useful for improving the constraint satisfaction and generated text quality when trained with many fewer instances. We use 0.1%, 1% and 10% of the original training instances to train the models. In the first two tasks (E2ENLG and CommonGen), we compare the MF models with T5-Base models.
In the nocaps task, we additionally compare the T5-Base + MF model with the T5-Base + C model. We report BLEU in E2ENLG CIDEr in CommonGen and nocaps. As shown in Table 6, the MF models consistently generate higher-quality text (higher METEOR or CIDEr Score) and achieve higher constraint satisfaction than the baseline models. The MF models reach 97+% when only training with 10% of the E2ENLG and CommonGen training data. This confirms our claim in Sec. 3.2 that the three added Mention Flag embeddings can be learned with relatively little training data.

Qualitative Analysis
We chose three representative examples that illustrate successful use of Mention Flags (Table 7).  T5-B Punter is a restaurant in the £20-25 price range. It is in the riverside area + C Punter is a kid friendly restaurant in the riverside area. It has a price range of £20-25. + MF Punter is a kid friendly restaurant in riverside with a price range of £20-25 ii) CommonGen mother, washer, clothes, toddler, help T5-B a mother helps a toddler to wash his clothes + G mother helping her toddler clothe in washer + MF a mother helps a toddler to wash clothes in the washer GT the mother helps her toddler put the clothes in the washer iii) nocaps Salient Obj: bee, flower; non-Salient Obj: plant, leaf T5-B a close up of a flower on a tree + C a close up of a bee flower on a tree + MF a small white flower with a bee in it GT a white flower has a bee on it with green around. i) The MF model generates the most concise dialogue response, compared to the baseline and constrained decoding model; ii) The MF model is the only model that generates a fluent and coherent sentence satisfying all input constraints; iii) The MF model is the only model that accurately describes the relationship between bee and flower, grounding to the input images and constraints.

Human Evaluation
We have shown that our proposed MF model can achieve higher constraint satisfaction ratio and automatic metrics. However, the automatic metrics do not necessarily reflect human preference of the generated text. We therefore select 100 output samples from the T5 baseline and our MF model in all three tasks (300 in total). For each sample pair, we ask three annotators to judge which sample is "more human-like". Table 8 shows that more than 70% of output of our MF model is generally better or similar than the output of the baseline model, verifying the output quality of our MF model.

Conclusion and Future Work
In this paper, we propose Mention Flags to constrain Transformer-based text generators via injecting mention status embeddings into text decoders. Our extensive experiments on three different tasks have shown the effectiveness of Mention Flags in maintaining high generated text quality and excellent constraint satisfaction, comparing favourably to competitive constrained decoding algorithms. We plan to expand Mention Flags i) to control larger input source text such as constrained text summarization and machine translation; ii) to handle larger granularity such as sentence-level.