Enconter: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Pretrained using large amount of data, autoregressive language models are able to generate high quality sequences. However, these models do not perform well under hard lexical constraints as they lack fine control of content generation process. Progressive insertion based transformers can overcome the above limitation and efficiently generate a sequence in parallel given some input tokens as constraint. These transformers however may fail to support hard lexical constraints as their generation process is more likely to terminate prematurely. The paper analyses such early termination problems and proposes the ENtity CONstrained insertion TransformER(ENCONTER), a new insertion transformer that addresses the above pitfall without compromising much generation efficiency. We introduce a new training strategy that considers predefined hard lexical constraints (e.g., entities to be included in the generated sequence). Our experiments show that ENCONTER outperforms other baseline models in several performance metrics rendering it more suitable in practical applications.


Introduction
The field of Natural Language Generation (NLG) (Gatt and Krahmer, 2018) has seen significant improvements in recent years across many applications such as neural machine translation (Bahdanau et al., 2015), text summarization (Chopra et al., 2016), poem generation (Zugarini et al., 2019) and recipe generation (H. Lee et al., 2020). Constrained text generation (CTG) is one of the challenging problems in NLG that is important to many real world applications but has not been well addressed. CTG imposes input constraints which may be in the form of objects expected to exist in the generated text or rules over objects in the generated text (Hokamp and Liu, 2017). The objects here can be entities, phrases, predefined nouns, verbs, or sentence fragments. The constraints can be categorized into two types: (1) Hard-constraints which require mandatory inclusion of certain objects and complete compliance of given rules (Post and Vilar, 2018;Miao et al., 2019;Welleck et al., 2019;Zhang et al., 2020); and (2) Softconstraints which allow the some constraint objects or rules to be not strictly enforced in the generated text (Qin et al., 2019;Tang et al., 2019). As autoregressive models generate tokens from left to right, they cannot easily support constraints involving multiple input objects, hard-constrained text generation therefore often requires non-autoregressive models.
Recently, Zhang et al. (2020) proposed a non-autoregressive hard-constrained text generation model (POINTER) that generates a text sequence in a progressive manner using an insertiontransformer . To train an insertion transformer to generate a missing token between every two tokens in an input sequence, the training data is prepared by masking "less important" tokens in the original text sequence in an alternating manner. The process is then repeated using the masked input sequence as the new original sequence, and further masking alternate tokens in it. The process ends when the masked sequence meets some length criteria.
While POINTER shows promising results, it does not consider hard constraints which involve entities that must be included in the generated sequence. Such entity constraint requirements are unfortunately prevalent in many applications. For example, we may want to generate a job description with some given skills, or a food recipe with some given ingredients.
A naive approach to the problem is to apply con-straints on the POINTER's masking strategy forcing it to keep entity tokens. We call this modified model POINTER-E. Although this allow entity information entering POINTER-E, another problem rises. POINTER-E suffers from cold start problem which refers to the inability to generate meaningful tokens at the early stages of inference forcing the generation to end prematurely. This issue can be attributed to the POINTER-E's top-down masking strategy for training the insertion transformer and the tokens of input entities not evenly spread out across the sequence.
To solve the cold start generation problem, we propose ENCONTER that incorporates bottom-up masking strategy. ENCONTER supports hard entity constraints, and encourages more meaningful tokens to be generated in the early stages of generation thus reducing cold start. On top of that, we further introduce the balanced binary tree scheme  to reduce the number of stages in generation and to improve the efficiency of generation.

Entity Constrained Sequence Generation
In this section, we first describe the state-of-theart POINTER model, its preprocessing of training data and inference process. We highlight the pitfalls of the entity constrained variant of POINTER, POINTER-E. We then present our proposed entity constrained insertion transformer called ENCON-TER.

POINTER
POINTER adopts a progressive masking approach to train an insertion transformer. Let X = {x 1 , x 2 , . . . , x T } denote a a sequence where x t ∈ V , where T is the sequence length and V is a finite vocabulary set. Suppose X is a training sequence, POINTER preprocesses it to obtain the training pairs S = (X k , Y k ) k ∈ {K, . . . , 0} using a progressive masking strategy. As shown in Figure 1a, in each stage X k represents the input sequence for stage k, and Y k represents the sequence of masked tokens to be inferred. X K is identical to the final training sequence X K = X, and there should not be any additional tokens to infer. X 0 on the other hand represents the initial lexical constraints. In stage k, Y k are the tokens to be predicted between adjacent tokens of X k . A special no-insertion token [N OI] is added to the vocabulary V and used  where B,D, and F are the tokens forming the entity constraints. The stopping criteria for POINTER is set to n = 3.
in Y k to indicate that no token is to be generated between adjacent tokens. Y K is thus a sequence of all [N OI]'s indicating the end of generation. Word-Piece (Wu et al., 2016) tokenization is applied in POINTER, and tokens split from the same word share the same score.
Token importance scoring POINTER assigns each token x t ∈ X an importance score α t : and α Y AKE t denote term frequency-inverse document frequency (TF-IDF), POS tag scores and YAKE (Campos et al., 2020) keyword scores, respectively. These scores are normalized to [0,1]. α P OS t is defined such that the scores of nouns and verbs are higher than those of other POS tags. The token importance scores are used to derive the masking pattern Y k−1 of stage k − 1 from X k .
POINTER adopts four criteria to derive Y k−1 from X k : (1) Y k−1 can only include non-adjacent tokens in X k ; (2) the number of tokens to be masked are maximized in each stage to make the model more efficient; (3) less important tokens are masked before more important ones and (4) A stopping criteria n is defined. The algorithm stops when |X k | = n. Kadane's algorithm (Gries, 1982) has been use in POINTER to fulfill the criteria. Specifically, the algorithm selects as many unimportant tokens as possible to be masked while not masking two adjacent tokens. X 0 is automatically determined when |X k | = n, it does not necessarily match the way the initial input sequence is provided by real world applications or users, including the entity constraints. Inference Given X 0 as input sequence, POINTER starts to inferŶ 0 and combines the two sequences to getX t happens to be [N OI], it will be deleted and leaving only non-[N OI] tokens inX 1 . The process repeats until all the generated tokens inŶ k are [N OI]s.
As shown in Figure 1a, entities may not be preserved during the preprocessing steps and the lexical constraint X 0 is not guaranteed to cover entity constraint X e even entity tokens are assigned high importance scores. The trained POINTER therefore may not be able generate a sequence successfully when given entity constraints during the inference. We therefore propose some changes to POINTER to make it entity-aware.

Entity Aware POINTER (POINTER-E)
The entity-aware POINTER model, POINTER-E, adopts a different preprocessing approach. Let X e ⊂ X be an ordered sequence of entity tokens (e.g., the person names in a news document). As X e is likely to be used as the initial generation input (i.e., X 0 = X e ), POINTER-E's preprocessing does not mask these entity tokens over the different preprocessing stages. This way, the model is trained to focus on generating tokens around the entities. Such tokens form the context around the entities and context relating one entity to others. We achieve such goal by ignoring the importance scores applied on entity tokens. That is, we only compute α t for x t / ∈ X e . We then apply the POINTER's masking strategy on the sub-sequence between every two entity to- Masking is applied on this subsequence iteratively until only {x e l , x e l+1 } are left: As shown in Figure 1b, POINTER-E always picks the optimal masking patterns while preserving the entities.
Cold Start Problem While POINTER-E is aware of entities, entities in X e may appear very close or very far from one another in the full sequence X, i.e., the gap between entities in the sequence X can vary a lot. Consider two sub-sequences The tokens between (x u , x v ) will then be masked out long before tokens in (x i , x j ) during preprocessing and training. This results in POINTER-E trained to generate a lot of [N OI]s in Y k for small k's. Figure 1b depicts this cold start problem as entity tokens B, D and E are near one another in X. As tokens between them are masked in early stages, the masked sequences in stages 0 and 1, Y 0 and Y 1 , contain many [N OI] tokens. POINTER-E trained with such data will therefore lack the ability to generate meaningful tokens in-between these entity tokens. In the worst case, POINTER-E simply generates all [N OI] tokens and ends the generation prematurely which is known as the cold start problem.
To better show the problem, we define: A clear problem of high N OI ratio is that Y k is very similar to Y k+1 . When N OI ratio = 1, the generation will end, In cases where N OI ratio is very high for masked sequences in early stages, say Y 0 , the trained POINTER-E will more likely infer from X 0 all [N OI]'s forŶ 0 and end the generation process. To address this, we need to re-examine the top-down masking stratey used in POINTER and POINTER-E.

ENCONTER
In this section, we propose ENCONTER which adopts a bottom-up masking strategy to overcome the cold start problem. There are two variants: GREEDY ENCONTER and BBT-ENCONTER. GREEDY ENCONTER Different from POINTER-E, we now construct training pairs S from X by setting X 0 to be X e : where Y k represents the sequence of masked tokens to be inserted into X k to form X k+1 . Similar to POINTER, Otherwise, we select a token x t from (x i , x j ) with maximum importance score α t within (x i , x j ) as the mask token. The sequence Y k is formed after we go through all the t's. By inserting Y k into X k , we obtain the next sequence X k+1 . The iterative process stops when all the tokens to be inserted are [N OI]s. This method GREEDY ENCONTER greedily selects the token with maximum importance score in the span to be generated in a bottom up insertion (or unmasking) process. By forcing more non-[N OI] tokens to be included in Y 0 and Y k of small k's, Greedy Enconter achieves lower N OI ratio in the early stages of inference. Experimentally, we find that the cold start problem is eliminated. Balanced binary tree ENCONTER (BBT-ENCONTER) To further improve the efficiency of GREEDY ENCONTER, we incorporate balanced binary tree  into ENCONTER to bias the masking of tokens to be those near the center of the unobserved subsequence of tokens. BBT reward is added to the importance score function as follows. Suppose x i and x j are two adjacent tokens in X k , and (x i , x j ) represents the corresponding subsequence in X. We define the distance d p for token x p ∈ (x i , x j ) as: We use a softmax function to compute the reward for weighted score based on d p : The weights in the span are then normalized to [0, 1]. Then the importance score is defined as: The construction of S is almost the same as GREEDY ENCONTER. The only difference is the new importance score function defined by Eq. 7. This proposed model, known as BBT-ENCONTER, will predict the center and semantically important token in X between two adjacent tokens of X k .

Models with Entity Span Aware Inference Option (ESAI)
So far, all above-mentioned models assume that each entity consists of one single token. In real world use cases, an entity may contain more than one token. Without any control during the inference process, it is possible for other tokens to be generated in-between tokens of the same entity. For example in Table 5, "Group Consolidation" may be split into "handling Group s project / Consolidation". To avoid inserting any tokens in between any multi-token entity, we introduce the entity span aware inference option to the inference process of POINTER-E and ENCONTER to force the inference ofŶ k to always generate [N OI] in between the tokens of the multi-token entities. After applying ESAI, the multi-token entities will remain unbroken duing the generation process.

Empirical Analysis of POINTER-E and ENCONTER
In this section, we conduct an analysis of the data preprocessing step in POINTER-E, GREEDY EN-CONTER and BBT-ENCONTER. Our objective is to empirically evaluate the characteristics of training data generated for the two models. We have left out POINTER as it is inherently not entity-aware and POINTER-E is its entity-aware variant. We first present the two datasets used in this study.

Datasets
CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003): We select the English version which contains 1,393 news articles labeled with four named entity types: persons, locations, organizations and names of miscellaneous. Training and development sets are used to train the model. Documents having more than 512 tokens by wordpiece tokenizer used in BERT (Devlin et al., 2019) are discarded to ensure that the whole document can fit into the models.
Jobs: This is a job post dataset collected from Singapore's Jobsbank 2 . The dataset consists of 7,474 job posts under the software developer occupation (SD) and 7,768 job posts under the sales and marketing manager occupation (SM). We extract the requirement section of these job posts as the text sequences to be generated. For each requirement text sequence (or document), we use a dictionary of skills to annotate the skill and job related entities in the sequence.
The detailed information of the datasets can be found in Table 1.

Analysis of NOI ratio and Stage Counts
We first analyse the ratio of [N OI] tokens inserted or masked in every stage of the training data. Figure 3 shows the mean together with one standard deviation of the POINTER-E, GREEDY ENCONTER and BBT-ENCONTER for each dataset. X-axis is in log scale. Note we add 1 to the stage number for showing log scale (e.g., the 10 0 in the figure indicates the ratio of [N OI] tokens in Y 0 ). From Figure 3, we find all datasets share a few similar characteristics, namely: (1) for POINTER-E, the [N OI] ratio is quite high in the first few stages, and drops when the stage is higher. A sudden increase of the ratio to 1 is due to the ending sequence consists all [N OI]'s; (2) for ENCONTER the [N OI] ratio is low in the first few stages, and slowly increase to 1. The result shows ENCONTER can learn to generate balance proportion of [N OI] and non-[N OI] tokens in the first few stages, and also learn not to generate to many non-[N OI] tokens when approaching the end of the generation process. Figure 2 shows the number of stages each training document requires under different models. The numbers are sorted according to the following priority: GREEDY ENCONTER, POINTER-E, then BBT-ENCONTER. Since BBT-ENCONTER incorporates the binary tree reward scheme, it is able to perform insertion in the middle stages more efficiently compare to GREEDY ENCONTER. This helps to lower the total number of stages required to derive training pairs.

Models for Comparison
GPT-2 (Radford et al., 2019) GPT-2 can be used to conduct conditional generation as well (softconstraints). For a training sequence X together with its entities X e , we concatenate X e with X to form a training sequence {X e , X}. X e is then served as a control code sequence to guide GPT-2 in the generation of X. We fine-tune the GPT-2 small pretrained by huggingface 3 with 10 −5 learning rate. Warmup and weight decay are applied. 10 epochs are used for fine-tuning. POINTER-E, GREEDY ENCONTER, and BBT-ENCONTER: We use BERT (Devlin et al., 2019) as the underlying insertion transformer for all these models similar to that of POINTER. Specifically, we use the bert-based-cased pretrained by huggingface. BERT with language model head is fine-tuned on all the training pairs to obtain the models. Learning rate is set to 10 −5 with warmup and weight decay. 10 epochs are used for fine-tuning.
For POINTER-E, GREEDY ENCONTER, and BBT-ENCONTER, top-k (top-20) sampling method is used to deriveŶ k . For GPT-2, we feed in theX e and let GPT-2 generate the following tokens until reaching the end-of-generation token.

Evaluation Metrics
We evaluate the models using a few criteria, namely: recall of entities, quality with respect to human crafted text, diversity, fluency, cold start, and generation efficiency. We measure recall of entity constraints by the proportion of entity tokens found in the generated text. Even without ESAI, the recall metric will allow to compare the recall ability of models. Besides recall, we also consider BLEU (Papineni et al., 2002), METEOR (MTR) (Lavie and Agarwal, 2007) and NIST (Doddington, 2002), which are common metrics for evaluating the quality of generated text against human craft text. We compute the BLEU-2 (B-2) and BLEU-4 (B-4) which are n-gram precision-based metrics. For the BLUE based evaluation metric NIST, we compute the NIST-2 (N-2) and . To measure the diversity of generation, Entropy (Zhang et al., 2018) and Distinction (Li et al., 2016) are used. Entropy-4 (E-4) is defined as the frequency distribution of unique 4-gram terms. Dist-1 (D-1) and Dist-2 (D-2) are used to derive distinct n-grams in the generated text. We also utilize pretrained language model to measure fluency. Perplexity (PPL) is calculated using pretrained GPT-2 (Radford et al., 2019) without fine-tuning. The lower the perplexity is, the more fluent the generation is (based on GPT-2). "AvgLen" is the averaged word counts of the generated sequence. "failure" indicates the proportion of test sequences that fail to be generated at the first step (i.e.,Ŷ 0 are all [N OI]'s). Finally, "AvgSteps" shows the average number of steps for the model to complete the generation. Note for GPT-2, the AvgSteps is based on tokens, while the AvgLen is based on words.

Experiment Results
Tables 2, 3, and 4 show the results of different models on the different datasets. On recall, GPT-2, due to its inability to enforce hard lexical constraints, yields the worst recall. For non-autoregressive models without ESAI, they still achieve high recall. Nevertheless, the high recall of POINTER-E is "contributed by" relatively high failure ratio ("failure") as recall is 1 even when the model fails to generate anything in the first stage. In other words, POINTER-E suffers from cold start problem. GREEDY ENCONTER and BBT-ENCONTER, in contrast, enjoy both good recall and zero failure ratio. With ESAI option, all non-autoregressive models can achieve perfect recall without much additional generation steps. However, this option does not reduce the high failure ratio of POINTER-E. On generation quality compared with human crafted text, GREEDY ENCONTER and BBT-ENCONTER outperform all other models by NIST, BLEU, and    MTR. This suggests that ENCONTER models learn the context of entities better compared to other models. On generation diversity, POINTER-E again has the highest diversity largely due to its high failure ratio. Finally, we discuss the efficiency of models measured by AvgSteps. The autoregressive nature of GPT-2 makes it the least efficient model among all. POINTER-E's ability to optimize masking patterns makes it the most efficient model. With balance binary tree reward, BBT-ENCONTER is able to finish its generation in fewer iterations than GREEDY ENCONTER. Table 5 shows a case example from Jobs SM dataset. The entities of the given constraint are underlined. Invalid entities generated are colored in red, while the remaining ones are colored in blue. There are three types of invalid cases. First, the case of entity is not the same as specified. Second, the entity is not recalled in the generation. Third, the entity has its tokens separated by some other token(s). In this example, POINTER-E and POINTER-E ESAI terminate their generations prematurely. They fail to perform generation at the very first stage.  directly steers the pretrained language by a bag-ofwords model or simple linear discriminator. The above models in their own ways gain certain level of control over the content generation process. However, they do not provide a mechanism to directly enforce some lexical constraints in the final generation. Non-monotonic sequence generation (Welleck et al., 2019) is designed to perform hard lexical constrains generation based on binary tree structure. By leveraging level-order and inorder traversal of binary tree, the model allows text to be generated non-monotonically. Although the results from non-monotonic generation models seem promising, they do not perform token generation in parallel and the tree structure governing the generation process may produce many unused tokens during the generation. The emergence of non-autoregressive language model provides another approach to support hard lexical constraints. Insertion transformer  uses transformer architecture with balanced binary tree loss to perform insertion-based generation. KERMIT  is proposed as a structure to unify insertion transformers. Levenshtein transformer (Gu et al., 2019) further introduces deletion as an action to take during generation. Our ENCONTER models differ from these previous models as they are not designed to support any lexical constraints, including entity constrains.

Conclusions
Constrained text generation is an important task for many real world applications. In this paper, we focus on hard entity constraints and the challenges associated with enforcing them in text generation. Our analysis of the state-of-the-art insertion transformers reveals issues, namely, cold start problems and inefficient generation. We therefore propose two insertion transformer models, GREEDY EN-CONTER and BBT ENCONTER, that use a bottomup preprocessing strategy to prepare training data so as to eliminate the cold start problem caused by top-down preprocessing strategy. BBT Enconter further incorporates a balanced tree reward scheme to make the generation process more efficient. Through experiments on real world datasets, we show that the two models outperform the strong baselines, POINTER-E and GPT2, in recall, quality and failure rate while not compromising much generation efficiency. For future research, it will be interesting to consider more diverse constraints (e.g., soft constraint, rules, etc.) and user interaction in the generation process to expand the scope of applications that can benefit from this research.