Injecting Entity Types into Entity-Guided Text Generation

Recent successes in deep generative modeling have led to significant advances in natural language generation (NLG). Incorporating entities into neural generation models has demonstrated great improvements by assisting to infer the summary topic and to generate coherent content. To enhance the role of entity in NLG, in this paper, we aim to model the entity type in the decoding phase to generate contextual words accurately. We develop a novel NLG model to produce a target sequence based on a given list of entities. Our model has a multi-step decoder that injects the entity types into the process of entity mention generation. Experiments on two public news datasets demonstrate type injection performs better than existing type embedding concatenation baselines.


Introduction
Entity, as an important element of natural language, plays the key role of making the text coherent (Grosz et al., 1995). Recently, modeling entities into NLG methods has demonstrated great improvements by assisting to infer the summary topic (Amplayo et al., 2018) or to generate coherent content (Ji et al., 2017;Clark et al., 2018). To enhance the representation of entity, entity type is often used in existing work -represented as a separate embedding and concatenated with the embedding of entity mention (i.e., surface name) in the encoding/decoding phase Puduppully et al., 2019;Yu et al., 2018;Chan et al., 2019). Although the concatenation performed better than using the entity mention embedding only, the relationship between entity mention and entity type was not reflected, making the signal from entity type undermined in the NLG.
To address the above issue, our idea is to model the entity type carefully in the decoding phase to * The first two authors have equal contributions.
§ Our code and datasets are available at https:// github.com/DM2-ND/InjType. Target: "US 1 vice president Dick_Cheney 2 made a surprise visit to Afghanistan 3 on Monday 4 for talks with Afghan 5 president Hamid_Karzai 6 , ahead of the NATO 7 summit early next month in Bucharest 8 ." generate contextual words accurately. In this work, we focus on developing a novel NLG model to produce a target sequence based on a given list of entities. Compared to the number of words in the target sequence, the number of given entities is much smaller. Since the source information is extremely insufficient, it is difficult to generate precise contextual words describing the relationship between or event involving multiple entities such as person, organization, and location. Besides, since input entities are important prompts about the content in the target sequence (Yao et al., 2019), the quality of generated sequence depends significantly on whether the input entities are logically connected and expressed in the output. However, existing generation models may stop halfway and fail to generate words for the expected entities, leading to serious incompleteness (Feng et al., 2018).
In this paper, we propose a novel method of utilizing the type information in NLG, called Inj-Type. It keeps the same encoder as Seq2Seq models . During decoding, it first predicts the probability that each token is a contextual word in the vocabulary or an entity from a given list. If the token is an entity, the model will directly inject the embedding of the entity type into the process of generating the entity mention by using a mention predictor to predict the entity mention based on the type embedding and current (2) (4) (3) Figure 1: The decoding process of InjType has four steps: (S1) predicting the <Ent> token (i.e., entity indicator); (S2) injecting the entity types; (S3) combining an entity type enhanced NLU with backward information of target sequence; (S4) predicting the entity mention using the type embedding and hidden state by a mention predictor.
decoding hidden state. The type injection maximizes the likelihood of generating an entity indicator rather than the likelihood of sparse entity mentions. The hidden state is jointly optimized by predicting the role of token and predicting the entity mention so the entity's information is effectively embedded into the hidden states. Experiments on two public news datasets GIGA-WORDS and NYT demonstrate that InjType can generate more precise contextual words than the existing concatenation-based models.

Related Work
Entity-related Text Generation. Entities in a natural language carry useful contextual information (Nenkova, 2008) and therefore play an important role in different NLG tasks such as summarization (Sharma et al., 2019;Amplayo et al., 2018), concept generation (Zeng et al., 2021), table description (Puduppully et al., 2019) and news generation . In summarization, entity mentions have been used to extract non-adjacent yet coherent sentences, link to existing knowledge bases, and infer the summary topic (Sharma et al., 2019;Amplayo et al., 2018). In table description, entity mentions have been used to achieve discourse coherence (Puduppully et al., 2019). Our task is relevant to (Chan et al., 2019) that generates product description from a list of product entities. Different from above work, we aim to leverage entity class into the decoding phase for better predicting entities and contextual words.
Words-to-text Generation. It is also referred as constraint text generation (Zhang et al., 2020;Qin et al., 2019). Generating text from topic words and keywords is a popular task in NLG. It not only has plenty of practical applications, e.g., benefiting intelligent education by assisting in essay writing (Feng et al., 2018;Yang et al., 2019) and automated journalism by helping news generation (Zheng et al., 2017;Zhang et al., 2020), but also serves as an ideal test bed for controllable text generation (Wang and Wan, 2018;Yao et al., 2019). The main challenge of words-to-text lies in that the source information is extremely insufficient compared to the target output, leading to poor topic consistency in generated text (Yu et al., 2020).

Proposed Method: InjType
In this section, we first give the task definition, then introduce our proposed type injection method. We note that InjTyp users the same encoder as in Seq2Seq models , i.e., a bidirectional GRU. So, in Figure 1 and the following sections, we only describe the decoding process.

Task Definition Given a list of entities
consists of the mention and type of the i-th entity, where M is the set of entity mentions and T is the set of entity type. The expected output sequence is y = (y 1 , . . . , y m ) containing all the entity mentions. We denote the vocabulary of contextual words by V. So y j ∈ M ∪ V, j ∈ {1, . . . , m}. The task is to learn a predictive function f : X → Y , mapping a list of entities to a target sequence.

Entity Indicator Predictor
At each step, the decoder predicts either an entity indicator or a contextual word. An entity indicator, denoted as <Ent>, indicates that the current decoding step should generate an entity in the output sequence. If the input has n entities, there will be n entity indicator <Ent> generated in the output sequence. So the first-step output sequence is: block 1 , Ent 1 , block 2 , . . . , Ent n , block n+1 . Each block has one or multiple contextual words, and it ends with an entity indicator (<Ent>). In each block, the generation process is the same as the auto-regressive decoding process. When the auto-regressive decoder generates an <Ent>, the generation process of the current block ends. When the decoder generated the (n+1)-th entity indicator <Ent>, the entire generation terminates.
Suppose the ground truth of entity indicator output y is the target sequence with entity mentions replaced with entity indicators <Ent>. Now, the loss function with entity indicator is defined as: log p(y t ∈ {Ent} ∪ V|y <t , X) .

Mention Predictor
Since each block's generation is ended with the entity indicator token (<Ent>), the representations of the last hidden states in different blocks are assimilated, which may lose contextual information in previous generated tokens. In order to let the decoding hidden states carry rich contextual information and better predict the entity mention, we present a novel mention predictor by injecting the entity type embedding into current hidden state, and feed the combined embedding into a mention classifier. In i-th block, the predicted entity mention is , where s t is the hidden state of the t-th token in the generated text, and x T i is the i-th entity type embedding. In this way, the last hidden state in each block not only has to be classified as an entity indicator (<Ent>), but also carries both entity type and entity mention information in order to make precise generation. The classification loss L M P can be written as:

Entity type-Enhanced NLU
Inspired by the UniLM (Dong et al., 2019), we let our decoder complete a type enhanced NLU task along with its original generation task. We borrow the decoder from the NLG task to conduct an NLU task on the ground truth articles during training. The entity type enhanced NLU task asks the decoder to predict entity mentions corresponding to the types in the ground truth based on context words. If the decoder is able to correctly predict the entity mention given contextual information, it should be capable of generating good context words that can help predict entity mentions as well.
Since the decoder used for generation is naturally one-way (left-to-right), in order to complete the NLU task more reasonably, we train a GRU module in a reversed direction, represented as ← −− − GRU. We reuse the original NLG decoder without attention, denoted by − −− → GRU for the NLU task. This module generates the prediction as follows: where s t is the concatenated hidden state of the original hidden state s t and new hidden state derived from added GRU module. The entity mention is then predicted as x M i = softmax(W r · s t ). So the NLU loss is only calculated at the entity positions:

Joint Optimization
InjType jointly optimizes the following loss: where λ 1 and λ 2 are hyperparameters to control the importance of different tasks.    ). It should be noted that all Seq2Seq baselines are implemented with the concatenation of entity mention and entity types. We fine-tune GPT-2 and UniLM on the training set for 20 epochs. Since our model is not pre-trained on large corpora and has much less model parameters, competing with GPT-2 and UniLM is very challenging.

Implementation Details
We take 256 as dimension size of hidden states for GRU encoder and decoder. The word embedding size is 300. We use Adam optimizer (Kingma and Ba, 2015) with learning rate of 1e-4. We trained our model for 60 epochs on an NVIDIA 2080-Ti GPU. We did grid search on the hyperparameters in loss Eq.(1) and got the best performance at λ 1 = λ 2 = 2.

Experimental Results
Tables 3 compares our InjType with competitive baseline methods on three datasets. As shown in the table, InjType performs the best among all meth-ods. This demonstrates that our proposed multistep decoder is a more effective strategy than simple entity type embedding concatenation. Table 5 shows a case in the test set to compare the generated results from different models. We observe our model can generate more precise contextual words than baseline methods. We compare our model with its variants in Table 3. We observe that the mention predictor (MP) contributes more than NLU modules. Adding MP improves BLEU-4 by +1.65%, while adding NLU improves BLEU-4 by +0.26%. The combination the two performs the best. The NLU module has a positive impact but not being claimed as a core contribution. It reuses the decoder to predict entity mentions and aligns seamlessly with our goal of effectively embedding entity meaning into hidden states. The bi-directional GRU used in the NLU brings extra coherency into the model since it considers entities and contexts after the current token.

Human Evaluation
We sample 50 examples from GIGAWORD test. Every generated news is presented to 5 annotators on Amazon Mechanical Trunk (AMT). Annotators were presented with two news articles and asked to decide which one was better and which one was worse in order of grammar and fluency, coherence, and informativeness. The result is "win", "lose" or "tie". Table 4 demonstrates the human evaluation results. InjType can significantly outperform Seq2Seq attention with type concatenation on coherence and informativeness. So, entity type injection is a more effective way to leverage entity type information than a simple concatenation way.

Conclusions
Entity plays the key role of making the text coherent in news generation. In order to enhance the role  Ground truth: China has agreed to back Russia's entry into the WTO, Chinese prime minister Wen Jiabao said Friday after talks in Moscow, the Interfax news agency reported .
Seq2Attn: China has lodged a veto global WTO plan to WTO the global trade body at its senior Chinese counterpart Wen Jiabao on Friday, Interfax news agency reported, as saying from Moscow, Interfax news agency said, quoted by the Interfax news agency. [The second "WTO" should be "Russia". "Interfax" appeared multiple times.] CopyNet: China and Russia want to sell WTO trade sanctions against the World_Trade_Organisation , warning the proposed sanctions against the WTO agreement are at the WTO agreement we are concerned , the report said after meeting with his meeting counterpart Wen_Jiabao on Friday in Moscow , the Interfax_news_agency reported .
InjType (Our method): China has agreed Russia's bid to join the WTO opening, Chinese foreign minister Wen Jiabao said Friday Moscow, report quoted the state official from Interfax news agency. of entity, we propose a novel multi-step decoder that can effectively embed the entity's meaning into decoding hidden states, making the generated words precise. Experiments on two news datasets demonstrate that our method can perform better than conventional type embedding concatenation.

Additional Dataset Information
We create two datasets from public sources: GIGA-WORDS (Graff et al., 2003) and NYT (Sandhaus, 2008). The Gigawords dataset contains around 4 million human-written news articles from various famous news publishers such as the New York Times and the Washington Posts from 1994 to 2010. The NYT dataset contains news articles written and published by New York Times from January, 1987 to June, 2007. It also contains 650,000 article summaries. For both datasets, we use 1,000 articles for development, 1,000 for test, and the remaining for training. On average, the models are expected to predict more than 70 contextual words precisely from a list of about 10 entities.

Entity Type Embedding Concatenation
A natural way to incorporate type information into a Seq2Seq generation framework is to concatenate entity mention embeddings and type embeddings (Yu et al., 2018;Chan et al., 2019). In the encoding phase, we take both entity mention and entity type as input, and learn contextual representation of each entity in the given list through a bi-directional GRU encoder. In the decoding phase, we adopt a standard attention mechanism (Bahdanau et al., 2015) to generate output sequence. Specifically, in the encoding phase, we concatenate the embedding of entity mention x M i with the embeddings of its corresponding type x T i extracted by CoreNLP (Finkel et al., 2005). So the input embedding of the i-th entity is defined as: where ⊕ denotes vector concatenation. We adopt bi-directional gated recurrent unit (Bi-GRU) (Cho et al., 2014) as encoder to capture contextualized representation for each entity given in the input list. The encoder has a forward GRU which reads input entities X from x 1 to x n , where n is the length of input list. It also has a backward GRU to learn from the backward direction. We then concatenate hidden states from both directions: In the decoding phase, another GRU serves as the model decoder to generate entity mention and contextual words to generate the target sequence s t = − −− → GRU(y t , s t−1 , c t ), where y t is the concatenation of entity mention embedding y M i with the embedding of its corresponding type y T i (as shown in Figure 2). So, given the current decoding hidden state s t and the source-side attentive context vector c t , the readout hidden state is defined as r t = tanh(W c · [s t ⊕ c t ]), where the source-side attentive context vector c t at the current decoding step t is computed through attention mechanism which matches the last decoding state s t−1 with each encoder hidden state h i to get an importance score α t,i . Then, all importance scores e t,i are then normalized to get the current context vector c t by weighted sum: e t,i = tanh(W a · s t−1 + U a · h i ), where W a and U a are trainable parameters. Then the readout state r t , is passed through a multilayer perceptron (MLP) to predict the next word with a softmax layer over the decoder vocabulary (all entity mentions and contextual words): p(y t |y <t , X) = softmax(W r · r t ).
The loss function is defined as: log (p(y t ∈ M ∪ V|y <t , X)) .   Figure 2: Concatenating entity mention embeddings and type embeddings is a straightforward strategy to use the type information. However, it may not be effective due to lack of contextual information in the hidden states. Note that the figure only highlights the decoder. Attention mechanism is also employed, but is not shown in the figure.