Grouped-Attention for Content-Selection and Content-Plan Generation

Recent neural data-to-text generation models employ Pointer Networks to explicitly learn content-plan given a set of attributes as input. They use LSTM to encode the input, which assumes a sequential relationship in the input. This may be sub-optimal to encode a set of attributes, where the attributes have a composite structure: the attributes are disordered while each attribute value is an ordered list of tokens. We handle this problem by proposing a neural content-planner that can capture both local and global contexts of such a structure. Specifically, we propose a novel attention mechanism called GSC-attention. A key component of the GSCattention is grouped-attention, which is tokenlevel attention constrained within each input attribute that enables our proposed model captures both local and global context. Moreover, our content-planner explicitly learns contentselection, which is integrated into the contentplanner to select the most important data to be included in the generated text via an attention masking procedure. Experimental results show that our model outperforms the competitors by 4.92%, 4.70%, and 16.56% in terms of Damerau-Levenshtein Distance scores on the WIKIALL, WIKIBIO, and ROTOWIRE datasets, respectively.


Introduction
Data-to-text generation (Reiter and Dale, 2000) is an important and challenging task in natural language processing. It aims to produce sentences given structured data. There are many downstream applications of data-to-text-generation, such as biography summarization (Lebret et al., 2016), automatic weather forecasting (Mei et al., 2016), etc.
Traditional approaches follow a pipeline framework consisting of three stages: content-selection, content-planning, and surface-realization. Contentselection selects the data to be expressed; contentplanning determines the structure of the sentences * Equal contribution

Example description
Barack Obama (born August 4, 1961, in Honolulu, Hawaii) is an American politician who lives in Washington, D.C., U.S.

Output (content-plan)
Barack, Obama, August, 4, 1961, Honolulu, Hawaii, politician, Washington, D.C.  to be generated; and surface-realization generates the output based on the content-planning. Recent neural data-to-text generation approaches integrate these stages into an end-to-end model, i.e., tasks of the stages are learned implicitly, as end-to-end training is becoming popular (Wang et al., 2018a). Despite their success, end-to-end models without proper content-planning may generate repetitive, incomplete, and incoherent sentences. Moreover, end-to-end models are less interpretable, making it difficult to perform error analysis for further improvement.
It has been shown that explicitly learning content-planning improves the performance (e.g., reduce repetition or generate a coherent sentence) and the interpretability of neural data-to-text generation models (Trisedya et al., 2018). In this paper, we study the problem of neural content-plan generation. Given a set of attributes of an entity, we aim to select the salient attributes (i.e., the attributes mentioned) and reorder the selected attributes such that they follow the common attribute mentioning order in natural sentences. For example, in Table 1, the input is a set of attributes for the entity Barack Obama (in the form of key-value pairs): name; Barack Hussein Obama , birth_place; Honolulu, Hawaii , etc. Suppose that the target descrip-tion is Barack Obama (born August 4, 1961, in Honolulu, Hawaii) is an American politician who lives in Washington, D.C., U.S. Our goal is to generate the content-plan of the target description, which is the attribute value mentioning order in the description. In this example, the content-plan is Barack, Obama, August, 4, 1961, Honolulu, Hawaii, politician, Washington, D.C. . Existing neural data-to-text generation models (Puduppully et al., 2019;Trisedya et al., 2020) employ Pointer Networks (Vinyals et al., 2015) as the content-planner. There are two limitations of such models. First, the oracle Pointer Networks used in Puduppully et al. (2019) and Trisedya et al. (2020) do not explicitly learn the content-selection. Moreover, the input to a data-to-text generation model is a set of attributes, which may be given in any order and do not have a sequential relationship. Using LSTM to encode and capture the ordering relationships in the input set may be sub-optimal. Second, Pointer Networks do not handle content-selection properly. Typically, a content-planner is applied at the token-level, which selects salient tokens for generating the description. For the example in Table 1, the tokens that are supposed to be selected for the attribute name are Barack and Obama, while the token Hussein is supposed to be filtered. In summary, the input of the task is attributes of an entity that have a composite structure (instead of a sequence): the attributes are disordered while each attribute value is an ordered list of tokens. To encode such a structure properly, the encoder should learn both the representation for each token of an attribute (i.e., local context) and the attribute as a whole (i.e., global context).
To address the limitations above, we propose a novel neural content-planner. Specifically, we propose a novel GSC-attention to capture the local and global contexts of the input set. The GSCattention consists of three attention mechanisms. The first is grouped-attention, a token-level attention mechanism restricted within each attribute. The grouped-attention lets an attribute representation captures the relationship between tokens in an attribute (i.e., local context). The second is self-attention among attribute level representations. This attention updates the attributes' representations based on the overall attribute information (i.e., global context). The third is cross-attention, which updates token representations based on attribute representations to capture the attributes' composite relationships. We stack multiple layers of GSCattention, and the updated token representations of a layer are used as input for the next layer. This way, GSC-attention ensures capturing the interaction between the local and global contexts.
We further propose a content-selection masking procedure to integrate content-selection into our content-planner. Content-selection aims to filter non-salient attribute tokens, which helps the content-planner arrange the selected attribute tokens properly. We integrate content-selection with our content-planner as follows. First, the contentselection module generates a pseudo-contentselection, which is a binary value that indicates whether an attribute token should be selected. Then, the pseudo-content-selection is used as a mask in the content-planning module to let contentplanning focus on arranging the selected attribute tokens. The advantages of our masking procedure are twofold. First, it allows end-to-end joint training between content-selection and contentplanning. Second, it explicitly learns contentselection from the content-planner, which improves the interpretability of the model, specifically in analyzing the error of the model. The contributions of the paper are as follows. • We propose a neural model for content-plan generation from a set of attributes that explicitly learns content-selection and content-planning. • To properly encode a set of attributes, we propose a novel attention mechanism, GSCattention, that effectively captures the local and the global contexts in an input set and their composite relationships. • To integrate the content-selection and contentplanning procedures, we propose a contentselection masking procedure.
As end-to-end training is becoming prevalent (Wang et al., , 2021b, recent data-totext generation approaches employ neural networks which can be trained end-to-end (Shen et al., 2020). These approaches use encoder-decoder frameworks (Cho et al., 2014;Bahdanau et al., 2015;Vaswani et al., 2017). The encoder is used to transform the input into some vector representation. The decoder takes the vector representation as context to generate the target sentence. In both the encoder and the decoder, sequence neural network models, such as LSTM (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017), are used to process the data. Studies in this line of work include biography summarization (Lebret et al., 2016;, weather forecasting (Mei et al., 2016), and game summarization (Wiseman et al., 2017). Despite their success, these models may generate incoherent sentences since they do not have a proper content-plan.
To improve the coherence of the generated text, recent studies re-introduce content-planning mechanism in neural approaches of data-to-text generation.  propose a content-planning mechanism via link-based attention to model relationships among input data. It learns a matrix where each element indicates transition probability of two attributes. Goldfarb-Tarrant et al. (2020) employ the Aristotelian framework and a re-scoring model to find the best story plot in story generation. Other studies along this line propose twostage models, i.e., learning the content-planning and the surface-realization consecutively. The aforementioned models use contentplanning mechanisms with LSTM or shallow trans-formation networks as the input encoder. However, LSTM is sub-optimal to capture the relationships in the input. This is because the input is a set (e.g., a set of attributes), and each attributes may consist of multiple tokens, which form a hierarchical structure among the attributes instead of a sequence. Thus, applying LSTM on the input data may lead to improper representations of the input. In this paper, we propose a novel content-planner to capture the hierarchical structure of the input.

Preliminary
Let A = {a 1 , a 2 , ..., a n } be a set of n attributes of an entity. Each attribute a i = k i ; v i is a pair of a key k i and a value v i (i = 1, 2, ..., n). Key k i denotes the type of an attribute. Value The content-plan of a sentence specifies which attributes are selected to be included in the generated sentence and their order in the sentence. It consists of a sequence of tokens, where each token belongs to one of attributes. The contentplan can be represented by a sequence of index Here, i k indicates the index of an attribute, and j k indicates the index of a token within the attribute (k = 1, 2, ..., c).
Given a set of attributes A, we aim to generate the content-plan P. Note that our goal is not to generate a sentence given a set of attributes. Our work aims to organize the attributes in a way that enables generating better (i.e., non-repetitive and coherent) sentences for downstream textual generation models.

Proposed Model
Solution overview. We propose a novel end-toend model for content-plan generation. The model consists of four modules: an embedding and linear transformation module, an attribute-encoding module, a content-selection module, and a contentplanning module.
The embedding and linear transformation module (cf. Section 4.1) aims to obtain initial vector representations of attributes and tokens. The representations should maintain the order of tokens within an attribute and should not impose any order on the set of attributes.
The attribute-encoding module (cf. Section 4.2) aims to encode the attributes of an entity into fixedlength embeddings, which will be used by the content-selection and the content-planning modules as inputs. The attributes of an entity have a composite structure: the attributes are disordered while each attribute value is an ordered list of tokens. To encode such a structure, we propose a novel attention mechanism capable of learning a representation for each token of an attribute and the attribute as a whole.
The content-selection module (cf. Section 4.3) aims to accurately predict both content-selection labels and attribute-selection labels. The contentselection labels help highlight the selected attributes, whereas the attribute-selection labels provide more supervision signals to train parameters of the content-selection module.
The content-planning module (cf. Section 4.4) integrates the content-selection into the Pointer Networks to generate a content-plan.

Embedding and Linear Transformation
We obtain representations of tokens and attributes by applying a linear transformation to their embeddings. To distinguish the same token in different attributes and maintain the internal order of tokens within an attribute, we represent a token of an attribute by a quadruple where k i is the key of the attribute, v i,j is the token, f i,j and r i,j are the forward and backward positions of the token within the attribute, respectively. The representation of the token t i,j is computed by applying a linear transformation on an embedding of the key e k i,j , an embedding of the token e v i,j , embeddings of the forward and backward positions e f i,j , e r i,j as follows: ti,j = tanh(ki,j + vi,j).
The representation of an attribute is computed by: where e a i is an embedding of attribute a i .

Attribute Encoding Module
We propose a novel attribute encoder to learn two types of embeddings: embeddings of tokens and embeddings of attributes. The attribute encoder consists of a stack of N identical layers (N is a system parameter). Each layer is composed of two sub-layers. The first is a Grouped-Self-Cross attention (GSC-attention) layer, a novel attention mechanism, and the second is a feed-forward layer. We employ each of the two sub-layers to learn a residual function and then apply a batch normalization layer. Formally, the output from each sub-layer is BatchNorm(x + SubLayer(x)), where x is an input to the sub-layer, SubLayer(·) is the residual function learned, and BatchNorm(·) denotes the batch normalization. We use a fixed dimensionality d for all layers throughout this paper to facilitate the residual connection. As a key building block of the attribute encoder, the GSC-attention has three attention mechanisms: a grouped-attention, a self-attention, and a crossattention. We illustrate them in Fig. 1 and describe the layers next.

Grouped-attention
Grouped-attention. The Grouped-attention aims to learn a representation for each attribute based on interactions among the tokens within the attribute. For simplicity, we use a one-dimensional index to represent a sequence of all tokens of all input attributes: [t 1 , t 2 , ..., t m ] , which is a simplified form of [t 1,1 , t 1,2 , ..., t n,mn ]. In Grouped-attention, we require the tokens of the same attribute to appear together in the sequence. The different attributes can be randomly ordered in the sequence. Let G ∈ R n×m be a group mask matrix where each entry g i,j = 1 if the value of an attribute a i contains a token t j , and g i,j = 0 otherwise. We compute the Grouped-attention as follows.
where Q g , K g , and K g are the query, key, and value matrices, respectively. Here, Q g ∈ R n×d , while K g , V g ∈ R m×d . denotes element-wise multiplication of two matrices, and W denotes a learned parameter. The use of the group mask G makes the attention weights focus on inter-attribute interactions, i.e., interactions among tokens within the same attribute.
Self-attention. The grouped-attention layer only considers intra-attribute interactions but not interattribute interactions, i.e., interactions among tokens from different attributes. To capture interattribute interactions, we employ a self-attention layer over the attribute embeddings where Q s , K s , and K s are the query, key, and value matrices that are computed from the updated attribute representation computed by Eq. 8. Cross-attention After obtaining the attribute embeddings that capture both intra-attribute and interattribute interactions, we update the token embeddings by a cross-attention layer where Q c is a query matrix computed from the token embedding t, K c and V c are the key and value matrices, respectively, which are computed based on the attribute representation from the selfattention (cf. Eq. 9). Unlike the grouped-attention, we do not employ the group mask in the crossattention, such that the token embeddings consider both intra-attribute and inter-attribute interactions among attribute tokens.

Content-Selection Module
We define content-plan as a sequence of attributetoken in the order that they are mentioned in a sentence. For example, the content-plan for the example in Table 1 is Barack, Obama, August, 4, 1961, Honolulu, Hawaii, politician, Washington, D.C. . The content-selection label is a set of binary variables F = {l i,j |1 ≤ i ≤ n, 1 ≤ j ≤ m i } where subscripts i and j indicate the attribute index and the token index, i.e., l i,j indicates whether token t i,j in the value of attribute a i appears in the content-plan (l i,j = 1) or not (l i,j = 0). Given a sequence of token embeddings outputted by the attribute encoding module [t 1 , t 2 , ..., t m ], we use a fully-connected layer on top of each token embedding to compute the probability that the content-plan includes the token.
where W is a trainable parameters of a fullyconnected layer, and b t is the bias. We refer to such probabilities as content-selection probabilities. Since the content-selection labels are binary, we use a binary cross-entropy loss:

Content-Planning Module
Given the token embeddings t j = [t 1 , t 2 , ..., t m ] and the content-selection probabilities p j = [p 1 , p 2 , ..., p m ], we aim to generate a sequence of pointers P = [j 1 , j 2 , ..., j c ] where each pointer j k is an index corresponding to the j k -th token. We adapt Pointer Networks (Vinyals et al., 2015) to incorporate content-selection probabilities into generating the content-plan via content-selection masking procedure as follows. We first transform the content-selection probabilities into pseudo content-selection, a binary value (i.e., p j = 1 if the corresponding probability score > 0.5, otherwise p j = 0) that indicates whether an attribute token is selected or not. Then, we use the Pointer Networks to produce a vector that modulates a content-based attention mechanism over tokens at each step. Let [h 1 , h 2 , ..., h c ] be a sequence of hidden states of the networks. At each step k, we incorporate the pseudo content-selection to compute pointer attention over all tokens: Here, u is a probability distribution over the tokens, where u j is the probability for token t j , v, W t , W h are parameters. We train the contentplanner by maximizing a cross-entropy loss: where 1(·) is an indicator function that returns 1 if the proposition in the argument is true, and 0 otherwise. The overall objective function is a sum of the objective functions of the binary classifiers and the pointer generator's objective function.
We optimize the overall objective function by backpropagation algorithm (Wang et al., 2018b(Wang et al., , 2021a 5 Experiments

Dataset
We evaluate our proposed model over three realworld datasets, WIKIALL (Trisedya et al., 2020), WIKIBIO (Lebret et al., 2016), and ROTOWIRE (Wiseman et al., 2017). The WIKIALL and WIK-IBIO datasets contain attributes-description pairs. The description is a sentence that describes an entity extracted from the first sentence of the corresponding Wikipedia page. The attributes comprise a set of attributes that belongs to the entity. The WIKIALL dataset obtains the attributes from Wikidata (Vrandecic and Krötzsch, 2014), while the WIKIBIO dataset obtains the attributes from the Wikipedia infobox. The ROTOWIRE dataset contains pairs of data-records and NBA basketball game summary. The data-record is a table of statistics about an NBA game. The WIKIALL dataset contains 152, 231 attributes-description pairs. It includes 53 entity types with an average of 15 attribute types (and up to 100 attribute tokens) per entity and an average of 20 tokens per description. The WIKIBIO dataset focuses on biography (i.e., this dataset only contains one entity type: PERSON). It contains 728, 321 attributes-description pairs. Its average number of attribute types per entity is 19 (and up to 100 attribute tokens), and its average number of tokens per description is 26. The ROTOWIRE dataset contains 4, 900 record-summary pairs with 39 record types. Its average number of attribute tokens per game record is 600, and its average number of tokens per summary is 337.
Our primary goal is to generate a content-plan from a set of attributes. Our proposed model includes content-selection learning to obtain a better content-plan. We need content-selection and content-plan labels for each attributes-description pair in all three datasets to train such a model. For the WIKIALL and WIKIBIO datasets, we use the data pre-processed by Trisedya et al. (2020). For the ROTOWIRE dataset, we use the data pre-processed by Puduppully et al. (2019). The pre-processed data have content-plan labels, but not content-selection labels. To obtain contentselection labels, we give label 1 for input tokens that appear in the target content-plan, and label 0 for the rest of the input tokens.

Training Details
We implement our model with Tensorflow and train it on NVIDIA GeForce RTX 2080 Ti. We use grid search to tune the hyperparameters. We select the embedding size in [8,128], the dropout rate in [0.1, 0.5], and the learning rate in [1e −2 , 1e −4 ]. The best hyper-parameter settings are as follows. We use 128 hidden units for the networks. We use 32, 16, and 8 dimensions of word embeddings (attribute value token), type embeddings (attribute key), and position embeddings, respectively. We use a 0.1 dropout rate. We use Adam (Kingma and Ba, 2015) with a learning rate of 1e −4 . The memory cost to store the model is 2, 475 MB, and the average running time for training and testing the model is 350 minutes on the WIKIALL dataset.

Tested Models
We compare our proposed model (GSC-attention) with the following models.
• Enc-Dec (Wiseman et al., 2017), which employs an encoder-decoder framework with LSTM in both the encoder and the decoder. It also uses conditional copy (Gulcehre et al., 2016) on the decoder side. • NCP (Puduppully et al., 2019), which uses LSTM in the encoder and Pointer Networks in the decoder. • Transformer, which is a direct adaptation of the Transformer model (Vaswani et al., 2017). For this model, we used the Transformer encoder and coupled it with Pointer Networks to generate content-plans. It is worth noting that we only take the content planner part from the existing models (Enc-Dec and NCP) since the main goal in this paper is to generate a content-plan from a set of attributes. For ablation tests, we run three variants for each compared model as follows.   ). The intuition is that sorted input may be easier to learn. + CS loss. This variant jointly learns contentselection and content-planning but does not use the masking strategy described in Section 4.4. + CS mask. This variant uses the masking strategy described in Section 4.4.

Main Results
We use the following metrics to evaluate the models Wiseman et al. (2017). To measure model performance on extracting salient attributes (i.e., content-selection), we use precision and recall. To measure how well a model orders the selected attributes (i.e., content-planning), we use Damerau-Levenshtein Distance (DLD) between the generated content-plan and the gold standard. Table 2 shows the results. From these results, we see that our proposed GSC-attention achieves the best performance for generating content-planning, indicated by the highest DLD score on all three datasets. We also see that the Transformer adaptation outperforms two existing models, Enc-Dec and NCP. This is because both existing models employ LSTM in the encoder side, which is suboptimal to encode the attribute set. Our model further outperforms the direct Transformer adaptation since our model can capture both local and global relationships among the attributes. In contrast, the Transformer adaptation linearizes the input set, which omits the local relationships between tokens within the same attributes. In general, all models achieve higher DLD scores in WIKIALL and WIKIBIO datasets but lower scores in the RO-TOWIRE dataset. This is because the ROTOWIRE dataset contains a larger (i.e., 600 records per game compared to 100 attributes per entity in WIKIALL and WIKIBIO) and homogeneous (i.e., mainly, it contains numbers related to a game statistics) input.
Sorting the attributes in the input set (i.e., the + Sorted variant) gives a deterministic order to the model input. However, this strategy does not ensure that the encoder (especially the LSTM encoders) can capture the relationships among the attributes. The alphabetically ordered attributes may not reflect the correct attribute relationships. These results verify that capturing the relationships of the input set is non-trivial.
Ablation test results. In the ablation tests, we aim to evaluate the effectiveness of our proposed content-selection integration. We apply our proposed integration to all models. In general, the content-selection integration improves the performance of the content-planner. All models benefit from the content-selection integration. The variants that use joint learning of content-selection and content-planning (i.e., the + CS loss variants) gain 1 to 2 points in DLD comparing with the respective models without the integration. The + CS mask variants further improve the content-planner's performance by 1 to 3 points in DLD. It is worth noting that the masking strategy substantially improves the precision and DLD score, but it may lower the recall. This is because the masking strategy narrows the output selection to the attribute selected by the content-selector. However, in content-plan gener-  ation, we argue that precision and DLD are more important than recall because we want to generate accurate planning. Take an example in the basketball summary generation. The content-planner should make an accurate content-plan prediction to generate an accurate game summary.

Evaluation with Text Generation
The primary goal of this paper is content-plan generation, which is an intermediate goal of text generation. In this experiment, we further evaluate the quality of the generated content-plans by using them for text generation. We train a text generation model using an encoder-decoder framework. Both the encoder and the decoder use the LSTM model. We train the model over the gold standard content-plan-target-sentence pairs of the dataset (cf. Section 5.1). For testing, we use the contentplan generated by the tested models as input, and we compute the BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) scores of the sentence generated by the text generation model.
For each model, we take the best variant (i.e., the CS mask variant) to generate the content-plan. Table 3 shows the text generation results. These results confirm that our proposed model achieves the best performance. The BLEU scores of the generated text from the content-plan generated by our model on WIKIALL, WIKIBIO, and RO-TOWIRE as input are 64.59, 43.57, and 19.21 respectively. Note that the upper-bound performances are 67.21, 46.32, and 24.06 in terms of BLEU score in WIKIALL, WIKIBIO, and RO-TOWIRE datasets, respectively. Manual evaluations. We further perform manual evaluations on the generated text. We define three metrics for the manual evaluations: repetition, completeness, and coherence. Repetition measures whether there is repeating information (or tokens) in the generated text. Completeness measures whether there is a missing attribute in the text. Coherence measures correctness of the attribute order in the generated text.
We randomly select 100 input sets of the  WIKIALL dataset along with the generated text. The manual evaluation is done by giving a score from 1 to 3 for the generated text in each metric. For each metric, score 3 is given to generated text with no error; score 2 is given to generated text with a single error; and score 1 is given to generated text with more than one error. Table 4 shows the manual evaluation results. From these results, we can see that exploiting the content-plan helps a text generation model to produce a better output. A text generation model that takes the original attribute set as input generates text that contains many repetition errors and missing information, i.e., it achieves a low score on the repetition and completeness metric compared to the text generation models that receive a content-plan as input. This is because the generated contentplan has been filtered from unnecessary information. Among the compared content-planner, our proposed GSC-attention achieves the best score in all manual evaluation metrics. This result confirms the automatic evaluation where our proposed model outperforms the competitors.

Conclusions and Future Works
We presented a model for generating a content-plan from a set of entity attributes. To capture the local and global contexts from the input set, we proposed a novel GSC-attention. This attention mechanism consists of three attention schemes, which combine the intra-attention among tokens in an attribute and the inter-attention among attributes. Our contentplanner further integrates a content-selection mechanism via a masking strategy. Experimental results on real-world datasets confirm the effectiveness of our model to generate a content-plan.
Despite outperforming all competitors, our generated content-plan can be further improved, especially for a large and homogeneous input set (e.g., the ROTOWIRE dataset). A further decoding strategy will be explored for future work. Another interesting direction is to design a text generation model that exploits the proposed content-planner.