Generalized Supervised Attention for Text Generation

The attention-based encoder-decoder framework is widely used in many natural language generation tasks. The attention mechanism builds alignments between target words and source items that facilitate text generation. Previous work proposes supervised attention that uses human knowledge to guide the attention mechanism to learn better alignments. However, well-designed supervision built from ideal alignments can be costly or even infeasible. In this paper, we build a Generalized Supervised Attention method (GSA) based on quasi alignments, which specify candidate sets of alignments and are much easier to obtain than ideal alignments. We design a Summation Cross-Entropy (SCE) loss and a Supervised Multiple Attention (SMA) structure to accommodate quasi alignments. Experiments on three text generation tasks demonstrate that GSA improves generation performance and is robust against errors in attention supervision.


Introduction
The encoder-decoder framework has been applied to various natural language generation (NLG) tasks, such as neural machine translation (Cho et al., 2014;Luong et al., 2015;Vaswani et al., 2017;Liu et al., 2016), text generation (Wiseman et al., 2017;Puduppully et al., 2019), text summarization (Liu and Lapata, 2019;Lin et al., 2018), image captioning (Anderson et al., 2018), dialogue systems (Liu et al., 2018), and so on. The attention mechanism (Bahdanau et al., 2015) plays a significant role in the framework, which automatically extracts the alignments between the target and the source for predicting the next target output. One disadvantage of the vanilla attention mechanisms is that the * The work was done when Yixian Liu and Xinyu Zhang were students of Shanghaitech University. Kewei Tu is the corresponding author. automatic weights do not necessarily encode prior knowledge, such as the alignments between input and output (Jain and Wallace, 2019). To alleviate this problem, supervised attention was considered (Liu et al., 2016;Mi et al., 2016;Kamigaito et al., 2017;, which shows that human knowledge is helpful for guiding the learning process of attention models. Previous work on supervised attention assumes access to ideal alignments. Unfortunately, obtaining ideal alignments is infeasible or extremely costly, for most NLG tasks. For example in Figure 1, for the AMR-to-text generation task, given the AMR graph for sentence "From among them, pick out 50 for submission to an assessment committee to assess.", the ideal alignment of the last word "assess" is node (10). As the names of (8) and (10) are the same, it is not easy to pick (10) exactly. On the other hand, it is much easier to obtain a candidate set containing both (8) and (10) and be rather confident that the ideal alignment is in the set. For different tasks, both EM-based algorithms (Brown et al., 1993;Pourdamghani et al., 2014) or rule-based methods (Flanigan et al., 2014) can be used to obtain such ambiguous alignments. However, little work has discussed making use of ambiguous labels for supervised attention.
We investigate the generalized supervised attention (GSA), where the supervision signal aligns a target word to multiple possible source items (named the quasi alignment), although only a subset of the items are the true alignment targets. The multiple source items are named candidate set of the quasi alignment. A generalized supervised attention framework is built for various text generation tasks with alignment relationships between target words and source items. One challenge for generalized supervised attention is that the standard Cross-Entropy (CE) loss (Liu et al., 2016)  Target Sentence: From among them, pick out 50 for submission to an assessment committee to assess. Alignments of target word assess: Ideal alignment (10) Quasi alignments (8), (10) Relevant alignments (4), (10), (12), (14) Figure 1: Example of an AMR graph and the alignments between the graph and the target sentence. Ideal alignment points to the most related source node to "assess". Quasi alignments point to the source nodes with the same name. Relevant alignments point to more source items with weak relation to "assess".
can be limited because it is not suitable for quasi alignments. We design a new loss function named Summation Cross-Entropy (SCE) to replace the Cross-Entropy loss given a set of quasi alignments. SCE considers multiple candidates as a whole and is more robust against spurious candidates than traditional CE. Supervised attention and automatically learned attention can be complementary. In Figure 1, the relevant alignments (4), (10), (12), (14) are useful for predicting "assess", but such alignments cannot be captured by simple rules and may require human annotation in order to be used for attention supervision. It is therefore more practical to rely on automatic attention to uncover such alignments. To balance supervised attention and automatic attention, we design a Supervised Multiple Attention (SMA) module for GSA. In SMA, there are multiple attention channels with the same structure but different parameters. One of them is used for supervised attention and the others are used for pure automatic attention (named unsupervised attention below) that are not influenced by the attention supervision. SMA can be seen as an extension of multi-head attention introduced in the Transformer (Vaswani et al., 2017).
We evaluate GSA on three real-world tasks: data-to-text generation (Koncel-Kedziorski et al., 2019), AMR-to-text generation (Mager et al., 2020), and text summarization (Yan et al., 2020). The results demonstrate that our method improves the performance in general. We also examine the robustness of our method against alignment errors. Our code will be released at https://github.com/ LiuYixian/Supervised_attention.

Related Work
Previous work (Liu et al., 2016;Mi et al., 2016;Kamigaito et al., 2017; have studied supervised attention. Their work is based on welldesigned alignments. However, no work has considered ambiguous labels, which are practically more common. We study the quasi alignments as the attention supervision and design the Summation Cross-Entropy to deal with the ambiguity in quasi alignments. Learning with ambiguous labels has been widely studied, in which the true label is not precisely annotated but in a candidate label set. In crosslingual Part-of-Speech, annotations are derived for low resource languages from cross-language projection, which results in partial or uncertain labels. To solve this problem, Täckström et al. (2013) proposed a partially observed conditional random field (CRF) (Lafferty et al., 2001) method, Wisniewski et al. (2014) made a history-based model, and Buys and Botha (2016) proposed an HMM-based model. SCE is designed for training attention weights using ambiguous labels. Xu et al. (2020) also study learning from ambiguous labels (called partial label learning) in classification tasks. Their method is based on constructing similar and dissimilar pairs of samples. However, supervised attention is not a traditional classification problem. The label spaces are various in different samples, making it difficult to construct similar pairs. Thus, the method is not suitable for GSA.

Encoder-decoder Model with Attention
Encoder-decoder models, including RNN models (Cho et al., 2014;Luong et al., 2015) and the Transformer model (Vaswani et al., 2017), are used for a variety of NLP tasks. The encoder extracts information from the source data into a memory bank, and the decoder makes use of the memory bank to generate the target sentence. Let x = {x 1 , . . . , x m } denote a sequence of source items, and y = {y 1 , . . . , y n } a target sentence. The encoder converts the source data x into a memory bank H = {h 1 , . . . , h m }, where each vector h i represents the contextual embedding of x i . At the t-th time step of the decoder, the model obtains the feature vector s t . The meaning of s t varies in different decoders. For an RNN decoder, it is the hidden state of the RNN at time t. The attention mechanism computes the contextual feature c t of s t over H, which is used to predict the next target word.
The objective function of generation is the negative log conditional likelihood loss:

Supervised Attention (SA) with Cross-Entropy (CE) Loss
Supervised attention (SA) was first introduced by Liu et al. (2016) and Mi et al. (2016) for neural machine translation. They obtain attention supervision between source and target words with off-theshelf aligners. SA is a multi-task learning approach, where the objective function is the summation of the loss of sequence generation (generation loss) and the disagreement between attention distribution and attention supervision (attention loss) as follows: where α t is the computed attention andα t is the attention supervision. (x, y) is the generation loss in Eq. (1), and λ is a positive hyper-parameter that balances the two losses.α t is the target attention distribution and ∆ measures the disagreement between the attention distribution and the target distribution. Liu et al. (2016) assume that every target word is aligned to at least one source word. If a target word is aligned to k source words, the corresponding elements in α t are 1 k and the other elements are 0. They apply the Cross-Entropy loss as the attention loss function:

Generalized Supervised Attention (GSA)
The overall architecture of GSA is shown in Figure  2. We will first introduce quasi alignments, and in-troduce a Summation Cross-Entropy loss function and a Supervised Multiple Attention structure.

Quasi Alignments
We consider quasi alignments as shown in Figure  1, in which a target word is allowed to be aligned to a candidate set in the source items, although only a subset of the candidates are the true alignment targets. The supervision signal provided by the quasi alignments isᾱ t = {ᾱ t,1 , . . . ,ᾱ t,m }, whereᾱ t,i = 1 if x i and y t should be aligned with considerable probability. If |ᾱ t | = 1 andα t is a one-hot alignment vector, y t is only aligned to x i . If |ᾱ t | > 1,α t expresses a discrete uniform distribution over the candidate set. Such candidate items usually include some irrelevant items that should not be aligned to y t , but it is costly to pick out the correct subset from these candidates. Therefore, we retain all these candidates and expect the training process to determine the better alignment automatically. If |ᾱ t | = 0, no item is found for y t .
In our experiments, we obtain the quasi alignments using simple rule-based methods, which differ for different tasks, as will be discussed in Section 5.

Summation Cross-Entropy (SCE) Loss
Quasi alignments form candidate sets containing the potential aligned source items but do not indicate the true ones among them. Intuitively, we want our attention loss to penalize attention probabilities outside the candidate set but allow an arbitrary attention distribution within the set. To this end, we design the SCE loss function to maximize the total of attention probabilities in the candidate set.
where ·, · stands for the inner product. The SCE loss is the negative logarithm of the likelihood summation of all candidate items. Theoretically, SCE loss can be derived from a generative model. Assume that one target word should be aligned to only one true source item. For the t-th target word, we define a random variable as the true aligned source item, P (z t = i) = α t,i . Given z t , we re-define the candidate set as α t = {ᾱ t,1 , . . . ,ᾱ t,m }, whereᾱ t,zt = 1. (5) In this way, the candidate set contains z t . Considering that z t is a hidden variable, the likelihood Aligner
of the candidate set can be defined as We assume that it is a certain event that we obtain a candidate set containing the true alignment given z t . P (ᾱ t |z t ) is a distribution over candidate sets, in which only the candidate set that contains all the words identical to x zt has probability 1 and all the other candidate sets have probability 0. Thus, P (ᾱ t |z t = i) = I(ᾱ t,i = 1). Therefore, optimizing the SCE loss is to maximize the likelihood of the candidate set.
By comparing Eq. (3) and Eq. (4), CE loss optimizes the summation of log-likelihood of target alignments, while SCE loss optimizes the log of summation of the likelihood of target alignments. If a target word is not aligned to any source items, the attention loss is 0. If it is aligned to only one source item, SCE reduces to CE (de Boer et al., 2005). If it is aligned to multiple items, then the SCE loss penalizes the attention probabilities outside the candidate set and uniformly increases the attention probabilities within the set. Note that this behavior is different from that of CE, which would encourage uniform attention over all the candidates in the set and hence produce different updates to the attention probabilities of different candidates during training.

Multiple Attention (MA) and Supervised
Multiple Attention (SMA) The motivation of supervised attention is to incorporate prior knowledge of alignments between source and target items into the attention mechanism. One problem of supervised attention is that alignments are typically established between similar items, but ideally, the decoder should also attend to some other informative source items (Figure 1), which are not necessarily similar to the target word.
Besides, the automatic aligner may make errors and align the target word to irrelevant source items. Therefore, the unsupervised automatic attention mechanism is still a useful supplement to supervised attention.
We define a multiple channel attention (MA) structure, which is closely related to multi-head attention (Vaswani et al., 2017). There are K attention channels with the same structure but different parameters in MA, which work concurrently and their output contextual feature vector are combined into one contextual feature vector.
where G is a combination function. Multi-head attention (Vaswani et al., 2017) can be regarded as a special case of the MA structure. One head of multihead attention is an attention channel in Eq. (9). The contextual features are combined by a stacking action followed with a linear function. We do not use the standard multi-head attention because the structure of multi-head attention is strict and it is not proper for all generation tasks. To balance supervised attention and unsupervised attention, SMA has the same structure as MA, and we compute the attention loss of the first channel only, leaving the other channels still unsupervised. The objective of SMA model is: A proposed generation model with attention can be easily modified by the SMA structure. If the original attention is one-headed, we add a new attention channel with the same structure parallel to the original one, and compute the attention loss for the new attention. The contextual features of the two attention channels are averaged in Eq. (10). If the original attention is multi-headed, we do not change the structure, and compute supervised attention loss for the first head.

Experiments
We apply GSA to three tasks: data-to-text generation, AMR-to-text generation, and text summarization. For each task, we choose one of the best published approaches as our basic model and modify it with GSA. In the three tasks, the relations between the source and the target are diverse. For text summarization, the source items contain more information than the target words. For data-to-text generation, the source items only contain key contents. For AMR-to-text generation, the source and the target contain the same information. We report the details of model structures and hyper-parameters in the appendix.

Data-to-text Generation
Task and Model : We consider the Abstract GENeration DAtaset (AGENDA) (Ammar et al., 2018), which contains pairs of a literature abstract and a knowledge graph extracted from the abstract. The nodes in the knowledge graphs are entity types, such as "Task" and "Method". The edges are the relations between different entities, including "COMPARE", "PART-OF", and so on. We use the training, development, and test splits of 38,720/1000/1000, as Ammar et al. (2018) does.
We use GraphWriter 1 (Koncel-Kedziorski et al., 2019) on this task. The encoder of this model is a graph transformer and the decoder is an RNN decoder with attention and copying mechanism. More detail is introduced in the appendix.
Aligner: The source items of this task include entities and relations, as shown in Figure 3. We use our string matching aligner to extract the alignments from target words to the source entities and extend our aligner for the alignments of relations, such as aligning target words "use" and "apply" to source relation "USED-FOR". For the details of 1 https://github.com/rikdz/GraphWriter  The unsupervised approach means the original method without supervision on attention. As the decoder applies multi-head attention, we design the SA approach, in which the attention distributions of all heads are averaged to compute the attention loss. In this way, we consider the multi-head attention as a supervised attention channel. The SMA approach is designed as in Section 4.3, in which only the first head is a supervised attention channel. In SCE and CE approaches, we used SCE and CE loss function to supervise the attention, respectively.

AMR-to-text Generation
Task and Model : Abstract meaning representation (AMR) (Banarescu et al., 2013) is a semantic graph representation that is independent of the syntactic realization of a sentence. In the graph, nodes represent concepts and edges represent semantic relations between the concepts. AMR-to-text generation is to generate sentences from AMR graphs. We use the AMR dataset LDC2015E86, which contains 16,833 training samples, 1368 development samples, and 1371 test samples. We use the model 2 of Mager et al. (2020) on this task, which is a GPT-2 (Radford et al., 2019) model with fine-tuning.
Aligner: We apply lemma matching to build the attention supervision as shown in Figure 1. There is a quasi alignment between a source item and target word if they have the same lemma 3 .
Experimental Settings: We experiment with 4 different approaches: UA, SA-CE, SA-SCE, SMA-SCE. We apply GSA to the multi-head attention in the last Transformer layer of the decoder.

Text Summarization
Task and Model : We use the CNN/DailyMail dataset (Nallapati et al., 2016). This dataset contains 287,226 training samples, 13,368 validation samples, and 11,490 test samples. We use Prophet-Net (Yan et al., 2020) on this task, which builds a pre-training and fine-tuning method for text generation. Both the encoder and the decoder are Transformers. ProphetNet is pre-trained on a large-scale dataset (160GB).
Aligner: We obtain the quasi alignments with lemma matching as in Section 5.2. A target word is aligned to a source item if they have the same lemma. Some words, such as "is" and "do", appear very frequently and are likely to cause wrong alignments. We use the inverse document frequency (IDF) (Robertson, 2004) scores to downweight these words. More details about IDF applied here are shown in the appendix.
Experimental Settings: The model has a Transformer decoder. We set the experiments similarly to Section 5.2. The basic model is proposed by Yan et al. (2020). We cannot fully reproduce their reported result (ROUGE-1/2/L of 44.2/21.17/41.30) by running their public model 4 . Thus, we report our results.

Main Results
The test results of GSA on the three tasks are shown in Table 1. For data-to-text generation, the basic model with unsupervised attention (UA) gives 3 We apply Pattern (Smedt and Daelemans, 2012)    To analyze the variations of the performance of GSA in different tasks, we compute the alignment coverage (A.C.), multi-alignment coverage (M.C.), and average multi-alignment size (M.S.) of different tasks (Table 2). A.C. is the percentage of target words with at least one alignment. M.C. is the percentage of target words with at least two alignments over all the aligned target words. M.S. is the average number of alignments of target words with at For data-to-text and AMR-to-text generation, SMA outperforms SA. On the other hand, SA performs better than SMA for text summarization. One possible reason is that the summarization dataset has much higher alignment coverage and multialignment coverage and the alignment accuracy may also be higher; consequently, supervised attention works so well that automatic attention becomes unnecessary or even distracting.

Significance Test
To assess the evidence of significance, we perform significance tests on GSA. The p-value is calcu-lated using the one-tailed sign test with bootstrap resampling on the test set of all three tasks following Chollampatt et al. (2019): • For data-to-text, we compare the Rouge-L score of SMA-SCE to the result of SA-CE.
• For AMR-to-text, we compare the BLEU score of SMA-SCE to the result of SA-CE.
• For summarization, we compare the Rouge-L score of SA-SCE to the result of SA-CE.
The p-value results are shown in Table 4, which show that the improvements are significant.

SCE Analysis
Variance We compute the variance of attention probability in the candidate set for the text summarization task. For every generated token, we get the candidate set containing the input tokens with the same lemma as the generated token. If the candidate set contains more than one token, we compute the normalized variance and entropy of the attention scores in the candidate set. Normalized variance means that we divide every attention score by their summation and compute the variance of the normalized attention scores. Then we average the values of normalized variance and entropy in the test set. As shown in Table 3, the normalized attention variance of SCE is larger than that of CE and the entropy of SCE is smaller. It implies that CE homogenizes the attentions over the candidate set, while SCE concentrates the attentions on certain tokens. It echos Section 4.2 that CE encourages uniform attention while SCE fixes the issue.
Attention Accuracy We design an automatic evaluation method to investigate whether our SCE method can find the correct alignment from the quasi alignment set in an unsupervised way. For a token whose length is greater than 5 5 in the generated result, if it is matched (by lemma matching) with more than one input token, we study the generated token and the candidate set containing the matched input tokens. Specifically, we consider the local context window of length 7 around the tokens in the candidate set. The correct alignment is defined as the input token whose context window shares the most tokens with the same window around the generated token. We find the alignment Generated wayne was in atlanta for a performance Matching 1 early sunday in atlanta . no one Matching 2 been made , atlanta police spokes woman Matching 3 parking lot in atlanta ' s buckhead Matching 4 wayne was in atlanta for a performance  selected by this automatic method almost fits the human judgment. An example is shown in Table 5.
The top-K accuracy indicates the rate that the attention score corresponding to the correct alignment is among the largest K scores. Figure 5 shows the top-K accuracy of CE and SCE for text summarization. We can find that our SA-SCE method gets higher top-K accuracy than SA-CE. That means our SCE method could find the correct alignment token and pay more attention to it without supervision.
Case Study An example of text summarization is shown in Figure 4. The figure displays a fragment of a test output sentence and the corresponding source fragment. The abscissas indicate the input text, and the ordinates denote the output summary. Figure 4(a) shows the attention of the CE approach, and 4(b) shows the attention of SCE. Both SCE and CE select the correct alignment in this example. However, the SCE approach provides higher attention probability on the correct alignment. As shown in Table 2, most output words can be flexibly aligned to more than one source word. Consider the attention probabilities of the word "police" framed by green squares. For one output word, there are two similar input "police" shown in the figure, with the first one being correct and the other one being incorrect. CE gives a probability of 0.07 for the correct alignment and 0.04 for the incorrect one. SCE approaches give the probability of 0.1 for the correct alignment and 0.03 for the incorrect one. According to section 4.2, SCE loss reduces the effect of incorrect alignments in the candidate set, which promotes the true source word.

More Powerful Supervision
In the main experiments, we apply a simple and general-purpose string matching aligner. For certain tasks, there are more powerful aligners available. To study the impact of better aligners, we in-vestigate the performance of GSA on AMR-to-text generation with the ISI aligner proposed by Pourdamghani et al. (2014), which is specially designed for the AMR-to-text task. The result is shown in Table 6. The improvement over the result in Table 1 proves that a more accurate aligner helps the supervised attention method for text generation. Besides, we can also find the result of SA-SCE better than that of SA-CE, which shows that SCE also works well while using a more accurate aligner.
We also analyze the alignments by the ISI aligner following the metrics of Table 2. The alignment coverage is 64.75%; the multi-alignment coverage is 52.17%; and the multi-alignment size is 3.11. It shows that the better aligner also produces ambiguous alignments. Therefore, SCE outperforms CE.

Robustness Analysis
We test the robustness of GSA by corrupting the attention supervision by changing correct alignments into incorrect ones. For every N target words with alignments, we change the alignments of one target word to a random and different source item. Then, we test GSA on data-to-text and AMR-to-text tasks based on the corrupted attention supervision. N ranges in {2, 3, 5, 10, 20}, which correspond to error rates of {50%, 33%, 20%, 10%, 5%}, respectively.
The results are shown in Table 7. For AMR-totext generation, we test the SMA-SCE approach. We observe that the supervised attention approach with a 20% error rate is still better than UA. Only with a 33% error rate does the supervised attention approach underperform the unsupervised attention. For data-to-text generation, we test the SMA-SCE approach. We observe that the SMA-SCE approach with even a 33% error rate is still better than UA. These results demonstrate the robustness of our supervised attention. On the other hand, in both experiments, the performance almost always decreases with more errors, demonstrating the importance of correct supervision.
Although GSA is shown to be robust to alignment errors, an overly high error rate would prevent the attention mechanism from finding the true alignments and make the supervised attention approaches worse than UA. Thus, reducing the mistake error rate is the most important when designing the aligner. More analyses about errors in attention supervision are in the appendix.

Conclusion
We studied generalized supervised attention (GSA) for text generation tasks, considering quasi alignments instead of true alignments, which are much more difficult to obtain in practice. A Summation Cross-Entropy (SCE) loss function was designed to deal with quasi alignments, and a Supervised Multiple Attention (SMA) structure was used to balance supervised attention and unsupervised attention. Experiments on three generation tasks demonstrated that generalized supervised attention produces competitive results and is robust against errors in attention supervision.
process, the beam size is 15. The supervised attention weight for the lemma matching aligner is 0.001, and for the ISI aligner is 0.01. The weight is tuned on the validation set by the BLEU score.

A.3 Text Summarization
The baseline model is based on an encoder-decoder structure with Transformer. It has 12 block layers with 1024 hidden sizes. The beam size of decoding is 5. The supervised attention weight is 0.1. The weight is tuned on the validation set by the BLEU score.

B Aligner for Relations in Graph-to-text Generation
There are seven different relations as mentioned in the main paper. A relation can be represented by different words in the target text. For example, "use" and "apply" both suggest the relation "USED-FOR". We build a corresponding keyword list (shown in Table 8) for each relation. In the source data, each relation is of the form "a-R-b", where "a" and "b" are two entities, and "R" is the relation type. To find the alignments, we first look for the cooccurrence of "a" and "b" in the abstract with the shortest distance between them. Then, we examine the words between "a" and "b" as well as four preceding words and align the words that appear in the corresponding keyword list to both the forward and backward directions of that relation.

C IDF Score in Text Summarization
The IDF score of a word w is computed as: where I(·) is the indicator function, M is the number of training samples, and X (i) is the target sentence of the i-th sample in the training set.
In the training step, the IDF scores of target words are used to downweight the attention loss: where w t is the t-th word in sentence y.
In the loss function, the attention loss of a target word is scaled by its IDF score. The IDF scores are only applied in this experiment because the alignments of these high-frequency words are rare in previous experiments.

D Alignment Error Analysis
The performance of supervised attention is influenced by the quality of the aligner. There are three types of alignment errors.
• Missing: a target word is not aligned to any source item.
• Redundancy: a target word is aligned not only to the correct source items but also to irrelevant items.
• Mistake: a target word is only aligned to some irrelevant items but not aligned to correct source items.
Missing errors reduce the alignment coverage over the target sentences. They decrease the number of target words that receive attention supervision but will not make supervised attention worse than the unsupervised baseline.
Redundancy errors, which are related to the candidate set from the flexible alignments, are handled by our SCE loss. In the worst case, a target word is aligned to all the source items, and the attention loss of this word becomes 0, resulting in no supervision. Thus, redundancy errors will not make supervised attention worse than the unsupervised baseline either. A case study is provided in the appendix showing that our method is not confused by the redundancy errors.
We empirically analyze mistake errors in section 4.5. Although supervised attention is shown to be robust to mistake errors, an overly high error rate would prevent the attention mechanism from finding the correct alignments and make the supervised attention approaches worse than the baseline. Thus, reducing the mistake error rate is the most important when designing the aligner.