Transformer-based Lexically Constrained Headline Generation

This paper explores a variant of automatic headline generation methods, where a generated headline is required to include a given phrase such as a company or a product name. Previous methods using Transformer-based models generate a headline including a given phrase by providing the encoder with additional information corresponding to the given phrase. However, these methods cannot always include the phrase in the generated headline. Inspired by previous RNN-based methods generating token sequences in backward and forward directions from the given phrase, we propose a simple Transformer-based method that guarantees to include the given phrase in the high-quality generated headline. We also consider a new headline generation strategy that takes advantage of the controllable generation order of Transformer. Our experiments with the Japanese News Corpus demonstrate that our methods, which are guaranteed to include the phrase in the generated headline, achieve ROUGE scores comparable to previous Transformer-based methods. We also show that our generation strategy performs better than previous strategies.


Introduction
Following the initial work of Rush et al. (2015), abstractive headline generation using the encoderdecoder model has been studied extensively (Chopra et al., 2016;Nallapati et al., 2016;Paulus et al., 2018). In the automatic headline generation for advertising articles, there are requests to include a given phrase such as a company or product name in the headline.
Generating a headline that includes a given phrase has been considered one of the lexically constrained sentence generation tasks. For these tasks, there are two major approaches. One approach is to select a plausible sentence including the given phrase from several candidate sentences generated from left to right (Hokamp and Liu, 2017;Anderson et al., 2017;Post and Vilar, 2018). Although these methods can include multiple phrases in a generated sentence, they are computationally expensive due to the large search space of the decoding process. In addition, since they try to force given phrases into sentences at every step of the generation process, these methods may harm the quality of the generated sentence (Liu et al., 2019).
Another approach proposed by Mou et al. (2015) is to generate token sequences in backward and forward directions from the given phrase. Mou et al. (2016) proposed Sequence to Backward and Forward Sequences (Seq2BF), which applies the method of Mou et al. (2015) to the sequence-tosequence (seq2seq) framework. They use an RNNbased model and adopt the best strategies proposed by Mou et al. (2015), generating the backward sequence from the phrase and then generating the remaining forward sequence. Liu et al. (2019) introduced the Generative Adversarial Network (GAN) to the model of Mou et al. (2015) to resolve the exposure bias problem  caused by generating sequences individually, and used the attention mechanism (Bahdanau et al., 2015) to improve the consistency between both sequences. However, their model does not support the seq2seq framework.
Recently, He et al. (2020) used a Transformerbased model (Vaswani et al., 2017), which is reported to achieve high performance, to generate a headline containing a given phrase. They proposed providing an encoder with additional information related to the given phrase. However, their method may not always include the given phrases in the generated headline.
In this study, we work on generating lexically constrained headlines using Transformer-based Seq2BF. The RNN-based model used by Mou et al. (2016) executes a strategy of continuous generation in one direction, and thus cannot utilize the information of the forward sequence when generating the backward sequence. However, Transformer can execute a variety of generative strategies by devising attention masks, so it can solve the problem of the RNN-based model. We propose a new strategy that generates each token from a given phrase alternately in the backward and forward directions, in addition to adapting and extending the strategies of Mou et al. (2016) to the Transformer architecture.
Our experiments with a Japanese summarization corpus show that our proposed method always includes the given phrase in the generated headline and achieves performance comparable to previous Transformer-based methods. We also show that our proposed generating strategy performs better than the extended strategy of the previous methods.

Proposed Method
We propose a Transformer-based Seq2BF model that applies Seq2BF proposed by Mou et al. (2016) to the Transformer model to generate headlines including a given phrase. The Seq2BF takes W (= w 1 , ..., w L ; w 1:L ) as the given phrase consisting of L tokens and generates the headline y −M :−1 of M tokens backward from W , and the headline y 1:N of N tokens forward from W . The Transformerbased Seq2BF is the Transformer model with two generation components, consisting of a linear and a softmax layer (see Figure 1).
In Transformer-based Seq2BF unlike Transformer generating tokens from left to right, the token position changes relatively depending on already generated tokens. We determine the token position, inputting to the positional encoding layer of the decoder, L+1 2 in W to be 0, and the position in the backward direction to be negative, and the position in the forward direction to be positive.
We consider the following four generation strategies. In addition to two strategies (a) and (b), which extend those proposed by Mou et al. (2016), we proposfe new strategies (c) and (d) as step-wise alternating generation to keep better contextual consistency in both backward and forward directions.  Transformer-based Seq2BF is formulated as where X denotes tokens of the article, W denotes tokens of the given phrase, Y (= y −M :−1 , w 1:L , y 1:N ) denotes tokens of the final generated headline, and Y obs denotes the alreadygenerated partial headline including W . Also, P OS j denotes a list of token positions representing the order of tokens to be generated corresponding to each generation strategy (see Figure 1), for ex- In Tok-B/F which M and N are different, once the generation in one direction is completed, the generation will be continued only in the remaining directions until M + N steps. For example in the case of M > N in Tok-B, our method completes generating tokens in the forward direction first, so it generates them in both directions until the 2N step, and then generates them only in the backward direction from the 2N + 1 step to the M + N step. To train the model on these generative strategies, we have prepared an attention mask for the decoder. Transformer can control the generation order of tokens by devising the attention mask used in the decoder's self-attention mechanism. Transformer generates tokens from left to right, so it is sufficient to disable the attention to tokens forward from the input tokens. However, the Transformerbased Seq2BF needs to specify the areas where input tokens disallow the attention in the backward and forward directions, depending on each generation strategy (see Figure 2).

Experiment
We conducted the experiment to verify the performance of our methods in the headline generation task. The objective of our experiment is to compare our method with previous Transformer-based methods that generate tokens from left to right. We also compare Seq-B/F, the generation orders proposed by Mou et al. (2016), with Tok-B/F, our new generation orders.

Setting
We used the 2019 version of the Japanese News Corpus (JNC) 1 (Hitomi et al., 2019) as the dataset. The JNC contains 1,932,399 article-headline pairs, and we split them randomly at a ratio of 98:1:1 for use as training, validation, and test sets, respectively. 2 We utilized MeCab (Kudo et al., 2004) with the IPAdic 3 and then applied the Byte Pair Encoding (BPE) algorithm 4 (Gage, 1994) for tokenization. We trained BPE with 10,000 merge operations and obtained the most frequent 32,000 1 https://cl.asahi.com/api_data/ jnc-jamul-en.html 2 We applied the preprocessing script at https://github.com/asahi-research/ script-for-transformer-based-seq2bf to the original JNC to obtain the split dataset.
We used context word sequences extracted from the reference headlines by GiNZA 5 as the 'given' phrase. 6 An average of 4.99 phrases was extracted from the reference headlines, and the 'given' phrases consisted of an average of 2.32 tokens. We evaluated our methods using precision, recall, and F-score of ROUGE-1/2/L (Lin, 2004) and success rate (SR), which is the percentage of the headline that includes the given phrase. We also calculated the Average Length Difference (ALD) to analyze the length of the generated headlines, as where n, l i , and len i are the number of samples, the length of the generated headline, and the length of the reference headline, respectively.
As a comparison method, we adopted the method proposed by He et al. (2020) with vanilla Transformer instead of BART (Lewis et al., 2020). This method controls the output by inserting the given phrase and the special token '|' in front of the input articles and randomly drops the given phrase from the input articles during training to improve the performance. The hyperparameters of both the comparison and our models are determined as described in Vaswani et al. (2017). The training was terminated when the perplexity computed on the validation set did not update three times in a row, and we used the model with the minimum perplexity on the validation set. The beam size during the inference was set to three.   Table 1 shows the experimental results. Note that the proposed and compared methods achieved higher ROUGE scores than Transformer because we computed ROUGE scores between the reference and the system-generated headlines, including the phrase extracted from the reference headlines.

Results
Our methods always include the given phrase in the generated headlines, whereas the comparison method had a success rate of around 90%. Although the recall of ROUGE scores tended to be higher in the comparison method than in the proposed method, the precision and F-scores of ROUGE scores in the proposed method were comparable or higher than in the comparison method. As we notice from ALD, we found that Transformer-based Seq2BF generated shorter headlines than the Transformer models. It has been confirmed that the Transformer models with a single output direction tend to generate shorter headlines than the reference. Because Transformer-based Seq2BF has two output directions, the generated headlines were considered to be even shorter. This is the reason why our methods had lower recall scores than the comparison methods. Comparing the generation strategies of Transformer-based Seq2BF, we can see that Tok-B/F had a higher score than Seq-B/F. To analyze how the four generation strategies of Transformer-based Seq2BF affected the systemgenerated headlines, we showed the character-level position of the given phrase in the headline using histograms in Figure 3. As we can see, all generation strategies had similar distributions in the reference and system-generated headlines, and hence Transformer-based Seq2BF has also been presumed to learn the position of a given phrase in the headline. Focusing on the headlines that include the given phrase in the head, the difference between the reference and the headline generated by Tok-B/F is smaller than that of the headline generated by Seq-B/F. Also, the headlines generated by Seq-B tend to place the given phrase in the beginning, while this tendency is opposite for the headlines generated by Seq-F. Table 2 shows examples of the headlines generated by the Transformer-based Seq2BF (Tok-B). When a product name such as "桜とイワシのパ フェ" ("Cherry Blossom and Sardine Parfait") was given, our methods could generate a natural headline that includes the given phrase. Also, given the phrase "6月末" ("the End of June"), our methods generated a headline with the addition of "販売" ("on Sale") that matched the given phrase. On the other hand, we found the problem of generating the same words related to the given phrase in the backward and forward directions, such as the headline generated given "群れ" ("Schools"). In addition, given the phrase "約1万匹" ("About 10,000"), our methods generated the headline meaning that special sweets contain about 10,000 sardines. In this way, examples that were not faithful to the article were confirmed.
Article: 約1万匹のイワシが群れで泳ぐ様子を見られる京都水族館。この展示に合わせ、ちょっぴり変わった特 別スイーツが6月末まで販売される。名前は「桜といわしのパフェ」で、... At the Kyoto Aquarium, you can see about 10,000 sardines swimming in schools. To coincide with this exhibition, a special sweet that is slightly unique will be on sale until the end of June. It is called "Cherry Blossom and Sardine Parfait," and ...  As can be seen from Table 2, various headlines are generated according to given phrases. In general, it is difficult to control the diversity in headline generation, but our methods can generate diverse headlines by giving a variety of phrases. However, it may be necessary to discuss whether our methods could generate diverse headlines. The reason is that all examples are only partially diverse. Specifically, they always include "特別スイーツ" ("Special Sweets") and "京都水族館" ("Kyoto Aquarium") as important contents in the headline.

Conclusion
We proposed Transformer-based Seq2BF that generates the lexically constrained headline by devising the attention mask for the decoder and generating backward and forward sentences from the phrase. Our experiments using the JNC demonstrated that Transformer-based Seq2BF always includes the given phrase in the generated headline and obtains comparable performance compared to previous Transformer-based methods. We also showed that strategies of generating each token alternately between backward and forward directions are more effective than that of generating a sequence in one direction and then a sequence in another direction.
In future work, we will investigate whether Transformer-based Seq2BF can generate natural headlines even when given a variety of phrases, such as phrases not in the reference or the article, and examine if our methods can creatively generate diverse headlines by giving a variety of phrases quantitatively. Also, we will explore methods for generating headlines that include multiple phrases.