LenAtten: An Effective Length Controlling Unit For Text Summarization

Fixed length summarization aims at generating summaries with a preset number of words or characters. Most recent researches incorporate length information with word embeddings as the input to the recurrent decoding unit, causing a compromise between length controllability and summary quality. In this work, we present an effective length controlling unit Length Attention (LenAtten) to break this trade-off. Experimental results show that LenAtten not only brings improvements in length controllability and ROGUE scores but also has great generalization ability. In the task of generating a summary with the target length, our model is 732 times better than the best-performing length controllable summarizer in length controllability on the CNN/Daily Mail dataset.


Introduction
Automatic text summarization aims at generating a short and coherent summary from one or multiple documents while preserving the main ideas of the original documents. Building upon the conventional summarization task, fixed length text summarization (FLS) demands extra focus on controlling the length of output summaries. Specifically, it requires generating summaries with a preset number of characters or words.
FLS is a rising research topic required in many scenarios. For example, in order to get universal user experiences on multiple platforms and devices, titles and abstracts for news articles are expected to have different numbers of characters. Instead of manually rewriting summaries, FLS can automatically generate required summaries by simply * Equal Contribution † Corresponding author 1 Code are publicly available at: https://github. com/X-AISIG/LenAtten Source document egyptian president hosni mubarak arrived here friday morning to discuss the latest developments of iraqi crisis with his turkish counterpart suleyman demirel . Reference summary egyptian president to discuss iraqi crisis with turkish counterpart Model Summary PAULUS egyptian president arrives in ankara PAULUS+LA2 (GT) mubarak arrives in ankara for talks on iraqi crisis with turkish pm PAULUS+LA2 (30) egyptian president arrives in ankara PAULUS+LA2 (50) egyptian president arrives in ankara for talks on iraq crisis PAULUS+LA2 (70) mubarak arrives in ankara for talks on iraqi crisis with turkish president demirel Table 1: Output examples from the proposed method Length Attention (LA) on the Annotated English Gigaword dataset. Numbers in the parentheses represent different desired lengths. (GT) means the desired length is equal to the number of characters in the reference summary. PAULUS (Paulus et al., 2018).
inputting the desired output length. Besides, FLS can help news editors to reduce post-editing time (Makino et al., 2019) and further improve summary quality (Liu et al., 2018;Makino et al., 2019). Last but not least, as shown in Table 1, with FLS, users can get customizable summaries by setting different desired lengths. Despite the benefits that could be brought, previous studies on FLS are very limited. Recent researches in FLS apply length information to either (i) the decoder (Kikuchi et al., 2016;Liu et al., 2018;Takase and Okazaki, 2019) or (ii) the optimization objective function (Makino et al., 2019). Though these systems are promising, they have to make a compromise between length controllability and summary quality. Kikuchi et al. (2016);Makino et al. (2019) can generate high-quality summaries, but perform inadequately at controlling length. Liu et al. (2018); Takase and Okazaki (2019) control the output length accurately, but these models suffer from producing summaries with low ROUGE scores.
In this paper, we present an effective length controlling unit, Length Attention (LenAtten). With LenAtten, summarizers can generate high-quality summaries with a preset number of characters, successfully breaks the trade-off between length controllability and summary quality.
Our contributions in this work are as follows: (1) A novel length controlling unit with great generalization capability is proposed to make summarizers generate high-quality summaries with a preset number of characters.
(2) Experimental results show that LenAtten can break the trade-off between length controllability and summary quality. The length controllability of the proposed method is the new state-of-the-art on the examined datasets, to our knowledge.
Derived from the works in general text summarization, two approaches have been developed for FLS: (1) Incorporating length information into the decoder. LenInit proposed in Kikuchi et al. (2016) introduced length information into the initialization stage of a LSTM decoder. Liu et al. (2018) follows a similar approach as LenInit, but it's based on a CNN sequence-to-sequence architecture. Other studies exploit length information in each decoding step. LenEmb introduced in Kikuchi et al. (2016) generates a learnable embedding for each target length, and uses it as an additional input to its decoder. Takase and Okazaki (2019) extended Transformer's sinusoidal positional encoding (Vaswani et al., 2017) to make summarizers take account of stepwise remaining length during prediction.  Firstly, decoder hidden state (blue) and remaining length (yellow) are employed to compute the attention weights a l . Then, the length context vector c l t (green) is produced by calculating the weighted sum between attention weights and pre-defined length embeddings (purple). Better viewed in color.

Our Approach: Length Attention
The motivation of LenAtten is to separate length information from the input of the recurrent decoding unit and to exploit proper length information based on the stepwise remaining length. As shown in Figure 1, at each decoding step, a length context vector is generated by calculating the weighted sum of a set of pre-defined embedding vectors l * . Then, the length context vector is concatenated with the decoder hidden state and other attention vectors and fed to the input of the word prediction layer (details are shown in §4.2), so that summarizers can take the remaining length into account. The length context vector c l t at t-th decoding step is defined as follows: where e l t ∈ R ℵ×1 , α l tj is the length attention score on the j-th length embedding at the t-th decoding step, h d t is the decoder hidden state, and V l , W l , w r , b l are learnable parameters. r t is a scalar representing the remaining length at the current decoding step and ℵ is a hyperparameter indicating the number of pre-defined length embeddings.
For length embeddings, we adopt the positional encoding proposed in Vaswani et al. (2017). We keep the embeddings fixed to remove the bias brought by the length distribution of data. The j-th length embedding l j is defined as follows: where P E( · ) is the positional encoding. At the t-th decoding step, the remaining length r t is updated by subtracting the length of the previously generated token. For the first decoding step, r 1 is initialized with desired output length. Following equations are used when t > 1: where L(y t−1 ) returns the number of characters in the output word y t−1 .

Experimental Settings
We evaluate LenAtten on the CNN/Daily Mail dataset (See et al., 2017) to compare it with previous studies. In addition, we test LenAtten with short articles and summaries on the Annotated English Gigaword dataset (Rush et al., 2015). By default, all models are trained with maximum likelihood estimation (MLE) on a NVIDIA TITAN RTX GPU. 2 For evaluation metrics, we adopt the standard F1 score of ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) to evaluate summary quality. For evaluating models' ability to control the output sequence length, we follow (Makino et al., 2019) to compute (1) character-level length variance V ar between reference summaries and generated summaries and (2) over-length ratio %over, which measures how many of the generated summaries are longer than their reference summaries. The length variance V ar is computed as follows: where y i is the reference summary, y i is the generated summary, and len(·) returns the number of characters in the given summary. For the FLS task, the length variance V ar is expected to be zero as it indicates the lengths of output summaries are exactly the desired summary lengths.
2 Detailed model configurations are provided in the Appendix.

Methods to be compared
We compare the proposed Length Attention unit with following methods: LEAD-3 extracts the first three sentences of source articles as the summary.
PG is the standard pointer-generator network proposed in See et al. (2017).
MASS (Song et al., 2019) is a sequence to sequence pre-trained model based on Transformer.
LenAtten is also compared with length controllable summarization methods. For a fair comparison, we choose methods that also aim at generating summaries with a preset number of characters in a word-by-word manner.
LE is the LenEmb method proposed in Kikuchi et al. (2016).
GOLC is a global optimization method introduced in Makino et al. (2019).
We apply LenAtten to three summarization models: S2S (RNN-based Seq2Seq Model) is a vanilla encoder-decoder summarizer. Specifically, we adopt a Bi-LSTM as the encoder and a unidirectional LSTM as the decoder. To integrate LenAtten, the length context vector c l t is added to the input of the word prediction layer to produce the vocabulary distribution P vocab : where W , b are learnable parameters, h d t is the decoder hidden state, y t−1 is the word embedding of the last generated token, "||" is the vector concatenation operator. C is the last encoder hidden state, which is known as the fixed context vector.
PAULUS (Copying Mechanism) follows the design of Paulus et al. (2018), which incorporates two attention modules and the copying mechanism into a Seq2seq summarizer. To integrate LenAtten, the vocabulary distribution P vocab is calculated using: where c e t and c d t are the context vectors generated from the encoder and decoder attention units.
ATTENTION (Attention-based model) is implemented by removing copying mechanism from PAULUS. For the above-mentioned three models, we remove the length context vector c l t in the ablation study and keep other components unchanged.

Experimental Results 3
Reference Summary Lengths In this experiment, we evaluate our model by comparing it with previous works. The desired length is set as the number of characters in corresponding reference summaries. Table 2 shows that LenAtten has superior length controllability and higher ROUGE scores on both datasets. Specifically, the length variance (V ar) of LenAtten is 732 times better than the best-performing length controllable method PG+LE(GOLC) in the CNN/DM dataset. Besides, adding LenAtten can boost ROUGE scores by 1-3 points. We believe the improvement in the ROUGE scores comes from the introduction of length information (i.e. the desired length information). The desired length information can be viewed as an inductive bias, which helps summarizers prefer some of the outputs over others. Under the same context, by conditioning on the desired output length, summarizers may prefer candidate summaries with output lengths similar to the desired length. Thus, summarizers can learn a better alignment with the  reference summaries during training and outputs summaries with higher ROUGE scores in inference.
In addition, previous length controllable methods control the output lengths at the cost of damaging the ROUGE scores. The ROUGE scores of PG and PAULUS drop after adding LenEmb (i.e. PG + LE(MLE) and PAULUS + LE). In comparison, LenAtten not only performs better at reducing the length variance V ar but also significantly improves ROUGE scores. This suggests that LenAtten can break the trade-off between summary quality and length controllability.
After integrating with LenAtten, the %over ratio of summarizers rises. This suggests that more of the generated summaries ended up being longer than the references. We believe this is because when the remaining length is small (e.g. 4 characters) but not 0, instead of stopping the generation process, summarizers with LA tend to generate more tokens to meet the length requirement. Since summarizers output a word at each inference step, they may select a word that's longer than the remaining length. Thus, the generated summaries may end up being longer than the references.
Perplexity To figure out how LenAtten affects the performance of the language model, we examine the log-perplexity of models on the test sets. Perplexity is a commonly-used metric for evaluating language models. A lower perplexity score indicates better language model performance. In this experiment, the desired length is set to the reference summary length. As shown in Table 3, after adding LenAtten, log-perplexity drops consistently on both datasets. This suggests that the adding of LenAtten can boost language model performance.  few of them are more than 100 characters. For the CNN/DM dataset, most reference summaries are 100-750 characters. Thus, the desired length is set as 30, 50, 75, 100, and 120 for the AEG dataset and 100, 200, 400, 800, and 1600 for the CNN/DM dataset. We add the LenAtten unit to PAULUS and exploit full reference summaries to get ROUGE scores.

Various Preset Lengths
As shown in Table 4, on the AEG dataset, for frequently appeared lengths (30,50,75), and lengths that are exceptionally long (100, 120), LenAtten demonstrates great length controllability along with good ROUGE scores. Same conclusions can be drawn on the CNN/DM dataset. This shows that LenAtten has great generalization ability under various desired lengths.
Exploring Hyperparameter ℵ In this experiment, we analyze how different ℵ (the number of pre-defined length embeddings) affect the performance of LenAtten on the AEG dataset. Desired lengths are set to the lengths of reference summaries. Figure 2 shows the length controllability becomes better as the increase of ℵ, with no harm to the ROUGE-L scores.

Conclusions
In this paper, we present a novel length controlling unit, LenAtten, to help summarization models generate quality summaries with a preset number of characters. On the examined datasets, LenAtten outperforms length controllable summarization baselines steadily in terms of length controllability and demonstrates great generalization ability. LenAtten also breaks the trade-off between length controllability and summary quality. To our knowledge, in the task of generating summaries with target lengths, LenAtten is the new state-of-the-art on the CNN Daily Mail dataset.

A.2 Additional Experiments
Semantic Similarity Another automatic evaluation metric BertScore (Zhang et al., 2019) recall score is used to measure the semantic similarity between system outputs and reference summaries. As shown in Figure 5, models with Length Attention module (LA2) outperform baselines (FREE) on both datasets.
Human Evaluation Correctness (CORR), completeness (COMP), and fluency (FLUE) of system outputs are assessed through 2 human evaluations. We randomly select 10 samples from each dataset.  30 skilled English speakers are presented with the original article and two summaries. One of the summaries is from the model without LenAtten, and the other one is from the same model plus LenAtten (e.g., S2S and S2S+LA2). The evaluation process is well-designed to prevent participants from knowing the source of the presented summaries. Models without LenAtten generate summaries without length restriction, and models with LenAtten are required to output summaries with desired lengths. There are 467 feedbacks collected for the first experiment and 160 for the second.
In the first experiment (Table 5), participants are asked to choose a better one from two given summaries. The desired length is set as the length of the reference summary.
In the second experiment (   length is set as (30, 50, 70) on the AEG dataset and (150, 250, 350) on the CNN/DM dataset. Participants need to rate each summary from 0 to 5. In order to guarantee the accuracy and credibility of results, each article is presented once to each participant. As shown in Table 5, models with LenAtten have better completeness and correctness scores on both datasets, along with a few improvements on the fluency. In the second experiment, Table 6 shows that (1) the completeness and correctness scores increase as the desired length increases. This trend is reasonable, since more information should be included in the final summary, as the summary length gets longer. This also suggests that, as the desired length gets longer, models with LenAtten can generate meaningful words instead of simply repeating one or two words. (2)   words, when the desired length gets smaller (but not too small), LenAtten can help summarization models to use concise words and phrases while maintaining summary quality.

A.3 Output Examples
Synonym substitution When examining generated summaries, we find adding LenAtten can make summarizers replace long/short words with synonyms to meet the length requirement. Examples are showcased in Table 7.