Demoting the Lead Bias in News Summarization via Alternating Adversarial Learning

In news articles the lead bias is a common phenomenon that usually dominates the learning signals for neural extractive summarizers, severely limiting their performance on data with different or even no bias. In this paper, we introduce a novel technique to demote lead bias and make the summarizer focus more on the content semantics. Experiments on two news corpora with different degrees of lead bias show that our method can effectively demote the model’s learned lead bias and improve its generality on out-of-distribution data, with little to no performance loss on in-distribution data.


Introduction
Neural extractive summarization, which produces a short summary for a document by selecting a set of representative sentences, has shown great potential in real-world applications, including news (Cheng and Lapata, 2016;Nallapati et al., 2017) and scientific paper summarization (Cohan et al., 2018;Xiao and Carenini, 2019). Typically, a generalpurpose extractive summarizer learns to select the most important sentences from a document to form the summary by considering their content salience, informativeness and redundancy. However, when restricted to a specific domain, the summarizer can learn to exploit particular biases in the data, the most famous of which is the lead bias in news (Nenkova et al., 2011;Hong and Nenkova, 2014); namely that sentences at the beginning of a news article are more likely to contain summary-worthy information. As a result, not surprisingly, such bias is strongly captured by neural extractive summarizers for news, for which the sentence positional information tends to dominate the actual content of the sentence in model prediction (Jung et al., 2019;Grenander et al., 2019;Zhong et al., 2019a,b).
While learning a summarizer reflecting the biases in the training dataset is completely fine when the summarizer is going to be deployed to summarize documents having similar biases, it would be problematic when the model was applied to deal with documents coming from a mixture of datasets with different degrees of such biases. In this paper, we address this problem in the context of the lead bias in the news domain by exploring ways in which an extractive summarizer for news can be trained so that it learns to balance the lead bias with the content of the sentences, resulting in a model that can be applied more effectively when the target documents belong to news datasets in which the lead bias is present in rather different degrees.
Recently, Grenander et al. (2019) proposes two preliminary solutions. One is to pretrain the summarizer on an automatic generated "unbiased" corpus where the document sentences are randomly shuffled, which however has the negative effects of preventing the learning of inter-sentential information. The other, which can be only applied to RL-based summarizers, is to add an explicit auxiliary loss to directly balance position with content. Alternatively, Zhong et al. (2019b) and Wang et al. (2019) investigate strategies to train the summarizer on multiple news datasets with different degrees of lead bias, but this may still be problematic when we apply the trained summarizer to the documents with lead bias not covered in the training data. Outside the summarization area, methods have also been proposed to eliminate data biases for other NLP tasks like text classification or entailment (Kumar et al., 2019;Clark et al., 2019Clark et al., , 2020. Inspired by Kumar et al. (2019), we have developed an alternating adversarial learning technique to demote the summarizer lead bias, but also maintain the performance on the in-distribution data. We introduce a position prediction component as an adversary, and optimize it along with the neural extractive summarizer in an alternating manner. Furthermore, in contrast with Grenander et al. (2019) and Wang et al. (2019), our proposal is model-independent and only requires one type of news dataset as training input.
In this paper, we apply our proposed method to a biased transformer-based extractive summarizer (Vaswani et al., 2017) trained on CNN/DM training set (Hermann et al., 2015) and conduct experiments on two test sets with different degrees of lead bias: CNN/DM and XSum (Narayan et al., 2018), for in-distribution and generality evaluation respectively. The experimental results indicate that our proposed "debiasing" method can effectively demote the lead bias learned by the neural news summarizer and improve its generalizability, while still mostly maintaining the model's performance on the data with a similar lead bias.

Proposed Debiasing Method
Our method aims to demote the lead bias learned by the summarizer and encourage it to select content based more on the semantics covered in sentences. As shown in Figure 1, our method comprises two components: one for Summarization (red) and the other for sentence Position Prediction (green).

Summarization Component
Following previous work, we formulate extractive summarization as a sequence labeling task Carenini, 2019, 2020;. For a document d = {s 1 , s 2 , ..., s k }, each sentence will be assigned a score α ∈ [0, 1]. The summary will then be formed with the highest scored sentences. We adopt a transformer-based model (Vaswani et al., 2017) as our basic "biased" summarization component (red in Fig. 1), as shown to be heavily impacted by the lead bias (Zhong et al., 2019a). This component contains a transformerbased encoder Enc θt and a multilayer perceptron (MLP) decoder Dec θs , parameterized by θ t and θ s respectively. We use the averaged word embedding from Glove as sentence embedding as suggested in Kedzie et al. (2018). We optimize this summarization system by minimizing the loss: where CE denotes the cross-entropy loss and y i ∈ {0, 1} is the ground truth label for sentence s i , representing if s i is selected to form the summary.

Position Prediction Component
Our goal is to train the summarization model to make accurate predictions based more on the sentence semantics, rather than whether the sentence is in the lead position. More specifically, we aim to design an encoder network Enc θt to output the set of contextualized sentence representations H = {h 1 , ..., h k } which cover less sentence positional information, so that the following decoder Dec θs will make predictions depending less on such positional information. To achieve this, the first step is to understand how much and in what form the positional information is encoded in Enc θt . Therefore, we propose a position prediction network to learn to predict the position of sentences in a document based only on H. Intuitively, the higher accuracy this component can achieve, the more positional information is contained in H. This position prediction component will then play the role of an adversary module to demote the influence of lead bias presented in the training phase of the summarization component. Concretely, because predicting the exact position for each sentence would require an extremely large set of labels with a skewed distribution, we choose to predict the portion of the document each sentence belongs to. In particular, once we obtain the set of contextualized sentence representations H from the encoder network Enc θt , we initialize a MLP (parameterized by θ p and followed by Softmax) as the position prediction component P os θp (green in Fig 1). In essence, this component P os θp takes H as input and outputs a M-dimensional multinomial distribution for each sentence to represent its position in a document. More formally, P os θp (h i ) = (p j is the predicted probability of the ith sentence belongs to the jth portion of a document when the document is divided into M parts (M is a tunable hyperparameter). We use the cross-entropy loss to optimize P os θp to extract sentence positional signals encoded in the system: where p i is the true position of sentence i.

Alternating Adversarial Learning
To demote the influence of positional bias and balance it with the sentence semantics in the summarization system, we want to modify the encoder to produce H, which can still be accurate for summary generation but fail at sentence position prediction. We achieve this by alternatingly executing "Position learning" and "Position debiasing", as proposed in Kumar et al. (2019) and presented in Algorithm 1. In the "Position learning" phase, once a pretrained summarization system is obtained, we first fix its weights and train an adversary network P os * θp (sentence position predictor) to extract the positional information contained in the encoder. Then in the "Position debiasing" phase, we fix the weights of P os * θp and update the parameters of the summarization component to maximize the position prediction loss of adversary (L adv in eq 3) while minimizing the summarization loss L 1 : To maximize the position prediction loss, the fixed adversary P os * θp should ideally output the uniform distribution, U M = ( 1 M , ..., 1 M ), for the position prediction of each sentence. β is the trade-off parameter tuned at validation stage to control the degree of lead bias demoting.
In practice, we notice that reusing the same adversary for all iterations will make the positional signals not weakened but instead encoded in a different way. To avoid this problem, we follow Kumar et al.

Datasets
We use the standard CNN/DM dataset (204,045 training, 11,332 validation and 11,334 test data) (Hermann et al., 2015) for training since it is one of the mainstream news datasets with observed lead bias (Jung et al., 2019;Grenander et al., 2019). For model evaluation, we use the test set of CNN/DM to evaluate model's in-distribution performance, as well as the test set of XSum (Narayan et al., 2018), which consists of 11,334 datapoints, to evaluate model's generality when transferred to less biased data. The empirical analysis in Narayan et al. (2018) and Jung et al. (2019) shows the documents and summaries in XSum are shorter and have less lead bias compared to CNN/DM.

Experimental Design
Baselines: We compare our proposal with various baselines (see Table 1). The top section of Table 1 presents Lead baseline and Oracle. For CNN/DM, lead baseline refers to Lead-3 and for XSum, it refers to Lead-1. The middle section of Table 1 contains the basic transfomer-based summarizer accepting "sentence representation + position encoding" as input, and its two variants, one without positional encoding, while the other with only positional encoding as input. The bottom section contains Shuffling (Grenander et al., 2019), which The best and second best performances over the basic transformer are in bold and underlined. ↑/↓ indicates the results are significantly higher/lower than Basic Transformer and ⇑/⇓ indicates the results are significantly higher/lower than Shuffling (p < 0.01 with bootstrap resampling test (Lin, 2004)).  is a method proposed lately for summarization lead bias demoting, and Learned-Mixin (Clark et al., 2019), which is a general debiasing method proposed to deal with NLP tasks when the type of data bias in the training set is known and bias-only model is available. In our case, the data bias is lead bias and the bias-only model is the transformer trained with only positional encoding as input.

Model
Implementation Details: All the transformerbased models have the same setting as the standard transformer (Vaswani et al., 2017), with 6 layers, 8 heads per layer, and d model = 512. We use Adam to train all the models with scheduled learning rate with warm-up (initial learning rate lr = 2e − 3). We choose the top-3 sentences to form the final summary for CNN/DM and the top-1 sentence for XSum due to the different average summary lengths. The class number of sentence position M is set to 10 and trade-off parameter β is set to 0.9 (searched from 0 to 1, by increasing 0.1 for each step). We tune these hyper-parameters on a "balanced" validation set sampled from the standard CNN/DM validation data. Table 1 reports the performance of the chosen baselines and our proposal on CNN/DM test set, which has the same data distribution as the training data, and XSum test set, which is from another news resource and with much less lead bias than CNN/DM. From the middle section of Table 1, we observe that if we withhold the position cues (No Position Encoding) by using only semantic representation as input, the model's performance drops considerably on CNN/DM, but remarkably increase on XSum. In contrast, if we merely use position cues as input (Only Position Encodings), the decrease of the performance on CNN/DM becomes much more modest, while there is substantial performance drop on XSum. These results confirm that positional signal is a rather important feature for bias-relied neural summarizers. However, relying too much on it will also limit model's generality when applied to the dataset with less bias than the training samples. Therefore, seeking strategies to balance the semantics and position features is crucial for the neural extractive summarization for news.

Results and Analysis
When we compare the lead bias demoting methods presented at the bottom of Table 1, our proposal and Shuffling give significant performance boosting on XSum, while Learned-Mixin results in performance decrease on both datasets. Comparing our method and Shuffling directly, while they are essentially equivalent on maintaining the performance on the in-distribution CNN/DM data (0.09 difference in terms of the average of ROUGE scores (ROUGE-Mean)), our method provides a significant improvement on XSum, and outperforms Shuffling and the basic transformer by 0.14 and 0.29 on ROUGE-Mean respectively. It is noteworthy that the transformer without position encoding achieves the best performance on XSum. However, it is the worst system on in-distribution data. Throughout all the comparisons, our proposal can best balance the sentence position and content semantics.
To more deeply investigate the behavior of our demoting method on the documents whose summary sentences are from different document portions, we follow Grenander et al. (2019) to create three subsets, D early , D middle , D late , from the CNN/DM testing set. Documents are ranked by the mean of their summary sentences' indices in ascending order, and then the top-ranked 100 documents, the 100 documents closest to the median, and the bottom-ranked 100 documents are selected to form D early , D middle , D late 2 . Results in Table 2 show that even if our model does not match the basic transformer on documents in D early , it does yield benefits for both D middle and D late with significant improvements, while the competitive baseline Shuffling only achieves that on D late . Position of Selected Content: To more explicitly investigate how well the prediction of different models fits the ground-truth sentence selection (Oracle), we compare the relative position of the selected content of our method with the undebiased model (Basic Transformer) and the most competitive debiased model (with Shuffling), as illustrated in Figure 2. We can observe that: (1) CNN/DM contains much more lead bias than XSum, shown by a more right-skewed histogram for Oracle. Thus, the basic transformer trained on it is also heavily impacted by the lead bias and tends to select sentences ∈ [0, 0.1] with much higher probability even on the less biased XSum.

Conclusion and Future Work
We propose a lead bias demoting method to make news extractive summarizers more robust across datastets, by optimizing a position prediction and a summarization component in an alternating way. Experiments indicate that our method improves model's generality on out-of-distribution data, while still largely maintaining its performance on in-distribution data. As such, it represents the best viable solution when at inference time input documents may come from an unknown mixture of datasets with different degrees of position bias.
For the future, we plan to explore more sophisticated and effective methods (e.g., adjusting the lead bias online) and infuse them together with neural abstractive summarization models, known to generate more succinct and natural summaries. Another interesting direction for future work can be exploring the potential of applying our proposed bias demoting strategy to other tasks, which can also be framed as the sequence labeling problem and possibly troubled by biases in the training data (e.g., Topic Segmentation (Xing et al., 2020) and Semantic Role Labeling (Ouchi et al., 2018)).