Deep Differential Amplifier for Extractive Summarization

For sentence-level extractive summarization, there is a disproportionate ratio of selected and unselected sentences, leading to flatting the summary features when maximizing the accuracy. The imbalanced classification of summarization is inherent, which can’t be addressed by common algorithms easily. In this paper, we conceptualize the single-document extractive summarization as a rebalance problem and present a deep differential amplifier framework. Specifically, we first calculate and amplify the semantic difference between each sentence and all other sentences, and then apply the residual unit as the second item of the differential amplifier to deepen the architecture. Finally, to compensate for the imbalance, the corresponding objective loss of minority class is boosted by a weighted cross-entropy. In contrast to previous approaches, this model pays more attention to the pivotal information of one sentence, instead of all the informative context modeling by recurrent or Transformer architecture. We demonstrate experimentally on two benchmark datasets that our summarizer performs competitively against state-of-the-art methods. Our source code will be available on Github.


Introduction
Single-document extractive summarization forms summary by copying and concatenating the most important spans (usually sentences) in a document. Sentence-level summarization is a very challenging task, because it arguably requires an in-depth understanding of the source document sentences, and current automatic solutions are still far from human performance. Recent approaches frame the task as a sequence labeling problem, taking advantage of the success of neural network architectures. * Corresponding authors: Fang Fang and Shi Wang However, there are still two inherent obstacles for sentence-level extractive summarization: 1) It should be detrimental to keep tangential information (West et al., 2019). The intuitive limitation of those approaches is that they always prefer to model and retain all informative content from the source document. This goes against the fundamental goal of summarization, which crucially needs to forget all but the "pivotal" information. Recently, the Information Bottleneck principle (Tishby et al., 2000;West et al., 2019) is introduced to incorporate a tradeoff between information selection and pruning. Length penalty and the topic loss (Baziotis et al., 2019) are used in the autoencoding system to augment the reconstruction loss. However, these methods require external variables or augmentative terms, without enhancing the representation of pivotal information.
2) Imbalanced classes inherently result in models that have poor predictive performance, specifically for the minority class. The distribution of examples across the known classes can vary from a slight bias to a severe imbalance, where there is one example in the minority class for dozens of examples in the majority class. For instance, according to the statistics on the popular summarization dataset, only 7.33% sentences of CNN/DM (Hermann et al., 2015) are labeled as "1" and others are "0", indicating whether this sentence should be selected as summary or not. Conversely, most machine learning algorithms for classification predictive models are designed and demonstrated on problems that assume an equal distribution of classes. This means that a naive application of a model may only focus on learning the characteristics of the abundant observations, neglecting the examples from the minority class. Furthermore, as shown in Figure 1, the ROUGE score gradually declines along with the number of sentences accumulating, since the valuable summary sentences is generally a tiny minority (with the quantity of 1-4), while more and more majority sentences will swamp the minority ones. Unfortunately, the imbalance in summarization is inherent, which can't be addressed by common data augmentation (He and Ma, 2013;Asai and Hajishirzi, 2020;Min et al., 2020;Zoph et al., 2019;Xie et al., 2020), for there is a rare influence on the 0/1 distribution by adding or deleting the entire document.
These two obstacles are interrelated and interact with each other. Highlighting the pivotal information will strengthen the unique semantic and weaken the common informative content. Additionally, a more balanced distribution would make minority class more attractive. If we can't resolve the category imbalance problem in extractive summarization by data augmentation, how to make the minority class more attractive? Inspired by the differential amplifier of analog electronics 1 , we propose a heuristic model, DifferSum, as shorthand for Differential Amplifier for Extractive Summarization to enhance the representation of the summary sentences. Specifically, we calculate and amplify the semantic difference between each sentence and other sentences, by the subtraction operation. The original differential amplifier consists of two terms and the second term is used to avoid making the final output zero. In our model, we use the residual unit instead of the second term to make the architecture deeper. We further design a more appropriate objective function to avoid biasing the data, by making the loss of a minority much greater than the majority. DifferSum shows superiority over other extractive methods in two aspects: 1) enhancing the representation of the pivotal information and 2) compensating the minority class and penalizing the majority ones. 1 https://en.wikipedia.org/wiki/Differential amplifier Experimental results validate the effectiveness of DifferSum. The human evaluation also shows that our model is better in relevance compared with others. Our contributions in this work are concluded as follows: • We propose a novel conceptualization of extractive summarization as rebalance problem.
• We introduce a heuristic approach, calculating and amplifying the semantic representation of pivotal information by integrating both the differential amplifier and residual learning.
• Our proposed framework has achieved superior performance compared with strong baselines. focus on the extract-then-compress learning paradigm, which will first train an extractor for content selection. Zhong et al. (2020) introduces extract-thenmatch framework, which employs BERTSUMEXT (Liu and Lapata, 2019) as first-stage to prune unnecessary information. However, these above extractive approaches prefer to model all source informative context and they pay little attention to the imbalance problem.

Deep Residual Learning
The original deep residual learning is introduced in image recognition (He et al., 2016a) for the notorious degradation problem. Then, residual is introduced to the natural language process by Transformer (Vaswani et al., 2017). Essentially, we cannot determine the depth of the network very well when building a deep network. There will be optimal layers in the network, and outside the optimal layer is the redundant layer. We expect the redundant layer to correspond to the input and output, namely identity mapping (He et al., 2016a,b;Veit et al., 2016;Balduzzi et al., 2018). Resnet (He et al., 2016a) addresses the degradation problem by introducing a deep residual learning framework.
If an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers (Huang and Wang, 2017). In this paper, the residual unit serves as the second item of the differential amplifier to keep our architecture deep enough and capture pivotal information.

Problem Definition
We model the sentence extraction task as a sequence tagging problem (Kedzie et al., 2018).
We denote by h i and h ij the embedding of sentences and words in a continuous space. The extractive summarizer aims to produce a summary S by selecting m sentences from D (where m ≤ M ). For each sentence s i ∈ D, there is ground-truth y i ∈ {0, 1} and we will predict a labelŷ i ∈ {0, 1}, where 1 means that s i should be included in the summary. We assign a score p(ŷ i |s i , D, θ) to quantify s i 's relevance to the summary, where θ is the parameters of neural network model. Finally, we assemble a summary S by selecting m sentences, according to the probability of p(1|s i , D, θ).

Sentence Encoder
The sentence encoder in extractive summarization models is usually a recurrent neural network with Long-Short Term Memory (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Units . In this paper, our sentence encoder builds on the BERT architecture (Devlin et al., 2019), a recently proposed highly efficient model which is based on the deep bidirectional Transformer (Vaswani et al., 2017) and has achieved state-ofthe-art performance in many NLP tasks. The Transformer aims at reducing the fundamental constraint of sequential computation which underlies most architecture . It eliminates recurrence in favor of applying a self-attention mechanism which directly models relationships between all words in a sentence. Our extractive model is composed of a sentencelevel Transformer (T S ) and a document-level Transformer (T D ) . For each sentence s i in the input document, T S is applied to obtain a contextual representation for each word: (1) And the representation of a sentence is acquired by applying weighted-pooling: Document-level transformer T D takes s i as input and yields a contextual representation for each sentence:

Deep Differential Amplifier
In the Transformer model sketched above, intersentence relations are modeled by multi-head attention based on softmax functions, which only capture shallow structural information .
A differential amplifier is a type of electronic amplifier that amplifies the difference between two input voltages but suppresses any voltage common to the two inputs. The output of an ideal differential amplifier is given by: where V + in and V − in are the input voltage; A d is the differential-mode gain.
In practice, the gain should not be quite equal for the two inputs, V + in and V − in . For instance, even if V + in and V − in are equal, the output V out should not be zero. So, modern differential amplifiers are usually implemented with a more realistic expression, which includes a second term: where A c is called the common-mode gain of the amplifier. Inspired by the differential amplifier above, we calculate and amplify the semantic difference between each sentence and other sentences by the subtraction operation of the sentence representa- in and V − in are calculated as follows: The original differential amplifier consists of two terms and the second one avoids making the final output zero. While for the deep neural network: 1) inputs of the differential amplifier are vector instances in the high dimensional space, which is practically impossible for the zero output, compared with scalar; 2) the second term of the differential amplifier is not suitable for the deep iterative architecture, since it is exposed to the degradation problem.
Notably, residual learning is introduced in deep learning as shortcut connections to skip one or more layers, which is naturally an alternative to the second item of the differential amplifier. The advantages of this method are: 1) the residual architecture will highlight the pivotal information as well as reserving the original sentence representation; 2) it is easier to optimize the residual mapping than to optimize the original (He et al., 2016a). Hence, the residual unit is employed as the second item, along with an iterative refinement algorithm to enhance the final representation of sentences.

Residual Representation for Sentence
The differential amplifier in our architecture consists of a few stacked layers to iteratively refine the pivotal representation. Let us consider H(x) as an underlying mapping to be fit, with x denoting the inputs to the first of these layers. Since multiple nonlinear layers can asymptotically approximate complicated functions (He et al., 2016a;Montúfar et al., 2014), the differential amplifier mapping H(x) is recast into a residual mapping F(x) and an identity mapping x: Obviously, residual learning is just a variant of the differential amplifier: where the output voltage V out thus becomes the original mapping H(x) and the first item of am- In our model, the second item of the differential amplifier is replaced by the identity mapping x, which is the shortcut connection and the output is added to the outputs of F(x). Furthermore, 1) the identity shortcut connections advance the architecture without extra parameter; 2) the identity shortcut doesn't add the computational complexity (He et al., 2016a); Thus, for sentence respresentation v i , the deep differential amplifier is:

Iterative Structure Refinement
The differential amplifier and residual unit specialize in modeling the pivotal information, while deeper neural networks with more parameters are able to infer semantic more accurately. So, an iterative refinement algorithm is introduced to enhance the final representation of pivotal information. For sentence v i , the fundamental iterative unit is: where we iteratively refine the representation v i for K times; and thanks to the built-in residual mechanism, most shorter paths are needed during training, as longer paths do not contribute any gradient. Along with the supervision, each iteration will pay more attention to the key semantic difference F(v i ) of sentences with label 1, while trying to zero other F(v j ). Conversely, previous extractive approaches without differential amplifier can only classify those sentences by compensating or penalizing v i / v j , which is more difficult to model.
Following previous work (Nallapati et al., 2017;, we use a sigmoid function after a linear transformation to calculate the probability r i of selecting s i as a summary sentence:

Weighted Objective Function
To rebalance the bias of minority 1-class and majority 0-class, we have built a deep differential amplifier to amplify and capture the unique information for summary sentences. Besides, another heuristic method is to make our model pay more attention to 1-class: a weighted cross-entropy function. Particularly, we further design a more appropriate objective function to avoid biasing the data, by making the loss of a minority much greater than the majority. The weight we employed is to rebalance the observations for each class, so the sum of observations for each class are equal. Finally, we define the model's loss function as the summation of the losses of all iterations: where I(·) is an indicator function and K is the number of iterations.

Datasets
As shown in Table 1, we employ two datasets widely-used with multiple sentences summary: CNN and Dailymail (CNN/DM) (Hermann et al., 2015) and New York Times (NYT) (Sandhaus, 2008).  (Manning et al., 2014) toolkit and pre-processing the dataset following (See et al., 2017) and (Zhong et al., 2020). This dataset contains news articles and several associated abstractive highlights. We use the unanonymized version as in previous summarization work and each document is truncated to 800 BPE tokens.
NYT Following previous work (Zhang et al., 2019b;Xu and Durrett, 2019), we use 137,778, 17,222 and 17,223 samples for training, validation, and test, respectively. We also followed their filtering procedure, documents with summaries less than 50 words were removed from the dataset. Sentences were split with the Stanford CoreNLP toolkit (Manning et al., 2014). Input documents were truncated to 800 BPE tokens too.

Parameters
Our code is based on Pytorch (Paszke et al., 2019) and the pre-trained model employed in DifferSum is 'albert-xxlarge-v2', which is based on the huggingface/transformers 2 . We train DifferSum two days for 100,000 steps on 2GPUs(Nvidia Tesla V100, 32GB) with gradient accumulation every two steps. Adam with β 1 = 0.9, β 2 = 0.999 is used as optimizer. Learning rate schedule follows the strategy with warming-up on first 10,000 steps. We have tried the iteration steps of 2/4/6/8 for iterative refinement, and K = 4 is the best choice based on the validation set. We select the top-3 checkpoints based on the evaluation loss on the validation set, and report the averaged results on the test set. Following Jia et al. (2020a) and Jia et al. (2021), we employ the greedy algorithm for the sentencelevel soft labels, which falls under the umbrella of subset selection. Besides, we employ the Trigram Blocking strategy for decoding, which is a simple but powerful version of Maximal Marginal Relevance (Carbonell and Goldstein, 1998). Specifically, when predicting summaries for a new document, we first use the model to obtain the probability score p(1|s i , D, θ) for each sentence, and then we rank sentences by their scores and discard those which have trigram overlappings with their predecessors.

Metric
ROUGE (Lin, 2004) is the standard metric for evaluating the quality of summaries. We report the ROUGE-1, ROUGE-2, and ROUGE-L of Dif-ferSum by ROUGE-1.5.5.pl, which calculates the overlap lexical units of extracted sentences and ground-truth.  tion models against abstractive models, and it is certainly that the abstractive paradigm is still on the frontier of summarization. The first part of extractive approaches is the Lead-3 baseline and Oracle upper bound, while the second part includes other extractive summarization models. We present our models finally at the bottom. It is obvious that our DifferSum outperforms all extractive baseline models. Compared with large version BERTSUMEXT, our DifferSum achieves 0.85/1.02/0.93 improvements on R-1, R-2, and R-L, which indicates the pivotal information captured by the differential amplifier is more powerful than the other structures. Compared with early approaches, especially for BERTSUMEXT, we observe that BERT outperforms all previous non-BERT-based summarization systems, and Trigram-Blocking leads to a great improvement on all ROUGE metrics. MATCH-SUM is a comparable competitor to our Differ-Sum, which formulates the extractive summarization task as a two-step problem and extract-thenmatch summary based on a well-trained BERT-SUMEXT. Therefore, we only train a large version DifferSum for a fair comparison.

Results on NYT
Results on NYT are summarized in Table 3. Note that we use limited-length ROUGE recall as Durrett et al. (2016), where the selected sentences are truncated to the length of the human-written summaries. The parts of Table 3 is similar to Table 2. The first four lines are abstractive models, and the next two lines are our golden baselines for extrac- tive summarization. The third part reports the performance of other extractive works and our model respectively. Again, we observe that our differential amplifier modeling performs better than both LSTM and BERT. Meanwhile, we find that extractive approaches show superiority over abstractive models, and the ROUGE scores are higher than CNN/DailyMail.

Ablation Studies
We propose several strategies to improve the performance of extractive summarization, including differential amplifier (vs. normal residual network), pre-trained ALBERT(vs. BERT), and iterative refinement (vs. None). To investigate the influence of these factors, we conduct experiments and list the results in Table 4. Significantly, 1) differential amplifier is more critical than ALBERT, for the reason that the pivotal information is essential and difficult for ALBERT to model; 2) iterative refinement mechanism enlarges the advantage of the differential amplifier, demonstrating the superiority of deep architecture.

Human Evaluation for Summarization
It is not enough to only rely on the ROUGE evaluation for a summarization system, although the ROUGE correlates well with human judgments (Owczarzak et al., 2012). Therefore, we design an experiment based on a ranking method to evaluate the performance of DifferSum by humans. Following Cheng and Lapata (2016), Narayan et al. (2018) and Zhang et al. (2019b), firstly, we randomly select 40 samples from CNN/DM test set. Then the human participants are presented with one original document and a list of corresponding summaries produced by different model systems.
Participants are requested to rank these summaries (ties allowed) by taking informativeness (Can the summary capture the important information from the document) and fluency (Is the summary grammatical) into account. Each document is annotated by three different participants separately. The input article and ground truth summaries are also shown to the human participants in addition to the three model summaries (SummaRuNNer, BERTSUMEXT, and DifferSum). From the results shown in Table 5, it is obvious that DifferSum is better in relevance compared with others.

Trigram Blocking Strategy
Trigram Blocking leads to a great improvement on all ROUGE metrics for many extractive approaches Wang et al., 2020). It is has become a fundamental module in extractive summarization. In this paper, DifferSum extracts summary sentences with the Trigram-Blocking algorithm, but whether there is a great improvement along with it, like in SummaRuNNer or BERT-SUMEXT? It has been explained by Nallapati et al. (2017); , that picking all sentences by comparing the predicted probability with a threshold may not be an optimal strategy since the training data is very imbalanced in terms of summarymembership of sentences. Therefore, the Trigram-Blocking algorithm is introduced to select top-k sentences and reduce the redundancy.
Coincidentally, our DifferSum is designed to 1) rebalance the distribution of majority and minority and 2) filter the tangential and redundant information. Thus, the Trigram-Blocking algorithm may be useless for our DifferSum. Table 6 further summarizes the performance gain of Trigram-Blocking strategy. It is obvious that this strategy is essential for BERTSUMEXT or SummaRuNNer, achieving more than 2.68 / 0.98 improvements on R-1 separately, for that there is no enough redundancy modeling for both of them. While on the other hand, the efficiency of the Trigram-Blocking strategy is weak for DifferSum.

Documents with a Different Number of Sentences
In this paper, we emphasize the inherent imbalance problem of the majority 0-class and the minority 1-class. In fact, in CNN/DailyMail dataset, there are plenty of documents with a different num-  ber of sentences, ranging from 3-sentences to 100sentences. While the number of summary sentences, labeled with 1, is from 1-sentences to 5sentences, and the average number of sentences labeled 1 in CNN/DailyMail is only 7.33%. What is worse is that the distribution of the number of sentences for documents is a uniform distribution, thus we could not avoid the imbalance by cleaning the data.
In this paper, we design another experiment to analysis the harmful effect of imbalance classes. We train the BERTSUMEXT (12-layers) from scratch on CNN/DailyMail, and evaluate the model on the test set to check the tendency of ROUGE scores, along with the number of sentences accumulating. The result is shown in the line chart of Figure 1 and Figure 3a, and obviously we only pay attention to the document in which the number of sentences less than 55. Specifically, each document is truncated to 2000 BPE tokens to involve more sentences, but this can not cover those whole documents with more than 55-sentences. Therefore, we choose to calculate the ROUGE scores for documents with sentences from 3 to 55.
For comparison, we train our DifferSum (12layers) from scratch, and each document is truncated to 2000 BPE tokens too. The tendency of our DifferSum is as Figure 3b. Compared with the tendency of BERTSUMEXT, there is no obvious ROUGE decrease, demonstrating that our approach has strengthened the representation of pivotal and rebalanced the disproportionate ratio of summary sentences and other sentences.
Note that more truncated BPE tokens will increase the final average ROUGE slightly, for it may lose some summary sentences when truncating too many tokens. Unfortunately, our 24-layers Differ-Sum can only be trained with 800 BPE tokens for the limitation of GPU source.

Map Words Representation into Sentence Representation
A key issue motivating the sentence-level Transformer (T S ) and the document-level Transformer (T D ) is that the features for words after the T S might be at different scales or magnitudes. This can be due to some words having very sharp or very distributed attention weights when summing over the features of the other words.
In this paper, we apply two ways to map the words representation into its sentence representation: weighted-pooling at Equation 2 and picking [CLS] token as sentence . Table 7 shows that [CLS] is not enough to convey enough informative information of words for both our DifferSum and BERTSUMEXT. Especially, DifferSum is more sensitive to the word features since our differential amplifier may amplify the semantic features effectively.

Conclusion
In this paper, we introduce a heuristic model, Dif-ferSum, 1) to calculate and amplifier the pivotal information and 2) to rebalance the distribution of minority 1-class and majority 0-class. Besides, we employ another weighted cross-entropy function to compensate for the imbalance. Experimental results show that our method significantly outperforms previous models. In the future, we would like to generalize DifferSum to other fields.