Long Document Summarization in a Low Resource Setting using Pretrained Language Models

Abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pretrained abstractive summarizer BART (Lewis et al., 2020), which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 (Radford et al., 2019) language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with an independent human labeling by domain experts.


Introduction and Related Work
Text summarization is the task of generating a smaller coherent version of a document preserving key information. Typical abstractive summarization algorithms use seq2seq models with attention (Chopra et al., 2016), copy mechanisms (Gu et al., 2016), content selection (Cheng and Lapata, 2016), pointer-generator methods (See et al., 2017) and reinforcement learning (Wu and Hu, 2018). These methods perform well in high resource summarization datasets with small documents such as CNN/DailyMail (Nallapati et al., 2016), Gigaword (Rush et al., 2015), etc. However, summarization over long documents with thousands of tokens is a more practically relevant problem. Existing solutions focus on leveraging document structure (Cohan et al., 2018) or do mixed model summarization involving compression or selection followed by abstractive summarization (Liu et al., 2018;Gehrmann et al., 2018). However, these methods require large amounts of training data. Low resource settings are common in real world applications as curating domain specific datasets especially over long documents and on a large scale, is both expensive and time consuming.
A human summarizing a long document would first understand the text, then highlight the important information, and finally paraphrase it to generate a summary. Building on this intuition, we present a low-resource long document summariza-  Figure 1: Our method for long document summarization task in low resource setting. The Extraction Model generates a compressed document D by identifying salient sentences. It is trained by computing salience score for each training set source sentence. The pretrained abstractive summarizer takes as input the compressed document. tion algorithm (Section 2) operating in 3 steps -(1) ground sentences of every training set summary into its source, identifying salient sentences; (2) train a salience classifier on this data, and use it to compress the source document during test time; (3) feed the compressed document to a state-of-theart abstractive summarizer pretrained on a related domain to generate a coherent and fluent summary.
To tackle data scarcity, we use pretrained language models in all three steps, which show strong generalization (Devlin et al., 2019) and are sample efficient (Yogatama et al., 2019). Notably, our step (1) uses a novel method based on GPT-2 perplexity (Radford et al., 2019) to ground sentences.
Unlike prior work (Parida and Motlicek, 2019;Magooda and Litman, 2020) tackling data scarcity in summarization, our method needs no synthetic data augmentation. Moreover, we study a significantly more resource constrained setting -a complex legal briefs dataset (Section 2) with only 120 available (document, summary) pairs and an average of 4.3K tokens per document; Parida and Motlicek (2019) assume access to 90,000 pairs with a maximum of 0.4K source document tokens, Magooda and Litman (2020) use 370 pairs with 0.2K source document tokens.
Despite this challenging setup, our method beats an abstractor-only approach by 6 ROUGE-L points, and also beats several competitive salience detection baselines (Section 3). Interestingly, identified salient sentences show agreement with an independent human labeling by domain experts, further validating the efficacy of our approach.

Dataset and Approach
To mimic the real world scenario of summarization over long domain-specific documents, we curate 120 document-summary pairs from publicly available Amicus Briefs 1 , thus simulating the legal domain. 2 As shown in Table 1  To tackle this low resource setting, we use the state-of-the-art abstractive summarizer BART (Lewis et al., 2020), pretrained on a dataset from a related domain (CNN/DM). Since BART was trained on short documents, it truncates documents longer than 1024 subwords. Hence, instead of feeding the whole source document as input to BART, we feed salient sentences extracted using a salience classifier. Our salience classification dataset is built using a novel method which grounds summary sentences to sentences in source with language model perplexity scores. Our approach (Figure 1) resembles the extract-then-abstract paradigm popular in prior work (Gehrmann et al., 2018;Liu et al., 2018;Subramanian et al., 2019;Chen and Bansal, 2018).
Extraction Stage: To extract the most important content from the source document required to generate the summary, we pose content selection as a binary classification task, labeling every sentence in the source document as salient or non-salient. Sentences classified as salient are concatenated in the order of occurrence in the source document 3 to generate a compressed "extractive summary", which is then fed to the abstractive summarizer.
In addition to identifying important information, the salience classifier is able to remove repetitive boilerplate text which is common in technical documents but often irrelevant to the actual content.
Training Data for Salience Classification: Since we do not have sentence-level training data for the classifier, we construct it by grounding sentences of the ground truth summary to sentences in the source document. Consider a source document S consisting of m sentences s 1:m and a target summary T consisting of n sentences t 1:n where m >> n. We compute the salience score for every source sentence s i ∈ S as 1 n n j=0 f (s i , t j ).
Here f (s, t) is a measure of how much source sentence s grounds target sentence t. Following this, we sort the sentences in the source document based on salience score. The highest scoring 3n sentences are chosen as salient sentences and the lowest scoring 3n are chosen as non-salient sentences. 4 We construct our dataset for salience classification by running this algorithm for every (S, T ) pair in the training dataset. To ensure generalization with limited training data, we incorporate transfer learning and build our classifier by Choice of f (s, t): To measure how much a source sentence s grounds a target sentence t we measure the perplexity of t conditioned on s, using a pretrained language model GPT-2 large (Radford et al., 2019). More formally, we concatenate s and t as [s; t] and feed it as input to GPT-2 large, calculating perplexity over the tokens of t.
Here, a lower perplexity corresponds to a higher f (s, t) score. We find that this measure correlates with entailment and outperforms other choices of f (s, t) like n-gram overlap, sentence embedding similarity & entailment classifiers (Section 3.3).
Abstraction Stage: Having compressed the source document using our extractor, we use a black-box pretrained abstractive summarizer trained on a related domain. In this work, we make use of the state-of-the-art model (i.e. BART), which is based on pretrained language models. Pretraining on CNN/DM helps BART generalize to unseen but related domains like legal briefs. 5

Evaluating the extractor
To evaluate our proposed extractor, we first check whether our salience classifier generalizes to a heldout test set 6 . Indeed, it achieves a classification accuracy of 73.66%, and qualitative analysis of the classifications confirm its ability to identify boilerplate sentences as non-salient. Our classifier compresses source documents by 61% on average. 7 Next, we evaluate the quality of extracted salient sentences by checking the extent to which they overlap in information with the gold test set summaries, by measuring ROUGE-1/2 recall scores. As shown in Table 2, our extractor outperforms a random selection of the same number of sentences and is comparable to the upper-bound recall performance achieved by feeding in the whole source document. Finally, to measure the extent to which our salience classifier matches human judgement, domain experts identified 8-10 salient sentences in four test documents with more than 200 sentences each on request. Despite their scarcity, our salience classifier recovers 64.7% marked sentences, confirming correlation with human judgments.

Evaluating the entire pipeline
We evaluate the entire pipeline by measuring the quality of abstractive summaries, obtained by feeding the extractive summary to BART. We study two abstractor settings: (1) Treating BART as a black-box with no modification; (2) Finetuning  Table 2: ROUGE-1/2 (R-1/2) recall scores of the gold summary with respect to the the "Source" document. Our saliency-driven extractor performs better than a random selection of the same number of sentences and is close to the upperbound recall performance achieved by feeding in the whole source document.  BART on the training and validation split of Amicus dataset 8 . We present results on the Amicus test set. We compare our model against several competitive baselines -(1) NE: no extraction; (2) Random: a random selection of the same number of sentences as our extractive summary; (3) Tex-tRank (Mihalcea and Tarau, 2004;Liu et al., 2018): unsupervised graph based approach to rank text chunks within a document; (4) Bottom-up summarizer (Gehrmann et al., 2018): a strong extractthen-abstract baseline where content selection is posed as a word-level sequence tagging problem. Similar to our setting, their content selector also uses large pretrained models (ELMo, Peters et al., 2018), which we finetune on our training set. As seen in Table 3, we observe a 4.8 / 6 ROUGE-1/L improvement when compared to the no extractor baseline (NE), and 2.3 / 3.2 R-1/L improvement over the strongest extractor baseline (per metric); confirming the effectiveness of our method. In addition, finetuning the CNN/DM pretrained BART on 96 Amicus documents helps in domain adaption and boosts the ROUGE scores of both baselines and our method (f.t. BART). Specifically, we ob-8 The training and validation splits together comprise of 96 documents. The test split was not used. .

Validating the choice of f (s, t)
In Section 2 we used GPT-2 perplexity scores to measure the extent to which a source sentence grounds a target sentence. To motivate this choice, we measure its correlation with existing entailment datasets. We randomly sample 5000 sentences from each class of the MultiNLI dataset (Williams et al., 2018) and compute the perplexity of the hypothesis with the premise as context. As seen in Figure 2, entailment pairs tend to have the lowest perplexity. This motivates our choice of f (s, t), since hypothesis sentences are best grounded in premise sentences for entailment pairs. 9 To further validate the merit of GPT-2 perplexity, we conduct ablations using alternatives for f (s, t): (1)  ineni et al., 2002). We present ROUGE scores using our whole extract-then-abstract pipeline with different choices of f (s, t) in Table 4. We note that perplexity performs the best, 2.4 ROUGE-1 better than the best alternative and also performs 3.41 ROUGE-1 better than entailment. 10

Conclusion
We tackle an important real-world problem of summarizing long domain-specific documents with very less training data. We propose an extract-thenabstract pipeline which uses GPT-2 perplexity and a BERT classifier to estimate sentence salience. These scores need to be generated once and can be reused for various experiments. Sampling methods to choose salient and non-salient sentences for each document takes less than a minute to run.
Analysis: (a) Table 5 shows the classifier accuracies for combinations of f(s,t) and sampling methods. We observe that for the aggregate sampling method, although perplexity based classifier does not have the highest accuracy, our  pipeline where f (s, t) is perplexity score gives the best result(ROUGE) amongst the ablation experiments (Table 4). Classifier accuracy is determined on automated labelling based on the saliency score, rather than true labels, hence best classifier does not imply best summarization. (b) Table 6 shows the examples of using perplexity as f(s,t) to see how the summary grounds the source. The table shows three summary sentences and the corresponding source sentences that had the lowest perplexity scores. We can see that, summary either has a similar meaning or logically follows the source. (c) Table 7 has three examples each for salient sentences and non-salient sentences inferred by the classifier trained on data prepared as mentioned in Section 2. The third sentence in the non-salient sentences column is an example of boiler-plate content detected that is present across documents.

A.3 Abstractive Summarizer: BART
BART is a seq2seq model based on denoising pretraining objective which is supposed to generalize better on various natural language understanding tasks; abstractive summarization being one of them. For abstractive stage of our proposed approach, we decided to see (bart.large.cnn) variant which is essentially BART-large model (with 12 encoder and decoder layers and 400 million parameters) finetuned for CNN/DM summarization task. We use the pre-computed weights available for use here 14 . Using BART's text generation script, we set length penalty (lenpen) as 2.0 and minimum length (min len) as 500 words in order 14 https://github.com/pytorch/fairseq/ tree/master/examples/bart to encourage BART to produce longer outputs which is more suitable to our dataset. Also, we use beam size of 4 and and no repeat ngram size of 3.
Finetuning: We use the train and dev splits of Amicus dataset (96 source-target pairs) and finetune BART for summarization task starting from its CNN/DM finetuned checkpoint. First, we pre-process the dataset as per the guidelines in the official code 15 . We finetune for 500 epochs with learning rate of 3e-5 and early stop if validation loss doesn't decrease for 50 epochs. Others parameters are as follows: total num updates = 20000, warmup updates = 500, update freq = 4, optimiser = Adam with weight decay of 0.01. Rest of parameters were kept as default in the official script. Results (Precision, Recall, F1) on the test set of Amicus using the existing BART model and finetuned BART are shown in Table 8. Table 9 shows an example of target summary and summary generated by our model(Section 2) for one sample source document. We can see that the summary generated by our model is fluent and has coherent flow of information.

Source Sentence
In the immigration context, this jurisprudence has prompted the Court to reject the notion that the so-called entry fiction is of constitutional significance.
Prior to Knauff and Mezei, the distinction between noncitizens who had entered the United States and those who remained outside it had not had been elevated to a bright-line constitutional rule, and entry had never been completely determinative of the fact or extent of protection under the Due Process Clause.
It has accordingly authorized such detention only in limited circumstances pursuant to a carefully defined scheme.
The Court's substantive due process jurisprudence also recognizes that an individual may be subjected to regulatory detention only in narrow circumstances under a carefully drawn scheme.
With respect to substantive due process, this Court has increasingly recognized the punitive consequences of indefinite regulatory detention.
Thus, the Court has substantially restricted the availability and duration of regulatory confinement in the -years since it decided Meze1.In Zadvydas, this Court established that its substantive due process jurisprudence provided the appropriate framework for evaluating the administrative detention of noncitizens pending removal from the United States. Table 6: Using GPT-2 perplexity as f(s,t), here are three sentences from the summary with corresponding source sentence, having the lowest perplexity.

Salient Sentences
Non-Salient sentences The same time, the Court has long been skeptical of the military's authority to try individuals other than active service personnel.
A government predicated on checks and balances serves not only to make Government accountable but also to secure individual liberty. On the basis of this revised test, the Court of Appeals refused to apply the exceptional circumstances exception to Al-Nashiri's petition.
At present, the Rules for Courts-Martial require that the accused be brought to trial within 120 days after the earlier of preferral of charges or confinement. Consonant with that tradition, this Court should review the Court of Appeals' decision to confirm that exceptional delay before trial remains of central concern on habeas review and is indeed one of the very dangers the writ of habeas corpus was designed to avoid.    (2001)). Because there is no dispute that the fundamental right to parent isat stake in abuse and neglect proceedings, the ABA focuses its discussion on the second and third factors of the three factor test.As to the second, so-called "risk of error" factor, the ABA's conclusion, after years of investigation and analysis, is that the absence of counsel for indigent parent-defendants in abuse and neglect proceedings results in a significant risk of an erroneous determination. This is especially true where the opposing party is the State. As to the third, state's interest factor, the ABA's investigation shows that the interests of both the parent and the state are best served where indigent parent-defendants are represented. The ABA respectfully suggests that the evidence and analysis relevant to these two factors is so compelling in most, if not all, abuse and neglect proceedings involving indigent parent-defendants, that a case-by-case balancing of the factors should be rejected in favor of a rule requiring the appointment of counsel] for indigent parent-defendants in all such proceedings. The evidence and analysis supporting the ABA's policy includes the fact that a substantial majority of states have recognized an unqualified right to counsel for indigentparent-defendants in child custody proceedings.
Similarly, other industrial democraciesprovide indigent parent-defendants with such right to counsel. The ABA respectfully submits that this Court should require no less as a matter of due process under the New Hampshire Constitution.Although of whetherJn re Shelby R. resulted in a or not a natural parent'splurality role inruling, the the familyCourt is awas not split fundamentalon the libertyquestion interestprotected by the State Constitution. See In re Shelby R., 148 NH. at 244 (dissenting opinion). Hampshire constitution requires this court to determine whether indigentparents have a legally protected interest.
Most indigent parent -defendants are incapable of performingthe advocacy functions required in abuse and neglectproceedings. Most unrepresented parents cannot perform the advocacy functions --including investigating facts , making an orderly factual presentation , and cross -examining witnesses --that are required. The intense, emotionally charged backdrop against which custody decisionsare often made further exacerbates the inherent disadvantages faced by unrepresented indigent parents. The need for counsel for the indigentparent -defendant is especially great where the opposing party is the state. The court must weighthree factors : ( 1) the private interests that will be affected. ( 2) the risk of erroneousdeprivation of the liberty interest through the procedures used and the value , if any, ofadditional or substitute procedural safeguards.
( 3) the state ' s interest , including the function involved and fiscal and administrative burdens that additional or substituteprocedural requirements would entail id at 240 ; see also in re father , 155 n . h . 93 , 95 ( 2007 ) . this court has previously concluded as to the first factor that adversary child custody proceedings implicate a fundamental liberty interest --the right to parent in this case, the central question thus becomes whether that right is sufficiently protected. The conclusion that counsel must be provided is so compelling in most , if not all cases , that a case -by -case balancing of the factors should be rejected in favor of a rule requiring the appointment of counsel for lowincome parent -defendant in all such proceedings to be constitutionally acceptable. The state is not the only adversary finding the only meaningful right to be heard when her adversary is not represented by counsel is not spaled against the traditional weapons of the state, such as the state's attorney general. The courts must also weigh the public interest in the child custody case, including the function involved and the cost of additional or substitute safeguards, as well as the cost to the state of the additional or substituted safeguards. The risk of an erroneous deprivation of the findamentalright to parent only increases the only increase in the risk that the state will find the child is not heard when the state is the adversary. The public interest is only increased by the fact that the child will not be heard by the state when the parent is represented by a lawyer. The high level of complexity of child custody cases makes it difficult for the court to make a fair and just decision. Table 9: The table shows the comparison of summaries where the top summary is the target summary and the bottom summary is the one generated by our extractor and f.t BART. As we can see, the summary is coherent and has fluent information flow.