Towards Argument-Aware Abstractive Summarization of Long Legal Opinions with Summary Reranking

We propose a simple approach for the abstractive summarization of long legal opinions that considers the argument structure of the document. Legal opinions often contain complex and nuanced argumentation, making it challenging to generate a concise summary that accurately captures the main points of the legal opinion. Our approach involves using argument role information to generate multiple candidate summaries, then reranking these candidates based on alignment with the document's argument structure. We demonstrate the effectiveness of our approach on a dataset of long legal opinions and show that it outperforms several strong baselines.


Introduction
Legal opinions contain implicit argument structure spreading across long texts.Existing summarization models often struggle to accurately capture the main arguments of such documents, leading to summaries that are suboptimal (Xu et al., 2021;Elaraby and Litman, 2022).We propose an approach for the abstractive summarization of long legal opinions that leverages argument structure.
Legal opinions often follow a specific argumentative structure, with the main points of the argument being presented clearly and logically (Xu et al., 2021;Habernal et al., 2022;Xu and Ashley, 2022).Prior work has shown that by considering this structure during summarization, it is possible to generate extractive and abstractive summaries that more accurately reflect the original argumentation in the document (Elaraby and Litman, 2022;Zhong and Litman, 2022;Agarwal et al., 2022).In this paper, we present a framework for abstractive summarization of long legal opinions that extends this literature by leveraging argument structure during summary reranking to both generate and score candidates.Our method involves utilizing the Longformer-Encoder-Decoder (LED) (Beltagy et al., 2020) model to generate multiple candidate summaries by training it on various input formats.This allows for the consideration of different argument representations in the summary generation process.Additionally, we use beam search to further diversify the output.Finally, we rank the candidate summaries by measuring their lexical similarity to the input's main arguments.
We evaluate our approach on a dataset of long legal opinions obtained from the Canadian Legal Information Institute (CanLII)1 and demonstrate that our method outperforms competitive baselines.Our results with ROUGE and BERTScore (Lin, 2004;Zhang et al., 2019) suggest that considering the argumentative coverage of the original opinions can lead to a more effective selection of summaries.
Our contributions are: (1) We propose a simple reranking approach that takes into account the argumentative structure of legal opinions to improve over the standard finetuning of generation models.(2) We demonstrate through empirical results and ablation analysis reasons for the effectiveness of our approach for summarizing long legal opinions.Our code can be accessed through this repository: https://github.com/EngSalem/legalSummReranking

Related Work
Long Legal Document Summarization Legal documents have a distinct format, with a hierarchical structure and specialized vocabulary that differs from that of other domains (Kanapala et al., 2019).They also tend to be longer in length (Kan et al., 2021;Huang et al., 2020;Moro and Ragazzi, 2022), which has led to the use of transformer models with sparse attention mechanisms (Michalopoulos et al., 2022;Guo et al., 2022;Beltagy et al., 2020) to reduce the complexity of encoding lengthy text.Legal opinions, in particular, have a complex argu-mentative structure that spans across the text, making it crucial to address in summaries (Xu et al., 2021;Xu and Ashley, 2022;Elaraby and Litman, 2022).We use prior legal opinion summarization methods as evaluation baselines.
Summarization and Argument Mining Using a dialogue summarization dataset with argument information, Fabbri et al. (2021b) converted an argument graph into a textual format to train a summarizer.For legal documents, Agarwal et al. (2022) used argument role labeling to improve extractive summarization using multitask learning.Elaraby and Litman (2022) blended argument role labeling and abstractive summarization using special markers, generating summaries that better aligned with legal argumentation.We incorporate the models of Elaraby and Litman (2022) into summary reranking and further improve performance.
Second Stage Reranking Generating multiple outputs and reranking them according to certain criteria has been successfully applied in NLP downstream applications including abstractive summarization.Some methods use different input formats to generate multiple outputs.Oved and Levy (2021) perturbed input multi-opinion reviews to generate multiple candidate summaries, then ranked them using coherency.Ravaut et al. (2022) used a multitask mixture of experts to directly model the probability that a summary candidate is the best one.Liu and Liu (2021) ranked candidate summaries generated from 16 diverse beam searches to improve news summarization in terms of ROUGE score.Liu et al. (2022) presented a novel technique for summary reranking that involves a non-deterministic training objective.Their approach enables the model to directly rank the summaries that are probable from beam-search decoding according to their quality.We rely on distinct argument-aware input formats in addition to diverse beam decoding to develop our argument-aware reranking method.

Annotated Dataset
We employ the annotated subset (Xu et al., 2021;Elaraby and Litman, 2022) of the CanLII dataset (Zhong and Litman, 2022) used in prior summarization research of legal opinions.This subset contains 1049 opinion/summary pairs annotated with sentence-level argument role labels for both input documents and reference summaries.The input opinions have mean/max lengths of 4375/62786 words, motivating us to use models for long text.
Recent work has proposed argument role taxonomies aligned with structures commonly found in legal text (Habernal et al., 2022;Xu et al., 2021).The CanLII data was annotated for argument roles using the IRC scheme for legal opinions (Xu et al., 2021), which divides argument roles into Issues (legal questions which a court addressed in the document), Reasons (pieces of text which indicate why the court reached the specific conclusions), and Conclusions (court's decisions for the corresponding issues).We use these 3 fine-grained IRC labels, as well as collapse them into a single argumentative label, to incorporate argument structure into our models.An IRC-annotated opinion and summary pair can be found in Appendix A.

Model and Methods
Our proposed method follows the generate and ranking paradigm and can be split into two parts.First, we explore techniques to utilize an argumentation augmented LED model to generate multiple candidate summaries S. Second, we propose a function µ that scores a summary S where S ∈ S based on its argumentative alignment with the input document.The best candidate S * is selected such that S * = arg max S i ∈S {µ(S 1 ), µ(S 2 ), .., µ(S n )}. Figure 1 shows an overview of our approach.

Generating Candidates: Argument-Aware
Training + Diverse Decoding Diverse decoding techniques such as beam-search can help diversify the summary output; however, it's only limited to the underlying language model used in the decoder and is completely isolated from the input format.Alternatively, we propose to complement the beam search via finetuning LED on three different input formats.We refer to this model as M arg−augmented such that the model parameter θ * arg−augmented is selected such that During finetuning, S is the reference summary, θ represents the trainable model parameters, and X is a set of inputs X = {X raw , X arg_binary , X arg_f inegrained }, where X raw is the input without the argument markers, X arg_binary is the input document with binary argument markers added to highlight argument role sentences, and X arg_f inegrained is the input document with the fine-grained argumentative markers added to also delineate the roles (i.e., Issue, Reason, Conclusion).These three representations of the input share the same reference summary, meaning that we augmented the training data three times.Table 1 shows an example of the distinct representations of our new training data.At inference time, we use the predicted markers by adopting the argument mining code2 from Elaraby and Litman (2022) instead of the manually labeled ones to construct Xarg_binary , Xarg_finegrained of X where X = {X raw , Xarg_binary , Xarg_finegrained }.Our incentive is that different formats of the input would yield different generated summaries that take into account different representations of the argumentative structure in the input.

Scoring and Reranking Summaries
We propose a scoring method to rank the candidate summaries based on their capability to capture the main argument points in the input.First, we employ a sentence-level argument role classifier to extract sentences with argument roles Xargs .The predicted sentences are used to construct an extractive summary.Then, we measure the lexical overlap between a generated candidate summary Ŝ and the constructed extracted one using ROUGE-1 F1-score3 , to compute a score to each candidate  summary that represents its alignment with the legal opinion argument content.Our scoring function µ can be written as µ = ROU GE1( Xargs , Ŝ).

Experiments
All models use LED-base checkpoint as a base model.LED-base encodes up to 16k tokens, which fits our long inputs.All experiments use 5-fold cross-validation, with the 4-fold documents split into 90% training and 10% validation; the validation split is used to select the best checkpoint. 4e compare all rank-based methods (baseline and proposed) to abstractive baselines previously explored in legal opinion summarization: finetune LED-base (which refers to vanilla model finetuning using our dataset), and arg-LED-base (Elaraby and Litman, 2022) (which finetunes LED on the dataset blended with argument markers that mark the start and the end of each argument role in the input). 5 We also compare our proposed rank-based approach from Section 4 with ranking baselines that use different input formats or diverse decoding alone.Specifically, we have employed ranking on top of the output of the three LED models outlined in Elaraby and Litman (2022) which are trained on distinct argument aware input formats (we refer to this model as "baseline ranking").Additionally, for diverse decoding, we have employed different beam widths within the range of 1 and 5 6 on top of the model trained on the input with fine-grained markers (arg-LED-fine-grained), which achieved the best abstractive baseline ROUGE results.
All models utilizing argument markers employed both oracle and predicted conditions during inference time, using human annotations or argument mining respectively, to produce the markers.
Utility of any Ranking The ranking-based methods (rows 6-13) consistently outperform the abstractive baselines 8 (rows 1-5) in both predicted 5 Argument marker details can be found in Appendix C. 6 We ran out of memory with BeamWidth > 5. 7 https://github.com/Yale-LILY/SummEval 8See Appendix D for extractive baseline results.
and oracle conditions.Also, abstractive baseline results (rows 1-5) align with those of Elaraby and Litman ( 2022), where leveraging fine-grained markers in the input yields the highest scores.
Utiliy of Proposed Ranking Framework and its Components In the predicted case, our proposed arg-augmented-LED (row 10) improves over the abstractive baselines (rows 1-3) with ranges 1.5 − 3.19 and 1.27 − 3.07 in ROUGE-1 and ROUGE-L respectively, while maintaining a limited drop of 0.1 and 0.01 in terms of ROUGE-2 and BS respectively.Similarly, compared to our ranking baselines, our proposed model improves over ROUGE-1 and ROUGE-L scores obtained by baseline ranking with ranges 0.56 − 0.73 while dropping in ROUGE-2 and BS by 0.31 and 0.02 points respectively.This indicates that incorporating argument information into the source inputs can lead to the generation of effective summary candidates.Our best predicted results were achieved by combining our proposed model with diverse beam decoding (row 11), which combines the strengths of various input formats and multiple beam decoding, resulting in statistically significant improvements over the previously proposed argument-aware abstractive baseline (row 3).
Inference with Predicted versus Oracle Argument Roles For the same model, predicted markers can impact the summarization results.In prior baselines (rows 3 and 5), we observe a drop in ROUGE score with ranges 2.05 − 2.14, and 0.06 in terms of BS when switching from oracle to predicted markers.This observation is consistent among row 6 and 8; and row 10 and 12.With our proposed arg-augmented-LED and diverse beam decoding, this performance gap is mitigated and reduced to −0.02 − 0.66 and −0.03 in ROUGE and BS, respectively (rows 11 and 13).We believe this is due to the combination of distinct argumentative formats and diverse decoding, allowing more diverse candidates to be considered in the ranking and enhancing robustness to noisy predictions during inference.

Conclusion and Future Work
We proposed a framework for improving the summarization of long legal opinions by combining distinct argument formats of the input with diverse decoding to generate candidate summaries.Our framework selects the summary with the highest lexical overlap with the input's argumentative content.Our results indicate that ranking alone can improve over abstractive baselines.Moreover, combining ranking with our proposed candidate generation method improves results while maintaining robustness to noisy predictions.In future research, we plan to incorporate human expert evaluations to compare automatic metrics with human ratings.Also, we aim to explore the impact of using noisier argument roles during training on a larger corpus by using the predicted markers obtained from our smaller dataset to experiment with the remaining unannotated portion of the CanLII dataset.

Limitations
The primary constraints encountered in our research result from our dependence on a single dataset for experimentation and computing resource limitations.Despite these, we postulate that our ranking-based methodology can be utilized for any summarization task that necessitates robust correspondence with a specific structure within the input.To validate this hypothesis, further experimentation is required to assess the generalizability of our technique to alternative datasets and domains.In addition, our limited computational resources prevented us from experimenting with other long document encoder-decoder models such as BigBird and LongT5 (Michalopoulos et al., 2022;Guo et al., 2022) as well as using higher beam widths during decoding.Furthermore, the cost and complexity of procuring expert evaluators within the legal domain resulted in using automatic metrics alone.

Ethical Considerations
The usage of the generated summary results from legal opinions remains important.Abstractive summarization models have been found to contain hallucinated artifacts that do not come from the source texts (Kryscinski et al., 2019;Zhao et al., 2020;Kryscinski et al., 2020).While our model incorporated the argument structure of the source article, the generation results may still carry certain levels of non-factual information and need to be utilized with extra care.Similarly, as mentioned in the prior line of works using CanLII (Elaraby and Litman, 2022; Zhong and Litman, 2022), Can-LII has taken measures to limit the disclosure of defendants' identities (such as blocking search indexing).Abstractive approaches may cause user information leakage.Thus using the dataset needs to be cautious to avoid impacting those efforts.(Khalifa et al., 2021;DeYoung et al., 2021).These markers can be added to the text by a human annotator, or they can be generated automatically by a model.These markers can take many forms, such as highlighting certain words or phrases or adding special tags to certain sentences.A summarization model can use them to identify the key parts of the text that should be included in the summary while also considering the overall structure and coherence of the text.This can help to improve the accuracy and effectiveness of the summarization process, especially when the text is long or complex.In this work, we use marker sets proposed by Elaraby and Litman (2022) to distinguish between argumentative and non-argumentative sentences.

Binary markers
The binary markers aim to dis-tinguish argumentative and non-argumentative sentences regardless of the type of the argument role (i.e, issues, reasons, or conclusions).In our work, we used the markers <IRC>,</IRC> to highlight the start and end of each argumentative sentence.
Fine-grained markers We also used the markers designated to distinguish between each argument role type by using the markers <Issue>,</Issue>, <Reason>, </Reason>, <Conclusion>, </Conclusion>.Table 3 shows an example of using different argumentative markers to highlight the start and end of a "Reason" sentence.

D Extractive Baselines
In addition to the abstractive baselines, we compare our methods to graph-based unsupervised extractive baselines built on top of HipoRank (Dong et al., 2021) and extractive baselines based on Extractive-BERT (Zheng and Lapata, 2019), which were leveraged before on the same dataset (Zhong and Litman, 2022).Table 4 shows our abstractive summarization results compared to the extractive baselines in cross-validation settings.Our ranking-based methods show consistent improvement over both the extractive and the abstractive baselines.

E ROUGE based ranking results
Table 5 shows a comparison between the usage of ROUGE-1, ROUGE-2, and ROUGE-L as potential ranking criteria to select the summary that aligns with the predicted argumentative content outlined in the input legal opinion.While there is no substantial differences between results with each ROUGE metric, ROUGE-L seems to have marginally lower scores compared to ROUGE-1, and ROUGE-2.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Please see footnote in section 1 on the license of the dataset used B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Yes, please refer to sections 1, 3, 4, and 5 discussing previous artifacts and how we use them.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Doesn't apply to our dataset owe used B5.Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? section 3 and the appendix A.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.Please refer to section 3 datasets for details C Did you run computational experiments?section 4 and 5 C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?We briefly discusses the computational infrastructure in the limitation and appendices.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.

Figure 1 :
Figure 1: Illustration of basic components of our approach.For input documents with fine-grained markers, colored sticks are sentences with argument role labels of Issue, Reason, and Conclusion.We used one IRC label for the binary version.In our real dataset, marked sentences are surrounded with special markers (Appendix C).

Figure 2 :
Figure 2: An example of the annotated Issue, Reason, and Conclusion sentences in the CanLII dataset's legal opinion and summary pair (ID: a_1991canlii2497).
you describe the limitations of your work?see section 8 (Limitations) after conclusion A2.Did you discuss any potential risks of your work?see section 9, Ethical Consideration after limitation section.A3.Do the abstract and introduction summarize the paper's main claims?Yes, see Abstract and Section 1 A4.Have you used AI writing assistants when working on this paper?Grammarly is used to help in checking grammar and writing style.B Did you use or create scientific artifacts?Yes, please see section 4 models.B1.Did you cite the creators of artifacts you used?Please see section 3 and 4 dataset and models.

Table 1 :
An example of X, which consists of three data points in different formats that share the same reference summary.In the table, S1 refers to the first sentence of the text, S2 to the second sentence, and so on.<IRC>, <Issue>, and <Reason> are the argumentative marker tokens described in Appendix C.

Table 2 :
Summarization ROUGE (R1, R2, RL) and BertScore (BS) cross-validation results.Best results in each column are bolded when obtained with the oracle markers and italicized with predicted markers.For full framework (rows 11/13), * indicates results are statistically significant in all scores over best argument-aware baseline (row-3).
Yue Dong, Andrei Mircea, and Jackie Chi Kit Cheung.2021.Discourse-aware unsupervised summarization for long scientific documents.In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1089-1102.

Table 3 :
Example of using argumentative marker tokens ground the summary