Extractive Summarization via ChatGPT for Faithful Summary Generation

Extractive summarization is a crucial task in natural language processing that aims to condense long documents into shorter versions by directly extracting sentences. The recent introduction of large language models has attracted significant interest in the NLP community due to its remarkable performance on a wide range of downstream tasks. This paper first presents a thorough evaluation of ChatGPT's performance on extractive summarization and compares it with traditional fine-tuning methods on various benchmark datasets. Our experimental analysis reveals that ChatGPT exhibits inferior extractive summarization performance in terms of ROUGE scores compared to existing supervised systems, while achieving higher performance based on LLM-based evaluation metrics. In addition, we explore the effectiveness of in-context learning and chain-of-thought reasoning for enhancing its performance. Furthermore, we find that applying an extract-then-generate pipeline with ChatGPT yields significant performance improvements over abstractive baselines in terms of summary faithfulness. These observations highlight potential directions for enhancing ChatGPT's capabilities in faithful summarization using two-stage approaches.


Introduction
Document summarization aims to compress text material while retaining its most salient information.With the increasing amount of publicly available text data, automatic summarization approaches have become increasingly important.These approaches can be broadly classified into two categories: abstractive and extractive summarization.While abstractive methods (Nallapati et al., 2016;Gupta and Gupta, 2019) have the advantage of producing flexible and less redundant summaries, they often struggle with generating ungrammatical or even nonfactual contents (Kryściński et al., 2019;Zhang et al., 2022b).In contrast, extractive summarization directly selects sentences from the source document to form the summary, resulting in summaries that are grammatically correct and faithful to the original text.
The growing interest in applying advanced large language models (LLM) such as ChatGPT1 for text summarization tasks has sparked significant attention.A recent study by (Goyal et al., 2022) compared GPT-3 with traditional fine-tuning methods and found that, despite lower Rouge scores, human annotators preferred the GPT-3 generated text.Another study by (Zhang et al., 2023d) conducted a comprehensive analysis of large language models for news summarization and found that the generated summaries were comparable to those produced by humans.However, existing research (Yang et al., 2023;Luo et al., 2023) has only focused on abstractive summary approaches, and the performance of ChatGPT for extractive summarization remains an open question.Moreover, the hallucination problem has dramatically hindered the practical use of abstractive summarization systems, highlighting the need to explore extractive summarization with LLMs for faithful summaries.
In this study, we comprehensively evaluate Chat-GPT's performance on extractive summarization and investigate the effectiveness of in-context learning and chain-of-thought explanation approaches.Our experimental analysis demonstrates that Chat-GPT exhibits inferior extractive summarization performance in terms of ROUGE scores compared to existing supervised systems, while achieving higher performance based on LLM-based evaluation metrics.Additionally, we observe that using an extract-then-generate pipeline with ChatGPT yields large performance improvements over abstractive baselines in terms of summary faithfulness.
The main contributions of this paper are: 1) This study represents the first attempt to extend the ap-plication of ChatGPT to extractive summarization and evaluate its performance.2) We investigate the effectiveness of in-context learning and chain-ofthought reasoning approaches for extractive summarization using ChatGPT.3) We further extend the extraction step to abstractive summarization and find that the extract-then-generate framework could improve the generated summary faithfulness by a large margin compared to abstractive-only baselines without hurting summary qualities.

Related Work
Most extractive summarization works formulate the task as a sequence classification problem and use sequential neural models with diverse encoders such as recurrent neural networks (Cheng and Lapata, 2016;Nallapati et al., 2016) and pre-trained language models (Liu and Lapata, 2019;Zhang et al., 2023b).Another group of works formulated extractive summarization as a node classification problem and applied graph neural networks to model inter-sentence dependencies (Xu et al., 2019;Wang et al., 2020;Zhang et al., 2022aZhang et al., , 2023a)).
Several studies also explored the use of large language models (Brown et al., 2020) for summarization.Goyal et al. (2022) found that while the former obtained slightly lower Rouge scores, human evaluators preferred them.Likewise, Zhang et al. (2023d) reported that large language modelgenerated summaries were on par with humanwritten summaries in the news domain.In addition, Yang et al. (2023) explored the limits of ChatGPT on query-based summarization other than generic summarization.Luo et al. (2023) explored the use of ChatGPT as a factual inconsistency evaluator for abstractive text summarization.Zhang et al. (2023c) proposed a self-evaluation and revisement framework with ChatGPT.While most of the existing research has focused on abstractive summarization, this work aims to investigate the applicability of ChatGPT to extractive summarization and examine whether extractive methods could enhance abstractive summarization faithfulness.

Task Formulation
Extractive summarization systems form a summary by identifying and concatenating the most salient sentences from a given document.These approaches have gained widespread traction in various real-world applications owing to their ability to produce accurate and trustworthy summaries devoid of grammatical inconsistencies.
Formally, given a document d consisting of n sentences, the goal of an extractive summarization system is to produce a summary s comprising of m(m ≪ n) sentences, by directly extracting relevant sentences from the source document.Most existing work formulates it as a sequence labeling problem, where the sentences are selected by model M based on the probability of whether it should be included in the summary s: (1) In the training of supervised summarization models, it is common to employ a greedy algorithm, as described in (Nallapati et al., 2017), to generate extractive ground-truth labels (ORACLE) by selecting multiple sentences that maximize the ROUGE score compared to the gold summary.

In-context Learning
Recent studies have shown that large language models have strong few-shot performance on various downstream tasks, known as in-context learning (ICL) (Brown et al., 2020).The standard ICL prompts a large language model, M , with a set of k exemplar document-summary pairs and predicts a summary ŝ for the document by: (2) Besides simple input-output pairs, previous works also show that including explanations and chain-of-thought (COT) reasoning in prompts (Nye et al., 2021;Wei et al., 2022) also benefits language models, represented as: where is the set of input-explanation-output triplets in prompts.Besides zero-shot setting, this study also investigates the impact of in-context learning on extractive summarization, with and without explanations.

Extract-abstract Summarization
It is not new to use extractive summaries to guide abstractive summary generations (Dou et al., 2020;Wang et al., 2022).Here we also propose to use LLM in a two-stage manner: extract salient sentences to form extractive summaries (s E ) first, and then ask the LLM to generate summaries guided by the extractive summaries, represented as: where s <t denotes the previous generated tokens before step t.We explore the extract-then-generate pipeline in this study, aiming to alleviate the hallucination problems in LLM summary generation.
We selected the best prompts on a dev set of 50 examples and randomly sampled 1000 examples from each test set of the original dataset for evaluation.The detailed prompts used in the experiments and more details about the experimental setup can be found in Table 4 and Appendix B.

Experiments Results
The overall results are shown in Table 1.The upper block includes extractive results and SOTA scores from MatchSum (Zhong et al., 2020).The lower block includes abstractive results and SOTA scores from BRIO (Liu et al., 2022)  Reddit, and GSum (Dou et al., 2020) for PubMed.
It is observed that ChatGPT generally achieves lower ROUGE scores in comparison to previous fine-tuning methods for all datasets under both extractive and abstractive settings, but achieves higher scores in terms of LLM-based evaluation metric G-EVAL.The findings are consistent with the previous conclusion in (Goyal et al., 2022;Zhang et al., 2023d).We also observe that ChatGPT-Ext outperforms ChatGPT-Abs in two extractive datasets CNN/DM and PubMed while performing worse in the other two abstractive datasets.We argue the results are due to the bias within the reference summaries of the dataset and the limit of ROUGE scores.Nonetheless, we notice that despite being primarily designed for generation tasks, ChatGPT achieves impressive results in extractive summarization, which requires comprehension of the documents.The decoder-only structure of Chat-GPT doesn't degrade its comprehension capability compared to encoder models like BERT.We also find that the ROUGE score gap between ChatGPT and SOTA fine-tuned baselines are smaller in the extractive setting than in the abstractive setting.
The results also indicate that in-context learning and reasoning are generally beneficial for the extractive summarization task across four datasets in different domains.We only observe performance degradation for in-context learning on the XSum dataset.We argue that the degradation comes from the short ORACLE of XSum, which brings more confusion with a few ORACLE examples.However, with chain-of-thought reasoning explanations, ChatGPT can better understand the pattern and thus shows improvements with in-context reasoning.More in-context learning results could be found in Table 5 in Appendix.

Extract Then Generate
We conduct further experiments to examine the effectiveness of the extract-then-generate framework as presented in Table 3.
The results show large improvements in summary factual consistency across all four datasets with the extract-then-generate framework.Notably, the FactCC scores are extremely low for generateonly baselines (less than 10 percent), highlighting the hallucination problems of ChatGPT-based summarization, where ChatGPT tends to make up new content in the summary.Nevertheless, the extractthen-generate framework effectively alleviates the hallucination problem of abstractive summaries by guiding the summary generation process with extracted salient sentences from the documents.We also find that guiding ChatGPT summary generation with its own extracted summaries leads to similar summary faithfulness improvements compared to guiding generation with ORACLE.
In terms of summary quality, the results demonstrate that the performance of ChatGPT improves largely in terms of ROUGE scores when grounded with the ORACLE summaries.However, the ROUGE score performance of the extract-thengenerate framework relies heavily on the extractive performance when grounded with its own extractive summaries.In summary, the extract-thengenerate framework could effectively improve the summary faithfulness with similar or even better summary quality.

Positional Bias
Lead bias is a common phenomenon in extractive summarization, especially in the news domain, where early parts of an article often contain the most salient information.As shown in Figure 1, we find that the position distribution of the ChatGPT extracted summary sentences is skewed towards a higher position bias than the ORACLE sentences.In addition, in-context learning brings more positional bias to the summaries.The results indicate that LLMs may rely on superficial features like sentence positions for extractive summarization.

Conclusion
This paper presents a thorough evaluation of Chat-GPT's performance on extractive summarization across four benchmark datasets.The results indicate ChatGPT's strong potential for the task and the possibility of generating factual summaries using the extract-generate framework.Overall, this study suggests that ChatGPT is a powerful tool for text summarization, and we hope the insights gained from this work can guide future research in this area.

Limitations
Instead of conducting experiments on the entire test set, we randomly sample 1000 examples from each dataset test set due to budget limits.Previous research efforts (Goyal et al., 2022;Zhang et al., 2023d) have also been limited in their testing of GPT-3 on a small number of instances.
Our experimental results are mainly evaluated with various automatic metrics (summary quality and faithfulness).We plan to include a human study to further verify the conclusions in the future.
We only use gpt-3.5-turbomodel from openAI API as an instance of large language models.The emphasis of the paper is to explore extractive summarization and extract-then-generate pipeline with

E Case Study
Here we show the ChatGPT-generated summaries with different prompt settings in Table 6 for one example from the CNNDM dataset.

ORACLE
Tough challenge: the 36-year-old daredevil will walk atop the Orlando Eye as it turns on April 29.The seventh-generation member of the 'Flying Wallenda' family of acrobats has walked across the Grand Canyon and Niagara Falls.In November, he walked twice between two Chicago skyscrapers without a net or harness, doing the second walk blindfolded.

Abstractive
Daredevil Nik Wallenda plans to walk untethered atop the 400-foot Orlando Eye observation wheel on April 29th.The 36-year-old high-wire performer, a seventh-generation member of the "Flying Wallenda" acrobat family, has previously completed daring walks across the Grand Canyon, Niagara Falls, and between Chicago skyscrapers.Although the exact distance for this stunt hasn't been determined, Wallenda stated he's unlikely to walk the entire 20 minutes it takes for the wheel to complete a full circle.

Extractive
Daredevil Nik Wallenda says he'll walk untethered on top of a 400-foot observation wheel in Orlando, Florida, this month.Wallenda said Monday at a New York City news conference that the Orlando Eye will be moving when he attempts his feat on April 29.The 36-year-old daredevil will walk atop the Orlando Eye as it turns on April 29.

In-Context
Daredevil Nik Wallenda says he'll walk untethered on top of a 400-foot observation wheel in Orlando, Florida, this month.Wallenda said Monday at a New York City news conference that the Orlando Eye will be moving when he attempts his feat on April 29.Wallenda is the great-grandson of Karl Wallenda, who fell to his death in Puerto Rico at 73. Extract-abstract Nik Wallenda, a daredevil and seventh-generation member of the "Flying Wallenda" family of acrobats, will attempt to walk untethered on top of a 400-foot observation wheel in Orlando, Florida, on April 29.The Orlando Eye, part of a new entertainment complex, will offer views of central Florida from inside 30 enclosed, air-conditioned glass capsules when it opens to the public on May 4. Wallenda previously walked atop a Ferris wheel in Santa Cruz, California, but he said the size of the much larger Orlando wheel and the fact that he will not use a pole sets this stunt apart.Wallenda is the great-grandson of Karl Wallenda, who fell to his death in Puerto Rico at 73.

Figure 1 :
Figure 1: Position distribution of extracted sentences on 1000 random samples from the CNN/DM test set.
sented in Figure2.Our findings suggest that Chat-GPT maintains consistent performance across documents of different lengths, indicating the model's robustness in the context of extractive summarization.

Figure 2 :
Figure 2: ROUGE-1 Score distribution over document of different lengths.

Table 1 :
Summarization results on four benchmark datasets.'+context' and '+reason' refer to ChatGPT with three in-context examples and human reasoning.The best results in both extractive and abstractive settings are in bold.

Table 2 :
Detailed statistics of the datasets.Doc # words and Sum # words refer to the average word number in the source document and summary.# Ext refers to the number of sentences to extract.

Table 3 :
(Ravaut et al., 2022)SummaReranker(Ravaut et al., 2022)for Summarization results of the extract-then-generate pipeline.Abs, Ext-Abs, and Oracle-Abs refer to the generate-only baseline, the extract-then-generate pipeline, and generation based on ORACLE, respectively.
DocumentDaredevil Nik Wallenda says he'll walk untethered on top of a 400-foot observation wheel in Orlando, Florida, this month.Wallenda said Monday at a New York City news conference that the Orlando Eye will be moving when he attempts his feat on April 29.The Orlando Eye, part of a new entertainment complex, will offer views of central Florida from inside 30 enclosed, air-conditioned glass capsules when it opens to the public on May 4. Eyes on the prize: high-wire performer Nik Wallenda announces his latest stunt at the 400-foot Orlando Eye, during a news conference, in New York on Monday.Tough challenge: the 36-year-old daredevil will walk atop the Orlando Eye as it turns on April 29.The Orlando Eye team issued a statement saying it's excited to have Wallenda attempt the 'amazing stunt.'Nodistancefor the performance has been set yet, but Wallenda, 36, said he was not likely to walk the entire 20 minutes or so that it takes the wheel to go a full circle.Wallenda previously walked atop a Ferris wheel in Santa Cruz, California, but he said the size of the much larger Orlando wheel and the fact that he will not use a pole sets this stunt apart.The seventh-generation member of the 'Flying Wallenda' family of acrobats has walked across the Grand Canyon and Niagara Falls.In November, he walked twice between two Chicago skyscrapers without a net or harness, doing the second walk blindfolded.Wallenda is the great-grandson of Karl Wallenda, who fell to his death in Puerto Rico at 73.ReferenceThe 36-year-old will stage his next stunt on April 29.In November, Wallenda walked back and forth between two Chicago skyscrapers in a live television event.His great-grandfather Karl Wallenda died in a tightrope walk in Puerto Rico in 1978.Wallenda has also tightrope walked across Niagara Falls and the Grand Canyon.

Table 6 :
Case study of different settings