Prompted Opinion Summarization with GPT-3.5

Large language models have shown impressive performance across a wide variety of tasks, including text summarization. In this paper, we show that this strong performance extends to opinion summarization. We explore several pipeline methods for applying GPT-3.5 to summarize a large collection of user reviews in a prompted fashion. To handle arbitrarily large numbers of user reviews, we explore recursive summarization as well as methods for selecting salient content to summarize through supervised clustering or extraction. On two datasets, an aspect-oriented summarization dataset of hotel reviews (SPACE) and a generic summarization dataset of Amazon and Yelp reviews (FewSum), we show that GPT-3.5 models achieve very strong performance in human evaluation. We argue that standard evaluation metrics do not reflect this, and introduce three new metrics targeting faithfulness, factuality, and genericity to contrast these different methods.


Introduction
Recent years have seen several shifts in summarization research, from primarily extractive models (Erkan and Radev, 2004;Gu et al., 2022;Kwon et al., 2021;Jia et al., 2020;Zhong et al., 2020) to abstractive models with copy mechanisms (See et al., 2017;Song et al., 2018;Gehrmann et al., 2018) to pre-trained models (Devlin et al., 2019;Isonuma et al., 2021;Lewis et al., 2020;Zhang et al., 2020a;He et al., 2020).GPT-3 (Brown et al., 2020;Wu et al., 2021;Saunders et al., 2022;Goyal et al., 2022) and GPT-4 represent another shift: they show excellent zero-and few-shot performance across a variety of text generation tasks.However, their capabilities have not been extensively benchmarked for opinion summarization.Unlike news, where extractive lead baselines are often highly effective, opinion summarization requires balancing contradictory opinions and a higher degree of abstraction to convey all of the viewpoints faithfully.
In this paper, we apply GPT-3.5, specifically the text-davinci-002 model,1 to the task of opinion summarization, focusing on reviews of products, hotels, and businesses.Applying  in this setting is not straightforward, as the combined length of the reviews or posts may exceed the model's maximum input length.Furthermore, we find that certain styles of inputs can lead to GPT-3.5 simply echoing back an extract of the inputs.To mitigate these issues, we explore a family of pipelined approaches, specifically (1) filtering a subset of sentences with an extractive summarization model, (2) chunking with repeated summarization, and (3) review-score-based stratification.In the context of aspect-oriented summarization, we also explore the inclusion of a sentence-wise topic prediction and clustering step.
We show that our approaches yield high-quality summaries according to human evaluation.The errors of the systems consist of subtle issues of balancing contradictory viewpoints and erroneous generalization of specific claims, which are not captured by metrics like ROUGE (Lin, 2004) or BERTScore (Zhang et al., 2020b).This result corroborates work calling for a re-examination of current metrics (Fabbri et al., 2021;Tang et al., 2023) and the need for fine-grained evaluation (Gehrmann et al., 2022).We therefore introduce a set of metrics, using entailment as a proxy for support, to measure the factuality, faithfulness, and genericity of produced summaries.These metrics measure the extent of over-generalization of claims and misrepresentation of viewpoints while ensuring that summaries are not overly generic.
Our results show that basic prompted GPT-3.5 produces reasonably faithful and factual summaries when the input reviews are short (fewer than 1000 words); more sophisticated techniques do not show much improvement.However, as the input size

Reviews (T)opic classifica1on of sentences (C)hunk summariza1on (G)enera1on of final summary
The rooms were so clean!Stained carpets and untidy beds…ew.
The housekeeping staff did a great job keeping the rooms clean

Great food! […]
All staff with the exception of the front desk were so polite and friendly.The housekeeping staff did a great job keeping the rooms clean.The manager would not register our complaint.
The staff was found to be polite and friendly, with special praise given to the housekeeping staff.
Most reviewers enjoyed their experience.However, one reviewer specifically complained about the manager… The reviews were generally positive about the service, with praise for the housekeeping staff and chefs.Some reviewers did find their room damp and dark, but were happy to be upgraded to a better suite.< l a t e x i t s h a 1 _ b a s e 6 4 = " o O s a 6 W a f m Q w 0 E L V x N / j 0 Z h b c 0 q I = " > A A A C n X i c b V F t a 9 s w E J a 9 b u u 8 t 3 T 9 t n 6 Y W A h 0 E I K d F t o v g 7 A y a M s I L V 2 a Q m K E f J Z b U f m l k l w I R v + q v 6 T f + m 8 m O 9 5 I 2 x 0 6 e O 5 5 7 q T T X V Q I r r T v P z j u i 7 W X r 1 6 v v / H e v n v / 4 W N n 4 9 O 5 y k s J b A K 5 y O V F R B U T P G M T z b V g F 4 V k N I 0 E m 0 b X B 7 U + v W V S 8 T z 7 r R c F C 1 N 6 m f G E A 9 W W I p 2 7 e U r 1 V Z R U Z 4 Y E + D t u Q q C i i b f / i g e G + F b i 8 V K X a a W Y v O X A z D e v N z f W 2 8 S F I R X t 8 x U G D B n 2 V 4 I d r 3 d D x l Y X c a 6 V 1 w N 7 + u O a D P 5 d P j s L D b 4 h w 8 f E s s L y Y 9 L p + g O / M f w c B C 3 o o t Z O S O d + H u d Q p i z T I K h S s 8 A v d F h R q T k I Z r x 5 q V h B 4 Z p e s p m F G U 2 Z C q t m u g b 3 L B P j J J f W M 4 0 b d r W i o q l S i z S y m X W 7 6 q l W k / / T Z q V O 9 s O K Z 0 W p W Q b L h 5 J S Y J 3 j e l U 4 5 p K B F g s L K E h u e 8 V w R S U F b R f q 2 S E E T 7 / 8 H J w P B 8 H O Y H i 6 2 x 3 9 a M e x j r b Q V 7 S N A r S H R u g Q n a A J A u e z M 3 K O n G P 3 i / v T / e W O l 6 m u 0 9 Z s o k f m T v 8 A B l X J 9 Q = = < / l a t e x i t > S 2 = S 2 (C 1 | service) < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 S l R n 7 I 2 / z c 0 Y S H Z 9 W p 3 o 1 / y 1 d 5 y e w W T N Y N I 3 8 P 9 n 4 A a N c 9 P G o o l q G P E e w 4 z P q J f s O O 5 F N D W 6 S A q t H J e Z M Z q 2 p H 9 j P j 8 M G 3 p h  grows larger, repeated summarization leads GPT-3.5 to produce generalized and unfaithful selections of viewpoints relative to the first round.We demonstrate that using QFSumm (Ahuja et al., 2022), an extractive summarization model, to filter out sentences prior to GPT-3.5 (instead of multi-level summarization) can slightly help with factuality and faithfulness.The resulting summaries also present a more specific selection of viewpoints but are generally shorter and use a higher proportion of common words.A topicwise clustering and filtering step pre-pended to the pipeline alleviates these issues while relinquishing a portion of the gains on factuality and faithfulness.
Our main contributions are: (1) We introduce two approaches to long-form opinion summarization with GPT-3.5, namely, hierarchical GPT-3.5 summarization with chunking, and pre-extraction with an extractive summarization model.(2) We establish the strength of these approaches with a human study and demonstrate the need for objective and automatic means of evaluation.(3) We develop three entailment-based metrics for factuality, faithfulness, and genericity that are better suited to evaluate extremely fluent summaries as compared to metrics based on n-gram matching.The relevant artifacts and code for this work are publicly available and can be found at https: //github.com/testzer0/ZS-Summ-GPT3/.

Motivation and Problem Setting
Review summarization involves the summarization of the text of multiple reviews of a given product or service into a coherent synopsis.More formally, given a set of reviews R = {R i } n i=1 with the review R i consisting of l i sentences {r ij } l i j=1 , we define a summarization system S to be a function that takes as input the combined reviews C and then produces k output sentences S = {s i } k i=1 , written as S = S(C), where C ≡ combine(R) is typically obtained by concatenating the review sentences.We use the notation combine to refer to the combination of both sentences and reviews.
We can also instantiate this pipeline for aspectoriented review summarization, which involves the summarization of multiple reviews conditioned on an aspect a (such as 'cleanliness').In particular, the summarization is written as S = S(C | a).We consider aspect-agnostic review summarization as a special case of aspect-oriented review summarization with the aspect 'none' for notational simplicity.

Desiderata
Opinion summaries should demonstrate three key characteristics.
First, the summaries should also be faithful, i.e., select the most subjectively important viewpoints with the largest consensus.For instance, if five reviews raised the issue of small rooms while eight complained about dusty carpets, the choice (due to a limited output size) to discuss the latter over the former would be considered faithful.Thus, faithfulness is about careful management of the word budget given constrained output length.
The summaries should also be factual, i.e., report information grounded in statements that actu- Table 1: The pipelines compared for SPACE and Few-Sum, and their constituents.
ally do appear in the set of reviews, without containing extrinsic hallucinations.For instance, if five reviews found hotel rooms to be small, but three found them large, the statement The rooms were large is considered factual despite the viewpoint being in the minority.By contrast, A pipe burst and flooded my room is unfactual if this is never actually reported in the reviews.Finally, the summaries should be relevant: the points raised in them should only discuss topics relevant to the specified aspect.For example, in a summary about the cleanliness of a hotel room, bad food should be omitted even if it was frequently brought up in the reviews.

Framework
Based on the desiderata, we need to ensure that the summaries represent all of the reviews; however they are too many in number and too long in combined length.We, therefore, define a summarization pipeline to be a series of summarization systems S 1 , • • • , S m where each system takes as input the condensed results of the previous system.Specifically, We showcase an example pipeline in Figure 1, with one stage extracting the relevant sentences from the reviews and the next summarizing the extracted sentences.

GPT-3.5 Summarization Pipelines
The components of our summarization pipelines may be broadly categorized into extractors and

Example Reference Summary
The room itself was very nice and quite large by normal hotel standards.The small kitchenette in the room included a coffee maker, microwave oven, and a refrigerator.The beds were comfortable and the sheets good quality, but the furniture is pretty dated, and the bathroom very tired looking and small.

QFSumm (Extractive) [Q]
The Primrose is a good hotel for people who plan on staying just a few dates in Toronto and plan on only sleeping there.The hotel parking was a little expensive (CAN $15) and the garage is compact so I would be careful if you drive a big car or SUV.The bathroom needed a little work but it was good enough for my needs.

AceSum [A]
The room was spacious and comfortable.The bathroom was a bit big, but the bathroom had a king size bed and a sofa.

Topicwise Clustering + GPT-3.5-Chunking + GPT-3.5 [TCG]
Overall, reviewers thought the rooms were spacious, clean, comfortable, and a good value for the price.Some reviewers noted that the hotel seemed to be aging, with noise from the air conditioning unit and slow drainage in the shower, but these were not major concerns.

Aspect: Rooms (Hotel ID 182002)
Figure 2: Example summaries from TCG, Q, and A, and a reference summary from the SPACE dataset.
summarizers, which we describe next.More details can be found in Appendix A. First, extractors select relevant parts of a set of reviews, optionally conditioned on an aspect.Our extractors include: GPT-3.5 Topic Clustering (T) We prompt GPT-3.5 to produce a single word topic for each sentence, which we map to the closest aspect with GloVe (Pennington et al., 2014) similarity.This defines a set of sentences to be used for aspect-based summarization.This step is only used for pipelines on SPACE, as FewSum is aspect-agnostic.
QFSumm-long (Q) We use the aspect-specific extractive summarization model introduced in (Ahuja et al., 2022) to extract up to 35 most relevant sentences from the input text.QFSumm was designed to allow extremely long inputs, and thus no truncation is required at this stage.
Review Stratification (R) This involves clustering reviews by reviewer scores (given in the dataset) and summarizing each cluster with GPT-3.5.
In addition to extractors, we also utilize GPT-3.5-chunking(C) in some of our pipelines.We segment the sentences from the prior step into nonoverlapping chunks, then summarize each individually with GPT-3.5.The results are then concatenated for the next step.
Our summarizers summarize the text one final time to produce the output summary.All of our pipelines use GPT-3.5 as the summarizer.However, we also compare to QFSumm (Ahuja et al., 2022), AceSum (Amplayo et al., 2021a)  released with FewSum (Bražinskas et al., 2020a), also referred to as FewSum.These building blocks are composed to build various summarization pipelines, which we list in Table 1.An illustration of one pipeline (TCG) is shown in Figure 1.Since topic-wise clustering is unnecessary for FewSum (due to lack of aspects), we only compare G (vanilla GPT-3.5 used to summarize the set of product reviews, truncated to fit if necessary), CG (Chunking + GPT-3.5),QG (QFSumm-long + GPT-3.5),Q (QFSumm), and FS (FewSum) for this dataset.The table also lists some approaches that are the first stages of pipelines that begin with GPT-3.5-chunking, which we also compare against in Section 5.  4.1 Datasets SPACE (Amplayo et al., 2021a) involves the summarization of reviews of hotels along the aspects {general, rooms, building, cleanliness, location, service, food} and provides three human-

Here's what some reviewers said about a hotel:
The location couldn't be better and there are lots of restaurants around the hotel.We left early to get to the parks and arrived back late after shopping at the outlets, so it did us for a week.
[…] Walking distance to many restaurants and shopping.
The hotel is about 10 minutes from Disney and 20 from Universal or Sea World. […] Has to be the word place I ever stayed.It's not the best location as you do need to drive everywhere but that was all part of the fun for us!! Summarize what reviewers thought of the location:

GPT-3.5 (G stage)
The vast majority of reviewers thought that the location was great.They especially appreciated how close it was to the park and to restaurants.A few reviewers thought that the location was not the best because it was not centrally located and you had to drive to get to most places.
Figure 3: Example of errors made by GPT-3.5.The viewpoint of a single reviewer is wrongly expressed as that of a "few reviewers" and generalized to the hotel not being centrally located, contradicting other reviews (blue).
written summaries for each (hotel, aspect) pair.Table 3 shows that the reviews of SPACE are too long to summarize with a non-pipelined system given text-davinci-002's context window size.We exclude the general aspect from our experiments.
FewSum (Bražinskas et al., 2020a) contains product reviews from Amazon and Yelp.As opposed to SPACE, FewSum is not aspect-oriented, and the reviews are typically much shorter.For many of the products, the combined length of the reviews falls below 900 words, enabling direct summarization with GPT-3.5.FewSum provides three gold summaries for only a small portion of the products.Across these two splits, FewSum provides golden summaries for 32 and 70 products in the Amazon and Yelp categories respectively.We list SPACE and FewSum statistics in Table 3.
The BERTScores for AceSum, as well as all GPT-3-related models, are in the range of 88 − 90, and differences in performance are unclear.Ace-Sum achieves the highest ROUGE-1 as well as ROUGE-L scores by far, and is followed by TQG and QG.QFSumm does particularly poorly on the ROUGE scores.The scores are all in the same ballpark on FewSum apart from FS, with it being difficult to draw any conclusions.The latter achieves the highest ROUGE-L as well as BERTScore.The GPT-3.5 systems perform slightly better than QF-Summ on the Yelp split which we attribute to the smaller combined review lengths of Yelp.We argue that these scores are not informative and that they are at times unreliable when comparing the quality of two summaries.ROUGE and BERTScore have been critiqued in prior work as inaccurate indicators of summary quality (Fabbri et al., 2021;Liu and Liu, 2008;Cohan and Goharian, 2016), particularly as the fluency and coherence of the outputs increase to near-human levels (Goyal et al., 2022).Figure 2 demonstrates this by with an example.n-gram methods penalize GPT-3.5 for generating summaries in a slightly different style: "The reviewers found the rooms to be clean" instead of "The rooms were clean."Similarly, the extractive nature of QFSumm drives it to produce sentences like "We were served warm cookies on arrival."While its selections are factual, they are not completely representative of the review opinions themselves.The actual mistakes in our systems include over-generalization and misrepresentation of viewpoints of popularities thereof, which are not well-represented by matching n-grams.Figure 3 shows an example of such errors.We conclude that metrics benchmarking the summaries on different dimensions are necessary.

Human Evaluation
For a more reliable view of performance, we manually evaluated the summaries of the pipelines TCG, TQG, AceSum (A) and QFSumm (Q) for 50 randomly chosen (hotel, aspect) pairs from the SPACE dataset, and G, CG, QG, Q and FS for 50 randomly chosen products (25 each from the Amazon and Yelp splits) from the FewSum dataset.The axes of evaluation were the attributes established in Subsection 2.1, namely Factuality, Faithfulness and Relevance.In addition, as we often observed our systems produce summaries of the form "While most reviewers thought ..., some said ..." to highlight contrasting opinions, we also evaluate on Representativeness.Representativeness is a more restricted form of Faithfulness that measures if the more popular opinion was exhibited between two opposing ones.For instance, if four people found the rooms of a hotel clean but two did not, the summary is expected to convey that the former was the more popular opinion.
The three authors of this paper independently rated the summaries along the above axes on Likert scales of 1-3 for both variations of factuality, and 1-5 for faithfulness and relevance.The average scores, along with the Krippendorff's Alpha and Fleiss Kappa scores (measuring consensus among the raters) are presented in Table 4.Among the compared pipelines, TCG improves upon TQG and QG substantially in terms of relevance.All three have a very high score under Factuality, showing that GPT-3.5 models seldom make blatantly wrong statements.Viewpoints selected by QFSumm are generally faithful, and factual due to their extractive nature, but may include irrelevant statements.
We list the corresponding metrics for FewSum in Table 5. CG tends to perform well, but the consensus is low for Faithfulness and Relevance.FS performs poorly across the board due to hallucinated statements harming its Factuality and bad viewpoint selection resulting in low Faithfulness.The lack of aspects may contribute to the low agreement on FewSum; dimensions such as Relevance may be considered underconstrained, and thus more difficult to agree upon in this setting (Kryscinski et al., 2019).
We remark that all of our systems are achieving close to the maximum scores; the small differences belie that the pipelines all demonstrate very strong performance across the board.

New Tools for Evaluation and Analysis
Enabling fast automatic evaluation of systems will be crucial for the development of future opinion summarizers.Furthermore, when a large number of reviews are presented to a system, it may be nearly impossible even for a dedicated evaluator to sift through all of them to evaluate a summary.We investigate the question of how we can automate this evaluation using existing tools.
One of the areas where automatic evaluation may help is faithfulness.Since faithfulness represents the degree to which a system is accurate in representing general consensus, it requires measuring the proportion of reviews supporting each claim of a summary.A viewpoint with larger support is more popular and, consequently, more faithful.Our key idea is to use entailment as a proxy for support.Past work (Goyal and Durrett, 2021;Laban et al., 2022) has used Natural Language Inference (NLI) models to assess summary factuality by computing entailment scores between pairs of sentences.
However, the summaries produced by GPT-3.5 and related pipelines often consist of compound sentences that contrast two viewpoints.In addition, GPT-3.5 prefers to say "The reviewers said..." instead of directly stating a particular viewpoint.We found these artifacts to impact the entailment model.We use a split-and-rephrase step to split these sentences into atomic value judgments by prompting GPT-3.5 as shown in Figure 4. We then use the zero-shot entailment model from Sum-maC (Laban et al., 2022) to compute the entailment scores for these atomic value judgments.Similar to the approach in the SummaC paper, we observe that a summary statement is factual when strongly entailed by at least one sentence and thus select the top entailment score of each summary sentence as its factuality score, and aggregate this score to produce per-system numbers.The choice of the model as well as that of using GPT-3.5 for the split-and-rephrase step are explained further in Appendix B, and the relevant metric of abstractiveness is discussed in Appendix D.
A system could potentially game this metric by producing relatively "safe" statements (like most reviewers found the rooms clean).We therefore The reviews of the hotel were generally positive, with most people finding the rooms clean and the staff apologetic.However, some found the carpets to be stained, and one reviewer reported dust balls in their room.

Split and rephrase (GPT-3.5)
Split and rephrase the following sentences into simple propositions: Sentence: The reviewers were mixed, with some praising the central location of the hotel, and some finding the surrounding area to be polluted.Output: The hotel is centrally located.

The surrounding area of the hotel is polluted […more in-context examples]
The room was spotless.And…

Will never come back here…
The shuttle service is convenient… The rooms were clean.
The rooms had dust balls.
The staff was apologetic.
The carpets had stains on them.also want to evaluate genericity.

Terminology
The set of sentences in the summary of the reviews of a hotel h ∈ H w.r.t aspect a ∈ A is called S h,a .Passing these to the split-and-rephrase step gives us a set of split sentences Z h,a .For any two sentences s 1 , s 2 we denote the entailment score of s 2 with respect to s 1 according to the SummaC-ZS (Laban et al., 2022) model by e(s 1 , s 2 ) ∈ [−1.0, 1.0].A score of 1.0 indicates perfect entailment while that of −1.0 denotes complete contradiction.Finally, we denote by N n (s) the (multi-)set of n-grams (with multiplicity) of the sentence s.In particular, N 1 (s) is the set of words in the sentence s.

Evaluation of Entailment
We first evaluate whether entailment is effective at identifying the support of the mentioned viewpoints by human evaluation.The three authors of this paper marked 100 random pairs (50 each from SPACE and FewSum) of sentences and assertions entailed with a score above 0.5 on the scale of 0−2.
Here, 2 indicates that the assertion is completely supported, and 1 that the assertion's general hypothesis is supported, but some specifics are left out.The average score of the selection across the raters was 1.88 with a Fleiss Kappa consensus score of 0.56 (moderate agreement).Many of the lowerrated entailed sentences also had lower entailment scores (closer to 0.5).The score illustrates that the precision of the entailment approach is high.

Faithfulness: Support Set Sizes
We propose an entailment metric for determining how the viewpoints in the summary reflect the consensus of the input.We first compute per-sentence entailment scores as shown in Figure 4.For each sentence of the split-and-rephrased summary, we measure the number of review sentences that entail it with a score greater than a threshold τ = 0.75 (the "support" of the sentence).This threshold was determined based on manual inspection.We bin these counts into 0, 1, 2 − 4 and 5+.The frequencies of the bins are converted to percentages and listed in Table 6.FS performs poorly due to presenting hallucinated viewpoints, and repeated summarization slightly hurts CG on the Amazon split.G and CG outperform other methods on the Yelp split, likely because it has fewer reviews per product than Amazon, making it much likelier for the combined reviews of a product to fit in a manageable number of words.The "pure" GPT-3.5 systems generally perform well on the short review sets of FewSum.As we move to the long combined lengths of the reviews on SPACE, however, the pure GPT-3.5 pipelines fall behind in terms of faithfulness.Repeated summarization causes a major dip from First-TCG to TCG, indicating that this is not effective for long-form inputs.QG outperforms other GPT-3-related pipelines by a large margin.As we saw in human evaluation, however, QG may include some irrelevant viewpoints in this process.Abating this behavior by performing a topic-clustering step first brings its numbers down to a level comparable with First-TCG, which is still more faithful than the TCG pipeline.AceSum has the largest number of statements with 5+ supports on the SPACE; however, as we will see later, many of its summaries are very generic, and support for them can be easily found among the large number of reviews.Q has the smallest percentage of statements with no support because it is extractive.

Factuality: Top Score
As depicted in Figure 4, averaging the per-sentence entailment scores (first per-summary, then persystem) gives us the Top Score metric.The average top score is a proxy for factuality since true statements will typically be strongly entailed by at least one sentence of the reviews.We list the computed average top scores in Table 7. FS performs poorly on FewSum in terms of Factuality.The numbers for other systems are similar, with QG and CG performing best on the Amazon and Yelp splits.However, on the longer inputs of SPACE, the differences in factuality become more apparent.In particular, to reconcile similar but distinct viewpoints, repeated summarization leads to a type of generalizing that hurts the factuality of TCG and TG.Among the GPT-3.5 pipelines, QG performs the best, followed by TQG.TQG yet again delivers performance comparable to First-TCG and therefore presents a reasonable trade-off with some gains on factuality and increased relevance.

Genericity
As mentioned before, we want to measure whether reviews contain largely generic statements like the service was helpful, which are likely to be faithful and factual but not very useful to a user of a system.We first focus on semantic genericity, i.e. the use of statements generally applicable to other products/services in the same class.On the other hand, lexical genericity involves the overuse of generic words and is tackled next.Our approach to measuring semantic genericity employs the observation that generic sentences from a summary are often widely applicable and thus likely to be strongly entailed by statements from other summaries.We calculate the similarity sim(S, S ′ ) of two sets of sentences using the averaged top score, as Figure 4 shows.Similarly, we also measure the fraction frac(S, S ′ , τ ) of sentences whose top score exceeds a threshold τ .Equation 1 computes the average similarity score between sentences that belong to two reviews by the same system but different (hotel, aspect) pairs (normalizing by the number of pairs N ).Equation 2 computes the corresponding metric based on frac.
We report these two metrics in Table 8.On the short inputs of FewSum, all GPT-3.5 pipelines give similar results, with FewSum being slightly less generic.
Moving to SPACE, however, the range of scores becomes much wider.Forced to reconcile disparate opinions during repeated summarization, TCG and RG produce generic summaries, although AceSum is the most generic.We note that pre-extraction with QFSumm and Topic-wise clustering help QG and TQG remain less generic.
To measure lexical genericity, we use the sentences from all summaries on the corresponding dataset as the set of documents to calculate an averaged Inverse Document Frequency (IDF) of the summaries, with stopwords removed and stemming applied.Since generic words are likely to occur more frequently and therefore have a low IDF, a smaller score indicates higher genericity.The scores calculated this way are listed in Table 9.As expected, QFSumm is highly specific due to being extractive.We observe that AceSum generates summaries that over-use generic words, in line with our prior observations.We also note that pre-extraction with QFSumm helps with lexical genericity as it did with semantic genericity.Finally, on FewSum, we observe that FS does better than every other pipeline apart from Q.This bolsters our previous claim that its low Factuality and Faithfulness scores were due to hallucinated, but specific, viewpoints.

Correlation with Human Judgments
Our entailment-based approaches set out to measure Factuality and Faithfulness; how well do these correlate with our human evaluation?We compute Spearman's rank correlation coefficient on the human-annotated SPACE examples with the averaged annotator scores, as the consensus among rater scores was high on that dataset.In particular, we use the average of the Factuality scores among the raters as the net human score on Factuality on an example and the mean score on Faithfulness as that for Faithfulness.Correspondingly, we consider the Top Score metric as the automatic measurement of Factuality and the percentage of statements with 3 or more supports as Faithfulness.We list the obtained Spearman correlation coefficients in Table 10.While there is room for stronger metrics, the fact that the introduced metrics correlate with human judgments better than ROUGE provides an encouraging signal that these target the factors of interest.

Multi-Stage Summarization
Most systems of both types are now end-to-end (Liu and Lapata, 2019b;Du et al., 2022;Ahuja et al., 2022).However, multi-stage approaches (Chen and Bansal, 2018;Li et al., 2021;Zhang et al., 2022) like ours have recently shown great promise.For instance, Li et al. (2021) extracts relevant evidence spans and then summarizes them to tackle long documents.Recursive summarization has been explored in (Wu et al., 2021) for book summarization, but involved fine-tuning GPT-3.5 to the task.Other approaches such as the mixture-of-experts re-ranking model Ravaut et al. ( 2022) can be considered as a two-step approach where the combine function ranks and filters the outputs of the first stage.

Evaluation Metrics
The domain of news summarization has recently seen interest in using factuality/faithfulness for evaluation (Scialom et al., 2021;Kryscinski et al., 2020;Tang et al., 2023).In news, faithfulness and factuality are quite similar, as news articles usually do not present incorrect information or conflicting opinions.Opinion summarization is therefore quite distinct in this regard, and a separate treatment of factuality and faithfulness is sensible.For the same reason, although unified approaches to evaluating text generation (Deng et al., 2021;Zhong et al., 2022) are useful, more targeted metrics are likely to be more informative for opinion summarization specifically.
Aspect-Oriented Summarization In addition to opinion summarization (Amplayo et al., 2021a), aspect-oriented summarization has also been explored in other domains of NLP (Bahrainian et al., 2022;Yang et al., 2022).However, as highlighted above, opinion summarization differs from news summarization with respect to desired characteristics, and this work focuses specifically on those issues.

Conclusion
In this work, we show that GPT-3.5-basedopinion summarization produces highly fluent and coherent reviews, but is not perfectly faithful to input reviews and over-generalizes certain viewpoints.ROUGE is unable to capture these factors accurately.We propose using entailment as a proxy for support and develop metrics that measure the faithfulness, factuality, and genericity of the produced summaries.Using these metrics, we explore the impact of two approaches on controlling the size of the input via pre-summarization on two opinion summarization datasets.With the reasonably sized inputs of FewSum, GPT-3.5 and CG produce faithful and non-generic outputs.However, as we move to long-form review summarization, the factuality and faithfulness of these approaches drop.A preextraction step using QFSumm helps in this setting but leads to generally shorter and more generic summaries; a topic clustering step can then make summaries less generic and more relevant at a small cost to faithfulness and factuality.We hope that our efforts inspire future improvements to systems and metrics for opinion summary evaluation.

Limitations
Our study here focused on the most capable GPT-3.5 model, text-davinci-002, at the time the experiments were conducted.We believe that models like ChatGPT and GPT-4, as well as those in the future, are likely to perform at least as well as these, and if they improve further, the metrics we have developed here will be useful in benchmarking that progress.However, significant further paradigm shifts could change the distribution of errors in such a way that certain of our factors (e.g., genericity) become less critical.In addition, the latest iterations of GPT have a much greater input window size, which help them digest much larger swaths of text in one go and potentially make our pipelined approaches less needed in certain settings.
Furthermore, the text-davinci-002 model is fine-tuned with data produced by human demonstrations.The precise data used is not publicly available, so it is difficult to use our results to make claims about what data or fine-tuning regimen leads to what failure modes in these models.
Recent work has noted that language models may be susceptible to learning biases from training data (Sheng et al., 2019;Wallace et al., 2019;Shwartz et al., 2020), and this phenomenon has also been observed for GPT-3.5 (Lucy and Bamman, 2021).We did not stress test the models studied for biases and furthermore only experimented on English-language data.
When properly used, the summarization models described in this paper can be time-saving.However, as noted above, summary outputs may be factually inconsistent with the input documents or not fully representative of the input, and in such a case could contribute to misinformation.This issue is present among all current abstractive models and is an area of active research.

A Pipeline Details
A.1 Details of the Infrastructure, Models, and Datasets Used Computational Resources All experiments were run on a machine equipped with an Intel Xeon W-2123, and utilized a TITAN RTX GPU with a 24 GB memory.We estimate the total computational GPU budget to be roughly 100 GPU-hours.
Model Sizes QFSumm (Ahuja et al., 2022)  Datasets and Evaluation Both the SPACE and FewSum datasets consist of reviews in English.
The former consists of reviews of hotels, and the latter product reviews from Amazon and service reviews from Yelp.We are using pre-existing datasets that are standard in opinion summarization.Through our human evaluation, we did not see any personal identifying information or offensive content in the reviews we assessed.All of our human evaluation experiments were performed once by the authors, and we report the Krippendorff's Alpha and Fleiss Kappa scores as measurements of consensus.We used ROUGE with the default settings. 2We used NLTK's (Loper and Bird, 2002) WordNet (Miller, 1994) lemmatizer as the lemmatizer where needed.Sentence splitting was done using the sent_tokenize() function of NLTK.

A.2 Details of the Configurations and Prompts
Here we provide more details of the configuration and/or prompts used for various models.Below, GPT-3.5 refers to the text-davinci-002 model.
QFSumm and QFSumm-long (Q) QFSumm allows one to specify the number n of sentences to extract from the reference text to shape into a summary.We use n = 3 (the default setting) for QFSumm (summarizer) and n = 35 for QFSummlong (extractor).On the SPACE dataset, we use the aspect-specific keywords from Ahuja et al. (2022) to pass to the model.On the FewSum dataset, however, the set of relevant keywords may be drastically different across examples.Therefore, for each product, we pass 5 randomly chosen reviews to GPT-3.5 with the prompt consisting of the reviews and the directive "Output up to eight commaseparated keywords that capture these reviews most saliently:".The produced keywords are then used with QFSumm to summarize the reviews.

GPT-3.5 Topic Clustering (T)
The prompt we use is "Describe the topic of each sentence in one word", followed by three examples and then the sentence whose topic is to be determined.We then map the produced words to their corresponding normalized GloVe (Pennington et al., 2014) vectors, which are then mapped to the closest aspects in terms of L2 distance.This is functionally equivalent to using cosine similarity as the vectors are normalized.
GPT-3.5 Chunking (C) We strive for the length of the chunks (in sentences) to be both as close to each other and to 30 as possible; thus, when there are l sentences total to be chunked, we take c = ⌈ l 30 ⌉ to be the number of chunks, and allocate ⌊ l c ⌋ sentences to each chunk (except the last one, which may have fewer).
Review Stratification (R) If a cluster's length exceeds GPT-3.5'supper limit at this stage, it is truncated to the maximum number of sentences that fit.
GPT-3.5 (G) When used as a summarizer, we feed the penultimate set of sentences to GPT-3.5 with the prompt "Summarize what the X said of the Y:," where X is either "reviewers" or "accounts" based on whether GPT-3.5-chunking was used so far.Y is the aspect being summarized (SPACE) or just "Product" (FewSum).The preamble is either "Here's what some reviewers said about a hotel:" or "Here are some accounts of what some reviewers said about the hotel" in the case of SPACE.The word "hotel" is replaced by "product" for FewSum.
Here is what some reviewers thought of a hotel: Stayed here during Sakura-Con with a group of teens.My room was warm and I discovered the fan was on, but the A/C was not.The thermostat was worn so you could not tell which setting was A/C and which was heat.They use energy saving compact fluorescent bulbs so the room was not very bright for reading.Great pricing compared to other hotels near by.Not to mention when I checked in they yelled my room number out to me, not a good idea as a female traveling alone... Pretty basic information for a concierge to screw up on!When I had got back to my room the first day I was there housekeeping had still not been to my room.I had left at 9am and it was then 3:30.I called down to the front desk to find out what time housekeeping would be coming by as I wanted to have a nap before I went to the game.Has fitness (I did not use it), free Wi-Fi and a desk where you can print e.g.Don't know the concierges name, it would be the older man in the morning but he had great recommendations for restaurants, each one excellent and was so knowledgeable about the city and when was the best times to go to events.
The reviewers said that the hotel was good value, but that the room was not very well-lit and that housekeeping was slow.They also said that the concierge was helpful.
Summarize what the reviewers said: The guests said that the room was warm and the A/C was not working.The thermostat was worn and it was difficult to tell which setting was for the A/C.The light bulbs in the room were not very bright.The guests said that the hotel was a great value compared to other hotels in the area.The guests said that when they checked in, the staff member yelled out their room number, which was not a good idea for a female guest traveling alone.The guests said that when they got back to their room, housekeeping had still not been there.They called the front desk to find out when housekeeping would be coming and were told that they would be there soon.The guests said that the hotel has fitness facilities and free Wi-Fi.The guests said that the concierge was very knowledgeable about the city and made great recommendations for restaurants.
-The hotel was overpriced compared to others in the area -The room was not very bright, and the thermostat was difficult to use -The housekeeping staff was slow and unresponsive -The concierge was helpful and knowledgeable In summary, the guests said: The location is a little out the way for Universal, Prime Outlets (formerly Beltz), I Drive etc but it's only 5 minutes from Disney, you can see the top of Expedition Everest from the car park at the back!!We ate at a steakhouse close by called the colorado house of beef, it will ring you up about 70$ for two , and you eat well , with good service.
The parking lot was clean, the location of the hotel was GREAT (right next to many restaurants, souvenir shops, and within 5 minutes to the Disney World Parks).
Fantastic location wit restaurants, gift shops etc all within walking distance of the hotel (I did the whole thing on foot as I don't drive).
The location of the Hotel is not too far from theme parks.The location is really good, it is across the street from a supermarket restaurants and other useful things.
Very close to Super Target, Olive Garden, Publix you name it!Don't know where the rental cars were located?!?!?
The one thing I was a bit 'meh' about; you have to pay for ice.The hotel is situated close to restaurants and shops" as found by the Conv SummaC model.The corresponding entailment scores are included in parentheses.We see that the scores are very close to each other and that the "weakening" statements do not weaken the statement at all.These issues led us to use the zero-shot model instead.

B Entailment and Decomposition
In line with our motivation, we would like to be able to use an NLI (Natural Language Inference) model to retrieve entailment scores of the produced summaries with respect to the input reviews.We tested several approaches including BERTScore, due to it being trained on entailment/contradiction pairs, but finally settled on using the zero-shot model from SummaC (Laban et al., 2022) to produce the entailment scores.SummaC is already becoming a standard evaluation tool for summarization factuality.We chose to forego the trained "Conv" SummaC model as we found that it did not generalize well to the kind of data we were working with.Specifically, two common issues were that (1) the range of scores assigned to the sentences from the reviews was very small, and (2) sometimes (especially for the most weakening statements) the scores assigned to the sentences seemed arbitrary and did not make a lot of sense.In comparison, the zero-shot model had neither of these issues.This issue is highlighted in Figure 6.
Further, a proposition X is typically not judged by models to entail statements of the form "The reviewers said X", or "X and Y", where Y is another proposition.Accordingly, the entailment scores are not very high for these two cases.We highlight this in Figure 7. Thus, we decide to split and rephrase all sentences of the produced summary to simple value propositions for all entailment-related metrics.Note that here rephrasing also includes removing any attribution such as "The guests said...".

"The room was warm."
The room was very cold. -1.00 The heater would not turn on.-0.85 The heater was broken.-0.63I can't believe they are still using heaters from a decade ago!In summers the room can get very warm.
We found the room warm and cozy.
They give you fur lined blankets ... majestic and fits the cold ... brrr!I don't see the use of fur lined blankets in this scorching summer.
The heater was broken, but thankfully we didn't need to use it.
The heater saved all of us from freezing to death.
The summers here can get very hot -our room felt like an oven.Heaters but no A/C in this heat ... uff."The reviews said that the room was warm." The room was very cold.-0.96 The heater would not turn on.-0.53 The heater was broken.-0.50I can't believe they are still using heaters from a decade ago!In summers the room can get very warm.
We found the room warm and cozy.
They give you fur lined blankets ... majestic and fits the cold ... brrr!I don't see the use of fur lined blankets in this scorching summer.
The heater was broken, but thankfully we didn't need to use it.
The heater saved all of us from freezing to death.
The summers here can get very hot -our room felt like an oven.Heaters but no A/C in this heat ... uff."The room was warm and the rugs were clean." The room was very cold.-1.00 The heater would not turn on.-0.63 The heater was broken.-0.82I can't believe they are still using heaters from a decade ago!
In summers the room can get very warm.
We found the room warm and cozy.
They give you fur lined blankets ... majestic and fits the cold ... brrr!I don't see the use of fur lined blankets in this scorching summer.
The heater was broken, but thankfully we didn't need to use it.
The heater saved all of us from freezing to death.
The summers here can get very hot -our room felt like an oven.Heaters but no A/C in this heat ... uff.
-0.03 0.00 0.00 -0.99 -0.71 -0.03 0.00 -0.01 -0.74 Figure 7: The scores of three statements with respect to a set of sentences, highlighting the issues with directly using the model output to compute entailment scores.Scores rounded to three decimal places are included before the corresponding sentences, with important lines highlighted in color.We note that quoting a proposition as said by someone else or having multiple propositions in the same sentence serve to cloud entailment scores.
We considered several models to this end, including BiSECT (Kim et al., 2021) and ABCD (Gao et al., 2021), but found two common issues with all of them: • The split sentences maintained the words from the original sentences, so a sentence such as "The food was received well but it was served late" would have one output part as "It was served late", which requires a round of entity disambiguation to follow the split-andrephrase step.
• These models do not remove attribution of viewpoints as we would like.
• A statement such as "I liked the setting of the movie but not its cast" produces one of the outputs as "Not its cast", which does not make any sense by itself.
Thus, we utilize GPT-3.5 to perform the split-andrephrase task, with few shot prompting used to illustrate the removal of attribution and other desired characteristics.We also experimented with having separate steps for split-and-rephrase and found no significant difference in the outputs or quality thereof.We utilize the split-and-rephrased sentences for all of the automatic metrics that involve entailment of any sort.

C Measuring Complexity
One of the challenges of opinion summarization is that sentences may contrast opinions: "Most reviewers liked the service, but there were a few complaints about sluggish response times."We quantify the percentage of simple and contrasting statements in the model outputs since it is subtly related to the extent of expression of opposing viewpoints.We use the original (non-split) sentences for this purpose and classify a sentence as contrasting if it contains one or more words from the set K = {'while', 'but', 'though', 'although', 'other', 'others', 'however'}, as Equation 3 depicts.We present these percentages in Table 11.
We note that AceSum produces the smallest percentage of contrasting statements.We see that topic-wise clustering pushes up the number of contrasting statements for QG.We hypothesize that this is because when bringing together statements with the same topics in a cluster two opposing statements are likelier to fall into the same chunk.In  cases where two opposing statements fall into different chunks, say X and Y, the chunks are likely to each contain statements similar to others in the same chunk.Thus, the summaries of those chunks are likely to be highly contrasting and thus increase the above measure even more for the final stage, as is observed above for TCG.

D Abstractiveness
We further investigate how the choice of the pipeline affects abstractiveness.To measure this, we calculate the percentage of n-grams in the summaries that do not appear in the input reviews for n ∈ {3, 4, 5}.For this, we use the original (nonsplit) sentences from the output summaries.The results are tabulated in Table 8.Since QFSumm is a purely extractive model, it is no surprise that Q has low abstractiveness.The numbers are non-zero due to some quirks of QF-Summ about splitting into sentences -this leads to some partial sentences ending up next to each other.The next stand-out is that A has very low abstractiveness.This is in line with our observation that even though AceSum is abstractive, it tends to highly generic observations such as "The rooms were clean", which very likely appear almost verbatim in some user reviews.We also observe that QG has a relatively low abstractiveness and that topic clustering drives up abstractiveness.We suspect that the above is a result of GPT-3.5 simply mashing together some sentences when presented with chunks containing highly disparate sentences (since it is hard to find a common thread among them), which promotes extraction over abstraction.Another observation is that multi-GPT-3.5 pipelines (TCG and RG) are more abstractive than single-GPT-3.5ones since there are two rounds of abstraction as opposed to one.All the GPT-3.5derivedpipelines are highly abstractive in the case of FewSum, and slightly more so than FS.This is unsurprising since the combined length of the reviews in the case of FewSum is much smaller when compared to SPACE, and therefore there are relatively fewer propositions to compress into general statements.Motivated by Ladhak et al. (2022), we display the line graph of the average Top Score vs. 3-gram Abstractiveness for the SPACE dataset in Figure 9.The trio of QG, TQG, and TCG define the best frontier on the Factuality-Abstractiveness tradeoff, followed by RG, then A and Q. D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.The human evaluators were the authors themselves.The ratings were on Likert scales -the explanation of the scales has been included in section 4.3 D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.The human evaluators were the authors themselves.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.The human evaluators were the authors themselves.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.The human evaluators were the authors themselves.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.The human evaluators were the authors themselves.
by topicS 1 = S 1 (C 0 | service) 5 c a l u r m i 1 y K + w f f w A u T e R / < / l a t e x i t > Prompted GPT-3.5 (zero-shot): Describe the topic of each sentence in one word.Prompted GPT-3.5 (zero-shot): Summarize what reviewers said of the service: Prompted GPT-3.5 (zero-shot): Summarize what reviewers said of the service:

Figure 1 :
Figure 1: Illustration of the TCG pipeline.Sentences are clustered based on the aspects closest to their topic (T step); examples are shown for rooms, food and service.The relevant cluster is then repeatedly chunked and summarized until the combined length falls below 35 sentences (C step).A final round of GPT-3.5 summarization follows (G step).

Figure 4 :
Figure 4: Per-sentence entailment scores are calculated by taking the maximum among the various candidates.

Figure 5 :
Figure 5: Aspects of summarization such as verbosity or the format of output are affected by the specific wording of the prompt.We use the leftmost prompt, "Summarize what the reviewers said.""The hotel is situated close to restaurants and shops." a Walmart and a Target store about 20 minutes West from the hotel, some shopping arease (groceries, pharmacy etc) are closer by.

Figure 6 :
Figure6: The top 5 supporting and weakening sentences from the reviews for the statement "The hotel is situated close to restaurants and shops" as found by the Conv SummaC model.The corresponding entailment scores are included in parentheses.We see that the scores are very close to each other and that the "weakening" statements do not weaken the statement at all.These issues led us to use the zero-shot model instead.

Figure 8 :Figure 9 :
Figure 8: Abstractiveness as measured by the percentage of novel n-grams when compared with the source reviews

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Section A.1 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Sections 4.3, 5.2, and A.1 C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Section A.1 D Did you use human annotators (e.g., crowdworkers) or research with human participants?Sections 4.3 and 5.2 and the model

Table 3 :
SPACE and FewSum dataset statistics.

Table 4 :
Results of Human Evaluation on the SPACE dataset.Colors indicate moderate (light green) and substantial (darker green) agreement, respectively.

Table 6 :
Percentages of split-and-rephrased sentences binned according to support sizes, for all compared pipelines.The threshold used is τ = 0.75.

Table 7 :
The average Top Score for each pipeline on the SPACE and FewSum datasets.

Table 8 :
Semantic genericity based on entailment, along with the raw percentage of scores above the threshold.The threshold used is τ = 0.5.

Table 10 :
Spearman Correlation Coefficients of our metrics and ROUGE with human judgments.

Table 11 :
Complexity as measured by the percentage of contrasting sentences.