Unsupervised Opinion Summarization Using Approximate Geodesics

Opinion summarization is the task of creating summaries capturing popular opinions from user reviews.In this paper, we introduce Geodesic Summarizer (GeoSumm), a novel system to perform unsupervised extractive opinion summarization. GeoSumm consists of an encoder-decoder based representation learning model that generates topical representations of texts. These representations capture the underlying semantics of the text as a distribution over learnable latent units. GeoSumm generates these topical representations by performing dictionary learning over pre-trained text representations at multiple layers of the decoder. We then use these topical representations to quantify the importance of review sentences using a novel approximate geodesic distance-based scoring mechanism. We use the importance scores to identify popular opinions in order to compose general and aspect-specific summaries. Our proposed model, GeoSumm, achieves strong performance on three opinion summarization datasets. We perform additional experiments to analyze the functioning of our model and showcase the generalization ability of GeoSumm across different domains.


Introduction
As more and more human interaction takes place online, consumers find themselves wading through an ever-increasing number of documents (e.g., customer reviews) when trying to make informed purchasing decisions.As this body of information grows, so does the need for automatic systems that can summarize it in an unsupervised manner.Opinion summarization is the task of automatically generating concise summaries from online user reviews (Hu and Liu, 2004;Pang, 2008;Medhat et al., 2014).For instance, opinion summaries allow a consumer to understand product reviews without reading all of them.Opinion summaries are also † Work done during an internship at Google Research.useful for sellers to receive feedback, and compare different products.The recent success of deep learning techniques has led to a significant improvement in summarization (Rush et al., 2015;Nallapati et al., 2016;Cheng and Lapata, 2016;See et al., 2017;Narayan et al., 2018;Liu et al., 2018) in supervised settings.However, it is difficult to leverage these techniques for opinion summarization due to the scarcity of annotated data.It is expensive to collect good-quality opinion summaries as human annotators need to read hundreds of reviews to write a single summary (Moussa et al., 2018).Therefore, most works on opinion summarization tackle the problem in an unsupervised setting.
Recent works (Bražinskas et al., 2021;Amplayo et al., 2021a) focus on abstractive summarization, where fluent summaries are generated using novel phrases.However, these approaches suffer from issues like text hallucination (Rohrbach et al., 2018) that affect the faithfulness of generated summaries (Maynez et al., 2020).Extractive summaries are less prone to these problems presenting the user with a representative subset of the original reviews.
We focus on the task of unsupervised extractive opinion summarization, where the system selects sentences representative of the user opinions.Inspired by previous works (Chowdhury et al., 2022;Angelidis et al., 2021a), we propose a novel encoder-decoder architecture along with objectives for (1) learning sentence representations that capture the underlying semantics, and (2) a sentence selection algorithm to compose a summary.
One of the challenges in extractive summarization is quantifying the importance of opinions.An opinion is considered to be important if it is semantically similar to opinions from other users.Using off-the-shelf pre-trained representations to obtain semantic similarity scores has known issues (Timkey and van Schijndel, 2021).These similarity scores can behave counterintuitively due to the high anisotropy of the representation space (a few dimensions dominate the cosine similarity scores).Therefore, we use topical representations (Blei et al., 2003), which capture the semantics of text as a distribution over latent semantic units.These semantic units encode underlying concepts or topics.The semantic units can be captured using a learnable dictionary (Engan et al., 1999;Mairal et al., 2009;Aharon et al., 2006;Lee et al., 2006).Topical representations enable us to effectively measure semantic similarity between text representations as they are distributions over the same support.Text representations from reviews lie on a high-dimensional manifold.It is important to consider the underlying manifold while computing the importance score of a review.Therefore, we use the approximate geodesic distance between topical text representations to quantify the importance scores of reviews.
In this paper, we present Geodesic Summarizer (GeoSumm) that learns topical text representations in an unsupervised manner from distributed representations (Hinton, 1984).We also present a novel sentence selection scheme that compares topical sentence representations in high-dimensions using approximate geodesics.Empirical evaluations show that GeoSumm achieves strong performance on three opinion summarization datasets -OPOSUM+ (Amplayo et al., 2021a), AMAZON (He and McAuley, 2016) and SPACE (Angelidis et al., 2021b).Our primary contributions are: • We present an extractive opinion summarization system, GeoSumm.It consists of an unsupervised representation learning system and a sentence selection algorithm (Section 3).• We present a novel representation learning model that learns topical text representations from distributed representations using dictionary learning (Section 3.1).• We present a novel sentence selection algorithm that computes the importance of text using approximate geodesic distance (Section 3.2).• GeoSumm achieves strong performance on 3 opinion summarization datasets (Section 4.4).

Task Setup
In extractive opinion summarization, the objective is to select representative sentences from a review set.Specifically, each dataset consists of a set of entities E and their corresponding review set R. For each entity e ∈ E (e.g., a particular hotel such as the Holiday Inn in Redwood City, CA), a review set R e = {r 1 , r 2 , . ..} is provided, where each review is an ordered set of sentences r i = {s 2 , . ..}.For simplicity of notation, we will represent the set of review sentences corresponding to an entity e as S e = r i ∈Re r i .For each entity, reviews encompass a set of aspects A e = {a 1 , a 2 , . ..} (e.g., service, food of a hotel).In this work, we consider two forms of extractive summarization: (a) general summarization, where the system selects a subset of sentences O e ⊂ S e , that best represents popular opinions in the review set R e ; (b) aspect summarization, where the system selects a representative sentence subset O (a) e ⊂ S e , about a specific aspect a (e.g., service) of an entity e (e.g., hotel).

Geodesic Summarizer (GeoSumm)
In this section, we present our proposed approach Geodesic Summarizer (GeoSumm).GeoSumm has two parts: (a) an unsupervised model to learn topical representations of review sentences, and (b) a sentence selection algorithm that uses the approximate geodesic distance between topical representations, to compose the extractive summary.

Unsupervised Representation Learning
The goal of the representation learning model is to learn topical representations of review sentences.Topical representations model text as a distribution over underlying concepts or topics.This is useful for unsupervised extractive summarization because we want to capture the aggregate semantic distribution and quantify the importance of individual review sentences with respect to the aggregate distribution.Topical representations allow us to achieve both.Being a distribution over latent units, topical representations can be combined to form an aggregate (mean) representation, enabling compositionality.Also, it is convenient to measure the similarity between representations using conventional metrics (like cosine similarity).
We propose to model topical representations by decomposing pre-trained representations using dictionary learning (Tillmann, 2015;Lotfi and Vidyasagar, 2018).In this setup, the various components of the dictionary capture latent semantic units and we consider the representation over dictionary elements as the topical representation.Unlike conventional dictionary learning algorithms, we use a sentence reconstruction objective for learning the dictionary.We use an encoder-decoder architecture to achieve this.We retrieve word embeddings  from a pre-trained encoder.We modify the architecture of a standard Transformer decoder by introducing a dictionary learning component at each decoder layer.The pre-trained word embeddings obtained from the encoder are decomposed using these dictionary learning components to obtain topical representations.Then, we combine the topical word representations at different decoder layers to form a sentence representation.The schematic diagram of the model is shown in Figure 1.Next, we will discuss each of the components in detail.Encoder.We obtain contextual word embeddings from a pre-trained BART (Lewis et al., 2020) encoder.We keep the weights of the encoder frozen during training.In Section 5, we discuss why frozen representations are important for our model.Given an input sentence s = {w 1 , . . ., w L }, we retrieve contextual word embeddings z i 's from the BART encoder: where sg(•) denotes the stop gradient operator.Dictionary Learning.We describe the dictionary learning component within each decoder layer.We use dictionary learning to decompose pre-trained word representations from the encoder to obtain a sparse representation for each word.We want word representations to be sparse because each word can capture only a small number of semantics.We forward word representations from the encoder to the decoder layers.For the j-th decoder layer, we use a dictionary, D (j) ∈ R m×d , and kernel function, k j (•, •), where j ∈ {1, . . ., N } (N is the number of decoder layers).The dictionary captures the underlying semantics in the text by enabling us to model text representations as a combination of dictionary elements.Specifically, we learn a topical word representation T j (w i ) over the dictionary D (j) as: where ẑ(j) i is the reconstructed word embedding, and k j (•, •) ∈ R m is the kernel function that measures the similarity between z i and individual dictionary elements.In practice, since the dictionary is common for all word embeddings z i 's, the kernel function can be implemented as: θ is a feed-forward neural network with ReLU non-linearity.ReLU non-linearity ensures that the kernel coefficients are positive and also encourages sparsity.
Following conventional dictionary learning algorithms (Beck and Teboulle, 2009), the dictionary D (j) and kernel layer f (j) θ are updated iteratively.We ensure the sparsity of the word representations f (j) θ (z) by adding an L1-penalty to the loss.Overall, this can be achieved by using the loss function: where the gradient update of the dictionary D (j) and kernel layer f (j) θ are performed independently.Decoder.We build on the decoder architecture introduced by Vaswani et al. (2017).A decoder layer consists of 3 sub-layers (a) masked multihead attention layer that takes as input decoder token embeddings, (b) multi-head attention that performs cross-attention between decoder tokens and encoder stack output, and (c) feed-forward network.We modify the cross attention multi-head sub-layer to attend over the reconstructed word embeddings ẑ(j) i (Equation 2), instead of the encoder stack output (shown in Figure 1).Finally, the decoder autoregressively generates the reconstructed sentence ŝ = { ŵ1 , . . ., ŵL }.Training.The system is trained using the sentence reconstruction objective.The overall objective function is shown below: where L CE is the cross-entropy loss, and f θ is the implementation of the kernel function k j (•, •) corresponding to the j-th decoder layer.The above loss function is used to update the decoder, the dictionary elements, and the kernel parameters while keeping the encoder weights frozen.Sentence Representations.We combine topical word representations from different decoder layers to form a sentence representation.First, we obtain a word representation, T j (w) ∈ R m from each decoder layer.We compose the final word representation x w by concatenating representations from all decoder layers.
where m is the dictionary dimension and N is the number of decoder layers.We use max-pooling over the dimensions of word representations to form a sentence representation x s as shown below.
x s n = max where x w n is the n-th entry of the vector x w .The sentence representation x s is normalized to a unit vector.Next, we discuss how we leverage these topical sentence representations to compute importance scores using approximate geodesics.We use the importance scores to compose the final extractive summary for a given entity.

General Summarization
We use representations retrieved from GeoSumm to select sentences representative of popular opinions in the review set.For an entity e, the set of sentence representations is denoted as X e = {x s |s ∈ S e }.
For a summary budget q, we select a subset of x s x s′ sentences O e ⊂ S e according to their importance scores, such that |O e | = q.First, we compute a mean representation as shown: Secondly, we define the importance of a sentence s, as the distance from the mean representation d(x s , µ e ).However, we do not directly evaluate d(•, •) using a similarity metric.Representations in X e lie in a high-dimensional manifold, and we aim to measure the geodesic distance (Jost and Jost, 2008) between two points along that manifold.An illustration of the geodesic distance between two points is shown in Figure 2. Computing the exact geodesic distance is difficult without explicit knowledge of the manifold structure (Surazhsky et al., 2005).We approximate the manifold structure using a k-NN graph.Each sentence representation forms a node in this graph.A directed edge exists between two nodes if the target node is among the k-nearest neighbours of the source node.The edge weight between two nodes (s, s ′ ) is defined using their cosine similarity distance, The geodesic distance between two sentence representations is computed using the shortest path distance along the weighted graph.Therefore, the importance score I(s) for a sentence s, is defined as: where the shortest path distance is computed using Dijkstra's algorithm (Dijkstra et al., 1959).We select the top-q sentences according to their importance scores I(s) to form the final general extractive summary.The overall sentence selection routine is shown in Algorithm 1.
Algorithm 1 General Summarization Routine 1: Input: A set of sentence representations X e = {x s |s ∈ S e } are review sentences for entity e.

Aspect Summarization
In aspect summarization, the goal is to select representative sentences to form a summary specific to an aspect (e.g., durability) of an entity (e.g., bag).To perform aspect summarization, we compute the mean representation of aspect-specific sentences as shown: µ e by detecting the presence of aspect-specific keywords available with the dataset.To ensure the selected sentences are aspect-specific, we introduce a measure of informativeness (Chowdhury et al., 2022;Peyrard, 2019).Informativeness penalizes a sentence for being close to the overall mean µ e .Therefore, we model the aspect-specific importance score I a (s) as: where γ is a hyperparameter, I(s) is the overall importance score (obtained from Eqn. 7).Aspect summary O (a) e is composed using the top-q sentences according to the aspect-specific scores, I a (s).

Experiments
We evaluate the performance of GeoSumm on extractive summarization.Given a set of user reviews the system needs to select a subset of the sentences as the summary.This summary is then compared with human-written summaries.In this section, we discuss the experimental setup in detail.

Datasets & Metrics
We evaluate GeoSumm on three publicly available opinion summarization datasets: (a) OPOSUM+ (Amplayo et al., 2021b) 1.We observe that SPACE dataset has significantly more reviews per entity compared to other datasets.

Implementation Details
Our experiments are implemented using the Ten-sorFlow (Abadi et al., 2015) framework.We use BART base (Lewis et al., 2020) architecture as our encoder-decoder model.We initialize the encoder with pre-trained weights from BART, while the decoder is trained from scratch.In our experiments, we use dictionary dimension m = 8192, number of decoder layers N = 6, and hidden dimension d = 768.GeoSumm was trained for 15K steps on 16 TPUs in all setups.We optimize our model using Adam (Kingma and Ba, 2014) optimizer with a learning rate of 10 −5 .We set aspect-summarization parameter γ = 0.5 for OPOSUM+ and γ = 0.7 for SPACE (Equation 8).All hyperparameters were tuned using grid-search on the development set.We will make our code publicly available.

Baselines
We compare GeoSumm with several summarization systems (including the current state-of-the-art) that can be classified into three broad categories: • Single Review systems select a single review as the summary.We compare with the following systems: (a) Random samples a review randomly from  • Abstractive systems generate summaries using novel phrasing.We compare GeoSumm with the following systems: MeanSum (Chu and Liu, 2019), Copycat (Bražinskas et al., 2020b), PlanSum (Amplayo et al., 2021c), TranSum (Wang and Wan, 2021), COOP (Iso et al., 2021)

Results
We discuss the performance of GeoSumm on general and aspect-specific summarization.We evaluate the quality of the extracted summaries using the automatic metric -ROUGE F-scores (Lin, 2004), which measures the n-gram overlap with the human-written summaries.General Summarization.We present the results of GeoSumm and baseline approaches on general summarization in slightly short of the state-of-the-art model, SemAE.However, we observe that GeoSumm's summaries are much more diverse leading to significantly better human evaluation scores compared to SemAE.Aspect Summarization.We report the performance on different approaches on aspect summarization in Table 3 on OPOSUM+ and SPACE.We observe that GeoSumm achieves the state-of-theart performance for all metrics on the OPOSUM+ dataset.On SPACE dataset, it achieves comparable scores to other extractive approaches.Human Evaluation.We perform a human evaluation to compare the summaries from GeoSumm with the state-of-the-art extractive summarization systems SemAE and QT.General summaries were judged based on the following criteria: informativeness, coherence, and redundancy.We present human evaluators with summaries in a pairwise fashion and ask them to select which one was better/worse/similar according to the criteria.The final scores for each system reported in Table 4 were computed using Best-Worst Scaling (Louviere et al., 2015).We observe that GeoSumm outperforms the baselines in coherence and redundancy.GeoSumm performs slightly worse than QT in informativeness.This is expected as GeoSumm greedily select sentences (that are often similar), while QT performs sampling leading to more coherent summaries (compromising on informativeness).
For aspect summaries, we ask annotators to judge whether a summary discusses a specific aspect exclusively, partially, or does not mention it at all.In Table 5, we report the human evaluation results for aspect summaries on OPOSUM+ dataset.We observe that GeoSumm generates summaries that are significantly more aspect-specific compared to baselines.We provide further details about human evaluation in Appendix A.1.
we observe that there is a significant drop in performance when the encoder is fine-tuned.We hypothesize that this happens because the model overfits shallow word-level semantics, and is unable to capture more abstract semantics.This showcases the utility of pre-trained representations that helps Geo-Summ perform well in an unsupervised setting.
Next, we investigate the efficacy of the representation learning and sentence selection modules by replacing each of them with a competitive variant.
. We observe a significant drop in performance across all three datasets.
Euclidean-based Importance Score.We investigate the utility of geodesic-based importance scoring over Euclidean-based scoring.In this experiment, instead of I(s) (defined in Equation 7) we compute the importance score of a sentence, s, as the Euclidean distance from the mean representation, µ e (I(s) = −∥x s − µ e ∥ 2 2 ).We report the results of this setup in Table 7 (relative performance to GeoSumm is shown in brackets).We observe that performing sentence selection using Euclidean distance results in a significant drop in performance across all datasets.We believe that leveraging the kNN graph provides us with a better approximation of the underlying representation manifold, which results in better summarization performance.Distributed vs. Topical Representations.In this experiment, we investigate the relative efficacy of topical representations compared to distributed representations.We retrieve distributed sentence representations from RoBERTa (Liu et al., 2019) ([CLS] token feature) and SimCSE (Gao et al., 2021) model.Then, we use these representations in our sentence selection algorithm (Section 3.2) to compose the summary.In  significant margin across all setups.This shows the utility of topical representations over distributed representations for unsupervised summarization.Summary Coherence.In this experiment, we evaluate the coherence of the generated extractive summaries using automatic measures.Specifically, we measure the perplexity scores (from Hugging-Face (Wolf et al., 2020) Evaluate API) using the GPT-Neo model (Black et al., 2021).The perplexity scores are indicative of the coherence of the generated text.In Table 9, we report the perplexity scores on SPACE and AMAZON datasets for extractive systems QT, SemAE, and GeoSumm.We observe that GeoSumm achieves the best perplexity scores showcasing that it is able to generate superior-quality summaries in terms of coherence.We believe that the greedy aggregation of sentences in GeoSumm often results in the selection of semantically similar sentences thereby leading to more coherent summaries with fewer context switches.
Cluster Interpretation.In this experiment, we investigate whether different parts of the representation space capture distinct semantics.We partition the space by performing agglomerative clustering with Ward's linkage (Ward Jr, 1963) on the representation set for a particular entity.In Table 10, we report example sentences within different clusters.
We observe that sentences belonging to the same cluster share a common theme.The underlying semantics of a cluster can vary from being coarse, like the presence of the phrase 'Calistoga', to more • Calistoga is a beautiful historic town with good restaurants and beautiful old houses -a fun place to walk.
• The Roman Spa and Calistoga is our favorite spot in the Wine Country.

Pillows & Beds
• The rooms were in great shape, very clean, comfortable beds with lots of pillows.
• The pillows and bed coverings were of very good quality There was also a mini-refrigerator and coffeemaker.
Phrase 'every year' • We return every year to the Roman Space after the holidays and brought Seattle friends this January.
• Every year for the past 15 years we have met at the Roman Spa ...
Table 10: Sentences within a cluster produced from agglomerate clustering.Sentences in a row belong to the same cluster.We highlight the dominant theme of a cluster in green.
nuanced concepts like pillows & beds in the room, flowers in the hotel's garden, etc. Generated Summaries.In Table 11, we report the summaries generated by GeoSumm, and other comparable extractive summarization systems like SemAE and QT.We observe that GeoSumm is able to generate a comprehensive summary that reflects the main considerations mentioned in the human summary.Compared to SemAE, we see more specific adjectival descriptions; SemAE indicates that many of the hotel characteristics are simply 'great'.Compared to QT, we see a review that seems to more accurately reflect the humanwritten summary.
We perform additional ablations experiments to investigate the domain transfer capabilities, sparsity of representations, among others in Appendix A.2.

Related Work
Most work on opinion summarization focuses on generating summaries in an unsupervised setup due to the scarcity of labeled data.These works are broadly classified into two categories based on the type of summaries being generated: abstractive (Ganesan et al., 2010;Carenini et al., 2006;Di Fabbrizio et al., 2014) or extractive (Erkan and Radev, 2004;Nenkova and Vanderwende, 2005; All staff members were friendly, accommodating, and helpful.The hotel and room were very clean.The room had modern charm and was nicely remodeled.The beds are extremely comfortable.The rooms are quite with wonderful beach views.The food at Hash, the restaurant in lobby, was fabulous.The location is great, very close to the beach.It's a longish walk to Santa Monica.The price is very affordable.
Overall we had a nice stay at the hotel.Our room was very clean and comfortable.The atmosphere is stylish and the service was great.We ate breakfast at the hotel and it was great.I appreciate the location and the security in the hotel.The food and service at the restaurant was awesome.The Hotel is classy and has a rooftop bar.The restaurant is cozy but they have good healthy food.Great hotel.
The staff is great.The Hotel Erwin is a great place to stay.The staff were friendly and helpful.The location is perfect.We ate breakfast at the hotel and it was great.The hotel itself is in a great location.The service was wonderful.It was great.
The rooms are great.The rooftop bar HIGH was the icing on the cake.The food and service at the restaurant was awesome.The service was excellent.
Great hotel.We liked our room with an ocean view.The staff were friendly and helpful.There was no balcony.The location is perfect.Our room was very quiet.I would definitely stay here again.You're one block from the beach.So it must be good!Filthy hallways.Unvacuumed room.Pricy, but well worth it.
Table 11: Human-written and generated summaries from GeoSumm, SemAE, and QT.For a fair comparison, we present the summary for the instance reported in previous works.GeoSumm generates a comprehensive review with a relatively logical ordering that starts with a clear topic sentence and then proceeds to details.Compared to SemAE, we see more descriptive sentences selected.Compared to QT, we see a summary that more closely matches the human-written summary.Zhao et al., 2022;Li et al., 2023).Abstractive systems, in an unsupervised setup (Chu and Liu, 2019;Bražinskas et al., 2020b;Iso et al., 2021;Wang and Wan, 2021;Amplayo et al., 2021a) train an encoderdecoder setup using a self-supervised objective and generate the summary by leveraging the aggregate opinion representation.On the other hand, extractive opinion systems (Kim et al., 2011), select sentences using an importance score that quantifies their salience.Salience has been computed using frequency-based approaches (Nenkova and Vanderwende, 2005), distance from mean (Radev et al., 2004), or graph-based techniques (Erkan and Radev, 2004).Few approaches focus on aspect specificity and sentiment polarity for sentence selection (Angelidis and Lapata, 2018b;Zhao and Chaturvedi, 2020).
Our work is most similar to extractive summarization systems SemAE (Chowdhury et al., 2022), and QT (Angelidis et al., 2021a).Similar to these systems, Geodesic Summarizer has two components: a representation learning system, and a sentence selection routine.However, unlike these approaches, we leverage pre-trained models to learn topical representations over a latent dictionary and propose a sentence selection mechanism using approximate geodesics to perform summarization.
Approaches in our work resemble prior works in deep clustering, which considers a similar combination of unsupervised representation learning and sparse structures (Yang et al., 2016;Jiang et al., 2016;Law et al., 2017;Caron et al., 2020;Zhao et al., 2020).In a similar fashion, dictionary learning-like approaches have been combined with deep networks (Liang et al., 2021;Zheng et al., 2021) for various tasks.

Conclusion
We present Geodesic Summarizer, a novel framework for extractive opinion summarization.Geo-Summ uses a representation learning model to convert distributed representations from a pre-trained model into topical text representations.GeoSumm uses these representations to compute the importance of a sentence using approximate geodesics.We show that GeoSumm achieves strong performance on several opinion summarization datasets.However, there are a lot of open questions about the inductive biases of representation learning that are needed for unsupervised summarization.In this work, we show the efficacy of topical representations.However, are there better approaches to capturing language semantics that help us quantify the importance of an opinion?Our analysis shows that representations from GeoSumm span the highdimensional space in a manner that different parts of it capture distinct semantics.This opens up the possibility of leveraging the representation geometry to capture different forms of semantics.Future work can explore ways to leverage topical representations from GeoSumm for tasks where there is a scarcity of labeled data.

105
The authors are thankful to Anneliese Brei, Haoyuan Li, Anvesh Rao Vijjini, and Chao Zhao for helpful feedback on an earlier version of this paper.This work is supported in part by the National Science Foundation under award DRL-2112635.

Limitations
We propose GeoSumm, a novel system that learns topical representations of text and uses them to compute the importance of opinion reviews for extractive summarization.One of the limitations of GeoSumm is that it requires pre-training of the representation learning module using reviews sentences from a similar domain.For this, GeoSumm requires access to a large collection of review data from the target domain, thereby limiting its applicability in zero-shot or few-shot setups.This can be alleviated by future research on developing foundational models that learn topical representations on large-scale datasets and generalize across different opinion summarization domains.

Ethical Considerations
We do not foresee any ethical issues from the technology introduced in this paper.However, we would like to mention certain limitations of extractive summarization systems in general.As extractive systems select review sentences from the input, it can produce undesirable output when the input reviews have foul or offensive language.Therefore, it is important to remove foul language from the input in order to ensure the end user is not affected.In general, we use public datasets and do not annotate any data manually.All datasets used in this paper have customer reviews in the English language.Human evaluations for summarization were performed on Amazon Mechanical Turks (AMT) platform.Human judges were based in the United States.Human judges were compensated at a rate of at least $15 USD per hour.

A.1 Human Evaluation
We perform the human evaluation on the Amazon Mechanical Turk (AMT) platform.We designed the payment rate per Human Intelligence Task (HIT) in a manner to ensure that judges were compensated at a rate of at least $15 USD per hour.In all tasks, each HIT was evaluated by three human judges.
For general summarization, we performed a pairwise evaluation of two summarization systems.Specifically, we were given two system summaries the judges were asked to judge each pair as better, worse, or similar.We asked the judges to evaluate the pair based on the following criteria -informativeness, redundancy, and coherence, in independent tasks.For informativeness, we also provide the judges with a human-written summary.The judges annotate a summary as more informative only if the information is consistent with the human-written summaries.The reported scores (-100 to +100) were computed using Best-worst scaling (Louviere et al., 2015).For a fair comparison, we consider the version of SemAE that does not use additional aspect-related information.
For aspect summarization, we provide human judges with a system-generated aspect summary and the corresponding aspect.Judges were asked to annotate whether the system summary discusses the mentioned aspect exclusively, partially, or does not mention the aspect at all.

A.2 Analysis
Dictionary Size Ablation.In this experiment, we vary the number of elements in each dictionary (m) and observe the summarization performance on SPACE dataset.We conduct these experiments on the SPACE dataset.In Table 12, we observe GeoSumm achieves comparable performance with significantly smaller dictionary sizes.In fact, for the smallest dictionary sizes GeoSumm achieves the best ROUGE-1 and ROUGE-L scores.

Sparsity.
We examine the sparsity of sentence representations from GeoSumm.For each sentence representation, we sort the dimensions by magnitude, from smallest to largest.This enables us to compare magnitudes across sentences for a specific sorted rank position.We then plot the mean magnitude (and two standard deviations) for each sorted rank position, as illustrated in Figure 3 Table 12: Evaluation results with a varying number of dictionary elements on SPACE dataset.We observe that there is only a small drop in performance of GeoSumm, when the dictionary sizes are reduced.
tions indicate that most sentences possess only a few dimensions with high magnitude, while the remaining dimensions have magnitudes of zero or close to zero.
Figure 3: Plot depicting the sparsity of sentence representations retrieved from GeoSumm.We sort, individually for each sentence, the dimensions from the smallest to the largest magnitude and report the mean magnitude for each sorted position (and two standard deviations).Most sentences seem to have only a few large magnitude dimensions and many close to zero.

Domain Transfer capability.
In this experiment, we investigate the domain transfer capability of GeoSumm.Specifically, we evaluate how Geo-Summ trained on one dataset, performs on others.We also evaluate GeoSumm when it is trained on C4 dataset (Raffel et al., 2020).In Table 14, we report the results of this experiment.We observe that when training on the non-domain specific C4 corpus, performance is nearly that of in-domain training.The largest degradation of performance occurs when training on OPOSUM+ or AMAZON and evaluating on SPACE.We hypothesize that this happens due to a domain shift, where both AMA-ZON and OPOSUM+ are product review datasets, while SPACE has reviews for hotel entities.When evaluated on OPOSUM+ or AMAZON, we observe that GeoSumm is generalizing well, and out-ofdomain performance is not much worse than indomain performance (highlighted in gray ).NMF (Lee and Seung, 2000) 32.85 10.44 18.96 30.33 5.07 16.10 34.88 6.14 18.87 LDA (Blei et al., 2003) 32.70 10.85 19.60 31.31 5.27 16.51 26.57 3.46 14.81 LSA (Dumais et al., 2004) 32.41 10.33 19.66 31.71 6.11 17.79 31.64 5.57 17.72 HDP (Teh et 2004) 34.60 11.29 19.39 30.60 4.91 16.20 29.77 4.44 16.49NTM BERT (Bianchi et al., 2021) 33.00 11.01 19.01 31.62 5.29 16.54 26.12 2.74 15.29 Geodesic Summarizer (GeoSumm) 41.55 20.77 25.19 33.75 7.15 18.79 42.36 12.44 24.80 Table 13: Comparison of Geodesic Summarizer's performance with other unsupervised topic modeling techniques on general summarization.In this experiment, we modify the representation learning module of Geodesic Summarizer while keeping the sentence selection approach same.We observe that Geodesic Summarizer's topic modeling approach achieves the best performance across all datasets.Unsupervised Topic Modeling Ablations In this setup, we experiment with different unsupervised topic modeling approaches -latent Dirichlet allocation (LDA) (Blei et al., 2003), linear semantic analysis (LSA) (Dumais et al., 2004), non-negative matrix factorization (NMF) (Lee and Seung, 2000), hierarchical Dirichlet process (HDP) (Teh et al., 2004), and neural topic model (NTM) using contextual embeddings (Bianchi et al., 2021).Most of these approaches focus on factorizing sentence representations into topical representations over a set of learned topics.We set the sentence representation dimension d = 100 for all approaches.Specifically, we replace the representation learning module from GeoSumm while keeping the sentence selection algorithm the same.In Table 13, we report the performance on general summarization of different methods.We observe that most of the other topical approaches perform significantly worse than GeoSumm.These approaches use significantly fewer parameters compared to a Transformer decoder used in GeoSumm.We believe that leveraging more parameters helps the unsupervised model to capture latent semantics better leading to better summarization performance.

Figure 1 :
Figure 1: Architecture of Geodesic Summarizer.Sparse representations of words are formed via the kernel function f (j) θ .The representations are trained to reconstruct the output embeddings of the encoder layer.Alongside the dictionary learning objective, we use an unsupervised sentence reconstruction cross-entropy loss.N indicates the number of decoder layers.

Figure 2 :
Figure 2: Illustration of the geodesic shortest path (shown in blue) between two sentence representations x s and x s ′ on a three-dimensional manifold.
(a) e = E s∼S (a) e [x s ], where S (a) e is the set of sentences mentioning aspect a.We identify S (a) the review set; (b) Centroid selects a review closest to the centroid of the review set.The centroid is computed using BERT(Devlin et al., 2019) embeddings; (c) Oracle selects the best review based on ROUGE overlap with the human-written summary.

Table 1 :
is an extended version of the original OPOSUM Dataset statistics for OPOSUM+, AMAZON and SPACE datasets.(Train/Test Ent.: Number of entities in the training and test set; Rev./Ent.: Number of reviews per entity in the test set.) (Angelidis et al., 2021a)., 2021a)contains reviews for hotels from Tripadvisor.SPACE provides three human-written abstractive summaries and six aspect-specific summaries per hotel entity.Statistics of the datasets are provided in Table

Table 2 :
Evaluation results of GeoSumm and baseline approaches on general summarization.We observe that GeoSumm achieves strong performance on all datasets.We report the ROUGE-F scores denoted as -R1: ROUGE-1, R2: ROUGE-2, RL: ROUGE-L.We highlight the best performance achieved by an extractive summarization system in bold and the best abstractive summarization performance in underline.

Table 3 :
Evaluation results on aspect summarization.The best scores for each metric is highlighted in bold.GeoSumm achieves the state-of-the-art performance on OPOSUM+, while achieving competitive performance with other extractive methods on SPACE.
, and AceSum(Amplayo et al., 2021b).•Extractive systems select text phrases from the review set to form the summary.We compare with

Table 5 :
Human evaluation results of aspect summarization for OPOSUM+ dataset.GeoSumm generates more aspect-specific summaries compared to baselines.

Table 7 :
Evaluation results of GeoSumm with a modified score I

Table 9 :
Perplexity of the summaries generated by different extractive summarization systems.We observe that GeoSumm achieves the best perplexity scores, indicating more coherent summaries.

Table 14 :
Evaluation results when the representation learning system is trained on a different dataset.Indomain performance is highlighted in gray .GeoSumm shows decent domain transfer performance for OPO-SUM+ and AMAZON datasets.