Unsupervised Extractive Opinion Summarization Using Sparse Coding

Opinion summarization is the task of automatically generating summaries that encapsulate information expressed in multiple user reviews. We present Semantic Autoencoder (SemAE) to perform extractive opinion summarization in an unsupervised manner. SemAE uses dictionary learning to implicitly capture semantic information from the review text and learns a latent representation of each sentence over semantic units. Our extractive summarization algorithm leverages the representations to identify representative opinions among hundreds of reviews. SemAE is also able to perform controllable summarization to generate aspect-specific summaries using only a few samples. We report strong performance on SPACE and AMAZON datasets and perform experiments to investigate the functioning of our model.


Introduction
Opinion summarization is the task of automatically generating digests for an entity (e.g. a product, a hotel, a service, etc.), from user opinions in online forums. Automatic opinion summaries enable faster comparison, search, and better consumer feedback understanding (Hu and Liu, 2004;Pang, 2008;Medhat et al., 2014). Although there has been significant progress towards summarization (Rush et al., 2015;Nallapati et al., 2016;Cheng and Lapata, 2016;See et al., 2017;Narayan et al., 2018;Liu et al., 2018), existing approaches rely on human-annotated reference summaries, which are scarce for opinion summarization. For opinion summarization, human annotators need to read hundreds of reviews per entity across different sources for writing a summary, which may not be feasible.
This lack of labeled training data has prompted a series of works to leverage unsupervised or weaklysupervised techniques for opinion summarization (Mei et al., 2007;Titov and McDonald, 2008;Angelidis and Lapata, 2018a;Angelidis et al., 2021). Recent works in this direction have focused on performing opinion summarization in an abstractive setting (Tian et al., 2019;Coavoux et al., 2019;Chu and Liu, 2019;Isonuma et al., 2019;Bražinskas et al., 2020;Amplayo et al., 2021b;Suhara et al., 2020;Iso et al., 2021;Wang and Wan, 2021). Abstractive models are able to produce fluent summaries using novel phrases. However, they suffer from problems common in text generation like hallucination (Rohrbach et al., 2018), text degeneration (Holtzman et al., 2020), and topic drift (Sun et al., 2020). Also, these approaches have been evaluated on small scales (10 reviews per entity or fewer), which does not reveal their utility in the real world where there typically are hundreds of reviews per entity.
In this work, we focus on extractive opinion summarization by selecting review sentences that are representative of the popular opinions, corresponding to an entity. We introduce an unsupervised model, Semantic Autoencoder (SemAE), which learns a representation of text over latent semantic units using dictionary learning (Dumitrescu and Irofti, 2018). SemAE is an unsupervised model and does not require a parallel training corpus of reviews and summaries. Prior works on learning latent text representations (Hosking and Lapata, 2021;Angelidis et al., 2021), leverage vector quantization (van den Oord et al., 2017) for assigning texts to a single latent representation, which is supposed to capture a semantic sense. This approach is restrictive as a text phrase can encapsulate multiple semantic senses. Semantic Autoencoder considers text as a combination of semantics and learns a distribution over latent semantic units. SemAE leverages Transformer (Vaswani et al., 2017) for sentence reconstruction to simultaneously learn latent semantic units and sentence representations. We present a novel algorithm for performing gen-eral and controllable summarization by utilizing information-theoretic measures (such as relevance, redundancy, etc) on sentence representations obtained from SemAE. Our algorithm shows strong performance on SPACE and AMAZON datasets. Our main contributions are: • We present Semantic Autoencoder (SemAE), which learns representation of sentences over latent semantic units. • We introduce novel inference algorithms for general and controllable summarization utilizing information-theoretic measures. • We show that SemAE outperforms previous methods using automatic and human evaluations. • We perform analysis to understand how the learnt representations align with human semantics.

Related Work
Opinion summarization in an abstractive setting was first explored by Ganesan et al. (2010), where opinion redundancy is captured by modeling text in a graphical structure. Di Fabbrizio et al. (2014) proposed a hybrid approach by selecting sentences in an extractive setup, and producing abstractive summaries via hand-written templates (Carenini et al., 2006). More recent approach, MeanSum (Chu and Liu, 2019), generates summaries from an aggregate representation of input. DenoiseSum (Amplayo and Lapata, 2020) uses a denoising autoencoder to capture popular opinions, by filtering out unpopular ones. Different forms of encoding schemes have been explored to generate the aggregate representation for abstractive decoding (Bražinskas et al., 2020;Amplayo et al., 2021b;Iso et al., 2021;Wang and Wan, 2021). Unsupervised extractive approaches for opinion summarization focus on selecting review sentences based on their saliency. Saliency computation has been explored using traditional frequency-based approaches (Nenkova and Vanderwende, 2005), similarity with the centroid in the representation space (Radev et al., 2004), and lexical similarity with all sentences in a graph-based representation (Erkan and Radev, 2004). Weakly supervised approaches (Angelidis and Lapata, 2018a; Zhao and Chaturvedi, 2020) extract opinions based on their aspect specificity, and nature of sentiment polarity.
QT (Angelidis et al., 2021) learns sentence representations using a VQ-VAE (van den Oord et al., 2017) and performs extractive opinion summarization using a two-step sampling procedure. Unlike QT, SemAE represents sentences as a representation over latent semantic units and selects sentences for extractive summarization using informationtheoretic measures. Our implementation is similar to neural topic model-based approaches (Iyyer et al., 2016;He et al., 2017;Angelidis and Lapata, 2018a) that use a variant of dictionary learning (Elad and Aharon, 2006;Olshausen and Field, 1997) to represent text as a combination of specific semantics (e.g. aspect, relationships etc). In contrast to these models, where text from same topics are trained to have similar representations using max-margin loss, SemAE uses an autoencoder setup to capture diverse latent semantics.

Problem Statement
Given a set of entities (e.g. hotels), a review set R e = {r 1 , r 2 , . . .} is provided for each entity e, where each review r i is a sequence of sentences {s 1 , s 2 , . . .}. The review set R e covers a range of aspects A = {a 1 , a 2 , . . .} relating to the domain (e.g. service, location for hotels). We denote S e to be the set of sentences from all reviews for an entity e. Our model is able to perform two types of extractive opinion summarization: (a) general summarization, which involves selecting a subset of sentences O e ⊂ S e such that it best represents the popular reviews in R e , and (b) aspect summarization, where the generated summary O (a) e ⊂ S e focuses on a single aspect a ∈ A.

The Semantic Autoencoder
The intuition behind Semantic Autoencoder is to represent text as a distribution over latent semantic units using dictionary learning. Learning semantic representations over a common dictionary makes them structurally aligned, enabling comparison of sentences using information-theoretic measures.
Semantic Autoencoder consists of (a) sentence encoder -a Transformer-based encoder that encodes an input sentence s into a multi-head representation {s 1 , . . . , s H }, where H is the number of heads and s h ∈ R d ; (b) reconstructor -a module that forms a representation of head vectors s h over elements of a dictionary D ∈ R K×d , to produce reconstructed representations z = {z 1 , . . . , z H }; and (c) sentence decoder -a Transformer-based decoder which attends over the latent representations z to produce the reconstructed sentenceŝ. SemAE is trained on the sentence reconstruction task. The overall workflow of SemAE is shown in Figure 1.

Sentence Encoder
Every sentence s starts with a special token [SNT], and is fed to a Transformer-based encoder. We only consider the final-layer representation of the [SNT] token s snt ∈ R d , and discard other word representations. We split s snt into H contiguous sub-vectors {s 1 , . . . , s H }, where s h ∈ R d/H . A multi-head representation is formed by passing s h through a layer-normalization layer: where W ∈ R d×d/H , b ∈ R d are trainable parameters and s h is the vector representation of h th head.

Reconstruction
We reconstruct the encoded sentence head representation s h by forming a representation α h over dictionary D.
where the reconstructed vector z h ∈ R d , and the representation α h ∈ R K . We hypothesize that the dictionary D captures the representation of latent semantic units, and α h captures the degree to which the text encapsulates a certain semantic. The vectors formed z = {z 1 , . . . , z H } are forwarded to the decoder for sentence reconstruction. The dictionary D and s h are updated simultaneously using backpropagation. For summarization (Section 5), we consider α h (not z h ) as the sentence representation.

Sentence Decoder
We employ a Transformer-based decoder which attends over the latent representations of the sentence z = {z 1 , . . . , z H }, instead of token-level representations from the encoder. MultiHead(z, z, t) attention module in the decoder takes z as key and value, and the target tokens t as the query. The reconstructed sentence is generated from the decoder asŝ = Decoder(z, t). As our goal is sentence reconstruction, we set the target tokens to be same as the input sentence s. A sentence can capture only a small number of semantic senses. We ensure this by enforcing sparsity constraints on the representations α h , so that z h is a combination of only a few semantic units. The encoder, reconstructor and decoder are trained together to minimize the loss function:

Input
: The rooms were comfortable.

Reconstruction
: The rooms were cozy.

(ŝ )
-dictionary elements α i -sentence representations z i -reconstructed vectors × Figure 1: An example workflow of SemAE. The encoder produces H = 3 representations (s h ) for a review sentence s, which are used to generate activation distribution over dictionary elements. The decoder reconstructs the input sentences using vectors (z h ) formed using activation distributions (α h ).
where L CE is the reconstruction cross-entropy loss of the decoder, and to ensure sparsity of α h we penalize the L1-norm (|α h |) and its entropy H(α h ).

Summarization using Latent Representations
We leverage the latent representations α h generated by SemAE to perform opinion summarization. 1

General Summarization
For obtaining the general summary of an entity, we first compute a mean representation of all the review sentences in S e , which represents the aggregate distribution over semantic units. Thereafter, the general summary is obtained as the collection of sentences that resemble the mean distribution. Mathematically, every sentence s is associated with a representation over dictionary elements α s = [α 1 , . . . , α H ], where α s ∈ R H×K . We form the mean representation of review sentences for an entity S e over dictionary elements as: where α s is the representation for sentence s ∈ S e . For general summarization, we compute the relevance score R(·) for each sentence s based on its similarity with the mean representationᾱ: where α s h is latent representation of sentence s for the h th head. ∆(x, y) denotes the similarity be-tween two representations x and y. It is implemented as negation of the sum of KL-divergence between head representations. We also experimented with other divergence metrics and observed similar summarization performance (Appendix A.3).
We rank sentences according to descending order of R(·) and select the top N (a constant hyperparameter, N < |S e |) sentences as the summary O e (shown in Figure 2). The extracted summary is a concatenation of the text from N selected input sentences (Input (s) in Figure 1). However, modeling relevance only using ∆(·, ·) results in selection of similar sentences. We overcome this by designing variations of our system that have additional information-theoretic constraints. (a) Redundancy: We introduce diversity in the generated summary by penalizing sentences that have a high similarity value with already selected sentences. This is achieved by adding the redundancy term in relevance score: whereÔ e is the set of sentences selected so far for the summary. The selection routine proceeds in a greedy fashion by choosing s 0 = arg max s∈Se ∆(ᾱ, α s ) whenÔ e = φ. (b) Aspect-awareness: Another drawback with sentence selection using ∆(·, ·) is that the summary frequently switches context among different aspects (example shown in Table 7). To mitigate this issue, we identify the aspect of a review sentence using occurrences of aspect-denoting keywords provided in the dataset (Section 5.2). We then cluster the sentences into aspect-specific buckets {S (a 1 ) e , S (a 2 ) e , . . .} and rank sentences within each bucket. We ignore sentences that are not part of any bucket. We select sentences using two different strategies: • We iterate over sentence buckets {S (a i ) e } and select the first m sentences ranked according to R(α s ), from each bucket.
• We prevent selection of similar sentences from a bucket by introducing the redundancy term. We iterate over individual buckets and select first m sentences ranked according to their relevance R(α s ,Ô e ) (Equation 6).

Aspect Summarization
SemAE can perform aspect summarization without needing additional training. For this, we require a small set of keywords to identify sentences that talk about an aspect. For example, food aspect is captured using keywords: "breakfast", "buffet" etc. For a given aspect a, let the keyword set be Q a = {w 1 , w 2 , . . .}. We use Q a to identify a set of sentences S (a) e for each entity e, belonging to aspect a from a held-out dev set S dev . Similar to general summarization, we proceed by computing the mean representation of sentences S (a) e belonging to the aspect a: We then select sentences most similar to the mean representation as the summary. (a) Informativeness: Sentences selected for aspect summarization should talk about the aspect but not the general information. We model informativeness (Peyrard, 2019) by ensuring that a selected sentence representation α s resembles the aspect meanᾱ (a) , but is divergent from the overall representation meanᾱ, for a given entity e. For an aspect a, we iterate over sentences in S (a) e and compute the relevance score for a sentence s as follows: We rank sentences s ∈ S e according to their aspect-specific relevance score R a (·), and select first N sentences as the summary for aspect O

Experimental Setup
In this section, we discuss the experimental setup, results and analysis.

Datasets
We evaluated our model on two public customer review datasets SPACE hotel reviews (Angelidis et al., 2021) and AMAZON product reviews (He and McAuley, 2016;Bražinskas et al., 2020). The dataset statistics are reported in Table 1. Test sets of both datasets contain three human-written general summaries per entity. The SPACE corpus was created in a two-step process of sentence selection and then summarization of selected sentences by annotators (further details in Appendix A.2). SPACE dataset also provides human-written summaries for six different aspects of hotels: building, cleanliness, food, location, rooms, and service.

Implementation Details
We used a 3-layer Transformer with 4 attention heads as the encoder and decoder. The input and hidden dimensions were set to 320 and Transformer feed-forward layer was 512. The encoder and decoder for SemAE was trained for 4 warmup epochs, before the dictionary learning based reconstruction component was introduced. We split the encoded vector into H = 8 head representations. We have K = 1024 dictionary elements, each with dimension d = 320. The dictionary elements are initialized using k-means clustering of review sentence representations. All hyperparameters were tuned on the development set of each dataset (see Appendix A.1 for more details).

Metrics
We compare lexical overlap with the human written summaries by measuring the ROUGE F-scores. For SPACE dataset, we also measure how much general summaries cover different aspects by computing the mean ROUGE-L score with the gold aspect summaries (denoted by RL ASP ).
We also compute perplexity (PPL) score to evaluate the readability of summaries. Perplexity is computed using cross-entropy loss from a BERTbase model. We measure aspect coverage of a system, by computing the average number of distinct aspects N ASP in the generated summaries. Lastly, to evaluate repetition in summaries, we compute the percentage of distinct n-grams (n = 2).

Baselines
We compare SemAE with three types of systems: (a) Best Review systems: We report the performance of Centroid method, where reviews are encoded using BERT or SentiNeutron (Radford et al., 2017), and the review most similar to the mean representation is selected. (c) Extractive systems: We report the performance of LexRank (Erkan and Radev, 2004), where sentences were encoded using BERT, SentiNeutron or tf-idf vector. We also report the performance achieved by selecting review sentences randomly.

Results
General Summarization: We present the results of general summarization on SPACE dataset in Table 2. SemAE and its variants show strong improvements over previous state-of-the-art QT, and other baselines, across all ROUGE metrics. They also outperform abstractive systems (like CopyCat and Meansum) by a large margin, which shows that SemAE can effectively select relevant sentences from a large pool of reviews. All variants of Se-mAE outperform other models in RL ASP metric, showcasing that general summaries from SemAE cover aspects better than baselines.
We further evaluate the quality of the summaries, for all variations of SemAE along with our strongest baseline QT, using other automatic metrics in Table 3. The first row in Table 3 reports the performance of QT, which achieves the highest distinct n-gram score, but has poor perplexity score. This shows that QT generates summaries with diverse text but they are not coherent. SemAE achieves the best perplexity score (second row in Table 3) but produces less diverse text (lowest distinct n-gram score). The third row in Table 3 reports the performance of SemAE with redundancy term. Comparing rows 2 and 3 of Table 3, we observe that the summaries from SemAE (w/ redundancy) have more distinct n-grams (less repetition), while falling behind in perplexity and    Table 3. We observe that iteratively covering aspects reduces repetition (increase in distinctn score). As expected the mean aspect-coverage (E[N ASP ]) improves in aspect-aware SemAE variants. However, a slight drop in aspect-coverage is observed when the redundancy term is introduced (last row in Table 3). We also observe an increase in perplexity for aspect-aware variants, which can be caused due to multiple changes in aspect context. Overall, SemAE (w/ aspect + redundancy) is able to produce diverse text with a high aspect coverage and a decent perplexity score, appearing to be the best performing model. Evaluation results on AMAZON dataset are reported in Table 4. SemAE and its variants 3 achieve similar performance, with SemAE achieving the best performance among all extractive summarization system. SemAE falls short of only abstractive 3 We do not have aspect-aware selection variants in AMA-ZON, as it does not provide aspect-denoting keywords.  summarization systems that have the advantage of generating novel phrases not present in the input reviews. Also, while SemAE beats most baselines for AMAZON dataset, the performance gain isn't as much as SPACE dataset. We believe this is because the number of reviews per entity in AMAZON (8) is much lower compared to SPACE (100). As SemAE is dependent on the mean representationᾱ, having more reviews helps in capturing the popular opinion distribution accurately. 4 For practical purposes, opinion summarization systems are useful when there are hundreds or more reviews per entity. A larger improvement on SPACE shows the efficacy of SemAE in the real world. e was assigned an aspect a based on frequency of aspectdenoting keywords in the cluster's sentences. The models then produced summaries for each aspect a given the input set S (a) e . All models including SemAE, use the same aspect-denoting keywords.
Evaluation results on SPACE are reported in Ta (2004)).
formance is comparable. We observe that adding the informativeness term (∆(ᾱ, α s ) in Equation 8) helps improve the specificity of the aspect thereby boosting performance. SemAE also shows significant gains in terms of average ROUGE-1/2 and ROUGE-L across different aspects. Human Evaluation: We performed human evaluations for the general and aspect summaries. We evaluated general summaries from QT, best performing variant SemAE (w/ aspect + redundancy) and gold summary. Summaries were judged by 3 human annotators on three criteria: informativeness, coherence and non-redundancy. The judges were presented summaries in a pairwise manner and asked to select which one was better/worse/similar. The scores (-100 to +100) were computed using Best-Worst Scaling (Louviere et al., 2015). The first half of Table 6 reports the evaluation results, where we observe that SemAE (w/ aspect + redundancy) outperforms our strongest baseline, QT, for all criteria (statistical significance information provided in the caption of Table 6). However, summaries generated from both systems are far from gold summaries on all criteria. We also evaluated aspect summaries generated by SemAE and QT in a similar manner. Aspect summaries were judged based on two criteria: aspect informativeness (usefulness of opinions for a specific aspect, consistent with reference) and as- pect specificity (how specific the summary is for an aspect without considering other factors). The bottom half of Table 6 reports the results for aspect summaries. We observe that both QT and SemAE produce aspect-specific summaries. However, Se-mAE shows a statistically significant improvement over QT in aspect informativeness.

Analysis
Latent Dictionary Interpretation. In this section, we investigate the semantic meanings learnt by individual dictionary elements, D k . We visualized the UMAP projection (McInnes et al., 2018) of dictionary element representations (shown in Figure 3). For different runs of SemAE, we found that the dictionary representations converged into clusters as shown in Figure 3 (elements are color-coded according to their cluster identities as assigned by k-means algorithm with k=12).
We hypothesize that the clusters should capture certain semantic meaning. We explore this hypothesis by identifying sentences sharing similar representations with the mean representations {µ 1 , . . . , µ K } for each cluster. For each head h in the encoder (Section 4.1), we compute cosine similarity of sentences with cluster means. Table 8 shows some examples of sentences having highest similarity with a cluster mean µ k for a head representation h. We observe in most cases sentences closest to a cluster share a similar semantic SemAE SemAE (w/ redun.) SemAE (w/ aspect) SemAE (w/ aspect + redun.) The staff is great. The Hotel Erwin is a great place to stay. The staff were friendly and helpful. The location is perfect. We ate breakfast at the hotel and it was great. The hotel itself is in a great location. The service was wonderful. It was great. The rooms are great. The rooftop bar HIGH was the icing on the cake. The food and service at the restaurant was awesome.
The service was excellent.
The hotel itself is in a great location. The rooms were clean and we were on the 5th. The best part of the hotel is the 7th floor rooftop deck.
The staff is great. The hotel has so many advantages over the other options in the area that it is a no contest. If you want to stay in Venice, this is a great place to be. The food and service at the restaurant was awesome.
The staff is great. The staff were friendly and helpful. The Hotel Erwin is a great place to stay. The location is perfect. We ate breakfast at the hotel and it was great. The food and service at the restaurant was awesome. The rooms are great. The room is epic! The rooftop bar HIGH was the icing on the cake. The rooftop bar at the hotel, "High", is amazing.
The staff is great. We had a great stay at the Erwin, and the staff really made it more enjoyable. The Hotel Erwin is a great place to stay. It was great. We ate breakfast at the hotel and it was great. The food and service at the restaurant was awesome. The rooms are great. We had a kitchen and balcony and partial ocean view. The rooftop bar HIGH was the icing on the cake.  meaning. For hotel reviews, we observe that sentences often talk about a specific aspect like service, food and rooms, as shown for (h, k) configurations (3, 5), (2, 8) and (5, 8) in Table 8. The clusters sometimes capture certain coarse semantics like presence of a word or phrase (e.g. config. (0, 10) in Table 8). It can also capture high-level semantics like the experience of a customer (e.g. config. (6, 0)). It was interesting to observe that a single cluster can capture different semantics for distinct heads (cluster 8 in configurations (2, 8)   Qualitative Examples. Table 7 shows summaries generated by SemAE and its variants for the SPACE dataset. While the summary generated by SemAE talks about location, staff & service multiple times (shown as highlighted text), summary from SemAE (w/ redundancy) doesn't have that repetition.
Also, the summary generated by SemAE switches context frequently. For example, the aspect of the first three sentences changes from service→location→service. We observe that compared to SemAE, both aspect-aware SemAE variants generate summaries without abrupt context switches. The summary generated by SemAE (w/ aspect) covers aspects like service, hotel, food and rooms sequentially, but sentences referring to an aspect are quite similar. SemAE (w/ aspect + redundancy) overcomes this shortcoming, and introduces diversity among the aspect-specific sentences.
Training Data Efficiency. We analyze the performance of SemAE, QT and CopyCat for general summarization (ROUGE-1) on SPACE for varying training data fractions in Table 9. We observe that both QT and SemAE perform well with low training data. However, SemAE outperforms QT in all low resource settings. SemAE (with 10% data) yields significant ROUGE-1 improvements over QT (with access to 100% data).  Impact of number of reviews. We investigate whether SemAE's performance gain on SPACE is due to the larger number of reviews available (reviews per entity -AMAZON: 8, SPACE: 100). Specifically, we perform ablation experiments by reducing the number of reviews/entity in SPACE dataset. We remove user reviews with low relevance scores (relevance score of a review is the average R(·) of its sentences). Table 10 reports the performance of SemAE with different number of reviews/entity in the test set. We observe a gradual decline in ROUGE-1 score when the reviews/entity is reduced, which shows that having more reviews per entity helps in better extractive summarization. Additional Controllable Summarization: We showcase that SemAE can perform different forms of controllable summarization. Specifically, we perform sentiment-based summarization using a small number (10) of seed sentences belonging to positive, negative and neutral sentiment class. Seed sentences were annotated using the rule-based system VADER (Hutto and Gilbert, 2014). An example of sentiment-based summarization is shown in Table 11. We observe SemAE is able to generate summaries aligning with the seed sentiments. We also perform multi-aspect summarization using SemAE, by controlling the aspect of the selected sentences. Table 12 showcases an example of multiaspect summarization. An interesting observation is that SemAE is able to select sentences, which have mutliple aspects (shown in blue) and not independent sentences from different aspects. These experiments show that SemAE is able capture and leverage granular semantics for summarization. In Appendix A.5, we perform additional analysis to investigate the head-wise analysis, efficacy of sparsity constraints, dictionary evolution, and qualitatively compare SemAE with baselines (QT and CopyCat).

Conclusion
We proposed a novel opinion summarization approach using Semantic Autoencoder, which en-SENTIMENT SUMMARY

Positive
Love the warm chocolate chips cookies and the service has always been outstanding. Excellent morning breakfasts and the airport shuttle runs every 15 minutes but we have made the 10 minute walk numerous times to the airport terminal.

Negative
To add insult to injury, for people who use the parking lot to "park and fly", the charge is $7.95/day, almost half of what the hotel guests are charged!! Cons -Hotel is spread out so pay attention to how to get to your room as you may get lost, Feather pillows (synthetic available on request), Pay parking ($16 self/day $20 valet/day), warm cookies on check in.

Neutral
Stayed at this hotel beause the park n fly. We have stayed at this hotel several times in the family suite ( 2 bedrooms/1 king and 2 queen beds). Despite the enormity of this hotel, it very much feels almost family run. The staff was friendly and helpful and we enjoyed the warm, chocolate chip cookie we were given at check-in. The breakfast in the restaurant was amazing, and the staff was very attentive and friendly.
(room, cleanliness) The bed was very nice, room was clean, we even had a balcony. The beds were comfortable and the room was very clean.

A Appendix
A.1 Implementation Details The Transformer is trained without the dictionary learning reconstruction for 4 warmup epochs. We tokenized text in an unsupervised manner using SentencePiece 5 tokenizer with 32K vocabulary size. The model was trained using Adam Optimizer with a learning rate of 10 −3 , and a weight decay of 0.9. Our model was trained for 10 epochs on a single GeForce GTX 2080 Ti GPU in 35 hours. The loss function parameters are reported in Table 13. The hyperparameters were tuned on the development set of the dataset based on ROUGE-1 F score. For aspect summarization, we set β = 0.7 after tuning (grid search between 0.1 and 1, with intervals of 0.1) on the development set. We choose the redundancy term constant γ = 0.1 in a similar manner. Post training, the summaries were generated with N = 20. We limit the summary length to 75 tokens. Each keyword w i ∈ Q a is associated with a confidence score for aspect a. In case a sentence has multiple keywords belonging to different aspects we use the confidence score to assign the aspect.

A.2 Dataset Construction
In this section, we provide some background information about the dataset creation process for SPACE and AMAZON. SPACE corpus has a large number of reviews per entity. Therefore, Angelidis et al. (2021) collected summaries from reviews following a two-step procedure (a) sentence voting, and (b) summary collection. Sentence voting step involves selecting informative review sentences using a majority vote from the annotators. Annotators were prompted to select between 20-40% of the total sentences. Summary collection involves generating a overview summary of the selected sentences upto a 100-word budget. For aspect summaries, selected sentences were annotated using an off-the-shelf aspect classifier (Angelidis and Lapata, 2018b). Human annotators were asked to summarize selected sentences belonging to an aspect. AMAZON dataset has a relatively lower number of  reviews per entity. The evaluation set of AMAZON was created by sampling 60 entities and 8 reviews per entity. These were provided to the human annotators for summarization (Bražinskas et al., 2020).

A.3 Ablations
• Divergence metric: SemAE uses KL divergence to measure the relevance of a sentence α s when compared to the meanᾱ, we used KL-divergence earlier.
In this setup, we experiment with cosine similarity as our divergence function ∆(·, ·). The modified divergence ∆(·, ·) score is defined as: The second row in Table 14 reports the performance in this setup, which is similar to the baseline SemAE performance. This shows that cosine similarity can serve as a good proxy to measure relevance R(·).
• Informativeness: In this ablation experiment, we incorporate the informativeness term in general summarization. The modified relevance score is: , the mean representation of all sentences across all entities. α (b) captures background knowledge distribution (Peyrard, 2019), and a good summary should be divergent from the background information. Third row in Table 14 reports the performance in this setup, where we do not observe any gain over the baseline. We believe this maybe due to the fact that α (b) doesn't capture the background knowledge properly, as it is the mean representation of hotel review sentences across all entities.
For both ablation setups, we observe almost no change in perplexity, aspect coverage and distinct n-grams metrics.

A.4 Variations of Sentence Selection
(a) Herding (Chen et al., 2010): In this setup, we modify selection mechanism of SemAE by updating the mean representation every time a sentence is selected. We consider the mean of the sentences that have not been selected so far. The intuition behind this approach is that the next selected sentence should best capture information, which is not present in the summary so far. The sentence selection process is described below: where α s t is the representation selected at time step t,ᾱ t is mean representation of the set of sentences that are not part of the summary yet andÔ e is the set of selected sentences so far. Table 15 reports the result of this setup. We observe a significant drop in performance compared to SemAE. We believe that removing the selected sentences skews the mean towards outlier review sentences resulting in a drop in performance. (b) Optimal Transport: In this setup, we consider the Wasserstein distance between two probability distributions. Wasserstein distance (Peyré et al., 2019) arising from the concept of optimal transport takes into account the underlying geometry of the representation space. Let M 1 + (R d ) be the space of probability distributions defined on R d with d ∈ Z + . Wasserstein distance between two arbitrary probability distributions µ ∈ M 1 + (X ) and ν ∈ M 1 + (Y) is denoted by W(µ, ν). Following (Colombo et al., 2021), we compute a Wasserstein barycenter of all sentences for each head h as: The overall representation for the barycenter is µ c = [µ c 1 , . . . , µ c H ]. Next, we derive the relevance score of each sentence s with the barycenter as: As shown in Equation 14, we select sentences with low Wasserstein distance from the barycenter. We report the results for this optimal transport setup in Table 15. We find that the performance of this setup is significantly lower than SemAE on SPACE dataset, but comparable to other baselines on AMA-ZON dataset.
(c) Clustering-based Sentence Selection: In this setup, instead of selecting sentences similar to the mean representation, we identify clusters formed by the representations. For clustering we flatten the sentence representation α s ∈ R HK , and use k-means 6 clustering (K is a hyperparameter). We select sentences that are representative samples in each cluster. The relevance score for each sentence is computed as follows: where α C is the representation of the cluster center where s belongs, and |C| is the size of the cluster. The first term in Equation 15 penalizes the relevance of a sentence for being too far away from the cluster center, and the second term selection of samples from a large cluster. The hyperparameters γ = 0.005, K = 5 in our experiments, were selected using the development set performance.
In Table 15, we observe that this clustering-based sentence selection work poorly for SPACE dataset but the performance on AMAZON is decent. The performance on SPACE dataset is poor as it has a large number of reviews, and identification of representative clusters is difficult using this approach.   configurations of sparsity losses. Specifically, we gauge SemAE' performance when L1-loss and entropy loss are removed. Table 16 reports the results with different loss setups. We observe a drop in performance when either of the sparsity losses are removed. This shows that ensuring sentence representations are a sparse combination of semantic units helps in summarization.

A.5 Extended Analysis
(b) Head-wise Analysis: We analyze whether there is a correlation between the head-wise representations and clusters formed by dictionary elements. For each dictionary element, we compute the average attention (α h ) it receives from each head h, and assign the element to a head where it received the maximum mean attention (head-wise dictionary elements are shown in Figure 4). We also compute the performance of general summarization when only a single head representation is considered ∆(α s ,ᾱ) = KL(ᾱ h , α s h ). In Figure 4, we observe that heads that have instances in mul-

SemAE QT Copycat
All staff members were friendly, accommodating, and helpful. The hotel and room were very clean. The room had modern charm and was nicely remodeled. The beds are extremely comfortable. The rooms are quite with wonderful beach views. The food at Hash, the restaurant in lobby, was fabulous. The location is great, very close to the beach. It's a longish walk to Santa Monica. The price is very affordable.
The staff is great. The Hotel Erwin is a great place to stay. The staff were friendly and helpful. The location is perfect. We ate breakfast at the hotel and it was great. The hotel itself is in a great location. The service was wonderful. It was great. The rooms are great. The rooftop bar HIGH was the icing on the cake. The food and service at the restaurant was awesome.
The service was excellent.
Great hotel. We liked our room with an ocean view. The staff were friendly and helpful. There was no balcony. The location is perfect. Our room was very quiet. I would definitely stay here again. You're one block from the beach. So it must be good! Filthy hallways. Unvacuumed room.
Pricy, but well worth it.
This hotel is in a great location, just off the beach. The staff was very friendly and helpful. We had a room with a view of the beach and ocean. The only problem was that our room was on the 4th floor with a view of the ocean. If you are looking for a nice place to sleep then this is the place for you. Table 17: Human-written and system generated summaries from SemAE, QT and Copycat. We showcase the summary for the same instance reported by previous works.
Food: The food and service at the restaurant was awesome. The food at Hash, the restaurant just off of the lobby, was fabulous for breakfast. The food was excellent (oatmeal, great wheat toast, freshberries and a tasty corned beef hash).
Location: The Hotel Erwin is a great place to stay. The hotel is not only in the perfect location for the ideal LA beach experience, but it is extremely hip and comfortable at the same time.
Cleanliness: The room was spacious and had really cool furnishings, and the beds were comfortable. The room itself was very spacious and had a comfortable bed. We were upgraded to a partial ocean view suite and the room was clean and comfortable.
Service: The hotel staff were friendly and provided us with great service. The staff were friendly and helpful. The staff was extremely helpful and friendly. The hotel staff was friendly and the room was well kept.
Building: The rooftop bar at the hotel, "High", is amazing. The rooftop bar HIGH was the icing on the cake. The Hotel Erwin is a great place to stay. The best part of the hotel is the 7th floor rooftop deck.

Rooms:
The room was spacious and had really cool furnishings, and the beds were comfortable. The room itself had a retro 70's feel with a comfortable living room and kitchen area, a separate bedroom with a nice king size bed, and a sink area outside the shower/toilet area. tiple dictionary element clusters (h = 0, 3, 5, 7) perform better than heads where instances are concentrated over few clusters (h = 1, 2).
(c) Output summaries: Table 17 shows the summaries generated by SemAE, QT and Copycat along with human-written summary. We observe that SemAE selects well formed sentences, avoiding truncated sentences or the ones in a first-person setting. Table 18 reports the summaries generated by SemAE for different aspects of a hotel entity. We observe that SemAE is able to produce summaries that talk about the specific aspect only.
(d) Evolution of Dictionary Representations: We plot the UMAP projections of dictionary elements from epochs 4 (after encoder warmup is complete) to 10 in Figure 5. During the training process, we observe that the UMAP project of dictionary elements form a set of clusters. We observe the first signs of cluster formation in epoch 7, which  becomes more distinct over the later epochs.
(e) Ablations with QT: In this section, we analyze the efficacy of our sentence selection (SS) module. We evaluate the summarization performance using our sentence selection scheme by retrieving sentence representations from QT and SemAE. The experiments were performed using 5% data from the SPACE dataset. For QT's representations, we obtain α h (Equation 2) as follows: In Table 19, we observe that incorporating our sentence selection (SS) improves QT's performance in terms of ROUGE-2 and ROUGE-L scores, with a small drop in ROUGE-1. However, the performance still falls behind SemAE, showcasing that the our representation learning model complements the sentence selection scheme. From these two results, we can conclude that the better performance of SemAE can be attributed to a combination of the two components. Note that using QT's sentence selection with SemAE's representations is not feasible as SemAE doesn't quantize sentences to a single latent code.