Aspect-Controllable Opinion Summarization

Recent work on opinion summarization produces general summaries based on a set of input reviews and the popularity of opinions expressed in them. In this paper, we propose an approach that allows the generation of customized summaries based on aspect queries (e.g., describing the location and room of a hotel). Using a review corpus, we create a synthetic training dataset of (review, summary) pairs enriched with aspect controllers which are induced by a multi-instance learning model that predicts the aspects of a document at different levels of granularity. We fine-tune a pretrained model using our synthetic dataset and generate aspect-specific summaries by modifying the aspect controllers. Experiments on two benchmarks show that our model outperforms the previous state of the art and generates personalized summaries by controlling the number of aspects discussed in them.


Introduction
Consumers oftentimes resort to review websites to inform their decision making (e.g., whether to buy a product or use a service). The proliferation of online reviews has accelerated research on opinion mining (Pang and Lee, 2008), where the ultimate goal is to glean information from reviews so that users can make decisions more efficiently. Opinion mining has assumed several guises in the literature such as sentiment analysis (Pang et al., 2002), aspect extraction (Hu and Liu, 2004;He et al., 2017), combinations thereof (Mukherjee and Liu, 2012;Pontiki et al., 2016), and notably opinion summarization (Hu and Liu, 2006;Wang and Ling, 2016), whose aim is to create a textual summary of opinions found in multiple reviews.
Text summarization models, both extractive (Narayan et al., 2018;Zheng and Lapata, 2019;Cachola et al., 2020) and abstractive (See et al., 2017;Gehrmann et al., 2018;Liu and Lapata, 2019), operate under the assumption that salient content is Table 1: General and aspect-specific summaries generated by our model for a hotel from the SPACE dataset. Aspects and aspect-specific sentences are color-coded.
relevant (Erkan and Radev, 2004) and should be presented in the summary. Opinion summarization is no exception, focusing on creating summaries based on opinions that are popular or redundant across reviews (Angelidis and Lapata, 2018b;Chu and Liu, 2019;Amplayo and Lapata, 2020;Bražinskas et al., 2020;Amplayo et al., 2021).
However, the notion of salience in reviews largely depends on user interest. For example, one might only care about the connectivity of a television product, an aspect which might be unpopular amongst reviews. As a result, models that create general opinion summaries may not satisfy the needs of all users, limiting their ability to make decisions. Angelidis et al. (2021) mitigate this problem with an extractive approach that produces both general and aspect-specific opinion summaries. They achieve this essentially by clustering opinions through a discrete latent variable model (van den Oord et al., 2017) and extracting sentences based on popular aspects or a particular aspect. By virtue of being extractive, their summaries can be incoherent, and verbose containing unnecessary redundancy. And although their model creates summaries for individual aspects, it is not clear how to control the number of aspects in the output (e.g., to obtain summaries that mention multiple rather than a single aspect of an entity).
In this paper, we propose an abstractive opinion summarization model that generates aspectcontrollable summaries. Using a corpus of reviews on entities (e.g., hotels, television sets), we construct a synthetic training dataset consisting of reviews, a pseudo-summary, and three types of aspect controllers which reflect different levels of granularity: aspect-related keywords, review sentences, and document-level aspect codes. We induce aspect controllers automatically based on a multiple instance learning model (Keeler and Rumelhart, 1991) and very little human involvement. Using the aspect-enriched dataset, we then fine-tune a pretrained model (Raffel et al., 2020) on summary generation. By modifying the controllers, we can flexibly generate general and aspect-specific summaries, discussing one or more aspects. Figure 1 shows summaries generated by our model.
We perform experiments on SPACE (Angelidis et al., 2021), a single domain dataset consisting of hotel reviews, and OPOSUM (Angelidis and Lapata, 2018b), a dataset with product reviews from multiple domains (e.g., "laptop bags", "boots"). Automatic and human evaluation show that our model outperforms previous approaches on both tasks of general and aspect-specific summarization. We also demonstrate that it can effectively generate multi-aspect summaries based on user preferences. We make our code and data publicly available. 1

Related Work
Earlier work on opinion summarization has focused on general summarization using extractive (Hu and Liu, 2006;Kim et al., 2011;Angelidis and Lapata, 2018b) or abstractive methods (Ganesan et al., 2010;Carenini et al., 2013;Fabbrizio et al., 2014). Due to the absence of opinion summaries in review websites and the difficulty of annotating them on a large scale, more recent methods consider an unsupervised learning setting where there are only reviews available without corresponding summaries (Chu and Bražinskas et al., 2020). They make use of autoencoders (Kingma and Welling, 1 https://github.com/rktamplayo/AceSum 2014) and variants thereof to learn a review decoder through reconstruction, and use it to generate summaries conditioned on averaged representations of the inputs.
A more successful approach to opinion summarization is through the creation of synthetic datasets, where (review, summary) pairs are constructed from a review corpus to enable supervised training. These methods usually start by randomly selecting a review which they treat as a pseudo-summary and subsequently pair it with a set of input reviews based on different strategies. These include random sampling (Bražinskas et al., 2020), generating noisy versions of the pseudo-summary (Amplayo and Lapata, 2020), ranking reviews based on similarity and relevance (Elsahar et al., 2021), and making use of content plans to create more naturalistic pairs (Amplayo et al., 2021).
Our work is closest to Angelidis et al. (2021) who propose an extractive summarization model that uses a vector-quantized variational autoencoder (van den Oord et al., 2017) to learn aspectspecific review representations. Their model effectively groups opinion sentences into clusters and extracts those capturing aspect-relevant information. We employ multi-instance learning to identify aspect-bearing elements in reviews with varying degrees of granularity (e.g., words, sentences, documents) which we argue affords greater flexibility and better control of the output summaries. In doing so, we also introduce an effective method to create synthetic datasets for aspect-guided opinion summarization. Our work also relates to approaches which attempt to control summarization output based on length (Kikuchi et al., 2016), content (Fan et al., 2018), style (Cao and Wang, 2021, or textual queries (Dang, 2006). Although we focus solely on aspect, our method is general and could be used to adjust additional properties of a summary such as sentiment (e.g., positive vs. negative) or style (e.g., formal vs. colloquial).

Problem Formulation
Let C denote a corpus of reviews about entities (e.g., products, hotels). Let R e = {r 1 , r 2 , ..., r N } denote a set of reviews for entity e and A e = {a 1 , a 2 , ..., a M } a set of aspects that are relevant for the entity (e.g., cleanliness, location). Each review r i is a sequence of tokens {w 1 , w 2 , ...}, while each aspect a j is represented by a small set of seed words {v 1 , v 2 , ...} (e.g., spotless, dirty, stain).  Figure 1: Overview of the controller induction model. Token-level aspect predictions are aggregated into sentencelevel predictions using a multiple instance pooling mechanism (described on the right). The process is repeated from sentence-to document-level predictions.
These seed words can be acquired automatically (Angelidis and Lapata, 2018b) or provided by users (see Appendix for those used in our experiments). Our approach creates two types of summaries: (a) a general summary that contains salient opinions about all aspects of an entity, and (b) an aspect-specific summary that focuses on opinions about particular aspects of interest specified by a query Q = {q 1 , q 2 , ..., q M }; here, q j is an indicator function which designates whether the aspect should be mentioned in the summary. We emphasize that the query can represent more than one aspect to reflect real-world usage. To facilitate supervised training, we create a synthetic training dataset D = (X, z, y), which is a set of triples composed of input reviews X, a pseudo-summary y, and aspect controllers z (Section 3.2). Our aspect controllers are induced with a unified model based on multi-instance learning (Section 3.1) and correspond to different levels of granularity: (1) document-level aspect codes, (2) aspect-related review sentences, and (3) aspect keywords.
At training time, we fine-tune a pretrained sequence-to-sequence Transformer model (Raffel et al., 2020) using controllers z as input and a pseudo-summary as output. During inference, we modulate summary generation by modifying the controllers, e.g., we produce a general summary using all aspect codes, or an aspect-specific one based on a subset thereof (Section 3.3).

Controller Induction Model
A key feature of our approach is the set of aspect controllers which allow our summarization model to be controllable. We induce these controllers using a multiple instance learning (MIL) model, illustrated in Figure 1. MIL is a machine learning framework where labels are associated with groups of instances (i.e., bags), while instance labels are unobserved (Keeler and Rumelhart, 1991). The goal is then to infer labels for bags (Dietterich et al., 1997;Maron and Ratan, 1998) or jointly for instances and bags (Zhou et al., 2009;Wei et al., 2014;Kotzias et al., 2015;Xu and Lapata, 2019;Angelidis and Lapata, 2018a). Our MIL model is an example of the latter variant.
In our setting, documents are bags of sentences and sentences are bags of tokens. We further assume that only documents have aspect labels. Given review r with tokens {w k }, we obtain token encodings e = {e k } from a pretrained language model (PLM; ) which uses the popular Transformer architecture (Vaswani et al., 2017). We use a non-linear transformation to obtain tokenlevel aspect predictions z T : where z T ∈ R N ×M , and N and M are the number of tokens and aspects, respectively. A positive value denotes that the token is related to the aspect of interest (and otherwise unrelated).

Multiple Instance Pooling
To obtain sentencelevel aspect predictions z S , we aggregate tokenlevel predictions z T using a new pooling method particularly effective for our multi-instance learning setting. We first obtain multiple predictions z h for each attention head h: where * is element-wise multiplication, · is dot product, k is the token index, qry h is a headspecific query vector, and key h is defined below: We hypothesize that different attention heads represent different aspects of the semantic space, and are thus helpful at predicting multiple aspects. We obtain a sentence-level prediction by max pooling the predictions of individual heads: We use max pooling since we want to isolate the most pertinent aspects for a given sentence; standard pooling methods such as mean and attention pooling (Angelidis and Lapata, 2018a; Xu and Lapata, 2019) assume that all instances of a bag contribute to its label. In Figure 1 (right) we illustrate our pooling mechanism and empirically show in experiments (see Section 5.1) it is superior to alternatives.
We so far discussed how multiple instance pooling is applied at the token-level to obtain sentencelevel predictions z S . Analogously, multiple instance pooling is applied to sentences to obtain document-level predictions z D (see Figure 1).

Training and Inference
Training the multiple instance model just described requires a dataset consisting of (review, aspect label) pairs. Unfortunately, we do not have access to annotations denoting which aspects are discussed in each review. Recall, however, that aspects are represented by seed words {v 1 , v 2 , ...}, which we exploit to induce silver-standard labels. Specifically, for each review in the dataset, we obtain binary labelsẑ D wherê z D [a] = 1 if at least one seed word for aspect a is found in the review (and −1 otherwise). We train the model using a soft margin loss, summing over all aspects a ∈ A: The parameters of the pretrained language model (see Equation (2)) are frozen, i.e., they are not finetuned during training which makes our controller induction model lightweight and efficient. Summary y At first they took us to an unready room which was disappointing but after a short wait they took us to a really big room with a great harbor scene as an apology to the mess. The rooms are pretty new or renovated recently. Bathroom is clean and wide. The beds are comfortable and big.

Review x1
Check in was quick and our bags were brought to the room in a timely manner. The rooms and hallways left a little more to be desired. The rooms didnt look nearly as good as they did in other less known cities. No safe or frig in the rooms. The staff was great.

Review x2
Only option for a hot meal for breakfast was scrambled eggs and bacon; The toaster was broken as well, with burned out elements. Other food in the lounge was good (fruit, coffee). Recommendation: eat elsewhere; even room service would probably have been better. Figure 2: Pseudo-summary y and input reviews X; the aspect code for summary y is room. Review sentences with the same aspect are underlined and same aspectkeywords are magnified.

Synthetic Dataset Creation
The MIL model allows us to learn three kinds of aspect controllers which are subsequently used to create a synthetic dataset for training our summarizer. These are aspect codes, essentially document-level aspect predictions z D , which control the overall aspect of the summary, aspect keywords ensure content support by explicitly highlighting which tokens from the input should appear in the summary, and aspect-relevant sentences which provide textual context for summary generation (while nonaspect-related sentences are ignored).
We first sample review r i as a pseudo-summary from review set R e of entity e. We treat r i as a pseudo-summary provided it resembles a real summary. We assume that opinion summaries discuss specific aspects regarding entity e. We use our controller induction model to verify this, i.e., document-level aspect predictions z D for r i should be positive for at least one aspect. Provided r i fulfills this constraint, we use it as summary y and R e − {r i } as review set X. A simplified example is shown in Figure 2, the pseudo summary is highlighted in gray and the input reviews in cyan. The summary focuses on the room aspect of a hotel and this is its aspect code (shown in blue).
Let (X, y) denote review set X for summary y (we only show two reviews in Figure 2 but there are usually hundreds). We obtain (positive) documentlevel aspect predictions z (y) D for summary y and sentence-level aspect predictions z (x) S for all re-views x ∈ X. We then rank review sentences in X based on their similarity to the summary's overall aspect. Specifically, we compare predictions z (x) S with z (y) D using the soft margin loss function from Equation (7). We also compare token-level predictions z (x) T with z (y) D using the same function to induce aspect keywords. In Figure 2 sentences which discuss the same aspect as the summary are underlined, and same-aspect keywords are magnified. For illustration purposes we only show one aspect code in Figure 2, but these can be several, and different review sentences and keywords would be selected for different aspects.

Opinion Summarization Model
We use a pretrained sequence-to-sequence Transformer model (Raffel et al., 2020) to generate opinion summaries. We transform the aspect controllers z into the following format: [KEY], and [SNT] are indicators denoting that the next tokens correspond to aspect codes, keywords, and review sentences.
Instead of the full set of input reviews X, the encoder takes z as input and produces multi-layer encodings Z. The decoder then outputs a token distribution p(y t ) for each time step t, conditioned on both Z and y 1:t−1 through attention: We fine-tune the model using a maximum likelihood loss to optimize the probability distribution p(y) based on gold summaryŷ: During inference, we can generate different kinds of opinion summaries by modifying the aspect controllers. When creating a general summary, we use all aspect codes as input. Analogously, when generating a single aspect summary, we use one aspect code. The aspect codes guide the selection of keywords and sentences from the input reviews (see Figure 2) which are given as input to our Transformer model to generate the summary (see Equation (8)).

Experimental Setup
Datasets We performed experiments on two opinion summarization datasets covering different review domains. SPACE (Angelidis et al., 2021) is a large corpus of "hotel" reviews from TripAdvisor; it contains human-written abstractive opinion summaries for evaluation only. Each instance in the evaluation set consists of 100 reviews and seven summaries: one general summary and six aspect-specific ones representing the aspects building, cleanliness, food, location, rooms, and service. OPOSUM (Angelidis and Lapata, 2018b) is a large corpus of product reviews from six different domains: "laptop bags", "bluetooth headsets", "boots", "keyboards", "televisions", and "vacuums". It also includes an evaluation set with extractive general summaries. We extended this dataset by (a) adding aspect-specific summaries which are human-written and abstractive following the methodology from Angelidis et al. (2021), and (b) increasing the size of the corpus. We call this extended dataset OPOSUM+. Both datasets include five human-annotated seed words for each aspect (see Appendix for details). Data statistics are shown in Table 2. Using our synthetic dataset creation method, we were able to generate 512K and 341K training instances for SPACE and OPO-SUM+, respectively.
Implementation For our pretrained Transformer models, we used weights and settings available in the HuggingFace library (Wolf et al., 2020). Specifically, we used distilroberta-base Sanh et al., 2019) as our language model and t5-small (Raffel et al., 2020) as our sequence-to-sequence model. We trained the controller induction model with a learning rate of 1e−4 for 100K steps, using h = 12 heads. For OPO-SUM+, we trained separate controller induction models for different domains. For the aspect controllers, we selected 10-best keywords, and review sentences were truncated up to 500 tokens to fit in the pretrained model. For summarization, we used a learning rate of 1e − 6 and 500K training steps. We used Adam with weight decay (Loshchilov and Hutter, 2019) to optimize both models. We added a linear learning rate warm-up for the first 10K steps. We generate summaries with beam search of size 2 and refrain from repeating ngrams of size 3 (Paulus et al., 2018).

Results
We compared our Aspect Controlled Summarization (ACESUM) model with several extractive and abstractive approaches. Traditional extractive systems include selecting as a summary the review closest to the CENTROID (Radev et al., 2004) of the input reviews and LEXRANK (Erkan and Radev, 2004), a PageRank-like algorithm that selects the most salient sentences from the input. For both methods we used BERT encodings (Devlin et al., 2019) to represent sentences and documents. Other extractive systems include QT 2 (Angelidis et al., 2021), a neural clustering method that uses Vector-Quantized Variational Autoencoders (van den Oord et al., 2017) to represent opinions in quantized space, and ACESUMEXT, an extractive version of our model that uses sentences ranked by our controller induction model as input (truncated up to 500 tokens) to LexRank.
Abstractive systems include MEANSUM (Chu and , an autoencoder that generates summaries by reconstructing the mean of review encodings, COPYCAT (Bražinskas et al., 2020), a hierarchical variational autoencoder which learns a latent code of the summary, and two variants of T5 (Raffel et al., 2020) trained with different synthetic dataset creation methods. For T5-RANDOM, summaries are randomly sampled (Bražinskas et al., 2020), whereas for T5-SIMILAR reviews are sampled based on their similarity to a candidate summary (Amplayo and Lapata, 2020).
Finally, we compared against two upper bounds: an extractive ORACLE which selects as a summary the review with the best ROUGE score against the input, and a HUMAN upper bound, calculated as inter-annotator ROUGE. Examples generated by our model are in Table 1

Automatic Evaluation
We evaluated the quality of general and aspectspecific opinion summaries using F 1 ROUGE (Lin and Hovy, 2003). Unigram and bigram overlap (ROUGE-1/2) are proxies for assessing informativeness while the longest common subsequence (ROUGE-L) measures fluency. Table 3 reports results on general opinion summarization. As can be seen, ACESUM outperforms all competing models on SPACE and performs best among abstractive systems on OPOSUM+. Our extractive model, ACESUMEXT, is overall best on OPOSUM+. This is expected since general OPOSUM+ summaries are extractive. Amongst abstractive models, Transformer-based models outperform MEANSUM and COPYCAT, demonstrating that pretraining is helpful for opinion summarization.

Aspect-Specific Opinion Summarization
Most comparison systems (all except QT) cannot naturally generate aspect-specific summaries. We use a simple sentence-filtering method to remove nonaspect-related sentences from the input during inference. Specifically, we use BERT encodings (Devlin et al., 2019) to represent tokens in review sentences {r (bert) i } and aspect seeds {a (bert) j }. We then rank the review sentences based on the maximum similarity between seed and sentence tokens, calculated as max i,j (sim(r sim(a, b) is the cosine similarity function. This method cannot be ported to the CENTROID and ORACLE baselines, and thus we do not compare with them.
Our results are summarized in  that SPACE and OPOSUM+ focus exclusively on single aspect summaries. We assess our model's ability to generate summaries covering multiple aspects in the following section. Overall, ACESUM performs best across datasets and metrics, which shows that our controllers can effectively customize summaries based on aspect queries. Interestingly, amongst extractive models, ACESUMEXT performs best. This suggests that, a simple centralitybased extractive approach such as LexRank (Erkan and Radev, 2004) can produce good enough summaries as long as an effective sentence filtering method is applied beforehand (in our case this is based on the controller induction model). T5 models perform substantially worse on this task, indicating that synthetic datasets based on either random or similarity-based sampling techniques are not suited to aspect-specific opinion summarization.

Ablation Studies
We present various ablation studies on the controller induction model and the summarization model itself. In Table 5, we compare our multiple instance pooling (MIP) mechanism with three standard pooling methods: mean, max, and attention-based pooling. We evaluate models using document and sentence F 1 which measures the quality of document-and sentencelevel aspect predictions. We extrapolate aspect labels for documents and sentences from the development set which contains aspect-specific summaries. We assume the aspect for which a summary is written is the document label and that all sentences within the summary are also representative of the same aspect. Results show that attention and mean pooling are not suitable for multi-instance learning, underperforming especially on document-level F 1 .   Table 6: Variants of ACESUM with different aspect controllers. Results are shown using ROUGE-L for general and aspect-specific opinion summaries. This suggests that token-level predictions are not used effectively to predict higher level aspects. Our results confirm that using multiple experts (i.e., attention heads) yields better aspect predictions.
In Table 6, we evaluate the contribution of different aspect controllers to summarization output. Selecting sentences randomly rather than based on aspect hurts performance, in particular when generating aspect-specific summaries. We also find that aspect codes substantially increase model performance in OPOSUM+. We conjecture that this is due to OPOSUM+ having multiple domains and, consequently, more aspects compared to SPACE.

Human Evaluation
We conducted several human elicitation studies to further analyze the summaries produced by competing systems using the Amazon Mechanical Turk crowdsourcing platform.
Best-Worst Scaling The first study assessed the quality of general opinion summaries using Best-Worst Scaling (BWS; Louviere et al., 2015). Participants were shown a human-written summary, in relation to which they were asked to select the best and worst among system summaries, taking into account the following criteria: Informativeness (how consistent are the opinions with the reference?), Coherence (is the summary easy to read and wellorganized?), Conciseness (does the summary provide useful information in a concise manner?), and Fluency (is the summary grammatical?).
We compared general summaries produced by  Table 7: Best-Worst Scaling evaluation. Best values are bold-faced. An asterisk (*) means that the system is significantly better than the second best system (one-way ANOVA with posthoc Tukey HSD tests, p < 0.05). Inf: informative, Coh: coherent, Con: concise, Flu: fluent.
the two best performing extractive (LEXRANK, QT) and abstractive (T5-SIMILAR, ACESUM) systems according to ROUGE. We elicited three judgements for all entities in the SPACE and OPOSUM+ test sets. Table 7 summarizes our results. BWS values range from −100 (unanimously worst) to 100 (unanimously best). ACESUM is deemed best for all criteria on both datasets. Crowdworkers also rated QT high on informativeness, which indicates that aspect modeling is helpful, but low on other criteria (e.g., coherence and conciseness) due to its extractive nature.

Aspect Controllability
We also conducted a user study to assess the quality of aspect-specific summaries. We showed participants the aspect in question as well as aspect summaries from T5-SIMILAR, QT, ACESUM, and HUMAN. Crowdworkers were asked to decide whether the summaries discussed the given aspect exclusively, partially, or not at all. We elicited three judgments for all test entities. As can be seen in Table 8, SPACE summaries produced by ACESUM exclusively discuss a single aspect 50.9% of the time. T5-SIMILAR mostly produces general summaries (74.8% of them partially discuss the given aspect) which is not surprising, given that it has no specialpurpose mechanism for modeling aspect. QT summaries are more topical for the opposite reason. In general, automatic systems perform worse on OPO-SUM+ whose larger number of domains renders this dataset more challenging. Finally, we observe a big gap between model and HUMAN performance. We further verified whether ACESUM can produce summaries covering two aspects. Although it can generate summaries with more aspects (see Table 1), we hypothesize that user queries pertain-   Table 9: Proportion of target aspects discussed in system summaries (All: both aspects are mentioned; One: only one is mentioned; Other: other aspects are also mentioned; None: no aspects are mentioned).
ing to two aspects would be most frequent. Besides, if performance with two aspects is inferior, there is little chance it will improve with more aspects. For each test example we elicited three judgments and randomly selected two aspect pairs from the set of all possible aspect combinations. We compared ACESUM against QT (for which we used seed words representing both target aspects). Participants were shown the two aspects and the summaries generated by QT and ACESUM. They were asked to decide whether the summaries discussed (a) both target aspects exclusively (b) one of the aspects (c) other aspects in addition to the target ones, and (d) none of the two aspects. The results in Table 9 show that ACESUM is able to produce two-aspect summaries effectively 61.3% of the time on SPACE and 47.0% of the time on OPOSUM+. QT on the other hand mostly creates single-aspect summaries.
Summary Veridicality Our third study examined the veridicality of the generated summaries, i.e., whether the opinions mentioned in them are indeed discussed in the input reviews. Participants were shown reviews and corresponding system summaries and were asked to verify, for each sentence of the summary, whether it was fully supported by the reviews, partially supported, or not at all supported. We performed this experiment  on OPOSUM+ only since the number of reviews is small and participants could read them all in a timely fashion. We collected three judgments for all system summaries, both general and aspectspecific ones. Participants assessed the summaries produced by T5-SIMILAR and ACESUM. We also included GOLD-standard summaries as an upper bound but no output from an extractive system as it by default produces veridical summaries which contain facts mentioned in the reviews. Table 10 reports the percentage of fully (Full-Supp), partially (PartSupp), and un-supported (No-Supp) sentences. Not unsurprisingly, GOLD summaries display the highest percentage of fully supported sentences for both general and aspectspecific summaries. ACESUM and T5-SIMILAR present similar proportions of supported sentences when it comes to general summaries, with ACE-SUM having a slight advantage. The proportion of supported sentences is higher in aspect summaries for T5-SIMILAR. Note that this model struggles to actually generate aspect-specific summaries (see Table 8); instead, it generates any-aspect summaries which maybe veridical but off-topic.

Conclusions
In this work, we presented an abstractive approach to aspect-controlled opinion summarization. Key to our model is the induction of aspect controllers which facilitate the creation of a synthetic training dataset and guide summary generation towards the designated aspects. Extensive experiments on two benchmarks show that our model achieves state of the art across the board, for both general and aspect-specific opinion summarization.
In the future, we would like to focus on controlling additional facets of opinion summaries such as sentiment or length. It would also be interesting to learn aspects from data rather than specifying them apriori as well as dealing with unseen aspects (e.g., in a scenario where reviews discuss new features of a product).

A Appendix
A.1 List of Seed Words   Tables 11 and 12 shows the seed words we used in our experiments. These were generated semiautomatically: we first obtained aspect-specific words through the automatic method introduced in Angelidis and Lapata (2018b) and Angelidis et al. (2021) and then asked human annotators to filter out the noise (i.e., words that were assigned incorrect aspects).  fit from better quality seed words with noticeable increase in ROUGE scores.

A.3 Extensions to OPOSUM Dataset
In this section, we present our additions to the OPO-SUM dataset (Angelidis and Lapata, 2018b). Firstly, we increased the size of the review corpus. The original dataset includes only 359K reviews, which is the result of down-sampling the Amazon Product Dataset introduced in McAuley et al. (2015). We instead gathered all reviews tagged with at least one of the OPOSUM domains ("Laptop Bags", "Bluetooth Headsets", "Boots", "Keyboards", "Televisions", and "Vacuums") from the newest version of the Amazon Product Dataset compiled by Ni et al. (2019). Since "Laptop Bags" and "Bluetooth Headsets" were significantly smaller than the other four domains, we additionally included all reviews tagged with "Bags" and "Headsets". We were able to increase the dataset to 4.13M reviews, i.e., by a factor of 12.
Secondly, we created a large collection of human-written abstractive summaries for aspectspecific summarization evaluation. For each test product (e.g., television set) and for each aspect (e.g., image quality), we asked three annotators to write an opinion summary about the given aspect. The annotators were shown 10 input reviews, in which opinions about the target aspect were highlighted to aid them in their task. We only used the three most common aspects for each domain, since opinions about less common aspects do not appear frequently in reviews. We gathered 540 aspectspecific summaries in total.

A.4 Example Summaries
Finally, we show general and aspect-specific summaries produced by QT, T5-SIMILAR, ACESUM, and HUMAN on SPACE (Table 14) and OPOSUM+ (Table 15). We also show two-aspect summaries produced by QT and ACESUM in Table 16. T5-SIMILAR summaries General I stayed at the hotel for 3 nights. The room was very clean, the staff was friendly and the breakfast was excellent! The location was great -a short walk to the Eiffel Tower and St Marks Square. Building i stayed here for 3 nights. The staff was very friendly and helpful, the rooms were clean and the location was great! The breakfast was good and there was a lot to do in the city. Cleanliness I stayed at the hotel for 3 nights. The staff was very friendly and helpful, the rooms were clean and the breakfast was excellent! The location is great -just a few minutes walk from the Eiffel Tower and it's close to the metro and Bourbon Street. Food I stayed at the San Diego for a week. The staff was very friendly and helpful, the rooms were clean and the location was great! The breakfast was good and there wasn't much to do in the room. Location I stayed at the hotel for 3 nights. The room was very clean, the staff were very helpful and the breakfast was excellent! The location is great -a short walk to the Eiffel Tower and St Marks Square. Rooms I stayed at the hotel for 3 nights. The room was very clean, the staff were very helpful and the location was great! The rooms were clean and well appointed -the breakfast was good and there was a great selection of food and drink options in the morning. Service I stayed at the hotel for 3 nights. The room was very clean, the staff was friendly and the breakfast was good! The location was great -a short walk to the Eiffel Tower and St Marks Square.

QT summaries
General Great location. The breakfast was very good. We would definitely stay here again. Room was clean. This hotel is great. The room was large with two queen beds. Nice hotel in a nice location. This is a multi-year award winning hotel. Staff were very helpful. The hotel is very clean. Front desk was friendly and helpful. The room was clean and comfy. The breakfast was average. It is very good. We enjoyed our stay here. Building Plus all these fancy hotels have the irritating routine of charging around $16 for internet access. The bad: the hotel is quite old and needs renovating. Cleanliness Pick this one. Toom was clean. The hotel is very clean. Great 5 star service. Room was nice and clean. This one was by far the best. Food The breakfast was very good. When you factor in the delicious complimentary breakfast consisting of scrambled eggs, grits, freshly-made waffles, bagels, bacon, sausage, cereal, toast, juice, and coffee. Location But it is just far enough away from the craziness of Bourbon and Canal streets. Walk. The location was also nice. The location of the hotel was excellent. Rooms The room was great. The bed was comfortable. The room was large with two queen beds. Hotel room was clean and comfortable. Service The staff was very nice. Every member of the staff we encountered was gracious, friendly and helpful. The staff at the hotel are super nice and attentive. Table 14: General and aspect-specific summaries for a hotel generated by four systems (SPACE dataset).

HUMAN summaries
General The price is great. Lightweight and comfortable fit in the ear. Based on feedback from others I can be heard clearly. Very easy to use and compatible with all of my phones! It holds a charge great, is light enough. The sound quality is great, but cheap. Comfort This unit is very lightweight and comfortable to fit in the ear. Makes it really easy to keep in there. Fits perfectly. Ease of Use The headphones are very easy to use and fit almost any cell phone. The only problem is that sometimes it is difficult to know if the equipment is on or off. Sound Quality The sound from the headphones is very good, the audio quality is excellent. Despite this they could have a louder sound for their maximum volume.
ACESUM summaries General I've had this for a few months now and it's still working great. I haven't had any issues with the sound quality, and I am very happy with the quality of the product. The earpiece is very comfortable and is very lightweight. Comfort It's a comfortable fit. The earpiece isn't too big, but it fits your ear well. It is a little snug, but my ear is very comfortable. It is comfortable to wear for long periods of time. Ease of Use It is easy to use, easy to set up, and easy to connect to my phone. It's not that easy to get on and off. Sound Quality The sound quality is good, but I'm not a fan. I haven't had any issues with the sound of the headphones.

T5-SIMILAR summaries
General I bought this headset for my son. He loves it, but a lot of people have complained about the sound quality of the headset! It is very comfortable and the mic isn't loud enough to hear. It has sounded great. Comfort I bought this headset for my husband. He loves it, and it works great! It has a great sound and the sound quality is excellent -the only thing is that the microphone isn't very loud. Ease of Use I bought this headset for my husband. He loves it, and it is very comfortable! If you are looking for a good headset, this is the best headset you can buy for the price Sound Quality I bought these for my husband. He loves them, and they are very comfortable! They don't have a lot of noise. If you are looking for something that will work for you, then they're ok but they will not work with the earbuds.

QT summaries
General Thank you! The battery life is ... bizarre. Light to the ear. I highly recommend this bluetooth headset. Lightweight and comfortable fit in the ear. I returned it and received a refund. I used it mostly in my car on my commute to work. Great product. Comfort I would really like it if it would stay in my ear or if the loop that went around my ear would hold it to my ear. I could not get this headset to work. Ease of Use Item delivery just as described! Its made of the cheapest of materials and the bluetooth has a hard time staying connected. My only gripe is that sometimes there's a small lapse between my voice. Sound Quality Also they are comfy and stay in my ears. The headset is light and fits comfortably in my ears (though it takes some time to find the right angle and fit it right in). Table 15: General and aspect-specific summaries for the "Bluetooth Headsets" domain) generated by four different systems (OPOSUM+ dataset).

ACESUM summaries Cleanliness and Location of a hotel
The hotel is clean and the rooms are very clean. The location is great, right on the beach, and close to the Eiffel Tower. Building and Cleanliness of a hotel The room was very clean and the bathroom was very clean. The pool was nice, but the pool area was a bit small. Food and Rooms of a hotel The breakfast was good, the food was good and the staff was very friendly. The breakfast buffet was good with a variety of choices. Quality and Size of a laptop bag It's a good size for a laptop. It is not a heavy bag, it is made of a soft material. Ease of Use and Suction Power of a vacuum I've had this vacuum for a few months now and it's very easy to use. I don't like the fact that it is a little heavy, but it does a great job of picking up the hair. Comfort and Looks of a pair of boots They are a little tight, and they are not comfortable. They look great with jeans and skirts. If you are looking for a comfortable shoe that will last a long time, do not order this.

QT summaries Cleanliness and Location of a hotel
Overall we had a nice stay at the hotel. It's well worth the extra money. For the price I paid it underwhelmed ($350 for 1 night). Doesn't get more LA than this have a drink at the roof top. Building and Cleanliness of a hotel The service was great! The hotel was beautiful. amazing. Holy cow. I love staying at this hotel. Excellent. Superb service!! I can't say enough about how perfect this hotel was for us. I stayed at this hotel not too long. Food and Rooms of a hotel (Note that breakfast isn't necessarily included in the price.) On the first floor there is a small breakfast room but no restaurant. Also a small but cosy terrace with swimming pool. Rooms are a decent size but walls are paper thin. Quality and Size of a laptop bag The hand straps have not ripped or torn so really I think the problem was that I put too much weight in the bag. Barely fit a 14 inch HP sleek notebook. I would not recommend this bag Ease of Use and Suction Power of a vacuum I even tried putting ear plugs in to vacuum with it, but it still hurts my ears. I looked at every small but powerful vacuum I could find in stores and on line.

Comfort and Looks of a pair of boots
Once the weather got cold the shoes became more stiff and they really hurt now so it looks like I wasted $40. I am wondering if they are worth returning or just passing off to someone