Informative and Controllable Opinion Summarization

Opinion summarization is the task of automatically generating summaries for a set of opinions about a specific target (e.g., a movie or a product). Since the number of input documents can be prohibitively large, neural network-based methods sacrifice end-to-end elegance and follow a two-stage approach where an extractive model first pre-selects a subset of salient opinions and an abstractive model creates the summary while conditioning on the extracted subset. However, the extractive stage leads to information loss and inflexible generation capability. In this paper we propose a summarization framework that eliminates the need to pre-select salient content. We view opinion summarization as an instance of multi-source transduction, and make use of all input documents by condensing them into multiple dense vectors which serve as input to an abstractive model. Beyond producing more informative summaries, we demonstrate that our approach allows to take user preferences into account based on a simple zero-shot customization technique. Experimental results show that our model improves the state of the art on the Rotten Tomatoes dataset by a wide margin and generates customized summaries effectively.


Introduction
The proliferation of opinions expressed in online reviews, blogs, internet forums, and social media has created a pressing need for automated systems which enable customers, companies, or service providers to make informed decisions without having to absorb large amounts of opinionated text. Opinion summarization is the task of automatically generating summaries for a set of opinions about a specific target (Conrad et al. 2009). Figure 1 shows various reviews about the movie "Coach Carter" and example summaries generated by humans and automatic systems.
The vast majority of previous work (Hu and Liu 2004) views opinion summarization as the final stage of a threestep process involving: (1) aspect extraction (i.e., finding features pertaining to the target of interest, such as battery life or sound quality); (2) sentiment prediction (i.e., determining the sentiment of the extracted aspects); and (3) summary generation (i.e., presenting the identified opinions to Figure 1: Three out of 150 reviews for the movie "Coach Carter", and summaries written by the editor, and generated by a model following the EXTRACT-ABSTRACT approach and the proposed CONDENSE-ABSTRACT framework. The latter produces more informative and factual summaries whilst allowing to control aspects of the generated summary (such as the acting or plot of the movie). the user). Textual summaries are created following mostly extractive methods which select representative segments (usually sentences) from the source text (Popescu and Etzioni 2005;Blair-Goldensohn et al. 2008;Lu, Zhai, and Sundaresan 2009;Lerman, Blair-Goldensohn, and McDonald 2009). Despite being less popular, abstractive approaches seem more appropriate for the task at hand as they attempt to generate summaries which are maximally informative and minimally redundant without simply rearranging passages from the original opinions (Ganesan, Zhai, and Han 2010;Carenini, Cheung, and Pauls 2013;Gerani et al. 2014;Di Fabbrizio, Stent, and Gaizauskas 2014).
General-purpose summarization approaches have recently shown promising results with end-to-end models which are data-driven and take advantage of the success of sequence-to-sequence neural network architectures. Most approaches (Rush, Chopra, and Weston 2015;Chen et al. 2016;Kryściński et al. 2018;Fabbri et al. 2019) encode documents and then decode the learned representations into an abstractive summary, often by attending to the source input (Bahdanau, Cho, and Bengio 2015) and copying words from it (See, Liu, and Manning 2017). Under this modeling paradigm, it is no longer necessary to identify aspects and their sentiment for the opinion summarization task, as these are learned indirectly from training data (i.e., sets of opinions and their corresponding summaries). These models are usually tested on domains where the input is either one document or a small set of documents.
However, the number of opinions tends to be very large (150 for the example in Figure 1). It is therefore practically unfeasible to train a model in an end-to-end fashion, given the memory limitations of modern hardware. As a result, current approaches (Wang and Ling 2016;Liu et al. 2018;Liu and Lapata 2019;Perez-Beltrachini, Liu, and Lapata 2019) sacrifice end-to-end elegance in favor of a two-stage framework which we call EXTRACT-ABSTRACT: an extractive model first selects a subset of opinions and an abstractive model then generates the summary while conditioning on the extracted subset (see Figure 2a). The extractive pass unfortunately has two drawbacks. Firstly, on account of having access to a subset of opinions, the summaries can be less informative and inaccurate, as shown in Figure 1. And secondly, user preferences cannot be easily taken into account (e.g., the reader may wish to obtain a summary focusing on the acting or plot of a movie as opposed to a general-purpose summary) since more specialized information might have been removed.
In this paper, we propose CONDENSE-ABSTRACT, an alternative two-stage framework which uses all input documents when generating the summary (see Figure 2b). We view the opinion summarization problem as an instance of multi-source transduction (Libovický and Helcl 2017); we first represent the input documents as multiple encodings, aiming to condense their meaning and distill information relating to sentiment and various aspects of the target being reviewed. These condensed representations are then aggregated using a multi-source fusion module based on which an opinion summary is generated using an abstractive model. We also introduce a zero-shot customization technique allowing users to control important aspects of the generated summary at test time. Our approach enables controllable generation while leveraging the full spectrum of opinions available for a specific target.
We perform experiments on a dataset consisting of movie reviews and opinion summaries elicited from the Rotten Tomatoes website (Wang and Ling 2016; see Figure 1). Our framework outperforms state-of-the-art models by a large margin using automatic metrics and in a judgment elicitation study. We also verify that our zero-shot customization technique can effectively generate need-specific summaries. In the CA framework, users can obtain needspecific summaries at test time (e.g., give me a summary focusing on acting).

Related Work
Most opinion summarization models follow extractive methods (see Kim et al. 2011 andAngelidis andLapata 2018 for overviews), with the exception of a few systems which are able to generate novel words and phrases not featured in the source text. Ganesan, Zhai, and Han (2010) propose a graph-based framework for generating ultra concise opinion summaries, while Gerani et al. (2014) represent reviews by discourse trees which they aggregate to a global graph from which they generate a summary. Other work (Carenini, Cheung, and Pauls 2013;Mukherjee and Joshi 2013) takes the distribution of opinions and their aspects into account so as to generate more readable summaries. Di Fabbrizio, Stent, and Gaizauskas (2014) present a hybrid system which uses extractive techniques to select salient quotes from the input reviews and embeds them into an abstractive summary to provide evidence for positive or negative opinions. More recent work has seen the effective application of sequence-to-sequence models (Sutskever, Vinyals, and Le 2014;Bahdanau, Cho, and Bengio 2015) to various abstractive summarization tasks including headline generation (Rush, Chopra, and Weston 2015), single- (See, Liu, and Manning 2017;Nallapati et al. 2016), and multi-document summarization (Wang and Ling 2016;Liu et al. 2018;Liu and Lapata 2019). Closest to our approach is the work of Wang and Ling (2016) who generate opinion summaries following a two-stage process which first selects documents bearing pertinent information, and then generates the summary by conditioning on these documents. Specifically, they use a ridge regression model with hand-engineered features such as TF-IDF scores and word counts, to estimate the importance of a document relative to its cluster (see also Liu et al. 2018 for a survey of additional document selection methods). The extracted documents are then concatenated into a long sequence and fed to an encoder-decoder model.
Our proposed framework eliminates the need to pre-select salient documents which we argue leads to information loss and less flexible generation capability. Instead, a separate model first condenses the source documents into multiple dense vectors which serve as input to a decoder to generate an abstractive summary. Beyond producing more informative summaries, we demonstrate that our approach allows to customize them. Recent conditional generation models have focused on controlling various aspects of the output such as politeness (Sennrich, Haddow, and Birch 2016), length (Kikuchi et al. 2016;Fan, Grangier, and Auli 2018), content (Fan, Grangier, and Auli 2018), or style (Ficler and Goldberg 2017). In contrast to these approaches, our customization technique requires neither training examples of documents and corresponding (customized) summaries nor specialized pre-processing to encode which tokens in the input might give rise to customization.

CONDENSE-ABSTRACT Framework
We propose an alternative to the EXTRACT first, ABSTRACT later (EA) approach which eliminates the need for an extractive model and enables the use of all input documents when generating the summary. Figure 2b illustrates our CONDENSE-ABSTRACT (CA) framework. In lieu of an integrated encoder-decoder, we generate summaries using two separate models. The CONDENSE model returns document encodings for N input documents, while the ABSTRACT model uses these encodings to create an abstractive summary. This two-step approach has at least three advantages for multi-document summarization. Firstly, optimization is easier since parameters for the encoder and decoder weights are learned separately. Secondly, CA-based models are more space-efficient, since N documents in the cluster are not treated as one very large instance but as N separate instances when training the CONDENSE model. Finally, it is possible to generate customized summaries targeting specific aspects of the input since the ABSTRACT model operates over the encodings of all available documents.

The CONDENSE Model
Let D denote a cluster of N documents about a specific target (e.g., a movie or product). For each document X = {w 1 , w 2 , ..., w M } ∈ D, the CONDENSE model learns an encoding d, and word-level encodings h 1 , h 2 , ..., h M . We use a BiLSTM autoencoder 1 as the CONDENSE model. Specifically, we employ a Bidirectional Long Short Term Memory (BiLSTM) encoder (Hochreiter and Schmidhuber 1997): where − → h i and ← − h i are forward and backward hidden states of the BiLSTM at timestep i, and ; denotes concatenation.
Training is performed with a reconstruction objective. Specifically, we use a separate LSTM as the decoder where the first hidden state z 0 is set to d (see Equation (5)). Words w t are generated using a softmax classifier: The auto-encoder is trained with a maximum likelihood loss: An advantage of using a separate encoder is increased training data, since we treat a single target with N input documents as N different instances. Once training has taken place, we use the CONDENSE model to obtain N pairs of document encodings {d i } and word-level encod-

The ABSTRACT Model
The ABSTRACT model first fuses the multiple encodings obtained from the CONDENSE stage and then generates a summary using a decoder.

Multi-source Fusion
The N pairs of document encodings {d i } and word-level encodings {h i,1 , h i,2 , ..., h i,M }, 1 ≤ i ≤ N are aggregated into a single pair of document encoding d and word-level encodings h 1 , h 2 , ..., h V , where V is the number of total unique tokens in the input.
We fuse document encodings, using an attentive pooling method which gives more weight to important documents. Specifically, we learn a set of weight vectors 2 a where the mean encodingd is used as the query vector, and We also fuse word-level encodings, since the same words may appear in multiple documents. To do this, we simply average all encodings of the same word, if multiple tokens of the word exist: where V wj is the number of tokens for word w j in the input.
Decoder The decoder generates summaries conditioned on the reduced document encoding d and reduced wordlevel encodings h 1 , h 2 , ..., h V . We use a simple LSTM decoder enhanced with attention (Bahdanau, Cho, and Bengio 2015) and copy mechanisms (Vinyals, Fortunato, and Jaitly 2015). We set the first hidden state s 0 to d , and run an LSTM to calculate the current hidden state using the previous hidden state s t−1 and word y t−1 at time step t: s t = LSTM(y t−1 , s t−1 ) (12) At each time step t, we use an attention mechanism over word-level encodings to output the attention weight vector a t and context vector c t : Finally, we employ a copy mechanism over the input words to output the final word probability p(y t ) as a weighted sum over the generation probability p g (y t ) and the copy probability p c (y t ): where W , v, and b are learned parameters, and t is the current timestep.
Salience-biased Extracts The model presented so far treats all documents as equally important and has no specific mechanism to encourage saliency and eliminate redundancy. In order to encourage the decoder to focus on salient content, we can straightforwardly incorporate information from an extractive step. In experiments, we select k documents using SUMMARUNNER (Nallapati, Zhai, and Zhou 2017), a state-of-the-art neural extractive model where each document is classified as to whether it should be part of the summary or not.
We concatenate k preselected documents into a long sequence and encode it using a separate BiLSTM encoder. The encoded sequence serves as input to an LSTM decoder which generates a salience-biased hidden state r t . We then update hidden state s t in Equation (12) as s t = [s t ; r t ]. Notice that we still take all input documents into account, while acknowledging that some might be more descriptive than others.
Training We use two objective functions to train the AB-STRACT model. Firstly, we use a maximum likelihood loss to optimize the generation probability distribution p(y t ) based on gold summaries Y = {y 1 , y 2 , ..., y L } provided at training time: Secondly, we propose a way to introduce supervision and guide the attention pooling weights W p in Equation (9) when fusing the document encodings. Our motivation is that the resulting fused encoding d should be roughly equivalent to the encoding of summary y, which can be calculated as z = CONDENSE(y). Specifically, we use a hinge loss that maximizes the inner product between d and z and simultaneously minimizes the inner product between d and n i , where n i is the encoding of one of five randomly sampled negative summaries: The final objective is then the sum of both loss functions:

Zero-shot Customization
Another advantage of our approach is that at test time, we can either generate a general-purpose summary or a needspecific summary. To generate the former, we run the trained model as is and use beam search to find the sequence of words with the highest cumulative probability. To generate the latter, we employ a simple technique that revises the query vectord in Equation (8). More concretely, in the movie review domain, we assume that users might wish to obtain a summary that focuses on the positive or negative aspects of a movie, the quality of the acting, or the plot. In a different domain, users might care about the price of a product, its comfort, and so on. We undertake such customization without requiring access to need-specific summaries at training time. Instead, at test time, we assume access to background reviews to represent the user need. For example, if we wish to generate a positive summary, our method requires a set of reviews with positive sentiment which approximately provide some background on how sentiment is communicated in a review.
We use these background reviews conveying a user need x (e.g., acting, plot, positive or negative sentiment) during fusion to attend more to input reviews related to x. Let C x denote the set of background reviews. We obtain a new query vectord = |Cx| c=1 d c /|C x |, where d c is the document encoding of the c'th review in C x , calculated using the CONDENSE model. This change allows the model to focus on input reviews with semantics similar to the user need as conveyed by the background reviews C x . The new query vectord is used instead ofd to obtain document encoding d (see Equation (8)

Experimental Setup
Dataset We performed experiments on the Rotten Tomatoes dataset 3 provided in Wang and Ling (2016). It contains 3,731 movies; for each movie we are given a large set of reviews (99.8 on average) written by professional critics and users and a gold-standard consensus, i.e. a summary written by an editor (see an example in Figure 1). On average, reviews are 19.7 tokens long, while the summary length is 19.6 tokens. The dataset is divided into 2,458 movies for training, 536 movies for development, and 737 movies for testing. Following previous work (Wang and Ling 2016), we used a generic label for movie titles during training which we replace with the original movie names at test time.
Training Configuration For all experiments, our model used word embeddings with 128 dimensions, pretrained on the training data using GloVe (Pennington, Socher, and Manning 2014). We set the dimensions of all hidden vectors to 256, the batch size to 8, and the beam search size to 5. We applied dropout (Srivastava et al. 2014) at a rate of 0.5. The model was trained using the Adam optimizer (Kingma and Ba 2015) and l 2 constraint (Hinton et al. 2012) of 2. We performed early stopping based on model performance on the development set. Our model is implemented in PyTorch 4 .

Comparison Systems
We present two variants of our approach: (a) AE+ATT+COPY uses the CONDENSE and ABSTRACT models described above, but without saliencebiased extracts, while (b) AE+ATT+COPY+SALIENT does incorporate them. We further compared our approach against two types of methods: one-pass methods and methods that use the EA framework. Fully extractive methods include (c) LEXRANK (Erkan and Radev 2004), a PageRank-like summarization algorithm which generates a summary by selecting the n most salient units, until the length of the target summary is reached; (d) SUBMODULAR (Sipos, Shivaswamy, and Joachims 2012), a supervised learning approach to train submodular scoring functions for extractive multi-document summarization; (e) OPINOSIS (Ganesan,Zhai,and Han 3 http://www.ccs.neu.edu/home/luwang/publications.html 4 Our code can be downloaded from xxx.yyy.zzz. 2010) a graph-based abstractive summarizer that generates concise summaries of highly redundant opinions; and (f) SUMMARUNNER (Nallapati, Zhai, and Zhou 2017).
EA-based methods include (g) REGRESS+S2S (Wang and Ling 2016), an instantiation of the EA framework where a ridge regression model with hand-engineered features implements the EXTRACT model, while an attention-based sequence-to-sequence neural network is the ABSTRACT model; (h) SUMMARUNNER+S2S, our implementation of an EA-based system which uses SUMMARUNNER instead of REGRESS as the EXTRACT model; and (i) SUMMARUN-NER+S2S+COPY, the same model as (h) but enhanced with a copy mechanism (Vinyals, Fortunato, and Jaitly 2015). For all EA-based systems, we set k = 5, which is tuned on the development set. Larger k leads to worse performance, possibly because the ABSTRACT model becomes harder to optimize.

Results
Automatic Evaluation We considered two evaluation metrics which are also reported in Wang and Ling (2016): METEOR (Denkowski and Lavie 2014), a recall-oriented metric that rewards matching stems, synonyms, and paraphrases, and ROUGE-SU4 (Lin 2004) which is calculated as the recall of unigrams and skip-bigrams up to four words. We also report F1 for ROUGE-1, ROUGE-2, and ROUGE-L, which are widely used in summarization (Lin 2004). They respectively measure word-overlap, bigramoverlap, and the longest common subsequence between the reference and system summaries.
Our results are presented in Table 1. The first block shows one-pass systems, both supervised (SUBMODULAR, SUM-MARUNNER) and unsupervised (LEXRANK, OPINOSIS). We can see that SUMMARUNNER is the best performing system in this block; despite being extractive, it benefits from training data and the ability of neural models to learn task-specific representations. The second block in Table 1 shows several two-pass abstractive systems based on the EA framework. Our implementation of an EA-based system, SUMMARUNNER+S2S+COPY, improves over the purely extractive SUMMARUNNER and the previously reported best EA-based system, REGRESS+S2S. The third block presents two models using the proposed CA frame-Model Rating SUMMARUNNER -0.115 SUMMARUNNER+S2S+COPY -0.434 AE+ATT+COPY+SALIENT 0.038 GOLD 0.511 Table 2: System ranking based on human judgments, using Best-Worst Scaling.
work. Both systems outperform all other models across all metrics; AE+ATT+COPY+SALIENT is the best model overall which exploits information about all documents and most salient ones.
Human Evaluation In addition to automatic evaluation, we also assessed system output by eliciting human judgments. Participants compared summaries produced from the best extractive baseline (SUMMARUNNER), and the best EA-and CA-based systems (SUMMARUNNER+S2S+COPY and AE+ATT+COPY+SALIENT, respectively). As an upper bound, we also included GOLD standard summaries.
The study was conducted on the Amazon Mechanical Turk platform using Best-Worst Scaling (BWS; Louviere, Flynn, and Marley 2015), a less labor-intensive alternative to paired comparisons that has been shown to produce more reliable results than rating scales (Kiritchenko and Mohammad 2017). Specifically, participants were shown the movie title and basic background information (i.e., synopsis, release year, genre, director, and cast). They were also presented with three system summaries and asked to select the best and worst among them according to Informativeness (i.e., does the summary convey opinions about specific aspects of the movie in a concise manner?), Correctness (i.e., is the information in the summary factually accurate and does it correspond to the information given about the movie?), and Grammaticality (i.e., is the summary fluent and grammatical?). Examples of system summaries are shown in Figure 1 and Figure 3. We randomly selected 50 movies from the test set and compared all possible combinations of summary triples for each movie. We collected three judgments for each comparison. The order of summaries and movies was randomized per participant.
The score of a system was computed as the percentage of times it was chosen as best minus the percentage of times it was selected as worst. The scores range from -1 (worst) to 1 (best) and are shown in Table 2. Perhaps unsurprisingly, the human-generated gold summaries were considered best, whereas our model (AE+ATT+COPY+SALIENT) was ranked second, indicating that humans find its output more informative, correct, and grammatical compared to other systems. SUMMARUNNER was ranked third followed by SUMMARUNNER+S2S+COPY. We inspected the summaries produced by the latter system and found they were factually incorrect bearing little correspondence to the movie (examples shown in Figure 3), possibly due to the huge information loss at the extraction stage. All pairwise system differences are statistically significant using a one-  way ANOVA with posthoc Tukey HSD tests (p < 0.01).
Customizing Summaries We further assessed the ability of CA-based systems to generate customized summaries at test time. As discussed earlier, customization at test time is not trivially possible for EA-based systems and as a result we cannot compare against them. Instead, we evaluate two CA-based systems, namely AE+ATT+COPY and AE+ATT+COPY+SALIENT. Similar to EA-based systems, the latter biases summary generation towards the k most salient extracted opinions using an additional extractive module, which may not contain information relevant to the user's need (we set k = 5 in our experiments). We thus expect this model to be less effective for customization than AE+ATT+COPY which makes no assumptions regarding which summaries to consider. In this experiment, we assume users may wish to control the output summaries in four ways focusing on actingand plot-related aspects of a movie review, as well as its sentiment, which may be positive or negative. Let CUST(x) be the zero-shot customization technique discussed in the previous section, where x is an information need (i.e., acting, plot, positive, or negative). We sampled a small set of background reviews C x (|C x |=1,000) from a corpus of 1 million reviews covering 7,500 movies from the Rotten Tomatoes website, made available in Ficler and Goldberg (2017). The reviews contain sentiment labels provided by their authors and heuristically classified aspect labels. 5 We then ran CUST(x) using both AE+ATT+COPY and AE+ATT+COPY+SALIENT models. We show in Figure 3 customized summaries generated by the two models.
To determine which system is better at customization, we again conducted a judgment elicitation study on AMT. Participants read a summary which was created by a generalpurpose system or its customized variant. They were then asked to decide if the summary is generic or focuses on a specific aspect (plot or acting) and expresses positive, negative, or neutral sentiment. We selected 50 movies (from the test set) which had mixed reviews and collected judgements from three different participants per summary. The summaries were presented in random order per participant. "Kitchen Stories" GOLD By turns touching and funny, this Norwegian import offers quietly absorbing commentary on modern life and friendship. SUMMARUNNER An enjoyable movie. Uniquely eccentric. SUMMARUNNER+S2S+COPY The Kitchen Stories is a morally ambiguous, exceedingly :::::::::: coming-of-age :::: story. AE+ATT+COPY General: Kitchen Stories is an offbeat, thought-provoking tale that's both funny and moving. Customized (Acting): Kitchen Stories is an intelligent, funny social comedy that benefits from an impressive cast and outstanding performances from Isak. Customized (Plot): Kitchen Stories is both funny and smart, featuring a highly original script. AE+ATT+COPY+SALIENT General: Kitchen Stories is a smart, offbeat comedy with fine performances. Customized (Acting): Kitchen Stories is a smart, offbeat comedy with fine performances. Customized (Plot): Kitchen Stories is a smart, offbeat comedy with fine performances.
"Gremlins" GOLD Whether you choose to see it as a statement on consumer culture or simply a special effects-heavy popcorn flick, Gremlins is a minor classic. SUMMARUNNER A wholesome Christmas family flick that veers over to the dark side. Gleefully mischievous and full of dark, magical energy. SUMMARUNNER+S2S+COPY Despite its :::::::::: sportsmanlike ::::::: swagger, Gremlins's aimless plot isn't worth betting on. AE+ATT+COPY General: Gremlins may appeal to the dark Christmas horror genre. Customized (Positive): Gremlins is an intelligent, funny Christmas horror film from Joe Dante's novel. Customized (Negative): Gremlins is an atrociously-acted project whose unoriginal and ineptly-staged horror film from Joe Dante's novel. AE+ATT+COPY+SALIENT General: Gremlins is a good introduction to the 1984 season. Customized (Positive): Gremlins is a good horror movie with a talented cast. Customized (Negative): Gremlins is a good horror movie with a talented cast. Figure 3: Examples of general-purpose and need-specific summaries generated by four systems. We also show the consensus summary (GOLD). ::::::::: Underlined phrases denote factually incorrect information. Words/phrases in color highlight aspects pertaining to acting, plot, positive and negative sentiment. The examples show that incorporating an extractive module (+SALIENT) prevents the model from customizing summaries. Table 3 shows what participants thought of summaries produced by non-customized systems (see column No) and systems which had customization switched on (see column Yes). Overall, we observe that AE+ATT+COPY is able to customize summaries to a great extent. In all cases, crowdworkers perceive a significant increase in the proportion of aspect x when using CUST(x). AE+ATT+COPY+SALIENT is unable to generate need-specific summaries, showing no discernible difference between generic and customized summaries. This shows that the use of an extractive module, which is used as one of the main components of EA-based approaches, limits the flexibility of the abstractive model to customize summaries based on a user need.

Conclusions
We proposed the CONDENSE-ABSTRACT (CA) framework for opinion summarization. Both automatic and humanbased evaluation show that CA-based approaches produce more informative and factually correct summaries compared to purely extractive models and models including an extractive summary pre-selection stage. We also show that a simple zero-shot customization technique is able to generate aspect-and sentiment-based summaries at test time. In the future, we plan to apply CA-based approaches to other multi-document summarization tasks and domains. It would also be interesting to investigate an unsupervised or semisupervised approach where reviews are available but no (or only a few) gold-standard summaries are given.