Self-Supervised and Controlled Multi-Document Opinion Summarization

We address the problem of unsupervised abstractive summarization of collections of user generated reviews through self-supervision and control. We propose a self-supervised setup that considers an individual document as a target summary for a set of similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss and mainstream models. We address the problem of hallucinations through the use of control codes, to steer the generation towards more coherent and relevant summaries.


Introduction
Recent progress in unsupervised methods has created breakthroughs in natural language processing applications, such as machine translation (Artetxe et al., 2018;Lample et al., 2018). Those have been mostly based on a bootstrapping approach, which consists in iteratively alternating between two representations, and optimizing a reconstruction loss. Beyond machine translation, other applications include Question-Answering (Lewis et al., 2019) and parsing (Drozdov et al., 2019). While similar ideas have been applied as well for video summarization (Yuan et al., 2019), such a bootstrapping approach seems less suited for summarization, because of the inherent information loss when going from the full text to the summarized one. Existing unsupervised approaches for summarization therefore relied mostly on extractive graph-based systems (Mihalcea and Tarau, 2004). Only recently have there been proposals for unsupervised abstractive summarization, using auto-encoders (Chu and Liu, 2019;Bražinskas et al., 2020). However, these set-ups are complex and require a combination of loss functions (Chu and Liu, 2019) or hierarchical latent variables (Bražinskas et al., 2020) to ensure that the generated summaries remain on-topic.
In this paper, we investigate a self-supervised approach for multi-document opinion summarization. In this setting, there are multiple opinions (reviews), one entity (products, venues, movies, etc) and the goal is to extract a short summary of those opinions. Our approach is based on self-supervision and does not require any gold summaries. We train a supervised model on examples artificially created by selecting (i) one review that will act as a target summary and (ii) a subset of reviews of the same entity that acts as a document collection.
Neural models have a known problem of hallucination (Rohrbach et al., 2018), which can be misleading in natural language generation tasks as the fluency of those models often distract from the wrong facts stated in the generated text. To reduce this effect, we propose to use control tokens (Fan et al., 2018;Keskar et al., 2019;Krause et al., 2020). Control tokens are discrete variables that are used to condition the generation. Different from previous work, our goal is not to allow users to control the generated text, but instead to steer the generated text to produce an output which is consistent with the input documents to be summarized.
Our main contributions are therefore three-fold: • performing multi-document summarization by modelling it as a self-supervised problem where one document acts as the summary of a subset. We carefully select those two, and link the resulting formulation to a recent theoretical framework (Peyrard, 2019) (Sect. 3); • using control tokens to steer the model towards consistency, increasing relevance of the generated summary (Sect. 4); • an application of multi-input transformer model (Libovický et al., 2018) to summarization. This model encodes each input independently, and at decoding time applies parallel attention to each encoded input (Sect. 5). Our experimental results (Sect. 6 and 7) show that our approach outperforms existing models on two datasets: Yelp reviews on venues (Chu and Liu, 2019) and Rotten Tomatoes movie reviews (Wang and Ling, 2016). We focus the human evaluation on the faithfulness of the summaries, confirming they are more factually correct than baselines.

Related Work
Unsupervised Opinion Summarization Extractive summarization consists in selecting a few sentences from the input documents to form the output summary. The centroid method (Radev et al., 2004;Rossiello et al., 2017;Gholipour Ghalandari, 2017) consists in ranking sentences according to their relevance to the whole input. Graph-based methods, such as LexRank (Erkan and Radev, 2004) or TextRank (Mihalcea and Tarau, 2004;Zheng and Lapata, 2019), use the PageRank algorithm to find the most central sentences in a graph of input sentences, where edge weights indicate word overlap. In contrast to these methods, we focus on abstractive summarization methods.
Non-neural abstractive methods (Ganesan et al., 2010;Nayeem et al., 2018) are also graph-based, but work on word-type graphs. West et al. (2019) introduced a self-supervised model for sentence compression: they use an unsupervised extractive system to generate training data for a supervised sentence compressor. Their system works on single sentences whereas our end-to-end approach summarizes multiple reviews.
Recently, a few approaches for neural unsupervised abstractive summarization have been proposed. Chu and Liu (2019, MeanSum) introduced a summarization system based on a review autoencoder. At inference time, MeanSum encodes every review for a product to a vector, computes the centroid of reviews and uses this centroid to seed the decoder and generate a summary. However, averaging representations of statements that are sometimes contradictory tends to confuse the decoder, and might lead it to ignore the input signal. To deal with this limitation, Coavoux et al. (2019) add a clustering step to group similar reviews and to generate one sentence per such found cluster. Bražinskas et al. (2020) proposed to solve the problem of unsupervised opinion summarization with an auto-encoder with latent variables. They use latent variable for products and reviews to address the hallucination issue, while at the same time allowing it to capture information from the set of reviews on the same entity. In contrast, we argue that our self-supervised setting is simpler as it relies on training with standard models. In addition, the use of Transformer (as opposed to GRU in their case) makes it possible to apply separate attentions to each input. Probably most similar to our self-supervised proposal is the recent work of Amplayo and Lapata (2020), in particular their document noising sub-strategy. Compared to it, our simple selection criteria of the dataset avoids any use of (domain-specific) noise generator. In addition, our use of control tokens allows to easily include existing (or inferred) meta-information. A similar approach is also used by Shapira and Levy (2020), which trains a seq2seq model by clustering reviews and using the medoid as target summary.
Another work that has recently shown the promise of self-supervision for summarization is Zhang et al. (2020a), in which masked-out sentences are predicted from the surrounding text. Our self-supervision training mechanism can be seen as a multi-document version of that.
Controlled Generation Controllable text generation has been previously investigated to apply global constraints on text generation, by directly optimizing evaluation metrics through policy gradient methods (Ranzato et al., 2016;Liu et al., 2017;Li et al., 2016b;Yi et al., 2018) or continuous approximation methods (Chu and Liu, 2019;Yang et al., 2018).
Other methods applied control only at inference time. Weighted decoding (Holtzman et al., 2018) was shown to be challenging, and often detrimental to fluency and coherence (See et al., 2019). Constrained beam search (Anderson et al., 2017;Hokamp and Liu, 2017;Post and Vilar, 2018) is slower, requires very large beam sizes, and does not enable soft constraints. Finally, updating the decoder hidden states (Chen et al., 2018;Dathathri et al., 2020) requires an extra training step.
Control codes have been introduced in generation as an early form of copy mechanism (Luong et al., 2015;ElSahar et al., 2018) to address the problem of rare words. They were widely adopted to steer language models towards specific features, such as aspects (Keskar et al., 2019) or structured outputs (Zellers et al., 2019).
In prior work, controlled language models rely on a predefined set of control tokens, collected manually (Keskar et al., 2019) or from dictionaries (Dathathri et al., 2020), which can lead to low domain coverage. Nabil et al. (2014) and ElSahar and El-Beltagy (2015) construct lexicons by exploiting the feature selection ability of sentiment classifiers, an approach that produces more relevant lexicons than classical topic models (e.g. LDA, Blei et al., 2003). In our work, we also rely on classifiers us-ing the categories of reviews provided as meta-data. Without meta-data, we could have relied instead on unsupervised or weakly supervised aspect extractors (He et al., 2017;Angelidis and Lapata, 2018).
Hierarchical encoding. In order to allow a neural summarizer to read several sections, Cohan et al. (2018) proposes a hierarchical LSTM that works at two level. Similar to our proposal, Liu and Lapata (2019) extends a Transformer network to read several ranked paragraphs as input, avoiding a retrievethen-read pipeline. In multi-document summarization, the paragraphs are not ranked but independent. This entails a significant change model-wise. We propose to encode each review independently (avoiding inter-paragraph self-attention) and only adapt the decoder-encoder attention.

Self-Supervision
In order to create our training dataset we assume that a review s i for an entity (venue or product) can serve as a summary for a set of other similar reviews D i . This simple intuition allows us to create training points (D i , s i ) in a very similar way to what the model will experience at inference time. However, there are two issues with this approach. First, the potential set of training points is too large to be explored exhaustively. Given the set of all reviews D the total number of possible input-output pairs is 2 |D|−1 × |D|. Second, the assumption that any review is fit to serve as a summary for any set of other reviews is obviously not true, and might yield a very noisy training dataset.
To solve the combinatorial explosion, we limit the size of D i to k, and from a given s i , we look for a set of k good reviews D i , for which s i serves as a good summary. Fixing k also simplifies training, and enables comparison with previous work where the number of input reviews is fixed (Chu and Liu, 2019;Bražinskas et al., 2020). Both s i and all members of D i are reviews of the same entity.
Having s i fixed, we now search for reviews d 1 , . . . , d k for which s i is a relevant review: where sim is an arbitrary similarity function (that we define at the end of this section). Fixing first the target summaries turns traditional approaches upside down. In particular, a recently proposed theoretical model of importance in summarization (Peyrard, 2019) defines the importance of a summary based on three aspects: (i) minimum redundancy, (ii) maximum relevance with the input document, and (iii) maximum informativeness. In that line of work D i is considered fixed: redundancy and informativeness are not dependent on D i and can therefore be ignored when s i is fixed. In this setting Peyrard (2019) reduces then to Eq. 1 Then, we sort the data-points (d i , rel(d i )) according to the value of the relevance Depending on the desired size of the target dataset, we keep the top-T pairs for training. Limiting T inherently increases informativeness, since it limits the creation of training examples where input and outputs are repetitive similar reviews that might be very prominent on corpora level (e.g. "Great restaurant."). This method is simple and fast, thanks to nearest neighbour search libraries (Pedregosa et al., 2011b). For all our experiments we defined sim to be the cosine similarity over a tf-idf bag-of-word representation (Ramos et al., 2003).

Controlling Hallucinations
Hallucinations are pieces of generated text that bear no relationship to the text they were conditioned on. They are likely to happen in our self-supervised setting, due to the noise from the construction of training instances. This might happen if the synthetically created training data contains contradictory signals, or because certain types of review are overly present (e.g. "great movie"). The model might default to those frequent patterns if it finds itself in a unfrequent state during decoding. To alleviate the problem of hallucinations, we propose to use control tokens that represent desired traits of the output text to steer the generated text towards more input-coherent summaries. These control tokens are inferred from each review, and used as prompts at inference time. We use two types of codes as follows: 1) Metadata control tokens. Those are special tokens that are associated with each input review, and are the capitalized control tokens in Fig. 1. We use two types of metadata that represent (i) the review polarity, a numerical value denoting the average sentiment score of the input reviews; (ii) categorical tokens representing the type of the entity of the review (e.g. Deli, Beauty&Spa, Furniture Stores). When meta-data labels are unavailable for all reviews (as in the Rotten-Tomatoes dataset), we infer control tokens with the same process, but using categories predicted by a classifier trained on labeled examples from the same domain.
2) Inferred control tokens. We follow recent work (Keskar et al., 2019;Dathathri et al., 2020) that shows that it is preferable to condition NLG models on control tokens that naturally co-occur in text. On one side, this allows for better control, and at the same it seems to be more robust when new (previously unseen) control codes are used. Here, we propose to use control codes that represent informative aspects (e.g. wine, service, ingredients) that occur in the input reviews text. However, instead of relying on manually created bag of control tokens for each desired attribute -which comes with obvious domain coverage limitations -we propose to infer those control codes from the text corpus.
To do so, we rely on the intrinsic feature selection capabilities of regularized linear classification models. For each category in the meta-data associated with each review we train a linear support vector machine (SVM) classifier (Vapnik and Lerner, 1963) 1 that learns to classify between reviews from this category and negative examples sampled randomly from the rest of the corpus. The features of the SVMs are parameterized by the weight vector θ ∈ R d , where d is the number of features (in our experiments: all unigrams and bigrams present in the corpus). We used a squared hinge loss with L1 regularization over θ -the latter to increase sparsity and force feature selection (Tibshirani, 1996;Ng, 2004). Finally, we trim the feature list into those who correspond to positive weights and renormalize the weights. The output of this step is a ranked list of n-grams that represent the distinctive aspects of each category.
When creating training data for summarization, we enrich each review with the top weighted ngrams of their corresponding categories as follows. For a given review d about entity p, we consider all m labels of p and use the weights of the corresponding classifiers θ (p) i (for each label We only consider those n-grams actually occurring in d, and keep the top 8 such features. Note that these features could come from different classifiers, as we consider all m labels.
During training, we enrich each review with its tailored control codes. In particular, the reviews acting as summary also contain them, and by con- struction those are n-grams present in the text. At inference time -when the target side and its control codes are not available -we select the most repeated control tokens from the input side and feed them as a prefix to the decoder. There is clearly a risk that the model just learns to copy the control codes it has seen somewhere in the text. We check whether this is the case in Sect. 7.

Multi-source Transformer Model
Previous work for multi-document summarization (Chu and Liu, 2019) built multi-source input representations through a simple mean over the last hidden states of the encoder. An intrinsic limitation of this method is that the full set of reviews is represented as a single vector. This aggregation might cause information distortion especially when some input reviews are expected to have conflicted opinions in between. Standard transformer models (Vaswani et al., 2017) consider only a single input to the decoder part of the model. Aggregating all input reviews into a single input (Junczys-Dowmunt, 2019) with special tokens to represent document boundaries might be slow and impractical due the O(n 2 ) complexity of the self-attention mechanism. We therefore experiment with several input combination strategies of the transformer cross-attention (Libovický et al., 2018).
Parallel. At each cross-attention head, the decoder set of queries Q attend to each of the encoded inputs separately from which the set of keys (K i ∈ K 1:m ) and values (V i ∈ V 1:m ) are generated and then the yielded context is averaged and followed by a residual connection from the previous decoder layer (box C in Fig. 1).
Mean. We also propose a simpler and less computationally demanding input combination strategy. It does not apply the cross-attention with each encoder separately. Instead, the set of keys and values coming from each input encoder are aggregated using the average at each absolute position. Afterwards the decoder set of queries attend to this aggregated set of keys and values. This combination can be seen as a more efficient variation of the flat combination strategy (Libovický et al., 2018) with mean instead of concatenation. Fig. 2 depicts this strategy, which replaces box (C) in Fig. 1.
In practice, we share the parameters across all encoders, this can be also seen as a single encoder used to encode each input document independently. We believe that this is an appropriate design choice as the order of the input document doesn't matter. Furthermore, this is necessary to allow variable number of input documents during different training batches or during inference. In Sect. 7, we compare both approaches through an ablation study, focusing on summary quality as well as empirical training times.

Experimental Setup
Experimental Details All our models are implemented with PyTorch (Paszke et al., 2019) and Fairseq (Ott et al., 2019) libraries, as well as scikit-learn (Pedregosa et al., 2011a) for the classifiers used either for inferring control tokens or for evaluation. For all our models we use sentence piece (Kudo and Richardson, 2018) as a tokenizer with a vocabulary size of 32 000. We use the same hyperparameters as the Transformer Big model described by Vaswani et al. (2017) (d model = 1024, n heads = 16, n layer = 6, dropout = 0.1). We optimize them with a Nesterov accelerated SGD optimizer with a learning rate of 0.01. We train all models for a total of 80 000 steps across 25 epochs, with linear warm-up for the first 8 000 steps. We select the best model checkpoint based on perplexity on the validation set. All models were trained on one machine with 4 NVIDIA V100 GPUs, the   longest model took 50 hours to train. For inference, we use a beam size of 35. We discard hypotheses that contain twice the same trigram. We limit generation of each summary to a maximum budget of 150 tokens for each summary for Yelp, as was done by Chu and Liu (2019), and a budget of 50 tokens for Rotten Tomatoes. We set a similar budget for all other extractive baselines in the experiments. Finally, we use length normalization (Wu et al., 2016) with length penalty 1.2 to account for the model's bias towards shorter sequences.
Datasets We evaluate our proposal on two English datasets: Yelp 2 (Chu and Liu, 2019) and Rotten Tomatoes (Wang and Ling, 2016). The Yelp dataset contains reviews of businesses (around 1M reviews for 40k venues). As described in Sect. 3, for each venue, we select the best reviews to use as target summaries: either the top-p (with p = 15%) or the top-T (with T = 100) reviews, whichever is smaller. For each selected target summary, we take its k = 8 most similar reviews (cosine) to form its input. We obtain around 340k training examples, representing 22.5k venues. The Rotten Tomatoes dataset was constructed by (Wang and Ling, 2016) from the movie review website rottentomatoes.com. We use the same process as for Yelp, but use p = 1% and T = 150. We 2 https://www.yelp.com/dataset/challenge construct around 170k training examples, representing 3.7k movies. We provide more details in the supplementary material.

Evaluation Metrics
We evaluate summary systems with the classical ROUGE-F-{1,2,L} metrics (Lin, 2004). 3 We also report BERT-score (Zhang et al., 2020b), a metric that uses pre-trained BERT (Devlin et al., 2019) to compute the semantic similarity between a candidate summary and the gold summary. Dist-n and Dist c -n (n = 1, 2, 3) scores (Li et al., 2016a) are the percentage of distinct n-grams in the generated text on the summary level or the corpora level respectively. Dist-n is an indicator of repetitiveness within a single summary while Dist c -n indicates the diversity of different generations. Finally, as done by Chu and Liu (2019), we use a classifier to check whether the sentiment of the summary is consistent with that of input reviews (Sentiment Acc., Tab. 1). 4 We extend this method to check whether the correct product category can also be inferred from the summary, we report F category the micro F-score of the multi-label category classifier.

MeanSum:
Drove by here for the first time. I just went to the deli with a friend and it's a quick fix that is just about as good as it gets. But it's not an actual sandwich, but it's not as good as I remembered it, but they were great!! Sandwich was also very good, just a hint of cinnamon. I will be back for the other locations.  and LexRank, 6 with their default parameters. For Opinosis, we use the official Java implementation, with default hyperparameters. 7 We also compare our systems with recent neural unsupervised summarization systems (Chu and Liu, 2019;Bražinskas et al., 2020). In addition, in our ablation study (next section) we also compare against a vanilla Transformer system, to capture the relative gains obtained on top of that model.

Evaluation Results
Automatic Evaluation Table 1 contains the automatic evaluation metrics with respect to reference summaries. The proposed multi-input selfsupervised model with control codes perform consistently better in the Yelp dataset across the benchmarked models, including the recent neural un-6 https://github.com/crabcamp/lexrank 7 Except for the redundancy parameter set to one, since the default led to many empty outputs.
https:// github.com/kavgan/opinosis-summarization Figure 4: Proportion of control tokens fed as prompts that occur in the generated summary. When the model is fed control tokens that occur in the input reviews (correct) it tends to generate them in the output. Contrary to this, incorrect control tokens are mostly ignored.
supervised models of MeanSum and H-VAE. For the recent H-VAE model we report the numbers from their paper. 8 For MeanSum we re-run their provided checkpoint and run evaluation through the same pipeline. The BERTScore (Zhang et al., 2020b) differences are closer and seem to favour neural models.
With the RottenTomatoes dataset we only benchmarked the graph-based unsupervised methods, since the released pretrained MeanSum model does not cover the domain of movie reviews. We attribute the lower score in sentiment accuracy to the fact that the "summaries" in RottenTomatoes are critical reviews, written in a very different style than the original reviews. Table 2 contains reference-less evaluation, analyzing the number of distinct n-grams (an indicator of repetitiveness) on the summary level and corpora level. On the summary level our model outperforms all the baselines: our model is capable of generating richer and less repetitive summaries. On the level of all generations our model generates text with more diversity than MeanSum. In general however extractive models tend to have more diversity on the corpus level as they directly copy from each input separately, while abstractive models tend to learn repetitive patterns present in the training set. Fig. 3 shows summaries generated by different models from the same input. We notice that our model learned to copy aspects of the input documents such as restaurant names "Capricotti's" and menu items "the Bobbie", possibly due to the crossattention mechanism. We provide more examples as supplementary material.
Human Evaluation Existing natural language generation systems are known to generate very fluent language, that looks very natural to native speakers. On the other side, current neural models are known to generate factually incorrect data, something which was less of a concern in preneural methods but also much harder to detect. As mentioned by Kryscinski et al. (2019): "Neither of the methods explicitly examines the factual consistency of summaries, leaving this important dimension unchecked." Inspired by Falke et al. (2019) we decided to focus the human evaluation on those aspects of the summarization evaluation in which existing models risk failing the most, the one of faithfulness.
We annotated 94 summaries through a crowdsourcing platform, comparing 3 systems (Gold, MeanSum and ours). Workers were asked if "the summary contains correct information given the original reviews". In total we had 282 tasks (94×3) and each task was labeled by 3 annotators and paid $0.50 (defined by a pilot study to aim for $15 / hour) and restricted to experienced, Englishspeaking workers. A full description of the campaign, including the filtering of the annotations, is detailed in the supplementary material. Correct  67  47  43  Incorrect  3  7  16  %Correct 95.71 87.04 72.88 Table 4: Results of the human evaluation focused on faithfulness of generated reviews.

Faithfulness Gold Ours MeanSum
The results in Table 4 show that 87.0% of the generated summaries of our system are considered factually correct (compare with 95.7% for the gold summaries), as opposed to 72.9% of MeanSum.
Ablation We analyzed the impact of our proposed variations of the basic self-supervised setting in Table 3. Removing control codes degrades significantly sentiment and category classification of the produced summary F 1 . It also impacts greatly the ROUGE score. Changing the decoder-encoder attention from parallel to mean (Sect. 5) also degrades ROUGE. The difference of this attention change without control codes is smaller but -surprisingly -in the different direction.

Control Codes
The previous ablation study shows the importance of the control codes in the quality of the final summaries. In order to see how rigidly the model follows those control codes we devise the following experiment to see if the tokens used as control codes are forced to appear in the output text, independent of the input text.
For this, we sample 500 reviews (for 279 venues from the Yelp validation set). For each input example, we randomly sample 8 inferred control tokens (see Sect 4) from the tokens occurring in the review, referring to these as correct control tokens. We run the decoder using these control tokens as prompt and count the proportion of them that also occurs in the generated summary. For comparison, we repeat the same experiment but sampling instead 8 control tokens that do not occur in the input text, referring to these as incorrect.
To minimize the possibility of conditioning on control tokens that might show up naturally in the generated text, for both settings, we repeat the process 5 times per input example (resulting in 2500 with correct control tokens as prefix and 2500 using incorrect). We report in Fig. 4 the proportion of fed control codes that are generated by the model in both cases. We observe that the model tends to comply with the correct control tokens that occur in the input documents (eg: 89% of the summaries contain more than 50% of the control tokens), but tends to ignore the control tokens when they do not occur in the input. We illustrate this behaviour with a set of examples generated from the same input but different control tokens in the supplementary material.

Conclusion
The promise of unsupervised multi-document abstractive summarization is been hampered by the complexity of those models and the problem of hal-lucinations. Our proposed model has the advantage of being very simple to train compared to previous proposals. In addition, the combined use of control-codes to steer the generation and of multiinput transformers results in summaries that are better (as measured by automatic measures), and produce more faithful summaries (as measured by human evaluation).
While the generated reviews are more factual than those generated by other models, we want to stress that inaccuracies can still appear. Generated summaries are often conjugated in first person, which could lend to believe that an actual human wrote those. We recommend strongly that any use of such algorithms to be accompanied by a clear disclaimer on its true nature.

References
Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020July 5-10, , pages 1934July 5-10, -1945 B Control Codes Analysis Figure 5 illustrates the behaviour of the model on the same input when it is fed either 'correct' or 'incorrect' control codes. The correct control codes occur in the output, but this is not the case for incorrect control codes: the model does not simply learn to copy control codes, it ignores them when they are inconsistent with the input reviews.
Correct Control Tokens: eat, lentil, eggplant, new, remember, flavourful, friendly Seriously best we've had in Toronto. We were looking for a new place to eat and stumbled upon this place. The atmosphere is very authentic, the food is authentic, and the service is very friendly and attentive. We started with the lentil soup, which was very flavourful and full of flavor. For dinner, I had the lamb shank and my husband had the eggplant dish. Both were very good. We also had the baklava for dessert and it was amazing. We can't wait to come back and try more of the menu Correct Control Tokens: yogurt, remember, service, shawarma, dishes, eat, restaurant, flavourful Seriously best I've had in Toronto. We were looking for a place to eat and this was the first restaurant we went to. The service was great and the dishes were authentic, flavourful and authentic. We had the shawarma sampler and the yogurt sampler. Both dishes were amazing. We also had the baklava, which was also very good. I can't remember the name of the dish, but it was very authentic Incorrect Control Tokens: open, tortellini, filter, details, rabbit, signature, craving, double Seriously best I've had in Toronto. We were looking for a new place to try and stumbled upon Syrianian food. We came across this place on yelp and decided to give it a try. The food, the service, the atmosphere, the quality of the food, and the service were all amazing. We ordered the sampler platter for two and it was enough for two people to share. We also ordered the Brabianianian appetizer, which was one of the best we've ever had. For our main course, we ordered the Crabian Frabian Bianianca and the Bianca. Both were so authentic, and full of flavour. For dessert, we had the Turkish coffee which was also very good. The only thing I didn't like was the bread they give you .
Incorrect Control Tokens: beer, balls, skip, mains, croissant, replacement, wish Seriously best I've had in Toronto. We came here with a friend and we were looking for a place to eat. We found Syrianian food, which was very authentic. We were very impressed with the quality of the food, the service, and the atmosphere. We had the sampler platter, which came with two mains and two mains for dinner. We also had the bread pudding for dessert and it was to die for. I'm not a huge fan of sweets, but this was one of the best we've ever had. I wish we lived in Toronto so we could come here all the time. We'll be back to try more of the menu . Figure 5: Summaries generated from the same input when different 'correct' and 'incorrect' control tokens are fed as prefixes at inference time. Control tokens that occur in the summary are highlighted (green/italics for the first rows, red/underlined for the other two).

D Human Evaluation Campaign
We used Amazon Mechanical Turk to ask 3 "workers" to assess if 282 summaries produced by 3 systems (94 from each: ours, gold from human experts and MeanSum) aligned correctly with sets of 8 reviews. Workers had to read the reviews, the summary and answer the question: "does the summary contain correct information given in the original reviews?" Instructions specified to "assess the faithfulness of the summary with respect to the set of reviews," specifically to "verify that the summary did not contain factually incorrect or self-contradicting statements that could not be inferred from what was provided in the original reviews." Using Mechanical Turk qualification criteria, we asked for the workers: (1) to be located in the United States, Canada or United Kingdom; (2) to have a HIT approval rate higher than 98; (3) to have more than 1000 HITs approved. Note that in an initial pilot, we asked evaluators to pick the best and worst summaries for fluency, coherency and alignment, as well as overall. We decided to simplify the task because it turned out to be a quite difficult one as workers struggled on many summaries to decide of the best and worst. We decided to focus the human evaluation on the aspect that is currently very difficult to automate, faithfulness to the original text. We see this evaluation as complementary to the automatic evaluation, focusing on different aspects.
We did an internal run to estimate the time needed per individual assignment -each Human Intelligence Task, or HIT, an annotation in our case, was assigned to 3 workers. We followed it by a short pilot to validate the average 2 minutes we had estimated. This is important to establish the rate to pay: 2 minutes translate into 30 potential assignments per hour, we picked $0.50 to target an average $15 hourly wage. Beyond the timing, the pilot was also used as a dry run for the full campaign. The average time to answer and the theoretical hourly wage are provided in Table 6 By using shuffled gold summaries, hence written for another set of reviews, we included 21 badly aligned "negatives." Workers who answered yes for these obvious no were filtered out as "dubious" from the results: all their answers were dis-  carded. After filtering out the "negatives" HITs and the ones from "dubious" answers, we were left with 446 annotations, from the 782 we received. We further discarded all annotations made in less than a minute to keep 377 realistic answers -one minute may seem harsh but we estimated it was the minimum time needed to first read the reviews, then the summary, and to assess the latter, given the question: proceeding backward by first reading the summary would still require the worker to read all summaries, to make sure a factual error according to one review is not extracted from another one. Finally we looked for full agreement at the HIT level and kept only the ones with either 0 yes or 0 no, with varying numbers, from 1 to 3, of the alternatives after the filtering of the "dubious" and "unrealistic" answers. Not surprisingly, as we focused on alignment, Gold summaries scored best but ours scored nicely, with a very low number of misaligned summaries.
Assessing the alignment of summaries to a set of reviews is not an easy task. We decided to discard all answers from the "dubious" workers who erred on our "negatives" summaries to be on the safe side. Mechanical Turk reports the time taken for an assignment, their averages is an interesting metric to look at, especially the way it evolves along our filterings -we translated it to the associated theoretical hourly wages, alas all under the $15 we initially targeted.
We also looked at the results with no full agreement: instead of doing it per HIT, or summary, it had to be done at the lower level of the evaluation. For the 276 evaluations with full agreement, the numbers are: Gold 108/4 (96.43% correct), Ours 69/7 (90.79%), MeanSum 63/25 (71.59%). When including the disagreements (377 evaluations), they are: Gold 116/13 (89.92%), Ours 89/27 (76.72%), MeanSum 85/47 (64.39%). The numbers are similar, however given the difficulty of the assessment for the workers, we decided to focus on the summaries they agreed on.   Table 6: Average time to Answer and the theoritical hourly wage of turkers (in USD) for the crowdsourcing experiments of human evaluation.