MReD: A Meta-Review Dataset for Structure-Controllable Text Generation

When directly using existing text generation datasets for controllable generation, we are facing the problem of not having the domain knowledge and thus the aspects that could be controlled are limited. A typical example is when using CNN/Daily Mail dataset for controllable text summarization, there is no guided information on the emphasis of summary sentences. A more useful text generator should leverage both the input text and the control signal to guide the generation, which can only be built with deep understanding of the domain knowledge. Motivated by this vision, our paper introduces a new text generation dataset, named MReD. Our new dataset consists of 7,089 meta-reviews and all its 45k meta-review sentences are manually annotated with one of the 9 carefully defined categories, including abstract, strength, decision, etc. We present experimental results on start-of-the-art summarization models, and propose methods for structure-controlled generation with both extractive and abstractive models using our annotated data. By exploring various settings and analyzing the model behavior with respect to the control signal, we demonstrate the challenges of our proposed task and the values of our dataset MReD. Meanwhile, MReD also allows us to have a better understanding of the meta-review domain.


Introduction
Text generation entered a new era because of the development of neural network based generation techniques. Along the dimension of the mapping relation between the input information and the output text, we can roughly group the recent tasks * * Equally Contributed. † Chenhui, Liying, and Ran are under the Joint PhD Program between Alibaba and their corresponding universities.

meta-review:
[This paper studies n-step returns in off-policy RL and introduces a novel algorithm which adapts the return's horizon n in function of a notion of policy's age.]←ABSTRACT [Overall, the reviewers found that the paper presents interesting observations and promising experimental results.]←STRENGTH [However, they also raised concerns in their initial reviews, regarding the clarity of the paper, its theoretical foundations and its positioning (notably regarding the bias/variance tradeoff of uncorrected n-step returns) and parts of the experimental results. ]←WEAKNESS [In the absence of rebuttal or revised manuscript from the authors, not much discussion was triggered.]←REBUTTAL PROCESS [Based on the initial reviews, the AC cannot recommend accepting this paper, but the authors are encouraged to pursue this interesting research direction.]←DECISION Table 1: An example of annotated meta-review. CATE-GORY indicates the category of each sentence. into three clusters: more-to-less, less-to-more, and neck-to-neck. The more-to-less text generation tasks output a concise piece of text from some more abundant input, such as text summarization (Tan et al., 2017;Kryściński et al., 2018). The lessto-more generation tasks generate a more abundant output from some obviously simpler input, such as prompt-based story generation (Fan et al., 2018b). The neck-to-neck generation aims at generating an output text which conveys the same quantity of knowledge as the input but in natural language, such as typical RDF triples to text tasks (Gardent et al., 2017).
To some extent, the existing task settings are not so adequate because they do not have a deep understanding of the domains they are working on, i.e., domain knowledge. Taking text summarization as an example, the most well-experimented dataset CNN/Daily Mail (Nallapati et al., 2016) is composed of the training pairs of news content and news titles. However, it does not tell why a particular piece of news content should have that corresponding title, for example for the same earnings report, why one media emphasizes its new business success in the title, but another emphasizes its net income. Obviously, there is not a standard answer regarding right or wrong. For such cases, if we can specify a control signal, e.g., "emphasizing new business", the generated text would make more sense to users using the text generator.
To allow controlling not only the intent of a single generated sentence but also the whole structure of a generated passage, we prepare a new dataset MReD (short for Meta-Review Dataset) with in-depth understanding of the structure of meta-reviews in a peer-reviewing system, namely the open review system of ICLR. MReD for the first time allows a generator to be trained by simultaneously taking the text (i.e. reviews) and the structure control signal as input to generate a meta-review which is not only derivable from the reviews but also complies with the control intent. Thus from the same input text, the trained generator can generate varied outputs according to the given control signals. For example, if the area chair is inclined to accept a borderline paper, he or she may invoke our generator with a structure of "abstract | strength | decision" to generate a meta-review, or may use a structure of "abstract | weakness | suggestion" otherwise. Note that for ease of preparation and explanation, we ground our dataset in the peer review domain. However, the data preparation methodology and proposed models are transferable to other domains, which is indeed what we hope to motivate with this effort.
Specifically, we collect 7,089 meta-reviews of ICLR in recent years (2018 -2021) and fully annotate the dataset. Each sentence in a meta-review is classified into one of the 9 pre-defined intent categories: abstract, strength, weakness, rating summary, area chair (AC) disagreement, rebuttal process, suggestion, decision, and miscellaneous (misc). Table 1 shows an annotated example, where each sentence is classified into a single category that best describes the intent of this sentence. Our MReD is obviously different from the previous text generation/summarization datasets because, given the rich annotations of individual meta-review sentences, a model is allowed to learn more sophisticated generation behaviors to control the structure of the generated passage. Our proposed task is also noticeably different from the existing controllable text generation tasks (e.g., text style transfer on sentiment polarity (Shen et al., 2017;Liao et al., 2018) and formality (Shang et al., 2019)) because we focus on controlling the macro structure of the whole passage, rather than the wordings.
To summarize, our contributions are as follows. (1) We introduce a fully-annotated meta-review dataset to make better use of the domain knowledge for text generation. With thorough data analysis, we derive useful insights into the domain characteristics.
(2) We propose a new task of controllable generation focusing on controlling the passage macro structures. It offers stronger generation flexibility and applicability for practical use cases.
(3) We design simple yet effective control methods that are independent of the model architecture. We show the effectiveness of enforcing different generation structures with a detailed model analysis.

MReD: Meta-Review Dataset
In this paper, we explore a new task, named the structure-controllable text generation, in a new domain, namely the meta-reviews in the peerreviewing system. Unlike the previous datasets that mainly focus on domains like news, the domain for meta-reviews is worth-studying because it contains essential and high-density opinions. Specifically, during the peer review process of scientific papers, a senior reviewer or area chair will recommend a decision and manually write a meta-review to summarize the opinions from different reviews written by the reviewers. We first introduce the data collection process and then describe the annotation details, followed by dataset analysis.

Data Collection
We collect the meta-review related data of ICLR from an online peer-reviewing platform, i.e., Open-Review 2 from 2018 to 2021. Note that the submissions from earlier years are not collected because their meta-reviews are not released. To prepare our dataset for controllable text generation, for each submission, we collect all of its corresponding official reviews with reviewer ratings and confidence scores, the final meta-review decision, and the meta-review passage. Table 2 shows the statistics of data collected from each year. Initially, 7,894 submissions are collected. After filtering, 7,089 meta-reviews are retained with their corresponding 23,675 reviews. Note that even without any further annotation, the dataset can already naturally serve the purpose of multi-document summarization (MDS). Compared with those conven- Year #Submissions #withReviews #Meta-Reviews   2018  0,994  0,942  0,892  2019  1,689  1,639  1,412  2020  2,595  2,517  2,169  2021  2,616  2,616  2,616 Total 7,894 7,714 7,089  tional datasets for MDS, such as TAC (Owczarzak andDang, 2011) andDUC (Over andYen, 2004), which contain in total a few hundred input articles (equivalent to reviews in MReD), our dataset is more than 10 times larger.

Data Annotation
As aforementioned, the structure-controllable text generation aims at controlling the structure of the generated passage. Therefore, we need to comprehensively understand the structures of metareviews so as to enable a model to learn how to generate outputs complying with certain structures. Specifically, based on the nature of meta-reviews, we pre-define 9 intent categories: abstract, strength, weakness, suggestion, rebuttal process, rating summary, area chair (AC) disagreement, decision, and miscellaneous (misc). Table 3 shows the definition for each category (see example sentences in Appendix A.1). The identification of category for some sentences is fairly straightforward, while some sentences are relatively ambiguous. Therefore, besides following the definition of each category, the annotators are also required to follow the additional rules as elaborated in Appendix A.2 For conducting the annotation work, 14 professional data annotators from a data company are initially trained, and 12 of them are selected for the task according to their annotation quality during a trial round. These 12 annotators are fully paid a b s t r a c t s t r e n g t h w e a k n e s s r a t i n g s u m m a r y for their work. Each meta-review sentence is independently labeled by 2 different annotators, and a third expert annotator resolves any disagreement between the first two annotators. We label 45,929 sentences from 7,089 meta-reviews in total, and the Cohen's kappa is 0.778 between the first two annotators, showing that the annotation is of quite high quality.

Data Analysis
To better understand the MReD dataset, we conduct the following analysis along different dimensions.
Sentence distribution across categories. The number of sentences in different categories are shown in Figure 1, breakdown by the decision (i.e., accept or reject). Among 7,089 submissions, there are 2,368 accepted and 4,721 rejected. Among all submissions and the rejected submissions, "weakness" accounts for the largest proportion, while across the accepted ones, "abstract" and "strength" take up a great proportion. To some extent, these three categories which dominate in meta-reviews could be easily summarized from the reviewers' comments. However, some minor or subjective categories (e.g., "ac disagreement") are hard to generate.
Breakdown analysis by meta-review lengths and average rating scores. We present the percentage of meta-reviews of different lengths in each score range, as shown in Figure 2. For example, among the meta-reviews that receive the reviewers' average score below 2 (i.e., the first column in the figure), 28% are less than or equal to 50 words, and 38% fall in the length range of 51 to 100 words. We can observe that the meta-reviews tend to be longer for those submissions receiving scores in the middle range, while shorter for those with lower scores or higher scores. This coincides with our commonsense that for high-score and low-score sub-  missions, the decision tends to be a clear accept or reject so that meta-reviews can be relatively shorter, while for those borderline submissions, area chairs have to carefully weigh the pros and cons to make the final decision (see Appendix B.1 for borderline submission analysis). As shown in Figure 3, the meta-reviews with more than 150 words generally have a larger proportion of sentences describing "weakness" and "suggestion" for authors to improve the submissions. Additional analysis on the category breakdown for accepted and rejected papers across the score ranges is shown in Appendix B.2.
Meta-review patterns. To study the common structures of meta-reviews, we present the transition matrix of different category segments in Figure 4, where the sum of each row is 1. Note that each segment represents the longest consecutive sentences with the same category. We add "<start>" and "<end>" tokens before and after each metareview accordingly to investigate which categories tend to be at the start/end of the meta-reviews. It is clear to see that "abstract" usually positions at the beginning of the meta-review, while "suggestion" and "decision" usually appear at the end. There are also some clear patterns appearing in the metareviews, such as "abstract | strength | weakness", "rating summary | weakness | rebuttal process", and "abstract | weakness | decision".
3 Structure-Controllable Text Generation

Task Definition
As aforementioned, in uncontrolled generation, users cannot instruct the model to emphasize on desired aspects. However, in a domain such as meta-reviews, given the same review inputs, one AC may emphasize more on the "strength" of the paper following a structure of "abstract | strength | decision", whereas another AC may prefer a differ-  Figure 3: Sentence-level category distribution percentage breakdown by different lengths of meta-reviews.
< s ta rt > a b s tr a c t s tr e n g th w e a k n e s s ra ti n g _ s u m m a ry a c _ d is a g re e m e n t re b u tt a l_ p ro c e s s s u g g e s ti o n d e c is io n m is c < e n d >  ent structure with more focus on reviewers' opinions and suggestions (i.e., "rating summary" and "suggestion"). To achieve such flexibility, the task of structure-controllable text generation is defined as: given the text input (i.e., reviews) and a control sequence of the output structure, a model should generate a meta-review that is derivable from the reviews and presents the required structure.

Explored Methods
As the recent generation works (Vaswani et al., 2017;Liu and Lapata, 2019;Xing et al., 2020) basically adopt an encoder-decoder based architecture and achieve state-of-the-art performance on many tasks and datasets, we primarily investigate the performance of such a framework on our task. Thus in this subsection, we mainly present how to re-organize the input reviews and the control structure as an input sequence of the encoder. We also explore other baselines in the experiments later.
In order to summarize multiple reviews into a meta-review showing a required structure, we explicitly specify the control label sequence that a model should comply with during generation. Specifically, we intuitively add the control se-  quence in front of the input text. By directly combining both the control and textual information as a single input, our control method is independent of any specially designed encoder and decoder structures. Moreover, by placing the short control sequence in front, an encoder can immediately observe the control signal at the very beginning, thus avoids the possible interference by the subsequent sequence. Moreover, the control sequence in front will never be truncated when the encoder truncates the input to a certain length limit. Given the multiple review inputs, we need to linearize them into a single input. One simple method, concat, is to concatenate all inputs one after another (Fabbri et al., 2019). Besides the text inputs, the review rating, which cannot be found in the review passages but exists in the field of rating score, is also crucial information for writing meta-reviews. Therefore, we create a rating sentence that consists of the extracted ratings given by the corresponding reviewers and prepend it to our concatenated review texts to obtain the final input. We name this method rate-concat (see Table 4, upper). We also explore an alternative method, merge, as follows: From all review inputs, we use the longest one as a backbone. We segment all reviews' content on a paragraph level, and encode them using Sen-tenceTransformers (Reimers and Gurevych, 2019). Then, for each paragraph embedding in the nonbackbone reviews, we calculate a cosine similarity score with each backbone paragraph embedding. We then insert each non-backbone paragraph after the backbone paragraph with which it has the highest similarity score. We repeat the process for all paragraphs in non-backbone reviews to obtain a single passage. We further add rating sentences in front of the results of merge to obtain rate-merge.
Additionally, we provide a longest-review baseline, which does not combine reviews but only uses the longest review as the input.
As aforementioned, we place the control sequence in front of the re-organized review information. Specifically, we explore two different control methods, namely, sent-ctrl and seg-ctrl. Sent-ctrl uses one control label per target sentence and controls generation on the sentence-level. Note that this method can allow implicit control on the length (i.e., number of sentences) of the generation. Segctrl treats consecutive sentences of the same label as one segment and only uses one label for a single segment. Example inputs of different control settings are shown in Table 4 (lower). For instance, sent-ctrl repeats "abstract" in its control sequence whereas seg-ctrl does not. This is because seg-ctrl treats the 1 st and 2 nd target sentences of "abstract" as the same segment and only uses a single label to indicate it in the sequence. Additionally, we provide a vanilla setting for uncontrolled generation, unctrl, where no control sequence is used.
Using the above input sequence as the source and the corresponding meta-review as the target, we can train an encoder-decoder model for controllable generation. Many transformer-based models have achieved state-of-the-art performance. Common abstractive summarization models include BART (Lewis et al., 2020), T5 (Raffel et al., 2020) and PEGASUS (Zhang et al., 2020). In this paper we focus on the bart-large-cnn model, one variant of the BART model (results on other pretrained models can be found in Appendix C.1, which show similar trend). More specifically, we use the Py-Torch implementation in the open-source library Hugging Face Transformers (Wolf et al., 2020) and its hosted pretrained models 3 .

Baselines
Extractive Baselines. We employ three common extractive summarization baselines each of which basically provides a mechanism to rank the input sentences. LexRank (Erkan and Radev, 2004) represents sentences in a graph and uses eigenvector centrality to calculate sentence importance scores. TextRank (Mihalcea and Tarau, 2004) is another graph-based sentence ranking method that obtains vertex scores by running a "random-surfer model" until convergence. MMR (Carbonell and Goldstein, 1998) calculates sentence scores by balancing the redundancy score with the information relevance score. After ranking with each of the above models, we select sentences as output with different strategies according to the controlled and uncontrolled settings. For the uncontrolled setting, we simply select the top k sentences as the generated output, where k is a hyperparameter deciding the size of the generated output. For the controlled setting, we select only the top sentences with the right category labels according to the control sequence. To do so, we employ an LSTM-CRF (Lample et al., 2016) tagger trained on the labeled meta-reviews to predict the labels of each input review sentence. Refer to Appendix C.2 for more details of the tagger.
Generic Sentence Baselines. Considering the nature of meta-reviews, we could imagine some categories may have common phrases inflating the Rouge scores, such as "This paper proposes ..." for abstract, and "I recommend acceptance." for decision, etc. To examine such impact, we select sentences that are generic in each category and combine these sentences to generate outputs according to the control sequences. For instance, if the control sequence is "abstract | strength | decision", we take the most generic sentences from the categories of "abstract", "strength" and "decision" respectively to form the output. Specifically, we create two generic sentence baselines by obtaining generic sentences from the training data from either the meta-review references (i.e., target) or the input reviews (i.e., source), namely "Target Generic" and "Source Generic". Moreover, we also study such impact on the high-score and low-score submissions respectively, since an AC may write more succinct meta-reviews for clear-cut papers, as suggested by Figure 2. See Appendix C.3 for more details and results on generic sentence baselines.

Experimental Setting
To conduct text generation experiments, we preprocess our MReD dataset by filtering to ensure the selected meta-reviews have 20 to 400 words, as certain meta-review passages are extremely short or long. After preprocessing, we obtain 6,693 sourcetarget pairs, for which we randomly split into train, validation, and test sets by a ratio of 8:1:1. We evaluate our generated outputs against the reference meta-reviews using the F 1 scores of ROUGE 1 ,  For the extractive and generic baselines, a key hyperparameter is the sentence number k. Recall that under the sent-ctrl setting, the control sequence length is the same as the sentence number of the target meta-review. Therefore, to conduct a fair comparison, we set the hyperparameter k equal to the number of labels in the control sequence for both controlled and uncontrolled extractive baselines, and sent-ctrl is used for all controlled extractive baselines. We also adopt the same k for the generic baselines.
For bart-large-cnn, we first load the pretrained model and then fine-tune it on MReD. All experiments are conducted on single V100 GPUs, using a batch size of 1 in order to fit the large pretrained model on a single GPU. During fine-tuning, we set the hyperparameters of "minimum_target_length" to 20, and "maximum_target_length" to 400, according to our filter range on the meta-review lengths. Due to long inputs (see Table 17), we experiment with different source truncation lengths of 1024, 2048, and 3072 tokens. We cannot explore truncation length of more than 3072 tokens due to the limitation of GPU space. Our learning rate is 5e-5, and we use Adam optimizer with momentum β 1 = 0.9, β 2 = 0.999 without any warm-up steps or weight decay. We set the seed to be 0, and train the model for 3 epochs with gradient accumulation step of 1. For decoding, we use a beam size of 4 and length penalty of 2.

Main Results
We show results in Table 5. Only the best settings of rate-concat ( Section 4.4) and input truncation of 2048 tokens (Appendix C.4) for bart-large-cnn are included. Amongst the extractive baselines, TextRank performs the best in both unctrl and sentctrl settings. Nevertheless, all controlled methods outperform their unctrl settings (same for the Transformers). This validates our intuition that structure-controlled generation is more suitable for user-subjective writings such as meta-reviews, because the model can better satisfy different structure requirements when supplied with the corresponding control sequences. On the other hand, for bart-large-cnn, sent-ctrl is the best, followed by seg-ctrl. This is most likely due to the former's more fine-grained sentence-level control that provides a clearer structure outline, as compared to the coarser segment-level control.
Moreover, bart-large-cnn far outperforms the extractive baselines, showing that the extractionbased methods are insufficient for MReD. This also suggests that meta-review writings are different from the input reviews, therefore copying full review sentences to form meta-reviews doesn't work well. This is also validated by the "Target Generic" baseline's consistent improvement over the "Source Generic" baseline, which shows that generic sentences from meta-reviews can suit generation better than those in reviews. Nevertheless, all Transformers results are still much better than the "Target Generic" sentence baseline, showing that despite generic phrases in some categories contributing to Rouge, the Transformers model is capable of capturing content-specific information for each input.

Review Combination Results
We also show uncontrolled generation results for different review combination methods in Table 6, with source truncation of 2048. The longest-review setting has the worst performance, thus validating that the review combination methods are necessary in order not to omit important information. Rateconcat has the best overall performance, which is the setting we used for the main results. Never-  theless, it is not significantly better than merge. It is also interesting to see that for merge, providing additional rating information (rate-merge) slightly worsens the performance. We will leave the investigation of better review combination methods for future work.

Case Study
We study some cases for a better understanding of the structure-controllable generation.
Identify the control label for each sentence.
We first evaluate whether the model is able to attend to the correct control label during generation. For each generation step, we obtain the cross attention weights from the decoder's output token towards the control labels, and plot them in Figure 5. The given control sequence is "abstract | weakness | decision". When generating each sentence, we can see that the attention weights of the corresponding control token are the highest, which demonstrates that our model can effectively pay attention to the correct control label and thus generate the content complying with the intent.

Extract information from the input sentences.
To understand what information the model attends to when generating each sentence, we aggregate the cross attention weights to obtain the attention scores from each generated sentence towards all input sentences (Appendix C.5). Then, we select the top 3 input sentences with the highest attention scores for each generated sentence, and visualize the normalized attention weights on all tokens in

Sent 2 (weakness):
The reviewers agree that the idea is interesting, but have concerns about the clarity of the paper and the lack of comparison to the baselines.

Sent 3 (decision):
The paper is not suitable for publication at ICLR in its current form. the selected sentences and the control sequence in Table 8. As shown, the model can correctly extract relevant information from the source sentences. For example, it identifies important phrases such as "interesting", "clarity" and "lack of comparison to baselines" when generating "Sent 2".
Generate varied outputs given different control sequences. To further investigate the effectiveness of the control sequence, we change the control sequence of the above example and re-generate the meta-reviews given the same input reviews. In Table 7, we first show the gold meta-review and the model output using the original control sequence in Row 0 and Row 1, and then show the model outputs with alternative control sequences in Row 2 and Row 3. From the outputs, we can see that indeed each generated sentence corresponds to its control label well. In Row 2, we add an additional control label in the sequence and by repeating the "abstract" label, the generator can further elaborate more details of the studied method. This is one key advantage of our sent-ctrl compared to the seg-ctrl, which allows the control of length and the level of the generation details. In Row 3, a very comprehensive control sequence is specified. We can see that the output meta-review is quite fluent and polite to reject the borderline paper. See Appendix C.6 for more examples.

Human Evaluation
In addition to the Rouge evaluation, we ask 3 human judges to manually assess the generation quality of the bart-large-cnn model trained under different control methods from Table 5 on 100 random test instances. For each test instance, we provide the judges with the input reviews and randomly ordered generations from different models, and ask them to individually evaluate the generations based on the following criteria: (1) Fluency: is the generation fluent, grammatical, and without unnecessary repetitions? (2) Content Relevance: does the generation reflect the review content well, or does it produce general but trivial sentences? (3) Structure Similarity: how close does the generation structure resemble the gold structure (i.e., the control sequence)? (4) Decision Correctness: does the gen-  Table 9: Human evaluation. * indicates the ratings of corresponding models significantly (by Welch's t-test) outperform the unctrl: p < 0.01 for decision correctness, p < 0.0001 for fluency and structure similarity. eration correctly predicts the gold human decision?
We grade fluency and content relevance on a scale of 1 to 5, whereas structure similarity and decision correctness are calculated from 0 to 1 (Appendix C.7). For structure similarity, because sent-ctrl and seg-ctrl have different control sequences, we evaluate the two models on sentence-level (sent) and segment-level (seg) structures respectively, and provide both evaluations for unctrl. As shown in Table 9, both sent-ctrl and seg-ctrl models show significant improvements on the generation structure over the uncontrolled baseline, which affirms the effectiveness of our proposed methods for structure-controllable generation. Sentctrl also has better fluency and decision correctness, suggesting that having a better output structure can benefit readability and decision generation. For the content relevance, the scores of all methods are reasonably good, and significance tests cannot prove any best model (p > 0.08). Nevertheless, it is possible that the looser control a method applies, the better relevance score it achieves. It is because a tighter control narrows the content that a model can use from the reviews.

Related Work
To facilitate the study of text summarization, earlier datasets are mostly in the news domain with relatively short input passages, such as NYT (Sandhaus, 2008) (2018), andFisas et al. (2016). In this paper, we explore text summarization in a new domain (i.e., the peer review domain) and provide a new dataset, i.e., MReD. Moreover, MReD's reference summaries (i.e., meta-reviews) are fully annotated and thus allow us to propose a new task, namely, structurecontrollable text generation.
Researchers recently explore the peer review domain data for a few tasks, such as PeerRead (Kang et al., 2018) for paper decision predictions, AM-PERE (Hua et al., 2019) for proposition classification in reviews, and RR (Cheng et al., 2020) for paired-argument extraction from review-rebuttal pairs. Additionally, a meta-review dataset is introduced by Bhatia et al. (2020) without any annotation. Our work is the first fully-annotated dataset in this domain for the structure-controllable generation task. There are also some datasets and annotation schemes on research articles (Teufel et al., 1999;Liakata et al., 2010;Lauscher et al., 2018), which differ in nature from the peer review domain and cannot be easily transferred to our task.
A wide range of control perspectives has been explored in controllable generation, including style control (e.g., sentiments ( ., 2019)). Our structure-controlled generation differs from these works as we control the high-level output structure, rather than the specific styles or the surface details of which keywords to include in the generated output. Our task also differs from content planning (Reiter and Dale, 1997;Shao et al., 2019;Hua and Wang, 2019), which involves explicitly selecting and arranging the input content. Instead, we provide the model with the high-level control labels, and let the model decide on its own the relevant styles and contents.

Conclusions
This paper introduces a fully-annotated text generation dataset MReD in a new domain, i.e., the meta-reviews in the peer review system, and provides thorough data analysis to better understand the data characteristics. With such rich annotations, we propose simple yet effective methods for structure-controllable text generation. Extensive experimental results are presented as baselines for future study and thorough result analysis is conducted to shed light on the control mechanisms.

Ethical Concerns
We have obtained approval from ICLR organizers to use the data collected from ICLR 2018-2021 on OpenReview.
Categories Examples abstract "The paper presents/explores/describes/addresses/proposes ..." strength "The reviewers found the paper interesting." "The method and justification are clear." "The quantitative results are promising." weakness "The paper is somewhat incremental ..." "... claims are confusing" "The main concern is ..." "... unfair experimental comparisons ..." rating summary "R1 recommends Accept." "All four reviewers ultimately recommended acceptance." "Reviews were somewhat mixed, but also with mixed confidence scores." ac disagreement "The area chair considers the remaining concerns by Reviewer 3 as invalid." "I do not agree with the criticism about ..." "I disagree with the second point ..." rebuttal process "The authors have made various improvements to the paper" "... remained after the author rebuttal ..." "Authors provided convincing feedbacks on this key point." suggestion "... more analysis ..." "The authors are advised to take into account the issues about ..." decision "The paper is recommended as a poster presentation." "AC recommends Reject." "I recommend rejection." miscellaneous "Thank you for submitting you paper to ICLR." "I've summarized the pros and cons of the reviews below." Table 10: Category examples of meta-review sentences.

A.1 Category definitions
We show category examples in Table 10.

A.2 Additional annotation rules
The additional rules for annotation are as follows: First, instead of only labeling the individual sentences per se, the annotators are given a complete paragraph of meta-review to label the sentences with context information. For example, if the area chair writes a sentence providing some extra background knowledge in the discussion of the weakness of the submission, even though that sentence itself can be considered as "misc", it should still be labeled as "weakness" to be consistent in context. Second, not every sentence can be strictly classified into a single category. When a sentence contains information from multiple categories, the annotators should consider its main point and primary purpose. One example is: "Although the paper discusses an interesting topic and contains potentially interesting idea, its novelty is limited." Although the first half of the sentence discusses the strength of the submission, the primary purpose of this sentence is to point out its weakness, and therefore it should be labeled as weakness.
Furthermore, there are still some cases where the main point of the sentence is hard to differentiate from multiple categories. We then define a priority order of these 9 categories according to Accept Reject abstract 23.8% 18.1% strength 18.1% 9.3% weakness 13.5% 34.3% rating summary 6.3% 4.1% ac disagreement 2.2% 0.5% rebuttal process ccccccccccccccccccc 13.2% 11.0% suggestion 7.7% 8.2% decision 9.2% 8.1% miscellaneous 6.2% 6.4% = weakness > ac disagreement > rebuttal process > abstract > suggestion > miscellaneous. We use the sign " ? =" because there are some rare cases where a sentence contains both "strength" and "weakness" while there is no obvious emphasis on either, and it is hard to tell whether "strength" should have a priority over "weakness" or the other way round. We then label this sentence based on the final decision: if this submission is accepted, we label the sentence as "strength", and vice versa.

B.1 Borderline papers
We further analyze the category distribution in borderline papers. As shown in Table 11, for submissions within the score range of [4.5,6), there are 713 accepted submissions and 2,588 rejected submissions. One clear difference is the percentage of "strength" and "weakness". Another difference is the percentage of "ac disagreement", where the accepted papers have four times the value than rejected ones. This suggests that for the accepted borderline papers, the area chair tends to share different opinions with reviewers, and thus deciding to accept the borderline submissions.

B.2 Percentage of each category for accepted and rejected papers across score ranges
We further analyze the occurrence of each category for accepted papers and rejected papers separately across different score ranges, as shown in Table  12. For accepted papers, as the score increases, the percentage of meta-reviews having "weakness" and "suggestion" drops because the high-score submissions are more likely to be accepted. Even the percentage of "decision" drops following the same miscellaneous 19 19 14 24 35 45 trend. In addition, the proportion of meta-reviews having "rebuttal process" is larger for submissions with lower scores. This suggests that the rebuttal process plays an important role in the peer review process, especially in helping the borderline papers to be accepted. On the other hand, for rejected papers, the percentage of meta-reviews having "strength" increases as the average score increases. This coincides with our common sense that the submissions receiving higher scores tend to have more strengths. One interesting finding here is that the percentage of "weakness" and "suggestion" also increases as the average rating score increases. This may be due to two main reasons. First, to reject a submission with higher scores, the area chair has to explain the weakness with more details and provide more suggestions for authors to further improve their submissions. Second, compared to the percentage of "strength", "weakness" definitely has a larger percentage within any range of rating scores. The difference in the percentage of "strength" and "weakness" is intuitively different between the accepted papers and the rejected papers.

C.1 Additional transformers models
We provide baselines of uncontrolled generation and controlled generation on MReD using other common Transformer pretrained models in Table  13. Note that due to limited GPU space, we cannot fit 2048 input tokens for T5. Thus, for fair comparison, all results shown are from source truncation of 1024.

C.2 Tagger for source sentences
To obtain labels on source input, we train a tagger based on the human-annotated meta-reviews, then use it to predict labels on the input sentences. Specifically, we define the task as a sequence labeling problem and apply the long short-term memory ( We select the best model parameters based on the best micro F 1 score on the development set and apply it to the test set for evaluation. All models are run with single V100 GPUs. We use Adam (Kingma and Ba, 2014) with an initial learning rate of 2e-5.
We report the F 1 scores for each category as well as the overall micro F 1 and macro F 1 scores in Table 14. Micro F1 is the overall accuracy regardless of the categories, whereas macro F1 is an average of per category accuracy evaluation. Since some of the category labels (eg. "ac disagreement") are very rare, their classification accuracy is low. Overall, micro F1 is a more important metric since it suggests general performance. The results stand proof that the majority of the categories have their own characteristics that can be identified from other categories. RoBERTabase is the best performing model, therefore we use this model to predict review sentence labels.

C.3 Generic sentence baselines
Besides the baselines of "Source Generic" and "Target Generic", we explore subsets of papers with high scores (average reviewers' rating 7) or low scores (average reviewers' rating 3) to obtain 4 additional generic baselines: "Source High Score", "Source Low Score", "Target High Score", "Target Low Score". We use "Target High Score" as an example to explain how we obtain the generic sentences: From the training subset of high score papers, We first separate all meta-review sentences into the corresponding label categories, obtaining a total of 9 groups of sentences. Then, we re-arrange the sentences in each group using TextRank (our best extractive model). Since TextRank ranks the input sentences based on each sentence's content connection with others, sentences with higher rankings are also more general in the sense that they have more shared content with others.
After obtaining the generic sentence sets, we can create baseline generations using the sent-ctrl sequence on the corresponding high score paper test data. We avoid using the same sentence twice inside the same generation, so if the same label appears multiple times in a control sequence, we will use the same number of generic sentences for that category down the ranking order.
All generic sentence baselines can be obtained in a similarly procedure as outlined above, and we show results in Table 15. Both "Target High Score" and "Target Low Score" perform much better than the "Target Genric" baseline, suggesting that pa-pers with very high or low scores tend to have more typical patterns in their meta-reviews. Nevertheless, the pattern is less evident in the source (reviews) baselines.

C.4 Ablation on truncation length
By default, the Transformers truncate the source to 1024 tokens. We further investigate the performance of different source truncation lengths under the setting of rate-concat. As shown in Table 18, truncating the source to 2048 tokens consistently achieves the best performance.

C.5 Attention aggregation method
During generation, we can obtain the attention weights of each output token towards all input tokens. Specifically, we average all decoder layers' cross attention weights for the same output token generated at each decoding step. We then calculate an attention value for that output token on each input sentence, by aggregating the token's attention weights on the list of input tokens that belong to the same sentence by max pooling. Finally, we can calculate an output-sentence-to-input-sentence attention score, by adding up these attention values for the output tokens that belong to the same sentence.
Common attention aggregation methods include summation, average-pooling, and max-pooling. We use max-pooling to aggregate attention for same-sentence input tokens, because summation unfairly gives high attention scores to excessively long sentences due to attention weight accumulation, whereas average-pooling disfavors long sentences containing a few relevant phrases by averaging the weights out. With max-pooling, we can correctly identify sentences with spiked attention at important phrases, regardless of sentence lengths. For attention aggregation on the samesentence output tokens, summation is used and can be viewed as allowing each output token to vote an attention score on all input sentences, so that the input sentence receiving the highest total score is the most relevant. We conduct trial runs of all aggregation methods on input tokens with summation for rating summary [this work adapts cycle GAN to the problem of decipherment with some success.]←ABSTRACT [it's still an early result, but all the reviewers have found it to be interesting and worthwhile for publication.]←RATING SUMMARY    output-token aggregation for multiple generation examples, and indeed max-pooling outperforms the other two by identifying more relevant input sentences with the generated sentence. Once we have the attention scores, we can attribute the generation of each output sentence to a few topmost relevant input sentences. Then, we can draw a color map of the input tokens in the selected sentences based on their relative attention weights.

C.6 Structure-controlled generation examples
We show examples of the generation results using alternative control sequences on another submission in Table 16. We can see the effectiveness of controlling the output structure using our proposed method.

C.7 Human evaluation
For structure similarity, we instruct the judges to label each generated sentence with the closest category. We then calculate the normalized token-level edit distance between the judge-annotated label sequence and the given control sequence, where each label is considered as a single token, and finally deduct this value from 1. For decision correctness, we evaluate it on a binary scale where 1 indicates complete correctness and 0 otherwise. More specifically, we give 0 if the generation produces either contradictory decisions or a wrong decision, or if the generation does not show enough hints for rejection or acceptance.