ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis

To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories such as novelty and meaningful comparison. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals via three steps: (1) We perform domain-specific Information Extraction to construct a knowledge graph (KG) from the target paper under review, a related work KG from the papers cited by the target paper, and a background KG from a large collection of previous papers in the domain. (2) By comparing these three KGs, we predict a review score and detailed structured knowledge as evidence for each review category. (3) We carefully select and generalize human review sentences into templates, and apply these templates to transform the review scores and evidence into natural language comments. Experimental results show that our review score predictor reaches 71.4%-100% accuracy. Human assessment by domain experts shows that 41.7%-70.5% of the comments generated by ReviewRobot are valid and constructive, and better than human-written ones for 20% of the time. Thus, ReviewRobot can serve as an assistant for paper reviewers, program chairs and authors.


Introduction
As the number of papers in our field increases exponentially, the reviewing practices today are more challenging than ever. The quality of peer paper reviews is well-debated across the academic community (Bornmann et al., 2010;Mani, 2011;Sculley et al., 2018;Lipton and Steinhardt, 2019). How many times do we complain about a bad, random, dismissive, unfair, biased or inconsistent peer review? Authors even created various social groups at social media to release their frustrations and anger, such as the "Reviewer #2 must be stopped" group at Facebook 2 . How many times are our papers rejected by a conference and then accepted by a better venue with only few changes? As the number of paper submissions continues to double or even triple every year, so does the need for highquality peer reviews.
The following are two different reviews for the same paper rejected by ACL2019 and accepted by EMNLP2019 without any change on content: • ACL 2019: Idea is too simple and tricky.
• EMNLP 2019: The main strengths of the paper lie in the interesting, relatively underresearched problem it covers, the novel and valid method and the experimental results.
These reviews, including the positive ones, are too vague and generic to be helpful. We often see review comments stating a paper is missing references without pointing to any specific references, or criticizing an idea is not novel without showing similar ideas in previous work. Some bad reviewers often ask to add citations to their own papers to inflate their citation record and h-index, and these papers are often irrelevant or published after the submission deadline of the target paper under review. Early study (Anderson, 2009) shows that the acceptance of a computer systems paper is often random and the dominant factor is the variability between reviewers. The inter-annotator agreement between two review scores for the ACL2017 accepted papers (Kang et al., 2018) are only 71.5%, 68.4%, and 73.1% for substance, clarity and overall recommendation respectively. (Pier et al., 2018) found no agreement among reviewers in evaluating  Figure 1: ReviewRobot Architecture Overview the same NIH grant application. The organizers of NIPS2014 assigned 10% submissions to two different sets of reviewers and observed that these two committees disagreed for 25.9% of the papers (Bornmann et al., 2010), and half of NIPS2016 papers would have been rejected if reviews are done by a different group (Shah et al., 2017).
These findings highlight the subjectivity in human reviews and call for ReviewRobot, an automatic review assistant to help human reviewers to generate knowledgeable and explainable review scores and comments, along with detailed evidence. We start by installing a brain for ReviewRobot with a large-scale background knowledge graph (KG) constructed from previous papers in the target domain using domain-specific Information Extraction (IE) techniques. For each current paper under review, we apply the same IE method to construct two KGs, from its related work section and its other sections. By comparing the differences among these KGs, we extract pieces of evidence (e.g., novel knowledge subgraphs which are in the current paper but not in background KGs) for each review category and use them to predict review scores. We manually select constructive human review sentences and generalize them into templates for each category. Then we apply these templates to convert structured evidence to natural language comments for each category, using the predicted scores as a controlling factor.
Experimental results show that our review score predictor reaches 71.4% overall accuracy on overall recommendation, which is very close to interhuman agreement (72.2%). The score predictor achieves 100% accuracy for both appropriateness and impact categories. Human assessment by domain experts shows that up to 70.5% of the comments generated by ReviewRobot are valid, and better than human-written ones 20% of the time.
In summary, the major contributions of this paper are as follows: • We propose a new research problem of generating paper reviews and present the first complete end-to-end framework to generate scores and comments for each review category.
• Our framework is knowledge-driven, based on fine-grained knowledge element comparison among papers, and thus the comments are highly explainable and constructive, supported by detailed evidence.
• We create a new benchmark that includes 8K paper and review pairs, 473 manually selected pairs of paper sentences and constructive human review sentences, and a background KG constructed from 174K papers.
2 Approach 2.1 Overview Figure 1 illustrates the overall architecture of Re-viewRobot. ReviewRobot first constructs knowledge graphs (KGs) for each target paper and a large collection of background papers, then it extracts evidence by comparing knowledge elements across multiple sections and papers, and uses the evidence to predict scores and generate comments for each review category. We adopt the following most common categories from NeurIPS2019 3 and PeerRead (Kang et al., 2018):

Method
Evaluation Metric Material Task Figure 2: Knowledge Graph Construction Example for Paper (Bahdanau et al., 2015) • Potential Impact: How significant is the work described? If the ideas are novel, will they also be useful or inspirational? Does the paper bring any new insights into the nature of the problem?

Knowledge Graph Construction
Generating meaningful and explainable reviews requires ReviewRobot to understand the knowledge elements of each paper. We apply a state-of-theart Information Extraction (IE) system for Natural Language Processing (NLP) and Machine Learning (ML) domains (Luan et al., 2018) to construct the following knowledge graphs (KGs): • G Pτ : A KG constructed from the abstract and conclusion sections of a target paper under review P τ , which describes the main techniques. •Ḡ Pτ : A KG constructed from the related work section of P τ , which describes related techniques. • G B : A background KG constructed from all of the old NLP/ML papers published before the publication year of P τ , in order to teach ReviewRobot what's happening in the field.
Each node v ∈ V in a KG represents an entity, namely a cluster of co-referential entity mentions, assigned one of six types: Task, Method, Evaluation Metric, Material, Other Scientific Terms, and Generic Terms. Following the previous work on entity coreference for scientific domains (Koncel-Kedziorski et al., 2019), we choose the longest informative entity mention in each cluster to represent the entity. We consider two entity clusters from different papers as coreferential if one's representative mention appears in the other. Each edge represents a relation between two entities.

Evidence Extraction
We compare the differences among the constructed KGs to extract evidence for each review category. Table 1 shows the methods to extract evidence and some examples for each category.

Score Prediction
Following (Kang et al., 2018), we consider review score prediction as a multi-label classification task. For a target paper, we first encode its category related sentences with an attentional Gated Recurrent Unit (GRU) (Cho et al., 2014;Bahdanau et al., 2015) and encode the extracted evidence for each review category with an embedding layer. Then we predict the quality score r in the range of 1 to 5 with a linear output layer. We use the prediction probability as the confidence score.

Comment Generation
Given the evidence graphs and predicted scores as input, we perform template-based comment generation for each category. We aim to learn good templates from human reviews. Unfortunately as we have discussed earlier, not all human written review sentences are of high quality, even for those accepted papers. Therefore in order to generalize templates we need to carefully select those constructive and informative human review sentences that are supported by certain evidence in the papers. To avoid expensive manual selection, we design a semi-automatic bootstrapping approach. We manually annotate 200 paper-review pairs from ACL2017 and ICIR2017 datasets, and then use  Table 1: Evidence Extraction for the example paper Attention-over-Attention Neural Networks for Reading Comprehension (Cui et al., 2017) them as seed annotations to train an attentional GRU (Cho et al., 2014) based binary (select/not select) classifier to process the remaining human review sentences and keep high-quality reviews with high confidence. Our attentional GRU achieves binary classification accuracy 85.25%. Table 2 shows the annotation statistics and some examples.
For appropriateness, soundness, and potential impact categories, we generate generic positive or negative comments based on the predicted scores. For summary, novelty, and meaningful comparison categories, we consider review generation as a template-based graph-to-text generation task. Specifically, for summary and novelty, we generate reviews by describing the Used-for, Feature-of, Compare and Evaluate-for relations in evidence graphs. We choose positive or negative templates depending on whether the predicted scores are above 3. We use the predicted overall recommendation score to control summary generation. For related work, we keep the knowledge elements in the evidence graph with a TF-IDF score (Jones, 1972) higher than 0.5. For each knowledge element, we recommend the most recent 5 papers that are not cited as related papers.

Data
We choose papers in NLP and ML domains in our experiments because it's easy for us to analyze results, and we are not the most harsh community in Computer Science: the average review score in our corpus is 3.3 out of 5 while it is 2.5/5 in the In this paper, we present a simple but novel model called attention-over-attention reader for better solving cloze-style reading comprehension task.
The paper describes a new method called attention-over-attention for reading comprehension. Novelty 33 The paper presents a new framework to solve the SR problem -amortized MAP inference and adopts a pre-learned affine projection layer to ensure the output is consistent with LR.
It introduces a novel neural network architecture that performs a projection to the affine subspace of valid SR solutions ensuring that the high resolution output of the network is always consistent with the low resolution input. Soundness 174 In high dimensions we empirically found that the GAN based approach, AffGAN produced the most visually appealing results.
Combined with GAN, this framework can obtain plausible and good results.

Meaningful
Comparison 16 As a concrete instantiation, we show in this paper that we can enable recursive neural programs in the NPI model, and thus enable perfectly generalizable neural programs for tasks such as sorting where the original, non-recursive NPI program fails.
This paper improves significantly upon the original NPI work, showing that the model generalizes far better when trained on traces in recursive form.
Potential Impact

14
Since there may be several rounds of questioning and reasoning, these requirements bring the problem closer to task-oriented dialog and represent a significant increase in the difficulty of the challenge over the original bAbI (supporting fact) problems.
I am a bit worried that the tasks may be too easy (as the bAbI tasks have been), but still, I think locally these will be useful.  (Anderson, 2009). In addition to the review corpus constructed by (Kang et al., 2018), we have collected additional paperreview pairs from openreview 4 and NeurIPS 5 . In total, we have collected 8,110 paper and review pairs as shown in Table 3. We construct the background KG from 174,165 papers from the open research corpus . Table 4 shows the data statistics of background KGs.  We use the ACL2017 dataset in the score prediction task because it has complete score annotations for each review category. We follow the data split of PeerRead (Kang et al., 2018) 6 . Unlike PeerRead which uses multiple review scores for the same input paper, we use the rounded average score of each category as the target score. Table 5 shows that our model trained from carefully selected constructed reviews has already reached a prediction accuracy of 71.43% for overall recommendation, which is very close to the human inter-annotator agreement (72.2%) and dramatically advances state-of-the-art approaches in most categories.

Score Prediction Performance
Our knowledge graph synthesis based approach is particularly effective at predicting Novelty score and achieves the accuracy of 71.43%, which is much higher than the accuracy (28.57%) of all other automatic prediction methods using paper abstracts only as input. In Figure 3 we show the average number of new knowledge elements of our test set consisting of ACL2017 papers, when it's reviewed during different years. When the background KG includes newer work, the novelty of these papers decreases, especially after 2017. This indicates that our approach provides a reliable measure for computing novelty.
As a fun experiment, we also run ReviewRobot on this paper submission itself. The predicted re- 6 We exclude the training pairs that we fail to run IE system on. The test set remains the same as (Kang et al., 2018).

Comment Generation Performance
For the system generated review comments for 50 ACL2017 papers, we ask domain experts to check whether each comment is constructive and valid. Two researchers independently annotate the reviews and reach the inter-annotator agreement of 92%, 92%, and 82% for Novelty, Summary and Related Work, respectively. One expert annotator performs data adjudication. The percentages of constructive and valid comments are 70.5%, 44.6% and 41.7% for Summary, Novelty and Meaningful Comparison, respectively. Human assessors also find that for 20% of these papers, human reviewers do not suggest missing related work for Meaningful Comparison, while ReviewRobot generates constructive and informative comments. For example, the reviewer states "The paper would be stronger with the inclusion of more baselines based on related work 7 ", but fails to provide any useful references. In the following we compare the human and system generated reviews for an example paper (Niu et al., 2017): Summary * [SYSTEM] The paper proposes novel skipgram, attention scheme, sememe-encoded models and word representation learning for NLP tasks. The authors uses linguistic common-sense knowledge bases. *

Remaining Challenges and Limitations
The quality of ReviewRobot is mainly limited by state-of-the-art Information Extraction performance for the scientific literature domain. In the future we plan to annotate more data to cover more dimensions for paper profiling (such as goal and main contribution), and more fine-grained knowledge types to improve the extraction quality. For example, for the NLP domain we can extract finergrained subtypes: a model can include parameters, components and features. The goal of an NLP paper could belong to: "New methods for specific NLP problems", "End-user applications", "Corpora and evaluations", "New machine learning methods for NLP", "Linguistic theories ", "Cognitive modeling and psycholinguistic research" or "Applications to social sciences and humanities". Our current evidence extraction framework also lacks of a salience measure to assign different weights to different types of knowledge elements.
Paper review generation requires background knowledge acquisition and comparison with the target paper content. Our novel approach on constructing background KG has helped improve the quality of review comments on novelty but the KG is still too flat to generate comments on soundness. For example, from the following two sentences in a paper: "Third, at least 93% of time expressions contain at least one time token.", and "For the relaxed match on all three datasets , SynTime-I and SynTime-E achieve recalls above 92%.", a knowledgeable human reviewer can infer 93% as the upper bound of performance and write a comment: "Section 5.2 : given this approach is close to the ceiling of performance since 93 % expressions contain time token , and the system has achieved 92 % recall , how do you plan to improve further?". Similarly, ReviewRobot cannot generalize knowledge elements into high-level comments such as "deterministic" as in "The tasks 1-5 are also completely deterministic".
ReviewRobot still lacks of deep knowledge reasoning ability to judge the soundness of algorithm design details, such as whether the split of data set makes sense, whether a model is able to generalize. ReviewRobot is not able to comment on missing hypotheses, the problems on experimental setting and future work. ReviewRobot currently focuses on text only and cannot comment on mathematical formulas, tables and figures.
Good machine learning models rely on good data. We need massive amounts of good human reviews to fuel ReviewRobot. In our current approach, we manually select a subset of good human review sentences that are also supported by corresponding sentences in the target papers. This process is very time-consuming and expensive. We need to build a better review infrastructure in our community, e.g., asking authors to provide feedback and rating to select constructive reviews as in NAACL2018 8 .

Related Work
Paper Acceptance Prediction. Kang et al. (2018) has constructed a paper review corpus, PeerRead, and trained paper acceptance classifiers. Huang (2018) applies an interesting visual feature to compare the pdf layouts and proves its effectiveness to make paper acceptance decision. Ghosal et al. (2019) applies sentiment analysis features to im-prove acceptance prediction. The KDD2014 PC chairs exploit author status and review comments for predicting paper acceptance (Leskovec and Wang, 2014). We extend these methods to score prediction and comment generation with detailed knowledge element level evidence for each specific review category.
Paper Review Generation. Bartoli et al. (2016) proposes the first deep neural network framework to generate paper review comments. The generator is trained with 48 papers from their own lab. In comparison, we perform more concrete and explainable review generation by predicting scores and generating comments for each review category following a rich set of evidence, and use a much larger data set. Nagata (2019) generates comment sentences to explain grammatical errors as feedback to improve paper writing.
Review Generation in other Domains. Automatic review generation techniques have been applied to many other domains including music (Tata and Di Eugenio, 2010), restaurants (Bražinskas et al., 2020),and products (Ni and McAuley, 2018;Li and Tuzhilin, 2019;Bražinskas et al., 2020). These methods generally apply a sequence-tosequence model with attention to aspects and attributes (e.g. food type). Compared to these domains, paper review generation is much more challenging because it requires the model to perform deep understanding on paper content, construct knowledge graphs to compare knowledge elements across sections and papers, and synthesize information as input evidence for comment generation.

Application Limitations and Ethical Statement
The types of evidence we have designed in this paper are limited to NLP, ML or related areas, and thus they are not applicable to other scientific domains such as biomedical science and chemistry. Whether ReviewRobot is essentially beneficial to the scientific community also depends on who uses it. Here are some example scenarios where Re-viewRobot should and should not be used: • Should-Do: Reviewers use ReviewRobot merely as an assistant to write more constructive comments and compare notes; Editors use ReviewRobot to assist filtering very bad papers during screening; Authors use ReviewRobot to get initial feedback to improve paper writing; Researchers use ReviewRobot to perform literature survey, find more good papers and validate the novelty of their papers.
• Should-Not-Do: Reviewers submit Re-viewRobot's output without reading the paper carefully; Editors send ReviewRobot's output and make decisions based on it; Authors revise their papers to fit into ReviewRobot's features to boost review scores.

Conclusions and Future Work
We build a ReviewRobot for predicting review scores and generating detailed comments for each review category, which can serve as an effective assistant for human reviewers and authors who want to polish their papers. The key innovation of our approach is to construct knowledge graphs from the target paper and a large collection of in-domain background papers, and summarize the pros and cons of each paper on knowledge element level with detailed evidence. We plan to enhance Re-viewRobot's knowledge reasoning capability by building a taxonomy on top of the background KG, and incorporating multi-modal analysis of formulas, tables, figures, and citation networks.