Uncertainty Aware Review Hallucination for Science Article Classification

The high subjectivity and costs inherent in peer reviewing have recently motivated the preliminary design of machine learning-based acceptance decision methods. However, such approaches are limited in that they: a) do not explore the usage of both the reviewer and area chair recommendations, b) do not explicitly model subjectivity on a per submission basis, and c) are not applicable in realistic settings, by assuming that review texts are available at test time, when these are exactly the inputs that should be considered to be missing in this application. We propose to utilise methods that model the aleatory uncertainty of the submissions, while also exploring different loss importance interpolations between area chair and reviewers’ recommendations. We also propose a modality hallucination approach to impute review representations at test time, providing the ﬁrst realistic evaluation framework for this challenging task.


Introduction
An analysis (Langford and Guzdial, 2015) of the NeurIPS 2014 experiment shows that 60% of the selected accepted papers were rejected by a second, independent review committee. Such significant reviewer disagreement makes the task of the area chair harder, and may even invite questioning of their decision. Software tools have been piloted in an effort to aid the human reviewers with a computational recommendation on aspects like absence of bias and proper statistical reporting in scientific submissions (Sizo et al., 2019).
Natural Language Understanding (NLU) could also offer decision support to the area chair, as argued in (Ghosal et al., 2019;Stappen et al., 2020). Such systems jointly model the entire or part of the article and one (Kang et al., 2018;Wang and Wan, 2018;Ghosal et al., 2019), or a variable number * * KF and GR contributed equally to this work. of potentially contradicting reviews (Stappen et al., 2020). We adopt the latter, review-aggregating approach, that resembles the editorial process more.

Contributions
In this short paper, we offer solutions to three particularities of this task that the above approaches do not address: a) Often, the recommendations given by the area chair and the reviewers are in disagreement. Whereas previous studies have used either the former (Kang et al., 2018;Wang and Wan, 2018;Ghosal et al., 2019) or a soft label average of the latter (Stappen et al., 2020) for supervision, we show that both signals comprise complementary information. b) Whereas soft labels de-emphasise subjective articles with disagreeing reviews during training (Stappen et al., 2020), we manage to outperform the latter study by explicitly modelling aleatory uncertainty as an auxiliary prediction task. c) A model that aims to support the editorial decision process should only assume the availability of human review text during training, and be able to make recommendations in their absence. Inspired by missing modality hallucination methods (Hoffman et al., 2016;Tang et al.;Pérez et al., 2020)), we propose a realistic system that uses all available data for training, but imputes review representations at test time based on the abstract text.

Purpose & Ethical Statement
We sincerely believe that human peer reviews should continue to be the main component of the paper acceptance selection process, and this work in no way attempts to replace the human reviewers; instead, we believe an NLU model can serve as an additional reviewer, aiding an area chair's decisionmaking process by slot-filling a missing reviewer, or providing a data-driven, tie-breaking perspective to the editor in cases of borderline reviews. The motivation behind this proposal is that NLU models trained on large-scale data, can learn to robustly cancel out individual human biases -in a similar way neural networks are robust to non-systematic label noise (Rolnick et al., 2017). Admittedly, such a model can still learn and reflect systematic biases, but we leave an approach to this problem by means of methods that learn with biased data (Kim et al., 2019) for future work.
2 Related work Kang et al. (2018) compiled the PeerRead dataset of submissions, and proposed NLU baselines for binary acceptance decision and score prediction such as novelty and technical correctness. Wang and Wan (2018) explored the acceptance task by modelling the abstract via a memory mechanism (Weston et al., 2015), along with one review. Ghosal et al. (2019) improved performance on the Peer-Read dataset by utilising sentiment information using the VADER tool (Hutto and Gilbert, 2014) and universal sentence embeddings (Cer et al., 2018). Unfortunately, PeerRead is imbalanced in that the NeurIPS rejected submissions are not included, despite the fact that 90% of the accepted submissions with reviews in PeerRead are from NeurIPS. Furthermore, around 80% of the submissions are from arxiv, thus having no reviews attached to them. Stappen et al. (2020) worked on the largest such dataset -the Interspeech 2019 submission corpusand fused the variable number of text reviews per submission. On incorporating reviewer disagreement information, they showed the simple label average to be better than the adapted version proposed in (Ando et al., 2018), and also approached the score prediction tasks via deep quantile regression (Rodrigues and Pereira, 2020). Direct modelling of a label disagreement value, instead of using soft labels, has been utilised in areas such as affective computing (Han et al., 2017) and medical image modelling (Raghu et al., 2019). Alternatively, Kendall and Gal (2017) devised a method for aleatory uncertainty modelling that is learnt from the data, instead of requiring ground truth disagreement "labels". Both Han et al. (2017); Kendall and Gal (2017) have shown regularisation benefits of learnt uncertainty prediction.

Submission-level modelling
Following Wang and Wan (2018); Stappen et al. (2020), we focus on abstract x abs i and review texts x rev i,r (numbering R i ) for the i-th submission, the acceptance classification labels given by the area chair y ac i , as well as by the reviewers y rev i,r . We use a model M that: a) learns abstract h abs i and review h rev i representations using corresponding modules, b) fuses the aforementioned into a submission representation h sub i , and generates the class probability distributionŷ i via a prediction module and softmax. We then calculate the cross entropy (CE) loss with the true probability distribution y true i : (1) The most straightforward way to do this is by using a hard label, i. e., assuming y true i ≡ y ac i , with all the probability concentrated at the final recommendation given by the area chair. This way, however, we withhold information about the reviewer uncertainty for the particular submission. Stappen et al. (2020) have successfully used the simple soft label: The value of soft labels becomes clear when one considers that, in their absence, true acceptance probabilities of .51 and .89 would receive the same treatment. Occasionaly, the area chair may disagree with the reviewers' aggregate decision, which motivates the interpolation of the two factors: where L * pred , λ * refers to prediction loss and regularisation parameter for either hard or soft labels.

Modelling recommendation subjectivity
We add a second "head" in our prediction module that outputs a predictive uncertainty estimatê σ i . We now require a supervision signal to train it, either by: a) treating label disagreement as ground truth uncertainty (GTU), or b) learning a heteroscedastic loss attenuation score (HLA). Inspired by (Han et al., 2017;Raghu et al., 2019), we define our approach to GTU as a multi-task loss: where σ i is the standard deviation among reviewer recommendations; MSE is mean squared error. For HLA we use the method proposed in (Kendall and Gal, 2017), whereσ i is the standard deviation of a normal distribution centred at the mean denoted by the main head logits. By sampling T logits, and calculating a corresponding Figure 1: The peer review machine support system, including the hallucination mechanism. In a first training stage, only the Classification Loss on Review is used, to learn review representations. In the second training stage, the MSE Hallucination Loss and the Classification Loss with Hallucination are used. class distributionŷ t i , the loss function becomes: A largerσ i relaxes the loss value for a sample that is difficult to predict correctly.
3.2 Imputing reviews at test time

Experiments
Small available dataset size is a limitation known to the community working on this domain (Kang et al., 2018;Ghosal et al., 2019) -we use the largest database of its kind (Stappen et al., 2020), i. e., the 2 179 preprocessed academic submissions, 5 842 reviews, with corresponding acceptance decisions and reviewer scores from the submission system of Interspeech 2019, shared with us by the technical chairs of the conference. After data cleaning and removal of corrupt entries, the accepted and rejected classes are well-balanced: 50.2 % acceptances, and 49.8 % rejections. The dataset is shuffled and split into 80-10-10 train-validationtest set percentages. We monitor the validation performance in terms of Macro-averaged F1 score, and also report the Macro-averaged Area Under Receiver Operator Characteristic (AU-ROC), averaged across 20 trials. We use the Adam optimiser (Kingma and Ba, 2014) with learning rate of 1e-3, and represent words using FastText (Bojanowski et al., 2017). Our abstract and review modules comprise a stacked 1D convolutional network with kernel sizes 4-4, interleaved by max pooling with rates 2-2, followed by a recurrent layer with gated recurrent unit cell and 100 hidden units, and attentional sequence pooling. The prediction module consists of two dense layers of 50-2 units, with a ReLu activation between them.

To model the reviewer or the area chair?
The interpolation weights λ sof t , λ hard for the prediction error (cf. Eq. 3) are dataset-based and should be set based on validation performance. We experiment with a grid, ranging from [1.0, 0.0] to [0.0, 1.0] using a step of 0.2. λ sof t ≡ 0.0 denotes the simple hard label case. The results using the GTU loss are summarised in Table 1. We find that the area chair and the reviewers' recommendations carry complementary information, and the best results of this study are at λ sof t ≡ 0.8. Interestingly, the agreement/accuracy between the editorial labels and the reviewer soft averages (rounded to 0 or 1) is 78.901%. The disagreements occur on close-to-borderline papers, in which cases the additional supervision is the most informative.

Are soft labels enough?
A comparison among the different loss functions, without hallucination, is summarised in Table 2. We report the best soft loss interpolation per case.  Table 2: Results on using the abstract with or without the reviews, using different kinds of losses. BL denotes the baseline by Stappen et al. (2020).
In the case of GTU we found that the choice of γ unc ≡ γ pred ≡ 0.5 works best. The additional complexity of explicit uncertainty modelling is shown to be beneficial when compared to the simple soft labels, and GTU is better than the selflearnt uncertainty method HLA. Our model implementation with soft-hard loss mixing is also shown to greatly outperform a baseline (BL), i. e., the best result found in (Stappen et al., 2020). We also performed statistical significance testing, using Welch's unequal variances t-test. No significance was found in improvement brought by uncertainty-aware methods compared to hard labels in the abstract-only experiments. However, GTU with hallucinated review representations is significantly better than abstract-only with p < 0.1 for AU-ROC and p < 0.05 for F1, and HLA with p < 0.05 for both measures. In the experiments using both abstract and reviews, the simple soft labels as well as HLA are both significantly better than hard labels in terms of AU-ROC with p < 0.05. GTU was significantly better than hard labels with p < 0.1 for F1 and p < 0.05 for AU-ROC.  Table 3: We report the review hallucination results; for the uncertainty-aware methods, we used λ sof t ≡ 0.8.

Can we impute reviews?
Table 3 summarises the improvement brought by hallucinated reviews over the abstract-only case. Even though we only report a specific hard-soft interpolation weight λ ≡ 0.8, we observe this improvement universally. The HLA method with hallucination achieves both the best performance in this experiment, and the largest relative improvement (t-test, p < .05) upon the abstract-only case, as shown in Table 4. Lacking the true reviews, we have high label variance for the same abstract input, i. e., high aleatory uncertainty. HLA (Kendall and Gal, 2017) is designed for such cases, and guides the learning of hallucinated review representations through regularisation, allowing for a significant fraction of the performance gap to be covered. Hard-labels with hallucination is the method that performs relatively closest to its ceiling performance, but this can be explained by the ceiling being comparatively low in the hard-labels case. The additional label uncertainty information, whether explicit or learnt, informs not just the classification capacity of the model, but also its ability to generate review representations. These hallucinated representations should be placed in embedding space such that they inform the model regarding the label, however not in an overconfident manner, given that the actual reviews are missing -this is exactly where knowledge of uncertainty contributes. In terms of a final method recommendation: we recommend the learnt attenuation based HLA, due to its better performance along with modality hallucination and the fact that it does not require the presence of multiple reviewer recommendations even at training-time.

Can we learn model disagreement?
The Pearson Correlation Coefficient (PCC) between the predicted uncertainty and the standard deviation of reviewer recommendations is .25 and  Table 4: Relative improvements (in %) brought by hallucinated reviews compared to using only the abstract, and relative reductions compared to the performance ceiling in the case the reviews are available at test time. In cases where the true reviews cannot be assumed to be present in test/deployment, our hallucination approach allows for improvement of results compared to excluding reviews altogether.
.08 for GTU and HLA respectively in the abstract plus review case. The former indeed learns on actual disagreement labels, although high uncertainty prediction fidelity may not be necessary for high predictive performance, shown by the competitive HLA. When using only abstracts, PCC drops to .08 and .04 respectively, whereas by using hallucination we observe .08 and .05, indicating that the true review representations are required for good uncertainty prediction.

Conclusion
We have proposed a machine learning framework for automatic peer review support that makes better use of the available information, and is also realistic with respect to the limitations set by the task 1 . We have found that the the area chair and reviewer recommendations comprise independent supervision signals that should be used in conjunction to train the system. Furthermore, in order to relax the penalty for mispredicting subjective submissions, it is not enough to use a simple soft label average of the reviewer recommendations; one has to directly model an aleatory uncertainty score as an auxiliary task, either using ground truth "uncertainty labels", or through learnt attenuation of the loss. Finally, we utilise review representation hallucinations at test-time to best utilise available review texts in a realistic manner, and find that this approach works well with and benefits from the regularisation introduced by direct uncertainty modelling. Even with the application of our review representation hallucination, the performance gap from the ceiling set by using the true review representations is still high. We intend to approach the task via selfsupervision methods (He et al., 2020) that focus on multimodal data (Nagrani et al., 2020). Requesting additional reviewers based on inference-time uncertainty, similar to (Raghu et al., 2019), is another promising future work step, as is an analysis of the uncertainties predicted by our model using the different losses. Finally, we have shown in our study that only a representation of the abstract is required as the input both for acceptance and hallucination modelling. Since previous work (Ghosal et al., 2019) has shown that modelling an article based on the entire paper can be beneficial, we also intend to explore the impact of using such a highly expressive article representation for hallucinating review representations.