Exploiting Labeled and Unlabeled Data via Transformer Fine-tuning for Peer-Review Score Prediction

,


Introduction
Over the past few years, the number of submissions for AI-related international conferences and journals has increased substantially, making the review process more challenging. Automatic peer-review aspect score prediction (PASP) scores academic papers on a numeric range of different qualities along with aspects such as "clarity" and "originality". It can be a helpful assistant tool for both reviewers and authors. PeerRead is the first publicly available dataset of scientific peer reviews for research purposes (Kang et al., 2018). It can be used in various ways, such as paper acceptance classification (Ghosal et al., 2019;Maillette de Buy Wenniger et al., 2020;Fytas et al., 2021) and review aspect score prediction (Li et al., 2020;Wang et al., 2020). Alternatively, the dataset is modified for citation recommendation (Jeong et al., 2019) and citation count prediction (van Dongen et al., 2020). 1 https://github.com/panitan-m/gamma_trans Much of the previous work on PASP is based on supervised learning (Kang et al., 2018;Li et al., 2020). However, the dataset with annotated aspect scores is relatively very small, which deteriorates overall performance. To mitigate the drawback and improve the performance of PASP, we propose a semi-supervised learning (SSL) method that can leverage contextual features from the larger unannotated dataset. SSL has been widely utilized in many NLP tasks, such as classification (Miyato et al., 2016;Li et al., 2021), sequence labeling (Yasunaga et al., 2018;Chen et al., 2020), and parsing (Zhang and Goldwasser, 2020;Lim et al., 2020). It has shown to be effective for learning models by leveraging a large amount of unlabeled data to compensate for the lack of labeled data. SSL is also beneficial for PASP because an enormous body of publications is available online, and unlabeled data, i.e., scholarly papers, can often be obtained with minimal effort. Recently, transformer-based pre-training language models (LM) such as BERT (Devlin et al., 2019) and its variants have been very successful as many NLP tasks which utilize these LM attained unprecedented performances.
In this paper, we combine the strengths of both techniques and propose a Transformer-based Γmodel (Γ-Trans) that incorporates a pre-trained transformer into the Γ-model (Rasmus et al., 2015), a variant of ladder network (Valpola, 2014;Rasmus et al., 2015), SSL autoencoder. The unsupervised part of Γ-Trans utilizes a denoising autoencoder to help focus on relevant features derived from supervised learning. The contributions of our work can be summarized as follows: • We propose Γ-Trans for PASP that incorporates a pre-trained transformer into SSL by fine-tuning the model using labeled and unlabeled data simultaneously.
• The experimental results show that Γ-Trans outperforms the supervised learning baselines The model is fine-tuned by supervised cost C s and denoising cost C d and naive SSL methods with a small amount of labeled training data.
• We compare several BERT variants and the size of unlabeled to examine the effectiveness of Γ-Trans for PASP.

Γ-Transformer
The existing works applying ladder networks to the NLP task, e.g., information extraction (Nagesh and Surdeanu, 2018) and sentiment analysis (Pan et al., 2020;Zheng et al., 2021). The latter utilizes the encoder of the ladder network (Rasmus et al., 2015) to extract the features from the pretrained LM without fine-tuning it. By freezing the features from the LM, the model only utilizes the fully connected layers from the encoder of the network without exploiting the transformer layer of the LM. To mitigate the issue, we fine-tune the LM along with training the Γ-model as well as acquiring the sequence embedding from the pre-trained LM. The model can be plugged into any feedforward network without decoder implementation, i.e., the denoising cost is only on the top layer of the model. Figure 1 illustrates the Γ-Trans network. Let x be the input and y be the output with targets t. The labeled training data of size N consists of pairs {x(n), t(n)}, where 1 ≤ n ≤ N . The unlabeled data of size M has only input x without the targets t, an x(n), where N +1 ≤ n ≤ N +M . As shown in Figure 1, the network consists of two forward passes, the clean path and the corrupted pass. The former is illustrated in a dotted frame on the righthand side in Figure 1 and produces clean z and y, which are given by: where e denotes the input embedding of x with positional encoding, T r (l) refers to the transformer block at layer l in the L-layer pre-trained LM (e.g., BERT), and N B indicates a batch normalization. W shows the weight matrix of the linear transformation f . ϕ refers to an activation function, where β and γ are trainable scaling and bias parameters, respectively. The clean path shares the mappings T r (l) and f with the corrupted path. The corruptedz andỹ are produced by adding Gaussian noise n in the corrupted path (left-hand side of Figure 1): (2) A supervised cost C s is the average negative log-probability of the noisy outputỹ matching the target t n given the input x n :  where N denotes the number of labeled data. Given the corruptedz and prior informationỹ, the denoising function g reconstructs the denoisedẑ: where g is identical to the one of Rasmus et al. 's (2015) consisting of its own learnable parameters. The unsupervised denoising cost function is given by: where M indicates the number of unlabeled data, λ is a coefficient for unsupervised cost, and d refers to the width of the output layer. The final cost C is given by:

Experimental Settings
We performed the experiments on the ACL dataset with the score of review aspects that are included in the PeerRead Dataset (Kang et al., 2018). We used the mean score of multiple reviews and classified them ranging from 1 to 5 into two classes: ≥ 4 (Positive) and < 4 (Negative). We balanced the data, i.e., the same size of two classes, by randomly downsampling the majority class. Table 1 shows the statistics of the dataset. Although the PeerRead dataset contains both paper and review texts, we only used the papers to predict the aspect scores. We utilized the first 512 tokens of the paper according to the maximum length of the most common pre-trained LM, BERT (Devlin et al., 2019). For the unlabeled data, we also used the ACL papers obtained from ScisummNet Corpus 2 (Yasunaga et al., 2019), which provides 1,000 papers in the ACL anthology. We used 5-fold cross-validation to evaluate all systems with an 80/20 split for the train and test sets. We selected the best model based on the performance of the test set. The final result is calculated from the average of the five folds. As the evaluation metric, we used accuracy and F1-score.

Baselines and Implementation Details
We compare Γ-Trans with supervised learning and semi-supervised learning baselines.
• ReviewRobot (RR) (Wang et al., 2020) -This method extracts evidence by comparing the knowledge graph of the target paper and a large collection of background papers and uses the evidence to predict scores.
• Multi-task (Li et al., 2020) -A multi-task approach that automatically selects shared network structures and other review aspects as auxiliary resources. The model is based on CNN text classification model. The Γ-model and Ladder employ a ladder network on top of frozen BERT-base representations. Each baseline and the implementation details are shown in Appendix A Implementation details.  the results by Γ-Trans are the best among other SSL methods on average. This shows that our assumption, incorporating fine-tuning the pre-trained LM into the ladder network, helps improve the performance significantly. BERT has the worst performance and even performs worse than other supervised learning baselines that utilize a common neural network layer, GRU or CNN. It is probably because the number of supervised data alone is insufficient to tune millions of parameters of BERT. Among the prediction of aspects, Impact aspect is the best score in both metrics. We investigated the distribution of each aspect score from the data and found that more than 60% of the papers whose impact score is ≥ 4 also have a score of ≥ 4 in other aspects, while other aspects are not. This indicates that the Impact aspect has relatively distinctive features compared with other aspects. In contrast, Meaningful Comparison score prediction has the worst performance.One possible reason is the limited length of the input sequence, i.e., the first 512 tokens. This data length includes abstract and introduction sections, but does not include related work section which deteriorates the performance of Meaningful Comparison score.

Results and Discussion
We recall that Γ-Trans fine-tunes the LM through training the ladder network. To examine how the LM affects the overall performance on PASP, we tested several pre-trained LMs. Table 3 shows the Overall recommendation score prediction by F1 obtained from several transformer-based pre-trained LMs with Γ-Trans and the second best method, Ladder. Our approach can generate better results in all models. We can see that SciBERT does not, compared to BERT. Table 3 also shows that Longformer 4 performs better than BERT on Γ-Trans, but not Ladder. This indicates that a longer sequence of textual information helps improve the performance of PASP. In contrast, Ladder does not work well with Longformer. One reason is that Ladder can not utilize the attention mechanism of Longformer for the different domains of ACL papers as it only employs the sequence embeddings obtained from the Longformer. We also examined how the number of unlabeled data for training affects overall performance. Figure 2 shows the F1-score of the SSL methods against the number of unlabeled data obtained by 5-fold cross-validation. Overall, the graph shows that more unlabeled data helps improve the performance in every SSL method except VAT, whose performance drops at 1,000 unlabeled data. Γ-Trans consistently outperformed other SSL methods, and especially the result with 100 unlabeled data outperformed other methods with 700 unlabeled data.

Error Analysis
We analyzed the prediction probability on the Overall Recommendation aspect test data. The average probability of the selected class is 50.26% which is relatively low. The close probability of two classes indicates that the extracted features between the two classes are not much different from each other. The average probabilities of the correct and incorrect predictions are 50.30% and 50.13%, respectively, showing no significant difference.   Figure 3 shows the ratio of the predictions between negative and positive. Our model tends to bias toward positive prediction in every aspect. The most biased prediction is Meaningful Comparison, with 84.31% on positive. One reason is that several reviewers are assigned to one paper. Assume that a sample labeled negative has a score of 3, 3, and 4. (The sample is labeled negative because the average of these scores is less than 4.) Such a sample has some positive features to trigger the model to predict it as positive. In contrast, there was no such case for positive samples.
We further investigated more on the negative predictions. Table 4 shows the precision of negative samples. Although our model predicts a positive outcome more than a negative one, the precision on the negative is very high. The highest precision is 0.938 on the Impact aspect and the lowest one is higher than 0.8. High precision on negative samples means a high measure of quality that indicates that our model is suitable for the first screen to filter out poor-quality works. Moreover, it is also helpful to authors for their first draft.

Conclusion
In this paper, we focused on the PASP task and proposed a method, Γ-Trans, that incorporates the Transformer fine-tuning technique into the Γmodel of the Ladder networks. The experimental results showed the effectiveness of our model as our model attained the best accuracy and F1 on average. Through the experiments, we found that our method helps improve the performance of all pretrained LMs including SciBERT and Longformer. Future work will include (i) extending the method for imbalanced aspect score datasets, (ii) exploiting the related information between aspects, and (iii) generating knowledgeable and explainable review comments.

Limitations
We should be able to obtain further advantages in efficacy in our pre-trained LM. We utilized the first 512 word tokens in the input paper and 768dimensions of the hidden layer as most of the pretrained LM restricts text length and embedding size which may lead to a lack of contextual information about aspects. Furthermore, in our experiment, fine-tuning Longformer by freezing the first ten layers on 1,000 tokens required around 50GB of GPU memory. We would improve our Γ-Trans model so that we can process papers consisting of long token sequences. A Implementation details A.1 Fine-tuning BERT We used Huggingface's Transformers package to fine-tune BERT. We fine-tuned the model with learning rate = 1e-6 until convergence with a batch size of 8, maximal sequence length of 512. Optimization was done using Adam with warm-up = 0.1 and weight decay of 0.01.

A.2 PeerRead model
We used a simple MLP with a single hidden layer of 100 neurons with the last recurrent state of a single GRU layer of 100 units. We trained the MLP until convergence, using Adam optimizer, a learning rate of 1e-4 with a batch size of 8 and an L2 penalty of 1.

A.3.1 Recurrent LM Pre-training
We used a unidirectional single-layer LSTM with 1,024 hidden units. The dimension of word embedding was 256. For the optimization, we used the Adam optimizer with a batch size of 32, an initial learning rate of 0.001, and a 0.9999 learning rate decay factor. We trained for 50 epochs. We applied gradient clipping with norm set to 5.0. We used dropout on the word embedding layer and an output layer with a 0.5 dropout rate.

A.3.2 Model Training
We added a hidden layer between the softmax layer for the target and the final output of the LSTM. The dimension is set to 30. For optimization, we also used the Adam optimizer, with a 0.001 initial learning rate and 0.9998 exponential decay. Batch sizes are set to 32 and 96 for calculating the loss of virtual adversarial training. We trained for 30 epochs. applied gradient clipping with the norm as 5.0.

A.4 Multi-task
We modified the model from performing a regression task to a classification task by changing the output layer. We used CNN with 64 filters and filter width of 2. We used fastText as initial word embeddings. The hidden dimension was 1024. We trained the model using Adam optimizer with learning rate 0.001 and batch size of 8. We trained all of the candidate multi-task models for one and two auxiliary tasks to find the best one.

A.5 Γ-model and Ladder
We used the layer sizes of the ladder network to be 768-100-500-250-250-250-2, according to the BERT's representation dimension and the number of output classes. We set the denoising cost multipliers λ to [1000, 10, 0.1, 0.1, 0.1, 0.1, 0.1] from the input layer to the output layer for the Ladder, and [0, 0, 0, 0, 0, 0, 1] for the Γ-model. The std of the Gaussian corruption noise n is set to 0.3. We trained the model with a learning rate of 3e-3 until convergence with a batch size of 8 for each labeled and unlabeled data, 16 in total. Optimization was done using Adam with weight decay of 0.01.

A.6 Γ-Trans
We used Huggingface's Transformers package to fine-tune transformer-based pre-trained LMs. The denoising cost multipliers λ is set to 1. We set the std of the Gaussian corruption noise n to 0.3 in both Γ-model and Ladder. For optimization, we used the Adam optimizer, with a 1e-4 initial learning rate, 0.01 weight decay, and 0.1 warm-up. Batch size is set to 8 for both labeled and unlabeled data, 16 in total.