Exploring Jiu-Jitsu Argumentation for Writing Peer Review Rebuttals

In many domains of argumentation, people's arguments are driven by so-called attitude roots, i.e., underlying beliefs and world views, and their corresponding attitude themes. Given the strength of these latent drivers of arguments, recent work in psychology suggests that instead of directly countering surface-level reasoning (e.g., falsifying given premises), one should follow an argumentation style inspired by the Jiu-Jitsu 'soft' combat system (Hornsey and Fielding, 2017): first, identify an arguer's attitude roots and themes, and then choose a prototypical rebuttal that is aligned with those drivers instead of invalidating those. In this work, we are the first to explore Jiu-Jitsu argumentation for peer review by proposing the novel task of attitude and theme-guided rebuttal generation. To this end, we enrich an existing dataset for discourse structure in peer reviews with attitude roots, attitude themes, and canonical rebuttals. To facilitate this process, we recast established annotation concepts from the domain of peer reviews (e.g., aspects a review sentence is relating to) and train domain-specific models. We then propose strong rebuttal generation strategies, which we benchmark on our novel dataset for the task of end-to-end attitude and theme-guided rebuttal generation and two subtasks.


Introduction
Peer review, one of the most challenging arenas of argumentation (Fromm et al., 2020), is a crucial element for ensuring high quality in science: authors present their findings in the form of a publication, and their peers argue why it should or should not be added to the accepted knowledge in a field.Often, the reviews are also followed by an additional rebuttal phase.Here, the authors have a chance to convince the reviewers to raise their assessment scores with carefully designed counterarguments.Recently, the analysis of review-rebuttal dynamics has received more and more attention in NLP research (Cheng et al., 2020;Bao et al., 2021;Kennard et al., 2022).For instance, previous efforts focused on understanding the links between reviews and rebuttals (Cheng et al., 2020;Bao et al., 2021) and on categorizing different rebuttal actions (e.g., thanking the reviewers; Kennard et al., 2022).Gao et al. (2019) find that well-structured rebuttals can indeed induce a positive score change.However, writing good rebuttals is challenging, especially for younger researchers and non-native speakers.For supporting them in writing effective rebuttals, argumentation technology can be leveraged.
In computational argumentation, existing efforts towards generating counterarguments are based on surface-level reasoning (Wachsmuth et al., 2018;Alshomary et al., 2021;Alshomary and Wachsmuth, 2023).That is, existing research focuses on directly rebutting the opponents' claims based on what is evident from the argument's surface form.In contrast, researchers also proposed theories that explain how one's arguments are driven by underlying beliefs (Kiesel et al., 2022;Liscio et al., 2022).One such concept, rooted in psychology, relates to 'attitude roots' and corresponding finer-grained 'attitude themes'.These attitude roots are the underlying beliefs that help sustain surface opinions (Hornsey and Fielding, 2017).Based on this idea, the authors further proposed 'Jiu-Jitsu argumentation', a rebuttal strategy that relies on understanding and using one's attitude roots to identify generic but customizable counterarguments, termed canonical rebuttals.As those are aligned with the opponent's underlying beliefs, Jiu-Jitsu argumentation promises to change opinions more effectively than simple surface-level counterargumentation.In this work, we acknowledge the potential of leveraging the latent drivers of arguments and are the first to explore Jiu-Jitsu argumentation for peer-review rebuttals.
Concretely, we propose the task of attitude root and theme-guided rebuttal generation.In this context, we explore reusing established concepts from peer review analysis for cheaply obtaining reviewer attitude roots and themes (Kennard et al., 2022;Ghosal et al., 2022): reviewing aspects, and reviewing targets (paper sections).We show an example in Figure 1: the example review sentence criticizes the clarity of the paper without being specific about its target.The argument is thus driven by the attitude root Clarity and the attitude theme Overall.The combination of these two drivers can be mapped to an abstract and generic description of the reviewer's beliefs.Next, given a specific rebuttal action, here Reject Criticism, a canonical rebuttal sentence can be retrieved or generated, serving as a template for further rebuttal refinement.
Contributions.Our contributions are three-fold: (1) we are the first to propose the novel task of attitude root and theme-guided peer review rebuttal generation, inspired by Jiu-Jitsu argumentation.
(2) Next, we present JITSUPEER, an enrichment to an existing collection of peer reviews with attitude roots, attitude themes, and canonical rebuttals.We build JITSUPEER by recasting established concepts from peer review analysis and training a series of models, which we additionally specialize for the peer-reviewing domain via intermediate training.
(3) Finally, we benchmark a range of strong baselines for end-to-end attitude root and theme-guided peer review rebuttal generation as well as for two related subtasks: generating an abstract review descriptions reflecting the underlying attitude, and scoring the suitability of sentences for serving as canonical rebuttals.We hope that our efforts will fuel more research on effective rebuttal generation.

Jiu-Jitsu Argumentation for Peer Review Rebuttals
We provide an introduction to Jiu-Jitsu argumentation and the task we propose.
2.1 Background: Jiu-Jitsu Argumentation 'Jiu-Jitsu' describes a close-combat-based fighting system practiced as a form of Japanese martial arts. 2 The term 'Jiu' refers to soft or gentle, whereas 'Jitsu' is related to combat or skill.The idea of Jiu-Jitsu is to use the opponent's strength to combat them rather than using one's own force.This concept serves as an analogy in psychology to describe an argumentation style proposed for persuading anti-scientists, e.g., climate-change skeptics, by understanding their underlying beliefs ('attitude roots', e.g., fears and phobias; Hornsey and Fielding, 2017).The idea is to use one's attitude roots to identify counterarguments which more effectively can change one's opinion, termed canonical rebuttals.A canonical rebuttal aligns with the underlying attitude roots and is congenial to those.They serve as general-purpose counterarguments that can be used for any arguments in that particular attitude root-theme cluster.

Attitude Root and Theme-Guided Peer Review Rebuttal Generation
We explore the concept of attitude roots for the domain of peer review.Our goal is to identify the underlying beliefs and opinions reviewers have while judging the standards of scientific papers (cf.§3).Note that a good rebuttal in peer reviewing is also dependent on the specific rebuttal action (Kennard et al., 2022) to be performed like mitigating criticism vs. thanking the reviewers.For example, in Figure 1, we display the canonical rebuttal for 'Reject Criticism'.For the same attitude root and theme, the canonical rebuttal for the rebuttal action 'Task Done' would be "We have significantly improved the writing, re-done the bibliography and citing, and organized the most important theorems and definitions into a clearer presentation.".We thus define a canonical rebuttal as follows: Definition. 'A counterargument that is congenial to the underlying attitude roots while addressing them.It is general enough to serve (as a template) for many instances of the same (attitude root-theme) review tuples while expressing specific rebuttal actions.' Given this definition, we propose attitude root and theme-guided rebuttal generation: given a peer review argument rev and a rebuttal action a the task is to generate the canonical rebuttal c based on the attitude root and theme of rev.

JITSUPEER Dataset
We describe JITSUPEER, which allows for attitude and theme-guided rebuttal generation by linking attitude roots and themes in peer reviews to canonical rebuttals based on particular rebuttal actions.To facilitate the building of JITSUPEER, we draw on existing corpora, and on established concepts, which we recast to attitude roots and attitude themes.We describe our selection of datasets (cf.§3.1) and then detail how we employ those for building JIT-SUPEER (cf.§3.2).We discuss licensing information in the Appendix (cf.Table 5).

Starting Datasets
DISAPERE (Kennard et al., 2022).The starting point for our work consists of reviews and corresponding rebuttals from the International Conference on Learning Representation (ICLR)3 in 2019 and 2020.We reuse the review and rebuttal texts, which are split into individual sentences (9, 946 review sentences and 11, 103 rebuttal sentences), as well as three annotation layers: Review Aspect and Polarity.Individual review sentences are annotated as per ACL review guidelines along with their polarity (positive, negative, neutral) (Kang et al., 2018;Chakraborty et al., 2020).We focus on negative review sentences, as these are the ones rebuttals need to respond to (2, 925 review sentences and 6, 620 rebuttal sentences).As attitude roots, we explore the use of the reviewing aspects -we hypothesize that reviewing aspects represent the scientific values shared by the community, e.g., papers need to be clear, authors need to compare their work, etc.
Review-Rebuttal Links.Individual review sentences are linked to rebuttal sentences that answer those review sentences.We use the links for retrieving candidates for canonical rebuttals.
Rebuttal Actions.Rebuttal sentences are directly annotated with the corresponding rebuttal actions.We show the labels in the Appendix (cf.Table 7).(Ghosal et al., 2022).The second dataset we employ is a benchmark resource consisting of 1, 199 reviews from ICLR 2018 with 16, 976 review sentences.Like DIS-APERE, it comes with annotation layers out of which we use a single type, Paper Sections.These detail to which particular part of a target paper a review sentence is referring to (e.g., method, problem statement, etc.).We hypothesize that these could serve as attitude themes.Our intuition is that while our attitude roots already represent the underlying beliefs about a work (e.g., comparison), the target sections add important thematic information to these values.For instance, while missing comparison within the related work might lead the reviewer to question the research gap, missing comparison within the experiment points to missing baseline comparisons.We show the paper section labels in the Appendix (cf.Table 6).

Enrichment
Our final goal is to obtain a corpus in which review sentences are annotated with attitude roots and themes that, in turn, link to canonical rebuttal sentences given particular rebuttal actions.As detailed above, DISAPERE already provides us with review sentences, their attitude roots (i.e., review aspects), and links to corresponding rebuttal sentences annotated with actions.The next step is to further enrich DISAPERE.For this, we follow a three-step procedure: (1) predict attitude themes using PEER-REVIEW-ANALYZE, (2) describe attitude root and theme clusters (automatically and manually), (3) identify a suitable canonical rebuttal through pairwise annotation and ranking.The enrichment pipeline is outlined in Figure 2.

Theme Prediction
For finer-grained typing of the attitude roots, we train models on PEER-REVIEW-ANALYZE and predict themes (i.e., paper sections) on our review sentences.We test general-purpose models as well as variants specialized to the peer review domain.
Models and Domain Specialization.We start with BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), two popular general-purpose models, as well as SciBERT (Beltagy et al., 2019), which is specifically adapted for scientific text.We compare Fine-tuning Language Models Step 3.1.Rebuttal Candidate Set Reduction The presentation is not professional, hard to follow and the submission overall looks very rushed The paper is neither nicely written nor easy to follow Overall (OAL) Step 1. Theme Prediction I think this is a very interesting direction, but the present paper is somewhat unclear Clarity Overall (OAL) Human Annotation / Extractive Summarization Step 2. Root-Theme Cluster Descriptions Step 3. Canonical Rebuttal Identi cation I think this is a very interesting direction, but the present paper is somewhat unclear

Clarity Origanality Substance
We invite the referee to be speci c about the sections of the original...We invite the referee to be speci c about the sections of the original manuscript that need more clari cation, allowing us to revise these sections.
Reject Criticism  those to variants, which we further specialize ourselves to the fine-grained domain of peer reviews using Masked Language Modeling (MLM) (Gururangan et al., 2020;Hung et al., 2022).
Evaluation.On PEER-REVIEW-ANALYZE, we conduct 5 runs starting from different random seeds and report the average performance across those runs.We select the model with the best micro-F1 for the final prediction.We compare against a Table 1: Results for general-purpose and domainspecialized models on theme enrichment task over 5 random runs.We report Precision, Recall, and Micro-F1 on the PEER-REVIEW-ANALYZE test set and highlight the best result in bold.We underline the 4 best performing models and separate model variants with dashes.
MAJORITY and a RANDOM baseline.
Results and Final Prediction.All transformer models outperform the baseline models by a huge margin (cf.Table 1).SciBERT ds_neg yields the best score outperforming its non-specialized counterpart by more than 1 percentage point.This points to the effectiveness of our domain specialization.Accordingly, we run the final theme prediction for JIT-SUPEER with the fine-tuned SciBERT ds_neg .For en-suring high quality of our data set, we only preserve predictions where the sigmoid-based confidence is higher than 70%.This way, we obtain 2, 332 review sentences annotated with attitude roots and attitude themes linked to 6, 614 rebuttal sentences.This corresponds to 143 clusters.

Root-Theme Cluster Descriptions
We add additional natural language descriptions to each attitude root-theme cluster.While these are not necessary for performing the final generation, they provide better human interpretability than just the label tuples.We compare summaries we obtain automatically with manual descriptions.
Summary Generation.For the human-written descriptions, we display ≤ 5 sentences randomly sampled from each review cluster to a human labeler.We then ask the annotator to write a short abstractive summary (one sentence, which summarizes the shared concerns mentioned in the cluster sentences).For this task, we ask a CS Ph.D. student who has experience in NLP and is familiar with the peer review process.For the automatic summaries, we simply extract the most representative review sentence for each cluster.To this end, we embed the sentences with our domain-specific SciBERT ds_neg (averaging over the last layer representations).Following Moradi and Samwald (2019), the sentence with the highest average cosine similarity with all other review sentences is the one we extract.
Evaluation.Having a manual and an automatic summary for each cluster in place, we next decide which one to choose as our summary.We show the summaries together with the same ≤ 5 cluster sentences to an annotator and ask them to select the one, which better describes the cluster.We develop an annotation interface based on INCEpTION (Klie et al., 2018) and hire two additional CS Ph.D. students for the task. 4All instances are labeled by both annotators.We measure the inter-annotator agreement on the 99 instances and obtain a Cohen's κ of 0.8.Afterwards, a third annotator with high expertise in NLP and peer reviewing resolves the remaining disagreements.

Canonical Rebuttal Identification
Last, we identify canonical rebuttals for each attitude root-theme cluster given particular rebuttal actions.To this end, we follow a three-step procedure: first, as theoretically all 6, 614 rebuttal sentences Candidate Set Reduction.To reduce the set of canonical rebuttal candidates to annotate in a later step, we obtain scores from two predictors: the first filter mechanism relies on confidence scores from a binary classifier, which predicts a rebuttal sentence's overall suitability for serving as a canonical rebuttal and which we train ourselves.Second, as the prototypical nature of canonical rebuttals plays an important role, we additionally use specificity scores from SPECIFICITELLER (Li and Nenkova, 2015).( 1) For training our own classifier, we annotate 500 randomly selected sentences for their suitability as canonical rebuttals.The annotator is given our definition and has to decide whether or not, independent from any reference review sentences, the rebuttal sentence could potentially serve as a canonical rebuttal. 5We then use this subset for training classifiers based on our general-purpose and domain-specific models with sigmoid heads.We evaluate the models with 5-fold cross-validation (70/10/20 split).We train all models for 5 epochs, batch size 16, and grid search for the learning rates λ ∈ {1 • 10 −5 , 2 • 10 −5 , 3 • 10 −5 }.The results are depicted in Table 3.As SciBERT ds_neg achieves the highest scores, we choose this model for filtering the candidates: we predict suitabilities on the full set of rebuttals and keep only those, for which the sigmoid-based confidence score is >95%.
(2) SPECIFICITELLER, a pre-trained featurebased model, provides scores indicating whether sentences are generic or specific.We apply the model to our 500 annotated rebuttal sentences and observe that good candidates obtain scores between 0.02 − 0.78, indicating lower specificity.We thus use this range to further reduce the number of pairs for rebuttal annotation.The complete filtering procedure leaves us with 1, 845 candidates.
Manual Annotation.For manually deciding on canonical rebuttals given the pre-filtered set of candidates, we devise the following setup: we show a set of ≤ 5 review sentences from an attitude root (e.g., 'Clarity') and attitude theme (e.g., 'Overall') cluster.We additionally pair this information with a particular rebuttal action (e.g., 'Reject Criticism').Next, we retrieve two random rebuttal sentences that (a) are linked to any of the review sentences in that cluster, and (b) correspond to the rebuttal action selected.The annotators need to select the best rebuttal from this pair (again, interface implemented with INCEpTION), which is a common setup for judging argument quality and ranking (e.g., Habernal and Gurevych, 2016;Toledo et al., 2019). 6For a set of n rebuttal sentences available for a particular (attitude root, attitude theme, rebuttal action)-tuple, the pairwise labeling setup requires judgments for n(n − 1)/2 pairs (in our case, 4, 402).We recruit 2 CS Ph.D. students for this task.In an initial stage, we let them doubly annotate pairs for two attitude roots (Clarity, Meaningful Comparison).We obtain Cohen's κ of 0.45 (a moderate agreement, which is considered decent for such highly subjective tasks (Kennard et al., 2022)).We calculate MACE-based competencies (Hovy et al., 2013) and choose the annotator with the higher competence (0.82) to complete the annotations.
Canonical Rebuttal Selection.Following Habernal and Gurevych (2016), we obtain the best rebuttals from the collected preferences based on Annotation Graph Ranking.Concretely, we create a directed graph for each root-theme-action cluster with the rebuttal sentences as the nodes.The edge directions are based on the preferences: if A is preferred over B, we create A → B. We then use PageRank (Page et al., 1999) to rank the nodes (with each edge having a weight of 0.5).The lowest-ranked nodes are the canonical rebuttalsthe node has very few or no incoming edges.

Dataset Analysis
The final dataset consists of 2, 332 review sentences, labeled with 8 attitude roots (aspects in DISAPERE) and 143 themes (paper sections in PEER-REVIEW-ANALYZE).We show label distributions in Figures 3 and 4, and provide more analysis in the Appendix.Most review sentences have the attitude root Substance, which, also, has the highest number of themes (29).The most common theme is Methodology followed by Experiments and Related Work.This is intuitive since reviewers in machine learning are often concerned with the  soundness and utility of the methodology.In total, we identified 302 canonical rebuttals for different attitude roots and rebuttal actions. 7Our canonical rebuttals can be mapped to 2, 219 review sentences (out of 2, 332).The highest number of canonical rebuttal sentences relate to the rebuttal action Task Done and the attitude root Substance.In Table 3, we show examples of some of the canonical rebuttals.We clearly note that different attitude roottheme descriptions connect to different canonical rebuttals (e.g., Concede Criticism in Clarity and Substance).

Clarity
The paper is neither nicely written nor easy to follow.
Answer: 'We have significantly improved the writing, re-done the bibliography and organized the most important theorems into a clearer presentation.'Concede Criticism: 'In addition, there are indeed a few places in the paper where our phrasing could have been better, thank you for pointing this out.'

Unclear Description of Method
Future Work: 'While we have provided some diagnostics statistics, understanding this method deeply that will help fuel interesting future research.' Reject Criticism: 'As far as explaining the method of combination, and the associated mathematical properties, we have tried to do this in greater detail in section 3 (Approach).'

Substance
Incomplete details on performance of the method Structuring: 'The experiment section lacks more detailed analysis which can intuitively explain how well the proposed method performs on the benchmarks' Answer: 'While we put this experiment into the appendix , for now, to not change the main paper too much compared to the submitted version, if the reviewers agree we would also be very happy to include this experiment in the main paper.'Limited improvement over baselines ' Task Done: 'We provided a detailed explanation about the experimental setting and further experimental results of the state-of-the-art performance in our response to "The Common concerns about experimental setting and results".' Concede Criticism: 'Nevertheless, we agree with your comments that it is more meaningful to emphasize our improvement over the state-of-the-art training methods.'Table 3: Canonical rebuttals for different rebuttal actions (in italics), attitude roots, and theme descriptions.

Baseline Experiments
Along with end-to-end canonical rebuttal generation, we propose three novel tasks on our dataset.

Canonical Rebuttal Scoring
Task Definition.Given a natural language description d, and a rebuttal action a, the task is to predict, for all rebuttals r ∈ R (relating to particular attitude root-theme cluster), a score indicating the suitability of r for serving as the canonical rebuttal for that cluster.
Experimental Setup.The task amounts to a regression problem.We only consider combinations of rebuttal actions and attitude root-theme clusters that have a canonical rebuttal (50 attitude roottheme cluster descriptions with 3, 986 rebuttal sentences out of which 302 are canonical).We use the PageRank scores from before (cf.§3.2.3) as our prediction target for model training. 8To avoid any information leakage, we split the data into 8 For canonical rebuttals obtained without pairwise annotation (these rebuttals were the only ones being predicted as canonical by our canonical rebuttal identifier and had no other candidates for comparison) we set the score to 0 (since the lower the score, the better the rebuttal)).Vice versa, for all other rebuttals, the score is set to 1. train-validation-test on the level of the number of attitude roots (4 − 1 − 3).The total number of instances amounts to 5941 out of which we use 2723−1450−1768 for the train-validation-test.We experiment with all models described in §3.2.1 in a fine-tuning setup.Following established work on argument scoring (e.g., Gretz et al., 2020;Holtermann et al., 2022), we concatenate the description d with the action a using a separator token (e.g.,

Models
[SEP] for BERT).We grid search for the optimal number of epochs e ∈ {1, 2, 3, 4, 5} and learning rates λ ∈ {1 • 10 −4 , 2 • 10 −4 } with a batch size b = 32 and select models based on their performance on the validation set.We also compare to a baseline, where we randomly select a score between 0 and 1.We report Pearson's correlation coefficient (r), and, since the scores can be used for ranking, Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).
Results.From Table 4, we observe that most of the domain-specialized models perform better than their non-specialized counterparts.SciBERT ds_all has the highest Pearson correlation across the board, however, BERT ds_neg has the highest ranking scores.The use of other cluster-related information such as representative review sentences, and paraphrasing the descriptions may lead to further gains which we leave for future investigation.

Review Description Generation
Task Definition.Given a peer review sentence rev, the task is to generate the abstract description d of the cluster to which rev belongs to.
Experimental Setup.Our data set consists of 2, 332 review sentences each belonging to one of 144 clusters with associated descriptions.We apply a train-validation-test split (70/10/20) and experiment with the following seq2seq models: BART (Lewis et al., 2020) (bart-large), Pegasus (Zhang et al., 2020) (pegasus-large), and T5 (Raffel et al., 2020) (t5-large).We grid search over the number of epochs e ∈ {1, 2, 3, 4, 5} learning rates λ ∈ {1 • 10 −4 , 5 • 10 −4 , 1 • 10 −5 } with a batch size b = 32.We use beam search with 5 beams as the decoding strategy.We run this experiment in a full fine-tuning setup as well as in zero-and few-shot scenarios (with random shot selection).We report the performance in terms of lexical overlap and semantic similarity (ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-L) (Lin, 2004), and BERTscore (Zhang* et al., 2020)). 9 Results.We show the R-1 scores in Figure 5 (full results in Table 11).Interestingly, all models exhibit a very steep learning curve, roughly doubling their performance according to most measures when seeing a single example only.BART excels in the 0-shot and 1-shot setup, across all scores.However, when fully fine-tuning the models, T5 performs best.We hypothesize that this relates to T5's larger capacity (406M params in BART vs. 770M params in T5). 9 We use the default roberta-large model for evaluation.

End2End Canonical Rebuttal Generation
Task Definition.Given a review sentence rev, and a rebuttal action a, the task is to generate the canonical rebuttal c.
Experimental Setup.We start from 2, 219 review sentences, which have canonical rebuttals for at least 1 action.As input, we concatenate rev with a placing a separator token in between resulting in 17, 873 unique review-rebuttal action instances.We use the same set of hyperparameters, models, and measures as before (cf.§4.2) and experiment with full fine-tuning, and zero-shot as well as fewshot prediction.For these experiments, we apply a 70/10/20 splits for obtaining train-validation-test portions on the level of the canonical rebuttals (302 rebuttals linked to 17, 873 unique instances).
Results.The differences among the models are in-line with our findings from before (Figure 6, full results in Table 12): BART excels in the zeroshot and few-shot setups, and T5 starts from the lowest performance but quickly catches up with the other models.However, the models' performances grow even steeper than before, and seem to reach a plateau already after two shots.We think that this relates to the limited variety of canonical rebuttals and to the train-test split on the canonical rebuttal level we decided to make -the task is to generate templates, and generalize over those.With seeing only few of those templates, the models quickly get the general gist, but are unable to generalize beyond what they have been shown.This finding leaves room for future research and points at the potential of data efficient approaches in this area.

Related Work
Peer Review Analysis.In recent years, research on peer reviews has gained attention in NLP, with most of the efforts concentrated on creating new datasets to facilitate future research.For instance, Hua et al. (2019) presented a data set for analyzing discourse structures in peer reviews annotated with different review actions (e.g., EVALUATE, REQUEST).Similarly, Fromm et al. (2020) developed a dataset that models the stance of peer-review sentences in relation to accept/reject decisions.Yuan et al. (2022) extended peer review labels with polarity labels and aspects based on the ACL review guidelines (as in (Chakraborty et al., 2020)).Newer works focused mainly on linking the review sections to the exact parts of the target papers they relate to (Ghosal et al., 2022;Kuznetsov et al., 2022;Dycke et al., 2023).Overall, only few works focused on understanding the dynamics between review and rebuttal sentences.Exceptions to this are provided by Cheng et al. (2020) and Bao et al. (2021) who study discourse based on sentence-wise links between review and rebuttal sentences.Kennard et al. (2022) proposed a dataset that unifies existing review annotation labels and also studied reviewrebuttal interaction.
Computational Argumentation and Attitude Roots.Computational argumentation covers the mining, assessment, generation and reasoning over natural language argumentation (Lauscher et al., 2022b).Here, most works focused on analyzing the explicated contents of an argument, e.g., w.r.t.its quality (Toledo et al., 2019;Gretz et al., 2020), or structure (Morio et al., 2020), and on generating arguments based on such surface-level reasoning (Slonim et al., 2021).In contrast, Hornsey and Fielding ( 2017) analyze the underlying reasons driving peoples arguments.In a similar vein, Lewandowsky et al. (2013) study the motivated rejection of science and demonstrate similar attitudes toward climate-change skepticism.Fasce et al. (2023) extend this theory to the domain of anti-vaccination attitudes during the COVID-19 era.
We borrow this idea and adapt it to the domain of peer reviewing to understand scientific attitudes.

Conclusion
In this work, we explored Jiu-Jitsu argumentation for peer reviews, based on the idea that reviewers' comments are driven by their underlying attitudes.For enabling research in this area, we created JITSUPEER, a novel data set consisting of review sentences linked to canonical rebuttals, which can serve as templates for writing effective peer review rebuttals.We proposed different NLP tasks on this dataset and benchmarked multiple baseline strategies.We make the annotations for JITSUPEER publicly available.We believe that this dataset will serve as a valuable resource to foster further research in computational argumentation for writing effective peer review rebuttals.

Limitations
In this work, we present a novel resource in the domain of peer reviews, JITSUPEER.Even though we develop our data set with the goal of fostering equality in science through helping junior researchers and non-native speakers with writing rebuttals in mind, naturally, this resource comes with a number of limitations: JITSUPEER contains different attitude roots and attitude themes along with the canonical rebuttals derived from the peer reviews from ICLR 2019 and 2020.ICLR is a toptier Machine Learning Conference and thus the taxonomy developed in this process is specific to Machine Learning Community and does not cover the peer reviewing domain completely (e.g., natural sciences, arts, and humanities, etc).Thus, the resource will have to be adapted as the domain varies.
The canonical rebuttals also do not form a closed set since the original dataset DISAPERE, from which we started, does not contain rebuttals for every review sentence.Accordingly, the peer review to canonical rebuttal mapping is sparse.We therefore, and for other reasons, highlight that writing rebuttals should be a human-in-the-loop process where models trained on JITSUPEER can provide assistance by generating templates that can be further refined for writing customized rebuttals.

Information on dataset licenses
The licenses for different datasets used in the paper are listed in Table 5.

Dataset details
We list the distribution of review sentences (in %) for the datasets used in our work namely, DIS-APERE and PEER-REVIEW-ANALYZE in Table 6.We show the distribution of rebuttal sentences with respect to rebuttal actions in Table 7.We display part of the label hierarchy developed for JITSU-PEER in Table 8.We plot the number of canonical rebuttals with respect to different rebuttal actions in Fig 9 .'Task Done' and 'Answer' are the most common rebuttal actions for the annotated canonical rebuttals.

Probing of the Domain Specialized Models
In order to gauge the quality of review representations of the different transformer models (pretrained and domain specialized), we additionally perform a probing task.We first start with describing the probing task.We adopt a classic probing procedure following previous works (Tenney et al., 2019;Zhang et al., 2021;Lauscher et al., 2022a) where a classifier is placed on top of the frozen features.Following (Zhang et al., 2021;Lauscher et al., 2022a), we use a simple two-layer feed-forward neural network on top of the frozen features.For a given text, we average over the representation of all the tokens except the special tokens.We perform the probing task on DISAPERE.We use the already available classification tasks in the dataset such as 'Aspect', 'Review Action', and 'Fine-grained Review Action' classification as the proxy tasks.We evaluate the models in terms of Accuracy (Acc) and Macro-F1 (F1) metrics.For  probing, we follow (Lauscher et al., 2022a), and use a batch size of 32 with a learning rate, λ= 1e−3 using the Adam optimizer.The models are trained for 100 epochs with early stopping on the validation set with patience of 5.The results are averaged over 5 runs.We use the standard train/dev/test splits from DISAPERE.

Annotator Details and Interface Design
We recruited 2 CS Ph.D. students for the different tasks that require human annotation while creating our dataset, JITSUPEER.Initially, we experimented with varying degrees of expertise but found that CS Masters students struggled with the task, which is why we resorted to recruiting PhD students.We trained the annotators initially for 1 hour to explain the guidelines and then answered their questions when needed.We adapted the INCEpTION platform (Klie et al., 2018) for carrying out annotations.In the dataset creation pipeline, there are two tasks that explicitly require human intervention: i) Root-Theme Cluster Description (cf.§3.2.2) and, ii) Manual Evaluation for Canonical Rebuttal Identification (cf.§3.2.3).The interfaces for both of these tasks are presented in

Canonical Rebuttal Ranking Task
Task Definition.As another variant of the Canonical Rebuttal Scoring Task (cf §4.1), we now seek to directly identify the canonical rebuttal using retrieval: given a description d and a rebuttal action a, the task is to retrieve the canonical rebuttal c from the set of rebuttals R that corresponds to the attitude root-theme cluster.
Experimental Setup.We formulate this problem as a ranking task.The number of cluster descriptions and rebuttals is the same as in the Canonical Rebuttal Scoring Task (cf §4.1).We implement BM25 based information-retrieval system as a baseline for this task.We also use different bi-encoderbased models built using S-BERT (Reimers and Gurevych, 2019).As in Task 1, we split the data on the level of attitude roots (4 − 1 − 3) for trainingvalidation-test.We concatenate d and a using [SEP] token.We encode the concatenated cluster description and rebuttals independently and use cosine similarity as a distance measure for training and inference.We finetune pre-trained sentence transformer models on this task along with training new sentence transformer models using embeddings from different language models (BERT, RoBERTa, SciBERT, and their domain-specialized variants).We perform grid search over learning rates λ ∈ {1e−5, 2e−5}, epochs e ∈ {1, 2, 3, 4, 5} and batch sizes, b ∈ {16, 32}.We report Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).

Figure 1 :
Figure 1: Review to Canonical Rebuttal Mapping via intermediate steps.The review has the attitude root 'Clarity' and theme 'Overall'.The canonical rebuttal is for the rebuttal action 'Reject Criticism'.

Figure 2 :
Figure 2: Structure of our starting dataset, DISAPERE along with the Enrichment Pipeline used for the construction of our dataset, JITSUPEER.The Enrichment Pipeline consists of three main steps that finally link the review sentences from each attitude root-theme cluster to the canonical rebuttals for different rebuttal actions.The three main steps are in blue boxes and the sub-steps are in yellow boxes.

Figure 3 :
Figure 3: Final data set analysis: number of review sentences and number of themes per attitude root.

Figure 5 :
Figure 5: ROUGE-1 variation on the Review Description Generation task of BART, Pegasus, and T5 with an increasing number of shots.

Figure 6 :
Figure6: ROUGE-1 variation on the End2End Reuttal Generation task of BART, Pegasus, and T5 with an increasing number of shots.

Figure 7 :
Figure 7: Snapshot of the interface used for collecting annotations for the Root-Theme Cluster Description Task.The attitude root here is 'Motivation Impact' (asp_motivation-impact) and the theme 'Experiment'(EXP).

Figure 8 :
Figure 8: Snapshot of the interface used for the preference collection for the Manual Evaluation for Canonical Rebuttal Identification Task.The attitude root 'Clarity' (asp_clarity) and the theme 'Overall'(OAL) with respect to the rebuttal action, 'rebuttal done'.

Table 5 :
License details of the different datasets used in the paper.

Table 6 :
Distribution of review sentences in DIS-APERE and PEER-REVIEW-ANALYZE.For PEER-REVIEW-ANALYZE, these labels constitute 60.24% of review sentences, the rest of the review sentences are annotated with different combinations of these labels.

Table 7 :
Distribution of rebuttal sentences (in %) for different rebuttal actions in DISAPERE tively.