Argument Pair Extraction via Attention-guided Multi-Layer Multi-Cross Encoding

Argument pair extraction (APE) is a research task for extracting arguments from two passages and identifying potential argument pairs. Prior research work treats this task as a sequence labeling problem and a binary classification problem on two passages that are directly concatenated together, which has a limitation of not fully utilizing the unique characteristics and inherent relations of two different passages. This paper proposes a novel attention-guided multi-layer multi-cross encoding scheme to address the challenges. The new model processes two passages with two individual sequence encoders and updates their representations using each other’s representations through attention. In addition, the pair prediction part is formulated as a table-filling problem by updating the representations of two sequences’ Cartesian product. Furthermore, an auxiliary attention loss is introduced to guide each argument to align to its paired argument. An extensive set of experiments show that the new model significantly improves the APE performance over several alternatives.


Introduction
Mining argumentation structures within a corpus is a crucial task in argument mining research field (Palau and Moens, 2009). There are usually two main components in learning natural language argument structures: (1) detecting argumentative units, (2) predicting relations between the identified arguments. It has been widely studied by natural language processing (NLP) researchers (Cabrio and Villata, 2018) and applied to domains such as: web debating platforms (Boltužić andŠnajder, 2015;Swanson et al., 2015;Chakrabarty et al., 2019), persuasive essays (Stab and Gurevych, 2014;Persing and Ng, 2016), social media (Abbott et al., 2016), etc. Unlike traditional argument extraction tasks that are mainly from monologues, Cheng et al. (2020) propose a new task -argument pair extraction (APE) from two passages in a new domain, namely peer review process, focusing on exploiting the interactions between reviewer comments and author rebuttals. As shown in Figure 1, APE task aims to extract the argument pairs from two passages. Specific suggestions, questions or challenges in reviews are considered as review arguments. Response sentences that answer or explain the specific review argument are its paired rebuttal arguments. For example in the pink area, the reviewer points out the lack of literature review in submission (i.e., review sentences 11-12). As a response, the authors argue that they select the literature based on the special focus of their work (i.e., rebuttal sentence 6-7).
Similar to the two components in the traditional argumentation structure mining, the APE task can be divided into two subtasks: (1) extracting the review and rebuttal arguments from two passages, (2) predicting if an extracted review argument and a rebuttal argument form an argument pair. The first subtask can be cast as a sequence labeling problem and the second one can be cast as a binary classification problem. One straightforward approach is to couple the two subtasks in a pipeline. However, such a pipeline approach learns two subtasks independently without sharing ample information. To address this limitation, the pioneering work (Cheng et al., 2020) employs a multi-task learning framework to train two subtasks simultaneously.
However, there are several shortcomings in the multi-task model. First, the review passage and its rebuttal passage are concatenated as a single passage to perform the argument extraction subtask with sequence labeling. It is obvious to see from 1 This relatively novel work proposes to augment current RL models by adding self-supervised tasks encouraging better internal representations.
2 -6 The proposed tasks are depth prediction and loop closure detection. [...] 7 It is original, clearly presented, and strongly supp-orted by empirical evidence.
8 One small downside of the experimental method (or maybe just the results shown) is that by picking top-5 runs, it is hard to judge whether such a model is better suited to the particular hyperparameter range that was chosen, or is simply more robust to these hyperparameter settings.
9 Maybe an analysis of performance as a function of hyperparameters would help confirm the superiority of the approach to the baselines.
10 My own suspicion is that adding auxiliary tasks would make the model robust to bad hyperparameters.
11 Another downside is that the authors dismiss navigation literature as "not RL". 12 I sympathize with the limit on the number of things that can fit in a paper, but some experimental comparison with such literature may have proven insightful, if just in measuring the quality of the learned representations.
O 1 Thank you for your comments.
2 We provided additional analysis, in Appendix section C.4, to address your comments.
3 For each of the experiments in this paper, 64 replicas were run with hyperparameters (learning rate, entropy cost) sampled from the same interval.
4 Figure 12 shows that the Nav architectures with auxiliary tasks achieve higher results for a comparatively larger number of replicas, suggesting that auxiliary tasks make learning more robust to the choice of hyperparameters -in line with the reviewer's intuition. 5 This observation is particularly strong for the small static maze (more than a third of the replicas for FF A3C and LSTM A3C baselines do not even reach the goal, whereas less than 10 Nav agents out of 64 replicas suffer from this).
6 In this paper we focused on the potential benefits of auxiliary tasks in enhancing the navigational capacities of agents that use deep RL to map pixels directly to actionsrather than designing a new state-of-the-art navigation system. 7 Our discussion of the literature reflected this focus, but was not intended to be dismissive of other navigation approaches such as SLAM. Figure 1: An example of APE task. The review and rebuttal passage pair is shown on the left. The grey area refers to non-arguments, while the blue and pink areas refer to two paired arguments. The table representing the pairing relation is shown on the right (filled entries: paired; unfilled entries: unpaired). Review and rebuttal sentence indices are on the left and the top of the table. Review and rebuttal sequence labels for argument extraction are on the right and the bottom of the table. Figure 1 that the review and rebuttal passages have their own styles in terms of structure and wording. Hence, it is not suitable to concatenate them as one long sequence, which is against the fact that they are two unique sequences in essence and hinders the model from well-utilizing their different characteristics. To overcome this limitation, we treat review and rebuttal passages as two individual sequences and design two sequence encoders for them respectively. In each sequence encoder, the sequence representations will be updated by the other's representations through mutual attention. It allows us to better distinguish two passages, and meanwhile, to conveniently exchange information between them through the attention mechanism.
Second, the subtask coordination capability of their multi-task framework is weak as two subtasks only coordinate with each other via the shared feature encoders, i.e., the sentence encoder for the sequence of word tokens and the passage encoder for the concatenation of sentences. Thus, the shared information between two subtasks is only learned implicitly. To overcome this limitation, we propose an attention-guided multi-layer multi-cross (MLMC) encoding mechanism. Inspired by the table-filling approach (Miwa and Sasaki, 2014), we form a table that represents features for the Cartesian Product of review and rebuttal sequences by utilizing both of their embeddings, as shown in the right portion of Figure 1. The table representations will be updated with the incorporation of the two sequence representations, and in return, it will also help to update the mutual attention mentioned above. It is named as multi-cross encoder because these three encoding components (i.e., one table and two sequences) interact with each other explicitly and extensively. By stacking multiple encoder layers, the two subtasks can further benefit each other. In addition, we also design an auxiliary attention loss to guide each argument to refer to its paired arguments. This additional loss not only enhances the model performance, but also significantly improves the attention interpretability.
To summarize, the contributions of this paper are three-fold. Firstly, we apply the table-filling approach to model the sentence-level correlation between two passages with multiple sentences for the first time. Secondly, on the model side, we propose an MLMC encoder to explicitly learn the useful shared information in the two passages. Furthermore, we introduce an auxiliary attention loss, which is able to further improve the efficacy of the mutual attentions. Thirdly, we evaluate our model on the benchmark dataset (Cheng et al., 2020), and the results show that our model achieves a new state-of-the-art performance on the APE task.

Related Work
Argument mining has wide applications in educational domain, including persuasive essays (Stab and Gurevych, 2017;Eger et al., 2017), scientific articles (Teufel et al., 2009;Guo et al., 2011), writing assistance (Zhang andLitman, 2016), essay scoring (Persing and Ng, 2015;Somasundaran et al., 2016), peer reviews (Hua et al., 2019), etc. Unlike previous works, Cheng et al. (2020) introduce a new task named APE in the domain of peer review and rebuttal, which intends to extract the argument pairs from two passages simultaneously. Table-filling approaches (Miwa and Sasaki, 2014;Gupta et al., 2016;Zhang et al., 2017) have been proposed to work towards the joint task of name entity recognition (NER) and relation extraction (RE). In their work, the diagonal entries of the table show the words' entity types and the off-diagonal entries show the relation types with other words. More recently, there are more research works to propose various table-filling models on different tasks. Wang and Lu (2020) propose to learn two separate encoders (a table encoder and a sequence encoder) by interacting with each other for joint NER and RE task. Wu et al. (2020) propose a grid tagging scheme to address the aspectoriented fine-grained opinion extraction task. Compared to our model, one major difference is the table shape. In their tables, the row and column represent the same sequence, and thus in square shape. In our model, the table is in a rectangle shape where the row and column represent two different sequences with different lengths. Another clear difference is that each entry in their table is for word-pair relation, whereas each entry in our table captures sentence-pair relation. As we can see from Figure 1, the review/rebuttal sequence consists of a list of sentences. Thus, it requires extra effort to learn comprehensive sentence representations.

Task Formulation
In this paper, we tackle the APE task, which aims to study the internal structure and relations between two passages, e.g., review and rebuttal passages. For example, as shown in Figure 1, given a pair of review passage s rv = [s rv,1 , · · · , s rv,12 ] (in the red box) and rebuttal passage s rb = [s rb,1 , · · · , s rb,7 ] (in the orange box), we intend to automatically extract all argument pairs between them. First, for the argument mining subtask, we cast it as a sentence-level sequence labeling problem follow-ing the work (Cheng et al., 2020) using the standard BIO scheme (Ramshaw, 1995;Ratinov and Roth, 2009). This subtask segments the argumentative units (highlighted in blue/pink) from nonargumentative units (highlighted in grey) for each passage. The label sequences for the review passage and the rebuttal passage are shown in the right portion of Figure 1. Second, the sentence pairing subtask predicts whether the two sentences belong to one argument pair. Here, we formulate it as a table-filling problem following the work (Miwa and Sasaki, 2014). Take the 8th review sentence s rv,8 in the first review argument as an example, the rebuttal argument sentences {s rb,2 , s rb,3 , s rb,4 , s rb,5 } forming sentence pairs with it are filled with green, as shown in the table. With the collaboration of these two subtasks, we can perform the overall argument pair extraction task. In this case, two argument pairs (highlighted in blue/pink from two passages) are extracted, which correspond to the two green rectangles shown in the table. Figure 2 shows our proposed attention-guided multi-layer multi-cross (MLMC) encoding based model. The model mainly consists of three parts: a sentence embedder, an n-layer multi-cross encoder, and a predictor. The review sentences and rebuttal sentences first go through the sentence embedder separately to obtain their sentence embeddings respectively. We then utilize the representations from review and rebuttal sequences to form a table as shown earlier in Figure 1. Next, the representations of the table and two sequences are updated through n multi-cross encoder layers. Finally, the model predicts the review and rebuttal arguments through a conditional random field (CRF) (Lafferty et al., 2001) layer based on two sequence representations, and extracts the pairing information through a multi-layer perceptron (MLP) based on the table representations.

Sentence Embedder
The bottom left part of Figure 2 shows our sentence embedder, the input of which is a review sentence or a rebuttal sentence with l tokens s = [t 0 , t 1 , · · · , t l−1 ]. We obtain the pre-trained BERT (Devlin et al., 2019) token embeddings [x 0 , x 1 , · · · , x l−1 ] for all word tokens in the sentence, after which all token embeddings are fed into a bidirectional long short-term memory (biLSTM)  Figure 2: Overview of our model architecture with n multi-cross encoder layers (shown on the left). The sentence embedder (in the grey dotted box at the bottom left) shows the process of obtaining initial review and rebuttal sentence embeddings from pre-trained BERT token embeddings with biLSTM. The kth multi-cross encoder layer (in the blue dotted box on the right) shows the process of getting the sentence representations of review and rebuttal and the pair representations for the next layer.
(Hochreiter and Schmidhuber, 1997) layer. The last hidden states from both directions are concatenated as the sentence embedding S (0) . A more common practice is to use the [CLS] token embedding to represent the sentence embedding. However, given the high density of scientific terms and the correspondence between review and rebuttal, token-level information is naturally crucial for the task. The same conclusion is drawn by the experimental results in the previous work (Cheng et al., 2020).

Multi-Cross Encoder
The entire multi-cross encoder consists of n layers. The details of each multi-cross encoder layer are shown in the blue dotted box on the right of Figure 2. The input of the layer includes table representations and two sequence representations, i.e., review and rebuttal sequence representations. In each layer, table features are updated by sequence features and vice versa.
Sequence Encoder Phase I To well-utilize different characteristics of review and rebuttal, we regard them as two individual sequences. Two sequence embeddings S (k−1) rv and S (k−1) rb of length I and J respectively (i.e., the output from the previous layer) are passed through the same biLSTM layer colored light yellow in Figure 2. Take review sequence as an example, the review hidden states at position i are updated as follows: The rebuttal hidden states S (k) rb in layer k is obtained from the same biLSTM in the same manner. with them through concatenation and linear projection as follows:

The table features from previous layer T
rv×rb with layer normalization: The entry T (k−1) i,j at row i and column j represents specific features between review sentence at position i and rebuttal sentence at position j. The table hidden states T (k) i,j are updated through 2D-GRU: The 2D-GRU settings are similar to the previous work (Wang and Lu, 2020) except that the table to be processed is not necessarily a square (I = J in general). Therefore, the 2D-GRU implemented here is more general. The previous hidden states for rv×rp of layer k are further exploited by the mutual attention mechanism explained below to update review and rebuttal sequence embeddings.

Mutual Attention
The mutual attention mechanism (shown as review attention and rebuttal attention modules in Figure 2) links review embedding, rebuttal embedding and table embedding together, through which review embedding and rebuttal embedding update each other with the help of table features. The attention weights α (k) i,j and β (k) i,j at position (i, j) in layer k are updated as follows: where v α and v β are learnable vectors. We further normalize the attention weights: Here, a i,j are the normalized attention weights ranging from 0 to 1. We then get the weighted average of sentence representations S Here, S (k) rv and S (k) rb are the updated review embedding and rebuttal embedding. Information in review and rebuttal sequences is exchanged via mutual attention.
Sequence Encoder Phase II The addition and layer normalization used to combine S (k) and S (k) in the sequence encoder are similar to the one in table encoder. We obtain the review sequence embedding S (k) rv and rebuttal sequence embedding S (k) rb as the sequence outputs of layer k as follows:

Stacking Multi-Cross Encoder Layers
The updating process described above continues as layer grows from 1 to n. The table feature is updated by both review and rebuttal sequences, and each sequence updates the other via the table later on.
There are also residual connections between adjacent layers which accept the previous layer's output as the current layer's input and include it as part of the new embedding, making the system more robust. All three features (i.e., review sequence, rebuttal sequence, table) are intertwined with each other and information flows across different components of the encoder. This is also the reason why the encoder is described as MLMC.

Argument Pair Predictor
After the final multi-cross encoder layer, sequence features are used for argument mining and table features are used for pair prediction.

Argument Predictor
We adopt CRF to predict argument sequence labels. The sequence labeling loss L seq for both review sequence s rv and rebuttal sequence s rb in each instance is defined as: where y rv and y rb are the review and rebuttal sequence labels 2 .
During inference, the predicted sequence label is the one with the highest conditional probability given the original sequence: Pair Predictor We use MLP to predict sentence pairs 3 . The pairing loss L pair for each instance is: where y pair i,j is 1 when s rv,i and s rb,j are paired, and is 0 otherwise 4 .
Following (Cheng et al., 2020), during evaluation, a pair of candidate spans ([s rv,i 1 , · · · , s rv,i 2 ] and [s rb,j 1 , · · · , s rb,j 2 ]) form a pair if they satisfy the following criterion: Attention Loss Attention loss is a loss term specifically designed for the task. It aims to increase the effectiveness of review attention and rebuttal attention discussed above. Even without this auxiliary loss term, sentences in review are supposed to attend to relevant sentences in rebuttal and vice versa. The auxiliary loss is thus aimed at augmenting the effect of mutual reference explicitly by guiding the paired arguments to refer to each other. Intuitively, under the settings of argument mining and pairing, it is natural that review arguments refer to the paired rebuttal arguments to update their embedding and vice versa during mutual attention. Hence, we introduce an auxiliary loss term to increase the attention weights computed for paired arguments and decrease the attention weights otherwise for both review and rebuttal attentions in all layers. For each instance, L attn is defined as: where γ is the decaying parameter used to compute exponential moving average for the sum of attention. Larger weights are assigned to layers closer to the final predictor as they are more related to the prediction in the end. The attention loss is defined in the form of summation across all layers to increase the accuracy and interpretability of both review and rebuttal attentions in all layers. If the tendency to attend to the paired argument is augmented, the benefits of attention mechanism can be further exploited (e.g., learning better sentence representations, increasing pair prediction accuracy). The overall loss L is then defined by summing up three losses together: where λ 1 and λ 2 are tuned hyperparameters.

Data
We conduct experiments on the benchmark dataset, i.e., RR dataset (Cheng et al., 2020) to evaluate the effectiveness of our proposed model. RR dataset includes 4,764 pairs of peer reviews and author rebuttals collected from ICLR 2013 to ICLR 2020. There are two dataset versions provided: RR-Submission-v1 and RR-Passage-v1. In RR-Submission-v1, multiple review-rebuttal passage pairs of the same paper submission are in the same set of train, dev or test; while in RR-Passage-v1, different review-rebuttal passage pairs of the same submission could be put into different sets. We further modify the RR-Submission-v1 dataset by fixing some minor bugs in the labels, and name it RR-Submission-v2. The data are split into train, dev and test sets by a ratio of 8:1:1 for all three dataset versions.

Baselines
We compare our model with two baselines: • The pipeline approach is used as a baseline model in the previous work (Cheng et al., 2020). It independently trains two subtasks and then pipes them together to extract argument pairs. • The multi-task learning model proposed by (Cheng et al., 2020) trains two subtasks simultaneously via the shared feature encoders.

Experimental Settings
We implement our attention-guided MLMC encoding based model in Pytorch. The dimension of pre-trained BERT sentence embeddings is 768 by default. Maximum number of BERT tokens for each sentence is set as 200. MLP layer is composed of 3 linear functions and 2 ReLU functions. We use Adam (Kingma and Ba, 2014) with an initial learning rate of 0.0002, and update parameters with a batch size of 1 and dropout rate of 0.5. We train our model for 25 epochs at most. We select the best model parameters based on the best overall F 1 score on the development set and apply it to the test set for evaluation. All models are run with V100 GPU. Note that in this paper, the parameters are mainly tuned based on RR-Submission-v1 5 . Following the previous work (Cheng et al., 2020), we report the precision (Prec.), recall (Rec.) and F 1 scores for the performance on both subtasks as well as the overall extraction performance. Table 1 shows the performance comparison between our proposed models and the pervious work on RR-Submission-v1 and RR-Passage-v1 datasets 6 . Besides the two baseline models mentioned before, we implement a bi-cross encoding scheme (Bi-Cross) for comparisons as well. The key difference between the bi-cross encoder and the multi-cross encoder is that in the bi-cross encoder, 5 More details about hyperparameter settings (e.g. weight for pair loss λ1, weight for attention loss λ2, decaying parameter γ of exponential moving average) and experimental results (e.g. running time, number of parameters, performance on the development set) could be found in Appendix B. 6 The previous work adopts negative sampling technique for sentence pairing subtask and evaluates the performance on the partial test set. In this work, we re-evaluate the previous work's sentence pairing subtask on the whole test dataset for a fair comparison. Those results are marked with * in Table 1.  the review sentences and rebuttal sentences are concatenated as one sequence, and thus it only has one sequence encoder. In contrast, there are two individual sequence encoders in our multi-cross encoder. With the same number of layers, our multi-cross model outperforms the bi-cross model on both datasets except for RR-Passage-v1 with 4 layers. This is especially conspicuous when the number of layers is 3. The superiority of the multicross model demonstrates the importance and robustness of learning review and rebuttal sequences separately. Our model achieves the highest F 1 score when the number of layers increases to 3. Adding more layers hurts the performance, probably because the model overfits with too many layers. Ta-ble 2 shows the performance on RR-Submission-v2 7 . The main conclusion is consistent with the performace on RR-Submission-v1. Both the bicross and multi-cross models outperform the multitask model, and the multi-cross models further outperform the bi-cross models. Although the baselines achieve slightly better performance on the argument mining subtask than both the bi-cross model and the multi-cross model, they still perform worse than our models on the sentence pairing subtask and the overall APE task. This is plausibly because of two main reasons. First, in the multi-task model, the subtask coordi-  nation capability is weak as the shared information between two subtasks is learned implicitly. However, in our model, the three encoding components are explicitly mingled with each other through the mutual attention mechanism and the table encoder. On one hand, the better sentence pairing subtask performance demonstrates the effectiveness of the table-filling approach. On the other hand, the better overall APE performance demonstrates the strong subtask coordination capability of our model architecture. Second, we further analyze the breakdown performance of the multi-task model and our multicross (n=3) model on the argument mining subtask. Figure 3 shows the subtask performance on RR-Submission-v1 dataset for reviews, rebuttals, and both of them. We can observe that the difference of F 1 scores between reviews and rebuttals of our model is smaller than the multi-task model. Despite the slight decrease in the overall argument mining performance, a more balanced argument extraction performance on reviews and rebuttals brings in better overall APE performance, which is because more accurate review argument extraction increases the chance for the extracted rebuttal arguments to be paired correctly.

Ablation study
We conduct an ablation study of the multi-cross (n=3) model on RR-Submission-v1 dataset from three perspectives, as presented in Table 3. Firstly, we evaluate the effect of sharing the biLSTM layer (the light yellow modules in Figure 2) and the CRF layer. We can notice that the F 1 drops 1.92 without sharing the biLSTM layer, drops 1.75 without sharing the CRF layer, and drops 1.02 when sharing neither. It is interesting to notice that when two sequences use their own biLSTMs and CRF simultaneously (i.e., w/o sharing both), the F 1 drops less compared to the models without sharing only one of them. This suggests that having an individual set of biLSTM and CRF layers for each type of sequence is plausibly a worthwhile setting, but it  is not as effective as sharing both. One possible reason is that the advantage brought in by such a tailor-made sequential tagging configuration for each type is overwhelmed by the disadvantage of fewer training instances. Secondly, without cross updates between the review and rebuttal embeddings (the mutual attention modules still exist), the F 1 drops 1.78. This result again demonstrates the effectiveness of explicitly blending two sequence embeddings via the mutual attention mechanism specifically designed for this task. Thirdly, we also investigate the effect of attention loss term by removing it from the overall loss. The performance drops about 2.87 F 1 points. We will elaborate more with the attention visualization below.

Attention Visualization
To examine the effectiveness of the auxiliary attention loss, we visualize the sum of attention weights of all layers for four test samples, as shown in Figure 4. The sum is computed for visualization because attention weights in all layers are guided by the attention loss. The distribution of attention is significantly improved as the colors for arguments in Column (c) are considerably darker. In Column (b) without the guidance of attention loss, despite some patterns, attention weights are distributed in a quite haphazard manner. Therefore, the interpretability of our model is much better as we can easily understand which part of the discourse each sentence refers to. Specifically, the boundary of most attention blocks in Column (c) matches well with the start and end positions of the ground truth review and rebuttal arguments. The gold and predicted argument spans and argument pairs of these four samples are shown in Appendix C.1, and more discussions are given regarding the reason for some mistakenly predicted boundaries. The effectiveness of the auxiliary attention loss is also quantitatively illustrated by a higher F 1 score after its incorporation (32.44 v.s. 29.57) in Table 3.  O  O  O  O  O  O  B  I  I  B  I  I  B  O  O B I I I I I I I I O B I I O B I I I I I I I I O 1  2  3  4  5  6  7  8  9  10  11  12  13  14   O  O  O  O  O  O  B  I  I  B  I  I  B  O  O B I I I I I I I I O B I I O B I I I I I I I I O 1  2  3  4  5  6  7  8  9  10  11  12  13  14   O  O  O  O  O  O  B  I  I  B  I  I  B  O  O B I I I I I I I I O B I I O B I I I I I I I I O 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Conclusions
In this paper, we adopt the table-filling approach for modeling the sentence-level correlation between two passages, and propose the attention-guided multi-layer multi-cross (MLMC) encoding scheme for the argument pair extraction (APE) task. Our model can better capture the internal relations be-tween a review and its rebuttal with two sequence encoders and a table encoder via mutual attention mechanism. We also introduce an auxiliary attention loss to further improve the efficacy of the mutual attentions. Extensive experiments on the benchmark dataset demonstrate the effectiveness of our model architecture, which is potentially beneficial for other NLP tasks.

A.1 Argument Predictor
We cast the task of predicting argument spans as a sequence labeling problem. We adopt conditional random field (CRF) (Lafferty et al., 2001) that assigns each label sequence a score. The probability for each sequence (both review and reply) is defined as following: p(y|s) = exp(score(s,y)) y exp(score(s,y)) , where s represents the original sequence and y is the gold sequence label encoding argument spans under the BIO scheme (Ramshaw, 1995;Ratinov and Roth, 2009). The score function is defined as: A is the matrix with trainable parameters representing transition scores within the CRF layer and F θ 1 represents the emission scores obtained after feeding review sequence and rebuttal sequence into the multi-cross encoder with parameters θ 1 . The negative log-likelihood loss for both review and reply sequence in each instance is then defined as: L seq (A, θ 1 ) = − log p(y rv |s rv ) + log p(y rb |s rb ) .

A.2 Pair Predictor
Given table features T (k) and an MLP layer, the probability that two sentences are from an argument pair can be expressed as following: p(y pair = 1|(s rv , s rb )) = 1 1+exp (−F θ 2 (T (k) )) .
F θ 2 is a composite function of Linear and ReLU functions, the final linear function among which has an output dimension of 1. The pairing loss L pair (θ 2 , θ 1 ) for each instance is then defined as: L pair (θ 2 , θ 1 ) = − i,j y pair i,j log p(y pair i,j = 1|s rv , s rb ) where θ 2 are parameters within the MLP layer. Note that the attention loss is a function of θ 1 and the overall loss is a function of θ 1 , A and θ 2 . The formulae provided in the main paper omit the related parameters for brevity.

B.1 Hyperparameters
We manually tune the hyperparameter values (e.g. weight for pair loss λ 1 , weight for attention loss λ 2 , decaying parameter γ of exponential moving average) for our proposed multi-layer multi-cross model. We manually tune the weight for pair loss λ 1 from 0.3 to 0.7 with step size of 0.1 (Table 4), the weight for attention loss λ 2 from 0.5 to 2.5 with step size of 0.5 (Table 5) and the decaying paramter γ of exponential moving average from 0.7 to 1 with step size of 0.1 (Table 6) for our proposed multi-layer multi-cross model. We select the best hyperparameters based on the best F 1 score achieved on the development set and apply them to the test set for evaluation. Specifically, λ 1 is set as 0.5, λ 2 is set as 2, γ is set as 0.9.

B.2 Running Time, Number of Parameters
and Results on Development Set Table 8 shows the running time, the number of parameters, and the results on the development set of our models on RR-Submission-v1 dataset. For the bi-cross models, as review sentences and rebuttal sentences are concatenated as one sequence in one sequence encoder during training, the sequences are generally longer. Thus, the bi-cross models require a longer running time. As the number of layers increases, the performance on the development set improves yet the performance on the test set becomes worse. It is plausibly because that the model might face the overfitting issue.

B.3 MLP v.s. CNN
We replace the MLP module with convolutional neural networks (CNN) to predict the pairs and compare their performance on RR-Submission-v1 dataset. The comparison results are presented in Table 7. The theoretical advantage of CNN over MLP is that CNN is able to capture surrounding information with the help of kernels. However, the experiment results show that the convolutional structure performs worse than the simple MLP structure. By examining the kernel weight of the convolution layers, we observe no significant magnitude difference between the center weights and the peripheral weights. Take a 3x3 kernel as an example, the center weights are the weights at the center grid, while the weights located in the rest of the 8 grids are peripheral weights. This indicates that CNN accords way more importance to the surrounding information (8 times more important in the case of a 3x3 kernel) than to the original grid. The overemphasis on surrounding information brings too much noise to the pair prediction.      Each row in Table 9 in which the exact gold and predicted results are shown corresponds to the re-spective row in Figure 4 in the main paper. We can see that instance (1) and instance (2) are perfectly predicted whereas one predicted reply argument is shorter than the gold argument in instance (3) and some argument pairs are identified wrongly in instance (4).
Attention distribution turns out to be strongly connected with the final output of the model, as attention weights exhibit exactly the same error as the wrongly predicted argument spans and argument pairs. In instance (3), we can see from the attention visualization that the review argument at position 15 only refers to the reply sentences from position 14 to 16. The wrong prediction of reply span (14, 16) (gold: (14, 26)) directly results from the inaccurate distribution of attention weights. For instance in Figure 4 row (4) as highlighted in red, it can also be noticed that some review arguments Table 9: Gold and predicted review arguments, rebuttal arguments and argument pairs for four test data samples. attend to the wrong rebuttal argument and some rebuttal arguments attend to the wrong review argument. The attention blocks in Figure 4 row (4) are (8, 9) -(2, 7), (10, 12) -(8, 9), (13, 13) -(10, 10) and the wrongly predicted argument pairs are also (8, 9) -(2, 7), (10, 12) -(8, 9), (13, 13) -(10, 10). Together with all 4 test instances, the conclusion can be reached that one-to-one correspondence can be found in the predicted paired arguments and the distribution of attention weights. Therefore, the hindrance to further improve the model performance comes from the inaccurately allocated attention weights.

C.2 Breakdown by Argument Density
We further evaluate the multi-cross (n=3) model performance on RR-Submission-v1 among differ-ent numbers of argument pairs in each instance. Figure 5 shows the argument mining performance on review and rebuttal separately and the overall APE performance. Their F 1 scores all increase as the number of argument pairs grows from 1 to 4 and reach plateaus afterwards. The reason is likely to be that most of the review and rebuttal pairs with about 4 argument pairs are written in a more formatted manner and are hence easier to be extracted. When the number of argument pairs is smaller than 3, it is highly likely that authors only reply to one or two review arguments. The irregular format might increase the difficulty of pair extraction. When the number of argument pairs is larger than average, the F 1 score of APE decreases slightly as the structure becomes more complicated.
In addition, we can see from Figure 5 that when the number of argument pairs is from 2 to 6, the F 1 scores of the argument mining subtask between review and rebuttal are very close. Compare to the multi-task model in the previous work (Cheng et al., 2020), our model's performance on the argument mining subtask between review and rebuttal is more balanced, which leads to the better overall APE performance.