Probing Graph Decomposition for Argument Pair Extraction

Argument pair extraction (APE) aims to extract interactive argument pairs from two passages within a discussion. The key challenge of APE is to effectively capture the complex context-aware interactive relations of arguments be-tween the two passages. In this paper, we elicit relational semantic knowledge from large-scale pre-trained language models (PLMs) via a probing technique. The induced sentence-level relational probing graph can help capture rich explicit interactive relations between argument pairs effectively. Since the relevance score of a sentence pair within a passage is generally larger than that of the sentence pair from different passages, each sentence would prefer to propagate information within the same passage and under-explore the interactive relations between two passages. To tackle this issue, we propose a graph decomposition method to decompose the probing graph into four sub-graphs from intra-and inter-passage perspectives, where the intra-passage graphs can help detect argument spans within each passage and the inter-passage graphs can help identify the argument pairs between the review and rebuttal passages. Experimental results on two benchmark datasets show that our method achieves substantial improvements over strong baselines for APE.


Introduction
Dialogical argumentation, which focuses on the analysis of argumentation in debates or discussions, has rapidly emerged as a hot research topic in recent years.Argument pair extraction (APE) (Cheng et al., 2020) is a new and challenging task in the field of dialogical argumentation, which aims to extract interactive argument pairs from two argumentative passages within a discussion.As illustrated in Figure 1, a peer review process involves rich interactive arguments with each argument consisting They also show that our proposed optimization algorithms are efficient enough to run on … Figure 1: A example of APE where a review passage and its corresponding rebuttal passage are shown on the upper and bottom.Rev:Arg-i/Rep:Arg-i denotes the i-th argument in the review/rebuttal and forms the i-th paired arguments.The white area refers to nonargument, while the green and yellow areas refer to argument pairs. of several consecutive sentences.An argument in the review can form an argument pair with the corresponding argument in the rebuttal that discusses the same topic.
The core of APE is to detect the arguments within each passage and construct the relations between interactive arguments in the two passages.Most existing works (Cheng et al., 2020(Cheng et al., , 2021;;Bao et al., 2022) apply powerful encoders, such as table encoders and attention mechanisms, to learn the sentence-level semantic representation for implicitly modeling relationships between argument pairs.However, as revealed in (Cheng et al., 2021), the sentence representations learned by pure attentionbased methods are difficult to effectively capture the complicated relations between the sentences from different passages.Bao et al. (2021) explicitly established argument links based on co-occurring words within the sentence pairs and verified the importance of word-level relations among arguments.Nevertheless, they ignore the fact that the sentence pairs within the argument pairs generally contain semantically similar words, such as "space" and "interval" in the sentence pair, as illustrated in Figure 2.Although the semantic-aware relational information is already contained in the continuous representations of PLMs, neural networks lack an optimal mechanism to benefit from such information.
To address the aforementioned issues, we propose a novel ProbIng Graph dEcompositiON (PI-GEON) framework for argument pair extraction by exploiting explicit semantic knowledge induced from large-scale PLMs.Specifically, we employ a two-stage masked language modeling process to construct sentence-level relation graphs between sentences (global linguistic properties) based on the number of highly similar word pairs within each sentence pair.The key idea behind this probing method is that we can obtain similar sentence pair representations when we mask out one word from a word pair with high similarity.The sentence-level relation graph is essential to effectively identify argument pairs.
Since the review and rebuttal passages have different writing styles and word distributions, the learned sentence-level probing graph may underexplore the interactive relations between the two passages.In particular, the relevance score of each sentence pair within a passage is generally larger than sentence pairs from different passages.For example, as shown in Figure 1, the tenth review sentence contains more semantically similar words with the eleventh review sentence than the second rebuttal sentence 1 .Consequently, the review sentences would prefer to propagate information within the same passage and under-explore the interactive relation between two passages.To effectively capture argument relations of different passages, we decompose the sentence-level probing graph into four sub-graphs from intra-and interpassage perspectives.The intra-passage graph can help detect the argument spans within each passage, while the inter-passage graph is used to identify the argument pairs from different passages.To fur-1 Further analysis is presented in Appendix A.1 ther improve the performance of our method, we also design an auxiliary graph contrastive loss to weaken the impact of noisy edges brought by the probing procedure.
Our contributions can be summarized as follows.(1) We propose a probing technique to elicit semantic-aware relational knowledge from largescale PLMs for constructing sentence-level probing graph.(2) We decompose the sentence-level probing graph into four sub-graphs from intra-and inter-passage perspectives so as to effectively detect argument spans within each passage via intrapassage graphs and identify argument pairs from two passages via inter-passage graphs.(3) We conduct experiments on two APE benchmark datasets, and the results show that our method outperforms the strong baselines by a noticeable margin.

Methodology
Following previous works (Cheng et al., 2020;Bao et al., 2021), we aim to automatically extract interactive argument pairs from the review and rebuttal passages by casting the argument mining and argument pairs extraction as two sentencelevel sequence labeling problems using the standard BIOES scheme (Ratinov and Roth, 2009).Formally, given a review passage S v = {s v 1 , . . ., s v m } consisting of m sentences and a rebuttal passage S u = {s u 1 , . . ., s u n } consisting of n sentences, we first identify argument spans within the review and rebuttal passages, and obtain a review argument spans set X v = {x v 1 , x v 2 , . ..} and a rebuttal argument spans set X u = {x u 1 , x u 2 , . ..},where x v i and x u j are sentence-level spans in review and rebuttal passages, respectively.Then, we extract the paired arguments from review and rebuttal passages, and a set of interactive argument pairs P = {p 1 , p 2 , . ..} can be collected, where p = (x v i , x u j ) is an interactive argument pair.
As illustrated in Figure 2, PIGEON contains four components, including a sentence representation learning module, a probing graph construction module, a graph decomposition module, and an argument pair prediction module.Next, we will describe each component of PIGEON in detail.

Sentence Representation Learning
We first apply BERT (Devlin et al., 2018)  : To demonstrate the advantage and necessity of the proposed search algorithm, I think it better to conduct an experiment with a higher dimensional search space.
: However, there are clustering distributions for which the expected cost is not smooth and the set of approximately optimal parameters is an arbitrarily small interval.STM) (Hochreiter and Schmidhuber, 1997) to encode the sentence-level dependencies within each passage.Specifically, we feed each sentence s i into BERT and obtain the sentence embedding e i ∈ R d b by mean pooling over all token representations, where d b is the vector dimension of the last layer of BERT.A BiLSTM is then utilized to encode the sentence representations {e 1 , e 2 , . ..} within each passage into the contextual sentence representations {h 1 , h 2 , . ..},where h i ∈ R d and d is the hidden size of BiLSTM.We denote the contextual representations of review and rebuttal as

Probing Graph Construction
To effectively capture the relation between each sentence pair, we need to detect the semantically similar words from the sentence pair.So far, there are many possible ways to derive semantically similar words, such as WordNet (Miller, 1995), Word2Vec (Mikolov et al., 2013), and GloVe (Pennington et al., 2014).However, these methods generally focus on context-free word similarity and ignore the context, failing to deal with words with multiple meanings in different contextual scenarios.In this paper, we elicit explicit semantic knowledge from large-scale pre-trained language models (PLMs) and build a relation graph between each sentence pair through a probing procedure.It is worth noting that such semantic knowledge is extracted in an unsupervised and off-the-line manner.
Formally, given a sentence pair (s i , s j ), we first concatenate (s i , s j ) and obtain a single sequence s where [CLS] and [SEP] represent the classification and separation tokens respectively.Then, we propose a probing approach with a masking technique to learn the semantic similarity between arbitrary word pairs from each sentence pair.We employ a two-stage masked language modeling (MLM) process to measure the impact a context word has on predicting another word.After the probing process, we can construct sentence-level relation graphs based on the number of highly similar word pairs with each sentence pair.The key idea behind the probing process is that we can obtain similar sentence pairs representations when we mask out one word from a word pair with high similarity.Concretely, we replace the k-th token w i k in sentence s i with a special mask token [mask] and feed the obtained new sequence s /w i k into BERT.We can obtain the contextualized representation h /w i k of the k-th token.To calculate the correlation between w i k and the t-th word w j t in sentence s j , we further mask out w j t from s /w i k to obtain the second corrupted sequence s /w i k w j t and feed it into BERT.We use h /w i k w j t to denote the new representation of word w i k when both w i k and w j t are masked out simultaneously.
After that, we measure the distance f (w i k , w j t ) between h /w i k and h /w i k w j t to induce the semantically correlation between the k-th word w i k of s i and the t-th word w j t of s j .Here, we use the Euclidean distance metric to implement the distance function f (•) due to its simplicity and effectiveness as follows: Note that one word may be split into multiple tokens, we mask all tokens for each split-up word and apply a mean pooling over the token representations to obtain the word representation.
Word-level Similarity Matrix By repeating the two-stage MLM process on each pair of words (w i k , w j t ) of the sentence pair (s i , s j ), we can obtain a word-level similarity graph for the sentence pair (s i , s j ), where M k,t = f (w i k , w j t ) denotes the relation between word pair (w i k , w j t ).Then, we utilize the min-max normalization to reduce the impact of the range of correlation scores.The normalized word-level similarity matrix Mk,t is computed by: where max and min is the maximum and minimum similarity scores of all word pairs in the review and rebuttal passages.
Sentence-level Probing Graph We construct the sentence-level probing graph in which nodes are sentences.The sentence-level relation matrix can be derived from the word-level relation matrices of all sentence pairs using the probing procedure.Specifically, we compute the relevance between each sentence pair (s i , s j ) based on the word pairs with high semantic similarity.Formally, we compute the relation between the sentence pair (s i , s j ) as follows: where σ is a pre-defined threshold.I is an indicator function.By traversing all sentence pairs, we can obtain the symmetrical sentence-level relation matrix for the review and rebuttal passages.Intuitively, if two sentences have many semantically similar word pairs, the corresponding edge will have a large weight.

Graph Decomposition
The review and rebuttal passages have different writing styles and word distributions.The relevance score of each sentence pair within a passage is generally larger than the sentence pair from different passages.To effectively capture argument relations of different passages, we decompose the sentence-level probing graph into four sub-graphs from intra-and inter-passage perspectives.The intra-passage graph can help detect the argument spans within the review (or rebuttal) passage, while the inter-passage graph is used to identify the argument pairs from the review and rebuttal passages.Formally, we decompose the sentence-level relation matrix R ∈ R (m+n)×(m+n) of two passages (S v , S u ) into four sub-matrices R vv ∈ R m×m , R uu ∈ R n×n , R vu ∈ R m×n and R uv ∈ R n×m , as illustrated in the left part of Figure 2.Among these, R vv and R uu represent the intra-passage relation matrices of the review and rebuttal passages, respectively.R vu and R uv denote the inter-passage relation matrices.

Intra-passage Graph Construction
The intrapassage graph of each passage takes the sentences within the passage as vertices.The embeddings of the vertices are initialized with the corresponding sentence representations.Then, we refine the edges and the corresponding weights by the relative positions between sentences and intra-passage relevance matrices.Specifically, given a review passage S v (or a rebuttal passage S u ), the edge weight A vv i,j for each sentence pair (s i , s j ) can be computed by: where τ is a pre-defined threshold.max(R vv i ) represents the maximum value of the i-th row of intrapassage relation matrix R vv .
Inter-passage Graph Construction The interpassage graph for each passage is a bipartite graph, where each edge only exists between two sentences from different passages.The inter-passage adjacency matrix A vu ∈ R m×n of passage S v can be computed by: where max(R vv ) is the maximum value of the sentence-level relation matrix R vv .Consequently, each passage can derive two different graphs (i.e.intra-passage and inter-passage graphs).Note that the intra-and inter-passage graphs of each passage are mutually independent though they are derived from the same passage.
Relation-aware Sentence Representations We use graph convolutional networks (GCNs) to learn the representations of intra-and inter-passage relation graphs.Each node in the graphs is updated according to the hidden representations of its neighborhoods based on the adjacency matrices of intrapassage and inter-passage graphs.Given the intraand inter-passage graphs of passage S v , the update of the l-th GCN block is defined as follows: where Âvv (or Âvu ) is the normalized adjacency matrix learned from A vv (or A vu ), and we have . Z v,l−1 represents the sentence representations of the review passage evolved in the l-1-th GCN block.Here, the node representations of the first GCN layer are defined as Z v,0 = H v .For simplicity, we denote the final output of the GCN blocks as Z v .In this way, we can obtain the updated representation Z v i for the i-th sentence of passage S v by integrating the representations of its neighbouring nodes within intra-and inter-passage graphs.Similarly, we can compute the updated sentence representations Z u of passage S u .

Argument Pair Prediction
After decomposing the probing graphs, the updated sentence representations are used for argument mining and argument pair extraction following previous works (Cheng et al., 2020;Bao et al., 2021).
Argument Mining We adopt a BiLSTM sequence tagger followed by a CRF sequence tagger to identify all potential arguments.Concretely, we feed the sentence representations Z v and Z u into the BiLSTM tagger to learn output hidden states Then, O v and O u are put into the CRF tagger to predict the argument labels Ŷ v = {ŷ v 1 , . . ., ŷv m } and Ŷ u = {ŷ u 1 , . . ., ŷu n } for review and rebuttal respectively, where y i is the BIOES label for the i-th sentence.Based on these two label sequences Ŷ v and Ŷ u , we can further parse the potential argument spans for the review and rebuttal passages, i.e.Xv = {x v 1 , xv 2 , . ..} and Xu = {x u 1 , xu 2 , . ..},where xi is the i-th predicted argument span.The sequence labeling loss L AM for each instance is defined as follows: where Y v and Y v are the ground-truth sequence labels of review and rebuttal.
Argument Pair Extraction With the learned argument spans sets (X v and X b ) and the argumentspecific sentence representations ( Ẑv = Z v + O v and Ẑu = Z u + O u ), we can extract argument pairs with dual sequence taggers.Specifically, we first produce the representation

Ẑv
i of the k-th extracted argument span x v k = (b k , e k ) by mean pooling over the corresponding argument-specific sentence representations.Then, we concatenate a v k to the argumentspecific sentence representation Ẑu of rebuttal S u to obtain the argument •] is the concatenation operation.We feed the learned representations into a BiLSTM tagger and a CRF tagger to predict the argument label sequences Y u k with its paired arguments from the rebuttal passage.Similarly, we can perform the same procedure to capture its paired arguments from review for the k-th rebuttal argument by predicting the label sequences Y v k with another BiLSTM and CRF tagger.For APE, its sequence labeling loss L APE in each instance is defined as: Graph Contrastive Loss We introduce an auxiliary graph contrastive loss to weaken the impact of the introduced noisy edges brought by the probing procedure.Taking the intra-and inter-passage graphs of review S v as an example, we follow an i.i.d.uniform distribution to randomly drop the noisy edges (non-argument pairs) in the graph with probability μ and generate auxiliary graph views with adjacency matrices Ãvv = drop( Âvv ) and Ãvu = drop( Âvu ) from the original graph.Noted that the removal probabilities of the edges for the argument pairs are zero.Then, we feed the learn auxiliary graph views into GCNs and produce the auxiliary updated node representations Zv and Zu for the review S v and rebuttal S u respectively.After that, we employ a contrastive objective L GCL to distinguish the representations of different views of the same node from the representations of different nodes: where Z = [Z v , Z u ] represent the updated node representation matrices of passages S v and S u .ψ(•, •) denotes the cosine similarity and g(•) is a two-layer perceptron.
Joint Training Objective We minimize the joint loss function L joint by summing up the three training objectives as: where λ is a tuned hyper-parameter controlling the impact of L GCL .

Experimental Setup
Datasets We conduct experiments on the Review-Rebuttal (RR) dataset, which is a benchmark dataset proposed by (Cheng et al., 2020).The RR dataset includes 4,764 review-rebuttal pairs collected from ICLR 2013 to ICLR 2020.There are two versions provided: RR-Passage-v1 (RR-P) and RR-Submission-v2 (RR-S).Both RR-P and RR-S are split by the ratio of 8:1:1 into train, dev, and test sets.In the RR-P dataset, different review-rebuttal passage pairs of the same paper submission could be put into different sets, while in the RR-S dataset, multiple review-rebuttal passage pairs of the same submission are included in the same set.Since RR-S is more challenging than RR-P, we conduct further experiments on RR-S.The detailed statistics of RR-P and RR-S are summarized in Appendix A.2.

Evaluation
Metrics Following previous works (Cheng et al., 2020;Bao et al., 2021), we adopt precision (Prec.),recall (Rec.) and F 1 scores to measure the performance of both argument mining (AM) and argument pair extraction (APE).
Baselines To evaluate the effectiveness of PI-GEON, we compare it with several strong baselines.PL-H-LSTM-CRF (Cheng et al., 2020) learns separate sequence labeling and sentence relation classification models, and then combines two results together to predict the argument pairs.Similar to PL-H-LSTM-CRF, MT-H-LSTM-CRF (Cheng et al., 2020) trains two subtasks via a shared feature encoder in the multi-task learning manner.MLMC (Cheng et al., 2021) is an attentionguided model based on a table-filling approach.MGF (Bao et al., 2021) proposes a mutual guidance framework with an inter-sentence relation graph.MRC-APE (Bao et al., 2022) applies machine reading comprehension framework with a Longformer (Beltagy et al., 2020) as the encoder, which is the state-of-the-art method on RR.
Implementation Details PIGEON is implemented in PyTorch on an NVIDIA TITAN RTX GPU.We apply the uncased BERT base2 as our PLM.The AdamW optimizer (Loshchilov and Hutter, 2018) is employed for parameter optimization, and the initial learning rates for the BERT layer and other layers are set to 1e-5 and 1e-3, respectively.Similar to previous works (Cheng et al., 2021), we set the batch size as 1 due to the limited memory.The maximum number of the GCN blocks on RR-S and RR-P are set to 5 and 3, respectively.The hidden size of BiLSTMs is set to 256.In addition, the parameters of BiLSTMs and CRFs used in the three taggers are not shared3 .All experiments are performed five times with different random seeds, and we report the averaged scores.Our code and data are available at https: //github.com/syiswell/PIGEON.

Overall Performance
We report the overall performance of our proposed framework and baseline methods in Table 1.Our method achieves the best performance on both RR-S and RR-P.On RR-S, our method outperforms the current state-of-the-art method (i.e., MRC-APE) by 2.94% in terms of F 1 score on the APE subtask.On RR-P, our model also exceeds MRC-APE and obtains about 3.05% higher F 1 score on the APE subtask.The experimental results verify the superiority of our method on the APE subtask.In addition, our PIGEON is also more efficient than baselines, as shown in Appendix A.4.
We also observe that the pipeline method (i.e., PLH-LSTM-CRF) perform more poorly than the other end-to-end baselines because it may lead to error propagation.The attention-based method (i.e., MLMC) achieves significant improvement over MT-H-LSTM-CRF, since MLMC can implicitly model the argument correlation.The graph-based method (i.e., MGF) surpasses MLMC but underperforms MRC-APE.This may be because MGF only considers the word overlap and explicitly constructs the incomplete argument correlation without considering semantic information.Our PIGEON performs better than all baselines by probing semantic knowledge from PLMs.(a) x p e (b) Figure 4: The inter-passage adjacent matrix and a wordlevel similarity matrix of an example.The red blocks denote the ground-truth argument pairs.The word-level similarity matrix belongs to the sentence pair (9, 3) of review and rebuttal passages.
GloVe and WordNet) for detecting word pairs with high semantic similarity.The co-occurrence based method focuses on string matching.The Word2Vec and GloVe methods obtain the word pair similarity by computing the cosine similarity of their word vectors.The WordNet method computes the word pair similarity based on the shortest path connecting the two word senses in a taxonomy.After learning the similarity of arbitrary word pairs, we construct the inter-and intra-passage graphs similar to our PIGEON method.We report the AM and APE results in Table 3.Our probing method performs significantly better than the compared word similarity measures by eliciting context-aware semantic knowledge from large-scale PLMs.

Case Study
We provide an exemplary case that is selected from RR-S test set by visualizing the adjacent matrix of the inter-passage graphs, where the distribution of edge weights is similar to the distribution of ground-truth argument pair labels.As shown in Figure 4b, the probing method can catch the seman-tic similarity between the word pair "observation" and "visible" with the help of elicited knowledge from PLMs.We believe that our PIGEON can probe rich semantic knowledge from PLMs, helping detect argument relations for APE.

Related Work
Argument Pair Extraction Most existing argument mining methods focus on modeling the arguments in monologues, such as argumentation structure parsing (Stab and Gurevych, 2014;Morio et al., 2020), argument quality assessment (Lauscher et al., 2020), and argumentation strategies modeling (Al Khatib et al., 2017).However, in real-life scenarios, arguments are often in the form of dialogues.Several prior studies detect agreement and disagreement in online debates and discussions (Morio and Fujita, 2018;Chakrabarty et al., 2019;Ji et al., 2021).Subsequently, Cheng et al. (2020) introduced a new argument pair extraction (APE) task in the domain of peer review and rebuttal, aiming to extract the argument pairs from the review and rebuttal passages simultaneously.Cheng et al. (2021) applied an attention mechanism and a table-filling approach to implicitly model the interaction between argument pairs.To explicitly model the relations between argument pairs, Bao et al. (2021) proposed a mutual guidance framework with an inter-sentence graph for APE.Bao et al. (2022) explored a bidirectional machine reading comprehension (MRC) to capture the interactions between argument pairs.Different from previous works, we explicitly capture the relations between argument pairs by eliciting context-aware semantic knowledge from PLM.In addition, we propose a graph decomposition method to deal with the issue that the review and rebuttal passages have different styles and word distribution.
Probing Knowledge from PLMs Recently, the success of PLMs has led to plenty of studies applying the probing technique to elicit rich knowledge from large-scale PLMs (Jawahar et al., 2019;Clark et al., 2019;Wu et al., 2020;Wang et al., 2022).A typical probing study is to investigate the knowledge and linguistic properties contained in PLMs, such as morphology (Belinkov et al., 2017), word sense (Reif et al., 2019), syntax (Hewitt and Manning, 2019;Dai et al., 2021).The key idea behind these works is to define a precise task, and then design a simple model (called a probe) to solve the task using the contextualized representations provided by PLMs.There are also some studies (Petroni et al., 2019;Zhong et al., 2021) that seek to answer to what extent PLMs store factual, relational and commonsense knowledge.

Conclusion
In this paper, we designed a probing technique to elicit semantic-aware relational knowledge from large-scale PLMs, which captured rich explicit interactive relations between argument pairs.In addition, we proposed a graph decomposition method to decompose the probing graph into four sub-graphs from intra-and inter-passage perspectives, which could alleviate the issue that different participants might have different writing styles and word distributions for the APE task.Experimental results on two benchmark datasets showed that our method outperformed strong baselines significantly.

Limitation
To better understand the limitations of the proposed model, we carry out an analysis of the errors made by PIGEON.Specifically, we randomly select 100 instances that are incorrectly predicted by PIGEON and summarize the primary types of error.The first error category is boundary prediction error.Since we modeled the APE task as a sequence labeling, our model may only recognize a part of an argument.Thus, multiple consecutive arguments may be identified as a single argument.The second type of error is caused by the absence of semantically similar words in the argument pairs.In this case, the proposed probing graphs cannot model the relations between argument pairs.Third, another error category occurs when semantically similar words are also presented in non-matching argument pairs.The argument relation may be misled by these words.It suggests that certain relation modeling method needs to be devised in the future so as to better infer argument relation.For example, we may leverage the high-level topic information over argument pairs to guide the learning of relation-specific features.
In addition, the proposed probing approach may be computationally expensive and we can alleviate this problem by saving the similarity of all word pairs for one time for the entire dataset.We will address this issue in future work.

A Appendix
A.1 Sentence Relevance Analysis In Figure 5, we randomly select 100 samples from the RR-S test set and show the distributions of the number of similar words and similarities of sentences from intra-and inter-passages.The number of similar words between sentences is computed by our probing procedure, while the sentence similarities are measured by the cosine similarity of sentence representations obtained from BERT.We can observe that the number of similar words in the sentences within a passage is larger than that in the sentences from different passages, as shown in Figure 5a.In addition, the sentence representations in the intra-passage, on average, have higher similarity than that in the inter-passage, as illustrated in Figure 5b.This may be because the review and rebuttal passages have different writing styles and word distributions.

A.2 Data Statistics
The detailed statistics of the RR-S and RR-P datasets are summarized in Table 4.Both RR-S and RR-P dataset contain 4,764 review-rebuttal pairs collected from ICLR 2013 to ICLR 2020, which are split by the ratio of 8:1:1 into train, dev, and test sets.In the RR-P dataset, different review-rebuttal passage pairs of the same paper submission could be put into different sets, while in the RR-S dataset,  multiple review-rebuttal passage pairs of the same submission are included in the same set.Thus, RR-S is more challenging than RR-P.

A.3 Hyperparameter Settings
We manually tune the hyperparameter values (e.g., the weight λ for graph contrastive loss and the drop probability μ of noisy edges) on the RR-S.We report the results in Table 5 and Table 6.The weight λ of graph contrastive loss is tuned from 0.001 to 1 with a ratio of 10.The drop probability μ is tuned from 0.1 to 0.5 with a step size of 0.1.
From the results in Table 5 and Table 6, we set the value of λ to 0.01 and the value of μ to 0.1.

A.4 Computational Cost
Table 7 shows the training time, the testing time, the number of parameters, and the APE results of our model on the RR-S development set.As the number of GCN blocks increases, the performance on the development set improves yet the performance on the testing set becomes worse.It implies our model might suffer from the overfitting issue with the increasing layers of GCN blocks.In addition, our PIGEON is more efficient than baselines during inference owing to the fewer model parameters.Note that MRC-APE with fewer parameters

Figure 2 :
Figure 2: The architecture of PIGEON.

Figure 3 :
Figure 3: The impacts of graph parameters on RR-S.

Figure 5 :
Figure 5: Visualizing the distributions of the number of similar words and similarities of sentences in the intraand inter-passages.

Table 4 :
The statistics of the evaluated datasets, where SPA denotes sentences per argument