Infusing Hierarchical Guidance into Prompt Tuning: A Parameter-Efficient Framework for Multi-level Implicit Discourse Relation Recognition

Multi-level implicit discourse relation recognition (MIDRR) aims at identifying hierarchical discourse relations among arguments. Previous methods achieve the promotion through fine-tuning PLMs. However, due to the data scarcity and the task gap, the pre-trained feature space cannot be accurately tuned to the task-specific space, which even aggravates the collapse of the vanilla space. Besides, the comprehension of hierarchical semantics for MIDRR makes the conversion much harder. In this paper, we propose a prompt-based Parameter-Efficient Multi-level IDRR (PEMI) framework to solve the above problems. First, we leverage parameter-efficient prompt tuning to drive the inputted arguments to match the pre-trained space and realize the approximation with few parameters. Furthermore, we propose a hierarchical label refining (HLR) method for the prompt verbalizer to deeply integrate hierarchical guidance into the prompt tuning. Finally, our model achieves comparable results on PDTB 2.0 and 3.0 using about 0.1% trainable parameters compared with baselines and the visualization demonstrates the effectiveness of our HLR method.


Introduction
Implicit discourse relation recognition (IDRR) (Pitler et al., 2009) is one of the most vital sub-tasks in discourse analysis, which proposes to discover the discourse relation between two discourse arguments without the guidance of explicit connectives.Due to the lack of connectives, the model can only recognize the relations through semantic clues and entity anaphora between arguments, which makes IDRR a challenging task.Through a deeper understanding of this task, it is beneficial to a series of downstream tasks such as text summarization (Li et al., 2020b), dialogue summarization (Feng et al., 2021) and event relation extraction (Tang et al., 2021).Meanwhile, the discourse relation is  annotated as multi-level labels.As shown in Figure 1, the top-level label of this argument pair is Comparison, while the sub-label Contrast is the fine-grained semantic expression of Comparison.Beyond that, when annotating the implicit relation, the annotators simulate adding a connective Consequently.We regard these connectives as the bottom level of discourse relations.
Since pre-trained language models (PLMs) are widely applied, IDRR has also achieved considerable improvement.However, previous work (Xu et al., 2018;Shi et al., 2018;Dou et al., 2021) has mentioned the data scarcity of the IDRR, in which data is insufficient to support deep neural networks to depict the high-dimensional task-specific feature space accurately.Simultaneously, the hierarchical division of discourse relations is complex, and the extraction of hierarchical semantics relies on a large scale of data to sustain.
Previous studies (Xu et al., 2018;Dai and Huang, 2019;Kishimoto et al., 2020;Guo et al., 2020;Shi and Demberg, 2021) alleviate this problem by data augmentation or additional knowledge.However, there are several deficiencies: 1) the difficulty of annotating sufficient data and introducing appropriate knowledge is considerable; 2) noisy data drive models to deviate from the target feature dis-tribution, and unreasonable knowledge injection exacerbates the collapse of feature space of PLMs.
Recently, some prompt tuning (PT) methods (Hambardzumyan et al., 2021;Li and Liang, 2021;Lester et al., 2021;Liu et al., 2021a;Zhang et al., 2022) have shown remarkable results in low resource scenarios (i.e., parameter-efficient prompt tuning, PEPT).They freeze most or all parameters of PLMs and leverage a few additional parameters to restrict the approximation in a small manifold, thus reducing the dependency on the scale of data.
Inspired by the above, we leverage the PEPT to drive the input to match the pre-trained feature space and further present a Parameter-Efficient Multi-level IDRR framework (PEMI), which alleviates the under-training problem caused by data scarcity and infuses hierarchical guidance into the prompt verbalizer.Thus we can mine better context patterns guided by hierarchical label signals for the IDRR.Generally, prompt-based framework mostly consists of two parts: template engineering and verbalizer engineering.
For the template formulation, instead of manually designed templates, we inject soft prompts into the template and regard them as learnable global context vectors to mine the unique pattern of arguments and adjust input features to align the target distribution under the pre-trained semantic space.
However, this alignment is marginal, so it is crucial to adopt the verbalizer for the masked language model (MLM), which maps several label words in vocab to a specific category.But these verbalizer does not have access to learn the hierarchical connection of discourse relations.Besides, existing methods (Wu et al., 2020;Chen et al., 2021;Wu et al., 2022;Wang et al., 2022) require feature alignment or extra structure (e.g., GCN, CRF), which conflicts with the hypothesis of PEPT.Therefore, we propose a novel method called hierarchical label refining (HLR) to incorporate hierarchical information into the verbalizer.In our method, only the bottom-level label words are parameterized.Others are refined from the bottom up according to the hierarchical division.And the dispersed label semantics are continuously aggregated to more generalized ones in each iteration, thus realizing the dynamic updating of the verbalizer.
Finally, our framework carries out joint learning at all levels, thus combining the intra-level label discrimination process and the inter-level hierarchical information integration process.
Our contributions are summarized as follows: • Initially leverage PEPT to drive arguments to match the pre-trained feature space and alleviate the data scarcity of IDRR from the parameter side.
• Propose a parameter-efficient multi-level IDRR framework, deeply infusing hierarchical label guidance into prompt tuning and jointly mining the unique patterns of arguments and labels for MIDRR.
• Results and visualization demonstrate the effectiveness of our framework with only 100K trainable parameters.
2 Related Work

Implicit discourse relation recognition
We introduce deep learning methods for the IDRR (Pitler et al., 2009) through two routes.
One route is argument pair enhancement.The early work (Zhang et al., 2015;Chen et al., 2016;Qin et al., 2016;Bai and Hai, 2018) tends to build a heterogeneous neural network to acquire structured argument representations.Besides, other methods (Liu and Li, 2016;Lan et al., 2017;Guo et al., 2018;Ruan et al., 2020;Liu et al., 2020) focus on capturing interactions between arguments.Moreover, several methods (Dai and Huang, 2018;Kishimoto et al., 2018;Guo et al., 2020;Kishimoto et al., 2020;Zhang et al., 2021) aim at obtaining robust representations based on data augmentation or knowledge projection.However, these methods lack the exploration of relation patterns.
Another route is discourse relation enhancement.These methods are not only concerned with argument pairs but also discourse relations.He et al. (2020) utilizes a triplet loss to establish spatial relationships between arguments and relation representation.Jiang et al. (2021) tends to predict a response related to the target relation.Most studies (Nguyen et al., 2019;Wu et al., 2020Wu et al., , 2022) ) import different levels of relations to complete task understanding.However, they lack consideration of data scarcity and weaken the effectiveness of PLMs.We combine prompt tuning with hierarchical label refining to mine argument and label patterns from a multi-level perspective and adopt a parameter-efficient design to alleviate the above problems.

Prompt Tuning
The essence of prompt-based learning is to bridge the gap between the MLM and downstream tasks by reformulating specific tasks as cloze questions.At present, there are some papers (Xiang et al., 2022b;Zhou et al., 2022) that make hand-crafted prompts to achieve promotion for IDRR.However, they require numerous experiments to obtain reliable templates.
Recently, prompt tuning (PT) (Liu et al., 2022;Ding et al., 2022) is proposed to search for prompt tokens in a soft embedding space.Depending on resource scenarios, it can be mainly divided into two kinds of studies: full prompt tuning (FPT) and parameter-efficient ones (PEPT).
With sufficient data, FPT (Han et al., 2021;Liu et al., 2021b;Wu et al., 2022) combines the parameters of PLM with soft prompts to accomplish the bidirectional alignment of semantic feature space and inputs.Among them, P-Tuning (Liu et al., 2021b) replaces the discrete prompts with soft ones and adopts MLM for downstream tasks.PTR (Han et al., 2021) concatenates multiple sub-templates and selects unique label word sets for different subprompts.
However, in the low-resource scenario, this strategy cannot accurately depict the high-dimensional task-specific space.Therefore, PEPT methods (Hambardzumyan et al., 2021;Lester et al., 2021;Li and Liang, 2021;Liu et al., 2021a;Zhang et al., 2022;Gu et al., 2022) consider fixing the parameters of PLMs, and leverage soft prompts to map the task-specific input into unified pre-trained semantic space.For example, WARP (Hambardzumyan et al., 2021) uses adversarial reprogramming to tune input prompts and the self-learning Verbalizer to achieve superior performance on NLU tasks.Prefix-Tuning (Li and Liang, 2021) tunes PLMs by updating the pre-pended parameters in each transformer layer for NLG.In this paper, we combine PEPT with our proposed hierarchical label refining method, which not only takes full advantage of PEPT for IDRR, but also effectively integrates the extraction of hierarchical guidance into the process of prompt tuning.

Overall Framework
Let x = (x 1 , x 2 ) ∈ X be an argument pair and L = L1 , L 2 , .., L Z be the set of total labels, where L z is the level-z label set.The goal of the MIDRR is to predict the relation sequence l = l 1 , . . ., l z , . . ., l Z , where l z ∈ L z is the prediction of level z.The overview of our framework is shown in Figure 2. In this section, we explain our framework in three parts.First, we analyze the theory of PEPT for single-level IDRR and infer the association with our idea.Next, we describe how to expand the PEPT to MIDRR through our proposed hierarchical label refining method.Finally, we conduct joint learning with multiple levels so as to fuse the label information of inter and intra-levels.

Prompt Tuning for Single-level IDRR
Prompt tuning is a universal approach to stimulate the potential of PLMs for most downstream tasks, which goal is to find the best prompts that make the MLM predict the desired answer for the <mask> in templates.It is also suitable for single-level IDRR.Inspired by a PEPT method called WARP (Hambardzumyan et al., 2021), we desire to achieve objective approximations with fewer parameters for the data scarcity of IDRR.And to our knowledge, our work is the first successful application of PEPT to the IDRR.
In theory, given a MLM M and its vocabulary V, it is requisite to transform z-th level IDRR into a MLM task.Therefore, for the input x, we first construct a modified input x ∈ X through template projection T : X → X , which surrounds by soft prompts P = {⟨P 0 ⟩, ⟨P 1 ⟩, ..., ⟨P K−1 ⟩} ⊂ V (K represents the number of P) and special tokens <mask> and <sep>.These soft prompt tokens are the same as other words in V.But they do not refer to any real word and are learnable through gradient backpropagation.So the actual input x ∈ X can be formulated as follows: x where tional, and we will discuss the main factors of template selection in 4.6. 1 Then, we hope to leverage the MLM M to predict discourse relations.We denote E : X → H and F : H → V as the encoder and vocabulary classifier of M. For encoder E, we do not make extra modifications and obtain the feature representation h ⟨mask⟩ ∈ H from <mask> position.Through The overall architecture of our PEMI framework.
the attention in E, prompts can constantly mine the context pattern and guide to acquire semantic representations with IDRR characteristics.While for F, label word selection should be made to constrain the probabilities to fall on words associated with relation labels.Here, instead of picking up verbalizer through handcraft or rules, we adopt self-learning verbalizer V z = {⟨V 0 ⟩, ⟨V 1 ⟩, ..., ⟨V |L z | ⟩} ⊂ V to represent label words for level-z classes.We denote this new projection as F z : H → V z .In practice, we replace the final projection in F with verbalizer embedding matrix M z ∈ R |L z |×d to acquire F z .The matrix M z represents as: where e(•) is the embedding projection of M. And the calculation of F z is as follows: i=1 is the probabilistic predictions of the z-th level and h ′ ⟨mask⟩ is the representation before verbalizer projection.There are different operations for each PLM (e.g., layer norm).
Finally, we train this model through crossentropy loss to approximate the real distribution of z-th level as follows: where i=1 is the one-hot representation of ground-truth relation.
Although we can narrow the gap between pretraining and IDRR by the above, it is inappropriate to fine-tune the pre-trained feature space to taskspecific ones in low-resource scenarios, which will further aggravate the collapse of the vanilla space.Therefore, we propose to approximate the original objective by adjusting the input to fit vanilla PLM space.Let θ M be the parameters of M and δ = {θ P , θ V z } represents the parameters of soft prompts and verbalizer.Our method seeks to find a new approximate objective function Lz (•; δ), such that: where ϵ is the approximation error.Moreover, if we assume that the difference of F z between L z and Lz is insignificant when L z reaches its optimal, the purpose of PEPT can be understood as: where E + is the optimal encoder.Through this method, we restrict MLM into a small manifold in the functional space (Aghajanyan et al., 2021), thus adjusting the input to fit the original PLM feature space.Especially in low-resource situations, this approach can effectively achieve approximation.

Hierarchical Label Refining
Despite the success of single-level IDRR, PEPT suffers from the absence of hierarchical label guidance.Besides, existed hierarchical infusion method (Wu et al., 2020;Chen et al., 2021;Wu et al., 2022;Wang et al., 2022) undoubtedly introduces additional parameters except for δ, which accelerates the deconstruction of pre-trained feature space.Therefore, we propose a hierarchical label refining (HLR) method that integrates hierarchical guidance on the verbalizer.Not only does our method not increase the scale of θ V = {θ V m } Z m=1 , but also restrict the parameters to θ V Z .
In detail, for multi-level IDRR, the arguments are annotated by different semantic granularity in the process of labeling.And all the labels can form a graph G with Z levels according to the predefined relationships.In this graph, for a particular z-th level label l z j (j ∈ {1, 2, ..., |L z |}), its relevant sub-labels are distributed in level z+1 and we denote them as: where P arent(•) means the parent node of it.
In abstract, the nodes in L z+1 j are the semantic divisions of l z j , which represent the local meaning of l z j .In other words, the meaning of l z j can be extracted by merging its sub-labels.While in the embedding space, this relationship can be translated into clustering, where l z j represents the semantic center of its support set L z+1 j .Therefore, if the embeddings for sub-labels make sense, we can regard the semantic center extracted by them as their parent label.Under this concept, we only need to build the semantics of the bottom-level labels, and other levels are produced by aggregation from the bottom up.From the view of the graph neural networks, our method limits the adjacent nodes of each node in G to be the fine-grained labels of the first-order neighborhood, and the updating of node embeddings only depends on the aggregation of the adjacent nodes without itself.In practice, the verbalizer V * only consists of |L Z | learnable label words and others are generated from V * .Furthermore, we discuss how to achieve effective semantic refining.A major direction is the proportion of support nodes.However, the weights of refining depend on numerous factors, e.g., the label distribution of datasets, the semantic impor-tance of the parent label, polysemy and so on.2Hence we apply several learnable weight units in the process of refining to balance the influence of multiple factors, which is equal to adding weights to the edges in G.All the weights are acquired through the iteration of prompt tuning.Formally, the element of the weight vector w z j = w z j,i for l z j are obtained as follows: where unit( * ) is the function to obtain the target weight unit controlled by z, i, and j.
After that, We formalize the calculation of the verbalizer matrix M z at z-th level as follows: is the weight matrix of z-th level, and f (•) stands for the normalization method like softmax and L 1 norm.
Our method repeats this process from the bottom up to get semantic embeddings at all levels.And it is performed in each iteration before the calculation of the objective function, thus aggregating upper semantics according to more precise basic ones and infusing it into the whole process of PT.In this way, discourse relations produce hierarchical guidance from the generation process and continue to enrich the verbalizer V * .

Joint Learning
After the embeddings of all levels are generated vertically, we conduct horizontal training for intralevel senses.Precisely, we first calculate the probability distribution of each level independently.The calculations of each level follow Equation ( 3) and (4).
Eventually, our model jointly learns the overall loss functions as the weighted sum of Equation ( 4): where λ z indicates the trade-off hyper-parameters to balance the loss of different levels.By joint learning for different levels, our model naturally combines the information within and between hierarchies.Besides, it can synchronously manage all the levels through one gradient descent, without multiple iterations like the sequence generation model, thus speeding up the calculation while keeping hierarchical label guidance information.

Experiments 4.1 Dataset
To facilitate comparison with previous work, we evaluate our model on PDTB 2.0 and 3.0 datasets.The original benchmark (Prasad et al., 2008) contains three-level relation hierarchies.However, the third-level relations cannot conduct classification due to the lack of samples in most of the categories.Following previous work (Wu et al., 2020(Wu et al., , 2022)), we regard the connectives as the third level for MIDRR.The PDTB 2.0 contains 4 (Top Level), 16 (Second Level) and 102 (Connectives) categories for each level.For the second-level labels, five of them without validation and test instances are removed.For PDTB 3.0, following Kim et al. (2020), we conduct 4-way and 14-way classifications for the top and second levels.Since previous work has not defined the criterion for PDTB 3.0 connectives, we choose 150 connectives in implicit instances for classification 3 .For data partitioning, we conduct the most popular dataset splitting strategies PDTB-Ji (Ji and Eisenstein, 2015), which denotes sections 2-20 as the training set, sections 0-1 as the development set, and sections 21-22 as the test set.More details of the PDTB-Ji splitting are shown in Appendix A.

Experimental Settings
Our work uses Pytorch and Huggingface libraries for development, and also verifies the effectiveness of our model on MindSpore library.For better comparison with recent models, we apply RoBERTabase (Liu et al., 2019) as our encoder.All of the hyper-parameters settings remain the same as the original settings for it, except for the dropout is set to 0. And we only updates the parameters of δ = {θ P , θ V Z } and weight units {W z } Z z=1 while freezing all the other parameters when training.The weight coefficients of loss function λ z are 1.0 equally.And the normalized function f is softmax.In order to verify the validity of the results, 3 https://github.com/cyclone-joker/IDRR_PDTB3_Conns we choose Adam optimizer and learning rate 1e-3 with a batch size of 8.The training strategy conducts early stopping with a maximum of 15 epochs and chooses models based on the best result on the development set.The evaluation step is 500.In practice, one training process of PEMI takes about 1.5 hours on a single RTX 3090 GPU.Finally, We choose the macro-F 1 and accuracy as our validation metrics.

The Comparison Models
In this section, we select some baselines for PDTB 2.0 and 3.0 separately and introduce them briefly: • PDTB 2.0 : We select some comparable models based on PLMs and briefly introduce them through two aspects: Argument Pair Enhancement 1) FT-RoBERTa: Liu et al. ( 2019) improves the BERT by removing the NSP task and pre-training on wide corpora.We conduct experiments for each level separately.
2) BMGF: Liu et al. ( 2020) proposes a bilateral multi-perspective matching encoder to enhance the arguments interaction on both text span and sentence level.

Discourse Relation Enhancement
3) MTL-KT: Nguyen et al. ( 2019) predicts relations and connectives simultaneously and transfers knowledge via relations and connectives through label embeddings.We import the RoBERTa version from Wu et al. (2022).
5) FT-RoBERTa: we also fine-tune a RoBERTa model on PDTB 3.0 for better comparison.

Results and Analysis
In this section, we display the main results of three levels on PDTB 2.0 (Table 1) and PDTB 3.0 (Table 7) and the label-wise F 1 of level 2 on PDTB 2.0 (Table 2) and PDTB 3.0 (Table 6).
We can obtain the following observations from these results: 1) In table 1, our model achieves comparable performance with strong baselines and only uses 0.1% trainable parameters.And the improvement mainly occurs at the level-3 senses, which states that our model is more aware of fine-grained hierarchical semantics.2) In table 7, compared with baselines, our model exceeds all fine-tuned models currently, which proves that the effect of our model is also guaranteed with sufficient data.3) In Table 2, our model mainly outperforms on the minor classes.For PDTB 2.0, the improvement depends on three mi-

Ablation Study and Analysis
We conduct the ablation study on PDTB 2.0 to deeply analyze the impact of our framework.Our Baseline chooses fine-tuned RoBERTa MLM with a learnable verbalizer.Compared with fine-tuned RoBERTa, our baseline acquires arguments representation through <mask> and retains some parameters of MLM head.Besides, it treats IDRR of different levels as an individual classification but shares the parameters of the encoder.And then, we decompose our model into two parts described in Section 3: Parameter-Efficient Prompt Tuning (PEPT) and hierarchical label refining (HLR).
From Table 3, we can observe that: 1) The results of our baseline are higher than the vanilla PLM, which indicates that adapting MLM to the IDRR is more practicable.2) Baseline+HLR gains improvements on all levels, especially on level 2, which presumes that information from both the upper and lower level labels guides to make it more semantically authentic.3) PEMI achieves the best performance over other combinations, which proves that PEPT makes HLR not be affected by redundant parameters and focuses on the semantic information in the verbalizer.

Template Selection
Furthermore, we design experiments on PDTB 2.0 for two main factors of the prompt templates: the location and the size of prompt tokens, as shown in Table 8 and Figure 3 separately.Table 8 shows that the locations have a great influence on our model.Generally, we note that most of the templates that prompt tokens are scattered surpass the compact ones.So it is beneficial to place scattered around sentences.Meticulous, placing more prompt tokens around the first sentence achieves particular promotion, suggesting that early intervention for prompts could better guide the predictions of discourse relations.
In Figure 3, as the number of prompt tokens increases, the situations are different for three levels.For the level-1 and level-2 senses, they reach the peak when the number rises to 20 and then starts to go down, which indicates that over many prompt tokens may dilute the attention between arguments.However, the performance of connectives continues to improve as the number increases.This is mainly because the difficulty of classification rises and more prompts need to be involved.Therefore, we ultimately measured the performance of all levels and chose 20 prompt tokens as our final result, but there is still room for improvement.

Impact of Hierarchical Label Refining
Finally, we carry out two experiments to explore the impact of our HLR method: weight coefficients learned by weight units in Table 9 and 10 and visualization of label embeddings in Figure 4.
In Table 9, we find out that most of the weight coefficients are inversely proportional to data size, while a few cases like Expa.Alternative are ignored.Combined with Table 4, we can infer that our model pays more attention to the minor classes and lowers the weight to the good-performing classes.
Besides, in Figure 4, we note that visibly clustering relationships exist in the embedding space.Meanwhile, for the major classes like Cont.Cause and Expa.Conjunction, the class centers tend to be the average of connectives in the cluster.In contrast, minor classes like Expa.Alternative and Expa.List are biased towards a particular connective.The reason is that some connectives belonging to multiple discourse relations can transfer knowledge from other relations and improve the prediction of the current relation.Then the model will increase the weight of those connectives to get closer to the actual distribution.Therefore, it can be said that the HLR method transfers the inter and intralevel guidance information in the embedding space.

Conclusion
In this paper, we tackle the problem of data scarcity for IDRR from a parameter perspective and have presented a novel parameter-efficient multi-level IDRR framework, which leverages PEPT to adjust the input to match the pre-trained space with fewer parameters and infuse hierarchical label guidance into the verbalizer.Experiments show that our model adopts parameter-efficient methods while it is comparable with recent SOTA models.Besides, it indicates that our framework can effectively stimulate the potential of PLMs without any intervention of additional data or knowledge.In the future, we will further explore the linguistic features of labels and enhance the discrimination against connectives.

Limitations
Although our model obtains satisfying results, it also exposes some limitations.First, for a fair comparison to other models, we mainly carry out relevant experiments on PDTB 2.0.Due to the lack of baselines on PDTB 3.0, further analysis and comparison cannot be conducted.Second, in our experiments, we can find out that the HLR method does not improve the top-level or bottom-level results effectively, indicating that with the increase of the level, the refining method is insufficient to continue to generalize the bottom-level labels and further improvement should be made according to the specific features of the IDRR task.Third, due to the limitation of space, this paper does not focus much on semantic weight for the refining of sub-labels.This is a very broad topic involving the rationality of the discourse relation annotation and the interpretability of the label embeddings.We will conduct a further study which may appear in our next work.B Experimental Results on PDTB 3.0 Due to the limitation of pages, we provide results of PDTB 3.0 in this section.

C Selection of Input Templates
In this section, we provide several templates by changing the location of prompt tokens and ⟨mask⟩ to explore the validity of IDRR.And Table 8 shows the overall results for reference.Finally, we find out that it is preferable to put the ⟨mask⟩ token in the middle of the argument pair, as described in Section 3.1.

D Details of Weignt Units
In this section, we display weight coefficients learned by weight units in section 3.2, as shown in Table 9 and 10.We can observe some characteristics of the weights learned by the units.Comparing Table 4 and 9, it is apparent that the weight is inversely proportional to the number of samples, which suggests that our model intentionally learns features from minor classes.While for the second level, the situation is complicated.Some minor connectives like "meanwhile" in Expa.List are put high weight and others like "furthermore" are quite the opposite.Therefore, is not enough to learn a good weight from sample size.Besides, since connectives can belong to different labels, the semantics learned from other relations can be beneficial for the current ones.

Figure 4 :
Figure 4: Visualization of HLR method for connectives.∆ represents level 2 labels and different colors indicate different classes.We use different markers since some connectives are overlapping due to the many-to-many mapping between level 2 and connectives.

La te nt Em be dd in g Sp ac e N Le ve l N -1 Le ve l
−...

Table 3 :
Ablation study on PDTB 2.0.Our Baseline choose fine-tuned RoBERTa MLM with a learnable verbalizer.PEPT means parameter-efficient prompt tuning and HLR is the hierarchical label refining.
The effect of prompt token size for MIDRR on PDTB 2.0.We follow the best template in Table8and try to put them uniformly in each location.norcategories: Comp.Concession, Expa.List and Expa.Instantiation, which indicates that the approximation through fewer trainable parameters drives the model to pay more attention to minors.More details for PDTB 3.0 are shown in Appendix B.

Table 4 :
Statistics for relation senses of Level 2 in PDTB 2.0 by PDTB-Ji splitting.

Table 5 :
Statistics for relation senses of Level 2 in PDTB 3.0 by PDTB-Ji splitting.
Table6displays the labelwise F1 for level-2 senses on PDTB 3.0 and Table7shows the main results on PDTB 3.0 compared with the baselines we stated in Section 4.3.

Table 8 :
Results by changing the locations of prompt tokens and ⟨mask⟩ on PDTB 2.0.We fix the size of the prompt tokens as 20 and test some of extreme cases based on simple permutations.⟨P:x⟩ represents that there are x prompt tokens inserted on this location.