Facilitating Contrastive Learning of Discourse Relational Senses by Exploiting the Hierarchy of Sense Relations

Implicit discourse relation recognition is a challenging task that involves identifying the sense or senses that hold between two adjacent spans of text, in the absense of an explicit connective between them. In both PDTB-2 (prasad et al., 2008) and PDTB-3 (Webber et al., 2019), discourse relational senses are organized into a three-level hierarchy ranging from four broad top-level senses, to more specific senses below them. Most previous work on implicitf discourse relation recognition have used the sense hierarchy simply to indicate what sense labels were available. Here we do more — incorporating the sense hierarchy into the recognition process itself and using it to select the negative examples used in contrastive learning. With no additional effort, the approach achieves state-of-the-art performance on the task. Our code is released inhttps://github.com/wanqiulong 0923/Contrastive_IDRR.


Introduction
Discourse relations are an important aspect of textual coherence.In some cases, a speaker or writer signals the sense or senses that hold between clauses and/or sentences in a text using an explicit connective.Recognizing the sense or senses that hold can be more difficult, in the absense of an explicit connective.
Automatically identifying the sense or senses that hold between sentences and/or clauses can be useful for downstream NLP tasks such as text summarization (Cohan et al., 2018), machine translation (Meyer et al., 2015) and event relation extraction (Tang et al., 2021).Recent studies on implicit discourse relation recognition have shown great success.Especially, pre-trained neural language models (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019) have been used and dramatically improved the performances of models (Shi and Demberg, 2019b;Liu et al., 2020;Kishimoto et al., 2020).The senses available for labelling discourse relations in the PDTB-2 (and later in the PDTB-3) are arranged in a three-level hierarchy, with the most general senses at the top and more specific senses further down.In the PDTB-3, annotators could only choose senses at terminal nodes in the hierarchy -level-2 senses for symmetric relations such as EXPANSION.EQUIVALENCE and TEMPORAL.SYNCHRONOUS, and level-3 senses for asymmetric relations, with the direction of the relation encoded in its sense label such as SUBSTITUTION.ARG1-AS-SUBST (where the text labelled ARG1 substitutes for the denied text labelled ARG2) and SUBSTITUTION.ARG2-AS-SUBST (where the text labelled ARG2 substitutes for the denied text labelled ARG1).Early work on recognizing the implicit relations only used the hierarchy to choose a target for recognition (e.g., the senses at level-1 (classes) or those at level-2 (types).Recently, Wu et al. (2022) have tried to leverage the dependence between the level-1 and level-2 labels (cf.Section 2).The current work goes further, using the whole three-level sense hierarchy to select the negative examples for contrastive learning.
Contrastive learning, which aims to minimize the distance between similar instances (defined as positive examples) and widen the difference with dissimilar instances (negative examples), has been considered as effective in constructing meaningful representations (Kim et al., 2021;Zhang et al., 2021;Yan et al., 2021).Previous work on contrastive learning indicates that it is critical to select good negative samples (Alzantot et al., 2018;Wu et al., 2020b;Wang et al., 2021).The insight underlying the current work is that the hierarchy of sense labels can enable the selection of good negative examples for contrastive learning.To see this, consider Examples 1-3 below from the PDTB-3.On the surface, they look somewhat similar, but in Examples 1 and 2, the annotators took the second sentence (Arg2) as providing more detail about the first sentence (Arg1) -the sense called EXPANSION.LEVEL-OF-DETAIL.ARG2-AS-DETAIL, while in Example 3, they took the second sentence as expressing a substitute for "American culture" in terms of what is relevant -the sense called EXPANSION.SUBSTITUTION.ARG2-AS-SUBST.
(1) "Valley National ""isn't out of the woods yet ".The key will be whether Arizona real estate turns around or at least stabilizes..
(2) The House appears reluctant to join the senators.A key is whether House Republicans are willing to acquiesce to their Senate colleagues' decision to drop many pet provisions..
(3) Japanese culture vs. American culture is irrelevant.The key is how a manager from one culture can motivate employees from another..
In this work, we use a multi-task learning framework, which consists of classification tasks and a contrastive learning task.Unlike most previous work using one benchmark dataset (usually PDTB-2 or PDTB-3), we evaluate our systems on both PDTB-2 and PDTB-3.Besides, Wang et al. (2021) have shown that data augmentation can make representations be more robust, thereby enriching the data used in training.We thus follow Ye et al. (2021) and Khosla et al. (2020) in identifying a relevant form of data augmentation for our contrastive learning approach to implicit relation recognition.
The main contributions of our work are as follows: • We leveraged the sense hierarchy to get contrastive learning representation, learning an embedding space in which examples from same types at level-2 or level-3 stay close to each other while sister types are far apart.• We explored and compared different methods of defining the negatives based on the sense hierarchies in PDTB-2 and PDTB-3, finding the approach which leads to the greatest improvements.

• Our proposed data augmentation method to
generate examples is helpful to improve the overall performance of our model.• We demonstrate that implicit relation recognition can benefit from a deeper understanding of the sense labels and their organization.

Related Work
Implicit discourse relation recognition For this task, Dai and Huang (2018) considered paragraphlevel context and inter-paragraph dependency.Recently, Shi and Demberg (2019b) showed that using the bidirectional encoder representation from BERT (Devlin et al., 2019) is more accurately to recognize Temporal.Synchrony, Comparison.Contrast, Expansion.Conjunction and Expansion.Alternative.Liu et al. (2020) showed that different levels of representation learning are all important to implicit relation recognition, and they combined three modules to better integrate context information, the interaction between two arguments and to understand the text in depth.However, only two existing works leveraged the hierarchy in implicit relation recognition.Both Wu et al. (2020a) and Wu et al. (2022) first attempted to assign a Level-1 sense that holds between arguments, and then only considered as possible Level-2 senses, those that are daughters of the Level-1 sense.
Contrastive learning Recently, there has been a growing interest in applying contrastive learning in both the pre-training and fine-tuning objectives of pre-trained language models.Gao et al. (2021) used a contrastive objective to fine-tune pretrained language models to obtain sentence embeddings, and greatly improves state-of-the-art sentence embeddings on semantic textual similarity tasks.Suresh and Ong (2021) proposed label-aware contrastive loss in the presence of larger number and/or more confusable classes, and helps models to produce more differentiated output distributions.
Besides, many works have demonstrated that selecting good negative examples are very important for using contrastive learning (Schroff et al., 2015;Robinson et al., 2021;Cao et al., 2022).In our work, we integrate contrastive learning loss with supervised losses and we use the structure of the sense hierarchy to guide the selection of negative examples.
3 Learning Loss

Supervised Learning Loss
The standard approach today for classification task is to use a standard cross-entropy loss: Where N denotes the number of training examples, y i is the ground-truth class of the i-th class and W j is the weight vector of the j-th class.

Contrastive Learning Loss
In contrastive learning, each example can be treated as an anchor to get its positive and negative examples.Contrastive learning can pull the anchor and its positive example together in the embedding space, while the anchor and negative samples are pushed apart.The contrastive learning loss was used by Chen et al. (2020); Suresh and Ong (2021) before.A set of N randomly sampled label pairs is defined as x k , y k , where x and y represent samples and labels, respectively, k = 1, ..., N .Let i be the index of anchor sample and j is the index of a positive sample.where iϵ{1, ..., N }, i ̸ = j.Contrastive loss is defined as: Here, h denotes the feature vector in the embedding space, and τ is the temperature parameter.Intuitively, the numerator computes the inner dot product between the anchor points i and its positive sample j.The denominator computes the inner dot product between all i and the inner dot product between all negative samples.where a total of N − 1 samples are computed.Supervised contrastive learning (Gunel et al., 2021) extends the equation.2to the supervised scenario.In particular, given the presence of labels, the positive examples are all examples with the same label.The loss is defined as: (3) N y j indicates the number of examples in a batch that have the same label as i, τ is the temperature parameter and h denotes the feature vector that is from the l2 normalized final encoder hidden layer before the softmax projection.

Our Approach
Figure 2 shows the overall architecture of our method.As figure 2 illustrates, we firstly use a simple multi-task model based on RoBERTa-base (Liu et al., 2019), and then we develop a contrastive learning algorithm where the sense hierarchy is used to select positive and negative examples.Detailed descriptions of our framework and our data augmentation method are given below.

Sentence Encoder
Every annotated discourse relation consists of two sentences or clauses (its arguments) and one or more relational senses that the arguments bear to each other.We concatenate the two arguments of each example and input them into RoBERTa.Following standard practices, we add two special tokens to mark the beginning ([CLS]) and the end ([SEP]) of sentences.We use the representation of [CLS] in the last layer as the representation of the whole sentences.

Data Augmentation
To increase the number of training examples, we take advantage of meta-data recorded with each Implicit Discourse Relation in the PDTB (cf.(Webber et al., 2019), Section 8]).For each sense taken to hold between the arguments of that relation, annotators have recorded in the meta-data, an explicit connective that could have signalled that sense.In the past, this meta-data was used in implicit relation recognition by both Patterson and Kehler (2013) and Rutherford and Xue (2015).We have used it in a different way, shown in Figure 3, to create an additional training example for each connective that appears in the meta-data.In the added training example, this added connective becomes part of the second argument of the relation (i.e., appearing after the [SEP] character) Since there is at least one explicit connective recorded in the meta-data for each implicit discourse relation and at most two1 , for a training batch of N tokens, there will be at least another [CLS] [SEP] [CLS] N tokens introduced by this data augmentation method, increasing the training batch to at least 2N tokens.

Positive Pair and Negative Pair Generation
We use the structure of the sense hierarchy to identify the positive and negative examples needed for contrastive learning.The only senses used in annotating discourse relations are ones at terminal nodes of the sense hierarchy.This is Level 2 for symmetric senses and Level 3 for asymmetric senses (i.e., where the inverse of the sense that holds between Arg1 and Arg2 is what holds between Arg2 and Arg1.For example, CONTRAST and SIMILARITY are both symmetric senses, while MANNER and CONDITION are asymmetric, given that there is a difference between Arg2 being the manner of doing Arg1 or Arg1 being the manner of doing Arg2).
In our work, when the lowest level of the senses is level-3, we directly used the level-3 labels instead of their parent at level-2.For example, under the level-2 label Temporal.asynchronous,there are two labels which are precedence and succession at level-3.For this case, we replaced the level-2 label Temporal.asynchronouswith the two labels precedence and succession at level-3.Although supervised contrastive learning in Eq. 3 can be valid for different classes of positive ex-ample pairs, its negative examples come from any examples inside a batch except itself.We defined l 1 , l 2 , l 3 as the first, second, and third level in the hierarchical structure respectively, and lϵl i refers to the labels from level i.
Instance e ∼ Same sub-level e pos Given the representation of a sentence e i and its first, second and third level of label l i 1 , l i 2 , l i 3 , we searched the set of examples with the same second level labels or the same third level labels (if the lowest level is level-3) as e pos in each training batch: E.g.If the label of the anchor is Temporal.asynchronous.precedence, its positive examples would be the examples with the same label.
Instance e ∼ Batch instance e neg Here, we would like to help the model discriminate the sister types at level-2 and level-3 (if the lowest level is level-3).We searched the set of examples with different level-2 labels or level-3 labels as e neg in each training batch.E.g.If the label of the anchor is Temporal.asynchronous.precedence, its negative examples would be its sister types at level-2 and level-3, namely Temporal.asynchronous.succession and Temporal.synchronous.

Loss Algorithms
As described above, given the query e i with its positive pairs and negative pairs and based on the general contrastive learning loss (see Equation 2), the contrastive learning loss for our task and approach is: where w j and w j are weight factors for different positive pairs and negative pairs respectively, sim(h i , h j ) is cosine similarity and τ is a temperature hyperparameter.
Our overall training goal is: As our classifications are done in the first level and second level for the same inputs, we used a standard cross-entropy loss to get supervised loss L L1 sup and L L2 sup .And β is the weighting factor for the contrastive loss.

Datasets
Besides providing a sense hierarchy, the Penn Discourse TreeBank (PDTB) also frequently serves as a dataset for evaluating the recognition of discourse relations.The earlier corpus, PDTB-2 (Prasad et al., 2008) included 40,600 annotated relations, while the later version, PDTB-3 (Webber et al., 2019) includes an additional 13K annotations, primarily intra-sentential, as well as correcting some inconsistencies in the PDTB-2.The sense hierarchy used in the PDTB-3 differs somewhat from that used in the PDTB-2, with additions motivated by the needs of annotating intra-sentential relations and changes motivated by difficulties that annotators had in consistently using some of the senses in the PDTB-2 hierarchy.
Because of the differences in these two hierarchies, we use the PDTB-2 hierarchy for PDTB-2 data and the PDTB-3 hierarchy for PDTB-3 data respectively.We follow earlier work (Ji and Eisenstein, 2015;Bai and Zhao, 2018;Liu et al., 2020;Xiang et al., 2022) using Sections 2-20 of the corpus for Training, Sections 0-1 for Validation, and Sections 21-22 for testing.We followed the predecessors to split section 2-20, 0-1, and 21-22 as training, validation, and test sets for both PDTB-2 and PDTB-3.With regard to those instances with multiple annotated labels, we also follow previous work (Qin et al., 2016).They are treated as separate examples during training.At test time, a prediction matching one of the gold types is taken as the correct answer.Implicit relation recognition is usually treated as a classification task.While 4-way (Level-1) classification was carried out on both PDTB-2 and PDTB-3, more detailed 11-way (Level 2) classification was done only on the PDTB-2 and 14-way (Level 2) classification, only on the PDTB-3.

Baselines
To exhibit the effectiveness of our proposed method, we compare our method with strong baselines.As previous work usually used one dataset (PDTB-2 or PDTB-3) for evaluation, we use different baselines for PDTB-2 and PDTB-3.Since PDTB-3 was not released until 2019, the baselines for PDTB-3 from 2016 and 2017 are from (Xiang et al., 2022).They reproduced those models which were originally used on PDTB-2 on PDTB-3.Baselines for PDTB-2:

Parameters Setting
In our experiments, we use the pre-trained RoBERTa-base (Liu et al., 2019) as our Encoder.
We adopt Adam (Kingma and Ba, 2015) with the learning rate of 3e−5 and the batch size of 256 to update the model.The maximum training epoch is set to 25 and the wait patience for early stopping is set to 10 for all models.We clip the gradient L2-norm with a threshold 2.0.For contrast learning, the weight of positive examples is set to 1.6 and the weight of negative examples is set to 1.All experiments are performed with 1× 80GB NVIDIA A100 GPU.

Evaluation Metrics
We used Accuracy and Macro-F1 score as evaluation metrics, because PDTB datasets are imbalanced and Macro-F1 score has been said to be an more appropriate assessment measure for imbalanced datasets (Akosa, 2017;Bekkar et al., 2013).

Effects of the Coefficient β
As shown in Equation 7, the coefficient β is an important hyperparameter that controls the relative importance of supervised loss and contrastive loss.Thus, we vary β from 0 to 2.4 with an increment of 0.2 each step, and inspect the performance of our model using different β on the validation set.
From Figure 4, we can find that, compared with the model without contrastive learning (β = 0), the performance of our model at any level is always improved via contrastive learning.For PDTB-2, when β exceeds 1.0, the performance of our model tends to be stable and declines finally.Thus, we directly set β = 1.0 for all PDTB-2 related experiments thereafter.For PDTB-3, the Acc and F1 of the validation set reach the highest point at β = 2.0.Therefore we choose β = 2.0 for all related experiments.
We have considered three ways of investigating why there is such a difference in the optimal weighting coefficient.First, compared with PDTB-2, the PDTB-3 contains about 6000 more implicit tokens annotated for discourse relations.Secondly, although the sense hierarchies of both the PDTB-2 and the PDTB-3 have three levels and have the same senses at level-1, but many changes at level-2 and level-3 due to difficulties found in annotating certain senses.Moreover, the intra-sentential implicit relations might be another reason.In PDTB-3, many more discourse relations are annotated within sentences.Liang et al. (2020) report quite striking difference in the distribution of sense relations inter-sententially vs. intra-sententially between PDTB-2 and PDTB-3.Therefore, these major differences in the PDTB-3 and the PDTB-2 might cause the fluctuation of the coefficient value.

Results and Analysis
The results on PDTB-2 and PDTB-3 for Level-1 and Level-2 are presented in Table 1 and Table 2 respectively, where the best results are highlighted in bold.Classification performance on PDTB-2 in terms of Macro-F1 for the four general sense types at Level-1 and 11 sense types at Level-2 is shown in Table 3 and Table 4.These results demonstrate better performance than previous systems for both Level-1 and Level-2 classification on both PDTB-2 and PDTB-3.In particular, the results clearly demonstrate benefits to be gained from contrastive learning.But there is more to be said: In Section 6.1, we discuss different ways of defining negative examples with respect to the sense hierarchy, and in Section 6.2, we discuss the relative value of the particular form of data augmentation we have used (cf.Section 4.2) as compared with our method of contrastive learning.Table 2: Experimental results on PDTB-3.

Comparisons with Other Negatives Selecting Methods
There is not only one way to select negative examples for contrastive learning based on PDTB hierarchical structures.In addition to the method we adopt, we have explored another 4 different methods of defining positive and negative examples by using the sense hierarchies, which can be shown in Figure 5.One can choose the level against which to select negative examples: method 2 below uses examples with different labels at level-2, while methods 1, 3 and 4 use examples with different labels at level-1.With regard to the use of weight for method 3 and method 4, we aim to give more weight to more similar (potentially) positive examples based on the hierarchy.Specifically, we give more weight to the examples from the same level-2/level-3 type than their sister types at level-2/level-3 when all of the examples from the same level-1 are positive examples.Besides, method 4 leverages level-3 labels, while method 1 to 3 only consider level-1 and level-2 labels.In our experiments for other negatives defining methods, we use the same hyperparameters as the experimental setup of our methods.For method 3 and method 4, the weight of positive examples is set to 1.6 and 1.3 and the weight of negative examples still is 1.
It can be seen from   datasets for both level-1 and level-2 classification tasks.Compared with method 2, we utilize level-3 labels, which indicated the level-3 label information is helpful for the approach.The greatest difference between our method and other three methods is  be effective as negative examples.Specifically, for the following example: (4) when [they built the 39th Street bridge] 1 , [they solved most of their traffic problems] 2 .
If the connective "when" is replaced with "because", the sentence still sounds not strange.Therefore, regarding all examples from different level-1 as negative examples might have some negative impacts on learning the representations.

Ablation Study
We wanted to know how useful our data augmentation method and our contrastive learning method are, so we have undertaken ablation studies for this.Effects of contrastive learning algorithm From Table 7, it can be seen that multi-task learning method where level-1 and level-2 labels are predicted simultaneously by using the same  representation perform better than separately predicting level-1 and level-2 labels, which verifies the dependency between different levels.Compared with the multi-task learning method, our model with a contrastive loss has better performance in PDTB-2 and PDTB-3, which means that our contrasting learning method is indeed helpful.
Effects of data augmentation Table 8 compares the results with and without data augmentation for both PDTB-2 and PDTB-3.From the comparisons, it is clear that the data augmentation method is helpful to generate useful examples.Khosla et al. (2020) showed that having a large number of hard positives/negatives in a batch leads to better performance.Since we have many classes at the second level, 11 types for PDTB-2 and 14 types for PDTB-3.In a batch with the size of 256, it is difficult to guarantee that there are enough positive examples for each class to take full advantage of contrast learning.Therefore, without data augmentation, the performance of our method degrades considerably.

Limitations and Future work
With regard to PDTB-2 and PDTB-3 annotation, there are two cases: (1) Annotators can assign multiple labels to an example when they believe more than one relation holds simultaneously; (2) Annotators can be told (in the Annotation Manual) to give precedence to one label if they take more than one to hold.For example, they are told in the Manual (Webber et al., 2019) that examples that satisfy the conditions for both Contrast and Concession, should be labelled as concession.We over-simplified the presence of multiple labels by following Qin et al. (2017) in treating each label as a separate example and did not consider the second case.Thus, our approach might be inadequate for dealing with the actual distribution of the data and can be extended or modified.It is worth exploring how to extend our approach to allow for examples with multiple sense labels and cases where one label takes precedence over another.We believe that this will be an important property of the work.
Another limitation is that we only use English datasets.There are PDTB-style datasets in other languages including a Chinese TED dicourse bank corpus (Long et al., 2020), a Turkish discourse Tree bank corpus (Zeyrek and Kurfalı, 2017) and an Italian Discourse Treebank (Pareti and Prodanof, 2010).Moreover, Zeyrek et al. (2019) proposed a TED Multilingual Discourse Bank (TED-MDB) corpus, which has 6 languages.These datasets allow us to assess the approach in languages other than English.Besides, there are datasets similar to PDTB-Style like Prague Dependency Treebank (Mírovský et al., 2014).The different datasets use essentially similar sense hierarchy, but two things need to be investigated (i) whether there are comparable differences between tokens that realise "sister" relations, or (ii) whether tokens often have multiple sense labels, which would change what could be used as negative examples if leveraging our approach on them.
In the future, we can also assess whether contrastive learning could help in separating out En-tRel relations and AltLex relations from implicit relations or whether other methods would perform better.

Conclusions
In this paper, we leverage the sense hierarchy to select the negative examples needed for contrastive learning for the task of implicit discourse relation recognition.Our method has better overall performance than achieved by previous systems, and compared with previous work, our method is better at learning minority labels.Moreover, we compared different methods of selecting the negative examples based on the hierarchical structures, which shows some potential negative impacts might be produced when negative examples include those from other level-1 types.Moreover, we conduct ablation studies to investigate the effects of our data augmentation method and our contrastive learning method.Besides, the limitations and the future work are discussed.

A Appendix
A.1 PDTB Hierarchy The hierarchies of both PDTB 2.0 and PDTB 3.0 consist of three levels, but for implicit relation recognition, so far no classification for third level labels has been done.We also focus on the hierarchy between level-1 and level 2. The PDTB-3 relation hierarchy simplifies and extends the PDTB-2 relation hierarchy.The PDTB 3.0 hierarchy not only simplifies the PDTB-2 relation hierarchy by restricting Level-3 relations to differences in directionality and eliminating rare and/or difficultto-annotate senses, but also augments the relation hierarchy.

A.2 The results on relation types on PDTB-3
We also examine the classification performance on PDTB-3 in terms of Macro-F1 for the four main relation types at level-1 and 14 sense types at level-2.The results can be seen in Table 9 and Table 10.
Our model has significantly better performance for all level-1 relations.
As for level-2 sense types, because there are no results of previous systems, we just show the result of 14 level-2 sense types in PDTB-3 in terms of F1.

Figure 2 :
Figure 2: The overall architecture of our model.When given an anchor, we search the positive and negative examples in a training batch based on the sense hierarchy of the PDTB.We narrow the distances among examples from the same types at level-2 or level-3 and enlarge the distances among examples from different types at level-2 and level-3.

Figure 3 :
Figure 3: An example with inserted connective: the connective word is "In contrast".

Figure 5 :
Figure 5: The other four negative examples selected methods.orange ball represent anchor, green ball represent negative examples, and blue ball represent positive examples.Darker blue ball means more weight is given to more similar (potentially) positive examples.

Figure 7 :
Figure 7: The PDTB 3.0 Senses Hierarchy.The leftmost column contains the Level-1 senses and the middle column, the Level-2 senses.For asymmetric relations, Level-3 senses are located in the rightmost column.
table 5 and table 6 that our method is better than the above methods in both

Table 4 :
The results for relation types at level-2 on PDTB-2 in terms of F1 (%) (second-level multi-class classification).

Table 8 :
Effects of data augmentation.

Table 9 :
The results of different relations on PDTB-3 in terms of F1 (%) (top-level multi-class classification).

Table 10 :
The results of different relations on PDTB-3 in terms of F1 (%) (second-level multi-class classification).