Global and Local Hierarchy-aware Contrastive Framework for Implicit Discourse Relation Recognition

Due to the absence of explicit connectives, implicit discourse relation recognition (IDRR) remains a challenging task in discourse analysis. The critical step for IDRR is to learn high-quality discourse relation representations between two arguments. Recent methods tend to integrate the whole hierarchical information of senses into discourse relation representations for multi-level sense recognition. Nevertheless, they insufficiently incorporate the static hierarchical structure containing all senses (defined as global hierarchy), and ignore the hierarchical sense label sequence corresponding to each instance (defined as local hierarchy). For the purpose of sufficiently exploiting global and local hierarchies of senses to learn better discourse relation representations, we propose a novel GlObal and Local Hierarchy-aware Contrastive Framework (GOLF), to model two kinds of hierarchies with the aid of multi-task learning and contrastive learning. Experimental results on PDTB 2.0 and PDTB 3.0 datasets demonstrate that our method remarkably outperforms current state-of-the-art models at all hierarchical levels. Our code is publicly available at https://github.com/YJiangcm/GOLF_for_IDRR


Introduction
Implicit discourse relation recognition (IDRR) aims to identify logical relations (named senses) between a pair of text segments (named arguments) without an explicit connective (e.g., however, because) in the raw text.As a fundamental task in discourse analysis, IDRR has benefitted a wide range of Natural Language Processing (NLP) applications such as question answering (Liakata et al., 2013), summarization (Cohan et al., 2018), information extraction (Tang et al., 2021), etc.
The critical step for IDRR is to learn high-quality discourse relation representations between two arguments.Early methods are dedicated to manually Figure 1: An IDRR instance in the PDTB 2.0 corpus (Prasad et al., 2008).Argument 1 is in italics, and argument 2 is in bold.The implicit connective is not present in the original discourse context but is assigned by annotators.All senses defined in PDTB are organized in a three-layer hierarchical structure (defined as global hierarchy in our paper), and the implicit connectives can be regarded as the most fine-grained senses.
designing shallow linguistic features (Pitler et al., 2009;Park and Cardie, 2012) or constructing dense representations relying on word embeddings (Liu and Li, 2016;Dai and Huang, 2018;Liu et al., 2020).Despite their successes, they train multiple models to predict multi-level senses independently, while ignoring that the sense annotation of IDRR follows a hierarchical structure (as illustrated in Figure 1).To solve this issue, some researchers propose global hierarchy-aware models to exploit the prior probability of label dependencies based on Conditional Random Field (CRF) (Wu et al., 2020) or the sequence generation model (Wu et al., 2022).
However, existing hierarchy-aware methods still have two limitations.Firstly, though they exploit the fact that there are complex dependencies among senses and such information should be encoded into discourse relation representations, their manners of encoding the holistic hierarchical graph of senses may not be sufficient, since they fail to strengthen the correlation between the discourse relation representation and its associated sense labels, which is highly useful for classification (Chen et al., 2020a).Secondly, they only consider the graph of the entire label hierarchy and ignore the benefit of the label sequence corresponding to each instance.As shown in Figure 2, the label sequences of Instances (1) and (2) differ at both the top and second levels, while the label sequences of Instances ( 1) and ( 3) only differ at the most fine-grained level.The similarity between label sequences provides valuable information for regularizing discourse relation representations, e.g., by ensuring that the distance between representations of Instance (1) and ( 2) is farther than the distance between representations of Instance (1) and (3).Under such an observation, we categorize the sense hierarchy into global and local hierarchies to fully utilize the hierarchical information in IDRR.We define global hierarchy as the entire hierarchical structure containing all senses, while local hierarchy is defined as a hierarchical sense label sequence corresponding to each input instance.Therefore, global hierarchy is static and irrelevant to input instances, while local hierarchy is dynamic and pertinent to input instances.
Built on these motivations, we raise our research question: How to sufficiently incorporate global and local hierarchies to learn better discourse relation representations?To this end, we propose a novel GlObal and Local Hierarchy-aware Contrastive Framework (GOLF), to inject additional information into the learned relation representation through additional tasks that are aware of the global and local hierarchies, respectively.This is achieved via the joint use of multi-task learning and con-trastive learning.The key idea of contrastive learning is to narrow the distance between two semantically similar representations, meanwhile, pushing away representations of dissimilar pairs (Chen et al., 2020b;Gao et al., 2021).It has achieved extraordinary successes in representation learning (He et al., 2020).Finally, our multi-task learning framework consists of classification tasks and two additional contrastive learning tasks.The global hierarchy-aware contrastive learning task explicitly matches textual semantics and label semantics in a text-label joint embedding space, which refines the discourse relation representations to be semantically similar to the target label representations while semantically far away from the incorrect label representations.In the local hierarchy-aware contrastive learning task, we propose a novel scoring function to measure the similarity among sense label sequences.Then the similarity is utilized to guide the distance between discourse relation representations.
The main contributions of this paper are threefold: analysis demonstrate that our approach delivers state-of-the-art performance on PDTB 2.0 and PDTB 3.0 datasets at all hierarchical levels, and more consistent predictions on multi-level senses.
2 Related Work

Implicit Discourse Relation Recognition
Early studies resort to manually-designed features to classify implicit discourse relations into four toplevel senses (Pitler et al., 2009;Park and Cardie, 2012).With the rapid development of deep learning, many methods explore the direction of building deep neural networks based on static word embeddings.Typical works include shallow CNN (Zhang et al., 2015), LSTM with Multi-Level Attention (Liu and Li, 2016), knowledge-augmented LSTM (Dai andHuang, 2018, 2019;Guo et al., 2020), etc.These works aim to learn better semantic representations of arguments as well as capture the semantic interaction between them.More recently, contextualized representations learned from large pre-trained language models (PLMs) and prompting (Schick and Schütze, 2021) have substantially improved the performance of IDRR.More finedgrained levels of senses have been explored by (Liu et al., 2020;Long and Webber, 2022;Chan et al., 2023b).Besides, researchers such as (Wu et al., 2020(Wu et al., , 2022) ) utilize the dependence between hierarchically structured sense labels to predict multilevel senses simultaneously.However, these methods may be insufficient to exploit the global and local hierarchies for discourse relation representations.

Contrastive Learning
Contrastive learning is initially proposed in Computer Vision (CV) as a weak-supervised representation learning method, aiming to pull semantically close samples together and push apart dissimilar samples (He et al., 2020;Chen et al., 2020b).In NLP, contrastive learning has also achieved extraordinary successes in various tasks including semantic textual similarity (STS) (Gao et al., 2021;Shou et al., 2022;Jiang et al., 2022), information retrieval (IR) (Hong et al., 2022), relation extraction (RE) (Chen et al., 2021), etc.Though intuitively supervised contrastive learning could be applied to IDRR through constructing positive pairs according to the annotated sense labels, it ignores the hierarchical structure of senses.This paper is the first work to meticulously adapt contrastive learning to IDRR considering the global and local hierarchies of senses.

Problem Definition
Given M hierarchical levels of defined senses S = (S 1 , ..., S m , ..., S M ), where S m is the set of senses at the m-th hierarchical level, and a sample input consisting of two text spans, or x i = (arg 1 , arg 2 ), our model aims to output a sequence of sense y i = (y 1 i , ..., y m i , ..., y M i ), where y m i ∈ S m .

Discourse Relation Encoder
Given an instance x i = (arg 1 , arg 2 ), we concatenate the two arguments and formulate them as a sequence with special tokens: , where [CLS] and [SEP] denote the beginning and the end of sentences, respectively.Then we feed the sequence through a Transformer (Vaswani et al., 2017) encoder to acquire contextualized token representations H. Previous works (Liu and Li, 2016;Liu et al., 2020)

Staircase Classifier
Given the discourse relation representation h i of an instance, we propose a "staircase" classifier inspired by (Abbe et al., 2021) to output the label logits a top-down manner, where the higher-level logits are used to guide the logits at the current level: where Then the cross-entropy loss of the classifier is defined as follows: where ⃗ y m i is the one-hot encoding of the groundtruth sense label y m i .

Global Hierarchy-aware Contrastive Learning
The Global Hierarchy-aware Contrastive Learning module first exploits a Global Hierarchy Encoder to encode global hierarchy into sense label embeddings.Then, it matches the discourse relation representation of an input instance with its corresponding sense label embeddings in a joint embedding space based on contrastive learning.

Global Hierarchy Encoder
To encode label hierarchy in a global view, we regard the hierarchical structure of senses as an undirected graph, where each sense corresponds to a graph node.Then we adopt a graph convolutional network (GCN) (Welling and Kipf, 2016) to induce node embeddings for each sense based on properties of their neighborhoods.The adjacent matrix A ∈ R |S|×|S| is defined as follows: where S is the set of all senses, i, j ∈ S, child(i) = j means that sense j is the subclass of sense i.By setting the number layer of GCN as L 2 , given the initial representation of sense i as r 0 i ∈ R dr , GCN updates the sense embeddings with the following layer-wise propagation rule: where l ∈ [1, L 2 ], W l ∈ R dr×dr and b l ∈ R dr are learnable parameters at the l-th GCN layer, D ii =

Semantic Match in a Joint Embedding Space
In this part, we match textual semantics and label semantics in a text-label joint embedding space where correlations between text and labels are exploited, as depicted in the upper right part of Figure 3.We first project the discourse relation representation h i of an instance x i and the sense label embeddings {r i } i∈S into a common latent space by two different Multi-Layer Perception (MLP) Φ 1 and Φ 2 .Then, we apply a contrastive learning loss to capture text-label matching relationships, by regularizing the discourse relation representation to be semantically similar to the target label representations and semantically far away from the incorrect label representations: where N denotes a batch of training instances, y i is the sense label sequence of instance x i , sim(•) is the cosine similarity function, τ is a temperature hyperparameter.By minimizing the global hierarchy-aware contrastive learning loss, the distribution of discourse relation representations is refined to be similar to the label distribution.
Here we would like to highlight the key differences between our model and LDSGM (Wu et al., 2022), since we both utilize a GCN to acquire label representations.Firstly, We use a different approach to capture the associations between the acquired label representations and the input text.In (Wu et al., 2022), the associations are implicitly captured using the usual attention mechanism.In contrast, our model explicitly learns them by refining the distribution of discourse relation representations to match the label distribution using contrastive learning.Secondly, our work introduces a novel aspect that has been overlooked by earlier studies including (Wu et al., 2022): the utilization of local hierarchy information, which enables our model to better differentiate between similar discourse relations and achieve further improvements.

Local Hierarchy-aware Contrastive Learning
Following (Gao et al., 2021), we duplicate a batch of training instances N as N + and feed N as well as N + through our Discourse Relation Encoder E with diverse dropout augmentations to obtain 2|N | discourse relation representations.Then we apply an MLP layer Φ 3 over the representations, which is shown to be beneficial for contrastive learning (Chen et al., 2020b).
To incorporate local hierarchy into discourse relation representations, it is tempting to directly apply supervised contrastive learning (Gunel et al., 2021) which requires positive pairs to have identical senses at each hierarchical level m ∈ [1, M ]: However, Equation ( 6) ignores the more subtle semantic structures of the local hierarchy, since it only admits positive examples as having identical, no account for examples with highly similar annotations.To illustrate, consider Instances (1) and (3) in Figure 2, where their sense label sequences only differ at the most fine-grained level.However, they are regarded as a negative pair in Equation ( 6), rather than a "relatively" positive pair.The standard of selecting positive pairs is too strict in Equation ( 6), thus may result in semantically similar representations being pulled away.To loosen this restriction, we regard all instance pairs as positive pairs but assign the degree of positive, by using a novel scoring function to calculate the similarity among label sequences y i = (y 1 i , ..., y m i , ..., y M i ) and y j = (y 1 j , ..., y m j , ..., y M j ).In our case, there exist three hierarchical levels including Top, Second, and Connective, and we use T, S, and C to denote them.Consequently, there are in total K = 6 sub-paths in the hierarchies, i.e., P = {T, S, C, TS, SC, TSC}.Then we calculate the Dice similarity coefficient for each sub-path among the hierarchical levels and take the average as the similarity score between y i and y j , which is formulated below: where Finally, our local hierarchy-aware contrastive loss utilizes the similarity scores to guide the distance between discourse relation representations: Compared with Equation ( 6), Equation ( 8) considers more subtle semantic structures of the local hierarchy for selecting positive pairs.It increases the relevance of representations for all similarly labeled instances and only pushes away instances with entirely different local hierarchies.Thus, the local hierarchical information is sufficiently incorporated into discourse relation representations.
The overall training goal is the combination of the classification loss, the global hierarchy-aware contrastive loss, and the local hierarchy-aware contrastive loss: where λ 1 and λ 2 are coefficients for the global and local hierarchy-aware contrastive loss, respectively.We set them as 0.1 and 1.0 while training, according to hyperparameter search (in Appendix C).

Dataset
The Penn Discourse Treebank 2.0 (PDTB 2.0) PDTB 2.0 (Prasad et al., 2008) is a large-scale English corpus annotated with information on discourse structure and semantics.PDTB 2.0 has three levels of senses, i.e., classes, types, and sub-types.Since only part of PDTB instances is annotated with third-level senses, we take the top-level and second-level senses into consideration and regard the implicit connectives as third-level senses.There are 4 top-level senses including Temporal (Temp), Contingency (Cont), Comparison (Comp), and Expansion (Expa).Further, there exist 16 second-level senses, but we only consider 11 major second-level implicit types following previous works (Liu et al., 2020;Wu et al., 2022).For the connective classification, we consider all 102 connectives defined in PDTB 2.0.
Appendix A shows the detailed statistics of the PDTB corpora.We follow early works (Ji and Eisenstein, 2015;Liu et al., 2020;Wu et al., 2022) using Sections 2-20 of the corpus for training, Sections 0-1 for validation, and Sections 21-22 for testing.In PDTB 2.0 and PDTB 3.0, there are around 1% data samples with multiple annotated senses.Following (Qin et al., 2016), we treat them as separate instances during training for avoiding ambiguity.At test time, a prediction matching one of the gold types is regarded as the correct answer.

Baselines
To validate the effectiveness of our method, we contrast it with the most advanced techniques currently available.As past research generally assessed one dataset (either PDTB 2.0 or PDTB 3.0), we utilize distinct baselines for each.Due to PDTB 3.0's recent release in 2019, there are fewer baselines available for it compared to PDTB 2.0.
Baselines for PDTB 2.0 • NNMA (Liu and Li, 2016): a neural network with multiple levels of attention.
• PDRR (Dai and Huang, 2018): a paragraphlevel neural network that models interdependencies between discourse units as well as discourse relation continuity and patterns.
• IDRR-Con (Shi and Demberg, 2019): a neural model that leverages the inserted connectives to learn better argument representations.
• RoBERTa (Fine-tuning): a RoBERTa-based model fine-tuned on three sense levels separately.
Baselines for PDTB 3.0 • MANF (Xiang et al., 2022a): a multi-attentive neural fusion model to encode and fuse both semantic connection and linguistic evidence.
• RoBERTa (Fine-tuning): a RoBERTa-based model fine-tuned on three sense levels separately.
• ConnPrompt (Xiang et al., 2022b): a PLMbased model using a connective-cloze Prompt to transform the IDRR task as a connectivecloze prediction task.

Implementation Details
We implement our model based on Huggingface's transformers (Wolf et al., 2020) and use the pretrained RoBERTa (Liu et al., 2019) (base or large version) as our Transformer encoder.The layer number of MHIA and GCN are both set to 2. We set temperature τ in contrastive learning as 0.1.We set Φ 1 , Φ 2 , Φ 3 as a simple MLP with one hidden layer and tanh activation function, which enables the gradient to be easily backpropagated to the encoder.The node embeddings of senses with the dimension 100 are randomly initialized by kaim-ing_normal (He et al., 2015).To avoid overfitting, we apply dropout with a rate of 0.1 after each GCN layer.We adopt AdamW optimizer with a learning rate of 1e-5 and a batch size of 32 to update the model parameters for 15 epochs.The evaluation step is set to 100 and all hyperparameters are determined according to the best average model performance at three levels on the validation set.
All experiments are performed five times with different random seeds and all reported results are averaged performance.

Multi-label Classification Comparison
The primary experimental results are presented in Table 1, which enables us to draw the following conclusions: • Firstly, our GOLF model has achieved new state-of-the-art performance across all three levels, as evidenced by both macro-F1 and accuracy metrics.Specifically, on PDTB 2.0, GOLF (base) outperforms the current state-ofthe-art LDSGM model (Wu et al., 2022) by 2.03%, 1.25%, and 1.11% in three levels, respectively, in terms of macro-F1.Additionally, it exhibits 1.34%, 0.83%, and 0.65% improvements over the current best results in terms of accuracy.Moreover, in the case of PDTB 3.0, GOLF (base) also outperforms the current state-of-the-art ConnPrompt model (Xiang et al., 2022b) by 1.37% F1 and 1.19% accuracy at the top level.
• Secondly, employing RoBERTa-large embeddings in GOLF leads to a significant improvement in its performance.This observation indicates that our GOLF model can effectively benefit from larger pre-trained language models (PLMs).
• Finally, despite the impressive performance of recent large language models (LLMs) such as ChatGPT (OpenAI, 2022) in few-shot and zero-shot learning for various understanding and reasoning tasks (Bang et al., 2023;Jiang et al., 2023)  of each discourse relation and extract the relevant language features from the text.Therefore, implicit discourse relation recognition remains a challenging and crucial task for the NLP community, which requires further exploration.

Label-wise Classification Comparison
Here we present an evaluation of GOLF's performance on PDTB 2.0 using label-wise F1 comparison for top-level and second-level senses.Multi-level Consistency Comparison Following (Wu et al., 2022), we evaluate the consistency among multi-level sense predictions via two metrics: 1) Top-Sec: the percentage of correct predictions at both the top-level and second-level senses; 2) Top-Sec-Con: the percentage of correct predictions across all three level senses.Our model's results, as displayed in

Limitations
In this section, we illustrate the limitations of our method, which could be summarized into the following two aspects.Firstly, since the cumbersome data annotation leads to few publicly available datasets of IDRR tasks, we only conduct experiments on English corpora including PDTB 2.0 and PDTB 3.0.In the future, we plan to comprehensively evaluate our model on more datasets and datasets in other languages.
Secondly, considering that instances of PDTB are contained in paragraphs of the Wall Street Journal articles, our approach ignores wider paragraphlevel contexts beyond the two discourse arguments.As shown in (Dai and Huang, 2018), positioning discourse arguments in their wider context of a paragraph may further benefit implicit discourse relation recognition.It is worth exploring how to effectively build wider-context-informed discourse relation representations and capture the overall discourse structure from the paragraph level.

Ethics Statement
Since our method relies on pre-trained language models, it may run the danger of inheriting and propagating some of the models' negative biases from the data they have been pre-trained on (Bender et al., 2021).Furthermore, we do not see any other potential risks.

C Effects of Hyperparameters
Here we investigate the effects of various hyperparameters on the development set of PDTB 2.0.These hyperparameters include the number layer L 1 of MHIA (Figure 5), the number layer L 2 of GCN (Figure 6), the coefficient λ 1 of the global hierarchy-aware contrastive loss (Figure 7), the coefficient λ 2 of the local hierarchy-aware contrastive loss (Figure 8), and the temperature τ in contrastive learning (Figure 9).Note that we only change one hyperparameter at a time.The Section named "Limitations" A2.Did you discuss any potential risks of your work?
The Section named "Ethics Statement" A3. Do the abstract and introduction summarize the paper's main claims?
In the Abstract and Section 1 A4.Have you used AI writing assistants when working on this paper?
Left blank.
B Did you use or create scientific artifacts?
In Section 5.3 and 5.4 B1. Did you cite the creators of artifacts you used?
In Section 5.3 and 5.4 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
In Section 5.1 and Appendix A C Did you run computational experiments?
In Section 5 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?In Section 5.3

Figure 2 :
Figure 2: Three instances from PDTB 2.0.The sense label sequence of each instance is defined as local hierarchy in our paper.

Figure 3
Figure 3 illustrates the overall architecture of our multi-task learning framework.Beginning at the left part of Figure 3, we utilize a Discourse Relation Encoder to capture the interaction between

Figure 3 :
Figure 3: The overall architecture of our framework.The squares are denoted as discourse relation representations.Among the local hierarchy-aware contrastive loss L Local , we use colored squares to denote discourse relation representations of various instances in a mini-batch and list their sense label sequences on the left.Besides, note that the numbers on the right are similarity scores between sense label sequences calculated by our scoring function.
Dice(A, B) = (2|A ∩ B|)/(|A| + |B|), P k i is the k-th sub-path label set of y i .Taking Instances (1) and (3) in Figure 2 as examples, their label sequences are Top: Comparison, Sec: Contrast, Conn: but and Top: Comparison, Sec: Contrast, Conn: however, respectively.Then the similarity score would be 1 6

Figure 4
Figure 4: t-SNE visualization of discourse relation representations for the top-level and second-level senses on PDTB 2.0 test set.

Figure 5 :
Figure 5: Effects of the number layer L 1 of MHIA on the development set.

Figure 6 :Figure 7 :
Figure 6: Effects of the number layer L 2 of GCN on the development set.

Figure 8 :
Figure 8: Effects of the coefficient λ 2 of the local hierarchy-aware contrastive loss on the development set.

Figure 9 :
Figure 9: Effects of the temperature τ in contrastive learning on the development set.
, they still lag behind our GOLF (base) model by approximately 30% in PDTB 2.0.This difference suggests that ChatGPT may struggle to comprehend the abstract sense

Table 2 :
Label-wise F1 scores (%) for the top-level senses of PDTB 2.0.The proportion of each sense is listed below its name.
Table2showcases the label-wise F1 comparison for the toplevel senses, demonstrating that GOLF significantly improves the performance of minority senses such as Temp and Comp.In Table3, we compare GOLF with the current state-of-the-art models for the second-level senses.Our results show

Table 3 :
Label-wise F1 scores (%) for the second-level senses of PDTB 2.0.The proportion of each sense is listed behind its name.Temp.Synchrony and Comp.Concession.To further validate our model's ability of deriving better discourse relation representations, we compare the generated representations of GOLF with those of current state-of-the-art models for both top-level and second-level senses in Appendix B.

Table 4 :
Ablation study on PDTB 2.0 considering the accuracy and F1 of each level as well as consistencies between hierarchies."w/o" stands for "without"; "r.p." stands for "replace"; "MHIA" stands for the Multi-Head Interactive Attention; L G stands for the Global Hierarchy-aware Contrastive loss; L L stands for the Local Hierarchy-aware Contrastive loss.

Table 5 :
Comparison with current state-of-the-art models on the consistency among multi-level sense predictions.

Table 6 :
The data statistics of second-level senses in PDTB 2.0.

Table 7 :
The data statistics of second-level senses in PDTB 3.0.