TEPrompt: Task Enlightenment Prompt Learning for Implicit Discourse Relation Recognition

Implicit Discourse Relation Recognition (IDRR) aims at classifying the relation sense between two arguments without an explicit connective. Recently, the ConnPrompt~\cite{Wei.X:et.al:2022:COLING} has leveraged the powerful prompt learning for IDRR based on the fusion of multi-prompt decisions from three different yet much similar connective prediction templates. Instead of multi-prompt ensembling, we propose to design auxiliary tasks with enlightened prompt learning for the IDRR task. Although an auxiliary task is not used to directly output final prediction, we argue that during the joint training some of its learned features can be useful to boost the main task. In light of such motivations, we propose a task enlightenment prompt learning model, called TEPrompt, to fuse learned features from three related tasks for IDRR. In particular, the TEPrompt contains three tasks, viz., Discourse Relation Recognition (DRR), Sense Semantics Classification (SSC) and Annotated Connective Prediction (ACP), each with a unique prompt template and an answer space. In the training phase, we jointly train three prompt learning tasks with shared argument representation. In the testing phase, we only take the DRR output with fused features as the final IDRR decision. Experiments with the same conditions have shown that the proposed TEPrompt outperforms the ConnPrompt. This can be attributed to the promoted decision features and language models benefited from joint-training of auxiliary tasks.


Introduction
Implicit Discourse Relation Recognition (IDRR) is to detect and classify some latent relation in between a pair of text segments (called arguments) without an explicit connective (Xiang and Wang, 2023).Fig. 1 illustrates an argument pair example with a Contingency relation in the Penn Discourse TreeBank (PDTB) corpus, and the implicit connective 'so' is inserted by annotators.IDRR is of great importance for many downstream Natural Language Processing (NLP) applications, such as question answering (Liakata et al., 2013), machine translation (Guzmán et al., 2014), summarization (Huang and Kurohashi, 2021), and etc.However, due to the absence of an explicit connective, inferring discourse relations from the contextual semantics of arguments is still a challenging task.Conventional pre-train and fine-tuning paradigm (Liu et al., 2021) designs sophisticated neural networks to encode the representation of argument pairs upon a Pre-trained Language Model (PLM) for relation classification (Chen et al., 2016b;Liu and Li, 2016;Ruan et al., 2020;Li et al., 2020;Liu et al., 2020).On the one hand, these task-specific neural networks introduce some additional parameters that need to be trained by a large amount of labelled data.On the other hand, the task objective function is often not in accordance with that of the PLM, so that the PLM needs to be fine-tuned for solving downstream tasks, resulting in poor utilization of the encyclopedic linguistic knowledge embedded in the pre-training process.
The recent ConnPrompt model (Xiang et al., 2022b) has successfully applied the pre-train, prompt, and predict paradigm, i.e. the so-called prompt learning, in the IDRR task by transforming the IDRR as a connective-cloze task to predict an answer word and map it to a relation sense.The ConnPrompt has achieved the new state-of-the-art performance on the commonly used PDTB corpus (Webber et al., 2019), however it designs three different yet much similar connective prediction templates which inserts the [MASK] token in between two arguments or at the beginning of one argument for answer prediction.Moreover, to fuse different prompt predictions, the ConnPrompt employs a simple majority voting decision fusing as for final relation sense prediction.
Instead of simple multi-prompt ensemble, we argue that some auxiliary prompt tasks can be designed to enlighten the main prompt task with promoted decision features.For example, as the top relation labels in the PDTB corpus are those plain vocabulary words, we can design an auxiliary task to directly predict such label words from the PLM vocabulary.Furthermore, as the PDTB corpus also contains manually annotated implicit connectives, we can design another auxiliary task to directly predict an annotated connective.Although such auxiliary tasks are not necessarily used to output the final IDRR prediction, they can be jointly trained with the main task on a shared PLM, by which some features learned from the auxiliary tasks can be fused into the main task to promote its decision features for the final prediction.
Motivated from such considerations, we propose a Task Enlightenment Prompt Learning (TEPrompt) model, where the main IDRR task can be enlightened from some auxiliary prompt tasks in terms of its promoted decision features via fusing auxiliary task features.Specifically, the TEPrompt contains a main prompt task: Discourse Relation Recognition (DRR), and two auxiliary prompt tasks: Sense Semantics Classification (SSC) and Annotated Connective Prediction (ACP).We design each prompt task with a unique template and an answer space.We concatenate three prompt templates as an entire word sequence with two newly added special tokens [Arg 1 ] and [Arg 2 ] for shared argument representation, as the input of a PLM.In the training phase, we jointly train three prompt tasks upon one PLM model but with three different answer predictions as objective functions.In the testing phase, we only take the main prompt decision features yet promoted by fusing the features from the two auxiliary prompts to output the final IDRR decision.
Experiment results have shown that our proposed TEPrompt outperforms the ConnPrompt with the same conditions and achieves the new state-of-the-art performance on the latest PDTB 3.0 corpus.

pre-train and fine-tuning paradigm
Conventional pre-train and fine-tuning paradigm usually approaches the IDRR task as a classification problem, and the key is to design a sophisticated downstream neural network for argument representation learning (Zhang et al., 2015;Rutherford et al., 2017).For example, the SCNN model (Zhang et al., 2015) obtains each argument representation via a single convolution layer and concatenates two arguments' representations for relation classification.Some hybrid models have attempted to combine CNN, LSTM, graph convolutional networks and etc., for argument representation learning (Zhang et al., 2021;Jiang et al., 2021b).
Attention mechanisms have been widely used in neural model to unequally encode each word according to its importance for argument representation (Zhou et al., 2016;Guo et al., 2020;Ruan et al., 2020;Li et al., 2020).For example, Zhou et al. (2016) apply self-attention to weight a word according to its similarity to its belonging argument.Ruan et al. (2020) propose a pipeline workflow to apply interactive attention after self-attention.Li et al. (2020) use a penalty-based loss re-estimation method to regulate the attention learning.
Word pair features have been exploited to capture interactions between arguments for representation learning (Chen et al., 2016a,b;Xiang et al., 2022a).For example, Chen et al. (2016b) construct a relevance score word-pair interaction matrix based on a bilinear model (Jenatton et al., 2012) and a single layer neural model (Collobert and Weston, 2008).Xiang et al. (2022a) propose an offset matrix network to encode word-pairs' offsets as linguistic evidence for argument representation.

pre-train, prompt, and predict paradigm
Recently, some large-scale PLMs have been proposed, such as the BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), T5 (Raffel et al., 2020), and etc.The prompt learning has become a new paradigm for many NLP tasks, which uses the probability of text in PLMs to perform a prediction task, and has achieved promising results (Seoh et al., 2021;Wang et al., 2021;Ding et al., 2021).For example, Seoh et al. (2021) propose a cloze question prompt and a natural language inference prompt for Some studies design appropriate prompts to reformulate an IDRR task for predicting discourse relations (Jiang et al., 2021a,b;Xiang et al., 2022b).Jiang et al. (2021a) use a masked PLM to generate a pseudo-connective for relation classification.Jiang et al. (2021b) utilize the PLM T5 (Raffel et al., 2020) to generate the target sentence which contains the meaning of discourse relations.Xiang et al. (2022b) propose the ConnPrompt model with the new state-of-the-art performance, which reformulates the IDRR task as a connective-cloze task.They further use a majority voting decision fusion of the same task but with three much similar cloze templates for final relation sense prediction.
The proposed TEPrompt model fuses the learned features of two auxiliary prompt task to boost the main prompt tasks for relation prediction.
3 The Proposed TEPrompt Model Fig. 2 presents our TEPrompt model, including three modules of prompt templatize, answer prediction and verbalizer for the main prompt task (DRR) and two auxiliary prompt tasks (SSC and ACP).The main DRR prompt task uses a kind of connective-cloze prompt to predict a manually selected answer words between two arguments, and map it to a relation sense; The SSC auxiliary prompt task describes and classifies the sense semantic between two arguments; While the ACP describes and predicts the implicit connective words.

Prompt Templatize
We first reformulate an input argument pair x = (Arg 1 ; Arg 2 ) into a prompt template T (x) by concatenating the main DRR prompt template with two auxiliary prompt templates: SSC and ACP, as the input of a PLM.Some PLM-specific tokens such as [MASK], [CLS] and [SEP] are inserted in the prompt template; While the [MASK] tokens are added for the PLM to predict an answer word v, and the [CLS] and [SEP] tokens are used to indicate the beginning and ending of each prompt template, respectively.Fig. 3 illustrates the three templates for our DRR, SSC and ACP task.We first use a kind of connective-cloze prompt template as the main DRR prompt template T D (x), in which argument-1 and argument-2 are concatenated as an entire word sequence, and the [MASK] token is inserted between two arguments.Besides, two newly added specific tokens [Arg 1 ] and [Arg 2 ] are inserted at the front of argument-1 and argument-2 to represent their se- mantics which are also shared in the SSC template.
We also design two discrete prompt templates T S (x) and T A (x) for the auxiliary task SSC and ACP, respectively.The text of SSC template describes the sense semantics between argument-1 and argument-2; While the text of ACP template describes the implicit connective words.The [MASK] tokens are inserted at the end of SSC and ACP template for prediction.Note that in the SSC template, the specific tokens [Arg 1 ] and [Arg 2 ] are used to represent the semantics of argument-1 and argument-2, which are shared and trained with the main prompt task.

Answer Prediction
After the PLM, we obtain a hidden state h for each input token in the prompt templates, where h ∈ R d h and d h is the dimension of the hidden state.We use h DRR m , h SSC m and h ACP m to denote the hidden state of [MASK] tokens in the DRR, SSC and ACP template, respectively, which are used for the joint training of task enlightenment prompt learning; While the h SSC c and h ACP c are used to denote the hidden state of the [CLS] token in the SSC and ACP template, respectively, which are used for the feature fusion of auxiliary prompt tasks.
To fuse the features of auxiliary prompt SSC and ACP into the main DRR task, we use the fusion gate mechanism to integrate their [CLS] representations into the [MASK] representation of the main DRR task, which is next used for the final answer word prediction.Specifically, we first use a fusion gate mechanism to integrate the [CLS] representations of SSC and ACP, the transition functions are computed as follows: where learnable parameters and ⊙ donates the element-wise product of vectors.
With the fusion gate, we adaptively assign different importance to the features of SSC and ACP prompt task, and outputs hc ∈ R d h as the auxiliary prompt vector.We next use another fusion gate to integrate the auxiliary prompt vector hc into the [MASK] hidden state of the main DRR prompt h DRP m for the final answer prediction.The transition functions are: where Finally, the Masked Language Model (MLM) classifier of the PLM uses the fused hidden state hm to estimates the probability of each word in its vocabulary V for the [MASK] token of the DRR task as follows: (5) Note that, the MLM classifier also estimates an answer word probability P S and P A for the [MASK] token of the auxiliary prompt task SSC and ACP without feature fusion in the joint training.

Verbalizer
We define a discrete answer space for the DRR, SSC and ACP prompt task, respectively, which are all subsets of the PLM vocabulary.Specifically, we use sixteen manually selected answer words as the answer space V d of the DRR, the same as that of ConnPrompt (Xiang et al., 2022b).Besides, we use four top-level sense labels in the PDTB corpus as the SSC answer space, V s = {Comparison, Contingency, Expansion, Temporal}, and we use the 174 manually annotated implicit connectives in the PDTB corpus as the ACP answer space V c of ACP.We note that the answer space of DRR is next mapped to a relation sense in verbalizer process, while the answer space of SSC and ACP are only used in the auxiliary task training.After answer prediction, a softmax layer is applied on the prediction scores of our pre-defined answer space to normalize them into probabilities: Then, the predicted answer word of DRR is projected into a unique discourse relation sense based on the pre-defined connection regulation.Table 1 presents the verbalizer connection from the answer word to the PDTB discourse relation sense labels.

Training and Prediction
In the training phase, we tune the PLM parameters based on the DRR, SSC and ACP prompt task jointly to fuse their learned features.We compute a cross entropy loss for the DRR loss L d , SSC loss L s and ACP loss L c , respectively.
where y (k) and ŷ(k) are the answer label and predicted answer of the k-th training instance respectively.λ and θ are the regularization hyper-parameters.We use the AdamW optimizer (Loshchilov and Hutter, 2019) with L2 regularization for model training.The cost function of our TEPrompt is optimized as follows: where β and γ are weight coefficients to balance the importance of the SSC loss and ACP loss.

Experiment Setting
In this section, we present our experiment settings, including the dataset, PLMs, competitors, and parameter settings.
The PDTB 3.0 Dataset: Our experiments are conducted on the Penn Discourse TreeBank (PDTB) 3.0 corpus1 (Webber et al., 2019), which contains more than one million words of English texts from the Wall Street Journal.Following the conventional data splitting, we use sections 2-20 as the full training set, sections 21-22 as the testing set and 0-1 as the development set (Ji and Eisenstein, 2015).Our experiments are conducted on the four top-level classes of relation sense, including Comparison, Contingency, Expansion, Temporal.Pre-trained Language Models: We use two of the most representative masked pre-trained language models (PLM) for comparison: BERT (Devlin et al., 2019) is the first Transformer-based large-scale pre-trained PLM proposed by Google2 , which is pre-trained using a cloze task and a next sentence prediction task; RoBERTa (Liu et al., 2019) is a BERT-enhanced PLM proposed by Facebook3 , which removes the next sentence prediction objective and is pre-trained on a much larger dataset with some modified key hyper-parameters.
Competitors: We compare our TEPrompt with the following advanced models: • DAGRN (Chen et al., 2016b) encodes wordpair interactions by a neural tensor network.
• PLR (Li et al., 2020) uses a penalty-based loss re-estimation to regulate the attention learning.
• BMGF (Liu et al., 2020) combines bilateral multi-perspective matching and global information fusion to learn a contextualized representation.
• MANF (Xiang et al., 2022a) encodes two kinds of attentive representation for arguments and fuses them with the word-pairs features.
• ConnPrompt (Xiang et al., 2022b) applies the prompt learning for IDRR based on the fusion of multi-prompt decisions.
Parameter Setting: We implement the PLM models with 768-dimension provided by Hugging-Face transformers4 (Wolf et al., 2020), and run PyTorch5 framework with CUDA on NVIDIA GTX 3090 Ti GPUs.The maximum length of our TEPrompt template is set to 150 tokens, in which the maximum length of arguments are 70 tokens.We set the mini-batch size to 32, the learning rate to 1e-5, the weight coefficients β and γ to 0.3 and 0.4 respectively, and all trainable parameters are randomly initialized from normal distributions.We release the code at: https://github.com/HustMinsLab/TEPrompt.

Overall Result
Table 3 compares the overall performance between our TEPrompt and the competitors.We implement a four-way classification on the top-level relation sense of the PDTB dataset and adopt the commonly macro F 1 score and accuracy (Acc) as performance metrics.
We note that the competitors in the first group all use the pre-train and fine-tuning paradigm; While our TEPrompt and the ConnPrompt use the pretrain, prompt, and predict paradigm, i.e. the prompt learning.Besides, the first two competitors both use a kind of distributed and static word embeddings: Word2vec and Glove; while the others use Transformer-based PLM models: BERT and RoBERTa.
The first observation is that the DAGRN and NNMA cannot outperform the other competitors.This is not unexpected, as the others employ the more advanced dynamic PLMs pre-trained with deeper neural networks and larger scale of parameters, which have been proven more effective for many downstream NLP tasks (Devlin et al., 2019;Liu et al., 2019).The gaps between large PLM fine-tuning and static embedding for representation learning also have a certain impact on the performance of the IDRR task.
The second observation is that our TEPrompt and the ConnPrompt adopting the prompt learning paradigm can significantly outperform the other  Finally, our TEPrompt achieves better performance than the ConnPrompt with the same PLM and outperforms all the other models in both higher macro F1 score and accuracy.Similar results can also be observed in the binary classification (i.e.one-versus-others) of implicit discourse relation recognition, in Table 4.We attribute the outstanding performance of our TEPrompt to the use of auxiliary tasks for enlightenment prompt learning, by which the jointly trained features of auxiliary SSC and ACP prompt task can be well fused into the main DRR task to improve the final answer prediction.This will be further analyzed in our ablation study.
Table 4: Comparison of binary classification results on the PDTB (F1 score %).We have reproduced some of the competitors on PDTB 3.0 for fair comparison.

Ablation Study
To examine the effectiveness of different prompt tasks, we design the following ablation studies.
• Prompt-SSC is only the SSC prompt concatenating argument-1 and argument-2 in front, without the DRR and ACP task.
• TEPrompt-SSC combines the SCC prompt with DRR and ACP, and only uses the predicted answer of SSC for relation sense mapping.
• Prompt-ACP is only the ACP prompt concatenating argument-1 and argument-2 in front, without the DRR and SSC.
• TEPrompt-ACP combines the ACP prompt with the DRR and SSC, and uses the predicted answer of ACP for relation sense mapping6 .
• Prompt-DRR is only the DRR prompt without the auxiliary prompt SSC and ACP.
• TEPrompt w/o Gate is our task enlightenment prompt model without fusion mechanisms.
Table 5 compares the results of our ablation study models with both single-prompt and multiprompt ConnPrompt.Task enlightenment prompt: We can observe that the Prompt-DRR has comparable performance to each single-ConnPrompt, viz.ConnPrompt-1/2/3.This is not unexpected.All the three single-ConnPrompts are with the same connective-cloze prompt model, and the only difference is the location of the cloze-mask in each template; While the Prompt-DRR is with the same connective-cloze prompt model and answer space as a single-ConnPrompt.The ConnPrompt-Multi uses multi-prompt majority voting and outperforms any of the single-ConnPrompt; While the TEPrompt designs two auxiliary tasks to augment the main task and outperforms both Prompt-DRR and ConnPrompt-Multi, which validates the effectiveness of our task enlightenment prompt learning via fusing features from both main and auxiliary prompt tasks by joint training.
Prompt ablation study: Among the second group of prompt ablation models, it can be observed that the Prompt-SSC and Prompt-ACP cannot outperform the Prompt-DRR; While the TEPrompt-SSC and TEPrompt-ACP also cannot outperform the TEPrompt.Although both the SSC and ACP prompt model can each output the final prediction by mapping its predicted answer to a relation sense, however, their objectives are not completely in accordance with the IDRR task.The SCC prompt is designed to classify sense semantics; While the ACP prompt aims at predicting manually annotated connectives.Furthermore, we can also observe that the TEPrompt-SSC and TEPrompt-ACP have achieved better performance than the Prompt-SSC and Prompt-ACP, respectively.This again validates our argument that fusing features from jointly trained auxiliary prompt tasks can be useful to boost the main prompt task prediction.model, even it jointly trains a PLM as well as the MLM head with two auxiliary tasks.This indicates that the features learned from auxiliary tasks can indeed augment the main task prediction.
Auxiliary prompt effections: To further investigate the task enlightenment effections, we design several combinations of individual prompt models: the DRR with the only main task, the DRR+SSC and DRR+ACP are the main task enlightened by only one auxiliary task, and DRR+SSC+ACP (viz., TEPrompt) is the main task enlightened by two auxiliary tasks.
Fig. 4 compares the performance of different auxiliary prompt ablation models.We can observe that both the SSC and ACP auxiliary task can help improving the performance of the main DRR task.This suggests that fusing either the sense semantics feature in training SSC or the annotated connective feature in training ACP (viz., the two [CLS] tokens) can help promoting the decision feature of the main DRR task (viz., the [MASK] token) to improve the IDRR prediction.Finally, our TEPrompt joint training with both SSC and ACP auxiliary prompts yields substantial improvements over all ablation models, again approving our arguments and design objectives.

Case Study
We use a case study to compare the TEPrompt and the DRR prompt.Note that the DRR prompt can be regarded as the ConnPrompt using only one template yet without multi-prompt ensemble.Fig. 5 visualizes the representation of the [MASK] token, as well as its prediction probability and classified relation sense by a pie chart.The [MASK] token representation of the TEPrompt is quite different from that of the DRR prompt, as the former also fuses two auxiliary prompt task features.Such feature fusion from auxiliary tasks may enlighten the main task to make correct predictions.
It can be observed that the DRR prompt itself tends to predict a Comparison relation (64.76%) corresponding to the answer word 'however' with the highest probability 35.99%.After feature fusion, the TEPrompt can correctly recognize the Contingency relation (83.59%) between the two arguments by predicting the answer word 'so' with a much higher probability 75.43% than that of the DRR prompt prediction (10.60%).We argue that such benefits from the adjustments of prediction probabilities can be attributed to the feature fusion of the two auxiliary prompt tasks.

Concluding Remarks
In this paper, we have argued a main prompt task can be enlightened by some auxiliary prompt tasks for performance improvements.For the IDRR task, we have proposed a TEPrompt, a task enlightenment prompt model that fuses learned features from our designed auxiliary SSC and ACP task into the decision features of the main DRR task.Since the three prompt tasks are trained jointly, the learned auxiliary task features in the training phase can help promoting the main task decision feature and improving the final relation prediction in the testing phase.Experiment results and ablation studies have validated the effectiveness of our arguments and design objectives in terms of improved stateof-the-art IDRR performance.
In our future work, we shall investigate other types of auxiliary tasks for the IDRR task as well as the applicability of such task enlightenment prompt learning for other NLP tasks.

Limitations
The two auxiliary prompt tasks are closely related to the PDTB corpus, as the top-level relation sense labels are those plain vocabulary words and the PDTB provides manually annotated connectives.We have purchased the PDTB 3.0 corpus for experiments, and cite the corpus in Section I introduction and Section IV experiment dataset B1.Did you cite the creators of artifacts you used?
We cite the corpus in Section I introduction and Section IV experiment dataset.We have purchased the PDTB 3.0 corpus with liscence.
B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
In Section IV, we state that we have purchased the PDTB 3.0 liscence for experiments.
B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?
In Section IV, we state that we have purchased the PDTB 3.0 liscence for experiments.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?The PDTB 3.0 corpus contains documents/articels from public available Wall Street Journal.Our use of PDTB 3.0 does not involve with any privacy information and offensive content.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.? Section IV Experiment Setting, we provide brief introduction about PDTB B6.Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.In Section IV Experiment Settings, we provide details of train/test/dev splits.

Figure 1 :
Figure 1: An example of implicit discourse relation annotation with manually inserted connective.

Figure 2 :
Figure 2: Illustration of our TEPrompt framework.It contains three modules of the prompt templatize, answer prediction and verbalizer for the main prompt task (DRR) and two auxiliary prompt tasks (SSC and ACP).

Figure 3 :
Figure 3: Illustration of our TEPrompt template, which is a concatenation of the three task templates.

Figure 4 :
Figure 4: Comparison of auxiliary prompt effections.Gate Fusion Mechanism: We also observe that the TEPrompt w/o Gate without gate fusion mechanism cannot outperform the full TEPrompt 12409 you describe the limitations of your work?Limitation section before Reference.A2.Did you discuss any potential risks of your work?This paper is a foundational research for discourse understanding, to our knowledge, there should be no potential risk.A3.Do the abstract and introduction summarize the paper's main claims?Section Abstract and section I Introduction.A4.Have you used AI writing assistants when working on this paper?Left blank.B Did you use or create scientific artifacts?

Table 1 :
Answer space of the DRR prompt and the connection to the top-level class discourse relation sense labels in the PDTB corpus.

Table 2 :
Table 2 presents the dataset statistics.Statistics of implicit discourse relation instances in PDTB 3.0 with four top-level relation senses.

Table 3 :
Comparison of overall results on the PDTB.

Table 5 :
Results of ablation study on the PDTB corpus.