Enhancing Reasoning Capabilities by Instruction Learning and Chain-of-Thoughts for Implicit Discourse Relation Recognition

,


Introduction
Discourse relation recognition refers to identifying the sense of the relation between two arguments.This task is categorized into two types: explicit discourse relation recognition (EDRR) and implicit discourse relation recognition (IDRR) depending on whether explicit connectives, such as "because" and "but", are present or absent between the argument pair.Our work investigates the potential of generative models and natural language generation for improving the performance of IDRR.
Recognizing implicit discourse relations involves comprehending and examining the semantic connections between argument pairs.Previous works have commonly employed semantic encoding to enhance the model's classification accuracy (Liu et al., 2020;Dou et al., 2021;Xiang et al., 2022a).Generative models have also been utilized for IDRR.As an example, generation tasks have been used as an auxiliary task (Jiang et al., 2021) and in a limited form that restricts prompt learning (Zhou et al., 2022;Xiang et al., 2022b).With the emergence of large language models, there is a growing interest in utilizing generative models rather than encoder-only models for NLP applications.However, some studies also suggest that generic generative models do not perform as well as fine-tuning relatively small encoder-only models for NLU tasks (Qin et al., 2023).Our experiments also reveal that employing generative models to directly generate relation sense in the context of IDRR is an ineffective approach.This work investigates how simple yet effective methods (IICOT) can unleash the inference capabilities of generative models.Figure 1 illustrates our approach of utilizing a thinking process to guide the model's output (COT).Specifically, we do not allow the model to output the relation sense directly; we compel the model to first identify whether the argument pair pertains to implicit or explicit relation data.This approach reduces the unwanted noise generated by explicit data.Next, the model

In-context
The task is to determine the conjunction that connects two given text fragments and identify whether they have a temporal, comparative, contingency, or extensional relationship.This analysis should consider both implicit and explicit relation sense.The expected output format is: "type-conjunction-relationship".

Input Prompt
Input: Text fragment1: "<Arg1>" Text fragment2: "<Arg2>" Output: <Rel>-<Conn>-<Label> identifies a reasonable conjunction between the argument pairs and bases its final inference on this analysis.To optimize the model's performance, we formulate a prefix prompt for better guidance which is in the form of instructions (I).By fine-tuning the instructions, we enhance the model's ability to learn and understand the task definition.Additionally, we employ the In-context learning (Min et al., 2022) (I) approach to provide additional examples to aid the model's comprehension of the prompt.
Contributions: Our work makes the following contributions.(a) We use generative methods on the IDRR task and explore methods for improving the inference power of generative models.(b) We investigate the impact of instruction learning, incontext learning, and Chain-of-Thoughts (COT) on the performance of generative models.Through our exploration, we are able to identify the causes and effects of these learning methods.(c) We achieve state-of-the-art performance on all three datasets, indicating the effectiveness of our approach.

Instruction Learning
The primary objective of instruction fine-tuning is to enhance the language models' capacity to respond to natural language instructions.The method entails utilizing supervised signals to instruct language models on performing tasks described in instructions.As a result of instruction fine-tuning, language models learn to follow instructions and respond to the same tasks.To test this approach, we devise instruction fine-tuning templates, as illustrated in Figure 2. The templates provide a comprehensive task definition for the model, enabling a deeper understanding of the task at hand.They utilize natural language to guide the model's thought process and restrict the format of the model's output to facilitate subsequent evaluation.

In-context Learning
In-context learning enables a language model to grasp a task and produce answers to queries based on given illustrative examples.Essentially, it entails training a proficient language model for estimating a conditional probability distribution model, relative to a specific condition.However, we have discovered in our research that providing the model with a specific number of instances during training enhances its adherence to format and facilitates more effective convergence.During our experiments, we meticulously prepare an example to represent each of the four relation senses, which is visually illustrated in Appendix A.

Chain-of-Thoughts
While regular training methods require models to tackle complex problems in a single step, people prefer an incremental approach, breaking down problems into smaller components to facilitate complex reasoning.This inclination towards incremental thinking enables people to engage in more nuanced and effective problem-solving.Our approach presents a simple yet effective method of prompting that mimics thinking process in the form of natural language prompts, as shown in Figure 2. Rather than providing a categorical answer directly, the model first considers whether the implied relationship is explicit or implicit.It then identifies appropriate connectives between pairs of arguments before finally providing the answer.
In the autoregressive generative mode, each token output by the model is influenced by its predecessors, creating a natural progression of thought.The reasoning process under standard generation prompt and with COT1 is depicted in Figure 1.The latter method requires the model to provide a COT before producing a response.By incorporating COT into the prompting strategy, the performance  of the model improves.All the specific prompt templates we used are in the Appendix B 3 Experiment

Experiment Settings
This study involves conducting experiments on three benchmark datasets: PDTB 2.0 (Prasad et al., 2008), PDTB 3.0 (Prasad et al., 2019), and CoNLL16 (Xue et al., 2016).Notably, the CoNLL16 dataset lacks manually annotated ligatures in 450 training data instances.To address this, we utilize the gpt-3.5-turbomodel to predict the ligatures and establish them as the ground trut.Detailed statistics for each dataset and the settings of hyperparameters can be found in Appendix C.
To ensure reproducibility, we will make all of our source code publicly availableh 2 .model specifically fine-tuned for a broad range of natural language processing tasks to better suit instruction learning.Remarkably, our approach outperforms current state-of-the-art methods in all datasets.Specifically, we achieve an impressive 4.62% increase in the Contingency category on PDTB 2.0, while realizing significant improvements in all of the categories on PDTB 3.0.

Ablation Experiments
Table 3 presents the findings of our ablation experiments, where Fine-tuning denotes that the input is only the argument pair while the direct output relation sense; Instruction indicates that the task definition directs the model to output labels; ICL refers to In-context learning; Conn refers to predicting connected words; Rel denotes predicting explicit or implicit data.
Our ablation experiments lead to four key conclusions: First, using instruction and ICL improves the model's performance in comparison to directly outputting results.Providing a certain amount of example enhances the model's understanding of the task.Second, appending Rel or Conn to ICL to predict chains before labels further improve the model's reasoning ability.Third, without ICL, adding both Rel and Conn also lead to a performance improvement.This indicates that the COT approach is highly effective.Fourth, the optimal value is reached by combining the approaches mentioned above, thereby demonstrating their individual validity and mutual reinforcement.

Explicit Data
Our experiment of the CoNLL16 task reveals that incorporating both Explicit and Implicit judgments into the COT further enhances model performance.Accordingly, we propose the inclusion of explicit data in the PDTB dataset to scrutinize whether the performance improves.Our experimental findings (Appendix D) demonstrate that the performance improves upon the inclusion of a limited amount of explicit data.However, with increasing amounts of explicit data, the performance deteriorates substantially.This is because the distribution of explicit data differs from implicit data and the introduction of more explicit data results in amplified noise.Figure 4 and 5 demonstrate that an increase in explicit data corresponds to a decrease in the model's accuracy in distinguishing between explicit and implicit data.Nonetheless, the model maintains a high accuracy rate of 96.83% and exhibits optimal performance at the 20% threshold of explicit data.These results suggest that the model remains relatively unaffected by noise and that successful data augmentation has been achieved.

How Chain-of-Thoughts works
Denoising It is believed that the model achieves denoising ability while generating COT.In order to demonstrate the denoising process, we analyze the output vectors of the model in terms of their representations.This allows us to gain insight into the denoising mechanism of the model.
The t-SNE method is frequently preferred for visualizing high-dimensional data, as it effectively presents local relationships and clustering structures.Specifically, t-SNE is adept at capturing similarities and differences within the data.Mitigating overfitting It is our contention that the efficacy of COT also lies in its ability to alleviate model overfitting.Figure 6 presents data on training loss, loss on the development (dev) set during training, and changes in dev set performance for models trained both with and without COT.The table illustrates that abstaining from COT leads to a further drop in training loss, but also to a shift in dev set loss from low to high, along with a trend of decreasing performance.These phenomena suggest that the model overfits the training data.
COT enhances the informational content of model outputs.Although it increases the complexity of the task, this additional information effectively guides the model towards the desired direction, thereby mitigating the potential of overfitting.

Conclusion
We aim to enhance the reasoning capabilities of generative models in IDRR by employing a generation task framework and incorporating Instruction learning, in-context learning, and COT.Through our approach, we achieve a notable improvement over the baseline model, leading to state-of-the-art performance on three benchmark datasets.
In our future research, we plan to further investigate the utilization of generative models and even large language models.Specifically, we aim to explore the efficacy of larger models, including the implementation of knowledge distillation to transfer knowledge from large models to smaller ones.

Rel & Conn & Label:
Instruction: The task is to determine the conjunction that connects two given text fragments and identify whether they have a temporal, comparative, contingency, or extensional relationship.This analysis should consider both implicit and explicit relation sense.The expected output format is: "type- conjunction-relationship".

C Dataset statistics and the hyperparameters
Table 4 presents the statistical information for specific datasets.In our evaluation process, we exclusively employ implicit data from the PDTB 2.0 dataset and PDTB 3.0 dataset.Conversely, for the CoNLL16 dataset, we utilize all available data in accordance with the official partitioning strategy.
During our training process, we employ a batch size of 16 and set the learning rate to 5e-5.We train the AdamW optimizer for 5 epochs, utilizing the default parameter settings.Moreover, we incorporate both warmup and linear learning rate decay strategies, with a warmup ratio of 0.1.

D Add Explicit Data
To examine the impact of explicit data on model performance, we gradually augment the amount of explicit data in the training sets of PDTB 2.0 and PDTB 3.0.The experiment results are detailed in Figure 4 and 5.

E Observation of Overfitting
The presence of overfitting can be readily observed by analyzing the training loss, evaluation set loss, and performance variation in the absence of the COT model.In contrast, the inclusion of the COT model leads to improved performance.It is important to highlight that the training steps differ between the two experiments due to the utilization of the early stop strategy.

Figure 2 :
Figure 2: The full prompt template of our model.

Figure 3 :
Figure 3: The noise reduction effect of COT is demonstrated through the distribution.
Figure 3-(b) illustrates the model that incorporates Rel within the COT framework.This contrasts the COT model shown in Figure 3-(a), which does not feature these judgments.The generative model, such as flan-T5, lacks a dedicated token [CLS], unlike the BERT model.Consequently, in our study, we employ the encoding vector of the [BOS] token to represent the sentence after applying t-SNE dimensionality reduction.This approach enables us to examine the distribution pattern.The COT judgment effectively reduces noise and facilitates the acquisition of semantic knowledge pertaining to explicit data.The results demonstrate that the different types of data are well-separated upon the inclusion of COT judgment.

Figure 6 :
Figure 6: The experimental records encompass training loss, evaluation set loss, and performance metrics.

Table 1 :
Main experiments on the 4-class classification in Macro-F1 and accuracy.The highest reported results of previous works are denoted by underlines.

Table 2 :
The binary classification performance on PDTB 2.0 and PDTB 3.0 benchmark datasets.

Table 3 :
We conduct ablation experiments on the latest PDTB3.0dataset of IDRR, evaluating the F1 and Macro-F1 metrics in the binary and 4-way scenario.

Table 4 :
Descriptive statistics of implicit discourse relation instances are reported for the datasets.