Not Just Classification: Recognizing Implicit Discourse Relation on Joint Modeling of Classification and Generation

Implicit discourse relation recognition (IDRR) is a critical task in discourse analysis. Previous studies only regard it as a classification task and lack an in-depth understanding of the semantics of different relations. Therefore, we first view IDRR as a generation task and further propose a method joint modeling of the classification and generation. Specifically, we propose a joint model, CG-T5, to recognize the relation label and generate the target sentence containing the meaning of relations simultaneously. Furthermore, we design three target sentence forms, including the question form, for the generation model to incorporate prior knowledge. To address the issue that large discourse units are hardly embedded into the target sentence, we also propose a target sentence construction mechanism that automatically extracts core sentences from those large discourse units. Experimental results both on Chinese MCDTB and English PDTB datasets show that our model CG-T5 achieves the best performance against several state-of-the-art systems.


Introduction
Discourse relation describes the logical connection between two discourse units (e.g., clauses, sentences, or paragraphs). As an essential discourse analysis task, discourse relation recognition is to recover what rhetorical relation exists between discourse units (DUs). Due to the absence of explicit connectives, implicit discourse relation recognition (IDRR) is still a challenging task and research hotspot. Moreover, IDRR is beneficial to many downstream natural language processing (NLP) applications, such as machine translation (Webber et al., 2017), text generation (Bosselut et al., 2018), and text summarization .
With the success of representation learning in discourse analysis, most existing methods of IDRR * Corresponding author focus on three aspects: enhancing discourse units representation (Ji and Eisenstein, 2015;Qin et al., 2016;Liu and Li, 2016), enhancing semantic interaction (Guo et al., 2018;Ruan et al., 2020;Guo et al., 2020), and joint learning with other tasks (Bai and Zhao, 2018;Nguyen et al., 2019;He et al., 2020). They all regard IDRR as a classification task and lack a deeper understanding of the relation semantics; even recent work (Nguyen et al., 2019;He et al., 2020) with labeling embedding cannot directly introduce the prior knowledge of the discourse relation semantics into their models.
Ningbo Free Trade Zone ... has achieved DU 1 fruitful results after three years of construction. ... the development level is among the best... ... the Ningbo Free Trade Zone had completed DU 2 a total of US$812 million in import and ... At the same time, the bonded zone has ... Relation Elaboration Target Sentence DU 2 is a detailed description of DU 1 . Table 1: The example of implicit discourse relation between two DUs where DU 1 and DU 2 are paragraphs that contain several sentences. Also, there is no explicit hint in DU 1 and DU 2 for the relation. The full sentences of these two DUs are shown in Appendix A.
In the stage of implicit discourse relation annotation, annotators usually not only give the relation type but also provide a description or basis for the relation. Therefore, we hope the model, like a human, gives a target sentence instead of a simple label index for understanding the relation deeper. The target sentence should describe the core information of two DUs and their relation through natural language. As an example in Table 1, the Elaboration relation can be transformed by definition into the target sentence: "DU 2 is a detailed description of DU 1 ". The model can more explicitly learn the semantics of the Elaboration relation through such a form of the learning goal.
The Question-Answering (QA) method can incorporate prior knowledge into the model using the generation instead of the classification. It has achieved success in a few fine-grained tasks, such as named entity recognition  and coreference resolution . However, it is a challenge to directly apply the traditional Question-Answering method to IDRR due to the following two issues. First, unlike the above finegrained tasks where the answers or the clues exist in the input context, the discourse relation in IDRR is implicit between two DUs and does not appear explicitly in the context. It makes the IDRR model unable to extract the answer from the input directly. Second, since DU usually is large and contains several sentences, the target sentence which contains two DUs as shown in Table 1, is too long to encode. Therefore, it is essential to extract the core information from DU in a short form before embedding them into the target sentence.
Besides, the classification model and the generation model have their complementary advantages. The former usually has better performance on major classes due to searching in a limited space, while the latter can introduce prior knowledge to capture the semantics of minor classes better, and its result is a natural language expression with stronger interpretability. Therefore, how to combine the advantages of the classification model and the generation model is another challenge.
Different from previous work, we first regard IDRR as a text generation task and design three forms of the target sentence to represent prior knowledge. In particular, inspired by the annotation work (Pyatkin et al., 2020), we use questions instead of answers to describe the discourse relation between two DUs as the target sentence. Therefore, our model can understand discourse relation deeper by generating a target sentence that describes the relation meaning instead of an index of the relation type. Moreover, we design a method to automatically extract the core information from these large DUs by semantic role labeling and then compress the DU into a short form.
To address the second challenge, we propose a CG-T5 model that combines the classification and generation model to leverage their complementary advantages. Specifically, inspired by pre-trained models (e.g. BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019)), we first extract hidden states of the encoder in T5 (Raffel et al., 2020) and feed them to the span representation layer for the classification task and then use the T5 decoder for the generation task. Finally, we combine these two models with a jointly learning mechanism. Our CG-T5 can integrate the advantage of two different models: the classification model constrains the generation model, while the generation model explains and supports the classification model. In summary, the main contributions of this paper are fourfold: • We regard IDRR as a generation task to generate the target sentence containing the meaning of relations, which can introduce prior knowledge to understand discourse relations deeper.
• We propose a joint learning model CG-T5 to integrate the classification task and generation task.
• We design three forms of the target sentence, including the question form, and propose a construction method to extract the core information from the large DUs automatically.
• The experimental results both on MCDTB and PDTB datasets show that our CG-T5 outperforms the SOTA baselines.

Related Work
We first briefly introduce relative discourse corpora, then summarize the existing methods of IDRR, and finally introduce the success of the questionanswering method in fine-grained tasks.

Relative Discourse Corpora
In English, one of the most popular discourse corpora is Penn Discourse TreeBank (PDTB) (Prasad et al., 2008) . It annotates about 2.3K Wall Street Journal articles with three-level discourse relations (4 classes, 16 types, and 23 sub-types), including 18.4K explicit relations and 16k implicit relations. In Chinese, two popular discourse corpora are Chinese Discourse TreeBank (CDTB) (Li et al., 2014) and Macro Chinese Discourse TreeBank (MCDTB) (Jiang et al., 2018). CDTB contains 500 articles with two-level discourse relations (4 classes and 17 types) between clauses and sentences. Following the PDTB-style annotation, it annotated both explicit discourse relation and implicit discourse relation. MCDTB annotated 720 articles from Xinhua News with 3 classes and 15 types relations between paragraphs. Since there are few connectives between paragraphs, it annotated all discourse relations as implicit relations in MCDTB.

Implicit Discourse Relation Recognition
Most previous studies on IDRR can be divided into the following three categories: enhancing DU representation, enhancing the interaction between DUs, and joint learning of IDRR and other tasks.
In English, early work explored the methods of enhancing DU representation via the shallow convolutional neural network (Zhang et al., 2015), recursive neural network (Ji and Eisenstein, 2015), collaborative gated neural network (Qin et al., 2016), or attention mechanism (Liu and Li, 2016). To enhance the interaction between DUs, Guo et al. (2018) and Ruan et al. (2020) proposed various interactive attention mechanisms for IDRR. Guo et al. (2020) proposed a knowledge-enhanced attention neural network to introduce external knowledge to enhance the interaction. Besides, a few studies combined IDRR with other tasks for joint learning, e.g., explicit relation recognition (Lan et al., 2017), connective prediction (Bai and Zhao, 2018;Shi and Demberg, 2019), and label embedding learning (Nguyen et al., 2019;He et al., 2020). In

Formalizing Fine-grained Tasks as QA
A few fine-grained tasks can be formalized as QA and have achieved success due to introducing prior knowledge to their tasks, such as relation extraction , named entity recognition , and co-reference resolution . It is worth noting that the above studies all take questions as the input and extract the answers from the context as the output.

IDRR as Text Generation
To deeper understand the semantics of discourse relation, we regard IDRR as a text generation task, as shown in Figure 1. Unlike previous work, we use a generation model instead of the classification model to generate the target sentence representing the semantics of discourse relation. Then, we obtain the relation type by mapping it into the corresponding discourse relation. In this section, we mainly describe our solution for the first challenge of obtaining the target sentence: its different forms and its corresponding construction method.

Forms of Target Sentence
Unlike other fine-grained tasks that can extract the target sentence (answer) from the context, we can not directly transform their methods to IDRR without manual annotations. In IDRR, the relation label is implicit between two DUs, and there is no hint of them in the context. To alleviate this issue, we design the following three forms of the target sentence to map the relation sense to the templates according to the relation definition: the name, explanation and question of relation.  Relation Name As an intuitive choice, we use relation name (Name) as the target sentence, as shown in Table 9. Using the relation name can introduce prior knowledge by itself, and there is no need to extract external information from context.
Relation Explanation Furthermore, we believe that using only the relation name is not enough, and we design the target sentence as an explanation of relation, as shown in Table 9. It has two variants: the explanation is before the relation name (Exp-Rel) and after the relation name (Rel-Exp).
Although this method contains the relation semantics more comprehensively, it is necessary to extract the core information (CI 1 and CI 2 ) from two DUs to form the target sentence.
Relation Question Inspired by intra-sentence discourse relation annotation (Pyatkin et al., 2020), we use question instead of declarative sentence as the target sentence to capture discourse relation better. On the one hand, this form extracting core information (CI 1 or CI 2 ) from only one DU can reduce cascading errors. On the other hand, the question sentence integrating prior knowledge can better connect the semantics of two DUs more naturally. 1 This form also has two variants: the question is guided by the core information of the first DU (Q 1 ) and guided by that of the second DU (Q 2 ), as shown in Table 9.

Constructing Target Sentence
Although we build three forms of the target sentence, the latter two forms need to integrate the core information of the DU into the template. However, it is a challenge to extract the DU's core information without manual annotations. The DU usually contains lots of tokens and we directly insert them into the target sentence will make the target sentence too long to represent the semantics of the corresponding relation. Therefore, we design a core information extraction method to compress DU into a short form, which contains three steps: extracting, filtering, and selecting. We first extract all candidate tuples from the given DU through the Semantic Role Labeling (SRL) tool 2 . Then we use the following three rules to filter out those redundant tuples: (1) Streamlining core semantics. We remove unimportant elements except for arguments and predicates from the extracted candidate tuples. (2) Ensuring semantic integrity. We remove the semantically incomplete tuples (i.e., the tuple does not contain both A0 and A1 in SRL) from the candidate tuples.
(3) Reducing semantic overlap. We remove those small tuples contained in the larger candidate tuples due to semantic overlap. Finally, considering that important information is usually in front of the DUs, we extract the first tuple containing complete semantics and reproduce them in the original order as the CI. In particular, to ensure the form of "Subject-Verb-Object", we place the predicate in the second position.
The Ningbo Free Trade Zone, with a total area of 2.3 square kilometers, has achieved fruitful results after three years of construction. Ningbo Free Trade Zone is one of the 13 free trade zones in China. It was established with the approval of the State Council in 1992. At present, the various functions of the bonded area have begun to take shape, and the level of development is among the best in China's bonded areas.
The Ningbo Free Trade Zone, with a total area of 2.3 square kilometers, has achieved fruitful results The Ningbo Free Trade Zone, with a total area of 2.3 square kilometers, We use the example in Figure 2 to illustrate the process. There is a DU that contains four sentences, and we extract all candidate SRL tuples from it. Then we filter them by three rules we proposed to obtain two simplified tuples. Finally, we select the sentence that contains the first tuple as CI. Besides, if there is more than one tuple in a sentence, we combine them into one CI by the comma.

CG-T5 Model
To integrate the advantages of the classification model and the generation model in IDRR, we propose the Classification and Generation T5 (CG-T5) model that recognizes relation class and generates the target sentence simultaneously, as shown in Fig CG-T5 comprises three parts: the classification module based on the encoder, the generation module based on the decoder, and the joint learning module. Therefore, when given two DUs as the input, CG-T5 has two outputs: the class label from the classification module and the target sentence from the generation module.

Classification Module
In the encoding layer, consistent with the input of the traditional classification model like BERT, we first encode two discourse units (DU 1 and DU 2 ) into a single string as follows.
Then, we send the input (S) to the Encoder Stack (i.e., the encoder of T5) to obtain the encoder hidden states (Hidden E ) as follows.
Since T5 does not use the state at the position of [CLS] for pre-training tasks like BERT, we use the Endpoint 4 of the span that combines the hidden states of the head (H 0 ) and tail (H −1 ) for representing two DUs, as shown in Equation 3. Then we feed the output of the span representation layer V s into a linear layer with softmax to classify the implicit discourse relation (r) between two DUs, as shown in Equation 4 and 5 . 4 We have evaluated our model using various span representations (Toshniwal et al., 2020), including Average Pooling, Attention Pooling, Endpoint, Coherent, etc., and the Endpoint achieves the best performance. The reason is that this representation not only obtains the representation of [CLS] (the head) commonly used in the pre-training model to represent the Span for classification, but also considers the last hidden state information (the tail) that is closest to the generation module.

Generation Module
There are two inputs in the decoding layer when training: the hidden state of the encoding layer (Hidden E ) and the golden target sentence (T ). We take the T as the same style in the encoding layer as follows.
where W t = {w 1 t , w 2 t , ..., w k t } represents the token sequence of the target sentence. Then we feed them into the decoding layer to get the decoder hidden states (Hidden D ) as follows.
Finally, we use a linear layer generator with softmax to produce the predicted target sentence. When testing, our model generates the predicted sentence W p = {w 1 p , w 2 p , ..., w g p } according to the input of two DUs.

Joint Learning Module
We joint learning the above two modules. The loss function for the classifier (Loss cla ) and generator (Loss gen ) is cross-entropy loss, and the total loss (Loss) is the sum of the two losses as follows.

Experimentation
In this section, we first introduce the datasets and experimental settings. Then, we evaluate our model CG-T5 on MCDTB and PDTB.

Datasets and Experimental Settings
We mainly evaluate our model on two popular discourse relation datasets Chinese MCDTB and English PDTB. First of all, considering the macro discourse unit is longer, and the connection between discourse units is more obscure in Chinese, we first conduct the experiments on MCDTB. Then, we conduct the experiments on PDTB, one of the most popular discourse relation corpus in English, to verify the generality of our model. Besides, we also conduct the experiments on another Chinese dataset CDTB and the results are shown in Appendix D.3. MCDTB: following previous work (Jiang et al., 2019;Sun et al., 2020), we use the same dataset division and five-fold cross-validation for the experiments.
PDTB: following previous work (Ji and Eisenstein, 2015; Kim et al., 2020), we adopt the mostused dataset splitting PDTB-Ji that takes the sections 2-20 as the training set, 0-1 as the development set, and 21-22 for testing.
We use Pytorch and Huggingface (Wolf et al., 2020) 5 as a deep learning framework, and the key parameter settings of our model are described in Appendix C. Since there is no official Chinese T5 model, we use the parameter weights provided by the third party 6 . It is a T5 (base) model with 12layer encoders and 12-layer decoders trained by automatic summarization task on about a 30G corpus. In English, we use the official parameter weight 7 of T5 (base) for our CG-T5 model. Besides, we use AllenNLP instead of LTP tools to extract the core information from discourse units in English.
Following previous work, we use Micro-F1 and Macro-F1 to evaluate the IDRR models on the toplevel and second-level class. In MCDTB, there are 3 classes on the top-level and 15 types on the second-level. In PDTB, there are 4 classes on the top-level and 11 types on the second-level 8 .

Experimentation on MCDTB
To exhibit the effectiveness of our CG-T5 model, we select the following strong baselines: 1) MSRN  Table 3 shows the performance of our CG-T5 (the bottom ten lines using different forms of the target sentence) and other baselines on MCDTB. The pre-trained BERT performs better than the other classification-based baselines. However, there is still a significant gap between the performance of the traditional generation model GPT-2 (Q 1 ) and that of the classification model. It indicates that the traditional generation model is not suitable to recognize implicit discourse relation.
In addition, CG-T5 using the relation name (Name) and the explanation of relation (Rel-Exp and Exp-Rel) as the target sentence achieve similar performance. In particular, Table 3 shows that our generation model and classification model with relation name (Name) and joint learning reach 31.09 and 31.50 at Second-Level in Macro-F1, significantly gain 5.97 and 6.3 improvement, respectively, in comparison with BERT, which demonstrates this form recognizes classes with fewer samples better.
Our CG-T5 using the question sentence guided by the first DU (Q 1 ) achieves the best performance and improves the fine-grained (15 types) IDRR up to 1.94 and 6.82 in Micro-F1 and Macro-F1, respectively, in comparison with the best baseline BERT. There are two following reasons, as mentioned in Section 3.1. First, using the question sentence extracting core information from only one DU to construct the target sentence can reduce cascading errors. Second, the question sentence highlights the meaning of discourse relation and difference between various types, which enables the model to understand the discourse relation better than the other two forms. Besides, according to our statistics, the average length of Q1's target sentence is 45.44 words, and that of Exp-Rel's target sentence is 87.73 words. It is more difficult for the model to learn the relation from the Exp-Rel because the relation description takes up a smaller proportion in the target sentence. Compared with generating only relation name (Name) that lacks core information of DUs and generating explanation and relation (Exp-Rel) that doesn't pay enough attention to the relation description, using questions as the target sentence (Q1) can balance learning discourse relation description and the core information of discourse unit, achieving the best performance.
In addition, we also notice the performance of the generation model (gen) is slightly lower than that of the classification model (cla). The reason may be that the pre-trained T5 in Chinese is different from the vanilla T5 in English.
However, the performance of CG-T5 using the question sentence guided by the second DU (Q 2 ) is not good as that guided by the first DU (Q 1 ). We believe that the reason is the uneven distribution of the nuclearity that the first DU is usually the nucleus. In MCDTB, 63.94% of the first DU is the nucleus, 2.52% of the second DU is the nucleus, and 33.54% of the discourse relation pairs are equally important. Therefore, the model (Q 1 ) using the question sentence guided by the first DU composing of more important information can better grasp the connection between the two DUs to recognize discourse relation better.

Experimentation on PDTB
To evaluate the model generality, we also conduct experiments on another dataset PDTB and select six strong baselines for fair comparison as follows: 1) Bai2018 (Bai and Zhao, 2018): it uses different grained text representations on ELMO to enhance DU representation. 2) Bai2019 (Bai et al., 2019): it adds the memorizing mechanism to their previous work (Bai and Zhao, 2018). 3) Nguyen2019 (Nguyen et al., 2019): it uses multi-task learning via label embedding. 4) Guo2020 (Guo et al., 2020): it is a knowledge-enhanced attention neural network that enhances the interaction between discourses by introducing external knowledge. 5) He2020 (He et al., 2020): it translates the discourse relations in low-dimensional embedding space and propose a joint learning framework with the semantic features of arguments. In addition, we also reproduce 6) BERT (base) 9 at the same scale as our model for fair comparison.   Similar to the performance on MCDTB, our CG-T5 with Q 1 achieves the best performance and almost all its F1-measures integrating classification or generation mechanism is better than the strong baseline BERT. In particular, compared with BERT, our generation model (gen) improve the Micro-F1 and Macro-F1 in 11-way classification by 1.25 and 1.66, and in 4-way classification by 1.63 and 2.05, respectively. These results indicate that our model can achieve the best performance under the same order of magnitude of model parameters, proving the effectiveness of our model on English IDRR.

Model
Comp  We further analyze the performance of the 4way classification on PDTB at the top-level on different classes, shown in Table 5. We notice that our improvement mainly comes from the relation Comparison, where there is a significant increase of 8.21%. The reason is that the two words "negate" and "opposite" in the target sentence (question) of the Concession and Contrast relations can more accurately represent their meanings, which helps the model better recognize the two relations.

Analysis
To further demonstrate the effectiveness of our CG-T5, we choose the 1st-fold data set of MCDTB as an example to further analyze the following two parts: the ablation experiments of joint modeling and the text generation quality assessment.

Ablation Study
We conduct ablation studies to evaluate CG-T5 with joint modeling of classification and generation, as shown in Table 6. We can find that both the only-generation (only-gen) model (i.e., vanilla T5) and the only-classification (only-cla) model achieve better performance than GPT-2 and BERT, respectively. It is worth noting that our CG-T5 further improves the performance of the classification model (cla) and generation model (gen) simultaneously by joint modeling, which proves that our CG-T5 model can integrate the advantages of these two models, and achieve better performance. In addition, we find that the improvement of the generation model is more significant than the classification model in joint modeling architecture.  Since we notice that the performance of the generation model is similar to that of the classification model in Table 3, we further analyze the difference between the two models with Q 1 in the joint framework, as shown in Table 7. Although the performance gap between the two outputs is not significant and the agreement rate is 91.34%, its oracle value (as long as one of two outputs is correct, the final result is correct) has a further improvement. Specifically, we find the generation model performs better for the minor classes with fewer samples, while the classification model performs better for the major classes. It demonstrates that our model can effectively integrate the classification model and the generation model to complement each other's advantages.  Table 7: The Micro-F1 comparison between the classification and generation of our best model (Q 1 ) on MCDTB (the 1st-fold data set).

Text Generation Quality Assessment
We select the Rouge scores 10 commonly used in the generation task to evaluate the generation quality of our best model (Q 1 ), as shown in Table 8. Due to the more advanced generation model architecture, our model achieves better performance than the traditional GPT-2. Moreover, since joint modeling with the classification can constrain the generation, our model CG-T5 outperforms the vanilla

Conclusion
In this paper, we regard IDRR as a generation task to generate the target sentence, which can introduce prior knowledge to understand discourse relations deeper. Moreover, we propose a joint learning model, CG-T5, to integrate the classification model and the generation model and design three forms of the target sentence and the construction method to automatically extract core content from the large DUs. The experimental results both on MCDTB and PDTB datasets show that our CG-T5 outperforms the SOTA baselines. In future work, we will focus on how to construct more robust and automatic target sentences and how to integrate the question generation and the answer generation to recognize discourse relations better. DU2: According to statistics, by the end of last year, the Ningbo Free Trade Zone had completed a total of US$812 million in import and export trade, and the import and export trade volume through the customs of the Free Trade Zone last year alone reached US$365 million. At present, there are ten bonded warehouses in the zone with a storage area of more than 80,000 square meters; last year alone, the zone has stored goods of 2.627 billion yuan. With the adjustment of China's special policies outside the bonded area since April this year, the bonded area's certificate and tax exemption, and the stability advantages of the bonded policy have become more obvious. A large number of domestic and foreign industrial processing projects have successively settled in the area. By the end of December last year, a total of 1,614 enterprises had been established in the zone, with a total investment of 1.2 billion U.S. dollars, of which 260 were foreign-invested enterprises, and the actual utilization of foreign capital was 113 million U.S. dollars. In addition, many domestic enterprises have also connected with the international market through the bonded zone. In order to complement the free trade zone in terms of operating mechanism, the Ningbo Free Trade Zone took the lead in implementing the trial one-stop management system of direct registration of enterprises in accordance with the law in China, which was handled at one time. At the same time, the bonded zone has vigorously promoted the construction of the information expressway network system in the zone to create good supporting conditions for the realization of modern management. (Finish) B Complete Templates of Target Sentences     The key parameters of our experimental model are shown in Table 13.   Figure 4 shows the confusion matrix of CG-T5 (Q 1 ) on 11 relation types. We find that the five major types (Contrast, Cause, Conjunction, Instantiation, and Restatement), whose samples are greater than 100 in the test set, achieve higher performance (Accuracy: >50). Although the instances of two types (P ragmatic cause and Synchrony) cannot be recognized due to too few samples, our model with the target sentence still improves the other types (e.g., Alternative, Asynchronous, and List). In addition, we also find that the main errors come from the confusion between the relations Synchrony and Conjunction, the relations P ragmatic cause and Cause, the relations P ragmatic cause and Restatement, and the relations List and Conjunction. They are difficult to distinguish even if our model takes the question as the target sentence due to their similar semantics.   Table 14 shows the two prediction outputs of our joint model CG-T5 (Q 1 ) and the prediction output of BERT for a sample. It can be seen that BERT can't distinguish the relations P ragmatic cause and Instantiation well because the form of the discourse unit in the two relation types is similar. However, our CG-T5 model accurately generates the target sentence and better grasp the difference between the two relations through joint learning. Therefore, both the classification and the generation of our CG-T5 model are correct due to the classification and generation modules can complement each other.

D.3 Experiments on CDTB
In CDTB, following the previous work (Xu et al., 2019) 11 , we use 446 articles as the training set and 49 articles as the test set. we reproduce the following baselines: Bi-LSTM, CNN, GCN (Dauphin et al., 2017) and DU1: TV programmers could let audiences vote on different endings for a movie DU2: Fox Broadcasting experimented with this concept last year when viewers of "Married ... With Children" voted on whether Al should say "I love you" to Peg on Valentine's Day True relation: Expansion.Instantiation True target sentence: Can you give me an example of TV programmers let audiences vote on different endings for a movie? Relation predicted by BERT: Contingency.Pragmatic cause Relation predicted by CG-T5: Expansion.Instantiation The target sentence generated by CG-T5: Can you give me an example of TV programmers let audiences vote on different endings for a movie? Xu19 (Xu et al., 2019). In addition, we also reproduce the BERT (base) model for a fairer comparison. The experiment setting parameters are the same as those on MCDTB. To be consistent with the previous work, we use the results of converting 17 types into four classes (top-level).   Table 15 shows that thanks to the large-scale pre-training tasks, the BERT model achieve 76.7 Micro-F1 and 55.7 Macro-F1, which is better than other SOTAs without pre-training tasks. It can be seen that our model with Q 2 achieves the best performance, higher 0.2 and 2.9 on Micro-F1 and Macro-F1 than BERT. It significantly improves 4.5 Macro-F1 scores in the Caus., which benefit from introducing prior knowledge.
Unlike the experimental results on the MCDTB, we notice that the model using Q 2 instead of Q 1 achieves the best performance. The reason may be the semantic difference between types within a class is not significant in the Q 1 form. For example, in the Caus., the difference between Hypothesis and Conditional relationship is more difficult to distinguish by the model using Q 1 than Q 2 .