Enhancing Dialogue-based Relation Extraction by Speaker and Trigger Words Prediction

Identifying relations from dialogues is more challenging than traditional sentence-level relation extraction (RE), since the difﬁculties of speaker information representation and the long-range semantic reasoning. Despite the successful efforts, existing methods do not fully consider the particularity of dialogues, making them difﬁcult to truly understand the semantics between conversational arguments. In this paper, we propose two beneﬁcial tasks, speaker prediction and trigger words prediction, to enhance the extraction of dialogue-based relations. Speciﬁcally, speaker prediction captures the characteristics of speaker-related entities, and the trigger words prediction provides supportive contexts for relations between arguments. Extensive experiments on the DialogRE dataset show noticeable improvements compared to the baseline models, which achieves a new state-of-the-art performance with a 65 . 5% of F1 score and a 60 . 5% of F1 c score, respectively.


Introduction
The task of relation extraction is to identify the relation facts between two arguments from plain text, which is the fundamental step of many natural language processing applications. Recent years have seen increasing efforts on sentence-level RE, e.g., relations only hold within a single sentence (Fu et al., 2019;Luan et al., 2019;Zhao et al., 2020;Wang and Lu, 2020;. To adapt to complex scenarios, some current works have moved forward to the document-level RE, e.g., relations can exist across multiple sentences (Yao et al., 2019;Nan et al., 2020;Jain et al., 2020;Zhou et al., 2021). * Work done during an internship at Tencent Cloud Xiaowei.
† Corresponding Author.  A more challenging yet practical extension is the dialogue-based relation extraction. The dialogues contain multi-turn conversations among a group of speakers. The relations not only exist between the entities in the dialogue text but also the speakers of each utterance. Additionally, most of relations appear in multi-turn conversation, which require cross-sentence extraction. Considering the complexities, we divide the dialogue-based RE into three categories. In the first category, the relation can be directly inferred from the current utterance, as shown in the Dialogue 1 of Figure 1. In the second category, the relation involves utterances among multiple speakers and there is clear evidence in the dialogue that triggers the relation. Regarding the Dialogue 2 in Figure 1, "mom" is the trigger word of the relation per:parents be-tween "S1" and "S2". Nevertheless, there are still cases where there is no clear context indicating the relationship. As shown in the Dialogue 3 of Figure  1, the relation between "S1" and "S3" as well as the relation between "S2" and "S3" can only be inferred from the tones and expression habits of speakers. Therefore, to identify relations from the complex dialogues, it is necessary to 1) discover highly supportive information about the arguments, and 2) capture the unique features of speakers.
Existing studies propose to solve this task through a speaker-aware BERT model (Yu et al., 2020) as well as a gaussian graph-based method (Xue et al., 2021). The former modifies the speaker arguments in dialogue text with special tokens to highlight the speaker-related information. The latter builds a latent multi-view graph to encode the long-distance dependency between arguments. However, these works regard dialogue as a plain text without considering the supporting information of relations and the characteristics of speakers. As been emphasized before (Xue et al., 2021), trigger words and speaker-related features play a critical role in dialogue-based relation extraction. In this case, it is difficult for them to detect the speaker-related relations from the complicated conversations.
To address the above limitations, we propose two beneficial tasks, speaker prediction and trigger words prediction, to enhance the dialogue-based relation extraction. Specifically, the speaker prediction task aims to capture the unique features of speakers. We randomly mask the speaker tokens and use the context to predict who said the utterance. The trigger words prediction task is to detect the supportive context of the current relation. We solve it with a sequence labeling method. Moreover, we design an integration module for the relation extraction task to combine both the global dialogue representation and the local arguments representation. Finally, the three tasks are jointly trained based on a multi-task learning framework.
The contributions of our work are summarized as follows. We propose two beneficial tasks, speaker prediction and trigger words prediction, to capture the unique features of speakers and detect the supportive information about arguments, both effectively enhance the dialogue-based relation extraction. We evaluate our method on the DialogRE dataset and achieve a new stateof-the-art performance with 65.5% of F1 score  Figure 2: Overall structure of our method. and 60.5% of F1 c score, respectively. Our code is available at https://github.com/TanyaZhao/ DialogRE-Trigger-Speaker-Prediction.

Problem Definition
Given a dialogue d = s 1 : t 1 , s 2 : t 2 , ..., s m : t m and two arguments a 1 , a 2 , where s i and t i are the speaker and the utterance of the i-th turn, The dialogue-based relation extraction aims to predict the relation type r ∈ R between a 1 and a 2 , where R is the set of predefined relation categories.

Methodolgy
This section introduces the structure of our method, including three tasks, relation extraction, trigger words prediction and speaker prediction. Figure 2 shows the overall structure of our method.

Relation Extraction Task
The relation extraction task takes the dialogue d and the argument pair (a 1 ,a 2 ) as input and outputs a relation type r between the two arguments. For the dialogue d, we first modify it intod =ŝ 1 : t 1 ,ŝ 2 :t 2 , ...,ŝ m :t m , wherê We feed the sequence into the pre-trained language model BERT (Devlin et al., 2019) and obtain the hidden semantic representation of each input tokens. Among them, where d h is the hidden size of BERT. We use h [CLS] to represent the global relational feature between a 1 and a 2 .
To better represent the semantic information of arguments, we distill all the hidden states of a k 's start marker in the sequence (Eq. 2), including those in the dialogue text, and formulate them as Then, we apply a max pooling process to obtain a combined representation of a k : Next, we concatenate h [CLS] , h a 1 and h a 2 as h = is the global relational information of the sequence, and h a 1 , h a 2 are the local features of the arguments. Furthermore, we propose an integration module to enhance the correlation between the dialogue and arguments. Specifically, h is fed into a twolayer highway network (Srivastava et al., 2015) aŝ where W t ∈ R 3d h ×dt , b t ∈ R dt are learned weights with d t as the hidden size of the highway network. Finally, we conduct a multi-class classification to calculate the probability of the relation between a 1 and a 2 by

Trigger Words Prediction Task
Generally, to identify the relation between two arguments, it is necessary to detect the supportive context that triggers the relation. Yu et al. (2020) have verified that trigger words play an important role for relation extraction. However, their work directly append the ground truth trigger words to the input sequence, which is not feasible for scenarios where the golden triggers are not available. The intuitive idea is to predict the trigger words from the conversation. Therefore, we can still obtain supporting information without relying on the golden triggers to guide the relationship extraction.
We propose the trigger words prediction task, which applies a simple and effective way to improve the relation extraction task. Specifically, we perform sequence labeling over the hidden outputs of BERT. Considering the trigger words are closely related to the relation, we first map the predicted relation r (Eq. 5) into a distributed embedding e r ∈ R dr , and concatenate it with each hidden output of BERT as z i = [e r ; h i ] 1 . Then, we predict the boundary label for every token. Formally, the probability of the token x i with label l is calculated by (6) where W l ∈ R (d h +dr)×|B| and b l ∈ R |B| with B = {B, I, O}.

Speaker Prediction Task
Notably, a majority of relations in the dialoguebased RE are associated with the speakers. For example, the triplet (S1, per:parents, S2) in Figure 1. Different from the ordinary entities, speakers have distinctive personal features, including tone of voices, expression habits, etc., which are important indicators for relation extraction. Therefore, we further propose the speaker prediction task based on the discourse structure to capture the speakerrelated features. The motivation behind it is that if the model can distinguish who said the utterance, it learned the speaker's unique information, which is helpful to the speaker-related relation prediction.
Concretely, we randomly select the speaker words s i in d with a probability of 10% and replace them with a special token [MASK]. Next, the BERT model takes the modified sequence as input and obtain the hidden state of each masked speaker, denoted as h mask i . Then, we predict the speaker through a multi-type classification as where W s ∈ R d h ×S , b r ∈ R S with S as the maximum number of speakers in dialogues.

Joint Training Objective
The above three tasks share the BERT encoder and are jointly trained based on the multi-task learning framework. During training, we minimize the following objective loss function as where L RE is the binary cross-entropy loss for relation extraction, L TP and L SP are the cross-entropy loss for trigger words prediction and speaker prediction, respectively. For inference, we directly use the relation predicted by Equation 5 as the final result.

Experiments
In this section, we compare the proposed method with the current state-of-the-art approaches to evaluate its effectiveness.

Experimental Setup
Dataset We conduct experiments on the dialogue-based RE benchmark dataset, DialogRE (Yu et al., 2020). It contains 1, 788 dialogues from the transcripts of Friends corpus, totally with 36 relation types. 49.6% of relation triples is annotated with trigger words.
Evaluation Metrics Following the previous work (Yu et al., 2020), we adopt F1 score and F1 c score as the evaluation metrics. Among them, F1 c is a supplement to the F1, which only considers the first i ≤ m turns of utterances, rather than the entire dialogue.
Baseline Models We compare our method with the existing advanced models, BERT (Devlin et al., 2019), BERTs (Yu et al., 2020) and GDPNet (Xue et al., 2021). BERT model for the dialogue-based RE directly applies BERT as the dialogue encoder, and uses the hidden state of [CLS] for relation prediction. BERTs is a speaker-awared BERT,  with modifies the speaker tokens to special markers. GDPNet uses a gaussian graph-based network to capture the interaction within dialogues, and achieves the current state-of-the-art performance.

Experimental Results
Main Results Table 1 presents the performance comparison of our method with the existing advanced models. The results show that our method obviously outperforms the previous models and achieve a new state-of-the-art on test set with a F1 score of 65.5% and a F1 c score of 60.5%, demonstrating the effectiveness of the proposed method.
Ablation Study We conduct ablation study experiments to investigate the influence of each proposed task.  although Ours w/o SP and TP only retains the relation extraction task, it still outperforms BERTs by a large margin. The result shows that our improved method of relation extraction is also effective.
Analysis on Discourse Structure Modeling To show the necessity of considering the discourse structure in dialogue-based RE, we design a naive way to degenerate a dialogue into a plain document. We modify the colon after a speaker into text like "said", "responsed" or "continued". For example, the Dialogue 2 in Figure 1 is converted into "S1 said Mom! S2 responsed Sweetie! So this is where you work? ...". Then, we apply our method to the changed text. The performance on the test set significantly degrades with 58.0% for F1 and 56.8% for F1 c . The result indicates that dialogues contain important discourse structural information. Therefore, it is important to study the extraction strategies for dialogues rather than directly applying common sentence-level or document-level extraction methods.
Trigger Words Prediction To further evaluate the effect of trigger words prediction task, we calculate the prediction performance on the cases annotated with the ground truth trigger words. Note that, 49.6% of relational triplets have trigger words in DialogRE. The prediction accuracy is 75.6%. The result demonstrates that we can correctly recognize most of the the trigger words. And with the help of the supporting information, the performance of relation extraction is considerably improved, as shown in Table 2.
Case Study We give a case study to analyze the quality of the results produced by our approach and the baseline model. Cases in Table 3 show that our method is capable of capturing the trigger words information and the characteristic of speakers. In case 1, the base model fails to utilize the trigger words information and identifies the relation as unanswerable. However, our method correctly recognizes that the word "Mom" triggers the relation between "S1" and "S2", which promotes the right prediction result. Besides, in case 2, our method can capture the characteristics information of speakers and thus correctly predict that "S3" is the boss of "S1" and "S2". Contrarily, the baseline model has difficulties in handling such case.

Conclusion
In this paper, we propose to enhance the dialoguebase relation extraction with two benefical tasks, the speaker prediction and the trigger words prediction. Extensive experiments on the benchmark dataset DialogRE demonstrate the effectiveness of our method in achieving state-of-the-art performance in both F1 score and F1 c score.