TREND: Trigger-Enhanced Relation-Extraction Network for Dialogues

The goal of dialogue relation extraction (DRE) is to identify the relation between two entities in a given dialogue. During conversations, speakers may expose their relations to certain entities by explicit or implicit clues, such evidences called “triggers”. However, trigger annotations may not be always available for the target data, so it is challenging to leverage such information for enhancing the performance. Therefore, this paper proposes to learn how to identify triggers from the data with trigger annotations and then transfers the trigger-finding capability to other datasets for better performance. The experiments show that the proposed approach is capable of improving relation extraction performance of unseen relations and also demonstrate the transferability of our proposed trigger-finding model across different domains and datasets.


Introduction
The goal of relation extraction (RE) is to identify the semantic relation type between two mentioned entities from a given text piece, which is one of the basic and important natural language understanding (NLU) problems (Zhang et al., 2017;Zhou and Chen, 2021;Cohen et al., 2020). In terms of the problem setting, we are usually given a written sentence and a query pair containing two entities and asked to return the most possible relation type from a predefined set of relations. Dialogue Relation Extraction (DRE), on the other hand, aims to excavate underlying cross-sentence relation in natural human communications (Yu et al., 2020;Jia et al., 2020). The problem itself is well-motivated, relations between entities in dialogues could potentially provide dialogue systems with additional features for better dialogue management (Peng et al., 2018;Su et al., 2018a) and generating more appropriate responses (Su et al., 2018b). Figure 1 illustrates an example of the recentlyproposed dataset, DialogRE (Yu et al., 2020). Given a conversation and a query pair, we aim to identify the interpersonal relationship between the entities, the entities can be not only human but The first two authors contributed to this work equally. other types of entities like locations. Furthermore, with longer context than a single sentence, Dialo-gRE also has annotation on the evidences of relations within conversation flow, called Triggers. A trigger can be a short phrase or even a single word, and different part-of-speech of words are possible. In the example, the clue for knowing Speaker 3 has negative impression on Speaker 3 is that Speaker 2 once said "You are arrogant." Such hint is intuitively useful for identifying the relations. However, none of previous work tried to explicitly leveraged the trigger information for DRE.
Prior work can be divided into two main lines, one of which is graph-based methods. DHGAT (Chen et al., 2020) presents an attention-based heterogeneous graph network to model multiple types of features; GDPNet (Xue et al., 2020b) construct latent multi-view graphs to model possible relationships among tokens in a long sequence, and then to refine the graphs by iterative graph convolution and special graph pooling techniques. Another branch is BERT-based (Devlin et al., 2018) methods (Yu et al., 2020;Xue et al., 2020a)

Relation Predictor
Trigger Span Context Vector

Relation
Have Trigger? [SEP] Figure 2: The proposed method is composed of two components: (1) a multi-tasking BERT with two fine-tuning tasks (trigger prediction, and prediction of having a trigger or not), and (2) a relation predictor with feature fusion by attention. (Xue et al., 2020a) is a simple BERT model with an additional refinement gate for iteratively finding high-confidence prediction. LSR (Nan et al., 2020) proposed a latent structure refinement method for better reasoning in the document-level relation extraction task.
In this work, we propose TREND, a multitasking model base on BERT with an attentional relation predictor, where we design some auxiliary tasks for trigger prediction. Specifically, TREND has (1) extractive-style trigger identification by predicting start-end pointers, and (2) binary classifier for existence of triggers.The proposed methods are simple and flexible, and the experimental results show that our model achieves the state-of-the-art on DialogRE (Yu et al., 2020) and DDRel (Jia et al., 2020).

Proposed Method
The core idea of this work is to identify trigger spans and accordingly leverage the information of them for improving the relation extraction. We hereby propose Trigger-enhanced Relation-Extraction Network for Dialogues, TREND.

Problem Formulation
Given a piece of dialogue context D composed of text tokens D = {x i } and a query pair q containing a subject entity and an object entity q = (s, o), we aim to find a function f to find the most possible relations between the entities from a predefined set of relations R, Note that a single query pair can have multiple relations but we follow the setting of previous work where if a query has multiple relation labels, it will be split into multiple data samples with the same input (D, q) and different single target label.

TREND
The proposed TREND has two main modules, (1) a multi-tasking BERT (Devlin et al., 2018) for encoding context and identifying triggers, and (2) a relation predictor for predicting relation by fusing the context feature and the trigger span.
Trigger Prediction As illustrated in Figure 2, an input (D, q) will be first augmented into a BERT-style sequence. Specifically, the format is and [SEP] are classification and separator special tokens, respectively. We also follow the method in (Yu et al., 2020) to replace the speaker tokens in D. The [CLS] tokens at different position in the sequence may carry different meaning after encoding by BERT. In our model, we assume the encoding of the [CLS] token in the beginning of the sequence contains contextual information of whole input sequence.
Since triggers certainly exist in the input dialogue context, we propose to use an extractivestyle method by predicting start-end pointers (Devlin et al., 2018), which is prevalent in Question-Answering area (Lee et al., 2016;Rajpurkar et al., 2016). The task is a single-label classification problem of predicting the most possible positions, hence the cross entropy loss is conducted.
Binary Gate Not every given query has a relation, in these cases, the labels are "Unanswerable". Certainly, such samples would not have trigger annotations. We hereby propose to learn a binary The performance of the models on automatic metrics, the official DDRel has different level (sessionlevel/pair-level) and different granularity of evaluation settings (4,6,13-class). * Though SimpleRE reports 66.7 in the paper, their problem setting is different from the others; here we take the one with the same setting for fair comparison, we will detail this in Section 3.
classifier as a gate, if the binary gate gives over 0.5 score, we suppose the given sample does not have triggers and accordingly use empty trigger spans for prediction. The binary cross entropy loss is conducted as the loss function.
Relation Predictor Now we have a context vector (encoded [CLS] token) and a predicted trigger span, we then feed them into the predictor for relation prediction, as depicted in Figure 2. The features are fused by the following generic attention mechanism, the query is the context vector and the keys and the values are trigger words: where c is the context vector and x i is the BERT encoding of words. The merged feature is then fed into a 1-layer feed-forward network for final relation prediction. Because the task is a single-label classification problem, hence the cross entropy loss is conducted. Finally, all the losses from the above objectives are linearly combined and the whole model can be trained in an end-to-end manner. For each objective, we have a weight parameter to adjust the impact of it. We also apply schedule sampling (Bengio et al., 2015) on trigger prediction and binary signal when feeding into the relation predictor.

Experiments
In all the experiments, we use mini-batch Adam with learning rate 3e−5 as the optimizer on Nvidia Tesla V100. The ratio of teacher forcing and other hyper-parameters were selected by grid search in (0,1] with step 0.1. The whole training takes 30 epochs without any early-stop method. The entire  implementation was based on PyTorch and Hug-gingFace transformers 1 package. Other details will be reported in Appendix A.

Datasets
The benchmark datasets conducted in our experiments are DialogRE (Yu et al., 2020) and DDRel (Jia et al., 2020), both are DRE datasets. The official DialogRE dataset has two versions, we chose the latest version (v2) of English part. Since the conversations in DialogRE are quite natural and colloquial, the preprocessing process includes text normalization like lemmatization and expanding contractions. Because of the different characteristics of the datasets, the batch size for DialogRE is 16 while the one for DDRel is 4.

Analysis
The experimental results are shown in Table 1. In our experiments, we take CNN (row (a)), BERT (row (b)), GDPNet (row (c)) (Xue et al., 2020b), and SimpleRE (row (d)) (Xue et al., 2020a) as the baselines for comparison. GDPNet and SimpleRE did not conduct the DDRel dataset in their experiments, so we only report the performance of CNN and BERT. Although SimpleRE reported 66.7 F1score of their best-performing model in their paper, their problem setting is different from the other work. Specifically, they concatenate all the argument pairs of a dialogue sample as a long query and likewise append it after the dialogue context. Because a dialogue sample could have up to 20 argument pairs, SimpleRE proposes to augment more contextual information by means of the concatenation. For fair comparison, we take the same setting of SimpleRE taking a single argument pair. TREND (row (e)) utilizes BERT-base model while TREND-L (row (f)) is based on the BERTlarge, both models outperform all the baselines on DialogRE and achieve the state-of-the-art, and TREND-L could further improve the performance for 1.0 F1-score. Unlike SimpleRE (row (d)) and GDPNet (row (c)) need to iteratively refine the latent features or latent graphs, the prediction of the proposed TREND is straight-forward. Such design makes training and inference efficient and robust. Table 2 shows a generated example of our model, in this example, our model successfully identify the correct trigger and hereby help the model to predict the right relation. In terms of exact-match, the trigger prediction of our model still has lots of space to improve, however, exact-match is not really necessary. For instance, if the ground truth trigger is "Mom" but the predicted trigger is "Dad", it could still facilitate the prediction regarding the label might be "parent". Partial matches are another case, if the ground truth trigger is "got married" but the predicted trigger is "married", such prediction apparently helps.
Our trained binary gate has about 85% accuracy while the trigger prediction has no more than 50% accuracy. Though these sub-modules are not perfectly-trained, we found them somewhat useful by the ablation test (row (g)-(h)). To examine the effectiveness of our modeling, we further try to estimate the upper-bound performance by using the ground truth trigger spans for final relation prediction. The estimated upper-bound of TREND (row (e)) is 75.3, in other words, our design of relation predictor is validated and the potential of the proposed model could be unleashed once we enhance the trigger prediction.
Transfer Learning Since the DDRel dataset does not provide annotations of triggers, the re-  ported numbers in (row (e)-(h)) are the transferred results where the models were first pre-trained on DialogRE and then fine-tuned on DDRel. Since the output space is different, the last prediction layer is replaced. From Table 1, we can see that TREND ((row (e)-(f))) are the best-performing models in all the evaluation settings. Especially for pair-level evaluation, which takes much longer dialogue context as input, the improvement over the baselines is more. We suppose this is because when longer context is provided, extracting key evidences becomes more important to overcome information overload. Suprisingly, TREND-L does not keep outperforming TREND. Table 3 is an example of the predicted results of our model adapted to DDRel, the model identify the word "father" as the trigger, which is reasonable for the target relation "Child-Parent". All the results show that TREND is capable of transferring learned knowledge to a new dataset and new domain.

Conclusion
In this paper, we propose TREND, a multi-tasking model predicting relation triggers for improving dialogue relation extraction. TREND is a simple, flexible, end-to-end model based on BERT, which has three main components: (1) extractive-style trigger indetifier by predicting start-end pointers, (2) binary classifier for existence of triggers, and (3) a relation predictor with attentional feature fusion. On the DRE benchmark datasets, DialogRE and DDRel, the proposed method achieves the state-ofthe-art performance. The experiment results also show that the proposed TREND: (1) can transfer the learned knowledge from DialogRE to DDRel, extracting the informative evidence without further instruction, and (2) has great potential to boost performance more based on the proposed ideas.