Graph Based Network with Contextualized Representations of Turns in Dialogue

Dialogue-based relation extraction (RE) aims to extract relation(s) between two arguments that appear in a dialogue. Because dialogues have the characteristics of high personal pronoun occurrences and low information density, and since most relational facts in dialogues are not supported by any single sentence, dialogue-based relation extraction requires a comprehensive understanding of dialogue. In this paper, we propose the TUrn COntext awaRE Graph Convolutional Network (TUCORE-GCN) modeled by paying attention to the way people understand dialogues. In addition, we propose a novel approach which treats the task of emotion recognition in conversations (ERC) as a dialogue-based RE. Experiments on a dialogue-based RE dataset and three ERC datasets demonstrate that our model is very effective in various dialogue-based natural language understanding tasks. In these experiments, TUCORE-GCN outperforms the state-of-the-art models on most of the benchmark datasets. Our code is available at https://github.com/BlackNoodle/TUCORE-GCN.


Introduction
The task of relation extraction (RE) aims to identify semantic relations between arguments from a text, such as a sentence, a document, or even a dialogue. However, since a large number of relational facts are expressed in multiple sentences, sentence-level RE suffers from inevitable restrictions in practice (Yao et al., 2019). Therefore, cross-sentence RE, which aims to identify relations between two arguments that are not mentioned in the same sentence or relations that cannot be supported by any single sentence, is an essential step in building knowledge bases from large-scale corpora automatically (Ji et al., 2010;Swampillai and Stevenson, 2010;Surdeanu, 2013). In this respect, because dialogues * Corresponding author S1: Hey Pheebs. S2: Hey! S1: Any sign of your brother? S2: No, but he's always late. S1: I thought you only met him once? S2: Yeah, I did. I think it sounds y'know big sistery, y'know, 'Frank's always late.' S1: Well relax, he'll be here. Subject: Frank Object: S2 relation: per:siblings Subject: S2 Object: Frank relation: per:siblings Subject: S2 Object: Pheeb relation:per:alternate_names Table 1: An example dialogue and its desired relations in DialogRE . S1, S2: anonymized speaker of each utterance. readily exhibit cross-sentence relations , extracting relations from the dialogue is necessary.
To support the prediction of relation(s) between two arguments that appear within a dialogue,  recently proposed DialogRE, which is a human-annotated dialogue-based RE dataset. Table 1 shows an example of DialogRE. In conversational texts such as DialogRE, because of its higher person pronoun frequency (Biber, 1988) and lower information density (Wang and Liu, 2011) compared to formal written texts, most relational triples require reasoning over multiple sentences in a dialogue. 65.9% of relational triples in DialogRE involve arguments that never appear in the same turn. Therefore, multi-turn information plays an important role in dialogue-based RE.
There are several major challenges in effective relation extraction from dialogue, inspired by the way how people understand dialogue in practice. First, the dialogue has speakers, and who speaks each utterance matters. The reason for it is because the subject and object of rela-tional triples depend on who is speaking which utterance. For example, if S3 answered "Hey!" after "Hey Pheebs.", the relational triple (S2, per:alternate_names, Pheebs) will be revised to (S3,per:alternate_names,Pheebs), in the case of Table 1. Second, when understanding the meaning of each turn in a dialogue, it is important to know the meaning of the surrounding turns. For example, if we look at "No, but he is always late." in Table 1, we don't know who's always late. However, if we look at the previous turn, we can see that S2's brother is always late. Third, the dialogue consists of several turns. Those turns are sequential, and the arguments may appear in different turns. Consequently, it is important to grasp the multiturn information in order to capture the relations between the two arguments. This could be done using the sequential characteristics of dialogues. Therefore, we aim to tackle these challenges to better extract relations from dialogues.
In this paper, we propose the TUrn COntext awaRE Graph Convolutional Network (TUCORE-GCN) for dialogue-based RE. It is designed to tackle the aforementioned challenges. TUCORE-GCN encodes the input sequence to reflect speaker information in dialogue by applying BERT s  and speaker embedding of SA-BERT (Gu et al., 2020). Then, to better extract the representations of each turn from the encoded input sequence, Masked Multi-Head Self-Attention (Vaswani et al., 2017) is applied using a surrounding turn mask. Next, TUCORE-GCN constructs a heterogeneous dialogue graph to capture the relational information between arguments in the dialogue. It consists of four types of nodes, namely dialogue node, turn node, subject node, object node, and three different types of edges, i.e., speaker edge, dialogue edge, and argument edge. Then, the sequential characteristics of the turn nodes should be considered. To obtain a surrounding turn-aware representation for each node, we apply bidirectional LSTM (BiLSTM) (Schuster and Paliwal, 1997) to the turn nodes and a Graph Convolutional Network (Kipf and Welling, 2017) to the heterogeneous dialogue graph. Finally, we classify the relations between arguments with the obtained features.
The task of emotion recognition in conversations (ERC) aims to identify the emotion of utterances in dialogue. ERC is a challenging task that has recently gained popularity due to its potential applica-   Table 3: An example of converting the example in Table 2 to DialogRE format to treat the ERC task as a dialogue-based RE. S1, S2, S3, S4: anonymized speaker of each utterance.
tions . It can be used to analyze user behaviors (Lee and Hong, 2016) and detect fake news (Guo et al., 2019). Table 2 shows an example from EmoryNLP (Zahiri and Choi, 2018), a dataset widely used in the ERC task. We propose a novel approach to treat the ERC task as a dialoguebased RE. If we define the emotion relation of each utterance when the subject says the object with a particular emotion (e.g., joyful, neutral, scared), the emotion of each utterance in the dialogue can be seen as a triple (speaker of utterance, emotion, utterance) as shown in Table 3. To the best of our knowledge, this approach was not introduced in previous studies.
In summary, our main contributions are as follows: • We propose a novel method, TUrn COntext awaRE Graph Convolutional Network (TUCORE-GCN), to better cope with a dialogue-based RE task.
• We introduce a surrounding turn mask to bet-ter capture the representation of the turns.
• We introduce a heterogeneous dialogue graph to model the interaction among elements (e.g., speakers, turns, arguments) across the dialogue and propose a GCN mechanism combined with BiLSTM.
• We propose a novel approach to treat the ERC task as a dialogue-based RE.
2 Related Work

Dialogue-Based Relation Extraction
Relation extraction has been studied extensively over the past few years and many approaches have achieved remarkable success. Most previous approaches focused on sentence-level RE (Zeng et al., 2014;Wang et al., 2016;Zhang et al., 2017;Zhu et al., 2019), but recently cross-sentence RE has been studied more because a large number of relational facts are expressed in multiple sentences in practice.
Recent work begins to explore cross-sentence relation extraction on documents that are formal genres, such as professionally written and edited news reports or well-edited websites. In document-level RE, various approaches including transformer-based methods (Tang et al., 2020;Ye et al., 2020;Wang et al., 2019) and graphbased methods (Christopoulou et al., 2019;Nan et al., 2020;Zeng et al., 2020) have been proposed. Among these, graph-based methods are widely adopted in document-level RE due to their effectiveness and strength in representing complicated syntactic and semantic relations among structured language data. Unlike previous work, we focused on extracting relations from dialogues, which are texts with high pronoun frequencies and low information density.  were among the early works on dialogue-based RE.  introduced several dialogue-based RE approaches with the DialogRE dataset. Among the various approaches, BERT s , a model that uses BERT , shows good performance. BERT s is a model that slightly modified the original input sequence of BERT in consideration of speaker information. However, it has a limitation in that it cannot predict asymmetric inverse relations well. Our model basically follows the input sequence of BERT s , but we designed it to overcome this limitation to some extent. More detailed explanation is in Sec 4.1.5.  proposed a graph-based approach, GDPNet, that constructs a latent multi-view graph to capture various possible relationships among tokens and refines this graph to select important words for relation prediction. In this approach, the refined graph and the BERT-based sequence representations are concatenated for relation extraction. The graph of GDPNet is a multi-view directed graph aiming to model all possible relationships between tokens. Unlike GDPNet, we combine tokens into meaningful units to form nodes and connect the nodes with speaker edges, dialogue edges, and argument edges to model what each edge means. In addition, GPDNet focuses on refining this multi-view graph to capture important words from long texts for RE, but we extract the relations using the features of the nodes in the graph.

Emotion Recognition in Conversation
Emotion recognition in conversation has emerged as an important problem in recent years and many successful approaches have been proposed. In ERC, numerous approaches including recurrencebased methods Ghosal et al., 2020) and graph-based methods (Ghosal et al., 2019;Ishiwatari et al., 2020) have been proposed. For instance, DialogueRNN  uses an attention mechanism to grasp the relevant utterance from the whole conversation and models the party state, global state, and emotional dynamics with several RNNs. COSMIC (Ghosal et al., 2020) adopts a network structure, which is similar to DialogueRNN but adds external common sense knowledge to improve performance. DialogueGCN (Ghosal et al., 2019) treats each dialogue as a graph where each node represents utterance and is connected to the surrounding utterances. RGAT (Ishiwatari et al., 2020) is based on DialogueGCN. It adds relational positional encodings that can capture speaker dependency, along with sequential information. Many studies with remarkable success have been proposed, but none can be used in ERC as well as other dialogue-based tasks like our approaches.

Model
TUCORE-GCN mainly consist of four modules: encoding module (Sec 3.1), turn attention module (Sec 3.2), dialogue graph with sequential nodes module (Sec 3.3), and classification mod- Figure 1: The overall architecture of TUCORE-GCN. First, A contextualized representation of each token is obtained by feeding the input dialogue to the context encoder. Next, Masked Multi-Head Attention using surrounding turn mask is applied to obtain representations that enhance the meaning of each turn. Then, TUCORE-GCN constructs a dialogue graph and applies GCN mechanism combined with BiLSTM. Finally, the classification module predicts relations using information from the previous module. ule (Sec 3.4), as shown in Figure 1.

Encoding Module
We follow BERT s  as the input sequence of the encoding module. Given a dialogue d = s 1 : t 1 , s 2 : t 2 , ..., s M : t M and its associated argument pair (a 1 , a 2 ), where s i and t i denote the speaker ID and text of the i th turn, respectively, and M is the total number of turns, BERT s constructŝ d =ŝ 1 : t 1 ,ŝ 2 : t 2 , ...,ŝ M : t M , whereŝ i is: , and a k otherwise. Then, we concatenated and (â 1 ,â 2 ) with a classification token [CLS] and a separator token [SEP] in BERT  as the input sequence To model the speaker change information, following SA-BERT (Gu et al., 2020), we add additional speaker embeddings to the token representations. E s (ŝ i ) is added to each token representation , and E s ( ) is added to all token representations without speaker embedding added, where E s (·) denotes the speaker embedding layer. E s ( ) is an embedding output for token representations without speaker information. A visual architecture of our input representation is illustrated in Appendix.
Then, token representations containing speaker change information are fed into an encoder to extract the speaker-sensitive token representations. The encoder can be BERT or BERT variants Conneau and Lample, 2019;Lan et al., 2020).

Turn Attention Module
To obtain the turn context-sensitive representation for each turn, we apply Masked Multi-Head Self-Attention (Vaswani et al., 2017) to the output of the encoder using the surrounding turn mask. The range of this surrounding turn is called the window, and the number of turns from the front and rear are viewed as the surrounding turn which is called the surround turn window size. The surround turn window size c is a hyper-parameter.
Let X = [x 1 , x 2 , x 3 , ..., x N ] be an output of the encoding module, where x j is the j th token representation in the output and N is the number of tokens. For token representations corresponding toŝ i : t i range from denotes representations of a dialogued, and F (x m ) denotes the turn number in which x m is included (e.g., F (x m ) = 2 if x m ∈ D 2 ). We implement the surrounding turn mask as follows: A visual architecture of an example regarding the surrounding turn mask is illustrated in Appendix.
Then, we reinforce the representation of each turn from representations of surrounding turns.

Dialogue Graph with Sequential Nodes Module
To model the dialogue-level information, interactions between turns and arguments, and interactions between turns, a heterogeneous dialogue graph is constructed.
We form four distinct types of nodes in the graph: dialogue node, turn node, subject node, and object node. The dialogue node is a node with the purpose of containing overall dialogue information. Turn nodes represent information about each turn in the dialogue and are created as many as the total number of turns in the dialogue. The subject node and object node represent the information of each argument. In our work, the initial representation of the dialogue node uses a feature corresponding to [CLS] in the output of the turn attention module. The initial representation of the i th turn node, subject node, and object node use the average of the token representations corresponding toŝ i : t i ,â 1 , andâ 2 in the output of the turn attention module, respectively.
There are three different types of edge: • dialogue edge: All turn nodes are connected to the dialogue node with the dialogue edge so that the dialogue node learns while being aware of turn-level information.
• argument edge: To model the interaction between turns and arguments, the i th turn node and argument nodes (i.e., subject node and object node) are connected with the argument edge if the argument is mentioned inŝ i : t i .
• speaker edge: To model the interaction among different turns of the same speaker, turn nodes uttered by the same speaker are fully connected with speaker edges.
Next, we apply a Graph Convolutional Network (GCN) (Kipf and Welling, 2017) to aggregate each node feature from the features of the neighbors. At this time, in order to inject sequential information to the turn nodes, GCN is applied after the turn nodes pass through the bidirectional LSTM (Schuster and Paliwal, 1997) layers. Given node u at the l th GCN layer, h (l) u andĥ (l) u denote the representation of the node before injecting sequential information and the representation of the node after injecting sequential information, respectively.ĥ (l) u can be defined as: if type of u is turn node h (l) u otherwise (4) where T i represents an i th turn node andḣ (l) T i represents turn node feature injected sequential information in the dialogue by concatenating the hidden states of two directions. W (l) and d is the dimension. Then, the graph convolution operation can be defined as: where κ are different types of edges, N k (u) denotes neighbors for node u connected in the k th type edge, W

Classification Module
We concatenate the dialogue node, subject node, and object node to classify the relation between arguments. Furthermore, to cover features of all different abstract levels from each layer of the GCN, we concatenate the hidden states of each GCN layer as follows: where G is the number of GCN layers and d, s, and o denote the dialogue node, subject node, and object node, respectively. For each relation type r, we introduce a vector W r ∈ R 3(G+1)d and obtain the probability P r of the existence of r between arguments by P r = sigmoid(CW T r ). We use crossentropy loss as the classification loss to train our model in an end-to-end way.

Experiments
In this section, we report our experimental results on two tasks, dialogue-based RE and ERC. We experiment with two versions of TUCORE-GCN, TUCORE-GCN BERT and TUCORE-GCN RoBERT a , respectively based on the uncased base model of BERT  and the large model of RoBERTa . TUCORE-GCN is trained using Adam (Kingma and Ba, 2015) as an optimizer with weight decay 0.01. We run each experiment five times and report the average score along with the standard deviation (σ) for each metric. The full details of our training settings are provided in the Appendix.

Dataset
We evaluate our model on DialogRE , an updated English version with a few annotation errors fixed 1 . DialogRE has 36 relation types, 1,788 dialogues, and 8,119 triples, not including no-relation argument pairs, in total. We follow the standard split of the dataset.

Metrics
For DialogRE, We calculate both the F 1 and F 1 c  scores as the evaluation metrics. F 1 c is an evaluation metric to supplement the standard F 1. F 1 c is computed by taking in the part of dialogue as input, instead of only considering the entire dialogue.

Baselines and State-of-the-Art
For a comprehensive performance evaluation, we compared our model with the models using the following baseline and state-of-the-art methods: 1 https://dataset.org/dialogre BERT : The BERT baseline for dialog-based RE, initialized with pretrained parameters of BERT-base. It is classified using a final hidden vector corresponding to the [CLS] token.
BERT s : A modification to the input sequence of the above BERT baseline. This modification prevents a model from overfitting to the training data and helps a model locate the start positions of relevant turns.
GDPNet : A state-of-the-art model for the DialogRE. GDPNet finds indicative words from long sequences by constructing a latent multi-view graph and refining the graph. It uses the same input format of BERT s and pre-trained parameters of BERT-base.
RoBERTa s : A model that uses the pre-trained parameters of RoBERTa-large  instead of pre-trained parameters of the BERT s above.

Results
We show the performance of TUCORE-GCN on the DialogRE dataset in Table 4 compared with other baselines.
Among the models using BERT, TUCORE-GCN BERT outperforms all baselines by 5.3 ∼ 7.6 F 1 scores and 2.9 ∼ 7.1 F 1 c scores on the test set. GDPNet, the state-of-the-art model, achieved highperformance improvement at F 1 c , but TUCORE-GCN showed high-performance improvement at both F 1 and F 1 c . Among the models using RoBERTa, TUCORE-GCN RoBERT a yields a great improvement of F 1/F 1 c on the test set by 1.8/2.2, in comparison with the strong baseline RoBERTa s . Our model can use BERT (or its variants) as an encoder, and in the experiment, we used both the BERT-base model and also the RoBERTa-large model. TUCORE-GCN show outstanding performance even when the BERT-base was used as the encoder. RoBERTa-large was also used, and it achieves state-of-the-art performance on DialogRE dataset with F 1 score 73.1 and F 1 c score 65.9. It suggests that TUCORE-GCN is very effective in this dialog-based RE task.

Analysis on Inverse Relations
We analyze asymmetric inverse relations and symmetric inverse relations performance on the dialogue-based RE task. We divide the DialogRE dev set into three groups depending on whether it was asymmetric inverse relation, symmetric inverse  Table 4: Performance on DialogRE. The scores marked by "*" are based on our re-implementation, because of the data differences.
relation, or other. Then, we report the F 1 score for each group in Appendix. In the dialogue-based RE task, when asymmetric inverse relations are predicted to exists, BERT makes more mistakes compared to symmetric inverse relations . Since BERT learns the tokenized representation of the input sequence through a Self-Attention mechanism, whether the arguments in the input sequence are a subject or an object is not learned in detail. As a result, the performance of asymmetric inverse relations that indicate different relations when subject and object are changed is significantly lower than in symmetric inverse relations that indicate the same relations even when subject and object are changed. However, TUCORE-GCN creates nodes for arguments separately, learns features of these nodes, and classifies relations. Thus, these issues with BERT can be improved. TUCORE-GCN BERT has improved performance in all groups compared to BERT and BERT s , especially for asymmetric inverse relations.

Dataset
We evaluate our model on three ERC benchmark datasets. We follow the standard split of the datasets and classify the emotion label of each utterance in the ERC benchmark datasets as the relation between the speaker and the utterance in the dialogue as in Table 3. MELD (Poria et al., 2019) 2 is a multimodal dataset collected from the TV show, Friends. We only used textual modality in this dataset. It has seven emotion labels, 2,458 dialogues, and 12,708 utterances. Each utterance is annotated with one of the seven emotion labels.
DailyDialog (Li et al., 2017) 4 reflects our daily communication way and covers various topics about our daily life. It has seven emotion labels, 13,118 dialogues, and 102,979 utterances. Each utterance is annotated with one of the seven emotion labels. Since it does not have speaker information, we consider the utterance turns as two anonymized speaker turns by default.

Metrics
For DailyDialog, we calculate micro-F 1 except for the neutral class, because of its extremely high majority. For MELD and EmoryNLP, we calculate weighted-F 1.

Baselines and State-of-the-Art
For a comprehensive performance evaluation, we compared our model with the models using the following baseline and state-of-the-art methods: Previous methods: CNN (Kim, 2014), CNN+cLSTM (Poria et al., 2017), DialogueRNN , DialogueGCN (Ghosal et al., 2019), and RoBERTa .
RGAT (Ishiwatari et al., 2020): A model that is provided with some information reflecting relation types through relational position encodings that can capture speaker dependency and sequential information.
RoBERTa r : The RoBERTa baseline for ERC as our proposed approach, initialized with pretrained parameters of RoBERTa-large . We set the input sequence of RoBERTa to [CLS]  actions, and cause-effect relations for emotional recognition in conversation.
CESTa (Wang et al., 2020): A state-of-the-art model for DailyDialog that uses Conditional Random Field layer. The layer is used for sequential tagging, and it has an advantage in learning when there is an emotional consistency in conversation.
The only difference between RoBERTa and RoBERTa r is the form of the input sequence, but RoBERTa r is better at solving ERC task. Accordingly, we claim that treating the ERC as a dialogue-based RE is useful in practice. TUCORE-GCN RoBERT a outperforms COSMIC, the previous state-of-the-art model for MELD and EmoryNLP, by 0.15, 1.13, and 3.43 on test sets of MELD, EmoryNLP, and DailyDialog respectively. It shows state-of-art performance on both MELD and EmoryNLP. On the other hand, TUCORE-GCN RoBERT a shows slightly lower performance than CESTa on DailyDialog dataset. When utterances in a conversation are adjacent to one another, they tend to show similar emotions. This is called emotional consistency, and CRF layer of CESTa fits well with this tendency. Therefore, it has better performance on DailyDialog, which shows emotional consistency well. However, it shows very poor performance on MELD, where emotional consistency does not appear much (Wang et al., 2020). Considering these observations, our model generally shows outstanding performance on MELD, EmoryNLP, and also DailyDialog. It suggests that TUCORE-GCN is effective in ERC as well as dialogue-based RE.

Ablation Study
We conduct ablation studies to evaluate the effectiveness of different modules and mechanisms in TUCORE-GCN. The results are shown in Table 6.
First, we removed the speaker embedding in the encoder module. To be specific, the encoder and input format of TUCORE-GCN RoBERT a are the same as RoBERTa s . Without speaker embedding, the performance of TUCORE-GCN RoBERT a drops by 0.4 F 1 score and 0.1 F 1 c score on the Di-alogRE test set and 0.72 and 1.65 F 1 scores on the MELD and EmoryNLP test set, respectively. This drop shows that when encoding a dialogue, a better representation can be obtained through speaker change information.
Next, we removed the turn attention module. To be specific, the output of the encoding module is delivered to the dialogue graph with sequential nodes module. Without turn attention, the performance of TUCORE-GCN RoBERT a sharply drops by 1.1 F 1 score and 0.6 F 1 c score on DialogRE test set and 0.77 and 2.17 F 1 scores on the MELD and EmoryNLP test set, respectively. This drop shows that the turn attention module helps capture the representation of the turns and, therefore, improves dialogue-based RE and ERC performance.  Finally, we removed the turn-level BiLSTM for turn nodes in the dialogue graph with sequential nodes module. To be specific, in the module, we apply GCN without injecting sequential information of the turn nodes. Without turn-level BiLSTM, the performance of TUCORE-GCN RoBERT a drops by 0.6 F 1 score and 0.2 F 1 c score on DialogRE test set and 0.34 and 0.89 F 1 scores on MELD and EmoryNLP test set, respectively. This means that reflecting the characteristics of the sequential nodes when learning the graph helps to learn the features of each node and, therefore, improves dialoguebased RE and ERC performance.

Conclusion and Future Work
In this paper, we propose TUCORE-GCN for dialogue-based RE. TUCORE-GCN is designed according to the way people understand dialogues in practice to better cope with dialogue-based RE. In addition, we propose a way to treat the ERC task as dialogue-based RE and showed its effectiveness through experiments. Experimental results on a dialogue-based RE dataset and three ERC task datasets demonstrate that TUCORE-GCN model significantly outperforms existing models and yields the new state-of-the-art results on both tasks.
Since TUCORE-GCN is modeled for the dialogue text type, we expect it to perform well in dialogue-based natural language understanding tasks. In future work, we are going to explore the effectiveness of it on other dialogue-based tasks. metric relation group and the symmetric relation group. The F 1 score difference between two groups were 5.8, which was the smallest F 1 score difference compared with the other models that use BERT. In addition, compared to RoBERTa s 's F 1 score difference between asymmetric relation group and the symmetric relation group, TUCORE-GCN RoBERT a 's F 1 score difference was reduced by 2.9. This suggests that TUCORE-GCN solves the limitations of BERT and its variants' inability to predict the inverse relation well. Table 9: Performance (F 1 (σ)) on asymmetric inverse relations group, symmetric inverse relations group, and other relations group of DialogRE . The scores of BERT, BERT s , and GDPNet are based on our re-implementation. Figure 3: When the surround turn window size is 1, it is the surrounding turn mask of TUCORE-GCN. For each token, the surrounding turn and its own turn correspond to 1, and the rest is −∞.