Directed Acyclic Graph Network for Conversational Emotion Recognition

The modeling of conversational context plays a vital role in emotion recognition from conversation (ERC). In this paper, we put forward a novel idea of encoding the utterances with a directed acyclic graph (DAG) to better model the intrinsic structure within a conversation, and design a directed acyclic neural network, namely DAG-ERC, to implement this idea. In an attempt to combine the strengths of conventional graph-based neural models and recurrence-based neural models, DAG-ERC provides a more intuitive way to model the information flow between long-distance conversation background and nearby context. Extensive experiments are conducted on four ERC benchmarks with state-of-the-art models employed as baselines for comparison. The empirical results demonstrate the superiority of this new model and confirm the motivation of the directed acyclic graph architecture for ERC.


Introduction
Utterance-level emotion recognition in conversation (ERC) is an emerging task that aims to identify the emotion of each utterance in a conversation. This task has been recently concerned by a considerable number of NLP researchers due to its potential applications in several areas, such as opinion mining in social media (Chatterjee et al., 2019) and building an emotional and empathetic dialog system (Majumder et al., 2020).
The emotion of a query utterance is likely to be influenced by many factors such as the utterances spoken by the same speaker and the surrounding conversation context. Indeed, how to model the conversational context lies at the heart of this task (Poria et al., 2019a). Empirical evidence also shows Figure 1: Conversation as a directed acyclic graph, with brown directed edges representing the information propagation between speakers and blue ones representing the information propagation inside a same speaker. that a good representation of conversation context significantly contributes to the model performance, especially when the content of query utterance is too short to be identified alone (Ghosal et al., 2019).
Numerous efforts have been devoted to the modeling of conversation context. Basically, they can be divided into two categories: graph-based methods (Zhang et al., 2019a;Ghosal et al., 2019;Zhong et al., 2019;Ishiwatari et al., 2020;Shen et al., 2020) and recurrence-based methods (Hazarika et al., 2018a;Hazarika et al., 2018b;Ghosal et al., 2020). For the graphbased methods, they concurrently gather information of the surrounding utterances within a certain window, while neglecting the distant utterances and the sequential information. For the recurrencebased methods, they consider the distant utterances and sequential information by encoding the utterances temporally. However, they tend to update the query utterance's state with only relatively limited information from the nearest utterances, making them difficult to get a satisfying performance.
According to the above analysis, an intuitively better way to solve ERC is to allow the advantages of graph-based methods and recurrence-based models to complement each other. This can be achieved by regarding each conversation as a directed acyclic graph (DAG). As illustrated in Figure 1, each utterance in a conversation only receives information from some previous utterances and cannot propagate information backward to itself and its predecessors through any path. This characteristic indicates that a conversation can be regarded as a DAG. Moreover, by the information flow from predecessors to successors through edges, DAG can gather information for a query utterance from both the neighboring utterances and the remote utterances, which acts like a combination of graph structure and recurrence structure. Thus, we speculate that DAG is a more appropriate and reasonable way than graph-based structure and recurrence-based structure to model the conversation context in ERC.
In this paper, we propose a method to model the conversation context in the form of DAG. Firstly, rather than simply connecting each utterance with a fixed number of its surrounding utterances to build a graph, we propose a new way to build a DAG from the conversation with constraints on speaker identity and positional relations. Secondly, inspired by DAGNN (Thost and Chen, 2021), we propose a directed acyclic graph neural network for ERC, namely DAG-ERC. Unlike the traditional graph neural networks such as GCN (Kipf and Welling, 2016) and GAT (Veličković et al., 2017) that aggregate information from the previous layer, DAG-ERC can recurrently gather information of predecessors for every utterance in a single layer, which enables the model to encode the remote context without having to stack too many layers. Besides, in order to be more applicable to the ERC task, our DAG-ERC has two improvements over DAGNN: (1) a relation-aware feature transformation to gather information based on speaker identity and (2) a contextual information unit to enhance the information of historical context. We conduct extensive experiments on four ERC benchmarks and the results show that the proposed DAG-ERC achieves comparable performance with the state-of-the-art models. Furthermore, several studies are conducted to explore the effect of the proposed DAG structure and the modules of DAG-ERC.
The contributions of this paper are threefold. First, we are the first to consider a conversation as a directed acyclic graph in the ERC task. Second, we propose a method to build a DAG from a conversation with constraints based on the speaker identity and positional relations. Third, we propose a directed acyclic graph neural network for ERC, which takes DAGNN as its backbone and has two main improvements designed specifically for ERC.
2 Related work

Emotion Recognition in Conversation
Recently, several ERC datasets with textual data have been released (Busso et al., 2008;Schuller et al., 2012;Zahiri and Choi, 2017;Li et al., 2017;Chen et al., 2018;Poria et al., 2019b), arousing the widespread interest of NLP researchers. In the following paragraphs, we divide the related works into two categories according to the methods they use to model the conversation context. Graph-based Models DialogGCN (Ghosal et al., 2019) treats each dialog as a graph in which each utterance is connected with the surrounding utterances. RGAT (Ishiwatari et al., 2020) adds positional encodings to DialogGCN. ConGCN (Zhang et al., 2019a) regards both speakers and utterances as graph nodes and makes the whole ERC dataset a single graph. KET (Zhong et al., 2019) uses hierarchical Transformers (Vaswani et al., 2017) with external knowledge. DialogXL (Shen et al., 2020) improves XLNet  with enhanced memory and dialog-aware self-attention. 2 Recurrence-based Models In this category, ICON (Hazarika et al., 2018a) and CMN (Hazarika et al., 2018b) both utilize gated recurrent unit (GRU) and memory networks. HiGRU (Jiao et al., 2019) contains two GRUs, one for utterance encoder and the other for conversation encoder. DialogRNN ) is a recurrence-based method that models dialog dynamics with several RNNs. COSMIC (Ghosal et al., 2020) is the latest model, which adopts a network structure very close to DialogRNN and adds external commonsense knowledge to improve performance.

Directed Acyclic Graph Neural Network
Directed acyclic graph is a special type of graph structure that can be seen in multiple areas, for example, the parsing results of source code (Allamanis et al., 2018) and logical formulas (Crouse et al., 2019). A number of neural networks that employ DAG architecture have been proposed, such as Tree-LSTM (Tai et al., 2015), DAG-RNN (Shuai et al., 2016), D-VAE (Zhang et al., 2019b), and DAGNN (Thost and Chen, 2021). DAGNN is different from the previous DAG models in the model structure. Specifically, DAGNN allows multiple layers to be stacked, while the others have only one single layer. Besides, instead of merely carrying out naive sum or element-wise product on the predecessors' representations, DAGNN conducts information aggregation using graph attention.

Problem Definition
In ERC, a conversation is defined as a sequence of utterances {u 1 , u 2 , ..., u N }, where N is the number of utterances. Each utterance u i consists of n i tokens, namely u i = {w i1 , w i2 , ..., w in i }. A discrete value y i ∈ S is used to denote the emotion label of u i , where S is the set of emotion labels. The speaker identity is denoted by a function p(·). For example, p(u i ) ∈ P denotes the speaker of u i and P is the collection of all speaker roles in an ERC dataset. The objective of this task is to predict the emotion label y t for a given query utterance u t based on dialog context {u 1 , u 2 , ..., u N } and the corresponding speaker identity.

Building a DAG from a Conversation
We design a directed acyclic graph (DAG) to model the information propagation in a conversation. A DAG is denoted by G = (V, E, R). In this paper, the nodes in the DAG are the utterances in the conversation, i.e., V = {u 1 , u 2 , ..., u N }, and the edge (i, j, r ij ) ∈ E represents the information propagated from u i to u j , where r ij ∈ R is the relation type of the edge. The set of relation types of edges, R = {0, 1}, contains two types of relation: 1 for that the two connected utterances are spoken by the same speaker, and 0 for otherwise.
We impose three constraints to decide when an utterance would propagate information to another, i.e., when two utterances are connected in the DAG: A previous utterance can pass message to a future utterance, but a future utterance cannot pass message backwards. Remote information: ∃τ < i, p(u τ ) = p(u i ), (τ, i , r τ i ) ∈ E and ∀j < τ, (j, i, r ji ) / ∈ E. For each utterance u i except the first one, there is a previous utterance u τ that is spoken by the same speaker as Algorithm 1 Building a DAG from a Conversation Input: the dialog {u 1 , u 2 , ..., u N }, speaker identity p(·), hyper-parameter ω Output: The information generated before u τ is called remote information, which is relatively less important. We assume that when the speaker speaks u τ , she/he has been aware of the remote information before u τ . That means, u τ has included the remote information and it will be responsible for propagating the remote information to u i . Local information: ∀l, τ < l < i, (l, i, r li ) ∈ E. Usually, the information of the local context is important. Consider u τ and u i defined in the second constraint. We assume that every utterance u l in between u τ and u i contains local information, and they will propagate the local information to u i .
The first constraint ensures the conversation to be a DAG, and the second and third constraints indicate that u τ is the cut-off point of remote and local information. We regard u τ as the ω-th latest utterance spoken by p(u i ) before u i , where ω is a hyper-parameter. Then for each utterance u l in between u τ and u i , we make a directed edge from u l to u i . We show the above process of building a DAG in Algorithm 1.
An example of the DAG is shown in Figure 2. In general, our DAG has two main advancements compared to the graph structures developed in previous works (Ghosal et al., 2019;Ishiwatari et al., 2020): First, our DAG doesn't have edges from future utterances to previous utterances, which we Figure 2: An example DAG built from a three-party conversation, with ω = 1. The three speakers' utterances are colored by red, blue and green, respectively. Solid lines represent the edges of local information, and dash lines denote the edges of remote information.
argue is more reasonable and realistic, as the emotion of a query utterance should not be influenced by the future utterances in practice. Second, our DAG seeks a more meaningful u τ for each utterance, rather than simply connecting each utterance with a fixed number of surrounding utterances.

Directed Acyclic Graph Neural Network
In this section, we introduce the proposed Directed Acyclic Graph Neural Network for ERC (DAG-ERC). The framework is shown in Figure 3.

Utterance Feature Extraction
DAG-ERC regards each utterance as a graph node, the feature of which can be extracted by a pretrained Transformer-based language model. Following the convention, the pre-trained language model is firstly fine-tuned on each ERC dataset, and its parameters are then frozen while training DAG-ERC. Following Ghosal et al. (2020), we employ RoBERTa-Large (Liu et al., 2019), which has the same architecture as BERT-Large (Devlin et al., 2018), as our feature extractor. More specifically, for each utterance u i , we prepend a special token [CLS] to its tokens, making the input a form of {[CLS], w i1 , w i2 , ..., w in i }. Then, we use the [CLS]'s pooled embedding at the last layer as the feature representation of u i .

GNN, RNN and DAGNN
Before introducing the DAG-ERC layers in detail, we first briefly describe graph-based models, recurrence-based models and directed acyclic graph models to help understand their differences.
For each node at each layer, graph-based models (GNN) aggregate the information of its neighboring nodes at the previous layer as follows: where f (·) is the information processing function, Aggregate(·) is the information aggregation function to gather information from neighboring nodes, and N i denotes the neighbours of the i-th node. Recurrence-based models (RNN) allow information to propagate temporally at the same layer, while the i-th node only receives information from the (i−1)-th node: Directed acyclic graph models (DAGNN) work like a combination of GNN and RNN. They aggregate information for each node in temporal order, and allow all nodes to gather information from neighbors and update their states at the same layer: The strength of applying DAGNN to ERC is relatively apparent: By allowing information to propagate temporally at the same layer, DAGNN can get access to distant utterances and model the information flow throughout the whole conversation, which is hardly possible for GNN. Besides, DAGNN gathers information from several neighboring utterances, which sounds more appealing than RNN as the latter only receives information from the (i−1)-th utterance.

DAG-ERC Layers
Our proposed DAG-ERC is primarily inspired by DAGNN (Thost and Chen, 2021), with novel improvements specially made for emotion recognition in conversation. At each layer l of DAG-ERC, due to the temporal information flow, the hidden state of utterances should be computed recurrently from the first utterance to the last one.
For each utterance u i , the attention weights between u i and its predecessors are calculated by using u i 's hidden state at the (l − 1)-th layer to attend to the predecessors' hidden states at l-th layer: where W l α are trainable parameters and denotes the concatenation operation.
The information aggregation operation in DAG-ERC is different from that in DAGNN. Instead of merely gathering information according to the attention weights, inspired by R- GCN (Schlichtkrull et al., 2018), we apply a relation-aware feature transformation to make full use of the relational type of edges: where W l r ij ∈ {W l 0 , W l 1 } are trainable parameters for the relation-aware transformation.
After the aggregated information M l i is calculated, we make it interact with u i 's hidden state at the previous layer H l−1 i to obtain the final hidden state of u i at the current layer. In DAGNN, the final hidden state is obtained by allowing M l i to control information propagation of H l−1 i to the l-th layer with a gated recurrent unit (GRU): where H l−1 i , M l i , and H l i are the input, hidden state and output of the GRU, respectively.
We refer to the process in Equation 6 as nodal information unit, because it focuses on the node information propagating from the past layer to the current layer. Nodal information unit may be suitable for the tasks that DAGNN is originally designed to solve. However, we find that only using nodal information unit is not enough for ERC, especially when the query utterance u i 's emotion should be derived from its context. The reason is that in DAGNN, the information of context M l i is only used to control the propagation of u i 's hidden state, and under this circumstance, the information of context is not fully leveraged. Therefore, we design another GRU called contextual information unit to model the information flow of historical context through a single layer. In the contextual information unit, the roles of H i−1 i and M l i in GRU are reversed, i.e., H i−1 i controls the propagation of M l i : The representation of u i at the l-th layer is the sum of H l i and C l i :

Training and Prediction
We take the concatenation of u i 's hidden states at all DAG-ERC layers as the final representation of u i , and pass it through a feed-forward neural network to get the predicted emotion: For the training of DAG-ERC, we employ the standard cross-entropy loss as objective function: where M is the number of training conversations, N i is the number of utterances in the i-th conversation, y i,t is the ground truth label, and θ is the collection of trainable parameters of DAG-ERC.

Datasets
We evaluate DAG-ERC on four ERC datasets. The statistics of them are shown in Table 1. IEMOCAP (Busso et al., 2008): A multimodal ERC dataset. Each conversation in IEMOCAP comes from the performance based on script by two actors. Models are evaluated on the samples with 6 types of emotion, namely neutral, happiness, sadness, anger, frustrated, and excited. Since this dataset has no validation set, we follow Shen et al. (2020) to use the last 20 dialogues in the training set for validation. MELD (Poria et al., 2019b): A multimodal ERC dataset collected from the TV show Friends. There are 7 emotion labels including neutral, happiness, surprise, sadness, anger, disgust, and fear.
DailyDialog (Li et al., 2017): Human-written dialogs collected from communications of English learners. 7 emotion labels are included: neutral, happiness, surprise, sadness, anger, disgust, and fear. Since it has no speaker information, we consider utterance turns as speaker turns by default. EmoryNLP (Zahiri and Choi, 2017): TV show scripts collected from Friends, but varies from MELD in the choice of scenes and emotion labels. The emotion labels of this dataset include neutral, sad, mad, scared, powerful, peaceful, and joyful. We utilize only the textual modality of the above datasets for the experiments. For evaluation metrics, we follow Ishiwatari et al. (2020) and Shen et al. (2020) and choose micro-averaged F1 excluding the majority class (neutral) for DailyDialog and weighted-average F1 for the other datasets.

Overall Performance
The overall results of all the compared methods on the four datasets are reported in Table 2. We can note from the results that our proposed DAG-ERC achieves competitive performances across the four datasets and reaches a new state of the art on the IEMOCAP, DailyDialog and EmoryNLP datasets.
As shown in the table, when the feature extracting method is the same, graph-based models generally outperform recurrence-based models on IEMOCAP, DailyDialog, and EmoryNLP. This phenomenon indicates that recurrence-based models cannot encode the context as effectively as graphbased models, especially for the more important local context. What's more, we see a significant improvement of DAG-ERC over the graph-based  models on IEMOCAP, which demonstrates DAG-ERC's superior ability to capture remote information given that the dialogs in IEMOCAP are much longer (almost 70 utterances per dialog). On MELD, however, we observe that neither graph-based models nor our DAG-ERC outperforms the recurrence-based models. After going through the data, we find that due to the data collection method (collected from TV shows), sometimes two consecutive utterances in MELD are not coherent. Under this circumstance, graph-based models' advantage in encoding context is not that important.
Besides, the graph-based models see considerable improvements when implemented with the powerful feature extractor RoBERTa. In spite of this, our DAG-ERC consistently outperforms these improved graph-based models and DAGNN, confirming the superiority of the DAG structure and the effectiveness of the improvements we make to build DAG-ERC upon DAGNN.

Variants of DAG Structure
In this section, we investigate how the structure of DAG would affect our DAG-ERC's performance by applying different DAG structures to DAG-ERC. In addition to our proposed structure, we further define three kinds of DAG structure: (1) sequence, in which utterances are connected one by one; (2) DAG with single local information, in which each utterance only receives local information from its nearest neighbor, and the remote information remains the same as our DAG; (3) common DAG, in which each utterance is connected with κ previous utterances. Note that if there are only two speakers taking turns to speak in a dialog, then our DAG is equivalent to common DAG with κ = 2ω, making the comparison less meaningful. Therefore, we conduct the experiment on EmoryNLP, where there are usually multiple speakers in one dialog, and the  speakers speak in arbitrary order. The test performances are reported in Table 3, together with the average number of each utterance's predecessors. Several instructive observations can be made from the experimental results. Firstly, the performance of DAG-ERC drops significantly when equipped with the sequence structure. Secondly, our proposed DAG structure has the highest performance among the DAG structures. Considering our DAG with ω = 2 and common DAG with κ = 6, with very close numbers of predecessors, our DAG still outperforms the common DAG by a certain margin. This indicates that the constraints based on speaker identity and positional relation are effective inductive biases, and the structure of our DAG is more suitable for the ERC task than rigidly connecting each utterance with a fixed number of predecessors. Finally, we find that increasing the value of ω may not contribute to the performance of our DAG, and ω = 1 tends to be enough.

Ablation Study
To study the impact of the modules in DAG-ERC, we evaluate DAG-ERC by removing relation-aware feature transformation, the nodal information unit, and the contextual information unit individually. The results are shown in Table 4.
As shown in the  Table 4: Results of ablation study on the four datasets, with rel-trans, H, and C denoting relation-aware feature transformation, nodal information unit, and contextual information unit, respectively. in IEMOCAP and DailyDialog, and there are usually more than two speakers in dialogs of MELD and EmoryNLP. Therefore, we can infer that the relation of whether two utterances have the same speaker is sufficient for two-speaker dialogs, while falls short in the multi-speaker setting.
Moreover, we find that on each dataset, the performance drop caused by ablating nodal information unit is similar to contextual information unit, and all these drops are not that critical. This implies that either the nodal information unit or contextual information unit is effective for the ERC task, while combining the two of them can yield further performance improvement.

Number of DAG-ERC Layers
According to the model structure introduced in Section 3.3.2, the only way for GNNs to receive information from a remote utterance is to stack many GNN layers. However, it is well known that stacking too many GNN layers might cause performance degradation due to over-smoothing (Kipf and Welling, 2016). We investigate whether the same phenomenon would happen when stacking many DAG-ERC layers. We conduct an experiment on IEMOCAP and plot the test result by different numbers of layers in Figure 4, with RGAT-RoBERTa and DAGNN as baselines. As illustrated in the figure, RGAT suffers a significant performance degradation after the number of layers exceeds 6. While for DAGNN and DAG-ERC, with the number of layers changes, both of their performances fluctuate in a relatively narrow range, indicating that over-smoothing tends not to happen in the directed acyclic graph networks.

Error Study
After going through the prediction results on the four datasets, we find that our DAG-ERC fails to distinguish between similar emotions very well, such as frustrated vs anger, happiness vs excited, scared vs mad, and joyful vs peaceful. This kind of mistake is also reported by Ghosal et al. (2019). Besides, we find that DAG-ERC tends to misclassify samples of other emotions to neutral on MELD, DailyDialog and EmoryNLP due to the majority proportion of neutral samples in these datasets. We also look closely into the emotional shift issue, which means the emotions of two consecutive utterances from the same speaker are different. Existing ERC models generally work poorly in emotional shift. As shown in Table 5, our DAG-ERC also fails to perform better on the samples with emotional shift than that without it, though the performance is still better than previous models. For example, the accuracy of DAG-ERC in the case of emotional shift is 57.98% on the IEMO-CAP dataset, which is higher than 52.5% achieved by DialogueRNN  and 55% achieved by DialogXL (Shen et al., 2020).

Conclusion
In this paper, we presented a new idea of modeling conversation context with a directed acyclic graph (DAG) and proposed a directed acyclic graph neural network, namely DAG-ERC, for emotion recognition in conversation (ERC). Extensive experiments were conducted and the results show that the proposed DAG-ERC achieves comparable performance with the baselines. Moreover, by comprehensive evaluations and ablation study, we confirmed the superiority of our DAG-ERC and the impact of its modules. Several conclusions can be drawn from the empirical results. First, the DAG structures built from conversations do affect the performance of DAG-ERC, and with the constraints on speaker identity and positional relation, the proposed DAG structure outperforms its variants. Sec-ond, the widely utilized graph relation type of whether two utterances have the same speaker is insufficient for multi-speaker conversations. Third, the directed acyclic graph network does not suffer over-smoothing as easily as GNNs when the number of layers increases. Finally, many of the errors misjudged by DAG-ERC can be accounted for by similar emotions, neutral samples and emotional shift. These reasons have been partly mentioned in previous works but have yet to be solved, which are worth further investigation in future work.