Knowledge-Interactive Network with Sentiment Polarity Intensity-Aware Multi-Task Learning for Emotion Recognition in Conversations

Emotion Recognition in Conversation (ERC) has gained much attention from the NLP community recently. Some models concentrate on leveraging commonsense knowledge or multi-task learning to help complicated emotional reasoning. However, these models neglect direct utterance-knowledge interaction. In addition, these models utilize emotion-indirect auxiliary tasks, which provide limited affective information for the ERC task. To address the above is-sues, we propose a Knowledge-Interactive Network with sentiment polarity intensity-aware multi-task learning, namely KI-Net, which leverages both commonsense knowledge and sentiment lexicon to augment semantic information. Speciﬁcally, we use a self-matching module for internal utterance-knowledge interaction. Considering correlations with the ERC task, a phrase-level Sentiment Polarity Intensity Prediction (SPIP) task is devised as an auxiliary task. Experiments show that all knowledge introduction, self-matching and SPIP modules improve the model performance respectively on three datasets. Moreover, our KI-Net model shows 1.04% performance improvement over the state-of-the-art model on the IEMOCAP dataset.


Introduction
Emotion recognition in conversation aims to identify each utterance's emotion from a conversation, which requires machines to understand the way of emotion expression during conversations (Poria et al., 2019b). Research on ERC helps in creating empathetic dialogue systems (Ghosh et al., 2017;, thus improving the overall human-computer interaction experience. Hence, the ERC task has a wide range of applications such as social media analysis (Li et al., 2019;Chatterjee et al., 2019) and intelligent systems like smart homes and chatbots (Young et al., 2018). * Equal contribution † Email corresponding Figure 1: Illustration of a conversation where both sentiment lexicon and commonsense knowledge aid ERC task. Cylinders denote commonsense knowledge, and rectangles denote sentiment lexicon knowledge.
Unlike vanilla emotion recognition of sentences, context modeling for conversations is crucial for ERC models. Early Recurrent Neural Network (RNN)-based ERC works adopt memory networks to store historical conversation context (Hazarika et al., 2018b,a). Recent progress in Pre-trained Language Models (PLMs) like BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019b) has benefitted many downstream tasks, such as dialogue systems (Henderson et al., 2020;Bao et al., 2020) and reading comprehension (Yang et al., 2019a;Shwartz et al., 2020). Nevertheless, Ilievski et al. (2021) indicate that PLMs lack some dimensions of knowledge, which may limit the performance of the corresponding downstream tasks. Hence most recent PLM-based ERC works adopt hierarchical structures that obtain word-level or utterancelevel representations via PLMs and then devise other elaborate modules for knowledge complement. Some of them explicitly combine structured commonsense knowledge to the model and form knowledge-enriched representations (Zhong et al., 2019;. For the knowledge that is abstractive or unstructured, some other models adopt multi-task learning to compensate for missing knowledge dimensions implicitly . However, the above works do not consider internal interactions between utterance and knowledge representations when incorporating commonsense knowledge but simply concatenate them, which may negatively affect model performance as proved in the follow-up experiment. Besides, the auxiliary tasks of most multi-task learning methods are emotion-indirect, such as topic inference  and utterance-speaker verification , which do not involve additional affective information directly. Ilievski et al. (2021) also show that knowledge overlap between different knowledge sources is little. Intuitively for the ERC task, the complement of different dimensions of knowledge helps the reasoning process. In Figure 1, We illustrate a conversation where both commonsense knowledge and sentiment lexicon aid emotion detection. For example, considering the keyword "divorce" in the first utterance, with "an_affair" as a possible cause, "stop_being_married" as an action, and "di-vide_a_family" as a result, commonsense knowledge enables the model to build a semanticsenhanced chain on "divorce". The sentiment lexicon assigns extremely negative sentiment polarity intensity "-0.83" for "divorce", which directly instructs the model on determining negative emotions. Obviously, in the process of judging this utterance as "Frustrated", the two sources of knowledge have played their respective roles.
To cope with the above challenges, we propose a Knowledge-Interactive Network with sentiment polarity intensity-aware multi-task learning (KI-Net). We first adopt a context-and dependency-aware encoder for context modeling. To further enhance the word-level representations, we leverage a largescale commonsense knowledge graph and a sentiment lexicon. Inspired by Yang et al. (2019a), knowledge representations are incorporated into word-level representations using a self-matching mechanism, allowing a full internal interaction. We also introduce a phrase-level Sentiment Polarity Intensity Prediction (SPIP) as the auxiliary task, which is expected to provide more direct instructions on emotion recognition of the target utterance.
In summary, this paper makes the following contributions: • We try to make up for some of the missing knowledge dimensions in the PLM by applying multi-source knowledge. The subse-quent ablation study shows that the introduced knowledge does have a positive impact on the performance of the model.
• For the first time on the ERC task, we discuss the necessity of explicit interactions between utterance and knowledge, guiding future work of knowledge integration.
• We adopt a new auxiliary task for ERC, namely phrase-level sentiment polarity intensity prediction. Experiments show that the SPIP task provides promising improvement for the ERC task.

Related Work
Emotion recognition in conversation has gained attention from the NLP community only in the past few years (Yeh et al., 2019; since the growing availability of public conversational data (Busso et al., 2008;Poria et al., 2019a;Li et al., 2017). ERC task naturally requires modeling interaction between conversation participants. Considering this requirement, many works adopt RNNs to model contextual utterances in a temporal sequence, such as CMN (Hazarika et al., 2018b) and ICON (Hazarika et al., 2018a). Based on them,  propose a attentive RNNbased model DialogueRNN to model party states and emotional dynamics. Transformer (Vaswani et al., 2017) has also been devised to model input sequences in many recent works (Zhong et al., 2019;, which lead to better results. Besides, modules such as memory networks (Wenxiang Jiao and King, 2020; and graph-based networks (Ghosal et al., 2019;Ishiwatari et al., 2020) are also introduced for representation learning to better model contextual information and utterance dependencies.
Limited by the scale of current available highquality datasets, some works manage to incorporate task-related knowledge to boost model performance. Hazarika et al. (2021); Chapuis et al. (2020) propose elaborate pre-training tasks to improve generalization of models. Zhong et al. (2019);  explicitly extract commonsense knowledge from large-scale knowledge graphs and concatenate them to word embeddings, forming knowledge-enriched representations. In addition, some works implicitly introduce knowl-

Utterance Emotion Polarity Value Utterance Emotion Polarity Value
Context-and Dependency-Aware Encoder Figure 2: Overall architecture of our model. Rep. denotes representation. (a) is a sub-graph extracted from the ConceptNet with the keyword "happy" while (b) is an example provided by the SenticNet with the keyword "bless". edge through multi-task learning, such as label imbalance confusion , dialogue topic information  and utterancespeaker relations .

Task Definition and Model Overview
We define ERC task as follows: Given {{X i j }, Y i }, where i = 1, ...N, j = 1, ...N i , as a collection of {utterance, emotion label} pairs in one conversation. Conversation X consists of N utterances and each utterance X i consists of N i tokens, namely where P is the set of conversation participants. The discrete value Y i ∈ S is used to denote the emotion label, where S denotes the set of pre-defined emotion labels, and |S| = h c . The objective of the ERC task is to predict the emotion label Y i of the target utterance X i given its previous context and the mappings between X and P. Our proposed KI-Net is illustrated in Figure 2.
To aid lacking knowledge dimensions of PLM, we design a hierarchical model, whose key idea is to enhance PLM with rich-interacted and stronglycorrelated knowledge. Based on this idea, KI-Net is built, as depicted in Figure 2, with four major components. We first use a XLNet-based encoder that computes context-and dependency-aware represen-tations for utterances (Sec. 3.2). Then a knowledge introduction module is devised to retrieve commonsense knowledge and form graph attention-based representation (Sec. 3.3). A self-matching module is employed for utterance-knowledge interaction based on self-attention mechanisms (Sec. 3.4). We also propose a SPIP task, which introduces strongly-correlated knowledge to the model, and a multi-task learning setting to combine ERC and SPIP task (Sec. 3.5).

Context-and Dependency-Aware Encoder
Both historical conversational information and dependency modeling are crucial for the ERC task. Therefore, based on XLNet (Yang et al., 2019b), we use a Context-and Dependency-Aware (CDA) encoder to exploit both of the above elements by improving the original self-attention mechanism.
For the time step i, the target utterance X i is prepended with the "[CLS]" token: Then x i is passed through the embedding layer: where h i 0 ∈ R N i ×D h , and D h denotes input dimension of XLNet-base. h i 0 is regarded as input states of the CDA encoder's first layer. Also, h i 0 is used in concept embedding layer of knowledge introduction, which we will discuss in Sec. 3.3.
Besides the ordinary global self-attention, our model devises a local self-attention which uses a limited conversational window size to focus on the neighboring part of the target utterance, a speaker self-attention which retains historical context belonging to the target speaker and listener selfattention which focuses on historical context uttered by the other participants. These four types of self-attention results are combined to form the output of every layer in the CDA encoder. Following DialogXL (Shen et al., 2020), the context memory m is combined with hidden states using a utterance recurrence mechanism. Given the input h i 0 , the CDA encoder adopts L layers of Transformer to get word-level representation. For convenience, we denote this process as: where h i L ∈ R N i ×D h , and m i−1 ∈ R L×Dm×D h , D m is a pre-defined max memory length. encoder denotes the encoding process.

Knowledge Introduction
This section introduces the knowledge introduction process where ConceptNet (Speer et al., 2017) is leveraged as the commonsense knowledge base. ConceptNet is a large multi-lingual semantic graph, where each node denotes a phrase-level concept and each edge denotes a relation. Each quadruple <concept1, relation, concept2, weight> in Concept-Net denotes an assertion, where the weight is a confidence score assigned to the assertion.
We first introduce the knowledge retrieval process. For a token t, we extract a graph g t , which consists of t's immediate neighbors from Concept-Net. For each g t , we discard concepts that are stopwords or not in the word vocabulary V of the encoder mentioned in last section, and remove assertions with confidence scores less than 1.0 for denoising.
where c i denotes the i th connected concept of t, w i denotes its corresponding confidence score. An example of g t is illustrated in Figure 2 (a).
We then adopt a graph attention mechanism to form knowledge representations. For each nonstop token X i j ∈ x i , we have a graph g i j . For X i j and c p ∈ g i j , we obtain their embedding h ij 0 and h cp 0 via embedding layer mentioned in Equation.1. Then knowledge representation k i j are computed as follows: · denotes dot product operation, and N ij c denotes the number of concepts in g i j . If N ij c = 0, we set k i j to the average of all node vectors.
We adopt SenticNet (Cambria et al., 2020) as another knowledge source. For each phrase s i in Sen-ticNet, there is a quintuple <polarity_value, polar-ity_intense, moodtags, sentics, semantics>, where the polarity_value belongs to positive or negative. Polarity_intense is a floating number between -1 and +1, denoting positivity of s i . For phrase s i , its mood tagsm i ⊂ M, where M is the set of pre-defined emotion description words. Sentics is a quadruple and semanticsê i defines a set of semantics-related concepts of s i . An example of these tuples is illustrated in Figure 2 We add mood tags and semantics to the commonsense knowledge base retrieved in Sec.

3.3.
Specifically, for s i ∈ V, we construct a mood tag set with a weight valuê With enhanced knowledge graphĝ, we make minor changes to Equation. 3. For X i j andĉ p ∈ĝ i j , we obtain their token embeddings h ij 0 and hĉ p 0 via embedding layer mentioned in Equation.1. We modify Equation. 3 as follows: where hĉ p 0 ∈ R D h . Then t p is used for computation of Equation. 4, with the rest unchanged.

Self-Matching
To employ internal utterance-knowledge interaction in the model, we propose a self-matching module based on self-attention. For each token X i j , we obtain u i j as follows: where [;] denotes concatenation operation, h ij L ∈ R D h and u i j ∈ R 2D h . For two tokens X i j and X i m within one utterance, we compute their similarity via a trilinear function (Seo et al., 2017): where W ∈ R 6D h is the model parameter, and denotes element-wise multiplication. We obtain the similarity matrixR accordingly withr j m being the jm th entry. Then we obtain the self-attention matrix Q as follows: where q j m is the jm th entry of Q. Intuitively, indirect interaction allows the model to learn deeper semantic relations within the knowledge-enriched representations. To further achieve indirect interaction, we conduct a selfmultiplication of Q:Q = QQ With indirect interaction, all token pairs can interact through every other token within the utterance.With Q andQ, we compute two attended vectors for each token X i j : where v i j ,v i j ∈ R 2D h ,q j m is the jm th entry of Q. We concatenate the two attended vectors with different means to allow rich interactions: where c i j ∈ R 12D h , and c i j denotes the j th row of self-matching output matrix C. C is derived by semantics and knowledge interactions between utterance tokens, which allows knowledge to be introduced purposefully instead of acting as noise.

Sentiment Polarity Intensity Prediction
In this section, we propose a phrase-level Sentiment Polarity Intensity Prediction (SPIP) task. SPIP is used as an auxiliary task to incorporate sentiment polarity knowledge to the model. Specifically, the model predicts sentiment intensive values for all SenticNet phrases within the utterance. For x i , we retrieve a set P i = {p i k |p i k ∈ n-grams from x i }, n = 1, 2, ..., N g , where N g is a hyperparameter. For p i k ∈ SenticNet ∩ V , where p i k denotes k th phrase of P i , we record their starting and ending positions < P ik 0 , P ik 1 > in the utterance, and the corresponding intensive value O i k . Therefore, for each utterance x i we have {< P ik 0 , P ik 1 >, O i k }, k=1,...,N i , whereN i denotes the number of SenticNet phrases within x i .
For each utterance x i , we obtain its word-level representation h i L via Equation.
2. For SenticNet phrase p i k , we obtain its representation r i k using phrase-level max pooling: are model parameters, h denotes a pre-defined hidden dimension, [:] denotes matrix slice operation, and maxpooling denotes the max pooling operation. We compute the final prediction score: where W 1 ∈ R h×1 and b 1 ∈ R 1 are model parameters. As training objective, we compute standard MSE loss for SPIP task: For utterance x i , we have obtained its word-level knowledge-enriched representations c i from selfmatching layer (Sec. 3.4), where c i is the i th entry of C. We compute its utterance-level representation through max pooling: where c i ∈ R N i ×12D h , W 2 ∈ R 12D h ×h and b 2 ∈ R h are model parameters. We compute final classification probabilities as follows:  whereĉ i ∈ R h , W 3 ∈ R h×hc and b 3 ∈ R hc are model parameters. softmax denotes the softmax operation.
We compute the loss of ERC task using standard cross-entropy loss: With both loss m for the main task ERC and loss a for auxiliary task SPIP, we compute the total loss of the task as follows: where ∈ [0, 1] is the pre-defined weight coefficient of loss a .

Experimental Setting
In this section we present experimental settings such as datasets, baselines, implementation details and evaluation metrics.

Datasets
We evaluate our model on the following three ERC datasets. The statistics are shown in Table 1.
MELD (Poria et al., 2019a): A multi-modal dataset enriched from EmotionLines dataset, collected from the scripts of TV show F riends. The labels are neutral, happiness, surprise, sadness, anger, disgust and fear.
DailyDialog (Li et al., 2017): From humanwritten daily conversations with no speaker information. The labels are similar to MELD.

Baselines and State of the Art
We compare our model with the following baselines: BERT_BASE (Devlin et al., 2019): Initialized with pre-trained parameters of BERT-BASE, the model is fine-tuned for ERC task.
DialogueRNN : Dialo-gRNN uses three GRUs to model speaker states, global contexts and emotion dynamics. The model is expected to model inter-speaker relations on multi-party conversations.
DialogueGCN (Ghosal et al., 2019): The model utilizes a graph-based structure to model utterance relations within a conversation.
KET (Zhong et al., 2019): The model uses a graph attention mechanism to combine commonsense knowledge into utterance representations.
AGHMN (Wenxiang Jiao and King, 2020): The model uses a hierarchical memory network to model and store context representations.
HiTrans : Based on hierarchical Transformer, the model uses multi-task learning to be speaker-sensitive.
IEIN (Lu et al., 2020): IEIN uses predicted emotion labels instead of gold labels and designs a loss to constrain the prediction of each iteration.
RGAT (Ishiwatari et al., 2020): Based on graph structure, the model augments relation modelling of conversations, and adds relational position encodings to combine sequential information.
COSMIC (Ghosal et al., 2020): COSMIC incorporates different elements of commonsense and leverages them to learn interlocutors' interactions.
DialogXL (Shen et al., 2020): The model uses a dialog-aware self-attention to introduce the awareness of inter-and intra-speaker dependency.

Other Experimental Settings
We conducted all experiments using Xeon(R) Silver 4110 CPU with 768GB of memory and GeForce GTX 1080Ti GPU with 11GB of memory. We tokenize and pre-process the above three datasets and use the XLNet tokenizer provided by Hugging Face 1 to correspond with the vocabulary of the pre-trained XLNet. For hyper-parameter setting, D h =768, h=300, L=12, N g =4, h c and D m depends on the dataset. We set the initial learning rates of 1e-5 on IEMOCAP, 1e-6 on MELD and DailyDialog. We employ AdamW optimizer (Loshchilov and Hutter, 2017) the scheduled learning rate with a batch size of 6 on on IEMOCAP and 4 on MELD and DailyDialog during training. WE set 0.3 as the dropout rate on DailyDialog and 0 on the rest dataset. All the results are obtained using the text modality only. The evaluation metrics are chosen as micro-F1 for DailyDialog and   weighted-F1 for the other datasets. The results reported in our experiments are all based on average of 5 random runs on the test set.

Overall Results
Overall results of our model and the baselines are listed in Table 2 and Table 3. According to the results on IEMOCAP. DialogXL, COSMIC and RGAT outperform other models with a performance of over 65%. All these three models consider modeling dependencies within conversations, indicating that elaborate context modeling modules are essential for the ERC task again. We also notice that models such as KET improve transformer-based PLM since they explicitly introduce commonsense knowledge. Besides, HiTrans devises an auxiliary task to combine task-related information, which also shows some improvement. Our KI-Net model refreshed the current best performance on IEMOCAP, bringing a 1.04% performance improvement. We attribute this result to our model considering all the three factors mentioned above. Similar results are also reflected on MELD and DailyDialog. KI-Net achieves 63.24% on MELD, which is around 5% better than KET. Considering the structure of KET, we believe that this improvement mainly comes from the introduction of self-matching modules. KI-Net achieves 57.30% on DailyDialog, which is around 2.5% better than DialogXL. This may because external knowledge complements the lacking knowledge dimensions of PLMs. KI-Net is weaker than COSMIC on these two datasets while still ranks in the top-2 positions. Unlike our model, COSMIC leverages a different set of PLM and knowledge source. We speculate that the performance on short conversations (less than ten turns) will be more dependent on the selection of knowledge sources.

Emotion-Specific Results
We present emotion-specific testing results on the IEMOCAP dataset in Table 2. KI-Net remains top-2 for most emotions and shows a balanced performance. Specifically, on emotions Neutral and Frustrated, our model achieves the best results at 65.63% and 68.38%. We believe the in-  teraction between the knowledge and the utterance provides reasonable instructions on the final judgment, which benefits fine-grained emotions' detection such as Frustrated and Angry.

Effect of Element Selection for SPIP
As mentioned above, for each phrase s i in Sentic-Net, there is a tuple with some sentiment-related elements. In addition to the sentiment intensive value, we also explore some of the other elements to provide supervision information for our auxiliary tasks. The results are shown in Table 4. We tried different combinations, such as the sentiment polarity and mood tags, but the effect is weaker than sentiment intensity. We think this is because sentiment intensity already includes sentiment polarity, and SPIP is a phrase-level auxiliary task, but the main task needs to be judged by context, which will shift the fine-grained emotions corresponding to mood tags, so sentiment intensity is a more appropriate choice.

Case Study
We provide two cases obtained from the actual testing process of the IEMOCAP dataset to verify the influence brought by the introduced knowledge and the SPIP task. As illustrated in Figure 3, in case 1, with commonsense concepts such as "miss", "husband" and "wedded" etc, the model gains more profound insight into the semantics of "married". Meanwhile, the SPIP classifier gives relative strong positivity for the phrase "getting_married", which establishes the emotional direction of the target utterance with another keyword "not". Obviously, these two ways of knowledge introduction play different roles in the reasoning process. The result of the CDA encoder further shows that context plays little role in this case.
In case 2, we can see the model does not get direct emotion-related information via commonsense knowledge concepts such as "earned". Hence, in this case, the knowledge introduction module plays a relatively little role and makes the same prediction as to the CDA encoder. However, with the negative intensity value that the SPIP classifier gives to the token "cheap", the model manages to label the utterance "Angry", which is also a negative emotion but obviously more intensive than "Frustrated".

Error Analysis
We present the confusion matrix of our test results on the IEMOCAP dataset in Figure 4. We notice that many of the misclassifications are between neutral and non-neutral emotions which can be improved by adding clues from other modalities. Despite the strong performance of our model, it still shows that distinguish similar emotions (e.g., excited and happy) remains a great challenge. A possible reason is that the sentiment lexicon assigns close polarity intense values to words with similar emotional expressions.

Ablation Study
We perform an ablation study for our designed modules. "-Self-Matching" denotes that the utterance and knowledge representation are directly concatenated. "-Knowledge Integration" means both knowledge introduction and self-matching are discarded. As shown in Table 5, the performance drops with any of the components removed. Especially after deleting self-matching, the performance may even lower than the CDA encoder. This result proves that self-matching is crucial for integrating knowledge, without which knowledge may even bring the noise to emotional reasoning.
Performance drops more when the SPIP is removed on the IEMOCAP dataset while knowledge integration plays a relatively more important role on the other two datasets. This may because there is only an average of 1.9 Sentic phrases per utterance with a 65% probability on the MELD dataset. To some extent, this once again confirms our previous conjecture that short conversations are more critical of knowledge sources than long conversations.

Conclusion
This paper proposes a KI-Net for emotion recognition in conversations. Our model outperforms state-of-the-art models on IEMOCAP. Extensive experiments prove the necessity of interaction between knowledge and utterance, and the new auxiliary task SPIP will further improve performance. Utterance-level interaction and confusion of similar emotions are the focus of our following research. Which dimensions of knowledge ERC relies more on is also worthy of in-depth discussion.