Exploring the Role of Context in Utterance-level Emotion, Act and Intent Classification in Conversations: An Empirical Study

the user intention and background. In recent years, a number of context-aware approaches have been proposed for various utterance-level dialogue understanding tasks. In this paper, we explore and quantify the role of context for different aspects of a dialogue, namely emotion, dialogue act, and intent identiﬁcation, using state-of-the-art dialogue understanding methods as baselines. Speciﬁcally, we employ various perturbations to distort the context of a given utter-ance and study its impact on the different tasks and baselines. This provides us with insights into the fundamental context factors that have immediate implications on different aspects of a dialogue. Such insights may inspire more effective dialogue understanding models and provide support for future text generation approaches.

I am going to need it again. I need to look at your file. Please calm down Sir. I am here to help you.

Abstract
The recent abundance of conversational data on the Web and elsewhere calls for effective NLP systems for dialogue understanding. Complete utterance-level understanding often requires context understanding, partly defined by the nearby utterances and by the user intention and background. In recent years, a number of context-aware approaches have been proposed for various utterance-level dialogue understanding tasks. In this paper, we explore and quantify the role of context for different aspects of a dialogue, namely emotion, dialogue act, and intent identification, using stateof-the-art dialogue understanding methods as baselines. Specifically, we employ various perturbations to distort the context of a given utterance and study its impact on the different tasks and baselines. This provides us with insights into the fundamental context factors that have immediate implications on different aspects of a dialogue. Such insights may inspire more effective dialogue understanding models and provide support for future text generation approaches.

Introduction
Human-like conversational systems are a longstanding goal of Artificial Intelligence (AI). How-ever, the development of such systems is not a trivial task, as we often participate in dialogues by relying on several contextual factors such as emotions, prior assumptions, intent, or personality traits. It is thus not surprising that the landscape of dialogue understanding research embraces several challenging tasks, such as, emotion recognition in conversations (ERC), dialogue intent classification, user-state representation, and others. These tasks are often performed at utterance level and can be conjoined together under the umbrella of utterancelevel dialogue understanding. Due to the fastgrowing research interest in dialogue understanding, several novel approaches have recently been proposed (Qin et al., 2020;Rashkin et al., 2019;Xing et al., 2020;Lian et al., 2019;Saha et al., 2020) to address the tasks by adopting speakerspecific and contextual modeling. However, to the best of our knowledge, the role of context has not been thoroughly explored across these tasks, partly due also to the lack of an unified framework across various utterance-level dialogue understanding tasks. In this work, we explore the role of context in utterance-level dialogue understanding. We use a contextual utterance-level dialogue understanding baseline (bcLSTM (Poria et al., 2017)) as a strong baseline for the six dialogue-understanding tasks in four datasets. We propose several unique context probing strategies and experimental designs that test and measure: (1) speaker-specific context; (2) context order; (3) paraphrased context; (4) label shifts; (5) role of CRF in the sequence tagging of utterances in a dialogue. These strategies can be easily adapted for other tasks for similar purposes and provide insights into the development of new approaches to address these tasks.
Task Definition: Given a conversation along with speaker information of each constituent utterance, the utterance-level dialogue understanding task aims to identify the label of each utterance from a set of predefined labels that can be a set of emotions, dialogue acts, intents etc. Fig. 1 illustrates one such conversation between two people, where each utterance is labeled by emotion and intent. Formally, given the input sequence of N utterances [(u 1 , p 1 ), (u 2 , p 2 ), . . . , (u N , p N )], where each utterance u i = [u i,1 , u i,2 , . . . , u i,T ] consists of T words and spoken by party p i , the task is to predict the label e i of each utterance u i . The classifier can make use of the conversational context in the process.

Models
We train our classification models in an end-to-end setup. We first extract utterance level features with a CNN module with pretrained GloVe vectors. The resulting features are non-contextual in nature as they are obtained from utterances without the surrounding context. We then classify the utterances with one of the following two models: i) Logistic Regression, or ii) bcLSTM. Among these, the Logistic Regression model is non-contextual in nature, whereas the bcLSTM is contextual. We expand on the feature extractor and the classifier in detail next.

Utterance Feature Extractor
Utterance level features are extracted using the following method: GloVe CNN. A convolutional neural network (Kim, 2014) is used to extract features from the utterances of the conversation. We use a convolutional layer followed by max-pooling and a fullyconnected layer to obtain the representation of the utterance. Each word in the utterances is initialized with 300d pretrained GloVe embeddings (Pennington et al., 2014). We pass these to convolutional filters of sizes 1, 2, and 3, each having 100 feature maps. The output of these filters are then maxpooled across all the words of an utterance. These are then concatenated and fed to a 100 dimensional fully-connected layer with ReLU activation (Nair and Hinton, 2010). The output after the activation form the final representation of the utterance.

Utterance Classifier
The representations obtained from the Utterance Feature Extractor are then classified using one of the following two methods: Without Context Classifier. In this model, classification of an utterance is performed using a fully connected multi-layer perceptron layer. This classification setup is non-contextual in nature as there is no flow of information from the contextual utterances. We call this model GloVe CNN.
GloVe bcLSTM. The Bidirectional Contextual LSTM model (bcLSTM) (Poria et al., 2017) creates context-aware utterance representations by capturing the contextual content from the surrounding utterances using a Bi-directional LSTM (Hochreiter and Schmidhuber, 1997) network. bcLSTM is a strong contextual utterance-level dialogue understanding baseline, with consistent performance across all six dialogue-understanding tasks considered in this work. In our experiments, on an average bcLSTM is only 1% worse than the state of the art across the six tasks that we address in this work. As opposed to more complicated models Qin et al., 2020;Zhong et al., 2019), the simpler architecture of bcLSTM is devoid of complicated interactions amongst the contextual utterances, as attention. This enables easier interpretation of the effects of the perturbations of the context. The feature representations extracted by the Utterance Feature Extractor serve as the input to the bcLSTM network. Finally, the context-aware utterance representations from the output of the bcLSTM are used for the label classification. The bcLSTM model is speaker independent as it does not model any speaker level dependency. In our implementation, we add a residual connection between the first and the output from the final layer to improve the network's stability. We call this model GloVe bcLSTM.
Why GloVe based CNN and LSTM-based Models: In this study, we consider GloVe CNN, GloVe bcLSTM and set up different scenarios to analyze them because these models are conceptually much more straightforward than other state-of-theart models such as DialogueRNN  and DialogueGCN (Ghosal et al., 2019). For example, DialogueRNN also tracks the speaker states in addition to context. Thus, perturbations in the input would influence speaker modeling along with context modeling. This results in more complex deviations than bcLSTM, and are more difficult to analyze. Simple models are likely to be more interpretable. E.g., owing to DialogueRNN's complexity, we need to perform different levels of ablation studies to explain its behavior.
Furthermore, we use GloVe embeddings as recent transformer based models such as BERT (Devlin et al., 2018) is trained using the masked language model (MLM) objective that is already very powerful in modeling cross sentential context representation as demonstrated by other works . Hence, to conduct a fair comparison between non-contextual and contextual models and further, for an easier apprehension on the role of contextual information in utterance-level dialogue understanding, we resort to the GloVe CNN and LSTM-based models. Additionally, as we perform a number of analysis studies, the GloVe based models were computationally much more efficient and faster to train and analyze.

Datasets
All the dialogue classification datasets that we consider in this work consists of two-party conversations in English language. We benchmark the models on the following datasets (see Table 1): IEMOCAP (Busso et al., 2008) is a dataset of two person conversations among ten different unique speakers. The train set dialogues come from the first eight speakers, whereas the test set dialogues are from the last two. Each utterance is annotated with one of the following six emotions: happy, sad, neutral, angry, excited, and frustrated. DailyDialog (Li et al., 2017) covers various topics about our daily life and follows the natural human communication approach. All utterances are labeled with both emotion categories and dialogue acts. The emotion can belong to one of the following seven labels: anger, disgust, fear, joy, neutral, sadness, and surprise. The dataset has over 83% neutral labels and these are excluded during Macro-F1 evaluation. In comparison, the dialogue act label distribution is relatively more balanced. The act labels can belong to the following four categories: inform, question, directive, and commissive.
MultiWOZ (Budzianowski et al., 2018) or Multi-Domain Wizard-of-Oz dataset is a fullylabeled collection of human-human conversations spanning over multiple domains and topics. The dataset has been created for task-oriented dialogue modelling and has 10,000 dialogues, which is atleast an order bigger than previously available taskoriented corpora. The dialogues are labelled with belief states and actions. It contains conversations between an user and a system from the following seven domains: restaurant, hotel, attraction, taxi, train, hospital and police. Here we focus on classifying the intent of the utterances from the user which belong to one of the following categories: book restaurant, book train, find restaurant, find train, find attraction, find bus, find hospital, find hotel, find police, find taxi, and None. The None utterances are not included in evaluation. Note that, utterances from the system side are not labelled and thus are not classified in our framework.
Persuasion For Good (Wang et al., 2019) dataset is a persuasive dialogue dataset where one participant aims to persuade the other participant to donate his/her earning using different persuasion strategies. The two participants are denoted as Persuader aka ER and Persuadee aka EE respectively. In this work, we formulate our problem to classify the utterances of Persuader and Persuadee separately using the full context of the conversation. This task can also be considered as a dialogue act classification task. The Persuader strategies are to be classified into the following eleven categories: donation-information, logical-appeal, personal-story, foot-in-the-door, credibilityappeal, emotion-appeal, personal-related-inquiry, source-related-inquiry, self-modeling, task-relatedinquiry, and non-strategy-acts. The strategy can belong to one of the following thirteen categories for Persuadee,: disagree-donation-more, ask-orginfo, agree-donation, provide-donation-amount,  disagree-donation, personal-related-inquiry, task-related-inquiry, ask-donation-procedure, negative-reaction-to-donation, positive-reactionto-donation, ask-persuader-donation-intention, neutral-reaction-to-donation, and other-acts.

Evaluation Metrics
In our experiments, we use Weighted average (W-Avg) F1 score in IEMOCAP emotion and Multi-WOZ intent classification. For the other tasks -DailyDialog emotion, DailyDialog act, Persuader and Persuadee strategy classification -the label distribution is highly imbalanced, hence we report Macro F1 scores. In DailyDialog emotion classification, neutral labels are excluded (masked) while calculating the metrics. However, these utterances are still passed in the input of the different models.

Speaker-specific Context Control
We first report the performance of the baseline GloVe CNN and GloVe bcLSTM model in the first two rows of Table 2. To further evaluate the intraand inter-speaker dependence and relation across the different tasks in the GloVe bcLSTM model, we adopted two different settings as follows - • w/o Inter-Speaker Dependency: when classifying a target utterance from speaker A, we drop the utterances of the speaker B from the context and vice versa.
• w/o Intra-Speaker Dependency: when classifying a target utterance from speaker A, we only keep utterances of the speaker B and drop all other utterances of speaker A from the context and vice versa.
Utterances of the Non-target Speaker are Important. The first setting coerces LSTM to only rely on the target speaker's (speaker of the target utterance) context in prediction. The results are reported in Table 2. As expected, performance drops are observed for all the datasets but IEMOCAP for emotion recognition, reinforcing the fact that the contextual utterances from the non-target speakers are important. Performance drop in DailyDialog dataset for act classification is noticeably the steepest. In the IEMOCAP dataset, we observe a pattern of the speakers maintaining the same emotion along a dialogue. This suggests that the speakers in the IEMOCAP dataset repeat the same emotion along consecutive utterances. Consequently, this induces a dataset bias. Hence, unlike the task of dialogue generation where the role of listener's utterance is key in generating speaker's response, we suspect in the case of emotion recognition in IEMOCAP dataset, removing other interlocutor's utterances from the context makes it easier and less confusing for the LSTM-based model to learn relevant contextual representations for the prediction. Contrary to this, although existing, repetitions of same or similar emotions in consecutive utterances of a speaker are less prevalent for emotion recognition in the DailyDialog dataset.
Utterances of the Target Speaker are also Important. 'w/o Intra-Speaker Dependency' scenario reported in Table 2 exhibits the importance of the utterances of the non-target speaker in the classification of the target utterance. In DailyDialog act and MultiWOZ intent classification, even when we remove the contextual utterances from the same speaker, the utterances from the non-target speaker provides key contextual information as evidenced by the performance in the 'w/o Intra-Speaker Dependency' setting. In those tasks, dropping the utterances of the non-target speaker results in more performance degradation as compared to the case when utterances from the target speaker are removed from the target utterance's context. This observation also supports the dialogue generation works (Zhou et al., 2017) that mainly consider previous utterances of the non-target speaker as the context for response generation. For emotion classification in DailyDialog and strategy classification in Persuasion For Good, the results obtained from 'w/o Intra-Speaker Dependency' setting are also relatively lesser compared to the baseline bcLSTM setting. This confirms the higher contextual salience of the target speaker's utterances over the non-target speaker's utterances for these particular tasks. In the case of the IEMOCAP emotion classification, removing the target speaker's utterances from the context causes a substantial performance dip for the reasons stated earlier.
Interestingly, the 'w/o Inter-Speaker Dependency' setting in the DailyDialog dataset manifests two distinct trends for two different tasks -act classification and emotion recognition. While nontarget speakers' utterances carry a little value for emotion recognition, they are extremely beneficial for act classification. This calls for task-specific context modeling techniques which should be the focus of the future works. Key Takeaways of this Experiment. Although both target and non-target speakers' utterances are useful in several utterance-level tasks, we observe some divergent trends in some of the tasks in our experiments. Hence, we surmise that a task-agnostic unified context model may not be optimal in solving all the tasks. In the future, we should strive for task-specific contextual models as each task can have unique features that make it distinct from others. One can also think of multi-task architectures where two tasks can corroborate each other in improving the overall performance.
Logically, dropping contextual utterances in a dialogue leads to inconsistency in the context and consequently, it should degrade the performance of a model that relies on the context for inference. Hence, given an unmodified dialogue flow, an ideal contextual model is expected to refer to the right amount of contextual utterances relevant in inferring the label of a target utterance. In contrast, bcLSTM shows performance improvement for IEMOCAP emotion classification when utterances from the non-target speaker are dropped (refer to the 'w/o Inter-Speaker Dependency' row in Table 2). The performance does not change much for dialogue act and intent classification in the Dai-lyDialog and MultiWOZ, respectively, when we drop utterances of the target speaker. These contrasting results indicate a potential drawback of the bcLSTM model in efficiently utilizing contextual utterances of both interlocutors in unmodified dialogues for the above mentioned tasks.

Classification in Shuffled Context
To analyze the importance of context, we shuffle the utterance order of a dialogue and try to classify the correct label from the shuffled sequence. For example, a dialogue having utterance sequence of {u 1 , u 2 , u 3 , u 4 , u 5 } is shuffled to {u 5 , u 1 , u 4 , u 2 , u 3 }. This shuffling is carried out randomly, resulting in an utterance sequence whose order is different from the original sequence.  We design three such shuffling experiments: i) dialogues in train and validation sets are shuffled, test set is unchanged, ii) dialogues in train and validation sets are kept unchanged, but dialogues in test set are shuffled, iii) dialogues in train, validation and test sets are all shuffled. We analyze these shuffling strategies in the GloVe bcLSTM model. In theory, the recurrent nature of the LSTM model allows it to be capable of modelling contextual information from the beginning of the sequence to the very end. However, when classifying an utterance, the most crucial contextual information comes from the neighbouring utterance. In an altered context, the model would find it difficult to predict the correct labels because the original neighbouring utterances may not be in immediate context after shuffling. This kind of perturbation would make the context modelling less efficient, and performance is likely to drop compared to their non-shuffled context counterparts. This is empirically shown in Table 3.
We observe that, whenever there is some shuffling in train, validation, or test set, the performance decreases a few points in all the datasets across all tasks and evaluation metrics. Notably, the performance drop is highest when the dialogues in train, validation sets are kept unchanged and dialogues in test set are shuffled. Note that, the result for this shuffling strategy (only test set is shuffled) in MultiWOZ stands at 67.91%, much lower than the original baseline of 96.22%. This is because, the test score of 67.91% is reported at the best validation score, even though we obtain better test scores at the initial epochs of training (around 78%).
Our reported results and observations are contradictory to the claims made by Sankar et al. (2019). According to Sankar et al. (2019), the shuffling of contextual utterances does not affect the response generation performance of a seq2seq model. There can be a number of reasons for these two contradicting observations: 1) first, the characteristics of utterance labels in a dialogue are different  from responses-responses are subjective and not unique, however labels are usually agreed upon by the observers to some degree-, 2) second, instead of reporting qualitative results, Sankar et al. (2019) only reported the perplexity score of their experiments. As stated in (Cai et al., 2019), perplexity and BLEU scores may not correctly represent the quality of the response generation.

Attacks with Context and Target Paraphrasing
Modern machine learning systems are often susceptible to attacks that slightly perturb the input without any drastic change in the semantics. Although prevalent in images, adversarial examples also exist in neural network-based NLP applications. In the context of NLP, crafting adversarial examples would require making character-, word-, or sentence-level modifications to the input text to trick the classifier into misclassification. Paraphrasing sentences is one such method to construct effective adversarial examples (Iyyer et al., 2018). We conduct several experiments to evaluate the sensitivity of utterance-level dialogue understanding systems to input paraphrasing. It should be noted that although task-specific adversarial strategies could be adopted, we chose to use a general set of attacking strategies in order to understand the behavior of the baseline across different tasks and datasets. This also facilitates a fair comparison among the tasks and whether there is a confounding factor that differentiates one task from another under the same attacking strategies.
Method. We use the following scheme to analyze this effect: • The input utterances are modified at word level. For this modification, an average of 3 to 4 words  are selected per utterance and masked. The pretrained RoBERTa model is then used to fill the masks with the most likely candidates. The utterance with substituted words form the new input. We call this method Paraphrasing-based Attack (PA).
• For each utterance (u t ) in a dialogue, we take a window of w immediate neighbouring utterances (context) on which the above modifications are performed. The window is selected as follows: -Only past w utterances: u t−w , .., u t−1 -Only future w utterances: u t+1 , .., u t+w -Past w and future w utterances: u t−w , .., u t−1 , u t+1 , .., u t+w -Past w, future w, and the target utterance: u t−w , .., u t−1 , u t , u t+1 , .., u t+w -Only the target utterance: u t In the last case, the window is empty. In other cases, we experiment with window size w = 3, 5, 10.
We train a GloVe bcLSTM and a GloVe CNN model with unadulterated train and validation data. During evaluation, however, the context and target are paraphrased as described before. The results of these experiments for bcLSTM and GloVe CNN are shown in Table 4 and Table 5, respectively.
Observations. We observe that the Paraphrasing-based Attack is quite effective in fooling the classifier in a number of tasks. The classification performance progressively deteriorates with larger window sizes.
In DailyDialog act classification, Paraphrasingbased Attack on only future utterances doesn't affect the results at all. The classification performance still remains very close to the original score of 79.46 %. We observe that there is a strong reliance on the label and content of past utterance in this task. For example, a question is likely to be followed by an inform or another question and much less likely to be followed by a commissive utterance. Unchanged past context thus results in performance that is very close to the original setup. Attacking the past utterances combined with future and/or target utterances results in a relatively bigger performance drop. We also notice that the drop in performance is relatively much lesser than the other tasks except in MultiWOZ for intent classification. This is possibly because the act labels are mostly driven by the sentence type and hence unlikely to be affected from paraphrasing perturbations. For instance, around 30% of the act labels are of type question, and our attack strategy is almost guaranteed not to change an utterance with label question to something which might be classified as inform, commissive, or directive. Overall, we observe a consistent plunge in the performance when the target utterance is attacked by the Paraphrasingbased Attack method. For intent classification in MultiWOZ, utterances often have keywords which indicate the label (presence of train might indicate class label of find train or book train). In these cases, if the target utterance is not paraphrased, the model is still likely to predict the correct label. Finally, in Persuasion for Good, we observe that the attack method is slightly more effective in fooling the classifier for persuadee strategy classification.
In terms of window direction, we observe that perturbations in the past or future utterances result in a similar range of reduction in performances. One notable exception is act prediction in DailyDialog, where the model continues to perform near the original score of 79.46% irrespective of the attack in future utterances in the window.
Performance Comparison for Attacks in GloVe CNN and GloVe bcLSTM. We summarize the performance of GloVe CNN and GloVe bcLSTM models against Paraphrasing-based Attack in Table 5. For all the tasks, we observe a very significant drop in performance for GloVe CNN. For example, in emotion classification, the drop is around 23% and 40% for Paraphrasing-based Attack in IEMOCAP and DailyDialog respectively. However, for the same setting, the relative decrease in performance is only around 6% and 10% for bcLSTM. We observe the same trend in other tasks where it can be seen that the bcLSTM model is much more robust against the attack compared to the CNN model. This is because contextual models such as bcLSTM are harder to fool as the context carry key information regarding the semantics of the target and salient information can be inferred about the target using its' context. It is thus evident that even when the target utterance is corrupted, bcLSTM is capable of using contextual information to predict the label correctly, and subsequently the decline in performance is much lesser.
In principle, our findings in Table 5 can be related to how transformer-based pre-trained language models work. For example, in BERT (Devlin et al., 2018), the masked language modeling (MLM) and the next sentence prediction (NSP) objective forces the model to infer or predict the target using contextual information. Such contextual models are more powerful and robust because context information plays a crucial role in almost every natural language processing task. An objective similar to next sentence prediction in BERT or permutation language modeling in XLNET (Yang et al., 2019) can be used for conversation level pre-training to improve several downstream conversational tasks. Such approaches have been found to be useful in the past .

Performance for Label Shift
As discussed before, a few of our tasks of interest exhibit the label copying property which means consecutive utterances from the same speaker or different speakers often have the same or similar emotion, act, or intent label. The inter-speaker and intra-speaker label copying is especially prevalent in the IEMOCAP emotion task, the DailyDialog act task, and the MultiWOZ intent task. Contextual models such as bcLSTM make correct predictions when utterances display such kind of continuation of the same label. But what happens when there is a change of label? Does bcLSTM continue to perform at the same level or is it affected from the change? To understand this occurrence in more detail, we define this event as Label Shift and look at the following two different kind of shifts that could happen in the course of a dialogue: • Intra-Speaker Shift: The label of the utterance is different from the label of the previous utterance from the same speaker.
• Inter-Speaker Shift: The label of the utterance is different from the label of the previous utterance from the non-target speaker.
In these two scenarios explained above, we are interested to see how bcLSTM performs at the utterances were the label shift takes place. We report results for utterances in the test data that show Intra-Speaker Shift and Inter-Speaker Shift in Table 6. The Inter-Speaker Shift is not  defined in MultiWOZ as we don't have intent labels for system utterances. We also don't report Inter-Speaker Shift results in Persuasion for Good as the persuader and persuadee strategy set is different. The emotion labels in IEMOCAP display the largest extent of label copying. We also observe in Table 6 that label shifts occur with high frequency in IEMOCAP. These are the likely reasons why we observe significant number of errors for utterances with Label Shift for this task in Table 6. The performance for both Intra-Speaker Shift and Inter-Speaker Shift stands at around 52.0%, much lesser than the overall average of 61.9% in test data. Although not as strong as IEMOCAP, the intraspeaker label copying feature can also be seen in MultiWOZ intent and DailyDialog act labels. For these two tasks, we again observe a drop of performance when either Intra-Speaker Shift or Inter-Speaker Shift occurs. In contrast, the extent of transition is spread over a much larger combination of labels in DailyDialog emotion and Persuasion for Good. We observe that the results for utterances with Label Shift in those tasks are in fact better than the overall score. In DailyDialog emotion, the scores are 44.23% and 47.77%, which is an improvement over the original 41.16%. The scores of 57.84% and 49.4% in Persuasion for Good also stand over the scores of 56.28% and 44.83% in the original setup.

Sequence Tagging using Conditional Random Field (CRF)
On the surface, the task of utterance level dialogue understanding looks similar to sequence tagging. Are there any distinct label dependency and patterns across the tasks that are dataset agnostic and likely to be captured by CRF (Lafferty et al., 2001)?
In the quest to answer this, we plug in three different CRF layers on top of the bcLSTM network.
Global-CRF. It is a linear chain CRF used on top of bcLSTM. In this setting, we do not consider speaker information. It can be defined using the equations below: φT (yi−1, yi)φE(yi, ui), (1) φT (y i−1 , y i )φE(y i , ui). (2) Global-CRF ext . The linear-chain CRF is extended to include not only the transition potential from the previous label to the current label, but also from the prior-to-previous label. Concisely, the current label is predicated on the previous two labels. Therefore, the transition potential function φ T takes one extra argument y i−2 . The advantage here is it also considers the previous label from the target speaker should utterance i − 2 have come from the target speaker. This becomes useful in the tasks where the speakers tend to retain label from its last utterance. It can be defined using the equations below: (4) Speaker-CRF. In this setting, we use two distinct CRFs for the two speakers in a dialogue. Interspeaker label dependency and transitions are not likely to be captured in this setting by the CRFs.
Negative Results. We report the results for CRF experiments in Table 7 and Table 8. Aside from the well-known sequence tagging tasks, such as, Named Entity Recognition (NER) and Part of Speech Tagging, CRF does not improve the performance of utterance-level dialogue understanding tasks. There could be multiple reasons as below: 1: A dialogue is governed by multiple variables or pragmatics, e.g., topic, personal goal, past experience, expressing opinions or presenting facts based on personal knowledge, and the role of the interlocutors. Hence, the response pattern can vary depending on these variables. The personality of the speakers add an extra layer of complexity to this which causes speakers to respond differently under the same circumstances. An identical utterance can be uttered with different emotions by two different speakers. CRF relies on surface label patterns which can vary with datasets. Due to this dynamic nature of dialogues and the presence of latent controlling variables, the label transition matrix of CRF does not learn any distinct   pattern that is complementary to what is learned by the feature extractor. 2: Some of the datasets -IEMOCAP and MultiWOZ -contain distinct label-transition patterns between the same and distinct speakers e.g., the label copying feature in the IEMOCAP dataset where the same or similar emotions are repeated by the same or both the speakers. Similarly, in the MultWOZ dataset, the intent book restaurant to be frequently followed by the intent find taxi. We believe the distinct label patterns in the IEMOCAP and MultiWoZ datasets is potentially one of the reasons why contextual models perform so well on these three datasets and tasks compared to the rest. On these two daatsets, we expected bcLSTM w/ global-CRF to outperform vanilla bcLSTM. However, we do not observe any statistically significant improvement using bcLSTM w/ global-CRF over bcLSTM. We posit that the evident label-transition patterns that exist in these two datasets are straightforward to capture without a CRF. In fact, we also tried GloVe CNN with a CRF layer on it, and surprisingly the result was not significantly higher than that of GloVe CNN. This can be attributed to the absence of explicit contextual and label transition-based features in the CRF.

Results in IEMOCAP and Persuasion for Good
Datasets. We observe a minor performance im-provement in the IEMOCAP dataset using speaker-CRF for emotion recognition. This observation directly correlates to the experiment under "w/o Inter-Speaker Dependency" setting in Table 2 and can be largely attributed to the label copying feature in the IEMOCAP dataset as explained in the last paragraph. In "w/o Inter-Speaker Dependency" setting, contextual utterances of the speaker B are not utilized to classify utterances of speaker A vice versa. The results do not improve when we use speaker-level CRF on bcLSTM under the "w/o Inter-Speaker Dependency" setting. From these observations, we can conclude that CRF is not learning any distinct label dependency and transition patterns that are not learned by the feature extractor or bcLSTM alone. Global-CRF ext shows significant performance improvement on the Persuasion for Good dataset. Some of the key controllable factors of the dialogues such as topics in this dataset are fixed and can be learned intrinsically by the classifier. The scope of the dialogues in this dataset is very limited as there are only two possible outcomes of the dialogues -agree to donate, and disagree to donate. Hence, there can be some label transition patterns learned by the Global-CRF ext using a larger labelcontext window in the transition potential.

Conclusion
In this paper, we explored the role of context for six utterance-level dialogue understanding tasks in four different datasets. Using a strong contextual baseline system (bcLSTM), we gained insights into the behavior of such contextualized models in the presence of various context perturbations. Such probes have bolstered many interesting intuitions about utterance-level dialogue understanding-the role of label dependency and future utterances; the role of speaker-specific contextual modelling; and the robustness of contextual models as opposed to their non-contextual counterparts against adversarial probes. We believe that these probing strategies can be straightforwardly adapted to other contextreliant tasks. The implementation pertaining to this work is available at https://github.com/ declare-lab/dialogue-understanding.

A Label Transitions.
To check whether there lies any patterns in the label sequences of the datasets, in Fig. 2 and 3, we plot frequency of the label pairs (x, y) where x and y are the labels of U s t−1 ,t−1 and U st,t respectively. Figure Fig. 2 explains inter-speaker label transition and Fig. 3 illustrates the intra-speaker label transition. Both these plots reveal the same emotion labels appearing in the consecutive utterances with high frequency in the IEMOCAP dataset. This induces label dependencies and consistencies and can be called as the label copying feature of the dataset. From our empirical analysis in Section 4, we confirm this property of the IEMOCAP dataset.
Although not as strong as IEMOCAP, the intraspeaker label copying feature is also prevalent in the MultiWOZ and DailyDialog (Act) dataset (refer to Fig. 2). Moreover, we observe interesting patterns in DailyDialog (Act). A directive utterance is commonly followed by a commissive utterance. This indicates that utterances with acts such as request and instruct (directive label) are followed by accepting/rejecting the request or order (commissive label). We also notice that an utterance with the act of questioning is commonly followed by the utterances with the act of answering (which is quite natural). Fig. 2 also corroborates the high frequent joint appearance of similar emotions in both speaker's utterances e.g., negative emotionsanger, frustration, sad expressed by one speaker is replied with a similar negative emotion by the other speaker. Interestingly, the DailyDialog dataset for emotion classification does not elicit any such patterns. We can attribute this to the scripted utterances present in the IEMOCAP that has specifically been designed to invoke more emotional content to the utterances. On the other hand, the DailyDialog dataset comprises naturalistic utterances that are more dynamic in nature as they depend on interlocutors' personality. In both IEMOCAP and Dai-lyDialog datasets, the repetitions of the same emotions can be found in consecutive utterances of a speaker. The repetition of the same or similar emotions for a speaker is frequent and often forms long chains in IEMOCAP. However, such repetitions are much less prevalent in DailyDialog. Readers are referred to Fig. 3 for a clearer view. This two different types of datasets used in this work is purposefully crafted in order to study datasetspecific nuances to attempt the same task. In DailyDialog, approximately 80% of utterances are labeled as no-emotion (see Fig. 4) which poses a difficult challenge to perform emotion classification. These two datasets also differ from each other in the average dialogue length. While the average number of utterances per dialogue in the IEMO-CAP dataset is more than 50, the average number of utterances per dialogue in the DailyDialog dataset is just 8 which is much shorter.
Among other semantically plausible label transitions, we can see in Fig. 3, the intent book restaurant to be frequently followed by the intent find taxi in the MultiWOZ dataset. We believe this is potentially one of the reasons why contextual models perform so well on these three datasets and tasks compared to the rest which we discuss in the subsequent sections. Further, label dependency and consistency can aid filtering likely labels given the prior labels. Notably, such patterns are not visible in the other datasets. Hence, one can use Conditional Random Field (CRF) to find any hidden label patterns and dependencies.
B Utterance Classifier cLSTM. Similar to bcLSTM but without the bidirectionality in the LSTM, this model is intended to ignore the presence of future utterances while classifying an utterance U t .
DialogueRNN.  is a recurrent network based model for emotion recognition in conversations. It uses two GRUs to track individual speaker states and global context during the conversation. Further, another GRU is employed to track emotion state through the conversation. In this work, we consider the emotion state to be a general state which can be used for utterance level classification (i.e., not limited to only emotion classification). Similar to the bcLSTM model, the features extracted by the Utterance Feature Extractor is the input to the DialogueRNN network. Di-alogueRNN aims to model inter-speaker relations and it can be applied on multiparty datasets. cLSTM, bcLSTM and DialogueRNN with Residual Connections. Deep neural networks can often have difficulties in information prorogation. Multi-layered RNN-like in particulars often succumb to vanishing gradient problems while modeling long range sequences. Residual connections or skip connections (He et al., 2016) are an intuitive way to tackle this problem by improving information propagation and gradient flow. Inspired by the early works in residual LSTM (Wu  et al., 2006;Kim et al., 2017), in our recurrent contextual models -bcLSTM and DialogueRNN we adopt a simple strategy to introduce a residual connection. For each utterance, a residual connection is formed between the output of the feature extractor and the output of the bcLSTM/DialogueRNN module. These two vectors are added and the final classification is performed from the resultant Figure 4: The heatmap of intra-speaker (left) and inter-speaker (right) label transition statistics in the DailyDialog dataset including neutral emotion. The color bar represents normalized number of inter-speaker and intra-speaker transitions such that elements of each matrix add up to 1.
vector.   We report results for IEMOCAP, DailyDialog dataset in Table 9 and MultiWOZ, Persuasion for Good dataset in Table 10. We ran each experiment multiple times and report the average test scores based on the best validation scores.

IEMOCAP
We observe that there is a general trend of improvement in performance when moving to the RoBERTa based feature extractor from the GloVe CNN feature extractor except in the intent prediction task in MultiWOZ dataset. As the RoBERTa model has been pre-trained on a large amount of textual data and has considerably more parameters, this improvement is expected. The results could possibly be improved even more if a RoBERTa-Large model is used instead of the RoBERTa-Base model that we use in this work.
We also observe that contextual models -bcLSTM and DialogueRNN perform much better than the non-contextual Logistic Regression models in most cases. Context information is crucial for emotion, act, and intent classification and models such as bcLSTM or DialogueRNN are some of the most prominent methods to model the contextual dependency between utterances and their labels. In IEMOCAP, DailyDialog and MultiWOZ there is a sharp improvement in performance in contextual models compared to the non-contextual models. However, for the strategy classification task in Persuasion for Good dataset, the improvement in contextual models is relatively lesser. Notably, for Persuadee classification, the RoBERTa non-contextual model achieves the best result, outperforming the contextual models. Without the presence of residual connections, the GloVe cLSTM and GloVe bcLSTM baselines perform poorly than the non-contextual GloVe CNN baseline in the Persuasion for Good dataset. This beckons the need for better contextual models for this dataset. To analyze the results of the different models we look at the following aspects: Importance of the Residual Connections in the Models. It is also to be noted that the introduction of the residual connections generally improves