CIDER: Commonsense Inference for Dialogue Explanation and Reasoning

Commonsense inference to understand and explain human language is a fundamental research problem in natural language processing. Explaining human conversations poses a great challenge as it requires contextual understanding, planning, inference, and several aspects of reasoning including causal, temporal, and commonsense reasoning. In this work, we introduce CIDER – a manually curated dataset that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference. Extracting such rich explanations from conversations can be conducive to improving several downstream applications. The annotated triplets are categorized by the type of commonsense knowledge present (e.g., causal, conditional, temporal). We set up three different tasks conditioned on the annotated dataset: Dialogue-level Natural Language Inference, Span Extraction, and Multi-choice Span Selection. Baseline results obtained with transformer-based models reveal that the tasks are difficult, paving the way for promising future research. The dataset and the baseline implementations are publicly available at https://github.com/declare-lab/CIDER.


Introduction
Understanding and explaining a conversation requires the decomposition of dialogue concepts -entities, events and actions, and also connecting them through definitive relations. The process of breaking down dialogues into such explanations is grounded in the conversational context and often requires commonsense inference. Such explanations, when expressed in the form of structured knowledge triplets 1 (Fig. 1), can describe the exact commonsense relation (causal/temporal/conditional/others) through which the concepts are related in the particular conversational context. Establishing such concept links that help explain the dialogue demands two distinct forms of commonsense inference: i) Explicit -the explanation is verbatim in the triplet. Such triplets can be easily extracted out by a parser (e.g., syntactic, pattern matching). These triplets are also prevalent in existing commonsense knowledge graphs (Speer et al., 2017;Sap et al., 2019); and ii) Implicit -the explanation is entirely contextual, making it more difficult for machines to infer as it requires complex multi-hop commonsense reasoning skills. Our goal is to explain a dialogue by the means of these commonsense inferred triplets. This form of explanation may not be complete, but can give a substantial understanding of the dialogue by breaking it down into contextual triplets. The key element of the dialogue explanation using such triplets is the aspect of contextuality. The triplets extracted from a dialogue using commonsense inference are contextual and are grounded exclusively in that particular dialogue. From our world knowledge, we know that missing a bus could cause being late, but (missed the bus, causes, late) is grounded and definitive only in the dialogue illustrated in Fig. 1. This particular triplet may not be valid in a different dialogue, where the cause of being late could be something different. Similarly, losing wallet could cause a different consequence (apart from being late) e.g., getting anxious in the context of another dialogue. It is also important to highlight that some extracted triplets could be persona-specific. For instance, (tardiness, causes, embarrassment) is grounded in the conversation of Fig. 1, but tardiness may not cause embarrassment for every listener.
In literature, there has been much work on extracting structured knowledge triplets from natural language text. However, there has been only little research to distinguish implicit triplets from explicit triplets present in the text. Explicit triplets can be relatively easily parsed out using semantic parsing (Speer et al., 2017) and simple co-reference resolution . Implicit triplets, however, involve non-trivial inference, which becomes even more challenging on dialogue data due to the contextual interplay and latent background knowledge shared between the speakers. Extraction of both explicit and implicit triplets can be conducive to improved dialogue understanding leading to better question-answering systems and richer knowledge bases. To this end, we construct a dataset of Commonsense Inference for Dialogue Explanation and Reasoning (CIDER) -as illustrated in Fig. 1 -which captures the relations between textual concepts or spans appearing in a dialogue. A concept or span can constitute one or multiple entities, objects, actions, states, or events that can be extracted from the dialogue. The relations are commonsense based, as elaborated in §3.2. Each triplet is tagged as explicit or implicit.
Through this dataset, we aim to evaluate whether state-of-the-art natural language processing models can really read, understand, and comprehend the conversational context of dialogues. We define three tasks on this dataset that require dialoguelevel contextual commonsense reasoning -(i) Dialogue-level Natural Language Inference, (ii) Span Extraction, and (iii) Multi-choice Span Selection. All three tasks require an overall contextual understanding of the dialogue with commonsense reasoning and inference. We setup different state-ofthe-art transformer language models as baselines and found that the tasks are challenging to solve. The Importance of this Dataset: The immediate aim of this research is to develop a rich corpus of dialogues with structured explanations in the form of implicit and explicit triplets, and then use this corpus to perform commonsense inference and reasoning. We formulate non-trivial natural language inference (NLI) and question answering (QA) tasks that can be used to benchmark such reasoning capabilities of natural language processing models.

Related Work
Recently, language models have been scaled up a lot and have seen a performance improvement on various tasks (Brown et al., 2020;Raffel et al., 2020). However, it has been proved that declarative knowledge is still valuable, especially implicit relationships that are hardly acquired by the state of the art models (Hwang et al., 2020).
Widely used commonsense knowledge bases such as ConceptNet (Speer et al., 2017) and ATOMIC (Sap et al., 2019) are mainly based on crowd-sourced effort. ConceptNet is a semantic network with nodes composed of common words or phrases in their natural language form. It contains 34 relations, including taxonomic, temporal, and causal ones, such as MotivatedByGoal and Causes. However, the knowledge in ConceptNet is annotated solely based on the first entity without any other context, making it difficult to capture the long-tail knowledge outside of the most common ones. ATOMIC focus on inferential knowledge and consists of nine relations, such as xIntent (the intent for personX's action) and xEffect (the effect of the event on personX). It covers knowledge around agents involved in the event for if-then reasoning, including subsequent events, mental state, and persona. However, it ignores causal relationships between events not carried out by a person. In contrast, our work captures relationships between spans across multiple turns in dialogues. As a result of the dialogue aspect of our data, we also manage to cover implicit knowledge that requires context This tardiness causes embarrassment every time.

Causes
Your main duty is to answer the phone calls and transfer them to the person wanted .
Good morning , sir . Can I help you ?
Yes , I want to deposit 1000 Yuan in my bank account .

(A)
Please fill out this deposit form , first . from conversations to make sense.

Buy a new cell phone
More recent work such as GLUCOSE (Mostafazadeh et al., 2020) which is annotated based on ROCstories (Mostafazadeh et al., 2016) captures implicit knowledge across multiple sentences. Our work instead annotates on dialogues, which have more complicated sentences and spoken conversational exchange.

Background
The primary impetus behind this dataset is the contextualized structured explanation of a dialogue in the form of concept triplets that can be inferred only through commonsense reasoning. The triplets are considered to be the commonsense explanations of different aspects and events that occur in the dialogue. Such aspects would include attributional, comparative, temporal knowledge, and the events may range from physical events involving physical entities, conditional and causal chains, social interactions, persona, etc.
We focus on conversations as our data source, with the choice being motivated by the fact that part of the context in conversations is naturally implicit and interlocutor dependent (Grice, 1975). Commonsense knowledge is considered to be the set of all facts and knowledge about the everyday world which is assumed to be known by all humans (Davis, 2014). For this very reason, human-tohuman dialogues -typically guided by the Gricean maxims of human interactions -tend to avoid explicit mentions of commonsense knowledge and the associated reasoning steps. It is thus reasonable to assume that conversations are generally likely to hold more context-specific inferable implicit knowledge than other genres. This ensures a rich dataset with plenty of contextual implicit triplets and a reasonable amount of explicit triplets.
Two distinct spans (e.g., events, entities) in a dialogue may have an implicit connection that can be trivial for humans to interpret using commonsense reasoning and contextual understanding, but can be challenging for machines. Uncovering implicit ex-planations has the potential to enable many important tasks, which we focus on later on. In this work, we propose a dataset that contains manually labeled implicit explanations present in dyadic dialogues that require commonsense reasoning to infer. We use this dataset to evaluate the ability of pre-trained language models' to perform commonsense-based implicit reasoning tasks.
The extracted triplets or explanations, of the form (h, r, t) or alternatively h r − → t, consist of a head (h) and a tail (t) span and the directed relation (r) between them. These spans are representative of some events, actions, objects, entities, and so on. The directed relation r comes from a predefined set of relations R that explain or describe the relationship between the head and tail spans within the context of the conversation -illustrated in Fig. 1 with the arrows between spans. Notably, the relation set R is intended to be generic in nature, rather than specifically factual or taxonomic, so as to accommodate wide categories of knowledge ( §3.2) inferred from the context of the conversation.

Types of Triplets
The extracted triplets are either explicit or implicit as defined below: Explicit triplets represent explanations (see Fig. 2a) that are overtly expressed in an utterance in a dialogue. Fig. 1 illustrates one such annotated instance in utterance 13 -tardiness Causes −−−−→ embarrassment -where the triplet is worded verbatim in a head-relation-tail sequence. The head and tail span may contain some pronouns that can be decoded by simple co-reference resolution. In the presence of complex co-reference however the context suggests many possible candidates, and the triplet is implicit.
Implicit triplets, on the other hand, are not directly expressed in the dialogue and must be inferable through commonsense reasoning using the contextual information present in the dialogue. Instances of such triplets are shown in Fig. 1 and 2b with the relations in purple font.
Why Focus on Implicit Triplets? As pointed out earlier, extracting explicit triplets from a conversation or any natural language text is relatively straightforward and has been studied in much detail in the literature (Auer et al., 2007;Carlson et al., 2010;Speer et al., 2017). The much more challenging problem, however, is to extract implicit triplets or explanations. For example, in Fig The decomposition of a dialogue into such implicit explanations also requires contextual understanding and complex commonsense reasoning involving multiple steps and utterances. Thus, the extraction of implicit explanations is challenging and a focus of this work.
Latent Spans and Differences with GLU-COSE (Mostafazadeh et al., 2020): As argued earlier, annotating implicit triplets often requires multi-step reasoning. In such cases, one or more intermediate spans (which may not be present in the dialogue) may be required to explain the relation between the constituting spans; see Fig. 2c for one such example. Annotators were given the freedom to identify such intermediate steps when they deemed so. However, such cases are infrequent in our dataset and thus we have chosen to omit the intermediate spans in our experimental studies for the sake of simplicity. We leave the intermediate step modelling as a direction for future work.
In this context, it is also important to highlight the fundamental differences between our dataset and GLUCOSE (Mostafazadeh et al., 2020). First, in our dataset, the knowledge represented by the spans and the relation connecting them is true (valid) given the context, but establishing this connection using an explicit relation requires complex commonsense inference and understanding of the discourse. The resulting triplet is thus valid in the context and grounded by the context. This is similar to deductive commonsense reasoning (Davis, 2014). GLUCOSE however focuses on abductive commonsense inference, where given an event/state and its context, the annotators provided inferred speculative causal explanations of the event (state) according to their world and common-sense knowledge. These explanations, although they may fit in the given context, may not always be entailed by it. As a consequence, GLUCOSE is conducive to generative modeling, whereas our dataset leads to extractive modeling. Second, GLU-COSE has a limited set of relations, where inference is only performed across the following dimensions: cause, enable, and result in. In contrast, we have a much more diverse set of relations ( §3.2). Finally, we construct our dataset based on conversations between two humans, while GLUCOSE is built using monologue-like stories that have significant differences with respect to the discourse structure and semantics.

Types of Relations
Our proposed CIDER dataset contains 25 main and 6 negated relations. Among the main 25 relations, 19 have been adopted from ConceptNet (Speer et al., 2017). We introduce 6 new relations to cover some aspects that are not covered by ConceptNet. Brief explanations, examples, and the new relations we introduce are shown in Table 1   (ii) Happens On, and (iii) Simultaneous.

Negative and Symmetric Relations
Apart from the relations in Table 1

Source Datasets of Dialogues
The annotation is performed on the following datasets containing dyadic dialogues: DailyDialog (Li et al., 2017) is aimed towards emotion and dialogue-act classification at utterance level. The conversations cover various topics ranging from ordinary life, work, and relationships, to tourism, finance and politics. MuTual (Cui et al., 2020) is a manually annotated dataset for multi-turn dialogue reasoning. It was introduced to evaluate several aspects of dialogue-level reasoning in terms of next utterance prediction given a dialogue history. These aspects include attitude reasoning, intent prediction, situation reasoning, multi-fact reasoning, and others. DREAM (Sun et al., 2019) is a dialogue-based multiple-choice reading-comprehension dataset collected from exams of English as a foreign language. This dataset presents several challenges as it contains non-extractive answers that require commonsense reasoning beyond a single sentence.
In total, we sampled 807 dialogues from the three datasets. Each sampled dialogue has 5 to 12 utterances, and each constituent utterance has no more than 30 words.

Annotation Process
Annotation guidelines. The annotators are instructed to identify either explicit or implicit triplets in a dialogue ( §3.1). Such a triplet consists of a pair of spans, say A and B, and an appropriate relation R between them, denoted as A R − → B. A span is defined as a word, phrase, or a sub-sentence unit of an utterance that represents some concept such as an entity, event, action. The annotators are instructed to meet the following constraints during the annotation: • The extracted triplets must be entailed by the conversation to be valid.
• The spans of a triplet should be as short and concise as possible. Also, a triplet may connect a pair of spans from distinct utterances in a dialogue.
• Multiple distinct valid relations between the same pair of spans are allowed. All these relations correspond to distinct triplets.
We used a web-based tool called BRAT (Stenetorp et al., 2012) for the annotation. The annotators are three PhD students who have thorough knowledge about the task. They were first briefed about the annotation rules, followed by a trial with a few samples to evaluate their understanding of the annotation guidelines and ability to extract both explicit and implicit triplets. Although annotators extract both types, they were instructed to focus more on annotating implicit triplets since extracting those are more challenging. The trial stage was conducted to ensure that annotators are well-versed in annotating high quality triplets in the final phase.

Annotation Verification and Agreement
Each dialogue is primarily annotated by a single annotator. We then verify the validity of the annotated triplets using the following strategy: 1. All extracted triplets are independently validated by two other validation annotators, in terms of their inferability from their source dialogues.
2. Unanimously agreed-upon valid triplets are kept, while unanimously agreed-upon invalid triplets are discarded. In the case of a disagreement, we bring in a third annotator to break the tie.
3. The final set of valid triplets is labelled as being explicit or implicit by the same two annotators as in step (1). The majority vote is assigned as the final label. Similar to the previous step, in case of a disagreement, we bring in a third annotator to break the tie.
After this stage, we obtained a Cohen's Kappa intervalidation-annotator agreement of 0.91 for triplet verification and 0.93 for relation type labelling. We found that the number of explicit triplets (4.5%) in the final annotated dataset is significantly less than implicit triplets (95.5%). The reason is the informal nature of the source datasets' conversations, which enables the extraction of much more frequent implicit triplets than explicit ones. Statistics of the annotated dataset are shown in Table 2.

Experimental Setup and Results
We formulate three tasks on the CIDER dataset: 1) Dialogue-level Natural Language Inference; 2)

Dialogue-level Cross Validation
We consider a dialogue-level cross-validation strategy to benchmark our models. We partition the annotated dialogues into five disjoint and roughly equal-sized folds. Per cross-validation round, the triplets from four folds are considered for training, and the remaining one fold is used for test.

Task 1: Dialogue-level Natural Language Inference (DNLI)
Textual entailment, later renamed as natural language inference (NLI), is the task of identifying if a "hypothesis" is true (entailment), false (contradiction), or undetermined (independent) given a "premise". We extend this definition to conversations and propose Dialogue-level Natural Language Inference (DNLI), which is the task of determining whether a triplet (hypothesis) is true or false given a dialogue (premise) (see Fig. 3a).
It should be noted that most NLI datasets such as SNLI (Bowman et al., 2015), MultiNLI (Williams et al., 2017), SciTail (Khot et al., 2018) consist of a single sentence hypothesis and premise, whereas for DNLI the hypothesis and the premise are a triplet and a conversation, respectively.
For our experiments, the hypothesis is formed by concatenating the elements of the triplet h r − → t in h, r, t order. Similarly, the premise is formed by concatenating the utterances of the dialogue. ples. The contradicting triplets/hypotheses for the negative samples are created from T as follows:   Combination of All. A combination of the above three strategies can also be used to create the contradicting hypothesis. We ensure that the contrived contradicting hypotheses do not appear in the set of annotated triplets T .

Reverse Relation Direction. In
The above strategies allow us to create multiple negative samples from a positive sample. In our experiments, we had two and eight negative samples per positive sample in the training and test split, respectively. We intentionally keep fewer negative samples in the training data to evaluate the generalization capacity of the models on a more diverse range of negative samples in the test data. Foldwise statistics are shown in Table 3. An example of the DNLI task is illustrated Fig. 3a.

Baseline
RoBERTa-large Fine-tuned on MNLI. We use the pretrained roberta-large-mnli model (Liu et al., 2019) to benchmark this task. The input to the model is: <CLS> Premise <SEP> Hypothesis <SEP>. The classification is performed on the <CLS> token vector from the final layer. We choose this model as it has been fine-tuned on the MNLI dataset and shows impressive performance on a number of NLI tasks.
The performance of the RoBERTa-MNLI model is reported in Table 5. As DNLI is a classification task, we report macro F1, weighted F1, and precision and recall over the positive examples (with valid triplets). We notice that the metrics are quite consistent across the five different folds and thus we report our conclusion against the average score. We obtained an average weighted F1 score of 85.78%. However, the macro F1 score is noticeably lower at 69.83%, suggesting that the model performs poorly on the less-frequent positive examples. The recall score suggests that 76.85% of the valid hypotheses are correctly identified by the model. However, the precision score is quite low at 37.25%, suggesting that almost 2/3-rd of the predicted valid hypothesis are in-fact invalid. Without fine-tuning, the model produces much lower macro F1 of 17.76%, precision of 15.06%, and recall of 47.4%. The state-of-the-art RoBERTa MNLI model is thus not very capable of correctly identifying triplets entailed by the conversation. We conclude that inference from conversational context based on commonsense reasoning is not straightforward for pretrained language models.

Task 2: Span Extraction
Span Extraction is defined as identifying the tail span B, given the head span A, the relation R between A and B, and the conversation C where A R − → B is encoded. It is analogous to the task of node prediction in knowledge bases, where the missing tail node B in A R − →? is to be predicted. Fig. 3b depicts an example of this subtask.
In this paper, Span Extraction is formulated as a Machine Reading Comprehension (MRC) task similar to SQuAD (Rajpurkar et al., 2016) where a question is to be answered from a given passage of text or more generally context. The equivalencies with MRC are defined as follows: Context. The entire conversation C is treated as the context, as the span B in the triplet A R − → B can come from any utterance of C. Question and Answer. For each relation type R, we create a question template that includes a placeholder for span A and asks for span B as the answer. The templates are filled with the appropriate valid triplets to generate the question-answer pairs. Please refer to the question template in appendix.

Baselines
We use two pretrained transformer-based models to benchmark the Span Extraction task. The methodology described in BERT QA models (Devlin et al., 2019) is used to extract the tail-spans/answers. RoBERTa Base. We use the roberta-base model (Liu et al., 2019) as a baseline model. SpanBERT Fine-tuned on SQuAD. We use Span-BERT (Joshi et al., 2020) fine-tuned on SQuAD 2.0 dataset as the other baseline model.

Evaluation
Metrics EM (Exact Match). % of the predicted answers that are identical to the gold answers. NM (No Match). % of the predicted answers that bear no match with the gold answer. F1: The F1 score introduced by Rajpurkar et al. (2016) to evaluate word-level overlap of predictions with the gold answers for extractive QA models.

Model
Metric Fold1 Fold2 Fold3 Fold4 Fold5 Avg.  The results for this task is reported in Table 4. We notice that the SpanBERT model performs significantly better than the RoBERTa model. This is expected as SpanBERT has been pretrained with a different objective function and it particularly excels at span extraction tasks, such as, question answering. However, the EM score of 28.41% and the F1 score of 42.06% for the superior SpanBERT  model is still subpar. The EM score suggests that the model extracts the exact correct answer less than 1/3-rd of the time. The NM score also indicates that the extracted answer and the actual answer have no overlap around half of the time. Without fine-tuning, the SpanBERT model produces an EM score of 7.96% and a F1 score of 20.78%, much lesser than the fine-tuned model. We conclude that the state-of-the-art pretrained language models struggle with extracting missing spans.

Task 3: Multi-choice Span Selection
Multi-choice Span Selection is motivated by the SWAG commonsense inference task (Zellers et al., 2018). In SWAG, given a partial description of a situation, the appropriate ending is to be selected from a given list of choices using commonsense inference. In our case, Multi-choice Span Selection is formulated as a multiple-choice question answering task. Similar to the previous task, given a conversation C and partial information about a triplet A R − →?, the goal is to predict the missing span B as an answer to a question created from A and R. However, in contrast to task 2, the missing span B has to be selected from a list of four possible answers S = {s 1 , ..., s 4 }. We show an example of this task in Fig. 4. The context, question, and answers for this task are created as follows:

Creating Confounding Options
To mitigate the stylistic artifacts that could give away the target answer (Gururangan et al., 2018;Poliak et al., 2018), the confounding options are generated in an adversarial fashion.
We first select a large number of spans from C to form a confounding-option collection N by leveraging the SpanBERT fine-tuned on the samples of Task 2 ( §5.3). We feed each individual utterance as the context, and the question created from A and R to the SpanBERT fine-tuned for Task-2. This leads to one or two candidate answers (spans) per contextual utterance per question, averaging around 30 confounding spans per question. We discard the spans that form a valid triplet with A and R.
Adversarial Filtering. Once we have the collection N , we follow Zellers et al. (2018) to filter the confounding options generated in §5.4.1. Please check Appendix Section A for more details. We use the roberta-base model to filter out stylistic patterns. During the filtering process, discriminator prediction accuracy decreased from 0.55 to 0.27, suggesting the method's effectiveness in removing easy confounding candidates with stylistic patterns.

Baseline
We experiment with bert-base-uncased and roberta-base on the adversarially created dataset. The input to the models is the concatenation of conversation C, question Q, and candidate answers A j , j ∈ {1, ..., 4}: <CLS> C <SEP> Q <SEP> A j <SEP>. Each score is predicted from the corresponding <CLS> token vector and the highest scoring one is selected as answer.

Results
The results reported in Table 7 indicate the importance of contextual information in improving models' performance. Our human verifiers could also predict the answers significantly more accurately when contextual information was available.
It is worth noting that all the pre-trained language models perform poorly in this task and the obtained results are far from reaching the humanlevel performance. Besides, the accuracy score for bert-base-uncased and roberta-base without fine-tuning are 25.60% and 26.22% respectively which is similar to a random baseline (25.00%), confirming the conclusion in Task 2 ( §5.3) that current language models have difficulties in predicting the missing span.
Performance across Relation Categories. We report the results across different relation categories for each task with the corresponding best performing models in Table 6. We notice that Spatial is one of the top-performing categories across all three tasks. Performance in Attribution and Temporal category are also reasonably well in Task 1 and Task 1, 2 respectively. Interestingly, the result of Temporal category in Task 3 is the worst. The performance in Causal and Conditional category is around the average mark across all three tasks. This implies that pretrained language models find it difficult to understand the concept of causal events or dependent events. Finally, we observe that the performance in Social category is the worst or among the worst for all the tasks, suggesting that the models find it very challenging to reason about social norms, rules, and conventions.

Conclusion
In this work, we introduced CIDER-a new dataset that focuses on commonsense-based implicit explanation extraction from dialogues. The dataset consists of more than 4,500 manually annotated triplets from over 800 dialogues. We also introduced dialogue-level NLI and QA tasks, along with pre-trained transformer-based baselines to evaluate their inference and reasoning capabilities.
A Adversarial Filtering.
For Task 3: Multi-choice Span Selection, once we have the collection N , we follow Zellers et al. (2018) to filter the confounding options generated in an iterative fashion. We follow the procedure below: 1. Initially, We select 3 random candidates from N and the correct answer to form a fake dataset.
2. We split our fake dataset randomly into train and test set following a 1:2 ratio.
3. We used our discriminator D to filter out confounding options with unwanted stylistic patterns. Then, we train our discriminator D on the dummy train set and score each option with a probability in the dummy test set.
4. We replace the easiest confounding option (lowest probability) with another option from N . 5. We merge our dummy train set and dummy test set after replacement together to form our fake dataset for the next iteration 6.
We designed the input feed to D as a combination of context C and relation R, specifically we feed <CLS> Conversation <SEP> Relation <SEP> Option i <SEP> as input. Here Option i means the ith option in options. The probability score is given on the final layer vector corresponding to the <CLS> token. We posit by excluding A in our model input; the model can only pick up on low-level stylistic patterns with respect to the relation R and context C while not possessing reasoning abilities. Therefore, Our model can filter solely leveraging on low-level patterns while not based on the high-level inference. We use roberta-base model to filter out stylistic patterns. During the filtering process, discriminator prediction accuracy decreased from 0.55 to 0.27, suggesting the method's effectiveness in removing easy confounding candidates with stylistic patterns.

B.1 Task 4: Relation Prediction
The fourth task of our interest is Relation Prediction between two spans from a conversation. Given two spans A and B from a conversation C, the task is to predict the unknown relation R between them in A ? − → B.
We propose two different settings to evaluate the relation prediction task: 1) Without Conversational Context and 2) With Conversational Context.

B.1.1 Task Description
Without Conversational Context. This setting is similar to the standard relation prediction task in knowledge graphs. Given the input spans (A, B), the task is to predict the relation R between A and B.
With Conversational Context. We surmise that the conversational context from C is key to predict relation between any two given spans. This task setting is thus designed to evaluate that hypothesis. In this case, given the input spans and the conversation -(A, B, C), the task is to predict the commonsense relation R between A and B.

B.1.2 Models
We use pretrained transformer based models to benchmark this task as well. In particular, we used the bert-base and the roberta-base models. The input for the models is formulated as follows -<CLS> A <SEP> B <SEP> in the without conversational context setting, and <CLS> A <SEP> B <SEP> C <SEP> in the with conversational context setting. The relation category R is classified from the final layer vector corresponding to the <CLS> token.

B.1.3 Results
The results for the relation prediction task is shown in Table 8. We report accuracy and other macro level scores in Table 8. We observe that the macro level scores are quite sub-par partly due to the fact that we have a lot of relations in the annotated dataset. It is also to be noticed that the incorporation of context brings a large improvement across all the evaluation metrics. The results support our hypothesis that contextual information is substantially important in predicting the relation between spans.

C Hyperparameters
We use the AdamW (Loshchilov and Hutter, 2018) optimizer to train the models for all the tasks. More details about learning rate, batch size and epochs are given below.

C.1 Hyperparameters for Task 1: NLI
The roberta-large-mnli model is trained with a learning rate of 1e −5 and batch size of 8 for 10 epochs.

C.2 Hyperparameters for Task 2: Span Extraction
The roberta-base and span-bert model are both trained with a learning rate of 1e −5 and batch size of 16 for 12 epochs.

D Relation Count
The frequency of the categorized relations in the final annotated dataset is shown in Table 9.  The question template used in Task 2: Span Extraction and Task 3: Multi-choice Span Selection is shown in Table 10. The placeholder X in the Question column is replaced with the actual annotated span A.